Chapter 10: RAC Database Optimization

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 10

RAC Database Optimization

by Riyaj Shamsudeen

The strategy and methods employed to optimize an application in a Real Application Cluster (RAC) database are similar to those for tuning a single instance. However, in RAC, performance problems in one instance can affect the performance of other instances. In addition to single-instance wait events, RAC introduces numerous RAC-specific wait events. Time model–based tuning is employed to tune RAC databases too.

In this chapter, I discuss RAC-related wait events, statistics, Automatic Workload Repository (AWR) report sections applicable to a RAC database, etc. Almost everything in this chapter is tested with and applicable to Oracle Database release 11.2 and version 12c. There may be minor differences in earlier versions.

Introduction to Cache Fusion

To understand RAC wait events, you should first understand the internal workings of cache fusion. In RAC, buffer cache memory areas from all instances are effectively shared to create an effect of fused buffer cache. Some background processes play critical roles in managing the processing of cache fusion.

A database block holds row pieces of a database table and a change atomically modifies a database block. Database blocks are modified in the buffer cache of an instance System Global Area (SGA) (excluding direct mode writes). Since a database block can be modified in any instance, changes to the buffers in the SGA must be globally coordinated. This coordination mechanism is implemented using resources and locks; these are allocated as memory structures in the Global Resource Directory (GRD).

A resource is allocated in GRD to protect a database block. Before modifying a block, a lock on that resource must be acquired in an exclusive mode. Further, a disk read of a block into the buffer cache might require a lock on the resource too. This scheme of reads needing to acquire buffer locks (BL) is required to prevent one instance modifying the block while another instance is holding the block in its buffer cache in read-only mode.

A resource name is coined in such a way that it uniquely identifies a block. An instance is also chosen to serve as a master instance of a resource to track the resource state.

Note Chapter 11 discusses resources and locks in more detail. I will briefly introduce resources and locks in this chapter, but a much more elaborate description can be found in Chapter 11.

You should realize that cache fusion locks do not replace single-instance locks, and both single-instance and cache fusion locks coexist in RAC database. Row-level locks protect rows from concurrent changes, and similarly, cache-fusion BL locks protect buffers from inconsistent concurrent changes. Essentially, row-level locks and buffer-level locks coexist.

The global cache coordination mechanism, also known as the RAC locking scheme, is centered on the following premises:

A block can be resident in a modifiable mode in one instance only. Before a block is modified, a global cache lock is acquired on a resource in exclusive mode, effectively protecting the block from concurrent changes.
A block can be resident in Protected Read (PR) mode in multiple instances. Buffers protected with PR mode locks are considered to be in read-only mode.
A session must acquire the lock on a block in exclusive mode before modifying the block. Therefore, other instances must downgrade the locks to null mode before the lock is acquired in exclusive mode.
Buffers reconstructed for Consistent Read (CR) mode do not require any BL locks to be held.

Cache Fusion Processing

In a single instance, a database block can reside only in the local buffer cache. Therefore, if the block is not in the local buffer cache, it is read from the disk. In RAC, the block can be resident in any buffer cache. So, a special type of processing takes place that deviates sharply from single-instance processing.

The following section provides a rough outline of the cache fusion-processing scheme. This discussion should be considered as an outline of cache fusion processing, not necessarily a documentation of the complete algorithm.

Figure 10-1 shows the mechanics of a cache fusion block transfer.

Figure 10-1. Cache fusion processing 3-way

Each of the numbered steps is explained below. Items with dotted lines in the picture indicate the state after the block transfer.

1. A foreground (FG) process (also known as a server process) tries to access a database block and the data block is not in the local buffer cache. A hashing algorithm¹ is applied on the resource_name protecting the block to calculate the resource master instance. In Figure 10-1, the resource master instance for this resource is 2.

a. The FG process retrieves the network address of a remote LMS background process running on the resource master instance.
b. The FG process constructs a message for the block access and sends that message to the remote LMS process.
c. The FG process must wait until a response is received from the remote LMS process. FG process cannot predict the LMS response since the block could be in a remote buffer cache or the block may not be resident in any buffer cache. So, the FG process waits on a placeholder wait event.

2. LMS process in the resource master instance receives the message, identifies the block, and accesses GRD to identify the current state of the resource.

a. For CURRENT mode request, if the block is not in any buffer cache, the LMS process allocates a new resource in GRD; then allocates a lock on the resource;² and then sends a message back to the FG process with a grant to read the block from the disk.
b. For CR mode requests, GRD representation of resource is not required. The block may be simply reconstructed without requiring additional lock conversion.

3. In Figure 10-1, the block is resident in the buffer cache of instance 1. So, an LMS process running on the resource master instance sends a request to the LMS process running on the owning instance (instance 1).
4. LMS process running in owning instance constructs a block and sends the block to the FG process running in instance 3.³
5. The FG process receives the block, copies the block to the instance 3 buffer cache, and continues processing further.

After the transfer of a block, LMS process running on the resource master instance also modifies the resource and adds a lock indicating that the block is now resident in both instances 1 and 3. Obviously, the LMS background process plays a critical role in the cache fusion process and it is the workhorse process behind cache fusion processing.

Even though an individual session requests locks, cache fusion locks are owned by instances, not individual sessions. Let’s say that another process in instance 3 accesses the same block: there is no additional cache fusion processing needed, as the lock is already held in a compatible mode. As long as the lock held is in compatible mode, any process can access the block locally without additional cache fusion processing. Of course, if the block is not in a compatible mode, then the block lock mode must be upgraded.

Note In this chapter, I am using the term “global cache lock requests.” The more accurate term is “global cache lock conversions.” Initially, locks are acquired in NULL mode and then converted to another higher-level mode such as exclusive or PR mode. Use of the term “global cache lock request” is much more readable than “global cache lock conversion.” This difference in terminology does not matter, but if you are reading internal traces this difference becomes obvious.

GRD

A combination of file_id and block_id uniquely identifies a block; BL resource names are coined using the combination of file_id and block_id of the block. A lock on that BL resource is acquired before altering the block or reading from the disk.

Let me explain the BL locking mechanism with a small table. The following code creates a table t_one and populates the table with 500 rows, adds an index, and then collects statistics on the table. With this setup, I will query a row to show how BL resources are employed to maintain cache fusion consistency.

DROP TABLE t_one;
CREATE TABLE t_one (n1 NUMBER , v1 VARCHAR2(100));
INSERT INTO t_one
SELECT n1, lpad (n1, 100,'DEADBEEF')
FROM
  ( SELECT level n1 FROM dual CONNECT BY level <=500
  );
COMMIT;
CREATE INDEX t_one_n1 ON t_one (n1);
EXEC dbms_stats.gather_table_stats ( USER, 't_one', CASCADE =>true);

BL Resources and Locks

The query printed below has an index range scan access path, so index block is read using the predicate n1=100, and then the table block is read using the rowid retrieved from the index entry. For this discussion, I will exclusively focus on BL resources and locks protecting the table data block, although a similar locking scheme is applied to index block also. As BL resource_name is derived using the combination of file_id and block_id, I will query the file_id and block_id of a block using dbms_rowid package.

-- script tc_one_row.sql --
SELECT n1,
  dbms_rowid.rowid_to_absolute_fno (rowid, user,'T_ONE') fno,
  dbms_rowid.rowid_block_number(rowid) block,
  dbms_rowid.rowid_object(rowid) obj,
  LENGTH(v1) v1
FROM t_one
WHERE n1=100;
 
N1        FNO         BLOCK      OBJ        V1
---------- ---------- ---------- ---------- ----------
       100          4        180      75742        250

Note The previous SQL statement has length(v1) phrase in the SELECT list. That column is needed to force reading the data block of the table. Query execution will read the block from the disk into the buffer cache.

Resource_name starts with a format [0xblock_id][0xfile_id],BL, where both block_id and file_id are converted to hexadecimal format.⁴ The query printed below constructs a resource_name protecting the block using dbms_rowid package. This resource_name will be used to query GRD to view resources and locks.

SELECT DISTINCT '[0x'||trim(TO_CHAR(dbms_rowid.rowid_block_number(rowid), 'xxxxxxxx'))
    ||'][0x'|| trim(TO_CHAR(dbms_rowid.rowid_to_absolute_fno (rowid,user,'T_ONE'),'xxxx'))
    ||'],[BL]' res
FROM t_one
WHERE n1=100;
RES
---------------------------
[0xb4][0x4],[BL]

With the derived resource_name, I will query gv$ges_resource to print the GRD resource protecting the block. The output of the query below shows that this resource is visible in inst_id=2 (I am connected to instance 2). Master node of the resource is 1 (Master_node starts with 0, so master_node=1 is inst_id=2). This resource protects the block with (file_id=4 and block_id=180). Column on_convert_q shows the count of processes waiting in the converting queue to acquire locks, indicating the count of lock waiters. Column on_grant_q shows the count of processes holding locks, essentially, a count of lock holders.

RS@ORCL2:2>SELECT resource_name, ON_CONVERT_Q, ON_GRANT_Q, MASTER_NODE
FROM gv$ges_resource
WHERE resource_name LIKE '[0xb4][0x4],[BL]%'
/
RESOURCE_NAME                  ON_CONVERT_Q ON_GRANT_Q MASTER_NODE
------------------------------ ------------ ---------- -----------
[0xb4][0x4],[BL][ext 0x0,0x0]             0          1           1

gv$ges_enqueue externalizes the locks held in GRD. The following query accesses gv$ges_enqueue to print lock details. We can see that owner_node column is set to 1 (column owner_node starts with 0, and so owner_node=1 is inst_id=2). Essentially, in this example, the resource is mastered by inst_id=2 as shown in the output of gv$ges_resuorce and lock on that resource is also owned by the same instance. (Note that SQL statements in the section are available in downloadable scripts Listing_10-1.sql).

Listing 10-1. Gv$ges_enqueue Output

col state format a15
RS@ORCL2:2>SELECT resource_name1, grant_level, state, owner_node
FROM v$ges_enqueue
WHERE resource_name1 LIKE '[0xb4][0x4],[BL]%'
/
RESOURCE_NAME1                 GRANT_LEV STATE      OWNER_NODE
------------------------------ --------- ---------- ----------
[0xb4][0x4],[BL][ext 0x0,0x0]  KJUSERPR  GRANTED             1

Connecting to instance ORCL1 (with inst_id=1), I will query the same row to read data blocks into instance 1 buffer cache.

RS@ORCL1:1> SELECT n1,
  dbms_rowid.rowid_to_absolute_fno (rowid, 'RS','T_ONE') fno,
  dbms_rowid.rowid_block_number(rowid) block,
  dbms_rowid.rowid_object(rowid) obj,
  LENGTH(v1) v1
FROM rs.t_one
WHERE n1=100;
 
N1         FNO        BLOCK      OBJ        V1
---------- ---------- ---------- ---------- ----------
       100          4        180      75742        250

Accessing gv$ges_resource, as printed in the following query output, we see that block is still mastered by instance 1.

RS@ORCL2:2> SELECT resource_name, ON_CONVERT_Q, ON_GRANT_Q, MASTER_NODE
FROM gv$ges_resource
WHERE resource_name LIKE '[0xb4][0x4],[BL]%';
 
RESOURCE_NAME                  ON_CONVERT_Q ON_GRANT_Q MASTER_NODE
------------------------------ ------------ ---------- -----------
[0xb4][0x4],[BL][ext 0x0,0x0]             0          1           1

Accessing gv$ges_enqueue, there are two locks on that resource, which are owned by instances 1 and 2 in KJUSERPR (PR) mode. Since sessions are accessing the block in read-only mode, locks are acquired in PR (KJUSERPR) mode. Both locks can be held in the KJUSERPR mode, as the lock mode KJUSERPR is compatible with another KJUSERPR mode.

KJUSERPR mode protects concurrent changes to the block. So, if another session connected to instance 3 tries to change the block, then both instances 1 and 2 must downgrade the block before instance 3 can acquire the lock on the resource in exclusive mode.

RS@ORCL2:2>SELECT resource_name1, grant_level, state, owner_node
FROM v$ges_enqueue
WHERE resource_name1 LIKE '[0xb4][0x4],[BL]%'
/
RESOURCE_NAME1                 GRANT_LEV STATE      OWNER_NODE
------------------------------ --------- ---------- ----------
[0xb4][0x4],[BL][ext 0x0,0x0]  KJUSERPR  GRANTED             1
[0xb4][0x4],[BL][ext 0x0,0x0]  KJUSERPR  GRANTED             0

Figure 10-2 represents the resource and locks discussed in this section. The resource master instance has a BL resource allocated to protect the block in discussion. Two GCS (Global Cache Services) locks also have been acquired. However, there is a minor distinction to be made between these two GCS locks, namely, GCS client and GCS shadow: if the buffer represented by the resource is resident in the local buffer cache, then the corresponding GCS lock structure is known as the GCS client; if the buffer is resident in the remote buffer cache, then the lock structure is known as GCS shadow. A GCS client points to lock element, which in turn points to a buffer header structure. The buffer header structure holds a pointer to the buffer in the buffer cache.

Figure 10-2. Resources and locks

Fixed table x$kjbr externalizes resource structures, x$kjbl externalizes BL locking structures, x$le externalizes the lock elements, and x$bh externalizes the buffer header structures.

The query in Listing 10-2 can be used to probe the internal structures further. (I should point out that this SQL statement has no practical value, it is provided purely to help you understand the internal structures.) In 12c, this statement must be executed in a container database (CDB) if the database is a CDB. You may be able to research the internal structures further using the script in Listing 10-2.

Listing 10-2. Lock_to_buffer

set lines 120
col kjblname format a30
col kjblname2 format a20
col kjblgrant format a10
col kjblrole format 9999
col kjblrequest format A10
set verify off
select /*+ leading (c a b ) */ b.kjblname, b.kjblname2 ,
b.kjblgrant, b.kjblrole, b.kjblrequest , b.kjblmaster, b.kjblowner
from x$le a , x$kjbl b, x$bh c
where a.le_kjbl = b.kjbllockp
and a.le_addr =c.le_addr
and dbablk=&&block and obj=&&obj
;
KJBLNAME                   KJBLNAME2            KJBLGRANT  KJBLROLE KJBLREQUES KJBLMASTER
-------------------------- -------------------- ---------- -------- ---------- ----------
[0xb4][0x4],[BL][ext 0x0,0 180,4,BL             KJUSEREX          0 KJUSERNL            0

In summary, a BL resource protects a block, the action of reading a block into buffer cache creates a resource in the resource master instance, and a lock is added to that resource in a specific mode. In this example, the block is read with no intent to change and hence KJUSERPR locking mode (PR) is sufficient. We will see that locking mode will be different depending upon whether the block is altered or not. Locking modes are also discussed in Chapter 11.

Performance Analysis

Typically, performance tuning an application in a RAC database is geared towards identifying the layer and the instance causing slowdown. You should ask the following questions to understand the problem further. While few questions would generally apply to single-instance performance, these questions are specifically written for a RAC database.

1. What are the wait events the application is suffering from? Analysis of the waits suffered by the FG processes typically focus on the receiving side of global cache performance.

a. Are there numerous occurrences of wait events?
b. Is the average wait time for an event huge?

2. Is the sending side of global cache performing efficiently? Is there a delay in global cache processing?
3. Is the network latency causing performance issues? Network will be discussed in Chapter 9.
4. Can the performance problem isolated to an instance? Is that instance suffering delays in the receiving side or the sending side?

Each of these questions can be answered by reviewing wait events or statistics. You can isolate the problem to a specific instance, component, or object using the methods discussed in this section.

Analysis of the Receiving Side

This section focuses on the receiving side of global cache performance metrics. RAC introduces a few wait events for time model instrumentation. You would use AWR or another time-based model to understand the wait events causing performance issues. Then, tune the top event consuming time.

Generally, wait events can be grouped into the following categories:

Current mode wait events: This wait event is incurred if the version of transferred or requested block is the latest. Current mode wait events can be encountered for both read and write requests. You should note that the block version of the disk could be the latest version, and hence the sheer action of reading a block from the disk can acquire locks in the current mode.
CR mode wait events: If the current version of the block is ahead of Query environment System Change Number (SCN), then the LMS process must reconstruct a block version for read-only purposes and then send the block to the FG process. This type of block transfer is known as CR mode transfer.
Busy wait events: Time is accounted to these events if the block transfer suffered from additional processing delays encountered by LMS process. For example, LMS process may have applied undo records to create CR mode blocks.
Congestion-related events: These events imply that delays were incurred since the LMS process was too busy servicing other requests or the LMS process was suffering from a resource starvation.

Wait events can be further grouped depending on the number of instances participating in a block transfer. Up to three instances can participate in a block transfer. Requesting instance is the instance requesting the block, master instance tracks the resource status, and the owner instance currently holds a lock on the resource. If three instances participate in a block transfer, then that block transfer is tagged with 3-way wait events. In some cases, the resource master instance might be the owner instance too. In that case, only two instances participate in a block transfer; hence, those wait events are tagged with 2-way wait events.

A Generic Analysis for all Wait Events

You can identify the object, sql_id, or distribution of wait time for any wait event. This section provides general guidance to understand the details of all RAC wait events. Use this section in conjunction with individual RAC wait events to understand the performance bottlenecks better.

Identify Object

Active Session History (ASH) can be queried to identify the objects suffering from wait events. Listing 10-3 introduces script ash_gcwait_to_obj.sql to identify the object_name.

Listing 10-3. Script: ash_gcwait_to_obj.sql

col owner format a30
col object_name format a30
set lines 160
WITH ash_gc AS
  (SELECT inst_id, event, current_obj#, COUNT(*) cnt
  FROM gv$active_session_history
  WHERE event=lower('&event')
  GROUP BY inst_id, event, current_obj#
  HAVING COUNT (*) > &threshold )
SELECT * FROM
  (SELECT inst_id, nvl( owner,'Non-existent') owner ,
           nvl ( object_name,'Non-existent') object_name,
           nvl ( object_type, 'Non-existent') object_type,
           cnt
  FROM ash_gc a, dba_objects o
  WHERE (a.current_obj#=o.object_id (+))
  AND a.current_obj#  >=1
  UNION
  SELECT inst_id, '', '', 'Undo Header/Undo block', cnt
  FROM ash_gc a WHERE a.current_obj#=0
  UNION
  SELECT inst_id, '', '', 'Undo Block', cnt
  FROM ash_gc a
  WHERE a.current_obj#=-1
  )
ORDER BY cnt DESC
/
Enter value for event: gc current block 2-way
Enter value for threshold: 30
 
INST_ID OWNER                OBJECT_NAME                      OBJECT_TYPE          CNT
------- -------------------- -------------------------------- ------------- ----------
      3 PO                   RCV_SHIPMENT_LINES               TABLE               2228
      2 PO                   RCV_SHIPMENT_LINES               TABLE               2199
      1 PO                   RCV_SHIPMENT_LINES               TABLE               2197
      3 PO                   PO_REQUISITION_LINES_ALL         TABLE               2061
      2 PO                   PO_REQUISITION_LINES_ALL         TABLE               1985
      3                                                       Undo Block           120
...

The output of Listing 10-3 shows the tables suffering from waits for the gc current block 2-way wait event. You can supply any event name to identify the objects suffering from that event. If the object_id is 0 or -1, then it indicates undo block or undo segment header block. If the object was dropped, then the object_name will be marked as non-existent.

Also, to improve the performance of this query, set a higher threshold limit such as 100 or 1,000 in a busy database.

Note ASH data is sampled, so the number of samples must be high enough for the data to be accurate. Also, you will need a Diagnostic & Tuning Pack license to access ASH data in release 12c and earlier.

In version 12c, the Pluggable Database (PDB) feature is introduced. As the user objects can be in a PDB instead of the CDB, the code shown in Listing 10-3 needs to be adjusted to access cdb_objects view. Listing 10-4 shows the modified script, and in the output below, con_id column indicates the PDB container_id and object inducing global cache wait events. You can identify the PDB name by joining cdb_pdbs view. In my test cluster, container_id of hrdb1 PDB is 3.

Listing 10-4. Script: ash_gcwait_to_obj_12c.sql

col owner format a30
col object_name format a30
set lines 160
WITH ash_gc AS
  (SELECT inst_id, event, current_obj#,con_id, COUNT(*) cnt
  FROM gv$active_session_history
  WHERE event=lower('&event')
  GROUP BY inst_id, event, current_obj#, con_id
  HAVING COUNT (*) > &threshold )
SELECT * FROM
  (SELECT inst_id, a.con_id, nvl( owner,'Non-existent') owner ,
           nvl ( object_name,'Non-existent') object_name,
           nvl ( object_type, 'Non-existent') object_type,
           cnt
  FROM ash_gc a, cdb_objects o
  WHERE (a.current_obj#=o.object_id (+))
  AND a.current_obj#  >=1
  AND a.con_id =o.con_id (+)
  UNION
  SELECT inst_id,0, '', '', 'Undo Header/Undo block', cnt
  FROM ash_gc a WHERE a.current_obj#=0
  UNION
  SELECT inst_id,0, '', '', 'Undo Block', cnt
  FROM ash_gc a
  WHERE a.current_obj#=-1
  )
ORDER BY cnt DESC
/
Inst Cont OWNER OBJECT_NAME                    OBJECT_TYPE            CNT
---- ---- ----  ------------------------------ ---------------        ---
   2    3 RS    HUGETABLE                      TABLE                   21
   2    3 RS    HUGETABLE_HASH                 TABLE PARTITION          9
...

Identify SQL Statement

ASH data can also be queried to identify the SQL statement associated with events. Listing 10-5 shows the script ash_gcwait_to_sql_id.sql to identify the SQL statement associated with the wait event. The output in Listing 10-5 shows that some SQL statements were inducing waits for these events. Further review of these SQL statements, access plans, and performance metrics would be required to diagnose the performance problem.

Listing 10-5. Script: ash_gcwait_to_sql_id.sql

WITH ash_gc AS
  (SELECT    /*+ materialize */ inst_id, event, sql_id, COUNT(*) cnt
  FROM gv$active_session_history
  WHERE event=lower('&event')
  GROUP BY inst_id, event, sql_id
  HAVING COUNT (*) > &threshold )
SELECT inst_id, sql_id, cnt FROM ash_gc
ORDER BY cnt DESC
/
Enter value for event: gc current block 2-way
Enter value for threshold: 100
 
   INST_ID SQL_ID               CNT
---------- ------------- ----------
         3 26shktr5f1bqk       2717
         3 4rfpqz63y34rk       2332
         2 4rfpqz63y34rk       2294
 ...

With the PDB feature, the script shown in Listing 10-6 must also be modified to retrieve container_id of the PDB. Column con_id can be used to identify PDB executing costly SQL statements.

Listing 10-6. Script: ash_gcwait_to_sql_id_12c.sql

WITH ash_gc AS
  (SELECT    /*+ materialize */ inst_id, con_id,event, sql_id, COUNT(*) cnt
  FROM gv$active_session_history
  WHERE event=lower('&event')
  GROUP BY inst_id, event, sql_id, con_id
  HAVING COUNT (*) > &threshold )
SELECT inst_id,con_id, sql_id, cnt FROM ash_gc
ORDER BY cnt DESC
/

Note that PDB is an optional feature in version 12c, and if your database is a non-CDB, then the script in Listing 10-4 is sufficient.

Understanding Wait Distribution

You should also identify the histogram of wait time, as averages can be misleading. Are a few occurrences of longer waits for an event causing an increase in average response time? Or is the average wait time itself elevated? AWR reports provide a section for wait event histograms too.

In Listing 10-7, histogram information about wait event is retrieved by querying gv$event_histogram view. The following output shows that 99% of the waits for gc cr block 2-way wait event complete in under 4 ms. This output is queried from a healthy database, and wait time is within the acceptable range.

Listing 10-7. Script: event_histogram.sql

SELECT inst_id, event, wait_time_milli, wait_count,
      TRUNC(100*(wait_count/tot),2) per
FROM
  (SELECT inst_id, event, wait_time_milli, wait_count,
    SUM (wait_count) over(partition BY inst_id, event
                  order by inst_id
                  rows BETWEEN unbounded preceding AND unbounded following
                  ) tot
  FROM
    (SELECT * FROM gv$event_histogram
     WHERE event LIKE '%&event_name%'
     ORDER BY inst_id, event#, WAIT_TIME_MILLI
    )
  )
ORDER BY inst_id, event, WAIT_TIME_MILLI ;
Enter value for event_name: gc cr block 2-way
 
   INST_ID EVENT                          WAIT_TIME_MILLI WAIT_COUNT        PER
---------- ------------------------------ --------------- ---------- ----------
         1 gc cr block 2-way                            1     105310       2.03
         1 gc cr block 2-way                            2    3461802      66.87
         1 gc cr block 2-way                            4    1593929      30.79
         1 gc cr block 2-way                            8      15163        .29
         1 gc cr block 2-way                           16        340          0
         1 gc cr block 2-way                           32         26          0
...

Remember that gv$event_histogram gives performance metrics from the start of an instance and could be misleading. So, it is crucial to review the performance metrics during a short period of time using AWR reports or custom scripts. The query in Listing 10-8 shows the wait histogram for the past day. This query uses AWR tables to retrieve the wait_time for an event.

Listing 10-8. Script: event_hist_from_awr.sql

set lines 160
col begin_interval_time format a30
col event_name format a30
col instance_number format 99 head "Inst"
break on begin_interval_time
SELECT snaps.begin_interval_time,
  snaps.instance_number,
  hist.event_name,
  hist.wait_time_milli,
  hist.wait_count
FROM dba_hist_event_histogram hist, DBA_HIST_SNAPSHOT snaps
WHERE snaps.snap_id           = hist.snap_id
AND snaps.instance_number     = hist.instance_number
AND snaps.begin_interval_time > sysdate-1 -- modify to specify constrained time window
AND hist.event_name           = lower('&event_name')
ORDER BY snaps.snap_id ,
  instance_number,
  wait_time_milli
/
 
BEGIN_INTERVAL_TIME            Inst EVENT_NAME           WAIT_TIME_MILLI WAIT_COUNT
------------------------------ ---- -------------------- --------------- ----------
01-NOV-12 05.00.15.638 PM         1 gc cr block 2-way                  1     110970
                                  1 gc cr block 2-way                  2    3620790
                                  1 gc cr block 2-way                  4    1708432
                                  1 gc cr block 2-way                  8      15685
                                  1 gc cr block 2-way                 16        340
                                  1 gc cr block 2-way                 32         26
                                  1 gc cr block 2-way                 64          8
                                  1 gc cr block 2-way                128          1
                                  1 gc cr block 2-way                256          1
                                  1 gc cr block 2-way               2048          1

Note that this query retrieves performance metrics for a period of one day. If you are analyzing a performance problem for a constrained time frame, then modify the query to retrieve data for that problem time frame. If the problem is constrained to a smaller window of time, then reviewing the performance metrics over a longer period of time can be misleading.

Eliminate the Network as Root Cause

Another action item is to eliminate the network as a problem. A few RAC wait events serve as a baseline and can be effectively used to eliminate the network layer being the probable root cause of a performance problem. There are two events that primarily can be used as baseline events. These events do not suffer from busy or congestion-related waits. So, if the average wait time for these baseline events is in a 2-4 ms range, that would imply that the network can be eliminated as an issue. This technique is useful when you are debugging a stressful production performance issue.

Event gc cr grant 2-way is a lightweight event. It involves a round-trip message to a remote LMS process and the remote LMS process responding with a message to read the block from the disk. Both messages are typically 300 bytes in size. The average wait time for this event is usually less than 2 ms. If the distribution of wait time for this event indicates that 90% of waits are less than 2 ms, then you can safely say that network may not be a problem. If the wait time is much higher than 2 ms, then it could be a network problem or GC processing problem. This event measures the baseline performance of a small packet transfer.

Another such event is gc current block 2-way or gc cr block 2-way. Time is accounted to these two events if a database block was transferred from one instance to the FG process by an LMS process without incurring any additional performance delays. Typically, the average wait time for this event is 2-4 ms. If the event histogram for this event is mostly in the lower wait buckets, then you can safely eliminate the network as a problem. The message size for these two wait events is usually big, as these messages will transfer one database block. As these two events generally transfer big packets (∼1 database block), they will serve to verify if the problem is with the Jumbo Frames setup (if Jumbo Frames is in play).

Identify Problem Instances

When you have a performance problem, you can identify problem instances by reviewing gv$instance_cache_transfer view. This view provides detailed map of receive time by class, receiving, and sending instance level. Using the query listed in Listing 10-9, you can quickly identify the instance causing elevated GC latencies. In the following output, there is no problem: both average CR receive time and average CUR receive time have the approximately the same values in all instances.

Listing 10-9. Script: gc_problem_instance.sql

SELECT instance ||'->' || inst_id transfer,
  class,
  cr_block cr_blk,
  TRUNC(cr_block_time     /cr_block/1000,2) avg_Cr,
  current_block cur_blk,
  TRUNC(current_block_time/current_block/1000,2) avg_cur
FROM gv$instance_cache_transfer
WHERE cr_block >0 AND current_block>0
ORDER BY instance, inst_id, class
/
TRANS CLASS                  CR_BLK     AVG_CR    CUR_BLK    AVG_CUR
----- ------------------ ---------- ---------- ---------- ----------
...
1->2  data block           51035711       1.57  930591874        1.6
...
1->2  undo block           21548056       1.05         70       1.82
1->2  undo header          17073000       1.01     272653        1.2
...
1->3  data block           35554782       1.67  804269141       1.78
...
1->3  undo block           16125459       1.06          1       2.03
1->3  undo header          25643014        .99     270995       1.15
...
2->1  data block          178936080       1.44 1382236663       1.55

Gv$instance_cache_transfer view is an ultracritical view to quickly identify the instance suffering from the performance issue. If there are problems with a huge cluster, I usually drill down to the instance suffering from performance issues quickly using this view.

The following output shows that instance 3 is causing global cache slowdown. Blocks received from instance 3 have an average wait time of about 20 ms, while the average wait time for the blocks received from other instances is ∼2 ms. You should probe instance 3 to identify the reason for slow block transfers from that instance.

TRANS CLASS                  CR_BLK     AVG_CR    CUR_BLK    AVG_CUR
----- ------------------ ---------- ---------- ---------- ----------
1->2  data block           87934887       1.23    9834152       1.8
2->1  data block           28392332       1.30    1764932       2.1
...
3->1  data block           12519985      11.57    2231921      21.6
...
3->2  undo block            4676398       8.85        320      27.82

Again, SQL statements shown in Listing 10-9 are accessing gv$instance_cache_transfer view, a view of metrics from the start of an instance. We need to review metrics in a more granular fashion, so review AWR report or use my script gc_instance_cache_tfr.sql in the downloadable script format.

RAC Wait Events

I will discuss important RAC wait events in this section. This is not a complete list of all wait events, but just a list of most frequently encountered wait events.

GC Current Block 2-Way/3-Way

Current block wait events indicate that the transferred block version is the latest version of the block. This wait event can be encountered for both read and write activities. If the block is accessed for read activity, then a lock on a resource is acquired in KJUSERPR (PR) mode. The example I discussed earlier in the “Resources and Locks” section shows the locks in KJUSERPR mode.

In the following example, I queried a row from table t_one, resulting in a disk read connecting to node 2. Reviewing the SQL trace file, there are no global cache wait events. The reasoning is that the block is mastered locally, and so the FG process can acquire locks on the resources directly without incurring any global cache wait. This type of locking is also called the affinity-locking scheme. The Dynamic Resource Mastering (DRM) section will discuss affinity locking in detail.

RS@ORCL2:2> @tc_one_row.sql
 
        N1        FNO      BLOCK        OBJ         V1
---------- ---------- ---------- ---------- ----------
       100          4        180      75742        250
 
Trace file:
 
nam='db file sequential read' ela= 563 file#=4 block#=180 blocks=1 obj#=75742

I queried the same row connecting to instance 1.⁵ Since that block is cached in instance 2 already, the block is transferred from instance 2 to 1. Trace line shows that a wait event gc current block 2-way was encountered for the block with file_id=4, block_id=180. This block transfer is a 2-way block transfer, as the owning instance is same as the resource master instance (instance 2).

SYS@ORCL1:1> @tc_one_row.sql
 
        N1        FNO      BLOCK        OBJ         V1
---------- ---------- ---------- ---------- ----------
       100          4        180      75742        250
 
Trace file:
 
nam='gc current block 2-way' ela= 629 p1=4 p2=180 p3=1 obj#=75742 tim=1350440823837572

Next, I will create conditions for a 3-way wait event, but first let me flush buffer cache in all three instances to start with a clean buffer cache. In the following example,

I will query the row connecting to instance 1 (which would load the block into the buffer cache of instance 1).
I will connect to an instance and query the same row from instance 3. The FG process for my session will request LMS process running in instance 2 for the block (as instance 2 is the resource master).
Instance 2 LMS process forwards the request to LMS process in instance 1.
An LMS process in instance 1 will send the block to the FG process in instance 3.

Essentially, three instances participate in a block transfer; hence, this is a 3-way block transfer.

-- alter system flush buffer_cache; from all instances
RS@ORCL1:1> @tc_one_row.sql
 
        N1        FNO      BLOCK        OBJ         V1
---------- ---------- ---------- ---------- ----------
       100          4        180      75742        250
 
RS@ORCL3:3> @tc_one_row.sql
        N1        FNO      BLOCK        OBJ         V1
---------- ---------- ---------- ---------- ----------
       100          4        180      75742        250

Trace file:

nam='gc current block 3-way' ela= 798 p1=4 p2=180 p3=1 obj#=75742

Reviewing locks on the resource connecting to instance 2, we see that there are two locks on the resources held for instance 1 and instance 3 (owner_node equals to 0 and 2).

RS@ORCL2:2> SELECT resource_name1, grant_level, state, owner_node
     FROM v$ges_enqueue
     WHERE resource_name1 LIKE '[0xb4][0x4],[BL]%';
 
RESOURCE_NAME1                 GRANT_LEV STATE         OWNER_NODE
------------------------------ --------- ------------- ----------
[0xb4][0x4],[BL][ext 0x0,0x0]  KJUSERPR  GRANTED                2
[0xb4][0x4],[BL][ext 0x0,0x0]  KJUSERPR  GRANTED                0

Excessive waits for gc current block 2-way or gc current block 3-way wait event are generally either due to (a) an inefficient execution plan, leading to numerous block visits, or (b) application affinity not being in play. Consider implementing application affinity if the object access is localized. Also, use the techniques discussed earlier in the section “A generic analysis for all wait events.”

GC CR Block 2-Way/3-Way

CR mode block transfer is requested for read-only access. Consider that a block is resident in current mode in instance 2; instance 2 holds the exclusive mode BL lock on the resource. Another session connected to instance 1 requests that block. As the “readers do not see uncommitted changes” in Oracle Database, SELECT statements request a specific version of the block as of the query start time. SCN is employed for block versioning; essentially, SELECT statement requests a version of a block consistent with a SCN. LMS process servicing the request in instance 2 will clone current mode buffer, verify that the SCN version is consistent with the request, and then send the CR copy of the block to the FG process.

The primary difference between these CR mode transfers and current mode transfers is that, in the case of CR mode transfers, no resource or locks are maintained in GRD for the CR buffers. Essentially, CR mode blocks do not require global cache resources or locks. Received CR copy is usable only by the requesting session and only for that specific SQL execution. That’s why Oracle Database does not acquire any lock on that BL resource for CR transfers.

Since there are no global cache locks protecting the buffer, the next execution of SQL statement accessing that block connected to instance 1 would suffer from wait for gc cr block 2-way or gc cr block 3-way event too. So, every access to the block from instance 1 would trigger a new CR buffer fabrication. Even if nothing has changed in the buffer in instance 2, still, the FG process in instance 1 would suffer from CR wait events. CR buffers residing in instance 1 are not reusable either, since the requested query SCN will be different for every SQL execution.

The following trace line shows that a block was transferred from a resource master instance to the requesting instance with a latency of 0.6 ms. Further, file_id, block_id, and object_id information in the trace file can be used to identify the objects suffering from these two wait events. Of course, ASH data can be queried to identify the object also.

nam='gc cr block 2-way' ela= 627 p1=7 p2=6852 p3=1 obj#=76483 tim=37221074057

After executing tc_one_row.sql five times and then querying buffer cache headers, you can see that there are five CR buffers for that block in both instances 1 and 2. Notice that CR_SCN_BAS and CR_SCN_WRP columns⁶ have different values for each CR buffer copy. Querying gv$ges_resource and gv$ges_enqueue, you can see that there are no GC locks protecting these buffers though.

Listing 10-10. Buffer Status

SELECT
  DECODE(state,0,'free',1,'xcur',2,'scur',3,'cr', 4,'read',5,'mrec',
               6,'irec',7,'write',8,'pi', 9,'memory',10,'mwrite',
              11,'donated', 12,'protected', 13,'securefile', 14,'siop',
              15,'recckpt', 16, 'flashfree', 17, 'flashcur', 18, 'flashna') state,
  mode_held, le_addr, dbarfil, dbablk, cr_scn_bas, cr_scn_wrp , class
FROM sys.x$bh
WHERE obj= &&obj
AND dbablk= &&block
AND state!=0 ;
Enter value for obj: 75742
Enter value for block: 180

STATE       MODE_HELD LE        DBARFIL     DBABLK CR_SCN_BAS CR_SCN_WRP      CLASS
---------- ---------- -- -------------- ---------- ---------- ---------- ----------
cr                  0 00              1      75742  649314930       3015          1
cr                  0 00              1      75742  648947873       3015          1
cr                  0 00              1      75742  648926281       3015          1
cr                  0 00              1      75742  648810300       3015          1
cr                  0 00              1      75742 1177328436       3013          1

CR buffer generation is a special case, and no global cache locks are acquired to protect CR buffers. It is possible to create a CR storm if there are long-pending transactions on highly accessed objects. Hence, it is prudent to schedule batch processes updating heavily accessed tables to a less busy timeframe.

GC CR Grant 2-Way/Gc Current Grant 2-Way

gc cr grant 2-way and gc current grant 2-way wait events are encountered if the requested block is not resident in any buffer cache. The FG process requests LMS process for a block, but the block is not resident in any buffer cache. So, LMS process replies with a grant message to the FG process to read the block from disk. The FG process reads the block from disk and continues the processing.

The following line shows that for the block access with file_id=4 and lock_id=180, a grant response was received by the FG process from LMS process. The subsequent line shows that a physical read was performed to read the block from disk.

nam='gc cr grant 2-way' ela= 402 p1=4 p2=180 p3=1 obj#=75742
nam='db file sequential read' ela= 553 file#=4 block#=180 blocks=1 obj#=75742

Excessive waits for these events imply either that the buffer cache is undersized or that the SQL statements are aggressively flushing the buffer cache. Identify SQL statements and objects suffering from waits for the events, and tune those SQL statements.

The DRM feature is designed to reduce the occurrences of waits for the grants.

GC CR Block Busy/GC Current Block Busy

Busy events indicate that LMS performed additional work due to concurrency-related issues. For example, to build a CR block, LMS process may have to apply undo records to reconstruct a block consistent with query SCN. While transferring the block back to the FG process, LMS will mark the block transfer to have suffered from gc cr block busy event or gc current block busy event depending upon the type of block transfer.

GC CR Block Congested/GC Current Block Congested

If the LMS process did not process a request within 1 ms after the receipt of a request, then LMS process marks the response that block suffered from congestion-related wait events. There are many reasons for congestion-related wait events, such as that the LMS process is overwhelmed by numerous global cache requests, LMS process is suffering from CPU Scheduling delays, LMS process has suffered from another resource starvation (such as memory), etc.

Typically, the LMS process runs in a real-time CPU scheduling priority; therefore, CPU scheduling delays will be minimal. The excessive number of waits for this event would indicate that there was a sudden spike in global cache requests and that the LMS process was not able to process the requests quickly enough. Memory starvation in the server also can lead to paging of the LMS process, affecting global cache performance.

You can review to see why LMS process is not able to process requests efficiently.

Placeholder Wait Events

After sending a message to LMS process to access a block, an FG process must wait for a response back from LMS process. RAC code is instrumented such that the FG process will wait on a placeholder event such as gc cr request or gc current request. The response from LMS process will indicate the type of delay encountered while processing the block. After receiving the response, the FG will account the time correctly to an actual wait event.

Generally, these placeholder wait events have no significance, since actual wait time is recorded to another wait event. If there is no response after a 0.5-s wait (6 s in Windows platform), the FG process will declare that the block has been lost, accounts time to gc lost block⁷ wait event, and then resends the message. These placeholder events are visible only in the v$session/v$session_wait family of views and indicate the real-time session waits. As 500 ms is a long time compared to 2–3 ms, you would probe to identify why the reply is not received in 500 ms.

In a healthy database, the amount of time accounted toward these two placeholder wait events will be negligible. If you see many processes waiting for these placeholder wait events, then review other events such as gcs log flush sync or background waits to identify the performance delay.

Sending-Side Analysis ⁸

Many analysts focus on the performance metrics of the receiving side exclusively while troubleshooting a global cache performance issue. RAC wait events suffered by the FG processes are instrumenting the performance of the receiving side.

It is equally important to understand the performance metrics of the sending side too. You need to understand that slowdown in the sending side can lead to longer wait times for FG processes in other nodes. Focus on sending side is more important if the histogram of wait events indicates latencies in higher-order buckets.

Fortunately, Oracle Database is instrumented with extensive statistics to perform a breakdown analysis on the sending side too. Before you can understand the metrics, you should understand the internal mechanics involved in a global cache block transfer. Figure 10-3 provides a rough overview of global cache processing. In this Figure, only two instances are shown to improve readability. The FG process connected to instance 1 tries to access a block, constructs a GC message, and sends the message to the LMS process running in instance 2. Assume that the block is resident in the buffer cache of instance 2 and that block is also mastered by instance 2.

Figure 10-3. LMS processing mechanics

The FG process uses OS calls to send messages. These system calls in turn recursively call network layer kernel functions to send the message through the network hardware. These system calls are executed, accounting time to the kernel mode CPU, and higher global cache traffic can induce higher CPU usage in kernel mode too.

An LMS process in instance 2 receives the message. If the LMS process has generated any type of redo, to reconstruct the block either for CR or for current mode, LMS process will request LGWR process to do a log flush sync. While LGWR is performing a log flush, LMS process will account wait time to the gcs log flush sync event. After writing the log buffer to redo log files, LGWR wakes up LMS process, and LMS sends the block to the FG process in instance 2. The packet is delivered to the FG process through the interconnect.

The FG process reads the packets from network buffers, copies the block to buffer cache, and continues processing.

A block transfer involves numerous atomic steps. Time waited on each of these atomic steps is instrumented as a statistic. Latency to receive a block in CR mode in an instance can be written using the following formula.

gc cr block receive time= Time to send message to a remote LMS process by FG

+ Time taken by LMS to build block
(statistics: gc cr block build time)
+ LMS wait for LGWR latency
( statistics: gc cr block flush time)
+ LMS send time
(Statistics: gc cr block send time)
+ Wire latency.

The AWR report also captures these performance metrics and prints the average time taken for these atomic steps involved in a block transfer. The following section from AWR report shows an example. First, receive metrics are printed with an average of 1.9 ms and 1.8 ms for a block transfer in CR and current mode. The next few lines print sending-side metrics suffered by LMS process. In this example, average CR block flush time is 8.3 ms and only 27% of the block transfers waited for LGWR to complete a log flush. The following AWR report shows no sign of a performance issue.

Global Cache and Enqueue Services - Workload Characteristics
∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼
  Avg global cache cr block receive time (ms)               :      1.9
  Avg global cache current block receive time (ms)          :      1.8
 
    Avg global cache cr block build time (ms)               :      0.0
    Avg global cache cr block send time (ms)                :      0.0
    Global cache log flushes for Cr blocks served %         :     27.2
    Avg global cache cr block flush time (ms)               :      8.3

In contrast, the following AWR report was captured when there was a problem in the sending side. In this example, the average CR block receive time is 222 ms in instance 1. Since the receive time in instance 1 is higher, we need to review the sending-side metrics in other nodes; AWR report in instance 2 shows a glaring problem. Average global cache flush time is very high, at an average of 15.8 s per CR block transfer. From this AWR report, we can infer that the performance problems suffered by the LGWR process in instance 2 is affecting the receive time in other instances, notably in instance 1.

Instance 1:
 
  Avg global cache cr block receive time (ms)              :    222.1
  Avg global cache current block receive time (ms)         :     27.5
 
Instance 2:
 
    Avg global cache cr block build time (ms)              :      0.0
    Avg global cache cr block send time (ms)               :      0.1
    Global cache log flushes for cr blocks served %        :      2.7
    Avg global cache cr block flush time (ms)              :  15879.9

You may need to review performance metrics in instance 2. AWR reports from instance 2 confirm that the log writer was suffering from I/O-related latency, as the average latency for log file sync wait time is greater than 11 s. (Normally, this event has a latency of a few milliseconds). In a nutshell, LMS process was waiting for LGWR to complete log flush accounting wait time to gc cr block flush time statistics, and the FG process in instance 1 is waiting for the LMS process to respond.

Top 5 Timed FG Events
∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log file sync                         2,054      23,720  11548   45.8 Commit
gc buffer busy acquire               19,505      10,382    532   20.0 Cluster
gc cr block busy                      5,407       4,655    861    9.0 Custer
enq: SQ - contention                    140       3,432  24514    6.6 Configurat
db file sequential read              38,062       1,305     34    2.5 User I/O

Similarly, sending-side metrics for current mode block transfer can be broken down to individual statistics too. The time taken by an LMS process to send a block to the remote side can be written as follows:

Gc current block receive time = Time take to send a message to LMS process by FG

+ (LMS) Time taken to pin block
(statistics: gc current block pin time)
+ (LMS) wait for log flush sync
( statistics: gc current block flush time)
+ (LMS) send time
(statistics: gc current block send time)
+ Wire latency.

Notice that the atomic steps involved in processing a current more transfer are slightly different from those involved in CR mode transfer. The statistic gc current block pin time indicates the time taken by the LMS process to pin a block. Higher latency for this statistic indicates that the blocks are currently pinned by another process and imply that block was busy.

Your strategy to debug cache fusion performance issue should cover performance metrics of both sending side and receiving side. Sending-side latency can lead to longer receiving-side latency.

Block Types Served

The pattern of block types served can be used to understand global cache workload patterns. v$cr_block_server provides a breakdown of CR requests into various types of requests. For example, the output in Listing 10-11 shows that 51% of CR requests were for data block, 31% of CR requests for undo blocks, and 15% for undo header block. From this output, you can infer that application affinity will greatly reduce the GC traffic, as 50% of the blocks served is for undo blocks.

Listing 10-11. Block Types Served

SELECT inst_id,
  TRUNC(data_requests /tot_req,2) * 100 data_per,
  TRUNC(undo_requests /tot_req,2) * 100 undo_per,
  TRUNC(tx_requests   /tot_req,2) * 100 tx_per,
  TRUNC(other_requests/tot_req,2) * 100 other_per
FROM
  (SELECT inst_id, cr_Requests + current_Requests tot_req,
    data_requests, undo_requests, tx_requests, other_requests
    FROM gv$cr_block_server
   )
ORDER BY inst_id;
 
   INST_ID   DATA_PER   UNDO_PER     TX_PER  OTHER_PER
---------- ---------- ---------- ---------- ----------
         1         51         31         15          0
         2         47         47          5          0
         3         49         42          7          0

GCS Log Flush Sync

When a session issues a commit, that action triggers LGWR to write log buffer entries to the redo log files. The FG process waits for LGWR to complete log file write. Until the LGWR process responds, the FG process will account wait time to log file sync event. The action of LGWR writes from the log buffer to the log file makes the transaction permanent. Even if the database crashes at that time, the transaction is not lost, as the transaction is permanently stored in the redo log file. Essentially, the durability in ACID properties of a database is satisfied with this mechanism (ACID = Atomicity, Consistency, Isolation, and Durability).

In RAC, since multiple buffer caches are fused, a mechanism similar to log file sync is employed by the LMS process to maintain durability. LMS will request a log flush before sending the block if undo records were applied to create a CR copy of the block or if the block has uncommitted recent transactions in the block. LMS process will wait for LGWR to complete the log flush and post LMS process to continue. LMS process accounts the wait time to gcs log flush sync wait event while waiting for LGWR process. The FG processes in other instances will be waiting for a global cache wait event while LMS process is waiting for LGWR process. Potentially, prolonged wait by LMS process explodes into massive waits for gc buffer busy waits.

Therefore, it is very important to have consistent LGWR performance in RAC. Any delay in LGWR processing will be magnified as an excessive amount of global cache waits.

If your database is suffering from high gc buffer busy waits or other RAC wait events, and if the average wait time for wait events is much higher, then probe to see if gcs log flush sync in other nodes is causing the problem, and also review the histogram of the wait time for gcs log flush sync wait event. Script event_histogram.sql in Listing 10-5 can be used to understand the waits by LMS process. The following output of the script shows that 91% of waits complete within 2 ms in a healthy database. (As always, use AWR report or custom script to review the histogram during a specific period of time.)

@event_histogram
Enter value for event_name: gcs log flush sync
 
   INST_ID EVENT                          WAIT_TIME_MILLI WAIT_COUNT        PER
---------- ------------------------------ --------------- ---------- ----------
         1 gcs log flush sync                           1  336241301      78.73
         1 gcs log flush sync                           2   58520524       13.7
         1 gcs log flush sync                           4   22517131       5.27
         1 gcs log flush sync                           8    7629771       1.78
         1 gcs log flush sync                          16    2103830        .49
         1 gcs log flush sync                          32      67629        .01
         1 gcs log flush sync                          64        608          0
         1 gcs log flush sync                         128         26          0

The following output shows a database with high wait time for gcs log flush event. This increase in wait time resulted in longer waits for the global cache waits from the perspective of application process.

@event_histogram.sql
Enter value for event_name: gcs log flush sync
INST_ID        EVENT                    WAIT_TIME_MILLI WAIT_COUNT   PER
---------- ---------------------------- --------------- ---------- -----
       1       gcs log flush sync                     1         28   .07
       1       gcs log flush sync                     2         24   .06
       1       gcs log flush sync                     4         31   .08
       1       gcs log flush sync                     8         33   .08
       1       gcs log flush sync                    16      35757 95.96
       1       gcs log flush sync                32       1378  3.69
       1       gcs log flush sync                64          6   .01
       1       gcs log flush sync               128          2     0

If LMS database processes suffers from high gcs log flush sync wait event, then you would need to understand why LGWR is not responding fast enough. In most cases, that would require a review of the output of script lfsdiag.sql supplied by Oracle Support to understand the breakdown of the wait time suffered by LGWR process.

Defending LMS Process

As the performance of LMS process is quite essential, there are few features protecting LMS process overworking.

As I/O events usually exceed 5 ms, and since global cache latency is in the order of a few milliseconds, an I/O event initiated by LMS process can induce longer latencies to the FG processes. So, LMS process does not initiate any I/O calls at all. If LMS process has to initiate an I/O, for example, to read undo blocks required for CR fabrication as those undo block may not be in the cache anymore, then LMS process will downgrade the lock on the BL resource and send the buffer to the requestor. The requesting FG process will perform I/O calls and complete CR fabrication. This feature is known as the “Light Works Rule.”

If a block is requested excessively, as detected by LMS process, then LMS process can downgrade the BL lock and transfer the block to the remote FG process. For example, if a block is requested in CR mode aggressively, then LMS process can downgrade the lock to a null mode and transfer the block to the FG process. This feature is known as “Fairness Down Convert.”

An overworked LMS process can trigger delays in global cache transfer. These features protect LMS process from working too hard due to an onslaught of excessive global cache requests. View gv$cr_block_server provide statistics on these features.

GC Buffer Busy Acquire/Release

Wait events gc buffer busy acquire or gc buffer busy release are usually side effects of another root cause issue. In a single-instance database, if a session s1 is trying to read a block from the disk to the buffer cache, and if another session s2 has issued a read call already for that block and has not completed the read call yet, then session s1 will wait for the read call issued by session s2 to complete. Session s1 will account wait time to an event, namely, read by other session. Additionally, subsequent sessions trying to read the same block, while the read call is still pending, will wait and account time to read by other session event. Further, if a buffer is pinned by a session s2, then another session s1 trying to access the same buffer will wait for the buffer busy wait event.

In RAC, wait for gc buffer busy event is similar to single-instance counterparts read by other session and buffer busy waits. A wait for it indicates that there is a global cache operation pending on a buffer; subsequent processes must wait for that global cache operation to complete before accessing the buffer. For example, process p1 may have requested a BL lock on the block, but GCS lock has not been completely acquired. So, another process p2 trying to access that buffer should wait for gc buffer busy event, essentially yielding to process p1 to complete GC operation.

Rarely, gc buffer busy event itself is a root cause. Typically, it is a side effect of another problem. Here is an example—in the AWR report section for instance 1 as follows, 10,382s was spent waiting for gc buffer busy acquire during an AWR period of 30 min, and the root cause, LGWR, was not able to complete log flushes quick enough in instance 2. So, the LMS process in instance 2 was stuck waiting for LGWR to complete a log flush. Meanwhile, a few processes requesting blocks in instance 1 are waiting on gc cr request or gc current request waits. Subsequent processes accessing the same block are waiting for gc buffer busy acquire event, since a global cache request is pending and the request was initiated from the local node.

Top 5 Timed Foreground Events
∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log file sync                         2,054      23,720  11548   45.8 Commit
gc buffer busy acquire               19,505      10,382    532   20.0 Cluster
gc cr block busy                      5,407       4,655    861    9.0 Cluster
enq: SQ - contention                    140       3,432  24514    6.6 Configurat
db file sequential read              38,062       1,305     34    2.5 User I/O

To troubleshoot gc buffer busy wait events, initially review events suffered by the background processes. In many production problems, slowness in the database background processes is the root cause of gc buffer busy event waits. It is also important to understand the difference between gc buffer busy acquire and gc buffer busy release wait events. The following output from v$session_wait view differentiates these two events nicely:

Notice that session 53 is waiting to access the block with file_id=10 and block_id=560651 and that session is currently waiting for gc current request event (a placeholder wait event as discussed).
Session 47 connected to instance 2 is waiting for log file sync event. LMS process is also waiting for log file sync wait event (not shown in the output, though).
In instance 1, all other sessions trying to access that block with file_id=10 and block_id=560651 are waiting for gc buffer busy acquire wait event.
Meanwhile, sessions connected to instance 2, which are trying to access that block, are waiting for gc buffer busy release wait event.

INST_ID    SID EVENT                          STATE   WIS   P1_P2_P3_TEXT
------- ------ ---------------------------- ------- ----- -------------------------------
      1     53 gc current request            WAITING    26 file# 10-block# 560651-...
      1     40 gc buffer busy acquire        WAITING    26 file# 10-block# 560651-class# 1
      1     60 gc buffer busy acquire        WAITING    26 file# 10-block# 560651-class# 1
      1     59 gc buffer busy acquire        WAITING    26 file# 10-block# 560651-class# 1
      1     58 gc buffer busy acquire        WAITING    26 file# 10-block# 560651-class# 1
...
      2      1 gc buffer busy release        WAITING     1 file# 10-block# 560651-class# 1
      2     68 gc buffer busy release        WAITING     1 file# 10-block# 560651-class# 1
      2     65 gc buffer busy release        WAITING     0 file# 10-block# 560651-class# 1
...
      2     47 log file sync                 WAITING    22 buffer# 4450-sync scn 30839721-
      2     43 gc buffer busy release        WAITING     1 file# 10-block# 560651-class# 1
      2     36 gc buffer busy release        WAITING     0 file# 10-block# 560651-class# 1
...

Essentially, sessions connected to instance 1 are waiting for the global cache locks to be acquired by instance 1, and so they account wait time to event gc buffer busy acquire. Sessions connected to instance 2 are waiting for instance 1 to release the gc lock, and so time is accounted to gc buffer busy release event. As global cache BL locks are considered to be owned by an instance, not by an individual session, in this example, sessions connected to instance 1 are waiting for instance 1 to acquire the gc lock, and sessions connected to instance 2 are waiting for instance 1 to release the gc lock.

This differentiation is useful to identify the instance that initiated a block request. Next, we need to identify the object suffering from an excessive amount of gc buffer busy events. SQL in Listing 10-3 is handy to identify the objects inducing this wait event. The following output shows that index lines_u1 is suffering from an excessive number of gc buffer busy acquire wait events. Depending upon the type of object, a different troubleshooting method will be employed. Again, if the problem is constrained to a specific time interval, then you should review the AWR report for that time frame and focus on the section listing segments suffering from gc buffer busy waits.

@ash_gcwait_to_obj.sql
Enter value for event: gc buffer busy acquire
Enter value for threshold: 1000
/
      INST OWNER      OBJECT_NAME                      OBJECT_TYP        CNT
---------- ---------- -------------------------------- ---------- ----------
         1 RS         LINES_U1                         INDEX         1205710
         2 AR         RA_CUSTOMER_TRX_ALL              TABLE            4399
         1 AR         RA_CUSTOMER_TRX_ALL              TABLE            4241
         1 WSH        WSH_DELIVERY_DETAILS_IDX1        INDEX            4106
...
         1                                             Undo Header      1481
                                                       /Undo block

The following section discusses the common object types causing the gc buffer busy wait event and a method to diagnose further.

Unique Indexes

In a B-tree index, indexed column values are stored in (key[s], rowid) format in ascending or descending key column order. If the column values are populated using sequence-generated values, then all new rows will be populated in the rightmost leaf block of the index. This type of index growth is right-hand index growth. If many sessions are trying to insert rows into the table, then the rightmost leaf block will be a hot block. In the case of a single-instance database, this concurrency issue will cause waits for read by other session and buffer busy wait events. In a RAC database, this problem will be magnified as gc buffer busy waits. If the application inserts into that table from all instances, then the rightmost leaf block has to be transferred between the instances aggressively, similar to a Hot Potato game played by children in the United States.

Almost-unique indexes can cause gc buffer busy waits too; indexes on timestamp column are an example. Indexed column values inserted concurrently will have the same timestamp or timestamp values close enough that those values will be stored in fewer leaf blocks of the index B-tree. So, concurrent inserts into that table will induce contention for few leaf blocks.

You will also notice that contention will be moving from one block to another block, as the leaf block will be filled, and new blocks added to the index structure of their rightmost part of the index.

Right-hand growth index contention can be resolved by hash partitioning the index on key column values. By hash partitioning the index, values are distributed among multiple leaf blocks, thereby effectively spreading the activity to many blocks. From 10g onward, indexes can be partitioned even if the table is not partitioned.

Index unique scan and index range scan access paths on hash-partitioned indexes specifying, at least, all partitioning key column values will perform similar to a nonpartitioned index. However, if your application performs numerous index range scans without providing all partitioning key columns, then hash partitioning will increase logical I/O per execution.

Reverse key indexes also can be used to combat right-hand growth contention. But reverse key indexes do not allow a range scan access path, and so there are only a very few reasons as to why you would choose a reverse key index over hash-partitioning indexes. One of those reasons is that partitioning requires a license for partitioning option.

Table Blocks

If the object_type suffering from gc buffer busy wait event is a table block, then it is imperative that you identify the type of block inducing gc buffer busy wait event. Listing 10-6 provides a SQL statement querying gv$active_session_history data and identifies the blocks inducing gc buffer busy acquire waits.

After identifying the specific block, you must identify the type of the block suffering from contention. The output in Listing 10-12 shows that block with file_id=49 and block_id= 635413 belongs to the fnd_concurrent_processes table (a familiar E-Business suite table). You may have to dump the data block to identify the block type.

Listing 10-12. Ash_gcwait_to_block.sql

col inst format 9999
col current_file# format 99999  head file
col current_block# format 9999999 head blk
WITH ash_gc AS
  (SELECT * FROM
    (SELECT /*+ materialize */ inst_id, event, current_obj#, current_file#,
      current_block#, COUNT(*) cnt
    FROM gv$active_session_history
    WHERE event=lower('&event')
    GROUP BY inst_id, event, current_obj#,
      current_file#, current_block#
    HAVING COUNT(*) >5
    )
  WHERE rownum <101
  )
SELECT * FROM
  (SELECT inst_id, owner, object_name, object_type, current_file#,
    current_block#, cnt
  FROM ash_gc a, dba_objects o
  WHERE (a.current_obj#  =o.object_id (+))
  AND a.current_obj#  >=1
  UNION
  SELECT inst_id, '', '', 'Undo Header/Undo block' ,
    current_file#, current_block#, cnt
  FROM ash_gc a
  WHERE a.current_obj#=0
  UNION
  SELECT inst_id, '', '', 'Undo Block' ,
    current_file#, current_block#, cnt
  FROM ash_gc a
  WHERE a.current_obj#=-1
  )
ORDER BY 7 DESC
/
Enter value for event: gc buffer busy acquire
 
     INST OWNER      OBJECT_NAME                      OBJECT_TYP   file     blk     CNT
---------- ---------- -------------------------------- ---------- ------ ------- -------
         1 APPLSYS    FND_CONCURRENT_PROCESSES         TABLE          49  635413    2039
         1 APPLSYS    FND_CONCURRENT_PROCESSES         TABLE         301  954692      18
         1 SYSTEM     AQ$_QUEUE_TABLES                 TABLE           4    1129      18
         1 APPLSYS    FND_CONCURRENT_QUEUES            TABLE         382  123951      17
         1                                             Undo Block      0       0      15
...

Dumping the data block using the following command⁹ and reading through trace file, you can identify the type of block. Oracle Support will help to read the trace file too. Table 10-1 shows possible block types and methods to resolve the contention.

alter system dump datafile 49 block 635413 block max 49 635413   ;

Table 10-1. Block Type and Possible Issues

Block type	Comment
Segment Header	Probable freelists contention. Alter the table to have multiple freelist groups.
Table block	Possibly code updating a few rows of the table aggressively. Concurrent deletes on inserts of a few rows. SQL statements scanning objects aggressively. Partitioning or code change may be required.
ASSM blocks	ASSM bitmap block contention. Usually would require further analysis from Oracle Support.
Index root or branch blocks	Heavy concurrent inserts and deletes.
Undo header blocks	Numerous short transactions?
Undo blocks	Long-pending transactions and excessive CR generation.
Dictionary	Possible sequence issues if the block belongs to SEQ$ table. Requires further analysis.

If your database is a 12c CDB, then join to cdb_objects to identify PDB associated with the object. The following script is useful in a database container to identify the block in a CDB.

Listing 10-13. Ash_gcwait_to_block_12c.sql

col inst format 9999
col current_file# format 99999  head file
col current_block# format 9999999 head blk
WITH ash_gc AS
  (SELECT * FROM
    (SELECT /*+ materialize */ inst_id, con_id, event, current_obj#, current_file#,
      current_block#, COUNT(*) cnt
    FROM gv$active_session_history
    WHERE event=lower('&event')
    GROUP BY inst_id, event, current_obj#,
      current_file#, current_block#, con_id
    HAVING COUNT(*) >5
    )
  WHERE rownum <101
  )
SELECT * FROM
  (SELECT a.inst_id, a.con_id, owner, object_name, object_type, current_file#,
    current_block#, cnt
  FROM ash_gc a, cdb_objects o
  WHERE (a.current_obj#  =o.object_id (+) and a.con_id=o.con_id(+) )
  AND a.current_obj#  >=1
  UNION
  SELECT inst_id, a.con_id, '', '', 'Undo Header/Undo block' ,
    current_file#, current_block#, cnt
  FROM ash_gc a
  WHERE a.current_obj#=0
  UNION
  SELECT inst_id,a.con_id, '', '', 'Undo Block' ,
    current_file#, current_block#, cnt
  FROM ash_gc a
  WHERE a.current_obj#=-1
  )
ORDER BY 7 DESC
/

In summary, event waits for gc buffer busy acquire/release is a symptom of another problem, and you need to investigate the layers underneath to identify the root cause.

DRM

As discussed earlier in this chapter, the mastership of gc resources is distributed among all active instances. If a resource is locally mastered, then the lock on that resource can be acquired using the affinity-locking scheme. With the affinity-locking scheme, the FG process can query GRD in the local instance, identify current locks, and acquire locks typically without any involvement of the background processes. This affinity-locking scheme improves the performance of the application, as the waits for grants are eliminated or reduced.

What if an object obj1 is accessed heavily by an instance? In that scenario, wouldn’t it be better to remaster that object to that instance so that every block of that object obj1 is locally mastered? With locally mastered locks, the application can perform better, since waits for grants will be minimal. That’s the idea behind the DRM feature. For example, a tkprof output file of a batch program accessing few tables aggressively is shown as follows. Over 1,050 s was spent waiting for grants. These waits for grants can be avoided if those heavily accessed objects are locally remastered.

Elapsed times include waiting on the following events:
 
Event waited on                              Times   Max. Wait   Total       Waited
----------------------------------------     Waited        ----------  -------------
gc cr grant 2-way                            271931              1.39         905.49
...
gc cr grant congested                          4249              1.47         150.58
...
gc current grant 2-way                           98              0.00           0.22

Overview of DRM Processing

RAC background processes keep track of global cache lock requests at an object level. I should emphasize that it is not object access but rather gc lock requests on an object that are tracked. If the number of BL lock requests from an instance exceeds a threshold in an observation window, then DRM is triggered for that object. After the completion of DRM, the object will be locally mastered and future access to that object from that instance will use affinity-locking scheme, reducing waits for grants. Three RAC background processes are at the heart of DRM implementation:

LCK0 process maintains object-level statistics and decides if remastering should be triggered on an object.
If an object is chosen, a request is queued. LMD process reads the request queue and initiates GRD freeze. In the following LMD trace file, pkey indicates the object_id of the remastered object. Dissolve pkey means that all resources currently mastered by the local instance must be removed from local instance memory.
```
*** 2010-01-08 19:41:26.726
* kjdrchkdrm: found an RM request in the request queue
Dissolve pkey 6984390
*** 2010-01-08 19:41:26.727
Begin DRM(189) - dissolve pkey 6984390 from 2 oscan 1.1
ftd received from node 1 (8/0.30.0)
ftd received from node 0 (8/0.30.0)
ftd received from node 3 (8/0.30.0)
all ftds received
```
LMON coordinates reconfiguration using LMS processes. Reconfiguration goes through many stages. DRM reconfiguration is similar to the reconfiguration triggered after the instance crash, but less severe. Reconfiguration triggered after an instance crash will remaster every resource mastered by the failed instance, whereas reconfiguration triggered for DRM will remaster only the resource identified by LCK0 process.
```
*** 2010-01-08 19:41:26.793
Begin DRM(189)
sent syncr inc 8 lvl 5577 to 0 (8,0/31/0)
synca inc 8 lvl 5577 rcvd (8.0)
```

View x$object_policy_statistics externalizes the BL lock requests and shows locking activity at the object level. As visible in the query output from x$object_policy_statistics, there were many Shared BL lock requests on object 12189604. In the following output, columns SOPENS and XOPENS track counts of BL lock requests in shared and exclusive mode, respectively.

SYS@ORCL1:> SELECT node, object, sopens, xopens, xfers, dirty
    FROM x$object_policy_statistics
    ORDER BY sopens+xopens
/
      NODE     OBJECT     SOPENS     XOPENS      XFERS      DIRTY
---------- ---------- ---------- ---------- ---------- ----------
...
         1   12189628     203303          1     127563         43
         3   11877357     247303        249      53641         16
         1    1698508     332013          0          0          0
         1    9972722     367830          0        181          0
         1   12189604     383890          2      22008          0

View gv$policy_history shows details about DRM activity. This view provides the object that was involved in a DRM activity, target instance that object was remastered to, and the date of the remastering event. In the following output, object_id 76739 was remastered on 12/30/2011 to instance 1. Initiate affinity is the policy event for remastering activity.

RS@ORCL1> select * from gv$policy_history;
 
INST_ID    POLICY_EVENT         DATA_OBJECT_ID TARGET_INSTANCE_NUMBER EVENT_DATE
---------- -------------------- -------------- ---------------------- --------------------
         1 initiate_affinity             76739                      1  12/30/2011 11:43:06

View gv$gcspfmaster_info shows current mastership information about the remastered object. Underlying x$table is used by the Oracle code to check if an object is mastered or not. For every BL lock request, RDBMS code checks if the object is locally mastered; if the object is visible in the underlying x$ table of v$gcspfmaster_info, then affinity locking code is executed.

RS@ORCL1>select * from v$gcspfmaster_info where object_id=6984154;
 
FILE_ID    OBJECT_ID  CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
---------- ---------- -------------- --------------- ------------
         0    6984154              1               0            2

In release 12c, the con_id column is populated with a value of 0 in x$object_policy_statistics, gv$policy_history, and gv$gcspfmaster_info views, indicating that these rows are associated with the whole CDB. Hopefully, this is just a bug to be fixed in future patches. Using con_id, you can identify PDB associated with the object.

I triggered object remastering by accessing the object in a loop and initiating numerous BL lock requests. (Again, note that it is the number of BL lock requests that is used to trigger DRM activity, not the number of accesses to a block, two completely different metrics.) After remastering the object again, you can see that current_master column value changed from 1 to 2 and remaster_cnt increased to 3.

RS@ORCL1>select * from v$gcspfmaster_info where object_id=6984154;
 
FILE_ID    OBJECT_ID  CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
---------- ---------- -------------- --------------- ------------
         0    6984154              2               1            3

DRM Stages

LMON goes through various stages to complete DRM activity. One major phase of DRM activity is GRD freeze. While GRD is frozen, no BL locks can be acquired or released. Existing buffers in SGA can be accessed, but new global locking requests will be stuck waiting for gc remaster event. At the end of DRM activity, GRD is unfrozen and the global cache locking activity continues. This GRD freeze and unfreeze is performed in a synchronized fashion by all instances. Instance triggering the remastering event acts as a co-coordinator instance. It coordinates with other instances to complete a specific sequence of tasks to complete the remastering event.

The following output from the LMON trace file shows an overview of DRM activity in a database. As a first step, GRD is frozen and no new BL locks can be acquired after the completion of freeze. Then, old locks and resources of the remastered objects are cleaned out. This cleanup activity must happen in all instances so that there is a synchronization mechanism after each step. The remastering instance waits for other instances to complete the task before proceeding to the next step. Finally, locks and resources are transferred to the instance that is about to receive the remastered objects.

*** 2011-08-12 15:24:33.146
* drm freeze : Quiesce- No new opens or converts
* drm cleanup: All old locks and resources closed for the objects
* drm sync 2: wait for sync from all nodes
* drm replay: transfer resource/lock states to new nodes
* drm sync 3: wait for sync
* drm fix writes

You have probably guessed the pain point of DRM processing. GRD freeze means that no new BL resources or locks can be acquired during the initial phases of DRM. This GRD freeze can induce a complete application hang.

GRD Freeze

GRD is frozen briefly during a resource remastering event. No new global resources and locks can be acquired while GRD is frozen, and in a few scenarios, this freeze can lead to a hung application. Worse yet, DRM can happen in cycles; for example, DRM started for object 1, at the completion of DRM, DRM started for object 2, etc. In a busy database, even a minute of GRD freeze can throw instability into application availability and performance. Further, as the size of SGA grows, the severity of GRD freeze increases. Typically, business-critical applications tend to have bigger SGA, and so the effect of GRD freeze can be severe in those business-critical applications. Databases with less activity might not suffer excessively due to GRD freeze.

The following output shows a few lines from an AWR report during DRM-related freeze. Event gcs drm freeze enter server indicates that the problem is due to DRM activity. The first process requesting a block will wait for gc remaster or gcs drm freeze enter server event. Subsequent processes accessing the same block will wait for gc buffer busy event, since there is an outstanding BL lock request already. So, in the following example, the actual impact of the application due to DRM activity should be considered 70% of response time. Note that average CR receive time is also at an elevated level of 17.1 ms and that elevated latency is due to DRM freeze.

Top Five Timed Events                                              Avg     %Total
∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼                                              wait    Call
Event                           Waits           Time (s)           (ms)    Time    Wait Class
------------------------------  ------------    -----------        ------  ------  ----------
gc buffer busy                     1,826,073        152,415            83    62.0  Cluster
CPU time                                             30,192          12.3
enq: TX - index contention            34,332         15,535           453     6.3  Concurrency
gcs drm freeze in enter server        22,789         11,279           495     4.6  Other
enq: TX - row lock contention         46,926          4,493            96     1.8  Applicatio
 
Global Cache and Enqueue Services - Workload Characteristics
∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼
Avg global enqueue get time (ms)                : 16.8
Avg global cache cr block receive time (ms)     : 17.1
Avg global cache current block receive time (ms): 14.9

Parameters

There are a few underscore parameters controlling the behavior of DRM activity. Parameter _gc_policy_minimum decides the minimum gc activity per minute of an object to be considered for resource mastering (default value, 1,500). Parameter _gc_policy_time governs the duration of an observation window (default value, 10 minutes). During the observation window, objects are observed, objects exceeding a minimum threshold are considered for resource mastering, and DRM is triggered.

There are many other underscore parameters controlling the behavior, and those parameters usually start with _lm_drm_ and do not normally need any adjustments. However, if you find that DRM activity is slower, it is possible to tune DRM speed using these parameters, but you must contact Oracle Support before changing these initialization parameters.

If your database performance suffers from excessive amount of DRM activity, you can tune that by adjusting _gc_policy_minimum parameter to a much higher value, such as 1 million. While DRM can be disabled by adjusting _gc_policy_time to zero, this choice disallows manual remastering too. So, the preferred action is to increase _gc_policy_minimum to a bigger value.

Changes in 12c

DRM has slight changes in release 12c, mostly for the PDB feature. Every PDB generates its own object_ids, and so pkey format has been changed to accommodate plugged-in databases. For example, remastering of an object with object_id=89911 uses pkey including con_id of plugged-in database. A snippet from LMD trace file follows.

*** 2013-03-17 21:57:49.718
Rcvd DRM(3) AFFINITY Transfer pkey 4.4.89911 to 1 oscan 1.1
ftd (30) received from node 1 (4 0.0/0.0)
all ftds received

In this example, pkey is 4.4.89911, where the first 4 is con_id of the object, my guess is that the second 4 is also a container_id, and 89911 is the object_id of the object. You can query v$container or cdb_pdbs to identify the plugged-in database of this object.

DRM and Undo

Undo segments also can be remastered. In RAC, every instance has its own undo tablespace. For example, if undo_tablespace parameter is set to undots1, then all undo segments onlined in instance 1 will be allocated in undots1 tablespace. So, by default, an instance onlining an undo segment will also be the resource master for that segment. But, if another instance accesses an undo segment excessively, then resource mastership for an undo segment can be remastered to another instance.

LMD0 trace file will show an object_id exceeding 4 billion. Here is an example of an undo remastering event:

Begin DRM(104) - transfer pkey 4294951403 to 1 oscan 0.0*** 2011-08-19

In version 12c, pkey format has been changed and pkey consists of two parts: value of 4 billion coloring the pkey as an undo segment, and 65541 identifying a specific undo segment.

*** 2013-03-17 19:29:41.008
Begin DRM(2) (swin 1) - AFFINITY transfer pkey4294967295.65541to 2 oscan 1.1
kjiobjscn 1
ftd (30) received from node 1 (4 0.0/0.0)
all ftds received

Troubleshooting DRM

If your database is suffering from excessive issues due to GRD freeze (and AWR wait events supports that assertion), then use the following guidelines to debug the problem.

Identify objects involved in DRM activity. Views gv$gcspfmaster_info and gv$policy_history can be used to identify objects triggered for DRM activity. LMD trace file will show object_id involved in a DRM with pkey as object_id. Verify if these object_ids are valid object_ids. In the following example, DRM is triggered on an object 27707121. Verify this object_id with dba_objects/cdb_objects(12c) and check if the object is valid. In 12c CDB, you need to identify the PDB associated with the object.
```
Rcvd DRM(5) AFFINITY Transfer pkey 27707121.0 to 3 oscan 1.1
```
Understand if the activity on that object is justified by reviewing application workload pattern. If the activity is justified, then try to tune the SQL statement accessing those objects to reduce object access.
Identify the programs/application processes accessing those objects heavily. Colocate those programs to the same instance.
If none of these actions help, then contact Oracle Support and check if you can reduce DRM activity by tuning _gc_policy_minimum and _gc_policy_time parameters.

In summary, dynamic remastering is a good feature to improve performance of applications in a RAC database if the application affinity is in play. You can reduce the RAC tax to a minimal amount with this great feature. Also, in a RAC one-node database or a RAC cluster with active/passive instances, you can avoid RAC tax completely. It is a pity that this feature doesn’t work for all workloads and especially that this feature starts to be problematic if the SGA size exceeds 50GB+.

AWR Reports and ADDM

AWR reports ¹⁰ and Automatic Database Diagnostic Monitor (ADDM) are the first step in performance diagnosis in a RAC cluster. There are multiple ways you can generate AWR reports in a RAC environment. You can also use OEM to generate an AWR report.

To generate an AWR report in a specific instance, use the awrrpti script. You can supply instance and database to generate AWR reports. In the following example, I am generating AWR report for instance 3.

@$ORACLE_HOME/rdbms/admin/awrrpti
...
Instances in this Workload Repository schema
∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼∼
 
   DB Id     Inst Num DB Name      Instance     Host
------------ -------- ------------ ------------ ------------
* 3821847422        2 PROD         PROD2        db1
  3821847422        3 PROD         PROD3        db2
  3821847422        1 PROD         PROD1        db3
 
Enter value for dbid: 3821847422
Using 3821847422 for database Id
Enter value for inst_num: 3
Using 3 for instance number

To generate an AWR report for the current instance, you can call awrrpt.sql file.

@$ORACLE_HOME/rdbms/admin/awrrpt.sql

To generate an AWR report for all instances, you can execute awrrptg.sql script. This script pulls the data for all active instances, prints metrics at both instance levels, and groups at cluster level. The AWR report generated from this script is useful to compare the performance of instances quickly.

@$ORACLE_HOME/rdbms/admin/awrrptg.sql
...

ADDM is executed after AWR report capture automatically. You can review the recommendations from ADDM with ease using the OEM tool. You can also use SQL*Plus tool to retrieve last task_name and retrieve the ADDM report.

SQL> select  task_name, status from dba_advisor_tasks where task_id=
  (select max(task_id) from dba_advisor_tasks where status='COMPLETED' );
 
ADDM:4227381283_3_59788        COMPLETED
 
 
SQL> SET LONG 1000000 PAGESIZE 0;
SQL> SELECT DBMS_ADDM.GET_REPORT('ADDM:4227381283_3_59788') from dual;

ASH Reports

Similar to AWR reports, you can also generate ASH reports for the local instance or a specific instance. To generate an ASH report for a specific instance, use ashrpti.sql script.

@$ORACLE_HOME/rdbms/admin/ashrpti.sql

To generate an ASH report for a current instance, use ashrpt.sql.

@$ORACLE_HOME/rdbms/admin/ashrpt.sql

Summary

The methodology to optimize an application in a RAC database is not vastly different from the single-instance methodology. Understand which layer is causing performance issues and tune that layer to resolve the root cause of the performance issue.

In most cases, RAC performance problems are due to single-instance database problems hiding behind RAC wait events.

This chapter describes the internals of cache fusion processing in an easy-to-understand format. You should be able to understand the meaning of the RAC wait events, identify objects inducing those RAC wait events, and identify SQL statements suffering from or inducing RAC wait events after having read this chapter. Further, you should be able to understand the details about LMS processing for a block transfer. Placeholder wait events indicate the current state of the session, but you should identify the final wait event to which the wait time was accounted.

DRM is an excellent feature. It is very useful in environments where the application affinity is implemented. It is a pity that resource mastering freezes GRD completely, and hopefully in a future release, this freeze will be minimized to freeze just a small portion of GRD.

Of course, a review of AWR reports and ASH reports of the problem time frame is an essential step in understanding the root cause of a performance issue.

¹ This algorithm is a variant of hashing algorithm. In a nutshell, in a three-instance database, the first contiguous 256 blocks of a data file will be mastered by instance 1, the next 256 blocks will be mastered by instance 2, the third 256-block range is to be mastered by instance 3, and the fourth 256-block range is to be mastered by instance 1, in a cyclic way. This blocking factor is controlled by _lm_contiguous_res_count parameter and defaults to 256 in 11g (in 10gR2, defaults to 128). DRM is a feature that alters the round-robin mastership and is discussed later in this chapter.

² Note that this algorithm is a rough outline; the actual algorithm is more complicated. The action depends upon the request type CR or current mode, whether the buffer is in buffer cache of resource master, whether the buffer is in another instance buffer cache, whether downgrade is possible or not, etc.

³ The Wireshark utility is a great tool to trace this activity. You can see that FG (server process) receives the blocks directly from LMS process from 10gR2 onward.

⁴ Note that in Database version 12c, the pluggable databases concept is introduced, and multiple pluggable databases can share the same buffer cache. Since (file_id , block_id ) combination is unique even when multiple databases are plugged in a container database, so, BL resource_name format remains the same in 12c also.

⁵ Notice that I am querying the block from a different instance. There is a RAC optimization by which block locks are kept local and globalized only when the block is accessed in a different instance. So, if you access the block in an instance, global cache structures may not be set up yet. That’s the reasoning behind accessing the block in a different instance.

⁶ Essentially, queries request a specific version of the block by specifying query SCN as block version. The combination of cr_scn_bas and cr_scn_wrp specifies SCN version of a CR buffer.

⁷ Event gc lost block is discussed extensively in Chapter 9. I think it is repetitive to probe this event further in this chapter. ⁸ Sending side refers to the instance and LMS processes sending the block. Receiving side refers to the FG process requesting the blocks.

⁹ Another cautionary tale: do not dump numerous blocks in a RAC database from version 11.2 onward. For example, dumping all blocks of an extent overloads the LMS processes and can lead to node crash. So, try to dump few blocks at a time if required.

¹⁰ You need a Diagnostics and Tuning pack license to access AWR tables or report.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10: RAC Database Optimization

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 10: RAC Database Optimization