CHAPTER 6

image

Application Design Issues

by Riyaj Shamsudeen

Inefficient application design strategy is the root cause of many customer RAC escalations.

Bad design practices causing minor performance issues in a single-instance database are magnified if you convert the database to a RAC database. Failures and cost overruns in many RAC deployment projects can be attributed to application design patterns that are not suitable for RAC. These design flaws are detrimental in single-instance databases, but the harm might be amplified in RAC. For example, poor design of sequence caching will cause only minor performance issues in a single-instance database; however, in a RAC database, that flaw can lead to a completely hung cluster.

In this chapter, I will discuss these design patterns, methods to troubleshoot these issues, and possible strategies for resolving them.

Localized Inserts

Excessive insert concurrency is one of the most common issues faced in a RAC database. Although too much concurrency itself is not an issue, either a flaw in an object design or an invalid use of application affinity can trigger massive performance issues.

In a B-Tree index, values are stored in ascending or descending order of indexed columns. Consider a unique index of the employee_id column in the employee table, populated by an emp_id_seq sequence with a current value of 1,000. An insert into the employee table using that sequence-generated value will store an entry in the rightmost leaf block of the index tree because 1,000 is the current maximum value. Subsequent inserts retrieving values from the emp_id_seq sequence will also populate entries in that leaf block until the leaf block is full. After the leaf block is full, a new block will be added to the index tree and subsequent inserts will populate index entries in the newly added leaf block. Essentially, new inserts into the employee table will populate index entries in the current rightmost leaf block of the index tree. In Figure 6-1, recently inserted values (1,000 to 1,003) are populated in the rightmost leaf block of the index tree.1

9781430250449_Fig06-01.jpg

Figure 6-1. Leaf block contention

If insert concurrency into the employee table increases, contention on the current rightmost leaf block will increase sharply. Contention on the rightmost leaf block is exacerbated because the leaf block splits when the block becomes full. In a single-instance database, leaf block contention will result in waits for events such as ‘buffer busy’ waits, ‘cache buffers chain’ latch waits, and so on. In a RAC database, the current rightmost leaf block must be transferred between the instances, leading to an increase in global cache transfer. Further, sessions will suffer from wait events such as gc buffer busy acquire and gc buffer busy release as global cache locks are acquired and released. The effect of waiting for these global events in a RAC database will be an order of magnitude larger than the single-instance waits, possibly leading to an unusable application. This phenomenon is called right-hand index growth contention.

Figure 6-2 shows the output from gv$session and the effect of leaf block contention in a busy database. Thousands of sessions were stuck waiting for gc buffer busy event waits due to index leaf block contention. Waits for buffer busy wait events and ITL contention events are also shown in the output.

9781430250449_Fig06-02.jpg

Figure 6-2. Events in a busy database

Indexed columns populated by current date or current timestamp values can suffer a similar fate because rows inserted at the same time will have the same column values. So, recent values will be populated in the current rightmost leaf block of the index. The issue is monotonically increasing values populating index entries in the rightmost leaf block of the index. Hence, even non-unique indexes can induce right-hand index growth. The root cause of this problem is an invalid application design issue, and unfortunately the effects are magnified in a RAC database.

There are at least four options to resolve a right hand growth index contention issue:

  1. Partition the indexes by a hash partitioning method with one or more indexed columns as partitioning keys.2  With this hash partitioning technique, multiple index trees are created for an index and therefore, contention is distributed among multiple current rightmost leaf blocks of the index partitions. Figure 6-3 shows an index partitioned by a hash with two partitions. Because each partition has its own index tree, the values are spread between two leaf blocks, thereby reducing contention by half. So, if you partition an index by the hash partitioning method with 64 partitions, then the values are distributed among 64 leaf blocks of the index, dividing the contention 64-fold.
  2. Further, a hash-partitioned index supports the index range scan operation. Hence, the ill effects of hash partitioning an index are minimal.

    9781430250449_Fig06-03.jpg

    Figure 6-3. Hash-partitioned index

  3. Partition the table by hash and recreate indexes as locally partitioned indexes. This technique has advantages similar to hash-partitioned indexes, but the difference is that the table is also partitioned by hash. This technique has an added advantage of reducing the contention in the table blocks. I should caution you that if your queries do not achieve partition pruning,3  then logical reads can increase because all index partitions must be accessed to search for matching rows. For example, consider the EMP table, hash partitioned by employee_id, with a locally partitioned index on the dept_id column. Because the dept_id column is not part of the partitioning key, predicates specifying the dept_id column must search in all index partitions of the dept_id index. This increase in logical reads will pose a problem only if the SQL statement is executed too frequently. However, the advantages of a partitioned table outweigh this concern.
  4. Convert the index to a reverse-key index type if you do not have a license for a partitioning option or if your database uses Standard Edition software. Because column values are reversed while they are stored in the index blocks, index entries will be distributed to numerous leaf blocks, relieving the contention. However, because the values are distributed between almost all leaf blocks of the index, it is possible to pollute the buffer cache with these index blocks, causing increased physical reads. Another disadvantage with this option is that reverse key indexes do not support the index range scan operation.
  5. Implement software-driven localized sequence access, with the sequence generating a different range of values. For example, you can convert an application to use multiple sequences generating a disjoint range of values instead of a sequence, such as an emp_id_seq1 sequence with a starting value of 1 billion, an emp_id_seq2 sequence with a starting value of 2 billion, and so on. The application will be coded in such a way that sessions connected to instance 1 will access the emp_id_seq1 sequence, sessions connected to instance 2 will access the emp_id_seq2 sequence, and so on. Hence, applications will be inserting values in a different range in each instance, reducing leaf block contention. Disadvantages with this option are that it requires a code change, and it does not resolve the problem completely as leaf block contention can still creep up within an instance.

To learn about methods to identify objects inducing or suffering from gc buffer busy performance issues, see Chapter 10.

Excessive TRUNCATE or DROP Statements

Applications executing excessive numbers of TRUNCATE and DROP commands will not scale in a single-instance database. In a RAC database, global waits triggered by these DDL commands will cause severe performance issues.

The TRUNCATE and DROP commands trigger object-level checkpoints for the table to be truncated or the object to be dropped. In a single-instance database, the object queue in the local buffer cache must be scanned to identify blocks currently in the buffer cache. In a RAC database, object queues in the buffer caches of all instances must be scanned to complete the TRUNCATE and DROP commands. Object checkpoints are triggered by posting remote background processes by means of local background processes. If numerous object checkpoints are triggered concurrently, then these checkpoint requests might be queued, leading to longer wait times to complete the commands. Also, as the size of the buffer cache increases, the amount of work to complete these commands also increases, and so this problem can be magnified in large buffer caches.

image Note   An object-level checkpoint also is triggered for dependent objects such as indexes. Hence, tables with numerous indexes can suffer the worst consequences.

The TRUNCATE command is a DDL command, so parse locks (also known as library cache locks) of dependent objects must be broken before the DDL command completes. In a single-instance database, breaking parse locks requires changes to local shared pool objects. However, in a RAC database, where the parsed representation of dependent objects can exist in another instance’s shared pool, parsed locks must be broken globally. Further, library cache locks are globalized as Global Enqueue Services (GES) locks, so breaking parse locks requires invalidating GES enqueues. This activity triggers a set of coordination messages between the instance background processes.

I enabled 10046 event trace in my session and truncated a table. The following are a few lines from the SQL*Trace file. These wait events are associated with a coordination mechanism between the instances. You can identify the object details as the object_id is listed in the ‘local write wait’ event. In this example, the object_id of the truncated table is 89956.

Truncate table t1;
...
 nam='enq: RO - fast object reuse' ela= 976 name|mode=1380909062 2=65598 0=2 obj#=22
...
 nam='enq: RO - fast object reuse' ela= 1022 name|mode=1380909062 2=65598 0=1 obj#=22
 nam='enq: RO - fast object reuse' ela= 478 name|mode=1380909057 2=65598 0=2 obj#=22
...
 nam='local write wait' ela= 270 file#=11 block#=280 p3=0 obj#=89956
 nam='local write wait' ela= 1315 file#=11 block#=281 p3=0 obj#=89956
 nam='local write wait' ela= 1702 file#=11 block#=282 p3=0 obj#=89956
 nam='DFS lock handle' ela= 651 type|mode=1128857605 id1=13 id2=5 obj#=89956
 nam='DFS lock handle' ela= 563 type|mode=1128857605 id1=13 id2=1 obj#=89956
 nam='DFS lock handle' ela= 1492 type|mode=1128857605 id1=13 id2=2 obj#=89956

For further discussion about the DFS lock handle mechanism, see Chapter 11.

In most cases, developers use TRUNCATE or DROP commands to store the data temporarily while processing the data. Global temporary tables (GTTs) are the preferred method instead of DDL commands. Rows in GTTs are session specific and not visible to any other session. By default, rows in a GTT are thrown away after a commit. You can also choose to create a GTT with on commit preserve rows to preserve rows in the GTT across commits.

As GTT is session specific, it does not suffer from the ill effects of DDL commands. The DELETE command on a GTT does not trigger any checkpoint, and there is no need to invalidate parse locks, either. So, code developed using GTTs will scale better in both a single-instance and a RAC database compared to DDL statements.

I am not advocating that you should never use TRUNCATE or DDL statements in your code. On the contrary, be aware of the ill effects of DDL statements in a RAC database and use the method that best suits your requirements. For example, if you must delete all rows from a big table (100K+ rows) infrequently in your code (or ad hoc), then you should use TRUNCATE commands instead of DELETE statements.

A common reason developers shy away from using GTT is that the optimizer can choose an inefficient execution plan for SQL statements accessing GTT. It is common for an application to gather schema-level statistics, thereby collecting statistics on GTT as well. The problem is that the statistics collection process has not populated any rows in those GTTs, and therefore zero rows statistics are populated for the GTT. The optimizer uses special considerations for zero rows statistics leading to inefficient execution plans.

One option to resolve the GTT statistics issue is to remove statistics on GTT and lock the statistics. Dynamic sampling will be triggered if there are no statistics on the table. Another option is to collect statistics after populating a representative number of rows in the GTT, and then lock the table statistics. However, this method assumes that all sessions have a uniform number of rows, which may not be entirely accurate. Release 12c introduces a new feature, session-level private statistics for a GTT, to resolve the statistics issue. You can choose to collect either session-level private statistics and/or shared statistics on GTT. If private statistics are available, the optimizer uses private statistics to choose an optimal execution plan. If not, the optimizer uses shared statistics.

Sequence Cache

Sequences provide unique values, not necessarily strict sequential values. Due to a misconception among developers and designers, however, applications often are designed assuming that sequences will provide strict sequential values.

At the first access to a sequence, sequence values are cached in an instance SGA, up to 20 values by default. Subsequent access to that sequence will retrieve values from the cache until the cache is depleted. After the exhaustion of cached values, the next access to the sequence will cache 20 more values in the instance SGA. The data dictionary table seq$ keeps permanent record of the highest cached value for every sequence in the database. The replenishing of the sequence cache will update the seq$ table, marking a new highest cached value for that sequence. In a single-instance database, updates to a dictionary table require a lock on a row cache.

The problem with the caching sequence value is that in case of instance failure, cached values are permanently lost. For example, consider a sequence with a next value of 1,014 and the maximum cached value of 1,021 recorded in a seq$ table. In an instance crash, values from 1,014 to 1,021 will be permanently lost. If the application code assumes that sequence values will be in a strict sequential order, the loss of cached values can lead to unnecessary application data issues. Also, note that losing sequence values does not imply a data loss scenario, only that cached sequence values are permanently lost. Application designers alter attributes of sequences to ORDER, NOCACHE to impose strict sequential order. The ORDER and NOCACHE attributes have adverse effects if the sequences are accessed very frequently. Every access to the sequence forces an update to the seq$ table. As updates to the seq$ table are performed under the protection of a row cache lock, if many sessions are trying to access sequence concurrently, then the row cache lock event can be seen as a major wait event.

image Note   Updates to the seq$ table are not the only cause of row cache lock event waits. Further review of details is warranted to identify the root cause. Chapter 11 discusses row cache locks further.

The following few lines from a SQL trace shows that the seq$ table is updated for each access to the sequence if the attributes of the sequence are set to ORDER, NOCACHE. The wait for row cache lock event is also shown in the following.

update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5,order$=:6,cache=:7,highwater=:8,audit$=:9,flags=:10,partcount=:11 where obj#=:1
 
nam='row cache lock' ela= 101 cache id=13 mode=0 request=5 obj#=89956 tim=1364157893629929

In the preceding wait line, cache_id is set to 13. Querying v$rowcache view, you can identify the type of rowcache entry. In this example, the update to the seq$ table is the reason to lock a row cache.

select type, parameter from v$rowcache where cache#=13;
 
TYPE        PARAMETER
----------- --------------------------------
PARENT      dc_sequences

The lines from a tkprof output of a SQL trace file (see Listing 6-1) show the impact of uncached sequences in a RAC database. About 255 seconds are spent on row cache lock waits in a total run time of 282 seconds. Note that there are no waits for the DFS lock handle event since the sequence attribute is set to NOCACHE. This difference will be relevant when I discuss cached sequences later in this section.

Listing 6-1.  Performance of Uncached Sequence

INSERT INTO RS.T_GEN_SEQ_02
VALUES
 ( RS.T_GEN_SEQ_02_SEQ.NEXTVAL, LPAD ('Gen',25,'DEADBEEF')
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute   5001      7.71     282.75          3        333      20670        5001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     5002      7.71     282.75          3        333      20670        5001
 
 Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  row cache lock                               4586        0.76        255.01
  Disk file operations I/O                        7        0.00          0.00
  db file sequential read                         3        0.01          0.03
  gc current block busy                        1064        0.46          7.08
  gc current block 2-way                       2660        0.05          3.36

In a RAC database, sequence values can be cached in all instances. This sequence caching leads to retrieved sequence values not correlating with a time component. Consider the following scenario:

  1. Session 1 connects to PROD1 instance and accesses a sequence emp_id_seq, retrieving a value of 1,022. Values in the range of 1,022 to 1,041 will be cached in PROD1 SGA for that sequence, assuming a default cache of 20.
  2. Another session connecting to PROD2 will retrieve a value of 1,042 from that sequence. Values in the range of 1,042 to 1,061 will be cached in the PROD2 instance SGA.
  3. If session 1 connected to the PROD1 instance accesses the sequence again, a value of 1,023 will be returned.

So, the retrieved sequence values sorted by time will be 1,022, 1042, and 1,023, clearly not correlating with a time component.

In a RAC database, if the attributes of a frequently accessed sequence are set to ORDER, NOCACHE, performance problems will be magnified. As uncached sequences trigger an update to a seq$ table for every access, row cache locks are acquired aggressively. In a RAC database, row cache locks are globalized as GES layer locks, so additional GES locks must be acquired before updating dictionary tables. Essentially, a magnification effect comes into play, and a huge amount of waiting for GES locks and row cache lock wait events will be the result.

If you must retrieve sequence values in a strict sequential order, one option is to use ORDER, CACHE attributes of sequences. In RAC, a GES resource is used to maintain the current maximum cached value of a sequence, reducing updates to a dictionary table. Using the attributes ORDER and CACHE is a much better option if you must retrieve values in a strict sequential order. Still, if the resource master instance of the sequence crashes, then there is a possibility of losing cached sequence values. Note that an excessive amount of access to sequences with ORDER and CACHE attributes results in longer waits for the DFS lock handle event.

The test case in Listing 6-1 was repeated with sequence attributes set to ORDER, CACHE 20. The lines from the tkprof output file in Figure 6-2 show that the insert statement completed in about 12 seconds compared to 282 seconds in the case of uncached sequences test case in Listing 6-2.

Listing 6-2.  Sequence with ORDER and CACHE

INSERT INTO RS.T_GEN_SEQ_02
VALUES
 ( RS.T_GEN_SEQ_02_SEQ.NEXTVAL, LPAD ('Gen',25,'DEADBEEF')
 
call     count       cpu    elapsed       disk      query    current        rowsf
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.01          0          0          0           0
Execute   5001      0.94      12.60          0        910      16440        5001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     5002      0.94      12.62          0        910      16440        5001
 
 Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  DFS lock handle                               359        0.05          0.64
  enq: HW - contention                            6        0.03          0.09
  buffer busy waits                             130        0.06          0.50

Our recommendations for a sequence in a RAC database are as follows:

  1. If sequences are heavily accessed, then use a bigger cache value (1,000+). A bigger cache improves performance because access to a cached value is an order of magnitude faster than access to an uncached sequence value. The only drawback is that an instance crash can lose all cached sequence values.
  2. If the sequences are accessed infrequently, use the default cache value of 20.
  3. If you must retrieve values sequentially for infrequently accessed sequences, alter the sequence attributes to ORDER, CACHE value to avoid ill effects of uncached sequences in a RAC database. Also, alter the cache value of the sequence to 20. This option has less impact than sequences with nocache attributes because sequence values are stored in a GES resource.
  4. If you cannot afford to lose any sequence value, then you should consider onlyORDER, NOCACHE attributes. You should know that if you alter index attributes to ORDER, NOCACHE for frequently accessed sequences, it could lead to a hung cluster.

Freelists and ASSM

A freelist maintains a list of blocks available with sufficient free space in a non-ASSM (Automatic Segment Space Management) tablespace. Sessions search freelists while searching for a free block to populate a row piece. Freelists are stored in the segment header block, and if there is a need for excessive concurrent access to freelists, then the segment header block can be a hot block. The segment header block must be transferred between the instances repetitively, causing block contention with possible waits for gc buffer busy events.4

You can alleviate freelist block contention (segment header contention) by creating objects with multiple freelist groups. Each freelist group is stored in a block by itself. Further, instances have an affinity to a freelist group and sessions in an instance search in the freelist group with affinity, before searching in another freelist group. So, segment header contention can be alleviated by creating as many freelist groups as instances; that is, if you have four instances in a database, then create objects with a freelist group of 4.

Note that freelists and freelist groups are applicable only to Manual Segment Space Management (MSSM) tablespaces. ASSM tablespaces do not suffer from the contention issue because bitmaps at different levels are used to maintain lists of free blocks. Each instance has affinity to an L1 bitmap, essentially implementing a concept similar to multiple freelist groups.

My recommendation is to avoid using MSSM tablespaces and instead use ASSM tablespaces. You should consider partitioning the table if ASSM is not able to support the concurrency.

Excessive Commits

Executing a COMMIT statement triggers an action to the LGWR (Log writer) background process to write the contents of the log buffer to the log file members. After completion of writing to log file members, LGWR posts waiting foreground (server) processes, marking a successful commit. While the LGWR is performing a log flush, a foreground process accounts wait time to a log file sync wait event. In a single-instance database, excessive commits can overload LGWR.

In a RAC database, CURRENT mode block transfers (and some Consistent Read mode block transfers, too) require log flush, a scheme similar to commit processing. LMS processes must wait for LGWR process to complete log flush before sending blocks to remote instance processes. The LMS process is instrumented to account the wait time for the gcs log flush sync event.

The requirement for a log flush is due to the consistency mechanism in the database. Let me explain what would happen if the log flush is not performed for a CURRENT mode transfer. Consider a scenario in which session #1 updated a block in a PROD1 instance but has not committed the changes yet. Session #2 connecting to PROD2 is requesting to update the same block. So, the LMS process will transfer the block from the PROD1 instance to the PROD2 instance in CURRENT mode. Assume the PROD1 instance crashes immediately after the block is transferred to PROD2. Session #1 in PROD1 has a pending transaction that must be rolled back, but the log flush did not happen, so the transaction changes are not recorded permanently in the redo log file. Changes made by session #1 would have been transient and discarded if this is a single-instance database. But the block is transferred to the PROD2 buffer cache with the session #1 changes. Essentially, uncommitted changes are permanent in the block, leading to a block corruption. That’s the reason the current mode block transfer requiring a log flush. So, if a block has a change SCN higher than the current commit SCN, then a log flush is triggered before a block is transferred to another node.

A CR block fabrication requires redo change vectors from undo records to be applied to roll back the changes, to create a block consistent with the requested block version. Applying change vectors causes redo generation, and therefore redo records are copied into a log buffer. Hence, if the LMS process generated a redo to construct a CR block copy, then that CR block transfer would require a log flush. However, an underscore parameter controls this behavior, and there is absolutely no reason to consider the underscore parameter as that parameter change can cause data consistency issues.

Delay in LGWR processing can induce delays in LMS processes, leading to a longer wait time for global cache events in other nodes. Other processes trying to access the block in transit will wait for gc buffer busy wait events, leading to a multilayered performance issue.

image Note   Performance issues in log file parallel writes can induce massive waits in LMS processing, too. Global cache transfers can be stuck if LGWR is not able to complete log flush faster. LMS processes will be accumulating higher wait time for the gcs log flush sync wait event. It is critically important for LGWR and log file writes to be highly efficient to maintain faster global cache transfers.

There is another inherent danger with excessive commits, which is a phenomenon known as commit cleanout. When a transaction commits, not all transactions clean the entries in the Interested Transactions List (ITL) section of a block header. Some transactions do not clean up ITL entries.5  A subsequent session visiting that data block must clean up the ITL entry by visiting an undo header block to determine the transaction status. In a RAC database, an original transaction might have been initiated in another instance, so commit cleanouts can trigger undo header blocks to be transferred from another instance. As undo header blocks and undo blocks are usually hot blocks, excessive transfer leads to higher global cache workload and downstream performance issues.

Our recommendation is to avoid excessive commits. Avoid autocommit features in various languages. Commit only on logical transaction boundaries.

Long Pending Transactions

Long pending transactions on critical tables can trigger massive amounts of undo construction, too. In an Oracle database, a SQL statement cannot see changes from uncommitted transactions created by another session. SQL statements create a read-consistent view of a block6 by applying undo records from uncommitted transactions. Further, every session accessing the block must build a read-consistent view of the block.

In a single-instance database, transactions are initiated only in a local instance, so the undo application is the only additional workload. In a RAC database, transactions can be initiated in any instance, so the undo header blocks and undo blocks must be transferred between the instances if the data block is accessed in another instance. This transfer will result in an additional global cache workload. If there is a long pending transaction on a heavily accessed table, this problem can quickly escalate to a cluster hung issue.

Further, blocks may not be transferred immediately to another instance if the block is considered busy. A delay in gc transfer can induce further waits for the gc buffer busy event.

Our recommendations are as follows:

  1. Defer long, intrusive transactions on critical tables for a low-activity period. This strategy will reduce the severity of the issue in a RAC database.
  2. Update only required columns. Developers write code to delete and reinsert rows instead of updates. Not only is this delete + insert strategy unnecessary, it is redo intensive, which results in a higher number of undo records to be applied for read- consistent block fabrication.

Localized Access

Small sets of blocks accessed from all instances can cause contention-related performance issues. It is a common requirement in many applications to query a request table to identify work items to be processed. For example, manufacturing applications typically query a request table to identify the next work item to process. After processing the work item, the row is updated, marking the work item with a completion status. Typically, these work items are picked up by many such concurrent processes, resulting in concurrent read and write on a few blocks of tables and indexes. The concurrent processing layer in an E-Business suite application also uses a similar strategy to execute the next concurrent request.

The performance issue is exacerbated as new rows are inserted into the table concurrently, too. The following pseudocode explains the application code. Worker programs constantly query the request_table to identify new work items to process. At the end of the processing, rows are updated, marking the work item to a completion status.

Loop
        Select request_id into v_req from request_table Where status='P'
        Order by request_id
        For update of request_id skip locked;
        --  Process row
        --  update row mark to C
        Update request_table set  status='C' where request_id=v_req;
End loop;

There are a few issues with this type of application design:

  1. It is typical of these SQL statements to use index-based access paths. As many processes will be concurrently accessing these index blocks, contention will be centered at fewer index blocks.
  2. Also, concurrent processes are interested in recently inserted rows and thus contention will be concentrated at a few table/index blocks.
  3. Contention is magnified because cardinality of these indexed columns, such as status column, is very low. Therefore, concurrent worker processes compete against each other, accessing the same block aggressively. In a single-instance database, excessive localized access leads to waits for various events such as buffer busy waits, ITL contention, and so on.
  4. In a RAC database, worker processes might be connected to all instances. The data blocks of table and indexes are transferred between instances, thereby causing severe performance issues. Numerous waits for events such as gc buffer busy acquire or gc buffer busy release will be visible in the database during the occurrence of this performance issue.
  5. If the block is busy undergoing changes, then the global cache block transfer can be delayed, resulting in waits for global cache events in other instances. This delay is controlled by the _gc_defer_time parameter, which defaults to 3 centiseconds.
  6. Also, because the rows are locked for update, ITL entries in the block header will show an active transaction. Other processes accessing those blocks must determine the transaction status by accessing the transaction table in the undo header block. Because the original transaction might have been initiated in another instance, the undo header and undo blocks from other instances might need to be transferred. This excessive block transfer leads to complex performance issues.

Therefore, performance problems can be encountered in many layers such as contention in table blocks, index blocks, undo header, and undo block. There are few options to resolve this design issue:

  1. If the application performs an excessive amount of localized access, then you should consider applying affinity and keep all workers in the same instance. While this reduces the severity of the issue, it does not completely resolve the issue.
  2. Redesign application tables to use hash-partitioned tables and hash-partitioned indexes. Modify the query and execution plan to access local hash partitions.
  3. Use the Advanced Queuing feature with multiple queues such that the worker processes read a smaller set of data.
  4. Design application so that one process assigns work items to the workers. Worker processes will perform a work item and update the row to completion status.

Small Table Updates

Frequently updated smaller tables can induce block contention. An example is an application that uses a table to keep track of maximum column value instead of using a sequence. The following pseudocode explains this coding practice. The maximum value of the employee_id column is tracked in the emp_seq_table. The query retrieves the max_employee column value, inserts rows into employees table with that value, and then updates the max_employee column, adding one to the value. If this code is executed aggressively, then there can be excessive contention on the emp_seq_table:

--Retrieve maximum value;
Select max_employee_id into v_employee_id
from emp_seq_table where column_name='EMPLOYEE_ID';
 
-- Insert;
insert into employees values (v_employee_id,name, ...);
 
-- Update max value;
Update emp_seq_table  set max_employee_id = max_employee_id +1
where column_name='EMPLOYEE_ID';
commit;

This is an inefficient application design. In a single-instance database, this design will lead to contention-related wait events such as buffer busy waits and ITL waits. In a RAC database, the application will suffer from more severe RAC-related wait events such as gc buffer busy waits and ITL contention. Further, that block will be transferred aggressively between the instances, increasing program runtime. Use sequences to generate ascending or descending values.

Index Design

Keep fewer indexes on critical, DML-intensive tables. Rows can be populated in any block of a heap table as long as there is sufficient free space in the block. However, index entries must be populated in a specific index block. If there are numerous indexes on a table, and if sessions modify the table from all instances, then index blocks can become hot blocks, leading to aggressive global cache activity. More indexes on a table increase the number of index blocks to modify during inserts/updates to indexed columns. This increased block activity directly translates to an increase in global cache activity.

Bitmap indexes are suitable for read-mostly tables. Excessive DML activity on tables with bitmap indexes increases the size of bitmap indexes sharply and induces row-level locking contention. In a RAC database, bitmap index blocks are transferred between instances, exacerbating the symptoms further.

In addition, the compressed index reduces the number of index blocks. This reduction is achieved by not storing repetitive values. Because the size of the compressed index is smaller than the uncompressed index, the number of buffers needed to support the indexes decreases, thereby reducing global cache activity.

The following best-practice guidelines for indexing design are applicable to both single-instance databases and RAC databases:

  1. Reduce the number of indexes on DML-intensive tables. In many cases, a number of indexes can be reduced by rearranging the column in a specific order that matches the application requirement and by modifying the query slightly to match the index design.
  2. Avoid adding bitmap indexes on heavily modified tables. Performance issues will be localized to a few components in a single-instance database. However, in a RAC database this performance issue is magnified. I have personally witnessed complete application downtime due to a bitmap index on a DML-intensive table.
  3. Consider compressed indexes if the cardinality of the leading column is lower.

Inefficient Execution Plan

Inefficient SQL execution plans (as a result of poorly written SQL statements or invalid statistics) can lead to excessive amounts of logical reads, translating to excessive buffer cache access and excessive physical reads. In a single-instance database, concurrent processes executing these inefficient statements can induce higher physical reads, waits for cache buffers chain latch contention, and so on.

In a RAC database, excessive logical reads can lead to increased global cache activity. Numerous blocks might need to be transferred back and forth between the instances in a busy database. This increased global cache activity will slow down the processes, executing inefficient statements, and also cause performance issues with other application components. SQL statements performing nested loops join with millions of rows in the driving table usually suffer from this type of performance issue. Due to incorrect cardinality estimates, the optimizer might choose nested loops join instead of hash join, leading to numerous lookups of other tables.

Again, RAC acts as a problem magnifier and magnifies performance issues associated with inefficient execution plans. A reliable method of statistics collection is an important part of an application lifecycle in a RAC database. A few customer sites collect statistics on a cloned copy of the production database and import the statistics to a production database, avoiding downtime due to invalid statistics.

Excessive Parallel Scans

Excessive inter-instance parallel scanning can overload the interconnect. In an inter-instance parallel operation, messages between parallel servers are transmitted over the interconnect. If the interconnect is not sized properly, then it is possible to induce global cache latency in the database due to overloaded interconnect.

Our recommendation is that if your application is designed to use parallelism aggressively, measure the private network traffic for PX traffic carefully and size the interconnect hardware to match the workload. For further discussion about parallelism in a RAC database, see Chapter 12.

Also, most parallel scans read blocks directly from disk to PGA. These direct reads trigger object-level checkpoints, causing performance issues. For smaller read-mostly tables, caching an entire table in a KEEP buffer cache and performing full scans might be more efficient.

Further, you can reduce the interconnect traffic for PX messages by localizing all PX servers to the local node using the parallel_force_local parameter. If the parameter is set to TRUE, then all parallel executions initiated from the local node will allocate PX servers from the local node only, avoiding PX interconnect traffic.

Full Table Scans

Full table scans can utilize direct path or conventional mode reads. With conventional mode read, blocks are read into the buffer cache from the disk. Reading the block into buffer cache requires global cache locks.7  This additional workload affects the application performance in a RAC database. However, blocks read for a full table scan operation are considered to be cold blocks and can be paged out of the buffer cache quickly. Hence, gains from the shared buffer cache are minimal, even if many concurrent processes are scanning the table. To reread the block into the buffer cache, global cache locks must be acquired again. Concurrent sessions from multiple instances compete for the same buffers, increasing waits for global cache events and global cache block transfers. This problem is more acute if the buffer cache is undersized.

A table can be scanned using the direct path read access path too, even if SQL executes serially. This feature is known as serial direct read. However, a direct path read access path requires an object-level checkpoint similar to the object checkpoint triggered for parallel SQL statements. Thus, excessive serial direct reads can be cost prohibitive if the buffer cache is very active and large. Events such as write complete waits and DFS lock handle can be seen as major wait events in the database.

With an effective table partitioning scheme, you can reduce the number of blocks to be read from the disk. For example, if the table is accessed using a time component, then partition the table along the natural time order. However, a SQL statement should be written matching the partitioning method. With partition pruning, the full table scan access path will scan only a smaller set of partitions, reducing disk reads and global cache locks.

Of course, if you must do full table scans on huge tables, you should use parallelism to improve the performance of those queries.

Application Affinity

Application affinity is an important tool to combat performance impact due to global cache workload delays. Essentially, if an application component (such as multiple threads of a batch process) accesses a few objects aggressively, then modify that application component to connect to one instance. Access to the local buffer cache is an order of magnitude faster than the remote buffer cache access, so application components accessing a few objects aggressively will perform more efficiently if the blocks are cached to the local instance.

An effective application affinity scheme should translate to physical segment-level affinity, not just logical-level separation. For example, a client designed a batch process to execute in multiple instances, with each instance processing data for a disjoint list of organizations; that is, organization 1 to 100 was processed by PROD1, 101 to 200 was processed by PROD2, and so on. Although the batch program logically divides the data processing among multiple instances, this logical separation did not translate to the segment level. All threads of the batch program were accessing the same segments and transferring blocks aggressively between the instances, leading to a sharp increase in global cache events. To avoid this problem, partitioning can be used. In this way, list partitioning by organization results in segment-level affinity; batch processes running on each node accessed a disjoint set of partitions, improving batch process performance.

Creating a service is an excellent choice to implement application affinity, too. For example, if you have two major application groups (such as supply chain and manufacturing applications) in the same database, then you can create a service for each application and separate application workloads to different instances. For example, create two services, SCM and MANUFAC. Assign PROD1 as a preferred instance for the SCM service and PROD2 as a preferred instance for the MANUFAC server. With this service design, supply chain tables will be accessed aggressively in the PROD1 instance and the manufacturing tables will be accessed aggressively in the PROD2 instance, improving block collocation. The dynamic remastering feature should also help by remastering almost all supply chain tables to the PROD1 instance and all manufacturing tables to the PROD2 instance, thereby reducing global cache impact further.

If the database server has enough capacity to handle an application component, then keep all application connections connected to a node. Of course, if one node cannot handle the workload, then you should design affinity at the sub-application component level.

Pipes

Database pipes are single-instance objects and do not spawn multiple instances. If your application uses pipes, then the code must be modified to access pipes locally such that all consumers and subscribers of a pipe connect to an instance. Pipes do not work well in a RAC database, so use of the Advanced Queuing feature is recommended.

Application Change Deployment

Application change deployment should be carefully controlled in a busy RAC database. In a single-instance database, if you modify a heavily used package while the sessions are executing the package, it would lead to severe library cache lock and pin waits.

In a RAC database, modifying heavily used packages (or its dependent objects) can lead to global enqueue waits because library cache locks and pins are globalized as enqueues. Sessions can enqueue on global cache locks, leading to an almost hung database. The database is not actually hung, but queuing causes a serialization because each session must acquire the global cache lock, validate the package, and then proceed. Because the sessions are queued in the global enqueue layer, even if the package is validated by an initial session, other sessions must acquire global locks, verify the package state, and only then proceed further.8  This behavior affects application availability severely.

Our recommendation is that you avoid deploying application changes during busy hours or perform application rollout in a controlled maintenance window. Even creating an index on a critical table has the ability to cause major application downtime due to globalized library cache locks and pins.

Similarly, massive data changes should be performed during off hours. Excessive amount of undo block access might be required if user sessions try to access modified blocks.

Summary

Better-designed applications that scale well in a single-instance database will scale even better in a RAC database. With a few simple application design strategies, such as objects designed with scalability in mind, optimal use of sequences, and application affinity, you can give your application a scalability boost. We realize there are some potential contradictions in our recommendations; for example, we recommend decreasing the number of indexes in a table and also decreasing logical reads, which might require more indexes. Instead of following our recommendations unquestioningly, use a balanced approach, understand the reasoning behind the recommendation, and then apply the principles to match your scenario. The ultimate goal is to reduce application wait time and increase the scalability of the application.

1 Note that Figure 6-1 shows only four rows in the leaf block. Usually, hundreds of row entries will exist in a leaf block.

2 You need a license for a partitioning option to implement this solution.

3 Partition pruning is an SQL optimization technique to limit the search to few partitions.

4 Note that this is not a complete description of freelist management. There are actually three freelists: master freelist, transaction freelist, and process freelist. The freelist search algorithm is complicated. Please refer to a MOS paper written by Stephen Heisley titled “Freelist Management.”

5 There is no danger to transaction integrity because the transaction table in the undo header block maintains the true status of a transaction.

6 A new feature, _undo_cr, allows construction of a read-consistent row, but it is applicable only to special cases such as accessing a row through a unique index, and so on.

7 The dynamic remastering feature is useful. If the table is remastered to a local instance, then the need for additional global cache locks is reduced significantly.

8 You can reduce the dependency chain by creating cascading packages. See, for example, Expert PL/SQL Practices for Oracle Developers and DBAs (Apress, 2011), specifically Chapter 15 “Dependencies and Invalidations”, by Arup Nanda.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.235.23