Chapter 12: Parallel Query in RAC

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 12

Parallel Query in RAC

by Riyaj Shamsudeen

Parallel Execution (PX) is a powerful feature that makes it possible to utilize numerous processes to process millions, if not billions, of rows. In RAC, multiple nodes can be employed to execute a single SQL statement if the data set processed by a single SQL statement is huge.

In a single-instance database, PX server processes are allocated in the local instance. In a RAC database, PX server processes can be allocated in any or all active instances depending upon the configuration. For example, a PX query initiated in instance 1 can employ PX server processes from instances 1, 2, and 3 (in a three-node RAC cluster) to execute a SQL statement. The session initiating a PX acts as a query co-coordinator (QC), breaks the total work into individual units, and employs PX server processes to complete SQL statement execution. At the end of the execution, the QC process returns the final result set or the result of SQL execution to the calling program.

Overview

In most cases, parallel statement execution employs PX servers in a classic Producer/Consumer model for SQL execution. Producers collect and filter row pieces and distribute row pieces to the consumers, and consumers receive the row pieces and perform further processing, such as JOIN operation. Typically, the QC process waits for the PX servers to complete the work. It is also possible for QC to perform data processing depending upon the execution plan chosen by the optimizer.

The Producer/Consumer model is implemented as two PX server sets. One set of PX server processes acts as producers and the other set as consumers. For example, if you have specified a parallelism of 8, then it is probable that your query might get 16 PX servers for that SQL statement. Eight of them will be assigned to act as a producer and the other eight as a consumer. Essentially, a parallelism of 8 can employ 16 PX servers to execute a SQL statement.

Note There are special cases of PX execution that use just one set of PX server processes. For example, the SQL statement select count (*) from table will use just one set of PX server processes. Another example is partition-wise joins (discussed later in this chapter).

The Producer/Consumer concept can be explained with a JOIN operation. Listing 12-1 prints a PX plan. The tables PRODUCTS and SALES are joined using the HASH join technique and aggregated with a group by operation. Column TQ in the execution plan specifies the PX server set.

In the following execution plan, the PRODUCTS table is scanned by a set of PX server processes marked as Q1,00 in the TQ column. This PX server set acts as a producer and distributes row pieces to the second PX server set, Q1,01.
The SALES table is scanned by a set of PX server processes marked as Q1,01 in the TQ column. In addition to the scanning SALES table, this PX server set receives the row pieces of the PRODUCTS table sent by PX server set Q1,00. In this HASH join step, PX server set Q1,01 acts as a consumer.
After completing the HASH join operation, the Q1,01 PX server set processes redistribute the row pieces to the next set of PX server processes, marked as Q1,02 in the TQ column. PX server set Q1,02 performs a HASH GROUP BY operation. For the HASH GROUP BY operation, PX server set Q1,01 acts as a producer and the Q1,02 PX server set operates as a consumer.
PX server set Q1,02 completes grouping operations and sends the processed data to QC.

Listing 12-1. Parallel SQL Execution Plan

explain plan FOR
SELECT /*+ parallel (8) full (p) full (s) */
prod_name, SUM (QUANTITY_SOLD) FROM sales s , products p
WHERE s.prod_id = p.prod_id
group by prod_name HAVING SUM(quantity_sold)>100
/
Select * from table(dbms_xplan.display('','','BASIC +parallel'));
 
------------------------------------------------------------------------------
| Id  | Operation                    | Name     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |        |      |            |
|   1 |  PX COORDINATOR              |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10002 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    FILTER                    |          |  Q1,02 | PCWC |            |
|   4 |     HASH GROUP BY            |          |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE              |          |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH           | :TQ10001 |  Q1,01 | P->P | HASH       |
|   7 |        HASH GROUP BY         |          |  Q1,01 | PCWP |            |
|   8 |         HASH JOIN            |          |  Q1,01 | PCWP |            |
|   9 |          PX RECEIVE          |          |  Q1,01 | PCWP |            |
|  10 |           PX SEND BROADCAST  | :TQ10000 |  Q1,00 | P->P | BROADCAST  |
|  11 |            PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  12 |             TABLE ACCESS FULL| PRODUCTS |  Q1,00 | PCWP |            |
|  13 |          PX BLOCK ITERATOR   |          |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS FULL  | SALES    |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------

The parallel query operation discussed earlier is shown in Figure 12-1. In the figure, PX servers Q1,00, Q1,01, and Q1,02 are drawn as nodes of a PX tree. The distribution between PX servers is represented as a connection between the nodes of the PX tree. The PX Distribution column prints the method of row distribution between PX server sets. In Listing 12-1, at step 10, PX distribution is set to BROADCAST, indicating that all rows of the PRODUCTS table are distributed from the Q1,00 PX server set to all PX server processes of the Q1,01 PX server set. (Broadcast mechanism can be chosen if the row source is smaller.) PX server processes of Q1,01 PX server set receive the PRODUCTS table and then join it to the SALES table to complete join processing. Notice that in step 6, the Distribution mechanism is HASH—a hashing algorithm applied to grouping columns—and rows are distributed to PX server processes of the Q1,02 PX server set.

Figure 12-1. PX operational details

With BROADCAST, all rows are distributed from one PX server set to another PX server set. With the HASH distribution mechanism, a subset of rows is distributed from one PX server set to another PX server set, that is, each PX server process receives a subset of row pieces. In the preceding example, each PX server process of Q1,02 will receive 1/8th of the rows produced in step 6 of the execution plan.

You must understand that there can be at most two sets of PX processes active at any time during a SQL execution, even though the execution plan shows more than two PX server sets. PX server processes can be reused to perform another operation in a different part of the execution plan. In this example, PX processes associated with PX server set Q1,00 are reassigned as PX server set Q1,02 in step 3.

Listing 12-2 shows the PX execution in play for the query in Listing 12-1. Parallel statement execution with a degree of parallelism (DOP) of 8 employed 16 PX server processes. Notice that the server_set column is showing the PX server set. Essentially, there are two sets of PX server processes executing this statement.

Listing 12-2. PX Server Processes in Single Instance: Script pxslaves.sql

col username for a12
col "QC SID" for A6
col SID for A6
col "QC/Slave" for A10
col "Requested DOP" for 9999
col "Actual DOP" for 9999
col server_set for  A10
set pages 100
select
  decode(px.qcinst_id,NULL,username,
        ' - '||lower(substr(s.program,length(s.program)-4,4) ) ) "Username",
  decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
  to_char( px.server_set) server_set,
  to_char(s.sid) "SID",
  decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID",
  px.req_degree "Requested DOP",
  px.degree "Actual DOP"
from
  v$px_session px,
  v$session s
where
  px.sid=s.sid (+)
 and
  px.serial#=s.serial#
order by 5 , 1 desc
/
Username     QC/Slave   ServerSet  SID    QC SID Requested DOP Actual DOP
------------ ---------- ---------- ------ ------ ------------- ----------
RS           QC                    12376  12376
 - p015      (Slave)    2          10756  12376              8          8
 - p014      (Slave)    2          11295  12376              8          8
 - p013      (Slave)    2          10211  12376              8          8
 - p012      (Slave)    2          9699   12376              8          8
 - p011      (Slave)    2          9147   12376              8          8
 - p010      (Slave)    2          7535   12376              8          8
 - p009      (Slave)    2          8609   12376              8          8
 - p008      (Slave)    2          8071   12376              8          8
 - p007      (Slave)    1          6474   12376              8          8
 - p006      (Slave)    1          7012   12376              8          8
 - p005      (Slave)    1          5403   12376              8          8
 - p004      (Slave)    1          5928   12376              8          8
 - p003      (Slave)    1          4847   12376              8          8
 - p002      (Slave)    1          4325   12376              8          8
 - p001      (Slave)    1          3787   12376              8          8
 - p000      (Slave)    1          2716   12376              8          8

In Figure 12-2, a SQL execution with a requested DOP set to 4 is shown. There are four processes acting as producers and four acting as consumers, and a QC receives the row pieces from consumers. This data flow operation shows that row pieces are distributed from one PX server set to another PX server set.

Figure 12-2. PX execution in single instance

Also, in a single-instance database, the PX server processes communicate by exchanging PX buffers.

PX Execution in RAC

PX execution in RAC follows the same principle as a single-instance PX execution, except that PX server processes can be allocated from any and all active instances.¹ Allocation of PX server processes from multiple instances to execute a SQL statement is known as inter-instance parallelism, and these PX server processes exchange message buffers between them through private networks. Allocation of PX server processes within the local instance is termed as intra-instance parallelism.

As we saw in Listing 12-1, row pieces (message buffers in network terminology) are transmitted between PX server sets using the PX Distribution method of HASH and BROADCAST (there are few other Distribution methods, namely, Partition, none, etc.). In RAC, row pieces are distributed through the private network. The optimal placement of PX server processes is of paramount importance to reduce private network traffic generated by PX workload.

RAC databases with higher PX workloads generally have higher bandwidth requirements in the private interconnect. Oracle Database versions 11g/12c are optimized to minimize private network traffic with optimal localization of PX server processes.

Figure 12-3 shows a PX server allocation scheme in RAC. In this figure, PX server processes are allocated in both nodes 1 and 2 to perform parallel execution. Two PX servers on each node act as consumers, and two PX servers on each node act as producers. Producer PX servers distribute row pieces to consumer PX servers through the private interconnect. In this example, to minimize interconnect traffic, both producers and their consumers should be allocated from the same node.

In Figure 12-3, distribution of row pieces between the PX server processes is shown as the connection between the PX processes. PQ distribution in an inter-instance parallel operation flows through the private network (cluster interconnect).

Figure 12-3. PX execution in RAC

Placement of PX Servers

Placement of PX servers can be controlled using a few techniques in RAC. Traditionally, in earlier versions of Oracle RAC, a combination of instance_group and parallel_instance_group parameters was employed to control PX server placement, but from Oracle Database 11g onward, it is preferable to use services to control the placement of PX servers.

Instance_group and Parallel_instance_group

The parameter instance_group specifies the membership of an instance to an instance group. PX servers are allocated from an instance group specified by the parallel_instance_group parameter. Adjustment to these two parameters can be utilized to control the placement of PX servers.

The following discussion uses a three-node cluster configuration with instances named orcl1, orcl2, and orcl3. In the following parameter setup, the instance_groups parameter is specified at the instance level using the syntax <instance>.<parameter>. The value of the parameter orcl1.instance_groups is set to 'po','scm' and specifies that the orcl1 instance is a member of instance groups po and scm. Similarly, the orcl2 instance is a member of the po and scm instance groups, and the orcl3 instance is a member of the finance and scm instance groups. Hence, all three instances are members of the scm instance group, instances 1 and 2 are members of the po instance group, and instance 3 is the sole member of the finance instance group.

orcl1.instance_groups='po','scm'
orcl2.instance_groups='po','scm'
orcl3.instance_groups='finance','scm'

With the parameter setup just discussed, we can alter the parallel_instance_group parameter to control PX server allocation. In the following example, the value of parameter parallel_instance_group is set to scm in orcl1 instance. PX execution initiated from orcl1 can allocate PX servers in all instances, since all three instances are members of the scm instance group.

orcl1.parallel_instance_group= 'scm'

If we set the parallel_instance_group parameter to po in orcl1 instance, then PX executions initiated from the orcl1 instance can allocate PX servers from both orcl1 and orcl2 instances.

orcl1.parallel_instance_group= 'po'

If the parameter parallel_instance_group is set to finance in orcl3, then the PX execution can allocate PX servers only from the orcl3 instance, as only the orcl3 instance is a member of the finance instance group.

Orcl3.parallel_instance_group= 'finance'

If you set the parameter parallel_instance_group to finance in the orcl2 instance, then the PX execution initiated from the orcl2 instance will spawn PX servers in the orcl3 instance, as orcl3 is the sole member of the finance instance group.

Orcl2.parallel_instance_group= 'finance'

If the specified parallel_instance_group parameter value has no members, then the parallel execution will be disabled completely. For example, a parameter value of ar will disable parallel exectuion, since there are no instances in the ar instance group.

*.parallel_instance_group= 'ar'

You can alter the value of the parallel_instance_group parameter dynamically at either the session level or the system level. For example, the following alter session statement enables PX server allocation in orcl1 and orcl2 instances in the session. You can override instance- or database-level setup with session-level changes.

Alter session set parallel_instance_group= 'po';

The parameter instance_group cannot be altered dynamically (since release 11g), and instance restarts would be required to alter the instance_group parameter. There is no explicit method to create an instance group.

As of version 11g, the optimizer considers the parallelism setup before choosing a PX plan. If parallelism is disabled, in the session or globally, the optimizer will not choose a PX plan. Prior to version 11g, the optimizer could choose a PX plan even if the parallelism was disabled. At run time, the execution engine would execute the chosen PX plan serially, leading to worse performance. From Database version 11.2.0.2 onward, the cost-based optimizer is also aware of the parallelism setup. If the parallelism is disabled in the session, through either session-level parameter setup or global-level parameter setup, then the optimizer does not choose a parallel plan at all. For example, I modified the parallel_instance_group to dummy and executed the following statement. Cost-based optimizer 10053 event trace indicates that optimizer used a parallel degree of 1, even though the hint specifies a DOP of 8. Also notice that the cost of parallel plan (resp) is same as the cost of serial plan (Resc), indicating that the optimizer will not choose a PX plan.

alter session set events '10053 trace name context forever, level 1';
REM parallel_instance_group value is non-existent.
alter session set parallel_instance_group='dummy';
 
select /*+ use_hash (h l ) parallel (h 8) parallel (l 8) full(h) full(l) */
count(*)  from
oe_order_headers_all h, oe_order_lines_all l
where h.header_id = l.header_id and
h.open_flag='N'
 
/
Trace file due to 10053 event:
 
Best:: AccessPath: TableScan
      Cost: 197226.11  Degree: 1  Resp: 197226.11  Card: 10992402.90  Bytes: 0

After altering parallel_instance_group to the correct value, reparsing the same statement shows that the optimizer used a parallel degree of 8. Also, notice that the parallel cost is lower than the earlier serial cost of 197226.11.

Best:: AccessPath: TableScan
         Cost: 61202.86  Degree: 8  Resp: 61202.86  Card: 10992402.90  Bytes: 0

Even though the parameters instance_group and parallel_instance_group can be used to control PX server placement, use of these parameter combinations is deprecated from Oracle Database version 11.2. Now, it is recommended to use services to control placement of PX servers. Also, in 11.2, the incorrect value for parallel_instance_group parameter can lead to an excessive number of child cursor problems due to bugs such as 7352775. There are also a few unpublished bugs where the use of these two parameters leads to unshareable child cursors. So, it is preferable to use services to control PX server placement.

Services

Services are the preferred approach to control PX server placement. By default, PX servers are allocated in instances where the current service is active (“current service” refers to the service that the session initiating PX execution connected). For example, if you connect to po service and if po service is available in ORCL1 and ORCL2 instances, then PX servers can be allocated from both ORCL1 and ORCL2 instances.

The following output shows the service configuration in a database named ORCL. In this configuration, service po is active in ORCL1, ORCL2 instances, service scm is active in all three instances, and finance is active only in the ORCL3 instance. If you connect to the po service and initiate a parallel statement execution, then PX servers can be allocated from the ORCL1 and ORCL2 instances. Similarly, if you connect to finance service and initiate a parallel statement execution, then PX servers can be allocated from the ORCL3 instance only.

$ srvctl status service -d ORCL
Service po is running on instance(s) ORCL1,ORCL2
Service scm is running on instance(s) ORCL1,ORCL2,ORCL3
Service finance is running on instance(s) ORCL3

You can also use a combination of both services and the Parallel_instance_group parameter to control PX server placement. If you set the parallel_instance_group parameter to a value of service, then PX servers can be allocated from the instances that service is running. For example, after connecting to the finance service (the connection will go to the ORCL3 instance), you can alter the parallel_instance_group parameter at the session level to po service. Now, PX executions from the current session can allocate PX servers in the ORCL1 and ORCL2 instances, since po service is active in the ORCL1 and ORCL2 instances, even though PX is triggered from the ORCL3 instance. While this strategy provides the ability to control your PX server allocation, it is not a recommended approach.

$ sqlplus rs/temp123@//orcl-scan:1521/finance
...
Alter session set parallel_instance_group=po;

PX server placement utilizing the service configuration is elegant, since after a node/instance failure, services automatically fail over to the surviving nodes. You should consider instance failure scenarios while setting up services and make active/passive configuration for services too. Further, service configuration can be modified dynamically. In contrast, if you use instance_group and parallel_instance_group parameters, then during instance failure, you may have to modify the instance_group parameter, which might require another surviving instance restart.

Parallel_force_local

The parallel_force_local parameter can be used to localize the PX server allocation. If the parallel_force_local parameter is set to TRUE, then PX servers are allocated in the same instance that the PX query was initiated. For example, if you connect to the ORCL1 instance and initiate a PX execution, then all PX servers will be allocated from the ORCL1 instance. This parameter can be modified at the session level or at the global level the using alter session or alter system commands.

The problem with controlling the PX server placement with this parameter is that if all PX servers in the local instance are exhausted, then the query may run serially or with reduced parallelism. Thus, it is a preferable approach to use services to control the allocation. Also, this parameter doesn’t work well with the parallel_statement_queuing parameter, as discussed later in this chapter.²

Measuring PX Traffic

Since Oracle Database version 11g, private network traffic statistics have been captured at the workload level and externalized in the x$ksxpclient fixed table.

In Listing 12-3, query shows that the data from the x$ksxpclient fixed table is connected to the PROD1 instance. From the output, we can calculate bytes sent and received for each workload. The PROD1 instance sent 32TB and received 49TB for the buffer cache (cache fusion) workload. Similarly, the PROD1 instance sent 8.7TB and received 5.2TB for the DLM (Distributed Lock Manager) workload. Instance PROD1 also sent 1.8TB and received 2.1TB for the PX workload. In the following output, PX workload is indicated by the line with name as ipq.

Listing 12-3. x$ksxpclient

set pages 100
SYS@PROD1> SELECT name,
           TRUNC (bytes_sent/1024/1024,2) bytes_sent_MB,
           TRUNC(bytes_rcv  /1024/1024,2) bytes_recv_MB
FROM x$ksxpclient;
 
NAME      BYTES_SENT BYTES_RECV
--------- ---------- ----------
cache       32165225 49870288.1
dlm       8754651.32 5217822.01
ipq        1791954.4 2121334.66
ksxr         5069.85    4572.55
cgs           294.73     161.94
osmcache           0          0
streams          .03       4.02
diag         28846.1  103344.72
ksv                0          0
ping         5050.59    5050.58
internal           0          0

Of course, the output in Listing 12-3 shows the traffic breakdown from the start of the PROD1 instance. So, output from x$ksxpclient can be misleading if we need to measure interconnect network traffic for a constrained time window. AWR captures the data from x$ksxpclient into AWR tables and externalizes in dba_hist_ic_client_stats view. Listing 12-4 queries dba_hist_ic_client_stats and prints the network traffic for the PX workload. By adding value from all three instances for bytes_received (or bytes_sent), we can calculate total bytes transmitted for the PX workload. In this example, on 11/11/12 between 4:00 and 4:30 p.m., this database generated 82MB of network traffic for the PX workload.

Listing 12-4. dba_hist_ic_client_stats

SELECT TO_CHAR(snaps.begin_interval_time, 'DD-MON-YYYY HH24:MI') begin_time,
  snaps.instance_number,
  hist.name,
  TRUNC((hist.bytes_received - lag (hist.bytes_received) OVER (
        PARTITION BY hist.name, startup_time, hist.instance_number
        ORDER BY begin_interval_time, startup_time,
          hist.instance_number, hist.name))/1048576,2) Mbyte_rcvd,
  TRUNC((hist.bytes_sent     - lag (hist.bytes_sent) OVER (
        PARTITION BY hist.name, startup_time, hist.instance_number
        ORDER BY begin_interval_time, startup_time,
          hist.instance_number, hist.name))/1048576,2) Mbyte_sent
FROM dba_hist_ic_client_stats hist,
  DBA_HIST_SNAPSHOT snaps
WHERE snaps.snap_id           = hist.snap_id
AND snaps.instance_number     = hist.instance_number
AND snaps.begin_interval_time > sysdate-1
AND hist.name                 ='ipq'
ORDER BY snaps.snap_id, instance_number;
 
BEGIN_TIME              Inst NAME      MBYTE_RCVD MBYTE_SENT
----------------------- ---- --------- ---------- ----------
...
11-NOV-2012 16:00          1 ipq            31.23       33.7
                           2 ipq            28.31      35.01
                           3 ipq             42.9      33.72
11-NOV-2012 16:30          1 ipq            82.19      44.41
                           2 ipq            32.61      66.86
                           3 ipq            63.57      67.06
11-NOV-2012 17:00          1 ipq            32.89      32.33
                           2 ipq            14.28      36.99
                           3 ipq            55.45      33.32

This metric is also printed in an AWR report in the interconnect throughput by client section. You can see that interconnect traffic for PX workload is very minimal in this database.

Interconnect Throughput by Client      DB/Inst: PROD/PROD1  Snaps: 52265-52313
-> Throughput of interconnect usage by major consumers
-> All throughput numbers are megabytes per second
                        Send     Receive
Used By           Mbytes/sec  Mbytes/sec
---------------- ----------- -----------
Global Cache           15.19       22.93
Parallel Query           .05         .07
DB Locks                4.11        2.46
DB Streams               .00         .00
Other                    .00         .00

PX and Cache Fusion

PX execution generally reads the block directly into the Program Global Area (PGA) of the PX server process, bypassing buffer cache. Since the buffer cache is not used, cache fusion locks are not required for the blocks read using the direct path read method.

Note Not all PX executions will bypass buffer cache. For example, a PX plan can use nested loops join (each PX server performing its own nested loops join for a subset of rows), and so PX execution using nested loops join might use the buffer cache. Also, the new feature in memory parallelism in Oracle Database release 11.2 also enables the use of the buffer cache for PX executions.

The following lines are printed from a SQL trace file after initiating a parallel query execution. You can see that there are no global cache event waits for the direct reads. The direct path method reads the blocks directly into the PGA of PX server processes. Notice that 128 blocks are read from the disk to PGA directly without incurring a single global cache–related wait event. For massive segment scans, direct path reads are very efficient and reduce the global cache overhead associated with a traditional buffered read. If 128 blocks had to be buffered in the buffer cache, then there would have been numerous waits for global cache events.

select /*+ full(a) parallel (4) */ count(*) from apps.mtl_material_transactions a
...
nam='direct path read' ela= 1676 file number=845 first dba=401920 block cnt=128 obj#=37871
nam='direct path read' ela= 37423 file number=845 first dba=402048 block cnt=128 obj#=37871
nam='direct path read' ela= 26573 file number=845 first dba=402304 block cnt=128 obj#=37871

However, PX execution using the direct path reads method triggers an object-level checkpoint in all instances at the start of PX execution. Excess PX execution can trigger numerous object-level checkpoints, causing elevated write activity. Further, excessive object-level checkpoints with a combination of huge buffer cache and numerous instances (eight-plus instances) can lead to continuous checkpoints in all instances, leading in turn to slowdown for the query itself. So, you should consider the effect of the checkpoint while designing SQL statements to scan smaller tables using parallel execution.

nam='KJC: Wait for msg sends to complete' ela= 1166 msg=61225879752 ...
nam='enq: KO - fast object checkpoint' ela= 1330 name|mode=1263468550 ...
nam='enq: KO - fast object checkpoint' ela= 722 name|mode=1263468545 ...

Note that it is still possible for parallel execution to suffer from cache fusion waits for undo, undo header, and segment header blocks. For example, if a block has an uncommitted transaction, then PX server processes will access undo blocks/undo header blocks to construct a Consistent Read (CR) version of the block, which would require global cache locks.

PEMS

In an inter-instance PX execution, PX servers transmit buffers through the private interconnect and the size of transmission buffer is controlled by the parameter parallel_execution_message_size (PEMS). Prior to 11g, the default value of this parameter was about 2KB. With smaller PEMS, data transmission between the PX server processes is inefficient due to chattier transmission.

From Oracle Database version 11.2 onward, if the compatibility parameter is set to 11.2.0 or higher, then PEMS defaults to a value of 16K. If your application uses the PX execution feature heavily, then you must verify that PEMS is set to a value of at least 16K.

In PX-intensive environments, it is also advisable to use Jumbo Frames so that chatty network traffic can be reduced. In my experience, the combination of Jumbo Frames and PEMS set to 16K increased throughput of PX workload in a production database dramatically.

Parallelism Features and RAC

Oracle Database version 11.2 introduced three important parallelism features, and version 12c enhances these features. The feature In-memory parallelism introduces buffered reads for PX workload. The AutoDOP(automatic DOP) feature introduces automatic PX plan, and with the PX Statement Queuing feature, PX is delayed until sufficient PX server processes are available.

All three features are controlled by a master switch parameter, namely, parallel_degree_policy.

If you set the parallel_degree_policy parameter to a value of AUTO, then all three features are enabled.
If you set the parallel_degree_policy parameter to a value of LIMITED, then that parameter value disables PX statement queuing and in-memory parallelism. However, for statements accessing objects decorated with a parallel degree of DEFAULT, the AutoDOP feature is enabled.
If you set the parallel_degree_policy parameter to a value of MANUAL, then these three new features are disabled.
Version 12c introduces AUTO_FEEDBACK, a new value which is a combination of AUTO and FEEDBACK. All three parallel features discussed earlier are enabled with the AUTO value. In addition, a FEEDBACK mechanism to the cursor in memory is used to mark a statement to be reparsed with a different DOP. Two types of feedback mechanisms are possible: statement-level time feedback and operator-level time feedback. After a parallel statement execution, if the calculated DOP is significantly³ different from actual DOP, then the statement is signaled to be reparsed with actual DOP during the execution. This feature is more granular than just the statement-level feedback and tracks the DOP at the operation level. If the actual operator-level DOP is different from estimated DOP, then the operator-level execution DOP is used as a feedback mechanism while reparsing the SQL statements during subsequent execution of the cursor.

I will introduce these features and discuss how they can be effectively utilized in a RAC environment.

In-Memory Parallelism

Traditionally, PX servers read blocks directly from the disk to PGA, bypassing buffer cache. In a few cases, it may be better to buffer the blocks in the database buffer cache so that subsequent PX server processes can reuse the buffers. This feature is pertinent in machines with huge swaths of memory allocated for SGA. Of course, buffering in SGA requires global cache locks, which might induce global cache event waits for the PX servers reading the block from disk.

From Oracle Database version 11.2 onward, if the parallel_degree_policy parameter is set to auto, then the in-memory parallelism feature is enabled. Objects are chosen to be cached in the buffer cache if the object is not too small or too big. If the segment is too small, then that segment can be cached in PGA itself; if the segment is too big, then buffering will overrun the buffer cache. Thus, this feature cleverly chooses segments of the right size. The advantage of this feature is that subsequent processes accessing the same table do not need to repeat the work of reading the blocks from disk.

The algorithm uses the following directives to determine if the objects can be buffered or not:

Size of the object (must fit in the buffer cache of one instance; the parameter _parallel_cluster_cache_pct limits the maximum percentage of buffer cache that can be used for affinity and defaults to 80%).
Object access frequency.
Object change frequency.

In RAC, object fragments are affinitized among active instances. For example, if there are three instances in a PX execution, then approximately 1/3rd of the object is affinitized to each instance. Affinitization is performed on a per-extent basis for a non-partitioned table; 1/3rd of extents are affinitized to each node in a three-node cluster. If the table is hash partitioned, then the object is affinitized per partition basis. PX server processes are also affinitized such a way that PX servers access extents/partitions in the local buffer cache. For example, if partition P1 is affinitized to the ORCL2 instance, then PX processes are allocated in the ORCL2 instance if the execution plan requires access to partition P1. Affinitization is designed to reduce cache fusion network traffic.

This feature can be enabled at the session level also:

alter session set "_parallel_cluster_cache_policy"=cached;

This feature can be disabled in session level using the following statement. Of course, it is always better to check with Oracle Support before using underscore parameters in production code. Also, instead of hard-coding these SQL statements in production code, wrap this code in a PL/SQL procedure and call the procedure from application code. This enables you to change the code in the database without altering the application.

alter session set "_parallel_cluster_cache_policy"=adaptive;

The following SQL statistics show a comparison between parallel execution of a statement with caching enabled and disabled. The comparison shown here is for the second execution of the statement for each mode in a production database. If the caching is enabled, subsequent executions of the statement perform almost no disk I/O and reuse cached buffers. If the caching is disabled, then each execution performs direct I/O to read blocks from the disk. In the following example, SQL statement performed 232K reads with an increased elapsed time of 81 seconds, compared to 18 seconds when caching was enabled.

SQL_ID         EXEC CPU_TIME ELA_TIME BGETS    DRDS     ROWS
-------------  ---- -------- -------- -------- -------- -------
gj1p0kqyxvqas     1    17.37    18.66   455257        0     809 <- with caching enabled
gj1p0kqyxvqas     1    17.09    81.42   237443   232779     809 <- without caching

You need to understand that the effect of this feature depends upon the activity in the database buffer cache. If the buffers are not accessed frequently, then the buffers will be flushed out of the buffer cache, and so PX servers may be doing more work with no future benefit.⁴ On the Exadata platform, this feature is probably not useful, as the smart scan feature requires direct path reads.

As the tables are buffered, there is a possibility that this feature can increase interconnect network traffic inadvertently. So, if you are planning to enable in-memory parallelism, verify that interconnect hardware is properly sized to support a possible increase in private network traffic.

AutoDOP

AutoDOP is a new feature available from Oracle Database version 11.2. Prior to version 11.2, the optimizer chose PX only if the objects had a DOP of greater than 1 or if there were hints in the SQL statement enabling parallelism. From 11.2 onward, parallelism is considered if the estimated run time of a SQL statement exceeds a threshold. The following algorithm outlines the logic of the AutoDOP feature.

Estimate total run time of a SQL statement for serial execution.
If estimated_run_time > 10 seconds threshold then
Re-parse the statement with a default DOP and determine the cost.
If the estimated cost is lower than the estimated cost for serial execution, then
                Execute the statement with Default DOP.
End;
Else
Execute the statement serially.
End;

The AutoDOP feature can be enabled by setting parallel_degree_policy to auto or limited (11.2) or auto_feedback (12c). If the parallel_degree_policy is set to limited, then objects with parallelism must be accessed in the SQL statement for the AutoDOP to be enabled. The following few lines are printed from event 10053 trace file. These lines show that AutoDOP was disabled since the parallel_degree_policy parameter is set to a value of limited and none of the objects accessed have a parallel degree decorated.

**************************
Automatic degree of parallelism (ADOP)
**************************
Automatic degree of parallelism is disabled: limited mode but no parallel objects referenced.

If you set the parallel_degree_policy parameter to auto, then the AutoDOP feature is enabled, but I/O must be calibrated for AutoDOP to be enabled.

**************************
Automatic degree of parallelism (ADOP)
**************************
Automatic degree of parallelism is enabled for this statement in auto mode.
kkopqSetDopReason: Reason why we chose this DOP is: IO calibrate statistics are missing.

Enabling I/O calibration using dbms_resource_manager package will enable the AutoDOP feature. Calibration results are visible in v$io_calibration_status view. Note that there is no documented method to remove the calibration,⁵ and so this change must be tested in a non-production database before calibrating I/O in production.

To calibrate I/O, you can choose any instance of a cluster and trigger I/O calibration. The database engine uses just one instance to calibrate I/O, but since I/O resource is common to all nodes, performing calibration in one node is good enough.

SET SERVEROUTPUT ON
DECLARE
  l_lat  INTEGER;
  l_iops INTEGER;
  l_mbps INTEGER;
BEGIN
  -- Calibrate I/O and print statistics
  DBMS_RESOURCE_MANAGER.CALIBRATE_IO (20, 20, l_iops, l_mbps, l_lat);
  DBMS_OUTPUT.PUT_LINE ('max_iops = ' || l_iops);
  DBMS_OUTPUT.PUT_LINE ('latency = ' || l_lat);
  DBMS_OUTPUT.PUT_LINE ('max_mbps = ' || l_mbps);
END;
/⁶

If the estimated run time exceeds the value of the parameter parallel_min_time_threshold, then the statement is reparsed with a higher DOP. The parameter value of parallel_min_time_threshold defaults to a value of AUTO, which indicates a threshold of 10 seconds. If you prefer more queries to be considered for parallelism, then you can decrease the value of the parallel_min_time_threshold parameter to less than 10 seconds.

After the completion of I/O calibration, regenerating event 10053 trace file for the same SQL statement shows that AutoDOP was enabled for this statement execution. The optimizer uses calibrated I/O statistics to convert the cost to time using I/O calibration statistics. In this example, the estimated time is 38 seconds, exceeding parallel_min_time_threshold parameter, and so optimizer considers AutoDOP.

kkeCostToTime: using io calibrate stats
 maxmbps=10(MB/s) maxpmbps=11(MB/s)
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes)
 tot_io_size=422(MB) time=38387(ms)
AutoDOP: Table/Index(#76100) access (scan) cost=14638.09 estTime=38386.60 unit=10000.00
   dop=3 -> maxdop=3
Best:: AccessPath: TableScan
Cost: 14638.09 Degree: 1 Resp: 14638.09 Card: 159987.00 Bytes: 0

Following lines from the trace file shows that optimizer is triggering a reparse of the SQL statement with a statement-level parallelism of 3.

AUTODOP PLANS EVALUATION
***************************************
Compilation completed with DOP: 1.
   Cost_io: 59742.00 Cost_cpu: 1227068251.35
   Card: 318995.00 Bytes: 1279807940.00
   Cost: 59810.86 Est_time: 156847ms
Serial plan is expensive enough to be a candidate for parallelism (59811)
Signal reparse withDOP 3.
*****************************
Number of Compilations tried: 1

Let us review a few important parameters controlling the AutoDOP feature. Parameter parallel_degree_limit governs the maximum DOP that can be chosen for a SQL statement. By default, this parameter is set to CPU, implying default DOP as the maximum DOP for AutoDOP execution. Default DOP is calculated by multiplying the total number of CPUs in the cluster (Sum of cpu_count from all active instances) by the parallel_threads_per_cpu (Default=2) parameter. So, if there are 32 cores each in a three-node cluster, then the value of parallel_degree_limit is calculated as 32 * 3 * 2 (parallel_threads_per_cpu=2), equaling a value of 192. So, 192 PX servers can be used for a SQL statement execution.

Note The parameter parallel_degree_limit can be set to I/O or an integer. If you set the parameter value to I/O, then I/O bandwidth derived from dbms_resource_manager.calibrate_IO procedure call is used to derive upper bounds for default DOP. You can also set the parallel_degree_limit parameter to a number specifying the maximum default parallelism to be used for AutoDOP calculations.

The parameter parallel_max_servers is a hard upper bound determining the maximum number of PX servers available in any instance, and the parallel_max_servers parameter overrides the parallel_degree_limit parameter if the parallel_max_servers parameter is set to a value lower than the default DOP.

While parallel_degree_limit controls the upper bound for parallelism, the actual parallelism used in a query (calculated DOP or actual DOP) is less than the default DOP. The exact algorithm is not documented for the actual DOP calculation; however, from event 10053 trace files, we can infer that the optimizer is recosting the statement with different DOPs to find the optimal DOP for a SQL execution.

There are a few dangers with AutoDOP feature in a production database.

If many concurrent queries are parallelized inadvertently, then those queries can deplete I/O and CPU resources in the database server, causing performance issues with the online application.
In addition, inadvertent parallel queries can consume all parallel servers, starving critical parallel queries of parallel servers.
Also, in RAC, if many queries are parallelized, then there is a possibility that numerous queries might perform inter-instance PX, leading to worse interconnect latency for online application(s).

So, in the production database, it is generally a good idea to control parallelism so that you don’t accidentally run out of resources. Further, you need to understand that AutoDOP uses estimated run time to decide if AutoDOP should be enabled or not. Hence, it is critical to have optimal statistics in a RAC environment so that excessive PX executions are not accidentally triggered due to a statistical issue.

AutoDOP feature is more suitable for data warehousing environments. You might need to tune the parallel_min_time_threshold parameter to match the workload in your RAC cluster. By default, the parallel_min_time_threshold parameter is set to a value of 10 seconds. In my experience, a lower default value for the parallel_min_time_threshold parameter will trigger a tsunami of parallel executions. So, if you are planning to enable the AutoDOP feature, increase the parallel_min_time_threshold parameter to a higher value, at least 600 seconds.

This feature is very useful in ad hoc reporting and data warehousing environments, where queries initiated by the users may not specify selective predicates and parallelism may be the only option to tune those queries. It may not be feasible to tune those ad hoc queries individually either. In these scenarios, it may be prudent to use the AutoDOP feature and let the optimizer choose optimal parallelism, but use Database Resource Manager or proper parallel_max_servers to control query parallelism.

Version 12c introduces a new parameter, parallel_degree_level, and this parameter controls the aggressiveness of default DOP calculations. It also controls the scaling factor for default DOP calculations. If you set this parameter value to 50, then calculated default DOP will be multiplied by 0.50, reducing DOP to one-half of the default DOP.

Parallel Statement Queuing

It is all too common for SQL statements to acquire fewer PX servers than the requested DOP in a production database. Consider a SQL statement requesting a DOP of 32, but other SQL statements currently executing parallel statements are using all PX servers except for four; then, this SQL statement will execute with a reduced parallelism of 4. Even if more PX servers are available minutes after the start of parallel exectuion, the SQL statement will continue to execute with a DOP of 4. A process with a normal run time of 1 hour might run for 8 hours (assuming a linear extrapolation) and cause performance issues. Had that SQL statement delayed the execution for a few seconds, it could have acquired all requested PX servers and might have completed the task in normal run time.

Note A short story about parallel execution. We had a client performance issue: a job intermittently ran longer than normal. Our task was to identify the root cause, as the data was about the same every day. Reviewing ASH data, we were able to identify that a critical SQL statement in the program did not get enough PX servers on slower days. Even though the parallel_max_servers parameter was set to a much higher value, the SQL statement did not get enough PX servers. Finally, we found the root cause to be that another process was scheduled at exactly the same second, consuming nearly all available PX servers and starving the critical process. Whichever job allocated PX servers first got the PX servers, and the other job could not get all of the required PX servers. The solution we used was to delay the non-critical job by a minute to allow the critical process to acquire all PX servers. However, parallel statement queuing would have been an even better solution for this problem.

Prior to Database version 11.2, the parameter parallel_min_percent could be used (with little coding) to wait for requested PX servers to be available. Since version 11.2, a waiting scheme for PX server availability can be achieved more elegantly using the Parallel Statement Queuing feature. Essentially, if the requested number of PX servers are not available, then the parallel execution is delayed and the process waits in a statement queue until the requested PX servers are available.

If the parallel_degree_policy is set to AUTO, then the Parallel Statement Queuing feature is enabled. This feature also can be enabled in a session by altering the _parallel_statement_queuing parameter to true.

alter session set "_parallel_statement_queuing"=true;

The parallel_servers_target parameter controls when the Parallel Statement Queuing feature becomes active. If the number of PX servers currently executing in a cluster is greater than parallel_servers_target, then the feature becomes active. Consider an example: if parallel_servers_target is set to 96 and the number of PX servers currently active is 100, then this feature becomes active. If a SQL statement requests a parallel degree of 32, then the session will be queued until at least 32 PX servers are available in the cluster.⁷ As soon as 32 PX servers are available, the session will allocate PX servers and continue processing. In essence, this feature is protecting a SQL statement from executing with reduced number of PX servers. Still, the maximum number of PX servers that can execute at any time in an instance is dictated by the parallel_max_servers parameter.

Consider the following setup to explain this feature with clarity. Assuming that the server has 32 cores (cpu_count=32) and three active instances in a cluster, then the default values for the parameters parallel_max_servers and parallel_servers_target are calculated as follows:

Parallel_max_servers = 32* parallel_threads_per_cpu * (2 if pga_aggregate_target >0) * 5
                = 32 * 2 * 2 * 5 = 640
       
Total parallel_max_servers     =  parallel_max_servers * active_instance_count
                = 640 * 3 = 1,920
 
Parallel_servers_target     = 4 * 32 * parallel_threads_per_cpu * active_instance_count
                = 4 * 32 * 2 * 3 = 768 PX servers

If the total number of active PX servers exceeds 768, then only Parallel Statement Queuing will become active. However, the feature would not limit PX allocation yet. Suppose that 800 PX servers are active; if another process requests 200 more PX servers, then the process will continue executing in parallel with 200 PX servers without waiting in the statement queue, as the total number of active PX servers is lower than the parallel_max_servers parameter.

Suppose that three sessions consumed 1,500 PX servers in total, leaving just 420 PX servers available. If another process requests 500 PX servers, then the process will wait until all 500 PX servers are available. So, the parallel_servers_target parameter determines the time at which this feature will become active, and the feature will limit PX server allocation only when requested PX servers are not available.

Default values for parallel_max_servers and parallel_servers_target parameters are very high and overly optimistic about I/O resources available in a database machine. It is not uncommon to keep the parallel_max_servers parameter to a lower value than the default. If you adjust the parallel_max_servers parameter, then you should adjust the parallel_servers_target parameter also. You should consider the type of workload, I/O resources available in the database machine, and PX activity to set these parameters.

A session enqueued waiting for PX servers will wait for resmgr: pq queued event.

It is possible to enable the Parallel Statement Queuing feature for a SQL statement using the hint STATEMENT_QUEUING even if the Parallel Statement Queuing feature is disabled in the database. Additionally, the hint NO_STATEMENT_QUEUING can be used to disable the Parallel Statement Queuing feature if the feature is enabled.

However, the Parallel Statement Queuing feature doesn’t work well with the parallel_force_local parameter. If the parameter parallel_force_local is TRUE, then parallel_statement_queuing keeps the session in the queue, even if PX servers are available in other nodes.⁸ It is a better idea to use services instead of the parallel_force_local parameter if you are planning to use the Parallel Statement Queuing feature.

Prioritizing Parallel Statement Queues (12c)

The Parallel Statement Queuing feature works in a strict First In- First Out (FIFO) model in Database versions up to 11.2. In version 12c, Database Resource Manager can be utilized to create multiple parallel statement queues and prioritize the next parallel statement execution. Parallel statements from a higher-priority parallel statement queue are dequeued for the next execution. The parallel statement queue feature is enhanced to have multiple statement queues.

It is easier to understand priority among parallel statement queues using an example. Consider that you want to design a queuing scheme with the following objectives:

SYS_GROUP should have highest priority and any statements in this consumer group must be dequeued first.
If no session is waiting from the SYS_GROUP queue, then the SQL statement should be dequeued in the order of PX_CRITICAL, PX_HIGH, PX_MEDIUM, and PX_LOW queues.

The following code will implement the designed priority objective. In this example, four new consumer groups are created and priority is assigned among those consumer groups. First, a pending area is created to make changes to resource consumer groups. The next step creates a resource plan named PQ_STMT_PLAN.

begin
 dbms_resource_manager.create_pending_area();
  end;
  /
--------------Create Resource Manager Plan -----------------------------
begin
 dbms_resource_manager.create_plan(
  plan => 'PQ_STMT_PLAN',
  comment => 'Resource plan for PQ statement Queuing 12c');
end;
/

The following step creates four Resource Manager consumer groups. These consumer groups will be associated with a plan directive in the next step.

-------------- Create Resource Manager Consumer Groups -------------------
begin
 dbms_resource_manager.create_consumer_group(
  consumer_group => 'PX_CRITICAL',
  comment => 'Resource consumer group/method for online users / report sessions');
 dbms_resource_manager.create_consumer_group(
  consumer_group => 'PX_BATCH_HIGH',
  comment => 'Resource consumer group/method for batch high users');
 dbms_resource_manager.create_consumer_group(
  consumer_group => 'PX_BATCH_MED',
  comment => 'Resource consumer group/method for batch med users');
 dbms_resource_manager.create_consumer_group(
  consumer_group => 'PX_BATCH_LOW',
  comment => 'Resource consumer group/method for batch low users');
end;
/

The next step creates the priorities among parallel statement queues of the resource consumer groups. Directive mgmt_p1 decides the priority of a specific queue in level 1. In the following example, SYS_GROUP has a 100% priority at level 1. A parallel statement with SYS_GROUP as the consumer group will be chosen next to be executed. If there is no statement belonging to SYS_GROUP that is waiting in the parallel statement queue, then a parallel query from the next-lowest-level PX statement queues will be considered for execution.

Directive mgmt_p2 decides the priority of the parallel statement queue at level 2, and the value of mgmt_p2 is set to 50 for the consumer group PX_CRITICAL. So, there is a 50% probability that the next query will be scheduled for the sessions with a resource consumer group associated with the PX_CRITICAL group. Similarly, PX_BATCH_HIGH has a 30% probability that a PX session for the PX_BATCH_HIGH consumer group will be chosen to be executed next. Also, PX_BATCH_LOW and PX_BATCH_MED have a probability of 10%.

The priority of OTHER_GROUPS is set to 100 at level 3 by setting the mgmt_p3 attribute. A statement belonging to OTHER_GROUPS will be executed only if there is no other statement waiting for higher-level queues.

------------- Create Resource Manager Plan Directives --------------------
begin
 dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'SYS_GROUP',
  comment => 'SYS sessions at level 1',
  mgmt_p1 => 100 );
 dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'PX_CRITICAL',
  comment => 'Online day users sessions at level 1',
  mgmt_p2 => 50);
 dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'PX_BATCH_HIGH',
  comment => 'Batch users level 2 -HIGH- Fast running ',
  mgmt_p2 => 30);
 dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'PX_BATCH_MED',
  comment => 'Batch users sessions at level 2 - MED - Medium running',
  mgmt_p2 => 10);
 dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'PX_BATCH_LOW',
  comment => 'Batch users sessions at level 2 - LOW - long running',
  mgmt_p2 => 10);
dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'OTHER_GROUPS',
  comment => 'Batch users sessions for Other groups',
  mgmt_p3 => 100);
end;
/
begin
 dbms_resource_manager.validate_pending_area();
end;
/
begin
 dbms_resource_manager.submit_pending_area();
end;
/

In a nutshell, version 12c enhances the strict FIFO policy employed in earlier versions for PX statement execution. A complex scheme such as the one discussed in this section can be used to prioritize the workload so that higher-priority sessions do not have to wait for lower-priority parallel statement execution.

This enhancement is useful in mixed-workload environments such as highly critical short-running parallel executions and low-priority long-running parallel executions mixed in the same database. If critical SQL statement must wait for low-priority parallel statement execution, then critical parallel execution may be delayed. Worse, low-priority execution can be a long-running statement with a higher number of PX servers, and so the wait time for critical statements can be longer. By enabling prioritization within the workload, you can reduce the wait time for critical SQL execution in parallel statement queues.

Critical Parallel Statement Queues (12c)

In the preceding section, you learned how to prioritize parallel statement execution among consumer groups. You may recall that the Parallel Statement Queuing feature is enabled only after the active PX server count exceeds the parallel_servers_target parameter. In some cases, you might prefer for ultra-critical jobs to acquire PX servers completely bypassing the PX statement queue. For example, an important index must be rebuilt quickly: waiting in the parallel statement queue is not acceptable, and you prefer to bypass it completely. As the parallel_servers_target parameter is configured to be less than the parallel_max_servers parameter, it may be beneficial to bypass the queue for ultra-critical workloads.

Version 12c introduces the attribute parallel_stmt_critical for a resource consumer group. If you set this attribute to BYPASS_QUEUE, then sessions attached to this resource consumer group will bypass the PX statement queue and start execution immediately. As the parallel_max_servers parameter is a hard threshold, it is possible for this critical statement to execute with fewer PX servers than requested.

begin
      dbms_resource_manager.create_pending_area();
   end;
   /
   begin
dbms_resource_manager.create_consumer_group(
  consumer_group => 'PX_ULTRA_CRIT',
  comment => 'Resource consumer group/method for critical users');
dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'PX_ULTRA_CRIT',
  comment => 'Very Critical stuff',
  parallel_stmt_critical => 'BYPASS_QUEUE');
    end;
/
begin
 dbms_resource_manager.validate_pending_area();
end;
/
begin
 dbms_resource_manager.submit_pending_area();
end;
/

Think of the bypass mechanism as an override mechanism. You should design a consumer group with this attribute only if it is absolutely required, as there is a danger of critical statements executing with a lower number of PX servers than requested.

Parallel Statement Queue Timeout (12c)

By default, PX statements will wait in the statement queue indefinitely, but that may not be preferred in a production environment. Version 12c introduces a timeout, namely, parallel_queue_timeout, for a resource consumer group. If a session waits in the parallel statement queue longer than the parallel_queue_timeout parameter value, then the statement is terminated with an ORA-7454 error and removed from the parallel statement queue. For example, sessions associated with PX_BATCH_LOW will wait in the parallel statement queue for an hour at most.

dbms_resource_manager.create_plan_directive(
  plan => 'PQ_STMT_PLAN',
  group_or_subplan => 'PX_BATCH_LOW',
  comment => 'Batch users sessions at level 2 - LOW - long running',
  mgmt_p2 => 10,parallel_queue_timeout =>3600);

You should also note that it is the responsibility of the application code to handle this error condition and react properly if ORA-7454 is received by the application.

Grouping Parallel Statement Queue Execution (12c)

A problem with the Parallel Statement Queuing feature is that a transaction may contain multiple parallel statements, and every parallel statement can wait in the statement queue, leading to a prolonged transaction. For example, a transaction may consist of two statements, with the first inserting 500 million rows into a table using parallelism and the second summarizing those 500 million rows using parallelism. Since the Parallel Statement Queuing feature works at the statement level, the session might wait twice for the whole transaction, inducing performance issues. Further, multiple waits in the parallel statement queue have the effect of prolonging a transaction, which could further lead to unnecessary locking issues or UNDO management issues. The following group of SQL statements shows the example scenario.

–- First parallel statement
Insert /*+ parallel(10) */ into huge_table
  Select  /*+ parallel (10) */ col1, col2 .. from oltp_transactions..;
update  emp..     ;              -– Serial statement
 
-- Second parallel statement
insert /*+ parallel (10) */ into summary_table
select /*+ parallel (10) */ ..from huge_table;
 
commit;

It may be acceptable for the first parallel SQL statement to wait in the parallel statement queue, but the subsequent, nearly immediate, parallel statement should not wait in the parallel statement queue. In 11gR2, every statement will wait in the PX statement queue if the requested PX servers are not available.

Version 12c introduces a new feature to group multiple parallel statements to reduce the number of waits in the parallel statement queue. Procedure call BEGIN_SQL_BLOCK and END_SQL_BLOCK can be used to group parallel statements. The following code groups prior parallel statements and enables reuse of PX servers acquired in the earlier parallel statement execution.

exec dbms_resource_manager.begin_sql_block;
Insert /*+ parallel(10) */ into huge_table
  Select  /*+ parallel (10) */ col1, col2 .. from oltp_transactions..;
update  emp..     ;              -– Serial statement
--second parallel statement
insert /*+ parallel (10) */ into summary_table
select /*+ parallel (10) */ ..from huge_table;
commit;
exec dbms_resource_manager.end_sql_block;

With this strategy, multiple parallel statements in the block will be considered as one group, and the waits in parallel statement queue will be minimal. The downside of this feature is that if there is a delay between two parallel statement executions in the same parallel statement block, then PX servers can be unnecessarily held up by this process.

Debugging PX Execution

Debugging PX execution problems in a RAC cluster requires reviewing PX execution environment setup. The PX tracing event is useful to understand why a query was not parallelized even though the optimizer chose a PX plan.⁹

In the following example, I will review a problem of a SQL statement executing serially, even though the optimizer execution plan indicates a PX plan. I will trace PX execution setting _px_trace event to high, all.¹⁰

alter session set "_px_trace"=high,all;
 
select /*+ parallel (8) full(h) full(l) */ count(*)  from
oe_order_headers_all h, oe_order_lines_all l
where h.header_id = l.header_id and
h.open_flag='N';

Reviewing the trace file generated by _px_trace event, we can walk through the PX server allocation and parallel execution. This part of the trace file shows the default DOP calculation. In this database, the calculated default DOP is 168.

kxfrSysInfo                                                    [      12/     0]
        DOP trace -- compute default DOP from system info
        # instance alive  = 3 (kxfrsnins)
kxfrDefaultDOP                                                 [      12/     0]
        DOP Trace -- compute default DOP
            # CPU       = 28
            Threads/CPU = 2 ("parallel_threads_per_cpu")
            default DOP = 56 (# CPU * Threads/CPU)
            default DOP = 168 (DOP * # instance)
            Default DOP = 168

Subsequent lines show the reasoning behind serial execution. In this database, the Database Resource Manager disabled parallelism for the current user by granting zero PX servers. That’s the reason why the SQL statement did not get any PX servers.

PX_Messaging:kxfp.c@17808:kxfpclinfo():     2   (78  :0   :100:35  :16    :0     :0   ...)
    Ressource Manager reduced num server to maxdop = 0
    serial -- Requested 8 threads,Granted 0Aborting!
       Max Servers=32  Adaptive=0 RM Max-DOP=0

After switching to the correct resource consumer group and retracing the execution of the statement, we can see that the query allocated eight threads each in instances 1 and 2. As you can see, two sets of eight threads were requested by this parallel query, and all of them were granted by the Resource Manager. In the following output, P000 is allocated from instance 2 and the complete trace file lists all 16 PX server allocations.

kxfpgsg                                                        [     586/     0]
          getting 2 sets of 8 threads, client parallel query execution flg=0x234
 
          Height=8, Affinity List Size=0, inst_total=1, coord=2
          Insts     2
          Threads   8
kxfpg1sg                                                       [     587/     1]
          q:e8dfd9668 req_threads:8 nthreads:8 unit:1 #inst:1 normal
          jStart:0  jEnd:32  jIncr:1 isGV:0  i:0 instno:2 kxfpilthno:8
kxfpg1srv                                                      [     587/     0]
          trying to get slaveP000 on instance 2for q:e8dfd9668
          slaveP000 is local
          found slave P000 dp:e11cdc888 flg:18
          Got It. 1 so far.

You can debug PX execution problems by tracing the statements with _px_trace event and reviewing the generated trace file. Refer to Oracle support document ID 400886.1 for further information about _px_trace command.

Index Creation in RAC

Parallel index creation is a much-discussed-topic in a RAC database. Should you allocate PX servers from all instances to create a big index? Or should you allocate PX servers from just one instance? The strategy depends upon the environment, and the following resource constraints should be considered before deciding the strategy:

Does one node have enough memory, CPU, and I/O capacity to complete index creation?
How much interconnect bandwidth is available for PQ workload? Can the bandwidth sustain the burst of interconnect traffic generated by index creation statements?
Is there sufficient I/O bandwidth such that the LGWR process in one node can write all index redo entries to the log files without incurring any performance delay? Index creation in two nodes uses the LGWR processes in both nodes and might perform better if LGWR is a bottleneck.

For example, a 6GB index creation using two nodes induced an interconnect traffic of 2.6GB. The following shows the interconnect traffic bytes from x$ksxpclient.

INSTANCE_NUMBER         NAME       DIFF_SENT   DIFF_RECV
-----------------       ---------  ----------  ----------
                1       ipq        1440328342  1297116631
                2       ipq        1299190976  1438331216
                3       ipq            733181      575309
                4       ipq            482387      350462

The same index creation with PX servers allocated in all four nodes generated 4.1GB of interconnect traffic for PQ workload.¹¹ So, index creation from many nodes can increase interconnect traffic.

INSTANCE_NUMBER         NAME      DIFF_SENT  DIFF_RECV
---------------         --------- ---------- ----------
              1         ipq       1101141155  989508581
              2         ipq       1015271038  992655090
              3         ipq       1150033881  981687750
              4         ipq        906133897 1204963877

If your database server has sufficient resources, you should probably set parallel_force_local=TRUE in your session and create the index in that session. This method would allocate all PX servers in the same node, reducing interconnect traffic.

Also, you should know that index creation on partitioned tables can be slower if not executed properly. By default, a partition is the granularity of PX allocation while creating a partitioned index. So, one PX server is assigned to process one table partition while creating a partitioned index. This strategy can cause performance issues, especially if the partitioned table is huge, which is typically the case. Worse, if the table has non-uniform size partition, then a few PX servers can be stuck processing huge partitions while other PX servers are idle. Ideally, at least half of allocated PX servers should be active during index creation. Hence, it is an optimal approach to create index structures with unusable state and then rebuild index partitions individually. Further, you can design your strategy in such a way that each index partition is rebuilt by an instance.

Note The dbms_pclxutil package is immensely useful to rebuild partitioned indexes with ease. The procedure call dbms_pclxutil.build_part_indexes uses a parallel-enabled rebuild strategy to rebuild index partitions effectively, but the package doesn’t allow you to distribute jobs among instances. If you are planning to use an instance to rebuild indexes, then consider using the dbms_plcxutil package.

Parallel DML in RAC

DML statements can be executed in parallel, utilizing PX servers from multiple instances. There are similarities in processing between parallel DML and index creation. Please refer to the section “Index Creation in RAC” earlier in this chapter for discussions about choosing intra-instance or inter-instance parallelism.

Parallel DML must be enabled in the session using the alter session enable parallel dml command. Two types of workload are possible with parallel DML, and the type chosen depends upon the execution plan: (1) parallel scan only, or (2) parallel scan and change. In the first case, PX servers will scan the objects and the DML operation will be performed by QC. In the second case, PX servers will scan the objects and perform DML operations too. In the following execution plan output, in step 3, the table is updated and then a message is sent to QC. Essentially, both scan and change are done by the PX server processes themselves. The keyword UPDATE is under the PX SEND QC operation, indicating that PX servers perform both scan and update operations.

-------------------------------------------------------------------------------------------
| Id  | Operation             | Name           | Starts | E-Rows | A-Rows |   A-Time   |...
-------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT      |                |      1 |        |      4 |00:00:24.91 |
|   1 |  PX COORDINATOR       |                |      1 |        |      4 |00:00:24.91 |
|   2 |   PX SEND QC (RANDOM) | :TQ10000       |      0 |  32186 |      0 |00:00:00.01 |     |
|   3 |    UPDATE             | HUGETABLE_HASH |      0 |        |      0 |00:00:00.01 |
|   4 |     PX BLOCK ITERATOR |                |      0 |  32186 |      0 |00:00:00.01 |
|*  5 |      TABLE ACCESS FULL| HUGETABLE_HASH |      0 |  32186 |      0 |00:00:00.01 |
-------------------------------------------------------------------------------------------

In the following execution plan, PX servers are scanning and identifying the rows to update. The QC process performs the change. Notice that the UPDATE keyword precedes the PX SEND QC step, indicating that changes are performed by the QC process.

------------------------------------------------------------------------------------------
| Id  | Operation                         | Name           | Rows  | Bytes | Cost (%CPU)|
-----------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                  |                |     1 |  1028 |  2600   (1)|
|   1 |  UPDATE                           | HUGETABLE_HASH |       |       |            |
|   2 |   PX COORDINATOR                  |                |       |       |            |
|   3 |    PX SEND QC (RANDOM)            | :TQ10003       |     1 |  1028 |  2600   (1)|
|   4 |     NESTED LOOPS                  |                |     1 |  1028 |  2600   (1)|
|   5 |      BUFFER SORT                  |                |       |       |            |
|   6 |       PX RECEIVE                  |                |       |       |            |
|   7 |        PX SEND BROADCAST          | :TQ10002       |       |       |            |
|   8 |         VIEW                      | VW_NSO_1       | 32186 |   408K|  2445   (1)|
|   9 |          HASH UNIQUE              |                |     1 |   817K|            |
|  10 |           PX RECEIVE              |                |     1 |   817K|            |
|  11 |            PX SEND HASH           | :TQ10001       |     1 |   817K|            |
|  12 |             HASH UNIQUE           |                |     1 |   817K|            |
|* 13 |              HASH JOIN            |                | 32186 |   817K|  2445   (1)|
|  14 |               PX BLOCK ITERATOR   |                | 32186 |   408K|  1222   (0)|
|  15 |                TABLE ACCESS FULL  | HUGETABLE_HASH | 32186 |   408K|  1222   (0)| ...
...

In RAC, you should write parallel DML so that all PX server processes perform the changes. With this strategy, true parallelism is achieved, as multiple PX servers are performing the DML operation. If only QC is carrying out the changes and if the magnitude of change is much higher (for example, numerous rows to be updated), then performance will suffer, as just one process is performing the change while PX servers are idle. Further, with inter-instance parallelism, multiple LGWR processes can work on writing the redo buffer to the log file, concurrently improving redo write throughput.

Concurrent Union Processing (12c)

Prior to 12c, union branches were executed one at a time. For example, if there were three union branches in a SQL statement, then each of these branches would have been executed sequentially. Note that each individual union branch can execute utilizing PX servers, but only one union branch will be active during SQL execution. This sequential execution is not efficient for a few types of SQL statements. For example, in the following SQL statement, most of the time is spent accessing the remote database over the database link; that is, time is spent in the network transfer. Sequential execution of each branch adds up to a huge elapsed time interval for the SQL statement.

select /*+ PQ_CONCURRENt_UNION */ col1, col2, col3 from orders@retail_oltp
union all
select col1, col2, col3 from warehouse_orders@shipping
union all
select col1, col2, col3 from orders;

Version 12c introduces a concurrent union processing feature, and multiple union branches can be executed in parallel. So, in this example, since most of the time is spent in the network, executing the union branches concurrently will reduce the elapsed time of the SQL statement.

In RAC, this new feature has interesting uses. A complex parallel statement execution, with multiple union all branches, can be designed to execute the utilization of inter-instance parallelism in multiple nodes.¹² Essentially, each node will work on its own union all branch.

Partition-Wise Join

Traditional PX execution uses the Producer/Consumer model, and partition-wise join is an exception to that model. In partition-wise join, each PX server will read a partition from joined tables and perform join processing. Finally, joined result set is sent to the next process in the PX tree or QC. Since this technique does not use the Producer/Consumer model, PX messaging between the processes is reduced. In RAC, a partition-wise join operation in an inter-instance parallel execution does not induce excessive interconnect traffic.

An execution plan for a partition-wise join follows. In this plan, you can see that PX PARTITION HASH ALL in step 5 precedes the HASH JOIN step. This execution sequence indicates that the partition-wise join is in play. Essentially, every PX server process is performing its operation underneath the PX PARTITION step. Each PX server performs HASH join between two partitions of the table, aggregates data, and then sends the aggregated results to QC. As both aggregation and join operations are performed by a PX server process, interconnect traffic is minimal.

----------------------------------------------------------------------------------------------
| Id  | Operation                 | Name           | Starts | E-Rows | A-Rows |   A-Time   |
---------------------------------------------------------------------------------------------   |   0 | SELECT STATEMENT          |                |      1 |        |      1 |00:00:37.42 |      |   1 |  SORT AGGREGATE           |                |      1 |      1 |      1 |00:00:37.42 |      |   2 |   PX COORDINATOR          |                |      1 |        |      8 |00:00:37.42 |      |   3 |    PX SEND QC (RANDOM)    | :TQ10000       |      0 |      1 |      0 |00:00:00.01 |       |   4 |     SORT AGGREGATE        |                |      0 |      1 |      0 |00:00:00.01 |       |   5 |      PX PARTITION HASH ALL|                |      0 |  32186 |      0 |00:00:00.01 |       |*  6 |       HASH JOIN           |                |      0 |  32186 |      0 |00:00:00.01 |       |   7 |        TABLE ACCESS FULL  | HUGETABLE_HASH |      0 |  32186 |      0 |00:00:00.01 |       |   8 |        TABLE ACCESS FULL  | HUGETABLE_HASH |      0 |  32186 |      0 |00:00:00.01 |       ----------------------------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access("T1"."N1"="T2"."N1")

In Figure 12-4, two instances are participating in a partition-wise join between SALES and ORDERS tables. PX servers P001 and P002 participate in this join operation in both instances. P001 reads partition P1 of the SALES table and partition P1 of the ORDERS table and joins these two partitions. Similarly, process P002 reads partition P2 of the SALES table and partition P2 of ORDERS table and joins the rows from these two partitions. PX messages are kept minimal between PX server processes, since this type of execution does not use the Consumer/Producer model. Finally, a result set is sent to the QC, reducing interconnect traffic for an inter-instance parallel operation.

Figure 12-4. Partition-wise join

However, specific conditions must be met for partition-wise join to be chosen. Some requirements for partition-wise joining include matching of parallelism chosen for joining tables and matching of the partition boundaries of the partitioned tables.

Summary

Parallel execution is an effective feature in a RAC environment. Although Oracle Database is optimized for effective placement of PX servers to reduce interconnect traffic, you should design interconnect bandwidth to match your application workload. If your application workload is mostly data warehouse queries, then you could expect higher interconnect traffic due to parallel execution. The following are the essential key elements of optimal practices in setting up parallelism in RAC:

Use services to control placement of PX servers. Placement using the instance_groups parameter is deprecated, but placement using a combination of services and parallel_instance_group is allowed.
Verify that the PEMS parameter is set to 16K. In PX-intensive environments, use Jumbo Frames.
If parallel queries do not get enough PX servers, use the Parallel Statement Queuing feature to provide consistent elapsed times.
For DDL statements on huge tables, if one node can handle the PX execution, then set up parallel parameters so that PX servers can be allocated from one node or use a parallel_force_local parameter to force PX server allocation to a node.
Use raccheck and exacheck tools to verify parallelism setup in a RAC environment.

¹ I will discuss methods to control PX server allocation to an instance or a subset of instance later in this chapter.

² Arup Nanda points out that, in the Exadata platform, by default, the parallel_force_local parameter is set to TRUE. It is interesting to realize that even with the highly efficient Infiniband fabric-based interconnect, this parameter is set to TRUE. However, choosing a value for this parameter might require knowledge about your application workload and database server architecture.

³ Significantly different means that actual DOP is lower than 0.5 * estimated DOP or higher than 2 * estimated DOP.

⁴ More work here refers to additional tasks of maintaining cache fusion lock structures to read buffers into buffer cache.

⁵ It is possible to manually delete rows from an underlying table. Please contact Oracle Support to remove I/O calibration if needed.

⁶ Optimizer estimated cost is roughly a number normalized to single block reads.

⁷ Note that parallel_max_servers parameter controls the hard maximum number of PX servers that can be active in an instance. So, even after exceeding 96 PX servers as specified by parallel_servers_target, SQL statement may be able to acquire PX servers if parallel_max_servers is set to a value greater than 128.

⁸ It is also possible that I encountered an undocumented bug in 11gR2 and 12c versions. In future versions, this behavior needs to be tested.

⁹ Note that discussion here is not “why is SQL statement not parallelized?”, but rather “why didn’t the PX plan use the PX servers?”.

¹⁰ Note that _px_trace=high, all creates a huge amount of diagnostics details in PX trace files. You might need to contact Oracle Support to understand further details printed in a PX trace file.

¹¹ Calculated by summing up either the DIFF_SENT column or the DIFF_RECV column, but not both.

¹² In 12c, you may not be able to force this sort of inter-instance parallelism where instances are operating on disjoint branches of union all execution. But, hopefully, future release might provide a mechanism to force a parallel concurrent execution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12: Parallel Query in RAC

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 12: Parallel Query in RAC