Data Guard tuning and wait events

Specific to standby database(s), we may have performance issues to read redo data and to transport over a network, redo write phase because of bad RAID configurations, Redo Apply phase because of huge redo, improper memory settings, or the issues can be with bugs. Here we will discuss some of them.

Network tuning

Standby databases will be placed geographically in different locations with WAN for high availability in case of a disaster. Even though you keep your standby database geographically far away, you should have reasonable bandwidth to avoid data lag between the primary and standby databases. It can be a bigger problem if you are using synchronous redo with AFFIRM. Consider the use of a high latency network to fulfill redo rate shipping as follows:

Required network bandwidth = ((Redo rate bytes per sec. /  0.7) * 8) / 1,000,000 = bandwidth in Mbps.

By using this formula according to the redo generation rate, you can estimate the required network bandwidth. You can get redo rate in bytes per second from DBA_HIST_SNAPSHOT or from the AWR/Statspack reports.

Network throughput can be increased by setting Oracle net parameters RECV_BUF_SIZE and SEND_BUF_SIZE equal to three times of Bandwidth Delay Product. To calculate Bandwidth Delay Product, the bandwidth of the link and the Network Round Trip time are required. RTT is measured by the complete two-way travel from the primary to standby database, including the standby and primary databases.

BDP = (Network speed * RTT) /8 

By this calculation, the optimal send and receive buffer sizes can be estimated with the following formula:

Socket buffer size = 3 * (Bandwidth Speed) * (RTT) 

Or you can also use the following:

Socket buffer size = 3 * (BDP) 

If the value of the socket buffer size is 11718750 bytes, the socket buffer size can be set as the following in sqlnet.ora or at the Operating System level:

[oracle@oracle-primary admin]$ cat sqlnet.ora|grep BUF_SIZE
RECV_BUF_SIZE=11718750
SEND_BUF_SIZE=11718750
[oracle@oracle-primary admin]$

You can also configure to send and receive buffer sizes to the net service for connector descriptor in the client-side sqlnet.ora file as follows:

INDIA =
  (DESCRIPTION =
    (RECV_BUF_SIZE=11718750)
    (SEND_BUF_SIZE=11718750)
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.180.20)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = india_un)
   )
  )

If you are replicating data remotely either using database links for materialized views or Data Guard, the data will be transferred over the network in terms of data sized units (SDU); if a large amount of redo is being transmitted, you can increase the size of the SDU buffer to improve performance and network utilization. You can configure it in the sqlnet.ora file as DEFAULT_SDU_SIZE, which ranges from 512 bytes to 32767 bytes. The default SDU size of 2048 bytes is applicable for the client and dedicated server, where for the shared server the default SDU will be 32767 bytes.

On the standby databases, you can configure either in the sqlnet.ora or listener.ora file where we can specify buffer parameters for the address in description as follows:

SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
    (SDU = 32767)
    (GLOBAL_DBNAME = india_un)
    (SID_NAME = INDIA)
    (ORACLE_HOME = /u01/home/oracle/product/11.2.0/db_1)
   )
  )

Redo transport and apply tuning

If you are using a redo transport type such as ARCH, consider increasing the number of LOG_ARCHIVE_MAX_PROCESSES parameters. The default value in 11gR2 is 4 and it can be controlled from 1 to 30, if you set this parameter with a higher value. According to the archive processes and the system configuration, all the ARCn processes work in parallel to resolve the archive gaps.

Choose the optimal value after several tests with an archive gap resolution as follows:

SQL> select * from V$PGASTAT where name='total PGA allocated';
NAME                      VALUE UNIT
-------------------- ---------- ------------
total PGA allocated   249153536 bytes

The following output can be extracted using the view v$process:

PROGRAM                        PGA_USED_MEM   PGA_MAX_MEM
------------------------------ ------------   -----------
oracle@oracle-stby (ARC2)          11270688   12050576
oracle@oracle-stby (ARC1)          11297656   12050576
oracle@oracle-stby (ARC4)          28942512   30924944
oracle@oracle-stby (ARC3)          28942512   30924944
oracle@oracle-stby (ARC0)          28942512   30924944

So every archive process is consuming nearly 30 MB of memory; this calculation is completely based on the memory management you have used. Consider the parameter value LOG_ARCHIVE_MAX_PROCESSES depending on the available resources.

If you are using synchronous redo transport with LGWR redo, consider decreasing the value of NET_TIMEOUT to avoid outages on the production database's performance; this value can be defined from one to 1200 according to 11gR2 and the default value is 30 seconds. Oracle recommends setting the value of NET_TIMEOUT to 10 seconds or less to avoid disconnection from the standby database.

Redo data is received from the primary to standby database and it will be applied by the background process MRP0. Redo Apply is a block-to-block physical replication of the primary database. It uses media recovery to read records from standby redo logfiles into memory and applies directly to the standby database. If you start MRP as alter database recover managed standby database disconnect from session, only one MRP process will be started to perform recovery. For huge OLTP databases, there are various possibilities for having a lot of redo to be applied on a standby database, with a single background process recovery being delayed. So we can initiate parallel recovery and it starts slave processes along with MRP background processes as follows:

SQL> alter database recover managed standby database using current logfile disconnect from session parallel 5
SQL> !ps -ef|grep pr0
oracle   32243     1  0 19:33 ?        00:00:00 ora_pr00_INDIA
oracle   32245     1  0 19:33 ?        00:00:00 ora_pr01_INDIA
oracle   32247     1  0 19:33 ?        00:00:00 ora_pr02_INDIA
oracle   32249     1  0 19:34 ?        00:00:00 ora_pr03_INDIA
oracle   32251     1  0 19:34 ?        00:00:00 ora_pr04_INDIA
oracle   32253     1  0 19:34 ?        00:00:00 ora_pr05_INDIA
oracle   32292 31785  0 19:34 pts/2    00:00:00 /bin/bash -c ps -ef|grep pr0

Tip

In the previous example, we have explicitly mentioned parallelism with the number 5. We can also mention parallel without any specific value so that parallelism will be the default for the number of CPUs.

Data Guard wait events

Data Guard wait events are classified into primary-and standby-related wait events. According to the new releases, many of the wait events are introduced; some of them are as follows:

  • Data Guard wait events on the primary database with an ARCH transport

    These wait events are specific to sending redo from the primary database to the standby database using ARCH with synchronous or asynchronous redo transport. There is an ARCH wait on ATTACH, an ARCH wait on SENDREQ, and an ARCH wait on DETACH.

  • Data Guard wait events on the primary database with LGWR transport

    If you are using real-time apply with LGWR redo transport, the LNS process will be working with the standby RFS server in redo transport and the wait events can be LNS wait on ATTACH, LNS wait on SENDREQ, LNS wait on DETACH, LGWR wait on LNS, LNS wait on LGWR, and LGWR-LNS wait on a channel.

  • Database wait events related to Data Guard

    These wait events are applicable even in a normal system and are related to I/O. They are log file sync, log file parallel write, and DB file sequential read.

  • Data Guard wait events on a standby database

    These wait events will occur in case time is spent on I/O on the standby. They are RFS write, RFS random I/O, and RFS sequential I/O.

Use the following query to get the details about all wait events:

SQL> select event,total_waits,time_waited,total_timeouts from v$system_event order by total_waits desc;
EVENT                                    TOTAL_WAITS TIME_WAITED 
---------------------------------------- ----------- ----------- 
parallel recovery slave next change           260168      381493 
control file sequential read                   67687        2373  
parallel recovery change buffer free           60053       82981  
parallel recovery read buffer free             18975       24964  
SQL*Net vector data from client                10534       79202  

Tip

For an in-depth analysis of the events discussed, use a load profile of AWR or a Statspack report or performance dynamic views, and the wait events can be varied from environment to environment depending on the configurations and settings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.205