CHAPTER 3

image

RAC Operational Practices

by Riyaj Shamsudeen

When running a RAC cluster, some operational practices can reduce overall administration costs and improve manageability. The design of a RAC cluster is critical for the effective implementation of these practices. While this chapter discusses numerous RAC design patterns, implementation of these design practices will lead to better operational results.

Workload Management

Workload management is an essential tool that simplifies the management of RAC clusters. Skillful use of workload management techniques can make the administration of a production cluster much easier. It can also speed up the detection of failures and accelerate the failover process. Services, VIP listeners, and SCAN listeners play important roles in workload management.

Workload management is centered on two basic principles:

  1. Application affinity: Optimal placement of application resource usage to improve the performance of an application.
  2. Workload distribution: Optimal distribution of resource usage among available nodes to maximize the use of cluster nodes.

Application affinityis a technique to keep intensive access to database objects localized to improve application performance. Latency to access a buffer in the local buffer cache is on the order of microseconds (if not nanoseconds with recent faster CPUs), whereas latency to access a buffer resident in a remote SGA is on the order of milliseconds, typically, 1–3ms.1  Disk access latency is roughly 3 to 5ms for single block reads; that is nearly the same as that for remote buffer cache access latency, whereas local buffer cache access is orders of magnitude faster. So, if an application or application components (such as a group of batch programs) access some objects aggressively, connect those application components to an instance so that object access is mostly localized, improving application performance.

A complementary (though not necessarily competing) technique to application affinity is workload distribution. For example, if there are four nodes in a cluster, then you would want to distribute workload evenly among all four nodes so that a node is not overloaded. Service is a key tool to achieving workload distribution evenly among all nodes of a cluster.

In practice, application affinity and workload distribution coexist. Some application components use application affinity to improve performance, and a few application components use workload distribution.

In addition to administrator-managed databases, you can also use policy-managed databases for workload management. Oracle 11gR2 introduces policy-managed databases and Oracle version 12c enhances policy-managed databases for dynamic policy management. Policy-managed databases are discussed later in this chapter.

Services

A service is a logical abstraction designed to group application components for optimal workload management. By implementing a service-based connection design, the application workload can be affinitized, distributed, or both. A database connection string specifies a service as a connection property, and the cluster processes distribute new connections to an instance depending on the workload specification of the service. The performance metrics of a service are also used by the listener processes to redirect the connection. Further, database processes calculate, maintain, and propagate service metrics to facilitate workload distribution and failover techniques.

A typical database connection using a service is done according to the following steps:

  1. Application process sends a connection request specifying a service_name to the SCAN (Single Client Access Name) listener.
  2. SCAN listener identifies the instance that can provide the best quality of service.
  3. SCAN listener redirects the connection to the VIP (virtual IP address) listener listening locally for that instance.
  4. VIP listener creates a database connection process, and the application continues processing the data by communicating to the connection process.

    By default, every database has some default services created, as shown in the following query output. Default services are marked as preferred in all instances; therefore, default services are not useful for workload management. For example, a service SOLRAC matching the database name (db_name initialization parameter) is created, and that service is preferred in all instances of the database. So, workload distribution among the instances of the database is not effective using the SOLRAC service since connecting through that service can connect to any instance.

    SQL> select name, goal from v$active_services
     
    NAME                           GOAL
    ------------------------------ ------------
    SOLRAC                         NONE
    SYS$BACKGROUND                 NONE
    SYS$USERS                      NONE

    You need to create a new service to implement workload management properly. Only a service created by you can be used effectively for workload management. There are many ways to create a service, such as using the srvctl command, the dbms_service package call, or Oracle Enterprise Manager (OEM). In a single-instance database, you can use the dbms_service package to create and manipulate a service. In RAC, you should not use the dbms_service package call. Instead, it is better to use the srvctl command to create a service. The srvctl command creates a cluster resource in Clusterware, in addition to creating a service in the database. This cluster resource for a service is monitored for availability, and the resource is failed over in cases of instance failure.

    The following command adds a new service: po to solrac database. This command also specifies that this service should be preferred in solrac1 and solrac3 instances, and the service should fail over to solrac2 and solrac4 instances if a preferred instance fails. Flag –r specifies preferred instances, and flag –a specifies availableinstances.

    $ srvctl add service -d solrac -s po -r solrac1,solrac3 
         -a solrac2,solrac4 -m BASIC -j short -B service_time

    In Oracle Database version 12c, the syntax to add a service is different and the options are more verbal. While the command options in 11.2 will also work in version 12c, the abbreviated command options are deprecated in version 12c. Earlier version 11.2 commands to add a service can be rewritten as follows using version 12c syntax.

    $ srvctl add service -db solrac -service po
       -preferred solrac1,solrac3
       -available solrac2,solrac4  -failovermethod BASIC
       -clbgoal short -rlbgoal service_time

    Adding a service using the srvctl command adds a resource in Clusterware, too. Whereas in 11.2 versions, the user has to start the service manually after adding a service, in version 12c, the service is automatically started. This cluster resource is a dependent of database and VIP resources. So, as soon as the database and VIP resources are online, the service resource will be altered online.

    $ crsctl stat res ora.solrac.po.svc -p |more
    NAME=ora.solrac.po.svc
    TYPE=ora.service.type
    ...
    SERVICE_NAME=PO
    START_DEPENDENCIES=hard(ora.solrac.db,type:ora.cluster_vip_net1.type) weak(type:ora.listener.type) pullup(type:ora.cluster_vip_net1.type) pullup:always(ora.solrac.db)

    Starting the database resource will recursively start dependent Clusterware resources. As the services are defined as dependent resources, starting the Clusterware resource of a service enables the service in the database by implicitly calling the dbms_service package.

    You could explicitly start a service using the srvctl command, as well. Query the view v$active_services to see that a new service has been created and started.

    $ srvctl start service –d solrac –s po
    SQL> select name, creation_date from v$active_services
         where name='PO';
    NAME                                      CREATION_DATE
    ----------------------------------------- ---------------
    PO                                        17-DEC-12

    Service Metrics

    The MMON (Memory Monitor) database background process calculates the service-level metrics using service statistics, and these metrics are visible at v$activeservice dynamic performance view. Metrics are calculated at intervals of 5 seconds and 1 minute. In the following output, the first line shows the metrics calculated at 1-minute intervals, and the second line shows the metrics calculated at 5-second intervals.

    SQL> select inst_id, begin_time, end_time from gv$servicemetric
         where service_name='PO' order by 1, 2;
     
    INST BEGIN_TIME                 END_TIME
    ---- -------------------------- --------------------------
       1 14-JAN-2013 16:05:57       14-JAN-2013 16:06:57
       1 14-JAN-2013 16:07:06       14-JAN-2013 16:07:11
    ...

    Two sets of metrics are calculated by the MMON background process, one indicating service_time and the other indicating throughput. Depending on the service configuration, goodness of a service is calculated and PMON (Process Monitor) or LREG (Listener Registration - 12c) background processes communicate the service metrics to the SCAN and VIP listeners.

    Figure 3-1 shows the MMON propagation details. It gives an overview of service metrics calculation and propagation.

    9781430250449_Fig03-01.jpg

    Figure 3-1. Service metrics calculation and propagation

  5. MMON process populates service metrics in AWR (Automatic Workload Repository) base tables.2
  6. Service metrics are propagated to the ONS (Oracle Notification Service) daemon, and ONS daemons propagate the service metrics to the client-side driver and client-side ONS daemon.
  7. In addition, service metrics are enqueued in the sys$service_metrics queue. A client program can register as a subscriber to the queue, receive queue updates, and take necessary action based on the queue alerts.

Also, MMON propagation is a key component of the FAN (Fast Application Notification) feature discussed later in this chapter.

A set of columns, namely, elapsedpercall, cpupercall, and dbtimepercall, indicate the response time quality of the service. The following data from v$servicemetric shows that response time metrics are better in instance 1 than in instance 2, as dbtimepercall is lower in instance 1 than in instance 2. Using these statistics, the SCAN listener redirects the connection to an instance that can provide better quality of service.

SQL> select inst_id,service_name, intsize_csec, elapsedpercall,
     cpupercall,dbtimepercall
     from gv$servicemetric where service_name='PO'
     order by inst_id, 2,3;
 
INST SERVICE    INTSIZE_CSEC ELAPSEDPERCALL CPUPERCALL DBTIMEPERCALL
---- ---------- ------------ -------------- ---------- -------------
   1 PO                  511     45824.8171 19861.5611    45824.8171
   1 PO                 5979     32517.4693 13470.9456    32517.4693
   2 PO                  512     89777.4994 41022.0318    89777.4994
   2 PO                 5938      50052.119 23055.3317     50052.119
...

Columns callspersec and dbtimepersec indicate the throughput metrics of a service.

SQL> select inst_id,service_name, intsize_csec, callspersec,
     dbtimepersec
     from gv$servicemetric where service_name='PO'
     order by inst_id, 2,3;
 
INST SERVICE    INTSIZE_CSEC CALLSPERSEC DBTIMEPERSEC
---- ---------- ------------ ----------- ------------
   1 PO                  512  1489.64844   4099.97602
   1 PO                 6003  1200.56638     4068.117
...

Note that these metrics are indicative of an average of performance statistics of all the workloads connecting to that service in an instance. As averages can be misleading, it is important to understand that in a few cases listeners can redirect the connections to instances that may not be optimal for that service.

Load Balancing Goals

Services can be designed with specific goals, to maximize either response time or throughput. Service metrics are factored to identify best-instance matching with the service goals. Two attributes of a service definition specify the service goal; namely, GOAL and CLB_GOAL.

  • Attribute GOAL can be set to a value of either LONG or SHORT. The value of LONG implies that the connections are long-lived, such as connections originating from a connection pool. When the attribute GOAL is set to LONG, connections are evenly distributed among available instances, meaning that the listener tries to achieve an equal number of connections in all instances where services are available. An attribute value of SHORT triggers runtime load balancing based upon performance metrics.
  • If the attribute GOAL is set to SHORT, then only the CLB_GOAL attribute is applicable. If the attribute CLB_GOAL is set to SERVICE_TIME, then connections are redirected to the instance that can provide better response time for that service. If CLB_GOAL is set to THROUGHPUT, then connections are redirected to the instance that can provide better throughput for that service.

If the attribute CLB_GOAL is set to response_time, then service metrics with names ending with percall are considered to measure goodness of an instance for a service. If the attribute CLB_GOAL is set to THROUGHPUT, then metrics ending with persec are considered to measure goodness of an instance for a service. Table 4-1 summarizes these service attributes and offers a short recommendation.

Table 4-1. Service Goals

GOAL CLB_GOAL Service metrics
LONG Ignored Suitable for long-lived connections such as connection pool connections.
SHORT Response_time Service is redirected to an instance that can provide better response_time. Suitable for OLTP workload.
SHORT Throughput Service is redirected to an instance that can provide better throughput. Suitable for DSS/Warehouse workload.

The MMON process calculates goodness and delta columns using service metrics. A lower value for the goodness column indicates quality of service in that instance (in retrospect, it should have been called a badness indicator). The delta column indicates how the value of the goodness column will change if a new connection is added to that service. In this example output, PO service is defined with the GOAL attribute set to the LONG value, and so the goodness column simply indicates the number of connections to that service in that instance. Since instances 2 and 3 have lower values for the goodness column, a new connection to PO service will be redirected to instance 2 or 3.

SQL> select inst_id, service_name, intsize_csec,goodness,delta
     from gv$servicemetric
     where service_name='PO' order by inst_id, 2,3;
 
INST SERVICE    INTSIZE_CSEC   GOODNESS      DELTA
---- ---------- ------------ ---------- ----------
   1 PO                  514       1883          1
   1 PO                 5994       1883          1
   2 PO                  528       1820          1
   2 PO                 5972       1820          1

In versions 11.2 and earlier, the PMON process propagates the service metrics to the listeners registered in local_listener and remote_listener initialization parameters. As the remote_listener specifies the address of SCAN listeners and the local_listener parameter specifies the address of the VIP listener, the PMON process propagates the service metrics to both SCAN and VIP listeners. You can trace listener registration using the following command.

alter system set events='immediate trace name listener_registration level 15';

The PMON trace file shows how the quality of service is propagated to listener processes. Here is an example (output of PMON process sending the service quality to the listener process).

kmmgdnu: po
goodness=100, delta=10,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-

In version 12c, listener registration is performed by a new mandatory background process named LREG. At database start, the LREG process polls the listeners specified in the local_listener and remote_listener parameters to identify whether the listeners are running. If the listeners are available, then the services are registered to the listeners by the LREG process. The earlier command to trace listener registration in version 11.2 is applicable to version 12c also, but the listener registration details are written to an LREG trace file instead of a PMON trace file.

Unix process pid: 3196, image: [email protected](LREG)
...
Services:
...
  2 -po.example.com
       flg=0x4, upd=0x6
       goodnes=0, delta=1
...
Listen Endpoints:
0 -(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=oel6rac1.example.com)(PORT=5500))
       (Presentation=HTTP)(Session=RAW))
       flg=0x80000000, nse=0, lsnr=, lflg=0x13
       pre=HTTP, sta=0

If a service is not registered properly to the listeners or if the service quality decisions taken by the listeners are incorrect, then you can trace the service registration to check if the listeners received correct service metrics.

Version 11.2 introduces the Valid Node Checking for Registration feature. SCAN listeners will accept registration from any IP address in the same subnet as the SCAN IP address. The VIP listener will accept the registration only from the local node. You can use invited_nodes to specify the permitted nodes for listener registration.

Runtime Failover

At runtime, a connection can be configured to fail over quickly to an available instance by configuring the failover type service attribute. A service supports failover types of NONE, SESSION, or SELECT. Failover type NONE does not provide any failover. If the failover type is set to SESSION, then after a node failure, sessions will fail over to surviving instances and the application must restart all work, including in-transit work. If the failover type is SELECT, then cursor fetch details are remembered by the client-side JDBC/OCI driver, the SELECT statement is restarted, and FETCH continues from the last row fetched before the failure.

In version 11.2, while the SELECT failover type supports a transparent restart of queries, it doesn’t restart any DML operation. The application must restart DML statements, and failures in DML statements must be handled by the application.

Version 12c introduces TRANSACTION, a new failover type, and this failovertype attribute value is specific to a new feature called Application Continuity. Application Continuity also captures DML changes and replays the DML activity using the Oracle client-side replay driver. The Application Continuity feature is discussed later in this chapter.

In version 12c, command options are specified more verbally. For example, in 11.2, option -e specifies the failover type, and in version 12c, service attribute failovertype specifies failover type.

Service in Second Network

A service can be configured to operate in a subnet. For example, to specify a service to be available in only the second network, you would specify option –k 2 while creating a service. In version 12c, clause -netnum 2 can be used to specify that the service will operate only in the second network. Please refer to Chapter 9 to learn more about the second network in a RAC cluster.

Guidelines for Services

The following guidelines are useful while creating services to improve cluster management.

  • Create a separate service for each major application component. Service is an excellent method to implement application affinity. For example, if PO application accesses numerous PO-related tables and if one instance is sufficient to service the PO application, then keep PO service contained in one instance. You can keep the PO service connecting to an instance by creating the service with just a preferred instance. Application affinity improves performance by reducing global cache latency.
  • Design the service in such a way that both preferred and available instances are configured. While designing services, consider the scenario of each node shutting down and construct the service placement based upon those scenarios.
  • Use policy-managed databases instead of administrator-managed databases if there are numerous nodes (12+) in your cluster.
  • Use of SCAN listeners and services reduces administration overhead. Especially for sites upgrading from versions 11.1 and earlier, it is important to alter the connection string to use SCAN listeners and SCAN IP addresses. It is very hard to change the connection string post-upgrade.
  • Specify optimal clbgoal and rlbgoal parameters matching the application workload. Service metrics are calculated as an average for all the workloads connecting to that service. So, if your application has different workloads and if it is possible to segregate the application components depending upon the workload characteristics, then use disjointed, unique services for those application components. This segregation of application workload to different services improves the accuracy of service metrics and improves the accuracy of load balancing decisions. This segregation also improves the ability to identify application components consuming resources quickly.

There are only a few minor pitfalls in creating a new service. Production clusters with 100+ services operate without any major side effects. While we are not recommending that you create hundreds of unnecessary services, you should create as many services as you need to split the applications into manageable, disjointed workloads. In version 11.2, if there are numerous services (100+) and numerous listeners, then there is a possibility that the PMON process might spend more time on service registration to listeners due to the sheer number of services and listeners. But, in version 12c, this possibility is eliminated as the LREG parameter performs service registration and PMON is freed from listener registration.

SCAN and SCAN Listeners

Introduced in version 11.2, SCAN (Single Client Access Name) is an effective tool to enhance workload management in RAC, to eliminate complexity in connection string management, and to provide faster connect time/runtime failover. Further, the SCAN feature simplifies connection management by providing a logical abstraction layer of the cluster topology. Cluster topology changes do not trigger changes to connect strings.

Consider an environment with numerous applications and users who are connecting to a RAC database with three nodes in the cluster, using the following connect string:

PO =
  (DESCRIPTION=
   (ADDRESS_LIST=
     (LOAD_BALANCE=YES)
     (FAILOVER=YES)                    
(ADDRESS=(PROTOCOL=tcp)(
HOST=rac1.example.com)(PORT=12000))              
(ADDRESS=(PROTOCOL=tcp)(
HOST=rac2.example.com)(PORT=12000))                
(ADDRESS=(PROTOCOL=tcp)(
HOST=rac3.example.com)(PORT=12000))
    )
    (CONNECT_DATA=
        (SERVICE_NAME=PO)
      )
    )

The preceding connect string has some drawbacks:

  1. If we add another node to the cluster, say rac4.example.com, then connect strings must be modified in numerous application configuration files. Worse, topology changes might trigger changes to the tnsnames.ora file in thousands of user computers. This type of configuration change requires seamless coordination.
  2. Similarly, if a node is removed from the RAC cluster, then connect strings must be altered to eliminate unnecessary waits for a TCP timeout while creating new connections. For example, consider that the rac1.example.com node is shut down for a prolonged period of time. As connection strings can be configured to try to connect to all VIP listeners by specifying load_balance=ON in the connection string, connection attempts will eventually succeed, after a brief TCP timeout period. Nevertheless, there may be a delay induced by TCP timeout duration. This connection delay is visible in applications that open new connections aggressively.
  3. Without SCAN listeners, the connect time load balancing decision is performed by VIP listeners. The use of VIP listeners for workload management is not optimal, as the load balancing decision is not centralized, and so VIP listeners can redirect connections inaccurately. VIP listeners do not use all service attributes for load balancing either.

SCAN listeners resolve these connect time issues and improve load balancing accuracy. SCAN listeners act as a forefront of connection management, applications connect to SCAN listeners, and SCAN listeners act as a conduit mechanism to forward connections to an instance that can provide better quality of service. A SCAN listener identifies an instance to forward using the service metrics received from the database processes.

SCAN listeners know about topology changes. Any change in the cluster topology, such as addition of a node, deletion of a node, etc., is propagated to SCAN listeners immediately. Application connects only to the SCAN listeners, and so cluster topology changes have no effect on application connect strings.

The following is an example of a connect string utilizing the SCAN listener feature; this connect string specifies the DNS alias of the SCAN IP address (also known as SCAN name) and the port number of SCAN listeners. In this example, rac-scan.example.com, is the SCAN name and the listener listens on port 25000.

PO =
(DESCRIPTION=
 (ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=tcp)(HOST=rac-scan.example.com)(PORT=25000))
     )
    (CONNECT_DATA=(SERVICE_NAME=PO))
   )

A new connection request using the preceding connect string is made according to the following high-level steps. Figure 3-2 shows an example of a new connection to the SCAN listener.

9781430250449_Fig03-02.jpg

Figure 3-2. SCAN listeners

  1. Application connection or a process queries the DNS server to identify the IP addresses associated with SCAN (rac-scan.example.com in this example). Note that as DNS name to IP address mapping can already be cached in a local DNS cache, the DNS query might result in a local DNS cache lookup instead of a query to the DNS server.
  2. DNS server responds with an IP address associated with the DNS alias.
  3. The process sends a connection request to the SCAN IP address and the PORT number providing SERVICE_NAME for the connection request.
  4. SCAN listeners identify instances that can provide better quality of service for that service using service metrics. The connection is forwarded to a VIP listener local to the chosen instance.
  5. A VIP listener creates a connection process (dedicated server mode), and the connection process attaches to the database, and the user process continues processing the data communicating with that connection process.

In a RAC cluster, up to three SCAN IP addresses can be configured. Since a SCAN listener is a simple redirection mechanism (and hence a lightweight process), SCAN listeners can manage spurious connection requests without any delay.

SCAN IP addresses must be mapped to a SCAN DNS alias in the DNS server or GNS (Grid Naming Service). It is recommended that the DNS server be configured to retrieve SCAN IP addresses in a round-robin fashion. IP addresses retrieved in a round-robin fashion provide additional connection load balancing capability and prevent overloading of a SCAN listener. Following nslookup shows that three IP addresses are associated with the SCAN name.

$ /usr/sbin/nslookup rac-scan.example.com
...
Name:   rac-scan.example.com
Address: 10.7.11.110
Name:   rac-scan.example.com
Address: 10.7.11.111
Name:   rac-scan.example.com
Address: 10.7.11.112

While installing Grid Infrastructure software, you can provide the SCAN name and port number. Grid Infrastructure queries the DNS server to identify IP addresses associated with SCAN names and creates as many SCAN listener resources as there are SCAN IP addresses. For example, if DNS returns three SCAN IP addresses, then three SCAN IP resources and three SCAN listener resources are created in the Clusterware.

After the Grid Infrastructure installation, you can modify SCAN IP address and SCAN listener attributes using the srvctl command. The following configuration of SCAN IP address shows that Clusterware created three sets of SCAN IP resources and SCAN listeners for the rac-scan.example.com DNS alias.

$ srvctl config scan
SCAN name: rac-scan, Network: 1/10.7.11.0/255.255.254.0/eth1
SCAN VIP name: scan1, IP: /rac-scan.example.com/10.7.11.110
SCAN VIP name: scan2, IP: /rac-scan.example.com/10.7.11.111
SCAN VIP name: scan3, IP: /rac-scan.example.com/10.7.11.112
  

Enabling client-level TNS trace, you can see that the SCAN IP address retrieved is different for each connection attempt. This round-robin retrieval of SCAN IP addresses provides a load balancing mechanism among SCAN listeners.

The following lines are shown from SQL*Net trace files, and you can see that the first connection request went to the SCAN listener with IP address 10.7.11.112 and the second connection request went to the scan listener with IP address 10.7.11.110.

$ sqlplus apps@po
nttbnd2addr:using host IP address: 10.7.11.112
..
$ sqlplus apps@po
nttbnd2addr:using host IP address: 10.7.11.110
..

The initialization parameter remote_listener is set to the TNS alias of SCAN listener or EZConnect syntax using the SCAN IP address and port number. The PMON or LREG process register services and propagates service-level changes to the SCAN listener specified in the remote_listener parameter. EZConnect syntax is as follows.

*.REMOTE_LISTENER=rac-scan.example.com:25000

As a SCAN listener acts as a redirection mechanism for connection requests, reviewing services supported by a SCAN listener shows that all services supported by a SCAN listener will have a remote server keyword, indicating that the SCAN listener will redirect the connection to a VIP listener. In the following lsnrctl services command output, notice the keyword REMOTE SERVER, indicating that the LISTENER_SCAN2 listener will redirect the connection requests to the rac2 node VIP listener, thereby forwarding the connection to the RAC2 instance.

$ lsnrctl services LISTENER_SCAN2
LSNRCTL for Solaris: Version 11.2.0.2.0 - Production on 19-JAN-2013 16:43:26
Copyright (c) 1991, 2010, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN2)))
  Services Summary...
    Service "PO" has 1 instance(s).
      Instance "RAC2", status READY, has 1 handler(s) for this service...
Handler(s):
 "DEDICATED" established:1 refused:0 state:ready
  REMOTE SERVER
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac2.example.com)(PORT=12000))
       (CONNECT_DATA=(SERVICE_NAME=PO)(INSTANCE_NAME=RAC2)))

If a service is not visible in a SCAN listener services output, the service registration to SCAN listener(s) is incorrect. Similarly, the lsnrctl services command output of a VIP listener will print services as LOCAL SERVER keyword, indicating that the VIP listener will handle the connection for that service.

SCAN Listener in Second Network (12c)

From version 11.2 and later, it is possible to create a second network, and VIP listeners can be created in the second network, too. However, SCAN listeners can listen only in the default first network in version 11.2. Version 12c eliminates that restriction, and it is possible to have SCAN listeners listening on a second network also. By specifying the listener_networks parameter, SCAN listeners will redirect the connection to VIP listeners within the same network subnet.

Let us walk through an example of SCAN listener in the second network in version 12c. The following set of commands adds a network with netnumber 2 in a subnet of 192.168.8.0 (note that first network has 10.7.11.0 subnet). The next three commands add VIP addresses with –netnumber 2, specifying that these VIP listeners are associated with the second network. The fourth command adds a listener to listen in port 1521 in these node VIP addresses.

$ srvctl add network -netnum 2 -subnet 192.168.8.0/255.255.255.0
$ srvctl add vip -noderac1.example.com-address  rac1vip.example2.com/255.255.255.0 -netnum 2
$ srvctl add vip -noderac2.example.com-address  rac2vip.example2.com/255.255.255.0 -netnum 2
$ srvctl add vip -noderac3.example.com-address  rac3vip.example2.com/255.255.255.0 -netnum 2
$ srvctl add listener -netnum 2 -endpoints TCP:1522

The following commands add a SCAN name in network 2 with a SCAN name of ebs.example2.com. The second of the following two commands adds a SCAN listener3 listening in port 20000, and Grid Infrastructure identifies SCAN IP addresses by querying DNS for SCAN name ebs.example2.com.

$ srvctl add scan -scanname ebs.example2.com -netnum 2
$ srvctl add scan_listener -listener LISTSCAN -endpoints TCP:20000

Database initialization parameter listener_networks must be modified so that the PMON or LREG process can register the services to listeners in both networks. The LISTENER_NETWORKS parameter specifies the network configuration in two parts, the first part specifying the configuration of the first network, named network1, and the second part specifying the configuration of the second network, named network2. In this example, listener_net1 and listener_net2 are TNS aliases connecting to VIP listener in local nodes; the remote_listener parameter refers to the SCAN listeners running in their respective networks specified using EZConnect syntax.

alter system set LISTENER_NETWORKS='((NAME=network1)(LOCAL_LISTENER=listener_net1)(REMOTE_LISTENER=ebs.example.com:10000))','((NAME=network2)(LOCAL_LISTENER=listener_net2) (REMOTE_LISTENER=ebs.example2.com:20000))';

Consider the following connection string from a user connecting to the second network (possibly a connection originating from example2.com domain): SCAN listeners will redirect the connection to a VIP listener in the second network only. Essentially, connection from the second network subnet will remain in the second network subnet and the network packets may not traverse from one subnet to another subnet. This network demarcation is important, as a firewall might prevent network packets from crossing over to another domain.

PO =
  (DESCRIPTION=
     (ADDRESS_LIST=
(ADDRESS=(PROTOCOL=tcp)(HOST=ebs.example2.com)(PORT=20000)))
      (CONNECT_DATA=(SERVICE_NAME=PO)
     )
  )

Support of SCAN listeners in the second network is an important feature in an enterprise database cluster with multiple network segments and domains, especially after major business events such as a merger of two companies.

Guidelines for SCAN Listeners

Optimally configured SCAN IP addresses and SCAN listeners are essential production components from 11.2 and later. The SCAN feature greatly reduces administration overheads, as you can publish the TNS connection string once and that connection string is not affected by RAC cluster topology changes. As SCAN listeners are lightweight forwarders, even if you have numerous nodes in a cluster, just three SCAN listeners are able to service an onslaught of connection requests. In fact, in a busy, connection-intensive production cluster, VIP listeners are bottlenecks, while SCAN listeners are able to withstand spurious connection requests easily.

Multiple DNS servers with failover and load balancing represent an extra layer of protection to avoid a single point of failure. This load balancing between DNS servers is usually a responsibility of network administrators. If you configure GNS servers, Clusterware manages GNS daemons and VIPs. If a server running GNS daemon or VIP fails, then the Clusterware will automatically fail over the GNS daemon and VIP to a surviving instance.

In a single segment network cluster, the local_listener parameter should refer to the VIP listener running on the local node, and the remote_listener parameter should refer to the SCAN listeners using EZConnect syntax specifying the DNS alias of the SCAN IP address and port number. Do not specify remote VIP listeners from another node in the remote_listener parameter, as that can cause cross-registration and accidental connection forwarding to the other node.

Global Database Services (12c)

Oracle Database version 12c introduces the concept of Global Database Services (GDS). Services discussed so far in this chapter are those that distribute the workload between instances of a database. In a truly globalized environment, however, databases are replicated using tools such as GoldenGate, Streams, etc. Global services provide a mechanism to fail over the connections globally between the databases.

It is easier to explain the concept of GDS with a connection string example. Consider that there are two databases kept in sync using a bidirectional replication technology. The database servicing the California region is physically located near that region, and likewise the database servicing the New York region is physically located there. Thus, clients in the California region will benefit by connecting to the California database during normal operating conditions, and in the case of failure of the database in California, the connections should fail over to the New York database, and vice versa. Until version 11.2, this was achieved using a connect time failover connect string.

As of version 12c, the connect string for the California region specifies two address_list sections. These two address_lists have the FAILOVER attribute set to ON but load_balance set to OFF. This connection string will try the connection to the first address_list in the connect string; that is, the application will connect to the California database. If that database does not respond before TCP timeout, then the connection will fail over to the New York database.

(DESCRIPTION=
  (FAILOVER=on)
  (ADDRESS_LIST=
    (LOAD_BALANCE=ON)
    (ADDRESS=(host=ca1)(port=1522))
    (ADDRESS=(host=ca2)(port=1522))
    (ADDRESS=(host=ca3)(port=1522)))
  (ADDRESS_LIST=
    (LOAD_BALANCE=ON)
    (ADDRESS=(host=ny1)(port=1522))
    (ADDRESS=(host=ny2)(port=1522))
    (ADDRESS=(host=ny3)(port=1522)))
  (CONNECT_DATA=
    (SERVICE_NAME=PO)
    (REGION=ca))
 )

In addition to the failover specification among address_list sections, each address_list section also specifies multiple addresses with LOAD_BALANCE set to ON. Essentially, a new connection from the California region will try the first address_list as load_balance is set to OFF by default for description list. Then, an address from the chosen address_list will be used to send packets to GDS listeners as load_balance is set to ON for that address_list.

Note that traditionally, a VIP listener or a SCAN listener is specified in the ADDRESS section. But, in the case of GDS connect string, these are not traditional listeners, but GDS listeners. This feature was introduced in version 12c. GDS listeners can be visualized as global SCAN listeners. Oracle recommends three GDS listeners per region, very similar to the three SCAN listeners in a RAC cluster. Also, services are managed by srvctl command and global services are managed by the gsdctl command interface.

GDS listeners connect to the database and register as subscribers of the servicemetric queue. Any topology changes and Load Balancing Advisories (LBAs) are propagated to GDS listeners, and the GDS listeners use the LBAs to forward connections to a database that can provide better quality of service.

GDS are useful with the Active Data Guard setup also. With the Active Data Guard feature, it is possible to use a data guard instance for read-only activity. In many cases, the need arises for a global failover; that is, if the Active Data Guard database is not available, then failover will be to the primary database and vice versa. The GDS feature will be handy in implementing this requirement. Refer to the GDS Concept and Administration Guide for more details.

Failover in RAC

A RAC database offers many failover techniques. Failover methods can be classified into two categories: connection time and runtime failover methods. Transparent Application Failover (TAF) is a reactive runtime failover method, wherein database connection will automatically fail over to a surviving instance if the current instance fails. Fast Connection Failover (FCF) is a proactive runtime failover technique. Changes in node availability, service failures, etc., are propagated to the application, and the application will proactively recycle connections even before the application encounters connection-related errors.

TAF

TAF fails over the connection to a surviving instance after encountering an instance failure. TAF can operate in either SESSION or SELECT failover mode.

  • In a SESSION failover mode, if a connection is severed then the connection will fail over to a surviving instance. Essentially, the connection is reattempted automatically. Only the connection is restarted, and the application must detect failures and restart failed workload.
  • In a SELECT failover mode, the client-side driver remembers the cursor fetch details, and if the current connection is severed, then the connection will fail over to surviving node. The SELECT statement is re-executed, rows are fetched from the cursor and scrolled to the point of failure in the cursor fetch operation, and rows are returned from the point of failure.

The following connect string is a simple example of TAF using a SCAN listener. With this connection string, the application connects to a SCAN listener listening in port 12099. Notice that the failover_mode parameter specifies the type as SESSION, indicating session failover mode. If the connection is severed, then the client will retry the connection to the SCAN listener again. The SCAN listener will redirect the connection to the VIP listener of a surviving node.

PO_TAF =
 (DESCRIPTION =
   (ADDRESS = (PROTOCOL = TCP)
     (HOST = rac-scan.example.com)(PORT = 12099))
          (CONNECT_DATA =
            (SERVICE_NAME = PO)
            (FAILOVER_MODE= (TYPE=SESSION) (METHOD=basic))
     )
    )

Dynamic-view v$session can be queried to verify if TAF is configured properly or not. The preceding connect string shows that session level failover is enabled for the current connection. Killing the connection resulted in respawning of a new connection, confirming that TAF is working as designed.

SQL> select sid, FAILOVER_TYPE, FAILOVER_METHOD, failed_over
     from v$session where
     sid =(select sid from v$mystat where rownum =1);
  
       SID FAILOVER_TYPE FAILOVER_M FAI
---------- ------------- ---------- ---
      8227 SESSION       BASIC      YES

While it is possible to set up TAF with a client connection string, TAF can be implemented more elegantly using service attributes. The advantage of implementing TAF using service attributes is that configuration changes can be performed on the database server side without altering the client-side connect string. The following srvctl command shows that SESSION failover type is added to po service. Option –z specifies failover retry attempts, and the –w option specifies delay between failover attempts. In this example, connection will be attempted five times with 60-second delays between successive attempts.

$  srvctl add service -d orcl12 -s po -r oel6vm1 -a oel6vm2 
             -m BASIC -e SESSION -z 5 -w 60

In version 12c, adding a service requires a different syntax, and options are specified more verbally. The earlier command is written in version 12c as follows.

$ srvctl add service -db orcl12 -service po 
      -preferred oel6vm1 -available oel6vm2
      -tafpolicy BASIC -failovermethod SESSION
      -failoverretry 5 -failoverdelay 60

It is a better practice to keep the connection string as simple as possible and modify configuration of services to implement TAF. This strategy simplifies connection string management in an enterprise database setup.

Further, it is best to implement TAF while creating services so that you can improve availability.

Fast Connection Failover

Fast Connection Failover (FCF) is a proactive failover method to avoid downtime when there is an instance failure. Proactive connection management improves application availability and user experience, as you can reduce runtime errors. Here are a few examples where FCF can be used to avoid errors:

  • After a database instance failure, all connections in a connection pool must be recycled by the connection manager. Usually, stale connections are identified only when the connection is reused by the application4, but the failure is detected only after an error is encountered by the application. This late detection of error condition can lead to application errors and other undesirable user experiences.
  • Connection pool connections that were busy in the database executing queries might need to wait for TCP timeouts. This delay can lead to slowness and errors in the application.
  • After the recovery of failed instances, the application might not reconnect to the recovered node. This issue can lead to unnecessary overload of the surviving nodes.

Instance failures can be propagated earlier to the application, and errors can be minimized by requesting the application to recycle its connections after node or service failures. Essentially, service changes, node failures, node up/down events, etc., are communicated immediately to the application, and this propagation is known as FAN. After receiving a FAN event, the application proactively recycles connections or takes corrective actions depending upon the nature of the failure.

FCF is a technique that captures FAN events and reacts to the received FAN events. FAN events can be captured and connections recycled utilizing a few technologies:

  1. Your application can implement connection pool technology that handles the FAN events, such as Oracle WebLogic Active GridLink data source, Oracle JDBC Universal Connection Pool, ODP.Net connection pool, OCI session pools, etc.
  2. Your application can use programmatic interfaces to capture FAN events and react accordingly.
  3. Or, you can use database side callouts to handle FAN events.

WebLogic Active GridLink

WebLogic Active GridLink for RAC is a data source available in Oracle WebLogic version 11g. This data source is integrated with RAC to handle FAN events and LBA without the help of any code changes. To make use of Active GridLink for FCF, you should configure an Active GridLink data source to connect to the SCAN listener and the ONS daemon, in a WebLogic configuration screen.

The ONS daemon propagates RAC topology changes to the registered clients. Any topology change is propagated to the clients listening to the SCAN IP address and 6200 ports. GridLink data source registers to the ONS daemon to receive notifications. If there are any topology changes as broadcast by the ONS daemon, such as instance restart, then GridLink data source captures the events and recycles the connections in the connection pool. Further, if there is an Instance UP event due to instance restart, then that event is captured by GridLink data source, and the connections are reapportioned among available nodes. This connection management avoids overloading of too few nodes after node recovery. Also, Active GridLink receives LBAs from ONS daemons, and GridLink data source uses connections from the least-loaded instances, implementing the runtime load balancing algorithm.

Essentially, Active GridLink is an important tool to implement FCF, runtime load balancing, and transaction affinity. If your application uses WebLogic as a web server, you should implement Active GridLink-based data sources to maximize the failover capabilities of RAC.

Transaction Guard (12c)

A key problem with failure detection and failure recovery is that the response from the database to the application is transient. So, the response to a COMMIT statement can be lost in the case of instance failures. For example, consider a scenario in which an application modified some rows and sent a commit to the database, but the database instance failed and the application did not receive any response from the database. Hence, the application does not know about the transaction status as the response from the database was lost. Two unique scenarios are possible:

  1. Database committed the changes, but success response was not received by the application.
  2. Database failed even before the successful commit operation, and so the application must replay the changes.

In either scenario, the application must know the transaction status to avoid data corruption. A new feature in version 12c is the Transaction Guard, which provides a mechanism to keep track of the transaction state even after database failures.

The Transaction Guard feature introduces the concept of logical transaction identifier. A logical transaction_id is unique for a specific task from a connection, and the logical transaction_id is mapped to a database transaction_id. The outcome of the logical transaction_id is tracked in the database. If there is a failure in the database, then application connection can query the transaction status using logical transaction_id to identify if a transaction is committed or not.

To implement this feature, you need to create a service with two attributes, namely, commit_outcome and retention. The commit_outcome attribute determines whether the logical transaction outcome will be tracked or not. Attribute retention specifies the number of seconds a transaction outcome will be stored in the database (defaults to a value of 86,400 seconds = 1 day). The following command shows an example of po service creation with commit_outcome set to TRUE, enabling Transaction Guard.

$ srvctl add service -db orcl12 -service po -preferred oel6vm1 -available oel6vm2 
  -commit_outcome TRUE –retention 86400

The Transaction Guard feature requires code changes, and your application code must be changed to react at the onset of failure. Application code must perform the following actions at the time of failure:

  1. Receive FAN events from ONS daemon for failures.
  2. Get the last logical transaction_id by calling getLTXID from a failed session.
  3. Retrieve the status of the logical transaction_id by calling GET_LXID_OUTCOME, passing the logical transaction_id retrieved in step 2.
  4. Handle the transaction depending upon the transaction state.

Application Continuity (12c)

Application Continuity is a new feature implemented using the Transaction Guard feature.  Oracle introduces a new client-side JDBC driver class to implement Application Continuity. Universal Connection Pool and WebLogic Active GridLink support Application Continuity by default. If your application uses a third-party connection pool mechanism, then you need to modify code to handle Application Continuity.

The application issues database calls through the JDBC driver. The JDBC replay driver remembers each call in a request and submits the calls to the database. In the case of failures, JDBC replay driver uses the Transaction Guard feature to identify the global transaction state. If the transaction is not committed, then the JDBC replay driver replays the captured calls. After executing all database calls, replay driver issues a commit and terminates replay mode.

To support Application Continuity, service must be created by setting a few service attributes, namely, failovertype, failoverretry, failoverdelay, replay_init_time, etc. The following command shows an example of service creation. Notice that failovertype is set to TRANSACTION; this is an additional failovertype value introduced by version 12c.

$ srvctl add service -db orcl12 -service po -preferred oel6vm1 -available oel6vm2 
   -commit_outcome TRUE –retention 86400 –failovertype TRANSACTION
   -failoverretry 10 –failoverdelay 5 –replay_int_time 1200

Not all application workloads can be safely replayed. For example, if the application uses autonomous transactions, then replaying the calls can lead to duplicate execution of autonomous transactions. Applications must be carefully designed to support Application Continuity.

Policy-Managed Databases

Traditionally, Clusterware resources (databases, services, listeners, etc.) are managed by database administrators, and this type of management is known as administrator-managed databases. Version 11.2 introduced policy-managed databases. This feature is useful in a cluster with numerous nodes (12+) supporting many different databases and applications.

With a policy-managed database, you create server pools, assign servers to the server pool, and define policies of server pools. Depending upon the policies defined, servers are moved into and out of server pools. For example, you can define a policy such that the online server pool has higher priority during the daytime and the batch server pool has higher priority in the nighttime. Clusterware will manage the servers in the server pool such that more resources can be allocated, matching workload definitions.

By default, two server pools, that is, free and generic server pools, are created. Server pool free is a placeholder for all new servers, and as the new server pools are created, servers are automatically reassigned from free server pool to new server pools. Generic server pool hosts pre-11.2 databases and administrator-managed databases.

You can associate applications to server pools, and a database is also an application from the Clusterware perspective. An application can be defined as singleton or uniform state. If an application is defined as singleton, then that application can exist only in a server, and if defined as uniform, then that application will exist on all servers in a server pool.

Temporary Tablespaces

Temporary tablespaces in RAC require special attention, as they are shared between the instances. Temporary tablespaces are divided into extents and instances cache subset of extents in their SGA. When a process tries to allocate space in the temporary tablespace, it will allocate space from cached extents of the current instance.

Dynamic performance view gv$temp_extent_pool shows how temporary tablespace extents are cached. Instances try to cache extents equally from all files of a temporary tablespace. For example, approximately 4,000 extents from files 6, 7, 8, and 95 were cached by every instance. So, you should create temporary tablespaces with as many temp files as there are instances. In a nutshell, extents are cached from all temporary files, thereby spreading the workload among the temporary files of a temporary tablespace.

RS@SQL> select inst_id, file_id, extents_cached , extents_used
        from gv$temp_extent_pool
  order by inst_id, file_id;
 
   INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
...
         1          6           4026            0
         1          7           4021            2
         1          8           3984            4
         1          9           4033            2
...
         3          6           3984            0
         3          7           3988            1
         3          8           3985            0
         3          9           3967            1

The size of the cached extent is governed by the attributes of the temporary tablespace. For example, in RAC1 database, the extent size is 128KB.

RS@SQL>select tablespace_name, INITIAL_EXTENT, NEXT_EXTENT
      from dba_tablespaces
where tablespace_name='TEMP';
 
TABLESPACE_NAME                INITIAL_EXTENT NEXT_EXTENT
------------------------------ -------------- -----------
TEMP                                   131072      131072

Extent caching is a soft reservation technique. Consider that a RAC1 instance used all of its cached extents (because of a session connected to RAC1 performing a massive sort or join operation); then, that instance will ask other instances to release the soft reservation for a group of extents in the temporary tablespace. After the extents are uncached from other instances, the RAC1 instance can cache the uncached extents.

Extent caching and uncaching operations are performed under the protection of SS Enqueue. Further, the DFS lock handle mechanism (discussed in Chapter 11) with CI Enqueue is used to trigger uncaching of extents in other instances. So, excessive caching/uncaching can lead to waits for SS Enqueue and waits for DFS lock handle. Since version 11g, uncaching of extents is performed in batches of 100 extents each per operation, and so SS Enqueue contention is greatly reduced.

The following guidelines can be used while creating temporary tablespaces:

  • Create as many temp files as the number of instances in every temp tablespace. For example, in an eight-instance database, there should be at least eight temp files associated with every temporary tablespace. This strategy reduces the file header level locking issues dramatically.
  • While adding an instance to a database, increase the size of temporary tablespace. Also, increase the number of temp files in temporary tablespaces.
  • If an application component uses an excessive amount of temporary tablespace, then create a new user for that application component, create additional temporary tablespace, and use application affinity to keep that component to a node. This strategy reduces the ill effects associated with excessive extent caching and uncaching activities.

Massive Data Changes

It is inevitable that every DBA must perform massive data changes in a production environment, for reasons such as upgrade, purge, etc. In a RAC database, performing massive changes requires careful strategy.

For massive updates, if you are planning to use parallelism, consider setting the parallel_force_local parameter to true. This initialization parameter will allocate all PX servers in local node and reduce overheads due to global cache latency. However, this strategy assumes that one server has enough resource capabilities to complete the tasks. On the other hand, if the activity is spread across many nodes, then CPU, I/O, and memory resource usage is spread between all nodes. So, the choice to use one or more instances depends upon (a) whether a node can handle the workload without incurring any additional latency, and (b) whether the global cache latency is big enough that keeping workload in a node will vastly improve the performance.

If you choose to use multiple instances, then the use of table partitioning or data partitioning through effective SQL tuning to reduce global cache transfers is another effective strategy to improve application performance. Further, index blocks deserve close attention, as these can increase global cache transfers. For massive updates on indexed columns or inserts, it is preferable to drop indexes before updating columns, and recreate the indexes after the updates.

Another typical production issue is encountered during massive index creation. Index creation is usually performed with many PX servers. As PX servers read blocks into PGA directly, bypassing the buffer cache, global cache latency is minimal. Still, PX servers can distribute blocks among themselves, flooding interconnect with PX messages. So, allocating all PX servers within the same node is preferable if a single node has sufficient resources to complete index creation. Refer to Chapter 12 for further discussion of parallel query execution.

Further, if an index creation fails due to an issue, always restart that index creation connecting to the same instance. Temporary space allocated for indexes (in permanent tablespace) is soft reserved. If you restart index creation connecting to another instance, then subsequent index creation would trigger uncaching of free space in other nodes, leading to a possibility of SS Enqueue contention and DFS lock handle contention.

Performance Metrics Collection

It is critical to collect performance metrics on a regular basis in RAC cluster. In most cases, RAC node problems are side effects of some other OS issue. For example, network, CPU starvation, and memory starvation issues can lead to node failures and reboots. Root cause analysis of a node reboot requires deep review of OS metrics. It is almost impossible to identify the root cause of a node reboot without sufficient OS performance metrics.

At the minimum, OSWatcher or another tool should collect performance data in all nodes and keep those statistics for at least a few days. These OS statistics will enable root cause analysis for node failures.

Further, AWR reports or statspack data are essential to understand pattern changes in database performance. So, at least a few weeks of history must be kept in the database for future analysis.

Parameter File Management

The initialization parameter file is common to all instances. It is a good idea to keep parameter files in ASM storage, as ASM storage is available to all RAC database instances.

Initialization parameters can be specified in the scope of an instance or database. For example, the following command modifies the parameter pga_aggregate_target in a RAC1 instance only.

alter system set pga_aggregate_target=10240M scope=both sid='RAC1';

The following command modifies the parameter globally. If a parameter is specified both at the database level and at the instance level, then the instance-level parameter specification overrides the global-level parameter.

alter system set pga_aggregate_target=5120M scope=both sid='*';

SPFILE will contain the following entries for the pga_aggregate_target parameter after executing the preceding two commands.

*.pga_aggregate_target=5120M
RAC1.pga_aggregate_target=10240M

The command to reset or remove a parameter from spfile should match with the scope of the parameter if the parameter exists in spfile already. For example, to remove *.pga_aggregate_target from spfile completely, you must specify the sid clause matching the spfile scope.

alter system reset pga_aggregate_target scope=both sid='*';

Not all parameters can be modified at the instance level. Column ISINSTANCE_MODIFIABLE in v$parameter shows whether a parameter can be modified at the instance level or not. The following output of SQL querying v$parameter shows that pga_aggregate_target parameter can be modified at the instance level.

SQL> select name, ISINSTANCE_MODIFIABLE from v$parameter
            where name='pga_aggregate_target'
NAME                           ISINS
------------------------------ -----
pga_aggregate_target           TRUE

Password File Management

Password files allow non-sys users to log in to the database with elevated privileges such as sysdba, sysoper, sysbackup (12c), sysdg (12c), etc. In 11gR2, each database instance is considered to have a non-shared password file, and so grants in an instance are not propagated to other database instances. You must copy password files to each node and rerun grants connecting to local instance in version 11.2.

In version 11.2 and earlier, it is a better approach to store the password files in a shared file system such as NFS or cluster file system. With this shared password file approach, you can maintain the password files with ease, but grants still need to be executed in every node.

$ orapwd file='orapworcl12' entries=10

In version 12c, password files can be stored in ASM, and this greatly simplifies password file management. By default, password files are stored in the ASM disk group.

+DATA/ORCL12/PASSWORD/pwdorcl12.268.804626815

The following command shows how a password file can be created in Oracle ASM disk group. As the password file is stored in ASM, all instances can access the password file.

$ orapwd file='+DATA/orcl12/orapworcl12' entries=8 dbuniquename='orcl12' format=12

Prior to version 11.2, you must grant privileges in every instance explicitly, as the sysdba grants operate at the instance level. This behavior is due to an inherent assumption that the password file is not shared between the instances. For example, in the following output, granting sysdba to rs user connecting to RAC2 instance populated the password file only for instance 2.

SQL> grant sysdba to rs;
 
Grant succeeded.
 
SQL>select * from gv$pwfile_users where username='RS';
 
   INST_ID USERNAME                       SYSDB SYSOP SYSAS
---------- ------------------------------ ----- ----- -----
         2 RS                             TRUE  FALSE FALSE

In version 12c, sysdba grants are considered to operate at the database level, and so sysdba grants are populated for all instances immediately. This behavior change in version 12c is logical, since the password files are stored in ASM and shared between instances. The following output shows that grants to a user are propagated to all instances, as is visible in the password file.

SQL> grant sysdba to rs;
Grant succeeded.
 
SQL> select * from gv$pwfile_users where username='RS';
INST_ID USER SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM CON_ID
------- ---- ----- ----- ----- ----- ----- -----     --
1         RS       TRUE  FALSE FALSE FALSE FALSE FALSE 0
2         RS       TRUE  FALSE FALSE FALSE FALSE FALSE 0

The Clusterware resource for the database also has a new attribute referring to the password file; in this example, the password file is stored in ASM.

$ crsctl stat res ora.orcl12.db –p |more
...
PWFILE=+DATA/orcl12/orapworcl12
...

Version 12c introduces new administrative privileges, namely, sysdg, sysbackup, and syskm, in addition to sysdba and sysoper privileges available in earlier versions. These new privileges can be secured with passwords too, but the password file should be created or migrated to version 12c format using the format=12 option in the orapwd command while creating password files.

It is important to maintain the password files consistently in all instances. Storing password files in a shared file system is a good practice in earlier versions, and storing the password files in ASM is a good practice in version 12c.

Managing Databases and Instances

To manage a database, it is advisable to use the srvctl command. While it is possible to manage databases as Clusterware objects using crsctl commands, you should try to use the srvctl command whenever possible. The srvctl command provides an additional layer of security and avoids unnecessary activity in the Clusterware owner account.

The general syntax of the srvctl command is given in the following. To start or stop a database, you would use the following command:

$ srvctl stop database -d ORCL
srvctl <operation> <object> [<options>]

By default, the database will be stopped in immediate shutdown mode. You could also specify other shutdown modes using the –o option as follows. Other options, such as transactional, immediate, etc., can be specified too.

$ srvctl stop database -d ORCL -o abort

The database can also be started using the srvctl command. The following commands start the database in OPEN startup mode. Startup mode can be explicitly specified too.

$ srvctl start database -d ORCL
$ srvctl start database -d ORCL -o open

The parameter file can be explicitly specified using the pfile option.

$ srvctl start database -d ORCL -o open,pfile=initORCL1.ora

Instances can be managed individually, too. For example, an ORCL1 instance alone can be stopped using the following command. It is also possible to specify a node name, as the mapping between instance_name and node_number is already stored in OCR. You could also stop multiple instances in a single command specifying a list of instances.

$ srvctl stop instance -d ORCL –i ORCL1
$ srvctl stop instance -d ORCL –n RAC1
$ srvctl stop instance -d ORCL –i ORCL1,ORCL2

Configuration of a database can be queried using the config database command and shows the spfile location, oracle_home location, default start options, default stop options, and instances of a database, etc.

$ srvctl config database -d ORCL
Database unique name: ORCL
Database name: ORCL
Oracle home: /u01/app/oracle/product/11.2.0/dbhome_1
Oracle user: oracle
Spfile: +DATA/orcl/spfileorcl1.ora
Domain:rac.example.com
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: ORCL
Database instances: ORCL1,ORCL2
Disk Groups: DATA
Services:
Database is administrator managed

Mapping between the node and instance names is stored in OCR, and array variable USR_ORA_INST_NAME maintains this mapping. For example, in the following configuration, instance ORCL1 will be started in RAC1 node and ORCL2 instance will be started in RAC2 node. It is possible to modify the USR_ORA_INST_NAME array variable and change the mapping if needed.

$ crsctl stat res ora.racdb1.db -p|grep ORA_INST_NAME
...
GEN_USR_ORA_INST_NAME@SERVERNAME(RAC1)=ORCL1
GEN_USR_ORA_INST_NAME@SERVERNAME(RAC2)=ORCL2
...
USR_ORA_INST_NAME@SERVERNAME(RAC1)=ORCL1
USR_ORA_INST_NAME@SERVERNAME(RAC2)=ORCL2

However, ASM instance mapping cannot be changed that easily, since the ASM instance name is dynamically derived from the node number. Updating the node number would require Clusterware reconfiguration, as the node number to node name mapping is stored in the voting disk.

$ crsctl stat res ora.asm -p |grep INST_NAME
..
GEN_USR_ORA_INST_NAME@SERVERNAME(RAC1)=+ASM1
GEN_USR_ORA_INST_NAME@SERVERNAME(RAC2)=+ASM2
USR_ORA_INST_NAME=+ASM%CRS_CSS_NODENUMBER%

In a cluster with a large number of nodes, it is typical to keep the node name with an incrementing suffix, such as RAC1, RAC2, RAC3, etc. It is also a matter of convenience to keep suffixes of both database and ASM instances matching with node name suffixes; for example, ORCL1 and ASM1 instance in RAC1 node, ORCL2 and ASM2 instance in RAC2 node, etc. As the node number is assigned at the time of cluster configuration, execute root.sh in the order desired; for example, execute root.sh in RAC1 node, wait for completion, execute root.sh in RAC2 node, wait for completion, etc. This execution order will maintain correct suffix mapping.

Managing VIPs, Listeners

To manage VIPs and listeners, use of the srvctl command is recommended. Configuration of VIP can be queried using the config vip parameter.

$ srvctl config vip  -n RAC1
VIP exists.:RAC1
VIP exists.: /RAC1_vip/1.2.1.101/255.255.255.0/eth1
 
$ srvctl config vip  -n RAC2
VIP exists.:RAC2
VIP exists.: /RAC2_vip/1.2.1.102/255.255.255.0/eth1

VIPs can be started and stopped using the srvctl command, but Clusterware daemons usually monitor these resources, and so these commands are seldom used.

$ srvctl stop vip  -n RAC2
$ srvctl start vip  -n RAC2

Similarly, listeners can be managed using the srvctl command. For example, configuration of a listener can be queried using the following config command.

$ srvctl config listener  -l LISTENER
Name: LISTENER
Network: 1, Owner: oracle
Home: <CRS home>
End points: TCP:1521

Listeners can be stopped and started using the srvctl command with the following syntax. Executing the lsnrctl command is not a recommended practice.

$ srvctl stop listener  -l LISTENER
$ srvctl start listener  -l LISTENER

The srvctl command is an option-rich tool, and if you are not sure about the syntax, type srvctl –h to understand the options in the srvctl tool.

Miscellaneous Topics

There are some important topics that require special attention in RAC. For example, in a single-instance database, memory starvation can lead to performance issues, but in RAC, memory starvation can lead to a node reboot.

Process Priority

In RAC, a few background processes must have higher CPU scheduling priority. For example, an LMS background process should run with higher CPU priority, as the LMS process is the workhorse of global cache transfer, and CPU starvation of an LMS process can increase global cache latency. Further, if an LMS process suffers from CPU starvation, network packets can be dropped, leading to gc blocks lost waits in other nodes. Underscore initialization parameter _high_priority_processes controls the processes that will execute at an elevated priority. By default, this parameter is set to elevate the priority of VKTM (Virtual Keeper of Time) and LMS background processes. Generally, you do not need to modify this parameter.

_high_priority_processes= 'LMS*|VKTM'

For example, in the Solaris platform, the LMS process will execute with RT priority. Of course, elevating a process priority requires higher-level permissions, and this is achieved using oradism binary in Unix platforms and oradism service in Windows platform. Notice in the following that oradism executable is owned by root userid with setuid flag enabled. OS group prod has execute permission on this oradism binary. Essentially, with setuid permission of the root user, the oradism executable is utilized to elevate the priority of background processes.

$ ls -lt oradism
-rwsr-x---   1 root     prod     1320256 Sep 11  2010 oradism

Setuid permissions on oradism executable are needed to alter the priority of background processes. When you execute root.sh during GI software installation, this setuid permission is set on oradism executable, but during software cloning, DBAs typically do not run root.sh, and so the oradism executable does not have correct permissions. Hence, the priority of LMS process is not elevated, leading to global cache latency issues during CPU starvation. Executing root.sh script and verifying that the oradism executable has correct permission after the completion of software cloning is recommended.

Also, GRID processes, such as cssd.bin, cssdagent, and cssdmonitor, etc., should run with elevated priority. Again, oradism binary in Grid Infrastructure software must have proper permissions. Permissions on an ordism file in GI were not a big problem until version 11.2, as cloning of GI was not widely used. However, in version 12c, cloning of GI software is allowed and binary permissions require careful attention if GI software is cloned.

While elevated priority for background processes protects the background processes from CPU starvation, it does not mean that CPU starvation will not cause node stability issues. For example, the LMS process must wait for the LGWR background process to do a log flush sync before sending the blocks (for certain types of global cache transfer). If there is CPU starvation in the server, then LGWR might not get enough CPU; LGWR will suffer from CPU starvation, and that can lead to longer waits by the LMS process. Remote instances will suffer from prolonged global cache transfers. So, CPU starvation in one node can increase the global cache latency in other nodes, inducing slowness in the application; in a few scenarios, this CPU starvation can also lead to node reboots. It is a recommended practice to keep CPU usage below 75% to avoid CPU starvation and reduce node stability issues.

Memory Starvation

Memory starvation in a cluster node can cause stability issues in a RAC cluster. Elevated process priority protects the background process from CPU starvation but does not prevent memory starvation issues. Memory starvation of an LMS process can lead to higher latency for global cache wait events in other nodes. Further, memory starvation of Grid Infrastructure daemons can lead to node reboots. It is a recommended practice to design cluster nodes with sufficient memory and avoid memory starvation.

Virtual memory swappiness is an important attribute to consider in a Linux kernel. This kernel parameter affects swapping behavior. As Linux kernel version 2.6.18 improves swapping daemons throughput, you should upgrade the kernel to at least version 2.6.18. If you cannot upgrade the kernel to at least 2.6.18, then the vm.swappiness kernel parameter is an important parameter to consider. If the parameter is set to a lower value, then pages will be kept in memory as long as possible. If the vm.swappiness parameter is set to a higher value, then the pages will be swapped more aggressively. If you are using a version prior to 2.6.18 kernels, you should reduce the vm.swappiness kernel parameter to a lower value such as 30 (default is 60).

We also recommend that you implement HugePages in the Linux platform. At the time of this writing, Oracle has made HugePages as one of the important ways to improve the stability of a RAC cluster. With HugePages, two important benefits can be achieved:

  • SGA is locked in memory. Linux kernel daemons have fewer pages to scan during memory starvation. Locking SGA also protects from paging.
  • Further, default page size of memory mapping is 4KB in the Linux platform. Pointers to these pages must be maintained in page table entries, and so page table entries can easily grow to a few gigabytes of memory. With the HugePages feature, page size is increased to 2MB and the page table size is reduced to a smaller size.

HugePages and Automatic memory management (AMM) do not work together. We recommend you use HugePages instead of AMM to improve RAC stability.

SGA size

If the data blocks are already residing in local SGA, then access to those buffers is in the range of microseconds. Thus, it is important to keep SGA size bigger in a database with OLTP workload. It is a best practice to increase the SGA size when you convert the database to RAC.

If the application workload is mostly Decision Support System (DSS) or Data Warehouse type, then keep PGA bigger and buffer cache smaller, as PX server processes read blocks directly into the PGA (Program Global Area). The parameter pga_aggregate_target is not a hard limit and the sum of PGA allocated by the database processes can far exceed the pga_aggregate_target parameter. This unintended excessive memory usage can induce memory starvation in the server and lead to performance issues. Release 12c introduces pga_aggregate_limit parameter and the total PGA used by the database cannot exceed the pga_aggregate_limit value. If the database have allocated pga_aggregate_limit amount of PGA, then the process with largest PGA allocation is stopped until the PGA size falls below the pga_aggregate_limit value. Therefore, in 12c, you can avoid excessive memory swapping issues by setting pga_aggregate_limit to an optimal value.

It is advisable to keep the SGA size as big as possible without inducing memory starvation, as discussed earlier. Database nodes with ample physical memory are essential while converting a database to a RAC database. More memory can also reduce unnecessary CPU usage, which can be beneficial from a licensing point of view.

Filesystem Caching

For Oracle database software and Grid Infrastructure software, the file system must have caching enabled. Without file system caching, every access to binary file pages (software binary pages are accessed frequently from an executing UNIX process) will lead to a disk access. This excessive disk reads induces latency in the executable, leading to instance reboot or even node reboot.

Summary

In summary, the following guidelines are offered for the operational aspects of a RAC cluster.

  • Use services effectively to partition the application workload. Load balancing and failover can be achieved elegantly with the use of services.
  • Use SCAN listeners and SCAN IP addresses. Publish only SCAN name to the application. Do not allow an application to connect to VIP listeners directly.
  • Use three SCAN IP addresses mapping to a SCAN name.
  • Set up TAF attributes of a service to implement TAF.
  • Use tools such as universal connection pool or Active GridLink data source to implement faster failover and load balancing capabilities.
  • Use as many temp files as the number of instances for temporary tablespaces. Allocate more space if you are planning to add more instances to the database.
  • Manage application connection and users so as to reduce locking due to soft reservation in temporary tablespaces.
  • Verify that oradism binary has correct permissions so that background processes can execute at an elevated priority.
  • In Linux, implement HugePages to avoid unnecessary reboots.

1 In Exadata platforms, the remote buffer cache access latency is about 0.5ms. Both faster CPUs and infiniband fabric hardware provide lower latency.

2 AWR report also prints the service-level metrics.

3 Notice the syntax for adding scan_listener does not specify a  netnum clause. This appears to be a documentation bug, so refer to public documentation to verify the syntax.

4 Applications typically borrow connections from a connection pool, perform database activity, and return the connections to the connection pool. It is more likely that some connections may not have been reused for days, and so there may be application errors after few days.

5 Only partial output is shown. Files 1 through 5 also exhibit similar caching behavior, but extents in use were 0 for those files when the view was queried. Thus, the output of those files is not shown.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.3.204