© Y V Ravi Kumar, Nassyam Basha, Krishna Kumar K M, Bal Mukund Sharma, Konstantin Kerekovski 2019
YV Ravi Kumar, Nassyam Basha, Krishna Kumar K M, Bal Mukund Sharma and Konstantin KerekovskiOracle High Availability, Disaster Recovery, and Cloud Services https://doi.org/10.1007/978-1-4842-4351-0_5

5. Troubleshooting and Conducting Health Checks for High Availability

Y V Ravi Kumar1 , Nassyam Basha2, Krishna Kumar K M3, Bal Mukund Sharma4 and Konstantin Kerekovski5
(1)
Chennai, Tamil Nadu, India
(2)
Riyadh, Saudi Arabia
(3)
Bangalore, Karnataka, India
(4)
Belmont, CA, USA
(5)
Pinellas Park, FL, USA
 

The goal of this chapter is to introduce you to the various health check utilities, both new in 12.2 and from previous versions, that can greatly ease management and troubleshooting efforts for High Availability and DR solutions. This chapter will focus mostly on features that are not necessarily Exadata specific but that can and should be used in Exadata and other engineered environments. The chapter will highlight important and often hidden aspects of the different features and software that can be used for health checks as well as cover a few of the new features available in the 12.2 GI and RDBMS software.

Cluster Verification Utility

The Cluster Verification Utility (CVU) is an extremely important component of the Grid Infrastructure installation. This tool can be used to perform a variety of health checks on an Oracle Real Application Cluster, regardless of whether the Real Application Cluster is on commodity hardware or an engineered system. This tool is essential in the installation of Grid Infrastructure as well as in the Database Creation Assistant (DBCA). However, there are several great and often underutilized one-off use cases for the tool, which this chapter will go over. The Cluster Verification Utility is available in $GRID_HOME. You will need to set your environment to +ASM to be able to successfully launch CVU.

Software Home Checks

Sometimes, for various reasons, a DBA may come to question the integrity of a software home, such as the RDBMS installation home or the Grid Infrastructure installation itself, for example, after patching and having to roll back certain patches or after a particularly problem-ridden attempt at Grid Infrastructure upgrade/patch that requires manual intervention. It is always a good thing to be able to say that the file permissions ad user/group ownerships of all files within an Oracle software installation are correct and as they should be.

The following is an example of a call to cluvfy that points out a file permissions issue and specifies what the expected and observed file permissions are for the files in question to help resolve the situation:
[oracle@stpldb101 ~]$ cluvfy comp software -n all -d /u01/app/12.2.0.1/grid/ -allfiles -r 12.2
Verifying Software home: /u01/app/12.2.0.1/grid/ ...FAILED (PRVG-2033)
Verification of software was unsuccessful on all the specified nodes.
Failures were encountered during execution of CVU verification request "software".
Verifying Software home: /u01/app/12.2.0.1/grid/ ...FAILED
stpldb102: PRVG-2033 : Permissions of file
           "/u01/app/12.2.0.1/grid//bin/kfod.bin" did not match the expected
           octal value on node "stpldb102". [Expected = "751" ; Found = "0755"]
stpldb102: PRVG-2033 : Permissions of file
           "/u01/app/12.2.0.1/grid//crs/install/dropdb.pl" did not match the
           expected octal value on node "stpldb102". [Expected = "700" ; Found
           = "0755"]
stpldb102: PRVG-2033 : Permissions of file
           "/u01/app/12.2.0.1/grid//lib/acfstoolsdriver.sh" did not match the
           expected octal value on node "stpldb102". [Expected = "750" ; Found
           = "0755"]
stpldb101: PRVG-2033 : Permissions of file
           "/u01/app/12.2.0.1/grid//bin/kfod.bin" did not match the expected
           octal value on node "stpldb101". [Expected = "751" ; Found = "0755"]
stpldb101: PRVG-2033 : Permissions of file
           "/u01/app/12.2.0.1/grid//crs/install/dropdb.pl" did not match the
           expected octal value on node "stpldb101". [Expected = "700" ; Found
           = "0755"]
stpldb101: PRVG-2033 : Permissions of file
           "/u01/app/12.2.0.1/grid//lib/acfstoolsdriver.sh" did not match the
           expected octal value on node "stpldb101". [Expected = "750" ; Found
           = "0755"]
CVU operation performed:      software
Date:                         Aug 19, 2017 12:23:27 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle
[oracle@stpldb101 ~]$ echo $?
1
The following is an example of running cluvfy against a 12.2 RDBMS home, which came back as not having any issues:
[oracle@stpldb101 ~]$ cluvfy comp software -n all -d /u01/app/oracle/product/12.2.0.1/dbhome_1 -allfiles -r 12.2
Verifying Software home: /u01/app/oracle/product/12.2.0.1/dbhome_1 ...PASSED
Verification of software was successful.
CVU operation performed:      software
Date:                         Aug 19, 2017 12:26:29 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle
[oracle@stpldb101 ~]$ echo $?
0

One of the great things about the Cluster Verification Utility is that the return code of many of the commands signifies whether an issue was encountered. Because of this, it is actually easy to set up shell scripts that run cluvfy comp software and then email an administrator if an issue is found so that it can be investigated.

File System Space Checks

It is a known fact that database administrators must keep an eye on the free space available on file systems that contain Grid Infrastructure and Oracle Database software homes because of logging configurations and other various environment-specific reasons. For this reason, every database administrator likely has their own custom script to check directories or df output for space usage. However, the Cluster Verification Utility can greatly simplify these scripts, if Grid Infrastructure is installed.

Grid Infrastructure–Only Check

The following is an example of the command used to check the free space available on the file system that contains Grid Infrastructure:
[oracle@stpldb101 grid]$ cluvfy comp freespace -n all
Verifying '/u01/app/12.2.0.1/grid' ...FAILED (PRVG-11104)
Verification of Free Space was unsuccessful.
Checks did not pass for the following nodes:
        stpldb101
Failures were encountered during execution of CVU verification request "Free Space".
Verifying '/u01/app/12.2.0.1/grid' ...FAILED
stpldb101: PRVG-11104 : The disk free space for file system path
           "/u01/app/12.2.0.1/grid" on node "stpldb101" is below "5.0" percent
           of total disk space. The required free space is "4960.700000000001
           MB", the available free space is "1748.0 MB" and the total disk size
           is "99214.0 MB".
CVU operation performed:      Free Space
Date:                         Aug 19, 2017 12:41:43 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle
[oracle@stpldb101 grid]$ echo $?
0

While cluvfy comp free space is handy, it is a bit limited in what it can do. Still, it does have its uses as a quick check of whether there is any free space in the Grid Infrastructure home. It will check only whether the file system has 5 percent free space and will check only the Grid Infrastructure home. Moreover, the command does not indicate whether a space check failed with a return code, which can complicate scripting.

Generic Space Checks

There exists another type of file system free space check within the Cluster Verification Utility, and that is the one provided by cluvfy comp space. This type of verification is much more robust and comes with the added benefit of return codes that change depending on whether the check fails or succeeds.

To illustrate the utility of this particular verification, the output of a check on the Oracle user’s home ($HOME) is shown here:
[oracle@stpldb101 ~]$ cluvfy comp space -n all -l /home/oracle/ -z 20G -verbose
Verifying Available File System Space ...
  Node Name     Available                 Required                  Status
  ------------  ------------------------  ------------------------  -------
  stpldb102     28.1875GB (2.9556736E7KB)  20GB (2.097152E7KB)       passed
  stpldb101     11.5938GB (1.2156928E7KB)  20GB (2.097152E7KB)       failed
Verifying Available File System Space ...FAILED
Verification of space availability was unsuccessful.
Checks did not pass for the following nodes:
        stpldb101
Failures were encountered during execution of CVU verification request "space availability".
Verifying Available File System Space ...FAILED
stpldb101: Sufficient space is not available at location "/home/oracle" on node
"stpldb101" [Required space = 20GB (2.097152E7KB); available space = 11.5938GB
(1.2156928E7KB)]
CVU operation performed:      space availability
Date:                         Aug 19, 2017 4:14:49 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle
[oracle@stpldb101 ~]$ echo $?
1

Cluster Verification Utility Health Checks

The Cluster Verification Utility can be used to perform health checks on three major components: databases, ASM, and clusters. We will go in depth on the health checks available for all three.

Database Health Checks

This type of check cannot be run without first performing some setup. Because CVU needs to connect to a database to run health checks and check for best-practice adherence, credentials need to be stored. Oracle uses a secure external password store also known as the wallet . The wallet will need to be set up for health checks to work properly for the database component. In this chapter, we will go over how to create such a wallet and how to run a health check using CVU.

CVU Wallets

Unlike the stand-alone Oracle connection wallets that can be used by regular clients, the wallet used by the Cluster Verification Utility is managed by the Grid Infrastructure software and is created, modified, and deleted via crsctl commands.

A wallet can be created by issuing the following command as the Grid Infrastructure owner. This will add a wallet entry for the user DBSNMP to the database orclcdb.
[oracle@stpldb101 admin]$ crsctl add wallet -name orclcdb -type CVUDB -user DBSNMP -passwd
Enter Password:
In container databases, the DBSNMP user exists in the CDB$ROOT container. Ensure that the user is unlocked and has the same password that was placed into the wallet. For example, the password database_123 is being used here; it is highly recommended to use secure passwords in a production environment.
SQL> alter user dbsnmp account unlock;
User altered.
SQL> alter user dbsnmp identified by "database_123";
User altered.
SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT

Running the Health Check

The following is the expected output for the health check when it is run. This type of call to cvufy will save the output in a text file and will output only the deviations from best practices per the -deviations flag. If you require HTML output, you can use the -html flag. It is also important to note that cluvfy does not return a usable return code to decide whether deviations exist.
[oracle@stpldb101 admin]$ cluvfy comp healthcheck -collect database -db orclcdb -bestpractice -deviations -save -savedir /home/oracle/bestpractices
Verifying best practice for database "orclcdb"? >
***************************************************************************
        Summary of environment
***************************************************************************
Date (mm/dd/yyyy)    :  08/19/2018
Time (hh:mm:ss)      :  16:45:11
Cluster name         :  stpldb01
Clusterware version  :  12.2.0.1.0
Grid home            :  /u01/app/12.2.0.1/grid
Grid User            :  oracle
Operating system     :  Linux4.1.12-124.15.2.el6uek.x86_64
Database1            :  Database name     -  orclcdb
                        Database version  -  12.2.0.1.0
                        Database home     -
                        /u01/app/oracle/product/12.
                        2.0.1/dbhome_1
***************************************************************************
Database recommendation checks for "orclcdb"
***************************************************************************
Verification Check        :  Java Role Count
Verification Description  :  Checks JVM roles
Verification Result       :  NOT MET
Verification Summary      :  Check for Java Role Count passed
Additional Details        :  A healthy JVM should contain six roles. If there are more or less than six roles, the JVM is inconsistent.
Verification Check        :  INVALID objects in the application related schemas
Verification Description  :  Checks for the presence of INVALID objects in the
                             application related schemas (non SYS and SYSTEM)
Verification Result       :  NOT MET
Verification Summary      :  Check for INVALID objects in the application related schemas passed
Additional Details        :  Investigate invalid objects in the application related schemas (non SYS and SYSTEM) .
Verification Check        :  Alert log messages indicating internal errors ORA-07445
                             errors
Verification Description  :  Checks for ORA-07445 errors in alert log
Verification Result       :  NOT MET
Verification Summary      :  Check for Alert log messages indicating internal errors ORA-07445 errors passed
Additional Details        :  Recurring ORA-07445 errors may lead to database block corruption or some serious issue. See the trace file for more information next to ORA-07445 error in alert log. If the problem persists, contact Oracle Support Services.
Verification Check        :  FILESYSTEMIO_OPTIONS
Verification Description  :  Checks FILESYSTEMIO_OPTIONS parameter
Verification Result       :  NOT MET
Verification Summary      :  Check for FILESYSTEMIO_OPTIONS passed
Additional Details        :  FILESYSTEMIO_OPTIONS=setall supports both direct I/O and asynchronus I/O which helps to achieve optimal performance with database data files
Complete report of this execution is saved in file "/home/oracle/bestpractices/cvucheckreport_08192017164511.txt"
Verification of Health Check was successful.
Warnings were encountered during execution of CVU verification request "Health Check".
Verifying Java Role Count ...WARNING
orclcdb: PRVE-2623 : The JVM roles appear to be inconsistent
Verifying INVALID objects in the application related schemas ...WARNING
orclcdb: PRVE-2563 : Application objects were found to be invalid
Verifying Alert log messages indicating internal errors ORA-07445 errors
...WARNING
orclcdb(orclcdb1): PRVE-2923 : ORA-07445 errors found in the alert log in alert
                   log destination
                   "/u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/alert/" on node
                   "stpldb101.localdomain".
orclcdb(orclcdb2): PRVE-2923 : ORA-07445 errors found in the alert log in alert
                   log destination
                   "/u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/alert/" on node
                   "stpldb102.localdomain".
Verifying FILESYSTEMIO_OPTIONS ...WARNING
orclcdb: PRVE-10138 : FILESYSTEMIO_OPTIONS is not set to the recommended value
         of setall
CVU operation performed:      Health Check
Date:                         Aug 19, 2018 4:44:59 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle
A set of “mandatory” checks can also be done. If you need to save them, you can save them to a directory using the -save and -savedir flags. The following is an example of the type of checks that can be run:
[oracle@stpldb101 ~]$ cluvfy comp healthcheck -collect database -db orclcdb -mandatory
Verifying mandatory requirements for database "orclcdb"
Verifying Easy Connect configuration:orclcdb ...passed
Verifying AUDIT_FILE_DEST location check ...met
Verifying Database operating system groups consistency check ...met
Verifying DB Initialization Parameters ...met
Verifying ACFS device special file ...met
Verifying /dev/shm mounted as temporary file system ...met
Verifying disk free space for Oracle Clusterware home "/u01/app/12.2.0.1/grid"...passed
***************************************************************************
        Summary of environment
***************************************************************************
Date (mm/dd/yyyy)    :  08/19/2018
Time (hh:mm:ss)      :  16:54:57
Cluster name         :  stpldb01
Clusterware version  :  12.2.0.1.0
Grid home            :  /u01/app/12.2.0.1/grid
Grid User            :  oracle
Operating system     :  Linux4.1.12-124.15.2.el6uek.x86_64
Database1            :  Database name     -  orclcdb
                        Database version  -  12.2.0.1.0
                        Database home     -
                        /u01/app/oracle/product/12.
                        2.0.1/dbhome_1
***************************************************************************
Database requirement checks for "orclcdb"
***************************************************************************
Verification Check        :  Easy Connect configuration:orclcdb
Verification Description  :  This check ensures that the Easy Connect is configured as an Oracle Net name resolution method
Verification Result       :  PASSED
Node                Status    Expected Value                Actual Value
---------------------------------------------------------------------------
stpldb102           PASSED    not applicable                not applicable
stpldb101           PASSED    not applicable                not applicable
___________________________________________________________________________
Verification Check        :  AUDIT_FILE_DEST location check
Verification Description  :  This task checks that AUDIT_FILE_DEST parameter designates a ACFS storage location that is not shared by any other instance of the database
Verification Result       :  MET
Node                Status    Expected Value                Actual Value
---------------------------------------------------------------------------
orclcdb             PASSED    not applicable                not applicable
___________________________________________________________________________
Verification Check        :  Database operating system groups consistency check
Verification Description  :  This task verifies consistency of database operating
                            system groups across nodes.
Verification Result      :  MET
Database(Instance)  Status    Expected Value                Actual Value
---------------------------------------------------------------------------
orclcdb             PASSED    not applicable                not applicable
___________________________________________________________________________
Verification Check        :  DB Initialization Parameters
Verification Description  :  This task collects DB Initialization Parameters
Verification Result       :  MET
Database(Instance)  Status    Expected Value                Actual Value
---------------------------------------------------------------------------
orclcdb2            Unknown   not applicable                not applicable
orclcdb1            Unknown   not applicable                not applicable
orclcdb             PASSED    not applicable                not applicable
___________________________________________________________________________
Verification Check        :  ACFS device special file
Verification Description  :  checks the attributes for the ACFS device special file
Verification Result       :  MET
Verification Summary      :  Check for ACFS device special file passed
Additional Details        :  The device special file attributes for '/dev/ofsctl' must be set correctly in order for the ASM instance to function correctly.
Node                Status    Expected Value                Actual Value
---------------------------------------------------------------------------
stpldb102           PASSED    not applicable                not applicable
stpldb101           PASSED    not applicable                not applicable
___________________________________________________________________________
Verification Check        :  /dev/shm mounted as temporary file system
Verification Description  :  Checks whether /dev/shm is mounted correctly as temporary file system
Verification Result       :  MET
Verification Summary      :  Check for /dev/shm mounted as temporary file system passed
Node                Status    Expected Value                Actual Value
---------------------------------------------------------------------------
stpldb102           PASSED    true                          true
stpldb101           PASSED    true                          true
___________________________________________________________________________
Verification of Health Check was unsuccessful.
Checks did not pass for the following database instances:
        orclcdb2,orclcdb1
CVU operation performed:      Health Check
Date:                         Aug 19, 2018 4:54:46 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle

Cluster Health Checks

The Cluster Verification Utility can also perform ASM- and OS-level health checks. The -collect cluster flag will indicate that both ASM- and OS-level checks need to be done, although -collect asm can be used to gather only ASM-level information.

For example, the following health check found both that there are ASM disks that do not belong to any ASM disk group (i.e., wasted space) and that jumbo frames are not configured for the cluster interconnects, an OS-level check.
[oracle@stpldb101 ~]$ cluvfy comp healthcheck -collect cluster -bestpractice -deviations
Verifying OS Best Practice
Verifying Ethernet Jumbo Frames ...not met
Verifying Clusterware Best Practice
Verifying ASM best practices
***************************************************************************
        Summary of environment
***************************************************************************
Date (mm/dd/yyyy)    :  08/19/2018
Time (hh:mm:ss)      :  17:00:37
Cluster name         :  stpldb01
Clusterware version  :  12.2.0.1.0
Grid home            :  /u01/app/12.2.0.1/grid
Grid User            :  oracle
Operating system     :  Linux4.1.12-124.15.2.el6uek.x86_64
***************************************************************************
System recommendations
***************************************************************************
Verification Check        :  Ethernet Jumbo Frames
Verification Description  :  Checks if Jumbo Frames are configured on the system
Verification Result       :  NOT MET
Verification Summary      :  Check for Ethernet Jumbo Frames failed
Additional Details        :  A performance improvement can be seen with Jumbo Frames, check with your system and network administrator first and if possible, configure Jumbo Frames on interconnect. See reference for more detail specific to platform.
References (URLs/Notes)   :  http://docs.hp.com/en/AB290-90001/ch01s11.html?jumpid=reg_R
                             1002_USEN
                             http://www.sun.com/products/networking/ethernet/jumbo/
                             http://kbserver.netgear.com/kb_web_files/n101539.asp
                             http://www.cisco.com/en/US/products/hw/switches/ps700/produ
                             cts_configuration_example09186a008010edab.shtml
                             http://www.juniper.net/company/presscenter/pr/2005/pr-05081
                             5.html
                             http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html
                             http://www.dell.com/content/products/productdetails.aspx/pw
                             cnt_2724?c=us&cs=04&l=en&s=bsd
                             http://www.intel.com/support/network/adapter/index.htm
Node            Status    Expected Value                Actual Value
---------------------------------------------------------------------------
stpldb102       NOT MET   eth1=9000;eth2=9000           eth1=1500;eth2=1500
stpldb101       NOT MET   eth1=9000;eth2=9000           eth1=1500;eth2=1500
___________________________________________________________________________
***************************************************************************
ASM recommendations
***************************************************************************
Verification Check        :  Disks without disk_group
Verification Description  :  Check disks without disk group
Verification Result       :  NOT MET
Verification Summary      :  Check for Disks without disk_group passed
Additional Details        :  The GROUP_NUMBER and DISK_NUMBER columns in V$ASM_DISK will only be valid if the disk is part of a disk group which is currently mounted by the instance. Otherwise, GROUP_NUMBER will be 0, and DISK_NUMBER will be a unique value with respect to the other disks that also have a group number of 0 .
Verification of Health Check was unsuccessful.
Checks did not pass for the following nodes:
        stpldb102,stpldb101
Failures were encountered during execution of CVU verification request "Health Check".
Verifying Ethernet Jumbo Frames ...FAILED
stpldb102: PRVE-0293 : Jumbo Frames are not configured for interconnects
           "eth1,eth2" on node "stpldb102.localdomain".
           [Expected="eth1=9000;eth2=9000"; Found="eth1=1500;eth2=1500"]
stpldb101: PRVE-0293 : Jumbo Frames are not configured for interconnects
           "eth1,eth2" on node "stpldb101.localdomain".
           [Expected="eth1=9000;eth2=9000"; Found="eth1=1500;eth2=1500"]
Verifying Disks without disk_group ...WARNING
AFD:OCR_1C: PRVE-3073 : Disks "AFD:OCR_1C" are not part of any disk group.
AFD:OCR_1B: PRVE-3073 : Disks "AFD:OCR_1B" are not part of any disk group.
AFD:OCR_1A: PRVE-3073 : Disks "AFD:OCR_1A" are not part of any disk group.
CVU operation performed:      Health Check
Date:                         Aug 19, 2018 5:00:27 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle

Cluster Verification Utility Baselines

A related feature to the health checks is the ability to create baselines and then compare them. The Cluster Verification Utility will store the contents of a health check so that it can be referenced in the future to help highlight changes in a cluster. The following example shows a way to collect best-practice information on all components on an Oracle Real Application Cluster and save them as a baseline called baseline1. The raw output is shown to help you understand the type of data that is collected by baselines.

It should be noted that if the -collect all flag is used, a baseline for best practices will be performed for any databases in the cluster. For this reason, it is recommended that you store DBSNMP passwords for all databases in the cluster.
[oracle@stpldb101 ~]$ cluvfy comp baseline -collect all -bestpractice -n all -reportname baseline1
Collecting OS best practice baseline
Collecting HugePages Existence ...collected
Collecting Hardware Clock synchronization at shutdown ...collected
Collecting availability of port 8888 ...collected
Collecting Ethernet Jumbo Frames ...collected
Collecting Clusterware best practice baseline
Collecting CSS misscount parameter ...collected
Collecting CSS reboottime parameter ...collected
Collecting CSS disktimeout parameter ...collected
Collecting Database best practice baseline for database "orclcdb"
Collecting OPTIMIZER_DYNAMIC_SAMPLING ...collected
Collecting Invalid Java Objects ...collected
Collecting SYSAUX tablespace existence ...collected
Collecting JVM configuration for database ...collected
Collecting Java based user in database ...collected
Collecting Java Role Count ...collected
Collecting Invalid SYS or SYSTEM Schema Objects ...collected
Collecting INVALID objects in the application related schemas ...collected
Collecting SPFILE ...collected
Collecting Database word size(bits) ...collected
Collecting Duplicate SYS or SYSTEM Schema Objects ...collected
Collecting MAX_DUMP_FILE_SIZE ...collected
Collecting REMOTE_LISTENER ...collected
Collecting Database Datafiles in Backup Mode ...collected
Collecting Files Needing Media Recovery ...collected
Collecting In-doubt Distributed Transactions ...collected
Collecting SYS-Owned Object Tables Check ...collected
Collecting Invalid Objects in dba_registry ...collected
Collecting Materialized View Group Refresh Jobs ...collected
Collecting Materialized View Manual Group Refresh ...collected
Collecting Redo Log Size (MB) ...collected
Collecting All tablespaces are locally managed ...collected
Collecting Cluster Interconnects ...collected
Collecting Core files destination ...collected
Collecting Alert log messages indicating internal errors ORA-00600 errors ...collected
Collecting Alert log messages indicating internal errors ORA-07445 errors ...collected
Collecting Old trace files in background dump destination ...collected
Collecting Database alert log file size ...collected
Collecting Automatic segment storage management ...collected
Collecting Average GC CR Block Receive Time ...collected
Collecting Average GC Current Block Receive Time ...collected
Collecting Automatic Undo Management ...collected
Collecting FILESYSTEMIO_OPTIONS ...collected
Verifying ASM best practices
Collecting ASM_POWER_LIMIT ...collected
Collecting ASM disk I/O error ...collected
Collecting Disks without disk_group ...collected
Collecting ASM SHARED_POOL_SIZE parameter ...collected
Collecting ASM disk group free space ...collected
Collecting ASM disk rebalance operations in WAIT status ...collected
Baseline collected.
Collection report for this execution is saved in file "/u01/app/oracle/crsdata/@global/cvu/baseline/users/baseline1.zip".
CVU operation performed:      baseline
Date:                         Aug 19, 2018 5:10:31 PM
Cluster name:                 stpldb01
Clusterware version:          12.2.0.1.0
Grid home:                    /u01/app/12.2.0.1/grid
User:                         oracle
Operating system:             Linux4.1.12-124.15.2.el6uek.x86_64
At a later time, a comparison can be done of stored baselines. Baselines exist as zip files within the CVU home and should be referenced by their absolute file path in a comparison operation. Comparisons can be made of baselines from other clusters using the -cross_compare flag, which is not shown in this example.
[oracle@stpldb101 ~]$ cluvfy comp baseline -compare /u01/app/oracle/crsdata/@global/cvu/baseline/users/baseline1.zip,/u01/app/oracle/crsdata/@global/cvu/baseline/users/baseline2.zip
Generating Baseline Report
Preparing the report..
Comparing "System Best Practices"
Comparing "Clusterware Best Practices"
Comparing "Best Practices for database "orclcdb""
Comparing "ASM Configuration Prerequisites"
Complete report of this execution is saved in file "/u01/app/oracle/crsdata/@global/cvu/report/html/cvucheckreport_08192017171818.html"
Opening the report in the browser..
Baseline report generation successful
Verification of baseline was successful.
CVU operation performed:      baseline
Date:                         Aug 19, 2018 5:18:16 PM
CVU home:                     /u01/app/12.2.0.1/grid/
User:                         oracle

Orachk

Orachk is a tool that can be used to run more comprehensive health checks and best-practice tests in an Oracle environment. Orachk comes in two flavors: Orachk and Exachk. Exachk can be used on all engineered systems other than Oracle Database Appliance, and Orachk can be used on all other commodity or engineered systems running the Oracle stack. Orachk and Exachk are similar and overlap in almost all options that are not specific to the Exadata platform (such as cell storage server checks, and so on).

Orachk is installed in the Grid Infrastructure home as well as the Oracle database software home in $ORACLE_HOME/suptools/ and can also be downloaded from Oracle directly. Each PSU that is applied to these software homes stages the latest Orachk version in the $ORACLE_HOME/suptools directory. It is recommended that you download the latest Orachk version from the Oracle web site and install that instead of using the Orachk version that is bundled with the Grid Infrastructure home.

Upgrading Orachk

Provided that the server running Orachk has access to the Internet, Orachk will prompt the user that Orachk needs upgrading if the tool determines that the version being used is older than 120 days. For this particular example, the Orachk version found within the 12.2 Oracle Database home will be used to show how the tool automatically prompts for an upgrade if it finds itself to be too old.
[oracle@stpldb101 orachk]$ ./orachk
This version of orachk was released on 26-May-2016 and its older than 120 days.
Latest version of orachk (ORACHK  VERSION: 12.2.0.1.3_20170719) is available at /u01/app/oracle/product/12.2.0.1/dbhome_1/suptools/.
Do you want to upgrade to the latest version of orachk? [y/n][y]y

Orachk/Exachk and Oracle RESTful Data Services

The latest versions of the Orachk and Exachk tools are now able to be used via REST calls. This is made possible by the use of Oracle’s ORDS feature. ORDS can be set up for Orachk only via the root user and is available only on operating systems that are compatible with Orachk daemon mode.

The following is an example of how to configure Orachk with ORDS; it enables the automatic restart of the Orachk daemon in the case of server restarts.

When the flag -ordssetup is used, it will prompt for a user password. This is the password for a new OS user called orachkords, which will be used to run the orachk daemon. The orachkords username and password must be specified in all the REST calls made to the daemon from remote hosts.
[root@stpldb101 orachk_standalone]# ./orachk -ordssetup
Enter a password for the user
Confirm password for the user
Oracle Rest Data Service (ORDS) URL: https://stpldb101.localdomain:7080/ords/tfaml/orachk
Please start the daemon using following command: orachk -d start -ords or orachk -initsetup -ords before submitting APIs
Oracle Rest Data Service (ORDS) started successfully
[root@stpldb101 orachk_standalone]# ./orachk -initsetup -ords
Clusterware stack is running from /u01/app/12.2.0.1/grid. Is this the correct Clusterware Home?[y/n][y] y
Checking ssh user equivalency settings on all nodes in cluster for root
Node stpldb102 is not configured for ssh user equivalency and the orachk uses ssh to execute checks on remote nodes.
Without passwordless ssh orachk  can not run audit checks on the remote nodes.
If necessary due to security policies the orachk can be run on each node using -localonly option.
Do you want to configure SSH for user root on stpldb102 [y/n][y] y
Enter stpldb102 root password :
Verifying root password ...
Node stpldb102 has been configured for ssh user equivalency for root
Searching for running databases . . . . .
.  .
List of running databases registered in OCR
1. orclcdb
2. None of above
Select databases from list for checking best practices. For multiple databases, select 1 for All or comma separated number like 1,2 etc [1-2][1]. 1
.  .  .  .
.  .  .
Copying plug-ins
. .
.  .  .  .  .  .
Setting up orachk auto restart functionality
oracle-orachkscheduler start/running, process 22793
Starting orachk daemon. . . . . . . . . .
orachk daemon started successfully
Daemon log file location is : /home/oracle/orachk_standalone/orachk_daemon.log
The ORDS setup can easily be removed by using the -ordsrmsetup flag as the root user. This will also remove the orachkords user.
[root@stpldb101 ~]# cd /home/oracle/orachk_standalone/
[root@stpldb101 orachk_standalone]# ./orachk -ordsrmsetup
0 collections generated by Oracle Rest Data Service (ORDS) API found at /home/orachkords
Executing (/home/oracle/orachk_standalone/orachk -initrmsetup) to stop the orachk daemon
orachk daemon stopped successfully (mode=init)
Oracle Rest Data Service (ORDS) user - orachkords deleted successfully

Recommended Settings for Orachk Daemons

There are a few settings that should be configured for the Orachk daemon to ensure that regularly scheduled health checks do not silently fail and leave a system unmonitored for best-practice usage.

Notification Emails

At a minimum, one notification email needs to be set for Orachk if automatic health checks will be run. This will ensure that any noteworthy health check failures are alerted on, even if the health check reports are not being transmitted via e-mail.
[root@stpldb101 orachk_standalone]# ./orachk -set "[email protected]"

Retention Periods

By default, Orachk and Exachk do not purge collections within any specific retention window. Therefore, a specific retention window needs to be set if disk space is at a premium. A retention for the collection of data can be set in terms of days, per the following example:
[root@stpldb101 orachk_standalone]# ./orachk -get collection_retention
------------------------------------------------------------
ID: orachk.default
------------------------------------------------------------
------------------------------------------------------------
[root@stpldb101 orachk_standalone]# ./orachk -set "COLLECTION_RETENTION=90"
Updated ['COLLECTION_RETENTION=90'] for Id[orachk.DEFAULT]
[root@stpldb101 orachk_standalone]# ./orachk -get collection_retention
------------------------------------------------------------
ID: orachk.default
------------------------------------------------------------
COLLECTION_RETENTION  =  90
------------------------------------------------------------
[root@stpldb101 orachk_standalone]#

Automated Password Verification Checks

Orachk saves passwords that are used for authentication so that the daemon can run health checks without manual intervention. However, if passwords are changed, health checks will fail and not produce output. It is a best practice to set an interval at which Orachk and Exachk verify that the stored passwords are still correct. If the passwords have changed, Orachk will send an e-mail to the list of recipients in the NOTIFICATION_EMAIL setting and stop running to allow the administrator to restart the daemon and re-enter the pertinent passwords. The password verification interval is specified in terms of hours.
[root@stpldb101 orachk_standalone]# ./orachk -set "PASSWORD_CHECK_INTERVAL=12"
Updated ['PASSWORD_CHECK_INTERVAL=12'] for Id[orachk.DEFAULT]

Trace File Analyzer

Trace File Analyzer (TFA) is arguably one of the most useful tools that Oracle has released for the Oracle Database and Grid Infrastructure software in terms of enabling Oracle DBAs to react quickly and analyze all the different logfiles that the Oracle software writes errors to. Trace File Analyzer, along with the Support Tools Bundle, is extremely powerful as a troubleshooting and diagnosis tool. Most of the features related to TFA are out of scope for this book a bit, but some of the new and exciting features that can be extremely useful will be touched upon in this chapter.

Upgrading TFA to Include the Support Tools Bundle

By default, TFA is upgraded with every quarterly update patch that comes out for Grid Infrastructure and the database software (12.2 and onward). However, the version of TFA that comes shipped with PSUs is generally three months behind what is available for download from Oracle Support. Furthermore, the version of TFA that comes with quarterly patches does not include the Support Tools Bundle, which has many useful features for diagnosing problems. The Support Tools Bundle can be downloaded from TFA Collector - TFA with Database Support Tools Bundle (Doc ID 1513912.1). While it is perfectly acceptable to use the TFA that is bundled with a standard Grid Infrastructure installation, you should download and install the Support Tools Bundle as shown in this chapter. This will upgrade your TFA with a multitude of new and useful features.

Upgrading TFA is extremely easy and consists of unzipping the download and running the executable script contained therein. Upgrading TFA will upgrade TFA for the entire cluster.
[oracle@stpldb101 ~]$ unzip TFA-LINUX_v18.3.0.zip
Archive:  TFA-LINUX_v18.3.0.zip
  inflating: README.txt
  inflating: installTFA-LINUX

TFA must be installed/patched as the root user. This is because certain TFA files are owned by root in a Grid Infrastructure software home and because init scripts need be modified.

Some of the following output has been abridged as it is essentially the same for remote node operations:
[root@stpldb101 oracle]# ./installTFA-LINUX
TFA Installation Log will be written to File : /tmp/tfa_install_5857_2018_08_20-00_27_36.log
Starting TFA installation
TFA Version: 183000 Build Date: 201808081359
TFA HOME : /u01/app/12.2.0.1/grid/tfa/stpldb101/tfa_home
Installed Build Version: 181100 Build Date: 201803280250
TFA is already installed. Patching /u01/app/12.2.0.1/grid/tfa/stpldb101/tfa_home...
TFA patching typical install from zipfile is written to /u01/app/12.2.0.1/grid/tfa/stpldb101/tfapatch.log
TFA will be Patched on:
stpldb101
stpldb102
Do you want to continue with patching TFA? [Y|N] [Y]: Y
Checking for ssh equivalency in stpldb102
stpldb102 is configured for ssh user equivalency for root user
Using SSH to patch TFA to remote nodes :
Applying Patch on stpldb102:
TFA_HOME: /u01/app/12.2.0.1/grid/tfa/stpldb102/tfa_home
Stopping TFA Support Tools...
Shutting down TFA
oracle-tfa stop/waiting
. . . . .
Killing TFA running with pid 23847
. . .
Successfully shutdown TFA..
Copying files from stpldb101 to stpldb102...
Current version of Berkeley DB in stpldb102 is 5 or higher, so no DbPreUpgrade required
Running commands to fix init.tfa and tfactl in stpldb102...
Updating init.tfa in stpldb102...
Starting TFA in stpldb102...
Starting TFA..
oracle-tfa start/running, process 10467
Waiting up to 100 seconds for TFA to be started..
. . . . .
Successfully started TFA Process..
. . . . .
TFA Started and listening for commands
Enabling Access for Non-root Users on stpldb102...
.-----------------------------------------------------------------.
| Host      | TFA Version | TFA Build ID         | Upgrade Status |
+-----------+-------------+----------------------+----------------+
| stpldb101 |  18.3.0.0.0 | 18300020180808135947 | UPGRADED       |
| stpldb102 |  18.3.0.0.0 | 18300020180808135947 | UPGRADED       |
'-----------+-------------+----------------------+----------------'
cleanup serializable files

Using Trace File Analyzer to Manage Logfiles

Old logs need to be deleted once they are no longer relevant. Every DBA knows this, and most DBAs have scripts that help to automate this task as it quickly becomes burdensome the more databases and clusters that a DBA has to manage. Up until 11g, this was a manual process that required custom shell scripts to help with the maintenance; however, in 11g Oracle came out with a concept called Automatic Diagnostic Repository (ADR). Now, the latest versions of Trace File Analyzer have become integrated with ADR, which allows for streamlined maintenance of database and Grid Infrastructure logs.

Analyzing Logfile Space Usage

TFA can be used to display the file system space used by database logs.
[oracle@stpldb101 ~]$ tfactl managelogs -show usage -database orclcdb
Output from host : stpldb101
------------------------------
.------------------------------------------------------------------.
|                       Database Homes Usage                       |
+------------------------------------------------------+-----------+
| Location                                             | Size      |
+------------------------------------------------------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/alert    | 1.81 MB   |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident | 25.68 MB  |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace    | 119.94 MB |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/cdump    | 136.58 MB |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/hm       | 4.00 KB   |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/log      | 36.00 KB  |
+------------------------------------------------------+-----------+
| Total                                                | 284.05 MB |
'------------------------------------------------------+-----------'
Output from host : stpldb102
------------------------------
.------------------------------------------------------------------.
|                       Database Homes Usage                       |
+------------------------------------------------------+-----------+
| Location                                             | Size      |
+------------------------------------------------------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/alert    | 1.84 MB   |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/incident | 7.05 MB   |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/trace    | 82.30 MB  |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/cdump    | 139.87 MB |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/hm       | 4.00 KB   |
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/log      | 32.00 KB  |
+------------------------------------------------------+-----------+
| Total                                                | 231.09 MB |
'------------------------------------------------------+-----------'
More important, Trace File Analyzer can now be used to analyze the change or variation space usage. In the following example, the Old Size shows as - because the database was newly created:
[oracle@stpldb101 ~]$ tfactl managelogs -show variation -older 1d -database orclcdb
Output from host : stpldb101
------------------------------
2018-08-20 00:42:46: INFO Checking space variation for 1 days
2018-08-20 00:42:46: INFO Space is calculated in bytes [without round off]
.-----------------------------------------------------------------------------.
|                           Database Homes Variation                          |
+------------------------------------------------------+----------+-----------+
| Directory                                            | Old Size | New Size  |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/cdump    | -        | 136.58 MB |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/alert    | -        | 1.81 MB   |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident | -        | 25.68 MB  |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/log      | -        | 36.00 KB  |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace    | -        | 119.99 MB |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/hm       | -        | 4.00 KB   |
'------------------------------------------------------+----------+-----------'
Output from host : stpldb102
------------------------------
2018-08-20 00:42:54: INFO Checking space variation for 1 days
2018-08-20 00:42:54: INFO Space is calculated in bytes [without round off]
.-----------------------------------------------------------------------------.
|                           Database Homes Variation                          |
+------------------------------------------------------+----------+-----------+
| Directory                                            | Old Size | New Size  |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/log      | -        | 32.00 KB  |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/alert    | -        | 1.84 MB   |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/cdump    | -        | 139.87 MB |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/incident | -        | 7.05 MB   |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/hm       | -        | 4.00 KB   |
+------------------------------------------------------+----------+-----------+
| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/trace    | -        | 82.35 MB  |
'------------------------------------------------------+----------+-----------'

Purging Database Logfiles

By default, TFA will purge files that are older than 30 days if –purge is not specified. In the following example, all logs are purged because of the –purge 0d setting (all files older than zero days):
[oracle@stpldb101 ~]$ tfactl managelogs -purge -older 0d -database orclcdb
Output from host : stpldb101
------------------------------
2018-08-20 01:00:59: INFO Purging files older than 0 days
2018-08-20 01:00:59: INFO Space is calculated in bytes [without round off]
2018-08-20 01:00:59: INFO Cleaning Database Home destinations
2018-08-20 01:01:04: INFO Purging diagnostic destination "diag/rdbms/orclcdb/orclcdb1" for files - 2598 files deleted , 276.98 MB freed
2018-08-20 01:01:04: MESSAGE Database Home : /u01/app/oracle/product/12.2.0.1/dbhome_1 [ Files deleted : 2598 files | Space Freed : 276.98 MB ]
.-------------------------------------------------------------------.
| File System Variation : /u01/app/oracle/product/12.2.0.1/dbhome_1 |
+--------+------+----------+----------+----------+----------+-------+
| State  | Name | Size     | Used     | Free     | Capacity | Mount |
+--------+------+----------+----------+----------+----------+-------+
| Before |      | 96888540 | 76662696 | 15721604 |      83% | /     |
| After  |      | 96888540 | 76373716 | 16010584 |      83% | /     |
'--------+------+----------+----------+----------+----------+-------'
Output from host : stpldb102
------------------------------
2018-08-20 01:01:06: INFO Purging files older than 0 days
2018-08-20 01:01:06: INFO Space is calculated in bytes [without round off]
2018-08-20 01:01:06: INFO Cleaning Database Home destinations
2018-08-20 01:01:08: INFO Purging diagnostic destination "diag/rdbms/orclcdb/orclcdb2" for files - 1513 files deleted , 226.99 MB freed
2018-08-20 01:01:08: MESSAGE Database Home : /u01/app/oracle/product/12.2.0.1/dbhome_1 [ Files deleted : 1513 files | Space Freed : 226.99 MB ]
.-------------------------------------------------------------------.
| File System Variation : /u01/app/oracle/product/12.2.0.1/dbhome_1 |
+--------+------+----------+----------+----------+----------+-------+
| State  | Name | Size     | Used     | Free     | Capacity | Mount |
+--------+------+----------+----------+----------+----------+-------+
| Before |      | 96888540 | 70290896 | 22093400 |      77% | /     |
| After  |      | 96888540 | 70055416 | 22328880 |      76% | /     |
'--------+------+----------+----------+----------+----------+-------'

TFA as a Health Check Tool

Trace File Analyzer can be used to analyze logfiles on several components over a range of time for errors or warnings. This can be extremely useful as an ongoing monitoring solution for large clusters.

To show the power of this type of analysis, this example will show how all of the logfiles managed by TFA can be quickly analyzed to look for a particular type of error. In this case, the logs will be analyzed to search for any mention of swapping to check for memory pressure conditions on a particular host. All the necessary logs were analyzed within two seconds, and a report was ouput to the screen. Furthermore, tfactl changed the return code depending on the findings, which lends itself to a tfactl-centered shell script as a monitoring solution.
[oracle@stpldb101 ~]$ tfactl analyze -comp all -last 1d -type error -search "swapping" -node local
INFO: analyzing all (Alert and Unix System Logs) logs for the last 1440 minutes...  Please wait...
INFO: analyzing host: stpldb101
                    Report title: Analysis of Alert,System Logs
               Report date range: last ~1 day(s)
      Report (default) time zone: EST - Eastern Standard Time
             Analysis started at: 20-Aug-2018 01:13:51 AM EDT
           Elapsed analysis time: 2 second(s).
              Configuration file: /u01/app/12.2.0.1/grid/tfa/stpldb101/tfa_home/ext/tnt/conf/tnt.prop
             Configuration group: all
                       Parameter: swapping
             Total message count:       18,695, from 02-Jul-2018 07:42:39 PM EDT to 20-Aug-2018 01:13:44 AM EDT
Messages matching last ~1 day(s):       9,172, from 19-Aug-2018 01:14:00 AM EDT to 20-Aug-2018 01:13:44 AM EDT
                  Matching regex: swapping
                  Case sensitive: false
                     Match count: 5
[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4847]
Aug 19 02:01:28 2018
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
Errors in file /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/orclcdb1_dbrm_31124.trc  (incident=72122) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident/incdir_72122/orclcdb1_dbrm_31124_i72122.trc
[Source: /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/alert_-MGMTDB.log, Line: 6135]
Aug 19 02:02:35 2018
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
Errors in file /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_dbrm_14223.trc  (incident=44115) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/incident/incdir_44115/-MGMTDB_dbrm_14223_i44115.trc
[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4946]
Aug 19 04:56:44 2018
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
Errors in file /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/orclcdb1_dbrm_31124.trc  (incident=72123) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident/incdir_72123/orclcdb1_dbrm_31124_i72123.trc
[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4979]
Aug 19 06:08:03 2018
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4982]
Aug 19 06:08:06 2018
Errors in file /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/orclcdb1_dbrm_31124.trc  (incident=72124) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident/incdir_72124/orclcdb1_dbrm_31124_i72124.trc
[oracle@stpldb101 ~]$ echo $?
1

New Health Check and Troubleshooting Features in 12.2

Oracle introduced a new health check and troubleshooting feature in 12.2 that can aid in identifying and resolving issues. Using this new feature in conjunction with the rest of the features outlined in this chapter should help to prepare a DBA for even the most vexing of performance issues.

Cluster Health Advisor

The Cluster Health Advisor is available via the chactl command and can be used to diagnose both at the cluster level and at the database level if -cluster is replaced with -db <DB_UNIQUE_NAME>.
[oracle@stpldb101 ~]$ chactl query diagnosis -cluster -start "2018-08-20 00:00:00" -end "2018-08-20 08:00:00"
2018-08-20 00:01:40.0  Host stpldb102  Host Memory Consumption [detected]
2018-08-20 00:01:50.0  Host stpldb101  Host Memory Consumption [detected]
2018-08-20 00:02:50.0  Host stpldb101  Host CPU Utilization [detected]
2018-08-20 00:04:35.0  Host stpldb101  Host Memory Swapping [detected]
2018-08-20 00:04:45.0  Host stpldb101  Host CPU Utilization [cleared]
2018-08-20 00:09:15.0  Host stpldb101  Host Memory Swapping [cleared]
2018-08-20 00:09:25.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 00:16:10.0  Host stpldb102  Host CPU Utilization [cleared]
2018-08-20 00:16:10.0  Host stpldb102  Host Memory Consumption [cleared]
2018-08-20 00:18:20.0  Host stpldb102  Host Memory Consumption [detected]
2018-08-20 00:19:20.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 00:27:05.0  Host stpldb102  Host CPU Utilization [cleared]
2018-08-20 00:30:20.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 00:31:30.0  Host stpldb101  Host CPU Utilization [detected]
2018-08-20 00:33:50.0  Host stpldb101  Host CPU Utilization [cleared]
2018-08-20 00:34:50.0  Host stpldb102  Host CPU Utilization [cleared]
2018-08-20 02:01:05.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 02:05:40.0  Host stpldb102  Host CPU Utilization [cleared]
2018-08-20 02:31:40.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 02:56:55.0  Host stpldb102  Host CPU Utilization [cleared]
2018-08-20 04:00:55.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 04:05:40.0  Host stpldb102  Host CPU Utilization [cleared]
2018-08-20 06:01:20.0  Host stpldb102  Host CPU Utilization [detected]
2018-08-20 06:05:45.0  Host stpldb102  Host CPU Utilization [cleared]
Problem: Host Memory Consumption
Description: CHA detected that more memory than expected is consumed on this server. The memory is not allocated by sessions of this database.
Cause: The Cluster Health Advisor (CHA) detected an increase in memory consumption by other databases or by applications not connected to a database on this node.
Action: Identify the top memory consumers by using the Cluster Health Monitor (CHM).
Problem: Host CPU Utilization
Description: CHA detected larger than expected CPU utilization on this node. The available CPU resource may not be sufficient to support application failover or relocation of databases to this node.
Cause: The Cluster Health Advisor (CHA) detected an unexpected increase in CPU utilization by databases or applications on this node.
Action: Identify CPU intensive processes and databases by reviewing Cluster Health Monitoring (CHM) data. Relocate databases to less busy machines, or limit the number of connections to databases on this node. Add nodes if more resources are required.
Problem: Host Memory Swapping
Description: CHA detected that there is not enough free memory on this server and memory swapping has started. The performance of databases on other nodes in this cluster will be affected severely. An instance eviction is likely if swapping continues.
Cause: The Cluster Health Advisor (CHA) detected that there was no more free memory because the memory requirements of processes increased.
Action: Stop database services on this server, or shut down less important database instances. Redirect the applications to the available healthy servers. Check the CHA diagnostics for the health of the other servers and databases. Add servers if more memory is required. Ensure that Huge Pages are correctly configured and used.
The Cluster Health Advisor uses data stored in the MGMTDB (Grid Management Infrastructure Repository). The amount of space used by the Grid Infrastructure Management Repository is directly correlated with the max retention (up to 168 hours) and the amount of entities/targets (up to 512) being tracked by Cluster Health Advisor.
[oracle@stpldb101 ~]$ chactl query repository
specified max retention time(hrs): 72
available retention time(hrs)    : 617
available number of entities     : 17
allocated number of entities     : 2
total repository size(gb)        : 15.00
allocated repository size(gb)    : 0.26

Summary

This chapter covered a lot of the new and old troubleshooting and health check topics in the hope of raising your awareness of the amazing options available to DBAs, especially with the new features of TFA, Orachk, and latest features in 12.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.75.227