Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

YV Ravi Kumar, Nassyam Basha, Krishna Kumar K M, Bal Mukund Sharma and Konstantin KerekovskiOracle High Availability, Disaster Recovery, and Cloud Services https://doi.org/10.1007/978-1-4842-4351-0_5

5. Troubleshooting and Conducting Health Checks for High Availability

Y V Ravi Kumar¹, Nassyam Basha², Krishna Kumar K M³, Bal Mukund Sharma⁴ and Konstantin Kerekovski⁵

(1)

Chennai, Tamil Nadu, India

(2)

Riyadh, Saudi Arabia

(3)

Bangalore, Karnataka, India

(4)

Belmont, CA, USA

(5)

Pinellas Park, FL, USA

The goal of this chapter is to introduce you to the various health check utilities, both new in 12.2 and from previous versions, that can greatly ease management and troubleshooting efforts for High Availability and DR solutions. This chapter will focus mostly on features that are not necessarily Exadata specific but that can and should be used in Exadata and other engineered environments. The chapter will highlight important and often hidden aspects of the different features and software that can be used for health checks as well as cover a few of the new features available in the 12.2 GI and RDBMS software.

Cluster Verification Utility

The Cluster Verification Utility (CVU) is an extremely important component of the Grid Infrastructure installation. This tool can be used to perform a variety of health checks on an Oracle Real Application Cluster, regardless of whether the Real Application Cluster is on commodity hardware or an engineered system. This tool is essential in the installation of Grid Infrastructure as well as in the Database Creation Assistant (DBCA). However, there are several great and often underutilized one-off use cases for the tool, which this chapter will go over. The Cluster Verification Utility is available in $GRID_HOME. You will need to set your environment to +ASM to be able to successfully launch CVU.

Software Home Checks

Sometimes, for various reasons, a DBA may come to question the integrity of a software home, such as the RDBMS installation home or the Grid Infrastructure installation itself, for example, after patching and having to roll back certain patches or after a particularly problem-ridden attempt at Grid Infrastructure upgrade/patch that requires manual intervention. It is always a good thing to be able to say that the file permissions ad user/group ownerships of all files within an Oracle software installation are correct and as they should be.

The following is an example of a call to cluvfy that points out a file permissions issue and specifies what the expected and observed file permissions are for the files in question to help resolve the situation:

[oracle@stpldb101 ~]$ cluvfy comp software -n all -d /u01/app/12.2.0.1/grid/ -allfiles -r 12.2

Verifying Software home: /u01/app/12.2.0.1/grid/ ...FAILED (PRVG-2033)

Verification of software was unsuccessful on all the specified nodes.

Failures were encountered during execution of CVU verification request "software".

Verifying Software home: /u01/app/12.2.0.1/grid/ ...FAILED

stpldb102: PRVG-2033 : Permissions of file

"/u01/app/12.2.0.1/grid//bin/kfod.bin" did not match the expected

octal value on node "stpldb102". [Expected = "751" ; Found = "0755"]

stpldb102: PRVG-2033 : Permissions of file

"/u01/app/12.2.0.1/grid//crs/install/dropdb.pl" did not match the

expected octal value on node "stpldb102". [Expected = "700" ; Found

= "0755"]

stpldb102: PRVG-2033 : Permissions of file

"/u01/app/12.2.0.1/grid//lib/acfstoolsdriver.sh" did not match the

expected octal value on node "stpldb102". [Expected = "750" ; Found

= "0755"]

stpldb101: PRVG-2033 : Permissions of file

"/u01/app/12.2.0.1/grid//bin/kfod.bin" did not match the expected

octal value on node "stpldb101". [Expected = "751" ; Found = "0755"]

stpldb101: PRVG-2033 : Permissions of file

"/u01/app/12.2.0.1/grid//crs/install/dropdb.pl" did not match the

expected octal value on node "stpldb101". [Expected = "700" ; Found

= "0755"]

stpldb101: PRVG-2033 : Permissions of file

"/u01/app/12.2.0.1/grid//lib/acfstoolsdriver.sh" did not match the

expected octal value on node "stpldb101". [Expected = "750" ; Found

= "0755"]

CVU operation performed: software

Date: Aug 19, 2017 12:23:27 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

[oracle@stpldb101 ~]$ echo $?

The following is an example of running cluvfy against a 12.2 RDBMS home, which came back as not having any issues:

[oracle@stpldb101 ~]$ cluvfy comp software -n all -d /u01/app/oracle/product/12.2.0.1/dbhome_1 -allfiles -r 12.2

Verifying Software home: /u01/app/oracle/product/12.2.0.1/dbhome_1 ...PASSED

Verification of software was successful.

CVU operation performed: software

Date: Aug 19, 2017 12:26:29 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

[oracle@stpldb101 ~]$ echo $?

One of the great things about the Cluster Verification Utility is that the return code of many of the commands signifies whether an issue was encountered. Because of this, it is actually easy to set up shell scripts that run cluvfy comp software and then email an administrator if an issue is found so that it can be investigated.

File System Space Checks

It is a known fact that database administrators must keep an eye on the free space available on file systems that contain Grid Infrastructure and Oracle Database software homes because of logging configurations and other various environment-specific reasons. For this reason, every database administrator likely has their own custom script to check directories or df output for space usage. However, the Cluster Verification Utility can greatly simplify these scripts, if Grid Infrastructure is installed.

Grid Infrastructure–Only Check

The following is an example of the command used to check the free space available on the file system that contains Grid Infrastructure:

[oracle@stpldb101 grid]$ cluvfy comp freespace -n all

Verifying '/u01/app/12.2.0.1/grid' ...FAILED (PRVG-11104)

Verification of Free Space was unsuccessful.

Checks did not pass for the following nodes:

stpldb101

Failures were encountered during execution of CVU verification request "Free Space".

Verifying '/u01/app/12.2.0.1/grid' ...FAILED

stpldb101: PRVG-11104 : The disk free space for file system path

"/u01/app/12.2.0.1/grid" on node "stpldb101" is below "5.0" percent

of total disk space. The required free space is "4960.700000000001

MB", the available free space is "1748.0 MB" and the total disk size

is "99214.0 MB".

CVU operation performed: Free Space

Date: Aug 19, 2017 12:41:43 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

[oracle@stpldb101 grid]$ echo $?

While cluvfy comp free space is handy, it is a bit limited in what it can do. Still, it does have its uses as a quick check of whether there is any free space in the Grid Infrastructure home. It will check only whether the file system has 5 percent free space and will check only the Grid Infrastructure home. Moreover, the command does not indicate whether a space check failed with a return code, which can complicate scripting.

Generic Space Checks

There exists another type of file system free space check within the Cluster Verification Utility, and that is the one provided by cluvfy comp space. This type of verification is much more robust and comes with the added benefit of return codes that change depending on whether the check fails or succeeds.

To illustrate the utility of this particular verification, the output of a check on the Oracle user’s home ($HOME) is shown here:

[oracle@stpldb101 ~]$ cluvfy comp space -n all -l /home/oracle/ -z 20G -verbose

Verifying Available File System Space ...

Node Name Available Required Status

------------ ------------------------ ------------------------ -------

stpldb102 28.1875GB (2.9556736E7KB) 20GB (2.097152E7KB) passed

stpldb101 11.5938GB (1.2156928E7KB) 20GB (2.097152E7KB) failed

Verifying Available File System Space ...FAILED

Verification of space availability was unsuccessful.

Checks did not pass for the following nodes:

stpldb101

Failures were encountered during execution of CVU verification request "space availability".

Verifying Available File System Space ...FAILED

stpldb101: Sufficient space is not available at location "/home/oracle" on node

"stpldb101" [Required space = 20GB (2.097152E7KB); available space = 11.5938GB

(1.2156928E7KB)]

CVU operation performed: space availability

Date: Aug 19, 2017 4:14:49 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

[oracle@stpldb101 ~]$ echo $?

Cluster Verification Utility Health Checks

The Cluster Verification Utility can be used to perform health checks on three major components: databases, ASM, and clusters. We will go in depth on the health checks available for all three.

Database Health Checks

This type of check cannot be run without first performing some setup. Because CVU needs to connect to a database to run health checks and check for best-practice adherence, credentials need to be stored. Oracle uses a secure external password store also known as the wallet . The wallet will need to be set up for health checks to work properly for the database component. In this chapter, we will go over how to create such a wallet and how to run a health check using CVU.

CVU Wallets

Unlike the stand-alone Oracle connection wallets that can be used by regular clients, the wallet used by the Cluster Verification Utility is managed by the Grid Infrastructure software and is created, modified, and deleted via crsctl commands.

A wallet can be created by issuing the following command as the Grid Infrastructure owner. This will add a wallet entry for the user DBSNMP to the database orclcdb.

[oracle@stpldb101 admin]$ crsctl add wallet -name orclcdb -type CVUDB -user DBSNMP -passwd

Enter Password:

In container databases, the DBSNMP user exists in the CDB$ROOT container. Ensure that the user is unlocked and has the same password that was placed into the wallet. For example, the password database_123 is being used here; it is highly recommended to use secure passwords in a production environment.

SQL> alter user dbsnmp account unlock;

User altered.

SQL> alter user dbsnmp identified by "database_123";

User altered.

SQL> show con_name

CON_NAME

------------------------------

CDB$ROOT

Running the Health Check

The following is the expected output for the health check when it is run. This type of call to cvufy will save the output in a text file and will output only the deviations from best practices per the -deviations flag. If you require HTML output, you can use the -html flag. It is also important to note that cluvfy does not return a usable return code to decide whether deviations exist.

[oracle@stpldb101 admin]$ cluvfy comp healthcheck -collect database -db orclcdb -bestpractice -deviations -save -savedir /home/oracle/bestpractices

Verifying best practice for database "orclcdb"? >

***************************************************************************

Summary of environment

***************************************************************************

Date (mm/dd/yyyy) : 08/19/2018

Time (hh:mm:ss) : 16:45:11

Cluster name : stpldb01

Clusterware version : 12.2.0.1.0

Grid home : /u01/app/12.2.0.1/grid

Grid User : oracle

Operating system : Linux4.1.12-124.15.2.el6uek.x86_64

Database1 : Database name - orclcdb

Database version - 12.2.0.1.0

Database home -

/u01/app/oracle/product/12.

2.0.1/dbhome_1

***************************************************************************

Database recommendation checks for "orclcdb"

***************************************************************************

Verification Check : Java Role Count

Verification Description : Checks JVM roles

Verification Result : NOT MET

Verification Summary : Check for Java Role Count passed

Additional Details : A healthy JVM should contain six roles. If there are more or less than six roles, the JVM is inconsistent.

Verification Check : INVALID objects in the application related schemas

Verification Description : Checks for the presence of INVALID objects in the

application related schemas (non SYS and SYSTEM)

Verification Result : NOT MET

Verification Summary : Check for INVALID objects in the application related schemas passed

Additional Details : Investigate invalid objects in the application related schemas (non SYS and SYSTEM) .

Verification Check : Alert log messages indicating internal errors ORA-07445

errors

Verification Description : Checks for ORA-07445 errors in alert log

Verification Result : NOT MET

Verification Summary : Check for Alert log messages indicating internal errors ORA-07445 errors passed

Additional Details : Recurring ORA-07445 errors may lead to database block corruption or some serious issue. See the trace file for more information next to ORA-07445 error in alert log. If the problem persists, contact Oracle Support Services.

Verification Check : FILESYSTEMIO_OPTIONS

Verification Description : Checks FILESYSTEMIO_OPTIONS parameter

Verification Result : NOT MET

Verification Summary : Check for FILESYSTEMIO_OPTIONS passed

Additional Details : FILESYSTEMIO_OPTIONS=setall supports both direct I/O and asynchronus I/O which helps to achieve optimal performance with database data files

Complete report of this execution is saved in file "/home/oracle/bestpractices/cvucheckreport_08192017164511.txt"

Verification of Health Check was successful.

Warnings were encountered during execution of CVU verification request "Health Check".

Verifying Java Role Count ...WARNING

orclcdb: PRVE-2623 : The JVM roles appear to be inconsistent

Verifying INVALID objects in the application related schemas ...WARNING

orclcdb: PRVE-2563 : Application objects were found to be invalid

Verifying Alert log messages indicating internal errors ORA-07445 errors

...WARNING

orclcdb(orclcdb1): PRVE-2923 : ORA-07445 errors found in the alert log in alert

log destination

"/u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/alert/" on node

"stpldb101.localdomain".

orclcdb(orclcdb2): PRVE-2923 : ORA-07445 errors found in the alert log in alert

log destination

"/u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/alert/" on node

"stpldb102.localdomain".

Verifying FILESYSTEMIO_OPTIONS ...WARNING

orclcdb: PRVE-10138 : FILESYSTEMIO_OPTIONS is not set to the recommended value

of setall

CVU operation performed: Health Check

Date: Aug 19, 2018 4:44:59 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

A set of “mandatory” checks can also be done. If you need to save them, you can save them to a directory using the -save and -savedir flags. The following is an example of the type of checks that can be run:

[oracle@stpldb101 ~]$ cluvfy comp healthcheck -collect database -db orclcdb -mandatory

Verifying mandatory requirements for database "orclcdb"

Verifying Easy Connect configuration:orclcdb ...passed

Verifying AUDIT_FILE_DEST location check ...met

Verifying Database operating system groups consistency check ...met

Verifying DB Initialization Parameters ...met

Verifying ACFS device special file ...met

Verifying /dev/shm mounted as temporary file system ...met

Verifying disk free space for Oracle Clusterware home "/u01/app/12.2.0.1/grid"...passed

***************************************************************************

Summary of environment

***************************************************************************

Date (mm/dd/yyyy) : 08/19/2018

Time (hh:mm:ss) : 16:54:57

Cluster name : stpldb01

Clusterware version : 12.2.0.1.0

Grid home : /u01/app/12.2.0.1/grid

Grid User : oracle

Operating system : Linux4.1.12-124.15.2.el6uek.x86_64

Database1 : Database name - orclcdb

Database version - 12.2.0.1.0

Database home -

/u01/app/oracle/product/12.

2.0.1/dbhome_1

***************************************************************************

Database requirement checks for "orclcdb"

***************************************************************************

Verification Check : Easy Connect configuration:orclcdb

Verification Description : This check ensures that the Easy Connect is configured as an Oracle Net name resolution method

Verification Result : PASSED

Node Status Expected Value Actual Value

---------------------------------------------------------------------------

stpldb102 PASSED not applicable not applicable

stpldb101 PASSED not applicable not applicable

___________________________________________________________________________

Verification Check : AUDIT_FILE_DEST location check

Verification Description : This task checks that AUDIT_FILE_DEST parameter designates a ACFS storage location that is not shared by any other instance of the database

Verification Result : MET

Node Status Expected Value Actual Value

---------------------------------------------------------------------------

orclcdb PASSED not applicable not applicable

___________________________________________________________________________

Verification Check : Database operating system groups consistency check

Verification Description : This task verifies consistency of database operating

system groups across nodes.

Verification Result : MET

Database(Instance) Status Expected Value Actual Value

---------------------------------------------------------------------------

orclcdb PASSED not applicable not applicable

___________________________________________________________________________

Verification Check : DB Initialization Parameters

Verification Description : This task collects DB Initialization Parameters

Verification Result : MET

Database(Instance) Status Expected Value Actual Value

---------------------------------------------------------------------------

orclcdb2 Unknown not applicable not applicable

orclcdb1 Unknown not applicable not applicable

orclcdb PASSED not applicable not applicable

___________________________________________________________________________

Verification Check : ACFS device special file

Verification Description : checks the attributes for the ACFS device special file

Verification Result : MET

Verification Summary : Check for ACFS device special file passed

Additional Details : The device special file attributes for '/dev/ofsctl' must be set correctly in order for the ASM instance to function correctly.

Node Status Expected Value Actual Value

---------------------------------------------------------------------------

stpldb102 PASSED not applicable not applicable

stpldb101 PASSED not applicable not applicable

___________________________________________________________________________

Verification Check : /dev/shm mounted as temporary file system

Verification Description : Checks whether /dev/shm is mounted correctly as temporary file system

Verification Result : MET

Verification Summary : Check for /dev/shm mounted as temporary file system passed

Node Status Expected Value Actual Value

---------------------------------------------------------------------------

stpldb102 PASSED true true

stpldb101 PASSED true true

___________________________________________________________________________

Verification of Health Check was unsuccessful.

Checks did not pass for the following database instances:

orclcdb2,orclcdb1

CVU operation performed: Health Check

Date: Aug 19, 2018 4:54:46 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

Cluster Health Checks

The Cluster Verification Utility can also perform ASM- and OS-level health checks. The -collect cluster flag will indicate that both ASM- and OS-level checks need to be done, although -collect asm can be used to gather only ASM-level information.

For example, the following health check found both that there are ASM disks that do not belong to any ASM disk group (i.e., wasted space) and that jumbo frames are not configured for the cluster interconnects, an OS-level check.

[oracle@stpldb101 ~]$ cluvfy comp healthcheck -collect cluster -bestpractice -deviations

Verifying OS Best Practice

Verifying Ethernet Jumbo Frames ...not met

Verifying Clusterware Best Practice

Verifying ASM best practices

***************************************************************************

Summary of environment

***************************************************************************

Date (mm/dd/yyyy) : 08/19/2018

Time (hh:mm:ss) : 17:00:37

Cluster name : stpldb01

Clusterware version : 12.2.0.1.0

Grid home : /u01/app/12.2.0.1/grid

Grid User : oracle

Operating system : Linux4.1.12-124.15.2.el6uek.x86_64

***************************************************************************

System recommendations

***************************************************************************

Verification Check : Ethernet Jumbo Frames

Verification Description : Checks if Jumbo Frames are configured on the system

Verification Result : NOT MET

Verification Summary : Check for Ethernet Jumbo Frames failed

Additional Details : A performance improvement can be seen with Jumbo Frames, check with your system and network administrator first and if possible, configure Jumbo Frames on interconnect. See reference for more detail specific to platform.

References (URLs/Notes) : http://docs.hp.com/en/AB290-90001/ch01s11.html?jumpid=reg_R

1002_USEN

http://www.sun.com/products/networking/ethernet/jumbo/

http://kbserver.netgear.com/kb_web_files/n101539.asp

http://www.cisco.com/en/US/products/hw/switches/ps700/produ

cts_configuration_example09186a008010edab.shtml

http://www.juniper.net/company/presscenter/pr/2005/pr-05081

5.html

http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html

http://www.dell.com/content/products/productdetails.aspx/pw

cnt_2724?c=us&cs=04&l=en&s=bsd

http://www.intel.com/support/network/adapter/index.htm

Node Status Expected Value Actual Value

---------------------------------------------------------------------------

stpldb102 NOT MET eth1=9000;eth2=9000 eth1=1500;eth2=1500

stpldb101 NOT MET eth1=9000;eth2=9000 eth1=1500;eth2=1500

___________________________________________________________________________

***************************************************************************

ASM recommendations

***************************************************************************

Verification Check : Disks without disk_group

Verification Description : Check disks without disk group

Verification Result : NOT MET

Verification Summary : Check for Disks without disk_group passed

Additional Details : The GROUP_NUMBER and DISK_NUMBER columns in V$ASM_DISK will only be valid if the disk is part of a disk group which is currently mounted by the instance. Otherwise, GROUP_NUMBER will be 0, and DISK_NUMBER will be a unique value with respect to the other disks that also have a group number of 0 .

Verification of Health Check was unsuccessful.

Checks did not pass for the following nodes:

stpldb102,stpldb101

Failures were encountered during execution of CVU verification request "Health Check".

Verifying Ethernet Jumbo Frames ...FAILED

stpldb102: PRVE-0293 : Jumbo Frames are not configured for interconnects

"eth1,eth2" on node "stpldb102.localdomain".

[Expected="eth1=9000;eth2=9000"; Found="eth1=1500;eth2=1500"]

stpldb101: PRVE-0293 : Jumbo Frames are not configured for interconnects

"eth1,eth2" on node "stpldb101.localdomain".

[Expected="eth1=9000;eth2=9000"; Found="eth1=1500;eth2=1500"]

Verifying Disks without disk_group ...WARNING

AFD:OCR_1C: PRVE-3073 : Disks "AFD:OCR_1C" are not part of any disk group.

AFD:OCR_1B: PRVE-3073 : Disks "AFD:OCR_1B" are not part of any disk group.

AFD:OCR_1A: PRVE-3073 : Disks "AFD:OCR_1A" are not part of any disk group.

CVU operation performed: Health Check

Date: Aug 19, 2018 5:00:27 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

Cluster Verification Utility Baselines

A related feature to the health checks is the ability to create baselines and then compare them. The Cluster Verification Utility will store the contents of a health check so that it can be referenced in the future to help highlight changes in a cluster. The following example shows a way to collect best-practice information on all components on an Oracle Real Application Cluster and save them as a baseline called baseline1. The raw output is shown to help you understand the type of data that is collected by baselines.

It should be noted that if the -collect all flag is used, a baseline for best practices will be performed for any databases in the cluster. For this reason, it is recommended that you store DBSNMP passwords for all databases in the cluster.

[oracle@stpldb101 ~]$ cluvfy comp baseline -collect all -bestpractice -n all -reportname baseline1

Collecting OS best practice baseline

Collecting HugePages Existence ...collected

Collecting Hardware Clock synchronization at shutdown ...collected

Collecting availability of port 8888 ...collected

Collecting Ethernet Jumbo Frames ...collected

Collecting Clusterware best practice baseline

Collecting CSS misscount parameter ...collected

Collecting CSS reboottime parameter ...collected

Collecting CSS disktimeout parameter ...collected

Collecting Database best practice baseline for database "orclcdb"

Collecting OPTIMIZER_DYNAMIC_SAMPLING ...collected

Collecting Invalid Java Objects ...collected

Collecting SYSAUX tablespace existence ...collected

Collecting JVM configuration for database ...collected

Collecting Java based user in database ...collected

Collecting Java Role Count ...collected

Collecting Invalid SYS or SYSTEM Schema Objects ...collected

Collecting INVALID objects in the application related schemas ...collected

Collecting SPFILE ...collected

Collecting Database word size(bits) ...collected

Collecting Duplicate SYS or SYSTEM Schema Objects ...collected

Collecting MAX_DUMP_FILE_SIZE ...collected

Collecting REMOTE_LISTENER ...collected

Collecting Database Datafiles in Backup Mode ...collected

Collecting Files Needing Media Recovery ...collected

Collecting In-doubt Distributed Transactions ...collected

Collecting SYS-Owned Object Tables Check ...collected

Collecting Invalid Objects in dba_registry ...collected

Collecting Materialized View Group Refresh Jobs ...collected

Collecting Materialized View Manual Group Refresh ...collected

Collecting Redo Log Size (MB) ...collected

Collecting All tablespaces are locally managed ...collected

Collecting Cluster Interconnects ...collected

Collecting Core files destination ...collected

Collecting Alert log messages indicating internal errors ORA-00600 errors ...collected

Collecting Alert log messages indicating internal errors ORA-07445 errors ...collected

Collecting Old trace files in background dump destination ...collected

Collecting Database alert log file size ...collected

Collecting Automatic segment storage management ...collected

Collecting Average GC CR Block Receive Time ...collected

Collecting Average GC Current Block Receive Time ...collected

Collecting Automatic Undo Management ...collected

Collecting FILESYSTEMIO_OPTIONS ...collected

Verifying ASM best practices

Collecting ASM_POWER_LIMIT ...collected

Collecting ASM disk I/O error ...collected

Collecting Disks without disk_group ...collected

Collecting ASM SHARED_POOL_SIZE parameter ...collected

Collecting ASM disk group free space ...collected

Collecting ASM disk rebalance operations in WAIT status ...collected

Baseline collected.

Collection report for this execution is saved in file "/u01/app/oracle/crsdata/@global/cvu/baseline/users/baseline1.zip".

CVU operation performed: baseline

Date: Aug 19, 2018 5:10:31 PM

Cluster name: stpldb01

Clusterware version: 12.2.0.1.0

Grid home: /u01/app/12.2.0.1/grid

User: oracle

Operating system: Linux4.1.12-124.15.2.el6uek.x86_64

At a later time, a comparison can be done of stored baselines. Baselines exist as zip files within the CVU home and should be referenced by their absolute file path in a comparison operation. Comparisons can be made of baselines from other clusters using the -cross_compare flag, which is not shown in this example.

[oracle@stpldb101 ~]$ cluvfy comp baseline -compare /u01/app/oracle/crsdata/@global/cvu/baseline/users/baseline1.zip,/u01/app/oracle/crsdata/@global/cvu/baseline/users/baseline2.zip

Generating Baseline Report

Preparing the report..

Comparing "System Best Practices"

Comparing "Clusterware Best Practices"

Comparing "Best Practices for database "orclcdb""

Comparing "ASM Configuration Prerequisites"

Complete report of this execution is saved in file "/u01/app/oracle/crsdata/@global/cvu/report/html/cvucheckreport_08192017171818.html"

Opening the report in the browser..

Baseline report generation successful

Verification of baseline was successful.

CVU operation performed: baseline

Date: Aug 19, 2018 5:18:16 PM

CVU home: /u01/app/12.2.0.1/grid/

User: oracle

Orachk

Orachk is a tool that can be used to run more comprehensive health checks and best-practice tests in an Oracle environment. Orachk comes in two flavors: Orachk and Exachk. Exachk can be used on all engineered systems other than Oracle Database Appliance, and Orachk can be used on all other commodity or engineered systems running the Oracle stack. Orachk and Exachk are similar and overlap in almost all options that are not specific to the Exadata platform (such as cell storage server checks, and so on).

Orachk is installed in the Grid Infrastructure home as well as the Oracle database software home in $ORACLE_HOME/suptools/ and can also be downloaded from Oracle directly. Each PSU that is applied to these software homes stages the latest Orachk version in the $ORACLE_HOME/suptools directory. It is recommended that you download the latest Orachk version from the Oracle web site and install that instead of using the Orachk version that is bundled with the Grid Infrastructure home.

Upgrading Orachk

Provided that the server running Orachk has access to the Internet, Orachk will prompt the user that Orachk needs upgrading if the tool determines that the version being used is older than 120 days. For this particular example, the Orachk version found within the 12.2 Oracle Database home will be used to show how the tool automatically prompts for an upgrade if it finds itself to be too old.

[oracle@stpldb101 orachk]$ ./orachk

This version of orachk was released on 26-May-2016 and its older than 120 days.

Latest version of orachk (ORACHK VERSION: 12.2.0.1.3_20170719) is available at /u01/app/oracle/product/12.2.0.1/dbhome_1/suptools/.

Do you want to upgrade to the latest version of orachk? [y/n][y]y

Orachk/Exachk and Oracle RESTful Data Services

The latest versions of the Orachk and Exachk tools are now able to be used via REST calls. This is made possible by the use of Oracle’s ORDS feature. ORDS can be set up for Orachk only via the root user and is available only on operating systems that are compatible with Orachk daemon mode.

The following is an example of how to configure Orachk with ORDS; it enables the automatic restart of the Orachk daemon in the case of server restarts.

When the flag -ordssetup is used, it will prompt for a user password. This is the password for a new OS user called orachkords, which will be used to run the orachk daemon. The orachkords username and password must be specified in all the REST calls made to the daemon from remote hosts.

[root@stpldb101 orachk_standalone]# ./orachk -ordssetup

Enter a password for the user

Confirm password for the user

Oracle Rest Data Service (ORDS) URL: https://stpldb101.localdomain:7080/ords/tfaml/orachk

Please start the daemon using following command: orachk -d start -ords or orachk -initsetup -ords before submitting APIs

Oracle Rest Data Service (ORDS) started successfully

[root@stpldb101 orachk_standalone]# ./orachk -initsetup -ords

Clusterware stack is running from /u01/app/12.2.0.1/grid. Is this the correct Clusterware Home?[y/n][y] y

Checking ssh user equivalency settings on all nodes in cluster for root

Node stpldb102 is not configured for ssh user equivalency and the orachk uses ssh to execute checks on remote nodes.

Without passwordless ssh orachk can not run audit checks on the remote nodes.

If necessary due to security policies the orachk can be run on each node using -localonly option.

Do you want to configure SSH for user root on stpldb102 [y/n][y] y

Enter stpldb102 root password :

Verifying root password ...

Node stpldb102 has been configured for ssh user equivalency for root

Searching for running databases . . . . .

. .

List of running databases registered in OCR

1. orclcdb

2. None of above

Select databases from list for checking best practices. For multiple databases, select 1 for All or comma separated number like 1,2 etc [1-2][1]. 1

. . . .

. . .

Copying plug-ins

. .

. . . . . .

Setting up orachk auto restart functionality

oracle-orachkscheduler start/running, process 22793

Starting orachk daemon. . . . . . . . . .

orachk daemon started successfully

Daemon log file location is : /home/oracle/orachk_standalone/orachk_daemon.log

The ORDS setup can easily be removed by using the -ordsrmsetup flag as the root user. This will also remove the orachkords user.

[root@stpldb101 ~]# cd /home/oracle/orachk_standalone/

[root@stpldb101 orachk_standalone]# ./orachk -ordsrmsetup

0 collections generated by Oracle Rest Data Service (ORDS) API found at /home/orachkords

Executing (/home/oracle/orachk_standalone/orachk -initrmsetup) to stop the orachk daemon

orachk daemon stopped successfully (mode=init)

Oracle Rest Data Service (ORDS) user - orachkords deleted successfully

Recommended Settings for Orachk Daemons

There are a few settings that should be configured for the Orachk daemon to ensure that regularly scheduled health checks do not silently fail and leave a system unmonitored for best-practice usage.

Notification Emails

At a minimum, one notification email needs to be set for Orachk if automatic health checks will be run. This will ensure that any noteworthy health check failures are alerted on, even if the health check reports are not being transmitted via e-mail.

[root@stpldb101 orachk_standalone]# ./orachk -set "[email protected]"

Retention Periods

By default, Orachk and Exachk do not purge collections within any specific retention window. Therefore, a specific retention window needs to be set if disk space is at a premium. A retention for the collection of data can be set in terms of days, per the following example:

[root@stpldb101 orachk_standalone]# ./orachk -get collection_retention

------------------------------------------------------------

ID: orachk.default

------------------------------------------------------------

[root@stpldb101 orachk_standalone]# ./orachk -set "COLLECTION_RETENTION=90"

Updated ['COLLECTION_RETENTION=90'] for Id[orachk.DEFAULT]

[root@stpldb101 orachk_standalone]# ./orachk -get collection_retention

------------------------------------------------------------

ID: orachk.default

------------------------------------------------------------

COLLECTION_RETENTION = 90

------------------------------------------------------------

[root@stpldb101 orachk_standalone]#

Automated Password Verification Checks

Orachk saves passwords that are used for authentication so that the daemon can run health checks without manual intervention. However, if passwords are changed, health checks will fail and not produce output. It is a best practice to set an interval at which Orachk and Exachk verify that the stored passwords are still correct. If the passwords have changed, Orachk will send an e-mail to the list of recipients in the NOTIFICATION_EMAIL setting and stop running to allow the administrator to restart the daemon and re-enter the pertinent passwords. The password verification interval is specified in terms of hours.

[root@stpldb101 orachk_standalone]# ./orachk -set "PASSWORD_CHECK_INTERVAL=12"

Updated ['PASSWORD_CHECK_INTERVAL=12'] for Id[orachk.DEFAULT]

Trace File Analyzer

Trace File Analyzer (TFA) is arguably one of the most useful tools that Oracle has released for the Oracle Database and Grid Infrastructure software in terms of enabling Oracle DBAs to react quickly and analyze all the different logfiles that the Oracle software writes errors to. Trace File Analyzer, along with the Support Tools Bundle, is extremely powerful as a troubleshooting and diagnosis tool. Most of the features related to TFA are out of scope for this book a bit, but some of the new and exciting features that can be extremely useful will be touched upon in this chapter.

Upgrading TFA to Include the Support Tools Bundle

By default, TFA is upgraded with every quarterly update patch that comes out for Grid Infrastructure and the database software (12.2 and onward). However, the version of TFA that comes shipped with PSUs is generally three months behind what is available for download from Oracle Support. Furthermore, the version of TFA that comes with quarterly patches does not include the Support Tools Bundle, which has many useful features for diagnosing problems. The Support Tools Bundle can be downloaded from TFA Collector - TFA with Database Support Tools Bundle (Doc ID 1513912.1). While it is perfectly acceptable to use the TFA that is bundled with a standard Grid Infrastructure installation, you should download and install the Support Tools Bundle as shown in this chapter. This will upgrade your TFA with a multitude of new and useful features.

Upgrading TFA is extremely easy and consists of unzipping the download and running the executable script contained therein. Upgrading TFA will upgrade TFA for the entire cluster.

[oracle@stpldb101 ~]$ unzip TFA-LINUX_v18.3.0.zip

Archive: TFA-LINUX_v18.3.0.zip

inflating: README.txt

inflating: installTFA-LINUX

TFA must be installed/patched as the root user. This is because certain TFA files are owned by root in a Grid Infrastructure software home and because init scripts need be modified.

Some of the following output has been abridged as it is essentially the same for remote node operations:

[root@stpldb101 oracle]# ./installTFA-LINUX

TFA Installation Log will be written to File : /tmp/tfa_install_5857_2018_08_20-00_27_36.log

Starting TFA installation

TFA Version: 183000 Build Date: 201808081359

TFA HOME : /u01/app/12.2.0.1/grid/tfa/stpldb101/tfa_home

Installed Build Version: 181100 Build Date: 201803280250

TFA is already installed. Patching /u01/app/12.2.0.1/grid/tfa/stpldb101/tfa_home...

TFA patching typical install from zipfile is written to /u01/app/12.2.0.1/grid/tfa/stpldb101/tfapatch.log

TFA will be Patched on:

stpldb101

stpldb102

Do you want to continue with patching TFA? [Y|N] [Y]: Y

Checking for ssh equivalency in stpldb102

stpldb102 is configured for ssh user equivalency for root user

Using SSH to patch TFA to remote nodes :

Applying Patch on stpldb102:

TFA_HOME: /u01/app/12.2.0.1/grid/tfa/stpldb102/tfa_home

Stopping TFA Support Tools...

Shutting down TFA

oracle-tfa stop/waiting

. . . . .

Killing TFA running with pid 23847

. . .

Successfully shutdown TFA..

Copying files from stpldb101 to stpldb102...

Current version of Berkeley DB in stpldb102 is 5 or higher, so no DbPreUpgrade required

Running commands to fix init.tfa and tfactl in stpldb102...

Updating init.tfa in stpldb102...

Starting TFA in stpldb102...

Starting TFA..

oracle-tfa start/running, process 10467

Waiting up to 100 seconds for TFA to be started..

. . . . .

Successfully started TFA Process..

. . . . .

TFA Started and listening for commands

Enabling Access for Non-root Users on stpldb102...

.-----------------------------------------------------------------.

+-----------+-------------+----------------------+----------------+

| stpldb101 | 18.3.0.0.0 | 18300020180808135947 | UPGRADED |

| stpldb102 | 18.3.0.0.0 | 18300020180808135947 | UPGRADED |

'-----------+-------------+----------------------+----------------'

cleanup serializable files

Using Trace File Analyzer to Manage Logfiles

Old logs need to be deleted once they are no longer relevant. Every DBA knows this, and most DBAs have scripts that help to automate this task as it quickly becomes burdensome the more databases and clusters that a DBA has to manage. Up until 11g, this was a manual process that required custom shell scripts to help with the maintenance; however, in 11g Oracle came out with a concept called Automatic Diagnostic Repository (ADR). Now, the latest versions of Trace File Analyzer have become integrated with ADR, which allows for streamlined maintenance of database and Grid Infrastructure logs.

Analyzing Logfile Space Usage

TFA can be used to display the file system space used by database logs.

[oracle@stpldb101 ~]$ tfactl managelogs -show usage -database orclcdb

Output from host : stpldb101

------------------------------

.------------------------------------------------------------------.

| Database Homes Usage |

+------------------------------------------------------+-----------+

| Location | Size |

+------------------------------------------------------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/alert | 1.81 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident | 25.68 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace | 119.94 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/cdump | 136.58 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/hm | 4.00 KB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/log | 36.00 KB |

+------------------------------------------------------+-----------+

| Total | 284.05 MB |

'------------------------------------------------------+-----------'

Output from host : stpldb102

------------------------------

.------------------------------------------------------------------.

| Database Homes Usage |

+------------------------------------------------------+-----------+

| Location | Size |

+------------------------------------------------------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/alert | 1.84 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/incident | 7.05 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/trace | 82.30 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/cdump | 139.87 MB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/hm | 4.00 KB |

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/log | 32.00 KB |

+------------------------------------------------------+-----------+

| Total | 231.09 MB |

'------------------------------------------------------+-----------'

More important, Trace File Analyzer can now be used to analyze the change or variation space usage. In the following example, the Old Size shows as - because the database was newly created:

[oracle@stpldb101 ~]$ tfactl managelogs -show variation -older 1d -database orclcdb

Output from host : stpldb101

------------------------------

2018-08-20 00:42:46: INFO Checking space variation for 1 days

2018-08-20 00:42:46: INFO Space is calculated in bytes [without round off]

.-----------------------------------------------------------------------------.

| Database Homes Variation |

+------------------------------------------------------+----------+-----------+

| Directory | Old Size | New Size |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/cdump | - | 136.58 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/alert | - | 1.81 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident | - | 25.68 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/log | - | 36.00 KB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace | - | 119.99 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/hm | - | 4.00 KB |

'------------------------------------------------------+----------+-----------'

Output from host : stpldb102

------------------------------

2018-08-20 00:42:54: INFO Checking space variation for 1 days

2018-08-20 00:42:54: INFO Space is calculated in bytes [without round off]

.-----------------------------------------------------------------------------.

| Database Homes Variation |

+------------------------------------------------------+----------+-----------+

| Directory | Old Size | New Size |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/log | - | 32.00 KB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/alert | - | 1.84 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/cdump | - | 139.87 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/incident | - | 7.05 MB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/hm | - | 4.00 KB |

+------------------------------------------------------+----------+-----------+

| /u01/app/oracle/diag/rdbms/orclcdb/orclcdb2/trace | - | 82.35 MB |

'------------------------------------------------------+----------+-----------'

Purging Database Logfiles

By default, TFA will purge files that are older than 30 days if –purge is not specified. In the following example, all logs are purged because of the –purge 0d setting (all files older than zero days):

[oracle@stpldb101 ~]$ tfactl managelogs -purge -older 0d -database orclcdb

Output from host : stpldb101

------------------------------

2018-08-20 01:00:59: INFO Purging files older than 0 days

2018-08-20 01:00:59: INFO Space is calculated in bytes [without round off]

2018-08-20 01:00:59: INFO Cleaning Database Home destinations

2018-08-20 01:01:04: INFO Purging diagnostic destination "diag/rdbms/orclcdb/orclcdb1" for files - 2598 files deleted , 276.98 MB freed

2018-08-20 01:01:04: MESSAGE Database Home : /u01/app/oracle/product/12.2.0.1/dbhome_1 [ Files deleted : 2598 files | Space Freed : 276.98 MB ]

.-------------------------------------------------------------------.

| File System Variation : /u01/app/oracle/product/12.2.0.1/dbhome_1 |

+--------+------+----------+----------+----------+----------+-------+

+--------+------+----------+----------+----------+----------+-------+

| Before | | 96888540 | 76662696 | 15721604 | 83% | / |

| After | | 96888540 | 76373716 | 16010584 | 83% | / |

'--------+------+----------+----------+----------+----------+-------'

Output from host : stpldb102

------------------------------

2018-08-20 01:01:06: INFO Purging files older than 0 days

2018-08-20 01:01:06: INFO Space is calculated in bytes [without round off]

2018-08-20 01:01:06: INFO Cleaning Database Home destinations

2018-08-20 01:01:08: INFO Purging diagnostic destination "diag/rdbms/orclcdb/orclcdb2" for files - 1513 files deleted , 226.99 MB freed

2018-08-20 01:01:08: MESSAGE Database Home : /u01/app/oracle/product/12.2.0.1/dbhome_1 [ Files deleted : 1513 files | Space Freed : 226.99 MB ]

.-------------------------------------------------------------------.

| File System Variation : /u01/app/oracle/product/12.2.0.1/dbhome_1 |

+--------+------+----------+----------+----------+----------+-------+

+--------+------+----------+----------+----------+----------+-------+

| Before | | 96888540 | 70290896 | 22093400 | 77% | / |

| After | | 96888540 | 70055416 | 22328880 | 76% | / |

'--------+------+----------+----------+----------+----------+-------'

TFA as a Health Check Tool

Trace File Analyzer can be used to analyze logfiles on several components over a range of time for errors or warnings. This can be extremely useful as an ongoing monitoring solution for large clusters.

To show the power of this type of analysis, this example will show how all of the logfiles managed by TFA can be quickly analyzed to look for a particular type of error. In this case, the logs will be analyzed to search for any mention of swapping to check for memory pressure conditions on a particular host. All the necessary logs were analyzed within two seconds, and a report was ouput to the screen. Furthermore, tfactl changed the return code depending on the findings, which lends itself to a tfactl-centered shell script as a monitoring solution.

[oracle@stpldb101 ~]$ tfactl analyze -comp all -last 1d -type error -search "swapping" -node local

INFO: analyzing all (Alert and Unix System Logs) logs for the last 1440 minutes... Please wait...

INFO: analyzing host: stpldb101

Report title: Analysis of Alert,System Logs

Report date range: last ~1 day(s)

Report (default) time zone: EST - Eastern Standard Time

Analysis started at: 20-Aug-2018 01:13:51 AM EDT

Elapsed analysis time: 2 second(s).

Configuration file: /u01/app/12.2.0.1/grid/tfa/stpldb101/tfa_home/ext/tnt/conf/tnt.prop

Configuration group: all

Parameter: swapping

Total message count: 18,695, from 02-Jul-2018 07:42:39 PM EDT to 20-Aug-2018 01:13:44 AM EDT

Messages matching last ~1 day(s): 9,172, from 19-Aug-2018 01:14:00 AM EDT to 20-Aug-2018 01:13:44 AM EDT

Matching regex: swapping

Case sensitive: false

Match count: 5

[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4847]

Aug 19 02:01:28 2018

WARNING: Heavy swapping observed on system in last 5 mins.

Heavy swapping can lead to timeouts, poor performance, and instance eviction.

Errors in file /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/orclcdb1_dbrm_31124.trc (incident=72122) (PDBNAME=CDB$ROOT):

ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident/incdir_72122/orclcdb1_dbrm_31124_i72122.trc

[Source: /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/alert_-MGMTDB.log, Line: 6135]

Aug 19 02:02:35 2018

WARNING: Heavy swapping observed on system in last 5 mins.

Heavy swapping can lead to timeouts, poor performance, and instance eviction.

Errors in file /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_dbrm_14223.trc (incident=44115) (PDBNAME=CDB$ROOT):

ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/incident/incdir_44115/-MGMTDB_dbrm_14223_i44115.trc

[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4946]

Aug 19 04:56:44 2018

WARNING: Heavy swapping observed on system in last 5 mins.

Heavy swapping can lead to timeouts, poor performance, and instance eviction.

Errors in file /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/orclcdb1_dbrm_31124.trc (incident=72123) (PDBNAME=CDB$ROOT):

ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident/incdir_72123/orclcdb1_dbrm_31124_i72123.trc

[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4979]

Aug 19 06:08:03 2018

WARNING: Heavy swapping observed on system in last 5 mins.

Heavy swapping can lead to timeouts, poor performance, and instance eviction.

[Source: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/alert_orclcdb1.log, Line: 4982]

Aug 19 06:08:06 2018

Errors in file /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/trace/orclcdb1_dbrm_31124.trc (incident=72124) (PDBNAME=CDB$ROOT):

ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [], [], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/orclcdb/orclcdb1/incident/incdir_72124/orclcdb1_dbrm_31124_i72124.trc

[oracle@stpldb101 ~]$ echo $?

New Health Check and Troubleshooting Features in 12.2

Oracle introduced a new health check and troubleshooting feature in 12.2 that can aid in identifying and resolving issues. Using this new feature in conjunction with the rest of the features outlined in this chapter should help to prepare a DBA for even the most vexing of performance issues.

Cluster Health Advisor

The Cluster Health Advisor is available via the chactl command and can be used to diagnose both at the cluster level and at the database level if -cluster is replaced with -db <DB_UNIQUE_NAME>.

[oracle@stpldb101 ~]$ chactl query diagnosis -cluster -start "2018-08-20 00:00:00" -end "2018-08-20 08:00:00"

2018-08-20 00:01:40.0 Host stpldb102 Host Memory Consumption [detected]

2018-08-20 00:01:50.0 Host stpldb101 Host Memory Consumption [detected]

2018-08-20 00:02:50.0 Host stpldb101 Host CPU Utilization [detected]

2018-08-20 00:04:35.0 Host stpldb101 Host Memory Swapping [detected]

2018-08-20 00:04:45.0 Host stpldb101 Host CPU Utilization [cleared]

2018-08-20 00:09:15.0 Host stpldb101 Host Memory Swapping [cleared]

2018-08-20 00:09:25.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 00:16:10.0 Host stpldb102 Host CPU Utilization [cleared]

2018-08-20 00:16:10.0 Host stpldb102 Host Memory Consumption [cleared]

2018-08-20 00:18:20.0 Host stpldb102 Host Memory Consumption [detected]

2018-08-20 00:19:20.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 00:27:05.0 Host stpldb102 Host CPU Utilization [cleared]

2018-08-20 00:30:20.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 00:31:30.0 Host stpldb101 Host CPU Utilization [detected]

2018-08-20 00:33:50.0 Host stpldb101 Host CPU Utilization [cleared]

2018-08-20 00:34:50.0 Host stpldb102 Host CPU Utilization [cleared]

2018-08-20 02:01:05.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 02:05:40.0 Host stpldb102 Host CPU Utilization [cleared]

2018-08-20 02:31:40.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 02:56:55.0 Host stpldb102 Host CPU Utilization [cleared]

2018-08-20 04:00:55.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 04:05:40.0 Host stpldb102 Host CPU Utilization [cleared]

2018-08-20 06:01:20.0 Host stpldb102 Host CPU Utilization [detected]

2018-08-20 06:05:45.0 Host stpldb102 Host CPU Utilization [cleared]

Problem: Host Memory Consumption

Description: CHA detected that more memory than expected is consumed on this server. The memory is not allocated by sessions of this database.

Cause: The Cluster Health Advisor (CHA) detected an increase in memory consumption by other databases or by applications not connected to a database on this node.

Action: Identify the top memory consumers by using the Cluster Health Monitor (CHM).

Problem: Host CPU Utilization

Description: CHA detected larger than expected CPU utilization on this node. The available CPU resource may not be sufficient to support application failover or relocation of databases to this node.

Cause: The Cluster Health Advisor (CHA) detected an unexpected increase in CPU utilization by databases or applications on this node.

Action: Identify CPU intensive processes and databases by reviewing Cluster Health Monitoring (CHM) data. Relocate databases to less busy machines, or limit the number of connections to databases on this node. Add nodes if more resources are required.

Problem: Host Memory Swapping

Description: CHA detected that there is not enough free memory on this server and memory swapping has started. The performance of databases on other nodes in this cluster will be affected severely. An instance eviction is likely if swapping continues.

Cause: The Cluster Health Advisor (CHA) detected that there was no more free memory because the memory requirements of processes increased.

Action: Stop database services on this server, or shut down less important database instances. Redirect the applications to the available healthy servers. Check the CHA diagnostics for the health of the other servers and databases. Add servers if more memory is required. Ensure that Huge Pages are correctly configured and used.

The Cluster Health Advisor uses data stored in the MGMTDB (Grid Management Infrastructure Repository). The amount of space used by the Grid Infrastructure Management Repository is directly correlated with the max retention (up to 168 hours) and the amount of entities/targets (up to 512) being tracked by Cluster Health Advisor.

[oracle@stpldb101 ~]$ chactl query repository

specified max retention time(hrs): 72

available retention time(hrs) : 617

available number of entities : 17

allocated number of entities : 2

total repository size(gb) : 15.00

allocated repository size(gb) : 0.26

Summary

This chapter covered a lot of the new and old troubleshooting and health check topics in the hope of raising your awareness of the amazing options available to DBAs, especially with the new features of TFA, Orachk, and latest features in 12.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Troubleshooting and Conducting Health Checks for High Availability

Create new playlist

Sign In

Sign Up

5. Troubleshooting and Conducting Health Checks for High Availability

Cluster Verification Utility

Software Home Checks

File System Space Checks

Grid Infrastructure–Only Check

Generic Space Checks

Cluster Verification Utility Health Checks

Database Health Checks

CVU Wallets

Running the Health Check

Cluster Health Checks

Cluster Verification Utility Baselines

Orachk

Upgrading Orachk

Orachk/Exachk and Oracle RESTful Data Services

Recommended Settings for Orachk Daemons

Notification Emails

Retention Periods

Automated Password Verification Checks

Trace File Analyzer

Upgrading TFA to Include the Support Tools Bundle

Using Trace File Analyzer to Manage Logfiles

Analyzing Logfile Space Usage

Purging Database Logfiles

TFA as a Health Check Tool

New Health Check and Troubleshooting Features in 12.2

Cluster Health Advisor

Summary

Table of Contents for
5. Troubleshooting and Conducting Health Checks for High Availability