Recovering an OPS Database

In a parallel server environment, the following types of failures may occur:

  • Node failure

  • Instance failure

  • Crash failure

  • Integrated Distributed Lock Manager (IDLM) failure

  • GMS failure

  • Media failure

These types are described in the following sections.

Some of these failures—for example, an instance failure or a media failure—also can occur in a standalone instance environment. Other types of failures—for example, an IDLM failure or a GMS failure—are specific to an OPS environment. You may need to perform a database recovery as a result of any one of these failures.

Node Failure and Recovery

A node may fail because of a power outage, operating system crash, or any other event on the node that makes it nonfunctional. Failure of a node causes the instance, the IDLM processes, and the GMS process running on that node to fail. The recovery from a node failure consists of instance recovery, IDLM recovery, and GMS recovery. A surviving instance will perform instance and IDLM recovery. You will have to restart GMS and the instance manually after you have diagnosed and corrected the cause of the node failure.

Instance Failure and Recovery

When one or more of the Oracle background processes for an instance fails or dies or when the SGA for an instance is lost, the instance will stop running. This type of failure is called an instance failure. Issuing a SHUTDOWN ABORT command also causes instance failure.

The process of recovering from instance failure is called instance recovery. For a single-instance database, instance recovery is performed automatically by the SMON background process at the next database startup.

The failure of more than one instance is still referred to as instance failure. However, the failure of all instances is called a crash. (Crash recovery is discussed in the next section.) When an instance in an OPS environment fails, one of the surviving instances will perform online instance recovery automatically. This is possible as long as at least one instance is running. When one instance fails, other instances keep running without being affected.

The recovery of a failed instance is triggered when either of the following occurs:

  • The SMON process of a surviving instance detects the failure. Each SMON process periodically wakes up to check the status of the other instances.

  • A lock is requested on a data block being managed by the failed instance.

The SMON process of the instance that detected the failure performs the instance recovery. For one instance to be able to recover another, it must have access to the online redo log files of the failed instance. In an OPS environment, that’s why all redo logs must be accessible to all instances.

Committed transactions are never lost as a result of an instance failure. During instance recovery, the SMON background process first rolls forward all the changes by reading from the redo logs and then rolls back any uncommitted changes using information from the database’s rollback segments. Instance recovery releases locks held by the failed instance. After instance recovery is complete, the failed instance is not started automatically. You must detect, diagnose, and correct the cause of the instance failure and then start the instance manually.

You have some control over the behavior of the database during the instance recovery. You can set the initialization parameter FREEZE_DB_FOR_FAST_INSTANCE_RECOVERY to TRUE to freeze the entire database during recovery. Setting this parameter to TRUE avoids contention for the resources used in the recovery process. When this parameter is set to TRUE, all database activity, other than that generated by the recovery process, is stopped. As a result, instance recovery may take less time. The tradeoff is that the database is also unavailable until the recovery process completes. When this parameter is set to FALSE, the data needed for instance recovery (in other words, the data involved in the transactions being recovered) will be unavailable, but the rest of the database will be available for normal use.

Warning

FREEZE_DB_FOR_FAST_INSTANCE_RECOVERY is obsolete in Oracle8i.

The default value for this parameter is determined by Oracle based on the Parallel Cache Management (PCM) locking mechanism that is being used. If any datafile uses fine-grained locks, the default is TRUE. If all datafiles use hash locking, the default is FALSE. Locking mechanisms such as fine-grained and hashed locks are discussed in Chapter 8.

All instances must have the same setting for this parameter. The sooner that instance recovery is completed, the better. Therefore, we recommend that you set this parameter to TRUE unless you can’t afford to stop other database activities during those times when recovery takes place.

Crash Failure and Recovery

For a standalone database, instance failure and crash failure are the same. However, in an OPS environment, crash failure means the failure of all of the instances associated with the parallel server database.

Recovery of all the instances is performed automatically the next time an instance is started. When you start up an instance to open the database, it performs recovery for all the failed instances. However, as in instance recovery, crash recovery does not automatically start the failed instances. You have to start the instances manually.

IDLM Failure and Recovery

As we discussed in Chapter 6, the Integrated Distributed Lock Manager (IDLM) consists of the background processes LMON and LMD0, as well as a distributed lock area. Failure of any of these components results in an IDLM failure. When IDLM on one node fails, the LMON process on another node will detect it. The recovery from IDLM failure is reconfiguration of the IDLM. The LMON process that detected the failure will perform the recovery by remastering the locks managed by the failed node.

GMS Failure and Recovery

The Group Membership Services (GMS) may fail for any of the following reasons:

  • Failure of the GMS daemon

  • Failure of the interconnect between nodes

  • Operating system crash

You have to detect and resolve the cause of a GMS failure before you can manually restart GMS. If the GMS failure has caused any instance to fail, you also will have to start the instance manually after starting the GMS.

Media Failure and Recovery

A media failure occurs when a database file becomes lost or corrupted, preventing Oracle from reading from or writing to that file. Media failure can be caused by damage to the disk or to the disk controller.

If the database is running in NOARCHIVELOG mode, the only possible recovery is to restore the complete database from the last cold (offline) backup, thereby losing all the transactions committed since then. If the database is running in ARCHIVELOG mode, it is possible to recover all committed changes up to the time of failure. That’s done by using the most recent backup (hot or cold), together with information from the archived redo log files. You can perform media recovery using the Recovery Manager or by using Oracle’s command-line utilities.

The database must be mounted, but not opened, by a single instance if you are recovering any of the following:

  • The entire database

  • The entire SYSTEM tablespace

  • A datafile in the SYSTEM tablespace

To recover any other tablespace or datafile, the database must be opened by the instance that is performing recovery, and the respective tablespace or datafile must be offline.

Multiple instances can perform recovery at the same time. For example, if tablespaces TS1 and TS2 need media recovery, you can recover TS1 from instance 1 and TS2 from instance 2 simultaneously.

Media recovery requires archived log files and online redo log files. Each instance has its own online and archived redo log files. As discussed in Chapter 6, the redo log files for any one instance are accessible from all other instances. However, the archived redo logs may be residing in local destinations not accessible to all instances. Therefore, during recovery, you need to place the archived log files of all instances into one location that is accessible by the instance performing the recovery.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.157.142