Recovery

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Recovery

The first thing to do when discussing recovery and RAC is to dismiss media recovery—it’s not really any different from media recovery for single-instance Oracle. You take a backup copy of the database and a list of (archived) redo log files and work through them applying the redo log entries that need to be applied; and since Oracle does a two-phase recovery (working out which blocks need recovering from which redo byte address) it doesn’t really make much difference that the recovery process has to keep log files from multiple instances in synch as it reads and applies them.

There are, inevitably, a couple of interesting little details. For example, the need to (logically) merge all the log files means that when you set up a physical standby for a RAC system, only one of the standby instances can perform the actual redo read and merge. (That’s not as bad as it sounds, since that one instance can hand off the redo records to the other instances in much the same way that single-instance Oracle can do parallel recovery.)

The more important aspect of recovery is instance recovery—after all, one of the arguments for RAC is high availability, and the fact that “the system” survives even if an instance (or the machine it’s running on) fails. It’s worth taking a look at how this works.

The different instances are constantly talking to each other—so if one instance disappears, the other instances find out about it fairly promptly. At this point they will all race to acquire the single instance recovery (IR) lock—and the instance that gets it will perform instance recovery for the failed instance.

Part of the work involves recovering the changes that had been made to blocks in the failed instance’s cache and then rolling back the uncommitted changes; part of it involves reconstructing the master resources that the failed instance was holding, and cleaning out all references to the failed instance from the GRD; and part of it entails “rebalancing” the GRD because the cluster is now a different size and the hashing algorithm that allowed an instance to find a resource won’t work any more.

REBALANCING

The way in which Oracle redistributes master resources keeps changing, but part of it has remained stable since 9i. Oracle keeps a very small map of “logical” and “physical” instances, and at steady state logical instance N will be physical instance N.

When an instance leaves the cluster, Oracle simply adjusts the map to assign the physical instance that has left to a different logical instance, and recreates the lost master resources on that physical instance. This avoids the need to redistribute all the resources because of one instance failure—but it does mean that the resource mastery is no longer operating a fair-share system.

It’s possible that a secondary mechanism subsequently takes over and gradually migrates master resources to a fairer distribution by mapping the correct location for a master when there are N-1 nodes in play, but also keeping a note of where it might be immediately after a failure from N nodes.

It’s probably appropriate to mention at this point that when it comes to BL resources, the hashing algorithm used by Oracle is based on a unit size of 256 blocks (in 11.2). Each datafile is split into chunks of 256 consecutive blocks, and the same instance will be the master for every block in a chunk. Conveniently, this means that during a tablescan, each multiblock read request is likely to cover blocks that all belong to a single instance—especially if you are using locally managed tablespaces that essentially operate on a 1 MB boundary.

It’s a little difficult to work out Oracle’s algorithm for redistributing BL resources when you add an instance to a cluster—particularly if you only have a small cluster to work with—but it looks as if the move from N-1 instance to N instances simply involves each instance passing mastery of every Nth chunk to the new node. This doesn’t produce a regular pattern of ownership across a file, but it produces fair shares with a minimum of disruption.

There are two key points to the recovery. On the plus side, it’s fully automatic (and if you’re lucky and you’ve configured it the failed instance will be restarted automatically anyway). On the minus side, rebuilding and rebalancing the GRD freezes the instance briefly—and it’s not possible to say how long that might be; ideally, it’s only a handful of seconds.

Shadow resources and past images play an interesting role in what goes on during recovery. Oracle can scan all the shadow resources on the surviving nodes to reconstruct images of the master resources from the failed instance—this is probably the step that is responsible for most of the time spent in the system freeze—and as it rebuilds the master resources it can discard any master resources (from any instance) holding enqueues for only the failed instance except master resources that show exclusive locks on BL resources as these are possibly blocks that will need recovery.

When it comes to the recovery, Oracle can reduce the amount of work it has to do with the redo logs because Past Images are known to be recent copies of blocks that would have been written to disc if another instance had not called them across the network. So any time Oracle finds a redo record for a block that exists as a past image somewhere it can discard the redo if its SCN is lower than the last change SCN on the past image. This helps to reduce the recovery time, but you will still see sessions waiting for blocks that are still pinned by the recovery process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Recovery

Create new playlist

Sign In

Sign Up

Recovery

Table of Contents for
Recovery