Recovery

The first thing to do when discussing recovery and RAC is to dismiss media recovery—it’s not really any different from media recovery for single-instance Oracle. You take a backup copy of the database and a list of (archived) redo log files and work through them applying the redo log entries that need to be applied; and since Oracle does a two-phase recovery (working out which blocks need recovering from which redo byte address) it doesn’t really make much difference that the recovery process has to keep log files from multiple instances in synch as it reads and applies them.

There are, inevitably, a couple of interesting little details. For example, the need to (logically) merge all the log files means that when you set up a physical standby for a RAC system, only one of the standby instances can perform the actual redo read and merge. (That’s not as bad as it sounds, since that one instance can hand off the redo records to the other instances in much the same way that single-instance Oracle can do parallel recovery.)

The more important aspect of recovery is instance recovery—after all, one of the arguments for RAC is high availability, and the fact that “the system” survives even if an instance (or the machine it’s running on) fails. It’s worth taking a look at how this works.

The different instances are constantly talking to each other—so if one instance disappears, the other instances find out about it fairly promptly. At this point they will all race to acquire the single instance recovery (IR) lock—and the instance that gets it will perform instance recovery for the failed instance.

Part of the work involves recovering the changes that had been made to blocks in the failed instance’s cache and then rolling back the uncommitted changes; part of it involves reconstructing the master resources that the failed instance was holding, and cleaning out all references to the failed instance from the GRD; and part of it entails “rebalancing” the GRD because the cluster is now a different size and the hashing algorithm that allowed an instance to find a resource won’t work any more.

REBALANCING

There are two key points to the recovery. On the plus side, it’s fully automatic (and if you’re lucky and you’ve configured it the failed instance will be restarted automatically anyway). On the minus side, rebuilding and rebalancing the GRD freezes the instance briefly—and it’s not possible to say how long that might be; ideally, it’s only a handful of seconds.

Shadow resources and past images play an interesting role in what goes on during recovery. Oracle can scan all the shadow resources on the surviving nodes to reconstruct images of the master resources from the failed instance—this is probably the step that is responsible for most of the time spent in the system freeze—and as it rebuilds the master resources it can discard any master resources (from any instance) holding enqueues for only the failed instance except master resources that show exclusive locks on BL resources as these are possibly blocks that will need recovery.

When it comes to the recovery, Oracle can reduce the amount of work it has to do with the redo logs because Past Images are known to be recent copies of blocks that would have been written to disc if another instance had not called them across the network. So any time Oracle finds a redo record for a block that exists as a past image somewhere it can discard the redo if its SCN is lower than the last change SCN on the past image. This helps to reduce the recovery time, but you will still see sessions waiting for blocks that are still pinned by the recovery process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.248.37