Testing the Disaster Recovery Plan

By the time that an organization has completed a DRP, it’s probably spent hundreds of man-hours and possibly tens (or hundreds) of thousands of dollars on consulting fees. You’d think that after making this big of an investment, any organization would want to test its DRP to make sure that it works when a real emergency strikes.

The following sections outline five methods available for testing the Disaster Recovery Plan.

Checklist

A checklist test is a detailed review of DRP documents, performed by individuals working on their own. The purpose of a checklist test is to identify inaccuracies, errors, and omissions in DRP documentation.

It’s easy to coordinate this type of test, because each person who performs the test does it when his or her schedule permits (provided they complete it before any deadlines).

By itself, a document review is an insufficient way to test a DRP; however, it’s a logical starting place. Perform one or more of the other tests soon after you do the checklist test.

Structured walkthrough

A structured walkthrough is a team approach to the checklist test. Here, several business and technology experts in the organization gather to “walk” through the BCP plan documents. A moderator or facilitator leads participants to discuss each step in BCP documents so that they can identify issues and opportunities for making documents more accurate and complete. Group discussions usually help to identify issues that people will not find when working on their own. Often the participants want to perform the review in a fancy mountain or oceanside retreat, where they can think much more clearly.

During a structured walkthrough, the facilitator writes down “parking lot” issues (items to be considered at a later time, written down now so they will not be forgotten) on a whiteboard or flipchart while the group identifies those issues. These are action items that will serve to make improvements to BCP documents. Each action item needs to have an accountable person assigned, as well as a completion date, so that the action items will be completed in a reasonable time. Depending upon the extent of the changes, a follow-up walkthrough may need to be conducted at a later time.

tip.eps A structured walkthrough usually requires two to eight hours or more to complete.

Simulation

In a simulation, all the designated disaster recovery personnel practice going through the motions associated with a real recovery. In a simulation, the team doesn’t actually perform any recovery or alternate processing.

An organization that plans to perform a simulation test appoints a facilitator who develops a disaster scenario, using a type of disaster that’s likely to occur in the region. For instance, an organization in San Francisco might choose an earthquake scenario, and an organization in Miami could choose a hurricane.

In a simple simulation, the facilitator reads out announcements as if they’re news briefs. Such announcements describe an unfolding scenario and can also include information about the organization’s status at the time. An example announcement might read like this:

It is 8:15 a.m. local time, and a magnitude 7.1 earthquake has just occurred, fifteen miles from company headquarters. Building One is heavily damaged and some people are seriously injured. Building Two (the one containing the organization’s computer system) is damaged and personnel are unable to enter the building. Electric power is out, and the generator has not started because of an unknown problem that may be earthquake related. Executives Jeff Finsch and Sarah Brewer (CIO and CFO) are backpacking on the Appalachian Trail and cannot be reached.

The disaster-simulation team, meeting in a conference room, discusses emergency response procedures and how the response might unfold. They consider the conditions described to them and identify any issues that could impact an actual disaster response.

The simulation facilitator makes additional announcements throughout the simulation. Just like in a real disaster, the team doesn’t know everything right away — instead, news trickles in. In the simulation, the facilitator reads scripted statements that, um, simulate the way that information flows in a real disaster.

A more realistic simulation can be held at the organization’s emergency response center, where some resources that support emergency response may be available. Another idea is to hold the simulation on a day that is not announced ahead of time, so that responders will be genuinely surprised and possibly be less prepared to respond.

tip.eps Remember to test your backup media to make sure that you can actually restore data from backups!

Parallel

A parallel test involves performing all the steps of a real recovery, except that you keep the real, live production systems running. The actual production systems run in parallel with the disaster recovery systems. The parallel test is very time-consuming, but it does test the accuracy of the applications because analysts compare data on the test recovery systems with production data.

The technical architecture of the target application determines how a parallel test needs to be conducted. The general principle of a parallel test is that the disaster recovery system (meaning the system that remains on standby until a real disaster occurs, at which time, the organization presses it into production service) runs process work at the same time that the primary system continues its normal work. Precisely how this is accomplished depends on technical details. For a system that operates on batches of data, those batches can be copied to the DR system for processing there, and results can be compared for accuracy and timeliness.

Highly interactive applications are more difficult to test in a strictly parallel test. Instead, it might be necessary to record user interactions on the live system and then “play back” those interactions using an application testing tool. Then responses, accuracy, and timing can be verified after the test to verify whether the DR system worked properly.

While a parallel test may be difficult to set up, its results can provide a good indication of whether disaster recovery systems will perform during a disaster. Also, the risks associated with a parallel test are low, since a failure of the DR system will not impact real business transactions.

instantanswer.eps The parallel test includes loading data onto recovery systems without taking production systems down.

Interruption (or cutover)

An interruption test (sometimes known as a “cutover” test) is similar to a parallel test except that in an interruption test, a function’s primary systems are actually shut off or disconnected. An interruption test is the ultimate test of a disaster recovery plan because one or more of the business’s critical functions actually depends upon the availability, integrity, and accuracy of the recovery systems.

An interruption test should be performed only after successful walkthroughs and at least one parallel test. In an interruption test, backup systems are processing the full production workload and all primary and ancillary functions including:

check.png User access

check.png Administrative access

check.png Integrations to other applications

check.png Support

check.png Reporting

check.png . . . And whatever else the main production environment needs to support

remember.eps An interruption test is the ultimate test of the ability for a disaster recovery system to perform properly in a real disaster, but it’s also the test with the highest risk.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.141