CHAPTER 7
Business Continuity and Disaster Recovery

This chapter discusses the following topics:

• Types of disasters and their impact on organizations

• Components of the business continuity and disaster recovery process

• Business impact analysis

• Recovery targets

• Testing business continuity and disaster recovery plans

• Training personnel

• Maintaining business continuity and disaster recovery plans

• Auditing business continuity and disaster recovery plans

The topics in this chapter represent 14 percent of the CISA examination.

Business continuity planning (BCP) and disaster recovery planning (DRP) are activities undertaken to reduce risks related to the onset of disasters and other disruptive events. BCP and DRP activities identify risks and mitigate those risks through changes or enhancements in technology or business processes, so that the impact of disasters is reduced and the time to recovery is lessened. The primary objective of BCP and DRP is to improve the chances that the organization will survive a disaster without incurring costly or even fatal damage to its most critical activities.

The activities of business continuity and disaster recovery plan development scale for any size organization. BCP and DRP have the unfortunate reputation of existing only in the stratospheric, thin air of the largest and wealthiest organizations. This misunderstanding hurts the majority of organizations that are too timid to begin any kind of BCP and DRP efforts at all because they feel that these activities are too costly and disruptive. The fact is, any size organization, from a one-person home office to a multinational conglomerate, can successfully undertake BCP and DRP projects that will bring about immediate benefits as well as take some of the sting out of disruptive events that do occur.

Organizations can benefit from BCP and DRP projects, even if a disaster never occurs. The steps in the BCP and DRP development process usually bring immediate benefit in the form of process and technology improvements that increase the resilience, integrity, and efficiency of those processes and systems.

Disasters

I always tried to turn every disaster into an opportunity.—John D. Rockefeller

In a business context, disasters are unexpected and unplanned events that result in the disruption of business operations. A disaster could be a regional event spread over a wide geographic area, or it could occur within the confines of a single room. The impact of a disaster will also vary, from a complete interruption of all company operations to merely a slowdown. (The question invariably comes up: when is a disaster a disaster? This is somewhat subjective, like asking, “When is a person sick?” Is it when he or she is too ill to report to work, or if he or she just has a sniffle and a scratchy throat? We’ll discuss disaster declaration later in this chapter.)

Types of Disasters

BCP and DRP professionals broadly classify disasters as natural or man-made, although the origin of a disaster does not figure into how we respond to it. Let’s examine the types of disasters.

Natural Disasters

Natural disasters are those phenomena that occur in the natural world with little or no assistance from mankind. They are a result of the natural processes that occur in, on, and above the earth.

Examples of natural disasters include

Image Earthquakes Sudden movements of the earth with the capacity to damage buildings, houses, roads, bridges, and dams; to precipitate landslides and avalanches; and to induce flooding and other secondary events.

Image Volcanoes Eruptions of magma, pyroclastic flows, steam, ash, and flying rocks that can cause significant damage over wide geographic regions. Some volcanoes, such as Kilauea in Hawaii, produce a nearly continuous and predictable outpouring of lava in a limited area, whereas the Mount St. Helens eruption in 1980 caused an ash fall over thousands of square miles that brought many metropolitan areas to a standstill for days, and also blocked rivers and damaged roads. Figure 7-1 shows a volcanic eruption as seen from space.

Image Landslides Sudden downhill movements of earth, usually down steep slopes, can bury buildings, houses, roads, and public utilities, and cause secondary (although still disastrous) effects such as the rerouting of rivers.

Image Avalanches Sudden downward flows of snow, rocks, and debris on a mountainside. A slab avalanche consists of the movement of a large, stiff layer of compacted snow. A loose snow avalanche occurs when the accumulated snowpack exceeds its shear strength. A power snow avalanche is the largest type and can travel in excess of 200 mph and exceed 10 million tons of material. All types can damage buildings, houses, roads, and utilities.

Image Wildfires Fires in forests, chaparral, and grasslands are a part of the natural order. However, fires can also damage buildings and equipment and cause injury and death.

Image

Figure 7-1 Mount Etna volcano in Sicily

Image Tropical cyclones The largest and most violent storms are known in various parts of the world as hurricanes, typhoons, tropical cyclones, tropical storms, and cyclones. Tropical cyclones consist of strong winds that can reach 190 mph, heavy rains, and storm surge that can raise the level of the ocean by as much as 20 feet, all of which can result in widespread coastal flooding and damage to buildings, houses, roads, and utilities, and significant loss of life.

Image Tornadoes These violent rotating columns of air can cause catastrophic damage to buildings, houses, roads, and utilities when they reach the ground. Most tornadoes can have wind speeds from 40 to 110 mph and travel along the ground for a few miles. Some tornadoes can exceed 300 mph and travel for dozens of miles.

Image Windstorms While generally less intense than hurricanes and tornadoes, windstorms can nonetheless cause widespread damage, including damage to buildings, roads, and utilities. Widespread electric power outages are common, as windstorms can uproot trees that can fall into overhead power lines.

Image Lightning Atmospheric discharges of electricity that occur during thunderstorms, but also during dust storms and volcanic eruptions. Lightning can start fires and also damage buildings and power transmission systems, causing power outages.

Image Ice storms Ice storms occur when rain falls through a layer of colder air, causing raindrops to freeze onto whatever surface they strike. They can cause widespread power outages when ice forms on power lines and the resulting weight causes those power lines to collapse. A notable example is the Great Ice Storm of 1998 in eastern Canada, which resulted in millions being without power for as long as two weeks, and in the virtual immobilization of the cities of Montreal and Ottawa.

Image Hail This form of precipitation consists of ice chunks ranging from 5mm to 150mm in diameter. An example of a damaging hailstorm is the April 1999 storm in Sydney, Australia, where hailstones up to 9.5cm in diameter damaged 40,000 vehicles, 20,000 properties, 25 airplanes, and caused one direct fatality. The storm caused $1.5 billion in damage.

Image Flooding Standing or moving water spills out of its banks and flows into and through buildings and causes significant damage to roads, buildings, and utilities. Flooding can be a result of locally heavy rains, heavy snow melt, a dam or levee break, tropical cyclone storm surge, or an avalanche or landslide that displaces lake or river water. Figure 7-2 shows severe flooding along the Mississippi River in 1927.

Image Tsunamis A series of waves that usually result from the sudden vertical displacement of a lakebed or ocean floor, but can also be caused by landslides or explosions. A tsunami wave can be barely noticeable in open, deep water, but as it approaches a shoreline, the wave can grow to a height of 50 feet or more. A notable example followed the December 26, 2004, earthquake in the eastern Indian Ocean, resulting in a tsunami that reached virtually all of the countries around the rim of the Indian Ocean and caused more than 350,000 fatalities.

Image Pandemic The spread of infectious disease over a wide geographic region, even worldwide. Pandemics have regularly occurred throughout history and are likely to continue occurring, despite advances in sanitation and immunology. A pandemic is the rapid spread of any type of disease, including typhoid, tuberculosis, bubonic plague, or influenza. Pandemics in the 20th century include the 1918–1920 Spanish flu, the 1956–1958 Asian flu, and the 1968–1969 Hong Kong “swine” flu. Figure 7-3 shows an auditorium that was converted into a hospital during the 1918–1920 pandemic. Recent concerns about the early 21st century H5N1 avian flu and H1N1 swine flu have health authorities around the world concerned about the start of the next influenza pandemic.

Image

Figure 7-2 The 1927 flood of the Mississippi River

Image

Figure 7-3 An auditorium was used as a temporary hospital during the 1918 flu pandemic.

Image Extraterrestrial impacts This category includes meteorites and other objects that may fall from the sky from way, way up. Sure, these events are extremely rare, and most organizations don’t even include these events in their risk analysis, but we’ve included it here for the sake of rounding out the types of natural events.

Man-Made Disasters

Man-made disasters are those events that are directly or indirectly caused by human activity, through action or inaction. The results of man-made disasters are similar to natural disasters: localized or widespread damage to businesses that result in potentially lengthy interruptions in operations.

Examples of man-made disasters include

Image Civil disturbances These can take on many forms, including protests, demonstrations, riots, strikes, work slowdowns and stoppages, looting, and resulting actions such as curfews, evacuations, or lockdowns.

Image Utility outages Failures in electric, natural gas, district heating, water, communications, and other utilities. These can be caused by equipment failures, sabotage, or natural events such as landslides or flooding.

Image Materials shortages Interruptions in the supply of food, fuel, supplies, and materials can have a ripple effect on businesses and the services that support them. Readers who are old enough to remember the petroleum shortages of the mid-1970s know what this is all about; Figure 7-4 shows a 1970s-era gas shortage. Shortages can result in spikes in the price of commodities, which is almost as damaging as not having any supply at all.

Image

Figure 7-4 Citizens wait in long lines to buy fuel during a gas shortage.

Image Fires As contrasted to wildfires, here I mean fires that originate in or involve buildings, equipment, and materials.

Image Hazardous materials spills Many created or refined substances can be dangerous if they escape their confines. Examples include petroleum substances, gases, pesticides and herbicides, medical substances, and radioactive substances.

Image Transportation accidents This broad category includes plane crashes, railroad derailment, bridge collapse, and the like.

Image Terrorism and war Whether they are actions of a nation, nation-state, or group, terrorism and war can have devastating but usually localized effects in cities and regions. Often, terrorism and war precipitate secondary effects such as materials shortages and utility outages.

Image Security events The actions of a lone hacker or a team of organized cyber-criminals can bring down one system, one network, or many networks, which could result in widespread interruption in services. The hackers’ activities can directly result in an outage, or an organization can voluntarily (although reluctantly) shut down an affected service or network in order to contain the incident.

Image

NOTE It is important to remember that real disasters are usually complex events that involve more than just one type of damaging event. For instance, an earthquake directly damages buildings and equipment, but can also cause fires and utility outages. A hurricane also brings flooding, utility outages, and sometimes even hazardous materials events and civil disturbances such as looting.

How Disasters Affect Organizations

Disasters have a wide variety of effects on an organization that are discussed in this section. Many disasters have direct effects, but sometimes it is the secondary effects of a disaster event that are most significant from the perspective of ongoing business operations.

A risk analysis is a part of the BCP process (discussed in the next section in this chapter) that will identify the ways in which disasters are likely to affect a particular organization. It is during the risk analysis when the primary, secondary, and downstream effects of likely disaster scenarios need to be identified and considered. Whoever is performing this risk analysis will need to have a broad understanding of the ways in which a disaster will affect ongoing business operations. Similarly, those personnel who are developing contingency and recovery plans also need to be familiar with these effects so that those plans will adequately serve the organization’s needs.

Disasters, by our definition, interrupt business operations in some measurable way. An event that has the appearance of a disaster may occur, but if it doesn’t affect a particular organization, then we would say that no disaster occurred, at least for that particular organization.

It would be shortsighted to say that a disaster only affects operations. Rather, it is appropriate to understand the longer-term effects that a disaster has on the organization’s image, brand, and reputation. The factors affecting image, brand, and reputation have as much to do with how the organization communicates to its customers, suppliers, and shareholders, as with how the organization actually handles a disaster in progress.

Some of the ways that a disaster affects an organization’s operations include

Image Direct damage Events like earthquakes, floods, and fires directly damage an organization’s buildings, equipment, or records. The damage may be severe enough that no salvageable items remain, or may be less severe, where some equipment and buildings may be salvageable or repairable.

Image Utility outage Even if an organization’s buildings and equipment are undamaged, a disaster may affect utilities such as power, natural gas, or water, which can incapacitate some or all business operations. Significant delays in refuse collection can result in unsanitary conditions.

Image Transportation Similarly, a disaster may damage or render transportation systems such as roads, railroads, shipping, or air transport unusable for a period. Damaged transportation systems will interrupt supply lines and personnel.

Image Services and supplier shortage Even if a disaster does not have a direct effect on an organization, if any of its critical suppliers feel the effects of a disaster, that can have an undesirable effect on business operations. For instance, a regional baker that cannot produce and ship bread to its corporate customers will soon result in sandwich shops without a critical resource.

Image Staff availability A communitywide or regional disaster that affects businesses is likely to also affect homes and families. Depending upon the nature of a disaster, employees will place a higher priority on the safety and comfort of family members. Also, workers may not be able or willing to travel to work if transportation systems are affected or if there is a significant materials shortage. Employees may also be unwilling to travel to work if they fear for their personal safety or that of their families.

Image Customer availability Various types of disasters may force or dissuade customers from traveling to business locations to conduct business. Many of the factors that keep employees away may also keep customers away.

Image

NOTE The kinds of secondary and tertiary effects that a disaster has on a particular organization depend entirely upon its unique set of circumstances that constitute its specific critical needs. A risk analysis should be performed to identify these specific factors.

The BCP Process

The proper way to plan for disaster preparedness is to first know what kinds of disasters are likely, and their possible effects on the organization. That is, plan first, act later.

The business continuity process is a life-cycle process. In other words, business continuity planning (and disaster recovery planning) is not a one-time event or activity. It’s a set of activities that result in the ongoing preparedness for disaster that continually adapts to changing business conditions and that continually improves.

The elements of the BCP process life cycle are

Image Develop BCP policy

Image Conduct business impact analysis (BIA)

Image Perform criticality analysis

Image Establish recovery targets

Image Develop recovery and continuity strategies and plans

Image Test recovery and continuity plans and procedures

Image Train personnel

Image Maintain strategies, plans, and procedures through periodic reviews and updates

The BCP life cycle is shown in Figure 7-5. The details of this life cycle are described in detail in this chapter.

BCP Policy

A formal BCP effort must, like any strategic activity, flow from the existence of a formal policy and be included in the overall governance model that is the topic of Chapter 2 of this book. BCP should be an integral part of the IT control framework, not lie outside of it. Therefore, BCP policy should include or cite specific controls that ensure that key activities in the BCP life cycle are performed appropriately.

Image

Figure 7-5 The BCP process life cycle

BCP policy should also define the scope of the BCP strategy. This means that the specific business processes (or departments or divisions within an organization) that are included in the BCP and DRP effort must be defined. Sometimes the scope will include a geographic boundary. In larger organizations, it is possible to “bite off more than you can chew” and to define too large a scope for a BCP project, so limiting scope to a smaller, more manageable portion of the organization can be a good approach.

BCP and COBIT Controls

The specific COBIT controls that are involved with BCP and DRP are contained within DS4—Ensure continuous service. DS4 has 11 specific controls that constitute the entire BCP and DRP life cycle:

Image Develop IT continuity framework.

Image Conduct business impact analysis and risk assessment.

Image Develop and maintain IT continuity plans.

Image Identify and categorize IT resources based on recovery objectives.

Image Define and execute change control procedures to ensure IT continuity plan is current.

Image Regularly test IT continuity plan.

Image Develop follow-on action plan from test results.

Image Plan and conduct IT continuity training.

Image Plan IT services recovery and resumption.

Image Plan and implement backup storage and protection.

Image Establish procedures for conducting post-resumption reviews.

These controls are discussed in this chapter and also in COBIT.

Business Impact Analysis (BIA)

The objective of the business impact analysis (BIA) is to identify the impact that different scenarios will have on ongoing business operations. The BIA is one of several steps of critical, detailed analysis that must be carried out before the development of continuity or recovery plans and procedures.

Inventory Key Processes and Systems

The first step in a BIA is the collection of key business processes and IT systems. Within the overall scope of the BCP project, the objective here is to establish a detailed list of all identifiable processes and systems. The usual approach is the development of a questionnaire or intake form that would be circulated to key personnel in end-user departments and also within IT. A sample intake form is shown in Figure 7-6.

Typically, the information that is gathered on intake forms is transferred to a multi-columned spreadsheet, where information on all of the organization’s in-scope processes can be viewed together. This will become even more useful in subsequent phases of the BCP project such as the criticality analysis.

Image

NOTE Use of an intake form is not the only accepted approach when gathering information about critical processes and systems. It’s also acceptable to conduct one-on-one interviews or group interviews with key users and IT personnel to identify critical processes and systems. I recommend the use of an intake form (whether paper based or electronic), even if the interviewer uses it him/herself as a framework for note-taking.

IT personnel are often eager to get to the fun and meaty part of a project. Developers are anxious to begin coding before design; system administrators are eager to build systems before they are scoped and designed; and BCP/DRP personnel fervently desire to begin designing more robust system architectures and to tinker with replication and backup capabilities before key facts are known. In the case of business continuity and disaster recovery planning, completion of the BIA and other analyses is critical, as the analyses help to define the systems and processes most needed before getting to the fun part.

Image

Figure 7-6 BIA sample intake form for gathering data about key processes

Statements of Impact

When processes and systems are being inventoried and cataloged, it is also vitally important to obtain one or more statements of impact for each process and system. A statement of impact is a qualitative or quantitative description of the impact if the process or system were incapacitated for a time.

For IT systems, you might capture the number of users and the names of departments or functions that are affected by the unavailability of a specific IT system. Include the geography of affected users and functions if that is appropriate. Example statements of impact for IT systems might include

Image Three thousand users in France and Italy will be unable to access customer records.

Image All users in North America will be unable to read or send e-mail.

Statements of impact for business processes might cite the business functions that would be affected. Some example statements of impact include

Image Accounts payable and accounts receivable functions will be unable to process.

Image Legal department will be unable to access contracts and addendums.

Statements of impact for revenue-generating and revenue-supporting business functions could quantify financial impact per unit of time (be sure to use the same units of time for all functions so that they can be easily compared with one another). Some examples include

Image Inability to place orders for appliances will cost at the rate of $1200 per hour.

Image Delays in payments will cost $45,000 per day in interest charges.

As statements of impact are gathered, it might make sense to create several columns in the main worksheet, so that like units (names of functions, numbers of users, financial figures) can be sorted and ranked later on.

When the BIA is completed, you’ll have the following information about each process and system:

Image Name of the system or process

Image Who is responsible for it

Image A description of its function

Image Dependencies on systems

Image Dependencies on suppliers

Image Dependencies on key employees

Image Quantified statements of impact in terms of revenue, users affected, and/or functions impacted

You’re almost home.

Criticality Analysis

When all of the BIA information has been collected and charted, the criticality analysis (CA) can be performed.

The criticality analysis is a study of each system and process, a consideration of the impact on the organization if it is incapacitated, the likelihood of incapacitation, and the estimated cost of mitigating the risk or impact of incapacitation. In other words, it’s a somewhat special type of a risk analysis that focuses on key processes and systems.

The criticality analysis needs to include, or reference, a threat analysis. A threat analysis is a risk analysis that identifies every threat that has a reasonable probability of occurrence, plus mitigating controls or compensating controls, and new probabilities of occurrence with those mitigating/compensating controls in place. In case you’re having a little trouble imagining what this looks like (we’re writing the book and we’re having trouble seeing this!), take a look at Table 7-1, which is a very lightweight example of what I’m talking about.

Image

Table 7-1 Example Threat Analysis Identifies Threats and Controls for Critical Systems and Processes

In the preceding threat analysis, notice a couple of things:

Image Multiple threats are listed for a single asset. In the preceding example, I mentioned just eight threats. For all the threats but one, I listed only a single mitigating control. For the extended power outage threat, I listed two mitigating controls.

Image Cost of downtime wasn’t listed. For systems or processes where you have a cost per unit of time for downtime, you’ll need to include it here, along with some calculations to show the payback for each control.

Image Some mitigating controls can benefit more than one system. That may not have been obvious in this example, but in the case of a UPS (uninterruptible power supply) and electric generator, many systems can benefit, so the cost for these mitigating controls can be allocated across many systems, thereby lowering the cost for each system. Another example is a high-availability SAN (storage area network) located in two different geographic areas; while initially expensive, many applications can use the SAN for storage, and all will benefit from replication to the counterpart storage system.

Image Threat probabilities are arbitrary. In Table 7-1, the probabilities were for a single occurrence in an entire year, so, for example, 5 percent means the threat will be realized once every 20 years.

Image The length of outage was not included. You may need to include this also, particularly if you are quantifying downtime per hour or other unit of time.

It is probably becoming obvious that a threat analysis, and the corresponding criticality analysis, can get pretty complicated. The rule here should be this: the complexity of the threat and criticality analyses should be proportional to the value of the assets (or revenue, or both). For example, in a company where application downtime is measured in thousands of dollars per minute, it’s probably worth taking a few man-weeks or even man-months to work out all of the likely scenarios and a variety of mitigating controls, and to work out which ones are the most cost-effective. On the other hand, for a system or business process where the impact of an outage is far less costly, a good deal less time can be spent on the supporting threat and criticality analysis.

Image

NOTE Test-takers should ensure that any question dealing with BIA and CA places the business impact analysis first. Without this analysis, criticality analysis is impossible to evaluate in terms of likelihood or cost-effectiveness in mitigation strategies. The BIA identifies strategic resources and provides a value to their recovery and operation, which is in turn consumed in the criticality analysis phase. If presented with a question identifying BCP/DRP at a particular stage, make sure that any answers you select facilitate the BIA and then the CA before moving on toward objectives and strategies.

Establishing Key Targets

When the cost or impact of downtime has been established, and the cost and benefit of mitigating controls has been considered, some key targets can be established for each critical process. The two key targets are recovery time objective and recovery point objective.

Recovery Time Objective (RTO)

Recovery time objective (RTO) is the period from the onset of an outage until the resumption of service. RTO is usually measured in hours or days. Each process and system in the BIA should have an RTO value.

RTO does not mean that the system (or process) has been recovered to 100 percent of its former capacity. Far from it—in an emergency situation, management may determine that a DR (disaster recovery) server in another city with, say, 60 percent of the capacity of the original server is adequate. That said, an organization could establish two RTO targets, one for partial capacity and one for full capacity.

Image

NOTE For a given organization, it’s probably best to use one unit of measure for all systems. That will help to avoid any errors that would occur during a rank-ordering of systems, so that two days does not appear to be a shorter period than four hours.

Further, a system that has been recovered in a disaster situation might not have 100 percent of its functionality. For instance, an application that lets users view transactions that are more than two years old may, in a recovery situation, only contain 30 days’ worth of data. Again, such a decision is usually the result of a careful analysis of the cost of recovering different features and functions in an application environment. In a larger, complex environment, some features might be considered critical, while others are less so.

Image

CAUTION Senior management should be involved in any discussion related to recovery system specifications of less than 100 percent capacity or functionality.

Recovery Point Objective (RPO)

A recovery point objective (RPO) is the period for which recent data will be irretrievably lost in a disaster. Like RTO, RPO is usually measured in hours or days. However, for critical transaction systems, RPO could even be measured in minutes.

RPO is usually expressed as a worst-case figure; for instance, the transaction processing system RPO will be two hours or less.

The value of a system’s RPO is a direct result of the frequency of backup or replication. For example, if an application server is backed up once per day, the RPO is going to be 24 hours (or one day, whichever way you like to express it). Maybe it will take three days to rebuild the server, but once data is restored from backup tape, no more than the last 24 hours of transactions are lost. In this case, the RTO is three days and the RPO is one day.

Publishing RTO and RPO Figures

If the storage system for an application takes a snapshot every hour, the RPO could be one hour, unless the storage system itself was damaged in a disaster. If the snapshot is replicated to another storage system four times per day, then the RPO might be better expressed as six hours.

The last example brings up an interesting point. There might not be one golden RPO figure for a given system. Instead, the severity of a disrupting event or a disaster will dictate the time to get systems running again (RTO) with a certain amount of data loss (RPO). Here are some examples:

Image A server’s CPU or memory fails and is replaced and restarted in two hours. No data is lost. The RTO is two hours and the RPO is zero.

Image The storage system supporting an application suffers a hardware failure that results in the loss of all data. Data is recovered from a snapshot on another server taken every six hours. The RPO is six hours in this case.

Image The database in a transaction application is corrupted and must be recovered. Backups are taken twice per day. The RPO is 12 hours. However, it takes 10 hours to rebuild indexes on the database, so the RTO is closer to 22–24 hours, since the application cannot be returned to service until indexes are available.

Image

NOTE When publishing RTO and RPO figures to customers, it’s best to publish the worst-case figures: “If our data center burns to the ground, our RTO is X hours and the RPO is Y hours.” Saying it that way would be simpler than publishing a chart that shows RPO and RTO figures for various types of disasters.

Image

Table 7-2 The Lower the Recovery Time Objective (RTO), the Higher the Cost to Achieve It

Pricing RTO and RPO Capabilities

Generally speaking, the shorter the RTO or RPO for a given system, the more expensive it will be to achieve the target. Table 7-2 depicts a range of RTOs along with the technologies needed to achieve them and their relative cost.

The BCP project team needs to understand the relationship between the time required to recover an application and the cost required to recover the application within that time. A shorter recovery time is more expensive, and this relationship is not linear. This means that reducing RPO from three days to six hours may mean that the equipment and software investment might double, or it might increase eightfold. There are so many factors involved in the supporting infrastructure for a given application that the BCP project team has to just knuckle down and develop the cost for a few different RTO and RPO figures.

The business value of the application itself is the primary driver in determining the amount of investment that senior management is willing to make to reach any arbitrary RTO and RPO figures. This business value may be measured in local currency if the application supports revenue. However, the loss of an application during a disaster may harm the organization’s reputation. Again, management will have to make a decision on how much it will be willing to invest in DR capabilities that bring RTO and RPO figures down to a certain level. Figure 7-7 illustrates these relationships.

Image

Figure 7-7 Aim for the sweet spot

Developing Recovery Strategies

When management has chosen specific RPO and RTO targets for a given system or process, the BCP project team can now roll up its sleeves and devise some ways to meet these targets. This section discusses the technologies and logistics associated with various recovery strategies. This will help the project team to decide which types of strategies are best suited for their organization.

Image

NOTE Developing recovery strategies to meet specific recovery targets is an iterative process. The project team will develop a strategy to reach specific targets for a specific cost; senior management could well decide that the cost is too high and that they are willing to increase RPO and/or RTO targets accordingly. Similarly, the project team could also discover that it is less costly to achieve specific RPO and RTO targets, and management could respond by lowering those targets. This is illustrated in Figure 7-8.

Site Recovery Options

In a worst-case disaster scenario, the site where information systems reside is partially or completely destroyed. In most cases, the organization cannot afford to wait for the damaged or destroyed facility to be restored, as this could take weeks or months. If an organization can take that long to recover an application, you’d have to wonder whether it is needed at all. The assumption has got to be that in a disaster scenario, critical applications will be recovered in another location. This other location is called a recovery site. There are two dimensions to the process of choosing a recovery site: the first is the speed at which the application will be recovered at the recovery site; the second is the location of the recovery site itself. Both are discussed here.

Image

Figure 7-8 Recovery objective development flowchart

As you might expect, speed costs. If a system is to be recovered within a few minutes or hours, the costs will be much higher than if the system can be recovered in five days.

Various types of facilities are available for rapid or not-too-rapid recovery. These facilities are called hot sites, warm sites, and cold sites. As the names might suggest, hot sites permit rapid recovery, while cold sites provide a much slower recovery. The costs associated with these are somewhat proportional as well, as illustrated in Table 7-3.

The details about each type of site are discussed in the remainder of this section.

Hot Sites A hot site is an alternate processing center where backup systems are already running and in some state of near-readiness to assume production workload. The systems at a hot site most likely have application software and database management software already loaded and running, perhaps even at the same patch levels as the systems in the primary processing center.

A hot site is the best choice for systems whose RTO targets range from zero to several hours, perhaps as long as 24 hours.

A hot site may consist of leased rack space (or even a cage for larger installations) at a colocation center. If the organization has its own processing centers, then a hot site for a given system would consist of the required rack space to house the recovery systems. Recovery servers will be installed and running, with the same version and patch level for the operating system, database management system (if used), and application software.

Systems at a hot site require the same level of administration and maintenance as the primary systems. When patches or configuration changes are made to primary systems, they should be made to hot-site systems at the same time or very shortly afterwards.

Because systems at a hot site need to be at or very near a state of readiness, a strategy needs to be developed regarding a method for keeping the data on hot standby systems current. This is discussed in detail in the later section, “Recovery and Resilience Technologies.”

Systems at a hot site should have full network connectivity. A method for quickly directing network traffic toward the recovery servers needs to be worked out in advance so that a switchover can be accomplished. This is also discussed in the “Recovery and Resilience Technologies” section.

Image

Table 7-3 Relative Costs of Recovery Sites

When setting up a hot site, the organization will need to send one or more technical staff members to the site to set up systems. But once the systems are operating, much or all of the system- and database-level administration can be performed remotely. However, in a disaster scenario, the organization may need to send the administrative staff to the site for day-to-day management of the systems. This means that workspace for these personnel needs to be identified so that they can perform their duties during the recovery operation.

Image

NOTE Hot-site planning needs to consider work (desk) space for on-site personnel. Some colocation centers provide limited work areas, but these areas are often shared and often have little privacy for phone discussions. Also, transportation, hotel, and dining accommodations need to be arranged, possibly in advance, if the hot site is in a different city from the primary site.

Warm Sites A warm site is an alternate processing center where recovery systems are present, but at a lower state of readiness than recovery systems at a hot site. For example, while the same version of the operating system may be running on the warm site system, it may be a few patch levels behind primary systems. The same could be said about the versions and patch levels of database management systems (if used) and application software: they may be present, but they’re not as up-to-date.

A warm site is appropriate for an organization whose RTO figures range from roughly one to seven days. In a disaster scenario, recovery teams would travel to the warm site and work to get the recovery systems to a state of production readiness and to get systems up-to-date with patches and configuration changes, to bring the systems into a state of complete readiness.

A warm site is also used when the organization is willing to take the time necessary to recover data from tape or other backup media. Depending upon the size of the database(s), this recovery task can take several hours to a few days.

The primary advantage of a warm site is that its costs are lower than for a hot site, particularly in the effort required to keep the recovery system up-to-date. The site may not require expensive data replication technology, but instead data can be recovered from backup media.

Cold Sites A cold site is an alternate processing center where the degree of readiness for recovery systems is low. At the very least, a cold site is nothing more than an empty rack, or just allocated space on a computer room floor. It’s just an address in someone’s data center or colocation site where computers can be set up and used at some future date.

Often, there is little or no equipment at a cold site. When a disaster or other highly disruptive event occurs in which the outage is expected to exceed 7 to 14 days, the organization will order computers from a manufacturer, or perhaps have computers shipped from some other business location, so that they can arrive at the cold site soon after the disaster event has begun. Then personnel would travel to the site and set up the computers, operating systems, databases, network equipment, and so on, and get applications running within several days.

The advantage of a cold site is its low cost. The main disadvantage is the cost, time, and effort required to bring it to operational readiness. But for some organizations, a cold site is exactly what is needed.

Table 7-4 shows a comparison of hot, warm, and cold recovery sites and a few characteristics of each.

Mobile Sites A mobile site is a portable recovery center that can be delivered to almost any location in the world. A viable alternative to a fixed location recovery site, a mobile site can be transported by semitruck, and may even have its own generator, communications, and cooling capabilities.

APC and SunGuard have mobile sites installed in semitruck trailers. Sun Microsystems has mobile sites that can include a configurable selection of servers and workstations, all housed in shipping containers that can be shipped by truck, rail, ship, or air to any location in the world.

Reciprocal Sites A reciprocal recovery site is a data center that is operated by another company. Two or more organizations with similar processing needs will draw up a legal contract that obligates one or more of the organizations to temporarily house another party’s systems in the event of a disaster.

Often, a reciprocal agreement pledges not only floor space in a data center, but also the use of the reciprocal partner’s computer system. This type of arrangement is less common, but is still used by organizations that use mainframe computers and other high-cost systems.

Image

NOTE With the wide use of Internet colocation centers, reciprocal sites have fallen out of favor. Still, they may be ideal for organizations with mainframe computers that are otherwise too expensive to deploy to a cold or warm site.

Geographical Site Selection An important factor in the process of recovery site selection is the location of the recovery site. The distance between the main processing site and the recovery site is vital and may figure heavily into the viability and success of a recovery operation.

Image

Table 7-4 Detailed Comparison of Cold, Warm, and Hot Sites

A recovery site should not be located in the same geographic region as the primary site. A recovery site in the same region may be involved in the same regional disaster as the primary site and may be unavailable for use or be suffering from the same problems present at the primary site.

By “geographic region” I mean a location that will likely experience the effects of the same regional disaster that affects the primary site. No arbitrarily chosen distance (such as 100 miles) guarantees sufficient separation. In some locales, 50 miles is plenty of distance; in other places, 300 miles is too close. Information on regional disasters should be available from local disaster preparedness authorities or from local disaster recovery experts.

Recovery and Resilience Technologies

Once recovery targets have been established, the next major task is the survey and selection of technologies to enable recovery time and recovery point objectives to be met. The important factors when considering each technology are

Image Does the technology help the information system achieve the RTO and RPO targets?

Image Does the cost of the technology meet or exceed budget constraints?

Image Can the technology be used to benefit other information systems (thereby lowering the cost for each system)?

Image Does the technology fit well into the organization’s current IT operations?

Image Will operations staff require specialized training on the technology?

Image Does the technology contribute to the simplicity of the overall IT architecture, or does it complicate it unnecessarily?

These questions are designed to help determine whether a specific technology is a good fit, from a technology as well as from process and operational perspectives.

RAID Redundant Array of Independent Disks (RAID) is a family of technologies that is used to improve the reliability, performance, or size of disk-based storage systems. From a disaster recovery or systems resilience perspective, the feature of RAID that is of particular interest is the characteristic of reliability. RAID is used to create virtual disk volumes over an array of disk storage devices and can be configured so that the failure of any individual disk drive in the array will not affect the availability of data on the disk array.

RAID is usually implemented on a hardware device called a disk array, which is a chassis in which several hard disks can be installed and connected to a server. The individual disk drives can be “hot swapped” in the chassis while the array is still operating. When the array is configured with RAID, a failure of a single disk drive will have no effect on the disk array’s availability to the server to which it is connected. A system operator can be alerted to the disk’s failure, and the defective disk drive can be removed and replaced while the array is still fully operational.

There are several options for RAID configuration, called levels:

Image RAID-0 This is known as a striped volume, where a disk volume splits data evenly across two or more disks in order to improve performance.

Image RAID-1 This creates a mirror, where data written to one disk in the array is also written to a second disk in the array. RAID-1 makes the volume more reliable, through the preservation of data even when one disk in the array fails.

Image RAID-4 This level of RAID employs data striping at the block level by adding a dedicated parity disk. The parity disk permits the rebuilding of data in the event one of the other disks fails.

Image RAID-5 This is similar to RAID-4 block-level striping, except that the parity data is distributed evenly across all of the disks instead of dedicated on one disk. Like RAID-4, RAID-5 allows for the failure of one disk without losing information.

Image RAID-6 This is an extension of RAID-5, where two parity blocks are used instead of a single parity block. The advantage of RAID-6 is that it can withstand the failure of any two disk drives in the array, instead of a single disk, as is the case with RAID-5.

Image

NOTE Several nonstandard RAID levels are developed by various hardware and software companies. Some of these are extensions of RAID standards, while others are entirely different.

Storage systems are hardware devices that are entirely separate from servers—their only purpose is to store a large amount of data and to be highly reliable through the use of redundant components and the use of one or more RAID levels. Storage systems generally come in two forms:

Image Storage Area Network (SAN) This is a stand-alone storage system that can be configured to contain several virtual volumes and connected to several servers through fiber optic cables. The servers’ operating systems will consider this storage to be “local,” as though it consisted of one or more hard disks present in the server’s own chassis.

Image Network Attached Storage (NAS) This is a stand-alone storage system that contains one or more virtual volumes. Servers access these volumes over the network using the Network File System (NFS) or Server Message Block/Common Internet File System (SMB/CIFS) protocols, common on Unix and Windows operating systems, respectively.

Replication Replication is an activity where data that is written to a storage system is also copied over a network to another storage system and written. The result is the presence of up-to-date data that exists on two or more storage systems, each of which could be located in a different geographic region.

Replication can be handled in several ways and at different levels in the technology stack:

Image Disk storage system Data write operations that take place in a disk storage system (such as a SAN or NAS) can be transmitted over a network to another disk storage system, where the same data will be written to the other disk storage system.

Image Operating system The operating system can control replication so that updates to a particular file system can be transmitted to another server where those updates will be applied locally on that other server.

Image Database management system The database management system (DBMS) can manage replication by sending transactions to a DBMS on another server.

Image Transaction management system The transaction management system (TMS) can manage replication by sending transactions to a counterpart TMS located elsewhere.

Image Application The application can write its transactions to two different storage systems. This method is not often used.

Replication can take place from one system to another system, called primary-backup replication. This is the typical setup when data on an application server is sent to a distant storage system for data recovery or disaster recovery purposes.

Replication can also be bi-directional, between two active servers, called multiprimary or multimaster. This method is more complicated, because simultaneous transactions on different servers could conflict with one another (such as two reservation agents trying to book a passenger in the same seat on an airline flight). Some form of concurrent transaction control would be required, such as a distributed lock manager.

In terms of the speed and integrity of replicated information, there are two types of replication:

Image Synchronous replication Here, writing data to a local and to a remote storage system are performed as a single operation, guaranteeing that data on the remote storage system is identical to data on the local storage system. Synchronous replication incurs a performance penalty, as the speed of the entire transaction is slowed to the rate of the remote transaction.

Image Asynchronous replication Writing data to the remote storage system is not kept in sync with updates on the local storage system. Instead, there may be a time lag, and you have no guarantee that data on the remote system is identical to that on the local storage system. However, performance is improved, because transactions are considered complete when they have been written to the local storage system only. Bursts of local updates to data will take a finite period to replicate to the remote server, subject to the available bandwidth of the network connection between the local and remote storage systems.

Image

NOTE Replication is often used for applications where the recovery time objective (RTO) is smaller than the time necessary to recover data from backup media. For example, if a critical application’s RTO is established to be two hours, then recovery from backup tape is probably not a viable option, unless backups are performed every two hours. While more expensive than recovery from backup media, replication ensures that up-to-date information is present on a remote storage system that can be put online in a short period.

Server Clusters A cluster is a characteristic of two or more servers to appear as a single server resource. Clusters are often the technology of choice for applications that require a high degree of availability and a very small RTO (recovery time objective), measured in minutes.

When an application is implemented on a cluster, even if one of the servers in the cluster fails, the other server (or servers) in the cluster will continue to run the application, usually with no user awareness that such a failure occurred.

There are two typical configurations for clusters, active/active and active/passive. In active/active mode, all servers in the cluster are running and servicing application requests. This is often used in high-volume applications where many servers are required to service the application workload.

In active/passive mode, one or more servers in the cluster are active and servicing application requests, while one or more servers in the cluster are in a “standby” mode; they can service application requests, but won’t do so unless one of the active servers fails or goes offline for any reason. When an active server goes offline and a standby server takes over, this event is called a failover.

A typical server cluster architecture is shown in Figure 7-9.

Image

Figure 7-9 Application and database server clusters

A server cluster is typically implemented in a single physical location such as a data center. However, a cluster can also be implemented where great distances separate the servers in the cluster. This type of cluster is called a geographic cluster, or geo-cluster. Servers in a geo-cluster are connected through a wide-area network (WAN) connection. A typical geographic cluster architecture is shown in Figure 7-10.

Network Connectivity and Services An overall application environment that is required to be resilient and have recoverability must have those characteristics present within the network that supports it. A highly resilient application architecture that includes clustering and replication would be of little value if it had only a single network connection that was a single point of failure.

An application that requires high availability and resilience may require one or more of the following in the supporting network:

Image Redundant network connections These may include multiple network adapters on a server, but also a fully redundant network architecture with multiple switches, routers, load balancers, and firewalls. This could also include physically diverse network provider connections, where network service provider feeds enter the building from two different directions.

Image Redundant network services Certain network services are vital to the continued operation of applications, such as DNS (domain name service, the function of translating server names like www.mcgraw-hill.com into an IP address), NTP (network time protocol, used to synchronize computer time clocks), SMTP (simple mail transfer protocol), SNMP (simple network management protocol), authentication services, and perhaps others. These services are usually operated on servers, which may require clustering and/or replication of their own, so that the application will be able to continue functioning in the event of a disaster.

Backup and Restoration

Disasters and other disruptive events can damage information and information systems. It’s essential that fresh copies of this information exist elsewhere and in a form that enables IT personnel to easily load this information into alternative systems so that processing can resume as quickly as possible.

Image

Figure 7-10 Geographic cluster with data replication

Image

NOTE Testing backups is important; testing recoverability is critical. In other words, performing backups is only valuable to the extent that backed-up data can be recovered at a future time.

Backup to Tape and Other Media Tape backup is just about as ubiquitous as power cords. From a disaster recovery perspective, however, the issue probably is not whether the organization has tape backup, but whether its current backup capabilities are adequate in the context of disaster recovery. An organization’s backup capability may need to be upgraded if:

Image The current backup system is difficult to manage.

Image Whole-system restoration takes too long.

Image The system lacks flexibility with regard to disaster recovery (for instance, how difficult it would be to recover information onto a different type of system).

Image The technology is old or outdated.

Image Confidence in the backup technology is low.

Many organizations may consider tape backup as a means for restoring files or databases when errors have occurred, and they may have confidence in their backup system for that purpose. However, the organization may have somewhat less confidence in their backup system and its ability to recover all of their critical systems accurately and in a timely manner.

Tape is not the only medium for backups. While tape has been the default medium since the 1960s, using a hard disk as a backup medium is growing in popularity: hard disk transfer rates are far higher, and disk is a random-access medium, whereas tape is a sequential-access medium.

E-vaulting is another viable option for system backup. E-vaulting permits organizations to back up their systems and data to an off-site location, which could be a storage system in another data center or a third-party service provider. This accomplishes two important objectives: reliable backup and off-site storage of backup media.

Backup Media Off-Site Storage Backup media that remains in the same location as backed-up systems is adequate for data recovery purposes, but completely inadequate for disaster recovery purposes: any event that physically damages information systems (such as fire, flood, hazardous chemical spill, and so on) is likely to also damage backup media. To provide disaster recovery protection, backup media must be stored off-site in a secure location. Selection of this storage location is as important as the selection of a primary business location: in the event of a disaster, the survival of the organization may depend upon the protection measures in place at the off-site storage location.

Image

NOTE CISA exam questions relating to off-site backups may include details for safeguarding data during transport and storage, mechanisms for access during restoration procedures, media aging and retention, or other details that may aid you during the exam. Watch for question details involving the type of media, geo-locality (distance, shared disaster spectrum [such as a shared coastline], and so on) of the off-site storage area and the primary site, or access controls during transport and at the storage site, including environmental controls and security safeguards.

The criteria for selection of an off-site media storage facility are similar to the criteria for selection of a hot/warm/cold recovery site discussed earlier in this chapter. If a media storage location is too close to the primary processing site, then it is more likely to be involved in the same regional disaster, which could result in damage to backup media. However, if the media storage location is too far away, then it might take too long for a delivery of backup media, which would result in a recovery operation that runs unacceptably long.

Another location consideration is the proximity of the media storage location and the hot/warm/cold recovery site. If a hot site is being used, then chances are there is some other near-real-time means (such as replication) for data to get to the hot site. But a warm or cold site may be relying on the arrival of backup media from the off-site media storage facility, so it might make sense for the off-site facility to be near the recovery site.

An important factor when considering off-site media storage is the method of delivery to and from the storage location. Chances are that the backup media is being transported by a courier or a shipping company. It is vital that the backup media arrive safely and intact, and that the opportunities for interception or loss be reduced as much as possible. Not only can a lost backup tape make recovery more difficult, but it can also cause an embarrassing security incident if knowledge of the loss were to become public. From a confidentiality/integrity perspective, encryption of backup tapes is a good idea, although this digresses somewhat from disaster recovery (concerned primarily with availability). Backup tape encryption is discussed in Chapter 6.

Image

NOTE The requirements for off-site storage are a little less critical than for a hot/warm/cold recovery site. All you have to do is be able to get your backup media out of that facility. This can occur even if there is a regional power outage, for instance.

Developing Recovery and Continuity Plans

In the previous section, I discussed the notion of establishing recovery targets and the development of architectures, processes, and procedures. The processes and procedures are related to the normal operation of those new technologies as they will be operated in normal day-to-day operations. When those processes and procedures have been completed, then the disaster recovery plans and procedures (those actions that will take place during and immediately after a disaster) can be developed.

For example, an organization has established RPO and RTO targets for its critical applications. These targets necessitated the development of server clusters and storage area networks with replication. While implementing those new technologies, the organization developed the operations processes and procedures in support of those new technologies that would be carried out every day during normal business operations. As a separate activity, the organization would then develop the procedures to be performed when a disaster strikes the primary operations center for those applications; those procedures would include all of the steps that must be taken so that the applications can continue operating in a warm site or hot site location.

The procedures for operating critical applications during a disaster are a small part of the entire body of procedures that must be developed. Several other sets of procedures must also be developed, including

Image Evacuation procedures

Image Disaster declaration procedures

Image Responsibilities

Image Contact information

Image Recovery procedures

Image Continuing operations

All of these are required so that an organization will be adequately prepared in the event a disaster occurs.

Evacuation Procedures

When a disaster strikes, measures to ensure the safety of personnel need to be taken immediately. If the disaster has occurred or is about to occur to a building, personnel need to be evacuated as soon as possible. Arguably, however, in some situations evacuation is exactly the wrong thing to do; for example, if a hurricane or tornado is bearing down on a facility, then the building itself may be the best shelter for personnel, even if it incurs some damage. The point here is that evacuation procedures need to be carefully developed, and possibly more than one set of evacuation procedures will be needed, depending on the event.

Image

NOTE The highest priority in any disaster or emergency situation is the safety of human life.

Evacuation procedures need to take many factors into account, including

Image Ensuring that all personnel are familiar with evacuation procedures

Image Ensuring that visitors will know how to evacuate the premises

Image Posting signs and placards that indicate emergency evacuation routes and gathering areas outside of the building

Image Emergency lighting to aid in evacuation

Image Fire extinguishment equipment (portable fire extinguishers, and so on)

Image The ability to communicate with public safety and law enforcement authorities, including in situations where communications and electric power have been cut off, and when all personnel are outside of the building

Image Care for injured personnel

Image CPR and emergency first-aid training

Image Safety personnel who can assist evacuation of injured and disabled persons

Image The ability to account for visitors and other non-employees

Image Emergency shelter in extreme weather conditions

Image Emergency food and drinking water

Image Periodic tests to ensure that evacuation procedures will be adequate in the event of a real emergency

Local emergency management organizations may have additional information available that can assist an organization with its emergency evacuation procedures.

Disaster Declaration Procedures

Disaster response procedures are initiated when a disaster is declared. However, there needs to be a procedure for the declaration itself, so that there will be little doubt as to the conditions that must be present.

Why is a disaster declaration procedure required? Primarily, because it’s not always clear whether a situation is a real disaster. Sure, a 7.5 earthquake or a major fire is a disaster, but overcooking popcorn in the microwave that sets off a building’s fire alarm system might not be. Many “in between” situations may or may not be disasters. A disaster declaration procedure must state some basic conditions that will help determine whether a disaster should be declared.

Further, who has the authority to declare a disaster? What if senior management personnel frequently travel and may not be around? Who else can declare a disaster? And, finally, what does it mean to declare a disaster—and what happens next?

Form a Core Team To be effective and workable, a core team of personnel needs to be established, all of whom will be familiar with the disaster declaration procedure, as well as the actions that must take place once a disaster has been declared. This core team should consist of middle and upper managers who are familiar with business operations, particularly those that are critical. This core team must be large enough so that a requisite few of them are on-hand when a disaster strikes. In organizations that have second shifts, third shifts, and weekend work, some of the core team members should be those in supervisory positions during those off-hours times. However, some of the core team members can be personnel who work “business hours” and are not on-site all of the time.

Declaration Criteria The declaration procedure must contain some tangible criteria that a core team member can consult to guide him or her down the “is this a disaster” decision path.

The criteria for declaring a disaster should be related to the availability and viability of ongoing critical business operations. Some example criteria include any one or more of the following:

Image Forced evacuation of a building containing or supporting critical operations that is likely to last for more than four hours

Image Hardware, software, or network failures that result in a critical IT system being incapacitated or unavailable for more than four hours

Image Any security incident that results in a critical IT system being incapacitated for more than four hours (security incidents could involve malware, break-in, attack, sabotage, and so on)

Image Any event causing employee absenteeism or supplier shortages that, in turn, results in one or more critical business processes being incapacitated for more than eight hours

Image Any event causing a communications failure that results in critical IT systems being unreachable for more than four hours

The preceding examples are a mostly complete list of criteria for many organizations. The periods will vary from organization to organization. For instance, a large, pure-online business such as Amazon.com would probably declare a disaster if its main web sites were unavailable for more than a few minutes. But in an organization where computers are far less critical, an outage of four hours might not be considered a disaster.

Pulling the Trigger When disaster declaration criteria are met, the disaster should be declared. The procedure for disaster declaration could permit any single core team member to declare the disaster, but it may be better to have two or more core team members to agree on whether a disaster should be declared. Whether an organization should use a single-person declaration or a group of two or more is each organization’s choice.

All core team members empowered to declare a disaster should have the procedure on-hand at all times. In most cases, the criteria should fit on a small, laminated wallet card that each team member can have with him or her or nearby at all times. For organizations that use the consensus method for declaring a disaster, the wallet card should include the names and contact numbers for other core team members, so that each will have a way of contacting others.

Next Steps Declaring a disaster will trigger the start of one or more other response procedures, but not necessarily all of them. For instance, if a disaster is declared because of a serious computer or software malfunction, there is no need to evacuate the building. While this example may be obvious, not all instances will be this clear. Either the disaster declaration procedure itself, or each of the subsequent response procedures, should contain criteria that will help determine which response procedures should be enacted.

False Alarms Probably the most common cause of personnel not declaring a disaster is the fear that a real disaster is not taking place. Core team members empowered with declaring a disaster should not necessarily hesitate. Instead, core team members could convene with additional core team members to reach a firm decision, provided this can be done quickly.

If a disaster has been declared, and later it is clear that a disaster has been averted (or did not exist in the first place), the disaster can simply be called off and declared to be over. Response personnel can be contacted and told to cease response activities and return to their normal activities.

Responsibilities

During a disaster, many important tasks must be performed to evacuate personnel, assess damage, recover critical processes and systems, and carry out many other functions that are critical to the survival of the enterprise.

About 20 different responsibilities are described here. In a large organization, each responsibility may be staffed with a team of two, three, or many individuals. In small organizations, a few people may incur many responsibilities each, switching from role to role as the situation warrants.

All of these roles will be staffed by people who are available to fill these roles. It is important to remember that many of the “ideal” persons to fill each role will be unavailable during a disaster for several reasons, including

Image Injured, ill, or deceased Some regional disasters will inflict widespread casualties that will include some proportion of response personnel. Those who are injured, ill (in the case of a pandemic, for instance, or who are recovering from a sickness or surgery when the disaster occurs), or who are killed by the disaster are clearly not going to be showing up to help out.

Image Caring for family members Some types of disasters may cause widespread injury or require mass evacuation. In some of these situations, many personnel will be caring for family members whose immediate needs for safety will take priority over the needs of the workplace.

Image Transportation unavailable Many types of disasters include localized or widespread damage to transportation infrastructure, which may result in many persons who are willing to be on-site to help with emergency operations being unable to get to the work site.

Image Out of the area Some disaster response personnel may be away on business travel or on vacation, and be unable to respond. However, some persons being away may actually be opportunities in disguise; unaffected by the physical impact of the disaster, they may be able to help out in other ways, such as communications with suppliers, customers, or other personnel.

Image Communications Some types of disasters, particularly those that are localized (versus widespread and obvious to an observer), require that disaster response personnel be contacted and asked to help. If a disaster strikes after hours, some personnel may be unreachable if they are engaged in any activity where they do not have a mobile phone with them or are out of range.

Image Fear Some types of disasters (such as pandemic, terrorist attack, flood, and so on) may instill fear for safety on the part of response personnel who will resist the call to help and stay away from the work site.

Image

NOTE Response personnel in all disciplines and responsibilities will need to be able to piece together whatever functionality they are called on to do, using whatever resources are available—this is part art form and part science. While response and contingency plans may make certain assumptions, personnel may find themselves with fewer resources than planned, requiring them to do the best they can with the resources available.

Each function will be working with personnel in many other functions, often working with unfamiliar persons. An entire response and recovery operation may be operating almost like a brand-new organization in unfamiliar settings and with an entirely new set of playing rules. In typical organizations, teams work well when team members are familiar with, and trust, one another. In a response and recovery operation, the stress level is much higher because the stakes—company survival—are higher, and often the teams are composed of persons who have little experience with each other in these new roles. This will cause additional stress that will bring out the best and worst in people, as illustrated in Figure 7-11.

Emergency Response These are the “first responders” during a disaster. Top priorities include evacuation of personnel, first aid, triage of injured personnel, and possibly, firefighting.

Image

Figure 7-11 Stress is compounded by the pressure of disaster recovery and the formation of new teams in times of chaos.

Command and Control (Emergency Management) During disaster response operations, someone has to be in charge. In a disaster, resources may be scarce, and many matters vie for attention. Someone needs to fill the role of decision maker to keep disaster response activities moving and to handle situations that arise. This role may need to be rotated among various personnel, particularly in smaller organizations, to counteract fatigue.

Image

NOTE Although the first person on the scene may be the person in charge initially, that will definitely change as qualified assigned personnel show up and take charge, and as the nature of the disaster and response solidifies. The leadership roles may then be passed among key personnel already designated to be in charge.

Scribe It’s vital that one or more persons continually document the important events during disaster response operations. From decisions to discussions to status to roll call, these events must be written down so that the details of disaster response can be pieced together afterward. This will help the organization better understand how disaster response unfolded, how decisions were made, and who performed which actions, all of which will help the organization be better prepared for future events.

Internal Communications In many disaster scenarios, personnel may be stripped of many or all of their normal means of communication, such as desk phone, voicemail, e-mail, and instant messaging. Yet never are communications as vital as during a disaster, when nothing is going according to plan. Internal communications are needed so that status on various activities can be sent to command and control, and so that priorities and orders can be sent to disaster response personnel.

External Communications People outside of the organization need to know what’s going on when a disaster strikes. There’s a potentially long list of parties who want or need to know the status of business operations during and after a disaster, including

Image Customers

Image Suppliers

Image Partners

Image Shareholders

Image Neighbors

Image Regulators

Image Media

Image Law enforcement and public safety authorities

These different audiences need different messages, as well as messages in different forms.

Legal and Compliance Several needs may arise during a disaster that require the attention of inside or outside legal counsel. Disasters present unique situations that need legal assistance such as:

Image Interpretation of regulations

Image Interpretation of contracts with suppliers and customers

Image Management of matters of liability to other parties

Image

NOTE Typical legal matters need to be resolved before the onset of a disaster.

Damage Assessment Whether a disaster is a physically violent event such as an earthquake or volcano, or instead involves no physical manifestation, such as a serious security incident, one or more experts are needed who can examine affected assets and accurately assess the damage. Because most organizations own many different types of assets (from buildings to equipment to information), qualified experts are needed to assess each asset type involved. It is not necessary to call upon all available experts, only those whose expertise matches the type of event that has occurred.

Some expertise may go well beyond the skills present in an organization, such as a building structural engineer who can assess potential earthquake damage. In such cases it may be sensible to retain the services of an outside engineer who will respond and provide an assessment on whether a building is safe to occupy after a disaster.

Salvage Disasters destroy assets that the organization uses to make products or perform services. When a disaster occurs, someone (either a qualified employee or an outside expert) needs to examine assets to determine which are salvageable; then a salvage team needs to perform the actual salvage operation at a pace that meets the organization’s needs.

In some cases, salvage may be a critical-path activity, where critical processes are paralyzed until salvage and repairs to machinery can be performed. In other cases, the salvage operation is performed on inventory of finished goods, raw materials, and other items so that business operations can be resumed. Occasionally, when it is obvious that damaged equipment or materials are a total loss, the salvage effort is one of selling the damaged items or materials to some organization that wants them.

Assessment of damage to assets may be a high priority when an organization will be filing an insurance claim. Insurance may be a primary source of funding for the organization’s recovery effort.

Image

NOTE Salvage operations may be a critical-path activity, or one that can be carried out well after the disaster. The command-and-control function will need to help decide the priority.

Physical Security After a disaster, the organization’s usual physical security controls may be compromised. For instance, fencing, walls, and barricades could be damaged, or video surveillance systems may be disabled or have no electric power. These and other failures could lead to increased risk of loss or damage to assets and personnel until those controls can be fixed. Also, security controls in temporary quarters such as hot/warm/cold sites and temporary work centers may be below those in primary locations.

Supplies During emergency and recovery operations, personnel will require supplies of many kinds, from writing tablets and pens to cell phones, portable generators, and extension cords. This function may also be responsible for ordering replacement assets such as servers and network equipment for a cold site.

Transportation When workers are operating from a temporary location, and/or if regional or local transportation systems have been compromised, many arrangements for all kinds of transportation may be required to support emergency operations. These can include transportation of replacement workers, equipment, or supplies by truck, car, rail, sea, or air. This function could also be responsible for arranging for temporary lodging for personnel.

Network This technology function is responsible for damage assessment to organization voice and data networks, building/configuring networks for emergency operations, or both. This function may require extensive coordination with external telecommunications service providers who, by the way, may be suffering the effects of a local or regional disaster as well.

Network Services This function is responsible for network-centric services such as DNS (domain name service), SNMP (simple network management protocol), and authentication.

Systems This is the function that is responsible for building, loading, and configuring the servers and systems that support critical services, applications, databases, and other functions. Personnel may have other resources such as virtualization technology to enable additional flexibility.

Databases For critical applications that rely upon databases, this function is responsible for building databases on recovery systems and for restoring or recovering data from backup media, replication volumes, or e-vaults onto recovery systems. Database personnel will need to work with systems, network, and applications personnel to ensure that databases are operating properly and available as needed.

Data and Records This function is responsible for access to and re-creation of electronic and paper business records. This is a business function that supports critical business processes and works with database management personnel and, if necessary, works with data-entry personnel to rekey lost data.

Applications The applications function is responsible for recovering application functionality on application servers. This may include reloading application software, performing configuration, provisioning roles and user accounts, and connecting the application to databases, network services, and other application integration issues.

Access Management This function is responsible for creating and managing user accounts for network, system, and application access. Personnel with this responsibility may be especially susceptible to social engineering and be tempted to create user accounts without proper authority or approval.

Information Security Personnel in this capacity are responsible for ensuring that proper security controls are being carried out during recovery and emergency operations. They will be expected to identify risks associated with emergency operations and to require remedies to reduce risks.

Security personnel will also be responsible for enforcing privacy controls, so that employee and customer personal data will not be compromised, even as business operations are compromised by the disaster and its effects.

Off-Site Storage This function is responsible for managing the effort of retrieving backup media from off-site storage facilities and for the protection of that media in transit to the scene of recovery operations. If recovery operations take place over an extended period (more than a couple of days), data at the recovery site will need to be backed up and sent to an off-site media storage facility to protect that information should a disaster occur at the hot/warm/cold site (and what bad luck that would be!).

User Hardware In many organizations, little productive work gets done when employees don’t have their workstations, printers, scanners, copiers, and other office equipment. Thus, a function is required to provide, configure, and support the variety of office equipment required by end users working in temporary or alternate locations. This function, like most others, will have to work with many others to ensure that workstations and other equipment are able to communicate with applications and services as needed to support critical processes.

Training During emergency operations, when response personnel and users are working in new locations (and often on new or different equipment and software), some of these personnel may need training so that their productivity can be quickly restored. Training personnel will need to be familiar with many disaster response and recovery procedures, so that they can help people in those roles understand what is expected of them. This function will also need to be able to dispense emergency operations procedures to these personnel.

Relocation This function comes into play when IT is ready to migrate applications running on hot/warm/cold site systems back to the original (or replacement) processing center.

Contract Information This function is responsible for understanding and interpreting legal contracts. Most organizations are a party to one or more legal contracts that require them to perform specific activities, provide specific services, and to communicate status if service levels have changed. These contracts may or may not have provisions for activities and services during disasters, including communications regarding any changes in service levels.

This function is vital not only during the disaster planning stages but also during actual disaster response. Customers, suppliers, regulators, and other parties need to be informed according to specific contract terms.

Recovery Procedures

Recovery procedures are the instructions that key personnel use to bootstrap services (such as IT systems and other enabling technologies) that support the critical business functions identified in the BIA. The recovery procedures should work hand-in-hand with the technologies that may have been added to IT systems to make them more resilient.

An example would be useful here. A fictitious company, Acme Rocket Boots, determines that its order-entry business function is highly critical to the ongoing viability of the business and sets recovery objectives to ensure that order entry would be continued within no more than 48 hours after a disaster.

Acme determines that it needs to invest in storage, backup, and replication technologies to make a 48-hour recovery possible. Without these investments, IT systems supporting order-entry would be down for at least ten days until they could be rebuilt from scratch. Acme cannot justify the purchase of systems and software to facilitate an auto-failover of the order-entry application to hot-site DR servers; instead, the recovery procedure would require that the database be rebuilt from replicated data on warm-site servers. Other tasks such as installing recent patches would also be necessary to make recovery servers ready for production use. All of the tasks required to make the systems ready constitute the body of recovery procedures needed to support the business order-entry function.

This example is, of course, a gross oversimplification. Actual recovery procedures could take dozens of pages of documentation, and procedures would also be necessary for network components, end-user workstations, network services, and other supporting IT services required by the order-entry application. And those are the procedures needed just to get the application running again. More procedures would be needed to keep the applications running properly in the recovery environment.

Continuing Operations

Procedures for continuing operations have more to do with business processes than they do with IT systems. However, the two are related, since the procedures for continuing critical business processes have to fit hand-in-glove with the procedures for operating supporting IT systems that may also (but not necessarily) be operating in a recovery or emergency mode.

Let me clarify that last statement. It is entirely conceivable that a disaster could strike an organization with critical business processes that operate in one city but that are supported by IT systems located in another city. A disaster could strike the city with the critical business function, which means that personnel might have to continue operating that business function in another location, on the original, fully featured IT application. It is also possible that a disaster could strike the city with the IT application, forcing it into an emergency/recovery mode in an alternate location, while users of the application are operating in a business-as-usual mode. And, of course, a disaster could strike both locations (or a disaster could strike in one location where both the critical business function and its supporting IT applications are), throwing both the critical business function and its supporting IT applications into emergency mode. Any organization’s reality could be even more complex than this: just add dependencies on external application service providers, applications with custom interfaces, or critical business functions that operate in multiple cities. If you wondered why disaster recovery and business continuity planning were so complicated, perhaps your appreciation has grown.

Restoration Procedures

When a disaster has occurred, IT operations need to temporarily take up residence in an alternate processing site while repairs are performed on the original processing site. Once those repairs are completed, IT operations would need to be transitioned back to the main (or replacement) processing facility. You should expect that the procedures for this transition would also be documented (and tested—testing is discussed later in this chapter).

Image

NOTE Transitioning applications back to the original processing site is not necessarily just a second iteration of the initial move to the hot/warm/cold site. Far from it: the recovery site may have been a skeleton (in capacity, functionality, or both) of its original self. The objective is not necessarily to move the functionality at the recovery site back to the original site, but to restore the original functionality at the original site.

Let’s look at an example. To continue the Acme Rocket Boots example: their order-entry application at the DR site had only basic, not extended, functions. For instance, customers could not look at order history, and they could not place custom orders; they could only order off-the-shelf products. But when the application is moved back to the primary processing facility, the history of orders accumulated on the DR application needs to be merged into the main order history database, which was not a part of the DR plan.

Considerations for Continuity and Recovery Plans

A considerable amount of detailed planning and logistics must go into continuity and recovery plans if they are to be effective.

Availability of Key Personnel

An organization cannot depend upon every member of its regular expert workforce to be available in a disaster. As discussed earlier in this chapter in more detail, personnel may be unavailable for a number of reasons, including

Image Injury, illness, or death

Image Caring for family members

Image Unavailable transportation

Image Being out of the area

Image Lack of communications

Image Fear

Image

NOTE An organization must develop thorough and accurate recovery and continuity documentation as well as cross-training and plan testing. When a disaster strikes, an organization has one chance to survive, and it depends upon how well the available personnel are able to follow recovery and continuity procedures and to keep critical processes functioning properly.

Emergency Supplies

The onset of a disaster may cause personnel to be stranded at a work location, possibly for several days. This can be caused by a number of reasons, including inclement weather that makes travel dangerous, or by transportation infrastructure that is damaged or blocked with debris.

Emergency supplies should be laid up at a work location and made available to personnel stranded there, regardless of whether they are supporting a recovery effort or not (it’s also possible that severe weather or a natural or man-made event could make transportation dangerous or impossible).

A disaster can also prompt employees to report to a work location (at the primary location or at an alternate site) where they may remain for days at a time, even around the clock if necessary. A situation like this may make the need for emergency supplies less critical, but it still may be beneficial to the recovery effort to make supplies available to support recovery personnel.

An organization stocking emergency supplies at a work location should consider including

Image Drinking water

Image Food rations

Image First-aid supplies

Image Blankets

Image Flashlights

Image Battery or crank-powered radio

Local emergency response authorities may recommend other supplies be kept at a work location.

Communications

Communications within organizations, as well as with customers, suppliers, partners, shareholders, regulators, and others, is vital under normal business conditions. During a disaster and subsequent recovery and restoration operations, it’s more important than ever, while many of the usual means for communications may be impaired.

Identifying Critical Personnel A successful disaster recovery operation requires available personnel who are located near company operations centers. While the primary response personnel may consist of the individuals and teams responsible for day-to-day corporate operations, others need to be identified. In a disaster, some personnel will be unavailable for many reasons (discussed earlier in this chapter).

Key personnel, as well as their backup persons, need to be identified. Backup personnel can consist of other employees who have familiarity with specific technologies, such as operating system, database, and network administration, and who can cover for primary personnel if needed. Sure, it would be desirable for these backup personnel also to be trained in specific recovery operations, but at the very least, if these personnel have access to specific detailed recovery procedures, having them on a call list is probably better than having no available personnel during a disaster.

Identifying Critical Suppliers, Customers, and Other Parties Besides employees, many other parties need to be notified in the event of a disaster. Outside parties need to be aware of the disaster, as well as of basic changes in business conditions.

In a regional disaster such as a hurricane or earthquake, nearby parties will certainly be aware of the disaster and that your organization is involved in it somehow. However, those parties may not be aware of the status of business operations immediately after the disaster: a regional event’s effects can range from complete destruction of buildings and equipment to no damage at all and business-as-usual conditions. Unless key parties are notified of status, they may have no other way to know for sure.

Parties that need to be contacted may include

Image Key suppliers This may include electric and gas utilities, fuel delivery, and materials delivery. An organization in a disaster will often need to impart special instructions to one or more suppliers, requesting delivery of extra supplies or temporary cessation of deliveries.

Image Key customers Many organizations have key customers whose relationships are valued above most others. These customers may depend on a steady delivery of products and services that are critical to their own operations; in a disaster, those customers may have a dire need to know whether such deliveries will be able to continue or not, and under what circumstances.

Image Public safety Police, fire, and other public safety authorities may need to be contacted, not only for emergency operations such as firefighting, but also for any required inspections or other services. It is important that “business office” telephone numbers for these agencies be included on contact lists, as 9-1-1 and other emergency lines may be flooded by calls from others.

Image Insurance adjusters Most organizations rely on insurance companies to protect their assets from damage or loss in a disaster. Because insurance adjustment funds are often a key part of continuing business operations in an emergency, it’s important to be able to reach insurers as soon as possible after a disaster has occurred.

Image Regulators In some industries, organizations are required to notify regulators of certain types of disasters. While regulators obviously may be aware of noteworthy regional disasters, they may not immediately know an event’s specific effects on an organization. Further, some types of disasters are highly localized and may not be newsworthy, even in a local city.

Image Media Media outlets such as newspapers and television stations may need to be notified as a means of quickly reaching the community or region with information about the effects of a disaster on organizations.

Image Shareholders Organizations are usually obliged to notify their shareholders of any disastrous event that affects business operations. This may be the case whether the organization is publicly or privately held.

Image

NOTE The persons or teams responsible for communicating with these outside parties will need to have all of the individuals and organizations included in a list of parties to contact. This information should all be included in emergency response procedures.

Setting Up Call Trees Disaster response procedures need to include a call tree. This is a method where the first personnel involved in a disaster begin notifying others in the organization, to inform them of the developing disaster and to enlist their assistance.

Just as the branches of a tree originate at the trunk and are repeatedly subdivided, a call tree is most effective when each person in the tree can make just a few phone calls. Not only will the notification of important personnel proceed more quickly, but each person will not be overburdened with many calls.

Remember, in a disaster a significant portion of personnel may be unavailable or unreachable. Therefore, a call tree should be structured so that there is sufficient flexibility as well as assurance that all critical personnel will be contacted. Figure 7-12 shows an example call tree.

Image

Figure 7-12 Example call tree structure

An organization can also use an automated outcalling system to notify critical personnel of a disaster. Such a system can play a prerecorded message or request that personnel call an information number to hear a prerecorded message. Most outcalling systems keep a log of which personnel have been successfully reached.

An automated calling system should not be located in the same geographic region. If it were, a regional disaster could damage or make the system unavailable during a disaster. The system should be Internet accessible, so that response personnel can access it to determine which personnel have been notified, and to make any needed changes before or during a disaster.

Wallet Cards Wallet cards containing emergency contact information should be prepared for core team personnel for the organization, as well as for members in each department who would be actively involved in disaster response. Wallet cards are advantageous, because most personnel will have their wallet, pocketbook, or purse nearby at all times, even when away from home, running errands, traveling, or on vacation. Information on the wallet card should include contact information for fellow team members, a few of the key disaster response personnel, and any conference bridges or emergency call-in numbers that are set up. An example wallet card is shown in Figure 7-13.

Transportation

Some types of disasters may make certain modes of transportation unavailable or unsafe. Widespread natural disasters such as earthquakes, volcanoes, hurricanes, and floods can immobilize virtually every form of transportation including highways, railroads, boats, and air. Other types of disasters may impede one or more types of transportation, which could result in overwhelming demand for the available modes. High volumes of emergency supplies may be needed during and after a disaster, but damaged transportation infrastructure often makes the delivery of those supplies difficult.

Image

Figure 7-13 Example laminated wallet card for core team participants with emergency contact information and disaster declaration criteria

Components of a Business Continuity Plan

The complete set of business continuity plan documents will include the following:

Image Supporting project documents These will include the documents created at the beginning of the business continuity project, including the project charter, project plan, statement of scope, and statement of support from executives.

Image Analysis documents These include the

Image Business impact analysis (BIA)

Image Threat assessment and risk assessment

Image Criticality analysis

Image Documents defining recovery targets such as recovery time objective (RTO) and recovery point objective (RPO)

Image Response documents These are all the documents that describe the required action of personnel when a disaster strikes, plus documents containing information required by those same personnel. Examples of these documents include

Image Business recovery (or resumption) plan This describes the activities required to recover and resume critical business processes and activities.

Image Occupant emergency plan (OEP) This describes activities required to safely care for occupants in a business location during a disaster. This will include both evacuation procedures and sheltering procedures, each of which might be required, depending upon the type of disaster that occurs.

Image Emergency communications plan This describes the types of communications imparted to many parties, including emergency response personnel, employees in general, customers, suppliers, regulators, public safety organizations, shareholders, and the public.

Image Contact lists These contain names and contact information for emergency response personnel as well as for critical suppliers, customers, and other parties.

Image Disaster recovery plan This describes the activities required to restore critical IT systems and other critical assets, whether in alternate or primary locations.

Image Continuity of operations plan (COOP) This describes the activities required to continue critical and strategic business functions at an alternate site.

Image Security incident response plan (SIRT) This describes the steps required to deal with a security incident that could reach disaster-like proportions.

Image Test and review documents This is the entire collection of documents related to tests of all of the different types of business continuity plans, as well as reviews and revisions to documents.

Testing Recovery Plans

It’s surprising what you can accomplish when no one is concerned about who gets the credit.—Ronald Reagan

Business continuity and disaster recovery plans may look elegant and even ingenious on paper, but their true business value is greatly diminished until their worth is proven through testing.

The process of testing DR and BC plans uncovers flaws not only in the plans, but also in the systems and processes that they are designed to protect. For example, testing a system recovery procedure might point out the absence of a critically needed hardware component, or a recovery procedure might contain a syntax or grammatical error that misleads the recovery team member and results in recovery delays. Testing is designed to uncover these types of issues.

Testing Recovery and Continuity Plans

Recovery and continuity plans need to be tested to prove their viability. Without testing, an organization has no way of really knowing whether its plans are effective. With ineffective plans, an organization has a far smaller chance of surviving a disaster.

Recovery and continuity plans have built-in obsolescence—not by design, but by virtue of the fact that technology and business processes in most organizations are undergoing constant change and improvement. Thus, it is imperative that newly developed or updated plans be tested as soon as possible to ensure their effectiveness.

Types of tests range from lightweight and unobtrusive to intense and disruptive. The types of tests are

Image Document review

Image Walkthrough

Image Simulation

Image Parallel test

Image Cutover test

These tests are described in more detail in this section.

Image

NOTE Usually, an organization will perform the less-intensive tests first, to identify the most obvious flaws, and follow with tests that require more effort.

Test Preparation

Each type of test requires advance preparation and recordkeeping. Preparation will consist of several activities, including

Image Participants You need to identify who will participate in an upcoming test. It is important to identify all relevant skill groups and department stakeholders so that the test will include a full slate of contributors.

Image Schedule The availability of each participant needs to be confirmed so that the test will include participation from all stakeholders.

Image Facilities For all but the document review test, proper facilities need to be identified and set up. This might consist of a large conference room or training room. If the test will take place over several hours, one or more meals and/or refreshments may be needed as well.

Image Scripting The simulation test requires some scripting, usually in the form of one or more documents that describe a developing scenario and related circumstances. Scenario scripting can make parallel and cutover tests more interesting and valuable, but this can be considered optional.

Image Recordkeeping For all of the tests except the document review, one or more persons need to take good notes that can be collected and organized after the test is completed.

Image Contingency plan The cutover test involves the cessation of processing on primary systems and the resumption of processing on recovery systems. This is the highest-risk plan, and things can go wrong. A contingency plan to get primary systems running again, in case something goes wrong during the test, needs to be developed.

These preparation activities are shown in Table 7-5.

The various types of tests are discussed next.

Document Review

A document review test is a review of some or all disaster recovery and business continuity plans, procedures, and other documentation. Individuals typically review these documents on their own, at their own pace, but within whatever time constraints or deadlines that may have been established.

The purpose of a document review test is the review of the accuracy and completeness of document content. Reviewers should read each document with a critical eye, point out any errors, and annotate the document with questions or comments that can be sent back to the document’s author(s), who can make any necessary changes to the document.

If significant changes are needed in one or more documents, the project team may want to include a second round of document review before moving on to more resource-intensive tests.

Image

Table 7-5 Preparation Activities Required for Each Type of DR/BC Test

The owner or document manager for the organization’s business continuity and disaster recovery planning project should document which persons review which documents, and perhaps even include the review copies or annotations. This practice will create a more complete record of the activities related to the development and testing of important DRP and BCP planning and response documents. It will also help to capture the true cost and effort of the development and testing of DRP and BCP capabilities in the organization.

Walkthrough

A walkthrough is similar to a document review: it’s a review of just the DRP and BCP documents. However, where a document review is carried out by individuals working on their own, a walkthrough is performed by an entire group of individuals in a live discussion.

A walkthrough is usually facilitated by a leader who guides the participants page-by-page through each document. The leader may read sections of the document aloud, describe various scenarios where information in a section might be relevant, and take comments and questions from participants.

A walkthrough is likely to take considerably more time than a document review. One participant’s question on some minor point in the document could spark a worthwhile and lively discussion that could last a few minutes to an hour. The group leader or another person will need to take careful notes, in the event that any deficiencies are found in any of the documents. The leader will also need to be able to control the pace of the review, so that the group does not get unnecessarily hung up on minor points. Some discussions will need to be cut short or tabled for a later time or for an offline conversation among interested parties.

Even if major revisions are needed in recovery documents, it probably will be infeasible to conduct another walkthrough with updated documents. However, follow-up document reviews are probably warranted, to ensure that they were updated appropriately, at least in the opinion of the walkthrough participants.

Image

NOTE Participants in the walkthrough should carefully consider that the potential audience for recovery procedures may be persons who are not as familiar as they are with systems and processes. They need to remember that the ideal personnel may not be available during a real disaster. Participants also need to realize that the skill level of recovery personnel might be a little below that of the experts who operate systems and processes in normal circumstances. Finally, walkthrough participants need to remember that systems and processes undergo almost continuous change, which could render some parts of the recovery documentation obsolete or incorrect all too soon.

Simulation

A simulation is a test of disaster recovery and business continuity procedures where the participants take part in a “mock disaster” to add some realism to the process of thinking their way through emergency response documents.

A simulation could be an elaborate and choreographed walkthrough test where a facilitator reads from a script and describes a series of unfolding events in a disaster such as a hurricane or an earthquake. This type of simulation might almost be viewed as “playacting,” where the script is the set of emergency response documentation. By stimulating the imagination of simulation participants, it’s possible for participants to really imagine that a disaster is taking place, which may help them to better understand what real disaster conditions might be like. It will help tremendously if the facilitator has actually experienced one or more disaster scenarios, so that he or she can add more realism when describing events.

To make the simulation more credible and valuable, the scenario that is chosen should be one that has a reasonable chance of actually occurring in the local area. Good choices would include an earthquake in San Francisco or Los Angeles, a volcanic eruption in Seattle, or an avalanche in Switzerland. A poor choice would be a hurricane or tsunami in central Asia, because these events would not ever occur there.

A simulation can also go a few steps further. For instance, the simulation can take place at an established emergency operations center, the same place where emergency command-and-control would operate in a real disaster. Also, the facilitator could change some of the participants’ roles, to simulate the real absence of certain key personnel, to see how remaining personnel might conduct themselves in a real emergency.

Image

NOTE The facilitator of a simulation is limited only by his or her own imagination when organizing a simulation. One important fact to remember, though, is that a simulation does not actually affect any live or DR systems—it’s all as pretend as the make-believe cardboard television sets and computers found in furniture stores.

Parallel Test

A parallel test is an actual test of disaster recovery and/or business continuity response plans. The purpose of a parallel test is to evaluate the ability of personnel to follow directives in emergency response plans—to actually set up the DR business processing or data processing capability. In a parallel test, personnel are actually setting up the IT systems that would be used in an actual disaster and operating those IT systems with real business transactions to find out if the IT systems perform the processing correctly.

The outcome of a parallel test is threefold:

Image It evaluates the accuracy of emergency response procedures.

Image It evaluates the ability for personnel to correctly follow the emergency response procedures.

Image It evaluates the ability for IT systems and other supporting apparatus to process real business transactions properly.

A parallel test is called a parallel test because live production systems continue to operate, and the backup IT systems are processing business transactions in parallel to see if they process them the same as the live production systems do.

Setting up a valid parallel test is complicated in many cases. In effect, you need to insert a logical “Y cable” into the business process flow so that the information flow will split and flow both to production systems (without interfering with their operation) and to the backup systems. Results of transactions need to be compared. Personnel need to be able to determine whether the backup systems would be able to output correct data without actually having them do so. In many complex environments, you would not want the DR system to actually feed information back into a live environment, because that might cause duplicate events to occur someplace else in the organization (or with customers, suppliers, or other parties). For instance, in a travel reservations system, you would not want a DR system to actually book travel, because that would cost real money and consume available space on an airline or other mode of transportation. But it would be important to know whether the DR system would be able to perform those functions. Somewhere along the line, it will be necessary to “unplug” the DR system from the rest of the environment and manually examine results to see if they appear to be correct.

Organizations that do wish to see if their backup/DR systems can manage a real workload can perform a cutover test, which is discussed next.

Cutover Test

A cutover test is the most intrusive type of disaster recovery test. It will also provide the most reliable results in terms of answering the question of whether backup systems have the capacity to shoulder the real workload properly.

The consequences of a failed cutover test, however, might resemble an actual disaster: if any part of the cutover test fails, then real, live business processes will be going without the support of IT applications as though a real outage or disaster were in progress. But even a failure like this would show you that “no, the backup systems won’t work in the event a real disaster were to happen later today.”

In some respects, a cutover test is easier to perform than a parallel test. A parallel test is a little trickier, since business information is required to flow to the production system and to the backup system, which means that some artificial component has been somehow inserted into the environment. However, with a cutover test, business processing does take place on the backup systems only, which can often be achieved through a simple configuration someplace in the network or the systems layer of the environment.

Image

NOTE Not all organizations perform cutover tests, because they take a lot of resources to set up and are risky. Many organizations find that a parallel test is sufficient to tell whether backup systems are accurate, and the risk of an embarrassing incident is almost zero with a parallel test.

Documenting Test Results

Every type and every iteration of DR plan testing needs to be documented. It’s not enough to say, “We did the test on September 10, 2009, and it worked.” First of all, no test goes perfectly—there are always opportunities for improvement identified. But the most important part of testing is to discover what parts of the test still need work, so that those parts of the plan can be fixed before the next test (or a real disaster).

As with any well-organized project, success is in the details. The road to success is littered with big and little mistakes, and all of the things that are identified in every sort of DR test need to be detailed, so that the next iteration of the test will give better results.

Recording and comparing detailed test results from one test to the next will also help the organization to measure progress. By this I mean that the quality of emergency response plans should steadily improve from year to year. Simple mistakes of the past should not be repeated, and the only failures in future tests should be in new and novel parts of the environment that weren’t well thought out to begin with. And even these should diminish over time.

Improving Recovery and Continuity Plans

Every test of recovery and response plans should include a debrief or review, so that participants can discuss the outcome of the test: what went well, what went wrong, and how things should be done differently next time. All of this information should be collected by someone who will be responsible for making changes to relevant documents. The updated documents should be circulated among the test participants who can confirm whether their discussion and ideas are properly reflected in the document.

Training Personnel

The value and usefulness of a high-quality set of disaster response and continuity plans and procedures will be greatly diminished if those responsible for carrying out the procedures are unfamiliar with them.

A person cannot learn to ride a bicycle by reading even the most detailed how-to instructions on the subject, so it’s equally unrealistic to expect personnel to be able to properly carry out disaster response procedures if they are untrained in those procedures.

Several forms of training can be made available for the personnel who are expected to be available if a disaster strikes, including

Image Document review Personnel can carefully read through procedure documents, to become familiar with the nature of the recovery procedures. But as mentioned earlier, this alone may be insufficient.

Image Participation in walkthroughs People who are familiar with specific processes and systems that are the subject of walkthroughs should participate in them. Exposing personnel to the walkthrough process will not only help to improve the walkthrough and recovery procedures, but will also be a learning experience for participants.

Image Participation in simulations Taking part in simulations will similarly benefit the participants by giving them the experience of thinking through a disaster.

Image Participation in parallel and cutover tests Other than experiencing an actual disaster and its recovery operations, no experience is quite like participating in parallel and cutover tests. Here, participants will gain actual hands-on experience with critical business processes and IT environments by performing the actual procedures that they would in the event of a disaster. When a disaster strikes, those participants can draw upon their memory of having performed those procedures in the past, instead of just the experience of having read the procedures.

You can see that all of the levels of tests that need to be performed to verify the quality of response plans are also training opportunities for personnel. The development and testing of disaster-related plans and procedures provide a continuous learning experience for all of the personnel involved.

Making Plans Available to Personnel When Needed

When a disaster strikes, often one of the effects is no access to even the most critical IT systems. Given a 40-hour workweek, there is roughly a 25 percent likelihood that critical personnel will be at the business location when a disaster strikes (at least the violent type of disaster that strikes with no warning, such as an earthquake—other types of disasters, such as hurricanes, may afford the organization a little bit of time to anticipate the disaster’s impact). The point is, chances are very good that the personnel who are available to respond may be unable to access the procedures and other information that they will need, unless special measures are taken.

Image

NOTE Complete BCP/DRP documentation often contains details of key systems, operating procedures, recovery strategies, and even vendor and model identification of in-place equipment. This information can be misused if available to unauthorized personnel, so the mechanism selected for ensuring availability must include planning to exclude inadvertent disclosure.

There are several ways that response and recovery procedures can be made available to personnel during a disaster, including

Image Hard copy While many have grown accustomed to the paperless office, disaster recovery and response documentation is one type of information that should be available in hardcopy form. Copies, even multiple copies, should be available for each responder, with a copy at the workplace and another at home, and possibly even a set in the responder’s vehicle.

Image Soft copy Traditionally, softcopy documentation is kept on file servers, but as you might expect, those file servers might be unavailable in a disaster. Soft copies should be available on responders’ portable devices (laptops, PDAs, and perhaps smart phones). An organization can also consider issuing documentation on memory sticks and cards. Depending upon the type of disaster, it can be difficult to know what resources will be available to access documentation, so making it available in more than one form will ensure that at least one copy of it will be available to the personnel who need access to it.

Image Alternate work/processing site Organizations that utilize a hot/warm/cold site for the recovery of critical operations can maintain hard copies and/or soft copies of recovery documentation there. This makes perfect sense; personnel working at an alternate processing or work site will need to know what to do, and having those procedures on-site will facilitate their work.

Image Online Soft copies of recovery documentation can be archived on an Internet-based site that includes the ability to store data. Almost any type of online service that includes authentication and the ability to upload documents could be suitable for this purpose.

Image Wallet cards It’s unreasonable to expect to publish recovery documentation on a laminated wallet card, but those cards could be used to store the contact information for core response team members as well as a few other pieces of information like conference bridge codes, passwords to online repositories of documentation, and so on. An example wallet card appears earlier in this chapter, in Figure 7-13.

Maintaining Recovery and Continuity Plans

Business processes and technology undergo almost continuous change in most organizations. A business continuity plan that is developed and tested is liable to be outdated within months and obsolete within a year. If much more than a year passes, a DR plan in some organizations may approach uselessness. This section discusses how organizations need to keep their DR plans up-to-date and relevant.

A typical organization needs to establish a schedule whereby the principal DR documents will be reviewed. Depending on the rate of change, this could be as frequently as quarterly or as seldom as every two years.

Further, every change, however insignificant, in business processes and information systems should include a step to review and, possibly, to update relevant DR documents. That is, a review of, and possibly changes to, relevant DR documents should be a required step in every business process engineering or information systems change process, and a key component of the organization’s software development life cycle (SDLC). If this is done faithfully, then you would expect that the annual review of DR documents would conclude that few (if any) changes were required (although it is still a good practice to perform a periodic review, just to be sure).

Periodic testing of DR documents and plans, discussed in detail in the preceding section, is another vital activity. Testing validates the accuracy and relevance of DR documents, and any issues or exceptions in the testing process should precipitate updates to appropriate documents.

Sources for Best Practices

It is unnecessary to begin business continuity planning and disaster recovery planning by first inventing a practice or methodology. Business continuity planning and disaster recovery planning are advanced professions with several professional associations, professional certifications, international standards, and publications. Any or all of these are, or can lead to, sources of practices, processes, and methodologies:

Image U.S. National Institute of Standards and Technology (NIST) This is a branch of the U.S. Department of Commerce that is responsible for developing business and technology standards for the federal government. The standards developed by NIST are exceedingly high, and as a result many private organizations all over the world are adopting them. The NIST web site is found at www.nist.gov.

Image Business Continuity Institute (BCI) This is a membership organization dedicated to the advancement of business continuity management. BCI has over 4,000 members in almost 100 countries. BCI holds several events around the world, prints a professional journal, and has developed several professional certifications, including

Image Associate of the Business Continuity Institute (ABCI)

Image Specialist of the Business Continuity Institute (SBCI)

Image Member of the Business Continuity Institute (MBCI)

Image Fellow of the Business Continuity Institute (FBCI)

The BCI web site can be found at www.thebci.org.

Image U.S. National Fire Protection Agency (NFPA) The NFPA has developed a pre-incident planning standard, NFPA 1620, which addresses the protection, construction, and features of buildings and other structures. It also requires the development of pre-incident plans that emergency responders can use to deal with fires and other emergencies. The NFPA web site can be found at www.nfpa.org.

Image U.S. Federal Emergency Management Agency (FEMA) FEMA is a part of the Department of Homeland Security (DHS) and is responsible for emergency disaster relief planning information and services. FEMA’s most visible activities are its relief operations in the wake of hurricanes and floods in the United States. Its web site can be found at www.fema.gov.

Image Disaster Recovery Institute International (DRII) This is a professional membership organization that provides education and professional certifications for disaster recovery planning professionals. Its certifications include

Image Associate Business Continuity Professional (ABCP)

Image Certified Business Continuity Vendor (CBCV)

Image Certified Functional Continuity Professional (CFCP)

Image Certified Business Continuity Professional (CBCP)

Image Master Business Continuity Professional (MBCP)

Image Business Continuity Management Institute (BCMI) This is a professional association that specializes in education and professional certification. BCMI is a co-organizer of the World Continuity Congress, an annual conference that is dedicated to business continuity and disaster recovery planning. Its web site can be found at www.bcm-institute.org. Certifications offered by BCMI include

Image Business Continuity Certified Expert (BCCE)

Image Business Continuity Certified Specialist (BCCS)

Image Business Continuity Certified Planner (BCCP)

Image Disaster Recovery Certified Expert (DRCE)

Image Disaster Recovery Certified Specialist (DRCS)

Auditing Business Continuity and Disaster Recovery

Audits of an organization’s business continuity plan are especially difficult because it is impossible to prove whether the plans will work unless there is a real disaster.

The IT auditor has quite a task when it comes to auditing an organization’s business continuity and disaster recovery program. The lion’s share of the audit results hinges on the quality of documentation and walkthroughs with key personnel.

As is typical with most audit activities, an audit of an organization’s BC program is a top-down analysis of key business objectives and a review of documentation and interviews to determine whether the BC strategy and program details support those key business objectives. This approach is depicted in Figure 7-14.

Image

Figure 7-14 Top-down approach to an audit of business continuity and disaster recovery

The objectives of an audit should include the following activities:

Image Obtain documentation that describes current business strategies and objectives. Obtain high-level documentation (for example, strategy, charter, objectives) for the BC program, and determine whether the BC program supports business strategies and objectives.

Image Obtain the most recent business impact analysis (BIA) and accompanying threat analysis, risk analysis, and criticality analysis. Determine whether these documents are current, complete, and support the BC strategy. Also determine whether the scope of these documents covers those activities considered strategic, according to high-level business objectives. Finally, determine whether the methods in these documents represent good practices for these activities.

Image Determine the effectiveness of planning and recovery documentation by examining previous test results.

Image Evaluate the methods used to store critical information off-site (which may consist of off-site storage, alternate data centers, or e-vaulting). Examine environmental and physical security controls in any off-site or alternate sites and determine their effectiveness. Note whether off-site or alternate site locations are within the same geographic region—which could mean that both the primary and alternate sites may be involved in common disaster scenarios.

Image Determine whether key personnel are ready to respond during a disaster, by reviewing test plans and training plans and results. Find out where emergency procedures are stored and whether key personnel have access to them.

Image Verify whether there is a process for the regular review and update of BC documentation. Evaluate the process’s effectiveness by reviewing records to see how frequently documents are being reviewed.

These activities are described in more detail in the following sections.

Reviewing Business Continuity and Disaster Recovery Plans

The bulk of an organization’s business continuity plan lies in its documentation, so it should be of little surprise that the bulk of the audit effort will lie in the examination of this documentation. The following procedure will help the auditor to determine the effectiveness of the organization’s BC plan:

1. Obtain a copy of business continuity and disaster recovery documentation, including response procedures, contact lists, and communication plans.

2. Examine samples of distributed copies of BC documentation, and determine whether they are up-to-date. These samples can be obtained during interviews of key response personnel, which are covered in this procedure.

3. Determine whether all documents are clear and easy to understand, not just for primary responders, but for alternate personnel who may have specific relevant skills but less familiarity with the organization’s critical applications.

4. Examine documentation related to the declaration of a disaster and the initiation of disaster response. Determine whether the methods for declaration are likely to be effective in a disaster scenario.

5. Obtain emergency contact information, and contact some of the personnel to see whether the contact information is accurate and up-to-date. Also determine whether all response personnel are still employed in the organization and that they are in the same or similar roles in support of disaster response efforts.

6. Obtain contact information for off-site storage providers, hot-site facilities, and critical suppliers. Determine whether these organizations are still providing services to the organization. Call some of the contacts to determine the accuracy of the documented contact information.

7. Obtain logical and physical architecture diagrams for key IT applications that support critical business processes. Determine whether BC documentation includes recovery procedures for all components that support those IT applications. See whether documentation includes recovery for end users and administrators for the applications.

8. Contact some or all of the response personnel who are listed in emergency contact lists. Interview them and see how well they understand their disaster response responsibilities, and whether they are familiar with disaster response procedures. Ask each interviewee if they have a copy of these procedures. See if their copies are current.

9. If the organization uses a hot site, examine one or more systems to determine whether they have the proper versions of software, patches, and configurations. Examine procedures and records related to the tasks in support of keeping standby systems current. Determine whether these procedures are effective.

10. If the organization has a warm site, examine the procedures used to bring standby systems into operational readiness. Examine warm-site systems to see whether they are in a state where readiness procedures will likely be successful.

11. If the organization has a cold site, examine all documentation related to the acquisition of replacement systems and other components. Determine whether the procedures and documentation are likely to result in systems capable of hosting critical IT applications and within the period required to meet key recovery objectives.

12. Determine whether any documentation exists regarding the relocation of key personnel to the hot/warm/cold processing site. See whether the documentation specifies which personnel are to be relocated, and what accommodations and supporting logistics are provided. Determine the effectiveness of these relocation plans.

13. Determine whether backup and off-site (or e-vaulting) storage procedures are being followed. Examine systems to ensure that critical IT applications are being backed up, and that proper media are being stored off-site (or that the proper data is being e-vaulted). Determine whether data recovery tests are ever performed, and whether results of those tests are documented and problems are properly dealt with.

14. Evaluate procedures for transitioning processing from the alternate processing facility back to the primary processing facility. Determine whether these procedures are complete and effective.

15. Determine whether a process exists for the formal review and update of business continuity and disaster recovery documentation. Examine records to see how frequently, and how recently, documents have been reviewed and updated. Determine whether this is sufficient and effective, by interviewing key personnel to understand whether significant changes to applications, systems, networks, or processes are reflected in recovery and response documentation.

16. Determine whether response personnel receive any formal or informal training on response and recovery procedures. Determine whether personnel are required to receive training, and whether any records are kept that show which personnel received training and at what time.

17. Examine the organization’s change control process. Determine whether the process includes any steps or procedures that require personnel to determine whether any change has an impact on disaster recovery documentation or procedures.

Reviewing Prior Test Results and Action Plans

Effectiveness of disaster recovery and business continuity plans relies, to a great degree, on the results and outcomes of tests. An IT auditor needs to carefully examine these tests to determine their effectiveness and to what degree they are used to improve procedures and to train personnel. The following procedure will help the IT auditor to determine the effectiveness of business continuity and disaster recovery testing:

1. Determine whether there is a strategy for testing business continuity and disaster recovery procedures. Obtain records for past tests and a plan for future tests. Determine whether prior tests and planned tests are adequate for establishing the effectiveness of response and recovery procedures.

2. Examine records for tests that have been performed over the past year or two. Determine the types of tests that were performed. Obtain a list of participants for each test. Compare the participants to lists of key recovery personnel. Examine test work papers to determine the level of participation by key recovery personnel.

3. Determine whether there is a formal process for recording test results and for using those results to make improvements in plans and procedures. Examine work papers and records to determine the types of changes that were recommended in prior tests. Examine BC and DR documents to see whether these changes were made as expected.

4. Considering the types of tests that were performed, determine the adequacy of testing as an indicator of the effectiveness of the BC program. Did the organization only perform document reviews and walkthroughs, for example, or did the organization also perform parallel or cutover tests?

5. If tests have been performed for two years or more, determine whether there’s a trend showing continuous improvement in response and recovery procedures.

6. If the organization performs parallel tests, determine whether tests are designed in a way that effectively determines the actual readiness of standby systems. Also determine whether parallel tests measure the capacity of standby systems or merely their ability to process correctly but at a lower level of performance.

7. Determine whether any tests included the retrieval of backup data from off-site storage or e-vaulting facilities.

Evaluating Off-Site Storage

Storage of critical data and other supporting information is a key component in any organization’s business continuity plan. Because some types of disasters can completely destroy a business location, including its vital records, it is imperative that all critical information be backed up and copies moved to an off-site storage facility. The following procedure will help the IT auditor determine the effectiveness of off-site storage:

1. Obtain the location of the off-site storage or e-vaulting facility. Determine whether the facility is located in the same geographic region as the organization’s primary processing facility.

2. Visit the off-site storage facility. Examine its physical security controls as well as its safeguards to prevent damage to stored information in a disaster. Consider the entire spectrum of physical and logical access controls. Examine procedures and records related to the storage and return of backup media, and of other information that the organization may store there.

3. Take an inventory of backup media and other information stored at the facility. Compare this inventory with a list of critical business processes and supporting IT systems, to determine whether all relevant information is, in fact, stored at the off-site storage facility.

4. Determine how often the organization performs its own inventory of the off-site facility, and whether steps to correct deficiencies are documented and remedied.

5. Examine contracts, terms, and conditions for off-site storage providers or e-vaulting facilities, if applicable. Determine whether data can be recovered to the original processing center and to alternate processing centers within a period that will ensure that disaster recovery can be completed within recovery time objectives.

6. Determine whether the appropriate personnel have current access codes for off-site storage or e-vaulting facilities, and whether they have the ability to recover data from those facilities.

7. Determine what information, in addition to backup data, exists at the off-site storage facility. Information stored off-site should include architecture diagrams, design documentation, operations procedures, and configuration information for all logical and physical layers of technology and facilities supporting critical IT applications, operations documentation, and application source code.

8. Obtain information related to the manner in which backup media and copies of records are transported to and from the off-site storage or e-vaulting facility. Determine whether controls protecting transported information are adequate.

9. Obtain records supporting the transport of backup media and records to and from the off-site storage facility. Examine samples of records and determine whether they match other records such as backup logs.

Evaluating Alternative Processing Facilities

The IT auditor needs to examine alternate processing facilities to determine whether they are sufficient to support the organization’s business continuity and disaster recovery plans. The following procedure will help the IT auditor determine whether an alternate processing facility will be effective:

1. Obtain addresses and other location information for alternate processing facilities. These will include hot sites, warm sites, cold sites, and alternate processing centers owned or operated by the organization.

2. Determine whether alternate facilities are located within the same geographic region as the primary processing facility, and the probability that the alternate facility will be adversely affected by a disaster that strikes the primary facility.

3. Perform a threat analysis on the alternate processing site. Determine which threats and hazards pose a significant risk to the organization and its ability to effectively carry out operations during a disaster.

4. Determine the types of natural and man-made events likely to take place at the alternate processing facility. Determine whether there are adequate controls to mitigate the effect of these events.

5. Examine all environmental controls and determine their adequacy. This should include environmental controls (HVAC), power supply, uninterruptible power supply (UPS), power distribution units (PDUs), and electric generators. Also examine fire detection and suppression systems, including smoke detectors, pull stations, fire extinguishers, sprinklers, and inert gas suppression systems.

6. If the alternate processing facility is a separate organization, obtain the legal contract and all exhibits. Examine these documents and determine whether the contract and exhibits support the organization’s recovery and testing requirements.

Interviewing Key Personnel

The knowledge and experience of key personnel is vital to the success of any disaster response operation. Interviews of key personnel will help the IT auditor determine whether key personnel are prepared and trained to respond during a disaster. The following procedure will guide the IT auditor in interviews:

1. Obtain the name, title, tenure, and full contact information for each person interviewed.

2. Ask the interviewee to summarize his or her professional experience and training, and current responsibilities in the organization.

3. Ask the interviewee whether he or she is familiar with the organization’s business continuity and disaster recovery programs.

4. Determine whether the interviewee is among the key response personnel expected to respond during a disaster.

5. Ask the interviewee if he or she has been issued a copy of any response or recovery procedures. If so, ask to see those procedures; determine whether they are current versions. Ask if the interviewee has additional sets of procedures in any other locations (residence, for example).

6. Ask the interviewee if he or she has received any training. Request evidence of this training (certificate, calendar entry, and so on).

7. Ask the interviewee if he or she has participated in any tests or evaluations of recovery and response procedures. Ask the interviewee whether he or she felt the tests were effective, whether management takes the tests seriously, and whether any deficiencies in tests resulted in any improvements to test procedures or other documents.

Reviewing Service Provider Contracts

No organization is an island. Every organization has critical suppliers without which it could not carry out its critical functions. The ability to recover from a disaster also frequently requires the support of one or more service providers or suppliers. The IT auditor should examine contracts for all critical suppliers and consider the following guidelines:

Image Does the contract support the organization’s requirements for delivery of services and supplies, even in the event of a local or regional disaster?

Image Determine whether the service provider has its own disaster recovery capabilities that will ensure its ability to deliver critical services during a disaster.

Image Determine the recourse available should the supplier be unable to provide goods or services during a disaster.

Reviewing Insurance Coverage

The IT auditor should examine the organization’s insurance policies related to the loss of property and assets supporting critical business processes. Insurance coverage should cover the actual cost of recovery, or a lesser amount if the organization’s executive management has accepted a lower amount. The IT auditor should obtain documentation that includes cost estimates for various disaster recovery scenarios, including equipment replacement, business interruption, and the cost of performing business functions and operating IT systems in alternate sites. These cost estimates should be compared with the value of insurance policies.

Summary

Natural and man-made disasters can damage business facilities, assets, and information systems, thus threatening the viability of the organization by halting its critical processes. Even without direct effects, many secondary or indirect effects from a disaster such as crippled transportation systems, damaged communications systems, and damaged public utilities can seriously harm an organization. The development of business continuity plans and disaster recovery plans helps an organization to be better prepared to act when a disaster strikes. A vital part of this preparation is the development of alternative means for continuing the most critical activities, usually in alternative locations that are not damaged by a disaster.

There is an accepted methodology to business continuity and disaster recovery planning, which begins with the development of a business continuity planning policy, a statement of the goals and objectives of a planning effort. This is followed by a business impact analysis (BIA), a study of the organization’s business processes to determine which are the most critical to the organization’s ongoing viability. For each critical process, a statement of impact is developed, which is a brief description of the effect on the organization if the process is incapacitated for any significant period. The statement of impact can be qualitative or quantitative.

A criticality analysis is performed next, where all in-scope business processes are ranked in order of criticality. Ranking can be strictly quantitative, qualitative, or even subjective.

Next, recovery targets for each critical business process are developed. The key targets are recovery time objective (RTO) and recovery point objective (RPO). These targets specify time to system restoration and maximum data loss, respectively. When these targets have been established, the project team can develop plans that include changes to technical architecture as well as business processes that will help achieve these established recovery objectives. Often, project teams discover that establishing specific recovery objectives is too expensive; this requires that the business revisit and consider changing those objectives to more affordable figures. Sometimes, however, the organization is able instead to develop new architectures or processes that can help lower costs overall, including the cost of achieving desired recovery objectives.

Once acceptable architectures and process changes have been determined, the organization sets out to make investments in these areas to bring its systems and processes closer to the recovery objectives. Significant investments may take place over a period of years. Procedures for recovering systems and processes are also developed at this time, as well as procedures for other aspects of disaster response such as emergency communications plans and evacuation plans.

Some of the investment in IT system resilience may involve the establishment of an alternate processing site, where IT systems can be resumed in support of critical business processes. There are several types of alternate sites, including a hot site, where IT systems are in a continual state of near-readiness and can assume production workload within an hour or two; a warm site, where IT systems are present but require several hours to a day of preparation; and a cold site, where no systems are present but must be acquired, which may require several days of preparation before those replacement systems are ready to support business processes. An organization can also establish a reciprocal site agreement, in which two or more organizations each agree to provide a part of their processing center to one of the other organizations in the event they experience a disaster. Organizations with a reciprocal processing agreement are usually located in different geographic regions.

Some of the technologies that may be introduced in IT systems to improve recovery targets include RAID, a technology that improves the reliability of disk storage systems; replication, a technique for copying data in near–real time to an alternate (and usually distant) storage system; and clustering, a technology where several servers (including some that can be located in another region) act as one logical server, enabling processing to continue even if one or more servers are incapacitated or unreachable.

The effectiveness of business continuity and disaster recovery plans can only be determined by testing; otherwise, there is no real way to know whether the plans and procedures are accurate and can actually be carried out, or whether they will achieve their objectives. There are five types of tests: document review, walkthrough, simulation, parallel test, and cutover test. These five tests represent progressively more complex (and risky) means for testing procedures and IT systems to determine whether they will be able to actually support critical business processes in a real disaster. The parallel test involves the use of backup IT systems in a way that enables them to process real business transactions while primary systems continue to perform the organization’s real work. The cutover test actually transitions business data processing to backup IT systems, where they will process actual business workload for a period. The risk of a cutover test is that the backup systems will not have the required accuracy or capacity, which could actually precipitate a disaster of its own!

Response personnel need to be carefully chosen from available staff, to ensure that sufficient numbers of personnel will be available in a real disaster. Some personnel may be unable to respond for a variety of reasons that are related to the disaster itself. As a result, some of the personnel who respond in an actual disaster may not be as familiar with the systems and procedures required to recover and maintain them. This makes training and accurate procedures critical for effective disaster recovery.

Auditing an organization’s business continuity capabilities involves the examination of BCP policies, plans, and procedures, as well as contracts and technical architectures. The IT auditor also needs to interview response personnel to gauge their readiness and to visit off-site media storage and alternate processing sites to identify risks present there.

Notes

Image Business continuity and disaster recovery planning ensure business recovery following a disaster. Business continuity focuses on maintaining service availability with the least disruption to standard operating parameters during an event, while disaster recovery focuses on post-event recovery and restoration of services.

Image While disasters are generally grouped in terms of man-made or natural disaster types, individual events may often create combined threats to enterprise operation. For example, a tornado (natural disaster) might also spawn structural fires and transportation accidents (man-made disaster methodology).

Image The BCP process encompasses a life cycle beginning with the initial BCP policy, followed by business impact and criticality analysis to evaluate risk and impact factors. Recovery targets facilitate the development of strategies for continuity and recovery, which then must be tested and conveyed to operation personnel through training and exercise. Post-implementation maintenance includes periodic reviews and updates as part of the enterprise continuous-improvement process.

Image The BCP policy defines the scope of continuity and recovery strategy, defining boundaries by functional, operational, or geographic alignment.

Image The business impact analysis (BIA) measures the impact on enterprise operation posed by various identified areas of risk. The output of the BIA is used in the criticality analysis (CA), which measures the impact of each risk against its likelihood and the cost of mitigation.

Image The output of the BIA and CA is used when establishing recovery time objectives (RTOs) and recovery point objectives (RPOs), which can then be measured against relative cost scenarios for each identified risk and mitigation option.

Image Once recovery objectives have been identified, strategies can be developed to meet each objective. Many solutions may include redundant (hot, warm, or cold) alternate sites, redundant service operation or storage in high-availability or distributed-cluster environments, alternative network access strategies, and backup/recovery strategies structured to meet identified recovery time and recovery point requirements.

Image BCP/DRP plans require triggers, established mechanisms for implementation and coordination, clearly defined responsibilities, and well-documented procedures for each element necessary to the BCP/DR effort. The plan must be documented and available to recovery team members even if displaced and without access to affected systems, and should contain all analysis, response, and testing documents related to each procedure.

Image BCP/DRP plans must be tested to validate effectiveness through document review, walkthrough, simulation, parallel testing, or cutover testing practices. Regular testing must take place to ensure new objectives and procedures meet the requirements of a living enterprise environment. Participation in these tests provides familiarity and training for engaged operational staff members, raising understanding and awareness of requirements and responsibilities.

Questions

1. An organization that is undertaking a business continuity plan should first perform:

A. A risk analysis

B. A business impact analysis

C. A threat analysis

D. A criticality analysis

2. The first step in a business impact analysis is:

A. Identify key assets.

B. Identify key personnel.

C. Establish the scope of the project.

D. Inventory all in-scope business processes and systems.

3. What is the purpose of a statement of impact?

A. The effect on the business if the process is incapacitated

B. A disaster’s effect on the business

C. The effect on the business if a recovery plan is not tested

D. The cost of backup systems

4. What is the purpose of a criticality analysis?

A. Determine feasible recovery targets.

B. Determine which staff members are the most critical.

C. Determine which business processes are the most critical.

D. Determine maximum tolerable downtime.

5. A critical application is backed up once per day. The recovery point objective for this system:

A. Is 48 hours

B. Cannot be determined

C. Is 24 hours

D. Is 12 hours

6. Recovery time objective is defined as:

A. The maximum period of downtime

B. The maximum data loss

C. The minimum period of downtime

D. The minimum data loss

7. An alternate processing center that contains no application servers is known as a:

A. Clear site

B. Warm site

C. Hot site

D. Cold site

8. What is the most important consideration for site selection of a hot site?

A. Time zone

B. Geographic location in relation to the primary site

C. Proximity to major transportation

D. Natural hazards

9. A collection of servers that is designed to operate as a single logical server is known as a:

A. Cluster

B. Grid

C. Cloud

D. Replicant

10. To determine effectiveness of a disaster recovery program, an IT auditor should:

A. Interview personnel

B. Examine test results

C. Examine documentation and interview personnel

D. Examine documentation

Answers

1. B. A business impact analysis is the first major task in a disaster recovery or business continuity planning project. A business impact analysis helps determine which processes in an organization are the most important.

2. D. The first step in a business impact analysis is the inventory of all in-scope business processes and systems.

3. A. A statement of impact describes the effect on the business if a process is incapacitated for any appreciable time.

4. C. A criticality analysis is used to determine which business processes are the most critical, by ranking them in order of criticality.

5. C. The recovery point objective (RPO) for an application that is backed up once per day cannot be less than 24 hours.

6. A. Recovery time objective (RTO) is defined as the maximum period of downtime for a process or application.

7. D. A cold site contains no information processing equipment.

8. B. An important selection criterion for a hot site is the geographic location in relation to the primary site. If they are too close together, then a single event may involve both locations.

9. A. A server cluster is a collection of two or more servers that is designed to appear as a single server.

10. C. An auditor who is auditing an organization’s disaster recovery plan should examine documentation and interview personnel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.1.158