Chapter 7
Risk Management and Disaster Recovery

“Planning for a rainy day” is an old expression that is so true to any organization. What happens today is likely to happen tomorrow. You’ll arrive at work with the normal traffic hassles. Boot your computer. Read your email. And the rest of the day is rather predictable. Even when things don’t go exactly as planned, you handle them routinely because that’s relatively normal too.

What if something doesn’t go as planned? Not only is your routine upset, but normal operations of the organization are also disrupted. Roads leading to work are unpassable. The building is shut down because of a fire. There is a power outage— lasting days. A key employee quits. No orders are received because of a system failure. There are an endless number of events that can happen that have a short- and long-term damaging effect on the organization.

Sustainability is the primary goal of every organization. A business model describes how the organization will survive financially when everything goes according to plan. A disaster recovery plan describes how the organization will survive when things don’t go according to plan, based on a risk assessment/management plan that identifies all possible and probable events that can negatively affect the organization and how each risk is mitigated. A key element of every disaster recovery plan is how the organization functions when computing devices and applications are no longer available—a disaster that can bring an organization to a screeching halt.

Every organization needs a comprehensive, well thought-out disaster recovery plan that can be implemented flawlessly in a heartbeat when disaster strikes. Employees, customers, vendors, regulatory authorities, lenders, insurers, and shareholders all expect that a disaster recovery plan be in place to protect their interest in the sustainability of the organization. This chapter addresses managing risks associated with organizations requiring high availability such as hospitals and utilities. Much of it can apply to an organization of any size. The first half of the chapter will talk about the various forms of risk and the second half will discuss what can be done to ameliorate those risks.

Disaster

Mention the word disaster and it conjures images of a catastrophic event that includes fire, floods, and powerful storms that may disrupt business operations. A wind storm, for example, makes accessing the facility impossible because of down trees and power lines. Employees are unable to leave the building and other employees are unable to come to work. Some may decide to work from home, but the systems, the network, and databases are all unavailable even from remote locations. And the storm seems to come out of nowhere with no forewarning and no time to prepare. Failure of heating, ventilation, and air conditioning in the facility; loss of a key employee(s) due to injury on or off the job; a strike at the facility or at a vendor; all place the organization in disaster mode.

A disaster to an organization is any event that negatively impacts the operation of the organization. This can be a strike by union workers, a vendor going out of business, or loss of access to the building. This involves anything that severely disrupts normal operations—not limited to floods, fire, and explosion. Anything that can arise from fear, uncertainty, or doubt of continuity, referred to as the FUD factor, can lead to a disaster.

A disaster is categorized in a number of ways. These are general classifications, class of emergency, and the tier system. There are two general classifications of disasters. These are:

Natural disasters: A natural disaster is an act of nature such as storms, floods, and earthquakes. Natural disasters cannot be prevented.

Human-made disasters: A human-made disaster is an act of humans, such as a failure of the infrastructure resulting in a hazardous material spill. A human-made disaster may be preventable by monitoring and implementing procedures that reduce the likelihood of such an event.

Each general classification is further categorized as a class of emergency. A class of emergency defines the emergency condition by the length of time of the emergency. These are:

Class 1: Class 1 is an emergency that lasts a few hours, such as a brief power outage or an injury onsite.

Class 2: Class 2 is an emergency that lasts 72 hours or less and is less serious than a Class 1 emergency, such as a contained fire that caused slight damage to the facility.

Class 3: Class 3 is an emergency that lasts more than 72 hours that affects one area of the facility, such as the data center.

Class 4: Class 4 is an emergency that lasts more than 72 hours that affects the entire facility.

Class 5: Class 5 is an emergency that affects the entire community, such a storm or flooding.

An alternative classification method is the tier system. The tier system separates operational functions into three tiers:

Tier 1: Tier 1 consists of functions that need to be operational with the first 72 hours of the disaster.

Tier 2: Tier 2 consists of functions that need to be operational by the end of the first week of the disaster.

Tier 3: Tier 3 consists of functions that need to be operational by the end of the first month of the disaster.

Risk Assessment

An organization faces many risks to its operation. A risk is the possibility of harm or long-term loss as a result of an event. There are obvious risks, such as fire in the facility, and less obvious risks, such as only one employee being able to program the old, dependable order entry application that hasn’t been changed in years. The organization comes to a halt should the order entry application fail and that employee quit. There’s no one to fix the application.

The organization identifies all risks that might disrupt the operation by conducting a risk assessment. Risk assessment is a process of identifying risks and assessing the magnitude of the potential interruption to the organization should the event occur. Risks are classified as direct risk and indirect risk.

Direct risk: A direct risk is an event that directly affects the organization, such as a systems failure, fire in the facility, or a power outage.

Indirect risk: An indirect risk is an event that affects another party needed for the sustainability of the organization, such as an employee, customer, or vendor. For example, a fire at a vendor’s facility can disrupt supplies to the organization.

The risk assessment must also consider secondary consequences that may occur as a result of an event. A secondary consequence is an obligation of the organization, such as loss to customers when the organization is unable to provide service to the customer. The disruption might result in contractual penalties, regulatory violations, and potential litigation.

Risk to the environment—drinkable water, power, and heat—should not be overlooked. Loss of water to the facility, for example, prevents flushing toilets and makes the facility uninhabitable. Even if some employees agree to continuing working, a lack of flushing toilets is violation of local health regulations, causing government officials to temporarily declare the facility unsafe and require that the building be evacuated.

There are three questions that should be asked when performing a risk assessment:

What is the risk?

What is the probability that the risky event will occur?

What is the impact to the organization should the risky event occur?

There are several formal methods that can be used to identify risk:

Disaster-Based Risk Assessment—focuses on hazards

Asset-Based Risk Assessment—focuses on assets

Business Impact Analysis—focuses on the business

Disaster-Based Risk Assessment

Disaster-based risk assessment focuses on hazards rather than processes and systems. The goal is to identify all potential hazards; the likelihood that a disaster will occur; the impact to business operations; and whether the hazard can be avoided.

The organization may be willing to do nothing to prevent a terrorist attack because there is a low probability that an attack will occur, depending on the type of organization. However, the organization is willing to invest in a backup power generator to power business operations because there is a greater risk of a power failure. Each hazard is entered into a weighted list and assessed for the likelihood that the hazard will occur and a contingency response is planned. A weighted list contains each risk and the probability that the risk will occur.

Asset-Based Risk Assessment

The asset-based risk assessment focuses on identifying assets that are vulnerable to hazards. Assets are people, equipment, information, systems—any person or thing that is necessary to keeping the business operational. List each asset, its location, and hazards that may affect the asset. Assign each hazard a probability of occurrence—the chance that the hazard will happen.

The asset list helps identify the vulnerability of each asset, and then focuses attention on how business operations can continue without the asset. A proper risk abatement plan includes developing controls to mitigate those hazards from occurring, and then measuring the effectiveness of each control. A control is a process that reduces the likelihood that the risk will occur.

Business Impact Analysis

A business impact analysis is part of the risk assessment that focuses on the impact a risk has on the operations of the organization. The business impact analysis examines each business process, looking for steps in the process that are at risk for failure. The risk is then evaluated for the probability that disruption might occur. The business impact also determines the effect the risk has on the sustainability of the organization.

Let’s take a look at the order entry process. Here are just some of the resources that may be unavailable:

Sales representatives

Sales assistants

Computing devices used to enter orders

Electricity to power computing devices

Backup power supplies

Network cables

Network routers (see Chapter 2)

The application server that runs the order entry application

The database server that runs the order database

The database management system

The database

The data center that houses the application server and database server

Nothing is assumed to work properly during the risk analysis, including backup resources. In a power failure the backup generator may not work. Here are elements to consider in the business impact analysis:

Assess the minimum effort needed to maintain operational levels.

Review the impact of disruptions.

Identify steps in all processes.

Estimate recovery point objectives. There are steps in the recovery process referred to as recovery points. Each recovery point has a goal referred to as a recovery point objective such as restarting all computing devices.

Assess the needs of direct support departments.

Identify gaps in the operations that can fail.

Legacy Systems

Every organization has an old, reliable application that has been running for decades without a problem, but may also be a hidden mine field for potential disasters. It’s so reliable that even the MIS department doesn’t give it much thought. Yet the organization would come to a standstill if it stopped working. Since the application hasn’t attracted attention, there is a high likelihood that the MIS department would have to hunt down the original program files (see Chapter 3) and, if it could be found, fix the problem. And this assumes that there is a programmer who knows how to fix it.

This application is a legacy system. A legacy system is a system that may be critical to the operations of the organization that has not been replaced or upgraded for a long time. A legacy system can be a computer-based system, a manual system, or a combination. The system may be totally under the control of the organization or a process provided by a vendor. Legacy systems operate without problems for many years—so much so that managers tend not to properly manage the system. In essence, managers may forget about legacy systems.

Legacy systems become problematic when the system ceases to operate and no one presently on staff is familiar with the details of the system—especially the computerized portions of the system—to fix it. The organization may discover that the vendor who supplied the system is no longer in business or no one is available to fix the legacy system on short notice, if at all. The technician who was intimately knowledgeable about the system has long retired or the computer language used to write the application is no longer taught in school. And the backup systems haven’t been tested for years. Bottom line: no one can change or fix the legacy system on short notice, and going back to a manual system is too hard to implement.

It is critical during a risk assessment to identify legacy systems and assess the preparedness to provide adequate support or to decide if there is a need to replace the legacy system. It is also vital that those responsible for the legacy system prove beyond a reasonable doubt that they can repair or modify the system. For example, the MIS manager who oversees the legacy system may identify the programmer responsible for maintaining the system. This is fine but the risk assessment requires that the programmer display the source code (the instructions written by the programmer) and the necessary tools (compiler) to convert the source code into an executable program (the program that actually runs on the computer) and then recompile the source code to recreate the executable program (see Chapter 3).

Points of Failure

The initial step in the risk assessment is to identify points of failure within the organization by performing an information technology (IT) audit. A point of failure is an element in the organization’s operation that might fail, such as failure of the local area network from transmitting electronic data throughout the facility. When a fail point is identified—a vulnerability to the operations—assess if steps have been taken to mitigate the risk to prevent the likelihood that the event will occur. Determine the significance of the event to the operations. What are the chances the event will happen and what impact does that event have on business operations?

An IT audit is a detailed survey of computing devices, programs, operating systems, networks, cables, and anything required for processing, including employees who use a computing device and employees and vendors who support and maintain computing devices.

The IT audit is conducted by IT auditors who have a background in all areas of information technology. These are usually former technicians or IT managers who use their knowledge to verify that the organization has addressed points of failure. Their goal is to determine if policies and procedures are in place that adhere to industry standards and mitigate points of failure. IT orders also verify that policies and procedures are implemented. Findings are reported in an IT audit report that also contains recommendations to mitigate any failures in policies, procedures, and practices.

The IT audit begins with the end point—the results of the process. Managers are interviewed to determine their expectations. The organization’s policies and procedures are reviewed, as are regulatory requirements, if any. Employees are observed as they use the process. Any deviations from management expectations and policies and procedures are noted. IT auditors then follow the process—some call it following the cable because IT auditors practically trace the cable from the computing device.

Identifying points of failure requires tracing the hardware used to access the information. Hardware includes computers, cables, servers, and other computing devices. The trace follows the cable. It starts at the network cable leading from the computing or the Wi-Fi connection and then traces the cable through walls into the communication closet. A communication closet is typically a small room on the floor where all cables connect to routers and other computing devices, including more cables that transmit data to a central communications hub somewhere in the facility. The central communications hub is connected to the data center or to outside vendors that operate the servers that run applications, database management systems, and databases. Each computing device is a point of potential failure. Each communications closet or hub is a point of failure. Each cable connection is a point of failure. Any of these could disrupt operations.

In the data center, IT auditors examine security that includes physical access to the facility; physical access to computing devices; physical and electronic access to applications, database management systems, and data. Every element of a process is closely examined. Auditors look for facts. Affirmations—taking someone’s word—are unacceptable. IT auditors trust employees but verify that what is said is true.

Recovery Point Objective (RPO) and Recovery Time Objective (RTO)

A recovery point objective is the acceptable period of time when the process is unavailable to the organization. That is, how long the business can live without the process. For example, the recovery point is zero for Amazon. Hundreds of thousands of orders will be missed if the process goes down. The risk is losing business. However, business-to-business organizations that sell through a sales staff to other businesses may have a recovery point objective of three hours. Orders are usually for large quantities and arrive at various times throughout the day. Most orders can be written by hand and then later entered into the order entry system. The risk is failure to deliver products within an acceptable time period.

The recovery time objective is the time needed to restore the failed process—how long it takes to recover the operations once a disaster occurs. Let’s say the application server running the order entry application crashes. IT requires two hours to replace the application server and restore the order entry application. This is the recovery time objective.

RTO focuses on how long it takes to recover and RPO on how long the organization can operate without the system. The difference indicates the organization’s risk exposure. In the ideal world, the RTO should be less than the RPO. In the real world, this may not be the case and the disaster recovery plan must address how the organization will respond should a point of failure occur.

Data Synchronization Point

Orders continue to flow through the order entry application stored in the database throughout the day. There is always a risk that the database or the computing device running the database will fail. A data synchronization point is a state of the database that was saved prior to the database failure. When a database failure occurs, the database is restored at the data synchronization point.

A disaster recovery plan requires that databases be backed up regularly. Timing of the backup depends on the nature of the application. For example, an online retailer may backup each transaction immediately while a less data-dependent organization may backup data at the close of each business day. Data entered between data synchronization points are lost should the database fail.

Knowing the data synchronization point enables the organization to develop a contingency plan to deal with failures that occur between data synchronization points.

Unseen Fail Points

An unseen fail point is one that is not obvious during the risk assessment, such as the configuration of computing devices. You’ve probably experienced this hassle when you get a new computing device and you need to reset all your favorite settings, the default browser page, and your bookmarks. This is more an inconvenience than a disaster for you. It is different for your organization because many settings are for security or are needed to properly run applications.

Settings of a computing device are based on the policies and procedures of the organization and the interaction of the computing device with the network and other devices. Failure of the computing device usually requires reinstallation and configuration of the device. Unseen fail points are instructions on how to reinstall and configure the computing device.

During a risk assessment, IT auditors watch the recovery of unseen fail points, especially during off hours. IT auditors ask the following questions. Answers reveal weaknesses in the recovery plan.

Who installs the application?

Has that person installed the application in the past?

Are there special configurations that need to be made to the application?

Are there written instructions on how to install and configure the application?

Where are those instructions stored?

Does the person who is installing the application know where to access the instructions?

How would you operate the organization if information necessary to run the business was in your building, and you had to evacuate the building?

Can employees function if their belongings (car keys, house keys, and so on) are in the building and the building is evacuated?

Who is going to contract with vendors to provide backup locations?

Are maps, routes, and accommodation details stored offsite?

Who has the contact list of employees, and is that person available during the disaster?

Do the critical employees know they are critical to the operation?

Who is contacting vendors and customers and telling them about the temporary change in business operations?

Disaster Recovery

Disaster recovery is the process of restoring functionality to the organization during and after the disaster—first returning to partial operations and then eventually, full operations. The goal of disaster recovery is to get back to normal business. The initial step is to reorganize employees and business operations when a disaster strikes. There are many challenges.

How will you communicate with employees, and how will employees communicate with each other?

Where will employees go if they can’t get back to the office? How will they keep doing their jobs?

What systems and business units are crucial to the basic operations of the organization?

Will employees be focused on the organization during a disaster? Consider Super Storm Sandy. Employees’ homes were destroyed and families uprooted. The organization wasn’t the primary focus.

The primary focus is to maintain the minimum sustainable level of services within the organization and to customers. In doing so the goals include minimizing the potential losses—loss of revenue and penalties for breach of contracts with customers. It is important to identify crucial systems and processes, then spend funds to restore those systems.

Disaster Recovery Team

The disaster recovery team is responsible for disaster planning, testing, and enacting the disaster recovery plan should a disaster occur. The team is composed of key employees, administrators, government agencies such as police and fire departments, and outside organizations such as utilities, vendors, transportation firms, and business partners. Each provides a unique perspective on potential disaster and how to prevent, mitigate, and recover from a disaster.

The disaster recovery team begins with a charter. A charter is a written document that authorizes the team to develop and carry out the disaster recovery plan. The charter contains:

Mission statement: The mission statement is the goal that describes the purpose of the plan and why the organization decided to create a disaster recovery plan.

Scope: The scope is the team’s authority—what the team can and cannot do—and the time frame within which to develop the disaster recovery plan.

Sponsor: The sponsor is the executive who sponsors the disaster recovery plan.

Team: The team leader and team members are designated along with resources— such as a budget—that can be used to develop, test, and enact the disaster recovery plan.

Disaster management: The charter must define every aspect of disaster management, including:

  • What is an emergency?
  • Who declares an emergency?
  • Who is the incident commander?
  • What is the command structure?
  • Who opens the emergency operation center?

There are two categories of disaster recovery teams:

The primary disaster recovery team is led by the recovery manager and consists of key coordinators, each responsible for leading the recovery process for a specific aspect of the business.

The secondary disaster recovery team supports the primary disaster recovery team’s efforts and focuses on rebuilding the business operation from minimum levels of operations.

Disaster Recovery Plan

Critical to the recovery is the disaster recovery plan. Think of a disaster recovery plan as a cookbook for creating a Thanksgiving dinner. You open the first page and prepare the first recipe. When completed, turn the page and continue with the next recipe. A disaster recovery plan contains processes to follow for specific disasters. The goal is to return the organization to normal operational levels with high-priority processes returning immediately and lower priority processes returning over time.

Getting management buy-in is a challenge for many disaster recovery plans. Support peaks moments after a disaster occurs and then declines weeks and months after the event. There is no disaster recovery plan without management support. The key selling point of a disaster recovery plan is cost benefit—you don’t lose business. A lot of time, effort, and money is spent creating and testing a disaster recovery plan without contributing to the bottom line. A disaster recovery can be justified by pointing out that:

The organization cannot afford business interruption insurance or cannot get adequate insurance coverage.

Auditors and regulatory agencies may require a disaster recovery plan, and a disaster recovery plan can give investors peace of mind.

Not all disaster recovery plans are successful, primarily because of common errors that can easily be avoided.

Inadequate planning: The presumption is that planning is a straightforward process. In reality, planning is a complex process that requires focus on many details that are overlooked during normal business operations.

Inventory of assets: There isn’t a complete and updated list of assets that clearly identify the location of the asset, its role within the organization, and requirements to operate each asset such as how to configure the asset.

Minimizing the recovery effort: The presumption is that staff is available and has the skill set and knowledge on how to recover from a disaster.

Invalid assumptions: Recovery and business continuity needs are based on unfounded assumptions rather than on measurements developed during a risk assessment.

There are two major parts of the disaster recovery plan. These are:

Disaster recovery: Disaster recovery focuses on returning the organization to marginal functionality within the first week of the disaster. This is like picking up the pieces and restoring some semblance of order in a chaotic situation.

Business continuity: Business continuity is focused on the long term, restoring operations beyond one week following the disaster.

Let’s say a storm disrupted power to the organization. The disaster recovery focuses on using battery backup power and the facility’s own generator to provide power to keep the organization marginally operational. Business continuity focuses on restoring full power.

Elements of a Disaster Recovery Plan

The disaster recovery plan defines disaster recovery control measures. A disaster recovery control measure is a process for managing an element of a disaster. There are three disaster elements specified in a disaster recovery plan. These are:

Preventive measures: Preventive measures are processes that prevent a disaster from occurring. For example, placing power lines underground prevents a loss of power due to a storm disrupting power lines.

Detective measures: Detective measures are processes that discover that a disaster is occurring, such as the activation of a fire alarm.

Corrective measures: Corrective measures are processes that restore functionality to the organization during or after a disaster. For example, an electrical generator provides power to the facility until the main power is restored.

Assumptions

The staff that develop a disaster recovery plan make assumptions about risks that may disrupt the sustainability of the organization. Assumptions are based on the probability that a specific disaster will occur. Probability is determined by evidence that supports the likelihood that the disaster will occur. The staff looks at the experience of the organization and of similar organizations within the region. Government and scientific projects and data are also considered when setting the probability. A list of potential disasters and their probability are generated. The assumption used as the basis for setting the probability is listed for each potential disaster.

For example, there might be small tremors over the years, but no earthquake sufficient to cause structural damage based on a review of one hundred years of data for the area. The assumption is that there will never be a significant earthquake and, therefore, there is no need to include an earthquake in the disaster recovery plan.

An assumption may be reasonable but not necessarily true. Although there hasn’t been a significant earthquake in the area, that doesn’t mean there couldn’t be one in the future. Therefore, it is critical that a disaster recovery plan consider all types of disasters—even those that may be remote based on history. An earthquake can happen and the organization needs to be prepared to recover.

Risk Tolerance

How much are you willing to take a risk? The answer depends on the risk. There is risk each time you drive your car. On a clear day the risk is minimum. On a stormy day the risk is moderate because the weather increases the risk of a storm related accident. You weigh the benefits of driving to the risk when you decide whether or not to drive. You may drive to work in a storm because you need to get paid. However, you may forgo driving in a storm to go shopping because the risk of becoming involved in an accident outweighs the benefit of shopping. Whether or not you drive depends on your risk tolerance. The same question must be answered by every organization. How much are you willing to take a chance that the risky event will occur? The answer also depends on the risk tolerance of the organization.

An organization’s risk tolerance is a factor in the development of the disaster recovery plan. Each risk is identified along with its probability of occurring. The disaster recovery team then decides an appropriate response to each risk. There are four common responses to a risk:

Accept: Accept the risk and do nothing now—deal with it if the risk should materialize.

Mitigate: Reduce the risk by doing something that lowers the probability that the risk will occur.

Transfer: Transfer the risk to a vendor. The vendor takes on the responsibility of deciding how to respond to the risk. The organization is still exposed to the risk.

Avoid: The administrator can change the situation to avoid the risk entirely.

Deciding on a response must balance the effort to respond to the risk with the likelihood that the risk will materialize. The effort is usually measured in financial expenditure. How much money is it worth spending now to address the risk?

For example, accepting the risk means no expenditures are made unless the risky event occurs. Mitigating the risk means some expense has been expended, such as purchasing insurance. Transferring the risk also requires expenditure: hiring the vendor. Avoiding the risk also requires expenditures that may change how the organization operates.

Risk Management

Risk management is the process of managing fail points. The organization has options.

It can do nothing and take the risk that it will not fail.

It can buy insurance to cover losses should it fail.

It can take steps to prevent failure.

It can have redundancy to minimize business interruptions.

Are you going to spend $100,000 for a fence to prevent unauthorized access to your facility? How do you know the risk of unauthorized access is more than $100,000? There is no absolute answer. The decision must be an informed decision based on the probability of the disaster occurring and the loss that would be realized should the disaster happen.

The question that every organization must ask: Is a potential event an inconvenience or a disaster? You can ask a subject matter expert to help answer the question or you can follow the money.

Enough cash in the bank can probably sustain an organization through nearly any disaster. The question is: What is enough cash? The answer depends on the organization. There is a point when cash in the bank will run out. Under normal business, revenue from sales replenishes cash removed from the bank. When revenue stops flowing, there is a cash drain—a disaster. All efforts are focused on restoring the revenue stream to stop the cash drain. If you follow the money and identify all processes involved in maintaining the revenue stream—that is, putting cash in the bank—you’ll know if an event is an inconvenience or a disaster.

Detail Analysis Is Critical

Let’s say this is the flow of money into the organization—the old-fashioned way.

The mail room staff picks up mail at the post office

Mail containing checks is sent to the accounts receivable department

An accounts receivable employee opens the envelope and separates out the check

The check is endorsed

A messenger takes the checks to the bank

A bank employee opens the package of checks

Checks are sent to a clearing house for processing

The bank credits the company’s account once the check clears

Now let’s take a look at the critical processes and assets.

The mail room staff picks up mail at the post office

  • The employee must awaken
  • Get dressed
  • Feel comfortable leaving their family to go to work
  • Travel safely to work
  • The office building must be intact and opened
  • The elevators in the office building must be working
  • The employee must be able to travel to the post office
  • Postal employees must have arrived at work
  • The post office must be open
  • And so on...

What can go wrong?

The employee oversleeps

The employee’s home is destroyed by the disaster

The safety of the employee’s family is a higher priority than going to work

Transportation is disrupted and the employee is unable to get to work

The office building is closed due to an emergency

The employee responsible for opening the office building doesn’t show up for work

There is a power outage, preventing the elevators from working

Transportation to the post office is disrupted

Postal employees cannot go to work

The post office is closed

And so on...

Think in very basic terms:

You need breathable air, drinkable water, power, and heat. Without one of these, your business cannot operate. What would happen if there were a water main break causing the utility to turn off water to your office building? The office building would be evacuated.

You need people to build, sell, and buy your products. Without one of these, your business cannot operate. You can build a warehouse full of products but no one will buy them if a series of blizzards force malls to close.

You need revenue—money coming into the business. Without a revenue stream, your organization will use cash on hand (savings) to pay expenses. Eventually the organization will run out of cash.

Define disaster scenarios and the impact each would have on the organization:

  • What if a major supplier suddenly went out of business?
  • What if employees of a major customer went on strike?

Low-Level Focus

The disaster recovery plan focuses in on the operational level, where detailed plans clearly provide the process to recovery from a disaster so when a disaster strikes, the staff need only to follow the disaster recovery plan. There is no need to assess the situation and then decide how to respond. Assessment and the optimum response are made in advance of the disaster when there is time to evaluate risks and options.

Here are common details that need to be considered in a disaster recovery plan.

What is the staffing level needed to maintain minimum functionality?

Are staffing levels maintained at the minimum level for all shifts?

How will staff arrive during a disaster?

Where will staff be stationed in the facility during a disaster?

Does the staff have the skillsets necessary to provide minimum functionality?

Where will off-duty staff who are not leaving the facility go to sleep, shower, and change clothes?

Is there sufficient food available for staff for the duration of the disaster and the first week following the disaster?

Will employees be more concerned about the disaster affecting their families and homes than coming to work?

Are employees able to come to work if mass transit is not operational?

How long will supplies last?

How will staff communicate with each other during a disaster? Telephone communication within and external to the facility may be unavailable.

Disaster Recovery Options

There are many options to respond to a disaster—some are better than others. The worst plan is to wait until a disaster strikes and then try to devise viable response options. The military coined the phrase “the cloud of war,” which describes the mindset that exists during a disaster. Few think clearly in the heat of battle. This is why the military has staff dedicated to anticipating conflicts and devising well thought-out response options for each possible conflict.

Disasters—like military conflicts—should be anticipated and response options well defined before a disaster occurs. A response option definition should clearly state what to do; when to do it; how to do it; and how to measure if it worked successfully. When the disaster occurs, the focus is on identifying the best response option and then following the plan for implementing the response option.

Be realistic. The disaster recovery plan must provide detailed instructions on every aspect of how employees will do their jobs during and after the disaster. Most important, the disaster recovery plan should consider the impact the disaster has directly on the employees and the employees’ families. It is not reasonable to expect that the employees will forego the care and safety of their families to handle the organization’s disaster. Think for a moment... if your house was destroyed and your employer’s operations were disrupted, what would you focus on first?

The organization’s data center is critical to the sustainability of the organization since it contains applications, database management systems, databases, and related computing and networking devices needed to keep the organization functional. The failure of the data center to function is a disaster for the organization. Let’s take a look at the response options to illustrate how advanced planning for a disaster mitigates risks.

Hot site: A hot site is a fully operational secondary data center that has all applications, databases, and computing devices found in the primary data center. The hot site is typically located in a different region of the country isolated from the environment (power, flood) that may affect the primary data center. All data from the primary data center is nearly instantaneously copied to the hot site when data is stored in the primary databases. If a disaster occurs in the primary data center, a switch is activated, directing the organization to the secondary data center. There is no downtime.

Warm site: A warm site is a secondary data center usually located in a different region of the country that has all the computing devices as the primary data center; however, applications and databases need to be installed and configured before the warm site can be activated. There is relatively short downtime.

Cold site: A cold site is a secondary data center usually located in a different region of the country that doesn’t have computing devices, applications, or databases. The data center must be practically rebuilt within the cold site. There is a long downtime.

Outsource site: An outsource site is when the organization contracts with a vendor to supply data center services. The risk of a data center disaster is transferred to the vendor. The vendor is responsible to anticipate and devise response options to potential disasters.

Reciprocal agreement: A reciprocal agreement is an arrangement between companies that have similar technology that allows the other to use the technology during a disaster.

Consortium arrangement: A consortium arrangement is an agreement among a group of firms to create a disaster recovery site that can be used by its members.

Each option has its advantages and disadvantages. The hot site has no downtime but is the most expensive since the organization is practically running a duplicate data center. The warm site costs less to operate. Typically, the organization contracts with a vendor to use its standby data center. However, there will be several days when the organization will have to operate without access to the data center. The cold site has the lowest ongoing cost but also can take weeks to become operational.

Outsourcing the data center places a key element of the organization’s operation in the hands of a vendor. The contract explicitly states services that the vendor will provide to the organization. Services not included in the contract will not be provided. The organization usually has little influence on how services are provided unless stated in the contract. The organization’s sustainability depends on the sustainability of the vendor. Anything that influences the vendor’s operation (strikes, suppliers) also affects the organization. The outsource data center must meet security, regulatory, and compliance requirements. Terms of the contact should include:

Contract duration

Termination conditions

Testing

Costs

Special security procedures

Notification of systems changes

Hours of operation during recovery

Specific hardware and equipment requirements for recovery

Personnel requirements during the recovery process

Circumstances constituting an emergency

Process to negotiate extension of services

Priorities for making the recovery site operational

Selection of the recovery site should address the following factors.

Number of available sites

Distance between sites and distance for employees to travel to the recovery site

Facilities requirements

Office supplies

Meals

Living quarters for recovery employees

Postal services

Recreational facilities for recovery employees

Travel cost

Site cost

Cost of temporary living for recovery employees

The decision to own, rent, or share the recovery site with other organizations

Communication requirements

Rerouting mail

A data center must be in a low-risk area to natural disasters such as floods, hurricane, tornadoes, and earthquakes. Likewise, employees of the outsource data center must live in low-risk areas too. A data center’s service to the organization is dependent on its employees. If employees are personally affected by the disaster then there is a high risk that the data center is unable to provide service.

The data center must have a high level of redundancy. If any element of the data center fails, there are two or three elements that can take its place quickly. For example, if a database server fails, a replacement can be fully operational within an hour.

A backup power source is necessary for all facilities. There are two types of backup power sources: battery backup and an on-site generator. The battery backup is used to power certain electrical devices, such as computing devices, for a few hours. The on-site generator is used to power certain electrical devices until the main power source is back online. Only electrical devices that are needed for the highest priorities should be on the backup power system since limited power may be available during the disaster. Make sure that backup power sources are always in working condition and are sufficient to meet the current needs of the facility. More backup power is required as the organization increases its dependency on applications.

Service Level of Agreement

Outsourcing transfers the organization’s responsibility to a vendor. It is important to understand that the organization remains responsible for the service although the contract with vendor appears to transfer those responsibilities to the vendor. Failure of the vendor to provide the service on behalf of the organization does not relieve the organization from the responsibility to provide the service to customers.

The vendor must provide the organization with a service level of agreement (SLA). A service level of agreement contains objective metrics that both the vendor and the organization can use to measure the vendor’s performance. The service level of agreement is typically part of the contract with the vendor and contains remedies should the vendor fail to perform to the expectations of the service level of agreement.

The service level of agreement specifies the minimum service that the organization will receive from the vendor. Metrics used to measure the service depend on the nature of the service. Let’s say the vendor provides data center services. A common metric to use is mean time to recovery (MTTR). Mean time to recovery is the average time necessary to restore the data center functionality to the organization.

If the vendor manufactures computing devices such as a server, the commonly used metric is mean time between failures. Mean time between failures is the average time period that the computing device will work before the device breaks down. This is important to know when acquiring and managing computing devices. Manufacturers test computing devices under various conditions and simulate extended usages. Test results identify a time range after which the computing device is likely to fail. You should acquire the computing device that meets your specifications and has the longest mean time between failures.

Here are other commonly used metrics:

Turnaround time (TAT): Turnaround time is the time that is necessary to complete a specific task.

Uptime (UT): Uptime is the amount of time that the application, computing device, or data center is functioning. For example, a computing device may be unavailable for four hours a week while the MIS department performs maintenance on the device.

First call resolution (FCR): First call resolution is the percentage of calls to a help desk that are resolved without the callers calling the help desk again.

Time service factor (TSF): Time service factor is the percentage of calls that are answered within a specific time period.

Abandonment rate (AR): Abandonment rate is the percentage of callers whose calls are not answered. The caller who is on a wait queue hangs up.

Average speed to answer (ASA): Average speed to answer is the number of seconds for the help desk to answer the phone.

The IT department and operating units should have an operational level of agreement. An operational level of agreement is similar in concept to the service level of agreement except the agreement is between internal entities within the organization. For example, the IT department agrees to respond to a problem with an application within a half hour of a call to the help desk. The response is material and not simply an IT department representative answering the telephone. That is, someone knowledgeable about the system will address the concerns. IT managers can staff and plan according to the operational level agreement.

Both a service level of agreement and an operational level of agreement focus on outcomes and not how those outcomes are achieved, except for ensuring that methodologies will comply with regulatory requirements. The vendor or IT may bring in another source to meet the obligation to deliver the outcome.

Disaster Recovery Operations

The organization should create an emergency incident command system (EICS) that takes over operations of the organization during a disaster or emergency. EICS has a chain of command that enables fast, ongoing assessment of the disaster and the impact the disaster has on the operations. The EICS structure enables the emergency incident response team to respond to known problems and anticipate and mitigate problems that might be forthcoming.

The chain of command structure is documented in a job action sheet. The job action sheet lists each position in the command structure and the corresponding roles and responsibilities. Information about the disaster including the job action sheet is shared among operational staff in the emergency incident command site. Each member of the emergency response team can view, assess, and determine the course of action appropriate to the team member’s responsibility.

The emergency response is led by the emergency incident commander (IC). The emergency incident commander is the person in charge of the emergency response. All decisions rest with that person, although the emergency incident commander relies heavily on subject matter experts such as the medical team and governmental emergency management.

There are four areas of concern for the emergency incident commander. Each area is called a section and has a section chief who is responsible for addressing issues within the domain of that section. Sections are:

Operations: Operations involves maintaining an adequate level of business operations. Included are the organization management, administrative operations, production of services, regulatory and contractual compliance, and customer services.

Logistics: Logistics is the management of resources both internal and external to the operations. Logistics involve staffing, supplies, food supplies, receiving and distribution within the facility, garbage removal—everything necessary to maintain the business and care for the staff during the disaster.

Planning: Planning involves the emergency response team anticipating needs and devising a way to meet those needs in advance. This includes developing a disaster recovery plan.

Finance: The organization must have funds to pay for ongoing operations and for expenditures that are associated with responding to the emergency, such as overtime cost. Furthermore, the organization must ensure that incoming revenue stream is not disrupted.

Communications: Leadership in the organization must have contact information of staff, customers, and vendors offsite so the disaster recovery team can communicate with them from anywhere during a disaster. Highlight staff, vendors, and customers who are critical to the operations.

Emergency Operations Center (EOC)

The emergency operations center is a location in the facility, if feasible, and is where the disaster is managed. Typically, the emergency operations center is located in a central location within the facility, such as a large conference room or auditorium. The room should be divided five areas. One area is for the emergency incident commander and each of the other four is for a section. Each section must be clearly identified and always staffed by at least one representative of that section’s team. Communication connections should be established for each area, enabling a free flow of communication to the field, if necessary.

The emergency incident commander section should display the job action sheet on a whiteboard or flip chart so each member of the emergency response team can clearly identify their role. Another whiteboard/flip chart should list the status of operations, preferably by unit and department. The status should include required staffing levels, actual staffing levels, supplies, and other factors required to operate the organization.

Downtime Procedures

Downtime procedures are processes that are enacted when a disaster or emergency occurs. These are well thought-out steps that, if followed, will maintain functionality of the operations. Each downtime procedure is a recovery script that clearly states who does what and when—and specifically what should be done if the downtime procedure doesn’t work as planned. Where possible, downtime procedures should be automated, such as having backup power automatically activated when the power fails.

Downtime procedures must reflect any changes in the business process and production systems. Members of the disaster recovery team must review the rationale for changes, approve changes, and then incorporate those changes into the disaster recovery plan.

There are two elements of a downtime procedure. These are to keep the organization operational and to recover once the disaster has passed. For example, sales information prior to the disaster is recorded electronically in the sales order application. However, the sales order application might be unavailable due to a power outage. Therefore, sales information is recorded on paper as part of the downtime procedure. Once the disaster is over, a procedure is necessary to record the information in the sales order application—otherwise, the sales information database is incomplete. All downtime procedures should be incorporated in the disaster recovery plan.

Contact Lists

Contact lists are easily overlooked yet are critical to basic business operations. These are lists of employees, vendors, suppliers, and customers. The disaster recovery plan should specify:

Who initiates the contact

The priority of making contact

Method of how contacts are made

Instructions to give when contacted

Contingencies if unable to contact

Disaster Drills

A disaster recovery plan is only as good as the number of times that the plan is tested. Every disaster recovery plan must be fully tested to identify gaps. Does the plan work? The only way to answer that question is to test each scenario as if the scenario has occurred.

Testing the disaster recovery plan is challenging. The test requires reliance on backup procedures and backup systems. Executives must determine an acceptable level of business disruption during the test.

Can work stop?

Can employees be diverted from their work?

Is there time to test?

Is there a budget for testing?

How much of a disaster do you want to create to ensure that test results are accurate?

Testing a disaster recovery plan is challenging. The World Trade Center had more than 20 million square feet of office space. After 9/11, there was only 10 million square feet of office space available in Manhattan. Businesses affected by 9/11 had limited relocation options.

The organization must hold disaster drills on a regular schedule during the course of the year. Disaster drills should simulate real-life disasters to test the response of the emergency response team. Although drills are scheduled, the drill should be held spontaneously. The emergency response team and the facility staff should not be alerted to the drill since disasters are rarely known in advance.

The disaster drill can be segmented. For example, the data center or a portion of the data center can operate on backup power for a few hours to test whether or not the backup power is sufficient to support the data center. There are several types of disaster recovery tests.

A checklist test is a walkthrough where no work stoppage occurs.

A simulation test pretends a disaster occurs and uses utility software to check if the hardware and software are recoverable. No production stops.

Parallel tests create a disaster in a parallel system—no production stops. This is a full interruption test in the non-production system.

A recovery production test requires the business to use a hot recovery site.

Backup activities are tested during a disaster drill to ensure that expected operational function is maintained by using the backup. The disaster recovery plan must be modified if backup activities are unable to support operational levels. A disaster recovery plan that is untested regularly should not be considered a valid disaster recovery plan because it has not been validated by scheduled testing.

Here are factors to consider when planning a disaster drill:

All employees, including administrators, must participate in the disaster drill.

Make exercises realistic enough to tap into employees’ emotions.

Practice crisis communication with employees, customers, and the outside world—assume that phone lines are down.

Each employee should perform their expected role in a disaster during the disaster drill.

Be sure that the disaster drill is realistic. A real disaster increases stress on staff. You want to assess how well the staff will perform under the stress of a disaster.

Include community services such as police and fire personnel in the disaster drill.

The goal is to find weaknesses in the disaster response and not simply walk through tasks associated with the disaster drill.

Make sure staff are trained to perform roles secondary to their primary responsibility (i.e., administrators are able to move food carts from the kitchen to the floors).

Make sure employees who evacuate the premises take their belongings with them. They won’t be able to go home without car keys, house keys, and other personal belongings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.181.231