CHAPTER 7

Incident Management Readiness

In this chapter, you will learn about

•   Similarities and differences between security incident response, business continuity planning, and disaster recovery planning

•   Performing a business impact analysis and criticality analysis

•   Developing business continuity and disaster recovery plans

•   Classifying incidents

•   Testing response plans and training personnel

This chapter covers Certified Information Security Manager (CISM) Domain 4, “Incident Management,” part A, “Incident Management Readiness.” The entire Incident Management domain represents 30 percent of the CISM examination.

Supporting Tasks in the CISM job practice that align with the Incident Management / Incident Management Readiness domain include:

29.   Establish and maintain an incident response plan, in alignment with the business continuity plan and disaster recovery plan.

30.   Establish and maintain an information security incident classification and categorization process.

31.   Develop and implement processes to ensure the timely identification of information security incidents.

32.   Establish and maintain processes to investigate and document information security incidents in accordance with legal and regulatory requirements.

33.   Establish and maintain incident handling process, including containment, notification, escalation, eradication, and recovery.

34.   Organize, train, equip, and assign responsibilities to incident response teams.

35.   Establish and maintain incident communication plans and processes for internal and external parties.

36.   Evaluate incident management plans through testing and review, including table-top exercises, checklist review, and simulation testing at planned intervals.

Our world is full of surprises, including events that disrupt our plans and activities. In the context of IT and business, several unexpected events can cause significant disruption to business operations, even to the point of threatening the ongoing viability of the organization itself. These events include:

•   Natural disasters

•   Human-made disasters

•   Malicious acts

•   Cyberattacks

•   Changes with unintended consequences

Organizations cannot, for the most part, specifically anticipate these events. Any of these events may inflict damage on information systems, office equipment, and work centers, making it necessary for the organization to act quickly to continue business operations using alternative means. In the case of a cyberattack, it may or may not be possible to reverse the effects of the attack, eradicate whatever harm was caused to information systems, and continue operations on those systems. But in some cases, it may be necessary for the organization to continue information processing on other systems until the primary systems can be expunged of their attacker and the damage that has been inflicted.

Incident management readiness begins with upfront analyses of business processes and their dependence upon business assets, including information systems. This analysis includes a big-picture prioritization of business processes and an up-close examination of business processes and information systems. This is followed by the development of contingency plans, response plans, and restoration plans. A natural by-product of all of this effort is improved resilience of business processes and information systems—even if disruptive events never occur—because overall weaknesses in processes and systems are identified, leading to steps to make tactical improvements.

This chapter explores the available methods and techniques for responding to these disruptive events and returning business operations to their normal, pre-event state. Chapter 8 continues with discussions of incident management tools, techniques, incident containment, recovery, and post-incident activities.

Incident Response Plan

Although security incident response, business continuity planning (BCP), and disaster recovery planning (DRP) are often considered separate disciplines, they share a common objective: the best possible continuity of business operations during and after a disruptive threat event. A wide variety of threat events, if realized, will call upon one or more of these three disciplines in response. Table 7-1 illustrates responses to threat events.

Images

Table 7-1  Event Types and Typical Response

The last entry in Table 7-1 represents an event in which an attacker damages or destroys information or information systems. An incident of this type may necessarily require security incident response, BCP, and DRP. Security incident response is enacted to discover the techniques used by the attacker to compromise systems so that any vulnerabilities can be remediated, thereby preventing similar attacks in the future. Business continuity response is required so that the organization can operate critical business processes without primary processing systems, and disaster recovery planning is needed to recover its systems and resume normal operations as quickly as possible. Figure 7-1 depicts the relationship between incident response, BCP, and DRP.

Images

Figure 7-1  Relationship between incident response, business continuity planning, and disaster recovery planning

Security incident response, business continuity, and disaster recovery all require advance planning so that the organization will have discussed, documented, and outlined the responses required for various types of incidents in advance of their occurrence. Risk assessments are the foundation of planning for all three disciplines, because it is necessary for organizations to discover relevant risks and establish priorities during a response. Additionally, by taking this proactive approach, the team will have a framework to lean on for incidents that may not have been considered otherwise.

The improvement of systems and processes is an important byproduct of planning for security incident response, business continuity, and disaster recovery. Primarily, planning efforts reveal improvement opportunities that, when implemented, will result in information systems being more secure and resilient. These improvements generally mean that incidents are either less likely to occur or that they will have less impact on the organization.

Security Incident Response Overview

As a result of a security incident, the confidentiality, integrity, or availability of information (or an information system) has been or is in danger of being compromised. A security incident can also be any event that represents a violation of an organization’s security policy. For instance, if an organization’s security policy states that it is not permitted for one person to use another person’s computer account, then such use that results in the disclosure of information would be considered a security incident. Several types of security incidents can occur:

•   Computer account abuse   Examples include willful account abuse, such as sharing user account credentials with other insiders or outsiders or one person stealing login credentials from another.

•   Computer or network trespass   An unauthorized person accesses a computer network. The methods of trespass include malware, using stolen credentials, access bypass, or gaining physical access to the computer or network and connecting to it directly.

•   Information exposure or theft   Information that is protected by one or more controls may still be exposed to unauthorized people through a weakness in controls or by deliberate or negligent acts or omissions. For instance, an intruder may be able to intercept e-mail messages, client-server communication, file transfers, login credentials, and network diagnostic information. Or a vulnerability in a system may permit an intruder to compromise the system and obtain information stored or processed there.

•   Malware   A worm or a virus outbreak may occur in an organization’s network. The outbreak may disrupt normal business operations simply through the malware’s spread, or the malware may also damage infected systems in other ways, including destroying or altering information. Malware can also eavesdrop on communications and send intercepted sensitive information back to its source.

•   Ransomware and wiperware   A malware attack may include ransomware, where critical data is encrypted and a ransom is demanded in exchange for a decryption key. Variations of ransomware attacks include exfiltration of critical data with the threat of posting the data publicly and wipers that destroy data instead of encrypting it.

•   Denial-of-service (DoS) attack   An attacker floods a target computer or network with a volume of traffic that overwhelms the target so that it is unable to carry out its regular functions. For example, an attacker might flood an online banking web site with so much traffic that the bank’s depositors are unable to use it. Sending traffic that causes the target to malfunction or cease functioning is another form of a DoS attack. Both types result in the malfunction of the target system.

•   Distributed denial-of-service (DDoS) attack   Similar to a DoS attack, a DDoS attack emanates simultaneously from hundreds or thousands of computers that comprise a botnet. A DDoS attack can be difficult to withstand because of the volume of incoming messages, as well as the large number of attacking systems.

•   Encryption or destruction of critical information   A ransomware or wiper attack can result in encrypted or destroyed information.

•   Disclosure of sensitive information   Any sensitive information may be disclosed to any unauthorized party.

•   Information system theft   Laptop computers, mobile devices, and other information-processing and storage equipment can be stolen, which may directly or indirectly lead to further compromises. If the stolen device contains retrievable sensitive information or the means to access sensitive information stored elsewhere, then what started out as a theft of a tangible asset may expand to become a compromise of sensitive information as well.

•   Information system damage   A human intruder or automated malware may cause temporary or irreversible damage to information or an information system. This may result in an interruption in the availability of information, as well as permanent loss of information.

•   Information corruption   A human intruder or automated malware such as a worm or virus may damage information stored on a system. This damage may or may not be readily noticed.

•   Misconfiguration   An error made by an IT worker can result in data loss or system malfunction.

•   Sabotage   A human intruder or automated malware attack may disrupt or damage information, information systems, or facilities in a single organization, several organizations in a market sector, or an entire nation.

These examples should give you an idea of the nature of a security incident. Not all represent cataclysmic events. Other types of incidents may also be considered security incidents in some organizations.

Images

NOTE   A vulnerability that is discovered in an organization is not an incident. However, the severity of the vulnerability may prompt a response similar to that of an actual incident.

Incident Response Plan Development

The time to repair the roof is when the sun is shining.

—John F. Kennedy, 1962

As for any emergency, the best time to plan for security incident response is prior to the start of any actual incident. During an incident is a poor time to analyze the situation thoughtfully, conduct research, and work out the sequence of events that should take place to restore normal operations quickly and effectively; emotions may run high, and there may be a heightened sense of urgency, especially if there has been little or no advanced planning.

Effective incident response plans take time to develop. A security manager who is developing an incident response plan must first thoroughly understand the organization’s business processes and underlying information systems and then discover resource requirements, dependencies, and failure points. A security manager may first develop a high-level incident response plan, which is usually followed by the development of several incident response playbooks, the step-by-step instructions to follow when specific types of security incidents occur.

Executive support is essential in incident response plan development, particularly for escalations and communications. Executives need to be comfortable knowing that low-severity incidents are competently handled without their being notified every time, for instance. Also, executives need to know that they will be notified using established protocols when more serious incidents occur.

Objectives

Similar to any intentional activity, organizations need to establish their objectives prior to undertaking an effort to develop security incident response plans. Otherwise, it may not be clear whether business needs are being met. Following are some objectives that may be applicable to many organizations:

•   Minimal or no interruption to customer-facing or revenue-producing business operations

•   No loss of critical information

•   Recovery of lost or damaged information within DRP and BCP recovery targets, mainly recovery point objective (RPO)

•   Least possible disclosure to affected parties

•   Least possible disclosure to regulators

•   Least possible disclosure to shareholders

•   Incident expenses fully covered by cyber insurance and other insurance policies

•   Sound internal and external communication protocols and consistent messaging

Organizations may develop additional objectives that are germane to their business model, degree and type of regulation, and risk tolerance.

Maturity

When undertaking any effort to develop or improve business processes, an organization should consider its current and desired levels of maturity. As a quick reminder, the levels of maturity according to the Capability Maturity Model Integration for Development (CMMi-DEV) are as follows:

1.   Initial.   This represents a process that is ad hoc, inconsistent, unmeasured, and unrepeatable.

2.   Repeatable.   This represents a process that is performed consistently and with the same outcome. It may or may not be well documented.

3.   Defined.   This represents a process that is well defined and documented.

4.   Managed.   This represents a quantitatively measured process with one or more metrics.

5.   Optimizing.   This represents a measured process that is under continuous improvement.

In addition to the objectives listed previously, a security manager should seek to understand the organization’s existing level of maturity and its desired level. Increasing the maturity level of any process or program takes time, and hastening maturity may be unwise. For example, if an organization’s current maturity for incident response is Initial and the long-term desired level is Managed, a number of improvements over one or more years may be required to reach a Managed maturity level.

Resources

Security incident response requires resources; security managers should keep this in mind when developing incident response plans. Each stage of incident response, from detection to closure, requires the involvement of personnel with different skill sets, as well as various tools that enable personnel to detect an incident, analyze it, contain it, and eradicate it. Various types of incidents require various tools and skills.

Personnel   Personnel are the heart of security incident response. Effective security incident response requires personnel with a variety of skills, including the following:

•   Incident detection and analysis   Security operations center (SOC) analysts and other personnel use a variety of monitoring tools that alert them when actionable events occur. These personnel receive alerts and proceed to analyze an incident by drilling into the details. These same people may also undertake threat hunting, proactively searching systems and networks for signs of reconnaissance, malicious command-and-control (C&C) traffic, and intrusion. This function is often outsourced to managed security service providers (MSSPs) that run large 24/7/365 operations, monitoring hundreds or even thousands of client organizations’ networks.

•   Network, system, and application subject matter experts (SMEs)   With expertise in the network devices, systems, and applications related to alerts, these personnel can help SOC analysts and others better understand the meaning behind incidents and their consequences.

•   Malware analysis and reverse engineering   These personnel use tools to identify and analyze malware to better understand what it does on a system and how it communicates with other systems. This helps the organization decide how to contain the incident and defend itself against similar attacks in the future.

•   Forensics   These persons use tools and techniques to collect evidence that helps the organization better understand the nature of the incident. Some evidence may be protected by a chain of custody in anticipation of later legal proceedings.

•   Incident command and control   These personnel have expertise in overall security incident response and take charge during an incident. Generally, this type of coordination is required only in high-impact incidents involving multiple parties in an organization, as well as external entities such as customers, regulators, and law enforcement.

•   Crisis communications   These personnel are skilled in internal communications as well as communications with external parties, including regulators, shareholders, customers, and the public.

•   Legal / privacy   One or more people in an organization’s legal department will read and interpret applicable laws and make decisions related to external communications with customers, regulators, law enforcement, shareholders, and other parties.

•   Business unit leaders   Also referred to as department heads, these personnel will be called to make critical business decisions during an incident. Examples include decisions to take systems offline or transfer work to other processing centers.

•   Executives   The top leaders in the organization who need to be consistently informed and who will be called upon to ratify or make important decisions.

•   Law enforcement   Personnel in external agencies may be able to assist in incident investigation.

Most of these responsibilities require training, which is discussed later in this chapter.

Outsourcing Incident Response   Incident response sometimes involves the use of forensic tools and techniques by trained and experienced incident response personnel. Larger organizations may have one or more such personnel on staff, though most organizations cannot justify the expense of hiring them full-time. Many organizations opt to utilize forensic experts on an on-demand or contract basis, typically in the form of incident monitoring and incident response retainers.

Incident Response Tools and Techniques   There are many forms of security incident detection, prevention, and alerting tools essential in incident response. These tools and techniques are discussed fully in Chapter 8.

Gap Analysis

Prior to the development of a security incident response plan, the security manager must determine the current state of the organization’s incident response capabilities, as well as the desired end state (for example, a completed security incident response plan with specific capabilities and characteristics). A gap analysis is the best way for the security manager to understand what capabilities and resources are lacking. Once gaps are known, a strategy for developing security incident response plans will consist of the creation or acquisition of all necessary resources and personnel.

A gap analysis in the context of security incident response program development is the same gap analysis activity described in more detail in Chapter 2.

Plan Development

A security incident response plan is a document that defines policies, roles, responsibilities, and actions to be taken in the event of a security incident. Often, a response plan also defines and describes roles, responsibilities, and actions that are related to the detection of a security incident. This portion of an incident response plan is vital, considering the high velocity and high impact of certain types of security incidents.

A security incident response plan typically includes these sections:

•   Policy

•   Roles and responsibilities

•   Incident detection capabilities

•   Playbooks

•   Communications

•   Recordkeeping

Playbooks  Recognizing that there are many types of security incidents, each with its own impacts and issues, many organizations develop a collection of incident response playbooks that provide step-by-step instructions for incidents likely to occur in the organization. A set of playbooks may include procedures for the following incidents:

•   Lost or stolen laptop computer

•   Lost or stolen mobile device

•   Extortion and wire fraud

•   Sensitive data exfiltration

•   Malware, ransomware, and wipers

•   Stolen or compromised user credentials

•   Critical vulnerability

•   Externally reported vulnerability

•   DoS attack

•   Unauthorized access

•   Violation of information security-related law, regulation, or contract

•   Business e-mail compromise

During a serious incident, emotions can run high, and personnel under stress may not be able to remember all of the steps required to handle an incident properly. Playbooks help guide experienced and trained personnel in the steps required to examine, contain, and recover from an incident. They are commonplace in other industries: pilots and astronauts use playbooks to handle various emergency situations, for example, and they practice the steps to help them prepare to respond effectively when needed.

Incidents Involving Third Parties  Organizations outsource many of their critical applications and infrastructure to third-party organizations. The fact that applications and infrastructure supporting critical processes are owned and managed by other parties does not absolve an organization from its responsibilities to detect and respond to security incidents, however, and this makes incident detection and response more complex. As a result, organizations need to develop incident response playbooks that are specific to various incident types at each third party to ensure that the organization will be able to detect and respond to an incident effectively.

Incident response related to a third-party application or infrastructure often requires that the organization and each third party understand their respective roles and responsibilities for incident detection and response. For example, software-as-a-service (SaaS) applications often do not make event and log data available to customers. Instead, organizations must rely on those third parties to develop and manage their incident detection and response capabilities properly, including informing affected customers of an incident in progress. Depending on the architecture of a SaaS solution, both the SaaS provider and the customer may have their own steps to take during incident response, and some of those steps may require coordination or assistance from the other party. Joint exercises between companies and critical SaaS providers help build confidence that their incident response plans will work.

Periodic Updates  All security incident management documents need to be periodically reviewed by all of the responsible parties, SMEs, and management to ensure that all agree on the policies, roles and responsibilities, and steps required to detect, contain, and recover from an incident. Generally, organizations should review and update documents at least once per year, as well as any time a significant change is made in an organization or its supporting systems.

Communication and Escalation

Because orderly internal communication is critical to effective incident response, incident response plans should include procedures regarding communications during a security incident. Effective communication keeps incident responders and other affected parties informed about the proceedings of the response.

Incident response plans should also include information about how to communicate with regard to escalations, which can take two forms:

•   Notifying appropriate levels of upper management when an incident has been detected. It is good practice to establish triggers or thresholds concerning when appropriate upper management should be notified of the incident based on the incident type and its impact on the organization. Some organizations accomplish this by classifying different types and levels of incidents, with specific escalation plans for each.

•   Notifying appropriate levels of management when incident response service level agreements (SLAs) have not been met. For example, various tasks performed during an incident response will be expected to require a specific period of time to complete. If a task has not been completed within a reasonable amount of time, appropriate management should be notified. Escalations, in this case, may trigger the use of external resources that can assist with incident response.

Rather than be an ad hoc activity, escalation should be a documented part of the incident response process so that incident responders know how and when to inform executives about issues that occur during incident response and how to proceed when an incident response is not progressing as expected.

Business Impact Analysis

Business impact analysis (BIA) is the study of business processes in an organization to understand their relative criticality, their dependencies upon resources, and how they are affected when interruptions occur. The objective of the BIA is to identify the impact that different business disruption scenarios will have on ongoing business operations. The results of the BIA drive subsequent activities—namely, BCP and DRP. The BIA is one of several steps of critical, detailed analysis that must be carried out before the development of continuity or recovery plans and procedures.

Inventory of Key Processes and Systems

The first step in a BIA is the identification of key business processes and supporting IT systems. Within the overall scope of a BCP project, the objective is to establish a detailed list of all identifiable processes and systems. The process usually begins with the development of a questionnaire or intake form that is circulated to key personnel in end-user departments and also within IT. Figure 7-2 shows a sample intake form.

Images

Figure 7-2  BIA sample intake form for gathering data about key processes

Images

NOTE   Although the BIA includes an enumeration of information systems, the BIA itself is business- and process-centric. Information systems are not the focus; instead, they are considered supporting assets.

Typically, the information gathered on intake forms is transferred to a multi-columned spreadsheet or a business continuity management system, where date on all of the organization’s in-scope processes can be viewed as a whole. This information will become even more useful in subsequent phases of the BCP project, such as the criticality analysis (discussed a bit later in this chapter).

Images

TIP   Use of an intake form is not the only accepted approach when gathering information about critical processes, dependencies, and systems. It’s also acceptable to conduct one-on-one interviews or group interviews with key users and IT personnel to identify critical processes, dependencies, and systems. I recommend the use of an intake form (whether paper-based or electronic), even if the interviewer uses it herself as a framework for note-taking.

Statements of Impact

When processes and systems are being inventoried and cataloged, it is also vitally important to obtain one or more statements of impact for each process and system. A statement of impact is a qualitative or quantitative description of the impact on the business if the process or system were incapacitated for a time.

For IT systems, you might capture the number of users and the departments or functions that are affected by the unavailability of a specific IT system. Include the geography of affected users and functions if that is appropriate. Here are example statements of impact on IT systems:

•   Three thousand users in France and Italy will be unable to access customer records, resulting in degraded customer service.

•   All users in North America will be unable to read or send e-mail, resulting in productivity slowdowns.

Statements of impact for business processes might cite the business functions that would be affected. Here are some example statements of impact:

•   Accounts payable and accounts receivable functions will be unable to process, impacting the availability of services and supplies and resulting in reduced revenue.

•   Legal department will be unable to access contracts and addendums, resulting in lost or delayed revenue.

Statements of impact for revenue-generating and revenue-supporting business functions could quantify financial impact per unit of time (be sure to use the same units of time for all functions so that they can be easily compared with one another). Here are some examples:

•   Inability to place orders for appliances will cost the rate of $12,000 per hour.

•   Delays in payments will cost $1,875 per hour in interest charges.

As statements of impact are gathered, it may make sense to create several columns in the main worksheet so that like units (names of functions, numbers of users, financial figures) can be sorted and ranked later. The statements of impact should be reviewed for relevance. Although business unit leaders are not trying to elevate their importance, some may believe their system is critical to the organization, even though it may make up only a small percentage of the organization’s overall revenue. When reviewed, the information in the statements of impact should be considered relative to the totality of impact to the organization.

When the BIA is completed, the following information will be available about each process and system:

•   Name of the process

•   Who is responsible for its operation

•   A description of its function

•   Dependencies on systems

•   Dependencies on suppliers

•   Dependencies on service providers

•   Dependencies on key employees

•   Quantified statements of impact in terms of revenue, users affected, and/or functions impacted

Criticality Analysis

When all of the BIA information has been collected and charted, a criticality analysis can be performed. Criticality analysis is a study of each system and process, a consideration of the impact on the organization if it is incapacitated, the likelihood of incapacitation, and the estimated cost of mitigating the risk or impact of incapacitation. In other words, it’s a somewhat special type of a risk analysis that focuses on key processes and systems. The criticality analysis should also include a vulnerability analysis (aka vulnerability assessment), an examination of a process or system to identify vulnerabilities that, if exploited, could incapacitate or harm the process or system.

In the context of a BIA, a vulnerability analysis need not be at the level of detail of a security scan to find missing patches or security misconfiguration. Instead, this type of a vulnerability analysis seeks to find characteristics in a process or system, such as the following:

•   Single points of failure, such as only one staff member who knows how to perform a key procedure.

•   System not backed up.

•   System lacks resilient architecture features, such as dual power supplies.

•   Procedure uses hard copy records and cannot be performed remotely.

•   No training material is available for workers.

The criticality analysis also needs to include, or reference, a threat analysis, a risk analysis that identifies every threat that has a reasonable probability of occurrence, plus one or more mitigating controls or compensating controls, and new probabilities of occurrence with those mitigating/compensating controls in place. In case you’re having a little trouble imagining what this looks like (I’m writing the book and I’m having trouble seeing this!), take a look at Table 7-2, which is a lightweight example of what I’m talking about.

Images

Table 7-2  Example Threat Analysis Identifying Threats and Controls for Critical Systems and Processes

In Table 7-2, notice the following:

•   Multiple threats are listed for a single asset. Only nine threats are included, and for all the threats but one, only a single mitigating control is listed. For the extended power outage threat, two mitigating controls are included.

•   Cost of downtime wasn’t listed. For systems or processes with a cost per unit of time for downtime, this should be included, along with some calculations to show the payback for each control.

•   Some mitigating controls can benefit more than one system. This may not be obvious in this example, but many systems can benefit from a UPS and an electric generator, so the cost for these mitigating controls can be allocated across many systems, thereby lowering the cost for each system. Another similar example, though not included in the analysis example, is a high-availability storage area network (SAN) located in two different geographic areas; though initially expensive, the SAN can be used by many applications storage, and all will benefit from replication to the counterpart storage system.

•   Threat probabilities are arbitrary. The probabilities are for a single occurrence in an entire year, so, for example, 5 percent means the threat will be realized once every 20 years.

•   The length of the outage was not included. This should be included, particularly if you are quantifying downtime per hour or other units of time.

Obviously, a vulnerability analysis, threat analysis, and the corresponding criticality analysis can get complicated. The rule here should be this: the complexity of the vulnerability, threat, and criticality analyses should be proportional to the value of the assets (or revenue, or both). For example, in a company at which application downtime is measured in thousands of dollars per minute, it’s probably worth taking a few weeks or even months to work out all of the likely scenarios, a variety of mitigating controls, and to work out which ones are the most cost-effective. On the other hand, for a system or business process with a far less costly outage impact, a lot less time may be spent on the supporting analyses.

Images

EXAM TIP   Test-takers should ensure that any question dealing with BIA and criticality analysis places the business impact analysis first. Without this analysis, criticality analysis is impossible to evaluate in terms of likelihood or cost-effectiveness in mitigation strategies. The BIA identifies strategic resources and provides a value to their recovery and operation, which is, in turn, consumed in the criticality analysis phase. If presented with a question identifying BCP at a particular stage, make sure that any answers you select facilitate the BIA and then the CA before moving on toward objectives and strategies.

Determine Maximum Tolerable Downtime

The next step for each critical process is the establishment of a maximum tolerable downtime (MTD), aka acceptable interruption window (AIW). This theoretical period of time is measured from the onset of a disaster, at which point the organization’s very survival is at risk. Establishing MTD for each critical process is an important step that aids in the establishment of key recovery targets, discussed in the next section. It would be a mistake to call MTD a target, but sometimes it is referred to as a target. It would be better to consider MTD as an estimated “point-of-no-return” time value.

Executives should ultimately determine MTD targets for various critical business functions in their organization; there is often no single MTD target for the entire organization, but usually an MTD is established for each major function. For example, an online merchandiser might establish an MTD of 7 days for its online ordering function and 28 days for its payroll function. If the organization’s ability to earn revenue is incapacitated for 7 days, in the opinion of its executives, its business will suffer so greatly that the organization itself may fail. However, the organization could tolerate an MTD of four weeks for its payroll system, as enough employees are likely to tolerate a lengthy payroll outage that the organization will survive. After four weeks, enough employees may abandon their jobs that the organization will be unable to continue operations.

MTD is generally used for BCP purposes. However, as some operational and security incidents can become disasters, when severe enough, MTD is also considered in information risk management planning.

Determine Maximum Tolerable Outage

Next, the maximum tolerable outage (MTO) metric needs to be determined. MTO is a measure of the maximum time that an organization can tolerate operating in recovery (or alternate processing) mode. This metric comes into play when systems and processes in recovery mode operate at a lower level of throughput, consistency, quality, integrity, or at higher cost. MTO drives the need to reestablish normal production operations within a specific period of time.

Here’s an example: Suppose an organization produces online advertising that is specially targeted to individual users based on their known characteristics. This feature makes the organization competitive in the online ad market. In recovery mode, the organization’s system lacks several key targeting capabilities; it would not be competitive and could not sustain business operations in the long term in such a state, so it has set its MTO at 48 hours. Running in alternate processing mode for more than 48 hours would result in lost revenue and losses in market share.

Images

NOTE   Like MTD, MTO is not a target, but when MTO is set, recovery targets such as recovery time objective (RTO), recovery point objective (RPO), service delivery objective (SDO), recovery consistency objective (RCO), and recovery capacity objective (RCapO) can be established.

Establish Key Recovery Targets

When the cost or impact of downtime has been established and the cost and benefit of mitigating controls have been considered, some key targets can be established for each critical process. The two key targets are RTO and RPO, which determine how quickly key systems and processes are made available after the onset of a disaster and the maximum tolerable data loss that results from the disaster. Following are the key recovery targets:

•   Recovery time objective (RTO)   The maximum period that elapses from the onset of a disaster until the resumption of service

•   Recovery point objective (RPO)   The maximum data loss from the onset of a disaster

•   Recovery capacity objective (RCapO)   The minimum acceptable processing or storage capacity of an alternate process or system, as compared to the primary process or system

•   Service delivery objective (SDO)   The agreed upon level or quality of service at an alternate processing site

•   Recovery consistency objective (RCO)   The consistency and integrity of processing in a recovery system, as compared to the primary processing system

Once these objectives are known, the business continuity team can develop contingency plans to be followed when a disaster occurs, and the disaster recovery team can begin to build system recovery capabilities and procedures that will help the organization to realize these recovery targets economically.

Recovery Time Objective  The RTO establishes a measurable interval of time during which the necessary activities for recovering or resuming business operations must take place. Various business processes in an organization will have different RTO targets, and some business processes will have RTOs that vary according to business cycles on a daily, weekly, monthly, or annual basis. For instance, point-of-sale terminals may have a short RTO during peak business hours, a longer RTO during less busy hours, and a still longer RTO when the business is closed. Similarly, financial and payroll systems will have RTOs that are shorter during times of critical processing, such as payroll cycles and financials at the end of the month. RTOs, data classification, and asset classification are all interrelated. Business processes with shorter RTOs are likely to have data and assets that are classified as more operationally critical.

When establishing RTOs, security managers typically interview personnel in middle management as well as senior and executive management. Personnel at different levels of responsibility will have different perspectives on the criticality of business functions. Ultimately, executive management will prioritize business functions across the entire organization. As a result, any particular business function prioritized at one level by a middle manager may be classified as higher or lower by executives. Ultimately, executive prioritization will prevail. For example, a middle manager in the accounting department may assert that accounts payable is the most critical business activity because external service providers will stop providing service if they are not paid. But executives, who have control over the entire organization, stipulate that customer service is the most critical business function since the organization’s future revenue depends on the quality of care customers receive every day.

RTOs are established by conducting a BIA, which helps the security manager understand the criticality of business processes, their resource dependencies and interdependencies, and the costs associated with interruptions in service. RTOs are a cornerstone objective in BCP. Once RTOs are established for a particular business function, contingency plans that support the RTO can be established. While shorter RTOs are most often associated with higher costs, organizations generally seek a break-even point where the cost of recovery is the same as the cost of interruption for the period of time associated with the RTO.

Images

NOTE   For a given organization, it’s probably best to use one unit of measure for recovery objectives for all systems. This will help you avoid any errors that would occur during a rank-ordering of systems, so that two days do not appear to be a shorter period than four hours.

Recovery Point Objective  Generally, the RPO equates to the maximum period of time between backups or data replication intervals. It is generally measured in minutes or hours, and like RTO, shorter RPO targets typically are associated with higher costs. The value of a system’s RPO is usually a direct result of the frequency of data backup or replication. For example, if an application server is backed up once per day, the RPO is going to be at least 24 hours (or one day, whichever way you like to express it). Maybe it will take three days to rebuild the server, but once data is restored from backup tape, no more than the last 24 hours of transactions are lost. In this case, the RTO is three days, and the RPO is one day.

RPOs represent a different aspect of service quality, as any amount of data loss represents required rework. For example, if an organization receives invoices that are entered into the accounts payable system, an RPO of four hours means that up to four hours of rekeying would be required in the event of an incident or disaster.

RPOs are key objectives in BCP. When RPOs are established, contingency plans can be developed that will help the organization meet its RPO targets.

Recovery Capacity Objective  The RCapO is generally expressed as a percentage. If any incident or disaster results in the organization switching to a temporary or recovery process or system, the capacity of that temporary or recovery process or system may be less than that used during normal business operations. For example, in the event of a communications outage, cashiers in a retail location will hand-write sales receipts, which may take more time than the use of point-of-sale terminals. The manual process may mean cashiers can process 80 percent as much work; this is the RCapO.

For economic reasons, an organization may elect to build a recovery site that has less processing or storage capacity than the primary site. Management may agree that a recovery site with reduced processing capacity is an acceptable trade-off, given the relatively low likelihood that a failover to a recovery site would occur. For instance, an online service may choose to operate its recovery site at 80 percent of the processing capacity of the primary site. In management’s opinion, the relatively low decrease in capacity is worth the cost savings.

In an emergency situation, management may determine that a disaster recovery server in another city with, say, 60 percent of the capacity of the original server is adequate. In that case, the organization could establish two RTO targets: one for partial capacity and one for full capacity. In other words, the organization needs to determine how quickly a lower-capacity system should be running and when a full-capacity system should be running.

Service Delivery Objective  Depending on the nature of the business process in question, SDO may be measured in transaction throughput, service quality, response time, available capabilities and features, or something else that is measurable.

Recovery Consistency Objective  Recovery consistency objective (RCO) is a measure of the consistency and integrity of processing at a recovery site, as compared to the primary processing site.

RCO is calculated as 1 – (number of inconsistent objects) / (number of objects). A system that has been recovered in a disaster situation may no longer have 100 percent of its functionality. For instance, an application that lets users view transactions that are more than two years old may, in a recovery situation, contain only 30 days’ worth of data. The RCO decision is usually the result of a careful analysis of the cost of recovering different features and functions in an application environment. In a larger, complex environment, some features may be considered critical, while others are less so.

For example, suppose an organization’s online application is used to calculate the current and future costs of a household budget. While the primary site uses inputs and performs calculations based upon 12 external data sources, the recovery site performs calculations based on only 8 external data sources. Economic considerations compelled management to accept the fact that the recovery site will calculate results based upon fewer inputs, and that this is an acceptable trade-off between higher licensing fees for the use of some external sources and small variations in the results shown to users of the site.

The RCO comes into play in organizations that decide to scale back the replication of features and functionality at a recovery site versus the primary processing site. For instance, a recovery site may lack detailed reporting capabilities because of the cost of software or service licensing. An organization may have to pay for a second, expensive license for a recovery site that would rarely be used. Instead, management may decide that users or customers can go without those or other functions at a recovery site, instead focusing on core functions.

Images

NOTE   SDO, RTO, RPO, and RCapO are related to one another. Organizations are free to construct recovery target models in ways that work for them. One organization may start with SDOs and derive appropriate RTO, RPO, and RCapO targets, while others may start with RTO and RPO and figure out their SDOs.

Business Continuity Plan (BCP)

As mentioned earlier in the chapter, BCP and DRP are interrelated disciplines with a common objective: to keep critical business processes operating throughout a disaster scenario, while recovering/rebuilding damaged assets to restore business operations in their primary locations. Figure 7-3 shows the relationship between a BCP and a DRP.

Images

Figure 7-3  The relationship between a BCP and a DRP

As mentioned, before business continuity and disaster recovery plans can be developed, a BIA and criticality analysis should be undertaken to define the organization’s business processes, the information systems supporting them, and interdependencies. The criticality analysis specifically identifies business processes that are most critical and defines how quickly they need to be recovered during and after any disaster scenario.

The primary by-product of effective BCP and DRP is improved business resilience, not only in disaster situations but on a daily basis. Close examinations of processes and systems often reveal numerous opportunities for improvement that result in better resilience and fewer unplanned outages. Thus, for many organizations, BCP and DRP benefit the organization even if a disaster never strikes.

Images

NOTE   Although CISM candidates are not required to understand the details of BCP and DRP, they are required to understand the relationship between incident response and BCP and DRP. The principles, methodologies, recovery procedures, and testing techniques are so similar between the two disciplines that it is important for information security managers to understand these disciplines and how they relate to each other.

Business Continuity Planning

BCP reduces risks related to the onset of disasters and other disruptive events. BCP activities identify risks and mitigate those risks through changes or enhancements in business processes or technology so that the impact of disasters is reduced and the time to recovery is lessened. The primary objective of BCP is to improve the chances that the organization will survive a disaster without incurring costly or even fatal damage to its most critical activities.

The activities of BCP development scale for any size organization. BCP has the unfortunate reputation of existing only in the stratospheric thin air of the largest and wealthiest organizations. This misunderstanding hurts the majority of organizations that are too timid to begin any kind of BCP efforts, because they believe that these activities are too costly and disruptive. The fact is that any size organization, from a one-person home office to a multinational conglomerate, can successfully undertake BCP projects that will bring about immediate benefits and take some of the sting out of disruptive events that do occur.

Organizations can benefit from BCP projects even if a disaster never occurs. The steps in the BCP development life cycle process bring immediate benefit in the form of process and technology improvements that increase the resilience, integrity, and efficiency of those processes and systems. BCP generally is managed outside of the information security function. Further, BCP is generally external to IT, because BCP is focused on the continuity of business processes, not on the recovery of IT systems.

Images

NOTE   Business continuity planning is closely related to disaster recovery planning—both are concerned with the recovery of business operations after a disaster.

Disasters

In a business context, disasters are unexpected and unplanned events that result in the disruption of business operations. A disaster could be a regional event spread over a wide geographic area or an event that occurs within the confines of a single room. The impact of a disaster will also vary, from a complete interruption of all company operations to a mere slowdown. (This question invariably comes up: When is a disaster a disaster? This is somewhat subjective, like asking, “When is a person sick?” Is it when she is too ill to report to work or when she just has a sniffle and a scratchy throat? I’ll discuss disaster declaration later in this chapter in the section “Developing Continuity Plans.”)

Types of Disasters  BCP professionals broadly classify disasters as natural or human-made, although the origin of a disaster does not figure very much into how we respond to it. Let’s examine the types of disasters.

Natural Disasters  Natural disasters occur in the natural world with little or no assistance from humans. They are a result of the natural processes that occur in, on, and above the earth. Here are examples of natural disasters:

•   Earthquakes Sudden movements of the earth with the capacity to damage buildings, houses, roads, bridges, and dams; precipitate landslides and avalanches; and induce flooding and other secondary events.

•   Floods   Standing or moving water spills out of its banks and flows into and through buildings and causes significant damage to roads, buildings, and utilities. Flooding can be a result of locally heavy rains, heavy snow melt, a dam or levee break, tropical cyclone storm surge, or an avalanche or landslide that displaces lake or river water.

•   Volcanoes   Eruptions of magma, pyroclastic flows, steam, ash, and flying rocks that can cause significant damage over wide geographic regions. Some volcanoes, such as Kilauea in Hawaii, produce a nearly continuous and predictable outpouring of lava in a limited area, whereas the Mount St. Helens eruption in 1980 caused an ash fall over thousands of square miles, brought many metropolitan areas to a standstill for days, and blocked rivers and damaged roads.

•   Landslides These sudden downhill movements of the earth, usually down steep slopes, can bury buildings, houses, roads, and public utilities and can cause secondary (although still disastrous) effects such as the rerouting of rivers.

•   Avalanches These sudden downward flows of snow, rocks, and debris on a mountainside. In a slab avalanche, a large, stiff layer of compacted snow forcefully moves down the slope. A loose snow avalanche occurs when the accumulated snowpack exceeds its shear strength. A powder snow avalanche is the largest type and can travel in excess of 200 mph and exceed 10 million tons of material. All avalanches can damage buildings, houses, roads, and utilities, resulting in direct or indirect damage affecting businesses.

•   Wildfires Fires in forests, chaparral, and grasslands are part of the natural order. However, fires can also damage buildings and equipment and cause injury and death, such as in the 2017 wildfires in California. Figure 7-4 shows a map of Sonoma County and nearby wildfires, as seen from the NASA Aqua satellite that year.

Images

Figure 7-4  Wildfires in California (Source: NASA)

•   Tropical cyclones   The largest and most violent storms are known in various parts of the world as hurricanes, typhoons, tropical cyclones, tropical storms, and cyclones. Tropical cyclones, such as Hurricane Harvey, consist of strong winds that can reach 190 mph, heavy rains, and storm surges that can raise the level of the ocean by as much as 20 feet, all of which can result in widespread coastal flooding and damage to buildings, houses, roads, and utilities and in significant loss of life.

•   Tornadoes   These violent rotating columns of air can cause catastrophic damage to buildings, houses, roads, and utilities when they reach the ground. Most tornadoes can have wind speeds of 40 to 110 mph and travel along the ground for a few miles. Some tornadoes can exceed 300 mph and travel for dozens of miles.

•   Windstorms   While generally less intense than hurricanes and tornadoes, windstorms can nonetheless cause widespread damage, including damage to buildings, roads, and utilities. Widespread electric power outages are common when windstorms uproot trees that fall into power lines.

•   Lightning   These atmospheric discharges of electricity occur during thunderstorms but also during dust storms and volcanic eruptions. Lightning can start fires and also damage buildings and power transmission systems, causing power outages.

•   Ice storms   When rain falls through a layer of colder air, raindrops freeze onto whatever surface they strike, resulting in widespread power outages after heavy ice coats power lines, causing them to collapse. A notable example is the Great Ice Storm of 1998 in eastern Canada, which resulted in millions being without power for as long as two weeks and in the virtual immobilization of the cities of Montreal and Ottawa.

•   Hail   This form of precipitation consists of ice chunks ranging from 5 to 150 mm in diameter. An example of a damaging hailstorm is the April 1999 storm in Sydney, Australia, where hailstones up to 9.5 cm in diameter damaged 40,000 vehicles, 20,000 properties, and 25 airplanes and caused one direct fatality. The storm caused $1.5 billion in damage.

•   Tsunamis   A series of waves that usually result from the sudden vertical displacement of a lake bed or ocean floor can also be caused by landslides, asteroids, or explosions. A tsunami wave can be barely noticeable in open, deep water, but as it approaches a shoreline, the wave can grow to a height of 50 feet or more. Recent notable examples are the 2004 Indian Ocean tsunami and the 2011 Japan tsunami. Figure 7-5 shows coastline damage from the Japan tsunami.

Images

Figure 7-5  Damage to structures caused by the 2011 Japan tsunami

•   Pandemic   Infectious diseases may spread over a wide geographic region, even worldwide. Pandemics have regularly occurred throughout history and are likely to continue occurring, despite advances in sanitation and immunology. A pandemic is the rapid spread of any type of disease, including typhoid, tuberculosis, bubonic plague, or influenza. Pandemics include the 1918–1920 Spanish flu, the 1956–1958 Asian flu, the 1968–1969 Hong Kong “swine” flu, the 2009–2010 swine flu, and the COVID-19 pandemic, which began at the end of 2019. Figure 7-6 shows a field hospital during the pandemic.

Images

Figure 7-6  A field hospital in Brazil during the 2019 COVID pandemic (Image courtesy of Gustavo Basso)

•   Extraterrestrial impacts   This category includes meteorites and other objects that fall from the sky from way, way up. Sure, these events are extremely rare, and most organizations don’t even include these events in their risk analysis, but I’ve included them here for the sake of rounding out the types of natural events.

Human-Caused Disasters  Human-caused disasters are directly or indirectly caused by human activity through action or inaction. The results of human-caused disasters are similar to natural disasters: localized or widespread damage to businesses that results in potentially lengthy interruptions in operations. These are some examples of human-caused disasters:

•   Civil disturbances   These can include protests, demonstrations, riots, strikes, work slowdowns and stoppages, looting, and resulting actions such as curfews, evacuations, or lockdowns.

•   Utility outages   Failures in electric, natural gas, district heating, water, communications, and other utilities can be caused by equipment failures, sabotage, or natural events such as landslides or flooding.

•   Service outages   Failures in IT equipment, software programs, and online services can be caused by hardware failures, software bugs, or misconfiguration.

•   Materials shortages   Interruptions in the supply of food, fuel, supplies, and materials can have a ripple effect on businesses and the services that support them. Readers who are old enough to remember the petroleum shortages of the mid-1970s know what this is all about. Ripple effects from the COVID pandemic lockdown include shortages of baby formula in 2022, shown in Figure 7-7. Shortages can result in spikes in the price of commodities, which is almost as damaging as not having any supply at all.

Images

Figure 7-7  Empty grocery store shelves exhibiting the lack of available baby formula.

•   Fires   These fires originate in or involve buildings, equipment, and materials.

•   Hazardous materials spills   Many created or refined substances can be dangerous if they escape their confines. Examples include petroleum substances, gases, pesticides and herbicides, medical substances, and radioactive substances.

•   Transportation accidents   This broad category includes plane crashes, railroad derailment, bridge collapse, and the like.

•   Terrorism and war   Whether they are actions of a nation, nation-state, or other group, terrorism and war can have devastating but usually localized effects in cities and regions. Often, terrorism and war precipitate secondary effects such as famine, disease, materials shortages, and utility outages.

•   Security events   The actions of a lone hacker or a team of organized cybercriminals can bring down one system or network, or many networks, which may result in a widespread interruption in services. Hackers’ activities can directly result in an outage, or an organization can voluntarily (although reluctantly) shut down an affected service or network to contain the incident.

Images

NOTE   It is important to remember that real disasters are usually complex events that involve more than just one type of damaging event. For instance, an earthquake directly damages buildings and equipment, but fires and utility outages can also result. A hurricane also brings flooding, utility outages, and sometimes even hazardous materials events and civil disturbances such as looting.

How Disasters Affect Organizations  Many disasters have direct effects, but sometimes the secondary effects of a disaster event are most significant, from the perspective of ongoing business operations. A risk analysis, which is a part of the BCP process (discussed in the next section in this chapter), will identify the ways in which disasters are likely to affect an organization. During the risk analysis, the primary, secondary, upstream, and downstream effects of likely disaster scenarios are identified and considered. Whoever is performing this analysis will need to have a broad understanding of the interdependencies of business processes and IT systems, as well as the ways in which a disaster will affect ongoing business operations. Similarly, personnel who are developing contingency and recovery plans also need to be familiar with these effects so that those plans will adequately serve the organization’s needs.

Disasters, by our definition, interrupt business operations in some measurable way. An event that may be a disaster for one organization would not necessarily be a disaster for another, particularly if it doesn’t affect the latter. It would be shortsighted to say that a disaster affects only operations; instead, the longer-term effects created by a disaster can impact the organization’s image, brand, reputation, and ongoing financial viability. The factors affecting image, brand, and reputation have as much to do with how the organization communicates to its customers, suppliers, and shareholders, as with how the organization actually handles a disaster in progress.

A disaster can affect an organization’s operations in several ways:

•   Direct damage   Events such as earthquakes, floods, and fires directly damage an organization’s buildings, equipment, or records. The damage may be severe enough that no salvageable items remain, or it may be less severe, and some equipment and buildings may be salvageable or repairable.

•   Utility interruption   Even if an organization’s buildings and equipment are undamaged, a disaster may affect utilities such as power, natural gas, or water, which can incapacitate some or all business operations. Significant delays in refuse collection, for example, can result in unsanitary conditions.

•   Transportation   A disaster may damage or render transportation systems such as roads, railroads, shipping, or air transport unusable for a period, causing interruptions in supply lines and personnel transportation.

•   Services and supplier shortage   Even if a disaster does not directly affect an organization, critical suppliers affected by a disaster can cause problems for business operations. For instance, a regional baker that cannot produce and ship bread to its corporate customers will soon result in sandwich shops without a critical resource.

•   Staff availability   A community-wide or regional disaster that affects businesses is likely to affect homes and families as well. Depending upon the nature of a disaster, employees will place a higher priority on the safety and comfort of family members. Also, workers may not be able or willing to travel to work if transportation systems are affected or if there is a significant materials shortage. Employees may also be unwilling to travel to work if they fear for their personal safety or that of their families.

•   Customer availability   Disasters may force or dissuade customers from traveling to business locations to conduct business, as many of the factors that keep employees away may also keep customers away.

Images

TIP   The secondary and tertiary effects a particular organization experiences after a disaster depends entirely upon a unique set of circumstances that constitute the organization’s specific critical needs. A risk analysis should be performed to identify these specific factors.

The BCP Process

To plan for disaster preparedness, the organization must start by determining which kinds of disasters are likely and their possible effects on the organization—that is, plan first, act later. The BCP process is a life-cycle process, as shown in Figure 7-8. In other words, BCP (and DRP) is not a one-time event or activity; it’s a set of activities that result in the ongoing preparedness for disaster that continually adapts to changing business conditions and that continually improves.

Images

Figure 7-8  The BCP process life cycle

The following are the elements of the BCP process life cycle:

1.   Assign ownership of the program.

2.   Develop BCP policy.

3.   Conduct business impact analysis.

4.   Perform criticality analysis.

5.   Establish recovery targets.

6.   Define KRIs and KPIs.

7.   Develop recovery and continuity strategies and plans.

8.   Test recovery and continuity plans and procedures.

9.   Test integration of business continuity and disaster recovery plans.

10.   Train personnel.

11.   Maintain strategies, plans, and procedures through periodic reviews and updates.

BCP Policy  A formal BCP effort must, like any strategic activity, flow from the existence of a formal policy and be included in the overall governance model discussed throughout this chapter. BCP should be an integral part of the IT control framework, not lie outside of it. Therefore, BCP policy should include or cite specific controls that ensure that key activities in the BCP life cycle are performed appropriately. BCP policy should also define the scope of the BCP strategy, so the specific business processes (or departments or divisions within an organization) that are included in the BCP effort must be defined. Sometimes the scope will include a geographic boundary. In larger organizations, it is possible to “bite off more than you can chew” and define too large a scope for a BCP project, so limiting the scope to a smaller, more manageable portion of the organization can be a good approach.

Developing Continuity Plans

In the previous section, I discussed the notion of establishing recovery targets and the development of architectures, processes, and procedures. The processes and procedures are related to the normal operation of those new technologies as they will be operated in normal day-to-day operations. When those processes and procedures have been completed, the disaster recovery plans and procedures (actions that will take place during and immediately after a disaster) can be developed.

Suppose, for example, that an organization has established RPO and RTO targets for its critical applications. These targets necessitated the development of server clusters and storage area networks with replication. While implementing those new technologies, the organization developed supporting operations processes and procedures that would be carried out every day during normal business operations. As a separate activity, the organization developed the procedures to be performed when a disaster strikes the primary operations center for those applications; those procedures include all of the steps that must be taken so that the applications can continue operating in an alternate location.

The procedures for operating critical applications during a disaster are a small part of the entire body of procedures that must be developed. Several other sets of procedures must also be developed to prepare the organization, including the following:

•   Personnel safety procedures

•   Disaster declaration procedures

•   Responsibilities

•   Contact information

•   Recovery procedures

•   Continuing operations

•   Restoration procedures

Personnel Safety Procedures  When a disaster strikes, measures to ensure the safety of personnel are the first priority. If the disaster has occurred or is about to occur in a building, personnel may need to be evacuated as soon as possible. Arguably, however, in some situations, evacuation is exactly the wrong thing to do; for example, if a hurricane or tornado is bearing down on a facility, the building itself may be the best shelter for personnel, even if it incurs some damage. The point here is that personnel safety procedures need to be carefully developed, and possibly more than one set of procedures will be needed, depending on the event.

Images

NOTE   Remember that the highest priority in any disaster or emergency situation is the safety of human life.

Personnel safety procedures should include the following factors:

•   All personnel are familiar with evacuation and sheltering procedures.

•   Visitors know how to evacuate the premises and the location of sheltering areas.

•   Signs and placards are posted to indicate emergency evacuation routes and gathering areas outside of the building.

•   Emergency lighting is available to aid in evacuation or sheltering in place.

•   Fire extinguishment equipment (portable fire extinguishers and so on) is readily available.

•   Communication with public safety and law enforcement authorities is available at all times, including times when communications and electric power have been cut off and when all personnel are outside of the building.

•   Care is available for injured personnel.

•   CPR and emergency first-aid training are provided.

•   Safety personnel are available to assist in the evacuation of injured and disabled people.

•   A process is in place to account for visitors and other nonemployees.

•   Emergency shelter is available in extreme weather conditions

•   Emergency food and drinking water are available when personnel must shelter in place.

•   Periodic tests are conducted to ensure that evacuation procedures will be adequate in the event of a real emergency.

Images

NOTE   Local emergency management organizations may provide additional information that can assist an organization with its emergency personnel safety procedures.

Disaster Declaration Procedures  Disaster response procedures are initiated when a disaster is declared. However, a procedure for the declaration itself must be created to ensure that there will be little doubt as to the conditions that must be present to declare a disaster. Why is a disaster declaration procedure required? It’s not always clear whether a situation is a “real disaster.” Certainly, a 7.5-magnitude earthquake or a major fire is a disaster, but popcorn overcooked in the microwave that sets off a building’s fire alarm system might not be. A disaster declaration procedure must provide some basic conditions that will help determine whether a disaster should be declared.

Further, who has the authority to declare a disaster? If senior management personnel frequently travel and are not on site when a disaster occurs, who else can declare a disaster? Finally, what does it mean to declare a disaster—and what happens next? The following points constitute the primary items organizations need to consider for their disaster declaration procedure.

Form a Core Team  A core team of personnel needs to be established, all of whom will be familiar with the disaster declaration procedure as well as the actions that must take place once a disaster has been declared. This core team should consist of middle and upper managers who are familiar with business operations, particularly those that are critical. This team must be large enough so that a requisite few of them are on hand when a disaster strikes. In organizations that require second shifts, third shifts, and weekend work, some of the core team members should be supervisory personnel during those times, while others can be personnel who work regular business hours and are not always on site.

Declaration Criteria  The declaration procedure must contain some tangible criteria that core team members can consult to guide them down the “Is this a disaster?” decision path. The criteria for declaring a disaster should be related to the availability and viability of ongoing critical business operations. Some example criteria include one or more of the following:

•   Forced evacuation of a building containing or supporting critical operations that is likely to last for more than four hours

•   Hardware, software, or network failures that result in a critical IT system being incapacitated or unavailable for more than four hours

•   Any security incident that results in a critical IT system being incapacitated for more than four hours (such as malware, break-ins, attacks, sabotage, and so on)

•   Any event causing employee absenteeism or supplier shortages that, in turn, results in one or more critical business processes being incapacitated for more than eight hours

•   Any event causing a communications failure that results in critical IT systems being unreachable for more than four hours

This is a pretty complete list of criteria for many organizations. The periods of downtimes will vary from organization to organization. For instance, a large, pure-online business such as Salesforce.com would probably declare a disaster if its main web sites were unavailable for more than a few minutes. But in an organization whose computers are far less critical, an outage of four hours may not be considered a disaster.

Pulling the Trigger  When disaster declaration criteria are met, the disaster should be declared. The procedure for disaster declaration could permit any single core team member to declare the disaster, but it may be better in some organizations to have two or more core team members agree on whether a disaster should be declared. All core team members empowered to declare a disaster should have the procedure on hand at all times. In most cases, the criteria should fit on a small, laminated wallet card that each team member can carry with him or have nearby at all times. For organizations that use the consensus method for declaring a disaster, the wallet card should include the names and contact numbers of other core team members so that each will have a way of contacting others.

Next Steps  Declaring a disaster will trigger the start of one or more other response procedures, but not necessarily all of them. For instance, if a disaster is declared because of a serious computer or software malfunction, there is no need to evacuate the building. While this example may be obvious, not all instances will be this clear. Either the disaster declaration procedure itself or each of the subsequent response procedures should contain criteria that will help determine which response procedures should be enacted.

False Alarms  Probably the most common cause of personnel not declaring a disaster is the fear that an event is not an actual disaster. Core team members empowered with declaring a disaster should not necessarily hesitate, however. Instead, core team members could convene with additional team members to reach a firm decision, provided this can be done quickly. If a disaster has been declared and it later becomes clear that a disaster has been averted (or did not exist in the first place), the disaster can simply be called off and declared to be over. Response personnel can be contacted and told to cease response activities and return to their normal activities.

Images

TIP   Depending on the level of effort that takes place in the opening minutes and hours of disaster response, the consequences of declaring a disaster when none exists may or may not be significant. In the spirit of continuous improvement, any organization that has had a few false alarms should seek to improve its disaster declaration criteria. Well-trained and experienced personnel can usually avoid frequent false alarms.

Responsibilities  During a disaster, many important tasks must be performed to evacuate or shelter personnel, assess damage, recover critical processes and systems, and carry out many other functions that are critical to the survival of the enterprise. About 20 different responsibilities are described here. In a large organization, each responsibility may be staffed with a team of two, three, or many individuals. In small organizations, a few people may incur many responsibilities each, switching from role to role as the situation warrants.

All roles will be staffed by people who are available; remember that many of the “ideal” people to fill each role may be unavailable during a disaster for several reasons:

•   Injured, ill, or deceased   Some regional disasters will inflict widespread casualties that will include some proportion of response personnel. Those who are injured, who are ill (in the case of a pandemic, for instance, or who are recovering from a sickness or surgery when the disaster occurs), or who are killed by the disaster are clearly not going to be showing up to help out.

•   Caring for family members   Some types of disasters may cause widespread injury or require mass evacuation. In some situations, many personnel will be caring for family members whose immediate needs for safety will take priority over the needs of the organization.

•   Unavailable transportation   Because some disasters result in localized or widespread damage to transportation infrastructure, many people who are willing to be on-site to help with emergency operations will be unable to travel to the site.

•   Out of the area   Some personnel may be away on business travel or on vacation and be unable to respond. However, these situations may provide opportunities in disguise; unaffected by the physical impact of the disaster, these individuals may be able to help out in other ways, such as communicating with suppliers, customers, or other personnel.

•   Communications   Some types of disasters, particularly those that are localized (versus widespread and obvious to an observer), require that disaster response personnel be contacted and asked to help. If a disaster strikes after hours, some personnel may be unreachable if they do not have a mobile phone with them or are out of range.

•   Fear   Some types of disasters (such as a pandemic, terrorist attack, or flood) may instill fear for safety on the part of response personnel who will disregard the call to help and stay away from the site.

Images

NOTE   Response personnel in all disciplines and responsibilities will need to be able to piece together whatever functionality they are called on to do, using whatever resources are available—this is part art form and part science. Although response and contingency plans may make certain assumptions, personnel may find themselves with inadequate resources, requiring them to do the best they can with the resources available.

Each function will be working with personnel in many other functions, including unfamiliar people. An entire response and recovery operation may resemble an entirely new organization in unfamiliar settings and with an entirely new set of rules. Typically, teams work best when members are familiar with and trust one another. In a response and recovery operation, the stress level is very high because the stakes—survival—are higher, and teams may be composed of people who have little experience with one another and these new roles. This additional stress will bring out the best and worst in people, as illustrated in Figure 7-9.

Images

Figure 7-9  Stress is compounded by the pressure of disaster recovery and the formation of new teams in times of chaos.

Emergency Response and Command and Control (Emergency Management)  The priorities of “first responders” during a disaster include evacuating or sheltering personnel, first aid, triage of injured personnel, and possibly firefighting. During disaster response operations, someone has to be in charge. Resources may be scarce, and many matters will vie for attention. Someone needs to fill the role of decision-maker to keep disaster response activities moving and to handle situations that arise. This role may need to be rotated among various personnel, particularly in smaller organizations, to counteract fatigue.

Images

TIP   Although the first person on the scene may be the person in charge initially, as more personnel show up and the nature of the disaster and response solidifies, qualified assigned personnel will take charge and leadership roles may then be passed among key personnel.

Scribe  It’s vital that one or more people document the important events continually during disaster response operations. From decisions, to discussions, to status, to roll call, these events must be recorded so that the details of disaster response can be pieced together afterward. This will help the organization better understand how disaster response unfolded, how decisions were made, and who performed which actions—all of which will help the organization be better prepared for future events.

Internal Communications  In many disaster scenarios, personnel may be stripped of many or all of their normal means of communication, such as desk phones, voicemail, e-mail, smartphones, and instant messaging. However, during a disaster, communications are vital, especially when nothing is going according to plan. Good communication ensures that the statuses of various activities can be sent to command and control and priorities and orders can be sent to disaster response personnel. Many organizations establish means for emergency communications, including the following:

•   Broadcast alerts   Sent via text, voice, or mobile app, these help inform large numbers of personnel about events affecting the organization.

•   Emergency radio communications   When wireless and wireline communications are not functioning, emergency communication via radio enables personnel in different locations to pass along important information.

External Communications  People outside of the organization also need to stay informed when a disaster strikes. Many parties may want or need to know the status of business operations during and after a disaster:

•   Customers

•   Suppliers

•   Partners

•   Law enforcement and public safety authorities (including first responders)

•   Insurance companies

•   Shareholders

•   Neighbors

•   Regulators

•   Media

These different audiences need different messages, as well as messages in different forms. For instance, notifications to the public may be sent through media outlets, whereas notifications to customers may be sent through e-mail or surface mail.

Legal and Compliance  Several needs may arise during a disaster that require the attention of inside or outside legal counsel. Disasters present unique situations, such as the following, that may require legal assistance:

•   Interpretation of regulations

•   Interpretation of contracts with suppliers and customers

•   Management of matters of liability to other parties

Images

TIP   Typical legal matters need to be resolved before the onset of a disaster, and this information should be included in disaster response procedures. Remember that legal staff members may be unavailable during the disaster.

Damage Assessment  After a physically violent event such as an earthquake or volcano, or after an event with no physical manifestation such as a serious security incident, one or more experts are needed to examine affected assets and accurately assess the damage. Because most organizations own many different types of assets (buildings, equipment, and information), qualified experts should assess each asset type; only those whose expertise matches the type of event that has occurred need to be consulted. Some needed expertise may go well beyond the skills present in an organization, such as a structural engineer who can assess potential earthquake damage. In such cases, it may be sensible to retain the services of an outside engineer who will respond and provide an assessment of whether a building is safe to occupy after a disaster. In fact, it may make sense to retain more than one in case they themselves are affected by a disaster.

Salvage  Disasters destroy assets that the organization uses to create products or perform services. When a disaster occurs, someone (either a qualified employee or an outside expert) needs to examine assets to determine which are salvageable; then, a salvage team needs to perform the actual salvage operation at a pace that meets the organization’s needs. In some cases, salvage may be a critical-path activity, where critical processes are paralyzed until salvage and repairs to critically needed machinery can be performed. Or the salvage operation may be performed on the inventory of finished goods, raw materials, and other items so that business operations can be resumed. Occasionally, when it is obvious that damaged equipment or materials are a total loss, the salvage effort involves selling the damaged items or materials to another organization. Assessment of damage to assets may be a high priority when an organization is filing an insurance claim. Insurance may be a primary source of funding for the organization’s recovery effort.

Images

CAUTION   Salvage operations may be a critical-path activity or may be carried out well after the disaster. To the greatest extent possible, this should be decided in advance. Otherwise, the command-and-control function will need to decide the priority of salvage operations, potentially wasting valuable time.

Physical Security  Following a disaster, the organization’s usual physical security controls may be compromised. For instance, fencing, walls, and barricades may be damaged, or video surveillance systems may be disabled or have no electric power. These and other failures could lead to an increased risk of loss or damage to assets and personnel until those controls can be repaired. Also, security controls in temporary quarters such as hot/warm/cold sites and temporary work centers may be less effective than those in primary locations.

Supplies  During emergency and recovery operations, personnel will require drinking water, writing tablets, writing utensils, smartphones, portable generators, and extension cords, among other items. The supplies function may be responsible for acquiring these items and replacement assets such as servers and network equipment for a cold site.

Transportation  When workers are operating from a temporary location, and if regional or local transportation systems have been compromised, many arrangements for all kinds of transportation may be required to support emergency operations. These can include transportation of replacement workers, equipment, or supplies by truck, car, rail, sea, or air. This function could also be responsible for arranging for temporary lodging for personnel.

Work Centers  When a disaster event results in business locations being unusable, workers may need to work in temporary locations. These work centers will require a variety of amenities to permit workers to be productive until their primary work locations are again available.

Network  This technology function is responsible for damage assessment to the organization’s voice and data networks, building/configuring networks for emergency operations, or both. This function may require extensive coordination with external telecommunications service providers, which, by the way, may be suffering the effects of a local or regional disaster as well.

Network Services  This function is responsible for network-centric services such as the Domain Name System (DNS), Simple Network Management Protocol (SNMP), network routing, and authentication.

Systems  This function is responsible for building, loading, and configuring the servers and systems that support critical services, applications, databases, and other functions. Personnel may have other resources such as virtualization technology to enable additional flexibility.

Database Management Systems  For critical applications that rely upon database management systems (DBMSs), this function is responsible for building databases on recovery systems and for restoring or recovering data from backup media, replication volumes, or e-vaults onto recovery systems. Database personnel will need to work with systems, network, and applications personnel to ensure that databases are operating properly and are available as needed.

Data and Records  This function is responsible for access to and re-creation of electronic and paper business records. This business function supports critical business processes, works with database management personnel, and, if necessary, works with data-entry personnel to rekey lost data.

Applications  This function is responsible for recovering application functionality on application servers. This may include reloading application software, performing configuration, provisioning roles and user accounts, and connecting the application to databases and network services, as well as other application integration issues.

Access Management  This function is responsible for creating and managing user accounts for network, system, and application access. Personnel with this responsibility may be especially susceptible to social engineering and may be tempted to create user accounts without proper authority or approval.

Information Security and Privacy  Personnel in this capacity are responsible for ensuring that proper security controls are being carried out during recovery and emergency operations. They will be expected to identify risks associated with emergency operations and to require remedies to reduce risks. Security personnel will also be responsible for enforcing privacy controls so that employee and customer personal data will not be compromised, even as business operations are affected by the disaster.

Offsite Storage  This function is responsible for managing the effort of retrieving backup media from offsite storage facilities and for protecting that media in transit to the scene of recovery operations. If recovery operations take place over an extended period (more than a couple of days), data at the recovery site will need to be backed up and sent to an offsite media storage facility to protect that information should a disaster occur at the hot/warm/cold site (and what bad luck that would be!).

User Hardware  In many organizations, little productive work is done when employees don’t have access to their workstations, printers, scanners, copiers, and other office equipment. Thus, a function is required to provide, configure, and support the variety of office equipment required by end users working in temporary or alternate locations. This function, like most others, will have to work with many other personnel to ensure that workstations and other equipment are able to communicate with applications and services as needed to support critical processes.

Training  During emergency operations, when response personnel and users are working in new locations (and often on new or different equipment and software), some may need to be trained to enable them to restore their productivity as quickly as possible. Training personnel should be familiar with many disaster response and recovery procedures so that they can help people in those roles understand what is expected of them. This function will also need to be able to dispense emergency operations procedures to these personnel.

Restoration  This function comes into play when IT is ready to migrate applications running on hot/warm/cold site systems back to the original (or replacement) processing center.

Contract Information  This function is responsible for understanding and interpreting legal contracts. Most organizations are a party to one or more legal contracts that require them to perform specific activities, provide specific services, and communicate status if service levels have changed. These contracts may or may not have provisions for activities and services during disasters, including communications regarding any changes in service levels. This function is vital not only during the disaster planning stages but also during actual disaster response. Customers, suppliers, regulators, and other parties need to be informed according to specific contract terms.

Recovery Procedures  Recovery procedures are the instructions that key personnel use to bootstrap services (such as IT systems and other business-enabling technologies) that support the critical business functions identified in the BIA and criticality analysis. The recovery procedures should work hand-in-hand with the technologies that may have been added to IT systems to make them more resilient.

An example is useful here. Acme Rocket Boots determines that its order-entry business function is highly critical to the ongoing viability of the business and sets recovery objectives to ensure that order entry would be continued within no more than 48 hours after a disaster. Acme determines that it needs to invest in storage, backup, and replication technologies to make a 48-hour recovery possible. Without these investments, IT systems supporting order entry would be down for at least ten days until they could be rebuilt from scratch. Acme cannot justify the purchase of systems and software to facilitate an auto-failover of the order-entry application to hot-site disaster recovery servers; instead, the recovery procedure would require that the database be rebuilt from replicated data on cloud-based servers. Other tasks, such as installing recent patches, would also be necessary to make recovery servers ready for production use. All of the tasks required to make the systems ready constitute the body of recovery procedures needed to support the business order-entry function.

This example is, of course, an oversimplification. Actual recovery procedures could take dozens of pages of documentation, and procedures would also be necessary for network components, end-user workstations, network services, and other supporting IT services required by the order-entry application. And those are the procedures needed just to get the application running again. More procedures would be needed to keep the applications running properly in the recovery environment.

Continuing Operations  Procedures for continuing operations have more to do with business processes than they do with IT systems. However, the two are related, because the procedures for continuing critical business processes have to fit hand in hand with the procedures for operating supporting IT systems that may also (but not necessarily) be operating in a recovery or emergency mode.

Let me clarify that last statement. It is entirely conceivable that a disaster could strike an organization with critical business processes that operate in one city but that are supported by IT systems located in another city. A disaster could strike the city with the critical business function, which means that personnel may have to continue operating that business function in another location, on the original, fully featured IT application. It is also possible that a disaster could strike the city with the IT application, forcing it into an emergency/recovery mode in an alternate location while users of the application are operating in a business-as-usual mode. And, of course, a disaster could strike both locations (or a disaster could strike in one location where both the critical business function and its supporting IT applications reside), throwing both the critical business function and its supporting IT applications into emergency mode. Any organization’s reality could be even more complex than this: just add dependencies on external application service providers, applications with custom interfaces, or critical business functions that operate in multiple cities. If you wondered why disaster recovery and business continuity planning were so complicated, perhaps your appreciation has grown just now.

Restoration Procedures  When a disaster has occurred, IT operations need to take up residence in an alternate processing site temporarily while repairs are performed on the original processing site. Once those repairs are completed, IT operations would need to be transitioned back to the main (or replacement) processing facility. You should expect that the procedures for this transition will also be documented (and tested—testing is discussed later in this chapter).

Images

NOTE   Transitioning applications back to the original processing site is not necessarily just a second iteration of the initial move to the hot/warm/cold site. Far from it. The recovery site may have been a skeleton (in capacity, functionality, or both) of its original self. The objective is not necessarily to move the functionality at the recovery site back to the original site but to restore the original functionality to the original site.

Let’s continue the Acme Rocket Boots example. The order-entry application at the disaster recovery site had only basic, not extended, functions. For instance, customers could not look at order history, and they could not place custom orders; they could order only off-the-shelf products. But when the application is moved back to the primary processing facility, the history of orders accumulated on the disaster recovery application needs to be merged back into the main order history database, which was not part of the DRP.

Considerations for Continuity and Recovery Plans  A considerable amount of detailed planning and logistics must go into continuity and recovery plans if they are to be effective.

Availability of Key Personnel  An organization cannot depend upon every member of its regular expert workforce to be available in a disaster. As discussed earlier, personnel may be unavailable for a number of reasons, including the following:

•   Injury, illness, or death

•   Caring for family members

•   Unavailable transportation

•   Damaged transportation infrastructure

•   Being out of the area

•   Lack of communications

•   Fear, related to the disaster and its effects

Images

TIP   An organization must develop thorough and accurate recovery and continuity documentation as well as cross-training and plan testing. When a disaster strikes, an organization has one chance to survive, and this depends upon how well the available personnel are able to follow recovery and continuity procedures and keep critical processes functioning properly.

Emergency Supplies  The onset of a disaster may cause personnel to be stranded at a work location, possibly for several days. This can be caused by a number of reasons, including inclement weather that makes travel dangerous or a transportation infrastructure that is damaged or blocked with debris. Emergency supplies should be laid up at a work location and made available to personnel stranded there, regardless of whether they are supporting a recovery effort or not. (It’s also possible that severe weather or a natural or human-made event could make transportation dangerous or impossible.)

A disaster can also prompt employees to report to a work location (at the primary location or at an alternate site), where they may remain for days at a time, even around the clock if necessary. A situation like this may make the need for emergency supplies less critical, but it still may be beneficial to the recovery effort to make supplies available to support recovery personnel.

An organization stocking emergency supplies at a work location should consider including the following:

•   Drinking water

•   Food rations

•   First-aid supplies

•   Blankets

•   Flashlights

•   Battery- or crank-powered radio

•   Out-of-band communications with internal and external parties (beepers, walkie-talkies, line-of-sight systems, and so on)

Local emergency response authorities may recommend other supplies be kept at a work location as well.

Communications  Communication within organizations, as well as with customers, suppliers, partners, shareholders, regulators, and others, is vital under normal business conditions. During a disaster and subsequent recovery and restoration operations, these communications are more important than ever, while many of the usual means for communications may be impaired.

Identifying Critical Personnel  A successful disaster recovery operation requires available personnel who are located near company operations centers. Although the primary response personnel may consist of the individuals and teams responsible for day-to-day corporate operations, others need to be identified. In a disaster, some personnel will be unavailable for many reasons (discussed earlier in this chapter).

Key personnel, as well as multiple backups, need to be identified. Backup personnel can consist of employees who have familiarity with specific technologies, such as operating system, database, and network administration, and who can cover for primary personnel if needed. Sure, it would be desirable for these backup personnel also to be trained in specific recovery operations, but at the least, if these personnel can access specific detailed recovery procedures, having them on a call list is probably better than having no available personnel during a disaster.

Notifying Critical Suppliers, Customers, and Other Parties  Along with employees, many other parties need to be notified in the event of a disaster. Outside parties need to be aware of the disaster and basic changes in business conditions. During a regional disaster such as a hurricane or earthquake, nearby parties will certainly be aware of the situation. However, they may not be aware of the status of business operations immediately after the disaster: a regional event’s effects can range from complete destruction of buildings and equipment to no damage at all and normal conditions. Unless key parties are notified of the status, they may have no other way to know for sure.

The people or teams responsible for communicating with these outside parties will need to have all of the individuals and organizations included in a list of parties to contact. This information should be included in emergency response procedures. Parties that need to be contacted may include the following:

•   Key suppliers   This may include electric and gas utilities, fuel delivery, and materials delivery. In a disaster, an organization will often need to impart special instructions to one or more suppliers, requesting delivery of extra supplies or requesting temporary cessation of deliveries.

•   Key customers   In many organizations, key customer relationships are valued above most others. These customers may depend on a steady delivery of products and services that are critical to their own operations; in a disaster, they may have a dire need to know whether such deliveries will be able to continue or not and under what circumstances.

•   Public safety   Police, fire, and other public safety authorities may need to be contacted, not only for emergency operations such as firefighting but also for any required inspections or other services. It is important that “business office” telephone numbers for these agencies be included on contact lists, as 911 and other emergency lines may be flooded by calls from others.

•   Insurance adjusters   Most organizations rely on insurance companies to protect their assets in case of damage or loss in a disaster. Because insurance adjustment funds are often a key part of continuing business operations in an emergency, it’s important that appropriate personnel are able to reach insurers as soon as possible after a disaster has occurred.

•   Regulators   In some industries, organizations are required to notify regulators of certain types of disasters. Though regulators may be aware of noteworthy regional disasters, they may not immediately know an event’s specific effects on an organization. Further, some types of disasters are highly localized and may not be newsworthy, even in a local city.

•   Media   Media outlets such as newspapers and television stations may need to be notified as a means of quickly reaching the community or region with information about the effects of a disaster on organizations.

•   Shareholders   Organizations are usually obliged to notify their shareholders of any disastrous event that affects business operations. This may be the case whether the organization is publicly or privately held.

•   Stakeholders   Organizations will need to notify other parties, including employees, competitors, and other tenants, if one or more multitenant facilities is lost.

Setting Up Call Trees  Disaster response procedures need to include a call tree, a method by which the first personnel involved in a disaster begin notifying others in the organization, informing them of the developing disaster, and enlisting their assistance. Just as the branches of a tree originate at the trunk and are repeatedly subdivided, a call tree is most effective when each person in the tree can make just a few phone calls. Not only will the notification of important personnel proceed more quickly, but each person will not be overburdened with many calls. Remember that many personnel may be unavailable or unreachable. Therefore, a call tree should be structured with sufficient flexibility as well as assurance that all critical personnel can be contacted. Figure 7-10 shows an example call tree.

Images

Figure 7-10  Example call tree structure

An organization can also use an automated outcalling system to notify critical personnel of a disaster. Such a system can play a prerecorded message or request that personnel call an information number to hear a message. Most outcalling systems keep a log of which personnel have been successfully reached. An automated calling system should not be located in the same geographic region, because a regional disaster could damage the system or make it unavailable during a disaster. The system should be Internet-accessible so that response personnel can access it to determine which personnel have been notified and to make any needed changes before or during a disaster.

Images

NOTE   Consider the use of texting and automated texting platforms or mobile notification apps to inform personnel of disaster situations.

Preparing Wallet Cards  Wallet cards containing emergency contact information should be prepared for core team personnel for the organization, as well as for members in each department who would be actively involved in disaster response. Wallet cards are advantageous because most personnel will have their wallet, pocketbook, or purse nearby at all times, even when away from home, running errands, traveling, or on vacation. Not everyone carries their mobile devices with them every minute of the day. Information on the wallet card should include contact information for fellow team members, a few of the key disaster response personnel, and any conference bridges or emergency call-in numbers that are set up.

Images

NOTE   A wallet card has no reliance on energy or technology and may still be valuable in extreme disaster scenarios.

Figure 7-11 shows an example wallet card. Organizations may also issue digital versions of wallet cards for people to store on mobile devices.

Images

Figure 7-11  Example of a wallet card for core team participants with emergency contact information and disaster declaration criteria

Electronic Contact Lists  Arguably, most IT personnel and business leaders have smartphones and other mobile devices with onboard storage that is available even when cellular carriers are experiencing outages. Copies of contact lists and even disaster response procedures can be stored in smartphones to keep this information handy during a disaster.

Transportation  Some types of disasters may make certain modes of transportation unavailable or unsafe. Widespread natural disasters, such as earthquakes, volcanoes, hurricanes, and floods, can immobilize virtually every form of transportation, including highways, railroads, boats, and airplanes. Other types of disasters may impede one or more types of transportation, which could result in overwhelming demand for the available modes. High volumes of emergency supplies may be needed during and after a disaster, but damaged transportation infrastructure often makes the delivery of those supplies difficult.

Components of a Business Continuity Plan  The complete set of business continuity plan documents will include the following:

•   Supporting project documents   These include the documents created at the beginning of the business continuity project, including the project charter, project plan, statement of scope, and statement of support from executives.

•   Analysis documents   These include the following:

•   BIA

•   Threat assessment and risk assessment

•   Criticality analysis

•   Documents defining recovery targets such as RTO, RPO, RCO, and RCapO

•   Response documents   These documents describe the required actions of personnel when a disaster strikes, plus documents containing information required by those same personnel. Examples of these documents include the following:

•   Business recovery (or resumption) plan   This describes the activities required to recover and resume critical business processes and activities.

•   Occupant emergency plan   This describes activities required to care for occupants safely in a business location during a disaster. This will include both evacuation procedures and sheltering procedures, each of which may be required, depending upon the type of disaster that occurs.

•   Emergency communications plan   This describes the types of communications imparted to many parties, including emergency response personnel, employees in general, customers, suppliers, regulators, public safety organizations, shareholders, and the public.

•   Contact lists   These contain names and contact information for emergency response personnel as well as for critical suppliers, customers, and other parties.

•   Disaster recovery plan   This describes the activities required to restore critical IT systems and other critical assets, whether in alternate or primary locations.

•   Continuity of operations plan   This describes the activities required to continue critical and strategic business functions at an alternate site.

•   Security incident response plan   This describes the steps required to deal with a security incident that could reach disaster-like proportions.

•   Test and review documents   This is the entire collection of documents related to tests of all of the different types of business continuity plans, as well as reviews and revisions to documents.

Making Plans Available to Personnel when Needed

When a disaster strikes, often one of the effects is no access to even the most critical IT systems. In a 40-hour workweek in an organization with on-site personnel, there is roughly a 25 percent likelihood that critical personnel will be at the business location when a disaster strikes (at least the violent type of disaster that strikes with no warning, such as an earthquake—other types of disasters, such as hurricanes, may afford the organization a little bit of time to anticipate the disaster’s impact). The point is that the chances are good that the personnel who are available to respond may be unable to access the procedures and other information that they will need, unless special measures are taken.

Images

CAUTION   Complete BCP documentation often contains details of key systems, operating procedures, recovery strategies, and even vendor and model identification of in-place equipment. This information can be misused if available to unauthorized personnel, so the mechanism selected for ensuring availability must include planning to exclude inadvertent disclosure.

Response and recovery procedures can be made available in several ways to personnel during a disaster, including the following:

•   Hard copy   While many have grown accustomed to the paperless office, disaster recovery and response documentation is one type of information that should be available in hard-copy form. Copies, even multiple copies, should be available for each responder, with a copy at the workplace and another at home, and possibly even a set in the responder’s vehicle.

•   Soft copy   Traditionally, soft-copy documentation is kept on file servers, but as you might expect, those file servers may be unavailable in a disaster. Soft copies should be available on responders’ portable devices (laptops, tablets, and smartphones). An organization can also consider issuing documentation on memory sticks and cards. Depending upon the type of disaster, it can be difficult to know what resources will be available to access documentation, so making it available in more than one form will ensure that at least one copy of it will be available to the personnel who need access to it.

•   Alternate work/processing site   Organizations that utilize a hot/warm/cold site for the recovery of critical operations can maintain hard copies and/or soft copies of recovery documentation there. This makes perfect sense; personnel working at an alternate processing or work site will need to know what to do, and having those procedures on site will facilitate their work.

•   Online   Soft copies of recovery documentation can be archived on an Internet-based site that includes the ability to store data. Almost any type of online service that includes authentication and the ability to upload documents could be suitable for this purpose.

•   Wallet cards   It’s unreasonable to expect to publish recovery documentation on a laminated wallet card. As described earlier in this chapter, they could be used to store the contact information for core response team members as well as a few other pieces of information, such as conference bridge codes, passwords to online repositories of documentation, and so on. An example wallet card appears in Figure 7-11.

Maintaining Recovery and Continuity Plans

Business processes and technology undergo an almost continuous change in most organizations. A business continuity plan that is developed and tested is liable to be outdated within months and obsolete within a year. If much more than a year passes, a disaster recovery plan in some organizations may approach uselessness. Organizations need to keep disaster recovery plans up-to-date and relevant, and they can do this by establishing a schedule whereby the principal disaster recovery documents will be reviewed. Depending on the rate of change, this could be as frequently as quarterly or as seldom as every two years.

Further, every change, however insignificant, in business processes and information systems should include a step to review, and possibly update, relevant disaster recovery documents. A review/update of relevant documents should be a required step in every business process engineering or information systems change process and a key component of the organization’s information systems development life cycle (SDLC). If this is done faithfully, the annual review of documents will likely conclude that only a few (if any) changes are required, although it is still a good practice to perform a periodic review, just to be sure.

Periodic testing of disaster recovery documents and plans is another vital activity. Testing validates the accuracy and relevance of these documents, and any issues or exceptions in the testing process should precipitate updates to appropriate documents.

Sources for Best Practices

It is unnecessary to begin BCP and DRP by inventing a new practice or methodology. These are advanced professions, and several professional associations, certifications, international standards, and publications can provide or lead to sources of practices, processes, and methodologies:

•   National Institute of Standards and Technology (NIST)   This branch of the U.S. Department of Commerce is responsible for developing business and technology standards for the federal government. NIST-created standards are excellent, and as a result, many private organizations all over the world are adopting them. Visit the NIST web site at www.nist.gov.

•   National Incident Management System (NIMS)   As a part of Homeland Security Presidential Directive 5, NIMS is a standard approach to incident management and facilitates coordination between U.S. public agencies’ and private organizations’ incident response plans and incident responders. Information is available from www.fema.gov/emergency-managers/nims.

•   Business Continuity Institute (BCI)   This membership organization is dedicated to the advancement of business continuity management. BCI has more than 8000 members in almost 100 countries. BCI hosts several events around the world, publishes a professional journal, and has developed a professional certification, the Certificate of the BCI (CBCI). For information, visit www.thebci.org.

•   National Fire Protection Agency (NFPA)   NFPA has developed a pre-incident planning standard, NFPA 1620, which addresses the protection, construction, and features of buildings and other structures in the United States. It also requires the development of pre-incident plans that emergency responders can use to deal with fires and other emergencies. Visit the NFPA web site at www.nfpa.org.

•   Federal Emergency Management Agency (FEMA)   FEMA is part of the U.S. Department of Homeland Security (DHS) and is responsible for emergency disaster relief planning information and services. FEMA’s most visible activities are its relief operations in the wake of hurricanes and floods in the United States. For more information, visit www.fema.gov.

•   Disaster Recovery Institute International (DRI International)   This professional membership organization provides education and professional certifications for disaster recovery planning professionals. Visit www.drii.org. Its certifications include the following:

•   Associate Business Continuity Professional (ABCP)

•   Certified Business Continuity Vendor (CBCV)

•   Certified Functional Continuity Professional (CFCP)

•   Certified Business Continuity Professional (CBCP)

•   Master Business Continuity Professional (MBCP)

•   Business Continuity Management Institute (BCM Institute)   This professional association specializes in education and professional certification. It is a co-organizer of the World Continuity Congress, an annual conference dedicated to BCP and DRP. Visit www.bcm-institute.org. Certifications offered by BCM Institute include the following:

•   Business Continuity Certified Expert (BCCE)

•   Business Continuity Certified Specialist (BCCS)

•   Business Continuity Certified Planner (BCCP)

•   Disaster Recovery Certified Expert (DRCE)

•   Disaster Recovery Certified Specialist (DRCS)

Disaster Recovery Plan (DRP)

DRP is undertaken to reduce risks related to the onset of disasters and other events. It is mainly an IT function to ensure that key IT systems are available to support critical business processes. DRP is closely related to, but somewhat separate from, BCP: the groundwork for DRP begins in BCP activities such as the business impact analysis, criticality analysis, establishment of recovery objectives, and testing. The outputs from these activities are the key inputs to DRP:

•   The BIA and criticality analysis help to prioritize which business processes (and, therefore, which IT systems) are the most important.

•   Key recovery targets specify how quickly specific IT applications are to be recovered. This guides DRP personnel as they develop new IT architectures that make IT systems compliant with those objectives.

•   Testing of disaster recovery plans can be performed in coordination with tests of business continuity plans to simulate real disasters and disaster response more accurately.

The relationships between BCP and DCP were discussed in detail earlier in this chapter and depicted in Figure 7-3.

Disaster Response Teams’ Roles and Responsibilities

Disaster recovery plans need to specify the teams that are required for disaster response, as well as each team’s roles and responsibilities. Table 7-3 describes several teams and their roles. Because of variations in organizations’ disaster response plans, some of these teams will not be needed in some organizations.

Images

Table 7-3  Disaster Response Teams’ Roles and Responsibilities

Images

NOTE   Some roles in Table 7-3 may overlap with responsibilities defined in the organization’s BCP. Disaster recovery and business continuity planners should work together to ensure that the organization’s overall response to disaster is appropriate and does not overlook vital functions.

Recovery Objectives

During the BIA and criticality analysis phases of a business continuity and disaster recovery project, the speed with which each business activity (with its underlying IT systems) needs to be restored after a disaster is determined. The primary recovery objectives, as discussed in detail earlier in this chapter, are as follows:

•   RTO

•   RPO

•   RCO

•   RCapO

Images

NOTE   Senior management should be involved in any discussion related to recovery system specifications in terms of capacity, integrity, or functionality.

Publishing Recovery Targets

If the storage system for an application takes a snapshot every hour, the RPO could be one hour, unless the storage system itself was damaged in a disaster. If the snapshot is replicated to another storage system four times per day, the RPO might be better expressed as six to eight hours. This brings up an interesting point. There may not be one golden RPO figure for a given system. Instead, the severity of a disrupting event or a disaster will dictate the time to get systems running again (RTO) with a certain amount of data loss (RPO). Here are some examples:

•   A server’s CPU or memory fails and is replaced and restarted in two hours. No data is lost. The RTO is two hours, and the RPO is zero.

•   The storage system supporting an application suffers a hardware failure that results in the loss of all data. Data is recovered from a snapshot on another server taken every six hours. The RPO is six hours in this case.

•   The database in a transaction application is corrupted and must be recovered. Backups are taken twice per day. The RPO is 12 hours. However, it takes 10 hours to rebuild indexes on the database, so the RTO is closer to 22 to 24 hours since the application cannot be returned to service until indexes are available.

Images

TIP   When publishing RTO and RPO figures to customers, it’s best to publish the worst-case figures: “If our data center burns to the ground, our RTO is X hours and the RPO is Y hours.” Saying it that way would be simpler than publishing a chart that shows RPO and RTO figures for various types of disasters.

Organizations that publish RCO and RCapO targets will need to include the practical meaning of these targets, whether they represent an exact match of capacity and integrity or some reduction. For example, if an organization’s recovery site is engineered to process 80 percent of the transaction volume of the primary site, an organization should consider stating that processing capacity at a recovery site may be reduced.

Pricing RTO and RPO Capabilities

Generally speaking, the shorter the RTO or RPO for a given system, the more expensive it will be to achieve the target. Table 7-4 depicts a range of RTOs along with the technologies needed to achieve them and their relative cost.

Images

Table 7-4  The Lower the RTO, the Higher the Cost to Achieve It

The BCP project team needs to understand the relationship between the time required to recover an application and the cost required to recover the application within that time. A shorter recovery time is more expensive, and this relationship is not linear. This means that reducing RPO from three days to six hours may mean that the equipment and software investment could double, or it may increase eightfold. So many factors are involved in the supporting infrastructure for a given application that a BCP project team must knuckle down and develop the cost for a few different RTO and RPO figures.

The business value of the application itself is the primary driver in determining the amount of investment that senior management is willing to make to reach any arbitrary RTO and RPO figures. This business value may be measured in local currency if the application supports revenue, but the loss of an application during a disaster may harm the organization’s reputation, which is difficult to monetize. Management must decide how much it is willing to invest in disaster recovery capabilities that bring RTO and RPO figures down to an acceptable level. Figure 7-12 illustrates these relationships.

Images

Figure 7-12  Aim for the sweet spot and balance the costs of downtime and recovery.

Developing Recovery Strategies

When management has chosen specific RPO and RTO targets for a given system or process, the BCP project team can then roll up its sleeves and devise some ways to meet these targets. This section discusses the technologies and logistics associated with various recovery strategies. This will help the project team decide which types of strategies are best suited for their organization.

Images

NOTE   Developing recovery strategies to meet specific recovery targets is an iterative process. The project team will develop a strategy to reach specific targets for a specific cost; senior management could well decide that the cost is too high and may increase RPO and/or RTO targets accordingly. Similarly, the project team could also discover that it is less costly to achieve specific RPO and RTO targets, and management could respond by lowering those targets. This is illustrated in Figure 7-13.

Images

Figure 7-13  Recovery objective development flowchart

Site Recovery Options  In a worst-case disaster scenario, the site where information systems reside is partially or wholly destroyed. In most cases, the organization cannot afford to wait for the damaged or destroyed facility to be restored, because this could take weeks or months. If an organization can take that long to recover an application, you’d have to wonder whether it is needed. The assumption must be that in a disaster scenario, the organization will recover critical applications at another location. This other location is called a recovery site. There are two dimensions to the process of choosing a recovery site: the speed at which the application will be recovered at the recovery site and the location of the recovery site itself.

As you might expect, speed costs: Developing the ability to recover more quickly costs more money and resources. If a system is to be recovered within a few minutes or hours, the costs will be much higher than if the organization can recover the system in five days.

Various types of facilities are available for rapid or not-too-rapid recovery. These facilities are called hot sites, warm sites, cold sites, and cloud sites. As the names suggest, hot sites permit rapid recovery, while cold sites provide a much slower recovery. The costs associated with these are also somewhat proportional, as illustrated in Table 7-5.

Images

Table 7-5  Relative Costs of Recovery Sites

Hot Sites  A hot site is an alternate processing center where backup systems are already running and in some state of near-readiness to assume production workload. The systems at a hot site most likely have application software and database management software already loaded and running, perhaps even at the same patch levels as the systems in the primary processing center. A hot site is the best choice for systems whose RTO targets range from zero to several hours, perhaps as long as 24 hours.

A hot site may consist of leased rack space (or even a cage for larger installations) at a co-location center. If the organization has its own processing centers, a hot site for a given system would consist of the required rack space to house the recovery systems. Recovery servers will be installed and running, with the same version and patch level for the operating system, DBMS (if used), and application software.

Systems at a hot site require the same level of administration and maintenance as the primary systems. When patches or configuration changes are made to primary systems, they should be made to hot-site systems at the same time or very shortly afterward. Because systems at a hot site need to be at or very near a state of readiness, a strategy needs to be developed regarding a method for keeping the data on hot standby systems current. Systems at a hot site should also have full network connectivity. A method for quickly directing network traffic toward the recovery servers needs to be worked out in advance so that a switchover can be accomplished. All this is discussed in the “Recovery and Resilience Technologies” section later in this chapter.

The organization sends one or more technical staff members to the hot site to set up systems; once the systems are operating, much or all of the system- and database-level administration can be performed remotely. In a disaster scenario, however, the organization may need to send the administrative staff to the site for day-to-day management of the systems. This means that workspace for these personnel needs to be identified so that they can perform their duties during the recovery operation.

Images

TIP   Hot-site planning needs to consider work (desk) space for on-site personnel. Some co-location centers provide limited work areas that are often shared and offer little privacy for phone discussions. Also, transportation, hotel, and dining accommodations need to be arranged, possibly in advance, particularly if the hot site is in a different city from the primary site.

Warm Sites  A warm site is an alternate processing center where recovery systems are present but at a lower state of readiness than recovery systems at a hot site. For example, while the same version of the operating system may be running on the warm site system, it may be a few patch levels behind primary systems. The same could be said about the versions and patch levels of DBMSs (if used) and application software.

A warm site is appropriate for an organization whose RTO figures range from roughly one to seven days. In a disaster scenario, recovery teams would travel to the warm site and work to get the recovery systems to a state of production readiness and to get systems up-to-date with patches and configuration changes to bring the systems into a state of complete readiness. A warm site is also used when the organization is willing to take the time necessary to recover data from tape or other backup media. Depending upon the size of the databases, this recovery task can take several hours to a few days.

The primary advantage of a warm site is that its costs are lower than for a hot site, particularly in the effort required to keep the recovery system up-to-date. The site may not require expensive data replication technology, but instead, data can be recovered from backup media.

Cold Sites  A cold site is an alternate processing center where the degree of readiness for recovery systems is low. At the least, a cold site is nothing more than an empty rack or allocated space on a computer room floor. It’s just an address in someone’s data center or co-location site where computers can be set up and used at some future date. Often, cold sites contain little or no equipment. When a disaster or other highly disruptive event occurs in which the outage is expected to exceed 7 to 14 days, the organization will order computers from a manufacturer or perhaps have computers shipped from some other business location to arrive at the cold site soon after the disaster event has begun. Then personnel would travel to the site and set up the computers, operating systems, databases, network equipment, and so on, and get applications running within several days.

The advantage of a cold site is its low cost. The main disadvantage is the time and effort required to bring it to operational readiness in a short period, which can be costly. But for some organizations, a cold site is exactly what is needed.

Table 7-6 shows a comparison of hot, warm, cold, and cloud-based recovery sites and a few characteristics of each.

Images

Table 7-6  Detailed Comparison of Cold, Warm, Hot, and Cloud Sites

Mobile Site  A mobile site is a portable recovery center that can be delivered to almost any location in the world. A viable alternative to a fixed-location recovery site, a mobile site can be transported by semitrailer truck and may even have its own generator, communications, and cooling capabilities. APC and SunGuard provide mobile sites installed in semis. Oracle can provide mobile sites that include a configurable selection of servers and workstations, all housed in shipping containers that can be shipped by truck, rail, ship, or air to any location in the world.

Cloud Sites  Organizations are increasingly using cloud hosting services as their recovery sites. Such sites charge for the utilization of servers and devices in virtual environments. Hence, capital cost for recovery sites is negligible, and operational costs come into play as recovery sites are used. As organizations become accustomed to building recovery sites in the cloud, they are, with increasing frequency, moving their primary processing sites to the cloud as well.

Reciprocal Sites  A reciprocal recovery site is a data center that is operated by a separate company. Two or more organizations with similar processing needs will draw up a legal contract that obligates one or more of the organizations to house another party’s systems temporarily in the event of a disaster. Often, a reciprocal agreement pledges not only floor space in a data center but also the use of the reciprocal partner’s computer system. This type of arrangement is less common but is used by organizations that use mainframe computers and other high-cost systems.

Images

NOTE   With the wide use of Internet co-location centers, reciprocal sites have fallen out of favor. Still, they may be ideal for organizations with mainframe computers that are otherwise too expensive to deploy to a cold or warm site.

Geographic Site Selection  An important factor in the process of recovery site selection is the location of the site. The distance between the main processing site and the recovery site is vital and may figure heavily into the viability and success of a recovery operation. A recovery site should not be located in the same geographic region as the primary site, because the site may be involved in the same regional disaster that affects the primary site and may be unavailable for use. By “geographic region,” I mean a location that will likely experience the effects of the same regional disaster that affects the primary site. No arbitrarily chosen distance (such as 100 miles) guarantees sufficient separation. In some locales, 50 miles is plenty of distance; in other places, 300 miles is too close—it all depends on the nature of disasters that are likely to occur. Information on regional disasters should be available from local disaster preparedness authorities or from local disaster recovery experts.

Disaster Recovery for SaaS Services  Many organizations’ principal business applications are SaaS-based, so the organization pays a monthly or yearly fee and uses software hosted by a service provider. At first glance, one may believe that DRP for SaaS services is entirely the responsibility of the SaaS provider. For the most part, this is true. There are, however, some issues that organizations need to consider as a part of their DRP, including the following:

•   Direct connectivity   Organizations sometimes employ Multiprotocol Label Switching (MPLS) or virtual private network (VPN) circuitry between their SaaS provider and their core data center. If the SaaS provider experiences a disaster, the provider will host its service from a different location. Or if the organization experiences a disaster, it may be using an alternate processing site for its on-premises systems. In either case, those MPLS/VPN connections will need to change to continue operations. Figure 7-14 depicts this connectivity.

Images

Figure 7-14  Direct connectivity scenarios for disaster recovery with SaaS providers (Source: Peter Gregory)

•   Integrations   In addition to the connectivity issues, other issues regarding integration between the organization and the SaaS provider need to be understood, which can affect disaster response architectures developed for various disaster scenarios in the organization, the SaaS provider, or both. Issues may involve user and machine authentication, license keys, encryption keys, e-mail, and other message routing.

•   Fourth parties   In addition to the organization and SaaS systems and their various integrations, functionality between the organization and its SaaS provider may involve other parties, such as message processors, security systems (such as event monitoring by an MSSP), customer relationship management (CRM), enterprise resource planning (ERP) integrations, and more.

Considerations When Using Third-Party Disaster Recovery Sites  Because most organizations cannot afford to implement their own secondary processing site, the only other option is to use a disaster recovery site that is owned by a third party, including cloud-based sites. This could be a co-location center, a disaster services center, or a cloud-based infrastructure service provider. An organization considering such a site needs to ensure that its service contract addresses the following:

•   Disaster definition   The provider’s definition of a disaster needs to be broad enough to meet the organization’s requirements.

•   Equipment configuration   IT equipment must be configured as needed to support critical applications during a disaster.

•   Availability of equipment during a disaster   IT equipment needs to actually be available during a disaster. The organization needs to know how the disaster service provider will allocate equipment if many of its customers suffer a disaster simultaneously.

•   Customer priorities   The organization needs to know whether any of the disaster services provider’s other customers (government or military, for example) have priorities that may exceed their own.

•   Data communications   The provider must have sufficient bandwidth and capacity for the organization plus other customers who may be operating at the provider’s center at the same time.

•   Data sovereignty   The organization should consider the geographic locations of stored data, particularly when that data involves private citizens. The locations of primary and recovery processing sites, together with the location of data subjects, may be affected by various privacy regulations.

•   Testing   The organization needs to know what testing it is permitted to perform on the service provider’s systems so that the ability to recover from a disaster can be assured prior to a disaster occurring.

•   Right to audit   The organization should have a “right to audit” clause in its contract to verify the presence and effectiveness of all key controls in place at the recovery facility.

•   Security and environmental controls   The organization needs to know what security and environmental controls are in place at the disaster recovery facility.

Considerations for a Distributed Workforce  During and after the COVID-19 pandemic, many organizations began hiring out-of-area personnel who worked from their homes most or all of the time. Organizations still operating data centers may find that few IT workers are located near primary or alternate processing centers. This “just-in-time” approach may be suitable for normal business operations, but disaster scenarios may result in few personnel being available for any necessary onsite work. For this reason, it’s essential that disaster recovery plans be written for an audience with less familiarity with the organization’s operations and practices, because it is possible that outsiders (such as contractors) will be performing some salvage and recovery tasks. For organizations with numerous remote employees, sufficient capacity for remote access (VPN) is essential to support business operations running on an emergency footing.

Acquiring Additional Hardware  Many organizations elect to acquire their own server, storage, and network hardware for disaster recovery purposes. How an organization will go about acquiring hardware will depend on its high-level recovery strategy:

•   Cold site   An organization must be able to purchase hardware as soon as the disaster occurs.

•   Warm site   An organization will need to purchase hardware in advance of the disaster, or it may be able to purchase hardware when the disaster occurs. The choice depends on the RTO.

•   Hot site   An organization should purchase its recovery hardware in advance of the disaster.

•   Cloud   An organization will not need to purchase hardware, as this is provided by the cloud infrastructure provider. Infrastructure in the cloud can likewise take on characteristics of being hot, warm, or cold.

Table 7-7 lists the pros and cons of these strategies. Warm-site strategy is not listed because an organization could purchase hardware either in advance of the disaster or when it occurs. Because cold, hot, and cloud sites are deterministic, they are included in the table.

Images

Table 7-7  Hardware Acquisition Pros and Cons for Hot/Warm, Cold, and Cloud Recovery Sites

The main reason an organization chooses to employ a cloud hosting provider is to eliminate capital costs. The provider supplies all hardware and charges organizations when the hardware is used. The primary business reason for not choosing a hot site is the high capital cost required to purchase disaster recovery equipment that may never be used. One way around this obstacle is to put those recovery systems to work every day. For example, recovery systems could be used for development or testing of the same applications that are used in production. This way, systems that are purchased for recovery purposes are being well utilized for other purposes, and they’ll be ready in case a disaster occurs. When a disaster occurs, the organization will be less concerned about development and testing and more concerned about keeping critical production applications running. It will be a small sacrifice to forgo development or testing (or whatever low-criticality functions are using the recovery hardware) during a disaster.

Images

NOTE   A cloud-based system recovery strategy can also be used in a hot, warm, or cold configuration.

Recovery and Resilience Technologies  Once recovery targets have been established, the next major task is the survey and selection of technologies to enable RTOs and RPOs to be met. The following are important factors when considering each technology:

•   Does the technology help the information system achieve the RTO, RPO, and RCapO targets?

•   Does the cost of the technology meet or exceed budget constraints?

•   Can the technology be used to benefit other information systems (thereby lowering the cost for each system)?

•   Does the technology fit well into the organization’s current IT operations?

•   Will operations staff require specialized training to use the technology for recovery?

•   Does the technology contribute to the simplicity of the overall IT architecture, or does it complicate it unnecessarily?

These questions are designed to help determine whether a specific technology is a good fit, from technology, process, and operational perspectives.

RAID  Redundant Array of Independent Disks (RAID) is a family of technologies used to improve the reliability, performance, or size of disk-based storage systems. From a disaster recovery or systems resilience perspective, the feature of RAID that is of particular interest is its reliability. RAID is used to create virtual disk volumes over an array (pun intended) of disk storage devices and can be configured so that the failure of any individual disk drive in the array will not affect the availability of data on the disk array.

RAID is usually implemented on a hardware device called a disk array, which is a chassis in which several hard disks can be installed and connected to a server. The individual disk drives can usually be “hot-swapped” in the chassis while the array is still operating. When the array is configured with RAID, a failure of a single disk drive will have no effect on the disk array’s availability to the server to which it is connected. A system operator can be alerted to the disk’s failure, and the defective disk drive can be removed and replaced while the array is still fully operational.

Several options, or levels, of RAID configuration are available:

•   RAID 0   This is known as a striped volume, in which a disk volume splits data evenly across two or more disks to improve performance.

•   RAID 1   This creates a mirror, where data written to one disk in the array is also written to a second disk in the array. RAID 1 makes the volume more reliable through the preservation of data, even when one disk in the array fails.

•   RAID 4   This level employs data striping at the block level by adding a dedicated parity disk, which permits the rebuilding of data in the event one of the other disks fails.

•   RAID 5   This is similar to RAID 4 block-level striping, except that the parity data is distributed evenly across all of the disks instead of being dedicated on one disk. Like RAID 4, RAID 5 allows for the failure of one disk without losing information.

•   RAID 6   This is an extension of RAID 5, in which two parity blocks are used instead of a single parity block. RAID 6 can withstand the failure of any two disk drives in the array instead of a single disk, as is the case with RAID 5.

Images

NOTE   Several nonstandard RAID levels have been developed by various hardware and software companies. Some of these are extensions of RAID standards, while others are entirely different.

Storage systems are hardware devices that are entirely separate from servers—their only purpose is to store a large amount of data. They are highly reliable through the use of redundant components and the use of one or more RAID levels. Storage systems generally come in two forms:

•   Storage area network (SAN)   This stand-alone storage system can be configured to contain several virtual volumes and can be connected to several servers through fiber-optic cables. The servers’ operating systems will often consider this storage to be “local,” as though it consisted of one or more hard disks present in the server’s own chassis.

•   Network-attached storage (NAS)   This stand-alone storage system contains one or more virtual volumes. Servers access these volumes over the network using the Network File System (NFS) or Server Message Block/Common Internet File System (SMB/CIFS) protocols, common on Unix and Windows operating systems, respectively.

Replication  During replication, data that is written to a storage system is also copied over a network to another storage system. The result is the presence of up-to-date data that exists on two or more storage systems, each of which could be located in a different geographic region. Replication can be handled in several ways and at different levels in the technology stack:

•   Disk storage system   Data-write operations that take place in a disk storage system (such as a SAN or NAS) can be transmitted over a network to another disk storage system, where the same data will be written to the other disk storage system.

•   Operating system   The operating system can control replication so that updates to a particular file system can be transmitted to another server, where those updates will be applied locally.

•   Database management system   The DBMS can manage replication by sending transactions to a DBMS on another server.

•   Transaction management system   The transaction management system (TMS) can manage replication by sending transactions to a counterpart TMS located elsewhere.

•   Application   The application can write its transactions to two different storage systems. This method is not often used.

•   Virtualization   Virtual machine images can be replicated to recovery sites to speed the recovery of applications.

Replication can take place from one system to another system, called primary-backup replication, and this is the typical setup when data on an application server is sent to a distant storage system for data recovery or disaster recovery purposes. Replication can also be bidirectional between two active servers; this is known as multiprimary or multimaster replication. This method is more complicated, because simultaneous transactions on different servers could conflict with one another (such as two reservation agents trying to book a passenger in the same seat on an airline flight). Some form of concurrent transaction control would be required, such as a distributed lock manager.

In terms of the speed and integrity of replicated information, there are two types of replication:

•   Synchronous replication   Writing data to a local and to a remote storage system is performed as a single operation, guaranteeing that data on the remote storage system is identical to data on the local storage system. Synchronous replication incurs a performance penalty, as the speed of the entire transaction is slowed to the rate of the remote transaction.

•   Asynchronous replication   Writing data to the remote storage system is not kept in sync with updates on the local storage system. Instead, there may be a time lag, and you have no guarantee that data on the remote system is identical to that on the local storage system. Performance is improved, however, because transactions are considered complete when they have been written to the local storage system only. Bursts of local updates to data will take a finite period to replicate to the remote server, subject to the available bandwidth of the network connection between the local and remote storage systems.

Images

NOTE   Replication is often used for applications where the RTO is smaller than the time necessary to recover data from backup media. For example, if a critical application’s RTO is established to be two hours, recovery from backup tape is probably not a viable option unless backups are performed every two hours. While more expensive than recovery from backup media, replication ensures that up-to-date information is present on a remote storage system that can be put online in a short period.

Server Clusters  A cluster is a collection of two or more servers that appear as a single server resource. Clusters are often the technology of choice for applications that require a high degree of availability and a very small RTO, measured in minutes. When an application is implemented on a cluster, even if one of the servers in the cluster fails, the other server (or servers) in the cluster will continue to run the application, usually with no user awareness that such a failure occurred.

There are two typical configurations for clusters, active/active and active/passive. In active/active mode, all servers in the cluster are running and servicing application requests. This is often used in high-volume applications where many servers are required to service the application workload. In active/passive mode, one or more servers in the cluster are active and servicing application requests, while one or more servers in the cluster are in a “standby” mode; they can service application requests but won’t do so unless one of the active servers fails or goes offline for any reason. A failover occurs when an active server goes offline and a standby server takes over. Figure 7-15 shows a typical server cluster architecture.

Images

Figure 7-15  Application and database server clusters

A server cluster is typically implemented in a single physical location, such as a data center. However, in a geographic cluster, or geocluster, a cluster can be implemented where great distances separate the servers in the cluster. Servers in a geocluster are connected through a WAN connection. Figure 7-16 shows a typical geographic cluster architecture.

Images

Figure 7-16  Geographic cluster with data replication

Network Connectivity and Services  An overall application environment that is required to be resilient and have recoverability must have those characteristics present within the network that supports it. A highly resilient application architecture that includes clustering and replication would be of little value if it had only a single network connection that was a single point of failure.

An application that requires high availability and resilience may require one or more of the following in the supporting network:

•   Redundant network connections   These may include multiple network adapters on a server but also a fully redundant network architecture with multiple switches, routers, load balancers, and firewalls. This could also include physically diverse network provider connections, where network service provider feeds enter the building from two different directions.

•   Redundant network services   Certain network services are vital to the continued operation of applications, such as Domain Name System (DNS; the function of translating server names such as www.mheducation.com into an IP address), Network Time Protocol (NTP; used to synchronize computer time clocks), Simple Mail Transport Protocol (SMTP), Simple Network Management Protocol (SNMP), authentication services, and perhaps others. These services are usually operated on servers, which may require clustering and/or replication of their own, so that the application will be able to continue functioning in the event of a disaster.

Developing Disaster Recovery Plans

A DRP effort starts with the initial phases of the BCP project, the BIA and criticality analysis, which lead to the establishment of recovery objectives that determine how quickly critical business processes need to be back up and running. With this information, the disaster recovery team can determine what additional data processing equipment is needed (if any) and establish a road map for acquiring that equipment. Note that “equipment” may represent physical hardware or virtual assets in public or private cloud environments.

The other major component of the disaster recovery project is the development of recovery plans, the process and procedure documents that will be triggered when a disaster has been declared. These processes and procedures will instruct response personnel on how to establish and operate business processes and IT systems after a disaster has occurred. It’s not enough to have all of the technology ready if personnel don’t know what to do.

Most disaster recovery plans are going to have common components:

•   Disaster declaration procedure   Includes criteria for how a disaster is determined and who has the authority to declare a disaster

•   Roles and responsibilities   Specify what activities need to be performed and which people or teams are best equipped to perform them

•   Emergency contact lists   Provide contact information for other personnel so that response personnel can establish and maintain communications as the disaster unfolds and recovery operations begin; lists should contain several different ways of contacting personnel since some disasters have an adverse impact on regional telecommunications infrastructure

•   System recovery procedures   Detailed steps for getting recovery systems up and running, which describe obtaining data, configuring servers and network devices, testing to confirm that the application and business information is healthy, and starting business applications.

•   System operations procedures   Detailed steps for operating critical IT systems while they are in recovery mode, because the systems in recovery mode may need to be operated differently than their production counterparts, and they may need to be operated by personnel who have not been doing this before

•   System restoration procedures   Detailed steps to restore IT operations to the original production systems

Images

NOTE   Business continuity and disaster recovery plans work together to get critical business functions operating again after a disaster. Because of this, business continuity and disaster recovery teams need to work closely when developing their respective response procedures to ensure that all activities are covered, but without unnecessary overlap.

Disaster recovery plans should consider all the likely disaster scenarios that may occur to an organization. Understanding these scenarios can help the team take a more pragmatic approach when creating response procedures. The added benefit is that not all disasters result in the entire loss of a computing facility. Most are more limited in their scope, although all of them can still result in a complete inability to continue operations. Some of these scenarios are as follows:

•   Partial or complete loss of network connectivity

•   Sustained electric power outage

•   Loss of a key system (such as a server, storage system, or network device)

•   Extensive data corruption or data loss

These scenarios are probably more likely to occur than a catastrophe such as a major earthquake or hurricane (depending on where a data center is located).

Data Backup and Recovery

Disasters, cyberattacks (primarily ransomware and destructware), and other disruptive events can damage information and information systems. It’s essential that fresh copies of information exist elsewhere and in a form that enables IT personnel to load the information easily into alternative systems so that processing can resume as quickly as possible.

Images

NOTE   Testing backups is important; testing recoverability is critical. In other words, performing backups is valuable only to the extent that backed-up data can be recovered at a future time. In addition, it is a good practice to ensure that backups are segmented off the corporate network to prevent an attacker from being able to destroy both production and backup data.

Backup to Tape and Other Media  In organizations still utilizing their own IT infrastructure, tape backup is just about as ubiquitous as power cords. From a disaster recovery perspective, however, the issue probably is not whether the organization has tape backup, but whether its current backup capabilities are adequate in the context of disaster recovery. An organization’s backup capability may need to be upgraded if:

•   The current backup system is difficult to manage.

•   Whole-system restoration takes too long.

•   The system lacks flexibility with regard to disaster recovery (for instance, a high degree of difficulty is required to recover information onto a different type of system).

•   The technology is old or outdated.

•   Confidence in the backup technology is low.

Many organizations may consider tape backup as a means of restoring files or databases when errors have occurred, and they may have confidence in their backup system for that purpose. However, the organization may have somewhat less confidence in its backup system and its ability to recover all of its critical systems accurately and in a timely manner.

Although tape has been the default medium since the 1960s, many organizations use hard drives for backup: hard disk transfer rates are far higher, and a disk is a random-access medium, whereas tape is a sequential-access medium. A virtual tape library (VTL) is a type of data storage technology that sets up a disk-based storage system with the appearance of tape storage, permitting existing backup software to continue to back data up to “tape,” which is really just more disk storage.

E-vaulting is another viable option for system backup. E-vaulting permits organizations to back up their systems and data to an offsite location, which could be a storage system in another data center or a third-party service provider. This accomplishes two important objectives: reliable backup and offsite storage of backup data.

Backup schemes, backup media rotation methods, and backup media storage are discussed in Chapter 6.

Incident Classification/Categorization

No two security incidents or disasters are alike: some may threaten the very survival of an organization, while others are minor, bordering on insignificant. The degree and type of response must be appropriate for the severity of the incident in terms of mobilization, speed, and communications. Further, an incident may or may not have an impact on sensitive information, including intellectual property and personally identifiable information (PII).

Organizations generally classify incidents according to severity, typically on a 3-, 4-, or 5-point scale. Organizations that store or process intellectual property, confidential data about their employees, or sensitive information about clients or customers often assign incident severity levels according to the level of impact on this information. Tables 7-8 and 7-9 depict two such schemes for classifying security incidents.

In Table 7-8, incidents are assigned a single numeric value of 1 to 5 based upon the impact as described. In Table 7-9, incidents are assigned a numeric value and an alphabetic value, based on impact on operations and impact on information. For example, an incident involving the loss of an encrypted laptop computer would be classified as 1A, whereas a ransomware incident where some production information has been lost would be classified as 4E or 5E.

Images

Table 7-8  Example Single-Dimensional Incident Severity Plan

Images

Table 7-9  Example Two-Dimensional Incident Severity Plan

The purpose of classifying incidents by severity level provides guidance on several aspects of response:

•   Numbers of personnel assigned to response and recovery

•   Utilization of external resources for response and recovery

•   Frequency of updates given to executive leadership

•   Updates provided to outside parties, including customers, suppliers, shareholders, regulators, and law enforcement

•   Emergency spending capabilities and limits

•   Notifications to the workforce regarding work assignments

The severity scales discussed here are applicable to security incidents as well as disasters. Both are business-disrupting events that require mobilization, response, communication with key parties, containment, and closure.

Incident Management Training, Testing, and Evaluation

Organizations do not perform incident response plans every day. While low-severity plans may be performed from time to time, high-severity plans may be used rarely, perhaps only once every several years. Organizations that want to have confidence in their incident response plans need to train personnel in their use and test their response plans from time to time to ensure that they will work as expected.

Although security incident response, business continuity plans, and disaster recovery plans are related to one another in some scenarios, it is important to distinguish each from the others. Each has a specific purpose, but in some scenarios, two or all three of these plans may be activated at once.

Security Incident Response Training

Like any procedure, incident response goes far better if responders have been trained prior to an actual incident occurring. Unlike many security procedures, during a security incident, emotions can run high, and those unfamiliar with the procedures and principles of incident response can get tripped up and make mistakes. This is not unlike the emotion and stress that other types of emergency responders, such as firefighters and police officers, may experience.

Incident response training should cover all of the scenarios that the organization is likely to face, ranging from the not-so-dire events such as stolen mobile devices and laptop computers to the truly catastrophic events such as a prolonged DDoS attack, destructive ransomware, or the exfiltration of large amounts of sensitive data.

Incident response personnel should be trained in the use of tools used to detect, examine, and remediate an incident. This includes SOC personnel who use a security information and event management (SIEM) system; security orchestration, automation, and response (SOAR); threat intelligence platform (TIP); and other detection and investigation tools. It also includes forensic specialists who use specialized forensic analysis tools and all personnel who have administrative responsibilities for every type of IT equipment, application, and tool.

Security Incident Response Professional Certifications

Security professionals specializing in incident response should consider one or more of the specialty certifications in incident response, including the following (in alphabetical order):

•   Certified Computer Examiner (CCE)   www.isfce.com/certification.htm

•   EC-Council Certified Incident Handler (ECIH)   www.eccouncil.org/programs/ec-council-certified-incident-handler-ecih/

•   GIAC Certified Forensic Analyst (GCFA)   www.giac.org/certifications/certified-forensic-analyst-gcfa/

•   GIAC Certified Incident Handler (GCIH)   www.giac.org/certification/certified-incident-handler-gcih

•   GIAC Network Forensic Analyst (GNFA)   www.giac.org/certifications/network-forensic-analyst-gnfa/

•   GIAC Reverse Engineering Malware (GREM)   www.giac.org/certifications/reverse-engineering-malware-grem/

•   Professional Certified Investigator (PCI)   www.asisonline.org/certification/professional-certified-investigator-pci/

A number of vendor-specific certifications are also available, including EnCase Certified Examiner (EnCe) for professionals using the EnCase forensics tool, and AccessData Certified Examiner (ACE) for professionals who use Forensic Toolkit (FTK).

Business Continuity and Disaster Response Training

The value and usefulness of a high-quality set of disaster response and continuity plans and procedures will be greatly diminished if those responsible for carrying out the procedures are unfamiliar with them. A person cannot learn to ride a bicycle by reading even the most detailed instructions on the subject, and it’s equally unrealistic to expect personnel to be able to carry out disaster response procedures properly if they are inexperienced in those procedures. Often, the best way to train responders is to participate in testing of business continuity and disaster recovery plans. Learning will be more effective if they understand that these tests are not only about testing the accuracy and effectiveness of the plans, but they also provide an opportunity for responders to become familiar with and be trained on those plans.

Training should not be limited to primary operations personnel and should also include others who may be responding in an actual disaster scenario. Remember that some disasters result in some personnel being unavailable for a variety of reasons.

Several forms of training can be made available for personnel who are expected to be available if a disaster strikes, including the following:

•   Document review   Personnel can carefully read through procedure documents to become familiar with the nature of the recovery procedures. As mentioned, this alone may be insufficient.

•   Participation in walk-throughs   People who are familiar with specific processes and systems should participate in the walk-through processes that deal with those issues. Exposing personnel to the walk-through process will not only help to improve the walk-through and recovery procedures but will also be a learning experience for participants.

•   Participation in simulations   Taking part in simulations will benefit the participants by giving them the experience of thinking through a disaster.

•   Participation in parallel and cutover tests   Other than experiencing an actual disaster and its recovery operations, no experience is quite like participating in parallel and cutover tests. Participants can gain actual hands-on experience with critical business processes and IT environments by performing the same procedures that they would perform in the event of a disaster. When a disaster strikes, those participants can draw upon their experience rather than recalling the information they read in procedure documents.

All of the test levels that need to be performed to verify the quality of response plans are also training opportunities for personnel. The development and testing of disaster-related plans and procedures provide a continuous learning experience for all of the personnel involved.

Testing Security Incident Response Plans

Security incident response plans must be documented and reviewed, but they also need to be periodically tested. Security incident response testing helps to improve the quality of those plans, which will help the organization better respond when an incident occurs. A by-product of security incident plan testing is the growing familiarity of personnel with security incident response procedures. Various types of tests should be carried out:

•   Document review   Individual subject-matter experts (SMEs) carefully read security incident response documentation to understand the procedures and identify any opportunities for improvement.

•   Walk-through   Similar to a document review, this is performed by a group of SMEs who talk through the security incident response plan. Discussing each step helps to stimulate new ideas that could lead to improvements in the plan.

•   Simulation   A facilitator describes a realistic security incident scenario, and participants discuss how they will actually respond. A simulation usually takes half a day or longer. It is suggested that the simulation be “scripted” with new information and updates introduced throughout the scenario. A simulation can be limited to the technical aspects of a security incident, or it can involve corporate communications, public relations, legal, and other externally facing parts of the organization that may play a part in a security incident that is known to the public.

•   Live fire   During a penetration test, personnel who are monitoring systems and networks jump into action in response to the scans, probes, and intrusions being performed by penetration testers. Note that those personnel could be told in advance about the penetration test; however, it would be more valuable for them to gain experience in responding to a real attack if they were not told in advance. During a test, incident responders need to respond carefully so that their actions do not cause real incidents.

These tests should be performed once each year or even more often. In the walk-through and simulation tests, someone should be appointed as a note-taker so that any improvements will be recorded and the plan can be updated. Tests should include incidents addressed in each playbook and at each classification level so that all procedures will be tested. Regardless of the type of test conducted, an after-action or lessons-learned session should be conducted. Any identified recommendations or remedial actions should be incorporated into incident response plans and supporting documentation.

If the incident response plan contains the names and contact information of response personnel, the plan should be reviewed more frequently to ensure that all contact information is up to date.

Testing Business Continuity and Disaster Recovery Plans

It is amazing how much can be accomplished if no one cares who gets the credit.

—John Wooden

Business continuity and disaster recovery plans may look elegant and even ingenious on paper, but their true business value is unknown until their worth is proven through testing. The process of testing these plans uncovers flaws not only in the plans but also in the systems and processes that they are designed to protect. For example, testing a system recovery procedure might point out the absence of a critically needed hardware component, or a recovery procedure might contain a syntax or grammatical error that misleads the recovery team member and results in recovery delays. Testing is designed to uncover these types of issues.

Testing Recovery and Continuity Plans

Recovery and continuity plans should be tested to prove their viability. Without testing, an organization has no way of really knowing whether its plans are effective. And with ineffective plans, an organization has a far smaller chance of surviving a disaster.

Recovery and continuity plans have built-in obsolescence—not by design but by virtue of the fact that technology and business processes in most organizations are undergoing constant change and improvement. Thus, it is imperative that newly developed or updated plans be tested as soon as possible to ensure their effectiveness.

Types of tests range from lightweight and unobtrusive to intense and disruptive:

•   Walk-through

•   Simulation

•   Parallel test

•   Cutover test

Images

TIP   Usually, an organization should perform the less intensive tests first to identify the most obvious flaws, followed by tests that require more effort.

Test Preparation

Each type of test requires advance preparation and recordkeeping. Preparation will consist of several items:

•   Participants   The organization will identify personnel who will participate in an upcoming test. It is important to identify all relevant skill groups and department stakeholders so that the test will include a full slate of contributors. This would also include key vendors/partners to support their systems.

•   Schedule   The availability of each participant needs to be confirmed so that the test will include participation from all stakeholders.

•   Facilities   For all but the document review test, proper facilities, such as a large conference room or training room, should be identified and set up. If the test takes several hours, one or more meals and refreshments may be needed as well.

•   Scripting   The simulation test requires some scripting, usually in the form of one or more documents that describe a developing scenario and related circumstances. Scenario scripting can make parallel and cutover tests more interesting and valuable, but this can be considered optional.

•   Recordkeeping   For all tests except the document review, one or more people should take good notes that can be collected and organized after the test is completed.

•   Contingency plan   The cutover test involves the cessation of processing on primary systems and the resumption of processing on recovery systems. This is the highest risk plan, and things can go wrong. Develop a contingency plan to get primary systems running again in case something goes wrong during the test.

Table 7-10 shows these preparation activities.

Images

Table 7-10  Preparation Activities for Disaster Recovery Business Continuity Tests

Document Review  A document review test reviews some or all disaster recovery and business continuity plans, procedures, and other documentation. Individuals typically review these documents on their own, at their own pace, but within established time constraints or deadlines. The purpose of this test is to review the accuracy and completeness of document content. Reviewers should read each document with a critical eye, point out any errors, and annotate the document with questions or comments that can be returned to the document’s author (or authors), who can make any necessary changes. If significant changes are needed in one or more documents, the project team may want to include a second document review before moving on to more resource-intensive tests.

The owner or document manager for the organization’s BCP and DRP project should document which people review which documents and perhaps include the review copies or annotations. This practice will create a complete record of the activities related to the development and testing of important DRP and BCP documents. It will also help to capture the true cost and effort of the development and testing of BCP capabilities in the organization.

Walk-through  A walk-through is similar to a document review but includes only the BCP documents. However, where a document review is carried out by individuals working on their own, a walk-through is performed by an entire group of individuals in a live discussion. A walk-through is usually facilitated by a leader who guides the participants page by page through each document. The leader may read sections of the document aloud, describe various scenarios where information in a section may be relevant, and take comments and questions from participants.

A walk-through is likely to take considerably more time than a document review. One participant’s question on some minor point in the document could spark a worthwhile and lively discussion that could last from a few minutes to an hour. The group leader or another person should take careful notes to record any deficiencies are discovered in any of the documents, as well as issues to be handled after the walk-through. The leader should be able to control the pace of the review so that the group does not get unnecessarily hung up on minor points. Some discussions may need to be cut short or tabled for a later time or for an offline conversation among interested parties.

Even if major revisions are required in recovery documents, it will probably be infeasible to conduct another walk-through with the updated documents. Follow-up document reviews are probably warranted, however, to ensure that they were updated appropriately, at least in the opinion of the walk-through participants.

Images

TIP   Participants in the walk-through should carefully consider that the potential audience for recovery procedures may be people who are not as familiar as they are with the organization’s systems and processes. They need to remember that the ideal personnel may not be available during an actual disaster. Participants also need to realize that the skill level of recovery personnel may be a little below that of the experts who operate systems and processes in normal circumstances. Finally, walk-through participants need to remember that systems and processes undergo almost continuous change, which could render some parts of the recovery documentation obsolete or incorrect all too soon.

Simulation  A simulation is a test of disaster recovery and business continuity procedures where the participants take part in a “mock disaster” to add some realism to the process of thinking their way through procedures included in emergency response documents. A simulation could be an elaborate and choreographed walk-through test, where a facilitator reads from a script and describes a series of unfolding events in a disaster such as a hurricane or an earthquake. This type of simulation could be viewed as a sort of playacting, where the script is the emergency response documentation. After stimulating the imagination of simulation participants, participants may find it easier to imagine what disaster recovery and business continuity procedures would be like if an actual disaster occurs. It will help tremendously if the facilitator has actually experienced one or more disaster scenarios to add more realism when describing events.

To make the simulation more credible and valuable, the chosen scenario should have a reasonable chance of actually occurring in the local area. Good choices would include an earthquake in San Francisco or Los Angeles, a volcanic eruption in Seattle, or an avalanche in Switzerland. A poor choice would be a hurricane or tsunami in Central Asia, because these events would never occur there. A simulation can also go a few steps further. For instance, the simulation can take place at an established emergency operations center, the same place where emergency command and control would operate in a real disaster. Also, the facilitator could change some of the participants’ roles to simulate the absence of certain key personnel to see how the remaining personnel may conduct themselves in a real emergency.

Images

NOTE   The facilitator of a simulation is limited only by her own imagination when organizing a simulation. One important fact to remember, though, is that a simulation does not actually affect any live or disaster recovery systems—it’s all as pretend as the make-believe cardboard television sets and computers in furniture stores.

Parallel Test  A parallel test is an actual test of disaster recovery and/or business continuity response plans and their supporting IT systems. Its purpose is to evaluate the ability of personnel to follow directives in emergency response plans—to set up the disaster recovery business processing or data processing capability. In a parallel test, personnel are setting up the IT systems that would be used in an actual disaster and operating those IT systems with real business transactions to determine whether the IT systems perform the processing correctly.

The outcome of a parallel test is threefold:

•   It evaluates the accuracy of emergency response procedures.

•   It evaluates the ability of personnel to follow the emergency response procedures correctly.

•   It evaluates the ability of IT systems and other supporting apparatus to process real business transactions properly.

A parallel test is so named because, as live production systems continue to operate, the backup IT systems are processing business transactions in parallel, to test whether both systems process transactions equally well. Setting up a valid parallel test is complicated in many cases. In effect, you need to insert a logical “Y cable” into the business process flow so that the information flow will split and flow both to production systems (without interfering with their operation) and to the backup systems.

Results of transactions are compared. Personnel need to be able to determine whether the backup systems would be able to output correct data without actually having them do so. In many complex environments, you would not want the disaster recovery system to feed information into a live environment, because that may cause duplicate events to occur someplace else in the organization (or with customers, suppliers, or other parties). For instance, in a travel reservations system, you would not want a disaster recovery system to book actual travel, because that would cost real money and consume available space on an airline or other mode of transportation. But it would be important to know whether the disaster recovery system would be able to perform those functions. Somewhere along the line, it will be necessary to “unplug” the disaster recovery system from the rest of the environment and manually examine the results.

Organizations that want to see whether their backup/disaster recovery systems can manage a real workload can perform a cutover test, which is discussed next.

Cutover Test  A cutover test, the most intrusive type of disaster recovery test, also provides the most reliable results in terms of answering the question of whether backup systems have the capacity and correct functionality to shoulder the real workload. The consequences of a failed cutover test, however, may resemble an actual disaster: if any part of the cutover test fails, real, live business processes will be proceeding without the support of IT applications, as though a real outage or disaster were in progress. But even a failure like this would reveal whether the backup systems will or won’t work if an actual disaster were to happen.

In some respects, a cutover test is easier to perform than a parallel test. A parallel test is a little trickier, because business information is required to flow to the production system and to the backup system, which means that some artificial component has been somehow inserted into the environment. With a cutover test, business processing takes place on the backup systems only, which can often be achieved through a simple configuration someplace in the network or the systems layer of the environment.

When conducting a cutover test, you should determine ahead of time how long the backup platform will be running. Additionally, a cutover test may be a good time to check the security controls of the backup platform.

Images

NOTE   Not all organizations perform cutover tests, because they take a lot of resources to set up and they are risky. Many organizations find that a parallel test is sufficient to determine whether backup systems are accurate, and the risk of an embarrassing incident is almost zero with a parallel test.

Documenting Test Results

Every type and every iteration of disaster recovery plan testing needs to be documented. It’s not enough to say, “We did the test on September 10, 2021, and it worked.” First of all, no test goes perfectly—opportunities for improvement are always identified. But the most important part of testing is to discover what parts of the plan or the test should be reworked before the next test (or a real disaster) occurs.

As with any well-organized project, success is in the details. The road to success is littered with big and little mistakes, and the things that are identified in every sort of disaster recovery test need to be detailed so that the next iteration of the test will provide better results. Here are some key metrics that can be reported:

•   Time required to perform key tasks

•   Accuracy of tasks performed (or number of retries needed)

•   Amount of data recovered

•   Performance against recovery targets, including RTO, RPO, RCO, and RCapO

Recording and comparing detailed test results from one test to the next will also help the organization measure progress. By this, I mean that the quality of disaster response plans should steadily improve from year to year. Simple mistakes of the past will not be repeated, and the only failures in future tests should be in new and novel parts of the environment that weren’t well thought out to begin with. And even these should diminish over time.

Improving Recovery and Continuity Plans

Every test of recovery and response plans should include a debriefing or review so that participants can discuss the outcome of the test: what went well, what went wrong, and how things should be done differently next time. This information should be recorded by someone who will be responsible for making changes to relevant documents. The updated documents should be circulated among test participants, who can confirm whether their discussion and ideas are properly reflected in the document.

Evaluating Business Continuity Planning

Audits and evaluations of an organization’s business continuity plan are especially difficult, because it is difficult to prove whether the plans will work unless a real disaster is experienced. The lion’s share of an evaluation result hinges on the quality of documentation and discussions with key personnel. The evaluation of an organization’s business continuity program should begin with a top-down analysis of key business objectives and a review of documentation and interviews to determine whether the business continuity strategy and program details support those key business objectives. This approach is depicted in Figure 7-17.

Images

Figure 7-17  Top-down approach to an evaluation of business continuity

The objectives of a BCP evaluation should include the following activities:

•   Obtain documentation that describes current business strategies and objectives. Obtain high-level documentation (such as strategy, charter, and objectives) for the business continuity program and determine whether and how the program aligns with business strategies and objectives.

•   Obtain the most recent BIA and accompanying threat analysis, risk analysis, and criticality analysis. Determine whether these documents are current and complete, and whether they support the business continuity strategy. Also determine whether the scope of these documents covers those activities considered strategic according to high-level business objectives. Finally, determine whether the methods in these documents represent good practices for these activities.

•   Determine whether key personnel are ready to respond during a disaster by reviewing test plans and training plans and results. Learn where emergency procedures are stored and whether key personnel have access to them.

•   Verify whether a process is in place for the regular review and update of business continuity documentation. Evaluate the process’ effectiveness by reviewing records to determine how frequently documents are reviewed.

Examining Business Continuity Documentation

The bulk of an organization’s business continuity plan lies in its documentation, so it should be little surprise that the bulk of any evaluation will rest in the examination of this documentation. The following steps will help determine the effectiveness of the organization’s business continuity plans:

1.   Obtain a copy of business continuity documentation, including response procedures, contact lists, and communication plans.

2.   Examine samples of distributed copies of business continuity documentation and determine whether they are up-to-date. These samples can be obtained during interviews of key response personnel, which are covered in this procedure.

3.   Determine whether all documents are clear and easy to understand, not just for primary responders, but for alternate personnel who may have specific relevant skills but less familiarity with the organization’s critical applications. In some disaster scenarios, primary responders may be unavailable to carry out disaster response activities.

4.   Examine documentation related to the declaration of a disaster and the initiation of disaster response. Determine whether the methods for declaration are likely to be effective in a disaster scenario.

5.   Obtain emergency contact information, and contact some of the personnel to determine whether the contact information is accurate and up-to-date. Also check to see that all response personnel are still employed in the organization and are in the same or similar roles in support of disaster response efforts.

6.   Contact some or all of the response personnel who are listed in emergency contact lists. Interview them to see how well they understand their disaster response responsibilities and whether they are familiar with disaster response procedures. Ask each interviewee whether they have a copy of these procedures, and ask whether their copies are current.

7.   Determine whether a process exists for the formal review and update of business continuity documentation. Examine records to see how frequently, and how recently, documents have been reviewed and updated.

8.   Determine whether response personnel receive any formal or informal training on response and recovery procedures. Determine whether personnel are required to receive training and whether any records are kept that show which personnel received training and at what time.

9.   Determine whether business continuity planners perform tests, walk-throughs, and exercises of plans, and whether retrospectives or after-action reviews of tests are performed to identify opportunities for improvement.

Reviewing Prior Test Results and Action Plans

The effectiveness of business continuity plans relies, to a great degree, on the results and outcomes of tests. Examine these tests carefully to determine their effectiveness and to what degree they are used to improve procedures and train personnel. The following will help determine the effectiveness of business continuity testing:

•   Determine whether a strategy exists for testing business continuity procedures. Obtain records for past tests and a plan for future tests. Determine whether prior tests and planned tests are adequate for establishing the effectiveness of response and recovery procedures.

•   Examine records for tests that have been performed over the past few years, and determine the types of tests that were performed. Obtain a list of participants for each test. Compare the participants to lists of key recovery personnel. Examine test work papers to determine the level of participation by key recovery personnel.

•   Determine whether a formal process exists for recording test results and for using those results to make improvements in plans and procedures. Examine documents and records to determine the types of changes that were recommended in prior tests. Examine business continuity documents to determine whether these changes were made as expected.

•   Considering the types of tests that were performed, determine the adequacy of testing as an indicator of the effectiveness of the business continuity program. Were only document reviews and walk-throughs performed, for example, or were parallel or cutover tests conducted?

•   If tests have been performed for two years or more, determine whether continuous improvement in response and recovery procedures exists.

•   If the organization performs parallel tests, determine whether tests are designed in a way that effectively determines the actual readiness of standby processes and systems. Also determine whether parallel tests measure the capacity of standby systems or merely their ability to process correctly but at a lower level of performance.

Interviewing Key Personnel

The knowledge and experience of key personnel are vital to the success of any business continuity operation. Interviews will help determine whether key personnel are prepared and trained to respond during a disaster. The following will guide discussions:

•   Ask the interviewee to summarize his or her professional experience and training and current responsibilities in the organization.

•   Ask whether he or she is familiar with the organization’s business continuity and disaster recovery programs.

•   Determine whether he or she is among the key response personnel expected to respond during a disaster.

•   Ask whether the interviewee has been issued a copy of any response or recovery procedures. If so, ask to see those procedures to determine whether they are current versions. Ask if the interviewee has additional sets of procedures in any other locations (residence, for example).

•   Ask whether he or she has received any training. Request evidence of this training (certificate, calendar entry, notes, and so on).

•   Ask whether the interviewee has participated in any tests or evaluations of recovery and response procedures. Ask whether the tests were effective, whether management takes the tests seriously, and whether any deficiencies in tests resulted in any improvements to test procedures or other documents.

Reviewing Service Provider Contracts

No organization is an island. Every organization has critical suppliers without which it could not carry out its critical functions. The ability to recover from a disaster also frequently requires the support of one or more service providers or suppliers. The examiner or auditor should examine contracts for all critical suppliers and consider the following questions:

•   Does the contract support the organization’s requirements for delivery of services and supplies, even in the event of a local or regional disaster?

•   Does the service provider have its own disaster recovery capabilities that will ensure its ability to deliver critical services during a disaster?

•   Is recourse available should the supplier be unable to provide goods or services during a disaster?

Finally, the examiner or auditor should determine whether the organization can continue its own critical business process should a key service provider experience its own disaster. The service provider may elect to activate an alternate processing center; will the organization’s systems be able to connect to the service provider’s disaster recovery systems easily and continue functioning as expected? Are there instructions for connecting systems to the service provider’s disaster recovery systems?

Reviewing Insurance Coverage

The examiner or auditor should examine the organization’s insurance policies related to the loss of property and assets supporting critical business processes. Insurance coverage should include the actual cost of recovery or a lesser amount if the organization’s executive management has accepted that. Obtain documentation that includes cost estimates for various disaster recovery scenarios, including equipment replacement, business interruption, and the cost of performing business functions and operating IT systems in alternate sites. These cost estimates should be compared with the value of insurance policies.

Evaluating Disaster Recovery Planning

The evaluation of a disaster recovery program and its plans should focus on their alignment with the organization’s business continuity plans. To a great extent, DRP should support BCP so that the organization’s most critical business processes will have companion disaster recovery plans that may need to be activated when a natural or human-made disaster impairs information systems at the organization’s primary processing facility.

The objectives of an examination or audit of DRP should include the following activities:

•   Determine the effectiveness of planning and recovery documentation by examining previous test results.

•   Evaluate the methods used to store critical information offsite (which may consist of offsite storage, alternate data centers, replication, or e-vaulting).

•   Examine environmental and physical security controls in any offsite or alternate sites and determine their effectiveness.

•   Note whether offsite or alternate site locations are within the same geographic region, which could mean that both the primary and alternate sites could be involved in common disaster scenarios.

Evaluating Disaster Recovery Plans

The following will help determine the effectiveness of an organization’s disaster recovery plans:

•   Obtain a copy of the disaster recovery documentation, including response procedures, contact lists, and communication plans.

•   Examine samples of distributed copies of the documentation and determine whether they are up-to-date. These samples can be obtained during interviews of key response personnel, which are covered in this procedure.

•   Determine whether all documents are clear and easy to understand, not just for primary responders, but for alternate personnel who may have specific relevant skills but less familiarity with the organization’s critical applications. Remember that primary responders may be unavailable in some disaster scenarios, and that others may need to carry out disaster recovery plan procedures.

•   Obtain contact information for offsite storage providers, hot-site facilities, and critical suppliers. Determine whether these organizations are still providing services to the organization. Call some of the contacts to determine the accuracy of the documented contact information.

•   For organizations using third-party recovery sites such as cloud infrastructure providers, obtain contracts and records that define organization and cloud provider obligations, service levels, and security controls.

•   Obtain logical and physical architecture diagrams for key IT applications that support critical business processes. Determine whether disaster recovery and business continuity documentation includes recovery procedures for all components that support those IT applications. Determine whether documentation includes recovery for end users and administrators for the applications.

•   If the organization uses a hot site, examine one or more systems to determine whether they have the proper versions of software, patches, and configurations. Examine procedures and records related to the tasks in support of keeping standby systems current, and determine whether these procedures are effective.

•   If the organization has a warm site, examine the procedures used to bring standby systems into operational readiness. Examine warm-site systems to see whether they are in a state where readiness procedures will likely be successful.

•   If the organization has a cold site, examine all documentation related to the acquisition of replacement systems and other components. Determine whether the procedures and documentation are likely to result in systems capable of hosting critical IT applications within the period required to meet key recovery objectives.

•   If the organization uses a cloud service provider’s service as a recovery site, examine the procedures used to prepare and bring cloud-based systems to operational readiness. Examine procedures and configurations to see whether they are likely to support the organization successfully during a disaster.

•   Determine whether any documentation exists regarding the relocation of key personnel to the alternate processing site. Check that the documentation specifies which personnel are to be relocated and what accommodations and supporting logistics are provided. Determine the effectiveness of these relocation plans.

•   Determine whether backup and offsite (or replication or e-vaulting) storage procedures are being followed. Examine systems to ensure that critical IT applications are being backed up and that proper media are being stored offsite (or that the proper data is being replicated or e-vaulted). Determine whether data recovery tests are ever performed and, if so, whether the results of those tests are documented and problems are properly dealt with.

•   Evaluate procedures for transitioning processing from the alternate processing facility back to the primary processing facility. Determine whether these procedures are complete and effective, and whether they have been tested.

•   Determine whether a process exists for the formal review and update of business continuity documentation to ensure continued alignment with DRP. Examine records to see how frequently, and how recently, documents have been reviewed and updated. Determine whether this is sufficient and effective by interviewing key personnel to understand whether significant changes to applications, systems, networks, or processes are reflected in recovery and response documentation.

•   Determine whether response personnel receive any formal or informal training on response and recovery procedures. Determine whether personnel are required to receive training, and whether any records are kept that show which personnel received training and at what time.

•   Examine the organization’s change control process. Determine whether the process includes any steps or procedures that require personnel to decide whether any change has an impact on disaster recovery documentation or procedures.

Reviewing Disaster Recovery Test Results and Action Plans

The effectiveness of disaster recovery plans relies on the results and outcomes of tests. The examiner or auditor needs to examine these plans and activities to determine their effectiveness. The following will help examine disaster recovery testing:

•   Determine whether a strategy or policy is in place for testing disaster recovery plans. Obtain records for past tests and a plan for future tests.

•   Examine records for tests that have been performed over the past year or two. Determine the types of tests that were performed. Obtain a list of participants for each test, and compare the participants to lists of key recovery personnel. Examine test work papers to determine the level of participation by key recovery personnel.

•   Determine whether there is a formal process for recording test results and for using those results to make improvements in plans and procedures. Examine work papers and records to determine the types of changes that were recommended in prior tests. Examine disaster recovery documents to see whether these changes were made as expected.

•   Considering the types of tests that were performed, check the adequacy of testing as an indicator of the effectiveness of the disaster recovery program. Did the organization perform only document reviews and walk-throughs, for example, or did the organization also perform parallel or cutover tests?

•   If tests have been performed for two years or more, check for continuous improvement in response and recovery procedures.

•   If the organization performs parallel tests, determine whether tests are designed in a way that effectively determines the actual readiness of standby systems. Also, determine whether parallel tests measure the capacity of standby systems or merely their ability to process correctly but at a lower level of performance.

•   Determine whether any tests included the retrieval of backup data from offsite storage, replication, or e-vaulting facilities. See what disaster scenarios were tested and the types of recovery procedures that were performed.

It is important to keep in mind that a cyberattack may trigger a disaster scenario. Disaster recovery plans must address scenarios such as ransomware and wiper attacks.

Evaluating Offsite Storage

Storage of critical data and other supporting information is a key component in any organization’s disaster recovery plan. Because some types of disasters can completely destroy a business location, including its vital records, it is imperative that all critical information is backed up and copies moved to an offsite storage facility. The following will help determine the effectiveness of offsite storage:

•   Obtain the location of the offsite storage or e-vaulting facility. Determine whether the facility is located in the same geographic region as the organization’s primary processing facility.

•   If possible, visit the facility and examine its physical security controls as well as its safeguards to prevent damage to stored information in a disaster. Consider the entire spectrum of physical and logical access controls. Examine procedures and records related to the storage and return of backup media and other information that the organization may store there. If it is not possible to visit the facility, obtain copies of audits or other attestations of controls effectiveness.

•   Take an inventory of backup media and other information stored at the facility. Compare this inventory with a list of critical business processes and supporting IT systems to determine whether all relevant information is, in fact, stored at the facility.

•   Determine how often the organization performs its own inventory of the facility and whether steps to correct deficiencies are documented and remedied.

•   Examine contracts, terms, and conditions for offsite storage providers or e-vaulting facilities, if applicable. Determine whether data can be recovered to the original processing center and to alternate processing centers within a period that will ensure that disaster recovery can be completed within RTOs.

•   Determine whether the appropriate personnel have current access codes or license keys for offsite storage or e-vaulting facilities and whether they have the ability to recover data from those facilities.

•   Determine what information, in addition to backup data, exists at the facility. Information stored offsite should include architecture diagrams, design documentation, operations procedures, and configuration information for all logical and physical layers of technology and facilities supporting critical IT applications, operations documentation, application source code, and software build systems.

•   Obtain information related to the manner in which backup media and copies of records are transported to and from the offsite storage or e-vaulting facility. Determine the adequacy of controls protecting transported information.

•   Obtain records supporting the transport of backup media and records to and from the storage facility. Examine samples of records and determine whether they match other records, such as backup logs.

Images

NOTE   Organizations need to balance the time practicing backup procedures with procedures used to perform different types of recovery scenarios.

Evaluating Alternate Processing Facilities

The examiner or auditor needs to examine alternate processing facilities to determine whether they are sufficient to support the organization’s business continuity and disaster recovery plans. The following will help determine whether an alternate processing facility will be effective:

•   Obtain addresses and other location information for alternate processing facilities. These will include hot sites, warm sites, cold sites, cloud-based services, and alternate processing centers owned or operated by the organization. Note that exact locations of cloud services are often unavailable for security reasons.

•   Determine whether alternate facilities are located within the same geographic region as the primary processing facility and note whether the alternate facility will also be adversely affected by a disaster that strikes the primary facility.

•   Perform a threat analysis on the alternate processing site. Determine which threats and hazards pose a significant risk to the organization and its ability to carry out operations effectively during a disaster.

•   Determine the types of natural and human-made events likely to take place at the alternate processing facility. Determine whether there are adequate controls to mitigate the effect of these events.

•   Examine all environmental controls and determine their adequacy. This should include environmental controls (HVAC), power supply, uninterruptible power supply (UPS), power distribution units (PDUs), switchgear, and electric generators. Also, examine fire detection and suppression systems, including smoke detectors, pull stations, fire extinguishers, sprinklers, and inert gas suppression systems.

•   If the alternate processing facility is a separate organization, obtain the legal contract and all exhibits. Examine these documents and determine whether the contract and exhibits support the organization’s recovery and testing requirements.

Images

NOTE   Cloud-based service providers often do not permit onsite visits. Instead, they may have one or more external audit reports available through standard audits such as SSAE 18, ISAE 3402, SOC 1, SOC 2, ISO, or PCI. It is vital to determine whether external audit reports are reliable and whether any controls are not covered in external audits.

Evaluating Security Incident Response

Evaluating security incident response plans can be a challenge, because it can be difficult to know whether plans will work, or whether personnel will understand how to follow them, when a real security incident occurs. Evaluation should use the top-down approach, shown in Figure 7-17, by examining business strategies and objectives, high-level documentation, and finally the incident response plans and playbooks themselves.

The steps and details for evaluating security incident response plans are virtually the same as those of evaluations of BCPs and DRPs: plans should be thorough, specific, business aligned, and maintained. Various exercises, from document walk-throughs to live-fire testing, should be performed to ensure that plans are accurate and that personnel use them correctly. After-action reviews should be performed, with follow-through on all action items that were identified.

Evaluations of security incident response plans need to be associated with risk and threat assessments: when new risks and threats are identified, the organization must ensure that detective controls, preventive controls, and response plans address them. Because risks, threats, and detective and preventive capabilities change frequently, reviews and updates to incident response plans and playbooks likewise must be frequent.

Chapter Review

Security incident management, disaster recovery planning, and business continuity planning all support a central objective: resilience and rapid recovery when disruptive events occur.

A security incident occurs when the confidentiality, integrity, or availability of information or information systems has been or is in danger of being compromised. The proliferation of connected devices makes life safety an additional consideration in many organizations.

An organization that is developing security incident response plans needs to determine high-level objectives so that response plans will meet these objectives.

With the proliferation of outsourcing to cloud-based service providers, many security incidents now take place in third-party organizations, which requires additional planning and coordination so that any incident response involving a third party is effective.

BCP and DCP work together to ensure the survival of an organization during and after a cyberattack, natural disaster, or human-made disaster.

The business impact analysis identifies the impact of various disaster scenarios and determines the most critical processes and systems in an organization. The BIA helps an organization focus its BCP and DRP on the most critical business functions. Statements of impact help management better understand the results of disruptive events in business terms. In a criticality analysis, each system and process is studied to consider the impact on the organization if it is incapacitated, the likelihood of incapacitation, and the estimated cost of mitigating the risk or impact of incapacitation.

Maximum tolerable downtime and maximum tolerable outage inform the development of recovery targets, including recovery time objective, recovery point objective, and recovery capacity objective, to help an organization understand how quickly various business processes should be recovered after a disaster. Recovery speed is an important factor as the cost of recovery varies widely.

Business continuity plans define the methods the organization will use to continue critical business operations after a disaster has occurred. Disaster recovery plans define the steps that will be undertaken to salvage and recover systems damaged by a disaster. Both BCP and DRP activities work toward the restoration of capabilities in their original (or replacement) facilities.

The safety of personnel is the most important consideration in any disaster recovery plan.

DRP is concerned with system resilience matters, including data backup and replication, the establishment of alternate processing sites (hot, warm, cold, cloud, mobile, or reciprocal), and the recovery of applications and data. The complexity of a disaster recovery plan necessitates reviews and testing to ensure that the plan is effective and will be successful during an actual disaster.

Recovery targets established during the BIA directly influence disaster recovery plans through the development of suitable infrastructure and response plans.

Security incident response plans, business continuity plans, and disaster recovery plans all need to be evaluated and tested to ensure their suitability. Organizations need to identify and train incident responders to ensure that they will understand how to respond to incidents properly and effectively.

Notes

•   Understanding the computer intrusion kill chain model can help an organization identify opportunities to make their systems more resilient to intruders.

•   The development of custom playbooks that address specific types of security incidents will ensure a more rapid and effective response to an incident. High-velocity incidents such as data wipers and ransomware require a rapid, almost-automated, response.

•   Organizations must carefully understand all of the terms and exclusions in any cyber-insurance policy to ensure that no exclusions would result in a denial of benefits after an incident.

•   With so many organizations using cloud-based services, it’s especially important that organizations understand, in detail, their own roles and responsibilities as well as those of each cloud service provider. This will ensure that the organization can build effective incident response should an incident occur at a cloud-based service provider.

•   Recovery objectives such as recovery time objective and recovery point objective serve as signposts for the development of risk mitigation plans and business continuity plans. Eventually, plans in support of these objectives must be developed and tested, usually in the context of business continuity planning.

Questions

1.   Which of the following recovery objectives is associated with the longest allowable period for a service outage?

A.   Recovery tolerance objective (RTO)

B.   Recovery point objective (RPO)

C.   Recovery capacity objective (RCapO)

D.   Recovery time objective (RTO)

2.   A security manager is developing a strategy for making improvements to the organization’s incident management process. The security manager has defined the desired future state. Before specific plans can be made to improve the process, the security manager should perform a:

A.   Training session

B.   Penetration test

C.   Vulnerability assessment

D.   Gap analysis

3.   A large organization operates hundreds of business applications. How should the security manager prioritize applications for protection from a disaster?

A.   Conduct a business impact analysis.

B.   Conduct a risk assessment.

C.   Conduct a business process analysis.

D.   Rank the applications in order of criticality.

4.   The types of incident response plan testing are:

A.   Document review, walk-through, and simulation

B.   Document review and simulation

C.   Document review, walk-through, simulation, parallel test, and cutover test

D.   Document review, walk-through, and cutover test

5.   An organization has developed its first-ever business continuity plan. What is the first test of the continuity plan that the business should perform?

A.   Walk-through

B.   Simulation

C.   Parallel test

D.   Cutover test

6.   An organization is experiencing a ransomware attack that is damaging critical data. What is the best course of action?

A.   Security incident response

B.   Security incident response followed by business continuity plan

C.   Concurrent security incident response and business continuity plan

D.   Business continuity plan

7.   What is the most important consideration when selecting a hot site?

A.   Time zone

B.   Geographic location in relation to the primary site

C.   Proximity to major transportation

D.   Natural hazards

8.   An organization has established a recovery point objective of 14 days for its most critical business applications. Which recovery strategy would be the best choice?

A.   Mobile site

B.   Warm site

C.   Hot site

D.   Cold site

9.   What technology should an organization use for its application servers to provide continuous service to users?

A.   Dual power supplies

B.   Server clustering

C.   Dual network feeds

D.   Transaction monitoring

10.   An organization currently stores its backup media in a cabinet next to the computers being backed up. A consultant told the organization to store backup media at an offsite storage facility instead. What risk did the consultant most likely have in mind when he made this recommendation?

A.   A disaster that damages computer systems can also damage backup media.

B.   Backup media rotation may result in loss of data backed up several weeks in the past.

C.   Corruption of online data will require rapid data recovery from offsite storage.

D.   Physical controls at the data processing site are insufficient.

11.   A major earthquake has occurred near an organization’s operations center. Which of the following should be the organization’s top priority?

A.   Ensuring that an automatic failure to the recovery site will occur because personnel may be slow to respond

B.   Ensuring that visitors know how to evacuate the premises and that they are aware of the locations of sheltering areas

C.   Ensuring that data replication to a recovery site has been working properly

D.   Ensuring that backup media will be available at the recovery site

12.   An organization wants to protect its data from the effects of a ransomware attack. What is the best data protection approach?

A.   Periodically scan data for malware.

B.   Replicate data to a cloud-based storage provider.

C.   Replicate data to a secondary storage system.

D.   Back up data to offline media.

13.   An auditor is evaluating an organization’s disaster recovery plan. Which of the following artifacts should be examined first?

A.   Business impact analysis

B.   After-action reviews

C.   Test results

D.   Training records

14.   An organization’s top executives are growing tired of receiving reports about minor security incidents. What is the best course of action?

A.   Enact controls to stop the incidents from occurring.

B.   Discontinue informing executives about incidents.

C.   Develop an incident severity schedule.

D.   Review regulatory requirements for incident disclosure.

15.   An organization has established a recovery time objective of four hours for its most critical business applications. Which recovery strategy would be the best choice?

A.   Mobile site

B.   Warm site

C.   Hot site

D.   Cold site

Answers

1.   D. RTO is the maximum period of time from the onset of an outage until the resumption of service.

2.   D. When the desired end state of a process or system is determined, a gap analysis must be performed so that the current state of the process or system can also be known. Then specific tasks can be performed to reach the desired end state of the process.

3.   A. A business impact analysis (BIA) is used to identify the business processes to identify the information systems that are most critical for the organization’s ongoing operations.

4.   A. The types of security incident response plan testing are a document review, a walk-through, and a simulation. Parallel and cutover tests are not part of security incident response planning or testing but are used for disaster recovery planning.

5.   A. The best choice of tests for a first-time business continuity plan is a document review or a walk-through. Since this is a first-time plan, other tests are not the best choices.

6.   C. If an organization’s critical data has been damaged or destroyed by a ransomware incident, the organization should invoke its business continuity plan alongside its security incident response plan. This may help the organization restore services to its customers more quickly.

7.   B. An important selection criterion for a hot site is the geographic location in relation to the primary site. If they are too close together, a single disaster event may involve both locations.

8.   D. An organization that has a 14-day recovery time objective (RTO) can use a cold site for its recovery strategy. Fourteen days is enough time for most organizations to acquire hardware and recover applications.

9.   B. An organization that wants its application servers to be available continuously to its users needs to employ server clustering so that at least one server will always be available to service user requests.

10.   A. The primary reason for employing offsite backup media storage is to mitigate the effects of a disaster that could otherwise destroy computer systems and their backup media.

11.   B. The safety of personnel is always the top priority when any disaster event has occurred. While important, the condition of information systems is a secondary concern.

12.   D. The best approach for protecting data from a high-velocity attack such as ransomware is to back up the data to offline media that cannot be accessed by end users. Replicating data to another storage system may only serve to replicate damaged data to the secondary storage system, making recovery more difficult or expensive.

13.   A. The auditor should first examine business impact analysis documents, as these define the priority of critical business processes as well as recovery targets.

14.   C. It is apparent that the security incident response plan does not have severity levels. One property of severity levels is the frequency and level of internal communications. For instance, executives are spared from being informed about minor incidents while they occur. Higher severity incidents include notifications to managers higher in the organization, and more frequent notifications.

15.   C. An organization that has a four-hour recovery time objective (RTO) should use a hot site for its recovery strategy. Only a hot site would be able to perform primary processing within that period of time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.31.11