Terms you'll need to understand:
Disaster recovery plan
Business continuity plan
Hot site
Warm site
Cold site
Contingencies
Maximum tolerable downtime
Remote journaling
Electronic vaulting
Disk shadowing
Techniques you'll need to master:
The process and development of contingency plans
Business impact analysis processes and procedures
Backup procedures and alternatives
BCP testing strategies
Recovery strategies
Much of the material you have read in this book has dealt with the ways in which security incidents can be prevented. The business continuity plan (BCP) and disaster-recovery plan (DRP) domains address what to do and how to respond when things go wrong. This chapter discusses how to preserve business operations in the face of major disruptions. The BCP is about assessing risk and determining how the business would respond should these risks occur. Some of the steps of the BCP process include project management and planning, business impact analysis (BIA), continuity planning design and development, and BCP testing and training. The DRP is a subset of your BCP plan; it is about the planning and restoration actions the business would undertake if a disastrous event occurred.
To pass the business continuity planning domain of the ISC2 Certified Information Systems Security Professional (CISSP) exam, you will need to know the steps that make up the BCP process. You will also need to know the differences between BCP and DRP. Attention to understanding ways in which the BCP can be tested, including tabletop, full interruptions, checklists, and functional tests, is also required.
Natural disasters such as earthquakes, floods, and fires often come at the least expected time. Others, such as hurricanes and tornados, are increasing in severity and destruction. Many reports and studies have found that only about 50% of businesses have comprehensive business continuity plans in place. Disasters can come in many shapes and forms. To those foolish enough not to be prepared, it could mean the death of the business. Organizations must plan for these types of disasters:
Natural—. Earthquakes, storms, fires, floods, hurricanes, tornados, and tidal waves
System/technical—. Outages, malicious code, worms, and hackers
Supply systems—. Electrical power problems, equipment outages, utility problems, and water shortages
Human-made/political—. Disgruntled employees, riots, vandalism, theft, crime, protesters, and political unrest
Each of these can cause aninterruption in operations. The length of time that services could be interrupted are defined as follows:
Minor—. Operations are disrupted for several hours to less than a day.
Intermediate—. An event of this stature can cause operations to be disrupted for a day or longer. The organization might need a secondary site to continue operations.
Major—. This type of event is a true catastrophe. In this type of disaster, the entire facility would be destroyed. A long-term solution would require building a new facility.
Business Continuity Management is about more than just developing a plan for recovery in case of an outage. It takes a long, hard look at the way in which the organization does business. The goal is not to just reduce outage time, but also to find better ways to manage its products and services. The Business Continuity Institute (www.thebci.org), a professional body for business continuity management, defines business community management in the following terms:
Business Continuity Management is a holistic management process that identifies potential impacts that threaten an organization and provides a framework for building resilience and the capability for an effective response that safeguards the interests of its key stakeholders, reputation, brand, and value creating activities.
The BCP is developed to prevent interruptions to normal business. If these events cannot be prevented, the goal of the plan is to minimize the outage. The other goal of the plan is to reduce the potential costs that such disruptions might cost an organization. Therefore, the business continuity plan should also be designed to help minimize the cost associated with the disruptive event and mitigate the risk associated with it. The BCP process as defined by the ISC2 has the following five steps:
Project management and initiation
Business impact analysis
Recovery strategy
Plan design and development
Testing, maintenance, awareness, and training
Each of these is discussed in the following sections.
Before the BCP process can begin, you need to make your case to management. You have to establish the need for the BCP. One way to start is to perform a risk analysis to identify and document potential outages to critical systems. The results should be presented to management so they understand the potential risk. That's a good time to remind them that, ultimately, they are responsible. Customers, shareholders, stockholders, or anyone else could bring a civil suit against senior management if they feel they have not practiced due care. If you don't get management's support, you will not have funds to successfully complete the project, and it will be marginally successful, if at all.
With management on board, you can start to develop a plan of action. This management plan should include the following:
Scope of the project—. A properly defined scope is a tremendous help in ensuring that an effective BCP plan is devised. At this point in the process, the decision to do only a partial recovery or a full recovery would be made. In larger organizations, office politics can pull the project in directions that it might not need to be going. Another problem is project creep, which occurs when more items are added to the plan that were not part of original project plan. This can delay completion of the project or cause it to run over budget.
Appointment of a project planner—. The project planner is a key role because this person drives the process. The project planner must ensure that all elements of the plan are properly addressed and that a sufficient level of research, planning, and analysis has been performed before the plan begins. This individual must also have enough creditability with senior management to influence them when the time comes to present the results and recommendations.
Determination of who will be on the team—. Team members should have representatives from senior management, the legal staff, recovery team leaders, the information security department, various business units, networking, and physical security. You want to make sure that the individuals who would be responsible for executing the plan are involved in the development process.
Finalize the project plan—. This step is similar to traditional project plan phases. The team leader and the team must finalize issues such as needed resources (personnel, financial), time schedules, budget estimates, and critical success factors. Scheduling meetings and BCP completion dates are two critical items that must be addressed at this point.
Determine the data-collection method—. Different tools can be used to gather the data. Strohl Systems BIA Professional and SunGard's Paragon software can automate much of the BCP process. If you choose to use these tools, be sure to add time into your schedule. A learning curve is involved anytime individuals are introduced to software they are not familiar with.
The BIA is the second step of the process. Its role is to describe what impact a disaster would have on critical business functions. The BIA is an important step in the process because it looks at the threats to these functions and the costs of a potential outage. As an example, the BIA might uncover the fact that DoS attacks that result in 2 hours of downtime of the company's VoIP phone system will result in $28,000 in lost revenue, whereas an 8-hour outage to the web server might cost the company only $1,000 in lost revenue. These types of numbers will help the organization determine what needs to be done to ensure the survival of the company. The eight steps in the BIA process are as follows:
Select individuals to interview.
Determine the methods to be used for gathering information.
Develop a customized questionnaire to gather specific monetary and operational impact information. This should include questions that inquire about both quantitative and qualitative losses. The goal is to use this data to help determine how the loss of any one function.
Analyze the compiled data.
Determine the time-critical business processes and functions.
Determine maximum tolerable downtimes for each process and function.
Prioritize the critical business process or function based on its maximum tolerable downtime (MTD).
Document the findings and report your recommendations to management.
MTD is a measurement of the longest time that an organization can survive without a specific business function. MTD estimates include critical (minutes to hours), urgent (24 hours or less), important (up to 72 hours), average (up to 7 days), and nonessential (these services can experience outages up to 30 days).
The impact or loss that an organization faces because of lost service or data can be felt in many ways. These are generally measured by one of the following:
Allowable business interruption—. What is the maximum tolerable downtime (MTD) the organization can survive without that function or service?
Financial and operational considerations—. What will this outage cost? Will there be a loss of revenue or operational capital, or will we be held personally liable? Cost can be immediate or delayed. Other potential costs include any losses incurred because of failure in meeting the SLA requirements of customers.
Regulatory requirements—. What violations of law or regulations could this cause? Is there a legal penalty?
Organizational reputation—. Will this affect our competitive advantage, market share, or reputation?
The BIA builds the groundwork for determining how resources should be appropriated for recovery-planning efforts.
A vulnerability assessment is often part of a BIA. Although the assessment is somewhat similar to the risk-assessment process discussed in Chapter 3, “Security-Management Practices,” this one focuses on providing information that is used just for the business continuity plan.
Recovery strategies are the predefined actions that management has approved to be followed in case normal operations are interrupted. Operations can be interrupted in several different ways:
Data interruptions—. The focus here is on recovering the data. Solutions to data interruptions include backups, offsite storage, and remote journaling.
Operational interruptions—. The interruption is caused by the loss of some type of equipment. Solutions to this type of interruption include hot sites, redundant equipment, Redundant Array of Inexpensive Disks (RAID), and Backup Power Supplies (BPS).
Facility and supply interruptions—. Causes of these interruptions can include fire, loss of inventory, transportation problems, Heating Ventilation and Air Conditioner (HVAC) problems, and telecommunications.
Business interruptions—. These interruptions can be caused by loss of personnel, strikes, critical equipment, supplies, and office space.
To evaluate the losses that could occur from any of these interruptions and determine the best recovery strategy, follow these steps:
Document all costs for each possible alternative.
Obtain cost estimates for any outside services that might be needed.
Develop written agreements with the chosen vendor for such services.
Evaluate what resumption strategies are possible in case there is a complete loss of the facility.
Document your findings and report your chosen recovery strategies to management for feedback and approval.
In this phase, the team prepares and documents a detailed plan for recovery of critical business systems. The plan should be a guide for implementation. The plan should include information on both long-term and short-term goals and objectives:
Identify critical functions and priorities for restoration.
Identify support systems that are needed by critical functions.
Estimate potential disasters and calculate the minimum resources needed to recover from the catastrophe.
Select recovery strategies and determine what vital personnel, systems, and equipment will be needed to accomplish the recovery.
Determine who will manage the restoration and testing process.
Calculate what type of funding and fiscal management is needed to accomplish these goals.
The plan should also detail how the organization will interface with external groups, such as customers, shareholders, the media, the community, and region and state emergency services groups. The final step of the phase is to combine this information into the BCP plan and interface it with the organization's other emergency plans.
This final phase of the process is for testing and maintaining the BCP. Training and awareness programs are also developed at this point. Testing the disaster-recovery plan is critical. Without performing a test, there is no way to know whether the plan will work. Testing helps make theoretical plans reality. As a CISSP candidate, you should be aware of the five different types of BCP testing:
Checklist—. Although this is not considered a replacement for a real test, it is a good start. A checklist test is performed by sending copies of the plan to different department managers and business unit managers for review. Each person the plan is sent to can review it to make sure nothing was overlooked.
Tabletop—. A tabletop test is performed by having the members of the emergency management team and business unit managers meet in a conference to discuss the plan. The plan then is “walked through” line by line. This gives all attendees a chance to see how an actual emergency would be handled and to discover dependencies. By reviewing the plan in this way, some errors or problems should become apparent.
Walkthrough—. This is an actual simulation of the real thing. This drill involves members of the response team acting in the same way as if there had been an actual emergency. This test proceeds to the point of recovery or to relocation of the alternative site. The primary purpose of this test is to verify that members of the response team can perform the required duties.
Functional—. A functional test is similar to a walkthrough but actually starts operations at the alternative site. Operations of the new and old sites can be run in parallel.
Full interruption—. This plan is the most detailed, time-consuming, and thorough. A full interruption test mimics a real disaster, and all steps are performed to startup backup operations. It involves all the individuals who would be involved in a real emergency, including internal and external organizations.
The CISSP exam will require you to know the differences of each BCP test type. You should also note the advantages and disadvantages of each.
When the testing process is complete, a few additional items still need to be done. The organization must put controls in place to maintain the current level of business continuity and disaster recovery. This is best accomplished by implementing change-management procedures. If changes are required to the approved plans, you will then have a documented, structured way to accomplish this. A centralized command and control structure eases this burden. Controls also should be built into the procedures to allow for periodic retesting. Life is not static, and neither should be the organization's BCP plans. The individuals responsible for specific parts of the BCP process are listed in Table 9.1.
Table 9.1. BCP Process Responsibilities
Person or Department | Responsibility |
---|---|
Senior management | Project initiation, ultimate responsibility, overall approval and support |
Midmanagement or business unit managers | Identification and prioritization of critical systems |
BCP committee and team members | Planning, day-to-day management, implementation and testing of the plan |
Functional business units | Plan implementation, incorporation, and testing |
Senior management is ultimately responsible for the BCP. This includes project initiation, overall approval, and support.
The goal of awareness and training is to make sure all employees know what to do in case of an emergency. If employees are untrained, they might simply stop what they're doing and run for the door anytime there's an emergency. Even worse, they might not leave when an alarm has sounded and they have been instructed to leave because of possible danger. Therefore, the organization should design and develop training programs to make sure each employee knows what to do and how to do it. Employees assigned to specific tasks should be trained to carry out needed procedures. Plan for cross-training of teams, if possible, so those team members are familiar with a variety of recovery roles and responsibilities.
Although BCP deals with what is needed to keep the organization running and what functions are most critical, the DRP's purpose is to get a damaged organization restarted where critical business functions can resume. Because the DRP is more closely related to IT issues, this portion of the chapter also introduces such topics as alternative sites, reciprocal agreements, backups, and electronic vaulting.
Individuals involved in disaster recovery must deal with many things, but when called to action, their activities center on assessing the damage, restoring operations, and determining whether an alternate location will be needed until repairs can be made. These items can be broadly grouped into salvage and recovery. Both activities are discussed here:
Salvage—. Restoring functionality to damaged systems, units, or the facility. This includes the following steps:
A damage assessment to determine the extent of the damage
A salvage operation to recover any repairable equipment
Repair and cleaning to eliminate any damage to the facility and restore equipment to a fully functional state
Restoration of the facility so that it is fully restored, stocked, and ready for business
Recovery—. Focused on the responsibilities needed to get an alternate site up and running. This site will be used to stand in for the original site until operations can be restored there.
Physical security is always of great importance after a disaster. Steps such as guards, temporary fencing, and barriers should be deployed to prevent looting and vandalism.
When disaster strikes your organization and your DRP team reports that the data center is unusable, that is not the time to start discussions on alternate sites. This discussion should have occurred long ago. Many options are available, from a dedicated offsite facility, to agreements with other organizations for shared space, to the option of building a prefab building and leaving it empty as a type of cold backup site. The following sections look at some of these options.
This frequently discussed option requires two organizations to pledge assistance to one another in case of disaster. This would be carried out by sharing space, computer facilities, and technology resources. On paper, this appears to be a cost-effective approach, but it does have its drawbacks. The parties to this agreement must place their trust in the other organization to their aid in case of disaster. However, the nonvictim might be hesitant to follow through if such a disaster did occur. There is also the issue of confidentiality because the damaged organization is placed in a vulnerable position and must trust the other party with confidential information. Finally, if the parties of the agreement are near each other, there is always the danger that disaster could strike both parties, thereby, rendering the agreement useless.
Because data centers are expensive and critical to the continuation of business, the organization might decide to have a dedicated to use a hot, warm, cold, or mobile site.
Cold site—. This is basically an empty room with only rudimentary electrical, power, and computing capability. It might have a raised floor and some racks, but it is nowhere near ready for use. It might take several weeks to a month to get the site operational.
Warm site—. Somewhat of an improvement over a cold site, this facility has data equipment and cables, and is partially configured. It could be made operational in anywhere from a few hours to a few days.
Hot site—. This facility is ready to go. It is fully configured and is equipped with the same system as the production network. Although it is capable of taking over operations at a moment's notice, it is the most expensive option discussed.
Another option or the organization is to maintain multiple data centers. Each of these sites is capable of handling all operations if another fails. Although there is an increased cost, it gives the company fault tolerance by maintaining multiple redundant sites. If the redundant sites are geographically dispersed, the possibility of more than one being damaged is low. The organization also does not have to depend on a third party or wait for a hot/warm/cold/mobile site to become operational.
Organizations might opt to contract their offsite needs to a service bureau. The advantage of this option is that the responsibility of this service is placed on someone else. The disadvantage is the cost and possible problems with resource contention if a large-scale emergency occurs.
Some other alternatives for backup and redundancy have not been discussed yet. Some organizations use these by themselves or in combination with other services:
Database shadowing—. Databases are a high-value asset for most organizations. File-based incremental backups can read only entire database tables and are considered too slow. A database shadowing system uses two physical disks to write the data to. It creates good redundancy by duplicating the database sets to mirrored servers. Therefore, this is an excellent way to provide fault tolerance and redundancy.
Electronic vaulting—. Electronic vaulting makes a copy of backup data to a backup location. This is a batch-process operation that functions to keep a copy of all current records, transactions, or files at an offsite location.
Remote journaling—. Remote journaling is similar to electronic vaulting, except that information is processed in parallel. By performing live data transfers, it allows the alternate site to be fully synchronized and ready to go at all times. It provides a very high level of fault tolerance.
Equipment is not much good without the software to run on it. Part of a good disaster-recovery plan should consist of ways to back up and restore software. This backup can be stored either on- or offsite. The decision to store offsite is usually made as a type of insurance policy in case the primary site is damaged or destroyed. Software can be vulnerable even when good backup policies are followed because sometimes software vendors go out of business or no longer support needed applications. In these instances, escrow agreements can help.
Escrow agreements are one possible software-protection mechanism. Escrow agreements allow an organization to obtain access to the source code of business-critical software if the software vendor goes bankrupt or otherwise fails to perform as required.
If you are using tape as your backup solution, you must decide what form of tape backup you perform. You have the choice of a faster backup, a longer restore, or more tapes used.
Full—. A full backup backs up all files, regardless of whether they have been modified. It removes the archive bit.
Incremental—. An incremental backup backs up only those files that have been modified since the previous backup of any sort. Incremental backups are performed after an initial full backup. Each night, the incremental backup copies any files that changed during the previous day. Because the incremental backup clears the archive bit, each night's backup operation can be completed quickly; however, a restoration will require all incremental backup tapes plus the last full backup.
Differential—. A differential backup backs up all files that have been modified since the last full backup. It does not remove the archive bit. Differential backups take longer to perform than incremental backups but can be restored quicker than incremental backups. Restoring from a differential backup means that only two restores will be required: the full backup and the last differential backup.
It's important to remember that you will want to periodically test your backup tapes. These tapes will be of little use if you find during a disaster that they have malfunctioned and no longer work. Tape-rotation strategies can range from simple to complex.
Simple—. A simple tape-rotation scheme uses one tape for every day of the week and then repeats the next week. One tape can be for Monday, one for Tuesday, and so on. You add a set of new tapes each month and then archive the monthly sets. After a predetermined number of months, you put the oldest tapes back into use.
Grandfather-father-son—. This scheme (GFS) includes four tapes for weekly backups, one tape for monthly backups, and four tapes for daily backups (assuming you are using a 5-day work week). It is called grandfather-father-son because the scheme establishes a kind of hierarchy. Grandfathers are the one monthly backup, fathers are the four weekly backups, and sons are the four daily backups.
Tower of Hanoi—. This tape-rotation scheme is named after a mathematical puzzle. It involves using five sets of tapes, each set labeled A through E. Set A is used every other day; set B is used on the first non-A backup day and is used every 4th day; set C is used on the first non-A or non-B backup day and is used every 8th day; set D is used on the first non-A, non-B, or non-C day and is used every 16th day; and set E alternates with set D.
http://thebci.org/—. Business Continuity Institute
www.exabyte.com/support/online/documentation/whitepapers/basicbackup.pdf—. Tape-backup strategies
www.crime-research.org/library/Richard.html—. Vulnerability assessment information
www.professorbainbridge.com/2003/11/substantive_due.html—. Information on due care and due diligence
www.disaster-resource.com/articles/electric_vault_rapid_lindeman.shtml—. Electronic vaulting
www.ncasia.com/ViewArt.cfm?Artid=15255&catid=4&subcat=43—. Recovery strategies
18.189.178.237