Elements of a DRP

No specific rules identify what must be in a DRP. Sections and elements can be added and removed to meet the company’s needs. However, many elements are commonly included. These are:

  • Purpose
  • Scope
  • Disaster/emergency declaration
  • Communications
  • Emergency response
  • Activities
  • Recovery procedures
  • Critical operations, customer service, and operations recovery
  • Restoration and normalization

Eight Rs of Disaster Recovery (DR) Planning

Some DR experts list the eight Rs of recovery planning, which provide a good overview of the recovery planning process. The eight Rs are:

  • Reason for planning—The scope and purpose sections address the reasons for the DR and personnel safety and the CBFs.
  • Recognition—When the disaster is recognized, it is “declared.” Personnel are notified, and a decision is made to activate the plan.
  • Reaction—Management and DR personnel respond to the emergency by assessing damage and deciding which steps to take next.
  • Recovery—DR personnel follow procedures to recover critical systems. If necessary, they activate alternate locations.
  • Restoration—DR personnel restore CBFs to full operation, which can include restoration of facility resources, such as power; connectivity, such as local area network (LAN) and WAN connections; and the systems that support the CBFs.
  • Return to normal—After the disaster has passed, DR personnel return the systems to normal operations, which could include moving from generator power to commercial power and moving functions from the alternate location back to the primary location.
  • Rest and relax—DRP responders need to have time off after the incident and to be thanked for helping the organization survive the disaster.
  • Reevaluate and redocument—Identify the things that went well and the things that can be improved. Document any lessons learned. Review any weaknesses or deficiencies in the plan. Use this data to update the plan.

The DRP should also be tested and maintained. Testing verifies that the procedures are valid. Additionally, the DRP should regularly be reviewed and updated to ensure it stays current. The following sections explore these elements in greater depth.

Purpose

The DRP starts with a simple statement identifying its purpose. DRPs are often written to support an individual function, service, or system. A DRP could be written to restore a single database server to functionality after a disaster. It would include the steps necessary to recover the database server after the failure.

The DRP could also be written to restore several servers that work together to host a service. For example, a website could be supported by several elements, which could include several web servers in a web farm and several database servers in a failover cluster. If a disaster took the website down, the DRP would include steps to either restore or recover all of these elements.

Either way, the purpose of the DRP needs to be defined early in the process and included in the final product. When defining the purpose, the following activities should be considered:

  • Recovery—Immediately after a disruption in services, a system should be recoverable, which would include a complete system rebuild if necessary and recovery of all the data. The RTO defines the maximum length of time this process should take.
  • Sustaining business operations—CBFs need to continue to operate even during a disaster. Several methods can be used to ensure these CBFs can continue, which could include a fully redundant data center, alternate locations, or redundant data sites. The method used should match the needs of the service and the available budget.
  • Normalization—Once the disaster has passed, the systems need to be normalized, which means different things depending on what was done to sustain business operations. For example, if alternate locations were used, normalization includes moving the CBFs back to the original location.

Scope

The scope of any project helps identify the boundaries. It helps all parties understand what is and is not covered. Without an identified scope, well-meaning people can cause the project to grow and expand, which is commonly known as scope creep.

The purpose of the DRP drives the scope. Based on the purpose of the DRP, elements that should be included and those that should not can be identified. Although the included elements may be obvious to some, they would still need to be identified in the DRP.

When developing the scope, the following areas should be considered:

  • Hardware—Hardware includes servers and network devices necessary to support them. Replacement servers and support equipment, such as office equipment or spare parts for the critical servers, should be available on-site or at another location.
  • Software—All software needed to support the CBFs needs to be considered, which includes operating systems and applications. Many organizations use imaging technologies. An image is an exact replica of a computer’s operating system, applications, settings, and other files. IT personnel might capture an image of a generic server every few months. Then, when a system crashes, IT personnel can use this image to quickly restore a server’s operations.
  • Data—Data considerations are essential to include in the scope of the plan. These considerations include a backup plan that identifies backup requirements if data is needed for CBFs. The recovery point objectives (RPOs) identify the amount of data loss that is acceptable, which depends on the value of the data.
  • Connectivity—Connectivity to the service consumers should also be included, which could be connectivity for users, managers, and customers, depending on who the consumers are. Connectivity could be redundant Internet service provider (ISP) links to the Internet or redundant WAN links.

Disaster/Emergency Declaration

When a disaster or emergency occurs or is imminent, the DRP is implemented. Usually, the overall BCP is activated first, and then, based on what the DRP does, the DRP is activated to support the BCP.

As an example, a hurricane is approaching. The BCP coordinator could activate the BCP when the hurricane is 96 hours out. The DRP might specify that, when the hurricane is 48 hours out, a recovery team deploys to an alternate location to prepare systems to take over operations. In this example, the DRP is not activated immediately with the BCP. The BCP might specify other actions to take immediately, but they are separate from this DRP. Instead, this DRP is activated when the hurricane is within 48 hours of striking.

The point is that the DRP should clearly state what causes it to be activated. Activation could result in the recall of personnel and the movement of equipment. When the time comes to take these steps, then do so. However, taking these steps before they are necessary can result in spending money needlessly.

Consider the hurricane again. Hurricanes don’t always travel in a straight line. A hurricane that is 96 hours away from striking can easily turn or weaken. Instead of hurricane-force winds, the location might just get some rain, in which case, the DRP shouldn’t be activated.

Communications

Several communications elements are important to the success of a DRP. These include:

  • Recall—The DRP should identify all personnel who should be notified when the DRP is activated. They include any personnel who have any responsibilities within the DRP and senior management personnel. Phone trees are often included as part of the BCP and can be used for this purpose.
  • Users—Users may need to be notified whether the DRP affects them. For example, critical business operations may not include some routine functions that users expect. They should be notified about what services are not available due to the disaster. This notification can be done before the disaster. Later, the users can be sent a reminder about these services.
  • Customers—If the disruption affects customers, they should be notified. For example, an online website may be moved to an alternate location, causing it to be down for a short period. The website could post a single page indicating that the DRP is being implemented in response to a disaster and that the website will be operational again within a specific time. Customers will understand and appreciate this. In contrast, if the website is unreachable and displays an error message, it indicates the organization was unprepared for the disaster.
  • Communications plan—DRPs often include both primary and alternate communications plans to ensure that personnel are able to communicate during an outage. The plans may be IT based, such as email or instant messaging, or cell phones or walkie-talkies could be used. They could even include specific meeting times in a central location set up as a “war room” instead of using electronic communications.

Emergency Response

The DRP could include an emergency response element that is used for short-fused disasters. For example, an earthquake will strike without warning, and tornadoes strike with very little warning. Similarly, a fire could result in a disaster that requires an emergency response.

Emergency response steps could include:

  • Recall and notification of personnel
  • Damage assessment
  • Plan activation
  • Implementation of specific steps and procedures

Depending on the location of the organization, The DRP could be written to address specific disasters. For example, many organizations on the East Coast of the United States have DRPs that cover hurricanes. The DRP would include many preparation steps, but the emergency response steps wouldn’t appear until much later in the plan, after the hurricane strikes.

Activities

The emergency response section identifies several emergency response steps to take, such as recalling personnel and assessing the damage. The DRP also identifies other activities to take in response to the disaster.

A primary activity of any DRP is ensuring that personnel safety has been addressed. Ensuring personnel safety and the protection of life should always be at the forefront of any DRP. In other words, the clear message should be people first, things next.

Activities defined in the DRP depend on the purpose and scope of the DRP. If the DRP addresses the recovery of a single system, the activities are limited and focused. If it addresses a large, complex system, many more activities will be required.

If advance warning of the disaster is given, activities may include preparing the environment. This is possible with weather-related events, such as hurricanes and other serious storms. However, many other disasters don’t provide any warning.

When alternate locations are used, a primary activity is preparing the alternate location. A cold site requires the most work. All the required equipment will need to be moved to the alternate location, set up, and configured, which will result in a flurry of activity to prepare. The activity section for a cold site will be quite extensive.

On the other hand, activities required to set up a hot site will be minimal. Personnel may be designated in a flyaway team to go to the hot site to take over operations. How and when operations will cut over to the hot site will need to be identified.

A warm site is a compromise between a hot site and a cold site. The activities to get the warm site up and operational depend on how much equipment and data are normally staged at the alternate location. The activity section can be quite extensive if the warm site is more of a cold site than a hot site. Mobile sites can also be expensive, although cheaper than a hot site. Equipment needs to be transported, and recovery time is usually a few days longer than for a hot site.

Recovery Procedures

The recovery steps and procedures describe all the specific actions required to recover systems or functions. This section often includes multiple procedures. For example, each critical function could have a separate procedure. Different personnel will be recovering separate systems, so the procedures could all be implemented at the same time.

Recovery procedures should also consider contingencies. For example, if certain recovery steps don’t work, the procedures should provide guidance for the recovery personnel. Additionally, procedures should address dependencies. If a server requires specific access to the Internet or another server, the procedure should state these requirements.

Recovering Systems

Recovery procedures identify steps for rebuilding and recovering a system after a disaster. They often include steps for recovering a system from scratch, which includes installing the operating system and all applications. If data is needed, the plan specifies how to restore the data.

For example, a database server could be running an Oracle database on a Microsoft Windows Server operating system. The recovery procedure would start with instructions for installing the Windows operating system. After installation, the procedure would describe how to install Oracle, and, last, the procedure would describe how to restore the database.

The recovery plan should be clear as to which steps must be completed before moving on to the next step. In other words, data shouldn’t be restored to a server until after the operating system and application have been successfully installed.

TIP

Recovery operations begin after activating the DRP and assessing the damage. Recovery focuses on measures necessary to restore IT capabilities and repair damage. The goal is to restore the mission-critical capabilities at either the original location or an alternate location.

Capturing an image of a server hosting Oracle is possible. If the server crashes, the image can be installed on a system. It will include the operating system, the fully configured Oracle application, and the data that was on the system when the image was captured. IT personnel will have to update the data from a recent backup. Restoring the image is much quicker than reinstalling everything from scratch.

A DRP could include specific recovery procedures for several servers and services. Separate written documents can be created for each procedure. The DRP can reference these procedures as separate appendixes.

These procedures are one of the most important elements of the DRP to test. Although many of the steps in the DRP may be generic in nature, these recovery procedures are often very technical.

TIP

If servers have been imaged, the image may need to be recaptured periodically. For example, if the operating system or application has been updated or modified, the original image won’t have the changes. Either the image must be recaptured after a change has been made or the changes must be verified to have been reapplied after the image is restored on a server.

During a disaster, the best administrators may not be available. Instead, junior technicians might be recovering systems. With this in mind, be sure that the procedural steps are clear and easy to follow.

Backup Plans

If data needs to be restored, an effective backup plan must be in place. One of the first steps in developing a backup plan is to identify critical data. Critical data is data that supports CBFs. Such data can be large databases or any other types of files that are critical.

The backup plan identifies several elements including:

  • The data to back up
  • Backup procedures for data
  • Length of time to keep the data
  • Types of backups, such as regular, electronic vaulting, or remote journaling
  • Off-site storage location, including how to retrieve a backup during a disaster
  • Testing of restore procedures and schedules
  • Disaster restore procedures

The RPOs identify the amount of data loss that is acceptable for any data. Therefore, the RPO is considered in the backup plan. If the RPO is a short period of time, such as minutes instead of hours or days, backups must be performed more frequently. If the RPO is a longer time, backups can be scheduled less often.

For example, with a high-volume database, the RPO could be 10 minutes, indicating that no more than 10 minutes of data loss is acceptable. The transaction log for the database can be backed up every 10 minutes to ensure that the last 10 minutes of data can be restored. This transaction log backup is restored after the other database backups are restored.

Mission-Critical Operations

A DRP addresses mission-critical operations. CBFs support these operations, and specific servers and services support the CBFs. Because a CBF is any function that is considered vital to the organization, if the organization loses the ability to perform the CBF, it loses the ability to perform mission-critical operations. By addressing CBFs, the DRP helps ensure that the critical servers and services continue.

TIP

A DRP ensures that backup plans exist for critical data. However, an organization may have backup plans that protect other data in the organization.

As an example, FIGURE 14-5 shows a web farm connected to a back-end database. The web servers in the web farm host an application that sells products online. In this example, the mission-critical operation is sales of products.

A network diagram of a web farm with a database in the back-end.

FIGURE 14-5 Web farm with back-end database.

Several CBFs support the sale of products. First, the web farm hosts the web application. One CBF serves webpages to clients. Users access the website, and one of the servers in the web farm sends webpages to the clients. Additionally, the back-end database server hosts databases, including the product database. Web servers query the database server and populate webpages with product data.

Once a customer decides to buy, an additional CBF comes into play. Existing customer data is retrieved from the customer database. The database also stores new customer data after a sale. Once a customer purchases the product, another CBF handles payment processes. Another CBF ensures that the product is shipped. All of these CBFs support the mission-critical operation of product sales.

Figure 14-5, shows that several servers are needed to support some of these CBFs. Specifically, both the web servers in the web farm and the back-end database server are needed. The DRP would ensure that these servers are included.

Critical Operations, Customer Service, and Operations Recovery

The DRP identifies mission-critical operations and CBFs to support. However, specifying steps for other elements of the business is often important.

For example, will customer service activities be stopped when a disaster occurs? Alternatively, will customer service activities move? If customer service is provided via phone, the functions may be easily switched to another location. On the other hand, if the organization does little customer service, it may not consider recovering during the disaster critical enough. Customers could be provided a simple notification. For example, a notification could be posted on the organization’s webpage or a short message recorded on the phone system.

Similarly, the company may have other operations that need to be recovered, which may not necessarily be considered a part of any CBF, but management may still consider them important enough to recover. For example, some personnel may be working on critical research projects. Although the research isn’t critical for current cash flow, management may want to ensure it can continue to operate. In this situation, the systems and services for research will need to be recovered.

This section can also provide another look at normal operations. While preparing the DRP, what is considered a CBF and what is not should be reviewed. Some operations may appear critical, yet initially excluded from the DRP when looking at daily activities, yet they were omitted from the DRP. These operations should be added.

Restoration and Normalization

Once the disaster has passed, personnel shift their focus to restoration and normalization. Some DRPs refer to this as the reconstitution phase. The BCP coordinator is typically the authority who announces when normalization begins. It might start after damage at the original location has been repaired. Management might also decide to start normalization earlier, based on other considerations.

Personnel return all mission-critical and non–mission-critical functions to normal during this phase. However, they are not all done at the same time. The DRP specifies the order in which they occur.

Unforeseen problems can be expected during the normalization phase. Because of these problems, normalizing the least critical functions first is important, especially if functions were moved to another location. Doing this ensures that the most critical functions aren’t interrupted as problems arise.

In some situations, the DRP might require concurrent processing. For example, as critical functions are normalized in the primary location, the recovered systems are kept operating in the alternate location. If problems affect the primary location, the load can easily be shifted back to the alternate location.

Testing

Testing DRPs is important to ensure they perform as expected. Because the DRP is written to restore CBFs, testing of the DRP should not affect operations of these CBFs. The goal of the tests is to identify any problems or omissions in the DRP.

Like a BCP, different testing methods can be used to test a DRP. The following are common testing methods:

  • Desktop exercise—In a desktop exercise, participants meet in a conference room setting. Participants talk through the steps of the DRP. The desktop exercise is similar to a tabletop exercise used in a BCP.
  • Simulation—A simulation goes through the steps and procedures in a controlled manner. The goal is to ensure that the DRP can be completed in the order presented. Simulations may test portions of the DRP without testing them all. For example, only the data at an alternate location could be restored to ensure this procedure works.
  • Full-blown DRP test—The full-blown DRP test goes through all the steps and procedures as if an actual disaster were occurring and helps to determine the actual time required to complete each step and procedure. The full-blown DRP test has the most potential to disrupt operations. Therefore, a full-blown test should be planned so that it has a minimum effect on operations.

The results of all tests should be thoroughly documented and include any lessons learned, mistakes, or weaknesses uncovered during testing. This documentation can then be used to improve the DRP. Updating the DRP is commonly done if testing identifies any deficiencies.

One of the benefits of testing is that it will give an accurate time frame for recovery. For example, with a database server, one administrator may think a technician can rebuild and restore it in 30 minutes, and another administrator may estimate it will take as long as four hours. Actually rebuilding and restoring the server will give an accurate time of how long it takes. A checklist is helpful for tracking the time frame for individual steps. The checklist may look similar to TABLE 14-2.

TABLE 14-2 Recovery Times Checklist
STEP START TIME END TIME
Locate server and install operating system
Install applications on server
Locate backup tape and restore data
Notify DRP coordinator of completion

Maintenance and DRP Update

The DRP needs to be regularly reviewed and updated. Doing so ensures that it will be ready when needed. IT systems are regularly updated and upgraded, and any of these changes could affect the usability of the DRP.

Most organizations have change management processes in place. These processes ensure that changes to systems are reviewed before the change occurs and that changes are documented. DRP developers should be involved in this process to ensure that they are aware of system changes. When a change is proposed, the DRP developer should review the change to determine whether it affects the DRP.

The DRP review should include the following elements:

  • Systems—Verify that the systems covered by the DRP have not been changed since the last review. These changes include any significant changes that may affect how the systems are recovered. Even smaller changes should be investigated to determine whether the DRP is affected.
  • Critical business functions—Verify that the DRP covers the CBFs and that priorities have not changed. An organization can change, resulting in some CBFs becoming more important than others.
  • Alternate sites—If the DRP requires alternate sites, ensure that the designated sites still support the DRP and that changes to the alternate sites don’t adversely affect the DRP. If possible, these alternate sites should be visited while the DRP is being reviewed to determine whether they still meet the needs of the company.
  • Contacts—Contact information must be accurate. Contact information includes contact information for management personnel who need to be notified and recall information used in the phone trees.

Just as with other documents, tracking changes to the DRP is important. The DRP should include a change page or version control page, which identifies the change, when the change was made, and who made the change.

NOTE

A phone tree is a method to facilitate calling a large group of people. The phone tree is a contact list shaped like a pyramid. To start, the person at the top calls a few people, who then each call a few people assigned to them. This process continues until all contacts have been notified.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.65