Chapter 7. Disaster Recovery

The terms disaster recovery (DR) and business continuity planning (BCP) are often confused and treated as interchangeable. They are, however, two different, but related, terms.

Business Continuity pertains to the overall continuation of business via a number of contingencies and alternative plans. These plans can be executed based on the current situation and the tolerances of the business for outages and such. Disaster Recovery is the set of processes and procedures that are used in order to reach the objectives of the Business Continuity Plan.

BCP normally extends to the entire business, not just IT, including such areas as secondary offices and alternate banking systems, power, and utilities. DR is often more IT focused and looks at technologies such as backups and hot standbys.

Why are we talking about DR and BCP in a security book? The CIA triad (confidentiality, integrity, and availability) is considered key to nearly all aspects of Information Security, and BCP and DR are focused very heavily on Availability, while maintaining Confidentiality and Integrity. For this reason, Information Security departments are often very involved in the BCP and DR planning stages.

In this chapter we will discuss setting our objective criteria, strategies for achieving those objectives, and testing, recovery, and security considerations.

Setting Objectives

Objectives allow you to ensure that you are measurably meeting business requirements when creating a DR strategy and allow you to more easily make decisions regarding balancing time and budget considerations against uptime and recovery times.

Recovery Point Objective

The recovery point objective (RPO) is the point in time that you wish to recover to. That is, determining if you need to be able to recover data right up until seconds before the disaster strikes, or whether the night before is acceptable, or the week before, for example. This does not take into account of how long it takes to make this recovery, only the point in time from which you will be resuming once recovery has been made. There is a tendency to jump straight to seconds before the incident; however, the shorter the RPO, the more the costs and complexity will invariably move upwards.

Recovery Time Objective

The recovery time objective (RTO) is how long it takes to recover, taken irrespective of the RPO. That is, after the disaster, how long until you have recovered to the point determined by the RPO.

To illustrate with an example, if you operate a server that hosts your brochureware website, the primary goal is probably going to be rapidly returning the server to operational use. If the content is a day old, it is probably not as much of a problem as if the system held financial transactions for which the availability of recent transactions is important. In this case an outage of an hour may be tolerable, with data no older than one day once recovered.

In this example, the RPO would be one day and the RTO would be one hour.

There is often a temptation for someone from a technology department to set these times; however, it should be driven by the business owners of systems. This is for multiple reasons:

  • It is often hard to justify the cost of DR solutions. Allowing the business to set requirements, and potentially reset requirements if costs are too high, not only enables informed decisions regarding targets, but also reduces the chances of unrealistic expectations on recovery times.

  • IT people may understand the technologies involved, but do not always have the correct perspective to make a determination as to what the business’s priorities are in such a situation.

  • The involvement of the business in the DR and BCP plans eases the process of discussing budget and expectations for these solutions.

Recovery Strategies

A number of different strategies can be deployed in order meet your organization’s DR needs. Which is most appropriate will depend on the defined RTO, RPO, and as ever, by cost.

Backups

The most obvious strategy for recovering from a disaster is to take regular backups of all systems and to restore those backups to new equipment. The new equipment should be held at a dedicated disaster recovery facility or secondary office, located somewhere where the appropriate connectivity is available and the servers can begin operating right away.

Historically, backups were often to a tape-based medium such as DLT drives, which were physically shipped to another location. However, in recent times the cost of storage and network connectivity has come down, so backups can often be made to a more readily available and reliable media, such as an archive file on a remote hard disk.

Backups will generally have a longer RPO than other strategies, and backups tend not to be continuous but rather a batch job run overnight—and not necessarily every night. The RPO will be, at best, the time of the most recent backup. Additionally, backups frequently fail and so the RPO is in reality the time of your most recent working backup.

The RTO will vary depending on the speed of the backup media and the location of the backup media in relation to backup equipment. For example, if the backup media needs to be physically shipped to a location, this must be factored in.

Warm Standby

A warm standby is a secondary infrastructure, ideally identical to the primary, which is kept in approximate synchronization with primary infrastructure. This infrastructure should be kept a reasonable geographic distance from the primary in case of events such as earthquakes and flooding. In the event of a disaster, services would be manually “cut over” to the secondary infrastructure. The method of doing so varies, but is often repointing DNS entries from primary to secondary or altering routing tables to send traffic to the secondary infrastructure.

The secondary infrastructure is kept in sync via a combination of ensuring that configuration changes and patches are applied to both primary and secondary, and automated processes to keep files synchronized. Ideally the configuration and patching would happen in an automated fashion using management software, however this is often not the case and can cause problems in the event that there are differences.

The RPO is fairly short on a warm standby, typically whatever the frequency of filesystem synchronization processes are.

The RTO is however long the cut-over mechanism takes. For example, with a DNS change this is the amount of time to make the change, and for old records to expire in caches so that hosts use the new system. With a routing change, the RTO is at least however long the routing change takes to make and, if using dynamic routing protocols, routing table convergence to occur.

However, this system does rely on having an entire second infrastructure that is effectively doing nothing until such time as there is a disaster.

High Availability

A high-availability system is typically a model like a distributed cluster. That is, multiple devices in distributed locations, which share the load during normal production periods. During a disaster, one or more devices will be dropped from the pool and the remaining devices will continue operation as per normal. In addition, they will continue to process their share of the additional load from the device that is no longer operational.

Due to the nature of high availability, it is typical that all devices in the cluster will be fully synchronized, or very close to it, and for this reason the RPO will be very short.

Many clustering technologies allow for devices to drop out of the cluster and the other devices will automatically adjust and compensate. For this reason, the RTO can also be lower than many other solutions.

Although the RPO and RTO are both advantageous when using a high-availability system, it is not without cost. The cluster needs to have enough capacity to compensate handling the additional load per remaining node. In the event of a disaster, this means running hardware that is not fully utilized in order to have spare capacity. Also, additional investment in areas such as intersite bandwidth will be required. Keeping all devices synchronized to run a clustered solution requires sufficient bandwidth at a low enough latency, which places additional requirements on the infrastructure.

Alternate System

In some cases, using an alternate system is preferential to running a backup or secondary system in the traditional sense. For example, in the case that an internal Voice over IP solution is rendered unavailable due to a disaster, the plan may not be to try to re-instantiate the VoIP infrastructure, but to simply change to using cellular phones until such time as the disaster is over.

This strategy does not always have an RPO per se, as recovery of the existing system is not part of the plan. This is why this type of approach is typically only taken with systems that do not hold data, but provide a service, such as telephones. There is, however, a measurable RTO in terms of the amount of time taken to switch over to using an alternate system.

System Function Reassignment

An approach that can prove to be cost effective is System Function Reassignment, which is a hybrid of other solutions. This is the repurposing of noncritical systems to replace critical systems in the event of a disaster situation. It is not applicable to all environments, and so should be considered carefully before being used as a strategy.

For example, if you already run two datacenters, structure your environments so that for any production environment housed in one datacenter, its test, pre-production, or QA environment is housed in the other datacenter. In this scenario you can have a near production site ready, but not idle, at all times. In the event of a disaster the environment in question will cease to operate as, for example, pre-production, and be promoted to a production environment.

This approach requires that the two environments be separated enough that a disaster affecting one will not affect the other. The state of the other environments should be tightly controlled so that any differences from production are known and easily changed to match that of production state prior to going live.

Dependencies

An important part of developing a strategy for DR and BCP is to understand the dependencies of all of the systems. For example, if you can successfully bring up a fileserver in another location, it does not matter if the staff can connect to it. Servers typically need a network connection, the associated routing, DNS entries, and access to authentication services such as Active Directory or LDAP. Failure to determine the dependencies required for any particular system may lead to missing the RTO for that service.

For example, if you have an email server with an RTO of 1 hour, and yet the network on which it depends has an RTO of 3 hours, irrespective of how quickly the email server is up and running, it may not be resuming operation in any meaningful sense until 3 hours have elapsed.

By mapping out dependencies such as this, it is much easier to identify unrealistic RTOs, or RTOs of other systems or services that need to be improved to meet these targets. Walking through tabletops and drills as mentioned in Chapter 1 will assist in discovering these dependencies.

Scenarios

When developing potential disaster plans, it is often useful to walk through a few high-level scenarios and understand how they impact your proposed plan. This exercise normally works most effectively with representatives from other IT teams who can assist with discussing the implications and dependencies of various decisions.

A few broad categories of scenarios are useful to consider, although which ones you choose to use will probably be dependent upon your own circumstances:

  • Hardware failure of mission-critical platform: something that is isolated to a single platform, but that is significant enough to cause a DR incident—for example, the failure of server hardware for the production environment of a key system.

  • Loss of a datacenter, potentially temporarily such as during a power outage, or perhaps for more prolonged periods such as a fire or earthquake.

  • Pandemic: in the event of a pandemic, services may remain available, but physical access may not be possible, which in turn could prevent certain processes from taking place, such as physically changing backup tapes, users working from home causing extra load up VPN, or other remote access services.

Invoking a Fail Over...and Back

It is all very well having a set of contingency plans in place and having target times by which to achieve them. If you do not know when you are in a disaster situation there is little point to the plans. There should be a process in place to determine what is and is not a disaster, and when to invoke the plan.

There may be a few key, high-level scenarios in which the plan would obviously be put into action. For example, the event of the datacenter being on fire is typically enough to invoke failing over to backup systems. However, care should be taken not to be too prescriptive or else the risk of minor deviations from the situations outlined may cause a failure to invoke the plan. Similarly, not being descriptive enough could cause an inexperienced administrator to invoke a DR plan needlessly. In this case, how do you determine when to invoke the plan? One of the most effective routes is to have a list of named individuals or roles who are authorized to determine when the organization is in a disaster situation and that the plan needs to be executed. The process for anyone who is not authorized to make this determination is to escalate to someone who can, who in turn will make the decision. This way the alarm can be raised by anyone, but the ultimate decision to execute is left to someone suitably senior and responsible.

One often overlooked area of DR and BCP is that as well as failing over to contingency systems, there will need to be a process of switching back again after the disaster has ended. Unlike the initial failover procedure, there is the advantage of being able to schedule the switch and take an appropriate amount of time to do so. Nevertheless, this should be a carefully planned and executed process that is invoked once again by an authorized person. Always remember to include the proper communication during the potential outages as it can be a high-stress time. A downtime is never too big for proper communication to happen.

Testing

Disaster Recovery can be extremely complex, with many of the complexities and interdependencies not being entirely obvious until you are in a disaster situation. Sometimes you’ll find that in order to complete the task, a file is required from a server that is currently under several feet of water. For this reason it is advisable—and under some compliance regimes mandatory—that regular DR tests be carried out. Of course, no one suggests that the datacenter be set on fire and attempted to be recovered. Choose a scenario and have the replacement systems brought up within the allotted RTO and RPO. This should be completed without access to any systems or services located on infrastructure affected by the scenario you have chosen.

The test should be observed and notes taken on what worked well and what did not. Holding a post-test debrief with the key people involved, even if the test met all targets, is a valuable process that can yield very useful results insofar as learning what can be improved in preparation for next time. Findings from the debrief should be minuted with clear action items for individuals in order to improve plans and work toward a more efficient and seamless process. A more in-depth look at this topic is covered in Chapter 1.

Security Considerations

As with any process, there are security considerations involved with most plans. These can be summarized into a few key categories:

Data at rest

Many contingency plans require that data from production systems be duplicated and retained at another site. This is true of both warm standbys and traditional backups, for example. It should always be remembered that this data will have controls placed on it in production in line with its value and sensitivity to the organization. For example, it may be encrypted, require two-factor authentication to access, or be restricted to a small group of people. If equal restrictions are not placed on the contingency systems, the original access controls are largely useless. After all, why would an attacker bother trying to defeat two-factor authentication or encryption on a production system when he can simply access a relatively unprotected copy of the same data from a backup system?

Data in transit

In order to replicate data to a secondary system, it will probably have to be transmitted over a network. Data transmitted for the purposes of recovering from or preparing for a disaster should be treated as carefully as any other time the data is transmitted. The appropriate authentication and encryption of data on the network should still be applied.

Patching and configuration management

It is often easy to fall into the trap of backup systems not being maintained in line with the production environment. This runs the risk of leaving poorly patched equipment or vulnerabilities in your environment for an attacker to leverage. In the event of a disaster, these vulnerabilities could be present on what has become your production system. Aside from the security issues, you cannot be sure that systems with differing configuration or patch levels will operate in the same way as their production counterparts.

User access

During a disaster situation there is often a sense of “all hands to the pumps” in order to ensure that production environments are operationally capable as soon as possible. It should be considered that not all data can be accessed by just anybody, particularly if the data is subject to a regulatory compliance regime such as those that protect personally identifiable healthcare or financial data. Any plans should include the continued handling of this type of data in line with established processes and procedures.

Physical security

Often the secondary site may not be physically identical to the primary site. Take, for example, a company for which the primary production environment is housed in a secure third-party facility’s managed datacenter, and the disaster location makes use of unused office space in the headquarters. A lower standard of physical access control could place data or systems at risk should an attacker be willing to attempt to physically enter a building by force, subterfuge, or stealth.

Conclusion

There is no one-size-fits-all solution to DR, although there are several well-trodden routes that can be reused where appropriate. One of the most important aspects of DR planning is to work with the business to understand what their requirements are for your DR solution. By aligning your solution with their expectations, it is easy to measure the success or failure of the system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.129.100