CHAPTER 8

Disaster Recovery Planning

In this chapter, you will

•   Learn about disaster recovery

•   Identify threat impacts on business operations

•   Understand business continuity

•   Distinguish various replication types

•   Implement appropriate backup techniques

This chapter will help you proactively plan for negative incidents that can affect business operations. Threats have to be mapped to their potential impact to the organization, and server technicians must be aware of the procedures within the disaster recovery plan to return IT systems to a functional state.

Disaster Recovery

IT problems resulting from a disaster could be as small as a failed drive in a RAID 5 array or as large as an entire office or data center being unavailable. Your organization’s ability to bounce back quickly from catastrophic IT incidents results from proactive planning and requires technicians who know their roles.

The recovery time objective (RTO) is the maximum amount of time that can be tolerated for an IT service to be down before it has a negative impact on the business. As such, the RTO is an important component of a business impact analysis (BIA). Ensuring that all involved parties know their roles in the disaster recovery plan through simulated tabletop exercises can ensure that the RTO is met.

Alternate Sites

Alternate sites enable business operations to continue when a primary site experiences some kind of disruption. For example, suppose a data center experiences region-wide communications link failure. Customers that depend on IT services from the unreachable data center can be redirected to another, alternate, data center, where IT systems are running and customer data has already been replicated from the primary site. The back-end virtual machine servers hosting a cloud load-balanced application can be spread across regions to achieve this. A round robin load balancing configuration can distribute requests to servers in different regions as well.

Many things must be done before a site failure occurs:

•   An alternate location must be acquired or built.

•   High-speed communication links must be in place between the sites.

•   The IT infrastructure has to be in place at the alternate site.

•   Hardware

•   Hypervisors up and running

•   Dynamic Host Configuration Protocol (DHCP) and Domain Name Service (DNS)

•   Data replication between the two sites must be configured.

•   Varying solutions determine how often data replicates.

•   Communications link speeds determine how much data can be replicated within a given timeframe.

When an alternate site is active and a negative incident occurs, several things are set into motion, such as the following:

•   Failing over IT services to the alternate site

•   DHCP and DNS

•   Hosted web sites

•   Virtual machines

•   Line of business applications

•   Ensuring network address changes do not affect IT service consumers

•   Dynamic DNS updates for changed IPv4 and IPv6 addresses

•   Ensuring that notifications are sent to affected stakeholders

Individual network services can be made highly available using failover clustering solutions. Multiple servers (cluster nodes) use the same shared storage and have the software installed and configured identically, so that if one server fails, another one can take over. When a cluster node connects to shared storage, it normally uses a most recently used (MRU) path selection policy from that point onward; the server will attempt other paths to the shared storage if the current path fails.

Some specific services, such as DHCP, can be specifically configured for failover, as shown in Figure 8-1. Administrators configure clustered services as either active/active or active/passive. In an active/active configuration, the clustered service is actually running simultaneously on multiple cluster nodes, which can result in zero downtime (called a live failover). In an active/passive configuration, if the node where the service is running fails, the service fails over and starts up on another cluster node.

Images

Figure 8-1  Configuring DHCP failover for high availability

The configuration can also determine that the failed over service is failed back to the original host once it is up and running once again. In rolling cluster updates, a staggered process of applying cluster node updates ensures that some cluster nodes always remain running.

Clustering solutions use a periodic heartbeat transmission from each cluster node to ensure that nodes have not failed. It’s best to use a dedicated network adapter for cluster heartbeats.

Hot Site

A hot site is an alternate location that can actively continue business operations at a moment’s notice when the primary site becomes unavailable. Communication links, network equipment and software, staff, as well as up-to-date data are ready to go so that business operations can continue well within the RTO.

Large organizations and government agencies control their own data centers, and hot sites are entirely under their own control. Smaller organizations and individual consumers indirectly partake in this through IT service provider disaster recovery (DR) sites.

DR sites are commonly used by public cloud providers. These organizations have large data centers located throughout the world that offer cloud IT services to companies and individuals in different regions. The data centers are designed to withstand natural and manmade disasters—for example, large generators can provide power if the power goes out.

Shared DR hot site facilities are used by multiple companies and are less expensive than a DR host site dedicated to a single company. Regulations or organizational security policy may prohibit IT system and data cohabitation with other organizations, however.

Hot sites stay up-to-date with continuous data protection (CDP) replication between sites, so that data is accessible from the hot site if the primary site becomes inaccessible.

Images

EXAM TIP   Watch out for exam questions related to items available at an alternate site. A hot site is the most expensive type of alternate site to maintain, because it must contain equipment, software, staff, and up-to-date data resulting from replication from the primary site.

Cold Site

Unlike hot sites, cold sites do not have IT equipment, software, data, and staff already in place. Essentially, a cold site is a location with power and communications links in place. In the event of a disaster, equipment, software, data, and people need to be provided at the cold site.

One common problem with a cold site is software. Getting operating system (OS) and application software installed and patched can take time (less than the RTO; otherwise, a cold site would not be used). Not only is time an issue, but version incompatibilities can slow things down even more. The specific software versions used at the primary location also must be available for use at a cold site.

Then there’s the issue of data. Cold sites don’t continuously replicate with primary sites, so restoring data from backup locations (tape or cloud) is required. Cold sites are much less expensive than hot sites, but they must fit into an organization’s business continuity plan (BCP) to be acceptable. If it takes four days to get a cold site functional, for example, it will not fall within a two-day RTO, so a cold site may not be feasible to use in such a case.

Warm Site

A warm site provides a location with power and communication links, but it also has equipment in place and ready in case of a disaster; software and data, however, are not in place prior to a disaster.

Many organizations use bare-metal server restoration solutions to get server OSs up and running quickly without requiring manual installations and configurations. A bare-metal restoration performs a full system recovery, including the OS for a physical or a virtual machine, and it can be configured to do so even if the physical or virtual hardware configuration is different from the software in place when the system backup or image was taken.

Technicians may use external bootable USB drives to apply the bare-metal images, or a network Preboot Execution Environment (PXE) may be first to be configured to enable multiple simultaneous bare-metal deployments over the network.

Once OS and application software are installed, patched, and configured, data is required. This involves restoration from a chosen backup solution that is on-premises, in the cloud, or a hybrid of both. For example, the Amazon Web Services (AWS) Storage Gateway is a virtual machine that runs on a customer network and can present cloud storage as virtual tapes for on-premises backup software. Figure 8-2 shows the AWS Storage Gateway configuration screen.

Images

Figure 8-2  Configuring the AWS Storage Gateway

Images

EXAM TIP   Make sure you know what distinguishes one type of alternate site from another. Exam questions may test you on this using wording that differs from what is presented here, but the concepts are the same. For example, a question could link the RTO to a site type.

Data Replication

Having a readily available copy of up-to-date data is crucial with disaster recovery. Backups are still required, but data replication can immediately provide the data without requiring a restoration procedure. Replication can be implemented in plenty of ways, such as at the disk level, server level, and site level.

Synchronous replication writes to the primary and alternate location simultaneously; with asynchronous replication, there is a slight delay before the alternate write completes. Naturally, synchronous replication solutions tend to be more expensive than asynchronous ones, but with a short RTO, the cost of synchronous solutions could be justified, especially when you consider the fact that some data could be lost with an asynchronous solution.

Disk-to-Disk

In RAID 1 (disk mirroring) storage, a second copy of data is written to a disk other than the primary disk that stores the original data. Should the primary disk fail, the mirrored copy kicks in without missing a beat. This server component redundancy provides fault tolerance against disk failures.

Enterprise-class storage array vendors provide replication solutions for their disk array enclosures. The only issue is incompatibility; most vendor replication solutions work only with their own products.

The Linux tar Command  You can also use the Linux tar (tape archiver) command to create compressed archives for backup purposes. Commonly used tar command line parameters are shown in Table 8-1.

Images

Table 8-1  Commonly Used Linux tar Command Line Parameters

For example, to create a compressed archive called UserFiles.tar.gz under /Backup from the /UserFiles folder, you would use the following:

tar -cvzf /Backup/UserFiles.tar.gz /UserFiles

To decompress the same archive and re-create the folder on /, you would use the following:

tar -zxvf /Backup/UserFiles.tar.gz -C /

The Linux dd command can also be used to back up specific disk blocks or even entire partitions. For example, the following example uses the dd command to back up (if means input file) the master boot record (MBR) of a bootable Linux disk (/dev/sda) to a backup file (of means output file) in /backup called sda_mbr_back. The bs means block size for the 512-byte first boot sector that contains the MBR.

dd if=/dev/sda of=/backup/sda_mbr_back bs=512 count=1

The following example restores the MBR to a disk (/dev/sda) from a backup file. Notice here we use only 446 bytes to retain the current partition table and disk signature on /dev/sda:

dd if=/backup/sda_mbr_back of=/dev/sda bs=446 count=1

Of course, there are many other file and disk backup solutions for Linux; tar and dd just happen to be built into most distributions.

Server-to-Server

Also called host-to-host replication, this solution uses software within the server OS to replicate data between two or more servers. The only issue here is that we are introducing more processing work for the server OS that really should be focusing on other tasks.

Windows DFSR  Windows Server OSs provide Distributed File System Replication (DFSR) as a role service that can synchronize folder contents between servers. Only file block changes are synchronized, and changes are compressed before being sent over the network.

Replication can be scheduled, or servers can be set to continuous replication. Bandwidth throttling can also be configured so that replication doesn’t consume all of the available bandwidth. Windows DFSR is considered asynchronous replication.

Server administrators can configure one or more servers in a DFSR replication group as read-only to prevent changes from that host. DFSR could be used to replicate data from branch office servers to a central server where data backups take place. The configuration of a DFSR replication group is shown in Figure 8-3.

Images

Figure 8-3  Configuring a DFSR replication group in Windows

Plenty of third-party replication solutions are available for Windows with varying capabilities.

rsync  UNIX and Linux admins often use the rsync tool to replicate data between hosts. In addition, other rsync variants work on the Windows platform.

The rsync tool can be used to synchronize data between two or more local folders over an intranet or the Internet. As with Windows DFSR, only file changes are synchronized. The following example synchronizes the /budgets/2021 folder from a server named server2 to a local folder called /incoming/rsync. The -a preserves extra attributes such as permissions, -v is for verbose output, and -z compresses the transfer:

rsync -avz server2:/budgets/2021/ /incoming/rsync

As with Windows, plenty of third-party replication solutions are available for UNIX and Linux, with varying options.

Site-to-Site

Keeping data synced between sites may be required by service providers and emergency contingency organizations, for example, and these are prime examples of a primary and hot site. Instead of host-based replication solutions such as Windows DFSR or the UNIX-based rsync, for serious replication between data centers, more advanced (and expensive) solutions from vendors such as HP, IBM, or EMC (and many other vendors) would be used.

Cloud providers use site-to-site replication between their data centers to ensure that customer IT services and data are highly available in the event of a data center failure. Network links between data centers must be able to accommodate large data transfers quickly, especially if a synchronous replication solution is employed; these are often referred to as active/active copies of data, and they can provide a near-zero RTO. Multiple network links can be aggregated to provide network link high availability as well as increased network performance.

Site-to-site replication is not only costly but also complex. Technicians who are certified in the specific solution being used are needed to configure and maintain the replication environment.

Images

EXAM TIP   Many exam questions will expect you to connect the dots. Remember that hot sites imply synchronous replication over high-speed network links.

Keep in mind that site-to-site synchronous replication won’t solve all of your headaches. Consider a malware infection in one data center that somehow goes undetected and gets replicated to other data centers, or a region-wide outage caused by inclement weather. Data backups beyond replication are still (and always will be) very important.

Business Impact Analysis

Two of the first activities related to risk management are identifying and prioritizing assets that have value to the organization. These tasks help you focus on allocating resources to implement security controls and mitigating risk, and you can determine how the loss of IT systems or data can negatively impact your organization.

Who and What Are Affected?

Prioritizing the impact of failed systems or data inaccessibility is part of a BIA. Even if a mission-critical database server has failed, if infrastructure services such as DHCP and DNS have also failed, they need to be made operational first.

Following is a sample list of DR priorities that will vary from one organization to the next:

•   Personnel safety

•   Critical organizational data

•   Network infrastructure hardware

•   Network infrastructure software

•   Mission-critical database servers

•   Front-end applications (that use the back-end database servers)

The RTO is a big factor in determining what type of failures can be tolerated and for how long. Organizations must sometimes consider their dependency on third parties for service high availability. Think of public cloud providers that have service level agreements (SLAs) with their customers that guarantee uptime, such as the SLA pictured in Figure 8-4. Resource allocation to protect critical IT systems is related to the amount of tolerable downtime.

Images

Figure 8-4  An example of a public cloud provider SLA

Images

EXAM TIP   Don’t confuse RTO with recovery point objective (RPO), which is discussed later. For the exam, remember that RTO is the maximum amount of tolerable downtime, and RPO is the maximum tolerable amount of data loss.

Business Continuity

Business operations must continue even in the face of natural and manmade disasters or technology failures. Preparation is the key. Where a business continuity plan takes a high-level approach to ensuring that the organization keeps running, disaster recovery plans are more specific to a technological solution.

Disaster Recovery Plan

A DR plan prepares an organization for potential negative incidents that can affect IT systems. Imagine a mission-critical database server running in a virtual machine that no longer boots. A DR plan is needed to get the server up and running again as quickly as possible. Presumably, if the server is that important, it has been made highly available through failover clustering, which is a server redundancy configuration. Nonetheless, the failed server still needs to be corrected. Server and application failover can be simulated before a disaster strikes to determine whether it will work correctly when needed.

DR plans include step-by-step procedures to recover failed systems such as a mission-critical database. Proper documentation makes this known to technicians who must know their roles in the DR plan. Periodic drills in a testing (not production) environment can streamline operations in the event of a real disaster.

Organizations also outsource IT expertise in some cases, such as with public cloud computing or synchronous replication solution vendors. A comprehensive DR plan will include details about these solutions as well as contact information in the event of a problem.

DR plans are effective only if the procedures are known and responsibilities are assigned. When roles and recovery steps are known, RTO is minimized. Table 8-2 lists common failure situations and responses that would be detailed in the DR plan.

Images

Table 8-2  IT Failures and Solutions

Companies should have multiple DR plans for various IT systems. A DR plan document should contain items such as these:

•   Table of contents

•   Scope of the DR document

•   Contact information for escalation and outsourcing

•   Recovery procedures

•   Document revision history

•   Glossary

The mean time to repair (MTTR) is a measure of time that expresses, on average, how long it takes to get failed components back up and running. This helps with planning equipment life cycle costs and helps you quickly recover from failed components. A smaller MTTR value is desirable. The mean time between failures (MTBF) is usually associated with hardware components such as hard disks; based on past data, the manufacturer provides an estimate as to how much time should go by until a failure occurs.

Business Continuity Plan

The BCP ensures that business operations can continue or can resume quickly during or after an IT failure. It should also include preventative measures that show stakeholders that the organization is committed to being prepared for the worst. The BCP relates to the Continuity Of Operations (COOP) to keep business processes up and running at all times.

Backups are helpful in recovering from failure. The RPO relates to the amount of tolerable data loss and is normally associated with backup frequency. For example, if the RPO is 10 hours, backups must occur at least every 10 hours.

Creating and using a BCP involves the following steps:

1.   Assemble the BCP team.

2.   Identify and prioritize critical IT systems and data.

3.   Determine whether required skills are available internally or must be outsourced.

4.   Determine whether alternate sites (hot, warm, cold) will be used.

5.   Create a DR plan for each IT service.

6.   Review the BCP with the BCP team.

7.   Run periodic drills to ensure effectiveness.

Data Backup

Data replication technologies provide additional copies of data that are readily available, yet backups must still be performed as well. Figure 8-5 shows the backup feature available with the Windows Server OS.

Images

Figure 8-5  Windows Server backup

The organization’s policies may be influenced by laws or regulations that specify how often backups must be performed, how long they must be retained, and in which country the data must reside. Most of today’s organizations have additional data backup options including offsite cloud backup solutions, although some types of data may be restricted from this type of backup. In some cases, printing highly classified information and storing it in a safe is considered a valid backup.

Backup Types

With today’s big data requirements, there are limitations to how much data can be backed up within a certain timeframe. Data center technicians must periodically reconcile backup software database references with the physical backup tapes themselves to ensure they exist.

In some cases, there just isn’t enough time to perform a nightly full backup of all data on an enterprise storage area network (SAN). In such cases, selective backups give server administrators the ability to restore only files that are required instead of overwriting all files as well as restoring to an alternate path from the original backup location.

Somewhat related to clustering, backup, and availability, Microsoft SQL Server log shipping uses a primary and secondary SQL server. The primary server database supports read/write access, and the secondary server data is updated via transaction log updates from the primary. This is sometimes referred to as a side-by-side backup.

File systems provide an archive attribute that can be turned on whenever a file is modified. Essentially, this flag means “I have been changed and I need to be backed up!” Not every backup solution uses the archive bit, but most do (and this can be configured in a GUI such as the Windows dialog shown in Figure 8-6), in addition to file date and time stamps. Different backup types will set the archive bit accordingly, which is explained in the following sections.

Images

Figure 8-6  The archive bit of a Windows file

Full Backup

As the name implies, a full backup copies all data specified in the backup set. Technicians also call this a normal or copy backup. Full backups take longer than other backup options, but they take the least amount of time to restore, because all data is contained in a single backup set. As a result, full backups are commonly performed only periodically, such as once a week on weekends.

The archive bit for files is cleared when a full backup is performed. When backed up files are modified in the live system (not on the backup media), the OS turns on the archive bit so that the backup solution knows the file has changed since the last backup. This is also true when new files are created; the OS sets the archive bit on.

Differential Backup

This type of backup copies only the files that have changed since the last full backup (not since the last differential backup). So if you perform full backups on Saturdays and differential backups each weeknight, Wednesday night’s differential backup will copy all file changes since last Saturday (Sunday, Monday, Tuesday, and Wednesday).

Differential backups take less time to perform than full backups but more time to restore, because you need not only the full backup set but also the correct differential backup set that includes changes since the last full backup.

The archive bit is not normally cleared with this type of backup, because when the next differential backup runs, you’ll want to copy all changed files since the last full backup (where the archive bit is normally cleared).

Incremental Backup

Where differential backups copy files changes since the last full backup, incremental backups copy only files that have changed since the last incremental or full backup. For example, you might perform a full backup each Saturday and an incremental backup each weeknight. Wednesday evening’s backup will include only those files changed since Tuesday evening’s backup.

The archive bit is normally cleared when this type of backup runs, because you want to capture all changes each time the incremental backup is run. This backup type takes the least amount of time to run but the most time to restore.

Images

EXAM TIP   Some backup solutions such as Veeam support combining backup types to facilitate restoration. Synthetic full backups take an incremental backup, and in the background, combine it with older, existing full backups in the same location.

Snapshots

Snapshots, also called checkpoints, can be taken of a virtual machine (VM), as shown in Figure 8-7, to capture its settings as well as data stored in virtual hard disk files. If problems are encountered with the VM in the future, it can be reverted to a previous snapshot. However, a VM snapshot should not be relied upon as the sole backup. Backup agents can be installed within each VM as would normally be done on physical servers to provide granular backup and restore options.

Images

Figure 8-7  Taking a snapshot of a VMware virtual machine

Images

EXAM TIP   For the exam, remember that snapshots do not replace backups. If you are required to restore specific items, such as specific files, VM snapshots will not do the trick; reverting a VM snapshot reverts the VM settings and virtual hard disk contents.

Other than virtual machines, snapshots can also apply to disk volumes, entire storage arrays, logical unit numbers (LUNs), hypervisors, and databases; they are often used in SAN environments, where they’re called storage snapshots.

Windows servers can have the built-in Volume Shadow Service (VSS) configured for each disk volume to enable scheduled snapshots (called volume shadow copies). The snapshots contain only changed disk blocks so they don’t consume much space. Many backup agents actually use the snapshots as their backup source; this eliminates the problem of backing up open files that are locked, which is one reason backups may fail. Windows users can also benefit from restoring previous versions of files or even undeleting files that have been removed from the Windows Recycle Bin.

Bare-Metal Backup

When servers will not boot or are not behaving and can’t be fixed, they may need to be reinstalled from an OS image. This is also true for VMs, although traditionally the term “bare metal” was used for physical servers. In a virtualization environment, a bare-metal hypervisor is the virtualization software that runs and interacts directly with host hardware, as opposed to running as an app within an OS.

Bare metal indicates that you have only the server (physical or virtual) and must install the OS, applications, configuration settings, and patches, all of which are included in a bare-metal backup. The system state, which can be backed up separately, is included and consists of the Windows Registry, Active Directory database, driver files, and so on. Data can also be included in a bare-metal recovery image. The idea is to get servers back up and running as quickly as possible in the event of some kind of problem, and the specific procedures should be a part of a DR plan. The Windows Server OS includes bare-metal backup options (Figure 8-8).

Images

Figure 8-8  Windows Server bare-metal backup option

Some bare-metal solutions can also be used to deploy new servers quickly while changing unique identifiers such as server names, IP addresses, and licenses. Most bare-metal tools use recovery points, which are essentially snapshots of changes at various points in time. When performing a bare-metal recovery, you normally have a choice of snapshots, or recovery points.

Bare-metal solutions need some kind of a boot device, whether it’s USB or PXE network boot. Recovery of the OS, apps, and data can be done from local media or from a network server.

Backup Media

Now that you understand the types of backups and the reasons for creating backups, let’s consider where the backup data can be stored. You have numerous tape drive and media systems to choose from, depending on your capacity, speed, and archiving needs. Today’s standards include Linear Tape-Open (LTO), Advanced Intelligent Tape (AIT), and Digital Linear Tape (DLT), which are discussed in the following sections.

Where most tape media is accessed sequentially, or in linear fashion, standard disk storage, including USB drives and CDs, DVDs, and Blu-ray discs, supports random data access instead of having to find data at a specific place on tape media.

As with most technology, tape capacities, number of writes, and transfer speeds change frequently. Generally speaking, modern tape media has the ability to store about 2TB of compressed data with transfer rates ranging from 10 to 80 MBps. Depending on the media, you may get from 200 to 1000 writes before the media needs to be replaced.

Consider the following when choosing a backup solution:

•   What will be backed up?

•   Files

•   Bare-metal recovery images

•   Databases

•   Disk volumes

•   Storage arrays

•   Entire storage appliance

•   How much data will be backed up?

•   How much time is available to perform backups?

•   Which backup types will be used?

•   Full

•   Differential

•   Incremental

•   Snapshot

•   Bare metal

•   Are periodic restore drills run to ensure that valid backups are taken?

•   Is verify-after-write being used to ensure the integrity of backups?

•   Are backups being performed over slow network links?

•   What type of storage media will be used?

•   Hard disks

•   Cloud storage

•   Magnetic tape

•   Must data be archived for the long term?

Linear Access Tape

Linear access tape, often called Linear Tape-Open (LTO), is magnetic storage media that uses the Linear Tape File System (LTFS); it was introduced in 2010 and has since been revised several times. It offers large capacities, fast data seeks, and streaming, and it is commonly used with tape backup systems and for archiving. An XML file is used as a catalog of backed-up content on the LTFS and is not stored at the beginning of the tape, which makes access quicker (no need to rewind tape). LTFS can also be used as additional storage media for copying purposes, apart from backup purposes.

Advanced Intelligent Tape

Advanced Intelligent Tape (AIT) was introduced in the 1990s. This magnetic tape storage is used with tape backup and archiving systems and has been revised since its inception. Each AIT data cartridge contains a chip with metadata, which means that the backup catalog can be accessed quickly, regardless of what part of the tape is currently being accessed.

Digital Linear Tape

The Digital Linear Tape (DLT) standard (the industry standard) has been around since the 1980s and has been revised numerous times. Because of its 30-plus–year rating, DLT is often used for long-term archiving. DLT cartridges should be placed in protective cases to ensure long-term data storage. The SuperDLT (SDLT) standard supports larger capacities and transfer rates. You can use SDLT in older DLT systems, but only with read access.

On-premises Backup

Many organizations continue to use on-premises tape backup solutions instead of, or in addition to, cloud backup solutions.

As mentioned earlier, the AWS Storage Gateway is a virtual appliance that runs on-premises and enables backup solutions to “see” virtual tape devices for cloud backup purposes. Remember that a cloud backup is a type of offsite backup solution that is similar to physical backup storage media at another location.

Wherever backup media is stored, it needs to be physically secured. You may choose to encrypt backups for additional security. Backup media storage must be carefully considered—for example, you don’t want to discover that high humidity has destroyed your backups or archives when you need to restore data.

Physical and virtual servers can have backup agents installed for granular backup and restore options. Your backup solution may also support the backup of storage arrays and databases. You’ll also hear the term “tape library” used on occasion. This refers to a management solution for multiple tape devices and backup media used for backup purposes. Some tape libraries are robotic in that specific tapes can be mounted upon request for access to the backed-up data. Corrupt tape media can result in the inability to mount backup tapes, whereas hardware failures can result in “drive not ready” errors.

Cloud Backup

Over the last few years, individuals as well as organizations have begun trusting public cloud providers to store their data. Cloud backup is basically an extension of your office or data center network hosted on a provider’s equipment and accessed over the Internet.

The biggest showstopper for cloud backup adoption tends to be the perception that cloud security is lacking. Public cloud providers must undergo third-party security audits from various entities to ensure consumer confidence in their services. Because of economies of scale, cloud providers have the resources to secure IT infrastructure properly at a level that often exceeds what we can do in our organizations. Figure 8-9 shows the Microsoft Azure cloud service.

Images

Figure 8-9  Azure is one of many cloud backup solutions.

Cloud Backup Security

The first security consideration is how to connect to the cloud provider’s data center. The standard is to connect over the Internet, but, from a security perspective, there are two alternatives:

•   Connect your network to the cloud provider over the Internet with a site-to-site VPN.

•   Connect your network to the cloud provider with a private network connection that bypasses the Internet.

The second issue is whether cloud backups are encrypted. Some cloud providers support server-side encryption; otherwise, you’ll have to encrypt data before backing it up to the cloud. Figure 8-10 shows an example of encryption options when you’re uploading files to the cloud.

Images

Figure 8-10  Encryption options for cloud backup

Backup and Restore Best Practices

After you select a backup solution, you need to use and maintain it properly. Also, you’ll need to adopt a backup and tape rotation strategy. Tape rotation enables you to keep backups for a period of time, but you must also consider that, over time, the integrity of the storage media itself will fail; refer to vendor documentation for further details.

Although we will focus on the grandfather-father-son (GFS) tape rotation scheme, two other schemes warrant a mention:

•   Tower of Hanoi

•   Incremental rotation

Grandfather-Father-Son

GFS is the most common tape rotation strategy. Because the amount of available tape backup media is finite, media reuse is inevitable. Keep in mind that different tape media enable different numbers of writes to be performed.

The GFS rotation method uses three backup sets, such as daily, weekly, and monthly—where each tape is rotated on a schedule. You could use quarters and years as backup cycles for even longer term archiving. In our example, here’s where the nomenclature comes in:

•   Son  Daily backup cycle

•   Father  Weekly backup cycle

•   Grandfather  Monthly backup cycle

This means that on day seven, the Son tape becomes a Father and will be used next for weekly backups. Other daily tapes keep getting reused as the cycle continues. On week four, the Father becomes a Grandfather, and it will be used next for monthly backups. Monthly backups can then be stored offsite.

From a high-level perspective, the GFS rotation scheme enables you to back up file versions that are recent or that are months, or perhaps even years, old.

Other Best Practices

Many simple tasks such as the following can help contribute to a dependable backup solution:

•   Clear and concise backup media labeling

•   Data retention policy

•   Integrity verification (often referred to as read-after-write verification)

•   Backup media offsite storage

•   Backup media encryption

•   Backup media environmental controls

•   Periodic data restoration tests

Hands-on Exercises

Exercise 8-1: Configure the Volume Shadow Service on Srv2019-1

1.   Make sure you are logged into the Srv2019-1 virtual machine with the Administrator account (FakedomainAdministrator) with a password of Pa$$w0rd172hfX.

2.   Start Windows Explorer.

3.   In the navigator on the left, click This PC. In the rightmost panel, right-click Local Disk (C:) and choose Properties.

4.   Open the Shadow Copies tab. Notice that each disk volume in the list shows “Disabled” under the Next Run Time column.

5.   Select C: and then click the Settings button.

6.   For the maximum size, set the Use Limit Value to 4000 MB.

7.   Click the Schedule button and select the boxes to the left of Sat and Sun to ensure that volume shadow copies are also taken at 7:00 A.M. on weekends. Click OK three times to return to Windows Explorer. Leave all screens open.

Exercise 8-2: Restore Previous Versions of Files in Windows

1.   On Srv2019-1 in Windows Explorer, on the right, double-click Local Disk (C:).

2.   In the right panel in a blank area, right-click and choose New | Folder. Enter Contracts for the folder name and press ENTER.

3.   Double-click the Contracts folder. Right-click in the right side of the screen and choose New | Text Document. Enter Contract_A for the filename and press ENTER.

4.   Double-click the Contract_A file and enter the following text in the file: Sample Contract Line One. Close Notepad by clicking the X in the upper-right corner and save the change.

5.   Click This PC on the left, and then right-click Local Disk (C:) and choose Properties.

6.   Open the Shadow Copies tab.

7.   Ensure that C: is selected and click the Create Now button to manually create a disk volume snapshot of drive C:. Click OK.

8.   Double-click Local Disk C:, and then double-click the Contracts folder.

9.   Double-click the Contract_A file, enter New Changes on a separate line, close Notepad, and save the changes.

10.   Right-click the Contract_A file and choose Restore Previous Versions.

11.   Notice the version of the file in the snapshot from a minute ago. Click the Open button at the bottom. Notice the text “New Changes” is not displayed because it did not exist when the snapshot was taken. Close Notepad. Close the Contract_A Properties dialog box.

12.   Select the Contract_A file. Press SHIFT-DEL. Choose Yes to delete the file permanently. The file will not be available in the Recycle Bin.

13.   In the address bar at the top, click Local Disk (C:).

14.   On the right, right-click the Contracts folder and choose Restore Previous Versions. Notice the version of the Contracts folder from when the snapshot was manually created earlier in this exercise.

15.   Click Open. Notice that the address bar at the top indicates how old the currently viewed folder contents are. Notice that the Contract_A file is listed; it can be undeleted from here.

16.   Right-click the Contract_A file and choose Copy. Close the newly opened Windows Explorer Contracts window. Close the Contract Properties dialog box.

17.   Double-click the Contract folder.

18.   Right-click in the white space on the right and choose Paste. The Contract_A file is restored. Double-click Contract_A. Notice the “New Changes” text is absent in the file; you have undeleted a version of the file that corresponds to when the snapshot was taken. Close Notepad but leave the server virtual machine running.

Exercise 8-3: Configure and Use Windows Server Backup

1.   On Srv2019-1, start PowerShell.

2.   Type get-windowsfeature *backup* to list server components that contain the word “backup.” Notice that the Windows Server Backup component is not installed (no X in the box).

3.   Because of word wrap in this book, sometimes the dash will look separated from the parameter. Keep them together, as in -includemanagementtools. Type the following:

Images

4.   Once the installation has completed (if it seems stuck, press ENTER in the PowerShell window), click the Start button at the bottom left of the taskbar. Type back and wait for Windows Server Backup to appear; then click it.

5.   Click Local Backup on the left. Notice the Backup Schedule, Backup Once, and Recover options on the far right. Click Backup Schedule.

6.   On the Backup Schedule Wizard screen, click Next.

7.   Choose Custom and click Next.

8.   On the Select Items For Backup screen, click Add Items.

9.   Click the + symbol to the left of Local Disk (C:) and select the checkbox next to the Contracts folder. Click OK and then click Next.

10.   Once A Day is currently set to 9:00 P.M.; change this to 8:00 P.M., and then click Next.

11.   On the Specify Destination Type screen, choose Back Up To A Volume. Click Next.

12.   On the Select Destination Volume screen, click the Add button and choose drive I:, which was created in an earlier lab for iSCSI virtual disk files. Click OK and then click Next.

13.   On the Confirmation screen, click Finish and then Close.

14.   On the far right in the Actions panel, click Backup Once.

15.   On the Backup Options screen, ensure that Schedule Backup Options is selected. Click Next.

16.   On the Confirmation screen, click Backup. When the backup status shows “Completed,” click Close. Note the backup status items in the middle of the screen. Leave the screen open.

17.   Open Windows Explorer and select Local Disk (C:).

18.   Select the Contracts folder. Press SHIFT-DEL. Choose Yes on the confirmation screen.

19.   Switch back to the wbadmin Windows Backup screen.

20.   In the Actions panel on the right, click Recover.

21.   On the Getting Started screen, ensure that This Server (SRV2019-1) is selected and click Next.

22.   On the Select Backup Date screen, accept the default and click Next.

23.   On the Select Recovery Type screen, ensure that Files And Folders is selected and click Next.

24.   Click the + symbol to the left of Srv2019-1. Do the same for Local Disk (C:).

25.   Select the Contracts folder and click Next.

26.   Accept the defaults on the Specify Recovery Options page and click Next.

27.   Click Recover.

28.   Once the status shows “Completed,” click Close.

29.   Switch to Windows Explorer and verify that C:Contracts has been restored. If Windows Explorer was already open, press F5 on the keyboard to refresh the screen.

Exercise 8-4: Use the Linux tar Command for Data Backup

1.   Make sure you are logged into the Ubuntu-1 virtual machine with the uone account, with a password of Pa$$w0rd172hfX. Prefix each command in each exercise step with sudo to gain elevated privileges. You’ll need to enter the password again for uone the first time you issue a sudo command. Press ENTER after each command you issue in this exercise.

2.   Type mkdir /asia_contracts.

3.   Enter touch /asia_contracts/file{1,2,3,4}.txt to create four empty text files (file1.txt, file2.txt, and so on).

4.   Enter ls /asia_contracts to verify that the files have been created.

5.   Enter mkdir /backup.

6.   Enter tar -cvzf /backup/asia_contracts.tar.gz /asia_contracts. The c means create, v means verbose output, z means compress with gzip utility, and f means file. (Linux compressed tar filenames normally have a .tar.gz file extension, but this is not required.)

7.   Enter ls /backup to ensure that the compressed tar file was created.

8.   Enter rm -rf /asia_contracts to delete the folder and its contents. Type ls / to verify that there is no longer an /asia_contracts folder.

9.   Enter cd /backup.

10.   Enter tar -zxvf asia_contracts.tar.gz -C /. The x means extract, and the -C means extract to a specified path.

11.   Enter ls / to verify that the /asia_contracts folder has been restored.

12.   Close all windows.

Chapter Review

This chapter focused on proactive planning for dealing with negative incidents when they occur. The overarching premise is to restore business operations as soon as possible.

Disaster Recovery Sites

A hot site is a facility that includes power, communications, hardware, software, data, and staff. This is the most expensive type of alternate site to maintain. A warm site provides a location, power, equipment, and communications links, but not up-to-date data. A cold site consists only of a location, power, and communication links. A cold site is the least expensive type of alternate site.

Data Replication

Synchronous replication ensures that data is written to primary and alternate locations without delay; this results in an up-to-date mirror copy of data between a primary and a hot site and is often done in the background automatically. Asynchronous replication includes a slight delay before data is written to alternate sites; as a result, this is less expensive than synchronous solutions, but it can cause problems with applications depending upon database consistency. Replication can be configured to occur in a single direction or bidirectionally.

Disk-level replication solutions include disk enclosure (array) and disk mirroring solutions that are used within a site, not between sites. Server replication solutions include commonly used tools such as Windows DFSR and rsync. Constant replication over a high-speed network link can ensure fast data copies and data consistency across locations.

Business Impact

An inventory of assets is needed before related threats can be identified. This allows for the prioritization of assets and risks and helps you determine the impact that threats can have on business operations.

Assets include the following:

•   IT systems

•   IT processes

•   Business and manufacturing processes

•   Personnel

•   Data

•   Trademarks

The recovery time objective (RTO) determines the maximum amount of tolerable downtime. Disaster recovery plans must take the RTO into account.

Disaster Recovery Plan

A disaster recovery (DR) plan is used to bring failed systems online as quickly and efficiently as possible. The DR plan must be updated periodically to reflect changing threats.

The DR plan contains step-by-step procedures detailing exactly how systems are to be quickly recovered. All stakeholders must know their roles for the effective recovery of failed systems.

The DR plan includes the following:

•   Table of contents

•   DR scope

•   Contact information

•   Recovery procedures

•   Document revision history

•   Glossary

Business Continuity Plan

Where DR plans are specific to a system, a business continuity plan (BCP) is more comprehensive and is not as specific as a DR plan. The purpose of a BCP is to ensure that overall business operations resume quickly after a negative incident; this is also referred to as Continuity of Operations (COOP).

BCPs include preventative measures such as backup policies. The recovery point objective (RPO) indicates the maximum tolerable data loss and is related to backup frequency. So, for example, an RPO of 12 hours means backup must never be more than 12 hours old.

BCP activities include these:

•   Assemble a team.

•   Identify and prioritize assets.

•   Identify skill requirements to recovery systems.

•   Determine whether alternate sites will be used.

•   Create a DR plan for each IT system.

•   Review the BCP with the BCP team.

•   Conduct periodic BCP drills.

Data Backups

Backups are required even if disk mirroring or data replication solutions are being used. Virtual machine snapshots are point-in-time pictures of virtual machine settings and data. They are useful before making a critical change to a virtual machine, because they serve as a quick way to revert back to a known working configuration; however, they should never replace backups.

Take care when choosing a backup type, which can influence the amount of time taken to back up and restore data.

Common backup types include the following:

•   Full  All data is backed up and the archive bit is cleared.

•   Incremental  Data that has changed since the last full or incremental backup is backed up and the archive bit is cleared.

•   Differential  Data that has changed since the last full backup is backed up; the archive bit is not cleared.

•   Snapshot  Settings and data stored in virtual hard disk files and volumes are captured.

•   Bare metal  This backup enables the entire OS and data to be restored.

Important backup considerations include the following:

•   Amount of time to complete backup (backup window)

•   Backup devices and media being used:

•   Media capacity

•   Media lifetime

•   Backup data transfer speed

•   Cloud backup

•   Regulatory compliance

•   Data retention

•   Backup type

•   Backup media rotation strategy

•   Media labeling

•   Verification of backed-up data

•   Media encryption

•   Media offsite storage including cloud

•   Media storage environmental controls

•   Periodic restore drills

Backup rotation strategies are designed to retain data for a period of time and to reuse backup media. Common backup rotation strategies include Tower of Hanoi, incremental rotation, and grandfather-father-son (GFS). GFS is the most common backup rotation type and is used for long-term data archiving.

Questions

1.   Which of the following best describes RTO?

A.   The maximum tolerable amount of lost data

B.   The maximum tolerable amount of failed array disks

C.   The maximum amount of tolerable downtime

D.   The maximum tolerable amount of failed services

2.   Which type of disaster recovery site provides a facility with power and communications links only?

A.   Cold

B.   Warm

C.   Hot

D.   Basic

3.   Which factor enables up-to-date hot site data?

A.   Disk mirroring

B.   Cloud backup

C.   Synchronous replication

D.   Asynchronous replication

4.   Which common Linux tool synchronizes file systems between remote hosts?

A.   chmod

B.   CUPS

C.   NFS

D.   rsync

5.   Which type of replication provides a near-zero RTO?

A.   rsync

B.   Asynchronous

C.   DFSR

D.   Synchronous

6.   After identifying assets in a BIA, what should be done next?

A.   Prioritize assets.

B.   Assemble a BIA team.

C.   Create a DR plan.

D.   Assess risk.

7.   How can servers be managed when network connectivity is unavailable?

A.   iLO

B.   iDRAC

C.   RDP

D.   KVM

8.   You need to prevent a dual disk failure from bringing down a RAID 5 array. What should you do?

A.   Add two hot spares to the RAID 5 array.

B.   Add one hot spare to the RAID 5 array.

C.   Configure RAID 6.

D.   Configure RAID 1.

9.   Which type of backup does not clear the archive bit?

A.   Full

B.   Differential

C.   Incremental

D.   Bare metal

10.   Which backup type takes the longest to restore?

A.   Full

B.   Copy

C.   Incremental

D.   Differential

11.   Which type of backup takes the longest?

A.   Full

B.   Synchronous

C.   Incremental

D.   Differential

12.   Which type of restore should be used for operating system, data, and configuration settings that can be applied to different servers?

A.   Full

B.   Copy

C.   Bare metal

D.   Differential

13.   Your company uses an image to repair failed servers. Optical drives are disabled in all server UEFI settings. Which options can be used to boot and apply images? Choose two.

A.   CD

B.   USB

C.   DVD

D.   PXE

14.   What type of configuration redirects users from a failed server to another running instance of a network service?

A.   PXE

B.   Failover clustering

C.   NIC teaming

D.   Round-robin

15.   Which items would be available at a warm site? Choose two.

A.   Network links

B.   Data

C.   Staff

D.   Power

16.   Which requirement justifies the cost of a hot site?

A.   Low RPO

B.   High RPO

C.   Low RTO

D.   High RTO

17.   An SLA guarantees that user data will be available in at least one data center. What type of replication is needed between data centers?

A.   Hot

B.   Synchronous

C.   Warm

D.   Asynchronous

18.   An organization’s BCP stipulates that the RPO is five hours. How often should backups be performed?

A.   Less than the RPO

B.   Less than the RTO

C.   More than the RPO

D.   More than the RTO

Questions and Answers

1.   Which of the following best describes RTO?

A.   The maximum tolerable amount of lost data

B.   The maximum tolerable amount of failed array disks

C.   The maximum amount of tolerable downtime

D.   The maximum tolerable amount of failed services

C. The recovery time objective (RTO) relates to the maximum amount of tolerable downtime. A, B, and D are incorrect. The maximum tolerable amount of lost data relates to the recovery point objective (RPO). Various RAID levels determine how many failed disks can be tolerated. The RTO is not related to the maximum tolerable amount of failed services.

2.   Which type of disaster recovery site provides a facility with power and communications links only?

A.   Cold

B.   Warm

C.   Hot

D.   Basic

A. Cold sites provide a facility with power and communications links only. B, C, and D are incorrect. Warm sites provide power, communications links, as well as equipment and software, but lack up-to-date data. Hot sites provide a facility with power, communications links, equipment, software, up-to-date data, and staff. A basic site is not a type of disaster recovery site.

3.   Which factor enables up-to-date hot site data?

A.   Disk mirroring

B.   Cloud backup

C.   Synchronous replication

D.   Asynchronous replication

C. Writing up-to-date data simultaneously to multiple locations is called synchronous replication and is often used between a primary and hot disaster recovery site. A, B, and D are incorrect. Disk mirroring is not used to replicate data between sites. Cloud backup normally relies on a schedule. Asynchronous replication introduces a delay before writing to alternate sites.

4.   Which common Linux tool synchronizes file systems between remote hosts?

A.   chmod

B.   CUPS

C.   NFS

D.   rsync

D. rsync is often used in Linux to synchronize file systems between hosts. A, B, and C are incorrect. chmod is used to set Linux file system permissions. Common UNIX Printing System (CUPS) is a Linux print server. Network File System is a Linux network file system standard.

5.   Which type of replication provides a near-zero RTO?

A.   rsync

B.   Asynchronous

C.   DFSR

D.   Synchronous

D. A near-zero recovery time objective (RTO) means very little tolerance for downtime; this is provided by synchronous replication. A, B, and C are incorrect. rsync and DFSR are asynchronous replication and they do not provide near-zero RTO.

6.   After identifying assets in a BIA, what should be done next?

A.   Prioritize assets.

B.   Assemble a BIA team.

C.   Create a DR plan.

D.   Assess risk.

A. After asset identification, prioritization must occur. B, C, and D are incorrect. The business impact analysis (BIA) team would already be assembled if assets have been identified. The disaster recovery (DR) plan and risk assessment do not happen until assets are prioritized.

7.   How can servers be managed when network connectivity is unavailable?

A.   iLO

B.   iDRAC

C.   RDP

D.   KVM

D. Keyboard, video, mouse (KVM) switches can be used to administer servers locally. A, B, and C are incorrect. iLO, iDRAC, and RDP are remote management tools that rely on network connectivity.

8.   You need to prevent a dual disk failure from bringing down a RAID 5 array. What should you do?

A.   Add two hot spares to the RAID 5 array.

B.   Add one hot spare to the RAID 5 array.

C.   Configure RAID 6.

D.   Configure RAID 1.

C. RAID 6 can tolerate two disk failures. A, B, and D are incorrect. RAID 5 can tolerate a single disk failure. RAID 1 can tolerate a single disk failure.

9.   Which type of backup does not clear the archive bit?

A.   Full

B.   Differential

C.   Incremental

D.   Bare metal

B. Differential backup must capture all changes since the last full backup and thus does not clear the archive bit for backed-up files. A, C, and D are incorrect. Full and incremental backups do clear the archive bit when backing up files. Bare-metal restores are not related to archive bits.

10.   Which backup type takes the longest to restore?

A.   Full

B.   Copy

C.   Incremental

D.   Differential

C. Restoring using incremental backups means restoring the last full backup and each incremental backup to the point of failure. A, B, and D are incorrect. Full and copy backups are the quickest to restore since only a single backup set is needed. Differential backups are the next quickest to restore following full.

11.   Which type of backup takes the longest?

A.   Full

B.   Synchronous

C.   Incremental

D.   Differential

A. Because everything is being backed up, full backups take the longest to complete. B, C, and D are incorrect. Synchronous, without being discussed in a specific context, is not a backup type. Incremental backups take the least amount of time; only changes since the last full or incremental must be captured. Differential backups capture all changes since the last full backup.

12.   Which type of restore should be used for operating system, data, and configuration settings that can be applied to different servers?

A.   Full

B.   Copy

C.   Bare metal

D.   Differential

C. Bare-metal backups can be used to restore an entire operating system, applications, and data quickly, even to a new server. A, B, and D are incorrect. Full, copy, and differential backups are not designed to restore operating systems, applications, and data to different servers.

13.   Your company uses an image to repair failed servers. Optical drives are disabled in all server UEFI settings. Which options can be used to boot and apply images? Choose two.

A.   CD

B.   USB

C.   DVD

D.   PXE

B, D. USB local boot and PXE network boot do not rely on optical media. A and C are incorrect. CDs and DVDs are optical media.

14.   What type of configuration redirects users from a failed server to another running instance of a network service?

A.   PXE

B.   Failover clustering

C.   NIC teaming

D.   Round-robin

B. Failover clustering uses multiple servers (nodes), each having the same network service and access to the same data on shared storage. If a server fails, users get redirected to the network service on a running server. A, C, and D are incorrect. PXE is a network boot standard. NIC teaming groups server NICS together for load balancing or aggregated bandwidth. Round-robin is a term often used with DNS where there are multiple A records with the same name, each pointing to a different IP address.

15.   Which items would be available at a warm site? Choose two.

A.   Network links

B.   Data

C.   Staff

D.   Power

A, D. Warm sites include network links and power; they are missing only staff and up-to-date data. B and C are incorrect. Data and staff are available at hot sites but not warm sites.

16.   Which requirement justifies the cost of a hot site?

A.   Low RPO

B.   High RPO

C.   Low RTO

D.   High RTO

C. A low RTO means very little tolerance for downtime. Hot sites align with this requirement. A, B, and D are incorrect. Recovery point objective (RPO) relates to the amount of tolerable data loss and does not align to a hot site as RTO does.

17.   An SLA guarantees that user data will be available in at least one data center. What type of replication is needed between data centers?

A.   Hot

B.   Synchronous

C.   Warm

D.   Asynchronous

B. Synchronous replication between data centers means there is no delay when writing data. This aligns with SLA guarantees in this context. A, C, and D are incorrect. Hot and warm replication do not exist. Asynchronous replication introduces a delay from the primary write.

18.   An organization’s BCP stipulates that the RPO is five hours. How often should backups be performed?

A.   Less than the RPO

B.   Less than the RTO

C.   More than the RPO

D.   More than the RTO

A. The RPO defines the maximum amount of tolerable data loss. If the RPO is five hours, then backups must occur within this timeframe. B, C, and D are incorrect. The RTO does not apply in this scenario as well as RPO does.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
34.230.35.103