Chapter 7. Backup and Recovery

In Chapters 5 and 6, we focused on infrastructure design and management. This means that at this point we have a good feeling for how to build, deploy, and manage distributed infrastructures running databases. This includes techniques for rapidly adding new nodes for capacity or to replace a failed node. Now, it’s time to discuss the serious meat and potatoes: data backup and recovery.

Let’s face it. Everyone considers backup and recovery dull and tedious. Most think of it as the epitome of toil. It is often relegated to junior engineers, outside contractors, and third-party tooling that the team is loathe to interact with. We’ve worked with some pretty horrible backup software before. Trust me, we empathize.

Still, this is one of the most crucial processes in your operations toolkit. Moving your precious data between nodes, across datacenters, and into long-term archives is the constant movement of your business’ most precious commodity: its data. Rather than relegating this to a second-class citizen of Ops, we strongly suggest you treat it as a VIP. Everyone should understand not only the recovery targets, but be intimately familiar with operating and monitoring the processes. Many DevOps philosophies propose that everyone should have an opportunity to write and push code to production. We propose that every engineer should participate at least once in the recovery processes of critical data.

We create and store copies of data, otherwise known as backups and archives, as a means to the real need. That need is recovery. Sometimes, this recovery is something nice and leisurely, such as building an environment for auditors or configuring an alternate environment. More often though, the recovery is needed to rapidly replace failed nodes or to add capacity to existing clusters.

Today, in distributed environments, we face new challenges in the backup and recovery realm. Now, as before, most local datasets are distributed to reasonable sizes, of up to a few terabytes at most. The difference is that those local datasets are only one fraction of a larger distributed dataset. Recovering a node is a relatively manageable task, but keeping state across the cluster becomes more challenging.

Core Concepts

Let’s begin by discussing core concepts around backup and recovery. If you are an experienced database or systems engineer, some of this might be rudimentary. If so, please feel free to fast forward a bit.

Physical versus Logical

When backing up a database physically, you are backing up the actual files in which the data resides. This means that the database-specific file formats are maintained, and there is usually a set of metadata within the database that defines what files exist, and what database structures reside within them. If you back up files and expect another database instance to be able to utilize them, you will need to back up and store the associated metadata that the database relies on in order to make the backup portable.

A logical backup exports the data out of the database into a format that is, theoretically, portable to any system. There will usually be some metadata still, but it is more likely to be focused on the point in time at which the backup was taken. An example of this is an export of all of the insert statements needed to populate an empty database to bring it up to date. Another example could be each row in a JSON format. Due to this, logical backups tend to be very time consuming because it is a row-by-row extraction rather than a physical copy and write operation. Similarly, recovery involves all of the normal overhead of the database, such as locking and generation of redo or undo logs.

An excellent example of this dichotomy is the difference between row-based and statement-based replication. In many relational databases, statement-based replication means that upon commit, a log of data manipulation language (DML; aka insert, update, replace, delete) statements is appended to. Those statements are streamed to replicas, where they are replayed. Another approach to replication is row-based, or change data capture (CDC).

Online versus Offline

An offline, or cold, backup is one in which the database instance that utilizes the files is shutdown. This allows files to be quickly copied with no worries about maintaining a point in time state while other processes are reading and writing data. This is an ideal, but very rare state in which to work.

In an online, or hot, backup, you are still copying all of the files, but you have the added complexity of needing to get a consistent, point-in-time snapshot of the data that must exist for the amount of time it takes a backup to occur. Additionally, if live traffic is accessing the database during the backup, you also must be careful not to overwhelm the Input/Output (IO) throughput of the storage layer. Even throttled, you can find that the mechanisms used to maintain consistency add unreasonable amounts of latency to application activity.

Full, Incremental, and Differential

A full backup, regardless of the approach, means that the entire local dataset is backed up fully. On small datasets, this is a fairly trivial event. For 10 terabytes, it can take an impossible amount of time.

A differential backup allows you to take a backup of only the changed data since the last full backup. In practice, there is usually more data backed up than just that which has changed because your data will be in structures such as a page. A page will be of a particular size, such as 16 K or 64 K, and will have many rows of data in it. An incremental backup will back up any page that has modified data in it. Thus, larger page sizes will backup significantly more than just the changed data.

An incremental backup is similar to a differential backup, except that it will use the last backup, incremental or full, as the point in time at which it will look for changed data. Thus, if you are restoring an incremental backup, you might need to recover the last full backup, and one or more incremental backups, as well, to get to the current point in time.

With these concepts in mind, let’s discuss various things to consider when deciding on an effective backup and recovery strategy.

Considerations for Recovery

When you first evaluate an effective strategy, you should look back to your Service-Level Objectives (SLOs), as discussed in Chapter 2. Specifically, you need to consider availability and durability indicators. Any strategy that you choose will require you to be able to recover data within the predefined uptime constraints. And, you need to back up fast enough to ensure that you meet the necessary parameters for durability. If you back up every day, and your transaction logs between backups remain on node-level storage, you can very well lose those transactions before the next backup.

Additionally, you need to consider how the dataset functions within the holistic ecosystem. For instance, your orders might be stored in a relational system, where everything is committed in transactions, and is thus easily recovered in relation to the rest of the data within that database. However, after an order is set, a workflow may be triggered via an event stored in a queuing system or a key–value store. Those systems might be eventually consistent, or even ephemeral, relying on the relational system for reference or recoverability. How do you account for those workflows when recovering?

If you are in an environment with rapid development, you might also find that data stored in a backup was written and utilized by a different version of the application than the version running after the restore is done. How will the application interact with that older data? Hopefully the data is versioned to allow for that, but you must be knowledgeable of this and prepared for such eventualities. Otherwise, the application could logically corrupt that data and create even larger issues down the road.

Each of these, and many other variables that you cannot plan for, must be taken into account when planning for data recovery. As we discussed in Chapter 3, we simply can’t prepare for every eventuality. But, this is a critical service. Data recoverability is one of the most significant responsibilities of the database reliability engineer (DBRE). So, your plan for data recoverability must be able to be as broad as possible, taking into account as many potential issues as possible.

Recovery Scenarios

With that in mind, let’s discuss the types of incidents and operations that might require recovery so that we can plan on supporting each need. We can first sort these into planned versus unplanned scenarios. Treating recovery as an emergency tool only will limit your team’s exposure to the tool to emergencies and emergency simulations. Instead, if we can incorporate recovery into daily activities, we can expect a higher degree of familiarity and success during an emergency. Similarly, we will have more data to determine whether the recovery strategy supports our SLOs. With multiple daily runs, it is easier to get a sample set that can include upper bounds and can be represented with some level of certainty for planning purposes.

Planned Recovery Scenarios

What are the day-to-day recovery needs that can be incorporated? Here is a list that we’ve seen at various sites:

  • Building new production nodes and clusters

  • Building different environments

  • Extract, Transform, and Load (ETL) and pipeline processes for downstream datastores

  • Operational tests

When performing these operations, be sure to plug the process into your operational visibility stack:

Time

How long does each component, as well as the overall process, take to run? Uncompress? Copy? Log applies? Tests? 

Size

How big is your backup compressed and uncompressed?

Throughput

How much pressure are you putting on the hardware?

This data will help you stay ahead of capacity issues, allowing you to ensure that your recovery process stays viable.

New production nodes and clusters

Whether your databases are part of an immutable infrastructure or not, there are opportunities for regular rebuilds  that will, of necessity, utilize recovery procedures. Databases are rarely set into autoscaling automation because of the amount of time it can take for a new node to be bootstrapped and brought into a cluster. Still, there is no reason that a team can’t set up a schedule to regularly introduce new nodes into a cluster to test these processes. Chaos Monkey, a tool developed by Netflix that randomly shuts down systems, can do this in such a way that the entire process of monitoring, notification, triage, and recovery can be tested. If you’re not there yet, though, you can still do this as a planned part of a checklist of processes your operations team should be performing at regular enough intervals to keep them all familiar with the procedure. These activities allow you to test not only a full and incremental recovery, but incorporation into the replication stream and the process to take a node into service.

Building different environments

It is inevitable that you will be building environments for development, for integration testing, for operational testing and for demos among others. Some of these environments will require complete recovery, and should utilize node recovery and full cluster recovery. Some will have other requirements, such as subset recovery for feature testing and data scrubbing for user privacy purposes. This allows for you to test point-in-time recovery as well as the recovery of specific objects. Each of these are very different from a standard full dataset recovery and are useful for recovering from operator and application corruption. By creating APIs that allow for object-level and point-in-time recovery, you can facilitate the automation and familiarization with these processes.

ETL and pipeline processes for downstream datastores

Similarly to your environment builds, the process of pushing data from production databases into pipelines for downstream analytics and streaming datastores is a perfect place to utilize point-in-time and object-level recovery processes and APIs.

Operational tests

During various testing scenarios, you will need copies of data. Some testing, such as capacity and load testing, requires a full dataset, which is an excellent opportunity for utilizing full-recovery processes. Feature testing might require smaller datasets, which is an excellent opportunity to use point-in-time and object-level restores.

Recovery testing itself can become a continuous operation. In addition to utilizing recovery processes in everyday scenarios, you can set restores to constantly be run, allowing for automated testing and validation to rapidly bring up any issues that might have occurred to break the backup process. When we bring up this process, many people ask how to test the success of a restore.

When taking the backup, you can produce a lot of data that can be used for testing, such as the following:

  • The most recent ID in an auto increment set.

  • Row counts on objects.

  • Checksums on subsets of data that are insert only and thus can be treated as immutable.

  • Checksums on schema definition files.

As with any testing, this should be a tiered approach. There are some tests that will succeed or fail quickly; these should be the first layer of testing. Examples of this are checksum comparisons on metadata/object definitions, the successful starting of a database instance, and the successful connection to a replication thread. Operations that might take longer, such as checksumming data and running table counts, should be run later into the validation process.

Unplanned Scenarios

With all of the day-to-day planned scenarios that can be used, the recovery process should be quite finely tuned, well documented, well practiced, and reasonably free of bugs and issues. Thus, the unplanned scenarios are rarely as scary as they could be otherwise. The team should see no difference in these unplanned exercises. Let’s list and dive into each to discuss the possibilities that might cause us to need to exercise our recovery processes:

  • User error

  • Application errors

  • Infrastructure services

  • Operating system and hardware errors

  • Hardware failures

  • Datacenter failures

User error

Ideally, user error should be somewhat of a rare occurrence. If you are creating guard rails for engineers, you can add a lot of prevention. Still, there will always be an occasion when an operator accidentally does damage. Some examples of this include the ubiquitous absence of a WHERE clause when executing an UPDATE or DELETE in the database client. Or, perhaps a data cleansing script is executed in the production, rather than the testing environment. There are also many cases when something executes correctly, just at the wrong time, or against the wrong hosts. All of these are user errors. These errors are often immediately identified and recovered. However, there can be occasions when the impacts of these changes might not be known for days or weeks, thus hindering detection.

Application errors

Application errors are the scariest of the scenarios discussed because they can be so insidious. Applications are constantly modifying how they interact with datastores. Many of these applications are also managing referential integrity and external pointers to assets such as files or third-party IDs. It is frighteningly simple to introduce a change that destructively mutates data, removes data, or adds incorrect data in ways that might not be noticed for quite a long time.

Infrastructure services

Chapter 6 covers the magic of infrastructure management services. Unfortunately, these systems can be as destructive as they can be helpful, with wide-ranging consequences from editing a file, pointing to a different environment, or pushing an incorrect configuration.

OS and hardware errors

Operating systems and the hardware they interface with are still systems built by humans, and thus can have bugs and unintended consequences from undocumented or poorly known configurations. In the context of data recovery, this is quite true regarding the path of data from the database through OS caches, filesystems, controllers, and ultimately disks. Data corruption or data loss is much more common than we think. Unfortunately, our trust and reliance on these mechanisms creates cultures in which data integrity is expected rather than something to be skeptical of.

Silent Corruption

This kind of OS and hardware error impact happened to Netflix in 2008. Error detection and correction on disks utilizes error correction code (ECC). ECC corrects single-bit errors automatically, and detects double-bit errors. Thus, an ECC can detect an error up to twice the hamming distance of what it can correct. So, if it can correct 46 bytes in your 512-byte sector hard drive, it can detect up to 92 bytes of error. What isn’t correctable is reported to the controller as uncorrectable, and the disk controller increments the “uncorrectable error” counter in S.M.A.R.T. Errors larger than 92 bytes are passed straight to the controller as good data. That propagates to backups. Terrifying right?

This is what makes cloud and so-called “serverless” computing something that should be approached with great skepticism. When you do not have access to implementation details, you cannot be sure that data integrity is being treated as top priority. Too often it is ignored, or even tuned down for performance. Without knowledge, there is no power.

Checksumming filesystems like ZFS will checksum each block, ensuring detection of bad data. If you are using RAID that involves mirroring or parity, it will even fix the data.

Hardware failures

Hardware components fail, and in distributed systems they fail regularly. You get regular failures of disks, memory, CPUs, controllers, or network devices. These failures of hardware can cause node failures, or latency on nodes that make the system unusable. Shared systems like network devices can affect entire clusters, making them unavailable, or causing them to break into smaller clusters that aren’t aware of the network having been partitioned. This can lead to rapid and significant data divergence that will need to be merged and repaired.

Datacenter failures

Sometimes, hardware failures at the network level can cascade into datacenter failures. Occasionally, congestion of storage backplanes cause cascading failures as in the case of Amazon Web Services in 2012. Sometimes hurricanes, earthquakes, and tractor trailers can create conditions that cause the failure of entire datacenters. Recovery from this will test even the most robust of recovery strategies.

Scenario scope

Having enumerated the planned and unplanned scenarios that might create the need for a recovery event, let’s further add the dimension of scope to these events. This will be useful to determine the most appropriate response. Here are ranges we’ll consider:

  • Localized or single-node

  • Cluster-wide

  • Datacenter or multiple clusters

In a local or single-node scope, the recovery is limited to a single host. Perhaps you are adding a new node to a cluster for capacity or for the replacement of a failed node. Perhaps you are doing a rolling upgrade, and restores are being done node by node. Any of these are local scope.

In a cluster-wide scope, the need to execute recovery is global to all members of that cluster. Perhaps a destructive mutation or data removal occurred and cascaded to all nodes via replication. Or, perhaps you need to build a new cluster for capacity testing.

Datacenter or multiple cluster scope indicates that all data in a physical location or region needs recovery. This could be due to failure of shared storage, or a disaster that has caused the catastrophic failure of a datacenter. This might also be deployment of a new redundant site for planning purposes.

In addition to the locality scope, there is dataset scope. This can be enumerated into three potential types:

  • Single object

  • Multiple objects

  • Database metadata

In single-object scope, one specific object requires recovery of some or all data. The incident discussed previously, in which a DELETE ends up removing more data than planned, is a single-object scope. In multiple objects, the scope is against more than one, and possibly all, objects in a particular database. This can occur in application corruption or during a failed upgrade or shard migration. Finally, there is database metadata scope, in which the data stored in the database is fine, but the metadata that makes the database usable, such as user data, security permissions, or mapping to OS files, is lost.

Scenario Impact

In addition to defining the scenario requiring recovery and the scope enumerated, it is also crucial to define the potential impacts because they will be significant in determining how the recovery option is approached. You can approach data loss that doesn’t affect the SLO methodically and slowly to minimize escalating the impact. More destructive changes that are causing SLO violation must be approached with an eye toward triage and rapid service restoration before any long-term clean up. We can separate the approaches into three categories:

  • SLO impacting, application down, or majority of users affected

  • SLO threatening, some users affected

  • Features affected, non-SLO threatening

With recovery scenario, scope, and impact, we have a potential combination of 72 different scenarios to consider. That’s a lot of scenarios! Too many really to give each the level of focus they need. Luckily, many scenarios can utilize the same recovery approach. Still, even with this overlap, there is no way that we can fully plan for every eventuality. Thus, building a multitiered approach to recovery is required to help make sure we have as extensive of a toolkit as possible. In the next section, we use the information we just went through in this section to define the recovery strategy.

Anatomy of a Recovery Strategy

There is a reason why we say “recovery strategy” rather than “backup strategy.” Data recovery is the very reason we do backups. Backups are simply a means to the end, and thus are dependent on the true requirement: recovery within parameters. The simple question “Is your database backed up?” is a question that should be followed with the response, “Yes, in multiple ways, depending on the recovery scenario.” A simple yes is naive and promotes a false sense of security that is irresponsible and dangerous.

An effective database recovery strategy not only engages multiple scenarios with the most effective approaches, but also includes the detection of data loss/corruption, recovery testing, and recovery validation.

Building Block 1: Detection

Early detection of potential data loss or corruption is crucial. In our discussion of user and application errors in “Unplanned Scenarios”, we noted that these problems can often go for days, weeks, or even longer before being identified. This means that backups might even be aged-out by the time the need for them is noticed. Thus, detection must be a high priority for all of engineering. In addition to building early detection around data loss or corruption, ensuring that there is as long of a window as possible in place to recover in case early detection fails is also critical. Let’s look at the different failure scenarios discussed, and identify some real-world approaches to detection and lengthened recovery windows.

User error

One of the biggest impacts in reducing time to identifying data loss is through not allowing manual or ad hoc changes to be executed in production environments. By creating wrappers for scripts, or even API-level abstractions, engineers can be guided through effective steps for ensuring all changes are as safe as possible, tested, logged, and pushed up to the appropriate teams.

An effective wrapper or API will be able to do the following:

  • Execute in multiple environments via parameterization

  • A dry-run stage, in which execution results can be estimated and validated

  • A test suite for the code execution

  • Validation post-execution to verify that changes met expectations

  • Soft-deletion or easy rollback via the same API

  • Logging by ID of all data modified, for identification and recovery

By removing the ad hoc and manual components of these processes, you can increase the likelihood that all changes will be trackable by troubleshooting engineers. All changes will be logged so that there is traceability and the change cannot simply disappear into the day-to-day noise. Finally, by soft-staging mutations or deletions and building in easy rollbacks of any data, you give greater windows of time for problems with the change to be identified and corrected. This is not a guarantee. After all, manual processes can be extremely well logged, and people can forget to set up logging in automated processes, or they can bypass them.

Application errors

A key to early detection of application errors is data validation. When engineers introduce new objects and attributes, database reliability engineers should work with them to identify data validation that can be done downstream, outside of the application itself.

Like all testing, initial work should focus on quick tests that provide fast feedback loops on critical data components, such as external pointers to files, relationship mapping to enforce referential integrity, and personal identification information (PII). As data and applications grow, this validation becomes more expensive and more valuable. Building a culture that holds engineers accountable for data quality and integrity rather than the storage engines pays dividends in terms of not only flexibility to use different databases, but also by helping people feel more confident about experimenting and moving fast on application features. Validation functions as a guard rail, helping everyone to feel braver and more confident.

Infrastructure services

Any catastrophic infrastructure impacts that require recovery should be caught rapidly by monitoring the operational visibility stack. That being said, there are some changes that can be quieter and potentially cause data loss, data corruption, or availability impacts. Using golden images and comparing them regularly to your infrastructure components can help identify straying from the test images quickly. Similarly, versioned infrastructure can help identify straying infrastructure and alert the appropriate engineers or automation.

OS and hardware errors

As with infrastructure services, the majority of these problems should be rapidly caught by monitoring of logs and metrics. Edge cases that are not standard will require some thought and experience to identify and add to monitoring for early detection. Checksums on disk blocks is an example of this. Not all filesystems will do this, and teams working with critical data need to take the time to consider the appropriate filesystems that can identify silent corruption via checksumming.

Hardware and datacenter failures

As with infrastructure services, these failures should be easily identifiable via the monitoring that we’ve already gone over in Chapter 4. Isn’t it great that we already did that?

Building Block 2: Tiered Storage

An effective recovery strategy relies on data being placed on multiple storage tiers. Different recovery needs can be served by different storage areas, which not only ensure the right performance, but also the right cost and the right durability for any number of scenarios.

Online, high performance storage

This is the storage pool most of your production datastores will run on. It is characterized by a high amount of throughput, low latency, and thus, a high price point. When recovery time is of the utmost importance, putting recent copies of the datastore, and associated incremental backups on this tier is paramount. Generally speaking, only a few copies of the most recent data will reside here, allowing for rapid recovery for the most common and impactful of scenarios. Typical use cases will be full database copies to bring new nodes into production service after failures, or in response to rapid escalations of traffic that result in a need for additional capacity.

Online, low-performance storage

This storage pool is often utilized for data that is not sensitive to latency. Larger-sized disks that have low throughput and latency profiles—and a lower price point—make up this pool. These storage pools are often much larger due to this, and thus more copies of data from further back in time can be stored in this tier. Relatively infrequent, low-impact, or long-running recovery scenarios will utilize these older backups. Typical use cases will be finding and repairing application or user errors that slipped by early detection.

Offline storage

Tape storage or even something like Amazon Glacier are examples of this kind of storage. This storage is off-site, and often requires movement via vehicle to bring it to an area where it can be made available for recovery. This can support business continuity and auditory requirements but does not have a place in day-to-day recovery scenarios. Still, due to the size and cost, vast amounts of storage are available here, allowing for the potential of storing all data for the life of the business, or at least for a full legal compliance term.

Object storage

Object storage is a storage architecture that manages data as objects, rather than files or blocks. Object storage gives features not available through traditional storage architectures, such as an API available to applications, object versioning, high degrees of availability via replication and distribution. Object storage allows for scalable and self-healing availability of large amounts of objects with full versions and history. This can be ideal for easy recovery of specific objects that are unstructured, and that are not reliant on relationships to other data for coherence. This gives an attractive opportunity to allow for recovery of application or user errors. Amazon S3 is a classic example of an inexpensive, scalable, and reliable object-level storage tier.

Each of these tiers plays a part in a comprehensive strategy for recoverability across multiple potential scenarios. Without being able to predict every possible scenario, it is this level of breadth that is required. Next, we will discuss the tools that utilize these storage tiers to provide recoverability.

Building Block 3: A Varied Toolbox

So, now it is time to evaluate the required recovery processes by going through the scenarios and evaluating options. We know from various sections in this chapter that we have a series of tools available to us. Let’s look at them a little more closely.

Replication Is Not a Backup!

You will note that nowhere do we discuss replication as a way to effectively back up data for recovery. Replication is blind, and can cascade user errors, application errors, and corruption. You must look at replication as a necessary tool for data movement and synchronization, but not for creating useful recovery artifacts. If anyone tells you that they are using replication for backups, give them some side eye and move on. Similarly, RAID is not a backup. Rather, it is a redundancy.

Full physical backups

We know that we will need to do full restores at each level of scope: node level, cluster level, and datacenter level. Rapid, portable full restores are incredibly powerful and mandatory in dynamic environments. They allow rapid node builds for capacity or for deployment of replacements during failures. A full backup can be done via full data copies over the network, or via volumes that can easily be attached and detached from specific hosts/instances. To do this, you need full backups.

Full backups of a relational database require either the opportunity to lock the database to get a consistent snapshot from which you can copy, or the ability to shut the database down for the duration of the copy. In an asynchronously replicated environment, the replicas cannot be completely trusted to be synchronized with the primary writer, so you should perform these full backups from the primary if at all possible. After the snapshot is created within the database or via a filesystem or infrastructure snapshot, you can copy that snapshot to staging storage.

Full backups of an appending write datastore, such as Cassandra, involve a snapshot that utilizes an OS hard link at the OS level. Because the data in these distributed datastores is not on all nodes, the backup is considered an eventually consistent backup. Recovery will require the node being brought back into a cluster at which point regular consistency operations will eventually bring it up to date.

A full backup on online, high-performance storage is for immediate replacement into an online cluster. These backups are typically uncompressed because uncompression takes a lot of time. Full backups on online, low-performance storage are utilized for building different environments, such as for testing, or for analytics and data forensics. Compression is an effective tool to allow for longer timelines of full backups on limited storage pools.

Incremental physical backups

As discussed earlier, incremental backups allow for bridging the gap between the last full backup and a place in time after it. Physical incremental backups are generally done via data blocks that have a changed piece of data in it. Because full backups can be expensive, both in terms of performance impact during the backup and storage, incremental backups allow you to quickly bring a full backup that might be older up to date for use in the cluster.

Full and incremental logical backups

A full logical backup provides for portability and simpler extraction of subsets of data. They will not be used for rapid recovery of nodes, but instead are perfect tools for use in forensics, moving data between datastores, and recovering specific subsets of data from large datasets.

Object stores

Object stores, like logical backups, can provide for easy recovery of specific objects. In fact, object storage is optimized for this specific use case, and it can easily be used by APIs to programmatically recover objects as needed.

Building Block 4: Testing

For such an essential infrastructure process as recovery, it is astonishing how often testing tends to fall by the wayside. Testing is an essential process to ensure that your backups are usable for recovery. Testing is often set up as an occasional process, to be run on an intermittent basis such as monthly or quarterly. Although this is better than nothing, it allows for long periods of time between tests during which backups can stop working.

There are two effective approaches to adding testing into ongoing processes. The first one is incorporating recovery into everyday processes. This way, recovery is constantly tested, allowing for rapid identification of bugs and failures. Additionally, constant recovery creates data about how long your recovery takes, which is essential in calibrating your recovery processes to meet Service-Level Agreements (SLAs). Examples of constant integration of recovery into daily processes includes the following:

  • Building integration environments

  • Building testing environments

  • Regularly replacing nodes in production clusters

If your environment does not allow for enough opportunities to rebuild datastores, you can also create a continuous testing process, whereby recovery of the most recent backup is a constant process, followed by verification of the success of that restore. Regardless of the presence of automation, even offsite backup tiers do require occasional testing.

With these building blocks, you can create an in-depth defense for different recovery scenarios. By mapping out the scenarios and tools used to recover them, you can then begin evaluating your needs in terms of development and resources.

A Recovery Strategy Defined

As we discussed earlier in this chapter, we have multiple failure scenarios to prepare for. To do this, we need a rich toolset, and a plan for utilization of each of those tools.

Online, Fast Storage with Full and Incremental Backups

This portion of the strategy supports the meat and potatoes of daily recovery. When you need to build a new node for rapid introduction into production or for testing, you use this strategy.

Use Cases

The following scenarios are the primary use cases for this portion of strategy:

  • Replacing failed nodes

  • Introducing new nodes

  • Building test environments for feature integration

  • Building test environments for operations testing

Running a daily full backup is often the highest frequency possible due to latency during backups. Keeping up to a week’s worth allows for rapid access to any recent changes and is usually more than enough. This means seven full copies of the database, uncompressed, plus the amount of data required to track all changes for incremental backups. Some environments do not have the capacity or money for this, so permutations can occur in retention period and frequency as levers for tuning.

Detection

Monitoring informs you when there is a node or component failure requiring recovery to new nodes. Capacity planning reviews and projections let you know when you need to add more nodes for capacity purposes.

Tiered storage

Online, high-performance storage is required because production failures require rapid recovery. Similarly, testing must be as fast as possible to support rapid development velocity.

Toolbox

Full and incremental physical backups provide the fastest recovery option, and are the most appropriate here. These backups are left uncompressed due to recovery time needs.

Testing

Because integration testing happens frequently, these recovery scenarios occur frequently. In virtual environments, daily reintroduction of one node into the cluster allows for similar frequent exercising of recovery processes. Finally, a continuous recovery process is introduced due to the significant importance of this process.

Online, Slow Storage with Full and Incremental Backups

Here, we have slower storage with cheaper, more plentiful space.

Use cases

The following scenarios are the primary use cases for this portion of strategy:

  • Application errors

  • User errors

  • Corruption repair

  • Building test environments for operations testing

When new features, failed changes, or inappropriate migrations occur and cause damage to data, you need to be able to access and extract large amounts of data for recovery. That is where this tier comes in. This is perhaps the messiest stage of recovery because there are too many permutations of potential damage to account for. Code often must be scripted during the recovery effort, which itself can lead to more bugs and errors without effective testing.

Copying full backups from high-performance to low-performance storage through a compression mechanism is an easy way to get full backups into this portion of our strategy. Due to compression and cheaper storage, keeping up to a month or even longer is possible depending on budget and needs. In highly dynamic environments, the opportunity for missing corruption and integrity issues is much higher, which means you need to account for a longer amount of time.

Detection

Data validation is the key for identifying the need for recovery from this pool. When validation fails, engineers can use these backups to identify what happened, and when, and begin extracting clean data for reapplication into production.

Tiered storage

Online, low-performance storage is required because a long window of time is required for this part of the strategy. Cheap, large storage is the key.

Toolbox

Full and incremental physical backups are the most appropriate here. These backups are compressed due to recovery time needs as well. Here, you can also utilize logical backups, such as replication logs, in addition to physical ones to allow for more flexibility in recovery.

Testing

Because this recovery does not happen as often, continuous automated recovery processes are critical to ensure that all backups are usable and in good shape. Occasional “game day” practice runs of specific recovery scenarios such as a table, or a range of data, are also good to keep teams familiar with the processes and tools.

Offline Storage

By far the least expensive, this is also the slowest of storage tiers from which to extract data.

Use cases

The following scenarios are the primary use cases for this portion of strategy:

  • Audits and compliance

  • Business continuity

So, this part of the solution is really focused on rare, but highly critical needs. Audits and compliance often require data going back seven years or more. But, they are not time sensitive, and can take quite a while to prepare and present. Business continuity requires copies of data away from the same physical locations of the current production systems to ensure that if there are disasters, you can rebuild. Although this is time sensitive, it can be restored in staged approaches that allow for flexibility.

Copying full backups from low-performance to offline storage through a compression mechanism is an easy way to get full backups into this portion of our strategy. Keeping up to seven years or even longer is not only possible, but required.

Detection

Detection is not a substantial part of this component of the strategy.

Tiered storage

Cheap storage in vast sizes is required because a long window of time is required for this part of the strategy. Tape or solutions like Amazon Glacier are often the choices here.

Toolbox

Full backups are the most appropriate here. These backups are compressed due to recovery time needs also.

Testing

Testing strategies here are similar to the online, slow storage tier.

Object Storage

An example of object storage is Amazon’s S3. It is characterized by programmatic access, rather than physical.

Use cases

The following scenarios are the primary use cases for this portion of the strategy:

  • Application errors

  • User errors

  • Corruption repair

Object storage inspection, placement, and retrieval APIs are given to software engineers for integration into applications and administrative tools to effectively recover from user error and application errors. With versioning, it becomes trivial enough to recover from deletes, unexpected mutations, and other potential time-sinks for administrators without these tools.

Detection

Data validation and user requests are key for identifying the need for recovery from this pool. When validation fails, engineers can identify the date ranges of the occurrence and programmatically recover from the incident.

Testing

Because object-level recovery becomes a part of the application, standard integration testing should be more than enough to ensure that this works.

With these four approaches to data recovery, we are able to provide a fairly comprehensive strategy for recovering from most scenarios, even those for which we don’t expect or plan. There is fine tuning to be done based on recovery service-level expectations, budget, and resources. But, overall, we’ve set the stage for an effective plan that incorporates detection, metrics, and tracking, and continuous testing.

Wrapping Up

You should be finishing this chapter with a solid understanding of the potential risks to your environment that could require data recovery. These risks are legion and unpredictable. One of the most important points is that you can’t plan for everything, and you need to build a comprehensive strategy to ensure you can tackle anything that comes up. Some of this includes working with software engineers to incorporate recovery into the application itself. In other places, you need to build some pretty solid recovery software yourself. And in all cases, you must build off of the previous chapters on service-level management, risk management, infrastructure management. and infrastructure engineering to get there.

In Chapter 8, we discuss release management. It is our hope that going into the rest of this book, data recovery stays in the forefront of your mind. Every step forward in an application and infrastructure brings risks to data and stateful services. The prime directive of the DBREs world is to ensure that data is recoverable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.13.255