Chapter 10

Develop, deploy, and manage data recovery

The first and foremost responsibility of a production DBA is to ensure that a database can be recovered in the event of a disaster. This chapter outlines the fundamentals of data recovery and SQL Server recovery models. Then, it covers backup devices in SQL Server before discussing the different types of backups. Next, it shows you how to create and verify database backups, and how to restore databases from those backups. This chapter ends with a discussion of defining a recovery strategy based on a fictitious scenario.

Note

As discussed in Chapter 2, “Introduction to database server components,” a disaster is any unplanned event caused by, but not limited to, natural disaster, hardware or software failure, malicious activity, or human error. Quite a few adverse events and disasters are caused by human error.

This chapter does not provide any guidance for fixing a corrupt database; Microsoft recommends restoring from a last-known good database backup if you experience corruption. Instead, as mentioned, this chapter focuses on database backups and restores. The next chapter, Chapter 11, “Implement high availability and disaster recovery,” covers how to achieve high availability and use SQL Server disaster recovery features to keep your environment running even in the face of disaster. After reading this chapter and the one that follows, you will understand how to achieve close to zero data loss with minimal downtime.

You don’t design a backup strategy; you design a recovery strategy. You must allow for potential downtime and data loss, within acceptable limits. These are defined by the business requirements for getting an environment back up and running after a disaster.

Technical solutions such as those for high availability (HA) and disaster recovery (DR) are available in Microsoft SQL Server to support these organizational requirements. In other words, business requirements define the approach you will take in your organization to plan for and survive a disaster.

We should also point out that this is only a small but important part of a larger business continuity plan (BCP). A BCP is designed to enable ongoing business operations while the DR plan is executed. A BCP defines the critical business functions and processes that might be supported by your SQL Server environment and how these functions will continue after disaster strikes.

Note

This chapter makes several references to Chapter 3, “Design and implement an on-premises database infrastructure”—particularly in the context of transaction log files, virtual log files (VLFs), and log sequence numbers (LSNs). If you have not yet read that chapter, we highly recommend that you do so before reading any further here.

This chapter includes recovery strategies for hybrid and cloud environments, but the sections on backup devices, types of backups, and creating and restoring backups apply to on-premises environments only. For SQL Server virtual machines in Azure, see Chapter 16, “Design and implement hybrid and Azure database infrastructure.”

There is an old saying among seasoned DBAs that a DBA is only as good as their last backup and will only keep their job if they tested it. This should serve as a reminder that backups are important, but just as important (if not more so) is testing those backups.

Prepare for data recovery

Designing a recovery strategy that achieves zero data loss with zero downtime is incredibly expensive and almost impossible. Ultimately, recovery is a balance between budget, acceptable downtime, and acceptable data loss.

Note

Emotions run high when systems are down, so it is incumbent on all organizations to define possible outcomes at the outset and how to deal with them. Meanwhile, runbooks and practice help keep heads cool when dealing with a stressful outage.

The governance of these requirements is outlined in a Service-Level Agreement (SLA). An SLA is a business document that specifies measurable commitments you make to the business related to availability and recovery objectives in case of a failure. Included in the SLA is the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The RPO expresses the amount of work that may be lost, in units of time such as minutes, when the service is returned to normal operation. The RTO expresses the amount of time that the service may be unavailable in case of a failure or disaster. The SLA might also include consequences and penalties (financial or otherwise) if you do not meet these timelines. Although you use technical solutions to satisfy these requirements, it is important to remember that your recovery strategy should be the best fit for the organization’s business needs.

  • Image For more information about achieving HA, read Chapter 11.

A disaster recovery scenario

Let’s paint a picture of a disaster recovery scenario. This scenario spirals out of control pretty fast, so buckle up.

Imagine it’s 4:57 p.m. on a beautiful, sunny Friday afternoon. Just as you are about to head home for the weekend, disaster strikes: The electricity goes out for the entire city block where your office is located, and the uninterruptible power supply (UPS) in the server room has been removed for repairs. You haven’t rehearsed this scenario, because no one ever thought the UPS would be removed for maintenance for an extended period.

Your transaction log backups run every 15 minutes, because that’s what the RPO stipulates. Also, you have a batch script in the Windows Task Scheduler that copies your files remotely, so your logs have been copied safely offsite. Or have they?

Suddenly you have a sinking feeling in the pit of your stomach about a warning you saw in your email this morning while you were on the phone with a colleague, which you accidentally deleted, and which, thanks to that annoying muscle-memory habit you have of emptying deleted items whenever you see them, is gone forever.

Your phone squawks to inform you it has 2 percent battery remaining. Your laptop has some charge, but not much, because you were planning to charge it when you got home. Still, you crack open your laptop to check whether you can somehow undelete that email. Oh, right, your Internet is down.

Your phone rings. It’s the vice president of marketing, who is away this week at a conference. He wants to check the sales figures for a report the board is putting together for an important meeting this evening. The pressure is on! As the VP—who, by the way, doesn’t care about power failures because the company spent thousands of dollars on that UPS—asks when he can access those figures, your phone dies.

You decide to charge your phone off the laptop. That way, you can also use the tethered cellular connection to log into the DR site. But the signal in this area is weak, so you must move to the window on the other side of the office. As you stand up to move, your laptop decides that it’s time to install operating system updates, because it’s now after 5 p.m. Eventually, though, the laptop cancels the updates, because there’s no Internet access.

You move to the new location, where the signal is stronger. After an agonizing few minutes, your phone finally starts. Finally! You connect to your offsite datacenter through a Remote Desktop Protocol (RDP) session. It takes three attempts, though, because you had forgotten that RDP to this server works only with the administrator user account.

The SQL Server instance has its own service account, so you must download and use psexec to run SQL Server Management Studio (SSMS) as that service account in interactive mode. First, though, you must change a registry entry to allow that user to use interactive login. You check the backup folder. The latest log file is from 4:30 p.m. Great.

That means the 4:45 p.m. backup didn’t copy over. Oh, it’s because the drive is full. That must have been what the email warning was about.

After clearing out some files that another colleague had put on the drive temporarily, you write the script you’ve been meaning to write to restore the database because you didn’t have time to set up log shipping.

You export the backup directory listing to a text file and begin looking for the latest full backup, differential backup, and transaction log backup files. But now you see that the last differential backup doesn’t make sense, because the size is all wrong.

You remember that one of your developers made a full backup of the production database on Monday evening, but didn’t use the COPY_ONLY option, and you don’t have access to that file. The latest differential file is useless. You must start from Sunday’s full backup file, and then use Monday afternoon’s differential backup and all transaction log files since then. That’s more than 400 files to restore.

Eventually, with a bit of luck and text manipulation, you begin running the restore script. One particular log file takes a very long time. In your panicked state, you wonder whether it has somehow become stuck. After a minute or two of clicking around, you realize SQL Server had to grow the transaction log of the restored database because it was replaying that annoying index rebuild script that failed on Tuesday morning and needed to roll back.

Finally, at 6:33 p.m., your offsite database is up and running with the latest database backups, up to and including the one from 4:30 p.m. Just then, the lights come back on in the office. The power failure that affected the block where your office is has been resolved.

You need to do a full DBCC CHECKDB of the production server as soon as it starts, but it takes hours because the server is five years old and was installed with ECC RAM (which is useful but causes longer bootup times). This will push you out of the two-hour RTO that you and your boss agreed to, so you go with a failover approach.

You update the connection settings in the application to point to the offsite datacenter. Just then, your phone dies once more, but at least the office has power to charge everything again. You send the VP an email to say the reports should be working. Your cellular data bill is coming out of the expenses for his trip, you tell yourself as you pack up to go home.

As you walk to the bus stop, it occurs to you that the files you cleared out to free up drive space probably included the full database backup created by the developer from Monday night, and that you might have saved some time by checking them first.

Define acceptable data loss: RPO

When disaster strikes, you might lose a single byte in a single row in a single 8-KB data page due to memory or drive corruption. How do you recover from that corruption? Worse, what happens if you lose an entire volume or drive, the storage array, or even the entire building?

The RPO should answer the question, “How much data are you prepared to lose?” The RPO is usually measured in seconds or minutes. Your job is to deliver a technical implementation that meets this objective, perhaps by ensuring there is less time between the last known good backup and the moment of the point of failure. To achieve this, you must consider whether your backups are being done correctly, regularly, and copied offsite securely and in a timely manner.

In the nightmarish scenario we just laid out, the organization decided that losing 15 minutes’ worth of data was acceptable, but ultimately data changes over 27 minutes were lost. This is because the drive on the server at the DR site was full, and the most recent backup did not copy over. To satisfy a 15-minute window, the transaction log backups would need to be created more frequently, as would the offsite copy. Ideally, backups are copied offsite immediately after they are taken.

If the organization requires zero data loss, the budget must significantly increase to ensure that whatever unplanned event occurs, SQL Server’s memory and transaction log remains online, and that all backups are working and being securely copied offsite as soon as possible.

Define acceptable downtime: RTO

Time is money. Every minute an organization is unable to work has a cost, and lost productivity adds up quickly. The RTO is the amount of time by which you need to have everything up and running again after a disaster. You might orchestrate a failover to your DR site in another building, or a manual failover using log shipping. The RTO is usually measured in hours.

In our disaster scenario, the RTO was two hours. Our intrepid but woefully unprepared and accident-prone DBA barely made it. A few factors acted against the plan (if it could be called a plan).

For an organization to require zero downtime, the budget is exponentially increased. This is where a combination of HA and DR technologies combine to support the requirements.

Note

Different types of adverse events may have different RPOs and RTOs. A failure of a single drive in the enterprise SAN shouldn’t come with any data loss or downtime. A power failure in a branch office might result in downtime, but no data loss. The loss of an entire facility due to an unforeseen natural disaster may require the organization to accept both data loss and downtime.

Different applications may have different RTOs as well. The e-commerce system might have a dramatically smaller RTO than the business travel expense tracking system. Regardless, application leadership should be aware of the RTO and RPO their budget is buying.

Establish and use a runbook

When panic sets in, you need a clear set of instructions to follow, just like our deer-in-the-headlights DBA in our fictional scenario. This set of instructions is called a runbook.

The runbook is one of many documents that make up a DR plan. It is part of the BCP. It covers the steps necessary for someone (including your future self) to bring the databases and supporting services back online after a disaster. In an eventuality in which you or your team members become incapacitated, the document should be accessible and understandable to another DBA who has appropriate access, but doesn’t have intimate knowledge of the environment.

From our example scenario, issues like the RDP user account not being able to log in to SSMS, downloading the requisite tools, knowing to skip an out-of-band differential backup, and so on would not be immediately obvious to many people. Even the most experienced DBA in a panic will struggle with thinking clearly.

The level of detail in a runbook is defined by the complexity of the systems that need recovery and the time available to bring them back again. Your organization might be satisfied with a simple Microsoft Excel spreadsheet containing configurations for a few business-critical systems. Or it might need something more in-depth, complete with screenshots, links to vendor documentation and downloads, and expected outcomes at each step. In either case, the runbook should be updated as the system changes and stored in a version control system (which itself should be backed up properly). It is not unreasonable to have a paper copy, too, as you may not have access to all drives in the case of multiple outages or if you have to walk someone else through the process over the phone in the remote DR location.

The rest of this chapter describes how SQL Server provides backup and restore features to help you develop a recovery strategy that is most appropriate for your environment, so that when your organization wants to produce a BCP, you have sufficient knowledge to guide an appropriate and achievable technical response.

Most important, you must be able to rehearse a DR plan. The runbook won’t be perfect, and rehearsing scenarios will help you produce better documentation. That way, when disaster does strike, even the most panicked individual will be able to figure things out. Many organizations schedule regular walk-throughs and practice scenarios for DR. Some even perform a failover from one datacenter to another each month to rehearse their failover automation. When it is important to your business, you must train for it like an athlete. Financial organizations might check backups daily, perform system checks monthly, conduct walk-throughs of the DR plan quarterly, and perform full system failover annually.

Ransomware attacks

Ransomware is a kind of malware designed to deny a user or organization access to files on their computer by encrypting them and demanding a ransom payment to unencrypt them. Unfortunately, not only do businesses have no guarantee the criminals will unencrypt the files after the ransom is paid, but those same criminals might do it again, or the files might not be usable or not in a fully usable state.

While organization leadership might feel that it will never happen to them, the reality is it can and does happen to everyone. In a 2020 survey of 600 US companies, 68 percent said they had experienced a ransomware attack and paid the ransom. Another 10 percent said they were infected but did not pay any money, while only 22 percent were never infected at all.

Microsoft issues an annual Digital Defense Report. It is an interesting if not unnerving read. One of the things to learn from the report is that novel or zero-day vulnerabilities are not required to successfully attack your infrastructure. “Most methods still rely on unpatched edge applications, lateral movement via connected drives, and weak credentials on available services” (Microsoft Digital Defense Report 2021, October 2021, https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RWMFIi). The report also emphasizes that prompt patching is still a critical activity: “Identification and rapid exploitation of unpatched vulnerabilities has become a key tactic. Rapid deployment of security updates is key to defense” (Microsoft Digital Defense Report 2022, November 2022, https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5bUvv).

Obviously, cybersecurity is not just a data platform topic, but databases are a prize target for ransomware and data exfiltration attacks. As noted in the report, “Maintaining a strong baseline of IT security hygiene through prioritized patching, enabling anti-tamper features, using attack surface management tools like RiskIQ to get an outside-in view of an attack surface, and enabling multifactor authentication across the full enterprise have become baseline fundamentals to proactively defend against many sophisticated actors” (Microsoft Digital Defense Report 2022, November 2022, https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5bUvv). RiskIQ’s suite of cybersecurity tools was acquired by Microsoft in 2021, and is now called Microsoft Defender Threat Intelligence and Microsoft Defender External Attack Surface Management.

Perpetrators are referred to as bad actors. They also get inside systems in similar ways to malware and viruses: by using social engineering techniques or bribery via unwitting or coerced insider threats. (We discuss password hashing in Chapter 13, “Protect data through classification, encryption, and auditing.”)

Once a bad actor gains access to an organization’s network, they begin encrypting data, which causes applications to crash and go offline. A ransom is requested, often in a cryptocurrency that is impossible to trace; when the ransom is paid, the data is unencrypted . . . maybe.

Many organizations protect their assets by categorizing their risks and by having security teams assess the threats and methodologies attackers use. They manage cybersecurity with a focus on detection-based defenses like firewalls and antimalware software. But the reality is, there is no perfect ransomware protection, only resilient recovery. The best response to a ransomware attack is a solid strategy to restore your data with very little downtime.

How do you restore if your data has been encrypted? This is where resilient backups—ones that a bad actor with administrative credentials cannot corrupt—is critical. A trusted immutable backup is needed. It is the key item to your recovery, and what all software and hardware vendors have in common.

Note

An immutable backup is a way of protecting data that ensures the data is fixed, unchangeable, and can never be deleted, encrypted, or modified.

Any backup is only a good backup if it is usable. You must be able to restore it and make sure it is not corrupt. Some software vendors do this automatically, but you should not rely on them. If you are a DBA, your first and most important responsibility is to be able to restore a database, with no data loss, within the RTO window.

Immutable copies do not just protect against ransomware attacks. They also protect against accidental deletion and other human errors. Before ransomware attacks became a widespread issue, organizations would restore less than 10 percent of their backed-up production data over the course of a year. That is now changing, making it even more vital to ensure data is valid, correct, and accessible.

Understand different types of backups

There are three different backup types:

  • Full

  • Differential

  • Transaction log

To recover a database, you need at least a full database backup. Differential backups are optional. They only include the data pages that have changed since the last full backup. Differential backups can reduce the amount of time required to restore a database. Transaction log backups are incremental backups that enable point-in-time restores. We talk more about each of these backup types in the subsections that follow.

In Enterprise edition, you can also create file-level and filegroup-level backups (known as partial backups) to allow a more controlled procedure when restoring. This is especially useful for very large databases (VLDBs).

  • Image You can read more about the files that make up a SQL Server database in Chapter 3.

A key part of understanding backups and their ability to restore is understanding recovery models. After these introductory paragraphs, we discuss recovery models first. The backup types, along with partial database backups and backup options, are covered afterward.

Regardless of the type of backup, it will always contain the active portion of the transaction log, including relevant LSNs. This ensures full transactional consistency when the backup is restored.

Every SQL Server native backup also contains a header and a payload. The header describes the backup device, what type of backup it is, backup start and stop information (including LSN information), and information about the database files. The payload is the content of the data or transaction log files in the backup. If transparent data encryption (TDE) or backup encryption was enabled, the payload is encrypted.

An overview of SQL Server recovery models

A key part of understanding backups and their ability to restore is grasping recovery models. In SQL Server, a database’s recovery model determines how the transaction log is maintained, which affects your options for backups and restores. SQL Server supports three recovery models:

  • Full. With this model, transactions are logged and kept at least until the transaction log has been backed up. This model allows a full point-in-time recovery. In this model, full, transaction log, and differential can be created. (We talk more about these three types of backups shortly.) The transaction log must be regularly backed up to avoid continuous growth. All database changes are fully logged and can be replayed.

  • Bulk-logged. This model reduces the amount of transaction log used for certain bulk operations. SQL Server only logs what the Database Engine needs to undo the bulk operation, but the bulk operation cannot be replayed. Non-bulk operations are logged and maintained as in the full recovery model. This model can allow a point-in-time recovery only if no bulk-logged operations are in that portion of the transaction log backup. Full, transaction log, and differential backups can be created. Here too, the transaction log must be backed up to avoid continuous growth.

  • Simple. Transactions are logged only until they are committed to the data file(s) on disk. Only full and differential backups can be created.

These models provide a high level of control over the types of backups, and thus restore options, available to your databases. We talk more about each of these models in the subsections that follow.

Note

These are called recovery models. If you see the term recovery mode, it is incorrect.

You can change the recovery model of a database in SSMS in Object Explorer, or by using the following Transact-SQL (T-SQL) statement and choosing the appropriate option in the square brackets:

ALTER DATABASE <dbname> SET RECOVERY [ FULL | BULK_LOGGED | SIMPLE ];
Full recovery model

For databases that require point-in-time recovery—which is the case for most business-critical systems—we recommend the full recovery model. The recovery model for new databases is based on the recovery model for the model database. By default, the model database is in the full recovery model.

In this recovery model, after the first full backup takes place, the virtual log files in the transaction log remain active and are not cleared until a transaction log backup writes these log records to a log backup. Only then will the log be truncated (cleared). Keep in mind that a truncated log file does not return unused space to the operating system.

Assuming you implement a process to ensure that these backups are securely copied offsite as soon as they are created, and that you regularly test these backups, you can use them to restore your database in the event of a disaster. Provided the right circumstances are in play, you might even be able to create a tail-log backup to achieve zero data loss if that data has been committed and made durable.

  • Image Tail-log backups are defined in the “Transaction log backups” section later in this chapter.

  • Image You can read more about durability, including delayed durability, in Chapter 2.

Bulk-logged recovery model

With the bulk-logged recovery model, typical commands are logged as they are in the full recovery model, but bulk operations are minimally logged. This reduces the size of the transaction log records and subsequent log backups. The downside is that this model eliminates the option of replaying certain operations from a transaction log backup, including BULK INSERT, INSERT … SELECT, SELECT … INTO (all using the TABLOCK hint), and BCP operations. Certain indexing operations are also minimally logged. This affects your ability to restore a database to a point in time if that time is included in a transaction log backup that includes bulk-logged operations.

Note

Minimal logging is not supported for memory-optimized tables.

There is a way to achieve near-point-in-time recovery. This allows a more flexible recovery strategy than the simple recovery model (more on this in the next section), without generating large transaction logs for bulk operations. For example, suppose you want to use the bulk-logged recovery model to perform minimally logged operations, without breaking the log backup chain. Here’s how it works:

Note

Your database must be in the full recovery model before the bulk-logged operation is performed.

  1. You create a transaction log backup. (Again, we talk about these momentarily.)

  2. You switch to the bulk-logged recovery model and perform the bulk-logged operations.

  3. You immediately switch the database back to the full recovery model.

  4. You back up the log again.

This process ensures that the backup chain remains unbroken and allows point-in-time recovery to any point before or after the bulk-logged operation. The Microsoft Docs on this subject note that the bulk-logged recovery model is intended specifically for this scenario to temporarily replace the full recovery model during bulk operations.

Simple recovery model

Databases in the simple recovery model can be restored only to the point in time when the backups were completed. Point-in-time recovery is not supported. After a transaction in the simple recovery model is committed or rolled back, a checkpoint is implicitly issued, which truncates (clears) the log.

Databases that use the simple recovery model can use full and differential backups. This recovery model is better suited to development databases, databases that change infrequently, and databases that can be rebuilt from other sources.

Full backups

A full database backup is a transactionally consistent copy of the entire database. The payload of this type of backup includes every 8-KB data page in every database file in all filegroups, FILESTREAM and memory-optimized files, and the portion of the transaction log that is necessary to roll forward or roll back transactions that overlapped with the backup window.

  • Image You can read more about the active portion of the transaction log in Chapter 3.

You can perform full backups on databases in all recovery models, and you can compress them. Since SQL Server 2016, you can also compress databases encrypted with TDE.

With a minimal amount of T-SQL code, you can perform a full backup to a backup disk target. For example, you can use the following code to back up the WideWorldImporters sample database on a default instance with a machine called SERVER to a local drive (the drive path must exist):

BACKUP DATABASE WideWorldImporters
TO DISK = N'C:SQLDataBackupSERVER_WWI_FULL_20221218_210912.BAK';
GO
Transaction log backups

Transaction log backups are incremental backups of a database. In the full recovery model, all transactions are fully logged. This means you can bring a database back to a point in time on or before that transaction log backup was created, provided the restore is successful. These backups allow for a recovery to any moment in time in the sequence (in other words, the backup chain—more on that in a moment).

In a transaction log backup, the active portion of the transaction log is backed up. Transaction log backups apply only to databases in the full and bulk-logged recovery models. Databases in the full recovery model can be restored to a point in time, whereas databases in the bulk-logged recovery model can be restored to a point in time if the transaction log does not contain bulk-logged operations. The transaction log does not contain the information necessary to replay bulk operations in the bulk-logged recovery model.

Differential backups

A differential backup is a convenience feature to reduce the number of transaction log backups (and time) required to restore a database to a point in time. A differential backup is always based on the last full database backup that wasn’t taken with the COPY_ONLY option. Differential backups contain all changed extents, FILESTREAM files, and memory-optimized data files since the last full backup. They cannot be restored on their own. To restore a database using a differential backup, you need that differential backup’s base full backup.

In many cases, a differential backup is much smaller than a full backup. This allows for a more flexible backup schedule. You can take a full backup less frequently and take the differential backups more regularly, taking up less space than full backups would.

Think back to Chapter 3, where we looked at extents. As a reminder, an extent is a 64-KB segment in the data file comprising a group of eight physically contiguous 8-KB data pages. After a default full backup completes (without the copy-only option, which we cover later), the differential bitmap is cleared. All subsequent changes in the database, at the extent level, are recorded in the differential bitmap. When the differential backup runs, it looks at the differential bitmap and backs up only the extents that have been modified since the full backup, along with the active portion of the transaction log. This is quite different from a transaction log backup, which records every change in the database, even if the change is made repeatedly to the same rows in the same tables. Thus, a differential backup is not the same thing as an incremental backup.

Even though you cannot restore a database to a point in time (or LSN) that occurs within the differential backup itself, differential backups can vastly reduce the number of transaction log files required to effect those same changes.

Differential backups apply to databases in the full, bulk-logged, and simple recovery models.

Caution

If a full backup is taken out-of-band without the COPY_ONLY option, this will affect subsequent differential backups. In that case, you will be restricted to using transaction log backups exclusively to restore the backup chain. If you want to create a full backup of a database without affecting the differential backup schedule, always use the COPY_ONLY option. The copy-only backup mechanism is covered later in this chapter.

The backup chain

A backup chain starts with a full backup, followed by a series of differential and/or transaction-log backups. You can combine these into a recovery sequence to restore a database to a particular point in time after the full backup or to the time of the latest backup, whichever is required. Databases in the full recovery model can be restored to a point in time because transactions are fully logged in that recovery model.

Note

You can also restore a database in the bulk-logged recovery model to a point in time, provided the transaction log backup does not contain bulk-logged operations up to that point in time.

Figure 10-1 illustrates a backup chain that starts with a full backup, which contains the most recent LSN of the active portion of the transaction log at the time that backup finished, followed by multiple transaction log backups and differential backups based on that full backup.

A flow chart of how differential and transaction log backups look in a chain when used together and how they can overlap. Differential backups can reduce the number of transaction logs needed when performing a point-in-time recovery.

Figure 10-1 A backup chain.

You can use a combination of the most recent differential backup (which must be based on that same full backup) and any additional transaction log backups to produce a point-in-time recovery. If you do not have a differential backup, or if the point in time you want to restore to is before the end of the differential backup, you must use transaction log backups. Either option will work, provided the LSNs required in the sequence are contained in each of those backups.

Until you run the very first full backup on a database using the full or bulk-logged recovery model, the database is pseudo-simple—that is, it behaves as though it is in the simple recovery model. Active portions of the log are cleared whenever a database checkpoint is issued, and the transaction log remains at a reasonably stable size unless a large transaction causes it to grow. This may seem unexpected, but there’s no need to keep a transaction log that can’t be restored.

Remember: A database restore always begins with a full backup (piecemeal restores are an uncommon exception and discussed later in this chapter). Until there is a full backup, there’s no need for historic transactions to be kept. Conversely, once the first full backup is created, less-experienced DBAs can be taken by surprise by the sudden and seemingly uncontrolled growth of the transaction log.

We recommend that you configure appropriate maintenance plans (including transaction log backups and monitoring) immediately after you create a new database. It is also a good idea to continue taking log backups even while a full backup is in progress. These log backups will ensure the backup chain is preserved even if the full backup fails or is cancelled.

Note

Backup chains can survive database migrations and upgrades, as long as the LSNs remain intact. This is what makes certain HA features possible in SQL Server.

  • Image You can read more about designing an appropriate backup schedule in the “Create and verify backups” section later in this chapter. For more on maintenance plans, read Chapter 9, “Automate SQL Server administration.”

Fix a broken backup chain with differential backups

A backup chain is broken when a database is switched from the full recovery model to the bulk-logged or simple recovery model for any reason (such as shrinking the transaction log file during an emergency to reclaim storage space). A backup chain can also be broken if a full database backup is taken “out-of-band” and is then discarded or becomes unavailable when a restore is required.

After you switch back to the full recovery model, you can restart the log backup chain—without having to perform a full database backup—by taking a differential backup. As long as you use this or a more recent differential backup in a later recovery, along with the accompanying transaction log backups, the backup chain will be repaired.

If you keep the database in the bulk-logged recovery model, you can restore the backup chain by taking a differential backup after bulk-logged operations. Combined with the transaction log backups taken after the differential backup, you can restore the database back to a point in time after the bulk-logged operation, or to a point in time before the transaction log backup containing a bulk-logged operation.

File and filegroup backups

You can take a more granular approach by backing up individual data files and filegroups, which use the full or differential options, as well. These options are available in the SSMS user interface; alternatively, you can use the official documentation to create appropriate T-SQL statements.

Caution

If a single file in a filegroup is offline (for instance, during a restore), the entire filegroup is also offline. This affects backups.

Partial backups

Because read-only filegroups do not change, it does not make sense to include them in ongoing backup processes. Primarily used for VLDBs that contain read-only filegroups, partial backups exclude those read-only filegroups, as required.

Partial backups contain the primary filegroup, any read-write filegroups, and one or more optional read-only filegroups. Partial backups can also contain a single file from a filegroup, but be aware that a filegroup cannot be brought online until all the files are available.

File backups

You can use file backups to restore individual files if they become corrupt. This makes restoring easier, because for VLDBs, it takes much less time to restore a single file than the entire database. Unfortunately, file backups increase complexity due to increased administration of the additional file backups over and above the full, differential, and transaction log backups. This overhead extends to recovery script maintenance.

Additional backup options and considerations

This section includes several options and special considerations for creating backups in SQL Server.

Database snapshots

A database snapshot is a read-only static view of a database at the point in time the snapshot was created. A database snapshot isn’t a backup and shouldn’t be considered as such, but it can play a role in disaster recovery when used tactically.

You could, for example, take a database snapshot immediately before making a major change. The snapshot can serve as a read-only copy of the database before the major change, as well as a point to revert to if the major change has unexpected consequences.

To create a database snapshot, you create a special database snapshot file, commonly with the extension .ss. For example, to create a database snapshot of the WideWorldImporters database at the iso date and time November 2, 2022, at 10:45 a.m.:

CREATE DATABASE WideWorldImporters_202211021045 ON
( NAME = WideWorldImporters, FILENAME =
'C:Program FilesMicrosoft SQL ServerMSSQL16.MSSQLSERVERMSSQLDataWideWorldImporters
_data_202211021045.ss' )
AS SNAPSHOT OF WideWorldImporters;

The WideWorldImporters_202211021045 database can now be connected to and read from as a read-only copy of the database at a point in time.

Before reverting to the snapshot, you must remove any other database snapshots for the database. To revert the entire WideWorldImporters to the database snapshot, the T-SQL command is simple:

RESTORE DATABASE WideWorldImporters FROM
DATABASE_SNAPSHOT = ' WideWorldImporters_202211021045';

You can use database snapshots to ease administration when testing new functionality to reduce the time required to restore to a point in time, but they carry the same risks as regular database files. Snapshots are not guaranteed backups, and you should not use them to replace native SQL Server backups.

Database snapshots are unrelated to other features or concepts that share the same vocabulary, such as snapshot backups, snapshot isolation of transactions, or snapshot replication. For more on database snapshots, see https://learn.microsoft.com/sql/relational-databases/databases/database-snapshots-sql-server.

Backup encryption

Since SQL Server 2014, it has been possible to encrypt database backups using an asymmetric key pair. This is different from the encryption provided for backups with TDE enabled. The key difference is that to restore an encrypted backup, you need the asymmetric key, not the database master key (DMK) or the certificate encrypted by the DMK.

Like any asymmetric encryption process, you need a cipher (the encryption algorithm) and an asymmetric key or certificate. Supported ciphers are Advanced Encryption Standard (AES), with key sizes of 128, 192, and 256 bits, and 3DES (also known as Triple DES). As discussed in depth in Chapter 13, AES is a safer and faster cipher than 3DES. You must back up and store the key or certificate in a secure location.

  • Image For more about security and encryption, see Chapter 13.

As noted in Chapter 3, you can also compress backups. This is recommended unless the database itself is already compressed, in which case you should first evaluate whether your case will benefit from backup compression. SQL Server 2016 introduced support for backup compression of databases encrypted with TDE.

Note

The MAXTRANSFERSIZE data transfer option with the BACKUP statement specifies the largest possible size of transfers between SQL Server and the backup media (in bytes). Backup compression for a database encrypted with TDE does not kick in unless the BACKUP statement contains a MAXTRANSFERSIZE value greater than 65,536. If MAXTRANSFERSIZE is not specified or is set to its minimum value of 65,536 bytes (64 KB), the backup will not be compressed. Valid values for MAXTRANSFERSIZE are multiples of 64 KB, up to 4 MB.

Backup checksums

SQL Server can perform optional checksum verifications on database backups. By default, backups do not perform a checksum. You can change this behavior either by Trace Flag 3023 or in the backup job or T-SQL script that takes backups. We recommend you enable backup checksum where possible, ideally using the trace flag, so it is the default behavior.

Without backup checksum enabled, no validation is performed on data pages or log blocks. This means any logical corruption will also be backed up without showing any of the errors you might see with a DBCC CHECKDB operation. This is to allow for scenarios in which you want to back up a corrupt database before attempting to fix the corruption. Alternatively, you can create the backup with checksum enabled but add the CONTINUE_AFTER_ERROR option.

  • Image To read more about recovering from corruption, see Chapter 8.

With backup checksum enabled, a checksum is calculated over the entire backup file. Additionally, the page checksum on every 8-KB data page (for both page verification types of checksum or torn-page detection) and the log block checksum from the active portion of the log will be validated.

Caution

Physical corruption, in which the data cannot be read from the drive, including corruption in memory-optimized filegroups, will cause the backup to fail.

The backup checksum can significantly increase the time for a backup to run, but adds some peace of mind, short of running a recommended DBCC CHECKDB on a restored database.

Copy-only backup

When a full backup runs with the default settings, a reserved data page known as the differential bitmap (discussed previously in this chapter) is cleared. Any differential backups that are created on the database after that will be based off that full backup.

You can change the default behavior by using the COPY_ONLY option, which does not clear the differential bitmap. Copy-only backups are useful for creating out-of-band backups without affecting the differential backup schedule. In other words, only differential backups are affected by the lack of the COPY_ONLY option. Transaction log backups, and thus the backup chain, are not affected.

Memory-optimized tables

Standard backups include memory-optimized tables. During the backup process, a checksum is performed on the data and delta file pairs to check for corruption. Any corruption detected in a memory-optimized filegroup will cause a backup to fail, requiring you to restore from the last known good backup.

Remember: The storage requirements for a memory-optimized table can be much larger than its usage in memory, which will affect the size of your backups.

System database backups

The SQL Server system databases contain important information about your system, and some of these should be backed up as well.

A system database that does not need to be backed up is tempdb. This tempdb database is re-created every time your SQL Server instance restarts and cannot be backed up or restored.

The model database typically does not contain data or database objects; this system database serves only as a template. It should be backed up for consistency.

The master system database holds all the instance-level information and should be backed up on regular intervals. You should also take a full backup of the master database immediately before and after any significant changes, such as the installation of a cumulative update.

Another system database that should be backed up is msdb. This system database is used by the SQL Server Agent for job control. This database also contains the backup and restore history tables.

In addition to these four system databases (master, msdb, model, and tempdb) that are present on every SQL Server instance, you might also have a distribution system database. This database exists if you have configured replication, and should also be backed up regularly, on the same schedule as full backups for replicated databases.

Back up non-database items

Successfully restoring data may require more than just the backup files. For example, restoring a database with TDE enabled will require the original certificate used to set up encryption. If you’re restoring to a different server, you must restore the certificate, including its private key, before attempting to restore the database.

Items you should consider backing up in addition to the database include the following:

  • Certificates

  • Logins

  • SSIS packages not stored in the SSISDB, but in the file system

  • Flat files, configuration files, or other custom-developed files necessary for jobs and business applications

Availability and security of backup media

Your backups must be safeguarded from corruption or deletion and from the disasters you’re trying to defend against. This means backup files should be available in multiple locations, including both on-premises and offsite. It’s not wise to keep backups only on-premises because a disaster or ransomware attack might affect your entire site. However, it’s advisable to keep one backup copy locally, because that copy is easier and quicker to access.

The devices that store your backups must be secured with appropriate access controls, physical and virtual. Authorized individuals, and only authorized individuals, must be able to access the backups to implement the data recovery plan. The backups contain the same sensitive information as your databases. Unauthorized access may lead to a security breach. (Unfortunately, attackers know organizations sometimes neglect to protect backups as well as data stores.) However, you must also plan for the eventuality that a disaster renders your identity and access management (IAM) infrastructure unavailable.

Similar considerations apply for backups to Azure Storage. You must verify that access to the storage account is limited to authorized individuals. Azure provides mechanisms for securing storage accounts, which are outside the scope of this text.

Understand backup devices

SQL Server writes database backups to physical backup devices. These storage media might be virtualized, but for the purposes of this section, they are considered physical. They include disk, tape, and URL. (Tape backup is now deprecated, and we will not cover it in detail.)

Caution

The option to back up SQL Server databases to tape will be removed from a future version of SQL Server. When creating a recovery strategy, use disk or URL. You should change any existing recovery strategies that involve tape to use another media type.

Back up to disk

SQL Server backups are most commonly stored directly on a local volume or network path, referred to as the backup disk. A backup disk contains one or more backup files, and each file contains one or more database backups.

A database backup might also be split across multiple files. To back up VLDBs in less time, striping the backup across multiple files can be a valuable strategy. Be careful, however, because you must have all files in the set to successfully restore.

Back up to URL

Since SQL Server 2012 (with Service Pack 1 and Cumulative Update 2), you can back up your SQL Server database directly to Azure Blob Storage. This is made possible by using an Azure Storage account URL as a destination.

SQL Server 2016 added the ability to back up to Azure block blobs. This is now the preferred blob type when backing up to Azure Storage. When striping a backup on up to 64 blobs, block blobs support much larger backups and enable higher throughput than page blobs.

Note

Backing up to URL is especially useful for hybrid scenarios and for Azure SQL Managed Instance.

To back up a database to a block blob, you must provide a token for a shared access signature (SAS) for a standard Azure Storage account. (Premium storage accounts are not supported.) You can create a SAS using the Azure portal, PowerShell, the Azure Storage Explorer application, or other means. Before creating the SAS, you should define a matching stored access policy for the blob container that will store the backup files. This policy allows the SAS to be changed or revoked.

A SAS token represents the SAS. It is a query string that contains several elements that refer to the SAS. When creating a SAS token, you must specify the permissions that will be allowed using the SAS token. Read, write, delete, and list permissions are minimally required. The SAS token you obtain will begin with a question mark (?), but that question mark must be removed before SQL Server can successfully use the SAS token. (Readers versed in web technologies will understand that the question mark is the beginning of a URL query string and not part of the content.)

Note

A SAS must have an expiration date. Make sure the expiration date is sufficiently far in the future so your backups don’t start failing unexpectedly. Regardless of the expiration date of the SAS token, it’s a good idea to rotate the secret on a regular schedule. As with other certificates in your environment, mark your calendars in advance of the expiration date you choose.

The SAS token must be used to create a credential. The credential’s name is the URL to the Azure Storage container. After the credential is created, backing up the database to Azure Storage only requires specifying the TO URL clause, as illustrated in the upcoming sample.

The following sample creates a credential using a SAS token and backs up the SSIO2022 database to the SSIO2022 Azure Storage account. The sample uses the WITH FORMAT option to overwrite an existing backup at the same URL. Your use case may require you to keep historic backups, in which case you should probably generate the backup file name by including the date and time of the backup. If you keep historic backups, you must create a scheme to remove outdated backup files; the RETAINDAYS and EXPIREDATE options are not supported with the TO URL clause.

CREATE CREDENTIAL [https://ssio2022.blob.core.windows.net/onprembackup]
    WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
    -- Remember to remove the leading ? from the token
    SECRET = 'sv=2022-03-28&ss=…';
BACKUP DATABASE SamplesTest
    TO URL = 'https://ssio2022.blob.core.windows.net/onprembackup/db.bak'
    -- WITH FORMAT to overwrite the existing file
    WITH FORMAT;

For security reasons, the container holding your SQL Server database backups should be configured as Private. We recommend that you configure the storage account to require secure transfer, meaning only HTTPS connections are allowed. Secure transfer is required by default on all new Azure Storage accounts. You can further secure the SAS token by restricting the IP addresses that can connect to the Azure Storage service using the token. If you choose to do so, you must remember to obtain a new token and update the database credential when your organization’s outbound IP addresses change.

Note

Did you know that SSMS 18 and later support connecting to Azure Storage? While Azure Storage Explorer is a great tool for managing Azure Storage accounts, the option to connect to an Azure Storage account used for SQL Server backups directly from within SSMS can be a time-saver. You can access this functionality from the Back Up Database dialog box.

Backup and media sets

As noted, SQL Server backups are written to media types (devices)—namely tape, disk (which includes locally connected volumes, solid-state drives, and UNC network paths), and Azure Blob Storage. Each of these types has specific properties, including format and block size.

Media set

A media set is an ordered collection of a fixed type and number of devices (see Figure 10-2). For example, if you are using a backup disk, your media set will comprise a fixed number of one or more files.

A diagram of a media set containing three media families spread over three devices. Media family 1 contains backup device A, media family 2 contains backup device B and media family 3 contains backup device C. The backup sets 1 and 2 span all three backup devices A, B, and C.

Figure 10-2 A media set, containing three media families spread over three devices.

A backup will always contain at least one media set. With tape backup now deprecated, media sets with multiple devices are less common. Other factors to consider:

  • Backup window. The backup window might be too short to allow the backup to finish unless it’s striped across multiple files or URLs.

  • Backup to URL. The recommended block blobs in Azure Storage can hold 200 GB per blob at most. If your backup is larger than 200 GB, you’ll need to use multiple block blobs. Better backup performance may also be achieved using multiple block blobs, assuming your Internet upload speed is sufficient. We should note that concerns surrounding the integrity of striped backups are less acute in Azure Storage because you can, and should, configure the storage account with automatic redundancy, if not geo-redundancy.

SQL Server Enterprise edition supports mirroring media sets. A mirrored media set writes another copy of the backup. This feature provides you with a redundant copy of your backup on the same type of media. When the mirrored media sets are appropriately segmented on different storage hardware, this redundant copy increases the reliability of the backup, because an error in one backup set might not affect the mirrored copy.

Media family

In each media set, all backup devices used in that media set make up the media family. The number of devices in that set is the number of media families. If a media set uses three backup devices that are not mirrored, there are three media families in that media set.

Backup set

A successful backup added to a media set is called a backup set. Information about successful backups is stored in the header of the backup set.

You can reuse an existing backup set by adding new backups for a database to the end of that media. This grows the media by appending the backup. However, we do not recommend this practice, because the integrity of previous backups relies on the consistency of that media. An errant INIT option in the backup command could even accidentally overwrite existing backups in the backup set.

Back up to S3-compatible storage

New in SQL Server 2022 is Simple Storage Service (S3)–compatible object storage integration. This feature extends existing BACKUP TO URL and RESTORE FROM URL functionality to S3 URLs using a REST API. URLs pointing to S3-compatible resources are prefixed with s3:// to denote that the S3 API is being used.

Requirements for S3-compatible storage

URLs beginning with the s3:// prefix assume that the underlying protocol is HTTPS, which means SQL Server will require a TLS certificate. Additionally, you will need credentials (an Access Key ID and Secret Key ID) to access the S3-compatible endpoint. You cannot back up to S3-compatible storage unless a bucket has been created on the S3-compatible storage account. Creating a bucket is not possible from within SQL Server.

Control backup size and speed

Your backup will use multiple parts—a maximum of 10,000 parts per URL, ranging in size from 5 MB to 20 MB, with a default of 10 MB. You can adjust this size using MAXTRANSFERSIZE. If a maximum of 10,000 parts at 20 MB per part is not sufficient, you can stripe your backup across multiple URLs, to a maximum of 64.

Note

You must enable backup compression if you back up to S3-compatible storage and modify the MAXTRANSFERSIZE setting.

If a backup to S3-compatible storage fails, any temporary files created during the backup process will not be automatically removed. You must remove them manually.

Create and verify backups

You should completely automate your SQL Server backups, whether you use the built-in SSMS Maintenance Plan Wizard or a third-party solution (free or paid). Always ensure that backups are successful by observing that the backup files exist and that the backup task does not error out. Additionally, you must regularly test backups by restoring them, which you should also do with an automated process.

For example, the dbatools.io library of SQL Server administration cmdlets includes Test-DbaLastBackup, which automatically restores, runs a DBCC CHECKDB, and drops a database. Consider this tool or similar automation solutions. For more information, see https://docs.dbatools.io/Test-DbaLastBackup.html.

  • Image You can read more about maintenance plans in Chapter 9.

SQL Server Agent is an excellent resource for automating backups, and many third-party solutions use it, too—as does the Maintenance Plan Wizard.

Create backups

You can design a backup solution to satisfy a recovery strategy by using all or a combination of the available backup types. This chapter already covered the backup types and code samples.

The buffer pool (see Chapter 2) is not used for database backups. The backup buffer is a portion of memory outside the buffer pool, big enough to read pages from the data file and write those pages to the backup file. The backup buffer is usually between 16 MB and 32 MB in size. Memory pressure can reduce the backup and restore buffer sizes, causing backups and restores to take longer.

Caution

It is possible to increase the number of backup buffers (using the BUFFERCOUNT option) as well as the transfer size of each block to the backup media (using the MAXTRANSFERSIZE option) to improve throughput. However, we recommend this cautiously, and only in circumstances where the defaults calculated by SQL Server prove problematic. Allocating too many buffers or increasing the transfer size too much might cause out-of-memory exceptions during the backup operation, which will cause the backup to fail. While this is not a situation that would require a server restart, it does diminish the automation you set up for backups. You can read more about this possible issue at https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/06/incorrect-buffercount-data-transfer-option-can-lead-to-oom-condition.

Verify backups

After you create a backup, we highly recommend that you immediately verify that the backup was successful. Although rare, corruption is always possible. Most of the time, it is caused by the storage layer (including device drivers, network drivers, and filter drivers), but it can also occur in non-ECC RAM or as the result of a bug in SQL Server itself.

Note

A filter driver is software that intercepts all drive reads and writes. This class of software includes defragmentation tools and security products like antivirus software. This is a good opportunity to remind you to exclude SQL Server files (data, logs, and backups) from antivirus scanners, and to note that defragmenting solid-state storage is a bad idea because it will dramatically reduce the lifespan of the drive.

There are two ways to verify a backup. You can probably guess that the most complete way is to restore it and use DBCC CHECKDB to perform a full consistency check on the restored database.

The other, quicker method is to use RESTORE VERIFYONLY. If you backed up your database using the checksum option (which is enabled by default on compressed backups), the restore will verify the backup checksum as well as the data page and log block checksums as it reads through the backup media. The convenience with RESTORE VERIFYONLY is that you do not need to allocate drive space to restore the data and log files, because the restore will read directly from the backup itself.

Restore a database

To restore a database, you generally start with a full backup (piecemeal restore is covered in a later section).

If you plan to use differential and/or transaction log backups, you must use the NORECOVERY keyword for all but the last of the backup files you will restore.

  • You restore a database in the simple recovery model by using a full backup to begin, plus the most recent differential backup based on the full backup if it is available.

  • You can restore a database in the bulk-logged recovery model by using a full backup, along with the most recent differential backup based on that full backup, if available. Should you want to restore to a specific point in time for a bulk-logged database, this might be possible if no bulk-logged operations exist in the transaction log backups you use.

  • You can restore a database in the full recovery model to a point in time using a full backup, plus any transaction log backups that form part of the backup chain. You can use a more recent differential backup (based on that full backup) to bypass a number of those transaction log backups, where appropriate.

Caution

Differential and transaction log backups rely on a corresponding full backup. If the full backup is not available, the differential and transaction log backups are useless. All transaction log backups must be available at least to the point in time to which you’re trying to restore.

Each transaction log backup is replayed against the database being recovered, using the NORECOVERY option, as though those transactions are happening in real time. Each file is restored in sequential order up to the required point in time or until the last transaction log backup is reached, whichever comes first.

Note

When restoring a chain of transaction log backups, especially the first one in the sequence after a full or differential backup, it’s possible for the LSN of the transaction log backup to be earlier than the latest LSN of the full or differential backup that was restored. In most cases, you can ignore the error message that appears, because the next transaction log file in the restore sequence will usually contain the required LSN.

After the entire chain has been restored (indicated by the WITH RECOVERY option), only then does the recovery kick in (covered in some detail in Chapter 3). All transactions that are committed will be rolled forward, and any in-flight transactions will be rolled back.

Some examples of restoring a database are included in the section that follows.

Restore a database using a full backup

You can perform a database restore through SSMS, Azure Data Studio, or by using a T-SQL statement. In this example, only a full backup file is available to restore a database. The full backup comes from a different server, where the path of the original database is different, so the database’s files must be relocated (moved) on the new server with the MOVE option.

To see the progress of the restore, you can set the statistics to display to the output window with the STATS option. The default is to write progress for every 5 percent complete. No statistics will be output until the files have been created on the file system.

RESTORE DATABASE WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_FULL_20220918_210912.BAK'
WITH MOVE N'WideWorldImporters' TO N'C:SQLDataWWI.mdf',
MOVE N'WideWorldImporters_log' TO N'C:SQLDataWWI.ldf',
STATS = 5,
RECOVERY;
GO

The RECOVERY option (the default) at the end brings the database online immediately after the full backup has been restored. This prevents any further backups from being applied. If you want to restore a differential backup after this full backup, you must use the NORECOVERY option. Later, bring the database online only after restoring subsequent differential or transaction log backups. This is covered in the next section.

Restore a database with differential and log backups

Restoring using full, differential, and transaction log backups is more complicated, but you can still perform it through SSMS’s graphical wizard, Azure Data Studio, or by using a series of T-SQL statements.

For this scenario, we recommend creating your own automated scripts. For example, after every transaction log backup, you can use the information in the msdb database to build a script to restore the entire database to that point in time and then save the script in the same folder as the backup file(s).

Note

When restoring a database using more than one backup type (full, plus differential and/or transaction log), each RESTORE statement that will be followed by another restore file must include a WITH NORECOVERY option. This prevents recovery from running until needed. You can either use the WITH RECOVERY option on the final file to run recovery and bring the database online or you can add an extra line to the end of the script as in the following example.

You restore a database by using the RESTORE command. Full and differential restores use the RESTORE DATABASE option by convention. You can also restore transaction logs by using the RESTORE DATABASE option, but you might prefer to use RESTORE LOG, instead, for clarity:

-- First, restore the full backup
RESTORE DATABASE WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_FULL_20220918_210912.BAK'
WITH
MOVE N'WideWorldImporters' TO N'C:SQLDataWWI.mdf',
MOVE N'WideWorldImporters_log' TO N'C:SQLDataWWI.ldf',
NORECOVERY;
GO
-- Second, restore the most recent differential backup
RESTORE DATABASE WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_DIFF_20220926_120100.BAK'
WITH NORECOVERY;
GO
-- Finally, restore all transaction log backups after the differential
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_LOG_20220926_121500.BAK'
WITH NORECOVERY;
GO
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_LOG_20220926_123000.BAK'
WITH NORECOVERY;
GO
-- Bring the database online
RESTORE LOG WideWorldImporters WITH RECOVERY;
GO

Remember: Specify WITH RECOVERY in the final transaction log file restore, or in a separate command as in the previous example.

The RECOVERY option instructs SQL Server to run recovery on the database, which might include an upgrade step if the new instance has a newer version of SQL Server on it. When recovery is complete, the database is brought online.

Restore a database to a point in time

If configured properly, it is possible to restore a database to the exact moment in time (or more precisely, to the exact LSN) before a disaster occurred. A point-in-time restore requires an LSN or timestamp (meaning any specific date and time value) to let the RESTORE command know when to stop restoring. You can even restore to a specific mark in the transaction log backup, which you specify at transaction-creation time by explicitly naming a transaction. This requires knowing in advance that this transaction might be the cause for a point-in-time restore, which is rarely the case in a disaster scenario.

Note

Some scenarios where you might suspect in advance that a point-in-time recovery is required include performing a schema upgrade, perhaps as part of an application release.

Caution

The timestamp used for a point-in-time restore comes from the transaction log itself and refers to the local date and time on the SQL Server instance when the transaction started. Remember to take time zones and daylight saving time into consideration when restoring a database to a point in time.

Depending on the cause of the disaster, you might want to investigate the current state of the database to determine the best recovery approach. You might need to close connections to the database and set the database to single user mode. In single user mode, you have time to find out what happened (without making modifications), take a tail-log backup when possible and necessary, and recover the database to the moment before disaster struck.

In some cases, you might opt to leave the database active but restore it to a new database and reload changes from the restored version. The feasibility of this depends on the database schema, the number of changes that have been made since the adverse event, and the RTO.

A point-in-time restore works only when restoring transaction log backups, not full or differential backups. The point in time must be after the full backup from which you begin the restore process. This fact may make it valuable to keep several full backups prior to the most recent one. Some errors, such as human error or database corruption, may not always be detected before a new full backup is taken.

The process is the same as in the previous example, except for the final transaction log file, for which the point in time is specified by using the STOPAT or STOPBEFOREMARK options. Let’s look at each option:

  • STOPAT. A timestamp. You must know this value from the time an unexpected event occurred. If it’s unclear when the event occurred, you might need to explore the transaction log.

  • STOPBEFOREMARK or STOPATMARK. A log sequence number or transaction name. You must know the LSN value from exploring the active portion of the transaction log (see the preceding Inside OUT).

Assuming you have followed the same sequence as shown in the previous example, the final transaction log restore might look like this:

-- Restore point in time using timestamp
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_LOG_20170926_123000.BAK'
WITH STOPAT = 'Sep 26, 2017 12:28 AM',
RECOVERY;
GO
-- Or restore point in time using LSN
-- Assume that this LSN is where the bad thing happened
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_WideWorldImporters_LOG_20170926_123000.BAK'
WITH STOPBEFOREMARK = 'lsn:0x0000029f:00300212:0002',
RECOVERY;
GO

Restore a database piecemeal

Partial database backups deal with file and filegroup backups to ease the manageability of your VLDB. This is an advanced topic with much detail to be discovered in your environment.

Partial recovery is useful for bringing a database online as quickly as possible to allow the organization to continue working. You can then restore any secondary filegroups later, during a planned maintenance window.

Piecemeal restores begin with something called the partial-restore sequence. In this sequence, the primary filegroup is restored and recovered first. If the database is under the simple recovery model, all read/write filegroups are then restored.

  • Image To learn more about the SQL Server recovery process, read Chapter 3.

While this is taking place, the database is offline until restore and recovery is complete. Any unrestored files or filegroups remain offline, but you can bring them online later by restoring them.

Regardless of the database’s recovery model, the RESTORE command must include the PARTIAL option when doing a piecemeal restore, but only at the beginning of the sequence. Because transactions might span more than just the recovered filegroups, these transactions can become deferred, meaning that any transaction that needs to roll back cannot do so while a filegroup is offline. The transactions are deferred until the filegroup can be brought online again, and any data involved in that deferred transaction is locked in the meantime.

Restore a database piecemeal under the simple recovery model

To initialize a partial recovery of a database under the simple recovery model, you must begin with a full database or a partial backup. The restore will bring the primary filegroup and all read/write secondary filegroups online. You can then bring the database online. If any read-only filegroups were damaged or corrupted and you need to restore them, you will do those last.

Restore a database piecemeal under the full recovery model

As with the simple recovery model, you must begin with a full database or partial backup (which must include the primary filegroup). When possible, you should also have a tail-log backup to help restore the database to the most recent point in time.

It is possible to restore and recover only the most important read-write filegroups first, then restore additional read-write filegroups, before finally restoring any read-only filegroups. Depending on the design of the database and applications, the organization may resume critical operations sooner than if the entire database needed to be restored.

Point-in-time restore is provided under the following conditions:

Note

If you skip a FILESTREAM filegroup during partial recovery, you can never again recover it unless the entire database is restored in full.

Define a recovery strategy

Consider our scenario from the beginning of the chapter, in which anything and everything that could go wrong did go wrong. This section highlights certain issues that could have been addressed by an appropriate recovery plan. You can then implement this recovery plan, step by step, using your runbook. We also discuss recovery strategies around hybrid environments, and briefly discuss Azure SQL Database.

Note

The word strategy means there is a long-term goal. Your recovery strategy must adapt to environmental changes. A runbook is a living document; it requires incremental improvements as you test it.

A sample recovery strategy for our DR scenario

Several avoidable problems occurred in the fictional DR story at the beginning of the chapter:

  • There was no redundant power system in place.

  • There was no redundant Internet connection (outside of a personal cellular device).

  • There was no runbook to guide the accident-prone DBA.

  • The security on the DR server did not follow recommended best practices.

  • Sending the backups offsite failed.

The following subsections discuss some key considerations for your recovery strategy beyond taking backups.

Keep the lights on

In your runbook, ensure your UPS collection and backup generators can run all necessary equipment for long enough to keep emergency lights on, laptops charging, network equipment live, and servers running (including a monitor) so that your DBA can log in to the SQL Server instance long enough to run a tail-log backup if necessary. If the RTO allows it, you might be able to proactively shut down gracefully rather than attempting to keep less critical systems running with limited power. A graceful shutdown before a looming sudden shutdown is a smart idea.

It might sound like a small detail, but you should even have a list of diesel suppliers in your runbook, especially if your generators must keep running for several hours. Clean power is also important. Generators cause power fluctuations, which can damage sensitive electronic equipment. A power conditioner should be installed with your generator, and you must make sure it works correctly.

In our scenario, the backup power failed. There was no sufficient way to ensure electricity would continue to flow after the building lost power, which had several knock-on effects, including the inability to charge laptops and cellphone batteries.

Redundant Internet connection

If your backups are being copied securely offsite or you have a hybrid environment in which systems are connected across datacenters using virtual private networks (VPNs), make sure these routes can stay connected if one of the links goes down. In our scenario, the DBA had no option but to use a low-quality cellular connection. While cellular modems can be a part of your recovery strategy, they can be fickle, and fail under sudden public demand. It’s also difficult to connect servers this way.

Know where the runbook is

The runbook itself should be printed out and stored in a secure but easily accessible location (for example, a fireproof safe or lockbox). Don’t forget to have an electronic copy available, as well, stored with a cloud provider of your choice and kept up to date.

Make sure your offsite backups are secure and tamper-proof

Security of the offsite location for backups is critical, and no one should have access to that site unless they are required to do a recovery. In our sample scenario, our DBA accidentally deleted an email alert indicating that the offsite storage for backups was almost full. This alert is a good one to have, but it was not acted on appropriately. Also, the cause of the alert was an avoidable situation because the colleague who used up the free space should not have been able to access critical organization resources.

Remote, offsite backups ideally aren’t accessible via network shares. Air-gapped backups are protected from ransomware attacks running rampant in your network. Copying offsite backups into your network should be through just-in-time access or other temporary means.

Regularly check and test your backups

Automate your backups and post-backup steps to verify backup success. Use maintenance plans (see Chapter 9) or established third-party backup tools such as dbatools.io or Ola Hallengren’s Maintenance Solution (available from https://ola.hallengren.com).

Verify that you have a process to check that your backups are taking place and test that process often. For example, if SQL Server Agent is crashing, your notifications might not fire. Test the backups, as well, by having a machine that restores backups continuously and by running DBCC CHECKDB where possible. If you can afford it, have log shipping configured so that all backups are restored as soon as they come into offsite storage. Ensure that the backup files are being securely copied offsite as soon as they can be.

If you’re unable to continuously restore databases, run random spot checks of your backups by picking a date and time in the backup history and restoring the backup. Aside from testing the databases themselves, this is a good rehearsal for when something goes wrong. Use the runbook for these exercises to ensure it is, and remains, up to date. Remember to run DBCC CHECKDB on any database you restore.

You might find as databases grow that the SLA becomes out of date and that the RTO is no longer achievable. Running these tests will alert you to this situation long before it becomes a problem and will allow you to tweak how you perform backups in the future. For example, you might discover that it is much quicker to spin up an Azure virtual machine (VM) with SQL Server and restore the business-critical databases stored in Azure Blob Storage than to struggle with VPN connections and failed hardware on-premises.

Our DBA assumed the backups were taking place every 15 minutes and being copied offsite immediately afterward. This was not the case, and instead of losing 15 minutes of data, the organization lost as much as 27 minutes’ worth.

Check the security of your disaster recovery site

Your DR site might have outdated security access controls, whether physical or virtual. Be sure you stay up to date, especially when people leave the company. You don’t want to be in the position of having to call someone who left your organization two years ago to ask for a firewall password—especially not at 3 a.m. on a Sunday (as is this author’s experience). If you must use a remote desktop connection to access a server, protect it by using a VPN.

Prepare tools and credentials

Check that you have already downloaded additional tools and documented how to use them. Unless they pose a security risk, have them available on the recovery systems in advance.

Keep all passwords and keys (symmetric and asymmetric) in a password manager as well as printed and stored in a secure location with the runbook where practical. Make sure all digital certificates are backed up securely.

Automate restore scripts

In case the msdb database is inaccessible, and you cannot generate a restore script from a database backup history, be sure you have a tool that generates a restore script based on files in a folder in your offsite storage. Many tools do this, including the Restore-DbaDatabase command in the free dbatools (available from https://dbatools.io). This is also when self-evident backup file naming and extensions help you to create the recovery chain.

Practice your disaster recovery plan

In concert with your HA strategy, which involves automated and manual failovers (see Chapter 11 for more on this), you should perform regular drills to test your runbook. You can also have people in your organization who are unfamiliar with the environment look through the runbook. They can provide valuable information regarding assumptions you might have made.

The cadence is up to your organization, but a full DR scenario should be tested at least once a year. Any changes you need to make to the runbook should be made immediately. If the recovery fails, add notes of how it failed and what you did to resolve the failure. All this information is extremely valuable.

Recovery strategies for hybrid and cloud environments

Many organizations use a combination of on-premises infrastructure and services in remote locations, including Azure services, third-party datacenters, and other cloud vendors.

Recover data in hybrid and cloud environments

The strategy for recovering data in a hybrid environment is very similar to an on-premises strategy, except that you must take network connection, latency, and bandwidth into account. When designing a recovery strategy for a hybrid environment, pick a DR site that is central, but geo-replicated.

It is prudent to use virtualization technologies that allow for VM and file system snapshots to ensure that your virtual servers are backed up regularly. These VM backups can be spun up in case of a disaster more quickly than rebuilding physical servers. You can augment these with appropriate native SQL Server backups that are tested properly.

If you already use Azure Storage for your backups, this reduces the network bandwidth and latency issues if you can restore your organization’s databases to Azure VMs or databases in Azure SQL Database.

Remember that after failing over to a DR site, you must fail back to your on-premises site when it is up and running again. Based on the magnitude of the disaster, you may need to keep the DR site as the new primary, in which case you must set up a new DR site.

Recovering a database in Azure SQL Database

Chapter 17, “Provision Azure SQL Database,” offers an in-depth look at managing Azure SQL Database, including preparing for DR. There are three options to consider when restoring a point-in-time backup, which play into your runbook:

  • Database replacement. You can replace an existing database using a database backup. This requires you to verify the service tier and performance level of the restored database. To replace your existing database, rename the existing one and restore to the old name.

  • Data recovery. If you need to recover data from a previous point in time, you can restore the database to a new database. Azure SQL Database lets you specify the point in time within the backup retention period to which you wish to restore.

  • Deleted database. If you deleted a database and you are still within the recovery window, you can restore that deleted database to the time just before it was deleted.

The geo-restore feature can restore a full backup to any server in any Azure region from a geo-redundant backup. Databases on the basic performance tier can take up to 12 hours to geo-restore. This estimated recovery time should factor into your RTO. With Azure SQL Database, you’re also giving up control over the RPO. When properly implemented, as discussed in Chapter 17, Azure SQL Database supports an RPO of as little as 5 seconds.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.59.8