Chapter 11
Developing, deploying, and managing data recovery

The first and foremost responsibility of a data professional is to ensure that a database can be recovered in the event of a disaster.

You don’t design a backup strategy. You design a restore strategy. You need to allow for potential downtime and loss of data, within acceptable limits. These are defined by the business requirements for getting an environment back up and running after a disaster.

Technical solutions such as high availability (HA) and disaster recovery (DR) are available in Microsoft SQL Server to support these requirements, which are ultimately governed by the organization itself. In other words, business requirements define the approach that you will take in your organization to plan for and survive a disaster. Remember that this is only a small but important part of a larger business continuity plan.

This chapter does not provide any guidance on recovering from a corrupt database. Microsoft recommends restoring from a last known good database backup if you experience corruption. That being said, our objective is that by the end of the next two chapters, you will understand how to achieve close to zero data loss with minimal downtime.

Image You can read more about data corruption in Chapter 13.

The fundamentals of data recovery

It is incredibly expensive, and almost impossible, to achieve zero data loss with zero downtime. Recovery is a balance between budget, acceptable downtime, and acceptable data loss. Also, emotions run high when systems are down, so it is incumbent on all organizations to define possible outcomes at the outset and how to deal with them.

The governance of these requirements is outlined in a Service-Level Agreement (SLA), which explains the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) from the organization’s perspective, as it relates to business continuity. The SLA might also include the consequences and penalties (financial or otherwise) if you do not meet the timelines.

The SLA is a business document, not a technical one, because the RPO and RTO are business requirements. Although you will use technical solutions to satisfy these requirements, it is important to keep in mind that your recovery strategy should be the best fit for the organization’s business needs.

Image For more information about achieving HA, read Chapter 12.

A typical disaster recovery scenario

Let’s paint a picture of a beautiful, sunny Friday afternoon, at 4:57 PM. This scenario spirals out of control pretty fast, so buckle up.

Disaster strikes in your office, just as you are about to head home for the weekend. The electricity goes out for the entire city block, and the uninterruptible power supply (UPS) under your desk fails because in all the confusion, you knocked over a half-finished can of soda onto it, which blew out the battery.

As the smell of burned electronics wafts up to your nostrils, you begin to panic. You haven’t rehearsed this scenario, because no one ever thought the UPS would be moved upstairs after the basement was flooded last spring, let alone end up with soda poured over it.

Your transaction log backups run every 15 minutes because that’s what the RPO stipulates, and you have a batch script in the Windows Task Scheduler that copies your files remotely, so your logs should have been copied safely off-premises. Well…that is, you’re pretty sure the log backups were copied correctly, right?

Except that you get a sinking feeling in the pit of your stomach as you remember a warning you saw among your email this morning, while a colleague was on the phone to you, and your finger had slipped on the mouse and accidentally deleted the notification instead of moving it. Plus, you have that annoying muscle-memory habit of emptying deleted items whenever you see them.

Your smartphone rings. It’s the boss, who is away this week at a conference and wants to check the sales figures for a report the board is putting together for an important meeting this evening. Your phone squawks because it has 2% battery remaining. Your laptop has some charge, but not much because you were planning on charging it when you arrived home.

You crack open your laptop to check whether you can somehow undelete your mail. Oh, right, your internet is down.

And then your phone dies while your boss, who coincidentally doesn’t care about power failures because the company spent hundreds of dollars on that UPS under your desk, is asking you when the reports will be available again and wants you to just get it done.

You could charge the phone off the laptop and use the tethered cellular connection to log into the DR site. But the signal in this area is weak, so you need to move to the window on the other side of the office. As you stand up, the laptop decides that it’s time to install operating system updates because it’s now after 5 PM.

After an agonizing few minutes, your phone finally starts. Meanwhile your laptop has cancelled the updates because there’s no internet access. You connect to your off-premises datacenter through a Remote Desktop session. It takes three attempts because you had forgotten that RDP to this server works only with the administrator user account.

The SQL Server instance has its own service account, so you need to download and use psexec to run SQL Server Management Studio as the service account in interactive mode, after changing a registry entry to allow that user to use interactive login. You check the backup folder, and thankfully the latest log file is from 4:30 PM. Great. That means the 4:45 PM backup didn’t copy over. Oh, it’s because the drive is full. That must have been what the email warning was about.

After clearing out some files that another colleague had put on the drive temporarily, you need to write the script you’ve been meaning to write to restore the database because you didn’t have time to set up log shipping.

You export the backup directory listing to a text file and begin looking for the latest full backup, differential backup, and transaction log backup files. But now you’ve seen that the last differential backup doesn’t make sense, because the size is all wrong.

You remember that one of your developers had made a full backup of the production database this week on Monday evening, didn’t use the COPY_ONLY option, and you don’t have access to that file. The latest differential file is useless. You need to start from Sunday’s full backup file and then use Monday afternoon’s differential backup and all transaction log files since then. That’s more than 400 files to restore.

Eventually, with a bit of luck and text manipulation, you begin running the restore script. One particular log file takes a very long time, and in your panicked state you wonder whether it has somehow become stuck. After a minute or two of clicking around, you realize SQL Server had to grow the transaction log of the restored database because it was replaying that annoying index rebuild script that failed on Tuesday morning and needed to roll back.

Finally, at 6:33 PM, your off-premises database is up and running with the latest database backups up to and including the one from 4:30 PM. Just then, the lights come on in the office, because the power failure that affected the downtown area where your office is has been resolved. Now you need to do a full DBCC CHECKDB of the production server as soon as it starts, which always takes forever because the server is five years old and was installed with ECC RAM, which will push you out of the two-hour RTO that you and your boss agreed to, so you stick with the failover plan.

You update the connection settings in the application to point to the off-premises datacenter just as your phone dies once more, but at least the office again has power to charge everything. You send the boss an email to say the reports should be working. The cellular data bill is coming out of your boss’s next trip, you tell yourself as you pack up to go home.

As you walk to the bus stop, it occurs to you that the files you cleared out to free up drive space probably included the full database backup from Monday night, and that you might have saved some time by checking them first.

Losing data with the RPO

When disaster strikes, you might lose a single byte in a single row in a single 8-KB data page due to memory or drive corruption. How do you recover from that corruption? What happens if you lose an entire volume or drive, the storage array, or even the entire building?

The RPO should answer the question: “How much data are you prepared to lose?” You need to consider whether your backups are being done correctly, regularly, and copied off-premises securely and in a timely manner. The RPO is usually measured in seconds or minutes. In other words, this is the acceptable amount of time elapsed between the last known good backup, and the moment of the point of failure.

In this hellish scenario that we just laid out, the organization decided that losing 15 minutes of data was acceptable, but ultimately 27 minutes was lost. This is because the drive on the DR server was full, and the most recent backup did not copy over.

To satisfy a 15-minute window, the transaction log backups would need to be taken more frequently, as would the off-premises copy.

If the organization requires “zero data loss,” the budget will need to significantly increase to ensure that whatever unplanned event happens, SQL Server’s memory and transaction log remains online and that all backups are working and being securely copied off-premises as soon as possible.

Inside OUT

Why is the RPO measured in time, and not drive usage?

Transactions vary in size, but time is constant.

In Chapter 3, we looked at how every transaction is assigned an LSN to keep track of things in the active portion of the transaction log. Each new LSN is greater than the previous one (this is where the word “sequence” in Log Sequence Number comes from).

The RPO refers to the most recent point in time in the transaction log history to which you will restore the database, based on the most recently committed LSN at that specific moment in time, or the latest LSN in the log backup chain, whichever satisfies the organization’s RPO.

Losing time with the RTO

Time is money. Every minute that an organization is unable to work has a cost, and lost productivity adds up quickly. The RTO is the amount of time you need to get everything up and running again after a disaster. This might be orchestrating a failover to your disaster recovery site in another building, or a manual failover using log shipping. The RTO is usually measured in hours.

In our disaster scenario, the RTO was two hours. Our intrepid but woefully unprepared and accident-prone DBA barely made it. A number of factors acted against the plan (if it could be called a plan).

For an organization to require zero downtime, the budget is exponentially increased. This is where a combination of HA and DR technologies combine to support the requirements.

Establishing and using a run book

When panic sets in, you need a clear set of instructions to follow, just like our deer-in-the-headlights DBA in our fictional scenario. This set of instructions is called a run book.

The run book is a business continuity document. It covers the steps necessary for someone (including yourself) to bring the databases and supporting services back online after a disaster. In an eventuality in which you or your team members become incapacitated, the document should be accessible and understandable to someone who doesn’t have intimate knowledge of the environment.

From our example scenario, issues like the Remote Desktop Protocol (RDP) user account not being able to log in to SQL Server Management Studio, getting psexec downloaded, knowing to skip an out-of-band differential backup, and so on would not be immediately obvious to many people. Even the most experienced DBA in a panic will struggle with thinking clearly.

The level of detail in a run book is defined by the complexity of the systems that need recovery and the time available to bring them back again. Your organization might be satisfied with a simple Microsoft Excel spreadsheet containing configurations for a few business-critical systems. Or, it might be something more in-depth, updated regularly, and stored in a version control system (which itself should be backed up properly).

The rest of this chapter describes how SQL Server provides backup and restore features to help you come up with a recovery strategy that is most appropriate to your environment, so that when your organization wants to produce business continuity documentation, you have sufficient knowledge to guide an appropriate and achievable technical response.

Most important, you need to be able to rehearse a DR plan. The run book won’t be perfect, and rehearsing scenarios will help you to produce better documentation so that when disaster strikes, even the most panicked individual will be able to figure things out.

An overview of recovery models

SQL Server supports three recovery models: full, bulk-logged, and simple. These models provide a high level of control over the types of backups available to your databases. Let’s take a brief look at each one:

  • Full recovery model. Allows a full point-in-time recovery. Full, differential, and transaction log backups can be taken. All transactions are fully logged.

  • Bulk-logged recovery model. Reduces the amount of transaction log used for certain bulk operations. Can allow a point-in-time recovery if no bulk-logged operations are in that portion of the transaction log backup. Full, differential, and transaction log backups can be taken.

  • Simple recovery model. No transactions are logged. Full and differential backups can be taken.

You can change the recovery model of a database in SQL Server Management Studio in Object Explorer, or by using the following Transact-SQL (T-SQL) statement (and choosing the appropriate option in the square brackets):

ALTER DATABASE <dbname> SET RECOVERY [ FULL | BULK_LOGGED | SIMPLE ];

Before diving into each recovery model in more detail, let’s take a look at point-in-time restores, and how those are affected by the log backup chain.

Recovery to a point in time

If configured properly, it is possible to restore a database to the exact moment in time (or more precisely, to the exact LSN) before disaster struck.

The most common form of disaster is human error, like accidentally leaving out a WHERE clause during an UPDATE statement.

No matter the cause, your reactions should be the same: stop all work on the database in question, find out what happened in a nondestructive way, take a tail-log backup if possible or necessary, and recover to the moment before disaster struck.

Inside OUT

What is a tail-log backup?

A tail-log (or tail-of-the-log) backup, is fundamentally the same thing as an ordinary transaction log backup. The difference is in the circumstances in which you would perform this kind of log backup.

In a disaster scenario, the automation for performing transaction log backups might be offline, or your backup drive is not available. Any time you need to manually perform a transaction log backup to ensure that the remaining transactions in the log are safely stored somewhere after a failure occurred, this is a tail-log backup.

Performing a tail-log backup that you can restore properly later is how you achieve zero data loss.

Keep in mind that a highly available database (using availability groups or database mirroring, for example) is not immune to disasters, especially if these disasters are replicated to downstream database instances.

Inside OUT

How can I tell when the unplanned event took place?

To find out when a disaster occurred that isn’t immediately apparent, you can query the active portion of the transaction log if it is available, making use of an undocumented system function that reads from the active VLF(s), as demonstrated here:

SELECT * FROM sys.fn_dblog(NULL, NULL);

This displays all transactions that have not yet been flushed as a result of a checkpoint operation.

Using a standard WHERE clause, you can trace back to the point immediately before the event took place. For example, if you know that a user deleted a row from a table, you would write a query looking for all delete operations:

SELECT * FROM sys.fn_dblog(NULL, NULL)
WHERE Operation LIKE '%delete%';

To get this to work, SQL Server should still be running, and the transaction log should still be available (although this technique does work on offline transaction log files using sys.fn_dump_dblog).

To see more from Paul Randal about reading from the transaction log, go to https://www.sqlskills.com/blogs/paul/using-fn_dblog-fn_dump_dblog-and-restoring-with-stopbeforemark-to-an-lsn.

Point-in-time recovery requires transaction logs that cover the full span of time from the most recent full backup to the time of the incident.

Image You can see an example of how to restore to a point in time in the section “Restoring a database to a point in time” later in this chapter.

The log backup chain

A backup chain starts with a full backup, followed by differential and/or transaction log backups that you can combine into a recovery sequence to restore a database to a particular point in time or to the time of the latest backup, whichever is required.

Databases in the full recovery model can be restored to a point in time because transactions are fully logged in that recovery model.

As Figure 11-1 illustrates, a backup chain starts with a full backup, which contains the most recent LSN of the active portion of the transaction log at the time that backup finished. You can then use a combination of the most recent differential backup (which must be based on that same full backup) and any additional transaction log backups to produce a point-in-time recovery. If you do not have a differential backup or the point in time you want to restore to is before the end of the differential backup, you must use transaction log backups. Either option will work as long as the LSNs required in the sequence are contained in each of those backups.

Image

Figure 11-1 The log backup chain.

A new database is in the full recovery model by default because it is derived from the model database, which itself is in the full recovery model by default. However, the database will not behave like it is in the full recovery model until the first time a full backup is taken, which is what initializes the backup chain.

Until you run that first full backup on a database in the full or bulk-logged recovery model, the new database is “pseudo-simple,” behaving as though it is in the simple recovery model. Active portions of the log are cleared whenever a database checkpoint is issued, and the transaction log remains at a reasonably stable size, unless a long-running transaction causes it to grow.

For less-experienced data professionals, the sudden and seemingly uncontrolled growth of the transaction log, after the first full backup, can take them by surprise.

We recommend that you configure appropriate maintenance plans (including transaction log backups and monitoring) at the time you create a new database.

Inside OUT

How long can the backup chain be?

Provided that you have an unbroken backup chain for which the LSNs are all intact, you can potentially have many thousands of log backups stretching back over months or even years. You can apply these backups, along with the full backup on which they are based (and assuming the files are intact), to restore the database to a current point in time, even if the database was moved or upgraded during that time.

However, this can be extremely time consuming and will negatively affect the RTO, especially if the backups need to be retrieved from slow storage (including tape). Legend has it that some organizations were forced to closed down as a result of missing the RTO.

It is far better practice to perform regular full backups (and differential database backups if they are useful) along with transaction log backups so that the dependency chain is shorter.

Image You can read more about designing an appropriate backup schedule in the section “Creating backups” later in this chapter. For more on maintenance plans, read Chapter 14.

Full recovery model

For databases that require point-in-time recovery, which is the case for most business-critical systems, we recommend the full recovery model (and it is the default for all new SQL Server databases).

In this recovery model, after the first full backup takes place (which initializes the backup chain), the virtual log files in the transaction log remain active and are not cleared until a transaction log backup writes these log records to a log backup. Only then will the log be truncated (cleared).

Assuming that you implement a process to ensure that these backups are securely copied off-premises as soon as the backups are completed, and that you regularly test these backups, you can easily restore your database in the event of a disaster. Provided the right circumstances are in play, you might even be able to take a tail-log backup to achieve zero data loss, if that data has been committed and made durable.

Image You can read more about durability, including delayed durability, in Chapter 2.

Bulk-logged recovery model

Under the bulk-logged recovery model, bulk operations are minimally logged, which reduces the size of the transaction log records and subsequent backups. These operations include BULK INSERT, INSERTSELECT, SELECTINTO, and bcp operations. Certain indexing operations are also minimally logged.

It is usually not possible to restore a database in the bulk-logged recovery model to a point in time, but there is a way to get mostly-point-in-time recovery. This allows a more flexible recovery strategy than the simple recovery model (more on this in the next section), without generating large transaction logs for bulk operations.

Suppose that you want to use the bulk-logged recovery model to perform minimally logged operations, without breaking the log backup chain. Your database must be in the full recovery model before the bulk-logged operation is performed. First, you must take a transaction log backup, and then switch to the bulk-logged recovery model. After the bulk-logged operation is complete, you must immediately switch back to the full recovery model, and then back up the log again. This ensures that the backup chain remains unbroken and allows point-in-time recovery to any point before or after the bulk-logged operation.

Image For more details, read the TechNet article “Operations That Can Be Minimally Logged,” which is available at https://technet.microsoft.com/library/ms191244.aspx.

Simple recovery model

Databases in the simple recovery model cannot make use of point-in-time recovery. After a transaction in the simple recovery model is committed or rolled back, a checkpoint is implicitly issued, which truncates (clears) the log.

Databases in the simple recovery model can make use of full and differential backups. This recovery model is better suited to development databases, databases that change infrequently, and databases that can be rebuilt from other sources.

Understanding backup devices

SQL Server writes database backups to physical backup devices. These storage media might be virtualized, but for the purposes of this section, they are considered physical. They include disk, tape, and URL.

Backup disk

The most common form of SQL Server backup is stored directly on a local drive or network path, referred to as the backup disk. A backup disk contains one or more backup files, and each file contains one or more database backups. A database backup might also be split across multiple files.

Backup sets and media

As noted previously, SQL Server backups are written to media types (devices), namely tape, hard drives (referred to as backup disks, which include solid-state drives and UNC network paths), and URLs (through Azure Blob Storage). Each of these types have specific properties, including format and block size. You must initialize (format) tapes before you can write to them.

Media set

This is an ordered collection of a fixed type and number of devices (see Figure 11-2). For example, if you are using a backup disk, your media set will comprise a fixed number of one or more files on the file system or UNC network path.

Image

Figure 11-2 A media set, containing three media families spread over three devices.

With tape backup now deprecated, media sets with multiple devices are less useful. When backing up to a disk, network, or URL, we recommend limiting backup operations to one file at a time. Nevertheless, a backup will always comprise at least one media set.

Image To read more about mirrored backup media sets, go to https://docs.microsoft.com/sql/relational-databases/backup-restore/mirrored-backup-media-sets-sql-server.

Media family

In each media set, all backup devices used in that media set make up the media family. The number of devices in that set is the number of media families. If a media set uses three backup devices that are not mirrored, there are three media families in that media set.

Backup set

A successful backup added to a media set is called a backup set. Information about the successful backup is stored here, in the header of a backup set.

You could reuse an existing backup set by adding new backups for a database to the end of that media. This grows the media by appending the backup to the end of it. However, we do not recommend this practice, because the integrity of previous backups relies on the consistency of that media. An errant INIT option in the backup command could even accidentally overwrite existing backups in the backup set.

Inside OUT

What media set, media family, and backup set should I use?

For typical SQL Server instances, we recommend that you back up your databases to strongly named, self-contained files, where only the backup in question is stored in each file. If you are using a third-party solution (free or paid), make sure that they follow a strong naming convention, as well.

In other words, the file name itself should contain the server and instance name, the database name, the type of backup (full, differential, or log), as well as the date and time of the backup. This makes locating, managing, and restoring backups much easier because each file is the backup set, media family, and media set all rolled into one.

For example, a full backup of the WideWorldImporters database on the default instance of a SQL Server called SERVER, taken on February 9th, 2018 at 10:53:44 PM might have the following filename: SERVER_WideWorldImporters_FULL_20180209_225344.BAK.

Physical backup device

This is the actual tape drive, file on a drive, or Azure Blob. You can split a single backup between as many as 64 backup devices of the same type. Splitting across many files can be useful for backing up very large databases (VLDBs), for performance reasons.

Backup to URL

Since SQL Server 2012 (Service Pack 1 with Cumulative Update 2), you can back up your SQL Server database directly to Azure Blob Storage. This is made possible by using a URL as a destination, along with the existing FILE and TAPE options.

Understanding different types of backups

A SQL Server backup is a process that creates an exact copy of the database, in a transactionally consistent state, at the moment that backup ended.

Regardless of the type of backup, the process will always include the active portion of the transaction log, including relevant LSNs, which ensures full transactional consistency when the backup is restored.

As we discussed earlier in the chapter, you can back up a SQL Server database using three main ways, full, differential, and transaction log, to produce the most efficient recovery strategy. A full database backup is the minimum type required to recover a database. Transaction log backups are incremental backups, based on a full backup, that allow point-in-time restores. Differential backups can reduce the amount of time required to restore a database to a point in time, also based on a full backup.

In Enterprise edition, especially for VLDBs, you can take file-level and filegroup-level backups to allow a more controlled procedure when restoring.

Image You can read more about the files that make up a SQL Server database in Chapter 3.

Inside OUT

How large is a VLDB?

Opinions differ as to what constitutes a VLDB, based on individual experience and available resources (such as memory and drive space).

For the purposes of this chapter, any database that exceeds 100 GB is considered very large. Although modern solid-state storage arrays do mitigate many of the challenges facing databases of this size, they are not in widespread use as of this writing.

You can read more about solid-state storage, and storage arrays, in Chapter 2.

Every backup contains a header and a payload. The header describes the backup device, what type of backup it is, backup start and stop information (including LSN information), and information about the database files. The payload is the content of the data and/or transaction log files belonging to that backup. If Transparent Data Encryption (TDE) or Backup Encryption was turned on, the payload is encrypted.

Image You can read more about TDE in Chapter 7.

Full backups

A full database backup is a transactionally consistent copy of the entire database. This type of backup includes all of the 8-KB data pages in the database file(s) as well as the portion of the transaction log that was active from the start to the end of the backup process.

Image You can read more about the active portion of the transaction log in Chapter 3.

When a full backup runs with the default settings, a reserved data page known as the differential bitmap is cleared (see the upcoming section “Differential backups”). Any differential backups that are taken on the database after that will be based off that full backup.

You can perform full backups on databases in all recovery models, and you can compress them. Since SQL Server 2016, you can also compress databases that were encrypted with TDE.

You can perform a full backup to a backup disk target with a minimal amount of T-SQL code. For example, a WideWorldImporters database on a default instance with a machine called SERVER can be backed up by using the following code:

BACKUP DATABASE WideWorldImporters
TO DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_FULL_20170918_210912.BAK';
GO

Copy-only backup

You can change the default behavior by using the COPY_ONLY option, which does not clear the differential bitmap. Copy-only backups are useful for taking out-of-band backups without affecting the differential backup schedule.

In other words, only differential backups are affected by the COPY_ONLY option. Transaction log backups, and thus the backup chain, are not affected.

Transaction log backups

Transaction log backups are incremental backups of a database. In the full recovery model, all transactions are fully logged. This means that you can bring back a database to the exact state it was when that transaction log was taken, provided that the restore is successful. These backups allow for a recovery at any moment in time in the sequence (the backup chain).

In this type of backup, the active portion of the transaction log is backed up. Transaction log backups apply only to databases in the full and bulk-logged recovery models. Databases in the full recovery model can be restored to a point in time, and databases in the bulk-logged recovery model can be restored to a point in time as long as the transaction log does not contain bulk-logged operations.

Tail-of-the-log backups

As noted previously, tail-of-the-log, or tail-log, backups are functionally no different to a regular transaction log backup. A disaster can occur that does not affect the transaction log itself. For example, the drive or volume for the data file(s) might become unavailable or corrupt, but SQL Server remains online.

After you have disconnected any other users from the database, you can switch it to single-user mode and perform a manual log backup on the active portion of the log. This creates a log backup that can be used at the very end of the backup chain to guarantee zero data loss.

Image You can read more about tail-log backups at https://docs.microsoft.com/sql/relational-databases/backup-restore/back-up-the-transaction-log-when-the-database-is-damaged-sql-server.

Differential backups

Differential backups, which are based on a full database backup, are a convenience feature to reduce the number of transaction log backups (and time) required to restore a database to a point in time.

In many cases, a differential backup is much smaller than a full backup, which allows for a more flexible backup schedule. You can run a full backup less frequently, and have differential backups running more regularly, taking up less space than the full backup would have taken.

Think back to Chapter 3 in which we looked at extents. As a reminder, an extent is a 64-KB segment in the data file, comprising a group of eight physically contiguous 8-KB data pages.

After a full backup completes (i.e., the default full backup without the copy-only option), the differential bitmap is cleared. All subsequent changes in the database, at the extent level, are recorded in the differential bitmap.

When the differential backup runs, it looks at the differential bitmap and backs up only the extents that have been modified since the full backup, along with the active portion of the transaction log.

This is quite different to a transaction log backup, which records every change in the database even if it’s to the same tables over and over again.

Thus, a differential backup is not the same thing as an incremental backup. If you want to restore a database using a differential backup, you need only the full backup file plus the most recent differential backup file.

Even though you cannot restore a database to a point in time (or LSN) that occurs within the differential backup itself, it can vastly reduce the number of transaction log files required to effect those same changes.

Differential backups apply to databases in the full, bulk-logged and simple recovery models.

Inside OUT

What do I do if my differential backup is larger than my full backup?

Differential backups will grow larger as the number of changed extents in the database increases. It is feasible that the differential backup can end up being larger than a full backup over time.

This is possible for situations in which every extent is modified in some way (for instance if all the indexes in the database are rebuilt), which makes the differential backup the same size as a full backup. When it adds the active portion of the log, you end up with a differential backup that is larger than a full backup.

SQL Server 2017 provides a new column called modified_extent_page_count in the DMV sys.dm_db_file_space_usage to let you know how large a differential backup will be. A good rule of thumb is to take a full backup if the differential backup approaches 80% of the size of a full backup.

Image You can read more about differential backups at https://docs.microsoft.com/sql/relational-databases/backup-restore/differential-backups-sql-server.

Fixing a broken backup chain by using differential backups

A backup chain is broken when a database is switched from the full recovery model to the bulk-logged or simple recovery model for any reason (such as shrinking the transaction log file during an emergency situation).

After you switch back to the full recovery model, you can restart the log backup chain without having to perform a full database backup by taking a differential backup. As long as you make use of this or a more recent differential backup in a later recovery, along with the accompanying transaction log backups, the backup chain has been repaired.

Image You can read more considerations regarding switching between recovery models at https://msdn.microsoft.com/library/ms178052.aspx.

File and filegroup backups

With SQL Server Enterprise edition, you can take a more granular approach by backing up individual data files and filegroups, which make use of the full or differential options, as well. Although these options are not available in the SQL Server Management Studio user interface, you can use the official documentation to build appropriate T-SQL queries.

Partial backups

Because read-only filegroups do not change, it does not make sense to include them in ongoing backup processes. Primarily used for VLDBs that contain read-only filegroups, partial backups will exclude those read-only filegroups, as required.

Partial backups contain the primary filegroup, any read-write filegroups, and one or more optional read-only filegroups.

Image To read more about partial backups, go to https://docs.microsoft.com/sql/relational-databases/backup-restore/partial-backups-sql-server.

File backups

You can use file backups to restore individual files in the event that they become corrupt. This makes restoring easier, because for VLDBs it would take much less time to restore a single file than the entire database, and it interferes with point-in-time restores.

Unfortunately, it does increase the complexity due to increased administration of the additional file backups over and above the full, differential, and transaction log backups. This overhead extends to recovery script maintenance.

Image To read more about file backups, visit https://docs.microsoft.com/sql/relational-databases/backup-restore/full-file-backups-sql-server.

Additional backup options

Since SQL Server 2014, it is possible to encrypt database backups using an asymmetric key. This is not the same as the encryption provided for backups with TDE turned on.

As noted in Chapter 3, you can also compress backups, which is recommended in almost all cases, unless the database makes use of page or row compression, or the database is encrypted with TDE.

Image To learn more about security and encryption, read Chapter 7.

Backup encryption

Like any asymmetric encryption process, you will require a cipher (the encryption algorithm) and an asymmetric key or certificate. Supported ciphers are Advanced Encryption Standard (AES; you can use key sizes of 128, 192, and 256 bits), and 3DES (also known as Triple DES). As discussed in depth in Chapter 7, AES is a safer and faster cipher than 3DES. You should back up and store the key or certificate in a secure location.

Memory-optimized tables

Standard backups include memory-optimized tables. During the backup process, a checksum is performed on the data and delta file pairs to check for corruption. Any corruption detected in a memory-optimized filegroup will cause a backup to fail, and you will be required to restore from the last known good backup.

Remember that the storage requirements for a memory-optimized table can be much larger than its usage in memory, which will affect the size of your backups.

Image To learn more about how to back up memory-optimized files, go to https://docs.microsoft.com/sql/relational-databases/in-memory-oltp/backing-up-a-database-with-memory-optimized-tables.

Other options also exist, including file-snapshot backups for database files that are already stored in Azure (see Chapter 3), or turning on Managed Backups (also covered in Chapter 3). Although these options are convenient, there is no substitute for a native SQL Server backup.

Creating and verifying backups

You should completely automate backups when possible, whether you make use of the built-in Maintenance Plan Wizard in SQL Server Management Studio or a third-party solution (free or paid). Always ensure that backups are successful by observing that the backup files exist and the backup task does not error out. Additionally, you must test those backups by restoring them, which you can also do with an automated process.

Image You can read more about maintenance plans in Chapter 13.

SQL Server Agent is an excellent resource for automating backups, and many third-party solutions make use of it too (as does the Maintenance Plan Wizard).

Inside OUT

How frequently should I run backups?

Business requirements and database size will dictate how long your maintenance window is and what needs to be backed up in that window. You might have a critical database small enough that it can be fully backed up daily, and has no need for differential backups. Larger databases might require a weekly schedule, augmented by daily differential backups.

A database in the full recovery model should have transaction log backups occurring as a factor of the RPO. Assuming that your RPO is five minutes, transaction logs should be backed up and securely copied off-premises at a more frequent interval (every minute or 90 seconds, perhaps). This is to accommodate the implicit delay between when a backup ends and when the backup has been copied off-premises. In the case of a disaster, there is a higher chance that the files are copied off-premises with a smaller copy interval. A backup is realistically not considered part of the RPO until it is copied off-premises.

Databases that can be rebuilt using existing processes might not need to be backed up at all.

Creating backups

You can design a backup solution to satisfy a recovery strategy by using all or a combination of the following methods:

  • A full backup reads the entire database, including the data file(s) and the active portion of the transaction log.

  • A differential backup reads extents that have changed in the data file(s) since the last full (non-copy-only) backup as well as the active portion of the log.

  • A transaction log backup reads only the active portion of the log.

  • A partial backup reads an individual file or filegroup as well as the active portion of the log.

The buffer pool (see Chapter 2) is not used for database backups. The backup buffer is a portion of memory outside of the buffer pool, big enough to read pages from the data file and write those page to the backup file. The backup buffer is usually between 16 MB and 32 MB in size. Be aware that memory pressure can reduce the backup and restore buffer sizes, causing backups and restores to take longer.

Backup checksums

SQL Server can perform optional checksum verifications on database backups. By default, backups do not perform a checksum unless they are compressed.

You can change this behavior either by a trace flag (TF3023), in the SQL Server Management Studio properties of a backup, or in any T-SQL script you create to perform backups. We recommend that you turn on backup checksum where possible.

Without backup checksum turned on, no validation is performed on data pages or log blocks. This means that any logical corruption will also be backed up without showing any of the errors you might see with a DBCC CHECKDB operation. This is to allow for scenarios in which you can back up a corrupt database before attempting to fix the corruption.

Image To read more about recovering from corruption, see Chapter 13.

With backup checksum turned on, a checksum is calculated over the entire backup file. Additionally, the page checksum on every 8-KB data page (for both page verification types of checksum or torn-page detection), and log block checksum from the active portion of the log, will be validated.

The backup checksum can significantly increase the time for a backup to run, but adds some peace of mind, short of running a recommended DBCC CHECKDB on a restored database. Backup compression can offset this additional overhead.

Verifying backups

After you create a backup, we highly recommend that you immediately verify that the backup was successful. Although rare, corruption is always a possibility. Most of the time it is caused by the storage layer (including as a result of device drivers, network drivers, and filter drivers), but it can also occur in non-ECC RAM or as the result of a bug in SQL Server itself.

There are two ways to verify a backup, and you can probably guess that the best way is to restore it and perform a full consistency check on the restored database by using DBCC CHECKDB.

The other, slightly quicker method, is to use RESTORE VERIFYONLY. If you backed up your database using the checksum option (which is on by default on compressed backups), the restore will verify the backup checksum as well as the data page and log block checksums as it reads through the backup media.

The convenience with RESTORE VERIFYONLY is that you do not need to allocate drive space to restore the data and log files, because the restore will read directly from the backup itself.

However, a DBCC CHECKDB is the only way to know that a database is free of corruption.

Inside OUT

Why should I perform a DBCC CHECKDB if I have backup checksums turned on?

Although backup checksums are verified by RESTORE VERIFYONLY, it is possible for corruption to occur after a page was verified as it is being written to the drive or while it is copied off-premises. A successful RESTORE VERIFYONLY is not a clean bill of health for the backup.

You can build an automated process on another server with a lot of cheaper drive space to restore all databases after they have been backed up and perform a DBCC CHECKDB on them. This also gives you an excellent idea of whether you can meet your RTO as databases grow in size.

Restoring a database

To restore a database, you will generally start with a full backup (piecemeal restore is covered in a later section).

If you plan to make use of differential and/or transaction log backups, you must use the NORECOVERY keyword for all but one of the backups.

You restore a database in the simple recovery model by using a full backup to begin, plus the most recent differential backup based on the full backup if one is available.

You can restore a database in the bulk-logged recovery model by using a full backup, along with a most recent differential backup based on that full backup if available. Should you want to restore to a specific point in time for a bulk-logged database, this might be possible if no bulk-logged operations exist in the transaction log backups you use.

You can restore a database in the full recovery model to a point in time using a full backup, plus any transaction log backups that form part of the backup chain. You can use a more recent differential backup (based off that full backup) to bypass a number of those transaction log backups, where appropriate.

Each transaction log backup is replayed against the restoring database, using the NORECOVERY option, as though those transactions are happening in real time. Each file is restored in sequential order up to the required point in time or until the last transaction log backup is reached, whichever comes first.

After the entire chain has been restored (indicated by the WITH RECOVERY option), only then does the recovery kick in (which was covered in some detail in Chapter 3). All transactions that are committed will be rolled forward, and any in-flight transactions will be rolled back.

Some examples of restoring a database are included in the section that follows.

Restoring a database using a full backup

You can perform a database restore through SQL Server Management Studio or by using a T-SQL statement. In this example, only a full backup file is available to restore a database. The full backup comes from a different server, where the path of the original database is different, so the files need to be relocated (moved) on the new server.

To see the progress of the restore, you can set the statistics to display to the output window. The default is to write progress for every 5% complete. No statistics will be output until the files have been created on the file system first.

RESTORE DATABASE WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_FULL_20170918_210912.BAK'

WITH

MOVE N'WideWorldImporters' TO N'C:SQLDataWWI.mdf',

MOVE N'WideWorldImporters_log' TO N'C:SQLDataWWI.ldf',
STATS = 5,

RECOVERY;

GO

The RECOVERY option (the default) at the end brings the database online immediately after the full backup has been restored. This prevents any further backups from being applied. If you want to restore a differential backup after this full backup, you will need to use the NORECOVERY option and bring the database online only after restoring the differential backup (see the next example).

Restoring a database with differential and log backups

Restoring using full, differential, and transaction log backups is more complicated, but you can still perform it through SQL Server Management Studio or by using a series of T-SQL statements.

For this scenario, we recommend creating your own automated scripts. For example, after every transaction log backup, you can use the information in the msdb database to build a script to restore the entire database to that point in time and then save the script in the same folder as the backup file(s).

Image To see an example by Steve Stedman, go to http://stevestedman.com/2017/10/building-sql-restore-script-backup-runs/.

You restore a database by using the RESTORE command. Full and differential restores use the RESTORE DATABASE option by convention. You can also restore transaction logs by using the RESTORE DATABASE option, but you might prefer to use RESTORE LOG, instead, for clarity:

-- First, restore the full backup
RESTORE DATABASE WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_FULL_20170918_210912.BAK'
WITH
MOVE N'WideWorldImporters' TO N'C:SQLDataWWI.mdf',
MOVE N'WideWorldImporters_log' TO N'C:SQLDataWWI.ldf',
STATS = 5,
NORECOVERY;
GO
-- Second, restore the most recent differential backup
RESTORE DATABASE WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_DIFF_20170926_120100.BAK'
WITH STATS = 5,
NORECOVERY;
GO
-- Finally, restore all transaction log backups after the differential
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_LOG_20170926_121500.BAK'
WITH STATS = 5,
NORECOVERY;
GO
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_LOG_20170926_123000.BAK'
WITH STATS = 5,
NORECOVERY;
GO
-- Bring the database online
RESTORE LOG WideWorldImporters WITH RECOVERY;
GO

Remember that you can use the WITH RECOVERY option in the final transaction log file restore, and exclude the final statement in the previous example.

The RECOVERY option instructs SQL Server to run recovery on the database, which might include an upgrade step if the new instance has a newer version of SQL Server on it. When recovery is complete, the database is brought online.

Restoring a database to a point in time

A point-in-time restore requires an LSN or timestamp (meaning a specific date and time value) to let the RESTORE command know when to stop restoring.

You can even restore to a specific mark in the transaction log backup, which you specify at transaction creation time by explicitly naming a transaction, though this is less common.

Image To see more about marking transactions, go to https://docs.microsoft.com/sql/t-sql/statements/restore-statements-transact-sql.

A point-in-time restore works only when restoring transaction log backups, not full or differential backups.

The process is the same as in the previous example, except for the final transaction log file, for which the point in time is specified by using the STOPAT or STOPBEFOREMARK options. Let’s look at each option:

  • STOPAT. A timestamp. You will need to know this value from the time an unexpected event occurred, or from exploring the transaction log.

  • STOPBEFOREMARK (also STOPATMARK). A log sequence number or transaction name. You will need to know the LSN value from exploring the active portion of the transaction log (see the Inside OUT for sys.fn_dblog in the section “Recovery to a point in time” previously in this chapter).

Assuming that you have followed the same sequence as shown in the previous example, the final transaction log restore might look like this:

-- Restore point in time using timestamp
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_LOG_20170926_123000.BAK'
WITH STOPAT = 'Sep 26, 2017 12:28 AM',
STATS = 5,
RECOVERY;
GO
-- Or restore point in time using LSN
-- Assume that this LSN is where the bad thing happened
RESTORE LOG WideWorldImporters
FROM DISK = N'C:SQLDataBackupSERVER_
WideWorldImporters_LOG_20170926_123000.BAK'
WITH STOPBEFOREMARK = 'lsn:0x0000029f:00300212:0002',
STATS = 5,
RECOVERY;
GO

Image To read more about database recovery, including syntax and examples, visit https://docs.microsoft.com/sql/t-sql/statements/restore-statements-transact-sql.

Restoring a piecemeal database

Partial database backups deal with file and filegroup backups in order to ease the manageability of your VLDB. This is an advanced topic, so this section is a very high-level overview, which does not cover all of the intricacies involved.

Partial recovery is useful for bringing a database online as quickly as possible to allow the organization to continue working. You can then restore any secondary filegroups later, during a planned maintenance window.

Piecemeal restores begin with what is known as the partial-restore sequence. In this sequence, the primary filegroup is restored and recovered first. If the database is under the simple recovery model, all read/write filegroups are then restored.

Image To learn more about the SQL Server recovery process, read Chapter 3.

While this is taking place, the database is offline until restore and recovery is complete. Any unrestored files or filegroups remain offline, but you can bring them online later by restoring them.

Regardless of the database’s recovery model, the RESTORE command must include the PARTIAL option when doing a piecemeal restore, but only at the beginning of the sequence. Because transactions might span more than just the recovered filegroups, these transactions can become deferred, meaning that any transactions that needs to roll back cannot do so while a filegroup is offline. The transactions are deferred until the filegroup can be brought online again, and any data involved in that deferred transaction is locked in the meantime.

Image To read more about deferred transactions, go to https://docs.microsoft.com/sql/relational-databases/backup-restore/deferred-transactions-sql-server.

Restoring a partial database under the simple recovery model

To initialize a partial recovery of a database under the simple recovery model, you must begin with a full database or a partial backup. The restore will bring the primary filegroup online. You can then restore any additional files, if they are valid and transactionally consistent. Finally, if any read-only filegroups were damaged or corrupted and you need to restore them, you will do those last.

Restoring a partial database under the full recovery model

As with the simple recovery model that we just looked at, you must begin with a full database or partial backup (which must include the primary filegroup).

Point-in-time restore is provided under the following conditions:

  • The first RESTORE DATABASE command must include the PARTIAL option.

  • For a point-in-time restore against read/write filegroups, you need an unbroken log backup chain, and you must specify the time in the restore statement.

Image To see more about piecemeal restores, including code examples, visit https://docs.microsoft.com/sql/relational-databases/backup-restore/piecemeal-restores-sql-server.

Limitations

If you skip a FILESTREAM filegroup during partial recovery, you can never again recover it unless the entire database is restored in full.

Defining a recovery strategy

Consider our scenario from the beginning of the chapter, in which anything and everything that could go wrong did go wrong. We will highlight certain issues that could be addressed by an appropriate recovery plan. This recovery plan can then be implemented, step by step, using your run book.

The word “strategy” means that there is a long-term goal. Your recovery strategy will need to adapt to your environmental changes. A run book is a living document; it requires incremental improvements as you test it.

We also discuss recovery strategies around hybrid environments, and briefly discuss Azure SQL Database.

A sample recovery strategy for a DR scenario

Several avoidable problems occurred in the DR scenario at the beginning of the chapter:

  • A UPS was in the wrong place, with no redundant backup

  • The internet connection did not have a redundant backup

  • There was no run book to guide the accident-prone DBA

  • The security on the DR server does not follow recommended best practices

  • The off-premises backups were failing

Keeping the lights on

First and foremost, the backup power was inappropriate. There was no sufficient way to ensure that electricity would continue to flow after the building lost power, which had a number of knock-on effects including the inability to charge laptops and cellphone batteries. Additionally, the UPS should not have been placed in an area that could be affected by a can of soda falling off a desk.

In your run book, ensure that your backup generators and UPS collection can run all necessary equipment for long enough to keep emergency lights on, laptops charging, network equipment live, and the servers running, so that your DBA can log in to the SQL Server instance long enough to run a tail-log backup if necessary.

It might sound like a small detail, but you should even have a list of diesel suppliers in your run book, especially if your generators need to keep running for several hours.

Clean power is also important. Generators cause power fluctuations, which can damage sensitive electronic equipment. Although a power conditioner should be installed with your generator, you need to make sure that it works correctly.

Redundant internet connection

If your backups are being copied securely off-premises or you have a hybrid environment in which systems are connected across datacenters using Virtual Private Networks (VPNs), make sure that these routes can stay connected if one of the links goes down.

Know where the run book is

The run book itself should be printed out and stored in a secure but easily accessible location (for example, a fireproof safe or lock box). Don’t forget to have an electronic copy available, as well, stored with a cloud provider of your choice and kept up to date.

Make sure your off-premises backups are secure and tamper-proof

In our example scenario, our DBA accidentally deleted an email alert indicating that the off-premises storage for backups was almost full. This alert is a good one to have, but it was not acted on appropriately. Also, the cause of the alert was an avoidable situation because the colleague who used up the free space should not have been able to use critical organization resources.

Security of the off-premises location for backups is critical, and no one should have access to that unless they are required to do a recovery.

Check your backups regularly and test them regularly

Our DBA assumed that the backups were taking place every 15 minutes and being copied off-premises immediately afterward. This was not the case, and instead of losing 15 minutes of data, the organization lost as much as 27 minutes’ worth.

Automate your backups. Use maintenance plans (see Chapter 13), and make use of established third-party backup tools such as Ola Hallengren’s Maintenance Solution (available from http://ola.hallengren.com) or MinionWare Backup (available from http://www.minionware.net).

Verify that you have a process to check that your backups are taking place. Test that process often. For example, if SQL Server Agent is crashing, none of your notifications might ever fire. Test the backups, as well, by having a machine that restores backups continuously and running DBCC CHECKDB where possible. If you can afford it, have log shipping configured so that all backups are restored as soon as they come into off-premises storage. Ensure that the backup files are being securely copied off-premises as soon as they can.

Run random spot-checks of your backups, as well, by picking a date and time in the backup history and restoring the backup. Aside from testing the databases themselves, this is a good rehearsal for when something goes wrong. Remember to run DBCC CHECKDB on any database you restore.

You might find as databases grow that the SLA becomes out of date and that the RTO is no longer achievable. Running these tests will alert you to this situation long before it becomes a problem and will allow you to tweak how you perform backups in the future.

For example, you might discover that it is much quicker to spin up an Azure virtual machine with SQL Server, and restore the business-critical databases that are stored in Azure Blob Storage, than having to struggle with VPN connections and failed hardware on-premises at another datacenter.

Check the security of your DR site

Your DR site might have outdated security access (physical and virtual). Be sure that you stay up to date, especially when people leave the company. You don’t want to be in the position of having to call someone who left your organization two years ago to ask for a firewall password—especially not at 3 AM on a Sunday.

If you must use a remote desktop connection to access a server, protect it by using a VPN. Also, check that you have already downloaded additional tools (like psexec in the example) and documented how to use them.

Keep all passwords and keys (symmetric and asymmetric) in a password manager as well as printed and stored in a secure location with the run book where practical. Make sure that all digital certificates are backed up securely.

Automate restore scripts

In the case of an msdb database being inaccessible and you are unable to generate a restore script from a database backup history, make sure that you have a tool that generates a restore script based on files in a folder in your off-premises storage. Many tools exist to do this, including the free dbatools (available from https://dbatools.io).

Practice your DR strategy

In concert with your HA strategy, which involves automated and manual failovers (see Chapter 12 for more on this), you should perform regular drills to test your run book. You can also have people in your organization who are unfamiliar with the environment look through the run book. They can provide valuable information with regard to assumptions you might have made.

The cadence is up to your organization, but a full DR scenario should be tested at least once or twice a year. Any changes you need to make to the run book should be made immediately. If the recovery fails, make notes of how it failed and what you did to resolve the failure. All of this information is extremely valuable.

Strategies for a cloud/hybrid environment

Many organizations are making use of a combination of on-premises infrastructure and services in remote locations, including Azure services, third-party datacenters, and other cloud vendors.

Recovering data in a hybrid environment

The strategy for recovering data in a hybrid environment is very similar to an on-premises strategy, except that you must take network connection, latency, and bandwidth into account.

With cloud services, don’t keep all of your data in one region. Make use of geo-replication, so that if or when one region becomes unavailable, you still have business continuity.

It can be prudent to make use of virtualization technologies that allow for virtual machine snapshots and file system snapshots to ensure that your virtual servers are backed up regularly. You can augment these with appropriate native SQL Server backups that are tested properly.

When designing a recovery strategy for a hybrid environment, pick a DR site that is central, but geo-replicated.

If you are already making use of Azure Storage for your backups, this reduces the network bandwidth and latency issues if you can restore your organization’s databases to Azure virtual machines or databases in Azure SQL Database.

Remember that after failing over to a DR site, you will need to fail back to your on-premises site when it is up and running again.

It’s always good to remember that the backup chain can survive migrations and upgrades. Keep your databases in the full recovery model, and take regular full and transaction log backups (and differential backups where appropriate). Make sure these backups are copied securely off-premises on a regular basis to a central location accessible from each node of your hybrid environment. Test your backups and your recovery strategy regularly, and write it down.

You are now prepared to handle almost any disaster recovery scenario.

Recovering a database in Azure SQL Database

Chapter 5 offers an in-depth look at managing Azure SQL Database, including backup and restore; however, there are three options to consider when restoring a point-in-time backup, which plays into your run book:

  • Database replacement. You can replace an existing database using a database backup. This requires that you verify the service tier and performance level of the restored database. To replace your existing database, rename the old one and restore the new one to the old name.

  • Database recovery. If you need to recover data from a previous point in time, you can restore your database backup with a different database name and then copy the data you need using T-SQL scripts you write yourself.

  • Deleted database. If you deleted a database and you are still within the recovery window, you can restore that deleted database to the time just before it was deleted.

The geo-restore feature can restore a full and differential backup to any server in the Azure region from a geo-redundant backup. However, databases on the basic performance tier can take up to 12 hours to geo-restore. This Estimated Recovery Time should be a consideration in your RTO.

Image To read more about recovering a database in Azure SQL Database, including associated costs, refer to Chapter 5, or visit https://docs.microsoft.com/azure/sql-database/sql-database-recovery-using-backups.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.133.160