Chapter 12. Disaster Recovery Techniques for Microsoft SQL Server

You can make individual parts of a system fault tolerant. You can design in redundancy for the systems, making the solution highly available. But what happens when a catastrophe strikes and your fault-tolerant, redundant, highly available systems do not weather the proverbial storm? This is where disaster recovery comes into play.

Many will argue that high availability and disaster recovery are not related, and some incorrectly mistake high availability for disaster recovery. The truth is that high availability planning must include the entire scope of what it takes to keep an application, business, or other enterprise up and running. Technology that makes your solutions fault tolerant, such as failover clustering, might not be sufficient for your disaster recovery needs. For example, although a technology like log shipping can be employed as a primary method of availability, it can also be used in a disaster recovery scenario. People and processes are crucial to all aspects of high availability, and they also form the backbone of most aspects of disaster recovery. This chapter guides you through planning and preparation for disaster recovery, as well as what you can do from a SQL Server perspective to recover your systems.

Planning for Disaster Recovery

There is no easy or "add on" solution to disaster recovery needs. Disaster recovery planning does not start late in the cycle; it starts from the first day of any project, when you are first assessing risk. Contrary to popular belief, each disaster recovery plan is unique, even for the same technology, because companies are different and their tolerances are different. When you seriously plan for disaster recovery and have your "local" high availability story well defined, two overriding questions govern the rest of the discussion:

  • What are our site disaster risks, and which ones are acceptable?

  • What are our dependencies from the hardware layer right on through to the end user applications?

Important

This book is mainly focused on SQL Server, but disaster recovery encompasses all aspects of a particular solution or system, from people to Microsoft Windows and everything in between. No stone can be left unturned in putting together a disaster recovery plan.

Run Book

There are several benefits to keeping historical records of your environment, including every change made to your servers. The first is, obviously, to help you rebuild it in the event of a disaster. For this reason a complete document, sometimes known as a run book, is crucial to disaster recovery planning and execution. A run book can contain a subset of the configuration information or a very high level of detail, depending on what suits your needs. The run book should also include contact information, disaster recovery plans, and so on.

The run book can be referred to on a daily basis for an easy way to get information about a system. If you keep contact and related system information in it, it can also help you identify systems that send or receive data from your systems and the people who need to be contacted in emergencies (or those who might be willing to answer some questions if, for example, you’re researching elusive performance problems).

A run book does not have to be an actual printed book, although you should be sure to print several copies on a regular basis. One copy (whether electronic or printed) should go out with the backup tapes so that the system can be restored in the event of a disaster at the main site. In addition to printing the run book, you could also put the information on a Web site that would be accessible by only authorized users. This could be on an intranet or a secure extranet. For example, you could have a Web page that shows a grid of all the latest service packs, hotfixes, system hardware configuration, and so on. You could also pull that information into a SQL Server database or Extensible Markup Language (XML) and base some self-healing maintenance on that as well. You should also document the organization and procedures for the configuration management of a specific product, project, system, support group, or service.

Items to Place in a Run Book

The following categorized list can be used as the basis for a run book or customized operations guide for your database system. This list is based on Microsoft SQL Server 2000, but it could apply equally to other versions, in concept, and it outlines the configuration items you would use to maintain a high standard of operations support. The following list has been reviewed and approved by operations and field personnel within Microsoft, and it represents the current opinions and best practices of those groups.

On the CD

A sample run book can be found on the CD-ROM in the file Run_Book.doc. This can be used and modified to fit your environment; it serves as a good starting point. In addition, many of the worksheets and templates included for use with other chapters in this book can be used as the starting point for information for your run book.

Important

A few of the items that follow note that you must document passwords. First, make sure this is not in violation of any corporate policy. However, if you cannot document passwords needed for vital accounts during a restore, you must ensure that people who know them can be on site during a recovery operation.

SQL Server Administrative Information

There is a minimum set of information that you must keep to fulfill your responsibilities as database system administrator:

  • Maintain information on maintenance plans: all related scripts, how information is transferred for analysis (if it is), how alerts and errors are handled, and remote servers.

  • Maintain information about database backup files such as their location and type (full database, differential, file, filegroup, log) and how current these are, per location. Also record the times that files are backed up to tape (or other removable storage) and where these are stored, and notes on related directories, data, or files that must also be backed up. Remember to include the password for each backup if necessary. All related information should be recorded.

  • From a command line, run Sqldiag.exe and store the result to a text file. Do this on a regular basis, and save the files historically.

    Important

    Make sure you know what to look for in the output of SQLDIAG. If you or your staff is not familiar with it, script the relevant portions separately.

  • Store your Data Transformation Services (DTS) packages as files to enable an easy modular restore or transfer. Make notes regarding logins and passwords for the packages, and any alternate data sources.

  • Create a script of all system or application users and passwords (including the sa account). Create scripts for application roles or passwords and linked or remote servers.

  • Record your software serial number, a copy of the CDs (including all service packs and hotfixes), and a reference to their location on network file shares.

  • Keep a record of all related interconnected systems, including contacts, configuration information, network topology, and documentation of data interfaces.

  • Record any custom-made database administration objects that you depend on to maintain or administer the server.

  • Record hardware and software vendor support phone numbers and account numbers as well as any login/password information for related Web sites.

  • Record contact information for your remote site, if you have one.

  • Record people to notify in the event that you must re-create standard SQL Server users or reset their passwords.

  • Use a tool like Microsoft Visual Source Safe to manage script versions (for example, schema, install and rollback scripts, and maintenance scripts, and perhaps even the run book). Scripting is especially important for encrypted objects.

  • Write down contact information. Remember to list these by each person’s role, so that if the person changes jobs, you can still find the correct contact. Ideally, you should record a group name (department, e-mail discussion list, and so on).

Analysis Services Administrative Information

There is a minimum set of information you must record if you are using Analysis Services in your solutions:

  • Information on maintenance plans: all related scripts, how information is transferred for analysis (if it is), how alerts and errors are handled, and any remote servers.

  • Setup information about the Analysis Services server. Where is the metadata stored (Microsoft Access database, SQL Server database, or SQL Server Meta Data Services)? Where are the data files located? What are the other configuration parameters for Microsoft SQL Server 2000 Analysis Services?

  • As described in the previous sections, store DTS packages that are used to populate Analysis Services cubes as files.

  • Any custom-made DBA objects that you depend on to maintain or administer the server.

Application System Information

There is a minimum set of information you must record about the non-SQL Server portions of your solutions:

  • List all the applications that must be in place for the system to run (either on the server itself or another system, if that is required). Include custom-built software.

  • Document the application’s security architecture, including type of logins used, any fixed passwords, and role permissions, and note the process for changing passwords on any multiuser logins.

  • Contacts for the application: include developers and anyone else (analysts, testers, managers) who should be involved if a system change is made to the application or to any related system or process.

Database Components

There is a minimum set of information you must record about your databases:

  • Script out all database schemas, collations, jobs, and custom error messages. Anything that can be saved in script form should be scripted and stored historically. User-defined functions, user-defined datatypes, custom roles, triggers, indexes, and object permissions are all easy to miss in the scripting process. Be careful if you intend for your disaster recovery scenario to encompass these objects.

  • Information related to distributed databases or partitions (if applicable), such as data dependent routing tables and distributed transaction marks.

  • Linked server connections.

Server Configuration

There is a minimum set of information you must record about your servers to use during disaster recovery:

  • Operating system version, with service pack level and hotfixes.

  • Exact list of hardware in use, and how that hardware is configured.

  • Processor information (for example, speed and class), RAM (amount, speed, and type), and BIOS (manufacturer and version).

  • Dates or version numbers of firmware.

  • Physical and logical disk configuration, including redundant array of independent disks (RAID) levels and disk controller information (including write cache settings), disk type and size, any special options used (that is, allocation units, formerly known as block size), and use of each disk (explanation).

  • Notes on anything unusual regarding this server, hardware, configuration, or location; for example, if you have disks on different shelves plugged into the same array controller.

SQL Server Configuration

There is a minimum set of information you must record about each SQL Server instance’s configuration for use in a disaster recovery scenario:

  • SQL Server installation configuration, including the installation and service pack levels, and any hotfixes that were applied.

  • SQL Server instance names, IP addresses, ports, configuration options, database file locations, service logins and passwords, e-mail accounts, enabled network protocols, and their order.

  • File share information, shares to which the service login must have permission reached by Universal Naming Convention (UNC) names or through any other protocols. This can be a potential nightmare in an off-site disaster recovery scenario, so you need to mitigate this for your environment.

  • Configuration information for any other software that runs on the same server. Make sure complete installation and configuration documentation are available and that correct support personnel (or job titles) are listed as contacts. Also list support numbers and Web sites for each piece of software.

  • Note any client tools that must be installed for remote database connections (for example, to heterogeneous data sources). Note configuration information such as what data access components must be loaded on clients.

  • Document any data source names (DSNs) that exist on the server.

  • Setup of SQL Server 2000 failover clustering, replication, log shipping, and a configuration and topology description.

  • Multi-instance configuration.

  • Notes on anything unusual regarding this server. For example, special features in use such as XML support for Internet Information Services (IIS), Active Directory directory service support, and so on.

Network Information and Configuration

There is a minimum set of information you must record about your network topology for use in a disaster recovery scenario:

  • Switch, hub, router, and gateway information.

  • Topology graphics or diagrams prepared in tools such as Microsoft Visio.

  • Management and configuration utilities for switches, routers, and gateways, and the settings for each; "how-to" instructions or manuals; and scripts.

  • Any redundant path information based on multipath routing.

  • Notes on anything unusual regarding the network.

Storage Information and Configuration

There is a minimum set of information you must record about your disk subsystem configurations for use in a disaster recovery scenario:

  • Switch information.

  • Fabric topology graphic or diagrams prepared in tools such as Visio.

  • Management and configuration utilities for switches, the settings for each, "how-to" instructions or manuals, scripts, and other items.

  • Any redundant path information based on multipathing.

  • Impact of clustering on fabric such as zoning or portioning.

  • Notes on anything unusual regarding the fabric.

Other Necessary Information

This is a list of other items that should also be recorded that do not fall under one of the preceding categories:

  • Complete contact information (home phone, e-mail address, cell phone, pager number—whatever you have) for anyone who might be involved in a disaster recovery. This information might include the following:

    • All of your company’s IT resources necessary for the recovery. This would include technical operations manager, DBAs, system administrators, networking people, and so on.

    • All factory support personnel and contract numbers. For example, your Microsoft Premier Product Support Services (PSS) support representative, the hardware vendor sales representative, and your agreement contract numbers.

    • Off-site data storage facility. You need access to the librarian software to prepare a list of which tapes you need to recall.

    • In addition to the off-site data storage facility, you might also need the name of a high-speed transcontinental courier service. For example, if there is a major blackout in the eastern United States, you might need to recover in the Midwest. When time is critical and overnight services are not fast enough, there are options. Airlines have services that can often provide four-hour bonded courier point-to-point delivery. Research this option in advance and have current contact numbers, drop-off times, and prices available.

    • The vendor information for your hot site location.

    • Senior managers who will be responsible for the recovery and their alternates.

    • Telecommunications managers responsible for networking.

    • Other senior management as needed. For example, general manager, chief information officer, and chief financial officer.

    • Work schedules.

  • Application dependencies.

  • Backup and restore plans.

  • Disaster recovery plans.

  • Service level agreements (SLAs).

  • Like plants, people need care to survive. Two main concerns are sustenance and a place to sleep. You might want to include the following:

    • Phone numbers and addresses of hotels and rental agencies in the area of your recovery site. You should also have back-ups, in case they are full.

    • Location of cots, couches, or sleeping bags for on-site personnel.

    • Phone numbers of restaurants that deliver.

    • Names and locations of 24-hour convenience stores in the area where you can get the basics after other stores are closed.

    • Check and see what type of bathroom facilities are available on site, too. For example, are there showers nearby?

    • Note the policy about food and where it can be eaten, and document the break room locations.

Important

The run book is only useful if it is meticulously kept up to date and stored on a separate system (which must also be treated as highly available). If you can automate the process, do so.

Tip

The run book should be stored both on site and off site, with both copies kept up to date. At least one copy should be a printed document or an encapsulated program or both (not a backup tape or a database that needs a separately installed GUI). Use a versioning tool such as Visual Source Safe to keep track of all changes made to the run book. Also keep a copy of all software installations off site as well in the event systems need to be rebuilt.

SLAs, Risk, and Disaster Recovery

Think back to Chapter 2, and the discussion there of SLAs. As crucial as they are to high availability, they are even more crucial to disaster recovery. This is where the difference in downtime—planned or unplanned—of minutes, hours, or even days comes into play. SLAs, in the form of support contracts, must exist for every physical and software component of a system to ensure that your SLAs with all parties are met. If you cannot meet your end user SLA because, for example, your disk subsystem was down and you did not have a proper SLA in place for it, that could be a problem. Now, in some cases this might not be possible, but you need to try. SLAs typically encompass issues such as physical security, money, uptime, and response times.

Obviously, the cost of buying the right SLA must be considered. Does the risk justify the financial cost of addressing it? For example, your company’s off-site data storage facility might have limited hours. This is probably not a wise choice, and you would need to find a better solution. In the meantime, management decides to put their off-site backups in a bank safe deposit box for easy access. This is a cost-effective solution because it costs only $100 per year. However, only certain individuals can get at the tapes and only during banking hours. Alternatively, you might decide to let the DBA on the night rotation take home a set of tapes. This is not only a risk, it might also violate corporate security policy.

When you are purchasing your SLAs, do not take only high availability into account. Remember disaster recovery, too.

Planning Step 1: Assessing Risk and Defining Dependencies

Disaster recovery, like "local" high availability, is about mitigating risk. Risk comes in several varieties, such as risk to the business, risk to human life, and risk to technology. Unfortunately, assessing risk is a complex task and each company’s risks and tolerance for those risks will be different. The following steps show you the thought process to follow and how to assess risk in your company.

Example Scenario

You have implemented a customized Microsoft Solution for Internet Business (MSIB). This encompasses IIS, SQL Server 2000, Microsoft BizTalk, and Microsoft Commerce Server. This particular solution involves two SQL Server 2000 failover clusters, each with two instances of SQL Server on them.

First, identify the physical servers that are part of this solution and their names. This extends to not only server functionality, but domain controllers and so on. Missing even one server could prove fatal. Taking an inventory of your environment, you see that you have the following:

  • DOMCONT1

  • DOMCONT2

  • STAGING1

  • FIREWALLDOMCONT1

  • FIREWALLDOMCONT2

  • ISASRV1

  • STAGINGSQL1

  • SQL1ANODE1

  • SQL1ANODE2

  • CLU2NODE1

  • CLU2NODE2

  • MSIB1

  • MSIB2

  • BIZTALK1

Next, take an inventory of all of the software configured on each server and make a detailed list, including versions. A final list should also include service packs and hotfixes. Here is a preliminary list for the example environment:

  • Microsoft BizTalk Server 2002

  • Microsoft Commerce Server 2002

  • Content Management Server 2001

  • Internet Information Server 5.0

  • Microsoft Message Queue

  • SQL Server 2000

  • Microsoft Windows 2000

  • Microsoft Windows Server 2003 (for domain controllers)

  • Custom applications

Now it is time to gather information about the instances of SQL Server. The instance names are as follows:

  • SQL1A

  • SQL1BINST2

  • SQL2A

  • SQL2BINST2

  • STAGINGSQL1

Table 12-1 lists the databases.

Table 12-1. Databases

Database

SQL Instance

Use

CustomDB3

SQL2A

CustomDB3

CustomDB8

SQL2A

CustomDB3

CustomDB4

SQL2A

CustomDB3

CUSTOMDB1

SQL1BINST2

CUSTOMDB1

Company_CUSTOMDB1

STAGINGSQL1

CUSTOMDB1

ETSBizTalkReference

SQL1A

BizTalk

CustomDB2

SQL1A

BizTalk

InterchangeBTM

SQL1BINST2

BizTalk

InterchangeDTA

SQL1BINST2

BizTalk

InterchangeSQ

SQL1A

BizTalk

MCMS_Data

SQL2A

Content Management Server

MSCS_Admin

SQL2BINST2

Commerce Server

MSCS_Admin

STAGINGSQL1

Commerce Server

MSCS_BDPermissions

SQL2BINST2

Commerce Server

MSCS_BDPermissions

STAGINGSQL1

Commerce Server

MSCS_Campaigns

SQL2BINST2

Commerce Server

MSCS_Campaigns

STAGINGSQL1

Commerce Server

MSCS_CatalogScratch

SQL2BINST2

Commerce Server

MSCS_CatalogScratch

STAGINGSQL1

Commerce Server

MSCS_EndeavorCE

STAGINGSQL1

Commerce Server

MSCS_ProductCatalog

SQL2BINST2

Commerce Server

MSCS_ProductCatalog

STAGINGSQL1

Commerce Server

MSCS_Profiles

SQL2BINST2

Commerce Server

MSCS_Profiles

STAGINGSQL1

Commerce Server

MSCS_TransactionConfig

SQL2BINST2

Commerce Server

MSCS_TransactionConfig

STAGINGSQL1

Commerce Server

MSCS_Transactions

SQL2BINST2

Commerce Server

MSCS_Transactions

STAGINGSQL1

Commerce Server

Inventory

SQL2A

CustomDB3

Inventory_Asia

SQL2A

CustomDB3

Inventory_SA

SQL2A

CustomDB3

InventoryArchive

SQL2A

CustomDB3

InventoryArchive2

SQL2A

CustomDB3

CustomDB9

SQL2A

CustomDB3

CustomDB2

SQL2A

CustomDB3

CustomDB11

SQL2A

CustomDB3

CustomDB5

SQL2A

CustomDB3

CustomDB7

SQL2A

CustomDB3

XLANG

SQL1A

BizTalk

Table 12-2 lists database dependencies and their impact on availability.

Table 12-2. Database Dependencies and Impact on Solution Availability

Database

Impact

Dependencies (N/A if None)

CustomDB6

CustomDB3 (North America) will not work if this database is not running.

Databases: CustomDB1, CustomDB8, Inventory, CustomDB9

CustomDB1 (production)

This is a crucial database for the solution. It will, over time, also contain tables for other applications. It is a central point of integration.

All CustomDB3 databases

CustomDB7

These are the database lookups for BizTalk Server (BTS) maps. Some tolerance where CustomDB1, BTM, DTA, and XLANG die here.

N/A

CustomDB5

Risk is low for losing data because it can be recovered elsewhere, but is part of the whole CustomDB2 process.

N/A

InterchangeBTM

Contains the definition of BTS objects. This will not change often, but is crucial for BTS being up.

N/A

InterchangeDTA

This is the logging database for BTS. The use of the textimage field is configurable, and right now it is turned on. There is a low risk to the solution from a high level if this database fails, and it is used mainly for archiving and reporting purposes. However, it is actually a high risk to the solution because if BTS cannot log, BizTalk will not be able to work. You will not lose transactions if this happens, though.

N/A

InterchangeSQ

Shared Queue database. This is very important, as it contains the state of anything received and recorded. If you lose this database, you lose the transaction. CustomDB2 system dials every 60 minutes, CustomDB3 data (spike between 9 A.M. and 12 P.M. North America and Latin America, Asia 10 hours prior, EMEA 6 hours prior).

N/A

MCMS_Data

Content Management Server will be down if this database is not available, and the entire solution will be down if this database is down because it contains all the site content.

Commerce Server databases (from a restore standpoint)

MSCS_Admin (production)

Will affect Commerce Server as it is the administrative definition database.

MCMS_Data (from a restore standpoint)

MSCS_BDPermissions (production)

Like the MSCS_Admin, this is a configuration database mainly used with BizDesk. BizDesk will be affected if this is down, but site will not go down.

MCMS_Data (from a restore standpoint)

MSCS_Campaigns (production only)

This will not affect solution because it is not in use.

N/A

MSCS_CatalogScratch (production only)

Used for virtual catalog generation. There is no need to back up or worry about it because it only affects BizDesk.

N/A

MSCS_ProductCatalog (production only)

Although the site will not shut down if this database is unavailable (it is the master product catalog), you cannot navigate the solution. You will still be able to register users as long as other databases are up.

MCMS_Data (from a restore standpoint), MSCS_Transactions (GUID orphans), MSCS_ProductCatalog (staging)

MSCS_ProductCatalog (staging)

Very important to the solution, as it contains the copy of the catalog that is pushed out to the production server.

N/A

MSCS_Profiles (production)

If down, no one will be able to buy anything on the site or register users.

MCMS_Data (from a restore standpoint)

MSCS_TransactionConfig (production)

This contains the definition of all pipeline configurations. It is updated with code push. It is needed for BTS operation, but can be re-created as long as staging is available. If it is down, users will not be able to buy anything on the site.

MSCS_TransactionConfig (staging)

MSCS_Transactions (production)

This is the heart of BTS; it contains all completed transactions and it will definitely affect the solution. If this database is down, users will not be able to buy anything on the site.

MCMS_Data, MSCS_ProductCatalog

Inventory_USA

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB8, CustomDB6, CustomDB9

Inventory_Asia

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB10, CustomDB12

Inventory_SA

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB11, CustomDB13

InventoryArchive

This has a minimal impact on just archived data for CustomDB3.

CustomDB1

InventoryArchive2

This has a minimal impact on just archived data for CustomDB3.

CustomDB1

CustomDB9

CustomDB3 will not work if this database is down. Customers and partners are in here, and it contains all item definitions.

CustomDB1, CustomDB8, CustomDB6, Inventory_USA

CustomDB2

CustomDB3 (Asia) will not work if this database is not running.

CustomDB1, Inventory_Asia, CustomDB12

CustomDB11

CustomDB3 (South America) will not work if this database is not running.

CustomDB1, Inventory_SA, CustomDB13

CustomDB5

CustomDB3 (Asia) will not work if this database is not running.

CustomDB1, Inventory_Asia, CustomDB10

CustomDB7

CustomDB3 (South America) will not work if this database is not running.

CustomDB1, Inventory_SA, CustomDB11

XLANG

Similar to InterchangeSQ, but stores state for orchestration schedules. This is very important to keeping BTS up.

N/A

Once the dependencies are known, you can then establish the facts for each product that will influence any disaster recovery planning.

Known Facts About Servers

The following is a list of facts for each server product.

BizTalk

  • CustomDB1, CustomDB2, BizTalk, and Commerce Server are all related.

  • If Internet Information Server (IIS), Message Queuing (also known as MSMQ), or Microsoft Distributed Transaction Coordinator (MS DTC) are down, BizTalk will not function.

  • If BizTalk is down, the site will still technically be up, but not fully functional, which essentially means it is down.

  • If BizTalk and MSMQ are down, the site will not be functional at all.

  • If InterchangeDTA is down, you will not be able to report from BizTalk (and BTS will be down because it cannot log). If you restore it to an older time, CustomDB1, CustomDB3, and so on, are interdependent. The issue is that a transaction might exist elsewhere that is not here, and you will need to create a process to get DTA back in sync.

Commerce Server

  • CustomDB1, CustomDB2, BizTalk, and Commerce Server are all related.

  • Commerce Server is dependent on IIS.

  • If you lose your MSCS_ProductCatalog database and need to restore it to an earlier backup, you will invalidate MSCS_Transactions because you will now have orphaned GUIDs.

  • If a problem occurs in Content Management Server or Commerce Server, you will have to restore all Commerce Server databases and Content Management Server databases to the same point if (and only if) the products that exist in one place do not exist in the other. This also means that you will probably need another server to restore the older backups on and reconcile the difference, making it a time-consuming process. If you can restore the catalogs back to the same point in time, this should not be a concern. Otherwise, you will get orphaned postings if products were not added to Content Management Server.

  • Without Commerce Server, there is no solution, because it controls all buying functionality.

  • If Commerce Server order status (that is, MSCS_Transactions) cannot be updated due to Commerce Server being down, the transaction from BTS should roll back if there is a schedule in XLANG, and it might have to be pushed back out. This needs to be addressed.

Content Management Server

  • Content Management Server is dependent on IIS.

  • If a problem occurs in Content Management Server or Commerce Server, you have to restore all Commerce Server databases and Content Management Server databases to the same point if (and only if) the products that exist in one place do not exist in the other. This also means that you probably need another server to restore the older backups on and reconcile the difference, making it a time-consuming process. If you can restore the catalogs back to the same point in time, this should not be a concern. Otherwise, you will get orphaned postings if products were not added to Content Management Server.

  • Without Content Management Server, there is no solution, as it is the presentation server for the MSIB site. All permissions and roles are also stored; the site will not work without authentication, and even if you could authenticate, you need the site content (Content Management Server contains 90 percent of the content).

  • Staging Content Management Server is the master for site content and feeds the production Content Management Server, so staging’s availability is crucial for the production and external portion of the solution.

  • The Content Management Server site definition object (SDO) file created by staging updates the production Content Management Server daily. You can take backups from staging and then apply them to production, but not vice versa. The SDO itself is more crucial than the database, because it is the mechanism by which you can re-create objects on the site. To ensure that the company is protected, the following should probably happen, in this order:

    1. Back up the Content Management Server database currently in production (if not done yet; this would be the previous day’s configuration).

    2. Back up the staging SQL database prior to generation of the daily SDO file.

    3. Generate the new SDO file, immediately back it up, and then apply it to the production environment.

    4. Back up the newly updated Content Management Server database in production.

    5. If an error is found due to a bad SDO push or inaccurate data in the SDO, the SDO should be repushed to production when it is fixed.

    6. Content Management Server has the ability to support release dates for certain content, so what appears to be an error might not be.

CustomDB1, CustomDB2, and CustomDB3

  • CustomDB1, CustomDB2, BizTalk, and Commerce Server are all related.

  • CustomDB2 uses CustomDB4, which keeps track of all activity for CustomDB2 trading partners. There are file pointers for every interchange received, and they remain until archived. It is hit every 45 minutes when CustomDB2 goes out to get transactions; this activity lasts for approximately 10 minutes. Risk of losing some data is low. Three systems feed CustomDB2: an EDI Value Added Network (which has its own archiving and can go back four days), the back-end SAP system (which can go back one day), and BizTalk (which also comes from SAP).

  • CustomDB1 has two transactional tables. If the application cannot identify customers, it goes elsewhere.

  • CustomDB1 is hit most heavily in the morning, and that might be a logical time to back the CustomDB1 database up (instead of in the middle of the night).

  • CustomDB1 is a central point of integration and, as time goes on, other applications will use it (such as telcom and asset), so it is important to the solution and the company.

  • If a transaction is not committed, everything will roll back and go into the retry queue (sits in MSMQ).

  • The CustomDB1 and CustomDB3 databases should be restored in a recovery situation at the same time.

  • CustomDB3 has triggers on tables, which feed MSMQ, which then feeds CustomDB1. If CustomDB1 needs to be rolled back to an older backup, you will be missing entries from CustomDB3, so you need to establish a process to roll forward entries from CustomDB3.

  • SAP is a flat file extract and a daily feed once in the morning. The flow is to the staging table, to MSMQ, and finally, into CustomDB1. If you need to regenerate data from CustomDB1, it is a difficult process because no transaction ID exists, so you need to devise a disaster recovery process for this interaction.

  • CustomDB3 is broken out by region: North America, Asia, and South America. All backups of dependent databases (see Table 12-2) must be coordinated, which means that you need separate SQL Server Agent jobs that are scheduled to run at the same time to ensure that all dependent databases can be restored to the same point.

  • CustomDB8 might be loaded once or twice a week depending on a reseller sending data. From a recoverability standpoint, the company can go back about a week because the Microsoft Excel spreadsheets are available and the spreadsheet names are stored in the database.

  • You might need a separate server to restore older databases to reconcile differences if a problem happens on one or more databases, especially the dependent ones.

Other Servers

  • Without the DOMCONT1 and DOMCONT2 domain controllers, the SQL Server clusters will not work, and their absence will most likely affect all other functionality as well (not just from an SQL Server perspective), such as MSMQ.

  • MSMQ public queues are dependent on the domain controllers for lookup purposes.

  • Without the Internet Security and Acceleration (ISA) Server domain controllers, ISA will not work.

  • Without IIS, no one will be able to connect to the solution, and Content Management Server, Commerce Server, and BizTalk will be unable to function.

  • If MSMQ is down, BizTalk will not function.

  • If MS DTC is down, BizTalk will not function.

  • Without ISA and the firewall, the solution will not be protected (but it should function).

  • SAP technically has no impact on the site if it is down, but it is part of the solution in one way or another and needs to be brought within its scope. Every message for SAP will be in MSMQ until it can be submitted. Obviously, if MSMQ fails, you cannot send the messages. The data is in Commerce Server, so it could be re-created.

Risks and Unknowns

The following is a list of risks and unknowns that must be accounted for:

  • You do not have a backup until you have done a restore. You need to restore all backups made (possibly on a rotational basis) so that you know the backups being made are usable. This requires hardware that might or might not currently exist.

  • You need to determine the retention time of backup files. Currently, it is limited by storage space, but that will affect the solution’s potential availability if you find all of the needed backups that exist (in the event of an emergency) are bad.

  • Schedule fire drills and test the disaster recovery plans and process so that it "feels" like the real thing. Making disaster recovery plans without testing them is not good enough.

  • The company needs to create disaster recovery plans based on this document and the Excel spreadsheets containing the backup information.

  • What will happen if the hardware load balancer fails? How is that being protected?

  • What is the impact of a network failure? How is that being mitigated?

  • Is the company stocking hardware, such as additional disks, host bus adapters (HBAs), network cards, and so on, in the event a component fails in one server?

  • There is no process to move the staging MSCS_ProductCatalog data to the production environment. You need to test that and initiate a way to re-create it.

  • You must revisit the SLAs to ensure that the steps put in place for disaster recovery reflect the time it takes to do the process.

  • When figuring out exact steps, document where you would have to start the whole process by reinstalling the operating system versus other methods (such as third-party software).

  • Include in this disaster recovery document any relevant support information, such as links to the diagrams for the disk configurations. That will speed up any disaster recovery process.

Planning Step 2: Putting the Plan Together

Once you have assembled your lists of known facts, risks, and dependencies and have all pertinent information in hand (such as SLAs), you can put the plan together. This plan should include the exact steps it will take to recover the system and who is responsible for what task (job roles mainly; a list maintained in the run book should tell you who is currently responsible for the action). Other information that you need to put into your disaster recovery plan includes the following:

  • Define a go/no-go point for executing the plan. Many companies like to try to solve the problem on their own before engaging any type of support such as Microsoft’s PSS, original equipment manufacturers, and service providers. Keep in mind that seconds and minutes matter with SLAs. If you want to try to diagnose the problem on your own, set a reasonable time frame (say 15 minutes) to solve the problem. If you cannot do it within that time, pick up the phone. Also keep in mind that if you have, say, a four nines SLA (meaning less than 52 minutes of downtime a year), do you really want to try to diagnose the problem yourself? Putting in explicit instructions avoids what can be known as "the hero syndrome." A clear understanding of what skills and knowledge you have in house can make it clear what failures can be addressed internally and which ones must be escalated immediately.

  • Define a chain of command. Chaos can quickly ensue with too many chefs and not enough bottle washers, or plenty of bottle washers but no chef to direct them. Roles and responsibilities should be clear. Even if the people performing the job change, everyone should know how they fit into the works and who reports to whom.

    Important

    A chain of command can solve a bottleneck when dealing with support. Often, when support professionals are contacted, they require a seemingly large amount of information, potentially including a copy of your database. Constructing barriers that prevent the support professional from helping you only increases your downtime. Remember that reputable support organizations are not there to peruse your potentially sensitive data or steal your trade secrets. They want to help you get up and running. Work out any corporate security issues that relate to external organizations prior to contacting support as part of your SLA. Conversely, if you will never be able to work with a support organization because you cannot divulge certain information, you might want to rethink your strategy. Is it better to be down or up?

  • Define a communications plan. With a chain of command, you must establish clear communications. Communications plans are based partially on being able to reach everyone who needs to be involved, so make sure that any type of contact information is updated in the run book. This also goes for any type of work schedules, on-call schedules, and so on.

  • Define schedules for the recovery plan itself. A recovery is often a 24-hour operation. It is neither practical nor smart to expect one person to oversee or work a particular aspect of the plan from the beginning to the end. People become fatigued after a period of time and might make poor decisions. The correct solution is to define shifts and staff them appropriately.

  • Define milestones. These are events on the critical path that, when complete, represent forward movement. Milestones make it easy to gauge your progress. Some examples follow.

    • All recovery personnel have been notified.

    • Personnel have shown up or been called in and are ready.

    • Call numbers for the site and personnel are physically available in e-mail.

    • The backup server is physically ready.

    • Operating systems and service packs are installed.

    • SQL Server is installed.

    • Application software is installed.

    • Tapes have arrived from off-site facilities.

    • The restoration of the database has begun.

    • Periodic updates are provided on the restore operation’s percentage complete.

    • The database restore is completed.

    • Logs have been reviewed.

    • Results of various diagnostics attest to the health of the database (for example, DBCC CHECKDB).

    • Applications have been started and connected to the database.

    • Application test reports confirm validity of the data.

    • Management has made the go/no–go decision.

  • Define acceptance criteria and a testing suite for a go/no-go decision. The confidence of recovery personnel in the successful accomplishment of milestones is key. For example, if the restore finished, but the DBCC CHECKDB failed, then clearly there is a lack of confidence in the recovery. Problems influencing a decision to go live can be independent of the database recovery, of course. If SQL Server is completely operational, yet the network is not functional at the time, then clearly the users are not going to be coming online.

When All Else Fails, Go to Plan B

Do not be naive and assume that your plan is perfect. There could be something beyond your control (such as data corruption or inability to get a new disk drive) that might mean you are down for a longer time. Always have a trick up your sleeve and create a backup plan so that if your primary plan fails, you can still get up and running sooner rather than much, much later.

Tip

You might even want to execute Plan B in parallel with Plan A so that there is no lag time in the event Plan A fails.

More Info

For links to SQL Server–specific disaster recovery Knowledge Base articles that contain additional information, consult Knowledge Base article 307775, "INF: Disaster Recovery Articles for Microsoft SQL Server." You might also want to print some of these articles and incorporate them into your test plans or run book.

Testing Disaster Recovery Plans

Remember, you have not done a backup until you test the restore. Similarly, you do not have a disaster recovery plan until you test it. Why would you spend time developing a plan if you know it will never be tested? Equate testing to holding a fire drill. Most municipalities around the world insist that all office buildings practice fire drills to ensure that people can get out and that all systems are working. Protecting your SLAs, as well as equipment and human life, is just as vital to your business. A successful test will give you confidence in your plan.

Every company that cares about disaster recovery needs to have a simulated recovery drill at least once a year, preferably more often. This ensures that everyone knows his or her role and responsibilities, and can execute the plan successfully. If plans have not been proactively updated as they should have been and not everything is documented, the mock drill is the time to find out, not during a real disaster.

A drill should start by simulating a system crash or that the primary site has become physically inaccessible. For example, assume that there has been a major accident at a chemical plant and your offices have been completely evacuated. At that point, the team is assembled, responsibilities are handed out, and the team sets out to the recovery site to put the system back together from the ground up. Disaster recovery could also be as simple as rebuilding a database locally.

Figuring out the logistics for the test is the complicated part. For example, is it really a valid test if everyone knows the exact time and date in advance? If you do not want it to be a complete surprise, you might give a vague time. The problem with publishing a time and date is that people can prepare. You must also notify users, management, and vendors if they will be affected or involved in the test. You should even go so far as to book flights and hotels, if necessary. In all regards, this should feel and act like the real thing.

During the drill, make notes about what is working and what is not, or where things are inaccurate. After the drill comes arguably one of the most important activities, the postmortem. This postrecovery meeting assesses what was good and what was bad. This feedback then needs to be incorporated back into the disaster recovery plan and the run book, both of which should be updated immediately so that whoever executes the plan in the future benefits from the experience, even if they were not part of the drill.

Executing Disaster Recovery Plans

At some point, you might be called on to execute the plans you have put into place. Successful execution means more than just the correct bits on the disk. It is about the process and organization of the team to get to that point. If you have not tested your plan and do not have the confidence that you know it will work and how long it will take, you could be in for a rude awakening. It cannot be stressed enough how much preparation shortens downtime. Keep in mind that even with careful planning and tests, you can always encounter a situation that is beyond your control. Even with careful planning and testing, you could still miss the agreed-on timeframes detailed in your SLAs. These are the realities of disaster recovery, as it is by nature a difficult and pressure-filled operation. Do not plan for disaster recovery or go into executing a plan thinking that everything will go as planned; expect the unexpected.

During the execution of the actual recovery, everyone involved should be making notes about what is working and what is not, or where things are inaccurate. These notes can be logged directly into the run book text as each step is executed. Notations such as "This did not work for the following reason ..." and "This is what had to be done to resolve the issue" allow process and procedure improvement. After the fact, recall will not be complete and accurate, so any postmortem used to improve the process and procedures will not be as productive as possible, and could even cause future harm if incorrect information is acted on. This postrecovery meeting will assess what was good and what was bad and the feedback should be reflected in an updated disaster recovery plan.

Example Disaster Recovery Execution

The following is a simple example of a basic recovery plan execution.

Roles and Responsibilities

  • Name: Tim

  • Role: Operations Manager

  • Recovery responsibility: Coordinate resources, interface with management and business users. Responsible for go/no-go decision. Sends out status e-mails.

  • Rotation: Watch beginning of recovery. Notified as needed during the restore. End communication with business users.

  • Names: Jane and Bill

  • Role: DBAs

  • Recovery responsibility: Install SQL Server software. Restore SAP, msdb, and master databases. Ensure that SQL Server Agent jobs are running properly. Run DBCC CHECKDB. Review error log and make sure it is clean. Make sure that database network libraries are configured properly. Install SQL Server client on the application servers.

  • Rotation: Jane will start the restore. Bill will come on at 2:00 A.M. and finish up the restore, and check the DBCC and logs.

  • Name: Doug

  • Role: OS administrator

  • Recovery responsibility: Install operating system. Apply service packs and security patches. Confirm configurations:disk letters and configuration are as expected, disk is formatted for 64K blocks. Make sure the firmware is correct, network cards are set to proper duplex, and so on. Check event viewer and error logs.

  • Rotation: Build the server on the front end. Be on site for any other issues as they arise.

  • Name: Shrikant

  • Role: SAP basis administrator

  • Recovery responsibility: Build SAP instance. Confirm profile parameters. Ensure connectivity to the database. Check that network library configurations are correct. Make sure batch jobs are running appropriately. Confirm the transport paths are in place and operational. Bring SAP up so it is open to the users.

  • Rotation: Will configure SAP and check the status at the end. Configure SAP at 11:30 P.M. and check at 6:00 A.M.

Recovery Timeline

  • 9:28 P.M. A power surge crashed the storage area network and corrupted the database. The server itself has crashed, and the database is now unavailable. Operations paged the on-call production manager.

  • 9:30 P.M. Operations manager receives notification and dials in to check the system. Recovery team called. Disaster recovery plan activated.

  • 9:45 P.M. Tim calls the CIO and notifies him or her that a business interruption has occurred. Tim posts the first e-mail to business users about the status and plan as well as notification checkpoints.

  • 10:15 P.M. The team starts showing up. Food and beverage orders are taken. Pizza is ordered for team and coffee is put on.

  • 10:50 P.M. Off-site backup tapes are recalled.

  • 11:00 P.M. Recovery begins. Backup server is confirmed to be on the latest software and firmware revisions. Service packs and security patches are applied. Status e-mail is sent out.

  • 11:30 P.M. SQL Server installation begins. Service packs are applied. When this process is complete, a status e-mail is sent out.

  • 11:36 P.M. SAP is installed and configured on the central instance. Application servers are recovered and the profiles are adjusted for the new database server.

  • 11:45 P.M. Tapes arrive. Contents are confirmed. Another status e-mail is sent.

  • Midnight Database restore begins. Master and msdb are restored. Another status e-mail is sent.

  • 12:30 A.M. SAP database restoration begins. Another status e-mail is sent.

  • 2:00 A.M. Restore is 39 percent complete. Estimated time to completion is 3.5 hours more. A status e-mail is sent out with an estimated completion of 5:30 A.M.

  • 4:00 A.M. Restore is 78 percent complete as seen through RESTORE with STATS. Another status e-mail is sent out.

  • 5:30 A.M. Restore is complete. DBCC CHECKDB is started. SAP is brought up and locked in single-user mode. SAP diagnostics are begun. Another status e-mail is sent out.

  • 6:00 A.M. SAP is confirmed that all is okay.

  • 7:18 A.M. DBCC CHECKDB complete and clean. Status e-mail is sent out to technical team.

  • 7:20 A.M. Go live decision is made by the operations manager in conjunction with the CIO. A go is given to start up production at half past the hour.

  • 7:30 A.M. SAP is started and error logs are checked. Business users are notified. Work can commence.

Each recovery for each company is different. Different personnel and capabilities exist. Hence, each schedule will be different. The idea is that one person cannot be expected to do it all and responsibilities and roles need to be parceled out appropriately.

Disaster Recovery Techniques

Chapter 9 and Chapter 10, covered backup and restore from a high availability standpoint. One topic not covered, however, was the restore of the system databases, which are critical for the successful recovery of your user databases in a disaster recovery scenario.

More Info

For specific disaster recovery scenarios with clusters, see Chapter 5, Chapter 6, and Chapter 8. Use that information in conjunction with the information presented here to help recover your systems completely. This section deals mainly with the general SQL Server failure condition.

Warning

Again, it cannot be stressed enough that unless yours is a very experienced shop and you have seen this type of failure before and know with certainty how to handle it, it is crucial to know what procedures to follow. Ultimately, the process is best led by a support professional or senior staff member who knows what he or she is doing. Performing any of the work in the following list improperly could worsen your situation.

You should be aware of some important characteristics of system databases:

  • The default recovery model on system databases is Simple. Although this can be changed, you should leave the default unless you have a good reason to change it. Because the recovery model is Simple, you cannot do point-in-time recovery. This is one of the main reasons that anytime you make a system change (for example, adding new logins or adding or growing databases), you should back them up.

  • Tempdb is rebuilt every time SQL Server is started, so not only can it not be backed up, but it is not necessary. It will be regenerated to the size specified on configuration.

  • It is possible to rebuild master, but not any other database. When you are rebuilding the master database, users cannot be on the system. This is not the case when user databases are being restored.

There are further distinctions, but this list is a brief summary of the basic factors that make a rebuild different from other restores.

As important as your user databases are, the system databases are just as—if not more—important to protect. If you lose your master database, SQL Server ceases to operate correctly. This is one of the most basic premises of proper design in architecting your high availability solution. Just as you should separate your data volumes from your transaction log, you should also separate your system databases (master, msdb) onto a more secure area of your system protected by RAID. Prevention is much easier than a rebuild from the ground up.

Tip

System databases are small in comparison to user databases. They range in size depending on system usage. Due to their smaller size, but their importance to SQL Server availability, you should back them up every night along with your user databases. Additionally, scheduling a DBCC CHECKDB on them periodically helps confirm the overall health of SQL Server, too. The work involved in doing this test is a very small price to pay for the long-term benefits.

Step 1: Assessing Damage

Many times when a system crashes, everything is not completely lost. There are reasonable starting places other than a total flatten and rebuild. It is in your best interests to survey the damage first and then put together a plan of attack.

The five categories that most issues fall into are as follows:

  • Hardware failures. This includes CPU, memory, and a media failure (that is, a disk drive failing, even if you have RAID).

  • Logical failures. This is the human error in which bad data would be inserted or a DBA does something wrong, like accidentally dropping a critical table.

    Tip

    When making large or critical system changes, make sure you not only make a backup prior to starting, but when executing the Transact-SQL statement, begin the SQL command with a BEGIN TRAN. If all is successful, then you can commit the transaction; if something incorrect happened (for example, the DBA forgot to put a WHERE clause on a DELETE statement), you can always rollback the transaction without damage.

  • Operating system failures. This is a problem that you cannot do much about, because it is external to SQL Server. The kernel fails to complete a function. You can only keep up with service packs and appropriate hotfixes for your environment.

  • Device driver failures. A driver for a particular piece of hardware malfunctions. Like an operating system failure, this is external to SQL Server, but affects it. A good example is a bad SAN driver on a cluster.

  • Low-level failures. A low-level failure occurs when something abnormal happens to a SQL Server database. It can be situational and is usually rooted in hardware failure. For example, a power outage could cause a torn page or corruption. You would then need to perform the appropriate action, such as issuing a DBCC CHECKDB or restoring a full, good backup.

Tip

Use battery-backed-up controllers to help protect your data from power outages. Many controllers have extensive amounts of cache on them to increase performance. If you lose power, you have lost the data if there is no battery backup. Additionally, an interrupted write due to power loss can cause a torn page.

As part of any good game plan, you should make a backup of the server if possible. This could be a flat file operating system backup and a SQL Server backup. If PSS needs to be engaged, a "before" image of a corrupt database can offer valuable clues for rescuing data. If you destroy the original state of the data by restoring over it before backing it up, you have lost a good starting point for PSS.

In the same vein, people often find out the hard way that the backup they thought was good really was not, so they were in worse shape after the restore than before. If you made a backup at the operating system level with automatic system reconfiguration (ASR) before starting the restore process, you can always restart. Remember, the last good backup you made was the last one that was successfully restored.

Tip

Restore your backups often to help ensure their integrity and make yourself familiar with the restore process. It also helps to know how long a database restore takes. Users often want to know, "Is it finished yet?"

If you have any doubts about the process or do not feel comfortable with recovering your system, stop and engage PSS. Not performing any work can actually be a safer alternative than haphazardly beginning an unsuccessful rebuild.

Tip

Any "ground up" recovery starts with a good backup of the master and msdb databases. Make sure you always have current, known, good backups of your system databases and run health checks against them.

Step 2: Preparing for Reconstruction

Before starting any reconstruction of a server, whether minor or full-blown, there are certain steps that need to be taken and items that need to be in place. First and foremost, ensure that you have the right person in place to execute any actions. Do not leave an important task to a rookie system engineer or DBA who does not know your systems well enough. The following steps offer fallback positions if you need to start over or engage other support personnel. They also help if you need to reconstruct at an off-site recovery location.

  1. Refer to the documentation in your run book for your hardware configuration, which should include model numbers, the physical layout (including which card goes in what slot), configuration settings for each component, firmware revisions, driver versions, block size of the file system, and so on.

  2. If your server is up, make an operating system-level backup of the server with SQL Server down, including ASR. This allows you to go back to a known state before attempting any rebuilds and gives you the ability to send the damaged data to PSS if needed.

  3. After making the operating system backup, back up what you can using the standard SQL Server backup processes. Try and run a full backup, and also grab the tail of the transaction log. These allow you to recover to the point of failure.

  4. If the database is accessible, run DBCC CHECKDB to assess the damage.

    Warning

    Running DBCC CHECKDB can take a long time depending on which options you select and the size of your database. This might not be appropriate for all environments and should be done at the recommendation of a support professional.

  5. When you want to rebuild SQL Server, as with the server itself, a build sheet should be accessible in your run book. The information that should be included is listed in the section "Run Book" earlier in this chapter.

  6. If you are restoring at a remote disaster recovery site, ensure you have compatible tape media hardware and the proper version of the restore software at that location, too. For example, you do not want to find out that your site uses Linear Tape-Open (LTO) technology and the disaster site only has Super DLT technology.

  7. Your run book should also include software and license keys for all components (Windows, SQL Server, third-party applications, and so on) that will be rebuilt.

  8. A connection to the Internet will help you gather service packs and hotfixes, as well as information such as the Knowledge Base articles on http://support.microsoft.com. You should probably pull all relevant service packs and hotfixes and burn them to a CD or DVD that can be easily accessed in a disaster recovery situation.

  9. Reference your vendor call list and contract numbers in the run book. If your support contracts have expired, you might need a credit card handy in case you need to call in other vendors using phone support.

Important

The key to success is having all this in place prior to a disaster.

Step 3: Reconstructing a System

The following are the basic steps that should be followed if you need to do a full recovery, or "flatten the system," when all components need to be rebuilt:

  1. Rebuild the system, starting with the operating system and any service packs, hotfixes, and so on. This should be done with the proper tools (such as NTBACKUP or a third-party utility, such as Veritas Backup Exec or any number of enterprise backup tools). Depending on the scope of damage, you might even need to low-level format your disks.

  2. Test the operating system to ensure that it is configured properly.

  3. Follow the steps detailed later to restore or recover system and user databases.

  4. Restore full-text indexes (if necessary or possible).

  5. Synchronize logins, and perform any other tasks (such as restoring custom objects with scripts) to bring your SQL Server instance to a usable point.

  6. Test everything extensively.

On the CD

You can find a flowchart that illustrates the disaster recovery process in the file Disaster_Recovery_Process.pdf.

Restoring the Tempdb Database

You do not need to do anything if tempdb is damaged because tempdb is regenerated each time SQL Server restarts. What you might want to do, for example, if your disk is damaged, is use the ALTER DATABASE statement to move the location of tempdb.

Step 1: Rebuilding and Restoring Master

To put it bluntly, without master, there is no SQL Server, so ensuring that master is up and functioning is paramount to any recovery operation. This is why you must go to great lengths during the planning phases of the system to use techniques like RAID and put the proper backup scheme in place. However, if you lose master, it does not necessarily mean that your user databases are damaged. You can rebuild or reinitialize the server, including the master database only, and then restore the system database backups from your last verified backup files. However, there are some very important requirements to make this work successfully. If planned for properly, this is easy!

Tip

Remember to back up your master database any time you make a change to it.

Caution

You cannot restore a backup of a system database, especially master, from a different build onto another SQL Server with a different build version because there are schema or metadata changes. The following is an example of an error message:

Server: Msg 3168, Level 16, State 1, Line 1

The backup of the system database on device
d:	empmaster.bak cannot be restored because it was created
by a different version of the server (134217904) than this
server (134217920).

Server: Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

You need to reinstall and apply all service packs, hotfixes, and so on, to be able to use the backup. Go to http://support.microsoft.com and consult Knowledge Base article 264474 for information and workarounds. Remember to back up system databases both before and after an application of a SQL Server 2000 hotfix or service pack so you can restore to the proper point in time.

Rebuilding Master for 32-Bit SQL Servers

The tool used for rebuilding the master database is called Rebuildm.exe, a binary executable located in the SQL Server installation’s ToolsBinn directory on your hard disk.

Caution

Do not rebuild your master database without knowing the consequences and making absolutely sure you need to do this. Follow the text in this entire section relating to master for more information.

Important

To run REBUILDM, you also need to access the SQL Server installation CD-ROM or installation point because it includes various data files needed for execution of the tool. Because the CD-ROM is read-only, you must copy the CDROM:x86Data directory on the CD locally to the server before starting the REBUILDM utility, and change the attribute of all files to be non-read-only. This can be done with the command line attrib -r *.* in the directory, or by right-clicking the files, selecting Properties, and clearing the Read-Only check box.

If you have access to a network share installation of SQL Server, that would also be appropriate. However, ensure that the files do not have a read-only attribute (as they would if it was just a copy of the CD-ROM).

Use the following steps to run REBUILDM.

Important

If you are running REBUILDM on a clustered instance of SQL Server 2000, make sure you are on the node that currently owns the SQL Server resources. You can verify this in Cluster Administrator.

  1. Ensure all files are not read-only.

  2. Take SQL Server offline using SQL Server Service Manager.

  3. From a command line, run Rebuildm.exe. When the main dialog box opens, as shown in Figure 12-1, select or enter the name of the SQL Server in the Server drop-down list box, select the directory of the files that you copied from the CD-ROM, and select the proper collation. Click Rebuild to begin the rebuilding process.

    The main Rebuild Master dialog box.

    Figure 12-1. The main Rebuild Master dialog box.

Warning

Make sure you select the right collation for your SQL Server. It should match the previous setting for this particular instance. If you do not select the right one, you might have some incompatibilities with your user databases.

The process then copies the data files for master to the right location and configures the server. A successful completion results in the dialog box shown in Figure 12-2.

A successful Rebuild Master completion.

Figure 12-2. A successful Rebuild Master completion.

Rebuilding Master for 64-Bit SQL Server

Under the 64-bit edition of SQL Server 2000, you cannot use REBUILDM to rebuild your master database. For a standalone instance, follow the instructions outlined in the SQL Server 64-bit Books Online topic "Repairing a SQL Server Installation (64-bit)." On a SQL Server 2000 64-bit virtual server, you must follow the steps outlined here, as the standard process does not work.

Note

To rebuild the master database, you need the product Product ID (PID) used to install SQL Server 2000 (64-Bit). The PID is a globally unique identifier (GUID) that is stored in the registry. For a default instance of SQL Server 2000 (64-Bit), the PID is in the registry under HKEY_LOCAL_MACHINESoftwareMicrosoftMSSQLServerSetup productID. For a named instance of SQL Server 2000 (64-Bit), the PID is in the registry under HKEY_LOCAL_MACHINESoftwareMicrosoft Microsoft SQL Server<instance name>SetupproductID.

When you rebuild the master database, you can configure the sa password, collation, cross-database ownership chaining, and error reporting. If you make changes to the authentication mode, they are not reflected after the server is reinstalled. You can change the authentication mode manually after the reinstallation is complete.

You must complete all steps from the cluster node that owns the SQL Server resource.

To rebuild the master database of a default instance of Microsoft SQL Server 2000 (64-Bit), follow these steps:

  1. Take the SQL Server virtual server offline.

  2. In Registry Editor, under HKEY_LOCAL_MACHINESoftwareMicrosoft-WindowsCurrentVersionUninstall< PID>, where <PID> is the PID of the Windows Installer package used to install SQL Server, change the SqlCluster value to 0 and note the value of SqlClusterSec. Then change the value of SqlClusterSec to 0. You will need to restore the value of Sql-ClusterSec after you rebuild the master database.

  3. Run Cliconfg.exe to launch the SQL Server Client Network Utility.

  4. In the SQL Server Client Network Utility, on the Alias tab, click Add. The Add Network Library Configuration window appears.

  5. Under Network Libraries, click Other.

  6. In the Server Alias text box, type <computer name>. For a named instance, type <computer name><instance name>, where <computer name> is the name of the local computer and <instance name> is the name of the SQL Server instance. The local computer must be the cluster node that owns the SQL Server resource.

  7. In the File Name text box, type DBNETLIB.

  8. In the Parameters text box, type LPC:<virtual server name>. For a named instance, type LPC:<virtual server name><virtual server instance name>, where <virtual server name> is the name of the SQL Server virtual server, and <virtual server instance name> is the name of the named instance of the SQL Server 2000 (64-bit) virtual server.

  9. Click OK.

  10. Run Msiexec.exe to rebuild the master database. Type msiexec.exe /i <PID> REINSTALL=ALL REINSTALLMODE=amus SQLAUTOSTART=0 /L*v <path and filename.txt>

    where <PID> is the PID of the Windows Installer package used to install SQL Server, and <path and filename.txt> is the location and file name for the verbose log file. For example, type d: ebuild_master_verbose_log.txt.

  11. In the Reinstallation dialog box that appears, make any necessary changes to the sa password, collation, cross-database ownership chaining, and error reporting. If you make changes to the authentication mode, they are not reflected after the server is reinstalled. You can change the authentication mode after the reinstallation is complete.

    Note

    In some situations, the computer might display a message stating that file operations are pending and ask you if you want to reboot the computer. Click No.

  12. In Registry Editor, under HKEY_LOCAL_MACHINESoftware MicrosoftWindowsCurrentVersionUninstall< PID>, change the SqlCluster value to 1. Restore the value of SqlClusterSec to the value you noted in Step 2.

  13. Run Cliconfg.exe to launch the SQL Server Client Network Utility. On the Alias tab, select the server alias that you created in Step 6, and click Remove.

  14. If the computer requests a restart, restart the computer now.

  15. Bring the virtual server online.

Restoring the Master Database

As noted earlier, once the master database has been rebuilt, it has no recognition of the user databases or user logins. To finish the recovery process, you need to restore the master database from the most recent available backup by following these steps:

  • Restoring master without a valid backup. There is no way to restore master in this case other than by using REBUILDM and then going through the steps for the other databases.

  • Restoring master with a valid backup. The way to restore master in this case is as follows:

    1. Ensure that your SQL Server 2000 installation is at the same version as your backup, including hotfixes and service packs.

    2. Optionally, create a logical backup device to reference the location where the backup of master is located, or you can just point the restore to the physical file. To create the device, use, for example:

      sp_addumpdevice 'disk', 'master_backup_device',
      'c:ackupmaster.bak'
    3. Take SQL Server offline using SQL Server Service Manager.

    4. If you have changed the recovery model for msdb to Full, save the transaction log before attempting any rebuild process for master; this allows you to back up the tail of the log and have up-to-the-minute recovery. To do this, after SQL Server is offline, you have to go to the location of the transaction log for msdb and rename it in a command window or in Windows Explorer because executing REBUILDM overwrites it.

    5. Rebuild the master database (see the previous section).

    6. In a command window, restart SQL Server in single-user mode from the command line using Sqlservr.exe. This is located in the 80Binn directory under your SQL Server 2000 installation. The following syntax is an example of a default instance and its resulting output:

      D:Program FilesMicrosoft SQL ServerMSSQLBinn> sqlservr.exe
      -c -m
      2003-02-02 15:44:08.43 spid3  SQL Server started in single user
      mode. Updates allowed to system catalogs.
      2003-02-02 15:44:08.46 spid3  Starting up database 'master'.

      This is the syntax for a named instance:

      D:Program FilesMicrosoft SQL ServerMSSQL$instancenameBinn>
      sqlservr.exe -c -m -Sinstancename

      If you do not bring up SQL Server in single-user mode, you will see the following error when you attempt to restore master:

      Server: Msg 3108, Level 16, State 1, Line 1
      RESTORE DATABASE must be used in single user mode when trying to
      restore the master database.
      Server: Msg 3013, Level 16, State 1, Line 1
      RESTORE DATABASE is terminating abnormally.

      Do not shut this command window until Step 8 is completed because you are now running SQL Server as a process in that command window.

    7. You can now start Enterprise Manager or log into a query tool such as Query Analyzer or OSQL, and log into your instance as system administrator or as a trusted administrator to begin the restore process.

    8. Restore the master database from a backup using either the backup device or with disk= syntax. For example:

      restore database master from master_backup_device

      When complete, you will see the following output:

      The master database has been successfully restored. Shutting
      down SQL Server.
      SQL Server is terminating this process.

      If you are running Enterprise Manager, you might have to manually terminate it with Task Manager.

    9. Bring SQL Server online using SQL Server Service Manager.

    10. Back up master.

If you have the ability to back up the tail of the log for msdb, follow these steps:

  1. Create a database similar to msdb, with the same number of data and log files.

  2. Take SQL Server offline.

  3. Delete all of the data files from your newly created database so that it will fail recovery. Replace the log file with the log file you renamed from your original msdb and rename it appropriately.

  4. Restart SQL Server.

  5. Back up the tail of the transaction log of this database with Enterprise Manager or in a query tool with Transact-SQL. A sample syntax follows:

    backup log db_name to disk = device_name WITH NO_TRUNCATE
  6. Use sp_dbremove to remove the temporary database from your new SQL Server. Back up master.

You can now go about restoring other databases, such as msdb, model, and finally, user databases. Then you can worry about tasks such as synchronizing logins and ensuring all objects are there for your use.

Tip

As you go through the rest of your recovery process, make frequent backups of master and any other databases as they are restored or at key points of the recovery process.

Step 2: Restoring the Msdb Database

Restoring msdb is the next step in rebuilding your SQL Server instance. If master is the brains and heart of SQL Server, msdb is the soul. It contains much of the status information and other information (such as replication information and definition of all SQL Server Agent jobs) used by SQL Server. It is doubtful that you do not use msdb at all. There are two scenarios for recovering msdb:

  • If you did not change the recovery model for msdb, you can only restore the last full backup that is good. This means that you might have lost quite a bit of information if it is a very old backup.

  • If you changed the recovery model of msdb to Full and you were able to back up the tail of the transaction log as described in the last section (as long as master was not damaged), you can take your point-in-time backup and restore all the transaction logs, meaning you could have an up-to-the-minute recovery.

Tip

It is strongly recommended that you change the recovery of msdb to Full. However, when you stop and restart SQL Server Agent, the recovery model is reset to Simple. You need to create a job that runs on each SQL Server Agent restart that executes two commands:

ALTER DATABASE msdb SET RECOVERY FULL and BACKUP DATABASE msdb TO DISK = ’location ’

You should back up msdb’s transaction log on a frequent basis.

This is optional, but you might want to stop and restart SQL Server after restoring msdb.

Restoring the Model Database

The model database is used as the basis for all new databases that are created, so you do need to back it up and have a valid backup for a possible disaster recovery scenario. However, unless you make customizations to it that are specific to your environment, you should not need to back up the model database on a very frequent basis. There are two main scenarios for restoring model:

  • You have a valid backup of model from that server. If that is the case, you can stop and then restart SQL Server with trace flag 3608. This trace flag has only SQL Server recover master in the startup process. You can then restore model like any other database. Remove the trace flag from the startup parameters in Enterprise Manager, and stop and restart SQL Server.

  • You do not have a backup, but do have another SQL Server with the same code page/collation in your environment with a valid model database (modified or not modified).

If the latter situation is yours, perform the following steps to restore model:

  1. Stop and start the instance you are recovering with trace flag 3608.

  2. On the SQL Server instance that has the valid model database, execute the following syntax in a query window:

    use master
    go
    sp_detach_db 'model'
    go
  3. Copy the Model.mdf and Modellog.ldf files to the proper location for your damaged instance.

  4. Open a query window that is connected to your damaged instance, and execute syntax similar to the following example to attach model:

    use master
    go
    sp_attach_db 'model','E:Sqldatamodel.mdf',
    'E:Sqldatamodellog.ldf'
    go
  5. Remove the trace flag from the startup parameters in Enterprise Manager, and stop and restart SQL Server.

Restoring User Databases

When it is time to restore user databases, you have two main options: a traditional restore using backup and restore, or sp_attach_db. Restore, as you know, is a straightforward process that can take considerable time depending on the size of your database. However, you might want to consider using sp_attach_db if your server has crashed but your disk array is still intact. In this case, you might be able to recover your data in a few minutes instead of hours. This stored procedure allows you to attach database data and log files that exist at the operating system level, but are not yet part of the SQL Server. The corresponding stored procedure to detach a database from SQL Server is sp_detach_db. Because the time could be much shorter than doing a full restore, it might be worth trying to attach and then let the database attempt recovery. You should run a DBCC CHECKDB after the sp_attach_db to ensure that the database is in proper condition for usage by your users. Additionally, make sure the user logins are mapped properly.

The following is an example of the usage of sp_attach_db:

EXEC sp_attach_db @dbname = N'PRD',
@filename1 = N'H:mssqldataPRD_data1.MDF',
@filename2 = N'U:mssqllogPRD_log1.LDF',
@filename3 = N'J:mssqldataPRD_data3.NDF',
@filename4 = N'L:mssqldataPRD_data5.NDF',
@filename5 = N'I:mssqldataPRD_data2.NDF',
@filename6 = N'K:mssqldataPRD_data4.NDF',
@filename7 = N'M:mssqldataPRD_data6.NDF',
@filename8 = N'N:mssqldataPRD_data7.NDF',
@filename9 = N'O:mssqldataPRD_data8.NDF',
@filename10 = N'P:mssqldataPRD_data9.NDF',
@filename11 = N'Q:mssqldataPRD_data10.NDF',
@filename12 = N'R:mssqldataPRD_data11.NDF',
@filename13 = N'S:mssqldataPRD_data12.NDF'

Note

Restoring pubs and Northwind is exactly like restoring a user database.

More Info

For details on the differences between doing a database restore with RESTORE and using sp_attach_db, consult Chapter 13.

Restoring the Full-Text Indexes

If you are using full-text indexing with your user databases, you need to think about how you will restore your indexes if there is a catastrophic failure, as mentioned in earlier chapters such as Chapter 7.

More Info

Knowledge Base article 240867, "INF: How to Move, Copy, and Back Up Full-Text Catalog Folders and Files," is a good reference on backing up, copying, and moving full-text indexes.

Synchronizing Logins

If you have successfully restored your master database, you should not need to synchronize logins unless new ones were added after the date and time of the backup that was restored. If that is the case, consult the section "Synchronizing Logins" in Chapter 14.

Tip

After the full rebuilding process, back up all system databases because you know they are in a good state.

Summary

No one ever hopes to find themselves in a disaster recovery situation. Unfortunately, however, it is usually a matter of when, rather than if. The best offense is a good defense, and as the old Boy Scout motto states, Be Prepared. Up-front planning, including testing of the plan, will pay huge dividends. If you do not test your plan before disaster strikes, all you have is words on a page. Remember that recovering a system is more than performing some steps laid out on a page: it involves people, processes, communication, and technology, among other things. Plans should be executed by qualified individuals only; those with lesser experience or without nerves of steel should not apply. Finally, when the dust settles, do a postmortem on the process so you can learn from the execution of the plan as well as from what went wrong on the servers. Then you can take corrective action and avoid the problem in the future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.160.63