Chapter 21. Data Protection, Recovery, and Availability

In our experience, a great deal of critical business data is being steadily migrated to Microsoft Office SharePoint Server 2007, whether by the administrators or the users. This is primarily because Office SharePoint Server 2007 is easy to use and familiar to the end-users and is thus a logical place to store and process data. Sometimes, SharePoint Server 2007 is used to access content on a third-party system such as SAP, Siebel, or PeopleSoft. Either way, SharePoint Server 2007 usually becomes a critical component in an organization’s infrastructure. As such, it is imperative to properly plan, design, implement, and maintain a data protection, recovery, and availability plan. Simply backing up and restoring via Central Administration is insufficient to fully restore data in all but the simplest implementations.

Because SharePoint Server 2007 usually consists of multiple, interconnected systems and dependencies, this chapter will focus on the planning and designing processes first and then focus on the individual components that make up a SharePoint Server 2007 server farm. The following topics will be presented in this chapter:

You should understand that there isn’t a silver bullet when you plan and implement data protection and availability solutions for SharePoint Server 2007. Therefore, you need to internalize the concepts presented in this chapter and design a solution for your specific environment.

Planning for Recovery

The best time to plan for content recovery is before you implement SharePoint Server 2007. Much of your content recovery plan depends on how your SharePoint Server 2007 server farm is implemented. A common bad practice is trying to force stringent recovery objectives from system that was poorly installed. Doing so is a lot like trying to get a Yugo to perform like a Ferrari! If you installed via the default options, use native backup tools, and ignore Microsoft SQL Server transaction logs, you are most likely assuming a 24-hour data loss in the event of SQL Server failure. If you aren’t moving your backup media off-site, then you are assuming a total loss of data. Can your company sustain a total loss of data? 24 hours? These are some of the questions you need to answer before implementing SharePoint Server 2007 or at least before moving business-critical content into SharePoint Server 2007.

First, you must define where the valuable content resides or will reside. If SharePoint Server 2007 is simply a front-end dashboard for back-end business data, then you will be more concerned with getting SharePoint Server 2007 back online in a failure and less concerned with the loss of SharePoint Server 2007 content. In this example, your primary recovery target would be the back-end business data. Likewise, you must design for accessing your content. If you require immediate access to your data, then solid backups to tape may not be sufficient. Instead, you may need to plan for disk-to-disk backups or create a mirrored instance of your farm altogether.

Unless you have a very simple installation, your data protection and recovery plan will require some preparation. Often, it isn’t a planning process that you can do alone. It will require discussions with the data owners and stakeholders to understand the criticality of the data, and what the expected availability is. Two key concepts are presented throughout this chapter:

  • Recovery time objective. The recovery time objective (RTO) defines how long your system can be down before it is back online after a disruption. The disruption could be due to anything from a SQL Server outage to a Web front-end (WFE) server failure. You don’t need to have the same RTO all of the time. For example, a bank might have a very short RTO from Monday through Friday, 9 A.M. until 5 P.M., but a longer RTO for all other times. The RTO should include data recovery at the server, farm, database, site, list, and item levels.

  • Recovery point objective. The recovery point objective (RPO) defines your data loss threshold, measured in time. If you run daily backups only and ignore the SQL Server transaction logs, then your RPO is 23 hours, 59 minutes, and 59 seconds. Any data written to SharePoint Server 2007 after you ran the backup cannot be restored via native tools until after the next backup. Many organizations assume this risk without fully understanding the impact of losing 24 hours of data.

Keep these two concepts in mind as you plan your design and as you read through this chapter. Whenever you plan and install a new farm component, such as a Web application or databases, be sure to plan for the appropriate RTO and RPO as defined by the stakeholders.

What Are You Protecting?

You must first decide what SharePoint Server 2007 content you will protect before you decide how you will protect it. As part of your planning process, you should define the criticality of your SharePoint Server 2007 content. Many organizations will calculate the value of the content by lost revenues or the cost to reproduce the content. If you have content within the same server farm and segments of the data have a drastically different value, then you might consider different levels of protection commensurate with the value of the content. A good example is business-critical content, such as architectural drawings, contracts, and designs, versus historical human resources documents, such as vacation requests. The business-critical content should probably be better protected, while the organization could probably withstand a greater data loss to the generic human resources information. You would probably spend more money to protect the business-critical content. Basically, you must decide the value of the content and then the cost to protect it. If a business-critical site collection costs $100,000 to reproduce, then an extra $5,000 spent on your SharePoint Server 2007 design to protect that data would be reasonable.

Putting a price on your content is easier said than done. It can be very difficult to define at what point the cost outweighs the risk. This is really a discussion in risk management, and there is simply not room to discuss it here. Instead, we will discuss those areas directly related to SharePoint Server 2007. Suffice it to say that you alone cannot define acceptable thresholds in data loss and downtime. You need to first educate your stakeholders in the costs of protecting content, gather data loss requirements, and then plan and design accordingly.

More Info

For a good overview of risk management within the Microsoft Solutions Framework, browse to http://www.microsoft.com/technet/solutionaccelerators/cits/mo/mof/mofrisk.mspx.

Stakeholder Education

Many stakeholders do not fully appreciate the complexity of fault tolerance and data recovery. In fact, many executives want all of the data, all of the time. You need to have an honest discussion with several people, but especially the users, data owners, and executives. You should ask them what their acceptable data loss is and how long the system can be down. Now, they will probably say that they cannot accept any data loss and that the system must always be up. Your job is to educate these stakeholders about the actual expense of achieving this. Moving to a 99.999 percent availability posture is very, very expensive. When stakeholders are presented with the costs of such HA, they often come back to reality and a compromise takes place. The compromise is between what you can design, implement, and support and what they are willing to pay.

Remember that not all data must always be online, and not all data must have recent backups. Think of it this way: Records management content usually doesn’t have to be online immediately, but it must always be recoverable. Your design using this scenario might have only a single primary SQL Server system, but leverage transaction log shipping to a second SQL Server instance so a copy can always be recovered that is close to the point of failure. Conversely, some content needs to be online all of the time, but not necessarily fault tolerant to the point of never losing a byte of data. A good example of this is data warehousing and business intelligence. It might be critical to always have a Report Center online, but it might not hurt to lose a small amount of warehoused data because the online transaction processing (OLTP) content still exists. These examples show you how designing for availability doesn’t always mean designing for recoverability.

Ask your stakeholders questions, and be prepared to give rough estimates of costs during these discussions. Here are a few stakeholder questions to get you started:

  • What content must always be online?

  • What is your pain threshold regarding RPO data loss?

  • What role will users play in the recovery process?

  • Do you really need 100 percent of the content in the event of a disruption?

  • Must the system honestly be up all of the time?

  • What is the lost labor cost per hour in the event of a system outage?

  • What is the lost revenue?

  • Will we lose customers in the event of an outage?

  • Will we have to compensate for the outage in marketing costs and sales?

  • Will we be legally liable for lost content?

Did you notice the last question? If you are a publicly traded company, your CFO might be a very good ally in getting overall executive support. Because the CFO is responsible for data that could be Sarbanes-Oxley or HIPAA regulated, it is in his or her best interest to always have the data available. Think about who can help you design the system you know you should build, regardless of what the stakeholders say. You may ultimately be on the hook for losses from a system outage. Keep a record of all data recovery and availability discussions for future use, and to prove your recommendations should the need arise.

Service Level Agreements

Service Level Agreements (SLAs) set the expectations of recoverability and availability. These can be informal documents within your organization or legal contracts with and between service providers. A good SLA should be easy to read, easy to follow, and easy to apply. An overly complex SLA makes your job difficult when you implement SharePoint Server 2007. So what is the anatomy of an SLA? The International Engineering Consortium defines an SLA as "an informal contract between a carrier and a customer that defines the terms of the carrier’s responsibility to the customer and the type and extent of remuneration if those responsibilities are not met." That definition obviously was originally defined for telecom carriers, but generally states what has become the standard for most SLAs.

An SLA will vary greatly depending on your operating environment, business type, and business requirements. There are some common elements of any SLA, and you should include the following at a minimum:

  • System availability

  • Transactional reliability of the actual data

  • Acceptable performance

  • Mean time to respond to problem requests

  • Mean time to restore service and/or content

When discussing SLAs and availability, we usually talk about the 9s. This is actually a fairly simple concept and can help you educate your stakeholders. We sometimes associate rough design costs with each level of 9s. Table 21-1 shows Microsoft’s 9s table.

Table 21-1. Number of 9s to Calendar Time Equivalents

Acceptable uptime percentage

Downtime per day

Downtime per month

Downtime per year

95

72.00 minutes

36 hours

18.26 days

99

14.40 minutes

7 hours

3.65 days

99.9

86.40 seconds

43 minutes

8.77 hours

99.99

8.64 seconds

4 minutes

52.60 minutes

99.999

0.86 seconds

26 seconds

5.26 minutes

Most SharePoint Server 2007 installations we have seen were architected to the 99-percent availability level. Most of these were not intentionally built to two 9s, but this is the natural path for most organizations. A two-9s design would allow the occasional outage due to system failure and a regular window for updates and hardware maintenance. Why aren’t most installations of such critical data designed to three 9s or higher? A rough estimate of moving from a 99-percent uptime posture (3.65 days of downtime per year) to a 99.9-percent uptime posture (8.77 hours per year) is a 100-percent increase in cost! Most likely, the 99-percent service level will be a good compromise between availability and cost.

Remember that you have many dependencies with SharePoint Server 2007, including SQL Server, networking, Active Directory directory services, operating systems, and hardware. Simply building a SharePoint Server 2007 server farm to provide 99-percent uptime doesn’t guarantee the solution actually provides 99-percent uptime. If you do not own the dependencies, you should obtain an SLA from the service vendor or consider building the dependency yourself. When defining these SLAs with vendors, don’t assume you need 24-hour coverage, seven days per week. If you are doing business only 12 hours a day, then your organization might need extreme availability only during that window. This would leave plenty of time for software updates, hardware fixes, and testing without the unnecessary design of failover server farms and additional server farm members. It will also reduce the cost of your design.

We recommend the creation and maintenance of SLA, even if your customer is your employer. SLAs provide a documented method for defining acceptable data loss and system availability. They also provide a way to define multiple tiers of recoverability and availability.

SharePoint Server 2007 can adapt to a multi-tiered SLA arrangement at the farm, Web application, and content database levels. Your SLA needn’t be a blanket agreement covering all facets of your installation. Table 21-2 shows how SharePoint Server 2007 can be architected to support different service levels.

Table 21-2. Tiered SLA Levels for SharePoint Server 2007

Component

Accomplished how?

Farm

Different farm servers, different SQL Server instance, dedicated network hardware

Web application

Dedicated content databases, isolated application pools, dedicated WFE servers, dedicated network hardware

Content database

Group site collections by SLA in their respective content databases; manage SLAs at the SQL Server instance level

Site Collection

Critical site collections can be in a dedicated content database and the SLA managed at the SQL Server instance level

Obviously, a multi-tiered SLA within a single farm can complicate things quite a bit. But if you are an experienced systems administrator and comfortable with SharePoint Server 2007, it might be more cost efficient than building a second server farm. While you can design a multi-tiered farm after the fact, it is much easier to do in the very beginning before implementation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.105.2