Chapter 2. SharePoint Disaster Recovery Design and Implementation

In This Chapter

  • Defining Scope

  • Planning the Recovery Process

  • Documenting and Implementing the Disaster Recovery Design

Many administrators of information technology (IT) systems are all too familiar with that famous axiom known as Murphy’s Law, which says, “If anything can go wrong, it will.” Although it may sound fatalistic, having the expectation that one day down the road a mishap of one kind or another will happen to your SharePoint environment is an important perspective to maintain when designing and creating your organization’s disaster recovery plan. This isn’t something you should generate for the sake of crossing an item off your To-Do list or checking a check box in a survey or audit. An effective disaster recovery plan gives you a resource you can use in all situations, regardless of scope or importance. By not losing sight of the fact that this strategy is going to be used and not just gather dust somewhere, you are drastically improving your chances for a successful recovery of your business’s crucial SharePoint systems and data when the chips are down.

Now that you’ve been introduced to the concepts and terminology of disaster recovery in Chapter 1, “SharePoint Disaster Recovery Planning and Key Concepts,” it’s time to start applying those lessons to your organization’s requirements and constraints. This chapter is designed to walk you through the process necessary to design and document your disaster recovery plan. You will gain an understanding of the data you need to collect and maintain in your plan, the parameters necessary for not only its design but its success, and ways to record all that data in a consistent, coherent fashion.

Defining Scope

It’s impossible to plan how you will recover your system in the event of an outage or disaster without understanding what your system is composed of and what its critical components are. For many complex environments, it simply isn’t feasible to attempt to fully restore every server, application, or database at the same time; trying to do so would add hours, days, or even weeks to the time it would take to complete this vital restoration activity. That is why the first step you must take when developing your disaster recovery plan is to define its scope and to evaluate and select the essential parts of your system that must be restored in the event of a disaster.

Note

It’s assumed that you’re not designing and developing your SharePoint environment’s disaster plan on your own, or only from an IT perspective. As discussed in Chapter 1, a disaster recovery strategy is simply part of a larger business continuity plan (BCP) that’s driven primarily by business stakeholders and the cost that is tied to outages in a SharePoint environment. Although you, as an administrator, know what infrastructure components you need to have in place to restore your environment, your users are the ones who should determine which sites are business critical, what content should be preserved at all costs, and what the acceptable levels of downtime are for these items. The results of a business impact analysis (BIA) serve as the primary guide when constructing your disaster recovery plan.

What Are Recovery Targets?

Recovery targets are the critical functions and data of your SharePoint environment that need to be restored following the declaration of a disaster. Seems pretty straightforward, doesn’t it? Well, thanks in part to the complex and modular nature of a SharePoint environment, that is not always the case.

Recovery targets are important because not only do they identify the parts of your system that need to be acknowledged and addressed in some way as a part of your disaster recovery plan, but they are the functions and data that must be restored or replaced as part of a successful recovery operation. A set of recovery targets reads like a checklist, and recovery targets are often used in this fashion during disaster recovery testing to gauge the success or failure of a recovery strategy following its execution.

How Are Your Recovery Targets Defined?

Recovery targets are defined through the process of mapping the results of a BIA (that is, the data and functionality that business stakeholders have identified as being critical in a SharePoint farm) to elements within the farm that were identified during the discovery and documentation phase described in Chapter 1. Each result from the BIA should translate to one or more technical functions and data elements within the SharePoint farm.

For example, consider a BIA that identifies a SharePoint site housing online actuarial capabilities as being highly critical to daily business operations. Technical analysis and cross-referencing of the site mentioned in the BIA might yield numerous recovery targets, including these:

  • The content database housing the SharePoint site containing Excel spreadsheets

  • The Excel Services Service Application providing online calculation functionality

  • The physical server that is dedicated within the farm to carry out the processor-intensive Excel calculations

  • The unattended service account username and password that Excel Services uses for several trusted data connections

  • A custom trusted data provider that is defined within the Excel Services Service Application

  • Several legacy line of business systems that are accessed through trusted data connections to supply data for the actuarial spreadsheets

As you can see, a seemingly straightforward business function could lead to a cascading list of technical requirements during the definition of recovery targets.

For large SharePoint farms, the recovery targets that are ultimately selected may comprise only a subset of the farm’s total functionality. This is especially true if the recovery time objective (RTO) for the functions and data specified is extremely aggressive and the disaster recovery plan involves a substantial manual effort to carry out.

What Should Be Restored?

As the results of the BIA are mapped to recovery targets, you may begin to see that some technical functions or data within your farm have a higher priority than others and that some pieces of key technical functionality or data are required to make their associated business functions available in SharePoint. It’s also perfectly normal for some technical functions to be identified as low-priority components that can be restored once your farm’s core content and technical functionality have been fully restored and verified. This kind of triage activity can be beneficial, because it helps you to focus your activities and energy on the most important aspects of your environment without getting distracted by targets of lower priority.

Often this exercise can help you understand that it isn’t a good idea to fully restore your production environment immediately after an outage. Another benefit of this analysis is the impact it can have on the architecture, configuration, and governance policies of your SharePoint farm to better position or partition key elements for recoverability based on business value and associated disaster recovery priority. Following are a few other factors that you should keep in mind as you analyze the BIA results and consider the recovery targets that result:

  • Content database distribution. How are sites and site collections in your farm distributed across content databases? Consider storing high-priority sites in specific or unique content databases to allow more frequent backups to be made on those databases and prevent lesser sites from using resources. Carefully distributing your sites across databases, and even database instances, can make your backup and restore processes much easier to manage and complete.

  • Content. What types of content or data do users store in different types of sites in your farm? Is the content that users store in their My Sites given the same recovery priority by the BIA as what they store in collaborative team sites? Your organization may already have usage and retention policies that can help to answer these questions about the contents of different types of sites and determine when they should be backed up and restored in the absence of specific directives by the BIA.

  • Service Applications. SharePoint Foundation uses a number of Service Applications, and SharePoint Server 2010 includes an even greater number. If your recovery strategy involves some form of manual rebuild or reconfiguration, it is important to understand the usage patterns for the Service Applications in your SharePoint farm. In the actuarial example that was mentioned earlier, Excel Services are critical to the restoration of business functionality and would likely receive a high priority for recovery. Excel Services could be run locally within the farm, or the service could be consumed from another farm entirely. Recognizing both the importance of the Service Application and the actual origin of services provided is key in the proper definition of recovery targets.

  • Dependent systems and interfaces. What applications or configuration items have been identified as recovery targets on your production servers to support the various functions of your SharePoint farm? Some applications provide crucial data or functionality to the users of your SharePoint farm and must be reconnected or restored as part of your farm’s restore effort. Other applications are not identified by the BIA as mission critical and are therefore not a priority.

What’s Out of Scope

It’s just as important to establish what’s out of scope for your disaster recovery plan as it is to identify what’s in scope. This isn’t a simple exercise of listing what platforms, applications, systems, or components are not included in your disaster recovery plan. Yes, such actions are definitely part of the scope definition process, but it’s also important to determine what other groups are being expected to support and identify those items deemed to be out of scope for your plan. For example, if database administrators (DBAs) external to your group manage your SharePoint databases, it may be possible to declare the disaster recovery of those databases out of scope to your plan because those DBAs will handle them.

Tip

Establishing external dependencies within a disaster recovery plan introduces risk and is not the “right” of SharePoint technical owners. Prior to portions of a plan becoming dependent on external systems or personnel, discussions with business owners and stakeholders must take place. Although SharePoint technical owners and personnel are ultimately responsible for meeting the recovery objectives identified through the BIA, business stakeholders are the ones assuming the risk and realizing the ultimate impact of a system outage.

What Are the Costs?

As professors of economics are often fond of stating, “There’s no such thing as a free lunch.” Every choice and decision you make around your disaster recovery plan has a direct impact on how much it will cost to implement that plan. Frequent backups can require extensive storage resources, as well as more time to configure, test, and maintain. Opting to restore every aspect of a farm as quickly as possible is certainly possible, but the hardware, software, and workforce resources necessary to pull off such a plan can prove prohibitively high for all but the largest of enterprises. It’s essential to understand the costs inherent in each aspect of a disaster recovery plan so that you can balance and consider them as part of the plan. You may find that the best solution is not always the right solution for your organization once you introduce costs and expenses into the equation.

Planning the Recovery Process

After you’ve established the recovery targets based on the BIA, it’s time to move on to the steps you must take to actually return your system to acceptable levels of functionality. It’s time to start determining the people, hardware, software, and other resources that need to be in place before you can start the recovery process.

During the planning and design process, it’s common to discover that the level of recoverability that business owners desire isn’t possible with the budget allotted to disaster recovery operations. At this stage, bargaining and compromise are common to reach levels of recoverability and cost that are acceptable to both business stakeholders and SharePoint farm owners.

Setting aside issues of cost, there are a number of additional areas to consider as you begin the process of recovery planning and design. Many factors and drivers are commonly uncovered as a plan evolves, and your approach should be flexible enough to respond to them, but at a minimum an effective disaster recovery plan is built with strong consideration for the following three aspects:

  • RTO and RPO. After reading Chapter 1, you should be familiar with the concepts of RTO and RPO (recovery point objective) and how they impact technical options regarding recoverability. The requirements that are established for each recovery target’s RTO and RPO directly affect your plan’s design, which must be able to meet those objectives to be effective. RTO and RPO can dictate the type and number of resources you need to have available to execute the plan, the sorts of tools and range of feasible technologies you use to preserve and restore your system, and the way you define your success criteria.

  • Your data. What content, such as business documents or task lists, must be immediately restored to enable your users to remain productive? How is that data stored within your SharePoint environment, and how easily can it be backed up and restored? These considerations impact your plan, the tools you use to implement it, and the infrastructure you put in place to support it.

  • Physical limitations. The tangible pieces of your infrastructure, such as your data center, storage, backup technology, and networking configuration, can make a real difference in the options you have available to build into your disaster recovery plan. Can your recovery team directly access your servers in the data center if they need to? Do you have enough storage for your backups? Can you architect enough redundancy into your infrastructure from the ground up to make it highly available? These are just some of the physical limitations you need to keep in mind as you design your disaster recovery plan.

Documenting and Implementing the Disaster Recovery Design

Once you’ve identified the inputs, requirements, and parameters of your plan’s design, you can move on to the fun part: putting it into writing and incorporating its elements into your system. This is where the rubber meets the road—where you must explicitly state how your SharePoint environment is prepared for the declaration of a disaster and how it will be restored after such an event. Thoroughly document your plan and store it in an accessible, visible, and reliable location so it can be quickly accessed by anyone who needs to review, revise, or execute it.

Tip

If your SharePoint disaster recovery strategy includes one or more alternate data centers or facilities, your recovery plan and any associated documentation should be replicated to those facilities to ensure that they are up to date and available in the event of a disaster.

Remember, there’s always a chance that the author of the plan (you) is not going to be the person who actually executes it, so make sure the plan contains all the information and instructions required to execute it even if the reader isn’t intimately familiar with the plan. The recovery plan should clearly state any assumptions it makes about the executor and that person’s knowledge of SharePoint and related systems.

Acquiring Resources

Once you understand your farm’s recovery targets and have an appropriate disaster recovery topology, you can start reviewing your available resources and establishing the assets needed to provide or expand your disaster plan. You can also define the resources your plan requires if a disaster is declared and you need to execute your plan. Obviously, it pays to have those items on hand before you actually need them so you can begin to satisfy the requirements of the plan as quickly as possible. The following list outlines the major resource areas you should review for your SharePoint environment and its disaster recovery plan:

  • Determine your physical requirements and resources. As has already been mentioned, your disaster recovery plan probably identifies some specific pieces of required hardware and infrastructure. Whether the plan’s requirements include rack space in multiple data centers, high-speed storage area network (SAN), hardware for hosting virtualized servers, or tape backup drives, you need to enumerate these items as completely and specifically as possible. Review your network requirements and usage, power consumption, available storage, and redundant devices such as load-balancers and Redundant Array of Independent (or Inexpensive) Disks (RAID) arrays.

  • Acquire your hardware. Once you know what you need, make sure you have it on hand when you need it. Don’t put this off for a rainy day or the next fiscal year. Disasters don’t happen when it’s convenient. You can’t afford to lose millions in business and productivity because you saved thousands waiting to procure the hardware required by your disaster recovery plan.

  • Acquire and license your software. If you have a failover farm, make sure to secure the proper software and licensing for that additional farm to stay in full compliance with your providers. Store copies of any required software or media in a location (or locations) that’s accessible in the event of a disaster. Work closely with your software manufacturer’s licensing representative. Explain exactly how you’re using the software, because the representative often has special provisions (at lower price points) for software running in a failover environment.

  • Review your dependent services. Most SharePoint installations depend heavily on Active Directory (AD) for user authentication, not to mention service accounts and administrative access to servers. Closely examine the disaster recovery plans for your environment’s AD domains, Domain Name Services (DNS), Dynamic Host Configuration Protocol (DHCP) services, Simple Mail Transfer Protocol (SMTP) services, and all other services that your SharePoint environment depends on. If these service dependencies have RPO or RTO targets that are out of alignment with those that your SharePoint environment has identified, you might need to make alternate arrangements and spend more money.

Establishing a Disaster Recovery Baseline

Baselines determine a desired configuration or setup for a given system at a specific point in time and are used as the basis for comparison for subsequent activities in and changes to that system. Establishing a baseline for your SharePoint farm allows you to solidify a specific configuration point and quality of service that your disaster recovery plan should strive to return the system to after a catastrophe. Baselining your system may not be required for your organization, but doing so gives you a defined target for success and goals that you can drive your plan at. You can also repeat the process at regular intervals, allowing you to quantify how your system has grown and changed over time, which can also provide you with valuable data for future updates to your To-in some way involve stakeholders or resources from the Be list. Regardless of whether you baseline your system, you should strive to have a complete picture of its current state and how compatible that state is with your disaster recovery plan.

Documenting Your Procedures for an Outage

Up until now, most of this chapter has focused on the items and details needed for a SharePoint environment’s disaster recovery plan to establish the best position possible to deal with the declaration of a disaster. Now this chapter turns its attention to some best practices for actually writing the plan and recording it in a consistent and controlled manner. This is important because the plan must be understandable and complete. Its audience is likely to be under a great deal of pressure when using it and won’t have time to spare trying to decipher a dense, ineffective document.

Following Published Standards for Writing

If your organization already has a common set of standards for official technical documents, your disaster recovery plan should follow them. If not, it may be worth the effort to establish them as part of this process. When you’re writing a document, it isn’t enough to simply outline the steps an executor should take to complete a process. A complete technical document should contain several common types of information, including but not limited to these:

  • Involved parties. Lists the people associated with the document, such as its author/owner, reviewer(s), and approver(s)

  • Version and revision history. Details the document’s changes over time

  • Effective date. Records the date that the document became available for use

  • Roles, responsibilities, and capabilities. Includes a list of the positions that need to be filled to execute the document’s instructions, the responsibilities for each of those positions, and the skills a resource must have to fill a position

  • Audience. Defines who the document is intended for

  • Purpose. Explains what purpose the document should be used for

  • Scope. Defines what’s in scope and out of scope for the document

  • Covered systems. Lists the systems or groups that the document applies to

  • Glossary of terms. Defines common terminology used in the document

  • Prerequisites and dependencies. Includes any activities or systems that must be completed or in place prior to the document’s execution

  • Assumptions. Details the assumptions the document makes

  • Primary content. Includes the instructions and procedures the document is intended to cover

  • References. Lists information, documents, or people external to the document that can be consulted for additional information

  • Training. Explains how individuals should be trained on the document’s content and procedures

Verifying Content

Once you’ve completed your disaster recovery plan, have a third party review and verify it. If you don’t, you risk allowing inconsistencies, omissions, or errors to remain in the document that could directly impact the success of a recovery operation. Consider this book as an example. Every page and every word in it has been reviewed, tested, and verified by at least two separate parties. A copy editor checked it for grammatical consistency and proficiency, and a technical editor checked the technical statements, assertions, walk-throughs, and content written. No matter how much authors check their own work, having outside reviewers drastically improves the quality and accuracy of an author’s output. No disaster recovery plan should be allowed to stand without being tested and verified before it’s considered complete; otherwise, you chance introducing additional, avoidable risk into your disaster recovery activities.

Lowering the Impact of Recovery

Take whatever precautions you can to lower the impact of your recovery strategy on your Share-Point environment and its users. These steps will vary depending on your situation, but here are two important areas to keep in mind that can make the recovery process go much more smoothly:

  • Securing your crucial disaster recovery resources. The need for a secure, centralized store for your software installers, license keys, and other associated bits has already been mentioned, but it bears repeating. Ensure that your disaster recovery personnel can access this storage location, and make sure that its contents are backed up and potentially replicated on a regular basis. If your organization lacks a formal disaster recovery department or group, appoint a specific person with the responsibility of maintaining that store and keeping it current. Identify a backup for that person or group in case the primary is unavailable when a disaster is declared.

  • Identifying what to secure. What items, such as service account identities and passwords, software license keys, or data center access, should be secured and unavailable to public access? What items should be commonly available to all resources? Review your system’s assets and the security around them to make sure that you are properly balancing your assets’ safety measures against the need to access them quickly.

Tip

As mentioned in Chapter 1, certain types of privileged configuration data are typically stored separately from other types of data. For configuration data that is deemed secure and stored separately, be sure that your disaster recovery plan identifies how (and from whom) such information should be recovered if a disaster is declared.

Defining the Communication Plan

Your disaster recovery plan should also include a plan for communicating information about the declared outage to everyone associated with your SharePoint environment so that you’re presenting a uniform, consistent, and informative front to those constituents. The plan should identify the various players and roles in the recovery action, such as data center technicians, database administrators, management, quality assurance, end user advocates, and end users in general. It should also detail the manner in which these various players should be contacted, who should manage and coordinate the communication effort, and the approvals required before a message can be sent. In addition, the plan should inform key personnel of how they can obtain information on their own, via sources such as conference calls, Web pages, and phone trees. It may also be beneficial to designate a specific meeting area that the team can use in perpetuity until the action is completed so that the team always uses a consistent location. Make sure that all key personnel in a recovery action are identified and assigned specific roles to avoid gaps in knowledge and arguments over areas of responsibility.

Determining Success

The last thing your SharePoint disaster recovery plan must provide is a coherent, concrete, agreed-upon list of criteria for a successful recovery. As stated earlier in this chapter, this list is often derived directly from the list of recovery targets.

Define the terms of a successful recovery before you attempt to conduct one so that there are specific goals your team can drive toward and a point where you can declare victory. Keep your business users’ needs in mind during this process. As discussed previously, it does little good to deliver a system that may be fully recovered from a technical standpoint but does not allow business users to get their work done. The success criteria and associated conditions must be agreed upon by all stakeholders in your SharePoint environment and with regard to the recovery targets that the BIA identified. Your plan should also identify a person or group that is responsible for verifying that these criteria have been met and approving the completed recovery effort.

Tip

You may find it worthwhile to explicitly include a baseline for your SharePoint environment within your disaster recovery plan and use it as a benchmark for a successful recovery. This allows you to solidify a specific configuration and quality of service for your system that your disaster recovery plan should strive to return the system to after a catastrophe, rather than an assorted list of recovery targets.

Conclusion

Creating a useful, effective disaster recovery plan and documenting it properly is one of the most important aspects of a successful disaster recovery strategy. Documentation isn’t one of the more interesting or exciting things that an IT administrator can be tasked with, but it certainly is one of the most crucial. Hopefully this chapter has given you a jump-start on the process.

The goal is for you to use the recommendations and best practices described in this chapter as a starting point for your organization’s SharePoint disaster recovery plan. Don’t forget that what has been presented may not cover everything that your team needs to meet the unique requirements of your SharePoint environment. Also keep in mind that your plan should, at a minimum, address all the concepts this chapter has introduced. Once you have developed your disaster recovery plan, the bad news is that you’re still not done. The good news is that Chapter 3, “SharePoint Disaster Recovery Testing and Maintenance,” walks you through the last steps of the process.

Now that you’ve learned about the importance of an effective disaster recovery plan and what goes into it, you should be able to answer the following questions about a plan’s capabilities. You can find the answers to these questions in Appendix A, “Chapter Review Q&A,” found on the Cengage Learning Web site at http://www.courseptr.com/downloads.

1.

What are recovery targets?

2.

What are some items to consider when evaluating what components of your SharePoint environment to restore?

3.

What are some of the ways your organization’s RTOs and RPOs can impact the design of your disaster recovery plan?

4.

What are some examples of resources that must be acquired or provisioned as part of your disaster recovery plan?

5.

How do you know when your disaster recovery plan has been completely executed?

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.96.247