Chapter 10. Sample continuous availability and disaster recovery scenarios

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Sample continuous availability and disaster recovery scenarios

In this chapter we describe a number of common client scenarios and requirements, along with what we believe to be the most suitable solution for each case.

We discuss the following scenarios:

•A client with a single data center that has already implemented Parallel Sysplex with data sharing and workload balancing wants to move to the next level of availability.

•A client with two centers needs a disaster recovery capability that will permit application restart in the remote site following a disaster.

•A client with two sites (but all production systems running in the primary site) needs a proven disaster recovery capability and a near-continuous availability solution.

•A client with two sites at continental distance needs to provide a disaster recovery capability.

The scenarios discussed in this chapter pertain to using the GDPS products that are based on hardware disk replication. The scenarios for GDPS/Active-Active using software data replication are discussed in Chapter 7, “GDPS/Active-Active” on page 173.

10.1 Introduction

In the following sections we describe how the various GDPS service offerings can address different continuous availability (CA) and disaster recovery (DR) requirements. Because every business is unique, the following sections do not completely list all the ways the offerings can address the specific needs of your business, but they do serve to illustrate key capabilities.

Note that in the figures accompanying the text we show minimal configurations for clarity. Many client configurations are more complex than this, but both configurations are supported.

10.2 Continuous availability in a single data center

In the first scenario the client has only one data center, but wants to have higher availability. The client has already implemented data sharing for their critical applications, and exploits dynamic workload balancing to mask the impact of outages. They already mirror all their disks within the same site, but have to take planned outages when they want to switch from the primary to secondary volumes in preparation for a disk subsystem upgrade or application of a disruptive microcode patch. They are concerned that their disk is their only remaining resource whose failure can take down all their applications. The configuration is shown in Figure 10-1.

Figure 10-1 Data sharing, workload balancing, mirroring - single site

From a disaster recovery perspective, the client relies on full volume dumps. Finding a window of time that is long enough to create a consistent set of backups is becoming a challenge. In the future, they plan to have a second data center, to protect them in case of a disaster. And in the interim, they want to investigate the use of FlashCopy to create a consistent set of volumes that they can then dump in parallel with their batch work. But their current focus is on improved resiliency within their existing single center.

Table 10-1 summarizes the client’s situation and requirements, and shows which of those requirements can be addressed by the most suitable GDPS offering for this client’s requirements, namely GDPS/PPRC HyperSwap Manager.

Table 10-1 Mapping client requirements to GDPS/PPRC HyperSwap Manager attributes

Attribute	Supported by GDPS/PPRC HM
Single site	Y
Synchronous remote copy support	Y (PPRC)
Transparent swap to secondary disks	Y (HyperSwap)
Ability to create a set of consistent tape backups	Y(1)
Ability to easily move to GDPS/PPRC in the future	Y
Note 1: To create a consistent source of volumes for the FlashCopy in GDPS/PPRC HyperSwap Manager, you must create a freeze-inducing event and be running with a Freeze and Go policy.

This client has a primary short-term objective to be able to provide near-continuous availability, but wants to ensure that they address that in a strategic way.

In the near term, they need the ability to transparently swap to their secondary devices in case of a planned or unplanned disk outage. Because they only have a single site, do not currently have a TS7700, and do not currently have the time to fully implement GDPS system and resource management, the full GDPS/PPRC offering is more than they currently need.

By implementing GDPS/PPRC HyperSwap Manager, they can achieve their near-term objectives in a manner that positions them for a move to full GDPS/PPRC in the future.

Figure 10-2 on page 268 shows the client configuration after implementing GDPS/PPRC HyperSwap Manager. Now, if they have a failure on the primary disk subsystem, the Controlling system will initiate a HyperSwap, transparently switching all the systems in the GDPS sysplex over to what were previously the secondary volumes. The darker lines connecting the secondary volumes in the figure indicate that the processor-to-control unit channel capacity is now similar to that used for the primary volumes.

Figure 10-2 Continuous availability within a single data center

After the client has implemented GDPS and enabled the HyperSwap function, their next move will be to install the additional disk capacity so it can use FlashCopy. The client will then be able to use the Freeze function to create a consistent view that can be flashcopied to create a set of volumes that can then be full-volume dumped for disaster recovery. This will not only create a more consistent set of backup tapes than the client has today (because today it is backing up a running system), but also the backup window will now only be a few seconds rather than the hours it is currently taking. This will allow the client to take more frequent backups if they prefer.

10.3 DR across two data centers at metro distance

The next scenario relates to a client that is under pressure to provide a disaster recovery capability in a short time frame, perhaps for regulatory reasons. The client has a second data center within metropolitan distances and suitable for synchronous mirroring, but has not yet implemented mirroring between the sites. Before moving to a full GDPS/PPRC environment, the client was going to complete their project to implement data sharing and workload balancing. However, events have overtaken them and they now need to provide the disaster recovery capability sooner than they had expected.

The client can select between the full GDPS/PPRC offering, as they had planned to do in the long term, or to install GDPS/PPRC HyperSwap Manager now. Because they will not be utilizing the additional capabilities delivered by GDPS/PPRC in the immediate future, the client decides to implement the lower cost GDPS/PPRC HyperSwap Manager option. Table 10-2 summarizes the client’s situation and requirements, and shows how those requirements can be addressed by GDPS/PPRC HyperSwap Manager.

Table 10-2 Mapping client requirements to GDPS/PPRC HyperSwap Manager attributes

Attribute	Supported by GDPS/PPRC HM
Two sites, 12 km apart	Y
Synchronous remote copy support	Y (PPRC)
Maintain consistency of secondary volumes	Y (Freeze)
Maintain consistency of secondary volumes during PPRC resynch	Y(1) (FlashCopy)
Ability to move to GDPS/PPRC in the future	Y
Note 1: FlashCopy is used to create a consistent set of secondary volumes prior to a resynchronization, following a suspension of remote copy sessions.

This client needs to be able to quickly provide a disaster recovery capability. The primary focus in the near term, therefore, is to be able to restart its systems at the remote site as though it was restarting off the primary disks following a power failure. Longer term, however, the RTO (which is the time to get the systems up and running again in the remote site) will be reduced to the point that it can no longer be achieved without the use of automation (this will be addressed by a move to GDPS/PPRC). The client also has a requirement to have a consistent restart point at all times (even during DR testing).

This client will implement GDPS/PPRC HyperSwap Manager, with the Controlling system in the primary site and the secondary disks in the remote site. The secondary storage subsystems are configured with sufficient capacity to be able to use FlashCopy for the secondary devices; this will allow the client to run DR tests without impacting its mirroring configuration.

GDPS/PPRC HyperSwap Manager will be installed and the Freeze capability enabled. After the Freeze capability is enabled and tested, the client will install the additional intersite channel bandwidth required to be able to HyperSwap between the sites. This configuration is shown in Figure 10-3. Later, in preparation for a move to full GDPS/PPRC, the client will move the Controlling system (and its disks) to the remote site.

Figure 10-3 GDPS/PPRC two-site HM configuration

10.4 DR and CA across two data centers at metro distance

The client in this scenario has two centers within metro distance of each other. The client already uses PPRC to remote copy the primary disks (both CKD and FB) to the second site. They also have the infrastructure in place for a cross-site sysplex; however, all production work still runs in the systems in the primary site.

The client is currently implementing data sharing, along with dynamic workload balancing, across their production applications. In parallel with the completion of this project, they want to start looking at how the two sites and their current infrastructure can be maximized to provide not only disaster recovery, but also continuous or near-continuous availability in planned and unplanned outage situations, including the ability to dynamically switch the primary disks back and forth between the two sites.

Because the client is already doing remote mirroring, their first priority is to ensure that the secondary disks provide the consistency to allow restart, rather than recovery, in case of a disaster. Because of pressure from their business, the client wants to move to a zero (0) data loss configuration as quickly as possible, and also wants to investigate other ways to reduce the time required to recover from a disaster.

After the disaster recovery capability has been tested and tuned, the client’s next area of focus will be continuous availability, across both planned and unplanned outages of applications, systems, and complete sites.

This client is also investigating the use of z/VM and Linux for System z to consolidate a number of their thousands of PC servers onto the mainframe. However, this is currently a lower priority than their other tasks.

Because of the disaster recovery and continuous availability requirements of this client, together with the work they have already done and the infrastructure in place, the GDPS offering for them is GDPS/PPRC. Table 10-3 shows how this offering addresses this client’s needs.

Table 10-3 Mapping client requirements to GDPS/PPRC attributes

Attribute	Supported by GDPS/PPRC
Two sites, 9 km apart	Y
Zero data loss	Y (PPRC with Freeze policy of SWAP,STOP)
Maintain consistency of secondary volumes	Y (Freeze)
Maintain consistency of secondary volumes during PPRC resynch	Y(1) (FlashCopy)
Remote copy and remote consistency support for FB devices	Y (Open LUN support)
Ability to conduct DR tests without impacting DR readiness	Y (FlashCopy)
Automated recovery of disks and systems following a disaster	Y(GDPS script support)
Ability to transparently swap z/OS disks between sites transparently	Y (HyperSwap)
DR and CA support for Linux guests under z/VM	Y
Note 1: FlashCopy is used to create a consistent set of secondary volumes prior to a resynchronization, following a suspension of remote copy sessions.

Although this client has performed a significant amount of useful work already, fully benefiting from the capabilities of GDPS/PPRC will take a significant amount of time, so the project has been broken up as follows:

1. Install GDPS/PPRC, define the remote copy configuration to GDPS, and start using GDPS to manage and monitor the configuration.

This will make it significantly easier to implement changes to the remote copy configuration. Rather than issuing many PPRC commands, the GDPS configuration definition simply needs to be updated and activated, and the GDPS panels then used to start the new remote copy sessions.

Similarly, any errors in the remote copy configuration will be brought to the operator’s attention using the NetView SDF facility. Changes to the configuration, to stop or restart sessions, or to initiate a FlashCopy, are far easier using the NetView interface.

2. After the staff becomes familiar with the remote copy management facilities of GDPS/PPRC, enable the Freeze capability, initially as PPRCFAILURE=GO and then moving to PPRCFAILURE=COND or STOP when the client is confident with the stability of the remote copy infrastructure. Because HyperSwap will not be implemented immediately, they will specify a PRIMARYFAILURE=STOP policy to avoid data loss if recovery on the secondary disks becomes necessary after a primary disk problem.

Although the client has PPRC today, they do not have the consistency on the remote disks that is required to perform a restart rather than a recovery following a disaster. The GDPS Freeze capability will add this consistency, and enhance it with the ability to ensure zero (0) data loss following a disaster when a PPRCFAILURE=STOP policy is implemented.

3. Add the FB disks to the GDPS/PPRC configuration, including those devices in the Freeze group, so that all mirrored devices will be frozen in case of a potential disaster. As part of adding the FB disks, a second Controlling system will be set up¹.

Although the client does not currently have distributed units of work that update both the z/OS and FB disks, the ability to Freeze all disks at the same point in time makes cross-platform recovery significantly simpler.

In the future, if the client implements applications that update data across multiple platforms inside the scope of a single transaction, the ability to have consistency across all disks will move from being “nice to have” to a necessity.

4. Implement GDPS Sysplex Resource Management to manage the sysplex resources within the GDPS, and start using the GDPS Standard actions panels.

GDPS system and sysplex management capabilities are an important aspect of GDPS. They ensure that all changes to the configuration conform to previously prepared and tested rules, and that everyone can check at any time to see the current configuration, that is, which sysplex data sets and IPL volumes are in use. These capabilities provide the logical equivalent of the whiteboard used in many computer rooms to track this type of information.

5. Implement the GDPS Planned and Unplanned scripts to drive down the RTO following a disaster.

The GDPS scripting capability is key to recovering the systems in the shortest possible time following a disaster. Scripts run at machine speeds, rather than at human speeds. They can be tested over and over until they do precisely what you require. And they will always behave in exactly the same way, providing a level of consistency that is not possible when relying on humans.

However, the scripts are not limited to disaster recovery. This client sometimes has outages as a result of planned maintenance to its primary site. Using the scripts, they can use HyperSwap to keep its applications available as it moves its systems one by one to the recovery site in preparation for site maintenance, and then back to the normal locations after maintenance is complete.

Because all production applications will still be running in the production site at this time, the processor in the second site is much smaller. However, to enable additional capacity to quickly be made available in case of a disaster, the processor has the CBU feature installed. The GDPS scripts can be used to automatically enable the additional CBU engines as part of the process of moving the production systems to the recovery processor.

6. After the disaster recovery aspect has been addressed, HyperSwap will be implemented to provide a near-continuous availability capability for the z/OS systems. It is recommended that a Controlling system be set up in each site when using HyperSwap, to ensure there is always a system available to initiate a HyperSwap regardless of where the primary disks might be at that time. In the case of this client, they had already set up the second Controlling system when they added the FB devices to the GDPS configuration.

The client will exploit both planned HyperSwap (to move their primary disks prior to planned maintenance on the primary subsystems) and unplanned HyperSwap (allowing the client to continue processing across a primary subsystem failure). They will test planned HyperSwap while their PRIMARYFAILURE policy option is still set to STOP. However, when they are comfortable and ready, they will change to running with a PRIMARYFAILURE=SWAP,STOP policy to enable unplanned HyperSwap.

7. Finally, and assuming the consolidation onto Linux for System z has proceeded, the Heterogeneous Disaster Recovery capability will be implemented to manage z/VM systems and its guests and to add planned and unplanned HyperSwap support for z/VM and the Linux guests.

Although the ability to transparently swap FB devices using HyperSwap is not available for z/VM guest Linux systems using FB disks, it is still possible to manage PPRC for these disks. GDPS will provide data consistency and will perform the physical swap, and can manage the re-IPL on the swapped-to disks.

z/VM systems hosting Linux guests using CKD disks will be placed under GDPS xDR control, providing them with near-equivalent management to what is provided for z/OS systems in the sysplex, including planned and unplanned HyperSwap.

And because it is all managed by the same GDPS, the swap can be initiated as a result of a problem on a z/OS disk, meaning that you do not have to wait for the problem to spread to the Linux disks before the swap is initiated. Equally, a problem on a CKD Linux disk can result in a HyperSwap of the Linux disks and the z/OS disks.

The projected final configuration is shown in Figure 10-4 (for clarity, we have not included the Linux components in the figure).

Figure 10-4 Active/standby workload GDPS/PPRC configuration

10.4.1 Active/active workload

As mentioned, this client is in the process of enabling all its applications for data sharing and dynamic workload balancing. This project will proceed in parallel with the GDPS project. When the critical applications have been enabled for data sharing, the client plans to move to an active/active workload configuration, with several production systems in the primary site and others in the “recovery” site.

To derive the maximum benefit from this configuration, it most likely is possible to transparently swap from the primary to secondary disks. Therefore, it is expected that the move to an active/active workload will not take place until after HyperSwap is enabled. The combination of multisite data sharing and HyperSwap means that the client’s applications will remain available across outages affecting a software subsystem (DB2, for example), an operating system, a processor, a Coupling Facility, or a disk subsystem (primary or secondary). The only event that can potentially result in a temporary application outage is an instantaneous outage of all resources in the primary site; this can result in the database managers in the recovery site having to be restarted.

The move to an active/active workload might require creating minor changes to the GDPS definitions, several new GDPS scripts, and modifications to existing ones, depending on whether new systems will be added or some of the existing ones moved to the other site. Apart from that, however, there is no fundamental change in the way GDPS is set up or operated.

10.5 DR in two data centers, global distance

The client in this scenario has a data center in Asia, and another in Europe. Following the tsunami disaster in 2004, the client decides to remote copy their production sysplex data to their data center in Europe. The client is willing to accept the small data loss that will result from the use of asynchronous remote copy.

However, there is a requirement that the data in the remote site be consistent, to allow application restart. In addition, to minimize the restart time, the solution must provide the ability to automatically recover the secondary disks and restart all the systems. The client has about 10000 primary volumes that they want to mirror. The disks in the Asian data center are IBM, but those in the European center that will be used as the secondary volumes are currently non-IBM.

The most suitable GDPS offering for this client is GDPS/XRC. Due to the long distance between the two sites (approaching 15000 km), using a synchronous remote copy method is out of the question. Because the disks in the two data centers are from a different vendor, GDPS/GM is also out of the question. Table 10-4 shows how the client’s configuration and requirements map to the capabilities of GDPS/XRC.

Table 10-4 Mapping client requirements to GDPS/XRC attributes

Attribute	Supported by GDPS/XRC
Two sites, separated by thousands of km	Y
Willing to accept small data loss	Y (actual amount of data loss will depend on a number of factors, most notably the available bandwidth)
Maintain consistency of secondary volumes	Y
Maintain consistency of secondary volumes during resynch	Y(1) (FlashCopy)
Over 10000 volumes	Y (exploit coupled SDM support)
Requirement for data replication for and between multiple storage vendors products	Y
Only z/OS disks need to be mirrored	Y
Automated recovery of disks and systems following a disaster	Y(GDPS script support)
Note 1: FlashCopy is used to create a consistent set of secondary volumes prior to a resynchronization, following a suspension of remote copy sessions.

The first step for the client is to size the required bandwidth for the XRC links. This information will be used in the tenders for the remote connectivity. Assuming the cost of the remote links is acceptable, the client will start installing GDPS/XRC concurrently with setting up the remote connectivity.

Pending the availability of the remote connectivity, three LPARs will be set up for XRC testing (two SDM LPARs, plus the GDPS Controlling system LPAR). This will allow the systems programmers and operators to become familiar with XRC and GDPS operations and control. The addressing of the SDM disks can be defined and agreed to and added to the GDPS configuration in preparation for the connectivity being available.

The final configuration is shown in Figure 10-5. The GDPS systems are in the same sysplex and reside on the same processor as the European production systems. In case of a disaster, additional CBU engines on that processor will automatically be activated by a GDPS script during the recovery process.

Figure 10-5 Final GDPS/XRC configuration

10.6 Other configurations

There are many other combinations of configurations. However, we believe that the examples provided here cover the options of one or two sites, short and long distance, and continuous availability and disaster recovery requirements. If you feel that your configuration does not fit into one of the scenarios described here, contact your IBM representative for more information about how GDPS can address your needs.

¹ Only the GDPS Controlling systems can see the FB disks. Therefore, a second Controlling system is recommended to ensure the FB disks can always be managed even if a Controlling system is down for some reason.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10. Sample continuous availability and disaster recovery scenarios

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 10. Sample continuous availability and disaster recovery scenarios