Chapter 5. Disaster recovery

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Disaster recovery

This chapter describes the use of the TS7700 in disaster recovery (DR).

This chapter includes the following sections:

•TS7700 disaster recovery principles

•Failover scenarios

•Planning for disaster recovery

•High availability and disaster recovery configurations

•Disaster recovery testing

•A real disaster

•Geographically Dispersed Parallel Sysplex for z/OS

5.1 TS7700 disaster recovery principles

To understand the DR capabilities of the TS7700 grid, the following topics are described:

•Data availability in the grid

•Deferred Copy Queue

•Volume ownership

5.1.1 Data availability

The fundamental function of the TS7700 is that all logical volumes are accessible through any of the virtual device addresses on the clusters in the grid configuration. If a copy of the logical volume is not available at that TS7700 cluster (either because it does not have a copy or the copy it does have is inaccessible because of an error), and a copy is available at another TS7700 cluster in the grid, the volume is accessed through the Tape Volume Cache (TVC) at the TS7700 cluster that has the available copy. If a recall is required to place the logical volume in the TVC on the other TS7700 cluster, it is done as part of the mount operation.

Whether a copy is available at another TS7700 cluster in a multi-cluster grid depends on the Copy Consistency Policy that was assigned to the logical volume when it was written. The Copy Consistency Policy is set through the Management Class (MC) storage construct. It specifies whether and when a copy of the data is made between the TS7700 clusters in the grid configuration. The following Copy Consistency Policies can be assigned:

•Synchronous Copy (Synch): Data that is written to the cluster is compressed and simultaneously written to another specified cluster.

•Rewind Unload (RUN): Data that is created on one cluster is copied to the other cluster as part of successful RUN command processing.

•Deferred Copy (Deferred): Data that is created on one cluster is copied to the specified clusters after successful RUN command processing.

•No Copy (None): Data that is created on one cluster is not copied to the other cluster.

Consider when the data is available on the cluster at the DR site. With Synchronous Copy, the data is written to a secondary cluster. If the primary site is unavailable, the volume can be accessed on the cluster that specified Synch. With RUN, unless the Copy Count Override is enabled, any cluster with Run specified has a copy of the volume available. With None, no copy is written in this cluster. With Deferred, a copy is available later, so it might be available at the cluster that specified Deferred.

When you enable Copy Count Override, it is possible to limit the number of RUN consistency points that are required before the application is given back device end, which can result in fewer copies of the data that is available than your copy policies specify.

The Volume Removal policy for hybrid grid configurations is available in any grid configuration that contains at least one TS7720 or TS7720T cluster and should be considered as well. The TS7720 Disk-Only solution has a maximum storage capacity that is the size of its TVC, and TS7720T CP0 works like TS7720. Therefore, after the cache fills, this policy enables logical volumes to be removed automatically from cache while a copy is retained within one or more peer clusters in the grid. If the cache is filling up, it is possible that fewer copies of the volume exist in the grid than is expected based on the copy policy alone.

5.1.2 Deferred Copy Queue

Besides a copy policy of No Copy, a Deferred Copy policy has the least impact to the applications that are running on the host. Immediately after the volume is closed, device end is passed back to the application and a copy is then queued to be made later. These copies are put on the Deferred Copy Queue.

With the standard settings, host application I/O always has a higher priority than the Deferred Copy Queue. It is normally expected that the configuration and capacity of the grid is such that the entire queue has the copies completed each day; otherwise, the incoming copies cause the Deferred Copy Queue to grow continually and the RPO might not be fulfilled.

When a cluster becomes unavailable due to broken grid links, error, or disaster, the incoming copy queue might not be complete, and the data might not be available on other clusters in the grid. You can use BVIR to analyze the incoming copy queue, but the possibility exists that volumes are not available. For backups, this might be acceptable, but for primary data, it might be preferable to use a Synch copy policy rather than Deferred.

5.1.3 Volume ownership

If a logical volume is written on one of the clusters in the grid configuration and copied to another cluster, the copy can be accessed through the either the original cluster or the other cluster.

At any time however, a logical volume is owned by a single cluster. We call this the owning cluster. The owning cluster has control over access to the volume and changes to the attributes that are associated with the volume (such as category or storage constructs). The cluster that has ownership of a logical volume can surrender it dynamically to another cluster in the grid configuration that is requesting a mount of the volume.

When a mount request is received on a virtual device address, the cluster for that virtual device must have ownership of the volume to be mounted, or must obtain the ownership from the cluster that owns it. If the clusters in a grid configuration and the communication paths between them are operational (grid network), the change of ownership and the processing of logical volume-related commands are transparent to the operation of the TS7700.

However, if a cluster that owns a volume is unable to respond to requests from other clusters, the operation against that volume fails, unless more direction is given. Clusters will not automatically assume or take over ownership of a logical volume without being directed.

This is done to prevent the failure of the grid network communication paths between the clusters, resulting in both clusters thinking that they have ownership of the volume. If more than one cluster has ownership of a volume, that might result in the volume’s data or attributes being changed differently on each cluster, resulting in a data integrity issue with the volume.

If a cluster fails, is known to be unavailable (for example, a power fault in the IT center), or must be serviced, its ownership of logical volumes is transferred to the other cluster through one of the following modes.

These modes are set through the Management Interface (MI):

•Read Ownership Takeover (ROT): When ROT is enabled for a failed cluster, ownership of a volume is allowed to be taken from a cluster that has failed. Only read access to the volume is allowed through the other cluster in the grid. After ownership for a volume is taken in this mode, any operation that attempts to modify data on that volume or change its attributes fails. The mode for the failed cluster remains in place until a different mode is selected or the failed cluster is restored.

•Write Ownership Takeover (WOT): When WOT is enabled for a failed cluster, ownership of a volume is allowed to be taken from a cluster that has been marked as failed. Full access is allowed through the other cluster in the grid. The mode for the failed cluster remains in place until a different mode is selected or the failed cluster is restored.

•Service prep/service mode: When a cluster is placed in service preparation mode or is in service mode, ownership of its volumes is allowed to be taken by the other cluster. Full access is allowed. The mode for the cluster in service remains in place until it is taken out-of-service mode.

•In addition to the manual setting of one of the ownership takeover modes, an optional automatic method named Autonomic Ownership Takeover Manager (AOTM) is available when each of the TS7700 clusters is attached to a TS3000 System Console (TSSC) and there is a communication path that is provided between the TSSCs. AOTM is enabled and defined by the IBM Service Support Representative (IBM SSR). If the clusters are near each other, multiple clusters in the same grid can be attached to the same TSSC, and the communication path is not required.

Guidance: The links between the TSSCs must not be the same physical links that are also used by cluster grid gigabit (Gb) links. AOTM must have a different network to be able to detect that a missing cluster is down, and that the problem is not caused by a failure in the grid gigabit wide area network (WAN) links.

When enabled by the IBM SSR, suppose that a cluster cannot obtain ownership from the other cluster because it does not get a response to an ownership request. In this case, a check is made through the TSSCs to determine whether the owning cluster is inoperable, or if the communication paths to it are not functioning. If the TSSCs determine that the owning cluster is inoperable, they enable either read or WOT, depending on what was set by the IBM SSR.

AOTM enables an ownership takeover mode after a grace period, and can be configured only by an IBM SSR. Therefore, jobs can intermediately fail with an option to try again until the AOTM enables the configured takeover mode. The grace period is set to 20 minutes, by default. The grace period starts when a cluster detects that another remote cluster has failed. It can take several minutes.

The following OAM messages can be displayed when AOTM enables the ownership takeover mode:

•CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname might be unavailable or a communication issue might be present.

•CBR3750I Message from library libname: G0009 Autonomic ownership takeover manager within library libname has determined that library libname is unavailable. The Read/Write ownership takeover mode has been enabled.

•CBR3750I Message from library libname: G0010 Autonomic ownership takeover manager within library libname determined that library libname is unavailable. The Read-Only ownership takeover mode has been enabled.

When a cluster in the grid becomes inoperable, mounts directed to that cluster will begin to fail with the following messages:

CBR4195I LACS retry possible for job OAM:

IEE763I NAME= CBRLLACS CODE= 140394

CBR4000I LACS WAIT permanent error for drive xxxx.

CBR4171I Mount failed. LVOL=vvvvvv, LIB=libname, PVOL=??????, RSN=10.

IEE764I END OF CBR4195I RELATED MESSAGES

CBR4196D Job OAM, drive xxx, volser vvvvvv, error code 140394. Reply 'R' to retry or 'C' to cancel.

The response to this message should be ‘C’ and the logical drives in the failing cluster should then be varied offline with VARY devicenumber, OFFLINE to prevent further attempts from the host to mount volumes on this cluster.

A failure of a cluster causes the jobs that use its virtual device addresses to end abnormally (abend). To rerun the jobs, host connectivity to the virtual device addresses in the other cluster must be enabled (if it is not already), and an appropriate ownership takeover mode selected. If the other cluster has a valid copy of a logical volume, the jobs can be tried again.

If a logical volume is being accessed in a remote cache through the Ethernet link and that link fails, the job accessing that volume also fails. If the failed job is attempted again, the TS7700 uses another Ethernet link. If all links fail, access to any data in a remote cache is not possible.

After the failed cluster comes back online and establishes communication with the other clusters in the grid, the following message will be issued:

CBR3750I Message from library libname: G0011 Ownership takeover mode within library libname has been automatically disabled now that library libname has become available.

At this point, it is now possible to issue VARY devicenumber, ONLINE to bring the logical drives on the cluster for use.

After the cluster comes back into operation, if there are any volumes that are in a conflicting state because they were accessed on another cluster, the following message will be issued:

CBR3750I Message from library libname: OP0316 The TS7700 Engine has detected corrupted tokens for one or more virtual volumes.

If this occurs, see “Repair Virtual Volumes window” on page 512 for the process for repairing the corrupted tokens.

It is now possible to change the impact code and impact text that are issued with the CBR3750I messages. For more information, see 10.2.1, “CBR3750I Console Message” on page 598.

5.2 Failover scenarios

As part of a total systems design, you must develop business continuity procedures to instruct information technology (IT) personnel in the actions that they need to take in a failure. Test those procedures either during the initial installation of the system or at another time.

The scenarios are described in detail in the IBM Virtualization Engine TS7700 Series Grid Failover Scenarios white paper, which was written to assist IBM specialists and clients in developing such testing plans. The white paper is available at the following web address:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100831

The white paper documents a series of TS7700 Grid failover test scenarios for z/OS that were run in an IBM laboratory environment. Simulations of single failures of all major components and communication links, and some multiple failures, are run.

5.3 Planning for disaster recovery

Although you can hope that a disaster does not happen, planning for such an event is important. This section provides information that can be used in developing a DR plan as it relates to a TS7700.

Many aspects of DR planning must be considered:

•Consider DR site connectivity input/output definition file (IODF).

•How critical is the data in the TS7700?

•Can the loss of some of the data be tolerated?

•How much time can be tolerated before resuming operations after a disaster?

•What are the procedures for recovery and who runs them?

•How will you test your procedures?

5.3.1 Disaster recovery site connectivity IODF considerations

If your production hosts have FICON connectivity to the TS7700 clusters at your DR site, you might consider including those virtual device addresses in your production IODF. Having those devices configured and offline to your production hosts makes it easier if there is a TS7700 failure that requires FICON access to the DR clusters, which is distance-dependent and might not be appropriate for all configurations.

To switch over to the DR clusters, a simple vary online of the DR devices is all that is needed by the production hosts to enable their usage. Another alternative is to have a separate IODF ready with the addition of the DR devices. However, that requires an IODF activation on the production hosts.

5.3.2 Grid configuration

With the TS7700, two types of configurations can be installed:

•Stand-alone cluster

•Multi-cluster grid

With a stand-alone system, a single cluster is installed. If the site at which that system is installed is destroyed, the data that is associated with the TS7700 might be lost unless COPY EXPORT was used and the tapes were removed from the site. If the cluster goes out of service due to failures, whether the data is recoverable depends on the failure type.

The recovery process assumes that the only elements that are available for recovery are the stacked volumes that are produced by COPY EXPORT and removed from the site. It further assumes that only a subset of the volumes is undamaged after the event. If the physical cartridges have been destroyed or irreparably damaged, recovery is not possible, as with any other cartridge types. It is important that you integrate the TS7700 recovery procedure into your current DR procedures.

Remember: The DR process is a joint exercise that requires your involvement and that of your IBM SSR to make it as comprehensive as possible.

For many clients, the potential data loss or the recovery time that is required with a stand-alone TS7700 is not acceptable because the COPY EXPORT method might take some time to complete. For those clients, the TS7700 grid provides a near-zero data loss and expedited recovery-time solution. With a multi-cluster grid configuration, up to six clusters are installed, typically at two or three sites, and interconnected so that data is replicated among them. The way that the sites are used then differs, depending on your requirements.

In a two-cluster grid, one potential use case is that one of the sites is the local production center and the other site is a backup or DR center, which is separated by a distance that is dictated by your company’s requirements for DR. Depending on the physical distance between the sites, it might be possible to have two clusters be both a high availability and DR solution.

In a three-cluster grid, the typical use is that two sites are connected to a host and the workload is spread evenly between them. The third site is strictly for DR and there probably are no connections from the production host to the third site. Another use for a three-cluster grid might consist of three production sites, which are all interconnected and holding the backups of each other.

In a four or more cluster grid, DR and high availability can be achieved. The high availability is achieved with two local clusters that keep RUN or SYNC volume copies, with both clusters attached to the host. The third and fourth (or more) remote clusters can hold deferred volume copies for DR. This design can be configured in a crossed way, which means that you can run two production data centers, with each production data center serving as a backup for the other.

The only connection between the production sites and the DR site is the grid interconnection. There is normally no host connectivity between the production hosts and the DR site’s TS7700. When client data is created at the production sites, it is replicated to the DR site as defined through Outboard policy management definitions and storage management subsystem (SMS) settings.

5.3.3 Planning guidelines

As part of planning a TS7700 grid configuration to address this solution, you must consider the following items:

•Plan for the necessary WAN infrastructure and bandwidth. You need more bandwidth if you are primarily using a Copy Consistency Points of RUN or SYNC because any delays in copy time that are caused by bandwidth limitations result in longer job run times.

If you have limited bandwidth available between sites, use Deferred Copy Consistency Point, or copy only the data that is critical to the recovery of your key operations. The amount of data that is sent through the WAN and the distance it is sent possibly might justify the establishment of a separate, redundant, and dedicated network only for the multi-cluster grid. There are also newer IPEX SAN42B-R switches that are available and IP extension hardware that might help with this issue.

•A factor to consider in the implementation of Copy Export for DR is that the export does not capture any volumes in the export pool that are not in the TVC of the export cluster. Any data that is migrated to back-end tape is not going to be on the EXPORT COPY volumes.

•Plan for host connectivity at your DR site with sufficient resources to run your critical workloads. If the cluster that is local to the production host becomes unavailable and there is no access to the DR site’s cluster by this host, production cannot run. Optionally, plan for an alternative host to take over production at the DR site.

•Design and code the Data Facility System Management Subsystem (DFSMS) automatic class selection (ACS) routines to control what MC on the TS7700 is assigned. It is these MCs that control which Copy Consistency Points are used. You might need to consider MC assignment policies for testing your procedures at the DR site that are different from the production policies.

•Prepare procedures that your operators can run if the local site becomes unusable. The procedures include various tasks, such as bringing up the DR host, varying the virtual drives online, and placing the DR cluster in one of the ownership takeover modes.

•Perform a periodic capacity planning of your tape setup and host throughput to evaluate whether the disaster setup still can the full production workload in a disaster.

•If encryption is used in production, ensure that the disaster site supports encryption. The EKs must be available at the DR site or the data cannot be read.

•Consider how you test your DR procedures. Many scenarios can be set up:

– Test based on all data from an existing TS7700?

– Test based on using the Copy Export function and an empty TS7700?

– Test based on stopping production access to one TS7700 cluster and running production to another cluster?

5.4 High availability and disaster recovery configurations

A few examples of grid configurations are addressed.

5.4.1 Example grid configurations

These examples are a small subset of possible configurations, and are only provided to show how the grid technology can be used. With five-cluster or six-cluster grids, there are many more ways to configure a grid.

Two-cluster grid

With a two-cluster grid, you can configure the grid for DR, high availability, or both. Configuration considerations for two-cluster grids are described. The scenarios that are presented are typical configurations. Other configurations are possible, and might be better suited for your environment.

Disaster recovery configuration

This section provides information that is needed to plan for a TS7700 2-cluster grid configuration to be used specifically for DR purposes.

A natural or human-caused event has made the local site’s cluster unavailable. The two clusters are in separate locations, which are separated by a distance that is dictated by your company’s requirements for DR. The only connections between the local site and the DR site are the grid interconnections. There is no host connectivity between the local hosts and the DR site cluster.

Figure 5-1 summarizes this configuration.

Figure 5-1 Disaster recovery configuration

Consider the following information as part of planning a TS7700 grid configuration to implement this solution:

•Plan for the necessary WAN infrastructure and bandwidth to meet the copy policy requirements that you need. If you have limited bandwidth available between sites, have critical data copied with a consistency point of RUN, with the rest of the data using the Deferred Copy Consistency Point. RUN or SYNCH are only acceptable copy policies for distances less than 100 kilometers. Distances greater than 100 km must rely on the Deferred Copy Consistency Point.

•Plan for host connectivity at your DR site with sufficient resources to perform your critical workloads.

•Design and code the DFSMS ACS routines to control what MC on the TS7700 is assigned, which determines what data gets copied, and by which Copy Consistency Point.

Configuring for high availability

This section provides the information that is needed to plan for a two-cluster grid configuration to be used specifically for high availability. The assumption is that continued access to data is critical, and no single point of failure, repair, or upgrade can affect the availability of data.

In a high-availability configuration, both clusters are within metro distance of each other. These clusters are connected through a LAN. If one of them becomes unavailable because it has failed, or is undergoing service or being updated, data can be accessed through the other cluster until the unavailable cluster is made available.

As part of planning a grid configuration to implement this solution, consider the following information:

•Plan for the virtual device addresses in both clusters to be configured to the local hosts. In this way, a total of 512 or 992 virtual tape devices are available for use (256 or 496 from each cluster).

•Set up a Copy Consistency Point of RUN for both clusters for all data to be made highly available. With this Copy Consistency Point, as each logical volume is closed, it is copied to the other cluster.

•Design and code the DFSMS ACS routines and MCs on the TS7700 to set the necessary Copy Consistency Points.

•Ensure that AOTM is configured for an automated logical volume ownership takeover method in case a cluster becomes unexpectedly unavailable within the grid configuration. Alternatively, prepare written instructions for the operators that describe how to perform the ownership takeover manually, if necessary. See 2.3.34, “Autonomic Ownership Takeover Manager” on page 96 for more details about AOTM.

Figure 5-2 summarizes this configuration.

Figure 5-2 Availability configuration

Configuring for disaster recovery and high availability

You can configure a two-cluster grid configuration to provide both DR and high availability solutions. The assumption is that the two clusters are in separate locations, which are separated by a distance that is dictated by your company’s requirements for DR. In addition to the configuration considerations for DR, you need to plan for the following items:

•Access to the FICON channels on the cluster at the DR site from your local site’s hosts. This can involve connections that use dense wavelength division multiplexing (DWDM) or channel extender, depending on the distance that separates the two sites. If the local cluster becomes unavailable, you use this remote access to continue your operations by using the remote cluster.

•Because the virtual devices on the remote cluster are connected to the host through a DWDM or channel extension, there can be a difference in read or write performance when compared to the virtual devices on the local cluster.

If performance differences are a concern, consider using only the virtual device addresses in the remote cluster when the local cluster is unavailable. If that is important, you must provide operator procedures to vary online and offline the virtual devices to the remote cluster.

•You might want to have separate Copy Consistency Policies for your DR data versus your data that requires high availability.

Figure 5-3 summarizes this configuration.

Figure 5-3 Availability and disaster recovery configuration

Three-cluster grid

With a three-cluster grid, you can configure the grid for DR and high availability or use dual production sites that share a common DR site. Configuration considerations for three-cluster grids are described. The scenarios that are presented are typical configurations. Other configurations are possible and might be better suited for your environment.

The planning considerations for a two-cluster grid also apply to a three-cluster grid.

High availability and disaster recovery

Figure 5-4 illustrates a combined high availability and DR solution for a three-cluster grid. In this example, Cluster 0 and Cluster 1 are the high-availability clusters and are local to each other (less than 50 kilometers (31 miles) apart). Cluster 2 is at a remote site that is away from the production site or sites. The virtual devices in Cluster 0 and Cluster 1 are online to the host and the virtual devices in Cluster 2 are offline to the host. The host accesses the virtual devices that are provided by Cluster 0 and Cluster 1.

Host data that is written to Cluster 0 is copied to Cluster 1 at RUN time or earlier with Synchronous mode. Host data that is written to Cluster 1 is written to Cluster 0 at RUN time. Host data that is written to Cluster 0 or Cluster 1 is copied to Cluster 2 on a Deferred basis.

The Copy Consistency Points at the DR site (NNR or NNS) are set to create a copy only of host data at Cluster 2. Copies of data are not made to Cluster 0 and Cluster 1. This enables DR testing at Cluster 2 without replicating to the production site clusters.

Figure 5-4 shows an optional host connection that can be established to the remote Cluster 2 by using DWDM or channel extenders. With this configuration, you must define an extra 256 or 496 virtual devices at the host.

Figure 5-4 High availability and disaster recovery configuration

Dual production site and disaster recovery

Figure 5-5 illustrates dual production sites that are sharing a DR site in a three-cluster grid (similar to a hub-and-spoke model). In this example, Cluster 0 and Cluster 1 are separate production systems that can be local to each other or distant from each other. The DR cluster, Cluster 2, is at a remote site at a distance away from the production sites.

The virtual devices in Cluster 0 are online to Host A and the virtual devices in Cluster 1 are online to Host B. The virtual devices in Cluster 2 are offline to both hosts. Host A and Host B access their own set of virtual devices that are provided by their respective clusters. Host data that is written to Cluster 0 is not copied to Cluster 1. Host data that is written to Cluster 1 is not written to Cluster 0. Host data that is written to Cluster 0 or Cluster 1 is copied to Cluster 2 on a Deferred basis.

The Copy Consistency Points at the DR site (NNR or NNS) are set to create only a copy of host data at Cluster 2. Copies of data are not made to Cluster 0 and Cluster 1. This enables DR testing at Cluster 2 without replicating to the production site clusters.

Figure 5-5 shows an optional host connection that can be established to remote Cluster 2 using DWDM or channel extenders.

Figure 5-5 Dual production site with disaster recovery

Three-cluster high availability production site and disaster recovery

This model has been adopted by many clients. In this configuration, two clusters are in the production site (same building or separate location within metro area) and the third cluster is remote at the DR site. Host connections are available at the production site (or sites).

In this configuration, each TS7700D replicates to both its local TS7700D peer and to the remote TS7740/TS7700T. Optional copies in both TS7700D clusters provide high availability plus cache access time for the host accesses. At the same time, the remote TS7740/ TS7700T provides DR capabilities and the remote copy can be remotely accessed, if needed.

This configuration, which provides high-availability production cache if you choose to run balanced mode with three copies (R-R-D for both Cluster 0 and Cluster 1), is depicted in Figure 5-6.

Figure 5-6 Three-cluster high availability and disaster recovery with two TS7700Ds and one TS7740/TS7700T tape library

Another variation of this model uses a TS7700D and a TS7740/TS7700T for the production site, as shown in Figure 5-7, both replicating to a remote TS7740/TS7700T.

Figure 5-7 Three-cluster high availability and disaster recovery with two TS7740/TS7700T tape libraries and one TS7700D

In both models, if a TS7700D reaches the upper threshold of usage, the PREFER REMOVE data, which has already been replicated to the TS7740/TS7700T, is removed from the TS7700D cache followed by the PREFER KEEP data. PINNED data can never be removed from a TS7700D cache or a TS7700T CP0.

In the example that is shown in Figure 5-7, you can have particular workloads that favor the TS7740/TS7700T, and others that favor the TS7700D, suiting a specific workload to the cluster best equipped to perform it.

Copy Export (shown as optional in both figures) can be used to have an additional copy of the migrated data, if required.

Four-cluster grid

A four-cluster grid that can have both sites for dual purposes is described. Both sites are equal players within the grid, and any site can play the role of production or DR, as required.

Dual production and disaster recovery at Metro Mirror distance

In this model, you have dual production and DR sites. Although a site can be labeled as a high availability pair or DR site, they are equivalent from a technology standpoint and functional design. In this example, you have two production sites within metro distances and two remote DR sites within metro distances between them. This configuration delivers the same capacity as a two-cluster grid configuration, with the high availability of a four-cluster grid. See Figure 5-8.

Figure 5-8 Four-cluster high availability and disaster recovery

You can have host workload balanced across both clusters (Cluster 0 and Cluster 1 in Figure 5-8). The logical volumes that are written to a particular cluster are only replicated to one remote cluster. In Figure 5-8, Cluster 0 replicates to Cluster 2 and Cluster 1 replicates to Cluster 3. This task is accomplished by using copy policies. For the described behavior, copy mode for Cluster 0 is RDRN or SDSN and for Cluster 1 is DRNR or DSNS.

This configuration delivers high availability at both sites, production and DR, without four copies of the same tape logical volume throughout the grid.

If this example was not in Metro Mirror distances, use copy policies on Cluster 0 of RDDN and Cluster 1 of DRND.

Figure 5-9 shows the four-cluster grid reaction to a cluster outage. In this example, Cluster 0 goes down due to an electrical power outage. You lose all logical drives that are emulated by Cluster 0. The host uses the remaining addresses that are emulated by Cluster 1 for the entire production workload.

Figure 5-9 Four-cluster grid high availability and disaster recovery - Cluster 0 outage

During the outage of Cluster 0 in the example, new jobs for write use only one half of the configuration (the unaffected partition in the lower part of Figure 5-9). Jobs for read can access content in all available clusters. When power is normalized at the site, Cluster 0 starts and rejoins the grid, reestablishing the original balanced configuration.

In a DR situation, the backup host in the DR site operates from the second high availability pair, which is the pair of Cluster 2 and Cluster 3 in Figure 5-9. In this case, copy policies can be RNRD for Cluster 2 and NRNR for Cluster 3.

If these sites are more than Metro Mirror distance, you can have Cluster 2 copy policies of DNRD and Cluster 3 policies of NDDR.

5.4.2 Restoring the host and library environments

Before you can use the recovered logical volumes, you must restore the host environment. The following steps are the minimum steps that you need to continue the recovery process of your applications:

1. Restore the tape management system (TMS) CDS.

2. Restore the DFSMS data catalogs, including the tape configuration database (TCDB).

3. Define the I/O gen by using the Library IDs of the recovery TS770 tape drives.

4. Update the library definitions in the source control data set (SCDS) with the Library IDs for the recovery TS7700 tape drives in the composite library and distributed library definition windows.

5. Activate the I/O gen and the SMS SCDS.

You might also want to update the library nicknames that are defined through the MI for the grid and cluster to match the library names defined to DFSMS. That way, the names that are shown on the MI windows match those names that are used at the host for the composite library and distributed library.

To set up the composite name that is used by the host to be the grid name, complete the following steps:

1. Select Configuration → Grid Identification Properties.

2. In the window that opens, enter the composite library name that is used by the host in the grid nickname field.

3. You can optionally provide a description.

Similarly, to set up the distributed name, complete the following steps:

1. Select Configuration → Cluster Identification Properties.

2. In the window that opens, enter the distributed library name that is used by the host in the Cluster nickname field.

3. You can optionally provide a description.

These names can be updated at any time.

5.5 Disaster recovery testing

The TS7700 grid configuration provides a solution for DR needs when data loss and the time for recovery must be minimized. Although a real disaster is not something that can be anticipated, it is important to have tested procedures in place in case one occurs. For discussion of DR testing practices, see Chapter 13, “Disaster recovery testing” on page 767.

5.6 A real disaster

To clarify what a real disaster means, if you have a hardware issue that, for example, stops the TS7700 for 12 hours, is this a real disaster? It depends.

For a bank, during the batch window, and without any other alternatives to bypass a 12-hour TS7700 outage, this can be a real disaster. However, if the bank has a three-cluster grid (two local and one remote), the same situation is less dire because the batch window can continue accessing the second local TS7700.

Because no set of fixed answers exists for all situations, you must carefully and clearly define which situations can be considered real disasters, and which actions to perform for all possible situations.

Several differences exist between a DR test situation and a real disaster situation. In a real disaster situation, you do not have to do anything to be able to use the DR TS7700, which makes your task easier. However, this easy-to-use capability does not mean that you have all the cartridge data copied to the DR TS7700.

If your copy mode is RUN, you need to consider only in-flight tapes that are being created when the disaster happens. You must rerun all these jobs to re-create tapes for the DR site. Alternatively, if your copy mode is Deferred, you have tapes that are not copied yet. To know which tapes are not copied, you can go to the MI in the DR TS7700 and find cartridges that are already in the copy queue. After you have this information, you can, by using your TMS, discover which data sets are missing, and rerun the jobs to re-create these data sets at the DR site.

Figure 5-10 shows an example of a real disaster situation.

Figure 5-10 Real disaster situation

In a real disaster scenario, the whole primary site is lost. Therefore, you need to start your production systems at the DR site. To do this, you need to have a copy of all your information not only on tape, but all DASD data copied to the DR site.

After you can start the z/OS partitions, from the TS7700 perspective, you must be sure that your hardware configuration definition (HCD) “sees” the DR TS7700. Otherwise, you cannot put the TS7700 online.

You must change ownership takeover, also. To perform that task, go to the MI interface and enable ownership takeover for read and write.

All the customizations that you made for DR testing are not needed during a real disaster. Production tape ranges, scratch categories, SMS definitions, RMM inventory, and so on, are in a real configuration that is in DASD and is copied from the primary site.

Perform the following changes because of the special situation that a disaster merits:

•Change your MC to obtain a dual copy of each tape that is created after the disaster.

•Depending on the situation, consider using the Copy Export capability to move one of the copies outside the DR site.

After you are in a stable situation at the DR site, you need to start the tasks that are required to recover your primary site or to create a new site. The old DR site is now the production site, so you must create a new DR site.

5.7 Geographically Dispersed Parallel Sysplex for z/OS

The IBM Z multisite application availability solution, Geographically Dispersed Parallel Sysplex (GDPS), integrates Parallel Sysplex technology and remote copy technology to enhance application availability and improve DR. The GDPS topology is a Parallel Sysplex cluster that is spread across two sites, with all critical data mirrored between the sites. GDPS manages the remote copy configuration and storage subsystems, automates Parallel Sysplex operational tasks, and automates failure recovery from a single point of control, improving application availability.

5.7.1 Geographically Dispersed Parallel Sysplex considerations in a TS7700 grid configuration

A key principle of GDPS is to have all I/O be local to the system that is running production. Another principle is to provide a simplified method to switch between the primary and secondary sites, if needed. The TS7700 grid configuration provides a set of capabilities that can be tailored to enable it to operate efficiently in a GDPS environment. Those capabilities and how they can be used in a GDPS environment are described in the following sections.

Direct production data I/O to a specific TS7740

The hosts are directly attached to the TS7740 that is local to the host so that is your first consideration in directing I/O to a specific TS7740. Host channels from each site’s GDPS hosts are also typically installed to connect to the TS7740 at the site that is remote to a host to cover recovery only when the TS7740 cluster at the GDPS primary site is down. However, during normal operation, the remote virtual devices are set offline in each GDPS host.

The default behavior of the TS7740 in selecting which TVC is used for the I/O is to follow the MC definitions and considerations to provide the best overall job performance. However, it uses a logical volume in a remote TS7740’s TVC, if required, to perform a mount operation unless override settings on a cluster are used.

To direct the TS7740 to use its local TVC, complete the following steps:

1. For the MC that is used for production data, ensure that the local cluster has a Copy Consistency Point. If it is important to know that the data is replicated at job close time, specify a Copy Consistency Point of RUN or Synchronous mode copy.

If some amount of data loss after a job closes can be tolerated, a Copy Consistency Point of Deferred can be used. You might have production data with different data loss tolerance. If that is the case, you might want to define more than one MC with separate Copy Consistency Points. In defining the Copy Consistency Points for an MC, it is important that you define the same copy mode for each site because in a site switch, the local cluster changes.

2. Set Prefer Local Cache for Fast Ready Mounts in the MI Copy Policy Override window. This override selects the TVC local to the TS7740 on which the mount was received if it is available and a Copy Consistency Point other than No Copy is specified for that cluster in the MC specified with the mount. The cluster does not have to have a valid copy of the data for it to be selected for the I/O TVC.

3. Set Prefer Local Cache for Non-Fast Ready Mounts in the MI Copy Policy Override window. This override selects the TVC local to the TS7740 on which the mount was received if it is available and the cluster has a valid copy of the data, even if the data is only on a physical tape. Having an available, valid copy of the data overrides all other selection criteria. If the local cluster does not have a valid copy of the data, without the next override, it is possible that the remote TVC is selected.

4. Set Force Volume Copy to Local. This override has two effects, depending on the type of mount requested. For a private mount, if a valid copy does not exist on the cluster, a copy is performed to the local TVC as part of the mount processing. For a scratch mount, it has the effect of OR-ing the specified MC with a Copy Consistency Point of RUN for the cluster, which forces the local TVC to be used. The override does not change the definition of the MC. It serves only to influence the selection of the I/O TVC or to force a local copy.

5. Ensure that these override settings are duplicated on both TS7740 Virtualization Engines.

Switching site production from one TS7700 to another one

The way that data is accessed by either TS7740 is based on the logical volume serial number. No changes are required in tape catalogs, job control language (JCL), or TMSs. In a failure in a TS7740 grid environment with GDPS, three scenarios can occur:

•GDPS switches the primary host to the remote location and the TS7740 grid is still fully functional:

– No manual intervention is required.

– Logical volume ownership transfer is done automatically during each mount through the grid.

•A disaster happens at the primary site, and the GDPS host and TS7740 cluster are down or inactive:

– Automatic ownership takeover of volumes, which are then accessed from the remote host, is not possible.

– Manual intervention is required. Through the TS7740 MI, the administrator must start a manual ownership takeover. To do so, use the TS7740 MI and click Service → Ownership Takeover Mode.

•Only the TS7740 cluster at the GDPS primary site is down. In this case, two manual interventions are required:

– Vary online remote TS7740 cluster devices from the primary GDPS host.

– Because the down cluster cannot automatically take ownership of volumes that is then accessed from the remote host, manual intervention is required. Through the TS7740 MI, start a manual ownership takeover. To do so, click Service → Ownership Takeover Mode in the TS7740 MI.

5.7.2 Geographically Dispersed Parallel Sysplex functions for the TS7700

GDPS provides TS7700 configuration management and displays the status of the managed TS7700 tape drives on GDPS windows. TS7700 tape drives that are managed by GDPS are monitored, and alerts are generated for abnormal conditions. The capability to control TS7700 replication from GDPS scripts, and to window by using TAPE ENABLE and TAPE DISABLE by library, grid, or site, is provided for managing the TS7700 during planned and unplanned outage scenarios.

The TS7700 provides a capability called Bulk Volume Information Retrieval (BVIR). If there is an unplanned interruption to tape replication, GDPS uses this BVIR capability to automatically collect information about all volumes in all libraries in the grid where the replication problem occurred. In addition to this automatic collection of in-doubt tape information, it is possible to request GDPS to perform BVIR processing for a selected library by using the GDPS window interface at any time.

GDPS supports a physically partitioned TS7700. For more information about the steps that are required to partition a TS7700 physically, see Appendix I, “Case study for logical partitioning of a two-cluster grid” on page 919.

5.7.3 Geographically Dispersed Parallel Sysplex implementation

Before implementing the GDPS support for TS7700, ensure that you review and understand the following topics:

•2.2.20, “Copy Consistency Point: Copy policy modes in a stand-alone cluster” on page 49

•IBM Virtualization Engine TS7700 Series Best Practices Copy Consistency Points, which is available at the following website:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101230

•IBM Virtualization Engine TS7700 Series Best Practices Synchronous Copy Mode, which is available at the following website:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102098

The complete instructions for implementing GDPS with the TS7700 can be found in the GDPS manuals.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Disaster recovery

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5. Disaster recovery