Introduction to business resilience and the role of GDPS
In this chapter, we discuss the objective of this book and briefly introduce the contents and layout. We discuss the topic of business IT resilience from a technical perspective (we refer to it as IT resilience).
The chapter includes a general description that is not specific to mainframe platforms, although the topics are covered from an enterprise systems and mainframe perspective. Finally, we introduce the members of the IBM Geographically Dispersed Parallel Sysplex (GDPS) family of offerings and provide a brief description of the aspects of an IT resilience solution that each offering addresses.
This chapter includes the following topics:
1.1 Objective
Business IT resilience is a high profile topic across many industries and businesses. Apart from the business drivers requiring near-continuous application availability, government regulations in various industries now take the decision about whether to have an IT resilience capability out of your hands.
This book was developed to provide an introduction to the topic of business resilience from an IT perspective, and to share how GDPS can help you address your IT resilience requirements.
1.2 Layout of this book
This chapter starts by presenting an overview of IT resilience and disaster recovery. These practices have existed for many years. However, recently they have become more complex because of a steady increase in the complexity of applications, the increasingly advanced capabilities of available technology, competitive business environments, and government regulations.
In Chapter 2, “Infrastructure planning for availability and GDPS” on page 13, we briefly describe the available technologies typically used in a GDPS solution to achieve IT resilience goals. To understand the positioning and capabilities of the various offerings (which encompass hardware, software, and services), it is also useful to have at least a basic understanding of the underlying technology.
Following these two introductory chapters and starting with Chapter 3, “GDPS Metro” on page 53, we describe the capabilities and prerequisites of each offering in the GDPS family of offerings. Because each offering addresses fundamentally different requirements, each member of the GDPS family of offerings is described in a chapter of its own.
Finally, we include a section with examples illustrating how the various GDPS offerings can satisfy your requirements for IT resilience and disaster recovery.
1.3 IT resilience
IBM defines IT resilience as the ability to rapidly adapt and respond to any internal or external disruption, demand, or threat, and continue business operations without significant impact.
IT resilience is related to, but broader in scope, than disaster recovery. Disaster recovery concentrates solely on recovering from an unplanned event.
When you investigate IT resilience options, these two terms must be at the forefront of your thinking:
Recovery time objective (RTO)
This term refers to how long your business can afford to wait for IT services to be resumed following a disaster.
If this number is not clearly stated now, think back to the last time that you had a significant service outage. How long was that outage, and how much difficulty did your company suffer as a result? This can help you get a sense of whether to measure your RTO in days, hours, or minutes.
Recovery point objective (RPO)
This term refers to how much data your company is willing to re-create following a disaster. In other words, what is the acceptable time difference between the data in your production system and the data at the recovery site?
As an example, if your disaster recovery solution depends on daily full volume tape dumps, your RPO is 24 - 48 hours depending on when the tapes are taken offsite. If your business requires an RPO of less than 24 hours, you will almost certainly be forced to do some form of offsite real-time data replication instead of relying on these tapes alone.
The terms RTO and RPO are used repeatedly in this book because they are core concepts in the methodology that you can use to meet your IT resilience needs.
1.3.1 Disaster recovery
As mentioned, the practice of preparing for disaster recovery (DR) is something that has been a focus of IT planning for many years. In turn, there is a wide range of offerings and approaches available to accomplish DR. Several options rely on offsite or even outsourced locations that are contracted to provide data protection or even servers if there is a true IT disaster. Other options rely on in-house IT infrastructures and technologies that can be managed by your own teams.
There is no one correct answer for which approach is better for every business. However, the first step in deciding what makes the most sense for you is to have a good view of your IT resiliency objectives, specifically your RPO and RTO.
Although Table 1-1 does not cover every possible DR offering and approach, it does provide a view of what RPO and RTO might typically be achieved with some common options.
Table 1-1 Typical achievable RPO and RTO for some common DR options
Description
Typically achievable recovery point objective (RPO)
Typically achievable recovery time objective (RTO)
No disaster recovery plan
Not applicable: all data is lost
Not applicable
Tape vaulting
Measured in days since last stored backup
Days
Electronic vaulting
Hours
Hours (hot remote location) to days
Active replication to remote site (without recovery automation)
Seconds to minutes
Hours to days (dependent on availability of recovery hardware)
Active storage replication to remote “in-house” site
Zero to minutes (dependent on replication technology and automation policy)
One or more hours (dependent on automation)
Active software replication to remote “active” site
Seconds to minutes
Seconds to minutes (dependent on automation)
Generally a form of real-time software or hardware replication is required to achieve an RPO of minutes or less, but the only technologies that can provide an RPO of zero are synchronous replication technologies (see 2.3, “Synchronous versus asynchronous data transfer” on page 19) coupled with automation to ensure that no data is written to one location and not the other.
The recovery time is largely dependent on the availability of hardware to support the recovery and control over that hardware. You might have real-time software or hardware-based replication in place, but without server capacity at the recovery site you will have hours to days before you can recover this previously current data.
Furthermore, even with all the spare capacity and current data, you might find that you are relying on people to perform the recovery actions. In this case, you will undoubtedly find that these same people are not necessarily available in the case of a true disaster or, even more likely, they find that processes and procedures for the recovery are not practiced or accurate. This is where automation comes in to mitigate the risk introduced by the human element and to ensure that you actually meet the RTO required of the business.
Also, you might decide that one DR option is not appropriate for all aspects of the business. Various applications might tolerate a much greater loss of data and might not have an RPO as low as others. At the same time, some applications might not require recovery within hours whereas others most certainly do.
Although there is obvious flexibility in choosing different DR solutions for each application, the added complexity this can bring needs to be balanced carefully against the business benefit. The preferred approach, supported by GDPS, is to provide a single optimized solution for the enterprise. This generally leads to a simpler solution and, because less infrastructure and software might need to be duplicated, often a more cost-effective solution, too. Consider a different DR solution only for your most critical applications, where their requirements cannot be catered for with a single solution.
1.3.2 The next level
In addition to the ability to recover from a disaster, many businesses now look for a greater level of availability covering a wider range of events and scenarios. This larger requirement is called IT resilience. In this book, we concentrate on two aspects of IT resilience: Disaster recovery, as discussed previously, and continuous availability (CA), which encompasses recovering from disasters and keeping your applications up and running throughout the far more common planned and unplanned outages that do not constitute an actual disaster.
For some organizations, a proven disaster recovery capability that meets their RTO and RPO can be sufficient. Other organizations might need to go a step further and provide near-continuous application availability.
There are several market factors that make IT resilience imperative:
High and constantly increasing client and market requirements for continuous availability of IT processes
Financial loss because of lost revenue, punitive penalties or fines, or legal actions that are a direct result of disruption to critical business services and functions
An increasing number of security-related incidents, causing severe business impacts
Increasing regulatory requirements
Major potential business impact in areas such as market reputation and brand image from security or outage incidents
For a business today, few events affect a company as much as having an IT outage, even for a matter of minutes, and then finding a report of the incident splashed across the newspapers and the evening news. Today, your clients, employees, and suppliers expect to be able to do business with you around the clock and from around the globe.
To help keep business operations running 24x7, you need a comprehensive business continuity plan that goes beyond disaster recovery. Maintaining high availability and continuous operations in normal day-to-day operations are also fundamental for success. Businesses need resiliency to help ensure two essentials:
Key business applications and data are protected and available
If a disaster occurs, business operations continue with a minimal impact
Regulations
In some countries, government regulations specify how organizations must handle data and business processes. An example is the Health Insurance Portability and Accountability Act (HIPAA) in the United States. This law defines how an entire industry, the US healthcare industry, must handle and account for patient-related data.
Other well-known examples include the US government-released Interagency Paper on Sound Practices to Strengthen the Resilience of the US Financial System, which loosely drove changes in the interpretation of IT resilience within the US financial industry, and the Basel II rules for the European banking sector, which stipulate that banks must have a resilient back-office infrastructure.
This is also an area that accelerates as financial systems around the world become more interconnected. Although a set of recommendations published in Singapore (such as
S 540-2008 Standard on Business Continuity Management) might be directly addressing only businesses in a relatively small area, it is common for companies to do business in many countries around the world, where these might be requirements for ongoing business operations of any kind.
Business requirements
An important concept to understand is that the cost and complexity of a solution can increase as you get closer to true continuous availability, and that the value of a potential loss must be borne in mind when deciding which solution you need, and which one you can afford. You do not want to spend more money on a continuous availability solution than the financial loss you can incur as a result of an outage.
A solution must be identified that balances the costs of the solution with the financial impact of an outage. Several studies have been done to identify the cost of an outage; however, most of them are several years old and do not accurately reflect the degree of dependence most modern businesses have on their IT systems.
Therefore, your company must calculate the impact in your specific case. If you have not already conducted such an exercise, you might be surprised at how difficult it is to arrive at an accurate number. For example, if you are a retailer and you suffer an outage in the middle of the night after all the batch work has completed, the financial impact is far less than if you had an outage of equal duration in the middle of your busiest shopping day. Nevertheless, to understand the value of the solution, you must go through this exercise, using assumptions that are fair and reasonable.
1.3.3 Other considerations
In addition to the increasingly stringent availability requirements for traditional mainframe applications, there are other considerations, including those described in this section.
Increasing application complexity
The mixture of disparate platforms, operating systems, and communication protocols found within most organizations intensifies the already complex task of preserving and recovering business operations. Reliable processes are required for recovering the mainframe data and also, perhaps, data accessed by multiple types of UNIX, Microsoft Windows, or even a proliferation of virtualized distributed servers.
It is becoming increasingly common to have business transactions that span and update data on multiple platforms and operating systems. If a disaster occurs, your processes must be designed to recover this data in a consistent manner.
Just as you would not consider recovering half an application’s IBM DB2® data to 8:00 a.m. and the other half to 5:00 p.m., the data touched by these distributed applications must be managed to ensure that all of this data is recovered with consistency to a single point in time. The exponential growth in the amount of data generated by today’s business processes and IT servers compounds this challenge.
Increasing infrastructure complexity
Have you looked in your computer room recently? If you have, you probably found that your mainframe systems are only a small part of the equipment in that room. How confident are you that all those other platforms can be recovered? And if they can be recovered, will it be to the same point in time as your mainframe systems? And how long will that recovery take?
Figure 1-1 shows a typical IT infrastructure. If you have a disaster and recover the mainframe systems, will you be able to recover your service without all the other components that sit between the user and those systems? It is important to remember why you want your applications to be available, so that users can access them.
Therefore, part of your IT resilience solution must include more than addressing the non-mainframe parts of your infrastructure. It must also ensure that recovery is integrated with the mainframe plan.
Figure 1-1 Typical IT infrastructure
Outage types
In the early days of computer data processing, planned outages were relatively easy to schedule. Most of the users of your systems were within your company, so the impact to system availability was able to be communicated to all users in advance of the outage. Examples of planned outages are software or hardware upgrades that require the system to be brought down. These outages can take minutes or even hours.
Most outages are planned, and even among unplanned outages, most are not disasters. However, in the current business world of 24x7 Internet presence and web-based services shared across and also between enterprises, even planned outages can be a serious disruption to your business.
Unplanned outages are unexpected events. Examples of unplanned outages are software or hardware failures. Although various of these outages might be quickly recovered from, others might be considered a disaster.
You will undoubtedly have both planned and unplanned outages while running your organization, and your business resiliency processes must cater to both types. You will likely find, however, that coordinated efforts to reduce the numbers of and impacts of unplanned outages often are complementary to doing the same for planned outages.
Later in this book we discuss the technologies available to you to make your organization more resilient to outages, and perhaps avoid them altogether.
1.4 Characteristics of an IT resilience solution
As the previous sections demonstrate, IT resilience encompasses much more than the ability to get your applications up and running after a disaster with “some” amount of data loss, and after “some” amount of time.
When investigating an IT resilience solution, keep in mind the following points:
Support for planned system outages
Does the proposed solution provide the ability to stop a system in an orderly manner? Does it provide the ability to move a system from the production site to the backup site in a planned manner? Does it support server clustering, data sharing, and workload balancing, so the planned outage can be masked from users?
Support for planned site outages
Does the proposed solution provide the ability to move the entire production environment (systems, software subsystems, applications, and data) from the production site to the recovery site? Does it provide the ability to move production systems back and forth between production and recovery sites with minimal or no manual intervention?
Support for data that spans more than one platform
Does the solution support data from more systems than just z/OS? Does it provide data consistency across all supported platforms, or only within the data from each platform?
Support for managing the data replication environment
Does the solution provide an easy-to-use interface for monitoring and managing the data replication environment? Will it automatically react to connectivity or other failures in the overall configuration?
Support for data consistency
Does the solution provide consistency across all replicated data? Does it provide support for protecting the consistency of the second copy if it is necessary to resynchronize the primary and secondary copy?
Support for continuous application availability
Does the solution support continuous application availability? From the failure of any component? From the failure of a complete site?
Support for hardware failures
Does the solution support recovery from a hardware failure? Is the recovery disruptive (reboot or IPL again) or transparent (HyperSwap, for example)?
Support for monitoring the production environment
Does the solution provide monitoring of the production environment? Is the operator notified in a failure? Can recovery be automated?
Dynamic provisioning of resources
Does the solution have the ability to dynamically allocate resources and manage workloads? Will critical workloads continue to meet their service objectives, based on business priorities, if there is a failure?
Support for recovery across database managers
Does the solution provide recovery with consistency independent of the database manager? Does it provide data consistency across multiple database managers?
End-to-end recovery support
Does the solution cover all aspects of recovery, from protecting the data through backups or remote copy, through to automatically bringing up the systems following a disaster?
Cloned applications
Do your critical applications support data sharing and workload balancing, enabling them to run concurrently in more than one site? If so, does the solution support and use this capability?
Support for recovery from regional disasters
What distances are supported by the solution? What is the impact on response times? Does the distance required for protection from regional disasters permit a continuous application availability capability?
You then need to compare your company’s requirements in each of these categories against your existing or proposed solution for providing IT resilience.
1.5 GDPS offerings
GDPS is a collection of several offerings, each addressing a different set of IT resiliency goals that can be tailored to meet the RPO and RTO for your business. Each offering uses a combination of server and storage hardware or software-based replication and automation and clustering software technologies, many of which are described in more detail in Chapter 2, “Infrastructure planning for availability and GDPS” on page 13.
In addition to the infrastructure that makes up a given GDPS solution, IBM also includes services, particularly for the first installation of GDPS and optionally for subsequent installations to ensure that the solution meets and fulfills your business objectives.
The following list briefly describes each offering, with a view of which IT resiliency objectives it is intended to address. Extra details are included in separate chapters of this book:
GDPS Metro
A near-CA and DR solution across two sites separated by metropolitan distances. The solution is based on the IBM Metro Mirror synchronous disk mirroring technology.
GDPS Metro HyperSwap Manager
A near-CA solution for a single site or entry-level DR solution that is across two sites separated by metropolitan distances. The solution is based on the same mirroring technology as GDPS Metro, but does not include much of the system automation capability that makes GDPS Metro a more complete DR solution.
IBM GDPS Virtual Appliance
A near-CA and DR solution across two sites that are separated by metropolitan distances. The solution is based on the IBM Metro Mirror synchronous disk mirroring technology. The solution provides Near-CA and DR protection for IBM z/VM and Linux on IBM Z in environments that do not have IBM z/OS operating systems.
GDPS Global - XRC (also known as GDPS XRC)
A DR solution across two regions that are separated by virtually unlimited distance. The solution is based on the IBM Extended Remote Copy (XRC) asynchronous disk mirroring technology (also called IBM z/OS Global Mirror).
GDPS Global - GM (also known as GDPS GM)
A DR solution across two regions that are separated by virtually unlimited distance. The solution is based on the IBM System Storage Global Mirror technology, which is a disk subsystem-based asynchronous form of remote copy.
GDPS Metro Global - GM (also known as GDPS MGM)
A 3-site or a symmetrical 4-site configuration is supported:
 – GDPS MGM 3-site
A 3-site solution that provides CA across two sites within metropolitan distances in one region and DR to a third site, in a second region, at virtually unlimited distances. It is based on a combination of the Metro Mirror and Global Mirror technologies.
 – GDPS MGM 4-site
A symmetrical 4-site solution that is similar to the 3-site solution in that it provides CA within region and DR cross region. In addition, in the 4-site solution, the two regions are configured symmetrical so that the same levels of CA and DR protection is provided, no matter which region production runs in.
GDPS Metro Global - XRC (also known as GDPS MzGM)
 – GDPS MzGM 3-site
A 3-site solution that provides CA across two sites within metropolitan distances in one region and DR to a third site in a second region at virtually unlimited distances. It is based on a combination of the Metro Mirror and XRC technologies.
 – GDPS MzGM 4-site
A symmetrical 4-site solution that is similar to the 3-site solution in that it provides CA within region and DR cross region. In addition, in the 4-site solution, the two regions are configured symmetrically so that the same levels of CA and DR protection is provided, no matter which region that production runs in.
GDPS Continuous Availability
A multisite CA/DR solution at virtually unlimited distances. This solution is based on software-based asynchronous mirroring between two active production sysplexes running the same applications with the ability to process workloads in either site.
As mentioned briefly at the beginning of this section, each of these offerings provides the following benefits:
GDPS automation code
This code has been developed and enhanced over several years to use new hardware and software capabilities to reflect preferred practices, based on IBM experience with GDPS clients since the inception of GDPS, in 1998, and to address the constantly changing requirements of our clients.
Can use underlying hardware and software capabilities
IBM software and hardware products have support to surface problems that can affect the availability of those components, and to facilitate repair actions.
Services
There is perhaps only one factor in common across all GDPS implementations, namely that each has a unique requirement or attribute that makes it different from every other implementation. The services aspect of each offering provides you with invaluable access to experienced GDPS practitioners.
The amount of service included depends on the scope of the offering. For example, more function-rich offerings, such as GDPS Metro, include a larger services component than GDPS Metro HyperSwap Manager.
 
Note: Detailed information about each of the offerings is provided in the following chapters. It is not necessary to read all chapters if you are interested only in a specific offering. If you do read all of the chapters, you might notice that some information is repeated in multiple chapters.
1.6 Automation and disk replication compatibility
The GDPS automation code relies on the runtime capabilities of IBM Z NetView and IBM System Automation. Although these products provide tremendous first-level automation capabilities in and of themselves, there are alternative solutions you might already have from other vendors.
GDPS continues to deliver features and functions that take advantage of properties unique to the IBM Tivoli products (such as support for alert management through IBM System Automation for Integrated Operations Management), but Z NetView and IBM System Automation also work well alongside other first-level automation solutions. Therefore, although benefits exist to having a comprehensive solution from IBM, you do not have to replace your current automation investments before moving forward with a GDPS solution.
Most of the GDPS solutions rely on the IBM developed disk replication technologies1 of Metro Mirror for GDPS Metro, XRC for GDPS XRC, and Global Mirror for GDPS GM. These architectures are implemented on IBM enterprise disk storage products. Also, the external interfaces for all of these disk replication technologies (Metro Mirror, XRC, GM, and FlashCopy) were licensed by many major enterprise storage vendors.
This gives clients the flexibility to select the disk subsystems that best match their requirements and to mix and match disk subsystems from different storage vendors within the context of a single GDPS solution. Although most GDPS installations do rely on IBM storage products, there are several production installations of GDPS around the world that rely on storage products from other vendors.
IBM has a GDPS Qualification Program for other enterprise storage vendors to validate that their implementation of the advanced copy services architecture meets the GDPS requirements.
The GDPS Qualification Program offers the following arrangement to vendors:
IBM provides the system environment.
Vendors install their disk in this environment.
Testing is conducted jointly.
A qualification report is produced jointly, describing details of what was tested and results.
Recognize that this qualification program does not imply that IBM provides defect or troubleshooting support for a qualified vendor’s products. It does, however, indicate at least a point-in-time validation that the products are functionally compatible and demonstrates that they work in a GDPS solution.
Check directly with non-IBM storage vendors if you are considering using their products with a GDPS solution because they can share their own approaches and capability to support the specific GDPS offering you are interested in.
1.7 Summary
At this point we have discussed why it is important to have an IT resilience solution, and have provided information about key objectives to consider when developing your own solution. We have also introduced the GDPS family of offerings with a brief description of which objectives of IT resiliency each offering is intended to address.
In Chapter 2, “Infrastructure planning for availability and GDPS” on page 13 we introduce key infrastructure technologies related to IT resilience focused on the mainframe platform. After that, we describe how the various GDPS offerings use those technologies. And finally, we position the various GDPS offerings against typical business scenarios and requirements.
We intend to update this book as new GDPS capabilities are delivered.

1 Disk replication technology is independent of the GDPS Continuous Availability solution, which uses software replication.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.98.120