Concepts and overview of the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition
In this book, we discuss the new features, differences from previous versions, and give some example scenarios to help you get used to and exploit PowerHA SystemMirror 7.1.2 Enterprise Edition.
In this chapter we cover the following topics:
1.1 High availability and disaster recovery
Datacenter and services availability are some of the most important topics for IT infrastructure, and each day draws more attention. Since not only natural disasters affect normal operations, but human errors and terrorist acts may affect business continuity and even with fully redundant infrastructure, services are vulnerable to such disasters.
Replication of data between sites is a good way to minimize business disruption since backup restores can take too long to meet business requirements, or equipment may be damaged depending on disaster extent and not available for restoring data. Recovery options typically range in cost from the least expensive having a longer time for recovery to the most expensive providing the shortest recovery time and the closest to having zero data loss. A fully manual failover would normally require many specialists to coordinate and perform all the needed steps to bring the services up to another site, and even with a good disaster recovery plan can take longer than needed by business requirements. High availability software is intended to minimize downtime of services by automating recovery actions when failures are detected on the various elements of the infrastructure.
PowerHA SystemMirror 7.1.2 Standard Edition helps you automate node failures and application events and provide high availability. Even in this version it is possible to have two site clusters using LVM Cross Site mirror as a solution. Testing was done and documented in Chapter 5, “Cross-site LVM mirroring with IBM PowerHA SystemMirror 7.1.2 Standard Edition” on page 151. The PowerHA Enterprise Edition, in addition, will help you automate recovery actions on storage failures for selected storage, controlling storage replication between sites and enabling recoveries for entire site failures, thus ensuring copies are in a consistent state to make the failover, enabling you to build a disaster recovery solution.
If you already have PowerHA SystemMirror 6.1 Enterprise Edition, and have implemented it as part of your disaster recovery solution, refer to Chapter 2, “Differences between IBM PowerHA SystemMirror Enterprise Edition version 6.1 and version 7.1.2” on page 15 for the differences between the versions. Then if you are planning on migrating PowerHA SystemMirror Enterprise Edition, we show three different migration scenarios to get your current version to 7.1.2 (snapshot, offline, and rolling migration). These migration scenarios guide you to getting your existing cluster migrated to the latest level using the option that better suits your business requirements. All the testing scenarios are documented in Chapter 8, “Migrating to PowerHA SystemMirror 7.1.2 Enterprise Edition” on page 329.
For both PowerHA Standard and PowerHA Enterprise Edition, the IBM Systems Director server can be enabled to manage clusters via its integrated GUI just by installing the PowerHA plug-in that has been enhanced to support the disaster recovery enablement features added in PowerHA SystemMirror 7.1.2 Enterprise Edition.
The IBM Systems Director with the PowerHA plug-in gives the ability to:
Discover the existing PowerHA SystemMirror clusters.
Collect information and a variety of reports about the state and configuration of applications and clusters.
Receive live and dynamic status updates for clusters, sites, nodes, and resource groups.
Having single sign-on capability which allows full access to all clusters with only one user ID and password.
Access and search log files to display a summary page that you can view for the overall status of all known clusters and resource groups.
Create clusters and add resource groups with wizards.
Apply updates to the PowerHA SystemMirror Agent using the Director Update Manager.
Chapter 9, “PowerHA 7.1.2 for IBM Systems Director plug-in enhancements” on page 379 was developed to discuss how to install and use the IBM Systems Director PowerHA plug-in to manage a cluster.
1.1.1 Disaster recovery considerations
The idea of a fast failover in the event of a problem, or the recovery time objective (RTO), is important but should not be the only area of focus. Ultimately, the consistency of the data and whether the solution meets the recovery point objective (RPO) is what make the design worth the investment. You should not enter a disaster recovery planning session and expect to truly achieve the five nines of availability by solely implementing a clustering solution. Table 1-1 outlines the calculations for the uptime criteria.
Table 1-1 Five nines of availability
Uptime
Uptime
Maximum downtime per year
Five nines
99.999%
5 minutes 35 seconds
Four nines
99.99%
52 minutes 33 seconds
Three nines
99.9%
8 hours 46 minutes
Two nines
99.0%
87 hours 36 minutes
One nine
90.0%
36 days 12 hours
There are some considerations when planning a disaster recovery solution to achieve an accurate RTO. For example, is the time for planned maintenance accounted for? If so, is that time deducted from the maximum downtime figures for the year? While PowerHA has the ability to nondisruptively update the cluster, you have to consider the impact of other service interruptions in the environment such as upgrades involving the applications, the AIX operating system, and the system firmware, which often require the services to go offline a certain amount of time.
The IBM PowerHA SystemMirror 7.1.2 Enterprise Edition solution can provide a valuable proposition for reliably orchestrating the acquisition and release of cluster resources from one site to another. It can also provide quick failover in the event of an outage or natural disaster. Figure 1-1 on page 6 shows the various tiers of disaster recovery solutions and how the PowerHA Enterprise Edition is considered a tier 7 recovery solution. Solutions in the alternate tiers can all be used to back up data and move it to a remote location, but they lack the automation that the PowerHA Enterprise Edition provides. Looking over the recovery time axis, you can see how meeting an RTO of under four hours could be achieved with the implementation of an automated multisite clustering solution.
Figure 1-1 Tiers of disaster recovery solutions - IBM PowerHA SystemMirror 7.1.2 Enterprise Edition
Table 1-2 describes the tiers of disaster recovery in more detail, and outlines POWER Systems solutions available in each tier for disaster recovery.
Table 1-2 Disaster recovery tier
DR planning
model reference
Power Systems
solutions
100% recovery
of application
data possible?
Automatic
detection of site
failure?
Facility
locations
supported
Tier 7
Zero data loss. Recovery up to application restart in minutes
Cross Site LVM
GLVM
Metro Mirror
Global Mirror
HyperSwap®
Yes - minus data
in transit at time
of disaster
Yes - Failover and
failback
automated
All
PowerHA Standard Edition suffices for a campus-style DR solution
Tier 6
Two-site two-phase commit. Recovery time varies from minutes to hours
Oracle or DB2® log shipping to a remote standby database
 
Oracle or DB2 active data replication to a remote database
 
DB2 HADR solution
No - does not
include active log
data
 
 
Yes
 
 
 
 
Yes
No
 
 
 
 
No
 
 
 
 
No
All
 
 
 
 
All
 
 
 
 
All
Tier 5
Continuous electronic vaulting of backup data between active sites. Active data management at each site is provided.
TSM with copy pool duplexing between sites, and active TSM servers at each site
No
 
 
Recovery in days or weeks
 
 
Must restore from backups
No
All
Tier 4
Electronic vaulting of critical backup data to a hot site. The hot site is not activated until a disaster occurs.
TSM with copy pool duplexing to the hot site, and TSM server at active site only
 
No
 
 
 
Recovery in days or weeks
No
N/A
Tier 3
Off-site vaulting with a hot site. Backup data is transported to the hot site manually. The hot site is staffed and equipped but not active until a disaster occurs.
TSM with multiple local storage pools on disk and tape at active site
No
 
 
 
 
Recovery in days or weeks
No
N/A
Tier 2
Off-site vaulting of backup data by courier. A third-party vendor collects the data at regular intervals and stores it in its facility. When a disaster occurs:
a) A hot site must be prepared.
b) Backup data must be transported.
TSM with multiple local storage pools on disk and tape at active site
No
 
 
 
 
 
Recovery in days or weeks
No
N/A
Tier 1
Pickup Truck Access Method with tape
Tape-backup based solution
No
No
No
Tier 0
No disaster recovery plan or protection
Local backup solution may be in place, but no offsite data storage
No DR recovery - site and data lost
N/A
N/A
High availability and disaster recovery is a balance between recovery time requirements and cost. Various external studies are available that cover dollar loss estimates for every bit of downtime experienced as a result of service disruptions and unexpected outages. Thus, key decisions must be made to determine what parts of the business are important and must remain online in order to continue business operations.
Beyond the need for secondary servers, storage, and infrastructure to support the replication bandwidth between two sites, there are items that can easily be overlooked, such as:
Where does the staff go in the event of a disaster?
What if the technical staff managing the environment is unavailable?
Are there facilities to accommodate the remaining staff, including desks, phones, printers, desktop PCs, and so on?
Is there a documented disaster recovery plan that can be followed by non-technical staff if necessary?
Chapter 3, “Planning” on page 29 describes the considerations to get a better understanding of the infrastructure needs, and how to plan the implementation of PowerHA to improve availability.
1.2 Local failover, campus style, and extended distance clusters
Clients implementing local PowerHA clusters for high availability often mistake failover functions between a pair of machines within the same server room from the same failover functions between machines located at dispersed sites. Although from a cluster standpoint the graceful release and reacquire functions are effectively the same, in a scenario with remote sites you always have a higher risk of data loss in the event of a hard failure and will likely take longer than a local failover since more steps are needed; for example, storage replication management. Thus, it is often more appropriate to use a local high availability cluster as the building block for a disaster recovery solution where two local nodes are paired within a site and at least one remote node is added into the cluster as part of a second site definition.
If the business requirements of a client state that the service level agreements (SLAs) claim to maintain a highly available environment even after a site failure, the client needs a cluster with four nodes, two in each site. This is the configuration that we used in most of the testing environments discussed in this book.
For two-site failover, two different situations can be used to accomplish replication needs:
The sites can be near enough to have shared LUNs in the same Storage Area Network (SAN), as seen in Figure 1-2 on page 9.
Figure 1-2 Extended SAN between sites
These environments often present a variety of options for configuring the cluster, including a cross-site LVM mirrored configuration, using disk-based metro-mirroring, or scenarios using San Volume Controller (SVC) Vdisk mirroring with a split I/O group between two sites, which can be implemented since the SVC firmware 5.1 release. Being inherently synchronous, all of these solutions experience minimal to zero data loss, very similar to that in a local cluster sharing LUNs from the same storage subsystem. Even asynchronous replication such as GlobalMirror technologies can be used in this kind of stretched SAN structure if performance with synchronous replication is too slow for business requirements. But data replication cannot be guaranteed between sites since the I/O is completed to the application before the replication is complete.
The infrastructure uses the network to accomplish the storage replication needs, using GLVM for example, and there is no SAN link between the two sites to share the same LUN across all cluster nodes, as shown in Figure 1-3 on page 10.
Figure 1-3 One different SAN in each site
Since PowerHA SystemMirror 7.1.2 Enterprise Edition uses the CAA repository disks (more details about repository disks can be found in 2.1.1, “Cluster Aware AIX” on page 16), in the case where the infrastructure is unable to provide the same LUN across all nodes, a new concept was created to accommodate these two different scenarios. When the same repository disk can be shared across all nodes in the cluster, it is a stretched cluster. If different disks are used on each site for the repository disks, the cluster is a linked cluster. See section 2.2.1, “Stretched and linked clusters” on page 19 for more information.
For stretched clusters, the repository disk is used as heartbeat path communication too. On linked clusters, this is not possible between the sites, but PowerHA SystemMirror 7.1.2 Enterprise Edition brings a new feature that you can use to automate decisions in split and merge situations with the possibility to use a SCSI-3 reservation tie breaker disk. For additional information about this topic, refer to 2.2.2, “Split/merge handling overview” on page 21 and Chapter 10, “Cluster partition management” on page 435.
The PowerHA cluster nodes can reside on different machine types and server classes as long as you maintain common AIX and cluster file set levels. This can be very valuable in the scenario where you acquire newer machines and leverage the older hardware to serve as the failover targets. In a campus environment, creating this stretched cluster could serve two roles:
Providing high availability
Providing a recovery site that contains a current second copy of the data
Historically, the main limiting factor for implementing disaster recovery (DR) solutions has been the cost. Today, as more and more clients reap the benefits of geographically dispersing their data, two of the more apparent inhibitors are:
The risks associated with a fully automated clustered solution
The associated value-add of wrapping the cluster software around the replication
A multisite clustered environment is designed so that by default if the primary site stops responding, the secondary site terminates the replication relationship and activates the resources at the backup location. As a best practice, plan for multiple redundant heart-beating links between the sites to avoid a separation. If you proceed with the assumption that the cluster design has redundant links between the sites, and in turn minimizes the risk of a false down, we continue to discuss the value-add that the PowerHA SystemMirror 7.1.2 Enterprise Edition brings.
Many clients use the various disk and IP replication technologies without leveraging the integration within the PowerHA Enterprise Edition only to find that they are left with more to manage on their own. For example, you can control remote physical volumes (RPVs) in Geographic Logical Volume Manager (GLVM) manually, but even after scripting the commands to activate the disks in the proper order, there are various cluster scenarios where you would need to append extra logic to achieve the desired results. This in part is the reason that the GLVM file set is included in AIX—the concept of try and buy. Once you have identified that the replication technology meets your needs, and you have sized the data links appropriately, you quickly realize that the management is significantly simpler when you exploit the integration within the PowerHA Enterprise Edition.
One of the major benefits that Power HA Enterprise Edition brings is that it has been comprehensively tested with not just the basic failover and failback scenarios, but also with such things as rg_move and selective failover inherent in cluster mechanisms. In addition, using the PowerHA Enterprise Edition automatically reverses the flow and restarts the replication after the original site is restored. The integrated cluster clverify functions are also intended to help identify and correct any configuration errors. The cluster EVENT logging is appended into the existing PowerHA logs, and the nightly verification checks identify whether any changes have occurred to the configuration. The replicated resource architecture in the Enterprise Edition allows finer control over the status of the resources. Through features such as application monitoring or the pager notification methods, you can receive updates any time that a critical cluster event has occurred.
Enabling full integration can also facilitate the ability to gracefully move resources and the testing of your scripts. By simply moving the resources from one site to the next, you can test the application stop and start scripts and ensure that everything is in working order. Leveraging some of the more granular options within the cluster, such as resource group location dependencies, can also facilitate the de-staging of lower priority test or development resources at the failover location whenever a production site failure has taken place. Using the site-specific dependencies, you could also specify that a set of resource groups always coexists within the same site. Another benefit of using PowerHA Enterprise Edition is the integrated logic to pass instructions to the disk storage subsystems automatically, based on the various events detected by the cluster that are integrated into the code and tested in many different ways.
1.3 Storage replication and the new HyperSwap feature
One of the major concerns when considering a replication solution is whether it maintains the integrity of the data after a hard site failover. The truth is that all of the solutions do a very good job, but they are not bullet-proof. One important note to keep in mind is the concept of garbage in garbage out (GIGO). Database corruption at the source site would be replicated to the target copy. This is the main reason that the DR design does not stop after selecting a replication technology.
Replicating the data only addresses one problem. In a well-designed disaster recovery solution, a backup and recovery plan must also exist. Tape backups, snapshots, and flashcopies are still an integral part of an effective backup and recovery solution. The frequency of these backups at both the primary and remote locations should also be considered for a thorough design.
 
Tip: An effective backup and recovery strategy should leverage a combination of tape and point-in-time disk copies to protect unexpected data corruption. Restore is very important, and regular restore tests need to be performed to guarantee that the disaster recovery is viable.
IBM PowerHA has been providing clustering and data replication functionality for a number of years and it continues to strive to be the solution of choice for IBM clients running on AIX. The software tightly integrates with a variety of existing disk replication technologies under the IBM portfolio and third-party technologies. Support for additional third-party replication technologies is continually being tested and planned for future software releases or via service flashes.
There are two types of storage replication: synchronous and asynchronous. Synchronous replication only considers the I/O completed after the write is done on both storages. Only synchronous replication can guarantee that 100% of transactions were correctly replicated to the other site, but since this can add a considerable amount of I/O time, the distance between sites must be considered for performance matters. This is the main reason asynchronous replication is used between very distant sites or with I/O sensitive applications.
In synchronous mirroring, both the local and remote copies must be committed to their respective subsystems before the acknowledgement is returned to the application. In contrast, asynchronous transmission mode allows the data replication at the secondary site to be decoupled so that primary site application response time is not impacted. Asynchronous transmission is commonly selected with the exposure that the secondary site’s version of the data may be out of sync with the primary site by a few minutes or more. This lag represents data that is unrecoverable in the event of a disaster at the primary site. The remote copy can lag behind in its updates. If a disaster strikes, it might never receive all of the updates that were committed to the original copy.
Although every environment differs, more contention and disk latency is introduced the farther the sites reside from each other. However, there are no hard set considerations dictating whether you need to replicate synchronously or asynchronously. It can be difficult to provide an exact baseline for the distance delineating synchronous versus asynchronous replication.
Some clients are replicating synchronously between sites that are hundreds of miles apart, and the configuration suits them quite well. This is largely due to the fact that their environments are mostly read intensive and writes only occur sporadically a few times a day, so that the impact of the application response time due to write activity is minimal. Hence, factors such as the application read and write tendencies should be considered along with the current system utilization.
If a synchronous replication is suitable for your environment, consider a new feature that has been added to this PowerHA SystemMirror 7.1 2 Enterprise Edition release: HyperSwap. This is a good virtualization layer above replicated storage devices since it enables a fast takeover and continuous availability against storage failures. See 2.2.3, “HyperSwap overview” on page 21 and Chapter 4, “Implementing DS8800 HyperSwap” on page 59 for more information about the new PowerHA SystemMirror 7.1.2 Enterprise Edition HyperSwap feature.
You can see that the differences between local high availability (HA) and DR revolve around the distance between the sites and ability, or inability (in this case a linked cluster is required), to extend the storage area network (SAN). Local failover provides a faster transition onto another machine than a failover going to a geographically dispersed site. A scenario where synchronous replication is faster on two different sites than asynchronous replication is where HyperSwap would mask the failure of one of the storage subsystems to the application and not require a failover in the case where only a storage failure occurred.
In environments requiring the replication of data over greater distances where asynchronous disk-based replication might be a better fit, there is a greater exposure for data loss. There may be a larger delta between the data in the source and target copies. Also, the nature of that kind of setup results in the need of a failover if the primary storage subsystem is to go offline.
For local or stretched clusters, licensing of the IBM PowerHA SystemMirror Standard Edition typically suffices, with the exception of the HyperSwap feature and synchronous or asynchronous disk-level mirroring configurations that could benefit from the additional integrated logic provided with the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition solution. The additional embedded logic would provide automation to the management of the role reversal of the source and target copies in the event of a failover. Local clusters, assuming that virtualized resources are being used, can also benefit from advanced functions such as IBM PowerVM Live Partition Mobility between machines at the same site. This combination of the IBM PowerVM functions and IBM PowerHA SystemMirror clustering is useful for helping to avoid any service interruption for a planned maintenance event while protecting the environment in the event of an unforeseen outage.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.39.142