Chapter 5. Copy Services

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Copy Services

Copy Services are a collection of functions that provide capabilities for disaster recovery, data migration, and data duplication solutions. This chapter provides an overview and the preferred practices of IBM Spectrum Virtualize and Storwize family copy services capabilities, including FlashCopy, Metro Mirror and Global Mirror, and Volume Mirroring.

This chapter includes the following sections:

•Introduction to copy services

•FlashCopy

•Remote Copy

•IP Replication

•Volume Mirroring

5.1 Introduction to copy services

IBM Spectrum Virtualize and Storwize family products offer a complete set of copy services functions that provide capabilities for Disaster Recovery, Business Continuity, data movement, and data duplication solutions.

5.1.1 FlashCopy

FlashCopy is a function that allows you to create a point-in-time copy of one of your volumes. This function might be helpful when performing backups or application testing. These copies can be cascaded on one another, read from, written to, and even reversed.

These copies are able to conserve storage, if needed, by being space-efficient copies that only record items that have changed from the originals instead of full copies.

5.1.2 Metro Mirror and Global Mirror

Metro Mirror and Global Mirror are technologies that enable you to keep a real-time copy of a volume at a remote site that contains another IBM Spectrum Virtualize or Storwize system.

Metro Mirror makes synchronous copies, which means that the original writes are not considered complete until the write to the destination disk has been confirmed. The distance between your two sites is usually determined by how much latency your applications can handle.

Global Mirror makes asynchronous copies of your disk. This fact means that the write is considered complete after it is complete at the local disk. It does not wait for the write to be confirmed at the remote system as Metro Mirror does. This requirement greatly reduces the latency experienced by your applications if the other system is far away. However, it also means that during a failure, the data on the remote copy might not have the most recent changes committed to the local disk.

5.1.3 Global Mirror with Change Volumes

This function (also known as Cycle-Mode Global Mirror), introduced in SV V6.3, can best be described as “Continuous Remote FlashCopy.” If you use this feature, the system takes periodic FlashCopies of a disk and write them to your remote destination.

This feature completely isolates the local copy from wide area network (WAN) issues and from sudden spikes in workload that might occur. The drawback is that your remote copy might lag behind the original by a significant amount, depending on how you have set up the cycle time.

5.1.4 Volume Mirroring function

Volume Mirroring is a function that is designed to increase high availability of the storage infrastructure. It provides the ability to create up to two local copies of a volume. Volume Mirroring can use space from two Storage Pools, and preferably from two separate back-end disk subsystems.

Primarily, you use this function to insulate hosts from the failure of a Storage Pool and also from the failure of a back-end disk subsystem. During a Storage Pool failure, the system continues to provide service for the volume from the other copy on the other Storage Pool, with no disruption to the host.

You can also use Volume Mirroring to migrate from a thin-provisioned volume to a non-thin-provisioned volume, and to migrate data between Storage Pools of different extent sizes.

5.2 FlashCopy

By using the IBM FlashCopy function of the IBM Spectrum Virtualize and Storwize systems, you can perform a point-in-time copy of one or more volumes. This section describes the inner workings of FlashCopy, and provides some preferred practices for its use.

You can use FlashCopy to help you solve critical and challenging business needs that require duplication of data of your source volume. Volumes can remain online and active while you create consistent copies of the data sets. Because the copy is performed at the block level, it operates below the host operating system and its cache. Therefore, the copy is not apparent to the host.

Important: Because FlashCopy operates at the block level below the host operating system and cache, those levels do need to be flushed for consistent FlashCopies.

While the FlashCopy operation is performed, the source volume is stopped briefly to initialize the FlashCopy bitmap, and then input/output (I/O) can resume. Although several FlashCopy options require the data to be copied from the source to the target in the background, which can take time to complete, the resulting data on the target volume is presented so that the copy appears to complete immediately.

This process is performed by using a bitmap (or bit array) that tracks changes to the data after the FlashCopy is started, and an indirection layer that enables data to be read from the source volume transparently.

5.2.1 FlashCopy use cases

When you are deciding whether FlashCopy addresses your needs, you must adopt a combined business and technical view of the problems that you want to solve. First, determine the needs from a business perspective. Then, determine whether FlashCopy can address the technical needs of those business requirements.

The business applications for FlashCopy are wide-ranging. In the following sections, a short description of the most common use cases is provided.

Backup improvements with FlashCopy

FlashCopy does not reduce the time that it takes to perform a backup to traditional backup infrastructure. However, it can be used to minimize and, under certain conditions, eliminate application downtime that is associated with performing backups. FlashCopy can also transfer the resource usage of performing intensive backups from production systems.

After the FlashCopy is performed, the resulting image of the data can be backed up to tape as though it were the source system. After the copy to tape is complete, the image data is redundant and the target volumes can be discarded. For time-limited applications, such as these examples, “no copy” or incremental FlashCopy is used most often. The use of these methods puts less load on your infrastructure.

When FlashCopy is used for backup purposes, the target data usually is managed as read-only at the operating system level. This approach provides extra security by ensuring that your target data was not modified and remains true to the source.

Restore with FlashCopy

FlashCopy can perform a restore from any existing FlashCopy mapping. Therefore, you can restore (or copy) from the target to the source of your regular FlashCopy relationships. It might be easier to think of this method as reversing the direction of the FlashCopy mappings. This capability has the following benefits:

•There is no need to worry about pairing mistakes because you trigger a restore.

•The process appears instantaneous.

•You can maintain a pristine image of your data while you are restoring what was the primary data.

This approach can be used for various applications, such as recovering your production database application after an errant batch process that caused extensive damage.

Preferred practices: Although restoring from a FlashCopy is quicker than a traditional tape media restore,do not use restoring from a FlashCopy as a substitute for good archiving practices. Instead, keep one to several iterations of your FlashCopies so that you can near-instantly recover your data from the most recent history. Keep your long-term archive as appropriate for your business.

In addition to the restore option, which copies the original blocks from the target volume to modified blocks on the source volume, the target can be used to perform a restore of individual files. To do that, you must make the target available on a host. Do not make the target available to the source host because seeing duplicates of disks causes problems for most host operating systems. Copy the files to the source by using the normal host data copy methods for your environment.

Moving and migrating data with FlashCopy

FlashCopy can be used to facilitate the movement or migration of data between hosts while minimizing downtime for applications. By using FlashCopy, application data can be copied from source volumes to new target volumes while applications remain online. After the volumes are fully copied and synchronized, the application can be brought down and then immediately brought back up on the new server that is accessing the new FlashCopy target volumes.

This method differs from the other migration methods, which are described later in this chapter. Common uses for this capability are host and back-end storage hardware refreshes.

Application testing with FlashCopy

It is often important to test a new version of an application or operating system that is using actual production data. This testing ensures the highest quality possible for your environment. FlashCopy makes this type of testing easy to accomplish without putting the production data at risk or requiring downtime to create a constant copy.

Create a FlashCopy of your source and use that for your testing. This copy is a duplicate of your production data down to the block level so that even physical disk identifiers are copied. Therefore, it is impossible for your applications to tell the difference.

5.2.2 FlashCopy capabilities overview

FlashCopy occurs between a source volume and a target volume in the same storage system. The minimum granularity that IBM Spectrum Virtualize and Storwize systems support for FlashCopy is an entire volume. It is not possible to use FlashCopy to copy only part of a volume.

To start a FlashCopy operation, a relationship between the source and the target volume must be defined. This relationship is called FlashCopy Mapping.

FlashCopy mappings can be stand-alone or a member of a Consistency Group. You can perform the actions of preparing, starting, or stopping FlashCopy on either a stand-alone mapping or a Consistency Group.

Figure 5-1 shows the concept of FlashCopy mapping.

Figure 5-1 FlashCopy mapping

A FlashCopy mapping has a set of attributes and settings that define the characteristics and the capabilities of the FlashCopy.

These characteristics are explained more in detail in the following sections.

Background copy

The background copy rate is a property of a FlashCopy mapping that allows to specify whether a background physical copy of the source volume to the corresponding target volume occurs. A value of 0 disables the background copy. If the FlashCopy background copy is disabled, only data that has changed on the source volume is copied to the target volume. A FlashCopy with background copy disabled is also known as No-Copy FlashCopy.

The benefit of using a FlashCopy mapping with background copy enabled is that the target volume becomes a real clone (independent from the source volume) of the FlashCopy mapping source volume after the copy is complete. When the background copy function is not performed, the target volume remains a valid copy of the source data while the FlashCopy mapping remains in place.

Valid values for the background copy rate are 0 - 100. The background copy rate can be defined and changed dynamically for individual FlashCopy mappings.

Table 5-1 shows the relationship of the background copy rate value to the attempted amount of data to be copied per second.

Table 5-1 Relationship between the rate and data rate per second

Value	Data copied per second
1 - 10	128 KB
11 - 20	256 KB
21 - 30	512 KB
31 - 40	1 MB
41 - 50	2 MB
51 - 60	4 MB
61 - 70	8 MB
71 - 80	16 MB
81 - 90	32 MB
91 - 100	64 MB

FlashCopy Consistency Groups

Consistency Groups can be used to help create a consistent point-in-time copy across multiple volumes. They are used to manage the consistency of dependent writes that are run in the application following the correct sequence.

When Consistency Groups are used, the FlashCopy commands are issued to the Consistency Groups. The groups perform the operation on all FlashCopy mappings contained within the Consistency Groups at the same time.

Figure 5-2 illustrates a Consistency Group consisting of two volume mappings.

Figure 5-2 Multiple volumes mapping in a Consistency Group

FlashCopy mapping considerations: If the FlashCopy mapping has been added to a Consistency Group, it can only be managed as part of the group. This limitation means that FlashCopy operations are no longer allowed on the individual FlashCopy mappings.

Incremental FlashCopy

Using Incremental FlashCopy, you can reduce the required time of copy. Also, because less data must be copied, the workload put on the system and the back-end storage is reduced.

Basically, Incremental FlashCopy does not require that you copy an entire disk source volume every time the FlashCopy mapping is started. It means that only the changed regions on source volumes are copied to target volumes, as shown in Figure 5-3.

Figure 5-3 Incremental FlashCopy

If the FlashCopy mapping was stopped before the background copy completed, then when the mapping is restarted, the data that was copied before the mapping was stopped will not be copied again. For example, if an incremental mapping reaches 10 percent progress when it is stopped and then it is restarted, that 10 percent of data will not be recopied when the mapping is restarted, assuming that it was not changed.

Stopping an incremental FlashCopy mapping: If you are planning to stop an incremental FlashCopy mapping, make sure that the copied data on the source volume will not be changed, if possible. Otherwise, you might have an inconsistent point-in-time copy.

A “difference” value is provided in the query of a mapping, which makes it possible to know how much data has changed. This data must be copied when the Incremental FlashCopy mapping is restarted. The difference value is the percentage (0-100 percent) of data that has been changed. This data must be copied to the target volume to get a fully independent copy of the source volume.

An incremental FlashCopy can be defined setting the incremental attribute in the FlashCopy mapping.

Multiple Target FlashCopy

In Multiple Target FlashCopy, a source volume can be used in multiple FlashCopy mappings, while the target is a different volume, as shown in Figure 5-4.

Figure 5-4 Multiple Target FlashCopy

Up to 256 different mappings are possible for each source volume. These mappings are independently controllable from each other. Multiple Target FlashCopy mappings can be members of the same or different Consistency Groups. In cases where all the mappings are in the same Consistency Group, the result of starting the Consistency Group will be to FlashCopy to multiple identical target volumes.

Cascaded FlashCopy

With Cascaded FlashCopy, you can have a source volume for one FlashCopy mapping and as the target for another FlashCopy mapping; this is referred to as a Cascaded FlashCopy. This function is illustrated in Figure 5-5.

Figure 5-5 Cascaded FlashCopy

A total of 255 mappings are possible for each cascade.

Thin-provisioned FlashCopy

When a new volume is created, you can designate it as a thin-provisioned volume, and it has a virtual capacity and a real capacity.

Virtual capacity is the volume storage capacity that is available to a host. Real capacity is the storage capacity that is allocated to a volume copy from a storage pool. In a fully allocated volume, the virtual capacity and real capacity are the same. However, in a thin-provisioned volume, the virtual capacity can be much larger than the real capacity.

The virtual capacity of a thin-provisioned volume is typically larger than its real capacity. On IBM Spectrum Virtualize and Storwize systems, the real capacity is used to store data that is written to the volume, and metadata that describes the thin-provisioned configuration of the volume. As more information is written to the volume, more of the real capacity is used.

Thin-provisioned volumes can also help to simplify server administration. Instead of assigning a volume with some capacity to an application and increasing that capacity following the needs of the application if those needs change, you can configure a volume with a large virtual capacity for the application. You can then increase or shrink the real capacity as the application needs change, without disrupting the application or server.

When you configure a thin-provisioned volume, you can use the warning level attribute to generate a warning event when the used real capacity exceeds a specified amount or percentage of the total real capacity. For example, if you have a volume with 10 GB of total capacity and you set the warning to 80 percent, an event is registered in the event log when you use 80 percent of the total capacity. This technique is useful when you need to control how much of the volume is used.

If a thin-provisioned volume does not have enough real capacity for a write operation, the volume is taken offline and an error is logged (error code 1865, event ID 060001). Access to the thin-provisioned volume is restored by either increasing the real capacity of the volume or increasing the size of the storage pool on which it is allocated.

You can use thin volumes for cascaded FlashCopy and multiple target FlashCopy. It is also possible to mix thin-provisioned with normal volumes. It can be used for incremental FlashCopy too, but using thin-provisioned volumes for incremental FlashCopy only makes sense if the source and target are thin-provisioned.

Thin-provisioned incremental FlashCopy

The implementation of thin-provisioned volumes does not preclude the use of incremental FlashCopy on the same volumes. It does not make sense to have a fully allocated source volume and then use incremental FlashCopy, which is always a full copy at first, to copy this fully allocated source volume to a thin-provisioned target volume. However, this action is not prohibited.

Consider this optional configuration:

•A thin-provisioned source volume can be copied incrementally by using FlashCopy to a thin-provisioned target volume. Whenever the FlashCopy is performed, only data that has been modified is recopied to the target. Note that if space is allocated on the target because of I/O to the target volume, this space will not be reclaimed with subsequent FlashCopy operations.

•A fully allocated source volume can be copied incrementally using FlashCopy to another fully allocated volume at the same time as it is being copied to multiple thin-provisioned targets (taken at separate points in time). This combination allows a single full backup to be kept for recovery purposes, and separates the backup workload from the production workload. At the same time, it allows older thin-provisioned backups to be retained.

Reverse FlashCopy

Reverse FlashCopy enables FlashCopy targets to become restore points for the source without breaking the FlashCopy relationship, and without having to wait for the original copy operation to complete. Therefore, it supports multiple targets (up to 256) and multiple rollback points.

A key advantage of the Multiple Target Reverse FlashCopy function is that the reverse FlashCopy does not destroy the original target. This feature enables processes that are using the target, such as a tape backup, to continue uninterrupted.

IBM Spectrum Virtualize and Storwize family systems also allow you to create an optional copy of the source volume to be made before the reverse copy operation starts. This ability to restore back to the original source data can be useful for diagnostic purposes.

5.2.3 FlashCopy functional overview

Understanding how FlashCopy works internally helps you to configure it in a way that you want and enables you to obtain more benefits from it.

FlashCopy bitmaps and grains

A bitmap is an internal data structure stored in a particular I/O Group that is used to track which data in FlashCopy mappings has been copied from the source volume to the target volume. Grains are units of data grouped together to optimize the use of the bitmap. One bit in each bitmap represents the state of one grain. FlashCopy grain can be either 64 KB or 256 KB.

A FlashCopy bitmap takes up the bitmap space in the memory of the I/O group that must be shared with other features’s bitmaps (such as Remote Copy bitmaps, Volume Mirroring bitmaps, and RAID bitmaps).

Indirection layer

The FlashCopy indirection layer governs the I/O to the source and target volumes when a FlashCopy mapping is started. This process is done by using a FlashCopy bitmap. The purpose of the FlashCopy indirection layer is to enable both the source and target volumes for read and write I/O immediately after FlashCopy starts.

The following description illustrates how the FlashCopy indirection layer works when a FlashCopy mapping is prepared and then started.

When a FlashCopy mapping is prepared and started, the following sequence is applied:

1. Flush the write cache to the source volume or volumes that are part of a Consistency Group.

2. Put the cache into write-through mode on the source volumes.

3. Discard the cache for the target volumes.

4. Establish a sync point on all of the source volumes in the Consistency Group (creating the FlashCopy bitmap).

5. Ensure that the indirection layer governs all of the I/O to the source volumes and target.

6. Enable the cache on source volumes and target volumes.

FlashCopy provides the semantics of a point-in-time copy that uses the indirection layer, which intercepts I/O that is directed at either the source or target volumes. The act of starting a FlashCopy mapping causes this indirection layer to become active in the I/O path, which occurs automatically across all FlashCopy mappings in the Consistency Group. The indirection layer then determines how each of the I/O is to be routed based on the following factors:

•The volume and the logical block address (LBA) to which the I/O is addressed

•Its direction (read or write)

•The state of an internal data structure, the FlashCopy bitmap

The indirection layer allows the I/O to go through the underlying volume. It redirects the I/O from the target volume to the source volume, or queues the I/O while it arranges for data to be copied from the source volume to the target volume. Table 5-2 summarizes the indirection layer algorithm.

Table 5-2 Summary table of the FlashCopy indirection layer algorithm

Volume being accessed	Has the grain been copied?	Host I/O operation
Volume being accessed	Has the grain been copied?	Read	Write
Source	No	Read from the source volume.	Copy grain to the most recently started target for this source, then write to the source.
Source	Yes	Read from the source volume.	Write to the source volume.
Target	No	If any newer targets exist for this source in which this grain has already been copied, read from the oldest of these targets. Otherwise, read from the source.	Hold the write. Check the dependency target volumes to see whether the grain has been copied. If the grain is not already copied to the next oldest target for this source, copy the grain to the next oldest target. Then, write to the target.
Target	Yes	Read from the target volume.	Write to the target volume.

Interaction with cache

Starting with V7.3, the entire cache subsystem was redesigned and changed. Cache has been divided into upper and lower cache. Upper cache serves mostly as write cache and hides the write latency from the hosts and application. Lower cache is a read/write cache and optimizes I/O to and from disks. Figure 5-6 shows the IBM Spectrum Virtualize cache architecture.

Figure 5-6 New cache architecture

This copy-on-write process introduces significant latency into write operations. To isolate the active application from this additional latency, the FlashCopy indirection layer is placed logically between the upper and lower cache. Therefore, the additional latency that is introduced by the copy-on-write process is encountered only by the internal cache operations, and not by the application.

The logical placement of the FlashCopy indirection layer is shown in Figure 5-7.

Figure 5-7 Logical placement of the FlashCopy indirection layer

Introduction of the two-level cache provides additional performance improvements to the FlashCopy mechanism. Because the FlashCopy layer is now above the lower cache in the IBM Spectrum Virtualize software stack, it can benefit from read pre-fetching and coalescing writes to back-end storage. Also, preparing FlashCopy is much faster because upper cache write data does not have to go directly to back-end storage, but just to the lower cache layer.

Additionally, in multi-target FlashCopy, the target volumes of the same image share cache data. This design is opposite to previous IBM Spectrum Virtualize code versions, where each volume had its own copy of cached data.

Interaction and dependency between Multiple Target FlashCopy mappings

Figure 5-8 represents a set of four FlashCopy mappings that share a common source. The FlashCopy mappings target volumes Target 0, Target 1, Target 2, and Target 3.

Figure 5-8 Interactions between multi-target FlashCopy mappings

The configuration in Figure 5-8 has these characteristics:

•Target 0 is not dependent on a source because it has completed copying. Target 0 has two dependent mappings (Target 1 and Target 2).

•Target 1 is dependent upon Target 0. It remains dependent until all of Target 1 has been copied. Target 2 depends on it because Target 2 is 20% copy complete. After all of Target 1 has been copied, it can then move to the idle_copied state.

•Target 2 depends on Target 0 and Target 1, and will remain dependent until all of Target 2 has been copied. No target depends on Target 2, so when all of the data has been copied to Target 2, it can move to the idle_copied state.

•Target 3 has completed copying, so it is not dependent on any other maps.

Target writes with Multiple Target FlashCopy

A write to an intermediate or newest target volume must consider the state of the grain within its own mapping, and the state of the grain of the next oldest mapping:

•If the grain of the next oldest mapping has not been copied yet, it must be copied before the write is allowed to proceed to preserve the contents of the next oldest mapping. The data that is written to the next oldest mapping comes from a target or source.

•If the grain in the target being written has not yet been copied, the grain is copied from the oldest already copied grain in the mappings that are newer than the target, or the source if none are already copied. After this copy has been done, the write can be applied to the target.

Target reads with Multiple Target FlashCopy

If the grain being read has already been copied from the source to the target, the read simply returns data from the target being read. If the grain has not been copied, each of the newer mappings is examined in turn and the read is performed from the first copy found. If none are found, the read is performed from the source.

5.2.4 FlashCopy planning considerations

The FlashCopy function, like all the advanced IBM Spectrum Virtualize and Storwize family product features, offers useful capabilities. However, some basic planning considerations are to be followed for a successful implementation.

FlashCopy configurations limits

To plan for and implement FlashCopy, you must check the configuration limits and adhere to them. Table 5-3 shows the limits for a system that apply to the latest version at the time of writing this book.

Table 5-3 FlashCopy properties and maximum configurations

FlashCopy property	Maximum	Comment
FlashCopy targets per source	256	This maximum is the maximum number of FlashCopy mappings that can exist with the same source volume.
FlashCopy mappings per system	5000	This property applies to these models: •SAN Volume Controller 2145 models SV1, DH8, CG8, and CF8 •Storwize V7000 2176 models 524 (Gen2) and 624 (Gen2+)
FlashCopy mappings per system	4096	Any other Storwize models
FlashCopy Consistency Groups per system	255	This maximum is an arbitrary limit that is policed by the software.
FlashCopy volume space per I/O Group	4096 TB	This maximum is a limit on the quantity of FlashCopy mappings by using bitmap space from one I/O Group.
FlashCopy mappings per Consistency Group	512	This limit is due to the time that is taken to prepare a Consistency Group with many mappings.

Configuration Limits: The configuration limits always change with the introduction of new HW and SW capabilities. Check the IBM Spectrum Virtualize/Storwize online documentation for the latest configuration limits.

The total amount of cache memory reserved for the FlashCopy bitmaps limits the amount of capacity that can be used as a FlashCopy target. Table 5-4 illustrates the relationship of bitmap space to FlashCopy address space, depending on the size of the grain and the kind of FlashCopy service being used.

Table 5-4 Relationship of bitmap space to FlashCopy address space for the specified I/O Group

Copy Service	Grain size in KB	1 MB of memory provides the following volume capacity for the specified I/O Group
FlashCopy	256	2 TB of target volume capacity
FlashCopy	64	512 GB of target volume capacity
Incremental FlashCopy	256	1 TB of target volume capacity
Incremental FlashCopy	64	256 GB of target volume capacity

Mapping consideration: For multiple FlashCopy targets, you must consider the number of mappings. For example, for a mapping with a 256 KB grain size, 8 KB of memory allows one mapping between a 16 GB source volume and a 16 GB target volume. Alternatively, for a mapping with a 256 KB grain size, 8 KB of memory allows two mappings between one 8 GB source volume and two 8 GB target volumes.

When you create a FlashCopy mapping, if you specify an I/O Group other than the I/O Group of the source volume, the memory accounting goes towards the specified I/O Group, not towards the I/O Group of the source volume.

The default amount of memory for FlashCopy is 20 MB. This value can be increased or decreased by using the chiogrp command. The maximum amount of memory that can be specified for FlashCopy is 2048 MB (512 MB for 32-bit systems). The maximum combined amount of memory across all copy services features is 2600 MB (552 MB for 32-bit systems).

Bitmap allocation: When creating a FlashCopy mapping, you can optionally specify the I/O group where the bitmap is allocated. If you specify an I/O Group other than the I/O Group of the source volume, the memory accounting goes towards the specified I/O Group, not towards the I/O Group of the source volume. This option can be useful when an I/O group is exhausting the memory that is allocated to the FlashCopy bitmaps and no more free memory is available in the I/O group.

Restrictions

The following implementation restrictions apply to FlashCopy:

•The size of source and target volumes in a FlashCopy mapping must be the same.

•Multiple FlashCopy mappings that use the same target volume can be defined, but only one of these mappings can be started at a time. This limitation means that no multiple FlashCopy can be active to the same target volume.

•Expansion or shrinking of volumes defined in a FlashCopy mapping is not allowed. To modify the size of a source or target volume, first remove the FlashCopy mapping.

•In a cascading FlashCopy, the grain size of all the FlashCopy mappings that participate must be the same.

•In a multi-target FlashCopy, the grain size of all the FlashCopy mappings that participate must be the same.

•In a reverse FlashCopy, the grain size of all the FlashCopy mappings that participate must be the same.

•No FlashCopy mapping can be added to a consistency group while the FlashCopy mapping status is Copying.

•No FlashCopy mapping can be added to a consistency group while the consistency group status is Copying.

•The use of Consistency Groups is restricted when using Cascading FlashCopy. A Consistency Group serves the purpose of starting FlashCopy mappings at the same point in time. Within the same Consistency Group, it is not possible to have mappings with these conditions:

– The source volume of one mapping is the target of another mapping.

– The target volume of one mapping is the source volume for another mapping.

These combinations are not useful because within a Consistency Group, mappings cannot be established in a certain order. This limitation renders the content of the target volume undefined. For instance, it is not possible to determine whether the first mapping was established before the target volume of the first mapping that acts as a source volume for the second mapping.

Even if it were possible to ensure the order in which the mappings are established within a Consistency Group, the result is equal to Multi Target FlashCopy (that is, two volumes holding the same target data for one source volume). In other words, a cascade is useful for copying volumes in a certain order (and copying the changed content targets of FlashCopies), rather than at the same time in an undefined order (from within one single Consistency Group).

•Both source and target volumes can be used as primary in a Remote Copy relationship. However, if the target volume of a FlashCopy is used as primary in a Remote Copy relationship, the following rules apply:

– The FlashCopy cannot be started if the status of the Remote Copy relationship is different from Idle or Stopped.

– The FlashCopy cannot be started if the I/O group that is allocating the FlashCopy mapping bitmap is not the same as the FlashCopy target volume.

– A FlashCopy cannot be started if the target volume is the secondary volume of a Remote Copy relationship.

FlashCopy presets

The IBM Spectrum Virtualize/Storwize GUI interface provides three FlashCopy presets (Snapshot, Clone, and Backup) to simplify the more common FlashCopy operations.

Although these presets meet most FlashCopy requirements, they do not provide support for all possible FlashCopy options. If more specialized options are required that are not supported by the presets, the options must be performed by using CLI commands.

This section describes the three preset options and their use cases.

Snapshot

This preset creates a copy-on-write point-in-time copy. The snapshot is not intended to be an independent copy. Instead, the copy is used to maintain a view of the production data at the time that the snapshot is created. Therefore, the snapshot holds only the data from regions of the production volume that have changed since the snapshot was created. Because the snapshot preset uses thin provisioning, only the capacity that is required for the changes is used.

Snapshot uses the following preset parameters:

•Background copy: None

•Incremental: No

•Delete after completion: No

•Cleaning rate: No

•Primary copy source pool: Target pool

A typical use case for the Snapshot is when the user wants to produce a copy of a volume without affecting the availability of the volume. The user does not anticipate many changes to be made to the source or target volume. A significant proportion of the volumes remains unchanged.

By ensuring that only changes require a copy of data to be made, the total amount of disk space that is required for the copy is reduced. Therefore, many Snapshot copies can be used in the environment.

Snapshots are useful for providing protection against corruption or similar issues with the validity of the data. However, they do not provide protection from physical controller failures. Snapshots can also provide a vehicle for performing repeatable testing (including “what-if” modeling that is based on production data) without requiring a full copy of the data to be provisioned.

Clone

The clone preset creates a replica of the volume, which can then be changed without affecting the original volume. After the copy completes, the mapping that was created by the preset is automatically deleted.

Clone uses the following preset parameters:

•Background copy rate: 50

•Incremental: No

•Delete after completion: Yes

•Cleaning rate: 50

•Primary copy source pool: Target pool

A typical use case for the Snapshot is when users want a copy of the volume that they can modify without affecting the original volume. After the clone is established, there is no expectation that it is refreshed or that there is any further need to reference the original production data again. If the source is thin-provisioned, the target is thin-provisioned for the auto-create target.

Backup

The backup preset creates a point-in-time replica of the production data. After the copy completes, the backup view can be refreshed from the production data, with minimal copying of data from the production volume to the backup volume.

Backup uses the following preset parameters:

•Background Copy rate: 50

•Incremental: Yes

•Delete after completion: No

•Cleaning rate: 50

•Primary copy source pool: Target pool

The Backup preset can be used when the user wants to create a copy of the volume that can be used as a backup if the source becomes unavailable. This unavailability can happen during loss of the underlying physical controller. The user plans to periodically update the secondary copy, and does not want to suffer from the resource demands of creating a new copy each time. Incremental FlashCopy times are faster than full copy, which helps to reduce the window where the new backup is not yet fully effective. If the source is thin-provisioned, the target is also thin-provisioned in this option for the auto-create target.

Another use case, which is not supported by the name, is to create and maintain (periodically refresh) an independent image. This image can be subjected to intensive I/O (for example, data mining) without affecting the source volume’s performance.

Grain size considerations

When creating a mapping a grain size of 64 KB can be specified as compared to the default 256 KB. This smaller grain size has been introduced specifically for the incremental FlashCopy, even though its use is not restricted to the incremental mappings.

In an incremental FlashCopy, the modified data is identified by using the bitmaps. The amount of data to be copied when refreshing the mapping depends on the grain size. If the grain size is 64 KB, as compared to 256 KB, there might be less data to copy to get a fully independent copy of the source again.

Incremental FlashCopy: For incremental FlashCopy, the 64 KB grain size is preferred.

Similar to the FlashCopy, the Thin Provisioned volumes also have a grain size attribute that represents the size of chunk of storage to be added to used capacity.

The following are the preferred settings for thin-provisioned FlashCopy:

•Thin-provisioned volume grain size must be equal to the FlashCopy grain size.

•Thin-provisioned volume grain size must be 64 KB for the best performance and the best space efficiency.

The exception is where the thin target volume is going to become a production volume (and is likely to be subjected to ongoing heavy I/O). In this case, the 256 KB thin-provisioned grain size is preferrable because it provides better long-term I/O performance at the expense of a slower initial copy.

FlashCopy grain size considerations: Even if the 256 KB thin-provisioned volume grain size is chosen, it is still beneficial to limit the FlashCopy grain size to 64 KB. It is possible to minimize the performance impact to the source volume, even though this size increases the I/O workload on the target volume.

However, clients with very large numbers of FlashCopy/Remote Copy relationships might still be forced to choose a 256 KB grain size for FlashCopy to avoid constraints on the amount of bitmap memory.

Volume placement considerations

The source and target volumes placement among the pools and the I/O groups must be planned to minimize the effect of the underlying FlashCopy processes. In normal condition (that is with all the nodes/canisters fully operative), the FlashCopy background copy workload distribution follows this schema:

•The preferred node of the source volume is responsible for the background copy read operations

•The preferred node of the target volume is responsible for the background copy write operations

For the copy-on-write process, Table 5-5 shows how the operations are distributed across the nodes.

Table 5-5 Workload distribution for the copy-on-write process

	Read from source	Read from target	Write to source	Write to target
Node that performs the back-end I/O if the grain is copied	Preferred node in source volume’s IO group	Preferred node in target volume’s IO group	Preferred node in source volume’s IO group	Preferred node in target volume’s IO group
Node that performs the back-end I/O if the grain is not yet copied	Preferred node in source volume’s IO group	Preferred node in source volume’s IO group	The preferred node in source volume’s IO group will read and write, and the preferred node in target volume’s IO group will write	The preferred node in source volume’s IO group will read, and the preferred node in target volume’s IO group will write

Note that the data transfer among the source and the target volume’s preferred nodes occurs through the node-to-node connectivity. Consider the following volume placement alternatives:

1. Source and target volumes uses the same preferred node.

In this scenario, the node that is acting as preferred for both source and target volume manages all the read and write FlashCopy operations. Only resources from this node are consumed for the FlashCopy operations, and no node-to-node bandwidth is used.

2. Source and target volumes uses the different preferred node.

In this scenario, both nodes that are acting as preferred nodes manage read and write FlashCopy operations according to the schemes described above. The data that is transferred between the two preferred nodes goes through the node-to-node network.

Both alternatives described have pros and cons then there is no general preferred practice to apply. The following are some example scenarios:

1. IBM Spectrum Virtualize or Storwize system with multiple I/O groups where the source volumes are evenly spread across all the nodes. Assuming that the I/O workload is evenly distributed across the nodes, the alternative 1 is preferable. In fact, the amount of read and write FlashCopy operations are again evenly spread across the nodes without using any node-to-node bandwidth.

2. IBM Spectrum Virtualize or Storwize system with multiple I/O groups where the source volumes and most of the workload are concentrated in some nodes. In this case, alternative 2 is preferrable. In fact, defining the target volumes’ preferred node in the less used nodes relieves the source volume’s preferred node of some additional FlashCopy workload (especially during the background copy).

3. IBM Spectrum Virtualize system with multiple I/O groups in Enhanced Stretched Cluster configuration where the source volumes are evenly spread across all the nodes. In this case, the preferred node placement should follow the location of source and target volumes on the back-end storage. For example, if the source volume is on site A and the target volume is on site B, then the target volume’s preferred node must be in site B. Placing the target volume’s preferred node on site A causes the redirection of the FlashCopy write operation through the node-to-node network.

Placement on the back-end storage is mainly driven by the availability requirements. Generally, use different back-end storage controllers or arrays for the source and target volumes.

Background Copy considerations

The background copy process uses internal resources such as CPU, memory, and bandwidth. This copy process tries to reach the target copy data rate for every volume according to the background copy rate parameter setting (as reported in Table 5-1 on page 136).

If the copy process is unable to achieve these goals, it starts contending resources to the foreground I/O (that is the I/O coming from the hosts). As result, both background copy and foreground I/O will tend to see an increase in latency and therefore reduction in throughput compared to the situation when the bandwidth not been limited. Degradation is graceful. Both background copy and foreground I/O continue to make progress, and will not stop, hang, or cause the node to fail.

To avoid any impact on the foreground I/O, that is in the hosts response time, carefully plan the background copy activity, taking in account the overall workload running in the systems. The background copy basically reads and writes data to managed disks. Usually, the most affected component is the back-end storage. CPU and memory are not normally significantly affected by the copy activity.

The theoretical added workload due to the background copy is easily estimable. For instance, starting 20 FlashCopy with a background copy rate of 70 each adds a maximum throughput of 160 MBps for the reads and 160 MBps for the writes.

The source and target volumes distribution on the back-end storage determines where this workload is going to be added. The duration of the background copy depends on the amount of data to be copied. This amount is the total size of volumes for full background copy or the amount of data that is modified for incremental copy refresh.

Performance monitoring tools like IBM Spectrum Control can be used to evaluate the existing workload on the back-end storage in a specific time window. By adding this workload to the foreseen background copy workload, you can estimate the overall workload running toward the back-end storage. Disk performance simulation tools, like Disk Magic, can be used to estimate the effect, if any, of the added back-end workload to the host service time during the background copy window. The outcomes of this analysis can provide useful hints for the background copy rate settings.

When performance monitoring and simulation tools are not available, use a conservative and progressive approach. Consider that the background copy setting can be modified at any time, even when the FlashCopy is already started. The background copy process can even be completely stopped by setting the background copy rate to 0.

Initially set the background copy rate value to add a limited workload to the backend (for example less than 100 MBps). If no effects on hosts are noticed, the background copy rate value can be increased. Do this process until you see negative effects. Note that the background copy rate setting follows an exponential scale, so changing for instance from 50 to 60 doubles the data rate goal from 2 MBps to 4 MBps.

Cleaning rate

The Cleaning Rate is the rate at which the data is copied among dependant FlashCopies such as Cascaded and Multi Target FlashCopy. The cleaning process is a copy process similar to the background copy, so the same guidelines as for background copy apply.

Host and application considerations to ensure FlashCopy integrity

Because FlashCopy is at the block level, it is necessary to understand the interaction between your application and the host operating system. From a logical standpoint, it is easiest to think of these objects as “layers” that sit on top of one another. The application is the topmost layer, and beneath it is the operating system layer.

Both of these layers have various levels and methods of caching data to provide better speed. Because IBM Spectrum Virtualize systems, and therefore FlashCopy, sit below these layers, they are unaware of the cache at the application or operating system layers.

To ensure the integrity of the copy that is made, it is necessary to flush the host operating system and application cache for any outstanding reads or writes before the FlashCopy operation is performed. Failing to flush the host operating system and application cache produces what is referred to as a crash consistent copy.

The resulting copy requires the same type of recovery procedure, such as log replay and file system checks, that is required following a host crash. FlashCopies that are crash consistent often can be used following file system and application recovery procedures.

Note: Although the best way to perform FlashCopy is to flush host cache first, some companies, like Oracle, support using snapshots without it, as stated in Metalink note 604683.1.

Various operating systems and applications provide facilities to stop I/O operations and ensure that all data is flushed from host cache. If these facilities are available, they can be used to prepare for a FlashCopy operation. When this type of facility is not available, the host cache must be flushed manually by quiescing the application and unmounting the file system or drives.

Preferred practice: From a practical standpoint, when you have an application that is backed by a database and you want to make a FlashCopy of that application’s data, it is sufficient in most cases to use the write-suspend method that is available in most modern databases. You can use this method because the database maintains strict control over I/O.

This method is as opposed to flushing data from both the application and the backing database, which is always the suggested method because it is safer. However, this method can be used when facilities do not exist or your environment includes time sensitivity.

5.3 Remote Copy services

IBM Spectrum Virtualize and Storwize technology offers various remote copy services functions that address Disaster Recovery and Business Continuity needs.

Metro Mirror is designed for metropolitan distances with a zero recovery point objective (RPO), which is zero data loss. This objective is achieved with a synchronous copy of volumes. Writes are not acknowledged until they are committed to both storage systems. By definition, any vendors’ synchronous replication makes the host wait for write I/Os to complete at both the local and remote storage systems, and includes round-trip network latencies. Metro Mirror has the following characteristics:

•Zero RPO

•Synchronous

•Production application performance that is affected by round-trip latency

Global Mirror is designed to minimize application performance impact by replicating asynchronously. That is, writes are acknowledged as soon as they can be committed to the local storage system, sequence-tagged, and passed on to the replication network. This technique allows Global Mirror to be used over longer distances. By definition, any vendors’ asynchronous replication results in an RPO greater than zero. However, for Global Mirror, the RPO is quite small, typically anywhere from several milliseconds to some number of seconds.

Although Global Mirror is asynchronous, the RPO is still small, and thus the network and the remote storage system must both still be able to cope with peaks in traffic. Global Mirror has the following characteristics:

•Near-zero RPO

•Asynchronous

•Production application performance that is affected by I/O sequencing preparation time

Global Mirror with Change Volumes provides an option to replicate point-in-time copies of volumes. This option generally requires lower bandwidth because it is the average rather than the peak throughput that must be accommodated. The RPO for Global Mirror with Change Volumes is higher than traditional Global Mirror. Global Mirror with Change Volumes has the following characteristics:

•Larger RPO

•Point-in-time copies

•Asynchronous

•Possible system performance effect because point-in-time copies are created locally

Successful implementation depends on taking a holistic approach in which you consider all components and their associated properties. The components and properties include host application sensitivity, local and remote SAN configurations, local and remote system and storage configuration, and the intersystem network.

5.3.1 Remote copy functional overview

In this section, the terminology and the basic functional aspects of the remote copy services are presented.

Common terminology and definitions

When such a breadth of technology areas is covered, the same technology component can have multiple terms and definitions. This document uses the following definitions:

•Local system or master system

The system on which the foreground applications run.

•Local hosts

Hosts that run on the foreground applications.

•Master volume or source volume

The local volume that is being mirrored. The volume has nonrestricted access. Mapped hosts can read and write to the volume.

•Intersystem link or intersystem network

The network that provides connectivity between the local and the remote site. It can be a Fibre Channel network (SAN), an IP network, or a combination of the two.

•Remote system or auxiliary system

The system that holds the remote mirrored copy.

•Auxiliary volume or target volume

The remote volume that holds the mirrored copy. It is read-access only.

•Remote copy

A generic term that is used to describe a Metro Mirror or Global Mirror relationship in which data on the source volume is mirrored to an identical copy on a target volume. Often the two copies are separated by some distance, which is why the term remote is used to describe the copies. However, having remote copies is not a prerequisite. A remote copy relationship includes the following states:

– Consistent relationship

A remote copy relationship where the data set on the target volume represents a data set on the source volumes at a certain point.

– Synchronized relationship

A relationship is synchronized if it is consistent and the point that the target volume represents is the current point. The target volume contains identical data as the source volume.

•Synchronous remote copy (Metro Mirror)

Writes to the source and target volumes that are committed in the foreground before confirmation is sent about completion to the local host application.

•Asynchronous remote copy (Global Mirror)

A foreground write I/O is acknowledged as complete to the local host application before the mirrored foreground write I/O is cached at the remote system. Mirrored foreground writes are processed asynchronously at the remote system, but in a committed sequential order as determined and managed by the Global Mirror remote copy process.

•Global Mirror Change Volume

Holds earlier consistent revisions of data when changes are made. A change volume must be created for the master volume and the auxiliary volume of the relationship.

•The background copy process manages the initial synchronization or resynchronization processes between source volumes to target mirrored volumes on a remote system.

•Foreground I/O reads and writes I/O on a local SAN, which generates a mirrored foreground write I/O that is across the intersystem network and remote SAN.

Figure 5-9 shows some of the concepts of remote copy.

Figure 5-9 Remote copy components and applications

A successful implementation of an intersystem remote copy service depends on quality and configuration of the intersystem network.

Remote copy partnerships and relationships

A remote copy partnership is a partnership that is established between a master (local) system and an auxiliary (remote) system, as shown in Figure 5-10.

Figure 5-10 Remote copy partnership

Partnerships are established between two systems by issuing the mkfcpartnership or mkippartnership command once from each end of the partnership. The parameters that need to be specified are the remote system name (or ID), the available bandwidth (in Mbps), and the maximum background copy rate as a percentage of the available bandwidth. The background copy parameter determines the maximum speed of the initial synchronization and resynchronization of the relationships.

Tip: To establish a fully functional Metro Mirror or Global Mirror partnership, issue the mkfcpartnership or mkippartnership command from both systems.

A remote copy relationship is a relationship that is established between a source (primary) volume in the local system and a target (secondary) volume in the remote system. Usually when a remote copy relationship is started, a background copy process that copies the data from source to target volumes is started as well.

After background synchronization or resynchronization is complete, a Global Mirror relationship provides and maintains a consistent mirrored copy of a source volume to a target volume.

Copy directions and default roles

When you create a remote copy relationship, the source or master volume is initially assigned the role of the master, and the target auxiliary volume is initially assigned the role of the auxiliary. This design implies that the initial copy direction of mirrored foreground writes and background resynchronization writes (if applicable) is from master to auxiliary.

After the initial synchronization is complete, you can change the copy direction (see Figure 5-11). The ability to change roles is used to facilitate disaster recovery.

Figure 5-11 Role and direction changes

Attention: When the direction of the relationship is changed, the roles of the volumes are altered. A consequence is that the read/write properties are also changed, meaning that the master volume takes on a secondary role and becomes read-only.

Consistency Groups

A Consistency Group (CG) is a collection of relationships that can be treated as one entity. This technique is used to preserve write order consistency across a group of volumes that pertain to one application, for example, a database volume and a database log file volume.

After a remote copy relationship is added into a Consistency Group, you cannot manage the relationship in isolation from the Consistency Group. So, for example, issuing a stoprcrelationship command on the stand-alone volume would fail because the system knows that the relationship is part of a Consistency Group.

Note the following points regarding Consistency Groups:

•Each volume relationship can belong to only one Consistency Group.

•Volume relationships can also be stand-alone, that is, not in any Consistency Group.

•Consistency Groups can also be created and left empty, or can contain one or many relationships.

•You can create up to 256 Consistency Groups on a system.

•All volume relationships in a Consistency Group must have matching primary and secondary systems, but they do not need to share I/O groups.

•All relationships in a Consistency Group have the same copy direction and state.

•Each Consistency Group is either for Metro Mirror or for Global Mirror relationships, but not both. This choice is determined by the first volume relationship that is added to the Consistency Group.

Consistency Group consideration: A Consistency Group relationship does not have to be in a directly matching I/O group number at each site. A Consistency Group owned by I/O group 1 at the local site does not have to be owned by I/O group 1 at the remote site. If you have more than one I/O group at either site, you can create the relationship between any two I/O groups. This technique spreads the workload, for example, from local I/O group 1 to remote I/O group 2.

Streams

Consistency Groups can also be used as a way to spread replication workload across multiple streams within a partnership.

The Metro or Global Mirror partnership architecture allocates traffic from each Consistency Group in a round-robin fashion across 16 streams. That is, cg0 traffic goes into stream0, and cg1 traffic goes into stream1.

Any volume that is not in a Consistency Group also goes into stream0. You might want to consider creating an empty Consistency Group 0 so that stand-alone volumes do not share a stream with active Consistency Group volumes.

It can also pay to optimize your streams by creating more Consistency Groups. Within each stream, each batch of writes must be processed in tag sequence order and any delays in processing any particular write also delays the writes behind it in the stream. Having more streams (up to 16) reduces this kind of potential congestion.

Each stream is sequence-tag-processed by one node, so generally you would want to create at least as many Consistency Groups as you have IBM Spectrum Virtualize nodes/Storwize canisters, and, ideally, perfect multiples of the node count.

Layer concept

Version 6.3 introduced the concept of layer, which allows you to create partnerships among IBM Spectrum Virtualize and Storwize products. The key points concerning layers are listed here:

•IBM Spectrum Virtualize is always in the Replication layer.

•By default, Storwize products are in the Storage layer.

•A system can only form partnerships with systems in the same layer.

•An IBM Spectrum Virtualize can virtualize a Storwize system only if the Storwize is in Storage layer.

•With version 6.4, a Storwize system in Replication layer can virtualize a Storwize system in Storage layer.

Figure 5-12 illustrates the concept of layers.

Figure 5-12 Conceptualization of layers

Generally, changing the layer is only performed at initial setup time or as part of a major reconfiguration. To change the layer of a Storwize system, the system must meet the following pre-conditions:

•The Storwize system must not have any IBM Spectrum Virtualize or Storwize host objects defined, and must not be virtualizing any other Storwize controllers.

•The Storwize system must not be visible to any other IBM Spectrum Virtualize or Storwize system in the SAN fabric, which might require SAN zoning changes.

•The Storwize system must not have any system partnerships defined. If it is already using Metro Mirror or Global Mirror, the existing partnerships and relationships must be removed first.

Changing a Storwize system from Storage layer to Replication layer can only be performed by using the CLI. After you are certain that all of the pre-conditions have been met, issue the following command:

chsystem -layer replication

Partnership topologies

Each system can be connected to a maximum of three other systems for the purposes of Metro or Global Mirror.

Figure 5-13 shows examples of the principal supported topologies for Metro and Global Mirror partnerships. Each box represents an IBM Spectrum Virtualize or Storwize system.

Figure 5-13 Supported topologies for Metro and Global Mirror

Star topology

A star topology can be used, for example, to share a centralized disaster recovery system (3, in this example) with up to three other systems, for example replicating 1 → 3, 2 → 3, and 4 → 3.

Ring topology

A ring topology (3 or more systems) can be used to establish a one-in, one-out implementation. For example, the implementation can be 1 → 2, 2 → 3, 3 → 1 to spread replication loads evenly among three systems.

Linear topology

A linear topology of two or more sites is also possible. However, it would generally be simpler to create partnerships between system 1 and system 2, and separately between system 3 and system 4.

Mesh topology

A fully connected mesh topology is where every system has a partnership to each of the three other systems. This topology allows flexibility in that volumes can be replicated between any two systems.

Topology considerations:

•Although systems can have up to three partnerships, any one volume can be part of only a single relationship. That is, you cannot replicate any given volume to multiple remote sites.

•Although various topologies are supported, it is advisable to keep your partnerships as simple as possible, which in most cases means system pairs or a star.

Intrasystem versus intersystem

Although remote copy services are available for intrasystem, it has no functional value for production use. Intrasystem Metro Mirror provides the same capability with less overhead. However, leaving this function in place simplifies testing and allows for experimentation and testing. For example, you can validate server failover on a single test system.

Intrasystem remote copy: Intrasystem remote copy is not supported on IBM Spectrum Virtualize/Storwize systems that run V6 or later.

Metro Mirror functional overview

Metro Mirror provides synchronous replication. It is designed to ensure that updates are committed to both the primary and secondary volumes before sending an acknowledgment (Ack) of the completion to the server.

If the primary volume fails completely for any reason, Metro Mirror is designed to ensure that the secondary volume holds the same data as the primary did immediately before the failure.

Metro Mirror provides the simplest way to maintain an identical copy on both the primary and secondary volumes. However, as with any synchronous copy over long distance, there can be a performance impact to host applications due to network latency.

Metro Mirror supports relationships between volumes that are up to 300 km apart. Latency is an important consideration for any Metro Mirror network. With typical fiber optic round-trip latencies of 1 ms per 100 km, you can expect a minimum of 3 ms extra latency, due to the network alone, on each I/O if you are running across the 300 km separation.

Figure 5-14 shows the order of Metro Mirror write operations.

Figure 5-14 Metro Mirror write sequence

A write into mirrored cache on an IBM Spectrum Virtualize or Storwize system is all that is required for the write to be considered as committed. De-staging to disk is a natural part of I/O management, but it is not generally in the critical path for a Metro Mirror write acknowledgment.

Global Mirror functional overview

Global Mirror provides asynchronous replication. It is designed to reduce the dependency on round-trip network latency by acknowledging the primary write in parallel with sending the write to the secondary volume.

If the primary volume fails completely for any reason, Global Mirror is designed to ensure that the secondary volume holds the same data as the primary did at a point a short time before the failure. That short period of data loss is typically between 10 milliseconds and 10 seconds, but varies according to individual circumstances.

Global Mirror provides a way to maintain a write-order-consistent copy of data at a secondary site only slightly behind the primary. Global Mirror has minimal impact on the performance of the primary volume.

Although Global Mirror is an asynchronous remote copy technique, foreground writes at the local system and mirrored foreground writes at the remote system are not wholly independent of one another. IBM Spectrum Virtualize/Storwize implementation of asynchronous remote copy uses algorithms to maintain a consistent image at the target volume always. They achieve this image by identifying sets of I/Os that are active concurrently at the source, assigning an order to those sets, and applying these sets of I/Os in the assigned order at the target. The multiple I/Os within a single set are applied concurrently.

The process that marshals the sequential sets of I/Os operates at the remote system, and therefore is not subject to the latency of the long-distance link.

Figure 5-15 shows that a write operation to the master volume is acknowledged back to the host that issues the write before the write operation is mirrored to the cache for the auxiliary volume.

Figure 5-15 Global Mirror relationship write operation

With Global Mirror, a confirmation is sent to the host server before the host receives a confirmation of the completion at the auxiliary volume. When a write is sent to a master volume, it is assigned a sequence number. Mirror writes that are sent to the auxiliary volume are committed in sequential number order. If a write is issued when another write is outstanding, it might be given the same sequence number.

This function maintains a consistent image at the auxiliary volume all times. It identifies sets of I/Os that are active concurrently at the primary volume. It then assigns an order to those sets, and applies these sets of I/Os in the assigned order at the auxiliary volume. Further writes might be received from a host when the secondary write is still active for the same block. In this case, although the primary write might complete, the new host write on the auxiliary volume is delayed until the previous write is completed.

Write ordering

Many applications that use block storage are required to survive failures, such as a loss of power or a software crash. They are also required to not lose data that existed before the failure. Because many applications must perform many update operations in parallel to that storage block, maintaining write ordering is key to ensuring the correct operation of applications after a disruption.

An application that performs a high volume of database updates is often designed with the concept of dependent writes. Dependent writes ensure that an earlier write completes before a later write starts. Reversing the order of dependent writes can undermine the algorithms of the application and can lead to problems, such as detected or undetected data corruption.

Colliding writes

Colliding writes are defined as new write I/Os that overlap existing “active” write I/Os.

Before V4.3.1, the Global Mirror algorithm required only a single write to be active on any 512-byte LBA of a volume. If another write was received from a host while the auxiliary write was still active, the new host write was delayed until the auxiliary write was complete (although the master write might complete). This restriction was needed if a series of writes to the auxiliary must be retried (which is known as reconstruction). Conceptually, the data for reconstruction comes from the master volume.

If multiple writes were allowed to be applied to the master for a sector, only the most recent write had the correct data during reconstruction. If reconstruction was interrupted for any reason, the intermediate state of the auxiliary was inconsistent.

Applications that deliver such write activity do not achieve the performance that Global Mirror is intended to support. A volume statistic is maintained about the frequency of these collisions. Starting with V4.3.1, an attempt is made to allow multiple writes to a single location to be outstanding in the Global Mirror algorithm.

A need still exists for master writes to be serialized. The intermediate states of the master data must be kept in a non-volatile journal while the writes are outstanding to maintain the correct write ordering during reconstruction. Reconstruction must never overwrite data on the auxiliary with an earlier version. The colliding writes of volume statistic monitoring are now limited to those writes that are not affected by this change.

Figure 5-16 shows a colliding write sequence.

‘

Figure 5-16 Colliding writes

The following numbers correspond to the numbers that are shown in Figure 5-16:

1. A first write is performed from the host to LBA X.

2. A host is provided acknowledgment that the write is complete, even though the mirrored write to the auxiliary volume is not yet completed.

The first two actions (1 and 2) occur asynchronously with the first write.

3. A second write is performed from the host to LBA X. If this write occurs before the host receives acknowledgment (2), the write is written to the journal file.

4. A host is provided acknowledgment that the second write is complete.

Global Mirror Change Volumes functional overview

Global Mirror with Change Volumes provides asynchronous replication based on point-in-time copies of data. It is designed to allow for effective replication over lower bandwidth networks and to reduce any impact on production hosts.

Metro Mirror and Global Mirror both require the bandwidth to be sized to meet the peak workload. Global Mirror with Change Volumes only must be sized to meet the average workload across a cycle period.

Figure 5-17 shows a high-level conceptual view of Global Mirror with Change Volumes. GM/CV uses FlashCopy to maintain image consistency and to isolate host volumes from the replication process.

Figure 5-17 Global Mirror with Change Volumes

Global Mirror with Change Volumes also only sends one copy of a changed grain that might have been rewritten many times within the cycle period.

If the primary volume fails completely for any reason, GM/CV is designed to ensure that the secondary volume holds the same data as the primary did at a specific point in time. That period of data loss is typically between 5 minutes and 24 hours, but varies according to the design choices that you make.

Change Volumes hold point-in-time copies of 256 KB grains. If any of the disk blocks in a grain change, that grain is copied to the change volume to preserve its contents. Change Volumes are also maintained at the secondary site so that a consistent copy of the volume is always available even when the secondary volume is being updated.

Primary and Change Volumes are always in the same I/O group and the Change Volumes are always thin-provisioned. Change Volumes cannot be mapped to hosts and used for host I/O, and they cannot be used as a source for any other FlashCopy or Global Mirror operations.

Figure 5-18 shows how a Change Volume is used to preserve a point-in-time data set, which is then replicated to a secondary site. The data at the secondary site is in turn preserved by a Change Volume until the next replication cycle has completed.

Figure 5-18 Global Mirror with Change Volumes uses FlashCopy point-in-time copy technology

FlashCopy mapping note: These FlashCopy mappings are not standard FlashCopy volumes and are not accessible for general use. They are internal structures that are dedicated to supporting Global Mirror with Change Volumes.

The options for -cyclingmode are none and multi.

Specifying or taking the default none means that Global Mirror acts in its traditional mode without Change Volumes.

Specifying multi means that Global Mirror starts cycling based on the cycle period, which defaults to 300 seconds. The valid range is from 60 seconds to 24*60*60 seconds (86 400 seconds = one day).

If all of the changed grains cannot be copied to the secondary site within the specified time, then the replication is designed to take as long as it needs and to start the next replication as soon as the earlier one completes. You can choose to implement this approach by deliberately setting the cycle period to a short amount of time, which is a perfectly valid approach. However, remember that the shorter the cycle period, the less opportunity there is for peak write I/O smoothing, and the more bandwidth you need.

The -cyclingmode setting can only be changed when the Global Mirror relationship is in a stopped state.

Recovery point objective using Change Volumes

RPO is the maximum tolerable period in which data might be lost if you switch over to your secondary volume.

If a cycle completes within the specified cycle period, then the RPO is not more than 2 x cycle long. However, if it does not complete within the cycle period, then the RPO is not more than the sum of the last two cycle times.

The current RPO can be determined by looking at the lsrcrelationship freeze time attribute. The freeze time is the time stamp of the last primary Change Volume that has completed copying to the secondary site. Note the following example:

1. The cycle period is the default of 5 minutes and a cycle is triggered at 6:00 AM. At 6:03 AM, the cycle completes. The freeze time would be 6:00 AM, and the RPO is 3 minutes.

2. The cycle starts again at 6:05 AM. The RPO now is 5 minutes. The cycle is still running at 6:12 AM, and the RPO is now up to 12 minutes because 6:00 AM is still the freeze time of the last complete cycle.

3. At 6:13 AM, the cycle completes and the RPO now is 8 minutes because 6:05 AM is the freeze time of the last complete cycle.

4. Because the cycle period has been exceeded, the cycle immediately starts again.

5.3.2 Remote copy network planning

Remote copy partnerships and relationships do not work reliably if the connectivity on which they are running is configured incorrectly. This section focuses on the intersystem network, giving an overview of the remote system connectivity options.

Terminology

The intersystem network is specified in terms of latency and bandwidth. These parameters define the capabilities of the link regarding the traffic that is on it. They be must be chosen so that they support all forms of traffic, including mirrored foreground writes, background copy writes, and intersystem heartbeat messaging (node-to-node communication).

Link latency is the time that is taken by data to move across a network from one location to another and is measured in milliseconds. The longer the time, the greater the performance impact.

Tip: SCSI write over FC requires two round trips per I/O operation, as shown in the following example:

2 (round trips) x 2 (operations) x 5 microsec/km = 20 microsec/km

At 50 km, you have another latency, as shown in the following example:

20 microsec/km x 50 km = 1000 microsec = 1 msec (msec represents millisecond)

Each SCSI I/O has 1 ms of more service time. At 100 km, it becomes 2 ms for more service time.

Link bandwidth is the network capacity to move data as measured in millions of bits per second (Mbps) or billions of bits per second (Gbps).

The term bandwidth is also used in the following context:

•Storage bandwidth: The ability of the back-end storage to process I/O. Measures the amount of data (in bytes) that can be sent in a specified amount of time.

•Remote copy partnership bandwidth (parameter): The rate at which background write synchronization is attempted (unit of MBps).

Intersystem connectivity supports mirrored foreground and background I/O. A portion of the link is also used to carry traffic that is associated with the exchange of low-level messaging between the nodes of the local and remote systems. A dedicated amount of the link bandwidth is required for the exchange of heartbeat messages and the initial configuration of intersystem partnerships.

Interlink bandwidth must support the following traffic:

•Mirrored foreground writes, as generated by foreground processes at peak times

•Background write synchronization, as defined by the Global Mirror bandwidth parameter

•Intersystem communication (heartbeat messaging)

Fibre Channel connectivity is the standard connectivity that is used for the remote copy intersystem networks. It uses the Fibre Channel protocol and SAN infrastructures to interconnect the systems.

Native IP connectivity has been introduced with IBM Spectrum Virtualize version 7.2 to implement intersystem networks by using standard TPC/IP infrastructures.

Network latency considerations

The maximum supported round-trip latency between sites depends on the type of partnership between systems, the version of software, and the system hardware that is used. Table 5-6 lists the maximum round-trip latency. This restriction applies to all variants of remote mirroring.

Table 5-6 Maximum round trip

IBM Spectrum Virtualize version	System node hardware	Partnership
		FC	1 Gbps IP	10 Gbps IP
7.3 or earlier	All	80 ms	80 ms	10 ms
7.4 or later	CG8 nodes (with a second four-port Fibre Channel adapter installed) DH8 and SV1 nodes	250 ms
	All other models	80 ms

More configuration requirements and guidelines apply to systems that perform remote mirroring over extended distances, where the round-trip time is greater than 80 ms. If you use remote mirroring between systems with 80 - 250 ms round-trip latency, you must meet the following additional requirements:

•The RC buffer size setting must be 512 MB on each system in the partnership. This setting can be accomplished by running the chsystem -rcbuffersize 512 command on each system.

Note: Changing this setting is disruptive to Metro Mirror and Global Mirror operations. Use this command only before partnerships are created between systems or when all partnerships with the system are stopped.

•Two Fibre Channel ports on each node that will be used for replication must be dedicated for replication traffic. This configuration can be achieved by using SAN zoning and port masking.

•SAN zoning should be applied to provide separate intrasystem zones for each local-remote I/O group pair that is used for replication. See “Remote system ports and zoning considerations” on page 176 for further zoning guidelines.

Link bandwidth that is used by internode communication

IBM Spectrum Virtualize uses part of the bandwidth for its internal intersystem heartbeat. The amount of traffic depends on how many nodes are in each of the local and remote systems. Table 5-7 shows the amount of traffic (in megabits per second) that is generated by different sizes of systems.

Table 5-7 IBM Spectrum Virtualize intersystem heartbeat traffic (megabits per second)

Local or remote system	Two nodes	Four nodes	Six nodes	Eight nodes
Two nodes	5	6	6	6
Four nodes	6	10	11	12
Six nodes	6	11	16	17
Eight nodes	6	12	17	21

These numbers represent the total traffic between the two systems when no I/O is occurring to a mirrored volume on the remote system. Half of the data is sent by one system, and half of the data is sent by the other system. The traffic is divided evenly over all available connections. Therefore, if you have two redundant links, half of this traffic is sent over each link during fault-free operation.

If the link between the sites is configured with redundancy to tolerate single failures, size the link so that the bandwidth and latency statements continue to be accurate even during single failure conditions.

Network sizing considerations

Proper network sizing is essential for the remote copy services operations. Failing to estimate the network sizing requirements can lead a poor performance in remote copy services and the production workload.

Consider that intersystem bandwidth should be capable of supporting the combined traffic of the following items:

•Mirrored foreground writes, as generated by your server applications at peak times

•Background resynchronization, for example, after a link outage

•Inter-system heartbeat

Calculating the required bandwidth is essentially a question of mathematics based on your current workloads, so it is advisable to start by assessing your current workloads.

For Metro or Global Mirror, you need to know your peak write rates and I/O sizes down to at least a 5-minute interval. This information can be easily gained from tools like IBM Spectrum Control. Finally, you need to allow for unexpected peaks.

There are also unsupported tools to help with sizing available from IBM:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105947

Do not compromise on bandwidth or network quality when planning a Metro or Global Mirror deployment. If bandwidth is likely to be an issue in your environment, consider Global Mirror with Change Volumes.

Bandwidth sizing examples

As an example, consider a business with the following I/O profile:

•Average write size 8 KB (= 8 x 8 bits/1024 = 0.0625 Mb).

•For most of the day between 8 AM and 8 PM, the write activity is around 1500 writes per second.

•Twice a day (once in the morning and once in the afternoon), the system bursts up to 4500 writes per second for up to 10 minutes.

•Outside of the 8 AM to 8 PM window, there is little or no I/O write activity.

This example is intended to represent a general traffic pattern that might be common in many medium-sized sites. Futhermore, 20% of bandwidth must be left available for the background synchronization.

Here we consider options for Metro Mirror, Global Mirror, and for Global Mirror with Change Volumes based on a cycle period of 30 minutes and 60 minutes.

Metro Mirror or Global Mirror require bandwidth on the instantaneous peak of 4500 writes per second as follows:

4500 x 0.0625 = 282 Mbps + 20% resync allowance + 5 Mbps heartbeat = 343 Mbps dedicated plus any safety margin plus growth

In the following two examples, the bandwidth for GM/CV needs to be able to handle the peak 30-minute period, or the peak 60-minute period.

GMCV peak 30-minute period example

If we look at this time broken into 10-minute periods, the peak 30-minute period is made up of one 10-minute period of 4500 writes per second, and two 10-minute periods of 1500 writes per second. The average write rate for the 30-minute cycle period can then be expressed mathematically as follows:

(4500 + 1500 + 1500) / 3 = 2500 writes/sec for a 30-minute cycle period

The minimum bandwidth that is required for the cycle period of 30 minutes is as follows:

2500 x 0.0625 = 157 Mbps + 20% resync allowance + 5 Mbps heartbeat = 195 Mbps dedicated plus any safety margin plus growth

GMCV peak 60-minute period example

For a cycle period of 60 minutes, the peak 60-minute period is made up of one 10-minute period of 4500 writes per second, and five 10-minute periods of 1500 writes per second. The average write for the 60-minute cycle period can be expressed as follows:

(4500 + 5 x 1500) / 6 = 2000 writes/sec for a 60-minute cycle period

The minimum bandwidth that is required for a cycle period of 60 minutes is as follows:

2000 x 0.0625 = 125 Mbps + 20% resync allowance + 5 Mbps heartbeat = 155 Mbps dedicated plus any safety margin plus growth

Now consider whether the business does not have aggressive RPO requirements and does not want to provide dedicated bandwidth for Global Mirror. But the network is available and unused at night, so Global Mirror can use that. There is an element of risk here, which is if the network is unavailable for any reason, GM/CV cannot keep running during the day until it catches up. Therefore, you would need to allow a much higher resync allowance in your replication window, for example, 100 percent.

A GM/CV replication based on daily point-in-time copies at 8 PM each night, and replicating until 8 AM at the latest would probably require at least the following bandwidth:

(9000 + 70 x 1500) / 72 = 1584 x 0.0625 = 99 Mbps + 100% + 5 Mbps heartbeat

= 203 Mbps at night plus any safety margin plus growth, non-dedicated, time-shared with daytime traffic

Global Mirror with Change Volumes provides a way to maintain point-in-time copies of data at a secondary site where insufficient bandwidth is available to replicate the peak workloads in real time.

Another factor that can reduce the bandwidth that is required for Global Mirror with Change Volumes is that it only sends one copy of a changed grain, which might have been rewritten many times within the cycle period.

Remember that these are examples. The central principle of sizing is that you need to know your data write rate, which is the number of write I/Os and the average size of those I/Os. For Metro Mirror and Global Mirror, you need to know the peak write I/O rates. For GM/CV, you need to know the average write I/O rates.

GMCV bandwidth: In the above samples, the bandwidth estimation for the GMCV is based on the assumption that the write operations occurs in such a way that a change volume grain (that has a size of 256 KB) is completely changed before it is transferred to the remote site. In the real life, this situation is unlikely to occur. Usually only a portion of a grain is changed during a GMCV cycle, but the transfer process always copies the whole grain to the remote site. This behavior can lead to an unforeseen processor burden in the transfer bandwidth that, in the edge case, can be even higher than the one required for a standard Global Mirror.

Fibre Channel connectivity

You must remember several considerations when you use Fibre Channel technology for the intersystem network:

•Redundancy

•Basic topology and problems

•Switches and ISL oversubscription

•Distance extensions options

•Optical multiplexors

•Long-distance SFPs and XFPs

•Fibre Channel over IP

•Hops

•Buffer credits

•Remote system ports and zoning considerations

Redundancy

The intersystem network must adopt the same policy toward redundancy as for the local and remote systems to which it is connecting. The ISLs must have redundancy, and the individual ISLs must provide the necessary bandwidth in isolation.

Basic topology and problems

Because of the nature of Fibre Channel, you must avoid ISL congestion whether within individual SANs or across the intersystem network. Although FC (and IBM Spectrum Virtualize) can handle an overloaded host or storage array, the mechanisms in FC are ineffective for dealing with congestion in the fabric in most circumstances. The problems that are caused by fabric congestion can range from dramatically slow response time to storage access loss. These issues are common with all high-bandwidth SAN devices and are inherent to FC. They are not unique to the IBM Spectrum Virtualize/Storwize products.

When an FC network becomes congested, the FC switches stop accepting more frames until the congestion clears. They can also drop frames. Congestion can quickly move upstream in the fabric and clog the end devices from communicating anywhere.

This behavior is referred to as head-of-line blocking. Although modern SAN switches internally have a nonblocking architecture, head-of-line-blocking still exists as a SAN fabric problem. Head-of-line blocking can result in IBM Spectrum Virtualize nodes that cannot communicate with storage subsystems or to mirror their write caches because you have a single congested link that leads to an edge switch.

Switches and ISL oversubscription

As specified in Chapter 2, “Back-end storage” on page 37, the suggested maximum host port to ISL ratio is 7:1. With modern 8 Gbps or 16 Gbps SAN switches, this ratio implies an average bandwidth (in one direction) per host port of approximately 230 MBps (16 Gbps).

You must take peak loads (not average loads) into consideration. For example, while a database server might use only 20 MBps during regular production workloads, it might perform a backup at higher data rates.

Congestion to one switch in a large fabric can cause performance issues throughout the entire fabric, including traffic between IBM Spectrum Virtualize nodes and storage subsystems, even if they are not directly attached to the congested switch. The reasons for these issues are inherent to FC flow control mechanisms, which are not designed to handle fabric congestion. Therefore, any estimates for required bandwidth before implementation must have a safety factor that is built into the estimate.

On top of the safety factor for traffic expansion, implement a spare ISL or ISL trunk. The spare ISL or ISL trunk can provide a fail-safe that avoids congestion if an ISL fails because of issues, such as a SAN switch line card or port blade failure.

Exceeding the standard 7:1 oversubscription ration requires you to implement fabric bandwidth threshold alerts. When one of your ISLs exceeds 70%, you must schedule fabric changes to distribute the load further.

You must also consider the bandwidth consequences of a complete fabric outage. Although a complete fabric outage is a fairly rare event, insufficient bandwidth can turn a single-SAN outage into a total access loss event.

Take the bandwidth of the links into account. It is common to have ISLs run faster than host ports, which reduces the number of required ISLs.

Distance extensions options

To implement remote mirroring over a distance by using the Fibre Channel, you have the following choices:

•Optical multiplexors, such as dense wavelength division multiplexing (DWDM) or coarse wavelength division multiplexing (CWDM) devices

•Long-distance Small Form-factor Pluggable (SFP) transceivers and XFPs

•Fibre Channel-to-IP conversion boxes

Of these options, the optical distance extension is the preferred method. IP distance extension introduces more complexity, is less reliable, and has performance limitations. However, optical distance extension can be impractical in many cases because of cost or unavailability.

For the list of supported SAN routers and FC extenders, see the support page at this website:

https://ibm.biz/BdiZa6

Optical multiplexors

Optical multiplexors can extend a SAN up to hundreds of kilometers (or miles) at high speeds. For this reason, they are the preferred method for long-distance expansion. If you use multiplexor-based distance extension, closely monitor your physical link error counts in your switches. Optical communication devices are high-precision units. When they shift out of calibration, you will start to see errors in your frames.

Long-distance SFPs and XFPs

Long-distance optical transceivers have the advantage of extreme simplicity. You do not need any expensive equipment, and you have only a few configuration steps to perform. However, ensure that you only use transceivers that are designed for your particular SAN switch.

Fibre Channel over IP

Fibre Channel over IP (FCIP) is by far the most common and least expensive form of distance extension. It is also complicated to configure. Relatively subtle errors can have severe performance implications.

With IP-based distance extension, you must dedicate bandwidth to your FCIP traffic if the link is shared with other IP traffic. Do not assume that because the link between two sites has low traffic or is used only for email, this type of traffic is always the case. FC is far more sensitive to congestion than most IP applications.

Also, when you are communicating with the networking architects for your organization, make sure to distinguish between megabytes per second as opposed to megabits per second. In the storage world, bandwidth often is specified in megabytes per second (MBps), and network engineers specify bandwidth in megabits per second (Mbps).

Hops

The hop count is not increased by the intersite connection architecture. For example, if you have a SAN extension that is based on DWDM, the DWDM components are not apparent to the number of hops. The hop count limit within a fabric is set by the fabric devices (switch or director) operating system. It is used to derive a frame hold time value for each fabric device.

This hold time value is the maximum amount of time that a frame can be held in a switch before it is dropped or the fabric is busy condition is returned. For example, a frame might be held if its destination port is unavailable. The hold time is derived from a formula that uses the error detect timeout value and the resource allocation timeout value. It is considered that every extra hop adds about 1.2 microseconds of latency to the transmission.

Currently, IBM Spectrum Virtualize and Storwize remote copy services support three hops when protocol conversion exists. Therefore, if you have DWDM extended between primary and secondary sites, three SAN directors or switches can exist between the primary and secondary systems.

Buffer credits

SAN device ports need memory to temporarily store frames as they arrive, assemble them in sequence, and deliver them to the upper layer protocol. The number of frames that a port can hold is called its buffer credit. Fibre Channel architecture is based on a flow control that ensures a constant stream of data to fill the available pipe.

When two FC ports begin a conversation, they exchange information about their buffer capacities. An FC port sends only the number of buffer frames for which the receiving port gives credit. This method avoids overruns and provides a way to maintain performance over distance by filling the pipe with in-flight frames or buffers.

The following types of transmission credits are available:

•Buffer_to_Buffer Credit

During login, N_Ports and F_Ports at both ends of a link establish its Buffer to Buffer Credit (BB_Credit).

•End_to_End Credit

In the same way during login, all N_Ports establish End-to-End Credit (EE_Credit) with each other. During data transmission, a port must not send more frames than the buffer of the receiving port can handle before you receive an indication from the receiving port that it processed a previously sent frame. Two counters are used: BB_Credit_CNT and EE_Credit_CNT. Both counters are initialized to zero during login.

FC Flow Control: Each time that a port sends a frame, it increments BB_Credit_CNT and EE_Credit_CNT by one. When it receives R_RDY from the adjacent port, it decrements BB_Credit_CNT by one. When it receives ACK from the destination port, it decrements EE_Credit_CNT by one.

At any time, if BB_Credit_CNT becomes equal to the BB_Credit, or EE_Credit_CNT becomes equal to the EE_Credit of the receiving port, the transmitting port stops sending frames until the respective count is decremented.

The previous statements are true for Class 2 service. Class 1 is a dedicated connection. Therefore, BB_Credit is not important, and only EE_Credit is used (EE Flow Control). However, Class 3 is an unacknowledged service. Therefore, it uses only BB_Credit (BB Flow Control), but the mechanism is the same in all cases.

Here, you see the importance that the number of buffers has in overall performance. You need enough buffers to ensure that the transmitting port can continue to send frames without stopping to use the full bandwidth, which is true with distance. The total amount of buffer credit needed to optimize the throughput depends on the link speed and the average frame size.

For example, consider an 8 Gbps link connecting two switches that are 100 km apart. At 8 Gbps, a full frame (2148 bytes) occupies about 0.51 km of fiber. In a 100 km link, you can send 198 frames before the first one reaches its destination. You need an ACK to go back to the start to fill EE_Credit again. You can send another 198 frames before you receive the first ACK.

You need at least 396 buffers to allow for nonstop transmission at 100 km distance. The maximum distance that can be achieved at full performance depends on the capabilities of the FC node that is attached at either end of the link extenders, which are vendor-specific. A match should occur between the buffer credit capability of the nodes at either end of the extenders.

Remote system ports and zoning considerations

Ports and zoning requirements for the remote system partnership have changed over time. The current preferred configuration is described in the following Flash Alert:

https://www.ibm.com/support/docview.wss?uid=ssg1S1003634

The preferred practice for the IBM Spectrum Virtualize and Storwize systems is to provision dedicated node ports for local node-to-node traffic (by using port masking) and isolate Global Mirror node-to-node traffic between the local nodes from other local SAN traffic.

Remote port masking: To isolate the node-to-node traffic from the remote copy traffic, the local and remote port masking implementation is preferable.

This configuration of local node port masking is less of a requirement on Storwize family systems, where traffic between node canisters in an I/O group is serviced by the dedicated inter-canister link in the enclosure. The following guidelines also apply to the remote system connectivity:

•Partnered systems should use the same number of nodes in each system for replication.

•For maximum throughput, all nodes in each system should be used for replication, both in terms of balancing the preferred node assignment for volumes and for providing intersystem Fibre Channel connectivity.

•Where possible, use the minimum number of partnerships between systems. For example, assume site A contains systems A1 and A2, and site B contains systems B1 and B2. In this scenario, creating separate partnerships between pairs of systems (such as A1-B1 and A2-B2) offers greater performance for Global Mirror replication between sites than a configuration with partnerships defined between all four systems.

For the zoning, the following rules for the remote system partnership apply:

•For Metro Mirror and Global Mirror configurations where the round-trip latency between systems is less than 80 milliseconds, zone two Fibre Channel ports on each node in the local system to two Fibre Channel ports on each node in the remote system.

•For Metro Mirror and Global Mirror configurations where the round-trip latency between systems is more than 80 milliseconds, apply SAN zoning to provide separate intrasystem zones for each local-remote I/O group pair that is used for replication, as shown in Figure 5-19.

Figure 5-19 Zoning scheme for >80 ms remote copy partnerships

Native IP connectivity

Remote Mirroring over IP communication is supported on the IBM Spectrum Virtualize and Storwize Family systems by using Ethernet communication links. The IBM Spectrum Virtualize Software IP replication uses innovative Bridgeworks SANSlide technology to optimize network bandwidth and utilization.

This new function enables the use of a lower-speed and lower-cost networking infrastructure for data replication. Bridgeworks’ SANSlide technology, which is integrated into the IBM Spectrum Virtualize Software, uses artificial intelligence to help optimize network bandwidth use and adapt to changing workload and network conditions. This technology can improve remote mirroring network bandwidth usage up to three times, which can enable clients to deploy a less costly network infrastructure, or speed up remote replication cycles to enhance disaster recovery effectiveness.

The native IP replication is covered in detail in 5.4, “Native IP replication” on page 203.

5.3.3 Remote copy services planning

When you plan for remote copy services, you must keep in mind the considerations that are outlined in the following sections.

Remote copy configurations limits

To plan for and implement remote copy services, you must check the configuration limits and adhere to them. Table 5-8 shows the limits for a system that apply to IBM Spectrum Virtualize V7.8.

Table 5-8 Remote copy maximum limits

Remote copy property	Maximum	Apply to	Comment
Remote Copy (Metro Mirror and Global Mirror) relationships per system	10000	•SAN Volume Controller models SV1, DH8, CG8, and CF8 •Storwize V7000 models 524 (Gen2) and 624 (Gen2+)	This configuration can be any mix of Metro Mirror and Global Mirror relationships.
	8192	Any other Storwize model	This configuration can be any mix of Metro Mirror and Global Mirror relationships. Maximum requires an 8-node system (volumes per I/O group limit applies).
Active-Active Relationships	1250	•SAN Volume Controller models SV1, DH8, CG8, and CF8 •Storwize V7000 models 524 (Gen2) and 624 (Gen2+)	This is the limit for the number of HyperSwap volumes in a system.
Active-Active Relationships	1024	Any other Storwize model	This is the limit for the number of HyperSwap volumes in a system.
Remote Copy relationships per consistency group	None	All models	No limit is imposed beyond the Remote Copy relationships per system limit
Remote Copy consistency groups per system	256	All models
Total Metro Mirror and Global Mirror volume capacity per I/O group	1024 TB	All models	This limit is the total capacity for all master and auxiliary volumes in the I/O group.
Total number of Global Mirror with Change Volumes relationships per system	256	All models
Inter-system IP partnerships per system	1	All models	A system can be partnered with up to three remote systems. A maximum of one of those can be IP and the other two FC.
I/O groups per system in IP partnerships	2	All models	The nodes from a maximum of two I/O groups per system can be used for IP partnership.
Inter site links per IP partnership	2	All models	A maximum of two inter site links can be used between two IP partnership sites.
Ports per node	1	All models	A maximum of one port per node can be used for IP partnership.
IP partnership Software Compression Limit	70 MBps	•SAN Volume Controller models CG8 and CF8 •Storwize V7000 model 124 (Gen1)
IP partnership Software Compression Limit	140 MBps	•SAN Volume Controller models SV1 and DH8 •Storwize V7000 models 524 (Gen2) and 624 (Gen2+)

Similar to FlashCopy, the remote copy services require memory to allocate the bitmap structures used to track the updates while volume are suspended or synchronizing. The default amount of memory for remote copy services is 20 MB. This value can be increased or decreased by using the chiogrp command. The maximum amount of memory that can be specified for remote copy services is 512 MB. The grain size for the remote copy services is 256 KB.

Remote copy restrictions

To use Metro Mirror and Global Mirror, you must adhere to the following rules:

•You must have the same target volume size as the source volume size. However, the target volume can be a different type (image, striped, or sequential mode) or have different cache settings (cache-enabled or cache-disabled).

•You cannot move Metro Mirror or Global Mirror source or target volumes to different I/O groups.

•You cannot resize Metro Mirror or Global Mirror volumes.

•You can mirror intrasystem Metro Mirror or Global Mirror only between volumes in the same I/O group.

Intrasystem remote copy: The intrasystem remote copy is not supported on IBM Spectrum Virtualize/Storwize systems running version 6 or later.

•Global Mirror is not recommended for cache-disabled volumes that are participating in a Global Mirror relationship.

Remote copy upgrade scenarios

When you upgrade system software where the system participates in one or more intersystem relationships, upgrade only one cluster at a time. That is, do not upgrade the systems concurrently.

Attention: Upgrading both systems concurrently is not monitored by the software upgrade process.

Allow the software upgrade to complete one system before it is started on the other system. Upgrading both systems concurrently can lead to a loss of synchronization. In stress situations, it can further lead to a loss of availability.

Pre-existing remote copy relationships are unaffected by a software upgrade that is performed correctly.

Remote copy compatibility cross-reference

Even if it is not a best practice, remote copy partnership can be established, with some restriction, among systems with different IBM Spectrum Virtualize versions. For more information about a compatibility table for intersystem Metro Mirror and Global Mirror relationships between IBM Spectrum Virtualize code levels, see SAN Volume Controller Inter-system Metro Mirror and Global Mirror Compatibility Cross Reference, S1003646. This publication is available at this website:

http://www.ibm.com/support/docview.wss?rs=591&uid=ssg1S1003646

Volume placement considerations

You can optimize the distribution of volumes within I/O groups at the local and remote systems to maximize performance.

Although defined at a system level, the bandwidth (the rate of background copy) is then subdivided and distributed on a per-node basis. It is divided evenly between the nodes, which have volumes that perform a background copy for active copy relationships.

This bandwidth allocation is independent from the number of volumes for which a node is responsible. Each node, in turn, divides its bandwidth evenly between the (multiple) remote copy relationships with which it associates volumes that are performing a background copy.

Volume preferred node

Conceptually, a connection (path) goes between each node on the primary system to each node on the remote system. Write I/O, which is associated with remote copying, travels along this path. Each node-to-node connection is assigned a finite amount of remote copy resource and can sustain only in-flight write I/O to this limit.

The node-to-node in-flight write limit is determined by the number of nodes in the remote system. The more nodes that exist at the remote system, the lower the limit is for the in-flight write I/Os from a local node to a remote node. That is, less data can be outstanding from any one local node to any other remote node. Therefore, to optimize performance, Global Mirror volumes must have their preferred nodes distributed evenly between the nodes of the systems.

The preferred node property of a volume helps to balance the I/O load between nodes in that I/O group. This property is also used by remote copy to route I/O between systems.

The IBM Spectrum Virtualize node/Storwize canister that receives a write for a volume is normally the preferred node of the volume. For volumes in a remote copy relationship, that node is also responsible for sending that write to the preferred node of the target volume. The primary preferred node is also responsible for sending any writes that relate to the background copy. Again, these writes are sent to the preferred node of the target volume.

Each node of the remote system has a fixed pool of remote coy system resources for each node of the primary system. That is, each remote node has a separate queue for I/O from each of the primary nodes. This queue is a fixed size and is the same size for every node.

If preferred nodes for the volumes of the remote system are set so that every combination of primary node and secondary node is used, remote copy performance is maximized.

Figure 5-20 shows an example of remote copy resources that are not optimized. Volumes from the local system are replicated to the remote system. All volumes with a preferred node of node 1 are replicated to the remote system, where the target volumes also have a preferred node of node 1.

Figure 5-20 Remote copy resources that are not optimized

With this configuration, the resources for remote system node 1 that are reserved for local system node 2 are not used. The resources for local system node 1 that are used for remote system node 2 also are not used.

If the configuration changes to the configuration that is shown in Figure 5-21, all remote copy resources for each node are used and remote copy operates with better performance.

Figure 5-21 Optimized Global Mirror resources

Background copy considerations

The remote copy partnership bandwidth parameter explicitly defines the rate at which the background copy is attempted, but also implicitly affects foreground I/O. Background copy bandwidth can affect foreground I/O latency in one of the following ways:

•Increasing latency of foreground I/O

If the remote copy partnership bandwidth parameter is set too high for the actual intersystem network capability, the background copy resynchronization writes use too much of the intersystem network. It starves the link of the ability to service synchronous or asynchronous mirrored foreground writes. Delays in processing the mirrored foreground writes increase the latency of the foreground I/O as perceived by the applications.

•Read I/O overload of primary storage

If the remote copy partnership background copy rate is set too high, the added read I/Os that are associated with background copy writes can overload the storage at the primary site and delay foreground (read and write) I/Os.

•Write I/O overload of auxiliary storage

If the remote copy partnership background copy rate is set too high for the storage at the secondary site, the background copy writes overload the auxiliary storage. Again, they delay the synchronous and asynchronous mirrored foreground write I/Os.

Important: An increase in the peak foreground workload can have a detrimental effect on foreground I/O. It does so by pushing more mirrored foreground write traffic along the intersystem network, which might not have the bandwidth to sustain it. It can also overload the primary storage.

To set the background copy bandwidth optimally, consider all aspects of your environments, starting with the following biggest contributing resources:

•Primary storage

•Intersystem network bandwidth

•Auxiliary storage

To set the background copy bandwidth optimally, ensure that you consider all the above resources. Provision the most restrictive of these three resources between the background copy bandwidth and the peak foreground I/O workload. Perform this provisioning by calculation or by determining experimentally how much background copy can be allowed before the foreground I/O latency becomes unacceptable. Then, reduce the background copy to accommodate peaks in workload.

Changes in the environment, or loading of it, can affect the foreground I/O. IBM Spectrum Virtualize and Storwize technology provides a means to monitor, and a parameter to control, how foreground I/O is affected by running remote copy processes. IBM Spectrum Virtualize software monitors the delivery of the mirrored foreground writes. If latency or performance of these writes extends beyond a (predefined or client defined) limit for a period, the remote copy relationship is suspended (see 5.3.5, “1920 error” on page 191).

Finally, note that with Global Mirror Change Volume, the cycling process that transfers the data from the local to the remote system is a background copy task. For this reason, the background copy rate setting affects the available bandwidth not only during the initial synchronization, but also during the normal cycling process.

Back-end storage considerations

To reduce the overall solution costs, it is a common practice to provide the remote systems with lower performance characteristics compared to the local system, especially when using asynchronous remote copy technologies. This attitude can be risky especially when using the Global Mirror technology where the application performances at the primary system can indeed be limited by the performance of the remote system.

The recommendation is to perform an accurate back-end resource sizing for the remote system to fulfill the following capabilities:

•The peak application workload to the Global Mirror or Metro Mirror volumes

•The defined level of background copy

•Any other I/O that is performed at the remote site

Remote Copy tunable parameters

Several commands and parameters help to control remote copy and its default settings. You can display the properties and features of the systems by using the lssystem command. Also, you can change the features of systems by using the chsystem command.

relationshipbandwidthlimit

The relationshipbandwidthlimit is an optional parameter that specifies the new background copy bandwidth in the range 1 - 1000 MBps. The default is 25 MBps. This parameter operates system-wide, and defines the maximum background copy bandwidth that any relationship can adopt. The existing background copy bandwidth settings that are defined on a partnership continue to operate, with the lower of the partnership and volume rates attempted.

Important: Do not set this value higher than the default without establishing that the higher bandwidth can be sustained.

The relationshipbandwidthlimit apply also to Metro Mirror relationships.

gmlinktolerance and gmmaxhostdelay

The gmlinktolerance and gmmaxhostdelay parameters are critical in the system for deciding internally whether to terminate a relationship due to a performance problem. In most cases, these two parameters need to be considered in tandem. The defaults would not normally be changed unless you had a specific reason to do so.

The gmlinktolerance parameter can be thought of as how long you allow the host delay to go on being significant before you decide to terminate a Global Mirror volume relationship. This parameter accepts values of 20 - 86400 seconds in increments of 10 seconds. The default is 300 seconds. You can disable the link tolerance by entering a value of zero for this parameter.

The gmmaxhostdelay parameter can be thought of as the maximum host I/O impact that is due to Global Mirror. That is, how long would that local I/O take with Global Mirror turned off, and how long does it take with Global Mirror turned on. The difference is the host delay due to Global Mirror tag and forward processing.

Although the default settings are adequate for most situations, increasing one parameter while reducing another might deliver a tuned performance environment for a particular circumstance.

Example 5-1 shows how to change gmlinktolerance and the gmmaxhostdelay parameters using the chsystem command.

Example 5-1 Changing gmlinktolerance to 30 and gmmaxhostdelay to 100

chsystem -gmlinktolerance 30

chsystem -gmmaxhostdelay 100

Test and monitor: To reiterate, thoroughly test and carefully monitor the host impact of any changes like these before putting them into a live production environment.

Settings considerations about the gmlinktolerance and the gmmaxhostdelay parameters are described later.

rcbuffersize

rcbuffersize was introduced with the Version 6.2 code level so that systems with intense and bursty write I/O would not fill the internal buffer while Global Mirror writes were undergoing sequence tagging.

Important: Do not change the rcbuffersize parameter except under the direction of IBM Support.

Example 5-2 shows how to change rcbuffersize to 64 MB by using the chsystem command. The default value for rcbuffersize is 48 MB and the maximum is 512 MB.

Example 5-2 Changing rcbuffersize to 64 MB

chsystem -rcbuffersize 64

Remember that any additional buffers you allocate are taken away from the general cache.

maxreplicationdelay and partnershipexclusionthreshold

IBM Spectrum Virtualize version 7.6 introduced two new parameters, maxreplicationdelay and partnershipexclusionthreshold, for remote copy advanced tuning.

maxreplicationdelay is a system-wide parameter that defines a maximum latency (in seconds) for any individual write passing through the Global Mirror logic. If a write is hung for that time, for example due to a rebuilding array on the secondary system, Global Mirror stops the relationship (and any containing consistency group), triggering a 1920 error.

The partnershipexclusionthreshold parameter was introduced to allow users to set the timeout for an IO that triggers a temporarily dropping of the link to the remote cluster. The value must be a number from 30 to 315.

Important: Do not change the partnershipexclusionthreshold parameter except under the direction of IBM Support.

Link delay simulation parameters

Even though Global Mirror is an asynchronous replication method, there can be an impact to server applications due to Global Mirror managing transactions and maintaining write order consistency over a network. To mitigate this impact, as a testing and planning feature, Global Mirror allows you to simulate the effect of the round-trip delay between sites by using the following parameters:

•The gminterclusterdelaysimulation parameter

This optional parameter specifies the intersystem delay simulation, which simulates the Global Mirror round-trip delay between two systems in milliseconds. The default is 0. The valid range is 0 - 100 milliseconds.

•The gmintraclusterdelaysimulation parameter

This optional parameter specifies the intrasystem delay simulation, which simulates the Global Mirror round-trip delay in milliseconds. The default is 0. The valid range is 0 - 100 milliseconds.

5.3.4 Remote copy use cases

This section describes the common uses cases of remote copy services.

Synchronizing a remote copy relationship

When creating a remote copy relationship, two options regarding the initial synchronization process are available:

•The not synchronized option is the default. With this option, when a remote copy relationship is started, a full data synchronization occurs between the source and target volumes. It is the simplest in that it requires no other administrative activity apart from issuing the necessary IBM Spectrum Virtualize commands. However, in some environments, the available bandwidth make this option unsuitable.

•The already synchronized option does not force any data synchronization when the relationship is started. The administrator must ensure that the source and target volumes contain identical data before a relationship is created. The administrator can perform this check in one of the following ways:

– Create both volumes with the security delete feature to change all data to zero.

– Copy a complete tape image (or other method of moving data) from one disk to the other.

In either technique, no write I/O must take place to the source and target volume before the relationship is established. The administrator must then complete the following actions:

– Create the relationship with the already synchronized settings (-sync option)

– Start the relationship

Attention: If you do not perform these steps correctly, the remote copy reports the relationship as being consistent, when it is not. This setting is likely to make any auxiliary volume useless.

By understanding the methods to start a Metro Mirror and Global Mirror relationship, you can use one of them as a means to implement the remote copy relationship, save bandwidth, and resize the Global Mirror volumes.

Global Mirror relationships, saving bandwidth, and resizing volumes

Consider a situation where you have a large source volume (or many source volumes) that you want to replicate to a remote site. Your planning shows that the mirror initial sync time takes too long (or is too costly if you pay for the traffic that you use). In this case, you can set up the sync by using another medium that is less expensive.

Another reason that you might want to use this method is if you want to increase the size of the volume that is in a Metro Mirror relationship or in a Global Mirror relationship. To increase the size of these volumes, delete the current mirror relationships and redefine the mirror relationships after you resize the volumes.

This example uses tape media as the source for the initial sync for the Metro Mirror relationship or the Global Mirror relationship target before it uses remote copy services to maintain the Metro Mirror or Global Mirror. This example does not require downtime for the hosts that use the source volumes.

Before you set up Global Mirror relationships, save bandwidth, and resize volumes, complete the following steps:

1. Ensure that the hosts are up and running and are using their volumes normally. No Metro Mirror relationship nor Global Mirror relationship is defined yet.

Identify all the volumes that become the source volumes in a Metro Mirror relationship or in a Global Mirror relationship.

2. Establish the IBM Spectrum Virtualize system partnership with the target IBM Spectrum Virtualize system.

To set up Global Mirror relationships, save bandwidth, and resize volumes, complete the following steps:

1. Define a Metro Mirror relationship or a Global Mirror relationship for each source disk. When you define the relationship, ensure that you use the -sync option, which stops the system from performing an initial sync.

Attention: If you do not use the -sync option, all of these steps are redundant because the IBM Spectrum Virtualize/Storwize system performs a full initial synchronization anyway.

2. Stop each mirror relationship by using the -access option, which enables write access to the target volumes. You need this write access later.

3. Copy the source volume to the alternative media by using the dd command to copy the contents of the volume to tape. Another option is to use your backup tool (for example, IBM Spectrum Protect) to make an image backup of the volume.

Change tracking: Although the source is being modified while you are copying the image, the IBM Spectrum Virtualize/Storwize system is tracking those changes. The image that you create might have some of the changes and is likely to also miss some of the changes.

When the relationship is restarted, the IBM Spectrum Virtualize/Storwize system applies all of the changes that occurred since the relationship stopped in step 2. After all the changes are applied, you have a consistent target image.

4. Ship your media to the remote site and apply the contents to the targets of the Metro Mirror or Global Mirror relationship. You can mount the Metro Mirror and Global Mirror target volumes to a UNIX server and use the dd command to copy the contents of the tape to the target volume.

If you used your backup tool to make an image of the volume, follow the instructions for your tool to restore the image to the target volume. Remember to remove the mount if the host is temporary.

Tip: It does not matter how long it takes to get your media to the remote site and perform this step. However, the faster you can get the media to the remote site and load it, the quicker IBM Spectrum Virtualize/Storwize system starts running and maintaining the Metro Mirror and Global Mirror.

5. Unmount the target volumes from your host. When you start the Metro Mirror and Global Mirror relationship later, the IBM Spectrum Virtualize/Storwize system stops write access to the volume while the mirror relationship is running.

6. Start your Metro Mirror and Global Mirror relationships. The relationships must be started with the -clean parameter. In this way, any changes that are made on the secondary volume are ignored, and only changes made on the clean primary volume are considered when synchronizing the primary and secondary volumes.

7. While the mirror relationship catches up, the target volume is not usable at all. When it reaches ConsistentSynchnonized status, your remote volume is ready for use in a disaster.

Changing the remote copy type

Changing the remote copy type for an existing relationship is quite an easy task. It is enough to stop the relationship, if it is active, and change the properties to set the new remote copy type. Do not forget to create the change volumes in case of change from Metro or Global Mirror to Global Mirror Change Volumes.

Remote copy source as an FlashCopy target

Starting with V6.2 a FlashCopy target volume can be used as primary volume for a Metro or Global Mirror. The inclusion of Metro Mirror and Global Mirror source as an FlashCopy target helps in disaster recovery scenarios. You can have both the FlashCopy function and Metro Mirror or Global Mirror operating concurrently on the same volume.

However, the way that these functions can be used together has the following constraints:

•A FlashCopy mapping must be in the idle_copied state when its target volume is the secondary volume of a Metro Mirror or Global Mirror relationship.

•A FlashCopy mapping cannot be manipulated to change the contents of the target volume of that mapping when the target volume is the primary volume of a Metro Mirror or Global Mirror relationship that is actively mirroring. A FlashCopy mapping cannot be started while the target volume is in an active remote copy relationship.

•The I/O group for the FlashCopy mappings must be the same as the I/O group for the FlashCopy target volume.

Native controller Advanced Copy Services functions

Native copy services are not supported on all storage controllers. For more information about the known limitations, see Using Native Controller Copy Services, S1002852, at this website:

http://www.ibm.com/support/docview.wss?&uid=ssg1S1002852

Using a back-end controller’s copy services

When IBM Spectrum Virtualize uses a LUN from a storage controller that is a source or target of Advanced Copy Services functions, you can use only that LUN as a cache-disabled image mode volume.

If you leave caching enabled on a volume, the underlying controller does not receive any write I/Os as the host writes them. IBM Spectrum Virtualize caches them and processes them later. This process can have more ramifications if a target host depends on the write I/Os from the source host as they are written.

Performing cascading copy service functions

Cascading copy service functions that use IBM Spectrum Virtualize/Storwize are not directly supported. However, you might require a three-way (or more) replication by using copy service functions (synchronous or asynchronous mirroring). You can address this requirement both by using IBM Spectrum Virtualize/Storwize copy services and by combining IBM Spectrum Virtualize/Storwize copy services (with image mode cache-disabled volumes) and native storage controller copy services.

Cascading with native storage controller copy services

Figure 5-22 describes the configuration for a three site cascading by using the native storage controller copy services in combination with IBM Spectrum Virtualize/Storwize remote copy functions.

Figure 5-22 Using three-way copy services

In Figure 5-22, the primary site uses IBM Spectrum Virtualize/Storwize remote copy functions (Global Mirror or Metro Mirror) at the secondary site. Therefore, if a disaster occurs at the primary site, the storage administrator enables access to the target volume (from the secondary site) and the business application continues processing.

While the business continues processing at the secondary site, the storage controller copy services replicate to the third site.

Cascading with IBM Spectrum Virtualize and Storwize systems copy services

A cascading-like solution is also possible by combining the IBM Spectrum Virtualize/Storwize copy services. These remote copy services implementations are useful in three site disaster recovery solutions and data center moving scenarios.

In the configuration described in Figure 5-23, a Global Mirror (Metro Mirror can also be used) solution is implemented between the Local System in Site A, the production site, and the Remote System 1 located in the Site B, the primary disaster recover site. A third system, Remote System 2, is located in Site C, the secondary disaster recover site. Connectivity is provided between Site A and Site B, between Site B and Site C, and optionally between Site A and Site C.

Figure 5-23 Cascading-like infrastructure

To implement a cascading-like solution, the following steps must be completed:

1. Set up phase. Perform the following actions to initially set up the environment:

a. Create the Global Mirror relationships between the Local System and Remote System 1.

b. Create the FlashCopy mappings in the Remote System 1 using the target Global Mirror volumes as FlashCopy source volumes. The FlashCopy must be incremental.

c. Create the Global Mirror relationships between Remote System 1 and Remote System 2 using the FlashCopy target volumes as Global Mirror source volumes.

d. Start the Global Mirror from Local System to Remote System 1.

After the Global Mirror is in ConsistentSynchronized state, you are ready to create the cascading.

2. Consistency point creation phase. The following actions must be performed every time a consistency point creation in the Site C is required.

a. Check whether the Global Mirror between Remote System 1 and Remote System 2 is in stopped or idle status, if it is not, stop the Global Mirror.

b. Stop the Global Mirror between the Local System to Remote System 1.

c. Start the FlashCopy in Remote Site 1.

d. Resume the Global Mirror between the Local System and Remote System 1.

e. Start/resume the Global Mirror between Remote System 1and Remote System 2.

The first time these operations are performed, a full copy between Remote System 1 and Remote System 2 occurs. Later executions of these operations perform incremental resynchronizations instead. After the Global Mirror between Remote System 1 and Remote System 2 is in ConsistenSynchronized state, the consistency point in Site C is created. The Global Mirror between Remote System 1 and Remote System 2 can now be stopped to be ready for the next consistency point creation.

5.3.5 1920 error

An IBM Spectrum Virtualize/Storwize system generates a 1920 error message whenever a Metro Mirror or Global Mirror relationship stops because of adverse conditions. The adverse conditions, if left unresolved, might affect performance of foreground I/O.

A 1920 error can result for many reasons. The condition might be the result of a temporary failure, such as maintenance on the intersystem connectivity, unexpectedly higher foreground host I/O workload, or a permanent error because of a hardware failure. It is also possible that not all relationships are affected and that multiple 1920 errors can be posted.

The 1920 error could be triggered both for Metro Mirror and Global Mirror relationships. However, in Metro Mirror configurations the 1920 error is associated only with a permanent I/O error condition. For this reason, the main focus of this section is 1920 errors in a Global Mirror configuration.

Internal Global Mirror control policy and raising 1920 errors

Although Global Mirror is an asynchronous remote copy service, the local and remote sites have some interplay. When data comes into a local volume, work must be done to ensure that the remote copies are consistent. This work can add a delay to the local write. Normally, this delay is low. The IBM Spectrum Virtualize code implements many control mechanisms that mitigate the impacts of the Global Mirror to the foreground I/Os.

gmmaxhostdelay and gmlinktolerance

The gmlinktolerance parameter helps to ensure that hosts do not perceive the latency of the long-distance link, regardless of the bandwidth of the hardware that maintains the link or the storage at the secondary site. The hardware and storage must be provisioned so that, when combined, they can support the maximum throughput that is delivered by the applications at the primary that is using Global Mirror.

If the capabilities of this hardware are exceeded, the system becomes backlogged and the hosts receive higher latencies on their write I/O. Remote copy in Global Mirror implements a protection mechanism to detect this condition and halts mirrored foreground write and background copy I/O. Suspension of this type of I/O traffic ensures that misconfiguration or hardware problems (or both) do not affect host application availability.

Global Mirror attempts to detect and differentiate between back logs that are because of the operation of the Global Mirror protocol. It does not examine the general delays in the system when it is heavily loaded, where a host might see high latency even if Global Mirror were disabled.

To detect these specific scenarios, Global Mirror measures the time that is taken to perform the messaging to assign and record the sequence number for a write I/O. If this process exceeds the expected value over a period of 10 seconds, this period is treated as being overloaded (bad period).

Global Mirror uses the gmmaxhostdelay and gmlinktolerance parameters to monitor Global Mirror protocol backlogs in the following ways:

•Users set the gmmaxhostdelay and gmlinktolerance parameters to control how software responds to these delays. The gmmaxhostdelay parameter is a value in milliseconds that can go up to 100.

•Every 10 seconds, Global Mirror samples all of the Global Mirror writes and determines how much of a delay it added. If at least a third of these writes are greater than the gmmaxhostdelay setting, that sample period is marked as bad.

•Software keeps a running count of bad periods. Each time that a bad period occurs, this count goes up by one. Each time a good period occurs, this count goes down by 1, to a minimum value of 0.

The gmlinktolerance parameter is defined in seconds. Bad periods are assessed at intervals of 10 seconds. The maximum bad period count is the gmlinktolerance parameter value that is divided by 10. For instance, with a gmlinktolerance value of 300, the maximum bad period count is 30. When maximum bad period count is reached, a 1920 error is reported.

Bad periods do not need to be consecutive, and the bad period count increments or decrements at intervals of 10. That is, 10 bad periods, followed by five good periods, followed by 10 bad periods, results in a bad period count of 15.

Within each sample period, Global Mirror writes are assessed. If in a write operation, the delay added by the Global Mirror protocol exceeds the gmmaxhostdelay value, the operation is counted as a bad write. Otherwise, a good write is counted. The proportion of bad writes to good writes is calculated. If at least one third of writes are identified as bad, the sample period is defined as a bad period. A consequence is that, under a light I/O load, a single bad write can become significant. For example, if only one write I/O is performed for every 10 and this write is considered slow, the bad period count increments.

An edge case is achieved by setting the gmmaxhostdelay and gmlinktolerance parameters to their minimum settings (1 ms and 20 s). With these settings, you need only two consecutive bad sample periods before a 1920 error condition is reported. Consider a foreground write I/O that has a light I/O load. For example, a single I/O happens in the 20 s. With unlucky timing, a single bad I/O results (that is, a write I/O that took over 1 ms in remote copy), and it spans the boundary of two, 10-second sample periods. This single bad I/O theoretically can be counted as 2 x the bad periods and trigger a 1920 error.

A higher gmlinktolerance value, gmmaxhostdelay setting, or I/O load might reduce the risk of encountering this edge case.

maxreplicationdelay and partnershipexclusionthreshold

IBM Spectrum Virtualize version 7.6 has introduced the maxreplicationdelay and partnershipexclusionthreshold parameters to provide further performance protection mechanisms when remote copy services (Metro Mirror and Global Mirror) are used.

maxreplicationdelay is a system-wide attribute that configures how long a single write can be outstanding from the host before the relationship is stopped, triggering a 1920 error. It can protect the hosts from seeing timeouts due to secondary hung IOs.

This parameter is mainly intended to protect from secondary system issues . It does not help with ongoing performance issues, but can be used to limit the exposure of hosts to long write response times that can cause application errors. For instance, setting maxreplicationdelay to 30 means that if a write operation for a volume in a remote copy relationship does not complete within 30 seconds, the relationship is stopped, triggering a 1920 error. Along with the 1920 error, the specific event ID 985004 is generated with the text “Maximum replication delay exceeded”.

The maxreplicationdelay values can be 0 - 360 seconds. Setting maxreplicationdelay to 0 disables the feature.

The partnershipexclusionthreshold is a system-wide parameter that sets the timeout for an IO that triggers a temporarily dropping of the link to the remote system. Similar to maxreplicationdelay, the partnershipexclusionthreshold attribute provides some flexibility in a part of replication that tries to shield a production system from hung I/Os on a secondary system.

In an IBM Spectrum Virtualize/Storwize system, a node assert (restart with a 2030 error) occurs if any individual I/O takes longer than 6 minutes. To avoid this situation, some actions are attempted to clean up anything that might be hanging I/O before the I/O gets to 6 minutes.

One of these actions is temporarily dropping (for 15 minutes) the link between systems if any I/O takes longer than 5 minutes 15 seconds (315 seconds). This action often removes hang conditions caused by replication problems. The partnershipexclusionthreshold parameter introduced the ability to set this value to a time lower than 315 seconds to respond to hung I/O more swiftly. The partnershipexclusionthreshold value must be a number in the range 30 - 315.

If an I/O takes longer the partnershipexclusionthreshold value, a 1720 error is triggered (with an event ID 987301) and any regular Global Mirror or Metro Mirror relationships stop on the next write to the primary volume.

Important: Do not change the partnershipexclusionthreshold parameter except under the direction of IBM Support.

To set the maxreplicationdelay and partnershipexclusionthreshold parameters, the chsystem command must be used, as shown in Example 5-3.

Example 5-3 maxreplicationdelay and partnershipexclusionthreshold setting

IBM_2145:SVC_ESC:superuser>chsystem -maxreplicationdelay 30

IBM_2145:SVC_ESC:superuser>chsystem -partnershipexclusionthreshold 180

The maxreplicationdelay and partnershipexclusionthreshold parameters do not interact with the gmlinktolerance and gmmaxhostdelay parameters.

Troubleshooting 1920 errors

When you are troubleshooting 1920 errors that are posted across multiple relationships, you must diagnose the cause of the earliest error first. You must also consider whether other higher priority system errors exist and fix these errors because they might be the underlying cause of the 1920 error.

The diagnosis of a 1920 error is assisted by SAN performance statistics. To gather this information, you can use IBM Spectrum Control with a statistics monitoring interval of 1 or 5 minutes. Also, turn on the internal statistics gathering function, IOstats, in IBM Spectrum Virtualize. Although not as powerful as IBM Spectrum Control, IOstats can provide valuable debug information if the snap command gathers system configuration data close to the time of failure.

The following are the main performance statistics that should be investigated for the 1920 error:

•Write I/O Rate and Write Data Rate

For volumes that are primary volumes in relationships, these statistics are the total amount of write operations submitted per second by hosts on average over the sample period, and the bandwidth of those writes. For secondary volumes in relationships, this is the average number of replicated writes that are received per second, and the bandwidth that these writes consume. Summing the rate over the volumes you intend to replicate gives a coarse estimate of the replication link bandwidth required.

•Write Response Time and Peak Write Response Time

On primary volumes, these are the average time (in milliseconds) and peak time between a write request being received from a host, and the completion message being returned. The write response time is the best way to show what kind of write performance that the host is seeing.

If a user complains that an application is slow, and the stats show the write response time leap from 1 ms to 20 ms, the two are most likely linked. However, some applications with high queue depths and low to moderate workloads will not be affected by increased response times. Note that this being high is an effect of some other problem. The peak is less useful, as it is very sensitive to individual glitches in performance, but it can show more detail of the distribution of write response times.

On secondary volumes, these statistics describe the time for the write to be submitted from the replication feature into the system cache, and should normally be of a similar magnitude to those on the primary volume. Generally, the write response time should be below 1 ms for a fast-performing system.

•Global Mirror Write I/O Rate

This statistic shows the number of writes per second, the (regular) replication feature is processing for this volume. It applies to both types of Global Mirror and to Metro Mirror, but in each case only for the secondary volume. Because writes are always separated into 32 kB or smaller tracks before replication, this setting might be different from the Write I/O Rate on the primary volume (magnified further because the samples on the two systems will not be aligned, so they will capture a different set of writes).

•Global Mirror Overlapping Write I/O Rate

This statistic monitors the amount of overlapping I/O that the Global Mirror feature is handling for regular Global Mirror relationships. That is where an LBA is written again after the primary volume has been updated, but before the secondary volume has been updated for an earlier write to that LBA. To mitigate the effects of the overlapping I/Os, a journaling feature has been implemented, as discussed in “Colliding writes” on page 164.

•Global Mirror secondary write lag

This statistic is valid for regular Global Mirror primary and secondary volumes. For primary volumes, it tracks the length of time in milliseconds that replication writes are outstanding from the primary system. This amount includes the time to send the data to the remote system, consistently apply it to the secondary non-volatile cache, and send an acknowledgment back to the primary system.

For secondary volumes, this statistic records only the time that is taken to consistently apply it to the system cache, which is normally up to 20 ms. Most of that time is spent coordinating consistency across many nodes and volumes. Primary and secondary volumes for a relationship tend to record times that differ by the round-trip time between systems. If this statistic is high on the secondary system, look for congestion on the secondary system’s fabrics, saturated auxiliary storage, or high CPU utilization on the secondary system.

•Write-cache Delay I/O Rate

These statistics show how many writes could not be instantly accepted into the system cache because cache was full. It is a good indication that the write rate is faster than the storage can cope with. If this amount starts to increase on auxiliary storage while primary volumes suffer from increased Write Response Time, it is possible that the auxiliary storage is not fast enough for the replicated workload.

•Port to Local Node Send Response Time

The time in milliseconds that it takes this node to send a message to other nodes in the same system (which will mainly be the other node in the same I/O group) and get an acknowledgment back. This amount should be well below 1 ms, with values below 0.3 ms being essential for regular Global Mirror to provide a Write Response Time below 1 ms. This requirement is necessary because up to three round-trip messages within the local system will happen before a write completes to the host. If this number is higher than you want, look at fabric congestion (Zero Buffer Credit Percentage) and CPU Utilization of all nodes in the system.

•Port to Remote Node Send Response Time

This value is the time in milliseconds that it takes to send a message to nodes in other systems and get an acknowledgment back. This amount is not separated out by remote system, but for environments that have replication to only one remote system. This amount should be very close to the low-level ping time between your sites. If this starts going significantly higher, it is likely that the link between your systems is saturated, which usually causes high Zero Buffer Credit Percentage as well.

•Sum of Port-to-local node send response time and Port-to-local node send queue

The time must be less than 1 ms for the primary system. A number in excess of 1 ms might indicate that an I/O group is reaching its I/O throughput limit, which can limit performance.

•System CPU Utilization (Core 1-4)

These values show how heavily loaded the nodes in the system are. If any core has high utilization (say, over 90%) and there is an increase in write response time, it is possible that the workload is being CPU limited. You can resolve this by upgrading to faster hardware, or spreading out some of the workload to other nodes and systems.

•Zero Buffer Credit Percentage

This is the fraction of messages that this node attempted to send through Fibre Channel ports that had to be delayed because the port ran out of buffer credits. If you have a long link from the node to the switch it is attached to, there might be benefit in getting the switch to grant more buffer credits on its port.

It is more likely to be the result of congestion on the fabric, as running out of buffer credits is how Fibre Channel performs flow control. Normally, this value is well under 1%. From 1 - 10% is a concerning level of congestion, but you might find the performance acceptable. Over 10% indicates extreme congestion. This amount is also called out on a port-by-port basis in the port-level statistics, which gives finer granularity of where any congestion might be.

When looking at the port-level statistics, high values on ports used for messages to nodes in the same system are much more concerning than those on ports that are used for messages to nodes in other systems.

•Back-end Write Response Time

This value is the average response time in milliseconds for write operations to the back-end storage. This time might include several physical I/O operations, depending on the type of RAID architecture.

Poor back-end performances on secondary system is a frequent cause of 1920 errors, while it is not so common for primary systems. Exact values to watch out for depend on the storage technology, but usually the response time should be less than 50 ms. A longer response time can indicate that the storage controller is overloaded. If the response time for a specific storage controller is outside of its specified operating range, investigate for the same reason.

Focus areas for 1920 errors

The causes of 1920 errors might be numerous. To fully understand the underlying reasons for posting this error, consider the following components that are related to the remote copy relationship:

•The intersystem connectivity network

•Primary storage and remote storage

•IBM Spectrum Virtualize nodes and Storwize node canisters

•Storage area network

Data collection for diagnostic purposes

A successful diagnosis depends on the collection of the following data at both systems:

•The snap command with livedump (triggered at the point of failure)

•I/O Stats running (if possible)

•IBM Spectrum Control performance statistics data (if possible)

•The following information and logs from other components:

– Intersystem network and switch details:

• Technology

• Bandwidth

• Typical measured latency on the Intersystem network

• Distance on all links (which can take multiple paths for redundancy)

• Whether trunking is enabled

• How the link interfaces with the two SANs

• Whether compression is enabled on the link

• Whether the link dedicated or shared; if so, the resource and amount of those resources they use

• Switch Write Acceleration to check with IBM for compatibility or known limitations

• Switch Compression, which should be transparent but complicates the ability to predict bandwidth

– Storage and application:

• Specific workloads at the time of 1920 errors, which might not be relevant, depending upon the occurrence of the 1920 errors and the volumes that are involved

• RAID rebuilds

• Whether 1920 errors are associated with Workload Peaks or Scheduled Backup

Intersystem network

For diagnostic purposes, ask the following questions about the intersystem network:

•Was network maintenance being performed?

Consider the hardware or software maintenance that is associated with intersystem network, such as updating firmware or adding more capacity.

•Is the intersystem network overloaded?

You can find indications of this situation by using statistical analysis with the help of I/O stats, IBM Spectrum Control, or both. Examine the internode communications, storage controller performance, or both. By using IBM Spectrum Control, you can check the storage metrics for the Global Mirror relationships were stopped, which can be tens of minutes depending on the gmlinktolerance and maxreplicationdelay parameters.

Diagnose the overloaded link by using the following methods:

– Look at the statistics generated by the routers or switches near your most bandwidth-constrained link between the systems

Exactly what is provided, and how to analyze it varies depending on the equipment used.

– Look at the port statistics for high response time in the internode communication

An overloaded long-distance link causes high response times in the internode messages (the Port to remote node send response time statistic) that are sent by IBM Spectrum Virtualize. If delays persist, the messaging protocols exhaust their tolerance elasticity and the Global Mirror protocol is forced to delay handling new foreground writes while waiting for resources to free up.

– Look at the port statistics for buffer credit starvation

The Zero Buffer Credit Percentage statistic can be useful here too, as you normally have a high value here as the link saturates. Only look at ports that are replicating to the remote system.

– Look at the volume statistics (before the 1920 error is posted):

• Target volume write throughput approaches the link bandwidth.

If the write throughput on the target volume is equal to your link bandwidth, your link is likely overloaded. Check what is driving this situation. For example, does peak foreground write activity exceed the bandwidth, or does a combination of this peak I/O and the background copy exceed the link capacity?

• Source volume write throughput approaches the link bandwidth.

This write throughput represents only the I/O that is performed by the application hosts. If this number approaches the link bandwidth, you might need to upgrade the link’s bandwidth. Alternatively, reduce the foreground write I/O that the application is attempting to perform, or reduce the number of remote copy relationships.

• Target volume write throughput is greater than the source volume write throughput.

If this condition exists, the situation suggests a high level of background copy and mirrored foreground write I/O. In these circumstances, decrease the background copy rate parameter of the Global Mirror partnership to bring the combined mirrored foreground I/O and background copy I/O rates back within the remote links bandwidth.

– Look at the volume statistics (after the 1920 error is posted):

• Source volume write throughput after the Global Mirror relationships were stopped.

If write throughput increases greatly (by 30% or more) after the Global Mirror relationships are stopped, the application host was attempting to perform more I/O than the remote link can sustain.

When the Global Mirror relationships are active, the overloaded remote link causes higher response times to the application host. This overload, in turn, decreases the throughput of application host I/O at the source volume. After the Global Mirror relationships stop, the application host I/O sees a lower response time, and the true write throughput returns.

To resolve this issue, increase the remote link bandwidth, reduce the application host I/O, or reduce the number of Global Mirror relationships.

Storage controllers

Investigate the primary and remote storage controllers, starting at the remote site. If the back-end storage at the secondary system is overloaded, or another problem is affecting the cache there, the Global Mirror protocol fails to keep up. Similarly, the problem exhausts the (gmlinktolerance) elasticity and has a similar effect at the primary system.

In this situation, ask the following questions:

•Are the storage controllers at the remote system overloaded (pilfering slowly)?

Use IBM Spectrum Control to obtain the back-end write response time for each MDisk at the remote system. A response time for any individual MDisk that exhibits a sudden increase of 50 ms or more, or that is higher than 100 ms, generally indicates a problem with the back end.

However, if you followed the specified back-end storage controller requirements and were running without problems until recently, the error is most likely caused by a decrease in controller performance because of maintenance actions or a hardware failure of the controller. Check whether an error condition is on the storage controller, for example, media errors, a failed physical disk, or a recovery activity, such as RAID array rebuilding that uses more bandwidth.

If an error occurs, fix the problem and then restart the Global Mirror relationships.

If no error occurs, consider whether the secondary controller can process the required level of application host I/O. You might improve the performance of the controller in the following ways:

– Adding more or faster physical disks to a RAID array.

– Changing the RAID level of the array.

– Changing the cache settings of the controller and checking that the cache batteries are healthy, if applicable.

– Changing other controller-specific configuration parameter.

•Are the storage controllers at the primary site overloaded?

Analyze the performance of the primary back-end storage by using the same steps that you use for the remote back-end storage. The main effect of bad performance is to limit the amount of I/O that can be performed by application hosts. Therefore, you must monitor back-end storage at the primary site regardless of Global Mirror.

However, if bad performance continues for a prolonged period, a false 1920 error might be flagged.

Node and canister

For the IBM Spectrum Virtualize node and Storwize node canister hardware, the possible cause of the 1920 errors might be from a heavily loaded secondary or primary system. If this condition persists, a 1920 error might be posted.

Global Mirror needs to synchronize its IO processing across all nodes in the system to ensure data consistency. If any node is running out of CPU, it can affect all relationships. So check the CPU usage statistic. If it looks higher when there is a performance problem, then running out of CPU bandwidth might be causing the problem. Of course, CPU usage goes up when the IOPS going through a node goes up, so if the workload increases, you would expect to see CPU usage increase.

If there is an increase in CPU usage on the secondary system but no increase in IOPS, and volume write latency increases too, it is likely that the increase in CPU usage has caused the increased volume write latency. In that case, try to work out what might have caused the increase in CPU usage (for example, starting many FlashCopy mappings then). Consider moving that activity to a time with less workload. If there is an increase in both CPU usage and IOPS, and the CPU usage is close to 100%, then that node might be overloaded.

In a primary system, if it is sufficiently busy, the write ordering detection in Global Mirror can delay writes enough to reach a latency of gmmaxhostdelay and cause a 1920 error. Stopping replication potentially lowers CPU usage, and also lowers the opportunities for each I/O to be delayed by slow scheduling on a busy system.

Solve overloaded nodes by upgrading them to newer, faster hardware if possible, or by adding more IO groups/control enclosures (or systems) to spread the workload over more resources.

Storage area network

Issues and congestions both in local and remote SANs can lead to 1920 errors. The Port to local node send response time is the key statistic to investigate on. It captures the round-trip time between nodes in the same system. Anything over 1.0 ms is surprisingly high, and will cause high secondary volume write response time. Values greater than 1 ms on primary system will cause an impact on write latency to Global Mirror primary volumes of 3 ms or more.

If you have checked CPU utilization on all the nodes, and it has not gotten near 100%, a high Port to local node send response time means that there is fabric congestion or a slow-draining Fibre Channel device.

A good indicator of SAN congestion is the Zero Buffer Credit Percentage on the port statistics (see “Buffer credits” on page 175 for more information on Buffer Credit). If any port is seeing over 10% zero buffer credits, that is definitely going to cause a problem for all I/O, not just Global Mirror writes. Values from 1 - 10% are moderately high and might contribute to performance issues.

For both primary and secondary systems, congestion on the fabric from other slow-draining devices becomes much less of an issue when only dedicated ports are used for node-to-node traffic within the system. However, this only really becomes an option on systems with more than four ports per node. Use port masking to segment your ports.

FlashCopy considerations

Check that FlashCopy mappings are in the prepared state. Check whether the Global Mirror target volumes are the sources of a FlashCopy mapping and whether that mapping was in the prepared state for an extended time.

Volumes in the prepared state are cache disabled, so their performance is impacted. To resolve this problem, start the FlashCopy mapping, which reenables the cache and improves the performance of the volume and of the Global Mirror relationship.

Consider also that FlashCopy can add significant workload to the back-end storage, especially when the background copy is active (see “Background Copy considerations” on page 152). In cases where the remote system is used to create golden or practice copies for Disaster Recovery testing, the workload added by the FlashCopy background processes can overload the system. This overload can lead to poor remote copy performances and then to a 1920 error. Careful planning of the back-end resources is particularly important with this kind of scenarios. Reducing the FlashCopy background copy rate can also help to mitigate this situation.

FCIP considerations

When you get a 1920 error, always check the latency first. The FCIP routing layer can introduce latency if it is not properly configured. If your network provider reports a much lower latency, you might have a problem at your FCIP routing layer. Most FCIP routing devices have built-in tools to enable you to check the RTT. When you are checking latency, remember that TCP/IP routing devices (including FCIP routers) report RTT by using standard 64-byte ping packets.

In Figure 5-24 on page 201, you can see why the effective transit time must be measured only by using packets that are large enough to hold an FC frame, or 2148 bytes (2112 bytes of payload and 36 bytes of header). Allow estimated resource requirements to be a safe amount because various switch vendors have optional features that might increase this size. After you verify your latency by using the proper packet size, proceed with normal hardware troubleshooting.

Look at the second largest component of your RTT, which is serialization delay. Serialization delay is the amount of time that is required to move a packet of data of a specific size across a network link of a certain bandwidth. The required time to move a specific amount of data decreases as the data transmission rate increases.

Figure 5-24 shows the orders of magnitude of difference between the link bandwidths. It is easy to see how 1920 errors can arise when your bandwidth is insufficient. Never use a TCP/IP ping to measure RTT for FCIP traffic.

Figure 5-24 Effect of packet size (in bytes) versus the link size

In Figure 5-24, the amount of time in microseconds that is required to transmit a packet across network links of varying bandwidth capacity is compared. The following packet sizes are used:

•64 bytes: The size of the common ping packet

•1500 bytes: The size of the standard TCP/IP packet

•2148 bytes: The size of an FC frame

Finally, your path maximum transmission unit (MTU) affects the delay that is incurred to get a packet from one location to another location. An MTU might cause fragmentation, or be too large and cause too many retransmits when a packet is lost.

Recovery

After a 1920 error occurs, the Global Mirror auxiliary volumes are no longer in the ConsistentSynchronized state. You must establish the cause of the problem and fix it before you restart the relationship. When the relationship is restarted, you must resynchronize it. During this period, the data on the Metro Mirror or Global Mirror auxiliary volumes on the secondary system is inconsistent, and your applications cannot use the volumes as backup disks.

Tip: If the relationship stopped in a consistent state, you can use the data on the auxiliary volume at the remote system as backup. Creating a FlashCopy of this volume before you restart the relationship gives more data protection. The FlashCopy volume that is created maintains the current, consistent image until the Metro Mirror or Global Mirror relationship is synchronized again and back in a consistent state.

To ensure that the system can handle the background copy load, delay restarting the Metro Mirror or Global Mirror relationship until a quiet period occurs. If the required link capacity is unavailable, you might experience another 1920 error, and the Metro Mirror or Global Mirror relationship stops in an inconsistent state.

Adjusting the Global Mirror settings

Although the default values are valid in most configurations, the settings of the gmlinktolerance and gmmaxhostdelay can be adjusted to accommodate particular environment or workload conditions.

For example, Global Mirror is designed to look at average delays. However, some hosts such as VMware ESX might not tolerate a single I/O getting old, for example, 45 seconds, before it decides to reboot. Given that it is better to terminate a Global Mirror relationship than it is to reboot a host, you might want to set gmlinktolerance to something like 30 seconds and then compensate so that you do not get too many relationship terminations by setting gmmaxhostdelay to something larger such as 100 ms.

If you compare the two approaches, the default (gmlinktolerance 300, gmmaxhostdelay 5) is saying “If more than one third of the I/Os are slow and that happens repeatedly for 5 minutes, then terminate the busiest relationship in that stream.” In contrast, the example of gmlinktolerance 30, gmmaxhostdelay 100 says “If more than one third of the I/Os are extremely slow and that happens repeatedly for 30 seconds, then terminate the busiest relationship in the stream.”

So one approach is designed to pick up general slowness, and the other approach is designed to pick up shorter bursts of extreme slowness that might disrupt your server environment. The general recommendation is to change the gmlinktolerance and gmmaxhostdelay values progressively and evaluate the overall impact to find an acceptable compromise between performances and Global Mirror stability.

You can even disable the gmlinktolerance feature by setting the gmlinktolerance value to 0. However, the gmlinktolerance parameter cannot protect applications from extended response times if it is disabled. You might consider disabling the gmlinktolerance feature in the following circumstances:

•During SAN maintenance windows, where degraded performance is expected from SAN components and application hosts can withstand extended response times from Global Mirror volumes.

•During periods when application hosts can tolerate extended response times and it is expected that the gmlinktolerance feature might stop the Global Mirror relationships. For example, you are testing usage of an I/O generator that is configured to stress the back-end storage. Then, the gmlinktolerance feature might detect high latency and stop the Global Mirror relationships. Disabling the gmlinktolerance parameter stops the Global Mirror relationships at the risk of exposing the test host to extended response times.

Note that the maxreplicationdelay settings do not mitigate the 1920 error occurrence because it actually adds a trigger to the 1920 error itself. However, the maxreplicationdelay provides users with a fine granularity mechanism to manage the hung I/Os condition and it can be used in combination with gmlinktolerance and gmmaxhostdelay settings to better address particular environment conditions.

In the VMware example, an alternative option is to set the maxreplicationdelay to 30 seconds and leave the gmlinktolerance and gmmaxhostdelay settings to their default. With these settings, the maxreplicationdelay timeout effectively handles the hung I/Os conditions, while the gmlinktolerance and gmmaxhostdelay settings still provide an adequate mechanism to protect from ongoing performance issues.

5.4 Native IP replication

The native IP replication feature enables replication between any IBM Spectrum Virtualize and Storwize family products running code version 7.2 or higher. It does so by using the built-in networking ports or optional 1/10Gbit adapter.

Following a recent partnership with IBM, native IP replication uses SANslide technology developed by Bridgeworks Limited of Christchurch, UK. They specialize in products that can bridge storage protocols and accelerate data transfer over long distances. Adding this technology at each end of a wide area network (WAN) TCP/IP link significantly improves the utilization of the link. It does this by applying patented artificial intelligence (AI) to hide latency that is normally associated with WANs. Doing so can greatly improve the performance of mirroring services, in particular Global Mirror with Change Volumes (GM/CV) over long distances.

5.4.1 Native IP replication technology

Bridgeworks’ SANSlide technology, which is integrated into the IBM Spectrum Virtualize Software, uses artificial intelligence to help optimize network bandwidth use and adapt to changing workload and network conditions. This technology can improve remote mirroring network bandwidth usage up to three times. It can enable clients to deploy a less costly network infrastructure, or speed up remote replication cycles to enhance disaster recovery effectiveness.

With an Ethernet network data flow, the data transfer can slow down over time. This condition occurs because of the latency that is caused by waiting for the acknowledgment of each set of packets that are sent. The next packet set cannot be sent until the previous packet is acknowledged, as shown in Figure 5-25.

Figure 5-25 Typical Ethernet network data flow

However, by using the embedded IP replication, this behavior can be eliminated with the enhanced parallelism of the data flow. This parallelism uses multiple virtual connections (VCs) that share IP links and addresses. The artificial intelligence engine can dynamically adjust the number of VCs, receive window size, and packet size as appropriate to maintain optimum performance. While the engine is waiting for one VC’s ACK, it sends more packets across other VCs. If packets are lost from any VC, data is automatically retransmitted, as shown in Figure 5-26.

Figure 5-26 Optimized network data flow by using Bridgeworks SANSlide technology

For more information about this technology, see IBM SAN Volume Controller and Storwize Family Native IP Replication, REDP-5103.

Metro Mirror, Global Mirror, and Global Mirror Change Volume are supported with native IP partnership.

5.4.2 IP partnership limitations

The following prerequisites and assumptions must be considered before IP partnership between two IBM Spectrum Virtualize or Storwize family systems can be established:

•The systems are successfully installed with V7.2 or later code levels.

•The systems have the necessary licenses that enable remote copy partnerships to be configured between two systems. No separate license is required to enable IP partnership.

•The storage SANs are configured correctly and the correct infrastructure to support the systems in remote copy partnerships over IP links is in place.

•The two systems must be able to ping each other and perform the discovery.

•The maximum number of partnerships between the local and remote systems, including both IP and Fibre Channel (FC) partnerships, is limited to the current maximum that is supported, which is three partnerships (four systems total).

•Only a single partnership over IP is supported.

•A system can have simultaneous partnerships over FC and IP, but with separate systems. The FC zones between two systems must be removed before an IP partnership is configured.

•IP partnerships are supported on both 10 gigabits per second (Gbps) links and 1 Gbps links. However, the intermix of both on a single link is not supported.

•The maximum supported round-trip time is 80 milliseconds (ms) for 1 Gbps links.

•The maximum supported round-trip time is 10 ms for 10 Gbps links.

•The minimum supported link bandwidth is 10 Mbps.

•The inter-cluster heartbeat traffic uses 1 Mbps per link.

•Only nodes from two I/O Groups can have ports that are configured for an IP partnership.

•Migrations of remote copy relationships directly from FC-based partnerships to IP partnerships are not supported.

•IP partnerships between the two systems can be over IPv4 or IPv6 only, but not both.

•Virtual LAN (VLAN) tagging of the IP addresses that are configured for remote copy is supported starting with V7.4.

•Management IP and Internet SCSI (iSCSI) IP on the same port can be in a different network starting with V7.4.

•An added layer of security is provided by using Challenge Handshake Authentication Protocol (CHAP) authentication.

•Direct attached systems configurations are supported with the following restrictions:

– Only two direct attach link are allowed.

– The direct attach links must be on the same I/O group.

– Use two port groups, where a port group contains only the two ports that are directly linked.

•Transmission Control Protocol (TCP) ports 3260 and 3265 are used for IP partnership communications. Therefore, these ports must be open in firewalls between the systems.

•Network address translation (NAT) between systems that are being configured in an IP Partnership group is not supported.

•Only a single Remote Copy data session per physical link can be established. It is intended that only one connection (for sending/receiving Remote Copy data) is made for each independent physical link between the systems.

Note: A physical link is the physical IP link between the two sites, A (local) and B (remote). Multiple IP addresses on local system A can be connected (by Ethernet switches) to this physical link. Similarly, multiple IP addresses on remote system B can be connected (by Ethernet switches) to the same physical link. At any point, only a single IP address on cluster A can form an RC data session with an IP address on cluster B.

•The maximum throughput is restricted based on the use of 1 Gbps or 10 Gbps Ethernet ports. The output varies based on distance (for example, round-trip latency) and quality of communication link (for example, packet loss):

– One 1 Gbps port can transfer up to 110 megabytes per second (MBps) unidirectional, 190 MBps bidirectional

– Two 1 Gbps ports can transfer up to 220 MBps unidirectional, 325 MBps bidirectional

– One 10 Gbps port can transfer up to 240 MBps unidirectional, 350 MBps bidirectional

– Two 10 Gbps port can transfer up to 440 MBps unidirectional, 600 MBps bidirectional

Note: The Bandwidth setting definition when the IP partnerships are created has changed. Previously, the bandwidth setting defaulted to 50 MB, and was the maximum transfer rate from the primary site to the secondary site for initial sync/resyncs of volumes.

The Link Bandwidth setting is now configured by using megabits (Mb), not MB. You set the Link Bandwidth setting to a value that the communication link can sustain, or to what is allocated for replication. The Background Copy Rate setting is now a percentage of the Link Bandwidth. The Background Copy Rate setting determines the available bandwidth for the initial sync and resyncs or for GM with Change Volumes.

5.4.3 VLAN support

VLAN tagging is supported for both iSCSI host attachment and IP replication. Hosts and remote-copy operations can connect to the system through Ethernet ports. Each traffic type has different bandwidth requirements, which can interfere with each other if they share IP connections. VLAN tagging creates two separate connections on the same IP network for different types of traffic. The system supports VLAN configuration on both IPv4 and IPv6 connections.

When the VLAN ID is configured for the IP addresses that are used for either iSCSI host attach or IP replication, the appropriate VLAN settings on the Ethernet network and servers must be configured correctly to avoid connectivity issues. After the VLANs are configured, changes to the VLAN settings disrupt iSCSI and IP replication traffic to and from the partnerships.

During the VLAN configuration for each IP address, the VLAN settings for the local and failover ports on two nodes of an I/O Group can differ. To avoid any service disruption, switches must be configured so the failover VLANs are configured on the local switch ports and the failover of IP addresses from a failing node to a surviving node succeeds. If failover VLANs are not configured on the local switch ports, there are no paths to Storwize V7000 system during a node failure and the replication fails.

Consider the following requirements and procedures when implementing VLAN tagging:

•VLAN tagging is supported for IP partnership traffic between two systems.

•VLAN provides network traffic separation at the layer 2 level for Ethernet transport.

•VLAN tagging by default is disabled for any IP address of a node port. You can use the CLI or GUI to set the VLAN ID for port IPs on both systems in the IP partnership.

•When a VLAN ID is configured for the port IP addresses that are used in remote copy port groups, appropriate VLAN settings on the Ethernet network must also be properly configured to prevent connectivity issues.

Setting VLAN tags for a port is disruptive. Therefore, VLAN tagging requires that you stop the partnership first before you configure VLAN tags. Then, restart again when the configuration is complete.

5.4.4 IP Compression

IBM Spectrum Virtualize version 7.7 introduced the IP compression capability that speed up replication cycles or that can allow use of less bandwidth. This feature reduces the volume of data that must be transmitted during remote copy operations by using compression capabilities similar to those experienced with existing Real-time Compression implementations.

No License: IP compression feature does not require an RtC software license.

The data compression is made within the IP replication component of the IBM Spectrum Virtualize code. It can be used with all the remote copy technology (Metro Mirror, Global Mirror, and Global Mirror Change Volume). The IP compression is supported in the following systems:

•SAN Volume controller with CF8 nodes

•SAN Volume controller with CG8 nodes

•SAN Volume controller with DH8 nodes

•SAN Volume controller with SV1nodes

•FlashSystem V9000

•Storwize V7000 Gen1

•Storwize V7000 Gen2 and Gen2+

•Storwize V5000 Gen2

The IP compression feature provides two kinds of compression mechanisms: The HW compression and the SW compression. The HW compression is active when compression accelerator cards are available, otherwise the SW compression is used.

The HW compression makes use of currently underused cards. The internal resources are shared between RACE and IP compression. The SW compression uses the system CPU and might have an impact on heavily used systems.

To evaluate the benefits of the IP compression, the Comprestimator tool can be used to estimate the compression ratio of the data to be replicated. The IP compression can be enabled and disabled without stopping the remote copy relationship by using the mkippartnership and chpartnership commands with the -compress parameter. Furthermore, in systems with replication enabled in both directions, the IP compression can be enabled in only one direction. IP compression is supported for IPv4 and IPv6 partnerships.

Figure 5-27 reports the current compression limits by system type and compression mechanism.

Figure 5-27 IP compression limits by systems and compression types

5.4.5 Remote copy groups

This section describes remote copy groups (or remote copy port groups) and different ways to configure the links between the two remote systems. The two systems can be connected to each other over one link or, at most, two links. To address the requirement to enable the systems to know about the physical links between the two sites, the concept of remote copy port groups was introduced.

Remote copy port group ID is a numerical tag that is associated with an IP port of system to indicate which physical IP link it is connected to. Multiple IBM Spectrum Virtualize nodes can be connected to the same physical long-distance link, and must therefore share a remote copy port group ID.

In scenarios with two physical links between the local and remote clusters, two remote copy port group IDs must be used to designate which IP addresses are connected to which physical link. This configuration must be done by the system administrator by using the GUI or the cfgportip CLI command.

Remember: IP ports on both partners must have been configured with identical remote copy port group IDs for the partnership to be established correctly.

The system IP addresses that are connected to the same physical link are designated with identical remote copy port groups. The IBM Spectrum Virtualize and Storwize family systems supports three remote copy groups: 0, 1, and 2.

The IP addresses are, by default, in remote copy port group 0. Ports in port group 0 are not considered for creating remote copy data paths between two systems. For partnerships to be established over IP links directly, IP ports must be configured in remote copy group 1 if a single inter-site link exists, or in remote copy groups 1 and 2 if two inter-site links exist.

You can assign one IPv4 address and one IPv6 address to each Ethernet port on the IBM Spectrum Virtualize and Storwize family systems. Each of these IP addresses can be shared between iSCSI host attach and the IP partnership. The user must configure the required IP address (IPv4 or IPv6) on an Ethernet port with a remote copy port group.

The administrator might want to use IPv6 addresses for remote copy operations and use IPv4 addresses on that same port for iSCSI host attach. This configuration also implies that for two systems to establish an IP partnership, both systems must have IPv6 addresses that are configured.

Administrators can choose to dedicate an Ethernet port for IP partnership only. In that case, host access must be explicitly disabled for that IP address and any other IP address that is configured on that Ethernet port.

Note: To establish an IP partnership, each Storwize V7000 canister must have only a single remote copy port group that is configured 1 or 2. The remaining IP addresses must be in remote copy port group 0.

Failover operations within and between port groups

Within one remote-copy port group, only one port from each system is selected for sending and receiving remote copy data at any one time. Therefore, on each system, at most one port for each remote-copy port group is reported as used.

If the IP partnership becomes unable to continue over an IP port, the system fails over to another port within that remote-copy port group. Some reasons this might occur are the switch to which it is connected fails, the node goes offline, or the cable that is connected to the port is unplugged.

For the IP partnership to continue during a failover, multiple ports must be configured within the remote-copy port group. If only one link is configured between the two systems, configure two ports (one per node) within the remote-copy port group. You can configure these two ports on two nodes within the same I/O group or within separate I/O groups. Configurations 4, 5, and 6 in IP partnership requirements are the supported dual-link configurations.

While failover is in progress, no connections in that remote-copy port group exist between the two systems in the IP partnership for a short time. Typically, failover completes within 30 seconds to 1 minute. If the systems are configured with two remote-copy port groups, the failover process within each port group continues independently of each other.

The disadvantage of configuring only one link between two systems is that, during a failover, a discovery is initiated. When the discovery succeeds, the IP partnership is reestablished. As a result, the relationships might stop, in which case a manual restart is required. To configure two intersystem links, you must configure two remote-copy port groups.

When a node fails in this scenario, the IP partnership can continue over the other link until the node failure is rectified. Failback then happens when both links are again active and available to the IP partnership. The discovery is triggered so that the active IP partnership data path is made available from the new IP address.

In a two-node system, or if there is more than one I/O Group and the node in the other I/O group has IP ports pre-configured within the remote-copy port group, the discovery is triggered. The discovery makess the active IP partnership data path available from the new IP address.

5.4.6 Supported configurations

Multiple IP partnership configurations are available depending on the number of physical links and the number of nodes. In the following sections, some example configurations are described.

Single inter-site link configurations

Consider two 2-node systems in IP partnership over a single inter-site link (with failover ports configured), as shown in Figure 5-28.

Figure 5-28 Only one remote copy group on each system and nodes with failover ports configured

Figure 5-28 shows two systems: System A and System B. A single remote copy port group 1 is configured on two Ethernet ports, one each on Node A1 and Node A2 on System A. Similarly, a single remote copy port group is configured on two Ethernet ports on Node B1 and Node B2 on System B.

Although two ports on each system are configured for remote copy port group 1, only one Ethernet port in each system actively participates in the IP partnership process. This selection is determined by a path configuration algorithm that is designed to choose data paths between the two systems to optimize performance.

The other port on the partner node in the I/O Group behaves as a standby port that is used during a node failure. If Node A1 fails in System A, IP partnership continues servicing replication I/O from Ethernet Port 2 because a failover port is configured on Node A2 on Ethernet Port 2.

However, it might take some time for discovery and path configuration logic to reestablish paths post failover. This delay can cause partnerships to change to Not_Present for that time. The details of the particular IP port that is actively participating in IP partnership is provided in the lsportip output (reported as used).

This configuration has the following characteristics:

•Each node in the I/O group has the same remote copy port group that is configured. However, only one port in that remote copy port group is active at any time at each system.

•If Node A1 in System A or Node B2 in System B fails in the respective systems, IP partnerships rediscovery is triggered and continues servicing the I/O from the failover port.

•The discovery mechanism that is triggered because of failover might introduce a delay where the partnerships momentarily change to the Not_Present state and recover.

Figure 5-29 shows a configuration with two 4-node systems in IP partnership over a single inter-site link (with failover ports configured).

Figure 5-29 Multinode systems single inter-site link with only one remote copy port group

Figure 5-29 shows two 4-node systems: System A and System B. A single remote copy port group 1 is configured on nodes A1, A2, A3, and A4 on System A, Site A, and on nodes B1, B2, B3, and B4 on System B, Site B. Although four ports are configured for remote copy group 1, only one Ethernet port in each remote copy port group on each system actively participates in the IP partnership process.

Port selection is determined by a path configuration algorithm. The other ports play the role of standby ports.

If Node A1 fails in System A, the IP partnership selects one of the remaining ports that is configured with remote copy port group 1 from any of the nodes from either of the two I/O groups in System A. However, it might take some time (generally seconds) for discovery and path configuration logic to reestablish paths post failover. This process can cause partnerships to change to the Not_Present state.

This result causes remote copy relationships to stop. The administrator might need to manually verify the issues in the event log and start the relationships or remote copy consistency groups, if they do not automatically recover. The details of the particular IP port actively participating in the IP partnership process is provided in the lsportip view (reported as used). This configuration has the following characteristics:

•Each node has the remote copy port group that is configured in both I/O groups. However, only one port in that remote copy port group remains active and participates in IP partnership on each system.

•If Node A1 in System A or Node B2 in System B encounter some failure in the system, IP partnerships discovery is triggered and continues servicing the I/O from the failover port.

•The discovery mechanism that is triggered because of failover might introduce a delay where the partnerships momentarily change to the Not_Present state and then recover.

•The bandwidth of the single link is used completely.

An eight-node system in IP partnership with four-node system over single inter-site link is shown in Figure 5-30.

Figure 5-30 Multinode systems single inter-site link with only one remote copy port group

Figure 5-30 on page 212 shows an eight-node system (System A in Site A) and a four-node system (System B in Site B). A single remote copy port group 1 is configured on nodes A1, A2, A5, and A6 on System A at Site A. Similarly, a single remote copy port group 1 is configured on nodes B1, B2, B3, and B4 on System B.

Although there are four I/O groups (eight nodes) in System A, any two I/O groups at maximum are supported to be configured for IP partnerships. If Node A1 fails in System A, IP partnership continues using one of the ports that is configured in remote copy port group from any of the nodes from either of the two I/O groups in System A.

However, it might take some time for discovery and path configuration logic to reestablish paths post-failover. This delay might cause partnerships to change to the Not_Present state.

This process can lead to remote copy relationships stopping. The administrator must manually start them if the relationships do not auto-recover. The details of which particular IP port is actively participating in IP partnership process is provided in lsportip output (reported as used).

This configuration has the following characteristics:

•Each node has the remote copy port group that is configured in both the I/O groups that are identified for participating in IP Replication. However, only one port in that remote copy port group remains active on each system and participates in IP Replication.

•If the Node A1 in System A or the Node B2 in System B fails in the system, the IP partnerships trigger discovery and continue servicing the I/O from the failover ports.

•The discovery mechanism that is triggered because of failover might introduce a delay where the partnerships momentarily change to the Not_Present state and then recover.

•The bandwidth of the single link is used completely.

Two inter-site link configurations

A two 2-node systems with two inter-site links configuration is depicted in Figure 5-31.

Figure 5-31 Dual links with two remote copy groups on each system configured

As shown in Figure 5-31, remote copy port groups 1 and 2 are configured on the nodes in System A and System B because two inter-site links are available. In this configuration, the failover ports are not configured on partner nodes in the I/O group. Rather, the ports are maintained in different remote copy port groups on both of the nodes. They can remain active and participate in IP partnership by using both of the links.

However, if either of the nodes in the I/O group fail (that is, if Node A1 on System A fails), the IP partnership continues only from the available IP port that is configured in remote copy port group 2. Therefore, the effective bandwidth of the two links is reduced to 50% because only the bandwidth of a single link is available until the failure is resolved.

This configuration has the following characteristics:

•There are two inter-site links, and two remote copy port groups are configured.

•Each node has only one IP port in remote copy port group 1 or 2.

•Both the IP ports in the two remote copy port groups participate simultaneously in IP partnerships. Therefore, both of the links are used.

•During node failure or link failure, the IP partnership traffic continues from the other available link and the port group. Therefore, if two links of 10 Mbps each are available and you have 20 Mbps of effective link bandwidth, bandwidth is reduced to 10 Mbps only during a failure.

•After the node failure or link failure is resolved and failback happens, the entire bandwidth of both of the links is available as before.

A configuration with two 4-node systems in IP partnership with dual inter-site links is shown in Figure 5-32.

Figure 5-32 Multinode systems with dual inter-site links between the two systems

Figure 5-32 shows two 4-node systems: System A and System B. This configuration is an extension of Configuration 5 to a multinode multi-I/O group environment. As seen in this configuration, there are two I/O groups. Each node in the I/O group has a single port that is configured in remote copy port groups 1 or 2.

Although two ports are configured in remote copy port groups 1 and 2 on each system, only one IP port in each remote copy port group on each system actively participates in IP partnership. The other ports that are configured in the same remote copy port group act as standby ports during a failure. Which port in a configured remote copy port group participates in IP partnership at any moment is determined by a path configuration algorithm.

In this configuration, if Node A1 fails in System A, IP partnership traffic continues from Node A2 (that is, remote copy port group 2). At the same time, the failover also causes discovery in remote copy port group 1. Therefore, the IP partnership traffic continues from Node A3 on which remote copy port group 1 is configured. The details of the particular IP port that is actively participating in IP partnership process is provided in the lsportip output (reported as used).

This configuration has the following characteristics:

•Each node has the remote copy port group that is configured in the I/O groups 1 or 2. However, only one port per system in both remote copy port groups remains active and participates in IP partnership.

•Only a single port per system from each configured remote copy port group participates simultaneously in IP partnership. Therefore, both of the links are used.

•During node failure or port failure of a node that is actively participating in IP partnership, IP partnership continues from the alternative port because another port is in the system in the same remote copy port group, but in a different I/O Group.

•The pathing algorithm can start discovery of available port in the affected remote copy port group in the second I/O group and pathing is reestablished. This process restores the total bandwidth, so both of the links are available to support IP partnership.

Finally, an eight-node system in IP partnership with a four-node system over dual inter-site links is depicted in Figure 5-33.

Figure 5-33 Multinode systems with dual inter-site links between the two systems

Figure 5-33 shows an eight-node System A in Site A and a four-node System B in Site B. Because a maximum of two I/O groups in IP partnership is supported in a system, although there are four I/O groups (eight nodes), nodes from only two I/O groups’ are configured with remote copy port groups in System A. The remaining or all of the I/O groups can be configured to be remote copy partnerships over FC.

In this configuration, there are two links and two I/O groups that are configured with remote copy port groups 1 and 2. However, path selection logic is managed by an internal algorithm. Therefore, this configuration depends on the pathing algorithm to decide which of the nodes actively participate in IP partnership. Even if Node A5 and Node A6 are configured with remote copy port groups properly, active IP partnership traffic on both of the links can be driven from Node A1 and Node A2 only.

If Node A1 fails in System A, IP partnership traffic continues from Node A2 (that is, remote copy port group 2). The failover also causes IP partnership traffic to continue from Node A5 on which remote copy port group 1 is configured. The details of the particular IP port actively participating in IP partnership process is provided in the lsportip output (reported as used).

This configuration has the following characteristics:

•There are two I/O Groups with nodes in those I/O groups that are configured in two remote copy port groups because there are two inter-site links for participating in IP partnership. However, only one port per system in a particular remote copy port group remains active and participates in IP partnership.

•One port per system from each remote copy port group participates in IP partnership simultaneously. Therefore, both of the links are used.

•If a node or port on the node that is actively participating in IP partnership fails, the remote copy (RC) data path is established from that port because another port is available on an alternative node in the system with the same remote copy port group.

•The path selection algorithm starts discovery of available ports in the affected remote copy port group in the alternative I/O groups and paths are reestablished. This process restores the total bandwidth across both links.

•The remaining or all of the I/O groups can be in remote copy partnerships with other systems.

5.4.7 Native IP replication performance consideration

A number of factors affect the performance of an IP partnership. Some of these factors are latency, link speed, number of intersite links, host I/O, MDisk latency, and hardware. Since the introduction with version 7.2, many improvements have been made to make the IP replication better performing and more reliable.

With version 7.7, a new workload distribution algorithm was introduced that optimize the usage of the 10 Gbps ports. Nevertheless, in presence of poor quality networks that have significant packet loss and high latency, the actual usable bandwidth might decrease considerably.

Figure 5-34 shows the throughput trend for a 1 Gbps port in respect of the packet loss ratio and the latency.

Figure 5-34 1 Gbps port throughput trend

The chart shows how the combined effect of the packet loss and the latency can lead to a throughput reduction of more than 85%. For these reasons, the IP replication option should not be considered for the replication configuration requiring high quality and performing networks. Due to its characteristic of low-bandwidth requirement, the Global Mirror Change Volume is the preferred solution with the IP replication.

The following recommendations might help improve this performance when using compression and IP partnership in the same system:

•Using nodes older than SAN Volume Controller CG8 with IP partnership, or Global Mirror and compression in the same I/O group is not recommended.

•To use the IP partnership on a multiple I/O group system that has nodes older than SAN Volume Controller 2145-CG8 and compressed volumes, configure ports for the IP partnership in I/O groups that do not contain compressed volumes.

•To use the IP partnership on Storwize Family product that has compressed volumes, configure ports for the IP partnership in I/O groups that do not contain compressed volumes.

•For SAN Volume Controller CG8 nodes using IP partnership, or Global Mirror and compression, update your hardware to an “RPQ 8S1296 hardware update for 2145-CG8”.

•If you require more than a 100 MBps throughput per intersite link with IP partnership on a node that uses compression, consider virtualizing the system with SAN Volume Controller 2145-SV1.

•Use a different port for iSCSI host I/O and IP partnership traffic. Also, use a different VLAN ID for iSCSI host I/O and IP partnership traffic.

5.5 Volume Mirroring

By using Volume Mirroring, you can have two physical copies of a volume that provide a basic RAID-1 function. These copies can be in the same Storage Pool or in different Storage Pools, with different extent sizes of the Storage Pool. Typically the two copies are allocated in different Storage Pools.

The first Storage Pool contains the original (primary volume copy). If one storage controller or Storage Pool fails, a volume copy is not affected if it has been placed on a different storage controller or in a different Storage Pool.

If a volume is created with two copies, both copies use the same virtualization policy. However, you can have two copies of a volume with different virtualization policies. In combination with thin-provisioning, each mirror of a volume can be thin-provisioned or fully allocated, and in striped, sequential, or image mode.

A mirrored (secondary) volume has all of the capabilities of the primary volume copy. It also has the same restrictions (for example, a mirrored volume is owned by an I/O Group, just as any other volume). This feature also provides a point-in-time copy function that is achieved by “splitting” a copy from the volume. However, the mirrored volume does not address other forms of mirroring based on Remote Copy (Global or Metro Mirror functions), which mirrors volumes across I/O Groups or clustered systems.

One copy is the primary copy, and the other copy is the secondary copy. Initially, the first volume copy is the primary copy. You can change the primary copy to the secondary copy if required.

Figure 5-35 provides an overview of Volume Mirroring.

Figure 5-35 Volume Mirroring overview

5.5.1 Read and write operations

Read and write operations behavior depends on the status of the copies and on other environment settings.

During the initial synchronization or a resynchronization, only one of the copies is in synchronized status, and all the reads are directed to this copy. The write operations are directed to both copies.

When both copies are synchronized, the write operations are again directed to both copies. The read operations usually are directed to the primary copy, unless the system is configured in Enhanced Stretched Cluster topology. With this system topology and the enablement of site awareness capability, the concept of primary copy still exists, but is not more relevant. The read operation follows the site affinity. For example, consider an Enhanced Stretched Cluster configuration with mirrored volumes with one copy in Site A and the other in Site B. If a host I/O read is attempted to a mirrored disk through an IBM Spectrum Virtualize Node in the Site A, then the I/O read is directed to the copy in Site A, if available. Similarly, a host I/O read attempted through a node in Site B goes to the Site B copy.

Important: For best performance, keep consistency between Hosts, Nodes, and Storage Controller site affinity as long as possible.

During back-end storage failure, note the following points:

•If one of the mirrored volume copies is temporarily unavailable, the volume remains accessible to servers.

•The system remembers which areas of the volume are written and resynchronizes these areas when both copies are available.

•The remaining copy can service read I/O when the failing one is offline without user intervention.

5.5.2 Volume mirroring use cases

Volume Mirroring offers the capability to provide extra copies of the data that can be used for High Availability solutions and data migration scenarios. You can convert a non-mirrored volume into a mirrored volume by adding a copy. When a copy is added using this method, the cluster system synchronizes the new copy so that it is the same as the existing volume. You can convert a mirrored volume into a non-mirrored volume by deleting one copy or by splitting one copy to create a new non-mirrored volume.

Server access: Servers can access the volume during the synchronization processes described.

You can use mirrored volumes to provide extra protection for your environment or to perform a migration. This solution offers several options:

•Stretched Cluster configurations

Standard and Enhanced Stretched Cluster configuration uses the Volume Mirroring feature to implement the data availability across the sites.

•Export to Image mode

This option allows you to move storage from managed mode to image mode. This option is useful if you are using IBM Spectrum Virtualize or Storwize V7000 as a migration device.

For example, suppose vendor A’s product cannot communicate with vendor B’s product, but you need to migrate existing data from vendor A to vendor B. Using “Export to image mode” allows you to migrate data by using the Copy Services functions and then return control to the native array, while maintaining access to the hosts.

•Import to Image mode

This option allows you to import an existing storage MDisk or logical unit number (LUN) with its existing data from an external storage system, without putting metadata on it. The existing data remains intact. After you import it, all copy services functions can be used to migrate the storage to the other locations, while the data remains accessible to your hosts.

•Volume migration using Volume Mirroring and then using the Split into New Volume option

This option allows you to use the available RAID 1 functionality. You create two copies of data that initially have a set relationship (one primary and one secondary). You then break the relationship (both primary and no relationship) to make them independent copies of data.

You can use this option to migrate data between storage pools and devices. You might use this option if you want to move volumes to multiple storage pools.

•Volume migration using the Move to Another Pool option

This option allows any volume to be moved between storage pools without any interruption to the host access. This option is effectively a quicker version of the Volume Mirroring and Split into New Volume option. You might use this option if you want to move volumes in a single step, or you do not have a volume mirror copy already.

When you use Volume Mirroring, consider how quorum candidate disks are allocated. Volume Mirroring maintains some state data on the quorum disks. If a quorum disk is not accessible and Volume Mirroring is unable to update the state information, a mirrored volume might need to be taken offline to maintain data integrity. To ensure the high availability of the system, ensure that multiple quorum candidate disks, which are allocated on different storage systems, are configured.

Quorum disk consideration: Mirrored volumes can be taken offline if there is no quorum disk available. This behavior occurs because synchronization status for mirrored volumes is recorded on the quorum disk. To protect against mirrored volumes being taken offline, follow the guidelines for setting up quorum disks.

The following are other Volume Mirroring usage cases and characteristics:

•Creating a mirrored volume:

– The maximum number of copies is two.

– Both copies are created with the same virtualization policy.

To have a volume mirrored using different policies, you need to add a volume copy with a different policy to a volume that has only one copy.

– Both copies can be located in different Storage Pools. The first Storage Pool that is specified contains the primary copy.

– It is not possible to create a volume with two copies when specifying a set of MDisks.

•Add a volume copy to an existing volume:

– The volume copy to be added can have a different space allocation policy.

– Two existing volumes with one copy each cannot be merged into a single mirrored volume with two copies.

•Remove a volume copy from a mirrored volume:

– The volume remains with only one copy.

– It is not possible to remove the last copy from a volume.

•Split a volume copy from a mirrored volume and create a new volume with the split copy:

– This function is only allowed when the volume copies are synchronized. Otherwise, use the -force command.

– It is not possible to recombine the two volumes after they have been split.

– Adding and splitting in one workflow enables migrations that are not currently allowed.

– The split volume copy can be used as a means for creating a point-in-time copy (clone).

•Repair/validate in three ways. This compares volume copies and performs these functions:

– Reports the first difference found. It can iterate by starting at a specific LBA by using the -startlba parameter.

– Creates virtual medium errors where there are differences.

– Corrects the differences that are found (reads from primary copy and writes to secondary copy).

•View to list volumes affected by a back-end disk subsystem being offline:

– Assumes that a standard use is for mirror between disk subsystems.

– Verifies that mirrored volumes remain accessible if a disk system is being shut down.

– Reports an error in case a quorum disk is on the back-end disk subsystem.

•Expand or shrink a volume:

– This function works on both of the volume copies at once.

– All volume copies always have the same size.

– All copies must be synchronized before expanding or shrinking them.

•Delete a volume. When a volume gets deleted, all copies get deleted.

•Migration commands apply to a specific volume copy.

•Out-of-sync bitmaps share the bitmap space with FlashCopy and Metro Mirror/Global Mirror. Creating, expanding, and changing I/O groups might fail if there is insufficient memory.

•GUI views now contain volume copy identifiers.

5.5.3 Mirrored volume components

Note the following points regarding mirrored volume components:

•A mirrored volume is always composed of two copies (copy0 and copy1).

•A volume that is not mirrored consists of a single copy (which for reference might be copy 0 or copy 1).

A mirrored volume looks the same to upper-layer clients as a non-mirrored volume. That is, upper layers within the cluster software, such as FlashCopy and Metro Mirror/Global Mirror, and storage clients, do not know whether a volume is mirrored. They all continue to handle the volume as they did before without being aware of whether the volume is mirrored.

Figure 5-36 shows the attributes of a volume and Volume Mirroring.

Figure 5-36 Attributes of a volume and Volume Mirroring

In Figure 5-36, XIV and DS8700 illustrate that a mirrored volume can use different storage devices.

5.5.4 Performance considerations of Volume Mirroring

Because the writes of mirrored volumes always occur to both copies, mirrored volumes put more workload on the cluster, the back-end disk subsystems, and the connectivity infrastructure.

The mirroring is symmetrical, and writes are only acknowledged when the write to the last copy completes. The result is that if the volumes copies are on Storage Pools with different performance characteristics, the slowest Storage Pool determines the performance of writes to the volume. This performance applies when writes must be destaged to disk.

Recommendation: Locate volume copies of one volume on Storage Pools of the same or similar characteristics. Usually, if only good read performance is required, you can place the primary copy of a volume in a Storage Pool with better performance. Because the data is always only read from one volume copy, reads are not faster than without Volume Mirroring.

However, be aware that this is only true when both copies are synchronized. If the primary is out of sync, then reads are submitted to the other copy. Finally, note that these considerations do not apply to IBM Spectrum Virtualize systems in Enhanced Stretched Cluster configuration where the primary copy attribute is irrelevant.

Synchronization between volume copies has a similar impact on the cluster and the back-end disk subsystems as FlashCopy or data migration. The synchronization rate is a property of a volume that is expressed as a value of 0 - 100. A value of 0 disables synchronization.

Table 5-9 shows the relationship between the rate value and the data copied per second.

Table 5-9 Relationship between the rate value and the data copied per second

User-specified rate attribute value per volume	Data copied/sec
0	Synchronization is disabled
1 - 10	128 KB
11 - 20	256 KB
21 - 30	512 KB
31 - 40	1 MB
41 - 50	2 MB ** 50% is the default value
51 - 60	4 MB
61 - 70	8 MB
71 - 80	16 MB
81 - 90	32 MB
91 - 100	64 MB

Rate attribute value: The rate attribute is configured on each volume that you want to mirror. The default value of a new volume mirror is 50%.

In large IBM Spectrum Virtualize or Storwize system configurations, the settings of the copy rate can considerably affect the performance in scenarios where a back-end storage failure occurs. For instance, consider a scenario where a failure of a back-end storage controller is affecting one copy of 300 mirrored volumes. The host continues the operations by using the remaining copy. When the failed controller comes back online, the resynchronization process for all the 300 mirrored volumes starts at the same time. With a copy rate of 100 for each volume, this process would add a theoretical workload of 18.75 GB/s, which will drastically overload the system.

The general recommendation for the copy rate settings is then to evaluate the impact of massive resynchronization and set the parameter accordingly.

Mirrored Volume and I/O Time-out Configuration

The source volume has pointers to two copies (mirrored volume copies) of data, each in different storage pools, and each write completes on both copies before the host receives I/O completion status.

For a synchronized mirrored volume, if a write I/O to a copy has failed or a long timeout has expired, then system has completed all available controller level Error Recovery Procedures (ERPs). In this case, that copy is taken offline and goes out of sync. The volume remains online and continues to service I/O requests from the remaining copy.

The Fast Failover feature isolates hosts from temporarily poorly-performing back-end storage of one Copy at the expense of a short interruption to redundancy.

The fast failover feature behavior is that during normal processing of host write IO, the system submits writes to both copies with a timeout of 10 seconds (20 seconds for stretched volumes). If one write succeeds and the other write takes longer than 5 seconds, then the slow write is aborted. The Fibre Channel abort sequence can take around 25 seconds.

When the abort is done, one copy is marked as out of sync and the host write IO completed. The overall fast failover ERP aims to complete the host I/O in around 30 seconds (40 seconds for stretched volumes).

In v6.3.x and later, the fast failover can be set for each mirrored volume by using the chvdisk command and the mirror_write_priority attribute settings:

•Latency (default value): A short timeout prioritizing low host latency. This option enables the fast failover feature.

•Redundancy: A long timeout prioritizing redundancy. This option indicates a copy that is slow to respond to a write I/O can use the full ERP time. The response to the I/O is delayed until it completes to keep the copy in sync if possible. This option disables the fast failover feature.

Volume Mirroring ceases to use the slow copy for 4 - 6 minutes, and subsequent I/O data is not affected by a slow copy. Synchronization is suspended during this period. After the copy suspension completes, Volume Mirroring resumes, which allows I/O data and synchronization operations to the slow copy that will, typically, quickly complete the synchronization.

If another I/O times out during the synchronization, then the system stops using that copy again for 4 - 6 minutes. If one copy is always slow, then the system tries it every 4 - 6 minutes and the copy gets progressively more out of sync as more grains are written. If fast failovers are occurring regularly, there is probably an underlying performance problem with the copy’s back-end storage.

For mirrored volumes in Enhanced Stretched Cluster configurations, generally set the mirror_write_priority field to latency.

5.5.5 Bitmap space for out-of-sync volume copies

The grain size for the synchronization of volume copies is 256 KB. One grain takes up one bit of bitmap space. 20 MB of bitmap space supports 40 TB of mirrored volumes. This relationship is the same as the relationship for copy services (Global and Metro Mirror) and standard FlashCopy with a grain size of 256 KB (Table 5-10).

Table 5-10 Relationship of bitmap space to Volume Mirroring address space

Function	Grain size in KB	1 byte of bitmap space gives a total of	4 KB of bitmap space gives a total of	1 MB of bitmap space gives a total of	20 MB of bitmap space gives a total of	512 MB of bitmap space gives a total of
Volume Mirroring	256	2 MB of volume capacity	8 GB of volume capacity	2 TB of volume capacity	40 TB of volume capacity	1024 TB of volume capacity

Shared bitmap space: This bitmap space on one I/O group is shared between Metro Mirror, Global Mirror, FlashCopy, and Volume Mirroring.

The command to create Mirrored Volumes can fail if there is not enough space to allocate bitmaps in the target IO Group. To verify and change the space allocated and available on each IO Group with the CLI, see the Example 5-4.

Example 5-4 A lsiogrp and chiogrp command example

IBM_2145:SVC_ESC:superuser>lsiogrp

id name node_count vdisk_count host_count site_id site_name

0 io_grp0 2 9 0

1 io_grp1 0 0 0

2 io_grp2 0 0 0

3 io_grp3 0 0 0

4 recovery_io_grp 0 0 0

IBM_2145:SVC_ESC:superuser>lsiogrp io_grp0

id 0

name io_grp0

node_count 2

vdisk_count 9

host_count 0

flash_copy_total_memory 20.0MB

flash_copy_free_memory 19.9MB

remote_copy_total_memory 20.0MB

remote_copy_free_memory 19.9MB

mirroring_total_memory 20.0MB

mirroring_free_memory 20.0MB

raid_total_memory 40.0MB

raid_free_memory 40.0MB

lines removed for brevity

IBM_2145:SVC_ESC:superuser>chiogrp -feature mirror -size 64 io_grp0

IBM_2145:SVC_ESC:superuser>lsiogrp io_grp0

id 0

name io_grp0

node_count 2

vdisk_count 9

host_count 0

flash_copy_total_memory 20.0MB

flash_copy_free_memory 19.9MB

remote_copy_total_memory 20.0MB

remote_copy_free_memory 19.9MB

mirroring_total_memory 64.0MB

mirroring_free_memory 64.0MB

lines removed for brevity

To verify and change the space allocated and available on each IO Group with the GUI, see Figure 5-37.

Figure 5-37 IOgrp feature example

5.5.6 Synchronization status of volume copies

As soon as a volume is created with two copies, copies are in the out-of-synchronization state. The primary volume copy (located in the first specified Storage Pool) is defined as in sync and the secondary volume copy as out of sync. The secondary copy is synchronized through the synchronization process. This process runs at the default synchronization rate of 50 (Table 5-9 on page 225), or at the defined rate while creating or modifying the volume. See 5.5.4, “Performance considerations of Volume Mirroring” on page 224 for the effect of the copy rate setting.

The -fmtdisk parameter ensures that both copies are overwritten with zeros. After this process, the volume comes online and they can be considered as synchronized copies. Both copies are filled with zeros, so they are the same. Starting with version 7.5, the format process is initiated by default at the time of the volume creation.

You can specify that a volume is synchronized (-createsync parameter), even if it is not. Using this parameter can cause data corruption if the primary copy fails and leaves an unsynchronized secondary copy to provide data. Using this parameter can cause loss of read stability in unwritten areas if the primary copy fails, data is read from the primary copy, and then different data is read from the secondary copy. To avoid data loss or read stability loss, use this parameter only for a primary copy that has been formatted and not written to. Also, use it with the -fmtdisk parameter.

Another example use case for -createsync is for a newly created mirrored volume where both copies are thin provisioned or compressed because no data has been written to disk and unwritten areas return zeros (0). If the synchronization between the volume copies has been lost, the resynchronization process is incremental. This term means that only grains that have been written to need to be copied, and then get synchronized volume copies again.

The progress of the volume mirror synchronization can be obtained from the GUI or by using the lsvdisksyncprogress command.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Copy Services

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5. Copy Services