Chapter 6. Designed for performance

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Designed for performance

This chapter describes the performance characteristics of the IBM DS8880 that relates to the physical and logical configuration. The considerations that are presented in this chapter can help you plan the physical and logical setup of the DS8880.

This chapter covers the following topics:

•DS8880 hardware: Performance characteristics

•Software performance: Synergy items

•Performance considerations for disk drives

•DS8000 superior caching algorithms

•Performance considerations for logical configurations

•I/O Priority Manager

•IBM Easy Tier

•Host performance and sizing considerations

6.1 DS8880 hardware: Performance characteristics

The IBM DS8880 is designed to support the most demanding business applications with its exceptional all-around performance and data throughput. These features are combined with world-class business resiliency and encryption capabilities to deliver a unique combination of high availability, performance, scalability, and security.

The DS8880 features IBM POWER8 processor-based server technology and uses a
PCI Express (PCIe) I/O infrastructure. The DS8880 includes options for 6-core, 8-core, 16-core, 24-core, or 48-core processors for each processor complex. Up to 2 TB of system memory are available in the DS8880 for increased performance.

Note: Six-core processors are only used in the DS8884. DS8886 uses the 8-core, 16-core, and 24-core processors. DS8888 uses 24-core or 48-care processors.

This section reviews the architectural layers of the DS8880 and describes the performance characteristics that differentiate the DS8880 from other disk systems.

6.1.1 Vertical growth and scalability

The DS8880 offers nondisruptive memory upgrades, and processor upgrades for the DS8886 for increased performance. For connectivity, 8 Gb and 16 Gb host adapters can be added without disruption. A wide range of storage options are available by using hard disk drives (HDDs), flash drives, or solid-state drives (SSDs), and high-performance flash enclosures (HPFEs). Expansion frames can also be added to the base 980, 981, or 982 frame for increased capacity. Other advanced-function software features, such as IBM Easy Tier, I/O Priority Manager, and storage pool striping, contribute to performance potential.

For more information about hardware and architectural scalability, see Chapter 2, “IBM DS8880 hardware components and architecture” on page 23.

Figure 6-1 shows an example of how the DS8886 performance scales as the configuration changes from eight cores to 24 cores in an open systems database environment.

Figure 6-1 DS8886 linear performance scalability

6.1.2 POWER8

The DS8884 model 980 systems use 2U 8284-22A servers, which are based on 3.89 GHz POWER8 processors.

The DS8886 model 981 systems use 4U 8286-42A servers, which are based on 4.15 or 3.52 GHz POWER8 processors.

The DS8888 model 982 systems use 4U 8408-E8E servers, which are based on 3.026 GHz POWER8 processors.

Figure 6-2 shows the Hardware Management Console (HMC) server view.

Figure 6-2 Power S824 servers 8286-42A (DS8886 based on POWER8)

The POWER8 is based on 22-nm technology, which allows higher density packaging than the 32-nm processor size on which the earlier POWER7 models were based. The number of transistors for each chip doubled from 2.1 bn to 4.2 bn. The POWER8 processor chip contains 1.5x the number of processor cores that were available in POWER7+.

Each POWER8 processor core contains 32 KB of Level 1 (L1) instruction cache, 64 KB of L1 data cache, and 512 KB of L2 cache, which is double the L2 cache that is available in POWER7+. The POWER8 processor chip contains 96 MB of embedded dynamic random access memory (eDRAM) L3 cache, which is divided into multiple 8 MB regions for each core. These regions can be dynamically shared between processor cores. New to POWER8 is an additional L4 cache, which is contained within the memory buffer chips on the custom memory dual inline memory modules (CDIMMs).

Simultaneous multithreading (SMT) allows a single physical processor core to simultaneously dispatch instructions from more than one hardware thread context. With its larger and faster caches, and faster memory access, POWER8 processors can provide eight concurrent SMT threads for each core, with up to 192 threads for each internal server.

Figure 6-3 shows the POWER8 physical processor chip layout.

Figure 6-3 Twelve-way POWER8 processor module

PCI Express Gen3

POWER 8 implements PCIe Gen3 system bus architecture. Each POWER8 processor module has 48 PCIe Gen3 lanes, each running at 8 Gbps full-duplex, for an aggregate internal bandwidth of 192 GBps full-duplex for each 8286-42A internal server.

PCIe Gen3 I/O enclosures

To complement the improved I/O bandwidth that is provided by POWER8, the DS8880 also includes new PCIe Gen3-based I/O enclosures. Each I/O enclosure connects to both internal servers over a pair of x8 PCIe Gen3 cables, each providing 8 GBs connectivity. An additional pair of PCIe connectors on the I/O enclosure provides PCIe connections to HPFEs. Each pair of I/O enclosures supports two HPFEs.

Note: The DS8888 All Flash array supports four HPFEs per pair of I/O enclosures.

6.1.3 The high-performance flash enclosure

The high-performance flash enclosures provide the highest standard of flash performance that is available in the DS8880. Each HPFE contains two RAID managers with 8-lane PCIe Gen2 interfaces. Each interface connects to the PCIe fabric in an I/O enclosure. The HPFE RAID managers are optimized for the performance capabilities of flash-based storage.

The 400 GB or 800 GB flash cards that are used in the HPFE are encryption-capable, and are packaged in a 1.8-inch form factor. They feature the Enterprise Multi-Level Cell (eMLC) technology. IBM was the first server vendor to provide this flash technology option, which blends enterprise class performance and reliability characteristics with the more cost-effective characteristics of MLC flash storage.

The 1.8-inch NAND flash cards that are used in the HPFE build on this base use advances in device controller flash memory management and advances in eMLC technology. Like the earlier IBM eMLC SSDs, flash cards are designed to provide high sustained performance levels, and extended endurance or reliability. The eMLC flash modules were designed to provide 24×7×365 usage even running write-intensive levels for at least five years. Typical client usage is expected to be much lower, especially for the average percentage of writes. Therefore, drive lifespan can be much longer.

The HPFE arrays can coexist with Fibre Channel-attached Flash Drive arrays within the same extent pool. Both storage types are treated by Easy Tier as the highest performance tier, which is Tier 0. However, Easy Tier is able to differentiate between the performance capabilities of the two and run intra-tier rebalancing. So, “hotter” I/O-intensive extents of the volumes are moved to the high-performance flash arrays within Tier 0. For more information, see IBM DS8000 EasyTier, REDP-4667.

6.1.4 Switched Fibre Channel Arbitrated Loops

Standard drive enclosures connect to the device adapters (DAs) by using Fibre Channel Arbitrated Loop (FC-AL) topology. Ordinarily, this configuration creates arbitration issues within the loops. To overcome the arbitration issues, IBM employs a switch-based FC-AL topology in the DS8880. By using this approach, individual switched loops are created for each drive interface, providing isolation from the other drives, and routing commands and data only to the destination drives.

The drive enclosure interface cards contain switch logic that receives the FC-AL protocol from the device adapters, and attaches to each of the drives by using a point-to-point connection. The arbitration message for the drive is captured in the switch, processed, and then propagated to the intended drive without routing it through all of the other drives in the loop.

Each drive has two switched point-to-point connections to both enclosure interface cards, which in turn each connect to both DAs. This configuration has these benefits:

•This architecture doubles the bandwidth over conventional FC-AL implementations because no arbitration competition exists and no interference occurs between one drive and all the other drives. Each drive effectively has two dedicated logical paths to each of the DAs that allow two concurrent read operations and two concurrent write operations.

•In addition to superior performance, the reliability, availability, and serviceability (RAS) are improved in this setup when compared to conventional FC-AL. The failure of a drive is detected and reported by the switch. The switch ports distinguish between intermittent failures and permanent failures. The ports understand intermittent failures, which are recoverable, and collect data for predictive failure statistics. If one of the switches fails, a disk enclosure service processor detects the failing switch and reports the failure by using the other loop. All drives can still connect through the remaining switch.

FC-AL switched loops are shown in Figure 6-4.

Figure 6-4 Switched FC-AL drive loops

A virtualization approach that is built on top of the high-performance architectural design contributes even further to enhanced performance, as described in Chapter 4, “Virtualization concepts” on page 93.

6.1.5 Fibre Channel device adapter

The DS8880 supports standard drive enclosures that connect to the internal servers by using a pair of DAs. The device adapters provide RAID-5, RAID-6, and RAID-10 capabilities, by using PowerPC technology, with an application-specific integrated circuit (ASIC). Device adapters are PCIe Gen2-based. Each adapter provides four 8 Gbps Fibre Channel loops, and the adapter pair provides two independent fabrics to the drive enclosures.

The DS8880 uses the DAs in Split Affinity mode. Split Affinity means that each central processor complex (CPC) unit uses one device adapter in every I/O enclosure. This configuration allows both device adapters in an I/O enclosure to communicate concurrently because each DA uses a different PCIe connection between the I/O enclosure and the CPC. This design significantly improves performance when compared to the approach that was used before the DS8870.

6.1.6 Eight Gbps and 16 Gbps Fibre Channel host adapters

The DS8880 supports up to 32 Fibre Channel host adapters (HAs), with either four or eight FC ports for each adapter. Each port can be independently configured to support Fibre Channel connection (FICON) or Fibre Channel Protocol (FCP).

The DS8880 offers three Fibre Channel host adapter types: 8 Gbps 4-port, 8 Gbps 8-port, and 16 Gbps 4-port. Each adapter type is available in both longwave (LW) and shortwave (SW) versions. The DS8880 I/O bays support up to four HAs for each bay, allowing up to 128 ports maximum for each storage system. This configuration results in a theoretical aggregated host I/O bandwidth of 128 x 16 Gbps. Each port provides industry-leading throughput and I/O rates for FICON and FCP.

The host adapters that are available in the DS8880 include the following characteristics:

•8 Gbps HBAs:

– Four or eight Fibre Channel ports

– Gen2 PCIe interface

– Dual-core PowerPC processor

– Negotiation to 8, 4, or 2 Gbps (1 Gbps is not possible.)

•16 Gbps HBAs:

– Four Fibre Channel ports

– Gen2 PCIe interface

– Quad-core PowerPC processor

– Negotiation to 16, 8, or 4 Gbps (1 and 2 Gbps are not possible.)

The DS8880 supports the intermixture of 16 Gbps and 8 Gbps HBAs. Hosts with slower FC speeds are still supported if their HBAs are connected through a switch.

With FC adapters that are configured for FICON, the DS8000 series provides the following configuration capabilities:

•Fabric or point-to-point topologies

•A maximum of 128 host adapter ports, depending on the DS8880 system memory and processor features

•A maximum of 509 logins for each FC port

•A maximum of 8192 logins for each storage unit

•A maximum of 1280 logical paths on each FC port

•Access to all 255 control-unit images (65,280 CKD devices) over each FICON port

•A maximum of 512 logical paths for each control unit image

IBM z13 servers support 32,000 devices per FICON host channel, whereas IBM zEnterprise® EC12 and IBM zEnterprise BC12 servers support 24,000 devices per FICON host channel. Earlier z Systems servers support 16,384 devices per FICON host channel. To fully access 65,280 devices, it is necessary to connect multiple FICON host channels to the storage system. You can access the devices through a Fibre Channel switch or FICON director to a single storage system FICON port.

The 16 Gbps FC host adapter doubles the data throughput of 8 Gbps links. Compared to
8 Gbps adapters, the 16 Gbps adapters provide improvements in full adapter I/O per second (IOPS) and reduce latency.

6.2 Software performance: Synergy items

Many performance features in the DS8880 work together with the software on IBM hosts and are collectively referred to as synergy items. These items allow the DS8880 to cooperate with the host systems to benefit the overall performance of the systems.

6.2.1 Synergy with Power Systems

The IBM DS8880 can work in cooperation with Power Systems to provide the following performance enhancement functions.

End-to-end I/O priority: Synergy with AIX and DB2 on Power Systems

End-to-end I/O priority is an IBM-requested addition to the Small Computer System Interface (SCSI) T10 standard. This feature allows trusted applications to override the priority that is given to each I/O by the operating system. This feature is only applicable to raw volumes (no file system) and with the 64-bit kernel. Currently, AIX supports this feature with DB2. The priority is delivered to the storage subsystem in the FCP Transport Header.

The priority of an AIX process can be 0 (no assigned priority) or any integer value from 1 (highest priority) to 15 (lowest priority). All I/O requests that are associated with a process inherit its priority value. However, with end-to-end I/O priority, DB2 can change this value for critical data transfers. At the DS8880, the host adapter gives preferential treatment to higher priority I/O, which improves performance for specific requests that are deemed important by the application, such as requests that might be prerequisites for other requests (for example, DB2 logs).

Cooperative caching: Synergy with AIX and DB2 on Power Systems

Another software-related performance item is cooperative caching, a feature that provides a way for the host to send cache management hints to the storage facility. Currently, the host can indicate that the information that was recently accessed is unlikely to be accessed again soon. This status decreases the retention period of the data that is cached at the host, which allows the subsystem to conserve its cache for data that is more likely to be reaccessed. Therefore, the cache hit ratio is improved.

With the implementation of cooperative caching, the AIX operating system allows trusted applications, such as DB2, to provide cache hints to the DS8000. This ability improves the performance of the subsystem by keeping more of the repeatedly accessed data cached in the high-performance flash at the host. Cooperative caching is supported in IBM AIX for Power Systems with the Multipath I/O (MPIO) Path Control Module (PCM) that is provided with the Subsystem Device Driver (SDD). It is only applicable to raw volumes (no file system) and with the 64-bit kernel.

Long busy wait host tolerance: Synergy with AIX on Power Systems

The SCSI T10 standard includes support for SCSI long busy wait, which provides a way for the target system to specify that it is busy and how long the initiator must wait before an I/O is tried again.

This information, provided in the FCP status response, prevents the initiator from trying again too soon. This delay, in turn, reduces unnecessary requests and potential I/O failures that can be caused by exceeding a set threshold for the number of times it is tried again. IBM AIX for Power Systems supports SCSI long busy wait with MPIO. It is also supported by the DS8880.

6.2.2 Synergy with z Systems

The DS8880 is able to work in cooperation with z Systems to provide several performance enhancement functions. The following section gives a brief overview of these synergy items.

Parallel access volume and HyperPAV

Parallel access volume (PAV) is included in the DS8880 z-Synergy Services license group, for the z/OS and z/VM operating systems. With PAV, you can run multiple I/O requests to a volume at the same time. With dynamic PAV, the z/OS Workload Manager (WLM) controls the assignment of alias addresses to base addresses. The number of alias addresses defines the parallelism of I/Os to a volume. HyperPAV is an extension to PAV where any alias address from a pool of addresses can be used to drive the I/O, without WLM involvement.

Cross Control Unit PAV or SuperPAV

With DS8880 release 8.1, HyperPAV has been further enhanced. If all alias addresses are busy, z/OS can now use an alias address from another logical control unit (LCU), providing these conditions are met:

•The alias devices are assigned to the same DS8000 server (odd or even LSS)

•The alias devices share a path group on the z/OS host

The system now has a lager pool of alias addresses for bursts of I/O to a base volume and thus reduce queuing of I/O. This is known as Cross CU PAV or SuperPAV. See “SuperPAV” on page 107.

DS8000 I/O Priority Manager with z/OS Workload Manager

I/O Priority Manager, together with z/OS Workload Manager (zWLM), enable more effective storage consolidation and performance management when different workloads share a common disk pool (extent pool). This function, now tightly integrated with zWLM, is intended to improve disk I/O performance for important workloads. It drives I/O prioritization to the disk system by allowing WLM to give priority to the system’s resources (disk arrays) automatically when higher priority workloads are not meeting their performance goals. The I/O of less prioritized workloads to the same extent pool are slowed down to give the higher prioritized workload a higher share of the resources, mainly the disk drives. Integration with zWLM is exclusive to the DS8000 and z Systems. For more information about I/O Priority Manager, see DS8000 I/O Priority Manager, REDP-4760.

Easy Tier support

IBM Easy Tier is an intelligent data placement algorithm in the DS8000 that is designed to support both open systems and z Systems workloads.

Specifically for z Systems, IBM Easy Tier provides an application programming interface (API) through which z Systems applications (zDB2 initially) can communicate performance requirements for optimal data set placement. The application hints set the intent, and Easy Tier then moves the data set to the correct tier. For more information, see 6.7, “IBM Easy Tier” on page 168.

Extended address volumes

This capability can help relieve address constraints to support large storage capacity needs by addressing the capability of z Systems environments to support volumes that can scale up to approximately 1 TB (1,182,006 cylinders).

High Performance FICON for z Systems

High Performance FICON for z Systems (zHPF) is an enhancement of the FICON channel architecture. You can reduce the impact of FICON channel I/O traffic by using zHPF with the FICON channel, the z/OS operating system, and the storage system. zHPF allows the storage system to stream the data for multiple commands back in a single data transfer section for I/Os that are initiated by various access methods, This process improves the channel throughput on small block transfers.

zHPF is an optional feature of the DS8880, included in the z-Synergy Services license group. Recent enhancements to zHPF include Extended Distance capability, zHPF List Pre-fetch, Format Write, and zHPF support for sequential access methods. The DS8880 with zHPF and z/OS V1.13 offers I/O performance improvements for certain I/O transfers for workloads that use these methods:

•Queued sequential access method (QSAM)

•Basic partitioned access method (BPAM)

•Basic sequential access method (BSAM)

With 16 Gbps host adapters on the DS8880 and z Systems server z13 channels, zHPF Extended Distance II supports heavy write I/Os over an extended distance of up to 100 km (62 miles). The result is an increase of the write throughput by 50% or better.

DB2 list prefetch

zHPF is enhanced to support DB2 list prefetch. The enhancements include a new cache optimization algorithm that can greatly improve performance and hardware efficiency. When combined with the latest releases of z/OS and DB2, it can demonstrate up to a 14x - 60x increase in sequential or batch-processing performance. All DB2 I/Os, including format writes and list prefetches, are eligible for zHPF. In addition, DB2 can benefit from the new caching algorithm at the DS8000 level, which is called List Prefetch Optimizer (LPO). For more information about list prefetch, see DB2 for z/OS and List Prefetch Optimizer, REDP-4862.

DB2 castout acceleration

DB2 offers a data sharing feature to improve scalability and availability over separate independent DB2 systems. DB2 uses the concept of a group buffer pool (GBP). It is a z/OS coupling facility structure to cache data that is accessed by multiple applications and ensure consistency.

In environments with write intensive activity, the group buffer pool fills up quickly and must be destaged to storage. This process is bursty and can cause performance problems on read threads. The process of writing pages from the GBP to disk is known as castout.

During a castout, DB2 will typically generate long chains of writes, resulting in multiple IOs. In a Metro Mirror environment, it is imperative that the updates go to the secondary in order, and before availability of the DS8000 Release 8.1 code, each IO in the chain was synchronized individually with the secondary.

With Release 8.1 and later, the DS8000 can be notified through the DB2 Media Manager that the multiple IOs in a castout can be treated as a single logical IO, although there are multiple embedded IOs. In other words, the data hardening requirement is for the entire IO chain. This new process brings significant response time reduction.

This enhancement only applies for zHPF IO.

Quick initialization (z Systems)

The DS8880 supports quick volume initialization for z Systems environments, which can help clients who frequently delete volumes, allowing capacity to be reconfigured without waiting for initialization. Quick initialization initializes the data logical tracks or block within a specified extent range on a logical volume with the appropriate initialization pattern for the host.

Normal read and write access to the logical volume is allowed during the initialization process. Depending on the operation, the quick initialization can be started for the entire logical volume or for an extent range on the logical volume.

Quick initialization improves device initialization speeds and allows a Copy Services relationship to be established after a device is created.

zHyperwrite

In a Metro Mirror environment, all writes (including DB2 log writes) are mirrored synchronously to the secondary device, which increases transaction response times. zHyperwrite enables DB2 log writes to be performed to the primary and secondary volumes in parallel, which reduces DB2 log write response times. Implementation of zHyperwrite requires that HyperSwap is enabled through either IBM Geographically Dispersed Parallel Sysplex (IBM GDPS) or IBM Tivoli Storage Productivity Center for Replication.

Further DS8880 synergy items with z Systems

The available 16 Gbps adapters offer additional end-to-end z Systems and DS8880 synergy items:

•Forward Error Correction (FEC) is a protocol that is designed to capture errors that are generated during data transmission. Both the z Systems z13 and DS8880 extend the use of FEC to complete end-to-end coverage for 16 Gbps links and preserve data integrity with more redundancy.

•Fibre Channel Read Diagnostic Parameters (RDP) improve the end-to-end link fault isolation for 16 Gbps links on the z Systems z13 and DS8880. RDP data provides the optical signal strength, error counters, and other critical information that is crucial to determine the quality of the link.

•FICON Dynamic Routing (FIDR) is a Systems z13 and DS8880 feature that supports the use of dynamic routing policies in the switch to balance loads across inter-switch links (ISLs) on a per I/O basis.

•Fabric I/O Priority that provides end-to-end quality of service (QoS) is a unique synergy feature between z13 - z/OS Workload Manager, Brocade SAN Fabric, and the DS8880 system that is designed to manage QoS on a single I/O level.

For more information, see IBM DS8870 and IBM z Systems Synergy, REDP-5186, and Get More Out of Your IT Infrastructure with IBM z13 I/O Enhancements, REDP-5134.

6.3 Performance considerations for disk drives

When you are planning your system, determine the number and type of ranks that are required based on the needed capacity and on the workload characteristics in terms of access density, read-to-write ratio, and cache hit rates. These factors are weighed against the performance and capacity characteristics of the physical storage.

Current 15K revolutions per minute (RPM) disks, for example, provide an average seek time of approximately 3 ms and an average latency of 2 ms. For transferring only a small block, the transfer time can be neglected. This time is an average 5 ms for each random disk I/O operation or 200 IOPS. Therefore, a combined number of eight disks, which is the case for a DS8000 array, potentially sustain 1,600 IOPS when they are spinning at 15K RPM. Reduce the number by 12.5% (1,400 IOPS) when you assume a spare drive in the array site.

On the host side, consider an example with 1,000 IOPS from the host, a read-to-write ratio of 70/30, and 50% read cache hits. This configuration leads to the following IOPS numbers:

•700 read IOPS.

•350 read I/Os must be read from disk (based on the 50% read cache hit ratio).

•300 writes with RAID 5 result in 1,200 disk operations because of the RAID 5 write penalty (read old data and parity, write new data and parity).

•A total of 1,550 disk I/Os.

With 15K RPM disk drive modules (DDMs) running 1,000 random IOPS from the server, you complete 1,550 I/O operations on disk compared to a maximum of 1,600 operations for 7+P configurations or 1,400 operations for 6+P+S configurations. Therefore, in this scenario, 1,000 random I/Os from a server with a certain read-to-write ratio and a certain cache hit ratio saturate the disk drives. This scenario assumes that server I/O is purely random. When sequential I/Os exist, track-to-track seek times are much lower and higher I/O rates are possible. It also assumes that reads have a cache-hit ratio of only 50%. With higher hit ratios, higher workloads are possible. These considerations show the importance of intelligent caching algorithms as used in the DS8000. These algorithms are described in 6.4, “DS8000 superior caching algorithms” on page 155.

Important: When a storage system is sized, consider the capacity and the number of disk drives that are needed to satisfy the space requirements, and also the performance capabilities that will satisfy the IOPS requirements.

For a single disk drive, various disk vendors provide the disk specifications on their websites. Because the access times for the disks are the same for the same RPM speeds, but they have different capacities, the I/O density is different. A 300 GB 15K RPM disk drive can be used for access densities up to, and slightly over, 0.5 I/O per GB. For 600-GB drives, it is approximately 0.25 I/O per GB. Although this discussion is theoretical in approach, it provides a first estimate.

After the speed of the disk is decided, the capacity can be calculated based on your storage capacity needs and the effective capacity of the RAID configuration that you use. For more information about calculating these needs, see Table 7-9 on page 197.

6.3.1 Flash storage

From a performance point of view, the best performing choice for your DS8880 storage is flash storage. Flash has no moving parts (no spinning platters and no actuator arm) and a lower energy consumption. The performance advantages are the fast seek time and average access time. Flash storage is targeted at applications with heavy IOPS, bad cache hit rates, and random access workload that necessitate fast response times. Database applications with their random and intensive I/O workloads are prime candidates for deployment on flash.

Improved Performance: New 400-GB flash cards for release 8.1 offer equivalent random read performance, and a 15% improvement in random writes.

Flash cards

Flash cards are available in the HPFE feature. Flash cards offer the highest performance option that is available in the DS8880. Integrated dual Flash RAID Adapters with native PCIe 8-lane attachment provide high-bandwidth connectivity, without the protocol impact of Fibre Channel. The DS8880 offers these maximums:

•Up to 480 400 GB or 800 GB flash cards in the DS8888

•Up to 240 400 GB or 800 GB flash cards in the DS8886

•Up to 120 400GB or 800 GB flash cards in the DS8884.

Flash drives

Flash drives offer extra flash storage capacity by using small-form-factor (SFF) drives that are installed in Fibre Channel-attached standard drive enclosures. The DS8880 supports flash drives in 200 GB, 400 GB, 800 GB, and 1.6 TB capacities.

6.3.2 Enterprise drives

Enterprise drives provide high performance and cost-effective storage for various workloads. Enterprise drives rotate at 15,000 or 10,000 RPM. If an application requires high-performance data throughput and continuous, I/O-intensive operations, enterprise drives provide the best price-performance option.

New: 1.8-TB drives are introduced with release 8.1, offering a 33% increase in capacity, over 1.2-TB models, with equivalent or better 4 KB random and sequential read/write performance. Rebuild rates are equivalent to the 1.2-TB drives.

6.3.3 Nearline drives

When disk capacity is a priority, nearline drives are the largest of the drives that are available for the DS8880. Given their large capacity, and lower (7,200 RPM) rotational speed, as compared to enterprise drives, nearline drives are not intended to support high-performance or I/O-intensive applications that demand fast random data access. For sequential workloads, nearline drives can be an attractive storage option. These nearline drives offer a cost-effective option for lower-priority data, such as fixed content, data archival, reference data, and nearline applications that require large amounts of storage capacity for lighter workloads. These drives are meant to complement, not compete with, enterprise drives.

The 6-TB nearline drives offer a 33% increase in capacity, over 4-TB models, with equivalent or better 4 KB random read and write performance, and significantly better sequential performance. Rebuild rates are also equivalent or better than 4-TB nearline drives.

6.3.4 RAID level

The DS8000 series offers RAID 5, RAID 6, and RAID 10.

Important: Flash cards are configured as RAID 5 and RAID 10 arrays only. They cannot be configured for RAID 6.

RAID 5

RAID 5 is most frequently used because it provides good performance for random and sequential workloads, and it does not need much more storage for redundancy (one parity drive). The DS8000 series can detect sequential workload. When a complete stripe is in cache for destage, the DS8000 series switches to a RAID 3-like algorithm. Because a complete stripe must be destaged, the old data and parity do not need to be read. Instead, the new parity is calculated across the stripe, and the data and parity are destaged to disk. This configuration provides good sequential performance. A random write causes a cache hit, but the I/O is not complete until a copy of the write data is put in non-volatile storage (NVS). When data is destaged to disk, a write in RAID 5 causes the following four disk operations, the so-called write penalty:

•Old data and the old parity information must be read.

•New parity is calculated in the device adapter.

•Data and parity are written to disk.

Most of this activity is hidden to the server or host because the I/O is complete when data enters cache and NVS.

RAID 6

RAID 6 increases data fault tolerance. It can tolerate two disk failures in the same rank, as compared to RAID 5, which is single disk fault tolerant. RAID 6 uses a second independent distributed parity scheme (dual parity). RAID 6 provides a read performance that is similar to RAID 5, but RAID 6 has a larger write penalty than RAID 5 because it must calculate and write a second parity stripe.

Consider RAID 6 in situations where you might consider RAID 5, but you need increased reliability. RAID 6 was designed for protection during longer rebuild times on larger capacity drives to cope with the risk of a second drive failure within a rank during rebuild. RAID 6 has the following characteristics:

•Sequential Read: About 99% x RAID 5 rate

•Sequential Write: About 65% x RAID 5 rate

•Random 4 K 70% R /30% W IOPS: About 55% x RAID 5 rate

If two disks fail in the same rank, the performance degrades during the rebuild.

Important: Configure large capacity nearline drives as RAID 6 arrays. RAID 6 is also an option for the enterprise drives. In general, the IBM delivery organization highly recommends RAID 6 for all dives of 900-GB and above, and strongly advises RAID 6 even for 600-GB drives.

RAID 10

A workload that consists mostly of random writes benefits from RAID 10. Here, data is striped across several disks and mirrored to another set of disks. A write causes only two disk operations when compared to four operations of RAID 5. RAID 10 requires nearly twice as many disk drives (and twice the cost) for the same capacity when compared to RAID 5, but it can achieve four times greater random write throughput. Therefore, it is worth considering the use of RAID 10 for high-performance random write workloads.

The decision to configure capacity as RAID 5, RAID 6, or RAID 10, and the amount of capacity to configure for each type, can be made at any time. RAID 5, RAID 6, and RAID 10 arrays can be intermixed within a single system. In addition, the physical capacity can be logically reconfigured later (for example, RAID 6 arrays can be reconfigured into RAID 5 arrays). However, the arrays must first be emptied. Changing the RAID level is not permitted when logical volumes exist.

Important: For more information about important restrictions on DS8880 RAID configurations, see 3.5.1, “RAID configurations” on page 78.

6.4 DS8000 superior caching algorithms

Most, if not all, high-end disk systems have an internal cache that is integrated into the system design. The DS8880 can be equipped with up to 2,048 GB of system memory, most of which is configured as cache. The DS8880 offers twice the available system memory than was available in the DS8870.

With its POWER8 processors, the server architecture of the DS8880 makes it possible to manage such large caches with small cache segments of 4 KB (and large segment tables). The POWER8 processors have the power to support sophisticated caching algorithms, which contribute to the outstanding performance that is available in the IBM DS8880. These algorithms and the small cache segment size optimize cache hits and cache utilization. Cache hits are also optimized for different workloads, such as sequential workloads and transaction-oriented random workloads, which might be active at the same time. Therefore, the DS8880 provides excellent I/O response times.

Write data is always protected by maintaining a second copy of cached write data in the NVS of the other internal server until the data is destaged to disks.

6.4.1 Sequential Adaptive Replacement Cache

The DS8000 series uses the Sequential Adaptive Replacement Cache (SARC) algorithm, which was designed by IBM Storage Development in partnership with IBM Research. It is a self-tuning, self-optimizing solution for a wide range of workloads with a varying mix of sequential and random I/O streams. SARC is inspired by the Adaptive Replacement Cache (ARC) algorithm and inherits many of its features. For more information about ARC, see “Outperforming LRU with an adaptive replacement cache algorithm” by N. Megiddo, et al., in IEEE Computer, volume 37, number 4, pages 58 - 65, 2004. For more information about SARC, see “SARC: Sequential Prefetching in Adaptive Replacement Cache” by Binny Gill, et al., in “Proceedings of the USENIX 2005 Annual Technical Conference”, pages 293 - 308.

SARC attempts to determine the following cache characteristics:

•When data is copied into the cache

•Which data is copied into the cache

•Which data is evicted when the cache becomes full

•How the algorithm dynamically adapts to different workloads

The DS8000 series cache is organized in 4-KB pages that are called cache pages or slots. This unit of allocation (which is smaller than the values that are used in other storage systems) ensures that small I/Os do not waste cache memory.

The decision to copy data into the DS8880 cache can be triggered from the following policies:

•Demand paging

Eight disk blocks (a 4 K cache page) are brought in only on a cache miss. Demand paging is always active for all volumes and ensures that I/O patterns with locality discover at least recently used data in the cache.

•Prefetching

Data is copied into the cache speculatively even before it is requested. To prefetch, a prediction of likely data accesses is needed. Because effective, sophisticated prediction schemes need an extensive history of page accesses (which is not feasible in real systems), SARC uses prefetching for sequential workloads. Sequential access patterns naturally arise in video-on-demand, database scans, copy, backup, and recovery. The goal of sequential prefetching is to detect sequential access and effectively prefetch the likely cache data to minimize cache misses. Today, prefetching is ubiquitously applied in web servers and clients, databases, file servers, on-disk caches, and multimedia servers.

For prefetching, the cache management uses tracks. A track is a set of 128 disk blocks (16 cache pages). To detect a sequential access pattern, counters are maintained with every track to record whether a track was accessed together with its predecessor. Sequential prefetching becomes active only when these counters suggest a sequential access pattern. In this manner, the DS8880 monitors application read I/O patterns and dynamically determines whether it is optimal to stage into cache the following I/O elements:

•Only the page requested

•The page that is requested plus the remaining data on the disk track

•An entire disk track (or a set of disk tracks) that was not requested

The decision of when and what to prefetch is made in accordance with the Adaptive Multi-stream Prefetching (AMP) algorithm. This algorithm dynamically adapts the amount and timing of prefetches optimally on a per-application basis (rather than a system-wide basis). For more information about AMP, see 6.4.2, “Adaptive Multi-stream Prefetching” on page 158.

To decide which pages are evicted when the cache is full, sequential and random (non-sequential) data is separated into separate lists. The SARC algorithm for random and sequential data is shown in Figure 6-5.

Figure 6-5 Sequential Adaptive Replacement Cache

A page that was brought into the cache by simple demand paging is added to the head of Most Recently Used (MRU) section of the RANDOM list. Without further I/O access, it goes down to the bottom of Least Recently Used (LRU) section. A page that was brought into the cache by a sequential access or by sequential prefetching is added to the head of MRU of the SEQ list and then goes in that list. Other rules control the migration of pages between the lists so that the system does not keep the same pages in memory twice.

To follow workload changes, the algorithm trades cache space between the RANDOM and SEQ lists dynamically and adaptively. This function makes SARC scan resistant so that one-time sequential requests do not pollute the whole cache. SARC maintains a wanted size parameter for the sequential list. The wanted size is continually adapted in response to the workload. Specifically, if the bottom portion of the SEQ list is more valuable than the bottom portion of the RANDOM list, the wanted size is increased. Otherwise, the wanted size is decreased. The constant adaptation strives to make optimal use of limited cache space and delivers greater throughput and faster response times for a specific cache size.

Additionally, the algorithm dynamically modifies the sizes of the two lists and the rate at which the sizes are adapted. In a steady state, pages are evicted from the cache at the rate of cache misses. A larger (or smaller) rate of misses leads to a faster (or slower) rate of adaptation.

Other implementation details account for the relationship of read and write (NVS) cache, efficient destaging, and the cooperation with Copy Services. In this manner, the DS8880 cache management goes far beyond the usual variants of the LRU/Least Frequently Used (LFU) approaches, which are widely used in other storage systems on the market.

6.4.2 Adaptive Multi-stream Prefetching

SARC dynamically divides the cache between the RANDOM and SEQ lists, where the SEQ list maintains pages that are brought into the cache by sequential access or sequential prefetching.

In the DS8880, AMP, an algorithm that was developed by IBM Research, manages the SEQ list. AMP is an autonomic, workload-responsive, self-optimizing prefetching technology that adapts the amount of prefetch and the timing of prefetch on a per-application basis to maximize the performance of the system. The AMP algorithm solves the following problems that plague most other prefetching algorithms:

•Prefetch wastage occurs when prefetched data is evicted from the cache before it can be used.

•Cache pollution occurs when less useful data is prefetched instead of more useful data.

By wisely choosing the prefetching parameters, AMP provides optimal sequential read performance and maximizes the aggregate sequential read throughput of the system. The amount that is prefetched for each stream is dynamically adapted according to the application’s needs and the space that is available in the SEQ list. The timing of the prefetches is also continuously adapted for each stream to avoid misses and any cache pollution.

SARC and AMP play complementary roles. SARC carefully divides the cache between the RANDOM and the SEQ lists to maximize the overall hit ratio. AMP manages the contents of the SEQ list to maximize the throughput that is obtained for the sequential workloads. SARC affects cases that involve both random and sequential workloads. However, AMP helps any workload that has a sequential read component, including pure sequential read workloads.

AMP dramatically improves performance for common sequential and batch processing workloads. It also provides excellent performance synergy with DB2 by preventing table scans from being I/O-bound and improves performance of index scans and DB2 utilities, such as Copy and Recover. Furthermore, AMP reduces the potential for array hot spots that result from extreme sequential workload demands.

For more information about AMP and the theoretical analysis for its optimal usage, see “AMP: Adaptive Multi-stream Prefetching in a Shared Cache” by Gill. For a more detailed description, see “Optimal Multistream Sequential Prefetching in a Shared Cache” by Gill, et al.

6.4.3 Intelligent Write Caching

Another cache algorithm, which is referred to as Intelligent Write Caching (IWC), was implemented in the DS8000 series. IWC improves performance through better write cache management and a better destaging order of writes. This algorithm is a combination of CLOCK, a predominantly read cache algorithm, and CSCAN, an efficient write cache algorithm. Out of this combination, IBM produced a powerful and widely applicable write cache algorithm.

The CLOCK algorithm uses temporal ordering. It keeps a circular list of pages in memory, with the clock hand that points to the oldest page in the list. When a page must be inserted in the cache, then an R (recency) bit is inspected at the clock hand’s location. If R is zero, the new page is put in place of the page the clock hand points to and R is set to 1. Otherwise, the R bit is cleared and set to zero. Then the clock hand moves one step clockwise forward and the process is repeated until a page is replaced.

The CSCAN algorithm uses spatial ordering. The CSCAN algorithm is the circular variation of the SCAN algorithm. The SCAN algorithm tries to minimize the disk head movement when the disk head services read and write requests. It maintains a sorted list of pending requests with the position on the drive of the request. Requests are processed in the current direction of the disk head until it reaches the edge of the disk. At that point, the direction changes. In the CSCAN algorithm, the requests are always served in the same direction. After the head arrives at the outer edge of the disk, it returns to the beginning of the disk and services the new requests in this one direction only. This process results in more equal performance for all head positions.

The basic idea of IWC is to maintain a sorted list of write groups, as in the CSCAN algorithm. The smallest and the highest write groups are joined, forming a circular queue. The new idea is to maintain a recency bit for each write group, as in the CLOCK algorithm. A write group is always inserted in its correct sorted position and the recency bit is set to zero at the beginning. When a write hit occurs, the recency bit is set to one. The destage operation proceeds, where a destage pointer is maintained that scans the circular list and looks for destage victims. Now this algorithm allows destaging of only write groups whose recency bit is zero. The write groups with a recency bit of one are skipped and the recent bit is then turned off and reset to zero. This process gives an extra life to those write groups that were hit since the last time the destage pointer visited them. The concept of how this mechanism is illustrated is shown in Figure 6-6 on page 161.

In the DS8000 implementation, an IWC list is maintained for each rank. The dynamically adapted size of each IWC list is based on workload intensity on each rank. The rate of destage is proportional to the portion of NVS that is occupied by an IWC list. The NVS is shared across all ranks in a cluster. Furthermore, destages are smoothed out so that write bursts are not translated into destage bursts.

Another enhancement to IWC is an update to the cache algorithm that increases the residency time of data in NVS. This improvement focuses on maximizing throughput with good average response time.

In summary, IWC has better or comparable peak throughput to the best of CSCAN and CLOCK across a wide variety of write cache sizes and workload configurations. In addition, even at lower throughputs, IWC has lower average response times than CSCAN and CLOCK.

6.5 Performance considerations for logical configurations

To determine the optimal DS8880 layout, define the I/O performance requirements of the servers and applications up front because they play a large part in dictating the physical and logical configuration of the disk system. Before the disk system is designed, the disk space requirements of the application must be understood.

6.5.1 Workload characteristics

The answers to “How many host connections do I need?” and “How much cache do I need?” always depend on the workload requirements, such as how many I/Os per second for each server, and the I/Os per second for each gigabyte of storage.

The following information must be considered for a detailed modeling:

•Number of I/Os per second

•I/O density

•Megabytes per second

•Relative percentage of reads and writes

•Random or sequential access characteristics

•Cache-hit ratio

•Response time

6.5.2 Data placement in the DS8000

After you determine the disk subsystem throughput, disk space, and the number of disks that are required by your hosts and applications, determine the data placement.

To optimize the DS8880 resource utilization, use the following guidelines:

•Balance the ranks and extent pools between the two DS8880 internal servers to support the corresponding workloads on them.

•Spread the logical volume workload across the DS8880 internal servers by allocating the volumes equally on rank groups 0 and 1.

•Use as many disks as possible. Avoid idle disks, even if all storage capacity is not to be used initially.

•Distribute capacity and workload across device adapter pairs.

•Use multi-rank extent pools.

•Stripe your logical volume across several ranks (the default for multi-rank extent pools).

•Consider placing specific database objects (such as logs) on separate ranks.

•For an application, use volumes from even-numbered and odd-numbered extent pools. Even-numbered pools are managed by server 0, and odd-numbered pools are managed by server 1.

•For large, performance-sensitive applications, consider the use of two dedicated extent pools. One extent pool is managed by server 0, and the other extent pool is managed by server 1.

•Consider mixed extent pools with multiple tiers that use flash drives as the highest tier and that are managed by IBM Easy Tier.

In a typical DS8880 configuration with equally distributed workloads on two servers, the two extent pools (Extpool 0 and Extpool 1) are created, each with half of the ranks, as shown in Figure 6-6. The ranks in each extent pool are spread equally on each DA pair.

Figure 6-6 Ranks in a multi-rank extent pool configuration and balanced across the DS8000 servers

All disk arrays in a storage system have roughly equivalent utilization. Arrays that are used more than other arrays can reduce overall performance. IBM Easy Tier, with storage pool striping or host-level striping of logical volumes, can improve performance by distributing workload across multiple arrays. Easy Tier auto-rebalancing provides further benefits by automatically migrating data from overused extents to less-used areas of the extent pool.

Data striping

For optimal performance, spread your data across as many hardware resources as possible. RAID 5, RAID 6, or RAID 10 already spreads the data across the drives of an array, but this configuration is not always enough. The following approaches can be used to spread your data across even more disk drives:

•Storage pool striping that is combined with single-tier or multitier extent pools

•Striping at the host level

Easy Tier auto-rebalancing

With multitier extent pools, Easy Tier can automatically migrate the extents with the highest workload to faster storage tiers, and also to faster storage within a tier, as workloads dictate. Auto-rebalance migrates extents to achieve the best workload distribution within the pools and reduce hotspots. Together with intra-tier rebalancing, Easy Tier provides the optimum performance from each tier, while also reducing performance skew within a storage tier. Furthermore, auto-rebalance automatically populates new ranks that are added to the pool. Auto-rebalance can be enabled for hybrid and homogeneous extent pools.

Important: A preferred practice is to use IBM Easy Tier to balance workload across all ranks even when only a single tier of disk is installed in an extent pool. Use the options that are shown in Figure 6-7 to auto-rebalance all pools.

Figure 6-7 Easy Tier options to auto-balance all extent pools

Storage pool striping: Extent rotation

Storage pool striping is a technique for spreading data across several disk arrays. The I/O capability of many disk drives can be used in parallel to access data on the logical volume.

The easiest way to stripe is to create extent pools with more than one rank and use storage pool striping when you allocate a new volume. This striping method does not require any operating system support.

The number of random I/Os that can be run for a standard workload on a rank is described in 6.3, “Performance considerations for disk drives” on page 152. If a volume is on only one rank, the I/O capability of this rank also applies to the volume. However, if this volume is striped across several ranks, the I/O rate to this volume can be much higher.

The total number of I/Os that can be run on a set of ranks does not change with storage pool striping.

Important: Use storage pool striping and extent pools with a minimum of 4 - 8 ranks to avoid hot spots on the disk drives. Consider combining this configuration with Easy Tier auto-rebalancing.

Figure 6-8 shows an example of storage pool striping.

Figure 6-8 Storage pool striping

A sample configuration is shown in Figure 6-9. The ranks are attached to DS8880 server 0 and server 1 in a half-and-half configuration, which might be used with a mixed count key data (CKD) and fixed-block (FB) workload. Ranks on separate device adapters are combined in a multi-rank extent pool, and volumes are striped across all available ranks in each pool. In all-CKD or all-FB systems, consider the use of a single extent pool for each server.

Figure 6-9 Balanced extent pool configuration

Striping at the host level

Many operating systems include the option to stripe data across several (logical) volumes. An example is the AIX Logical Volume Manager (LVM).

LVM striping is a technique for spreading the data in a logical volume across several disk drives so that the I/O capacity of the disk drives can be used in parallel to access data on the logical volume. The primary objective of striping is high-performance reading and writing of large sequential files, but benefits also exist for random access.

Other examples for applications that stripe data across the volumes include the IBM SAN Volume Controller and IBM System Storage N series Gateway.

If you use a Logical Volume Manager (such as LVM on AIX) on your host, you can create a host logical volume from several DS8000 series logical volumes or logical unit numbers (LUNs). You can select LUNs from different DS8880 servers and device adapter pairs, as shown in Figure 6-10. By striping your host logical volume across the LUNs, the best performance for this LVM volume is realized.

Figure 6-10 Optimal placement of data

Figure 6-10 shows an optimal distribution of eight logical volumes within a DS8880. You might have more extent pools and ranks, but when you want to distribute your data for optimal performance, ensure that you spread it across the two servers, across different device adapter pairs, and across several ranks.

The striping on the host level can also work with storage pool striping to help you spread the workloads on more ranks and disks. In addition, LUNs that were created on each extent pool offer another alternative balanced method to evenly spread data across the DS8880 without the use of storage pool striping, as shown on the left side of Figure 6-11.

If you use multi-rank extent pools and you do not use storage pool striping nor Easy Tier auto-rebalance, you must be careful where to put your data, or you can easily unbalance your system (as shown on the right side of Figure 6-11).

Figure 6-11 Spreading data across ranks (without the use of Easy Tier auto-rebalance)

Each striped logical volume that is created by the host’s Logical Volume Manager has a stripe size that specifies the fixed amount of data that is stored on each DS8000 LUN.

The stripe size must be large enough to keep sequential data relatively close together, but not too large to keep the data on a single array.

Define stripe sizes by using your host’s Logical Volume Manager in the range of 4 MB - 64 MB. Select a stripe size close to 4 MB if you have many applications that share the arrays, and a larger size when you have few servers or applications that share the arrays.

Combining extent pool striping and Logical Volume Manager striping

Striping by an LVM is performed on a stripe size in the MB range (about 64 MB). Extent pool striping is at a 1-GiB extent size. Both methods can be combined. LVM striping can stripe across extent pools and use volumes from extent pools that are attached to server 0 and server 1 of the DS8000 series. If you already use LVM physical partition (PP) wide striping, you might want to continue to use that striping.

Important: Striping at the host layer contributes to an equal distribution of I/Os to the disk drives to reduce hot spots. However, if you are using tiered extent pools with solid-state flash drives, IBM Easy Tier can work best if hot extents can be moved to flash drives.

6.6 I/O Priority Manager

It is common practice to have large extent pools and stripe data across all disks. However, when production workload and, for example, test systems, share the physical disk drives, potentially the test system can negatively affect the production performance.

The DS8000 series I/O Priority Manager (IOPM) feature is included in the Base Function license group for the DS8880. It enables more effective storage consolidation and performance management, and the ability to align QoS levels to disparate workloads in the system. These workloads might compete for the same shared and possibly constrained storage resources.

I/O Priority Manager constantly monitors system resources to help applications meet their performance targets automatically, without operator intervention. The DS8880 storage hardware resources are monitored by the IOPM for possible contention within the RAID ranks and device adapters.

I/O Priority Manager uses QoS to assign priorities for different volumes, and applies network QoS principles to storage by using an algorithm called Token Bucket Throttling for traffic control. IOPM is designed to understand the load on the system and modify priorities by using dynamic workload control.

The I/O of a less important workload is slowed down to give a higher-priority workload a higher share of the resources.

Important: If you separated production and non-production data by using different extent pools and different device adapters, you do not need I/O Priority Manager.

Figure 6-12 shows a three-step example of how I/O Priority Manager uses dynamic workload control.

Figure 6-12 Automatic control of disruptive workload by I/O Priority Manager

In step 1, critical application A works normally. In step 2, a non-critical application B starts, causing performance degradation for application A. In step 3, I/O Priority Manager detects the QoS impact on critical application A automatically and dynamically restores the performance for application A.

6.6.1 Performance policies for open systems

When I/O Priority Manager is enabled, each volume is assigned to a performance group when the volume is created. Each performance group has a QoS target. This QoS target is used to determine whether a volume is experiencing appropriate response times.

A performance group associates the I/O operations of a logical volume with a performance policy that sets the priority of a volume relative to other volumes. All volumes fall into one of the performance policies.

For open systems, I/O Priority Manager includes four defined performance policies: Default (unmanaged), high priority, medium priority, and low priority. I/O Priority Manager includes 16 performance groups: Five performance groups each for the high, medium, and low performance policies, and one performance group for the default performance policy.

The following performance policies are available:

•Default performance policy

The default performance policy does not have a QoS target that is associated with it. I/Os to volumes that are assigned to the default performance policy are never delayed by
I/O Priority Manager.

•High priority performance policy

The high priority performance policy has a QoS target of 70. I/Os from volumes that are associated with the high performance policy attempt to stay under approximately 1.5 times the optimal response of the rank. I/Os in the high performance policy are never delayed.

•Medium priority performance policy

The medium priority performance policy has a QoS target of 40. I/Os from volumes with the medium performance policy attempt to stay under 2.5 times the optimal response time of the rank.

•Low performance policy

Volumes with a low performance policy have no QoS target and have no goal for response times. If no bottleneck occurs for a shared resource, the low priority workload is not restricted. However, if a higher priority workload does not achieve its goal, the I/O of a low priority workload is slowed down first by delaying the response to the host. This delay is increased until the higher-priority I/O meets its goal. The maximum delay that is added is 200 ms.

6.6.2 Performance policies for z Systems

z Systems has 14 performance groups: Three performance groups for high performance policies, four performance groups for medium performance policies, six performance groups for low performance policies, and one performance group for the default performance policy.

With z Systems, two operation modes are available for I/O Priority Manager: Without software support, or with software support.

Important: Only z/OS operating systems use the I/O Priority Manager with software support.

I/O Priority Manager count key data support

In a z Systems environment, I/O Priority Manager includes the following characteristics:

•A user assigns a performance policy to each CKD volume that applies in the absence of software support.

•z/OS can optionally specify parameters that determine the priority of each I/O operation and allow multiple workloads on a single CKD volume to have different priorities.

•I/O Priority Manager is supported on z/OS V1.11, V1.12, V1.13, and later.

•Without z/OS software support, on ranks in saturation, the volume’s I/O is managed according to the performance policy of the volume’s performance group.

•With z/OS software support, the following actions occur:

– A user assigns application priorities by using IBM Enterprise Workload Manager™ (EWLM).

– z/OS assigns an importance value to each I/O based on eWLM inputs.

– z/OS assigns an achievement value to each I/O based on the previous history of I/O response times for I/O with the same importance and based on eWLM expectations for response time.

– The importance and achievement value on I/O associates this I/O with a performance policy (independently of the volume’s performance group/performance policy).

– On ranks in saturation, I/O is managed according to the I/O’s performance policy.

If no bottleneck exists for a shared resource, a low priority workload is not restricted. However, if a higher priority workload does not achieve its goal, the I/O of the low priority workload is slowed down first, by delaying the response to the host. This delay is increased until the higher-priority I/O meets its goal. The maximum delay that is added is 200 ms.

For more information: For more information about I/O Priority Manager, see DS8000 I/O Priority Manager, REDP-4760.

6.7 IBM Easy Tier

IBM Easy Tier on the DS8880 can enhance performance and balance workloads through the following capabilities:

•Automated hot spot management and data relocation.

•Support for thin provisioned and fully provisioned extent space-efficient (ESE) volumes.

•Automatic inter-tier and intra-tier rebalancing.

In systems that contain flash storage only, or systems with a single drive type, intra-tier rebalancing monitors for hot spots and relocates extents to balance the workload.

•Support for HPFEs, including intra-tier rebalancing for heterogeneous flash storage pools.

•Manual volume rebalancing and volume migration.

•Rank depopulation.

•Extent pool merging.

•Directive data placement from applications.

Support is provided for z Systems applications, such as DB2, to communicate application hints to the Easy Tier Application API. An application hint that contains performance requirements for optimal data set placement instructs Easy Tier to move the data set to the correct tier.

•On FlashCopy activities, adequately assigning workloads to source and target volumes for the best production optimization.

•Heat map transfer from Peer-to-Peer Remote Copy (PPRC) source to target.

The eighth generation of Easy Tier supports Easy Tier Heat Map Transfer (HMT) in 3-site and 4-site Metro Global Mirror (MGM) environments. Used with GDPS or IBM Tivoli Storage Productivity Center for Replication, this capability leads to performance-optimized disaster recovery.

•The administrator can pause and resume Easy Tier monitoring (learning) at the extent pool and volume level, and can reset Easy Tier learning for pools and volumes. The administrator can now pause and resume migration at the pool level. It is also possible to prevent volumes from being assigned to the nearline tier.

•Easy Tier Application is an application-aware storage interface to help deploy storage more efficiently through enabling applications and middleware to direct more optimal placement of data. Easy Tier Application enables administrators to assign distinct application volumes to a particular tier in the Easy Tier pool. This capability provides a flexible option for clients who want certain applications to remain on a particular tier to meet performance and cost requirements. Easy Tier Application is available at no additional cost, and no separate license is required.

Full Disk Encryption support: All drive types in the DS8880 support Full Disk Encryption. Encryption usage is optional. With or without encryption, disk performance or Easy Tier functions are the same.

For more information about IBM Easy Tier, see IBM DS8000 EasyTier, REDP-4667, DS8870 Easy Tier Application, REDP-5014, and IBM DS8870 Easy Tier Heat Map Transfer, REDP-5015.

6.8 Host performance and sizing considerations

This section describes performance and sizing considerations for open systems. For z Systems specific performance and sizing considerations, see IBM DS8870 and IBM z Systems Synergy, REDP-5186.

6.8.1 Performance and sizing considerations for open systems

The following sections describe topics that relate to open systems.

Determining the number of paths to a LUN

When you configure a DS8000 series for an open systems host, a decision must be made about the number of paths to a particular LUN because the multipathing software allows (and manages) multiple paths to a LUN. The following opposing factors must be considered when you are deciding on the number of paths to a LUN:

•Increasing the number of paths increases availability of the data, which protects against outages.

•Increasing the number of paths increases the amount of CPU that is used because the multipathing software must choose among all available paths each time that an I/O is issued.

A good compromise is 2 - 4 paths for each LUN. Consider eight paths if a high data rate is required.

Dynamic I/O load-balancing: Subsystem Device Driver

Subsystem Device Driver (SDD) is an IBM provided pseudo-device driver that is designed to support the multipath configuration capabilities in the DS8000. SDD runs on each host system, in cooperation with the native disk device driver.

The dynamic I/O load-balancing option (default) of SDD is suggested to achieve the best performance by using these functions:

•SDD automatically adjusts data routing for optimum performance. The multipath load balancing of the data flow prevents a single path from becoming overloaded, causing I/O congestion when many I/O operations are directed to common devices along the same I/O path.

•The path to use for an I/O operation is chosen by estimating the load on each adapter to which each path is attached. The load is a function of the number of I/O operations currently in process. If multiple paths include the same load, a path is chosen at random from those paths.

IBM SDD is available for most operating environments. On certain operating systems, SDD offers an installable package to work with their native multipathing software, also. For example, IBM Subsystem Device Driver Path Control Module (SDDPCM) is available for AIX, and IBM Subsystem Device Driver Device Specific Module (SDDDSM) is available for Microsoft Windows.

For more information about the multipathing software that might be required for various operating systems, see the IBM System Storage Interoperation Center (SSIC):

http://www.ibm.com/systems/support/storage/config/ssic/index.jsp

SDD is covered in more detail in the following IBM publications:

•IBM System Storage DS8000: Host Attachment and Interoperability, SG24-8887

•IBM System Storage DS8000 Host Systems Attachment Guide, SC27-4210

Automatic Port Queues

The DS8880 Fibre Channel host adapters and the server HBAs support I/O queuing. The length of this queue is called the queue depth. Because several servers can and usually do communicate with few DS8880 ports, the queue depth of a storage host adapter needs to be larger than the queue depth on the server side. This parameter is also true for the DS8880, which supports 2,048 FC commands that are queued on a port. However, sometimes the port queue in the DS8880 host adapter can be flooded.

When the number of commands that are sent to the DS8880 port exceeds the maximum number of commands that the port can queue, the port discards these additional commands. This operation is a normal error recovery operation in the Fibre Channel Protocol to allow overprovisioning on the SAN. The normal recovery is a 30-second timeout for the server. The server retries the command until the command retry value is exceeded, at which point the operation fails. Command Timeout entries can be seen in the server logs.

Automatic Port Queues is a mechanism that the DS8880 uses to self-adjust the queues based on workload. This mechanism allows higher port queue oversubscription while it maintains a fair share for the servers and the accessed LUNs. When an I/O port queue fills up, the port goes into SCSI Queue Full mode, and accepts no additional commands to slow down the I/Os. By avoiding error recovery and the 30-second blocking SCSI Queue Full recovery interval, the overall performance is better with Automatic Port Queues.

Determining where to attach the host

The DS8000 series host adapters have no server affinity, but the device adapters and the ranks have server affinity. When you determine where to attach multiple paths from a single host system to I/O ports in the storage system, the following considerations apply:

•Spread the connections across host adapters in all of the I/O enclosures

•Spread the connections across port pairs in the host adapters

Figure 6-13 shows a z13 system with four FICON paths, which are connected through two SAN384B-2 switches to eight DS8880 host adapters. The connections are spread evenly across the I/O enclosures. The diagram also shows the DS8880 internal PCIe pathing between the internal servers and the I/O enclosures and to a high performance flash enclosure, and the 8 Gb Fibre Channel connections from the device adapters to the standard drive enclosures.

Figure 6-13 DS8880 multipathing

Options for four-port and eight-port 8 Gbps host adapters are available in the DS8880. Eight-port cards provide more connectivity, but not necessarily more total throughput because all the ports share a single PCIe connection to the I/O enclosure. Additionally, host ports are internally paired, driven by two-port Fibre Channel I/O controller modules. Four-port 16 Gbps adapters are also available.

For maximum throughput, consider the use of fewer ports for each host adapter and spread the workload across more host adapters. Where possible, avoid mixing FICON and FCP connections on a single adapter, or mixing host connections with PPRC connections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6. Designed for performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6. Designed for performance