Performance
This chapter describes the performance considerations for IBM z13.
The IBM z13 Model NE1 is designed to offer approximately 40% more capacity and 3.3 times the amount of memory than the IBM zEnterprise EC12 (zEC12) Model HA1 system. Uniprocessor performance also has increased. A z13 Model 701 offers, on average, performance improvements of more than 10% over the zEC12 Model 701. Figure 12-1 shows the estimated capacity ratios for z13, zEC12, z196, z10 EC, and z9 EC.
Figure 12-1 z13 to zEC12, z196, z10 EC, and z9 EC performance comparison
The Large System Performance Reference (LSPR) numbers that are given for z13 were obtained with z/OS V2R1; the numbers for zEC12 were obtained with z/OS V1R13; and numbers for the z196, z10 EC, and z9 EC systems were obtained with the z/OS V1R11 operating system.
On average, the z13 can deliver up to 40% more performance in a 141-way configuration than an zEC12 101-way. However, variations on the observed performance increase depend on the workload type.
Consult the LSPR when you consider performance on the zEC12. The range of performance ratings across the individual LSPR workloads is likely to have a large spread. More performance variation of individual logical partitions (LPARs) exists because the fluctuating resource requirements of other partitions can be more pronounced with the increased number of partitions and the availability of more processor units (PUs). For more information, see 12.6, “Workload performance variation” on page 463.
For detailed performance information, see the LSPR website:
The millions of service units (MSU) ratings are available from the following website:
This chapter includes the following sections:
 
12.1 LSPR workload suite
Historically, LSPR capacity tables, including pure workloads and mixes, have been identified with application names or a software characteristic. Examples are CICS, IMS, OLTP-T,1 CB-L,2 LoIO-mix,3 and TI-mix.4 However, capacity performance is more closely associated with how a workload uses and interacts with a particular processor hardware design. The CPU Measurement Facility (CPU MF) data that was introduced on the z10 provides insight into the interaction of workload and hardware design in production workloads. CPU MF data helps LSPR to adjust workload capacity curves based on the underlying hardware sensitivities, in particular, the processor access to caches and memory. This is known as nest activity intensity. Using this data, LSPR introduces three new workload capacity categories that replace all prior primitives and mixes.
LSPR contains the internal throughput rate ratios (ITRRs) for the zEC12 and the previous generation processor families. These ratios are based on measurements and projections that use standard IBM benchmarks in a controlled environment. The throughput that any user experiences can vary depending on the amount of multiprogramming in the user’s job stream, the I/O configuration, and the workload processed. Therefore, no assurance can be given that an individual user can achieve throughput improvements equivalent to the performance ratios that are stated.
12.2 Fundamental components of workload capacity performance
Workload capacity performance is sensitive to three major factors:
This section examines each of these three factors.
12.2.1 Instruction path length
A transaction or job runs a set of instructions to complete its task. These instructions are composed of various paths through the operating system, subsystems, and application. The total count of instructions that are run across these software components is referred to as the transaction or job path length. The path length varies for each transaction or job, and depends on the complexity of the tasks that must be run. For a particular transaction or job, the application path length tends to stay the same, presuming that the transaction or job is asked to run the same task each time.
However, the path length that is associated with the operating system or subsystem might vary based on a number of factors:
Competition with other tasks in the system for shared resources. As the total number of tasks grows, more instructions are needed to manage the resources.
The n-way (number of logical processors) of the image or LPAR. As the number of logical processors grows, more instructions are needed to manage resources that are serialized by latches and locks.
12.2.2 Instruction complexity
The type of instructions and the sequence in which they are run interacts with the design of a microprocessor to affect a performance component. This is defined as instruction complexity. Many design alternatives affect this component:
Cycle time (GHz)
Instruction architecture
Pipeline
Superscalar
Out-of-order execution
Branch prediction
As workloads are moved between microprocessors with various designs, performance varies. However, when on a processor, this component tends to be similar across all models of that processor.
12.2.3 Memory hierarchy and memory nest
The memory hierarchy of a processor generally refers to the caches, data buses, and memory arrays that stage the instructions and data that must be run on the microprocessor to complete a transaction or job.
Many design choices affect this component:
Cache size
Latencies (sensitive to distance from the microprocessor)
Number of levels, the Modified, Exclusive, Shared, Invalid (MESI) protocol, controllers, switches, the number and bandwidth of data buses, and others
Certain caches are private to the microprocessor core, which means that only that microprocessor core can access them. Other caches are shared by multiple microprocessor cores. The term memory nest for a z Systems processor refers to the shared caches and memory along with the data buses that interconnect them.
Figure 12-2 shows a memory nest in a z13 single CPC drawer system.
Figure 12-2 Memory hierarchy on the z13 one CPC drawer system (two nodes)
Workload capacity performance is sensitive to how deep into the memory hierarchy the processor must go to retrieve the workload instructions and data for running. The best performance occurs when the instructions and data are in the caches nearest the processor. In this configuration, little time is spent waiting before running. If the instructions and data must be retrieved from farther out in the hierarchy, the processor spends more time waiting for their arrival.
As workloads are moved between processors with various memory hierarchy designs, performance varies because the average time to retrieve instructions and data from within the memory hierarchy varies. Additionally, when on a processor, this component continues to vary. This variation is because the location of a workload’s instructions and data within the memory hierarchy is affected by many factors including, but not limited to, these factors:
Locality of reference
I/O rate
Competition from other applications and LPARs
12.3 Relative nest intensity
The most performance-sensitive area of the memory hierarchy is the activity to the memory nest. This is the distribution of activity to the shared caches and memory. The term Relative Nest Intensity (RNI) indicates the level of activity to this part of the memory hierarchy. Using data from CPU MF, the RNI of the workload running in an LPAR can be calculated. The higher the RNI, the deeper into the memory hierarchy the processor must go to retrieve the instructions and data for that workload.
RNI reflects the distribution and latency of sourcing data from shared caches and memory, as shown in Figure 12-3.
Figure 12-3 Relative Nest Intensity
Many factors influence the performance of a workload. However, usually what these factors are influencing is the RNI of the workload. The interaction of all these factors results in a net RNI for the workload, which in turn directly relates to the performance of the workload.
These factors are simply tendencies and not absolutes. For example, a workload might have a low I/O rate, intensive processor use, and a high locality of reference, which all suggest a low RNI. But it might be competing with many other applications within the same LPAR and many other LPARs on the processor, which tend to create a higher RNI. It is the net effect of the interaction of all these factors that determines the RNI.
The traditional factors that were used to categorize workloads in the past are listed along with their RNI tendency in Figure 12-4.
Figure 12-4 The traditional factors that were used to categorize workloads
Little can be done to affect most of these factors. An application type is whatever is necessary to do the job. The data reference pattern and processor usage tend to be inherent to the nature of the application. The LPAR configuration and application mix are mostly a function of what must be supported on a system. The I/O rate can be influenced somewhat through buffer pool tuning.
However, one factor, software configuration tuning, is often overlooked but can have a direct effect on RNI. This term refers to the number of address spaces (such as CICS application-owning regions (AORs) or batch initiators) that are needed to support a workload. This factor always has existed, but its sensitivity is higher with the current high frequency microprocessors. Spreading the same workload over more address spaces than necessary can raise a workload’s RNI. This increase occurs because the working set of instructions and data from each address space increases the competition for the processor caches.
Tuning to reduce the number of simultaneously active address spaces to the correct number that is needed to support a workload can reduce RNI and improve performance. In the LSPR, the number of address spaces for each processor type and n-way configuration is tuned to be consistent with what is needed to support the workload. Therefore, the LSPR workload capacity ratios reflect a presumed level of software configuration tuning. Retuning the software configuration of a production workload as it moves to a larger or faster processor might be needed to achieve the published LSPR ratios.
12.4 LSPR workload categories based on relative nest intensity
A workload’s RNI is the most influential factor in determining workload performance. Other more traditional factors, such as application type or I/O rate, have RNI tendencies. However, it is the net RNI of the workload that is the underlying factor in determining the workload’s capacity performance. The LSPR now runs various combinations of former workload primitives, such as CICS, DB2, IMS, OSAM, VSAM, WebSphere, COBOL, and utilities, to produce capacity curves that span the typical range of RNI.
Three new workload categories are represented in the LSPR tables:
LOW (relative nest intensity)
A workload category that represents light use of the memory hierarchy. This category is similar to past high-scaling primitives.
AVERAGE (relative nest intensity)
A workload category that represents average use of the memory hierarchy. This category is similar to the past LoIO-mix workload, and is expected to represent most production workloads.
HIGH (relative nest intensity)
A workload category that represents a heavy use of the memory hierarchy. This category is similar to the past TI-mix workload.
These categories are based on the RNI. The RNI is influenced by many variables, such as application type, I/O rate, application mix, processor usage, data reference patterns, LPAR configuration, and the software configuration that is running. CPU MF data can be collected by z/OS System Measurement Facility on SMF 113 records.
12.5 Relating production workloads to LSPR workloads
Historically, a number of techniques were used to match production workloads to LSPR workloads:
Application name (a client running CICS can use the CICS LSPR workload)
Application type (create a mix of the LSPR online and batch workloads)
I/O rate (the low I/O rates used a mix of low I/O rate LSPR workloads)
The previous LSPR workload suite was composed of the following workloads:
Traditional online transaction processing workload OLTP-T (formerly known as IMS)
Web-enabled online transaction processing workload OLTP-W (also known as Web/CICS/DB2)
A heavy Java based online stock trading application that is known as WASDB (previously referred to as Trade2-EJB)
Batch processing, represented by the CB-L (commercial batch with long-running jobs or CBW2)
A new ODE-B Java batch workload, replacing the CB-J workload
The traditional Commercial Batch Short Job Steps (CB-S) workload (formerly CB84) was dropped. Figure 12-4 on page 460 shows the traditional factors that have been used to categorize workloads.
The previous LSPR provided performance ratios for individual workloads and for the default mixed workload. This default workload was composed of equal amounts of four of the previous workloads (OLTP-T, OLTP-W, WASDB, and CB-L). Guidance in converting the previous LSPR categories to the new ones is given in Figure 12-5. The IBM Processor Capacity Reference for z Systems (zPCR) tool5 has been changed to support the new z/OS workload categories.
Figure 12-5 New z/OS workload categories defined
However, as addressed in 12.4, “LSPR workload categories based on relative nest intensity” on page 461, the underlying performance sensitive factor is how a workload interacts with the processor hardware. These past techniques were approximating the hardware characteristics that were not available through software performance reporting tools.
Beginning with the z10 processor, the hardware characteristics can now be measured by using CPU MF (SMF 113) counters data. A production workload can now be matched to an LSPR workload category through these hardware characteristics. For more information about RNI, see 12.4, “LSPR workload categories based on relative nest intensity” on page 461.
The AVERAGE RNI LSPR workload is intended to match most client workloads. When no other data is available, use it for capacity analysis.
Direct access storage device (DASD) I/O rate was used for many years to separate workloads into two categories: Those whose DASD I/O per MSU (adjusted) is <30 (or DASD I/O per Peripheral Component Interconnect (PCI) is <5), and those higher than these values. Most production workloads fell into the “low I/O” category, and a LoIO-mix workload was used to represent them. Using the same I/O test, these workloads now use the AVERAGE RNI LSPR workload. Workloads with higher I/O rates can use the HIGH RNI workload or the AVG-HIGH RNI workload that is included with IBM zPCR. Low-Average and Average-High categories allow better granularity for workload characterization.
For z10 and newer processors, the CPU MF data can be used to provide an extra hint as to workload selection. When available, this data allows the RNI for a production workload to be calculated. By using the RNI and another factor from CPU MF, the L1MP (percentage of data and instruction references that miss the L1 cache), a workload can be classified as LOW, AVERAGE, or HIGH RNI. This classification and resulting hit are automated in the zPCR tool. It is preferable to use zPCR for capacity sizing.
12.6 Workload performance variation
Because of the nature of the z13 multi-drawer system and resource management across those drawers, performance variability from application to application is expected. This variation is similar to that seen on the zEC12, z196, z10 EC, and z9 EC. This variability can be observed in certain ways. The range of performance ratings across the individual workloads is likely to have a spread, but not as large as with the z10 EC.
The memory and cache designs affect various workloads in many ways. All workloads are improved, with cache-intensive loads benefiting the most. When comparing moving from z9 EC to z10 EC with moving from z10 EC to z196 or from z196 to zEC12, it is likely that the relative benefits per workload will vary. Those workloads that benefited more than the average when moving from z9 EC to z10 EC will benefit less than the average when moving from z10 EC to z196. Nevertheless, the workload variability for moving from zEC12 to z13 is expected to be less than the last few upgrades.
The effect of this variability is increased deviations of workloads from single-number metric-based factors, such as Millions of Instructions Per Second (MIPS), MSUs, and CPU time charge-back algorithms.
Experience demonstrates that z Systems can be run at up to 100% utilization levels, sustained. However, most clients prefer to leave a bit of room and run at 90% or slightly under. For any capacity comparison exercise, using a single metric, such as MIPS or MSU, is not a valid method. When deciding the number of processors and the uniprocessor capacity, remember both the workload characteristics and LPAR configuration. For these reasons, when you plan capacity, zPCR and involving IBM technical support are recommended.
Main performance improvement drivers with z13
The z13 is designed to deliver new levels of performance and capacity for large-scale consolidation and growth. The following attributes and design points of the z13 contribute to overall performance and throughput improvements as compared to the zEC12.
The z/Architecture implementation has the following enhancements:
Transactional Execution (TX) designed for z/OS, Java, DB2, and other users
Runtime Instrumentation (RI) provides dynamic and self-tuning online recompilation capability for Java workloads
Enhanced DAT-2 for supporting 2-GB pages for DB2 buffer pools, Java heap size, and other large structures
Software directives implementation to improve hardware performance
Decimal format conversions for COBOL programs
The z13 microprocessor design has the following enhancements:
Eight processor cores per chip
Improved out-of-order (OOO) execution design
Improved pipeline balance, with up to six instructions that can be decoded per cycle, and up to 10 instructions/operations that can be initiated to run per clock cycle
Simultaneous multithreading
Single-instruction multiple-data (SIMD) unit and 139 new instructions for vector operations
Enhanced branch prediction latency and instruction fetch throughput
Improvements in execution bandwidth and throughput: 10 execution units and two load/store units, which are divided in to two symmetric pipelines:
 – Four fixed-point units (FXU) (integer)
 – Two load/store units (LSU)
 – Two binary floating-point units (BFU)
 – Two binary coded decimal floating-point units (DFU)
 – Two vector floating-point units (VXU)
Redesigned cache structure:
 – Increased L1I and L1D caches (96 KB instruction and 128 KB data per core)
 – Increased 2 MB + 2 MB eDRAM split (instruction and data) private L2 cache per core
 – On chip 64 MB eDRAM L3 Cache, shared by all cores (eight) - 384 MB per CPC drawer
 – New Inclusive L4 Design: 480 MB L4 with 224 MB NIC Directory (960 MB L4 per CPC drawer)
One cryptographic/compression co-processor per core, redesigned
CPACF (hardware) runs additional UTF conversion operations: UTF8 to UTF32, UTF8 to UTF16, UTF32 to UTF8, and UTF32 to UTF16
Clock frequency at 5.0 GHz
IBM CMOS 14S0 22 nm SOI technology with IBM eDRAM technology
The z13 design has the following enhancements:
Increased total number of PUs that are available on the system, from 120 to 168, and number of characterizable cores, from 101 to 141
Hardware system area (HSA) increased from 32 GB to 96 GB
Up to 85 LPARs (compared to 60 on zEC12)
10 TB of addressable memory (configurable to LPARs) with up to 10 TB of memory per LPAR
Increased default number of SAP processors per CPC drawer
New Coupling Facility Control Code (CFCC) that is available for improved performance:
 – Elapsed time improvements when dynamically altering the size of a cache structure
 – DB2 conditional writes to a group buffer pool (GBP)
 – Performance improvements for coupling facility cache structures to avoid flooding the coupling facility cache with changed data, and avoid excessive delays and backlogs for cast-out processing
 – Performance throughput enhancements for parallel cache castout processing by extending the number of record code check (RCC) cursors beyond 512
 – Coupling facility (CF) storage class and castout class contention avoidance by breaking up individual storage class and castout class queues to reduce storage class and castout class latch contention
The following new features are available on the z13:
Integrated Coupling Adapter (ICA SR)
FICON Express16S
The 10GbE Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) Express (10GbE RoCE Express) feature now supports using both physical ports and can be shared between up to 31 LPARs
Crypto Express5S with up to 256 domains
 

1 Traditional online transaction processing workload (formerly known as IMS)
2 Commercial batch with long-running jobs
3 Low I/O Content Mix Workload
4 Transaction Intensive Mix Workload
5 The IBM Processor Capacity Reference tool reflects the latest IBM LSPR measurements. It is available at no extra charge at http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1381.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.39.142