Chapter 12. Performance

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance

This chapter describes the performance and capacity planning of z15.

Note: Throughout this chapter, z15 refers to IBM z15 Model T01 (Machine Type 8561) unless otherwise specified.

This chapter includes the following topics:

•12.1, “IBM z15 performance characteristics” on page 474

•12.2, “z15 Large System Performance Reference ratio” on page 476

•12.3, “Fundamental components of workload performance” on page 477

•12.4, “Relative Nest Intensity” on page 479

•12.5, “LSPR workload categories based on RNI” on page 481

•12.6, “Relating production workloads to LSPR workloads” on page 481

•12.7, “CPU MF counter data and LSPR workload type” on page 482

•12.8, “Workload performance variation” on page 483

•12.9, “Capacity planning consideration for z15” on page 483

12.1 IBM z15 performance characteristics

The IBM z15 Model T01 Feature Max190 (7J0) is designed to offer up to 25% more capacity and a 25% increase in the amount of memory that is compared to an IBM z14 Model M05 (7H0) system.

Uniprocessor performance also increased. On average, a z15 Model 701 offers performance improvements of more than 12% over the z14 Model 701. Figure 12-1 shows a system performance comparison of successive IBM Z servers.

Figure 12-1 System performance comparison of successive IBM Z servers

Note: PCI = Processor Capacity Index.

Operating system support varies for the number of “engines” that are supported.

12.1.1 z15 single-thread capacity

The z15 processor chip runs at 5.2 GHz clock speed, which is the same as the z14 processor chip, but the performance is increased. For a uniprocessor, it increases 10 - 13% on average. For N-way processors model, it increases 12 - 14% on average at equal N-way configuration. These numbers differ depending on the workload type and LPAR configuration.

12.1.2 z15 SMT capacity

As with z13 and z14, customers can choose to run two threads on IFL and zIIP cores by using SMT mode. SMT increases throughput by 10 - 40% (average 25%), depending on workload.

12.1.3 IBM Integrated Accelerator for zEnterprise Data Compression

Starting with z13, IBM introduced the zEnterprise Data Compression (zEDC) Express PCIe feature, bringing efficiency and economies for data storing and data transfers.

The zEDC Express feature was adopted by enterprises because it helps with software costs for compression/decompression operations (by offloading these operations), and increases data encryption (compression before encryption) efficiency.

With z15, the zEDC Express functionality was moved off from the PCIe infrastructure into the processor nest. By moving the compression and decompression into the processor nest (on-chip), IBM z15 processor provides a new level of performance for these tasks and eliminates the need for the zEDC Express feature virtualization. It also brings new use cases to the platform.

For more information, see Appendix C, “IBM Integrated Accelerator for zEnterprise Data Compression” on page 503.

12.1.4 Primary performance improvement drivers with z15

The attributes and design points of z15 contribute to overall performance and throughput improvements as compared to the z14. The following major items contribute to z15 performance improvements:

•z15 microprocessor architecture:

– Larger GCT (60x3 entries versus 48x3 in z14, 1.25x entries)

– New Mapper Design (2x entries)

– Larger Issue Queue (2x36 entries versus 2x30 in z14)

– 4x sized 2-gig TLB2 (256 entries versus 64 in z14)

– Doubled # of core FARs (12 versus 6 in z14)

– Double sized BTB1 (16 K entries versus 8 K in z14, Improved prediction)

– New TAGE-based PHT branch predictor design

– Pipeline optimization

– Third-generation SMT processing for zIIPs and IFLs

•Cache:

– L2 I-Cache increased from 2 MB to 4 MB per Core, or 2x

– L3 Cache increased from 128 MB to 256 MB per CP chip, or 2x

– L4 Cache increased from 672 MB to 960 MB, or +43%

– New power efficient logical directory design

•Storage hierarchy:

– Reduced cache latencies (L2 → L3, L3 → L4)

– Improved system protocols (Reduced contention points)

– Core stores target L3 cache compartment directly

– Improved on-chip data by using (shared by 3 versus 5 on z14)

– On-cluster memory fetches only cached in L3 (LRU write-back to L4)

– Bus speeds and feeds Improvements (meso-sync buses)

– Nest Early Eviction fetch hints (non-mru caching of temporal data)

– Improved hot cache line handling (contention affinity, single SC)

•Software and hardware:

– Drawer-based memory affinity

– z/OS HiperDispatch Optimizations

– WiseOps, z/OS MicroTrend Analysis

– PR/SM Algorithm Improvements (placement, ICF relocation)

– Hot Cache line handling improvements

– Post Quantum Encryption

– Speed Boost:

•z/Architecture implementation:

– DEFLATE-Conversion Facility (on chip compression)

Replaces zEDC Express PCIe feature. DEFLATE CONVERSION CALL (DFLTCC) instruction can compress and decompress data by using the DEFLATE standard.

– Move-Page-and-Set-Key Facility

Enables setting the storage key that is associated with the page to which is being stored.

– PER Storage-Key-Alteration Facility

Identify changes to ACC (access key) and F (fetch protect) bits of storage keys.

– Message-Security-Assist Extension 9

Add support to CPACF for performing digital signatures.

– Vector-Enhancements Facility 2:

Provides eight new instructions to help deal with endian conversions.

– Adapter CPU Directed Interrupts

New architecture to allow native PCI devices to present interrupts that are directed to a specific CPU.

– PCI Mapped I/O (MIO) Address Space

Allows for an operating system to enable a problem state program to access PCI memory.

– Miscellaneous-Instruction-Extensions Facility 3

12.2 z15 Large System Performance Reference ratio

The Large System Performance Reference (LSPR) provides capacity ratios among various processor families that are based on various measured workloads. It is a common practice to assign a capacity scaling value to processors as a high-level approximation of their capacities.

For z/OS V2R3 studies, the capacity scaling factor that is commonly associated with the reference processor is set to a 2094-701 with a Processor Capacity Index (PCI) value of 593. This value is unchanged since z/OS V1R11 LSPR. The use of the same scaling factor across LSPR releases minimizes the changes in capacity results for an older study and provides more accurate capacity view for a new study.

Performance data for z15 servers were obtained with z/OS V2R3 (running DB2 for z/OS V12, CICS TS V5R3, IMS V14, Enterprise COBOL V6R2, and WebSphere Application Server for z/OS V9.0.0.8). All IBM Z server generations are measured in the same environment with the same workloads at high usage.

Note: If your software configuration is different from what is described here, the performance results might vary.

On average, z15 servers can deliver up to 25% more performance in a 190-way configuration than an z14 170-way. However, the observed performance increase varies depending on the workload type.

Consult the LSPR when you consider performance on the z15. The range of performance ratings across the individual LSPR workloads is likely to include a large spread. Performance of the individual logical partitions (LPARs) varies depending on the fluctuating resource requirements of other partitions and the availability of processor units (PUs). Therefore, it is important to know which LSPR workload type suite your production environment. For more information, see 12.8, “Workload performance variation” on page 483.

For more information about performance, see the Large Systems Performance Reference for IBM Z page of the Resource Link website.

For more information about millions of service units (MSU) ratings, see the IBM z Systems Software Contracts page of the IBM IT infrastructure website.

12.2.1 LSPR workload suite

Historically, LSPR capacity tables, including pure workloads and mixes, were identified with application names or a software characteristic; for example, CICS, IMS, OLTP-T,¹ CB-L,² LoIO-mix,³ and TI-mix.⁴ However, capacity performance is more closely associated with how a workload uses and interacts with a particular processor hardware design.

The CPU Measurement Facility (CPU MF) data that was introduced on the z10 provides insight into the interaction of workload and hardware design in production workloads. CPU MF data helps LSPR to adjust workload capacity curves that are based on the underlying hardware sensitivities; in particular, the processor access to caches and memory. This processor access to caches and memory is called nest. By using this data, LSPR introduces three workload capacity categories that replace all older primitives and mixes.

LSPR contains the internal throughput rate ratios (ITRRs) for the z15 and the previous generation processor families. These ratios are based on measurements and projections that use standard IBM benchmarks in a controlled environment.

The throughput that any user experiences can vary depending on the amount of multiprogramming in the user’s job stream, the I/O configuration, and the workload processed. Therefore, no assurance can be given that an individual user can achieve throughput improvements that are equivalent to the performance ratios that are stated.

12.3 Fundamental components of workload performance

Workload performance is sensitive to the following major factors:

•Instruction path length

•Instruction complexity

•Memory hierarchy and memory nest

These factors are described next.

12.3.1 Instruction path length

A transaction or job runs a set of instructions to complete its task. These instructions are composed of various paths through the operating system, subsystems, and application. The total count of instructions that are run across these software components is referred to as the transaction or job path length.

The path length varies for each transaction or job, and depends on the complexity of the tasks that must be run. For a particular transaction or job, the application path length tends to stay the same, assuming that the transaction or job is asked to run the same task each time.

However, the path length that is associated with the operating system or subsystem can vary based on the following factors:

•Competition with other tasks in the system for shared resources. As the total number of tasks grows, more instructions are needed to manage the resources.

•The number of logical processors (n-way) of the image or LPAR. As the number of logical processors grows, more instructions are needed to manage resources that are serialized by latches and locks.

12.3.2 Instruction complexity

The type of instructions and the sequence in which they are run interacts with the design of a microprocessor to affect a performance component. This factor is defined as instruction complexity. The following design alternatives affect this component:

•Cycle time (GHz)

•Instruction architecture

•Pipeline

•Superscalar

•Out-of-order execution

•Branch prediction

•Transaction Lookaside Buffer (TLB)

•Transactional Execution (TX)

•Single instruction multiple data instruction set (SIMD)

•Simultaneous multithreading (SMT)⁵

As workloads are moved between microprocessors with various designs, performance varies. However, when on a processor, this component tends to be similar across all models of that processor.

12.3.3 Memory hierarchy and memory nest

The memory hierarchy of a processor generally refers to the caches, data buses, and memory arrays that stage the instructions and data that must be run on the microprocessor to complete a transaction or job.

The following design choices affect this component:

•Cache size

•Latencies (sensitive to distance from the microprocessor)

•Number of levels, the Modified, Exclusive, Shared, Invalid (MESI) protocol, controllers, switches, the number and bandwidth of data buses, and so on.

Certain caches are private to the microprocessor core, which means that only that microprocessor core can access them. Other caches are shared by multiple microprocessor cores. The term memory nest for an IBM Z processor refers to the shared caches and memory along with the data buses that interconnect them.

A memory nest in a z15 CPC drawer is shown in Figure 12-2.

Figure 12-2 Memory hierarchy in a z15 CPC drawer

Workload performance is sensitive to how deep into the memory hierarchy the processor must go to retrieve the workload instructions and data for running. The best performance occurs when the instructions and data are in the caches nearest the processor because little time is spent waiting before running. If the instructions and data must be retrieved from farther out in the hierarchy, the processor spends more time waiting for their arrival.

As workloads are moved between processors with various memory hierarchy designs, performance varies because the average time to retrieve instructions and data from within the memory hierarchy varies. Also, when on a processor, this component continues to vary because the location of a workload’s instructions and data within the memory hierarchy is affected by several factors that include, but are not limited to, the following factors:

•Locality of reference

•I/O rate

•Competition from other applications and LPARs

12.4 Relative Nest Intensity

The most performance-sensitive area of the memory hierarchy is the activity to the memory nest. This area is the distribution of activity to the shared caches and memory.

The term Relative Nest Intensity (RNI) indicates the level of activity to this part of the memory hierarchy. By using data from CPU MF, the RNI of the workload that is running in an LPAR can be calculated. The higher the RNI, the deeper into the memory hierarchy the processor must go to retrieve the instructions and data for that workload.

RNI reflects the distribution and latency of sourcing data from shared caches and memory, as shown in Figure 12-3.

Figure 12-3 Relative Nest Intensity

Many factors influence the performance of a workload. However, these factors often are influencing the RNI of the workload. The interaction of all these factors results in a net RNI for the workload, which in turn directly relates to the performance of the workload.

These factors are tendencies, not absolutes. For example, a workload might have a low I/O rate, intensive processor use, and a high locality of reference, which all suggest a low RNI. However, it might be competing with many other applications within the same LPAR and many other LPARs on the processor, which tends to create a higher RNI. It is the net effect of the interaction of all these factors that determines the RNI.

The traditional factors that were used to categorize workloads in the past are shown with their RNI tendency in Figure 12-4.

Figure 12-4 Traditional factors that were used to categorize workloads

Little can be done to affect most of these factors. An application type is whatever is necessary to do the job. The data reference pattern and processor usage tend to be inherent to the nature of the application. The LPAR configuration and application mix are mostly a function of what must be supported on a system. The I/O rate can be influenced somewhat through buffer pool tuning.

However, one factor, software configuration tuning, is often overlooked but can have a direct effect on RNI. This term refers to the number of address spaces (such as CICS application-owning regions (AORs) or batch initiators) that are needed to support a workload. This factor always existed, but its sensitivity is higher with the current high frequency microprocessors. Spreading the same workload over more address spaces than necessary can raise a workload’s RNI. This increase occurs because the working set of instructions and data from each address space increases the competition for the processor caches.

Tuning to reduce the number of simultaneously active address spaces to the optimum number that is needed to support a workload can reduce RNI and improve performance. In the LSPR, the number of address spaces for each processor type and n-way configuration is tuned to be consistent with what is needed to support the workload. Therefore, the LSPR workload capacity ratios reflect a presumed level of software configuration tuning. Retuning the software configuration of a production workload as it moves to a larger or faster processor might be needed to achieve the published LSPR ratios.

12.5 LSPR workload categories based on RNI

A workload’s RNI is the most influential factor in determining workload performance. Other more traditional factors, such as application type or I/O rate, have RNI tendencies. However, it is the net RNI of the workload that is the underlying factor in determining the workload’s performance. The LSPR now runs various combinations of former workload primitives, such as CICS, Db2, IMS, OSAM, VSAM, WebSphere, COBOL, and utilities, to produce capacity curves that span the typical range of RNI.

The following workload categories are represented in the LSPR tables:

•LOW (relative nest intensity)

A workload category that represents light use of the memory hierarchy.

•AVERAGE (relative nest intensity)

A workload category that represents average use of the memory hierarchy. This category is expected to represent most production workloads.

•HIGH (relative nest intensity)

A workload category that represents a heavy use of the memory hierarchy.

These categories are based on the RNI. The RNI is influenced by many variables, such as application type, I/O rate, application mix, processor usage, data reference patterns, LPAR configuration, and the software configuration that is running. CPU MF data can be collected by z/OS System Measurement Facility on SMF 113 records or z/VM Monitor starting with z/VM V5R4.

12.6 Relating production workloads to LSPR workloads

Historically, the following techniques were used to match production workloads to LSPR workloads:

•Application name (a client that is running CICS can use the CICS LSPR workload)

•Application type (create a mix of the LSPR online and batch workloads)

•I/O rate (the low I/O rates that are used a mix of low I/O rate LSPR workloads)

The IBM Processor Capacity Reference for IBM Z (zPCR) tool supports the following workload categories:

•Low

•Low-Average

•Average

•Average-high

•High

For more information about the no-charge IBM zPCR tool (which reflects the latest IBM LSPR measurements), see the Getting Started with zPCR (IBM's Processor Capacity Reference) page of the IBM Techdoc Library website.

As described in 12.5, “LSPR workload categories based on RNI” on page 481, the underlying performance sensitive factor is how a workload interacts with the processor hardware.

12.7 CPU MF counter data and LSPR workload type

Beginning with the z10 processor, the hardware characteristics can be measured by using CPU MF (SMF 113) counters data. A production workload can be matched to an LSPR workload category through these hardware characteristics.

For more information about RNI, see 12.5, “LSPR workload categories based on RNI” on page 481.

The AVERAGE RNI LSPR workload is intended to match most client workloads. When no other data is available, use the AVERAGE RNI LSPR workload for capacity analysis.

Low-Average and Average-High categories allow better granularity for workload characterization but these categories can apply on zPCR only.

The CPU MF data can be used determine workload type. When available, this data allows the RNI for a production workload to be calculated.

By using the RNI and another factor from CPU MF, the L1MP (percentage of data and instruction references that miss the L1 cache), a workload can be classified as LOW, AVERAGE, or HIGH RNI. This classification and resulting hit are automated in the zPCR tool. It is preferable to use zPCR for capacity sizing.

Starting with z/OS V2R1 with APAR OA43366, zFS file is not required any more for CPU MF and Hardware Instrumentation Services (HIS). HIS is a z/OS function that collects hardware event data for processors in SMF records type 113, and a z/OS UNIX System Services output files.

Only SMF 113 record is required to know proper workload type by using CPU MF counter data. CPU overhead of CPUMF is minimal. Also, the amount of SMF 113 record is 1% of typical SMF 70 and 72 which RMF writes.

CPU MF and HIS can use not only for deciding workload type but also use another purpose. For example, starting with z/OS V2R1, you can record Instruction Counts in SMF type 30 record when you activate CPU MF. Therefore, we strongly recommend that you always activate CPU MF.

For more information about getting CPUMF counter data, see the CPU MF - 2017 Update and WSC Experiences of the IBM Techdoc Library website.

12.8 Workload performance variation

As the size of transistors approaches the size of atoms that stand as a fundamental physical barrier, a processor chip’s performance can no longer double every two years (Moore’s Law⁶ does not apply).

A holistic performance approach is required when the performance gains are reduced because of frequency. Therefore, hardware and software synergy becomes an absolute requirement.

Starting with z13, Instructions Per Cycle (IPC) improvements in core and cache became the driving factor for performance gains. As these microarchitectural features increase (which contributes to instruction parallelism), overall workload performance variability also increases because not all workloads react the same way to these enhancements.

Because of the nature of the z15 multi-CPC drawer system and resource management across those drawers, performance variability from application to application is expected.

Also, the memory and cache designs affect various workloads in many ways. All workloads are improved, with cache-intensive loads benefiting the most. For example, having more PUs per CPC drawer, each with higher capacity than z14, more workload can fit on a z15 CPC drawer. This configuration can result in better performance. For example, z14 two drawer system model M02 can populate maximum 69 PUs.

In contrast, z15 two drawer system Max71 can populate maximum 71 PUs. Therefore, two more PUs can share caches and memories within the drawer, so the performance improvements is expected.

The workload variability for moving from z13 and z14 to z15 expected to be stable. Workloads that are migrating from z10 EC, z196, and zEC12 to z15 can expect to see similar results with slightly less variability than the migration from z13 and z14.

Experience demonstrates that IBM Z servers can be run at up to 100% utilization levels, sustained. However, most clients prefer to leave some room and run at 90% or slightly under.

12.9 Capacity planning consideration for z15

In this section, we describe recommended ways conduct capacity planning for z15.

Do not use MIPs or MSUs for capacity planning: Do not use “one number” capacity comparisons, such as MIPs or MSUs. IBM does not officially announce the processor performance as “MIPs”. MSU is only a number for software license charge and it does not represent for performance for the processor.

12.9.1 Collect CPU MF counter data

It is important to recognize the LSPR workload type of your production system. As described in 12.7, “CPU MF counter data and LSPR workload type” on page 482, the capacity of the processor is different from the LSPR workload type. By collecting the CPU MF SMF 113 record, you can recognize the workload type in a specific IBM-provided capacity planning tool. Therefore, collecting CPU MF counter data is a first step to begin the capacity planning.

12.9.2 Creating EDF file with CP3KEXTR

EDF file is an input file of the IBM Z capacity planning tool. You can create this file by using the CP3KEXTR program. The CP3KEXTR program reads SMF records and extracts needed data as input to IBM’s Processor Capacity Reference (zPCR) and z Systems Batch Network Analyzer (zBNA) tools.

CP3KEXTR is offered as a “no-charge” application. It can also create the EDF file for ZCP3000. ZCP3000 is an IBM internal tool, but you can create the EDF file for it on your system. For more information about CP3KEXTR, see the IBM Techdoc z/OS Data Extraction Program (CP3KEXTR) for zPCR and zBNA.

12.9.3 Loading EDF file to the capacity planning tool

By loading EDF file to IBM capacity planning tool, you can see the LSPR workload type based on CPU MF counter data. Figure 12-5 shows a sample zPCR window of a workload type. In this example, the workload type displays in the “Assigned Workload” column. When you load the EDF file to zPCR, it automatically sets your LPAR configuration. It also makes easy to define the LPAR configuration to the zPCR.

Figure 12-5 zPCR LPAR Configuration from EDF window

12.9.4 Tips to maximize z15 server capacity

The server capacity of the z15 can be maximized by using the following tips:

•Turn on HiperDispatch in every LPARs. Hiperdispatch optimizes processor cache usage by creating an affinity between a PU and the workload.

•Assign an appropriate number of logical CPs. If you assign too many logical CPs to the LPAR, unnecessary LPAR management cost is exhausted. This issue reduces the efficiency of the cache.

The server capacity declines relative to the LCP:RCP ratio (sum of logical CPs defined in all LPARs: the number of physical CPs on your configuration). Therefore, assigning the correct number of CPU to LPAR is important.

If your LPARs configuration LCP:RCP ratio reaches its limit, zPCR warns your configuration. Figure 12-6 shows a sample zPCR error message window when the practical LCP:RCP ratio is exceeded.

Figure 12-6 zPCR message window

¹ Traditional online transaction processing workload (formerly known as IMS).

² Commercial batch with long-running jobs.

³ Low I/O Content Mix Workload.

⁴ Transaction Intensive Mix Workload.

⁵ Only available for IFL, zIIP, and SAP processors

⁶ For more information, see the Moore’s Law website.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12. Performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 12. Performance