Chapter 1. Performance terminology

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance terminology

This chapter describes several important concepts and terms that are used to help you understand the performance of IBM z Systems™ hardware and software. These concepts and terms are referenced extensively in this publication when the performance of a CICS environment is described.

The IBM z Systems CPU Measurement Facility (CPU MF) provides information for use with the Large Scale Performance Reference (LSPR) charts. This chapter describes how CPU MF data is collected, how the LSPR charts are used, and how the figures that are obtained from the LSPR reference tables relate to CICS transaction cost and throughput.

This chapter includes the following topics:

•1.1, “CPU Measurement Facility” on page 4

•1.2, “Relative nest intensity” on page 4

•1.3, “Large Systems Performance Reference” on page 6

•1.4, “Relating LSPR values to a CICS workload” on page 9

1.1 CPU Measurement Facility

This section provides background information about the CPU MF capability that is available in IBM z Systems z10™ (and later) hardware. Data from CPU MF is used alongside the LSPR tables as described in 1.3, “Large Systems Performance Reference” on page 6.

The CPU MF capability provides optional hardware-assisted collections of information about the logical CPUs work that is run over a specified interval in selected logical partitions (LPAR).

This powerful new capability was not available previously. However, CPU MF does not replace functions or capabilities; instead, it enriches the capabilities. CPU MF consists of the following important, but independent, functions:

•The collection of counters that maintain counts of certain activities.

•The collection of samples that provide information about what the CPU is doing at the time of the sample.

The collection of counters function is intended to be run constantly to collect long-term performance data, in a similar manner to how you collect other performance data.

The collection of samples function is a short duration, precise function that identifies where CPU resources are being used to help you improve application efficiency.

CPU MF runs at the LPAR level so that you can collect counter data in one LPAR, counter and sample data in another LPAR, and not use CPU MF at all on a third LPAR. The information that CPU MF gathers pertains to only the LPARs where you enable and start CPU MF.

The implementation of CPU MF is nondisruptive. If the prerequisite hardware and software are in place, you can start CPU MF data collection with no LPAR deactivations or activations. As a result, performing an initial program load (IPL) on the system that CPU MF is used with is not necessary.

CPU MF can run in multiple LPARs simultaneously and can be used with central processors (CPs), IBM System z® Integrated Information Processor (zIIP), and IBM System z Application Assist Processor (zAAP).

For more information about the concepts, configuration, and use of CPU MF data, see the IBM Redpaper™ publication Setting Up and Using the IBM System z CPU Measurement Facility with z/OS, REDP-4727, which is available at this website:

http://www.redbooks.ibm.com/abstracts/redp4727.html

1.2 Relative nest intensity

This section outlines several concepts that apply to IBM z Systems memory hierarchy and then defines the relative nest intensity (RNI) metric that quantifies the interactions between software and hardware.

Included in this book are extracts from the IBM Redbooks publication Large Systems Performance Reference, SC28-1187. For more information about the IBM z Systems memory hierarchy and the LSPR workloads that were used, see the following LSPR for IBM z Systems resource link:

https://www.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprindex

1.2.1 Memory hierarchy and nest

The memory hierarchy of a processor generally refers to the caches, data buses, and memory arrays that stage the instructions and data that must be executed on the micro-processor to complete a transaction or job. Many design alternatives affect this component, such as cache size, latencies (sensitive to distance from the micro-processor), number of levels, modified, exclusive, shared, invalid (MESI) protocol, controllers, switches, and number and bandwidth of data use.

Some of the caches are private to the micro-processor, which means that only micro-processor can access them. Other caches are shared by multiple micro-processors. In this book, the term memory nest for a z Systems processor refers to the shared caches and memory along with the data buses that interconnect them.

Workload performance is sensitive to how deep into the memory hierarchy the processor must go to retrieve the instructions and data of the workload for execution. Best performance occurs when the instructions and data are found in the cache (or caches) that are nearest the processor so that little time is spent waiting before execution. Where instructions and data must be retrieved from farther out in the hierarchy, the processor spends more time waiting for their arrival.

As workloads are moved between processors with different memory hierarchy designs, performance varies because the average time to retrieve instructions and data from within the memory hierarchy varies. Also, when on a processor this component continues to vary significantly because the location of the instructions and data of a workload within the memory hierarchy is affected by many factors, including locality of reference, I/O rate, competition from other applications, and other LPARs.

The most performance-sensitive area of the memory hierarchy is the activity to the memory nest, namely, the distribution of activity to the shared caches and memory. The term, relative nest intensity (RNI) indicates the level of activity to this part of the memory hierarchy. By using data from CPU MF, the RNI of the workload running in an LPAR can be calculated. The higher the RNI, the deeper into the memory hierarchy the processor must go to retrieve the instructions and data for that workload.

Micro-processors do not execute instructions at a constant rate. When instructions and data must be retrieved from farther out into the memory hierarchy, the processor spends more time waiting for their arrival. Therefore, a high RNI infers that the instruction execution rate (usually measured as millions of instructions per second, or MIPS) of a processor is lower than that of a workload with a low RNI. Alternatively, a workload with a high RNI requires a higher number of cycles to complete each instruction (stated as cycles per instruction, or CPI).

1.2.2 Factors that can influence RNI

Many factors influence the performance of a workload. However, usually what these factors are influencing is the RNI of the workload. The interaction of all these factors is what results in a net RNI for the workload, which in turn directly relates to the performance of the workload.

Figure 1-1 shows the traditional factors that were used to categorize workloads in the past, along with their RNI tendency.

Figure 1-1 Relative nest intensity tendencies

An important aspect to emphasize is that these factors are tendencies and not absolutes. For example, a workload might have a low I/O rate, intensive CPU use, and a high locality of reference, which are all factors that suggest a low RNI. But, what if it is competing with many other applications within the same LPAR and many other LPARs on the processor, which tend to push it toward a higher RNI? The net effect of the interaction of all these factors is what determines the RNI of the workload, which in turn greatly influences its performance.

You can do little to affect most of these factors. An application type is whatever is necessary to do the job. Data reference pattern and CPU usage tend to be inherent in the nature of the application. LPAR configuration and application mix are mostly a function of what must be supported on a system. I/O rate can be influenced somewhat through buffer pool tuning.

However, one factor that can be affected (software configuration tuning) is often overlooked but can have a direct effect on RNI. In the context of a CICS workload, software configuration tuning refers to the number of address spaces (such as CICS AORs) that are needed to support a workload. This factor always existed but its sensitivity is higher with today’s high frequency micro-processors. Spreading the same workload over many address spaces than necessary can raise the RNI of a workload as the working set of instructions and data from each address space increases the competition for the processor caches. For more information, see 5.10, “Workload consolidation” on page 62.

Tuning to reduce the number of simultaneously active address spaces to the proper number needed to support a workload can reduce RNI and improve performance. To produce the LSPR reference tables, IBM tunes the number of address spaces for each processor type and count configuration to be consistent with what is needed to support the workload. Therefore, the LSPR workload capacity ratios reflect a presumed level of software configuration tuning. This sensitivity of RNI to the number of supporting address spaces suggests that retuning the software configuration of a production workload as it moves to a bigger or faster processor might be needed to achieve the published LSPR ratios.

1.3 Large Systems Performance Reference

The following important capacity metrics are defined in this section before the use of the LSPR tables is described:

•External throughput rate

•Internal throughput rate

1.3.1 External throughput rate

The external throughput rate (ETR) is computed by using the following equation:

ETR = (units of work) ÷ (elapsed time)

For a CICS workload, units of work are normally expressed as the number of CICS transactions completed. To be useful, the units of work that are measured must represent a large and repeatable sample of the total workload to best represent the average. Elapsed time is normally expressed in seconds.

ETR characterizes system capacity because it is an elapsed time measurement (system capacity encompasses the performance of the processor and all of its external resources, considered together). As such, ETR lends itself to the system comparison methodology. This methodology requires the data processing system to be configured with all intended resources, including the processor, with appropriate amounts of central storage, expanded storage, channels, control units, I/O devices, TP network, and so on.

After the system is configured, the goal is to determine how much work the system, as a whole, can process over time. To accomplish this goal, the system is loaded with the appropriate workload until it cannot absorb work at any greater rate. The highest ETR achieved is the processing capability of the system.

When you make a system measurement of this type, all resources on the system are potential capacity inhibitors. If a resource other than the processor is, in fact, a capacity inhibitor, the processor is likely to be running at something less than optimal utilization.

This system comparison methodology is a legitimate way to measure when the intent is to assess the capacity of the system as a whole. For online systems, response time also becomes an important system-related metric because poor response times inhibit the ability of users to work. Therefore, system measurements for online work usually involve some type of response time criteria. If the response time criteria is not met, what ETR can be realized does not matter.

1.3.2 Internal throughput rate

The internal throughput rate (ITR) is computed by using the following formula:

ITR = (units of work) ÷ (processor busy)

As with ETR, units of work are normally expressed as jobs (or job-steps) for batch workloads, and as transactions or commands for online workloads. System control programs (SCPs) and most major software products have facilities to provide this information. To be useful, the units of work that are measured must represent a large and repeatable sample of the total workload to best represent the average. Processor busy time is normally expressed in seconds.

ITR characterizes processor capacity because it is a CPU busytime measurement. As such, ITR lends itself to the processor comparison methodology. Because the focus of LSPR is on a single resource (the processor), you must modify the measurement approach from that used for a system comparison methodology.

To ensure that the processor is the primary point of focus, you must configure it with all necessary external resources (including central storage, expanded storage, channels, control units, and I/O devices) in adequate quantities so that they do not become constraints. You must avoid the use of processor cycles to manage external resource constraints to assure consistent and comparable measurement data across the spectrum of processors being tested.

Many acceptance criteria for LSPR measurements can help assure that external resources are adequate. For example, internal response times should be subsecond; if they are not, some type of resource constraint must be resolved. For various DASD types, expected nominal service times are known. If the measured service times are high, some type of queuing is occurring, which indicates a constrained resource. When unexpected resource constraints are detected, they are fixed and the measurement is redone.

Because the processor is also a resource that must be managed by the SCP, steps must be taken to ensure that excess queuing on it does not occur. The way to avoid this type of constraint is to make the measurements at preselected utilization levels that are less than 100%. Because the LSPR is designed to relate processor capacity, measurements must be made at reasonably high utilization, but without causing uncontrolled levels of processor queuing. Typically, LSPR measurements for online workloads are made at a utilization level of approximately 90%. Batch workloads are always measured with steady-state utilizations above 90%. Mixed workloads that contain an online and batch component are measured at utilizations near 99%.

One other point must be made about processor utilization. Whenever two processors are to be compared for capacity purposes, they should both be viewed at the same loading point, which means at equal utilization. Assessing relative capacity when one processor is running at low utilization and the other is running at high utilization is imprecise. The LSPR methodology mandates that processor comparisons be made at equivalent utilization levels.

1.3.3 ITR and ETR relationship

An ITR can be viewed as a special case ETR; that is, an ITR is the measured ETR normalized to full processor utilization. Therefore, an alternative way to compute an ITR is to use the following equation:

ITR = (ETR ) ÷ (processor utilization)

1.3.4 LSPR ITR ratios

LSPR capacity data is presented in the form of ITR ratios for IBM processors where each model is configured with multiple z/OS images that are based on an average LPAR profile of client systems. All capacity numbers are relative to the IBM 2094-701 running multi-image z/OS image.

Comparing ITR ratios for two processor configurations allows a capacity planner to predict the effects of modifying hardware configuration at a high level. However, the most accurate sizings require the use of the LPAR Configuration Capacity Planning function of the zPCR tool, which can be customized to match a specific multi-image configuration rather than the average configurations that are reflected in the multi-image LSPR table.

1.4 Relating LSPR values to a CICS workload

By using data that is obtained from CPU MF and the reference information that is found in LSPR, you can understand how a CICS workload is expected to perform when moving between hardware configurations.

The example in this section outlines the steps to help you understand the effects of a hardware upgrade. For this example, assume that the workload has an “average” RNI as determined by the CPU MF.

In the following example, we look at the expected effects of adding CPs to an IBM z Systems z13™. Table 1-1 lists an extract of ITR ratios for the two processor configurations. This extract was taken from the z/OS V2.1 LSPR ITR ratios reference.

Table 1-1 Extract of LSPR table for selected processors

Processor	# CP	Low	Average	High
2964-703	3	9.08	8.30	7.28
2964-705	5	14.72	13.21	11.45

To calculate the potential throughput improvements that are obtained by upgrading the configuration from 3 CPs to 5 CPs, calculate the ratios of the relevant ITR columns. So, the average throughput scaling is equal to 13.21 ÷ 8.30 = 1.59.

Therefore, in the absence of software constraints, you might expect the throughput of the system to increase by 59%.

To calculate the change in CPU cost per transaction, first calculate the CPU cost of each LSPR transaction, as shown in the following example:

CPU cost = (number of CPs) ÷ (ITR)

Therefore, the LSPR Average RNI transaction for the 2964-703 processor costs 0.361s of the CPU, as shown in the following example:

3 ÷ 8.30 = 0.361s

The same transaction on the 2964-705 processor costs 0.379s of the CPU, as shown in the following example:

5 ÷ 13.21 = 0.379s

From these values, you can see that the CPU cost per transaction increases from 0.361s to 0.379s, which is an increase of 5%.

This increase in CPU per transaction is an expected result because increasing concurrency through the addition of CPUs increases contention for common cache lines. As described in 1.2.2, “Factors that can influence RNI” on page 5, workload performance is sensitive to how deep into the memory hierarchy the processor must go to retrieve instructions and data. Increasing concurrency decreases the probability of a cache line to be available for exclusive use by a processor at any specific time.

Note: This increase in CPU per transaction is important for non-threadsafe CICS transactions. Non-threadsafe applications run on the CICS QR TCB; therefore, non-threadsafe applications in CICS are limited by the capacity of the single QR TCB within a CICS region.

1.4.1 LSPR alternative

The LSPR shows relative capacity ratios that are sensitive to workload type. However, LPAR configuration is also a sensitive factor in capacity relationships. IBM offers the Processor Capacity Reference (zPCR) tool for customer use. The tool takes the LSPR to the next level by estimating capacity relationships that are sensitive to workload type and LPAR configuration, processor configuration, and specialty engine configuration. All of these factors can be customized to match your configuration. The LSPR data is contained in the tool.

For the most accurate capacity sizings, zPCR should be used. By using CPU-MF data that is collected in your environment, the zPCR tool can calculate the overall RNI value of a workload and determine the most appropriate LSPR workload to model the environment.

For more information about the zPCR tooling, see the IBM Techdoc Getting Started with zPCR, which is available at this website:

http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1381

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1. Performance terminology

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1. Performance terminology