Chapter 6. Performance planning tools

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance planning tools

This chapter gives an overview about the IBM Storage Modeller, which is the storage modeling and sizing tool for IBM storage systems.

We also consider special aspects of the sizing like planning for Multi-Target PPRC, or working with DS8000 storage systems using Easy Tier.

6.1 IBM Storage Modeller

IBM Storage Modeler (StorM) is a sizing and modeling tool for IBM Storage Systems, based on an open client/server architecture, prepared to be connected to other tools, using a web-based graphical user interface. Its ability to collaborate by sharing project files is a huge differentiator, and the interactive canvas enables easy design and visualization of a solution architecture. IBM Storage Modeler is platform agnostic and runs on most current web browsers.

Only IBMers and registered Business Partners can access StorM.

A precise storage modeling, for both capacity and performance, is paramount, and the StorM tool follows the latest mathematical modeling for it. Storage systems which are supported include the DS8000 models starting with the DS8880 (POWER8 based) generation.

Some aspects of the outgoing Disk Magic tool are still mentioned here, given that this tool can also model DS8000 generations before DS8880 or given its still excellent means of importing collected host performance data and doing a quick number crunching on them. Disk Magic however does not pick up anymore the latest support for product features coming new in 2021 and later.

This chapter gives and idea about the basic usage of these tools. Clients who want to have such a study run best contact their IBM or BP representative for it.

6.1.1 The need for performance planning and modeling tools

There are two ways to achieve your planning and modeling:

•Perform a complex workload benchmark based on your workload.

•Do a modeling by using performance data retrieved from your system.

Performing an extensive and elaborate lab benchmark by using the correct hardware and software provides a more accurate result because it is real-life testing. Unfortunately, this approach requires much planning, time, and preparation, plus a significant amount of resources, such as technical expertise and hardware/software in an equipped lab.

Doing a study with the sizing tool requires much less effort and resources. Retrieving the performance data of the workload and getting the configuration data from the servers and the storage systems is all that is required from the client. Especially when replacing a storage system, or when consolidating several, these tools are valuable. In the past we often had small amounts of flash like 5% and less, and then many 10K-rpm HDDs in multi-frame installations. Now we sharply shrink footprint by switching to all-flash installations with some major amount of high-capacity flash, and often we consolidate several older and smaller DS8000 models into one bigger of a newer generation: all that can be simulated. But we can also simulate an upgrade for an existing storage system that we want to keep for longer: For instance: Does it help to upgrade cache, and what would be the positive effect on latency; or am I having a utilization problem with my host bus adapters, and how many should I add to get away with it.

Now for collecting your performance data, especially when on host side: What matters is, that the time frames are really representative, and also for most busy times of a year even. And: Different applications can peak at different times. For example, a processor-intensive online application might drive processor utilization to a peak while users are actively using the system. However, the bus or drive utilization might be at a peak when the files are backed up during off-hours. So, you might need to model multiple intervals to get a complete picture of your processing environment.

6.1.2 Configuration data of your legacy storage systems

Generally, to perform a sizing and modeling study, we should obtain the following information about our existing storage systems, that we want to upgrade or to potentially replace:

•Control unit type and model

•Cache size

•Disk drive sizes (flash, HDD), their numbers and their speeds, and RAID formats

•Number, type, and speed of channels

•Number and speed of host adapter (HA) and (for Z:) FICON ports

•Parallel access volume (PAV) for Z

•Speed in the SAN, and when this is due for replacement

For Remote Copy, you need the following information:

•Remote Copy type(s) (synchronous, asynchronous; any multi-target?)

•Distance

•Number of links and speed

But also the capacity planning must be clear: is today‘s capacity to be retained 1:1? Or would it grow further, and by what factor. And for the performance: shall we also factor in some load growth, and to what extent.

If architectural changes are planned, like introducing further multi-target replication, or introducing Safeguarded Copy, it should definitely be noted and taken into account.

A part of these data can be obtained from either the collected Z host performance data, or when using a tool like IBM Spectrum Control, or IBM Storage Insights.

6.2 Data from IBM Spectrum Control and Storage Insights

IBM Storage Insights, or Spectrum Control, are strategic single-pane-of-glass monitoring tools that can be recommended for every IBM storage customer. Since the recommendation also is to have a measurement from your real storage system and workload data, some monitoring tools like these can be used. Especially for distributed systems, where a some large-scale host based performance monitoring (like z/OS clients using RMF) is typically not available.

In both Storage Insights and IBM Spectrum Control, when going to the Block Storage Systems and then doing a right-mouse click, the menu opens to “Export Performance Data”, as in Figure 6-1 on page 128. This performance data package gives valuable and detailed performance and configuration information, for instance, on drive types and quantity, RAID type, extent pool structure, number of HBAs, ports, and their speeds, if PPRC is used, and more.

Figure 6-1 SC/SI Export Performance Data package

It also shows those ports that are not used or RAID arrays outside pools and, therefore, not used.

Before exporting the package, extend the period from the default of a few hours to the desired longer interval, incorporating many, and representative, busy days, as in Figure 6-2.

Figure 6-2 SC/SI exporting performance data: Select the right time duration

You can consider may aspects of the configuration, including PPRC (synchronous, asynchronous; what is the distance in km, Metro/Global Mirror, or multi-site).

The resulting performance data come in the following format:

SC_PerfPkg_<DS8000_name>_<date>_<duration>.ZIP

and contain mainly CSV files, as shown in Figure 6-3 on page 129.

Figure 6-3 DS8000 SC/SI Performance Data Package content

The StorageSystem CSV file is essential, as it contains the view on the entire storage system in terms of performance data to the SAN and to the hosts. But for setting up your model, the other ones also give you important information beforehand:

•From the Nodes CSV, get some information on the DS8000 cache available and used.

•In the Pools CSV, get an information how many extent pools exist.

•In the RAIDArrays CSV, get the drive types used, and the type of RAID formatting for each rank (e.g., RAID-6). When you see arrays which are consistent with zero load, don’t take them into your modeling.

•In the StoragePorts CSV, see how many HBAs are installed, how many ports thereof are in use, and their speeds. Also, you see how many of these HBAs and ports do PPRC traffic.
Look at different points of time: If some ports (or even HBAs) are consistent with zero load, we take them out of the modeling.

•Looking at the HostConnections CSV, you might want to check if, eventually, not all hosts take part in the PPRC.

•The StorageSystem CSV finally tells you about the exact DS8000 model (e.g., 986) and finally gives the primary peak data for the modeling. Again here, you also can check the amount of PPRC activity.

May also check if there are any exceptionally high values somewhere, which gives additional ideas about why a customer might ask for exchanging the storage system.

All this information from the several CSVs can build up your configuration of an existing DS8000. But competition storage systems can be also monitored by SC/SI.

The CSV, or ZIP file, can then be imported into StorM, in an automated way, to let the tool determine the peak load intervals, and use them for your monitoring.

Workload profiles

After importing the workload file into the StorM tool, we see several categories there that typically make up a full workload profile: The Read I/O Percentage (or, Read:Write ratio), the transfer sizes for reads and writes respectively, the read-cache hit ratio, and the sequential ratios for reads and for writes. The latter ones are optional, in case you would enter a workload manually, and hence are under the Advanced Options, as in Figure 6-4 on page 130.

Figure 6-4 Open Systems workload after import into StorM

The workload peak profiles thus obtained can then be scaled further with a factor for “Future growth”. For this value, it is even possible to enter negative values like -50 %, for instance in case you would want to cut your workload in half when splitting it into pools and into the StorM “Data Spaces”.

The Cache Read Hit percentage is shown exactly as measured from the previous legacy storage system. However in case you already are modeling a replacement system with a bigger cache, can switch from automated input to manual and then use the Estimator button to calculate the new cache-hit ratio on the new storage system.

Once the data are read in automatically, you have up to 9 different kinds of peak workload intervals to choose from when doing your modeling, like for instance the Total I/O Rate (= usually the most important overall), or the Total Data Rate (= also important and should always be looked at), or response time peaks (= which are interesting to consider because here we often have low cache-hit ratios, and deviating read:write ratios or suddenly higher block sizes), or the peak of the Write Data Rate (= this one should be additionally considered for instance when doing modeling for PPRC projects, but also flash drives can be limited by a high write throughput). See Figure 6-5 for this selection of the peak intervals.

Figure 6-5 Peak interval selection in StorM

6.3 StorM with IBM i

Modeling for an IBM i host can also be done by gathering storage data via IBM Storage Insights or Spectrum Control. When the customer mixes IBM i with other platforms on their storage system, SC/SI can be the preferred choice for the performance data collection.

However, a significant number of IBM i clients is entirely focused on that platform, and they prefer to collect their storage performance data on the host platform.

On host side, one of the options is then the Performance Tools (PT1) reports. IBM i clients sometimes can have very heavy short-term I/O peaks, and if it’s an i-only platform, then a data collection with the PT1 reports can also give you a precise idea of what is happening on the storage side.

Such PT1 (Performance Tools) reports consist of:

•System Report

– Disk Utilization

– Storage Pool Utilization

•Component Report

– Disk Activity

•Resource Interval Report

– Disk Utilization Summary

From the PT1 reports, IO peaks can be manually transfered into StorM.

With StorM, a newer reporting option can be used. It is also based on the IBM MustGather Data Capture tool for IBM i. A QMGTOOLS build of 23 June 2021 or later is necessary to exploit that new reporting option.

Use a collection interval of 5 min or less with it, and find more background on this tool and method at:

https://www.ibm.com/support/pages/node/684977

Figure 6-6 shows in the QMGTOOLS QPERF Menu, which options exist to gather data for modeling storage performance.

Figure 6-6 QMGTOOLS QPERF menu: Gathering data for storage performance modeling

The newer option now is 15, to collect the needed metrics that can directly be used with the StorM tool.

IIBM i has a very low Easy Tier skew typically. Unless you have a precisely measured skew factor, enter “Easy Tier Skew = Very Low” (Factor 2.0) in the Performance tab of StorM.

6.4 Modeling and sizing for IBM Z

In a z/OS environment, gathering data for performance modeling requires the System Management Facilities (SMF) record types 70 - 78. Namely, for a representative time period, would extract the following SMF records: 73, 74-1, 74-5, 74-7, 74-8, 78-3.

The SMF data are packed by using RMFPACK. The installation package, when expanded, shows a user guide and two XMIT files: RMFPACK.JCL.XMIT and RMFPACK.LOAD.XMIT.
To pack the SMF data set into a ZRF file, complete the following steps:

1. Install RMFPACK on your z/OS system.

2. Prepare the collection of SMF data.

3. Run the $1SORT job to sort the SMF records.

4. Run the $2PACK job to compress the SMF records and to create the ZRF file.

For most mainframe customers, sampling periods of 15 min are fine. The main thing is to cover those days which are really busy, as they can occur only at certain times in a month, or even a year potentially. So 1 or 2 busy days to have performance data of could be sufficient, but have the right days for this, and cover both nights and daytimes, as workload patterns in the night (batch) often differ from more transaction-processing workloads during daytimes.

Storage modeling output: Utilization

The performance modeling checks for several things, not only that it calculates the expected new latency (response time) for a replacement storage system, or upgraded storage system. It can also calculate the expected internal utilization levels of all the various components in such a storage system.

Figure 6-7 shows the modeling of DS8000 component utilization for a given peak workload interval. We can make the following observations:

•For the selected (peak) workload of 600 KIOps, and with the selected number of drives, host bus adapters, amount of cache, number of DS8000 processor cores and so on, how busy each internal component would with this model. We see utilization numbers for the FICON adapters and for the FICON ports, to be around 50%. This utilization rate could easily be lowered, by adding more HBAs even though the utilization is still below the critical “amber” threshold value.

•The utilization numbers for the HPFE and for the Flash Tier 1 drives (consisting of 3.84 TB drives) are extremely low. As a result, you could consider a cheaper configuration using 7.68 TB Flash Tier 2 drives.

•Expected utilization levels for the internal DS8950 bus and for the processing capability shows an expected utilization of about 69%, which is a bit too high. Assuming the storage system is currently a DS8950F with dual 10-core processors, you could upgrade it to a dual 20-core instead.

Figure 6-7 Storage Systems component Utilization chart for a DS8950F

We also see that this is a storage system actively using zHyperLink, Metro Mirror, and z/OS Global Mirror. Mirroring consumes a significant percentage of the internal storage system processing resources and must be modeled specifically. Metro Mirror also triggers important FICON adapters usage, and a Metro Mirror modeling will allow you to see the throughput at the DS8000 host adapter which could be too high. It could also be limited by the bandwidth of your Wide Area Network. You must also take into consideration that a typical workload increases overtime, and that your Metro Mirroring bandwidth will have to increase by the same factor.

The sizing must be done in a way that optimizes the utilization of all internal components while making certain that all peak workload profiles stay below the critical amber or red threshold. When coming close to these thresholds, elongated response time occurs and in hitting the threshold the system latency will start to exhibit a non-linear behavior and any workload increase at that stage will deteriorate the response time. This is a situation to be definitely avoided.

Modeling output: Expected response time

StorM can predict for the given workload profile, that the response time is expected to be 0.34 ms. But there is more information: IBM Storage Modeler predicts when the response time starts increasing exponentially, and it gives more information about the individual response time components.

What do the Response time Components tell you?

Figure 6-8 on page 135 provides the average response time for a minimum to maximum workload period for each interval for the measured data. The response time components are Connect time, Disconnect time, Pending time, and IOSQ time:

•Connect time: The device was connected to a channel path and transferring data between the device and central storage (processor). High connect time occurs because of contention in the FICON channels.

•Disconnect time: The value reflects the time when the device was in use but not transferring data. Because the device is likely waiting for Metro Mirror copy, Random-read workload (read miss), or Write de-stage activity.

•Pending time: I/O request must wait in the hardware between the processor and storage system. This value also includes the time waiting for an available channel path and control unit as well as the delay due to shared DASD contention.

•IOSQ time: I/O request must wait on an IOS queue before the disk becomes free.

Being at 0.34 ms overall response for the 600 KIOps workload, we can compare these expected 0.34 ms (new/upgraded system) with what the customer has today, and make sure that there is an improvement. But we can also make sure that we are still in some near-linear range of workload increase, so that further I/O increases will not lead to sudden bigger spikes in latency. The total latency is shown with the four components that make up the response time.

Being below 700 KIOps here the response time remains at a lower level. This indicates the configured system has enough room for growth against the customer’s peak workload data which was collected using RMF Utility. Based on the modeled latency and response time curve, the proposed storage system is suitable for a customer proposal.

But, again, we can go back to the Utilization figures and make sure that these are all in a lower “green”' range first. If not the case, expand and upgrade the configuration still. So both aspects, expected utilization levels and expected response time values, need to fit - or the configuration upgraded.

Figure 6-8 Response time components chart for DS8950F

IBM Z Batch Network Analyzer (zBNA) tool

zBNA is a PC-based productivity tool designed to provide a means of estimating the elapsed time for batch jobs. It provides a powerful, graphic demonstration of the z/OS batch window.
The zBNA application accepts SMF Record Types 14, 15, 16, 30, 42, 70, 72, 74, 78 and 113 data to analyze such a single batch window of user-defined length.

In case no zHyperLink is used yet, the zBNA tool can for instance be used to determine how big the zHyperLink ratio is expected to be, once zHyperLink is switched on. Find the tool and more background information at:

https://www.ibm.com/support/pages/node/6354321

6.5 Easy Tier modeling

StorM supports DS8000 Easy Tier modeling for multitier configurations. The performance modeling is based on the workload skew level concept. The skew level describes how the I/O activity is distributed across the capacity for a specific workload. A workload with a low skew level has the I/O activity distributed evenly across the available capacity. A heavily skewed workload has many I/Os to only a small portion of the data. The workload skewing affects the Easy Tier effectiveness, especially if there is a limited number of high performance ranks. The heavily skewed workloads benefit most from the Easy Tier capabilities because even when moving a small amount of data, the overall performance improves. With lightly skewed workloads, Easy Tier is less effective because the I/O activity is distributed in such a large amount of data that it cannot be moved to a higher performance tier.

Lightly skewed workloads might be solved already with a 100% High-Capacity Flash design, and in many cases, this can even mean using Flash Tier 2 drives only. Whereas with a higher skew and higher load, a mix of High-Performance Flash (Flash Tier 0) and High-Capacity Flash (then mostly Flash Tier 2) is usually done, and the question is, which capacity ratio to best choose between the various tiers.

There are three different approaches on how to model Easy Tier on a DS8000. Here are the three options:

•Use one of the predefined skew levels.

•Use an existing skew level based on the current workload on the current DS8000 (measuring on the host side).

•Use heatmap data from a DS8000 (measuring on storage system side).

6.5.1 Predefined skew levels

StorM supports five predefined skew levels for use in Easy Tier predictions:

•Skew level 2: Very low skew

•Skew level 3.5: Low Skew

•Skew level 7: Intermediate skew

•Skew level 14: High skew

•Skew level 24: Very high skew

Storm uses this setting to predict the number of I/Os that are serviced by the higher performance tier.

A skew level value of 1 means that the workload does not have any skew at all, meaning that the I/Os are distributed evenly across all ranks.

The skew level settings affect the modeling tool predictions. A heavy skew level selection results in a more aggressive sizing of the higher performance tier. A low skew level selection provides a conservative prediction. It is important to understand which skew level best matches the actual workload before you start the modeling.

6.5.2 Getting the exact skew

For IBM Z workloads, getting an exact skew level is very easy, when you have RMF measurements. Additionally, most Open Systems clients have skew levels between “high” and “intermediate”. IBM i clients, however, typically have a “very low” skew level. In case you do some modeling for IBM i and could not retrieve an exact value for this level, best go with the assumption of “very low” (2.0), for i.

Your IBM support or Business Partner personnel have access to additional utilities that allows them to retrieve the exact skew directly from the DS8000, from the Easy Tier summary.

At the DS8000 Graphical User Interface, you can, at the top, export the Easy Tier Summary, as in Figure 6-9:

Figure 6-9 DS8900 GUI: Export Easy Tier Summary

The output ZIP that is created contains CSV file reports on the Easy Tier activity that can be used for additional planning and more accurate sizing. The skew curve is available in a format to help you differentiate the skew according to whether the skew is for small or large blocks, reads or writes, and IOPS or MBps, and it is described in more detail in IBM DS8000 Easy Tier, REDP-4667.

The comparable DSCLI command, which yields the same data, is shown in Example 6-1.

Example 6-1 DSCLI export of Easy Tier data

dscli> offloadfile -etdataCSV mp

Date/Time: 13 July 2021 14:51:55 CEST IBM DSCLI Version: 7.9.20.440 DS: IBM.2107-75HAL91

CMUC00428I offloadfile: The etdataCSV file has been offloaded to mpet_data_20210713145155.zip.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6. Performance planning tools

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6. Performance planning tools