Chapter 9. Reliability, availability, and serviceability

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Reliability, availability, and serviceability

From the Quality perspective, the z15 reliability, availability, and serviceability (RAS) design is driven by a set of high-level program RAS objectives. The IBM Z platform continues to drive toward Continuous Reliable Operation (CRO) at the single footprint level.

Note: Throughout this chapter, z15 refers to IBM z15 Model T01 (Machine Type 8561), unless otherwise specified.

The key objectives, in order of priority, are to ensure data integrity, computational integrity, reduce or eliminate unscheduled outages, reduce scheduled outages, reduce planned outages, and reduce the number of Repair Actions.

RAS can be accomplished with improved concurrent replace, repair, and upgrade functions for processors, memory, drawers, and I/O. RAS also extends to the nondisruptive capability for installing Licensed Internal Code (LIC) updates. In most cases, a capacity upgrade can be concurrent without a system outage. As an extension to the RAS capabilities, environmental controls are implemented in the system to help reduce power consumption and meet cooling requirements.

This chapter includes the following topics:

•9.1, “RAS strategy” on page 382

•9.2, “Technology” on page 382

•9.3, “Structure” on page 385

•9.4, “Reducing complexity” on page 385

•9.5, “Reducing touches” on page 386

•9.6, “z15 availability characteristics” on page 386

•9.7, “z15 RAS functions” on page 390

•9.8, “z15 enhanced drawer availability” on page 394

•9.9, “z15 Enhanced Driver Maintenance” on page 402

•9.10, “RAS capability for the HMC and SE” on page 406

9.1 RAS strategy

The RAS strategy is to manage change by learning from previous generations and investing in new RAS function to eliminate or minimize all sources of outages. Enhancements to z14 RAS designs are implemented on the z15 system through the introduction of new technology, structure, and requirements. Continuous improvements in RAS are associated with new features and functions to ensure that IBM Z servers deliver exceptional value to clients.

The following overriding RAS requirements are principles as shown in Figure 9-1:

•Inclusion of existing (or equivalent) RAS characteristics from previous generations.

•Learn from current field issues and addressing the deficiencies.

•Understand the trend in technology reliability (hard and soft) and ensure that the RAS design points are sufficiently robust.

•Invest in RAS design enhancements (hardware and firmware) that provide IBM Z and Customer valued differentiation.

Figure 9-1 Overriding RAS requirements

9.2 Technology

This section introduces some of the RAS features that are incorporated in the z15 design.

9.2.1 Processor Unit chip

The Processor Unit (PU) chip includes the following features:

•A Single Chip Module (SCM) uses 14nm SOI technology and consists of 17 layers of metal, over 9.1 billion transistors, cores running at 5.2 GHz with 12 cores per PU SCM, all of which enhances the thermal conductivity and improves the reliability.

•L3:

– Symbol ECC on L3 data cache

– 256MB (double the size of L3 in z14)

– Ability to monitor (dynamically) fenced macros

– Dynamic cache monitor (“stepper”) to find and demote HSA lines in the cache

– Wordline span reduced (less impact)

– Dynamic uMasking for subarrays

•L2: L2-I (Instruction) is now 4MB (double compared to z14 L2-I)

•Ability to spare the PU core upon non-L2 cache/DIR Core Array delete.

•Improved error thresholding on PU cores, which avoids continuous recovery.

•Memory Control Unit (MCU) Cache Symbol ECC.

•L1 and L1+; L2 protected by PU sparing.

•PU Core mandatory address checking.

•Redundant parity on error in RU bit to protect wordline (WL).

•On-Chip Compression

The On-Chip Compression, which is new to z15 is a major improvement from the zEDC cards. The on-chip compression offers an industry-leading, hardware-based acceleration for data compression with faster single thread performance.

This On-Chip Compression capability replaces the zEDC Express adapter on the IBM z14 and earlier servers, whereby all date interchange remains compatible.

With zEDC cards the throughput was 1GBps per feature with a maximum of 16 features.

With On-Chip Compression the throughput is improved to 12 GBps per PU, which equates to 48 GBps per drawer and 240 GBps for a fully populated z15.

The On-Chip Compression can be virtualized between all LPARs on the z15, whereas the zEDC feature was limited to 15 LPARs.

The On-Chip Compression module implements DEFLATE/gzip/lzip algorithms and works in a synchronous mode in problem state and an asynchnous mode for larger operations under z/OS.

For more information about IBM Integrated Accelerator for z Enterprise Data Compression (zEDC - On-Chip Compression) on z15 servers, see Appendix C, “IBM Integrated Accelerator for zEnterprise Data Compression” on page 503.

z15 processor memory and cache structure are shown in Figure 9-2.

Figure 9-2 Memory and Cache Structure

9.2.2 System Controller and main memory

The System Controller (SC) and main memory consist of the following features:

•A Single Chip Module (SCM) uses 14nm SOI technology and consists of 17 layers of metal, 9.7 billion transistors, all which enhances the thermal conductivity and improves the reliability.

•Reduced chip count (single SC chip) improves reliability with fewer components involved.

•Reduced SMP cables (fewer SC chips) improves reliability and results in faster repair actions and hardware upgrade times because of fewer components to quiesce, test, and resume.

•L4:

– Symbol ECC on the cache data, directory, configuration array and on the store protects key cache data.

– Ability to monitor (dynamically) fenced macros and allow integrated sparing.

– Array recovery as dynamic array sparing, line deleting, and dynamic uMasking of subarrays.

•Preemptive memory channel marking:

– Analysis of uncorrectable errors considers pattern of prior correctable errors

– More robust uncorrectable error handling

– Simplified repair action

•Improved resilience in KUE¹ repair capability

•Virtual Flash Memory (Flash Express replacement) solution is moved to DIMM:

– Solution is moved to more robust storage Redundant Array of Independent Memory (RAIM) protected (same function that main memory uses)

– Concurrent Drawer Repair (CDR) and concurrent drawer addition (CDA)

9.2.3 I/O and service

I/O and service consist of the following features:

•PCIe+ Fanout Gen3

z15 has a new and improved PCIe+ Fanout Gen3 card with dual 16x ports to provide better availability

•The number of PSP, support partitions, for managing native PCIe I/O:

– Four partitions

– Reduced effect on MCL updates

– Better availability

•Faster Dynamic Memory Relocation engine:

– Enables faster reallocation of memory that is used for LPAR activations, CDR, and concurrent upgrade

– Provides faster, more robust service actions

•Dynamic Time Domain Reflectometry (TDR):

– Hardware facility that is used to isolate failures on wires provides better FRU isolation and improved service actions.

•Universal spare for PU SCMs and SC SCMs and processor drawer.

•z15 has a new and improved Fill and Drain tool.

9.3 Structure

The z15 server is built in a new form factor of 1- 4 19-inch frames. The z15 server can be delivered as an air-cooled and water-cooled system and fulfills the requirements for an ASHRAE A3 environment.

The z15 server can have up to 11 PCIe+ I/O drawers when delivered with Bulk Power Assembly (BPA) and 12 PCIe+ I/O drawers when delivered with Power Distribution Unit (PDU). The structure changes to the z15 server are done with the following goals:

•Enhanced system modularity

•Standardization to enable rapid integration

•Platform simplification

Cables are keyed to ensure that correct lengths are plugged. Plug detection ensures correct location, and custom latches ensure retention. Further improvements to the fabric bus include symmetric multiprocessing (SMP) cables that connect the drawers.

To improve field-replaceable unit (FRU) isolation, TDR techniques are applied to the SMP cables, between chips (PU-PU, and PU-SC), and between the PU chips and dual inline memory modules (DIMMs).

Enhancements to thermal RAS also were introduced, such as a field-replaceable water manifold for PU cooling. The z15 has the following characteristics:

•Processing infrastructure is designed by using drawer technology.

•Keyed cables and plugging detection.

•SMP cables that are used for fabric bus connections.

•Water manifold is a FRU.

•Master-master redundant oscillator design in the main memory.

•Processor and nest chips are separate FRUs.

•Point of load cards are separate FRUs.

•Two combined Flexible Service Processor (FSP) and Oscillator Cards (OSC) are provided per CPU draw.

•Built in time domain reflectometer for FRU isolation in interface errors.

•Redundant N+1 Power Supply Units (PSU) to CPC drawer and PCIe+ drawer.

9.4 Reducing complexity

z15 servers continue the z14 enhancements that reduced system RAS complexity. Specifically, simplifications were made in RAIM recovery in the memory subsystem design. Memory DIMMs are no longer cascaded, which eliminates the double FRU call for DIMM errors.

Independent channel recovery with replay buffers on all interfaces allows recovery of a single DIMM channel, while other channels remain active. Further redundancies are incorporated in I/O pins for clock lines to main memory, which eliminates the loss of memory clocks because of connector (pin) failure. The following RAS enhancements reduce service complexity:

•Continued use of RAIM ECC.

•No cascading of memory DIMM to simplify the recovery design.

•Replay buffer for hardware retry on soft errors on the main memory interface.

•Redundant I/O pins for clock lines to main memory.

9.5 Reducing touches

IBM Z RAS efforts focus on the reduction of unscheduled, scheduled, planned, and unplanned outages. IBM Z technology has a long history of demonstrated RAS improvements, and this effort continues with changes that reduce service touches on the system.

Firmware was updated to improve filtering and resolution of errors that do not require action. Enhanced integrated sparing in processor cores, cache relocates, N+1 SEEPROM and POL N+2 redundancies, and DRAM marking also are incorporated to reduce touches. The following RAS enhancements reduce service touches:

•Improved error resolution to enable filtering

•Enhanced integrated sparing in processor cores

•Cache relocates

•N+1 SEEPROM

•N+2 POL

•DRAM marking

•(Dynamic) Spare lanes for PU-SC, PU-PU, PU-mem, and SC-SMP fabric

•N+1 radiator pumps, controllers, blowers, and sensors

•N+1 Ethernet switches

•N+1 Support Element (SE) (with N+1 SE power supplies)

•Redundant SEEPROM on memory DIMM

•Redundant temperature sensor (one SEEPROM and one temperature sensor per I2C bus)

•FICON forward error correction

9.6 z15 availability characteristics

The following functions include availability characteristics on z15 servers:

•Enhanced drawer availability (EDA)

EDA is a procedure under which a CPC drawer in a multidrawer system can be removed and reinstalled during an upgrade or repair action with no effect on the workload.

•Concurrent memory upgrade or replacement

Memory can be upgraded concurrently by using Licensed Internal Code Configuration Control (LICCC) if physical memory is available on the drawers.

The EDA function can be useful if the physical memory cards must be changed in a multidrawer configuration (requiring the drawer to be removed).

It requires the availability of more memory resources on other drawers or reducing the need for memory resources during this action. Select the flexible memory option to help ensure that the appropriate level of memory is available in a multiple-drawer configuration. This option provides more resources to use EDA when repairing a drawer or memory on a drawer. They are also available when upgrading memory when larger memory cards might be required.

•Enhanced driver maintenance (EDM)

One of the greatest contributors to downtime during planned outages is LIC driver updates that are performed in support of new features and functions. z15 servers are designed to support the concurrent activation of a selected new driver level.

•Plan Ahead for Balanced Power (FC 3003)

This feature allows you to order the maximum number of bulk power regulators (BPRs) on any server configuration. This feature helps to ensure that your configuration is in a balanced power environment if you intend to add CPC drawers and I/O drawers to your server in the future. The feature is available with Bulk Power Adapters (BPA) only.

•Concurrent fanout addition or replacement

A PCIe+ fanout card provides the path for data between memory and I/O through PCIe cables. With z15 servers, a hot-pluggable and concurrently upgradeable fanout card is available. Up to 12 PCIe fanout cards per CPC drawer are available for z15 servers. A z15 Model T01 feature Max190 holds five CPC drawers and can have 60 PCIe fan out slots.

Internal I/O paths from the CPC drawer fanout ports to a PCIe drawer or an I/O drawer are spread across multiple CPC drawers (for feature Max71, Max108, Max145, and Max190) and across different nodes within a single CPC drawer Feature Max34. During an outage, a fanout card that is used for I/O can be repaired concurrently while redundant I/O interconnect ensures that no I/O connectivity is lost.

•Redundant I/O interconnect

Redundant I/O interconnect helps maintain critical connections to devices. z15 servers allow a single drawer, in a multidrawer system, to be removed and reinstalled concurrently during an upgrade or repair. Connectivity to the system I/O resources is maintained through a second path from a different drawer.

•Flexible Service Processor (FSP) / Oscillator Cards (OSC).

z15 servers have two combined Flexible Service Processor (FSP) and Oscillator Cards (OSC) per CPU draw. The strategy of redundant clock and switchover stays the same. One primary and one backup is available. If the primary OSC fails, the backup detects the failure, takes over transparently, and continues to provide the clock signal to the CPC.

•Processor unit (PU) sparing

z15 servers have two spare PUs per CPU drawer to maintain performance levels if an active PU, Internal Coupling Facility (ICF), Integrated Facility for Linux (IFL), IBM Z Integrated Information Processor (zIIP), integrated firmware processor (IFP), or system assist processor (SAP) fails. Transparent sparing for failed processors is supported and sparing is supported across the drawers in the unlikely event that the drawer with the failure does not have spares available.

•Application preservation

This function is used when a PU fails and no spares are available. The state of the failing PU is passed to another active PU, where the operating system uses it to successfully resume the task, in most cases without client intervention.

•Cooling improvements

The z15 air-cooled configuration includes a newly designed front to rear radiator cooling system. The radiator pumps, blowers, controls, and sensors are N+2 redundant. In normal operation, one active pump supports the system. A second pump is turned on and the original pump is turned off periodically, which improves reliability of the pumps. The replacement of pumps or blowers is concurrent with no affect on performance.

A water-cooling system also is an option in z15 servers, with water-cooling unit (WCU) technology. Two redundant WCUs run with two independent chilled water feeds. One WCU and one water feed can support the entire system load. The water-cooled configuration is backed up by the rear door heat exchangers in the rare event of a problem with the chilled water facilities of the customer.

•FICON Express16SA / FICON Express16S+ with Forward Error Correction (FEC)

FICON Express16SA and FICON Express16S+ features continue to provide a new standard for transmitting data over 16 Gbps links by using 64b/66b encoding. The new standard that is defined by T11.org FC-FS-3 is more efficient than the current 8b/10b encoding.

FICON Express16SA and FICON Express16S+ channels that are running at 16 Gbps can take advantage of FEC capabilities when connected to devices that support FEC.

FEC allows FICON Express16SA and FICON Express16S+ channels to operate at higher speeds, over longer distances, with reduced power and higher throughput. They also retain the same reliability and robustness for which FICON channels are traditionally known.

FEC is a technique that is used for controlling errors in data transmission over unreliable or noisy communication channels. When running at 16 Gbps link speeds, clients often see fewer I/O errors, which reduces the potential effect to production workloads from those I/O errors.

Read Diagnostic Parameters (RDP) improve Fault Isolation. After a link error is detected (for example, IFCC, CC3, reset event, or a link incident report), link data that is returned from Read Diagnostic Parameters is used to differentiate between errors that result from failures in the optics versus failures because of dirty or faulty links.

Key metrics can be displayed on the operator console. The results of a display matrix command with the LINKINFO=FIRST parameter, which collects information from each device in the path from the channel to the I/O device (see Figure 9-3 on page 389):

– Transmit (Tx) and Receive (Rx) optic power levels from the PCHID, Switch Input and Output, and I/O device

– Capable and Operating speed between the devices

– Error counts

– Operating System requires new function APAR OA49089

Figure 9-3 Read Diagnostic Parameters function

The new IBM Z Channel Subsystem Function performs periodic polling from the channel to the end points for the logical paths that are established and reduces the number of useless Repair Actions (RAs).

The RDP data history is used to validate Predictive Failure Algorithms and identify Fibre Channel Links with degrading signal strength before errors start to occur. The new Fibre Channel Extended Link Service (ELS) retrieves signal strength.

•FICON Dynamic Routing

FICON Dynamic Routing (FIDR) enables the use of storage area network (SAN) dynamic routing policies in the fabric. With the z15 server, FICON channels are no longer restricted to the use of static routing policies for inter-switch links (ISLs) for cascaded FICON directors.

FICON Dynamic Routing dynamically changes the routing between the channel and control unit based on the Fibre Channel Exchange ID. Each I/O operation has a unique exchange ID. FIDR is designed to support static SAN routing policies and dynamic routing policies.

FICON Dynamic Routing can help clients reduce costs by providing the following features:

– Share SANs between their FICON and FCP traffic.

– Improve performance because of SAN dynamic routing policies that better use all the available ISL bandwidth through higher use of the ISLs,

– Simplify management of their SAN fabrics by using static routing policies that assign different ISL routes with each power-on-reset (POR), which makes the SAN fabric performance difficult to predict.

Clients must ensure that all devices in their FICON SAN support FICON Dynamic Routing before they implement this feature.

9.7 z15 RAS functions

Hardware RAS function improvements focus on addressing all sources of outages. Sources of outages feature the following classifications:

•Unscheduled

This outage occurs because of an unrecoverable malfunction in a hardware component of the system.

•Scheduled

This outage is caused by changes or updates that must be done to the system in a timely fashion. A scheduled outage can be caused by a disruptive patch that must be installed, or other changes that must be made to the system.

•Planned

This outage is caused by changes or updates that must be done to the system. A planned outage can be caused by a capacity upgrade or a driver upgrade. A planned outage usually is requested by the client, and often requires pre-planning. The z15 design phase focuses on enhancing planning to simplify or eliminate planned outages.

The difference between scheduled outages and planned outages might not be obvious. The general consensus is that scheduled outages occur sometime soon. The time frame is approximately two weeks.

Planned outages are outages that are planned well in advance and go beyond this approximate two-week time frame. This chapter does not distinguish between scheduled and planned outages.

Preventing unscheduled, scheduled, and planned outages was addressed by the IBM Z system design for many years.

z15 servers have a fixed size HSA of 256 GB (up from 192 GB on z14). This size helps eliminate pre-planning requirements for HSA and provides the flexibility to update dynamically the configuration. You can perform the following tasks dynamically:²

•Add a logical partition (LPAR)

•Add a logical channel subsystem (LCSS)

•Add a subchannel set

•Add a logical PU to an LPAR

•Add a cryptographic coprocessor

•Remove a cryptographic coprocessor

•Enable I/O connections

•Swap processor types

•Add memory

•Add a physical processor

By addressing the elimination of planned outages, the following tasks also are possible:

•Concurrent driver upgrades

•Concurrent and flexible customer-initiated upgrades

For more information about the flexible upgrades that are started by clients, see 8.2.2, “Customer Initiated Upgrade facility” on page 340.

•STP management of concurrent CTN Split and Merge

•Dynamic I/O for stand-alone CF CPCs

Dynamic I/O configuration changes can be made to a stand-alone CF without requiring a disruptive power on reset. An LPAR with a firmware-based appliance version of an HCD instance is used to apply the new I/O configuration changes. The firmware-based LPAR is driven by updates from an HCD instance that is running in a z/OS LPAR on a different CPC that is connected to the same z15 HMC.

•System Recovery Boost

System Recovery Boost, which is new on z15, introduces the possibility of reducing the downtime from an operating systems perspective in both scheduled and unscheduled events on a partition basis. This reduction in downtime is achieved by delivering more CP capacity for a boost period before a scheduled shutdown and following a restart.

For more information about System Recovery Boost, see Appendix B, “System Recovery Boost” on page 491.

9.7.1 Scheduled outages

Concurrent hardware upgrades, parts replacement, driver upgrades, and firmware fixes that are available with z15 servers all address the elimination of scheduled outages. Also, the following indicators and functions that address scheduled outages are included:

•Double memory data bus lane sparing.

This feature reduces the number of repair actions for memory.

•Single memory clock sparing.

•Double DRAM chipkill tolerance.

•Field repair of the cache fabric bus.

•Processor drawer power distribution N+2 design.

The CPC Drawer uses point of load (POL) cards in a highly redundant N+2 configuration. POL regulators are daughter cards that contain the voltage regulators for the principle logic voltage boundaries in the z15 CPC drawer. They plug onto the CPC drawer system board and are nonconcurrent FRUs for the affected drawer, similar to the memory DIMMs. If you can use EDA, the replacement of POL cards is concurrent for the whole Z server.

•Redundant (N+1) Ethernet switches.

•Redundant (N+2) humidity sensors.

•Redundant (N+2) altimeter sensors.

•Redundant (N+2) ambient temperature sensors.

•Dual inline memory module (DIMM) field-replaceable unit (FRU) indicators.

These indicators imply that a memory module is not error-free and might fail sometime in the future. This indicator gives IBM a warning and provides time to concurrently repair the storage module if the z15 is a multidrawer system.

The process to repair the storage module is to isolate or “fence off” the drawer, remove the drawer, replace the failing storage module, and then add the drawer. The flexible memory option might be necessary to maintain sufficient capacity while repairing the storage module.

•Single processor core checkstop and sparing.

This indicator shows that a processor core malfunctioned and is spared. IBM determines what to do based on the system and the history of that system.

•Point-to-point fabric for symmetric multiprocessing (SMP).

Having fewer components that can fail is an advantage. In a multidrawer system, all of the drawers are connected by point-to-point connections. A drawer can always be added concurrently.

•Air-cooled system: radiator with redundant (N+2) pumps.

z15 servers implement true N+2 redundancy on pumps and blowers for the radiator.

One radiator unit in Frame A and one radiator unit in framer B (configuration-dependant). The radiator cooling system can support up to three CPC drawers simultaneously with a redundant design that consists of two pumps and two blowers.

The replacement of a pump or blower causes no performance effect.

•Water-cooled system: N+1 Water-Cooling Units (WCUs).

A water-cooling system is an option in z15 servers with WCU technology. Two redundant WCUs run with two independent chilled water feeds for each frame A and B (if B is installed). One WCU and one water feed can support the entire system load. The water-cooled configuration is backed up by the rear door heat exchangers in the rare event of a problem with the chilled water facilities of the customer.

•The PCIe+ I/O drawer is available for z15 servers. It and all of the PCIe+ I/O drawer-supported features can be installed concurrently.

•Memory interface logic to maintain channel synchronization when one channel goes into replay. z15 servers can isolate recovery to only the failing channel.

•Out-of-band access to DIMM (for background maintenance functions).

Out-of-band access (by using an I2C interface) allows maintenance (such as logging) without disrupting customer memory accesses.

•Lane shadowing function to each lane that periodically is taken offline (for recalibration).

The (logical) spare bit lane is rotated through the (physical) lanes. This configuration allows the lane to be tested and recalibrated transparently to customer operations.

•Automatic lane recalibration on offline lanes on the main memory interface. Hardware support for transparent recalibration is included.

•Automatic dynamic lane sparing based on pre-programmed CRC thresholds on the main memory interface. Hardware support to detect a defective lane and spare it out is included.

•Improved DIMM exerciser for testing memory during IML.

•PCIe redrive hub cards plug straight in (no blind mating of connector). Simplified plugging that is more reliable is included.

•ICA (short distance) coupling cards plug straight in (no blind mating of connector). Simplified plugging that is more reliable is included.

•Coupling Express LR (CE LR) coupling cards plug into the PCIe+ I/O drawer, which allows more connections with a faster bandwidth.

•Hardware-driven dynamic lane sparing on fabric (SMP) buses. Increased bit lane sparing is featured.

9.7.2 Unscheduled outages

An unscheduled outage occurs because of an unrecoverable malfunction in a hardware component of the system.

The following improvements can minimize unscheduled outages:

•Continued focus on firmware quality

For LIC and hardware design, failures are eliminated through rigorous design rules; design walk-through; peer reviews; element, subsystem, and system simulation; and extensive engineering and manufacturing testing.

•Memory subsystem

RAIM on Z servers is a concept similar to the concept of Redundant Array of Independent Disks (RAID). The RAIM design detects and recovers from dynamic random access memory (DRAM), socket, memory channel, or DIMM failures. The RAIM design requires the adding one memory channel that is dedicated for RAS.

The parity of the four data DIMMs is stored in the DIMMs that are attached to the fifth memory channel. Any failure in a memory component can be detected and corrected dynamically. z15 servers inherited this memory architecture.

The memory system on z15 servers is implemented with an enhanced version of the Reed-Solomon ECC that is known as 90B/64B. It provides protection against memory channel and DIMM failures.

A precise marking of faulty chips helps ensure timely DIMM replacements. The design of the z15 server further improved this chip marking technology. Graduated DRAM marking is available, and channel marking and scrubbing calls for replacement on the third DRAM failure is available. For more information about the memory system on z15 servers, see 2.5, “Memory” on page 56.

•Soft-switch firmware

z15 servers are equipped with the capabilities of soft-switching firmware. Enhanced logic in this function ensures that every affected circuit is powered off during the soft-switching of firmware components. For example, when you are upgrading the microcode of a FICON feature, enhancements are implemented to avoid any unwanted side effects that were detected on previous systems.

•Server Time Protocol (STP) recovery enhancement

When PCIe-based integrated communication adapter (ICA) Short Reach (SR) links are used, an unambiguous “going away signal” is sent when the server on which the coupling link is running is about to enter a failed (check stopped) state.

When the “going away signal” that is sent by the Current Time Server (CTS) in an STP-only Coordinated Timing Network (CTN) is received by the Backup Time Server (BTS), the BTS can safely take over as the CTS without relying on the previous Offline Signal (OLS) in a two-server CTN, or as the Arbiter in a CTN with three or more servers.

Enhanced Console Assisted Recovery (ECAR) was new with z13s and z13 GA2 and carried forward to z14 and z15. It contains better recovery algorithms during a failing Primary Time Server (PTS) and uses communication over the HMC/SE network to assist with BTS takeover. For more information, see Chapter 10, “Hardware Management Console and Support Element” on page 409.

Coupling Express LR does not support the “going away signal”; however, ECAR can be used to assist with recovery in the following configurations:

•Design of pervasive infrastructure controls in processor chips in memory ASICs.

•Improved error checking in the processor recovery unit (RU) to better protect against word line failures in the RU arrays.

9.8 z15 enhanced drawer availability

Enhanced drawer availability (EDA) is a procedure in which a drawer in a multidrawer system can be removed and reinstalled during an upgrade or repair action. This procedure has no effect on the running workload.

The EDA procedure and careful planning help ensure that all the resources are still available to run critical applications in an (n-1) drawer configuration. This process allows you to avoid planned outages. Consider the flexible memory option to provide more memory resources when you are replacing a drawer. For more information about flexible memory, see 2.5.7, “Flexible Memory Option” on page 65.

To minimize the effect on current workloads, ensure that sufficient inactive physical resources exist on the remaining drawers to complete a drawer removal. Also, consider deactivating non-critical system images, such as test or development LPARs. After you stop these non-critical LPARs and free their resources, you might find sufficient inactive resources to contain critical workloads while completing a drawer replacement.

9.8.1 EDA planning considerations

To use the EDA function, configure enough physical memory and engines so that the loss of a single drawer does not result in any degradation to critical workloads during the following occurrences:

•A degraded restart in the rare event of a drawer failure

•A drawer replacement for repair or a physical memory upgrade

The following configurations especially enable the use of the EDA function. These z15 features need enough spare capacity so that they can cover the resources of a fenced or isolated drawer. This configuration imposes limits on the following number of the client-owned PUs that can be activated when one drawer within a model is fenced:

•A maximum of 34 client PUs are configured on the Max34.

•A maximum of 71 client PUs are configured on the Max71.

•A maximum of 108 client PUs are configured on the Max108.

•A maximum of 145 client PUs are configured on the Max145.

•A maximum of 190 client PUs are configured on the Max190.

•No special feature codes are required for PU and model configuration.

•Feature Max34 to Max145 each have 4 SAPs in each drawer. Max190 has in total 22 standard SAPs.

•The flexible memory option delivers physical memory so that 100% of the purchased memory increment can be activated even when one drawer is fenced.

The system configuration must have sufficient dormant resources on the remaining drawers in the system for the evacuation of the drawer that is to be replaced or upgraded. Dormant resources include the following possibilities:

•Unused PUs or memory that is not enabled by LICCC

•Inactive resources that are enabled by LICCC (memory that is not being used by any activated LPARs)

•Memory that is purchased with the flexible memory option

•Extra drawers

The I/O connectivity must also support drawer removal. Most of the paths to the I/O feature redundant I/O interconnect support in the I/O infrastructure (drawers) that enable connections through multiple fanout cards.

If sufficient resources are not present on the remaining drawers, certain non-critical LPARs might need to be deactivated. One or more PUs or storage might need to be configured offline to reach the required level of available resources. Plan to address these possibilities to help reduce operational errors.

Exception: Single-drawer systems cannot use the EDA procedure.

Include the planning as part of the initial installation and any follow-on upgrade that modifies the operating environment. A client can use the Resource Link machine information report to determine the number of drawers, active PUs, memory configuration, and channel layout.

If the z15 server is installed, click Prepare for Enhanced Drawer Availability in the Perform Model Conversion window of the EDA process on the Hardware Management Console (HMC). This task helps you determine the resources that are required to support the removal of a drawer with acceptable degradation to the operating system images.

The EDA process determines which resources, including memory, PUs, and I/O paths, are free to allow for the removal of a drawer. You can run this preparation on each drawer to determine which resource changes are necessary. Use the results as input in the planning stage to help identify critical resources.

With this planning information, you can examine the LPAR configuration and workload priorities to determine how resources might be reduced and still allow the drawer to be concurrently removed.

Include the following tasks in the planning process:

•Review of the z15 configuration to determine the following values:

– Number of drawers that are installed and the number of PUs enabled. Consider the following points:

• Use the Resource Link machine information or the HMC to determine the model, number, and types of PUs (CPs, IFLs, ICFs, and zIIPs).

• Determine the amount of memory (physically installed and LICCC-enabled).

• Work with your IBM Service Support Representative (IBM SSR) to determine the memory card size in each drawer. The memory card sizes and the number of cards that are installed for each drawer can be viewed from the SE under the CPC configuration task list. Use the View Hardware Configuration option.

– ICA SR fanout layouts and ICA to ICA connections.

Use the Resource Link machine information to review the channel configuration. This process is a normal part of the I/O connectivity planning. The alternative paths must be separated as far into the system as possible.

•Review the system image configurations to determine the resources for each image.

•Determine the importance and relative priority of each LPAR.

•Identify the LPAR or workloads and the actions to be taken:

– Deactivate the entire LPAR.

– Configure PUs.

– Reconfigure memory, which might require the use of reconfigurable storage unit (RSU) values.

– Vary off the channels.

•Review the channel layout and determine whether any changes are necessary to address single paths.

•Develop a plan to address the requirements.

When you perform the review, document the resources that can be made available if the EDA is used. The resources on the drawers are allocated during a POR of the system and can change after that process. Perform a review when changes are made to z15 servers, such as adding drawers, PUs, memory, or channels. Also, perform a review when workloads are added or removed, or if the HiperDispatch feature was enabled and disabled since the last time you performed a POR.

9.8.2 Enhanced drawer availability processing

To use the EDA, first ensure that the following conditions are met:

•Free the used processors (PUs) on the drawer that is removed.

•Free the used memory on the drawer.

•For all I/O domains that are connected to the drawer and ensure that alternative paths exist. Otherwise, place the I/O paths offline.

For the EDA process, this phase is the preparation phase. It is started from the SE, directly or on the HMC, by using the Single object operation option on the Perform Model Conversion window from the CPC configuration task list, as shown in Figure 9-4.

Figure 9-4 Clicking Prepare for Enhanced Drawer Availability option

Processor availability

Processor resource availability for reallocation or deactivation is affected by the type and quantity of the resources in use, such as:

•Total number of PUs that are enabled through LICCC

•PU definitions in the profiles that can be dedicated and dedicated reserved or shared

•Active LPARs with dedicated resources at the time of the drawer repair or replacement

To maximize the PU availability option, ensure that sufficient inactive physical resources are on the remaining drawers to complete a drawer removal.

Memory availability

Memory resource availability for reallocation or deactivation depends on the following factors:

•Physically installed memory

•Image profile memory allocations

•Amount of memory that is enabled through LICCC

•Flexible memory option

•Virtual Flash Memory if enabled and configured

For more information, see 2.7.2, “Enhanced drawer availability (EDA)” on page 72.

Fan out card to I/O connectivity requirements

The optimum approach is to maintain maximum I/O connectivity during drawer removal. The redundant I/O interconnect (RII) function provides for redundant connectivity to all installed I/O domains in the PCIe+ I/O drawers.

Preparing for enhanced drawer availability

The Prepare Concurrent Drawer replacement option validates that enough dormant resources are available for this operation. If enough resources are not available on the remaining drawers to complete the EDA process, the process identifies those resources. It then guides you through a series of steps to select and free up those resources. The preparation process does not complete until all processors, memory, and I/O conditions are successfully resolved.

Preparation: The preparation step does not reallocate any resources. It is used only to record client choices and produce a configuration file on the SE that is used to run the concurrent drawer replacement operation.

The preparation step can be done in advance. However, if any changes to the configuration occur between the preparation and the physical removal of the drawer, you must rerun the preparation phase.

The process can be run multiple times because it does not move any resources. To view the results of the last preparation operation, click Display Previous Prepare Enhanced Drawer Availability Results from the Perform Model Conversion window in the SE.

The preparation step can be run without performing a drawer replacement. You can use it to dynamically adjust the operational configuration for drawer repair or replacement before IBM SSR activity. The Perform Model Conversion window in you click Prepare for Enhanced Drawer Availability is shown in Figure 9-4 on page 397.

After you click Prepare for Enhanced Drawer Availability, the Enhanced Drawer Availability window opens. Select the drawer that is to be repaired or upgraded; then, select OK, as shown in Figure 9-5. Only one target drawer can be selected at a time.

Figure 9-5 Selecting the target drawer

The system verifies the resources that are required for the removal, determines the required actions, and presents the results for review. Depending on the configuration, the task can take from a few seconds to several minutes.

The preparation step determines the readiness of the system for the removal of the targeted drawer. The configured processors and the memory in the selected drawer are evaluated against unused resources that are available across the remaining drawers. The system also analyzes I/O connections that are associated with the removal of the targeted drawer for any single path I/O connectivity.

If insufficient resources are available, the system identifies the conflicts so that you can free other resources.

The following states can result from the preparation step:

•The system is ready to run the EDA for the targeted drawer with the original configuration.

•The system is not ready to run the EDA because of conditions that are indicated by the preparation step.

•The system is ready to run the EDA for the targeted drawer. However, to continue with the process, processors are reassigned from the original configuration.

Review the results of this reassignment relative to your operation and business requirements. The reassignments can be changed on the final window that is presented. However, before making any changes or approving reassignments, ensure that the changes are reviewed and approved by the correct level of support based on your organization’s business requirements.

Preparation tabs

The results of the preparation are presented for review in a tabbed format. Each tab indicates conditions that prevent the EDA option from being run. The following tab selections are available:

•Processors

•Memory

•Single I/O

•Single Domain I/O

•Single Alternate Path I/O

Only the tabs that feature conditions that prevent the drawer from being removed are displayed. Each tab indicates the specific conditions and possible options to correct them.

For example, the preparation identifies single I/O paths that are associated with the removal of the selected drawer. These paths must be varied offline to perform the drawer removal. After you address the condition, rerun the preparation step to ensure that all the required conditions are met.

Preparing the system to perform enhanced drawer availability

During the preparation, the system determines the PU configuration that is required to remove the drawer. The results and the option to change the assignment on non-dedicated processors are shown in Figure 9-6.

Figure 9-6 Reassign Non-Dedicated Processors results

Important: Consider the results of these changes relative to the operational environment. Understand the potential effect of making such operational changes. Changes to the PU assignment, although technically correct, can result in constraints for critical system images. In certain cases, the solution might be to defer the reassignments to another time that has less effect on the production system images.

After you review the reassignment results and make any necessary adjustments, click OK.

The final results of the reassignment, which include the changes that are made as a result of the review, are displayed (see Figure 9-7). These results are the assignments when the drawer removal phase of the EDA is completed.

Figure 9-7 Reassign Non-Dedicated Processors, message ACT37294

Summary of the drawer removal process steps

To remove a drawer, the following resources must be moved to the remaining active drawers:

•PUs: Enough PUs must be available on the remaining active drawers, including all types of PUs that can be characterized (CPs, IFLs, ICFs, zIIPs, SAPs, and IFP).

•Memory: Enough installed memory must be available on the remaining active drawers.

•I/O connectivity: Alternative paths to other drawers must be available on the remaining active drawers, or the I/O path must be taken offline.

By understanding the system configuration and the LPAR allocation for memory, PUs, and I/O, you can make the best decision about how to free the necessary resources to allow for drawer removal.

Complete the following steps to concurrently replace a drawer:

1. Run the preparation task to determine the necessary resources.

2. Review the results.

3. Determine the actions to perform to meet the required conditions for EDA.

4. When you are ready to remove the drawer, free the resources that are indicated in the preparation steps.

5. Repeat the step that is shown in Figure 9-4 on page 397 to ensure that the required conditions are all satisfied.

6. Upon successful completion, the system is ready for the removal of the drawer.

The preparation process can be run multiple times to ensure that all conditions are met. It does not reallocate any resources; instead, it produces only a report. The resources are not reallocated until the Perform Drawer Removal process is started.

Rules during EDA

During EDA, the following rules are enforced:

•Processor rules

All processors in any remaining drawers are available to be used during EDA. This requirement includes the two spare PUs or any available PU that is non-LICCC.

The EDA process also allows conversion of one PU type to another PU type. One example is converting a zIIP to a CP during the EDA function. The preparation for the concurrent drawer replacement task indicates whether any SAPs must be moved to the remaining drawers.

•Memory rules

All physical memory that is installed in the system, including flexible memory, is available during the EDA function. Any physical installed memory, whether purchased or not, is available to be used by the EDA function.

•Single I/O rules

Alternative paths to other drawers must be available, or the I/O path must be taken offline.

Review the results. The result of the preparation task is a list of resources that must be made available before the drawer replacement can occur.

Freeing any resources

At this stage, create a plan to free these resources. The following resources and actions are necessary to free them:

•Freeing any PUs:

– Vary off the PUs by using the Perform a Model Conversion window, which reduces the number of PUs in the shared PU pool.

– Deactivate the LPARs.

•Freeing memory:

– Deactivate an LPAR.

– Vary offline a portion of the reserved (online) memory. For example, in z/OS, run the following command:

CONFIG_STOR(E=1),<OFFLINE/ONLINE>

This command enables a storage element to be taken offline. The size of the storage element depends on the RSU value. In z/OS, the following command configures offline smaller amounts of storage than the amount that was set for the storage element:

CONFIG_STOR(nnM),<OFFLINE/ONLINE>

– A combination of both LPAR deactivation and varying memory offline.

Reserved storage: If you plan to use the EDA function with z/OS LPARs, set up reserved storage and an RSU value. Use the RSU value to specify the number of storage units that are to be kept free of long-term fixed storage allocations. This configuration allows for storage elements to be varied offline.

9.9 z15 Enhanced Driver Maintenance

EDM is one more step toward reducing the necessity for and the duration of a scheduled outage. One of the components to planned outages is LIC Driver updates that are run in support of new features and functions.

When correctly configured, z15 servers support concurrently activating a selected new LIC Driver level. Concurrent activation of the selected new LIC Driver level is supported only at specific released sync points. Concurrently activating a selected new LIC Driver level anywhere in the maintenance stream is not possible. Certain LIC updates do not allow a concurrent update or upgrade.

Consider the following key points about EDM:

•The HMC can query whether a system is ready for a concurrent driver upgrade.

•Previous firmware updates, which require an initial machine load (IML) of the z15 system to be activated, can block the ability to run a concurrent driver upgrade.

•An icon on the SE allows you or your IBM SSR to define the concurrent driver upgrade sync point to be used for an EDM.

•The ability to concurrently install and activate a driver can eliminate or reduce a planned outage.

•z15 servers introduce Concurrent Driver Upgrade (CDU) cloning support to other CPCs for CDU preinstallation and activation.

•Concurrent crossover from Driver level N to Driver level N+1, then to Driver level N+2, must be done serially. No composite moves are allowed.

•Disruptive upgrades are permitted at any time, and allow for a composite upgrade (Driver N to Driver N+2).

•Concurrently backing up to the previous driver level is not possible. The driver level must move forward to driver level N+1 after EDM is started. Unrecoverable errors during an update might require a scheduled outage to recover.

The EDM function does not eliminate the need for planned outages for driver-level upgrades. Upgrades might require a system level or a functional element scheduled outage to activate the new LIC. The following circumstances require a scheduled outage:

•Specific complex code changes might dictate a disruptive driver upgrade. You are alerted in advance so that you can plan for the following changes:

– Design data or hardware initialization data fixes

– CFCC release level change

•OSA CHPID code changes might require PCHID Vary OFF/ON to activate new code.

•Crypto code changes might require PCHID Vary OFF/ON to activate new code.

Note: zUDX clients should contact their User Defined Extensions (UDX) provider before installing Microcode Change Levels (MCLs). Any changes to Segments 2 and 3 from a previous MCL level might require a change to the client’s UDX. Attempting to install an incompatible UDX at this level results in a Crypto checkstop.

9.9.1 Resource Group and native PCIe features MCLs

Microcode fixes, referred to as individual MCLs or packaged in Bundles, might be required to update the Resource Group code and the native PCIe features. Although the goal is to minimize changes or make the update process concurrent, the maintenance updates at times can require the Resource Group or the affected native PCIe to be toggled offline and online to implement the updates. The native PCIe features (managed by Resource Group code) are listed Table 9-1.

Table 9-1 Native PCIe cards for z15

Native PCIe adapter type	Feature code	Resource required to be offline
25 GbE RoCE Express2.1	0450	FIDs/PCHID
25 GbE RoCE Express2	0430	FIDs/PCHID
10 GbE RoCE Express2.1	0432	FIDs/PCHID
10 GbE RoCE Express2	0412	FIDs/PCHID
zHyperLink Express1.1	0451	FIDs/PCHID
zHyperLink Express	0431	FIDs/PCHID
Coupling Express LR	0433	CHPIDs/PCHID

Consider the following points for managing native PCIe adapters microcode levels:

•Updates to the Resource Group require all native PCIe adapters that are installed in that RG to be offline.

•Updates to the native PCIe adapter require the adapter to be offline. If the adapter is not defined, the MCL session automatically installs the maintenance that is related to the adapter.

The PCIe native adapters are configured with Function IDs (FIDs) and might need to be configured offline when changes to code are needed. To help alleviate the number of adapters (and FIDs) that are affected by the Resource Group code update, z15 have four Resource Groups per system (CPC).

Note: Other adapter types, such as FICON Express, OSA Express, and Crypto Express that are installed in the PCIe+ I/O drawer are not effected because they are not managed by the Resource Groups.

The front, rear, and top view of the PCIe+ I/O drawer and the Resource Group assignment by card slot are shown in Figure 9-8. All PCIe+ I/O drawers that are installed in the system feature the same Resource Group assignment.

Figure 9-8 Resource Group slot assignment

The adapter locations and PCHIDs for the four Resource Groups are listed in Table 9-2.

Table 9-2 Resource Group affinity to native PCIe adapter locations

RG	Adapter locations	PCIe+ I/O drawer	PCHIDs
RG1 Left	LG02, LG04, LG07, and LG09	1	100-101,108-109,110-111,118-119
		2	140-141,148-149,150-151,158-159
		3	180-181,188-189,190-191,198-199
		4	1C0-1C1,1C8-1C9,1D0-1D1,1D8-1D9
		5	200-201,208-209,210-211,218-219
		6	240-241,248-249,250-251,258-259
		7	280-281,288-289,290-291,298-299
		8	2C0-2C1,2C8-2C9,2D0-2D1,2D8-2D9
		9	300-301,308-309,310-311,318-319
		10	340-341,348-349,350-351,358-359
		11	380-381,388-389,390-391,398-399
		12	3C0-3C1,3C8-3C9,3D0-3D1,3D8-3D9
RG2 Right	LG12, LG14, LG17, and LG19	1	120-121,128-129,130-131,138-139
		2	160-161,168-169,170-171,178-179
		3	1A0-1A1,1A8-1A9,1B0-1B1,1B8-1B9
		4	1E0-1E1,1E8-1E9,1F0-1F1,1F8-1F9
		5	220-221,228-229,230-231,238-239
		6	260-261,268-269,270-271,278-279
		7	2A0-2A1,2A8-2A9,2B0-2B1,2B8-2B9
		8	2E0-2E1,2E8-2E9,2F0-2F1,2F8-2F9
		9	320-321,328-329,330-331,338-339
		10	360-361,368-369,370-371,378-379
		11	3A0-3A1,3A8-3A9,3B0-3B1,3B8-3B9
		12	3E0-3E1,3E8-3E9,3F0-3F1,3F8-3F9
RG3 Left	LG03, LG05, LG08, and LG10	1	104-105,10C-10D,114-115,11C-11D
		2	144-145,14C-14D,154-155,15C-15D
		3	184-185,18C-18D,194-195,19C-19D
		4	1C4-1C5,1CC-1CD,1D4-1D5,1DC-1DD
		5	204-205,20C-20D,214-215,21C-21D
		6	244-245,24C-24D,254-255,25C-25D
		7	284-285,28C-28D,294-295,29C-29D
		8	2C4-2C5,2CC-2CD,2D4-2D5,2DC-2DD
		9	304-305,30C-30D,314-315,31C-31D
		10	344-345,34C-34D,354-355,35C-35D
		11	384-385,38C-38D,394-395,39C-39D
		12	3C4-3C5,3CC-3CD,3D4-3D5,3DC-3DD
RG4 Right	LG13, LG15, LG18, and LG20	1	124-125,12C-12D,134-135,13C-13D
		2	164-165,16C-16D,174-175,17C-17D
		3	1A4-1A5,1AC-1AD,1B4-1B5,1BC-1BD
		4	1E4-1E5,1EC-1ED,1F4-1F5,1FC-1FD
		5	224-225,22C-22D,234-235,23C-23D
		6	264-265,26C-26D,274-275,27C-27D
		7	2A4-2A5,2AC-2AD,2B4-2B5,2BC-2BD
		8	2E4-2E5,2EC-2ED,2F4-2F5,2FC-2FD
		9	324-325,32C-32D,334-335,33C-33D
		10	364-365,36C-36D,374-375,37C-37D
		11	3A4-3A5,3AC-3AD,3B4-3B5,3BC-3BD
		12	3E4-3E5,3EC-3ED,3F4-3F5,3FC-3FD

9.10 RAS capability for the HMC and SE

The HMC and the SE include the following RAS capabilities:

•Back up from HMC and SE

For the customers who do not have an FTP server that is defined for backups, the HMC can be configured as an FTP server.

On a scheduled basis, the HMC hard disk drive (HDD) is backed up to the USB flash memory drive (UFD), a defined FTP server, or both.

SE HDDs are backed up on to the primary SE HDD and an alternative SE HDD. In addition, you can save the backup to a defined FTP server.

For more information, see 10.2, “HMC and SE changes and new features” on page 411.

•Remote Support Facility (RSF)

The HMC RSF provides the important communication to a centralized IBM support network for hardware problem reporting and service. For more information, see 10.4, “Remote Support Facility” on page 427.

•Microcode Change Level (MCL)

Regular installation of MCLs is key for RAS, optimal performance, and new functions. Generally, plan to install MCLs quarterly at a minimum. Review hiper MCLs continuously. You must decide whether to wait for the next scheduled apply session, or schedule one earlier if your risk assessment of the new hiper MCLs warrants.

For more information, see 10.5.4, “HMC and SE microcode” on page 433.

•SE

z15 servers are provided with two 1U trusted servers inside the IBM Z server A frame: One is always the primary SE and the other is the alternative SE. The primary SE is the active SE. The alternative acts as the backup. Information is mirrored once per day. The SE servers include N+1 redundant power supplies.

For more information, see 10.2.3, “New Support Element” on page 415.

¹ Key in storage error uncorrected: Indicates that the hardware cannot repair a storage key that was in error.

² Some pre-planning considerations might exist. For more information, see Chapter 8, “System upgrades” on page 331.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Reliability, availability, and serviceability

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9. Reliability, availability, and serviceability