Chapter 9. Reliability, availability, and serviceability

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Reliability, availability, and serviceability

From the quality perspective, the z14 RAS design is driven by a set of high-level program reliability, availability, and serviceability (RAS) objectives. The IBM Z platform continues to drive toward Continuous Reliable Operation (CRO) at the single footprint level.

In order of priority, the key objectives are to ensure data and computational integrity, reduce or eliminate unscheduled outages, and reduce scheduled outages, planned outages, and the number of Repair Actions.

RAS can be accomplished with improved concurrent replace, repair, and upgrade functions for processors, memory, drawers, and I/O. RAS also extends to the nondisruptive capability for installing Licensed Internal Code (LIC) updates. In most cases, a capacity upgrade can be concurrent without a system outage. As an extension to the RAS capabilities, environmental controls are implemented in the system to help reduce power consumption and meet cooling requirements.

This chapter includes the following topics:

•9.1, “RAS strategy” on page 328

•9.2, “Structure change” on page 328

•9.3, “Technology change” on page 329

•9.4, “Reducing complexity” on page 331

•9.5, “Reducing touches” on page 331

•9.6, “z14 ZR1 availability characteristics” on page 333

•9.7, “z14 ZR1 RAS functions” on page 335

•9.8, “z14 ZR1 Enhanced Driver Maintenance” on page 338

•9.9, “RAS capability for the HMC and SE” on page 340

9.1 RAS strategy

The RAS strategy is to manage change by learning from previous generations and investing in new RAS function to eliminate or minimize all sources of outages. Enhancements to z13s RAS designs are implemented on the z14 ZR1 system through the introduction of new technology, structure, and requirements. Continuous improvements in RAS are associated with new features and functions to ensure that IBM Z servers deliver exceptional value to clients.

The following overriding RAS requirements are principles as shown in Figure 9-1:

•Inclusion of existing (or equivalent) RAS characteristics from previous generations.

•Learn from current field issues and addressing the deficiencies.

•Understand the trend in technology reliability (hard and soft) and ensure that the RAS design points are sufficiently robust.

•Invest in RAS design enhancements (hardware and firmware) that provide IBM Z and Customer valued differentiation.

Figure 9-1 Overriding RAS requirements

9.2 Structure change

The z14 ZR1 is a transitional machine, not just a simplified version of z14 M0x with the following Goals:

•Enhanced system modularity

•Standardization to enable rapid integration

•Platform simplification

The z14 ZR1 is built up in a new form factor that is an industry standard 19-inch rack. It is an air cooled system that fulfills the requirements for an ASHRAE A3 environment (as the z13s). The CPC Drawer and the PCIe+ I/O Drawer uses the same Chipsets and PCIe I/O adapters as the z14 M0x models.

The power subsystem is completely redesigned and is now based on 200 - 240 V AC power distribution units (PDU). This configuration uses a Power System Control Network (PSCN) structure, which is the industry standard in data centers.

With z14 ZR1, the processing infrastructure within the CPC is designed by using drawer technology. To improve field-replaceable unit (FRU) isolation, Time Domain Reflectometry (TDR) techniques are applied between SCMs (CP-CP, and CP-SC), and between the CP SCM and dual inline memory modules (DIMMs).

Enhancements to thermal RAS also were introduced; for example, two redundant oscillator cards are installed into the central processor complex (CPC) drawer. The following z13s characteristics are continued with z14 ZR1:

•Keyed cables and plugging detection

•Master-master redundant oscillator design in the main memory

•Processor and nest chips are separate FRUs

•Point-of-load cards are separate FRUs

•Oscillator redundancy and concurrent oscillator card repair capability

•Built-in time domain reflectometer for FRU isolation in interface errors

9.3 Technology change

The IBM z14 Model ZR1 builds upon the RAS of the z13s with the following RAS improvements for the PU/Cache/Memory structure (as shown in Figure 9-2 on page 330):

•Symbol ECC on L3 data cache for better availability. This layer is now another layer of ECC protection that was with the main memory RAIM and L4 symbol ECC.

•New PU sparing algorithm, which gives one single dedicated spare PU for all configurations.

•The single CPC drawer-only configuration simplifies the structure and avoids CPC drawer to CPC drawer connectivity (no SMP cables and no SC SCM to SC SCM).

•L3 ability to monitor (dynamically) fenced macros and allow integrated sparing.

•Ability to spare the PU upon non-L2 cache/DIR Core Array delete.

•Improved error thresholding on PU, which avoids continuous recovery.

•Dynamic L3 cache monitor (“stepper”) to find and demote HSA lines in the cache.

•Symbol ECC on the L4 cache data, directory, configuration array and on the store protects key cache data.

•L4 ability to monitor (dynamically) fenced macros and allow integrated sparing.

•MCU (memory control unit) Cache Symbol ECC.

•L1 and L1+; L2 protected by PU sparing.

•PU mandatory address checking.

•Redundant parity on error in recovery unit (RU) bit to protect wordline (WL).

Figure 9-2 Memory and cache structure

The following storage controller (SC) single chip module (SCM) and main memory improvements also were made:

•Preemptive memory channel marking:

– Analysis of uncorrectable errors considers pattern of previous correctable errors

– More robust uncorrectable error handling

– Simplified repair action

•Improved resilience in KUE¹ repair capability

•Virtual Flash Memory (Flash Express replacement) solution is moved to CPC drawer memory

Solution is moved to more robust storage RAIM protected (same function that main memory uses)

I/O and service

The following I/O and service improvements were made:

•OSA-express6S adds TCP checksum on large send:

– Now on large and small packet sizes

– Reduced the cost (CPU time) of error detection for large send

•Increased number of PSP, PCIe support partitions for managing native PCIe I/O adapters:

– Now four partitions (was two on previous systems)

– Reduced effect on MCL updates

– Better availability

•Faster Dynamic Memory Relocation engine:

– Enables faster reallocation of memory (2x faster) that is used for LPAR activations, CDR, and concurrent upgrade

– Provides faster, more robust service actions

•Dynamic Time Domain Reflectometry (TDR), which was static

Hardware facility that is used to isolate failures on wires provides better FRU isolation and improved service actions.

•Universal spare for PU SCMs and SC SCMs and processor drawer

9.4 Reducing complexity

z14 ZR1 continues the z13s enhancements that reduced system complexity. Specifically, simplifications were made in CPC drawer technology, which reduces the number of CPC drawers to one and RAIM recovery in the memory subsystem design. Memory DIMMs are not cascaded, which eliminates the double FRU call for DIMM errors.

Independent channel recovery with replay buffers on all interfaces allows recovery of a single DIMM channel, while other channels remain active. More redundancies are incorporated in I/O pins for clock lines to main memory, which eliminates the loss of memory clocks because of connector (pin) failure.

The following RAS enhancements reduce service complexity:

•Continued use of RAIM with ECC.

•No cascading of memory DIMM to simplify the recovery design.

•Replay buffer for hardware retry on soft errors on the main memory interface.

•Redundant I/O pins for clock lines to main memory.

9.5 Reducing touches

IBM Z RAS efforts focus on the reduction of unscheduled, scheduled, planned, and unplanned outages. IBM Z technology has a long history of demonstrated RAS improvements, and this effort continues with changes that reduce service touches on the system.

Firmware was updated to improve filtering and resolution of errors that do not require action. Enhanced integrated sparing in processor cores, cache relocates, N+1 SEEPROM and POL N+2 redundancies, and DRAM marking also are incorporated to reduce touches. The following RAS enhancements reduce service touches:

•Improved error resolution to enable filtering

•Enhanced integrated sparing in processor cores

•Cache relocates

•N+1 SEEPROM

•N+2 POL

•DRAM marking

•(Dynamic) Spare lanes for CP-SC, CP-CP, CP-mem

•N+1 controllers, blowers, and sensors

•N+1 Support Element (SE) (with N+1 SE power supplies)

•Redundant SEEPROM on memory DIMM

•Redundant temperature sensor (one SEEPROM and one temperature sensor per I2C bus)

•FICON forward error correction

Table 9-1 compares the integrated sparing functions for zBC12, z13s, and z14 ZR1 and shows the improvements that are archived.

Table 9-1 Integrated Sparing functions

Subsystem	Integrated sparing RAS function	zBC12	z13s	z14ZR1
PU
	Dynamic PU Sparing	y	y	y
	Dynamic cache line delete/relocate	y	y	y
	L4 Dynamic subarray delete	n/a	y	y
	L3 Dynamic subarray delete	n	n	y
	Dual module SEEPROM	y	y	y
Memory
	Dynamic DRAM Marking - N+2 DRAMS per rank	y	y	y
	DIMM temperature sensors	y	y	y
	Dual DIMM SEEPROM	y	y	y
	DIMM sockets - dual clock tabs	n/a	y	y
Power
	Point of Load (POL) - N+2, SEEPROM	y	y	y
	Voltage regulator module (VRM) N+2	n	y	y
Thermal
	Corrosion sensor	y	y	y
Fabric Interconnect
	Dynamic PU/SC bus lane sparing	n	y	y
	Dynamic PU/PU bus lane sparing	n	y	y
	Dynamic memory bus data lane sparing	n	y	y

9.6 z14 ZR1 availability characteristics

The following functions include availability characteristics on z14 ZR1:

•Concurrent memory upgrade

Memory can be upgraded concurrently by using Licensed Internal Code Configuration Control (LICCC) if physical memory is available on the drawer. Memory plan ahead can be used at the time of initial configuration to provides more resources for future use.

•Enhanced driver maintenance (EDM)

One of the greatest contributors to downtime during planned outages is LIC driver updates that are performed in support of new features and functions. z14 ZR1 is designed to support the concurrent activation of a selected new driver level.

•Concurrent fanout addition or replacement

A PCIe fanout card provides the path for data between memory and I/O through PCIe cables. With z14 ZR1, hot-pluggable and concurrently upgradeable fanouts are available. Up to eight PCIe fanout are available to the CPC drawer for z14 ZR1.

•Redundant I/O interconnect

Redundant I/O interconnect helps maintain critical connections to devices. z14 ZR1 allows a single PCIe+ I/O drawer adapter a fanout card, or even a PCIe+ I/O drawer in a multi PCIe I/O drawer system to be removed and reinstalled concurrently during a repair action. Connectivity to the system I/O resources is maintained through a second path when planned thoroughly.

•Dynamic oscillator switch-over

z14 ZR1 has two oscillator cards: a primary and a backup. During a primary card failure, the backup card transparently detects the failure, switches over, and provides the clock signal to the system.

•Processor unit (PU) sparing

z14 ZR1 has one spare PU to maintain performance levels if an active CP, Internal Coupling Facility (ICF), Integrated Facility for Linux (IFL), IBM z Integrated Information Processor (zIIP), integrated firmware processor (IFP), or system assist processor (SAP) fails. Transparent integrated sparing for failed processors is supported. One spare PU is available per system.

•Application preservation

This function is used when a CP fails and no spares are available. The state of the failing CP is passed to another active CP, where the operating system uses it to successfully resume the task, in most cases without client intervention.

•Cooling improvements

The z14 air-cooled configuration includes a newly designed front to rear air cooling system. The fans, controls, and sensors are N+1 redundant.

•FICON Express16S+ with Forward Error Correction (FEC)

FICON Express16S+ features continue to provide a new standard for transmitting data over 16 Gbps links by using 64b/66b encoding. The new standard that is defined by T11.org FC-FS-3 is more efficient than the current 8b/10b encoding.

FICON Express16S+ channels that are running at 16 Gbps can use FEC capabilities when connected to devices that support FEC.

FEC allows FICON Express16S+ channels to operate at higher speeds, over longer distances, with reduced power and higher throughput. They also retain the same reliability and robustness for which FICON channels are traditionally known.

FEC is a technique that is used for controlling errors in data transmission over unreliable or noisy communication channels. When running at 16 Gbps link speeds, clients often see fewer I/O errors, which reduces the potential effect to production workloads from those I/O errors.

Read Diagnostic Parameters (RDP) improve Fault Isolation. After a link error is detected (for example, IFCC, CC3, reset event, or a link incident report), link data that is returned from Read Diagnostic Parameters is used to differentiate between errors that result from failures in the optics versus failures because of dirty or faulty links.

Key metrics can be displayed on the operator console. The results of a display matrix command with the LINKINFO=FIRST parameter that collects information from each device in the path from the channel to the I/O device is shown in Figure 9-3 (a z14 M02 is used in this example). The following output is displayed:

– Transmit (Tx) and Receive (Rx) optic power levels from the PCHID, Switch Input and Output, and I/O device

– Capable and Operating speed between the devices

– Error counts

– Operating System requires new function APAR OA49089

Figure 9-3 Read Diagnostic Parameters function

The new IBM Z Channel Subsystem Function performs periodic polling from the channel to the end points for the logical paths that are established and reduces the number of useless Repair Actions (RAs).

The RDP data history is used to validate Predictive Failure Algorithms and identify Fibre Channel Links with degrading signal strength before errors start to occur. The new Fibre Channel Extended Link Service (ELS) retrieves signal strength.

•FICON Dynamic Routing

FICON Dynamic Routing (FIDR) enables the use of storage area network (SAN) dynamic routing policies in the fabric. With the z14 systems, FICON channels are no longer restricted to the use of static routing policies for inter-switch links (ISLs) for cascaded FICON directors.

FICON Dynamic Routing dynamically changes the routing between the channel and control unit that is based on the Fibre Channel Exchange ID. Each I/O operation has a unique exchange ID. FIDR supports static SAN routing policies and dynamic routing policies.

FICON Dynamic Routing can help clients reduce costs by providing the following features:

– Share SANs between their FICON and FCP traffic.

– Improve performance because of SAN dynamic routing policies that better use all the available ISL bandwidth through higher use of the ISLs.

– Simplify management of their SAN fabrics by using static routing policies that assign different ISL routes with each power-on-reset (POR), which makes the SAN fabric performance difficult to predict.

Clients must ensure that all devices in their FICON SAN support FICON Dynamic Routing before they implement this feature.

9.7 z14 ZR1 RAS functions

Hardware RAS function improvements focus on addressing all sources of outages. Sources of outages feature the following classifications:

•Unscheduled

This outage occurs because of an unrecoverable malfunction in a hardware component of the system.

•Scheduled

This outage is caused by changes or updates that must be done to the system in a timely fashion. A scheduled outage can be caused by a disruptive patch that must be installed, or other changes that must be made to the system.

•Planned

This outage is caused by changes or updates that must be done to the system. A planned outage can be caused by a capacity upgrade or a driver upgrade. A planned outage is usually requested by the client, and often requires pre-planning. The z14 ZR1design phase focuses on enhancing planning to simplify or eliminate planned outages.

The difference between scheduled outages and planned outages might not be obvious. The general consensus is that scheduled outages occur sometime soon. The time frame is approximately two weeks.

Planned outages are outages that are planned well in advance and go beyond this approximate two-week time frame. The distinction between scheduled and planned outages is beyond the scope of this chapter.

Preventing unscheduled, scheduled, and planned outages was addressed by the IBM Z system design for many years.

z14 ZR1 introduces a fixed size HSA of 64 GB. This size helps eliminate planning requirements for HSA and provides the flexibility to update dynamically the configuration. You can perform the following tasks dynamically:²

•Add a logical partition (LPAR)

•Add a logical channel subsystem (LCSS)

•Add a subchannel set

•Add a logical CP to an LPAR

•Add a cryptographic coprocessor

•Remove a cryptographic coprocessor

•Enable I/O connections

•Swap processor types

•Add memory

•Add a physical processor

By addressing the elimination of planned outages, the following tasks also are possible:

•Concurrent driver upgrades

•Concurrent and flexible customer-initiated upgrades

For more information about the flexible upgrades that are started by clients, see 8.2.2, “Customer Initiated Upgrade facility” on page 290.

9.7.1 Scheduled outages

Concurrent hardware upgrades, parts replacement, driver upgrades, and firmware fixes that are available with z14 ZR1 all address the elimination of scheduled outages. Also, the following indicators and functions that address scheduled outages are included:

•Double memory data bus lane sparing.

•Redundant N+1 Power Distribution Units (PDU).

The bulk power hub (BPH) in former Z systems is repacked into a new design that is based on switchable standard PDU. This design power cycles the SEs and PSCN Ethernet switch and avoids cable misplugging. The number of PDUs is configuration-dependent.

•CPC drawer power distribution N+1 design by way of Power Supply Units (PSU).

•The CPC Drawer uses point of load (POL) cards in a highly redundant N+2 configuration. POL regulators are daughter cards that contain the voltage regulators for the principle logic voltage boundaries in the z14 CPC drawer. They plug onto the CPC drawer system board and are nonconcurrent FRUs for the affected drawer, similar to the memory DIMMs.

•Redundant (N+2) ambient temperature, pressure, and humidity sensors.

•Dual inline memory module (DIMM) field-replaceable unit (FRU) indicators.

These indicators imply that a memory module is not error-free and might fail sometime in the future. This indicator gives IBM a warning and provides scheduled time to repair the storage module.

•Single PU checkstop and sparing.

This indicator shows that a PU malfunctioned and is spared. IBM determines what course of action to take based on the system and the history of that system.

•Air-cooled system, which features fans with N+1 redundancy and a new designed front-to-rear cooling system.

•Redundant 1 Gbps Ethernet service network with virtual LAN (VLAN).

The service network in the system gives the machine code the capability to monitor each internal function in the system. This process helps to identify problems, maintain the redundancy, and concurrently replace a part. Through the implementation of the VLAN to the redundant internal Ethernet service network, these advantages are improved, which makes the service network easier to handle and more flexible.

•The PCIe+ I/O drawer is available for z14 ZR1. It and all of the PCIe I/O drawer-supported adapters can be installed concurrently.

•Memory interface logic to maintain channel synchronization when one channel goes into replay. z14 ZR1 can isolate recovery to only the failing channel.

•PCIe redrive hub cards plug straight in (no blind mating of connector). Simplified plugging that is more reliable is included.

•ICA (short distance) coupling cards plug straight in (no blind mating of connector). Simplified plugging that is more reliable is included.

•Coupling Express LR (CE LR) coupling cards plug into the PCIe I/O Drawer, which allows more connections with the same bandwidth.

9.7.2 Unscheduled outages

An unscheduled outage occurs because of an unrecoverable malfunction in a hardware component of the system.

The following improvements can minimize unscheduled outages:

•Continued focus on firmware quality

For LIC and hardware design, failures are eliminated through rigorous design rules; design walk-through; peer reviews; element, subsystem, and system simulation; and extensive engineering and manufacturing testing.

•Memory subsystem improvements

RAIM on IBM Z systems is a concept similar to the concept of Redundant Array of Independent Disks (RAID). The RAIM design detects and recovers from dynamic random access memory (DRAM), socket, memory channel, or DIMM failures. The RAIM design requires the adding one memory channel that is dedicated for RAS.

The parity of the four data DIMMs is stored in the DIMMs that are attached to the fifth memory channel. Any failure in a memory component can be detected and corrected dynamically. z14 ZR1 inherited this memory architecture.

The memory system on z14 ZR1 is implemented with an enhanced version of the Reed-Solomon ECC that is known as 90B/64B. It provides protection against memory channel and DIMM failures.

A precise marking of faulty chips helps ensure timely DIMM replacements. The design of the z14 ZR1 further improved this chip marking technology. Graduated DRAM marking is available, and channel marking and scrubbing calls for replacement on the third DRAM failure is available. For more information about the memory system on z14 ZR1, see 2.5, “Memory” on page 38.

•Improved thermal, altitude, and condensation management

•Server Time Protocol (STP) recovery enhancement

Enhanced Console Assisted Recovery (ECAR) was new with z13s and z13 GA2 and carried forward to z14. It contains better recovery algorithms during a failing Primary Time Server (PTS) and uses communication over the HMC/SE network to assist with BTS takeover. For more information, see Chapter 11, “Hardware Management Console and Support Elements” on page 357.

•Design of pervasive infrastructure controls in processor chips in memory ASICs.

•Improved error checking in the processor recovery unit (RU) to better protect against word line failures in the RU arrays.

9.8 z14 ZR1 Enhanced Driver Maintenance

Enhanced Driver Maintenance (EDM) is one more step toward reducing the necessity for and the duration of a scheduled outage. One of the components to planned outages is LIC Driver updates that are run in support of new features and functions.

When correctly configured, z14 ZR1 supports concurrently activating a selected new LIC Driver level. Concurrent activation of the selected new LIC Driver level is supported only at specific released sync points. Concurrently activating a selected new LIC Driver level anywhere in the maintenance stream is not possible. Certain LIC updates do not allow a concurrent update or upgrade.

Consider the following key points regarding EDM:

•The HMC can query whether a system is ready for a concurrent driver upgrade.

•Previous firmware updates, which require an initial machine load (IML) of the z14 ZR1 to be activated, can block the ability to run a concurrent driver upgrade.

•An icon on the SE allows you or your IBM SSR to define the concurrent driver upgrade sync point to be used for an EDM.

•The ability to concurrently install and activate a driver can eliminate or reduce a planned outage.

•Concurrent crossover from Driver level N to Driver level N+1, then to Driver level N+2, must be done serially. No composite moves are allowed.

•Disruptive upgrades are permitted at any time, and allow for a composite upgrade (Driver N to Driver N+2).

•Concurrently backing up to the previous driver level is not possible. The driver level must move forward to driver level N+1 after EDM is started. Unrecoverable errors during an update might require a scheduled outage to recover.

The EDM function does not eliminate the need for planned outages for driver-level upgrades. Upgrades might require a system level or a functional element scheduled outage to activate the new LIC. The following circumstances require a scheduled outage:

•Specific complex code changes might dictate a disruptive driver upgrade. You are alerted in advance so that you can plan for the following changes:

– Design data or hardware initialization data fixes

– CFCC release level change

•OSA CHPID code changes might require PCHID Vary OFF/ON to activate new code.

z14 introduced the support to concurrently activate an MCL on an OSA-ICC channel to improve the availability and simplification of the firmware maintenance. The OSD channels already feature this capability.

•Crypto code changes might require PCHID Vary OFF/ON to activate new code.

Note: zUDX clients should contact their User Defined Extensions (UDX) provider before installing Microcode Change Levels (MCLs). Any changes to Segments 2 and 3 from a previous MCL level might require a change to the client’s UDX. Attempting to install an incompatible UDX at this level results in a Crypto checkstop.

9.8.1 Resource Group and native PCIe MCLs

Microcode fixes (referred to as individual MCLs or packaged in Bundles) might be required to update the Resource Group code and the native PCIe features. Although the goal is to minimize changes or make the update process concurrent, the maintenance updates at times can require the Resource Group or the affected native PCIe to be toggled offline and online to implement the updates. The native PCIe features (managed by Resource Group code) are listed in Table 9-2.

Table 9-2 Native PCIe cards for z14 ZR1

Native PCIe adapter type	Feature code	Resource required to be offline
25GbE RoCE Express2	0430	FIDs/PCHID
10GbE RoCE Express	0412	FIDs/PCHID
zEDC Express	0420	FIDs/PCHID
Coupling Express LR	0433	CHPIDs/PCHID
IBM zHyperLink Express	0431	FIDs/PCHID

Consider the following points for managing native PCIe adapters microcode levels:

•Updates to the Resource Group require all native PCIe adapters that are installed in that RG to be offline. For more information about this requirement, see Appendix C, “Native Peripheral Component Interconnect Express” on page 419.

•Updates to the native PCIe adapter require the adapter to be offline. If the adapter is not defined, the MCL session automatically installs the maintenance that is related to the adapter.

The PCIe native adapters are configured with Function IDs (FIDs) and might need to be configured offline when changes to code are needed. To help alleviate the number of adapters (and FIDs) that are affected by the Resource Group code update, z14 ZR1 increased the number of Resource Groups from two per system (for previous systems) to four per system (CPC).

Note: Other adapter types, such as FICON Express, OSA Express, and Crypto Express that are installed in the PCIe I/O drawer are not effected because they are not managed by the Resource Groups.

The rear view of the PCIe I/O drawer and the Resource Group assignment by card slot are shown in Figure 9-4. All PCIe I/O drawers that are installed in the system feature the same Resource Group assignment.

Figure 9-4 Resource Group slot assignment

9.9 RAS capability for the HMC and SE

The HMC and the SE include the following RAS capabilities:

•Back up from HMC and SE

For the customers who do not have an FTP server that is defined for backups, the HMC can be configured as an FTP server, which is new with z14.

On a scheduled basis, the HMC hard disk drive (HDD) is backed up to the USB flash memory drive (UFD), a defined FTP server, or both.

SE HDDs are backed up on to the primary SE HDD and an alternative SE HDD. In addition, you can save the backup to a defined FTP server.

For more information, see 11.2.6, “New backup options for HMCs and primary SEs” on page 362.

•Remote Support Facility (RSF)

The HMC RSF provides the important communication to a centralized IBM support network for hardware problem reporting and service. For more information, see 11.4, “Remote Support Facility” on page 374.

•Microcode Change Level (MCL)

Regular installation of MCLs is key for RAS, optimal performance, and new functions. Generally, plan to install MCLs quarterly at a minimum. Review hiper MCLs continuously. You must decide whether to wait for the next scheduled apply session, or schedule one earlier if your risk assessment of the new hiper MCLs warrants.

For more information, see 11.5.4, “HMC and SE microcode” on page 380.

•SE

z14 ZR1 is provided with two 1U trusted servers inside the rack: one is always the primary SE and the other is the alternative SE. The primary SE is the active SE. The alternative acts as the backup. Information is mirrored once per day. The SE servers include N+1 redundant power supplies.

For more information, see 11.2.5, “New SEs” on page 362.

•HMC in an ensemble

The serviceability function for the components of an ensemble is delivered through the traditional HMC/SE constructs, as for earlier Z severs. The primary HMC for the ensemble is where portions of the Unified Resource Manager routines run. The Unified Resource Manager is an active part of the ensemble and z14 infrastructure. Therefore, the HMC is in a stateful state that needs high availability features to ensure the survival of the system during a failure. Each ensemble must be equipped with two HMCs: a primary and an alternative. The primary HMC performs all HMC activities (including Unified Resource Manager activities). The alternative is only the backup and cannot be used for tasks or activities.

Failover: The primary HMC and its alternative must be connected to the same LAN segment. This configuration allows the alternative HMC to take over the IP address of the primary HMC during failover processing.

•Alternative HMC preload function

The Manage Alternate HMC task allows you to reload internal code onto the alternative HMC to minimize HMC downtime during an upgrade to a new driver level. After the new driver is installed on the alternative HMC, it can be made active by running an HMC switchover.

¹ Key in storage error uncorrected: Indicates that the hardware cannot repair a storage key that was in error.

² Some planning considerations might be necessary. For more information, see Chapter 8, “System upgrades” on page 281.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Reliability, availability, and serviceability

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9. Reliability, availability, and serviceability