Chapter 7. Problem determination and debugging an ECE system

If any recovery groups are not active, check the mmfs.log.* files on the ECE storage nodes for resign messages. Example 7-1 shows what we expect to see if too many nodes or pdisks are down.

Example 7-1 Example log output showing RG resign

2019-08-26_17:34:04.208-0700: [E] RG rg1 pdisk n009p011 slot both: disk state missing/0048.000; slot status disk unavailable.

2019-08-26_17:34:04.208-0700: [E] RG rg1: Total of 36 unreadable pdisk slots.

2019-08-26_17:34:32.897-0700: [E] Failed to recover RG rg1, error code 217.

2019-08-26_17:34:32.898-0700: [E] Beginning to resign recovery group rg1 due to "recovery failure", caller err 217 when "recovering RG master"

2019-08-26_17:34:32.916-0700: [I] Finished resigning recovery group rg1

In this situation, IBM Spectrum Scale ECE continually retries the recovery until enough pdisks are available.

You might encounter other resign messages when too many pdisks are down, including the following example:

2019-08-23_13:09:30.865-0700: [E] Beginning to resign recovery group rg1 due to "vdisk IO failure with unavailable pdisks", caller err 5 when "root log group resign"

Other resign messages must be analyzed by IBM Support.

7.3 Check for pdisks that are ready for replacement

Use the mmvdisk pdisk list --replace command to find pdisks that are ready for replacement:

mmvdisk pdisk list -rg rg1 -replace

declustered

recovery group pdisk array paths capacity free space FRU (type) state

-------------- ------------ ----------- ----- -------- ---------- --------------- -----

rg1 n001p001 DA1 0 931 GiB 928 GiB ST91000640NS failing/drained

If any pdisks are listed, use the mmvdisk pdisk replace command to replace the failed drives.

7.4 Check for pdisks that are not in OK state

Use the mmvdisk pdisk list -not-ok command to find pdisks that are in a state other than “ok”:

mmvdisk pdisk list -rg rg1 -not-ok

This command lists pdisks in states other than “OK”. It includes the pdisks that are ready for replacement with mmvdisk pdisk list -replace, but it also includes pdisks in various other states that are not called out for replacement. These include unavailable, draining, transient, and maintenance states, as shown in the following example:

declustered

recovery group pdisk array paths capacity free space FRU (type) state

-------------- ------------ ----------- ----- -------- ---------- --------------- -----

rg1 n001p001 DA1 0 931 GiB 928 GiB ST91000640NS failing/replace

rg1 n001p002 DA1 0 931 GiB 928 GiB ST91000640NS slow/draining

rg1 n001p003 DA1 0 931 GiB 52 GiB ST91000640NS missing/drained

rg1 n001p004 DA1 0 931 GiB 10 GiB ST91000640NS diagnosing

The pdisk state consists of several flags that combine to form the overall state. When the state is displayed, not all flags are always shown. For example, if a pdisk is in “failing” and “slow” states, only “failing” is displayed because it is the more important state.

The pdisk states are described next.

7.5 Pdisk states

Pdisk states include the following ranges:

•Normal state is OK, which indicates that the pdisk is operating normally.

•States indicating that the drive is defective:

– Dead

ECE can communicate with the drive, but the drive cannot read or write data.

– Read Only

Only some blocks of the drive can no longer be successfully written and ECE designated the entire drive as read-only.

– Failing

The uncorrectable error rate of the drive is too high. The drive reported a predicated failure through self-monitoring, analysis, and reporting technology (SMART), or the disk hospital determined that the medium error rate is too high.

– Slow

Performance of this drive is slow compared to other drives in the array.

•Transient states:

– Diagnosing

The disk hospital is checking the drive.

– PTOW

A write request is still pending on the drive after ECE timed out on the I/O. The state stands for pending timed-out write. Further writes are disallowed until the pending writes complete.

•Longer-term unavailable states:

– Missing

ECE has no I/O connectivity with the drive. The most common meaning of this state is that the server that is hosting the drive is down, but the state also can indicate any of the following issues:

• The drive was physically removed from the system.

• The drive is not fully plugged into its socket.

• The drive is not receiving power.

• A cable is not connected.

• An I/O expander or I/O HBA is not working.

• The GPFS labels on the drive were overwritten.

• The drive failed such that it does not respond to inquiry or identify commands (rare).

Other causes must be ruled out before missing drives are replaced.

– VWCE

The drive reports that it has volatile write caching enabled. The drive must be reconfigured to disable volatile write caching before ECE uses it.

•Test and maintenance states:

– SimulatedDead

The mmvdisk pdisk change -simulate-dead command was used to inject the equivalent of “dead” state. All data from the disk is rebuilt into the spare space of other drives. When the rebuild process completes, the pdisk is replaceable.

Note: This process is not a recommended test on a production system because too many disks in this state cause permanent data loss.

– SimulatedFailing

The mmvdisk pdisk change -simulate-failing command was used to inject the equivalent of “failing” state. All data from the disk is rebuilt into the spare space of other drives. When the rebuild completes, the pdisk is replaceable. This test is safer than simulateDead because the system can still read and write the drive, if necessary.

– Suspended

The pdisk was suspended for all I/O activity for a maintenance action, such as upgrading the drive firmware. For more information, see mmvdisk pdisk change command, --suspend and -resume options in IBM Knowledge Center.

– ServiceDrain

All data from the pdisk is being temporarily drained in preparation for a maintenance operation. For more information, see mmvdisk pdisk change command, --begin-service-drain and -end-service-drain options in IBM Knowledge Center.

– PathMaintenance

An I/O path was temporarily taken offline in preparation for a maintenance operation such as upgrading HBA firmware.

•Deleting, draining, and replace states:

– deleting

The pdisk is in the process of being deleted from the recovery group.

– draining

Data from the pdisk is being rebuilt into the spare space of other drives.

– drained

All data was drained from the pdisk.

– Undrainable

The DA replacement threshold is set higher than the number of data spares and this pdisk was made replaceable without draining its data to meet the threshold. Replacing an undrainable pdisk effectively lowers the fault tolerance of all VDisks in the DA by one fault. Proceed with care.

– Replace

The pdisk is replaceable and the replacement threshold was met in the DA.

Except for the pdisks that are listed by using the mmvdisk pdisk list -replace command, most of the pdisk states that are listed here do not necessarily require any user action. For example, defective pdisks that are still in the process of being drained appear here, but these disks cannot be replaced until the drain completes. The reason is that although ECE attempts to avoid sending I/O to defective disks, it does so if the alternative is data loss.

For example, suppose that an 8+2p code is used for a VDisk, two pdisks in the array are in failing state, and a third is in slow state. These three defective disks exceed the fault tolerance of the VDisk, but ECE reads from these disks while re-replicating the data onto other disks.

If you see pdisks in unexpected states, some troubleshooting might be needed. For example, if you see pdisks in “missing” state when the hosting node is not down, service action is required to figure out why it is in “missing” state.

The following questions must be answered:

•Does Linux see the device

•Does the device appear in the output of the tspreparedisk -s command?

•Has the kernel logged errors from the device?

•Is the drive present and seated in its socket?

•Are there many missing drives with some hardware component in common?

7.6 Check each recovery group’s event log for messages

You can list the event log for each recovery group by using the mmvdisk recovery group list command:

mmvdisk recoverygroup list --recovery-group RgName -events

Not all messages indicate trouble. For example, drives are expected to have a certain frequency of errors (as many as 25 uncorrectable read errors in a year can be within the manufacture's specification). Although a message, such as “SCSI illegal request” appears similar to an error, it is a SCSI drive’s way of reporting that it does not support a particular feature. But in general, the logs should be quiet; seeing any message printing frequently suggests some kind of trouble.

An example is many “I/O error” or “pdisk timeout” messages, pdisks going into diagnosing state, hospital finding no problem, then pdisks returning to “OK” state. Although these issues can indicate bad drives, more often they indicate that something is wrong in the I/O layers between the drive and IBM Spectrum Scale.

Asking the following questions can be helpful:

•Are the errors isolated to only one drive? Are the errors common to a subset of drives sharing a common hardware failure domain, or are they spread across all drives?

•If the errors are isolated to one drive, do they always stay with that drive, or do they slowly move from drive to drive (for example, following the drive with current highest queue depth)?

•What errors is Linux reporting? Do they suggest connectivity problems, such as low-level SAS ACK/NAK timeouts?

7.7 Using the mmhealth command with ECE

The use of the mmhealth command helps to monitor the health of the node, network, and services that are hosted on the node in an ECE system. Every service that is hosted on an ECE node has its own health monitoring service.

All subcomponents, such as the file system, network, or disk interfaces, are monitored through the monitoring service of their main component. The us of the mmhealth command provides the health details from these monitoring services.

If the status of a service that is hosted on any node is failed, mmhealth allows the user to view the event log to analyze and determine the problem. On detailed analysis of these events, a set of troubleshooting steps can be followed to resume the failed service.

Nodes or services feature the following possible statuses:

•UNKNOWN: The status of the component or the service that is hosted on the node is not known.

•HEALTHY: The component or the service that is hosted on the node is working as expected. No active error events exist.

•CHECKING: The monitoring of a service or a component that is hosted on the node is starting at the moment. This state is a transient state and is updated when the start is completed.

•TIPS: An issue might exist with the configuration and tuning of the components. This status is assigned to a tip event only.

•DEGRADED: The node or the service that is hosted on the node is not working as expected. That is, a problem occurred with the component, but it did not result in a complete failure.

•FAILED: The component or the service that is hosted on the node failed because of errors or no longer can be reached.

•DEPEND: The component or the services that are hosted on the node failed because of the failure of some of its components.

The statuses are graded as shown in the following example:

HEALTHY< TIPS< DEGRADED< FAILED.

For example, the status of the service that is hosted on a node becomes FAILED if at least one active event occurred in the FAILED status for that corresponding service. The FAILED status gets more priority than the DEGRADED, which is followed by TIPS and then HEALTHY while setting the status of the service. That is, if a service has an active event with a HEALTHY status and another active event with a FAILED status, the system sets the status of the service as FAILED.

7.8 System health monitoring use cases

In this section, the following use cases demonstrate the use of the mmhealth command to verify the health of ECE components. For more information about the components and services that are monitored by using the mmhealth command, see the “Monitoring system health by using mmhealth command” topic in IBM Spectrum Scale: Administration Guide:

•To view the status of current node, issue the mmhealth command:

mmhealth node show

The output of the command is similar to the following example:

Node name: client25-ib0.sonasad.almaden.ibm.com

Node status: HEALTHY

Status Change: 2 days ago

Component Status Status Change Reasons

--------------------------------------------------------------

GPFS HEALTHY 2 days ago -

NETWORK HEALTHY 3 days ago -

FILESYSTEM HEALTHY 2 days ago -

DISK HEALTHY 2 days ago -

NATIVE_RAID HEALTHY 1 day ago -

•To view the status of components of NATIVE_RAID component on ECE server node, issue the command:

mmhealth node show NATIVE_RAID

The output of the command is like the following:

Node name: client25-ib0.sonasad.almaden.ibm.com

Component Status Status Change Reasons

--------------------------------------------------------------------

NATIVE_RAID HEALTHY 19 hours ago -

ARRAY HEALTHY 19 hours ago -

NVME HEALTHY 1 day ago -

PHYSICALDISK HEALTHY 19 hours ago -

RECOVERYGROUP HEALTHY 19 hours ago -

VIRTUALDISK HEALTHY 19 hours ago -

There are no active error events for the component NATIVE_RAID on this node (client25-ib0.sonasad.almaden.ibm.com).

•In case the components of NATIVE_RAID has some errors or warning the output of the command, mmhealth node show NATIVE_RAID, presents the following information:

Node name: cv1

Component Status Status Change Reasons

------------------------------------------------------------------------------------------

NATIVE_RAID DEGRADED 3 min. ago gnr_pdisk_replaceable(rg1/n001p023), gnr_array_needsservice(rg1/DA1)

ARRAY DEGRADED 27 min. ago gnr_array_needsservice(rg1/DA1)

PHYSICALDISK DEGRADED 27 min. ago gnr_pdisk_replaceable(rg1/n001p023)

RECOVERYGROUP HEALTHY 27 min. ago -

VIRTUALDISK HEALTHY 3 min. ago -

Event Parameter Severity Active Since Event Message

------------------------------------------------------------------------------------------

gnr_pdisk_replaceable rg1/n001p023 ERROR 27 min. ago GNR pdisk rg1/n001p023 is replaceable

gnr_array_needsservice rg1/DA1 WARNING 27 min. ago GNR declustered array rg1/DA1 needs service

Here, the pdisk rg1/n001p023 encountered errors that caused the drive to be replaceable. This issue marked components PHYSICALDISK, ARRAY as DEGRADED, which marked upper level component NATIVE_RAID also as degraded. The reasons column shows the events caused for degradation of the component.

•The status of low-level components of a node can be viewed by using the mmhealth node show command. For example, to view the status of physical disks of an ECE storage node, issue the following command:

mmhealth node show NATIVE_RAID PHYSICALDISK

•The output of the command is like the following example:

Node name: client25-ib0.sonasad.almaden.ibm.com

Component Status Status Change Reasons

----------------------------------------------------------------

PHYSICALDISK HEALTHY 20 hours ago -

rg_1/n005p001 HEALTHY 1 day ago -

rg_1/n005p002 HEALTHY 1 day ago -

rg_1/n005p003 HEALTHY 1 day ago -

rg_1/n005p004 HEALTHY 1 day ago -

rg_1/n005p005 HEALTHY 1 day ago -

rg_1/n005p006 HEALTHY 1 day ago -

rg_1/n005p007 HEALTHY 1 day ago -

rg_1/n005p008 HEALTHY 1 day ago -

rg_1/n005p009 HEALTHY 1 day ago -

rg_1/n005p010 HEALTHY 1 day ago -

rg_1/n005p011 HEALTHY 1 day ago -

rg_1/n005p012 HEALTHY 1 day ago -

rg_1/n005p013 HEALTHY 1 day ago -

rg_1/n005p014 HEALTHY 1 day ago -

rg_1/n005p015 HEALTHY 1 day ago -

rg_1/n005p016 HEALTHY 1 day ago -

rg_1/n005p017 HEALTHY 1 day ago -

There are no active error events for the component PHYSICALDISK on this node (client25-ib0.sonasad.almaden.ibm.com).

•To view the status of the subcomponents of a node, issue the mmhealth command:

mmhealth node show --verbose

The output of the command is similar to the following:

Node name: client22-ib0.sonasad.almaden.ibm.com

Node status: DEGRADED

Status Change: 2019-09-04 23:24:36

Component Status Status Change Reasons

-------------------------------------------------------------------------------------

GPFS HEALTHY 2019-09-04 23:24:32 -

NETWORK DEGRADED 2019-09-03 22:30:34 nic_firmware_not_available

ib0 HEALTHY 2019-09-03 22:30:34 -

FILESYSTEM HEALTHY 2019-09-04 23:25:31 -

gpfs_hd HEALTHY 2019-09-04 23:25:31 -

gpfs_hs HEALTHY 2019-09-04 23:25:31 -

DISK HEALTHY 2019-09-04 23:24:32 -

RG001LG001VS001 HEALTHY 2019-09-04 23:29:32 -

RG001LG001VS002 HEALTHY 2019-09-04 23:24:33 -

RG001LG001VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG002VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG002VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG002VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG003VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG003VS002 HEALTHY 2019-09-04 23:24:35 -

RG001LG003VS003 HEALTHY 2019-09-04 23:29:34 -

RG001LG004VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG004VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG004VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG005VS001 HEALTHY 2019-09-04 23:29:32 -

RG001LG005VS002 HEALTHY 2019-09-04 23:24:33 -

RG001LG005VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG006VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG006VS002 HEALTHY 2019-09-04 23:24:35 -

RG001LG006VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG007VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG007VS002 HEALTHY 2019-09-04 23:24:35 -

RG001LG007VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG008VS001 HEALTHY 2019-09-04 23:29:32 -

RG001LG008VS002 HEALTHY 2019-09-04 23:24:33 -

RG001LG008VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG009VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG009VS002 HEALTHY 2019-09-04 23:24:35 -

RG001LG009VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG010VS001 HEALTHY 2019-09-04 23:29:33 -

RG001LG010VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG010VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG011VS001 HEALTHY 2019-09-04 23:24:32 -

RG001LG011VS002 HEALTHY 2019-09-04 23:24:32 -

RG001LG011VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG012VS001 HEALTHY 2019-09-04 23:24:34 -

RG001LG012VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG012VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG013VS001 HEALTHY 2019-09-04 23:24:32 -

RG001LG013VS002 HEALTHY 2019-09-04 23:24:32 -

RG001LG013VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG014VS001 HEALTHY 2019-09-04 23:24:33 -

RG001LG014VS002 HEALTHY 2019-09-04 23:24:33 -

RG001LG014VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG015VS001 HEALTHY 2019-09-04 23:24:35 -

RG001LG015VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG015VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG016VS001 HEALTHY 2019-09-04 23:24:35 -

RG001LG016VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG016VS003 HEALTHY 2019-09-04 23:29:33 -

RG001LG017VS001 HEALTHY 2019-09-04 23:24:32 -

RG001LG017VS002 HEALTHY 2019-09-04 23:24:32 -

RG001LG017VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG018VS001 HEALTHY 2019-09-04 23:24:35 -

RG001LG018VS002 HEALTHY 2019-09-04 23:24:35 -

RG001LG018VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG019VS001 HEALTHY 2019-09-04 23:24:33 -

RG001LG019VS002 HEALTHY 2019-09-04 23:24:33 -

RG001LG019VS003 HEALTHY 2019-09-04 23:29:32 -

RG001LG020VS001 HEALTHY 2019-09-04 23:24:34 -

RG001LG020VS002 HEALTHY 2019-09-04 23:24:34 -

RG001LG020VS003 HEALTHY 2019-09-04 23:29:32 -

NATIVE_RAID HEALTHY 2019-09-05 10:02:46 -

ARRAY HEALTHY 2019-09-05 09:14:45 -

rg_1/DA1 HEALTHY 2019-09-05 09:14:52 -

rg_1/DA2 HEALTHY 2019-09-05 09:14:52 -

rg_1/DA3 HEALTHY 2019-09-05 09:14:52 -

NVME HEALTHY 2019-09-03 22:30:20 -

/dev/nvme0 HEALTHY 2019-09-03 22:30:20 -

PHYSICALDISK HEALTHY 2019-09-05 09:14:45 -

rg_1/n002p001 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p002 HEALTHY 2019-09-04 23:26:52 -

rg_1/n002p003 HEALTHY 2019-09-04 23:26:52 -

rg_1/n002p004 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p005 HEALTHY 2019-09-04 23:26:52 -

rg_1/n002p006 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p007 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p008 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p009 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p010 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p011 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p012 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p013 HEALTHY 2019-09-05 01:08:44 -

rg_1/n002p014 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p015 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p016 HEALTHY 2019-09-04 23:26:53 -

rg_1/n002p017 HEALTHY 2019-09-04 23:26:53 -

RECOVERYGROUP HEALTHY 2019-09-05 09:14:45 -

rg_1 HEALTHY 2019-09-05 09:14:45 -

VIRTUALDISK HEALTHY 2019-09-05 10:02:46 -

RG001LG001LOGHOME HEALTHY 2019-09-05 09:14:51 -

RG001LG001VS001 HEALTHY 2019-09-05 09:14:46 -

RG001LG001VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG001VS003 HEALTHY 2019-09-05 09:14:46 -

RG001LG002LOGHOME HEALTHY 2019-09-05 09:14:49 -

RG001LG002VS001 HEALTHY 2019-09-05 09:14:50 -

RG001LG002VS002 HEALTHY 2019-09-05 10:02:47 -

RG001LG002VS003 HEALTHY 2019-09-05 09:14:50 -

RG001LG003LOGHOME HEALTHY 2019-09-05 09:14:47 -

RG001LG003VS001 HEALTHY 2019-09-05 09:14:48 -

RG001LG003VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG003VS003 HEALTHY 2019-09-05 09:14:48 -

RG001LG004LOGHOME HEALTHY 2019-09-05 09:14:46 -

RG001LG004VS001 HEALTHY 2019-09-05 09:14:49 -

RG001LG004VS002 HEALTHY 2019-09-05 10:02:47 -

RG001LG004VS003 HEALTHY 2019-09-05 09:14:51 -

RG001LG005LOGHOME HEALTHY 2019-09-05 09:14:51 -

RG001LG005VS001 HEALTHY 2019-09-05 09:14:51 -

RG001LG005VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG005VS003 HEALTHY 2019-09-05 09:14:46 -

RG001LG006LOGHOME HEALTHY 2019-09-05 09:14:49 -

RG001LG006VS001 HEALTHY 2019-09-05 09:14:52 -

RG001LG006VS002 HEALTHY 2019-09-05 10:02:47 -

RG001LG006VS003 HEALTHY 2019-09-05 09:14:51 -

RG001LG007LOGHOME HEALTHY 2019-09-05 09:14:46 -

RG001LG007VS001 HEALTHY 2019-09-05 09:14:51 -

RG001LG007VS002 HEALTHY 2019-09-05 10:02:47 -

RG001LG007VS003 HEALTHY 2019-09-05 09:14:51 -

RG001LG008LOGHOME HEALTHY 2019-09-05 09:14:48 -

RG001LG008VS001 HEALTHY 2019-09-05 09:14:49 -

RG001LG008VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG008VS003 HEALTHY 2019-09-05 09:14:49 -

RG001LG009LOGHOME HEALTHY 2019-09-05 09:14:50 -

RG001LG009VS001 HEALTHY 2019-09-05 09:14:48 -

RG001LG009VS002 HEALTHY 2019-09-05 10:02:47 -

RG001LG009VS003 HEALTHY 2019-09-05 09:14:51 -

RG001LG010LOGHOME HEALTHY 2019-09-05 09:14:46 -

RG001LG010VS001 HEALTHY 2019-09-05 09:14:50 -

RG001LG010VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG010VS003 HEALTHY 2019-09-05 09:14:50 -

RG001LG011LOGHOME HEALTHY 2019-09-05 09:14:47 -

RG001LG011VS001 HEALTHY 2019-09-05 09:14:46 -

RG001LG011VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG011VS003 HEALTHY 2019-09-05 09:14:45 -

RG001LG012LOGHOME HEALTHY 2019-09-05 09:14:48 -

RG001LG012VS001 HEALTHY 2019-09-05 09:14:47 -

RG001LG012VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG012VS003 HEALTHY 2019-09-05 09:14:47 -

RG001LG013LOGHOME HEALTHY 2019-09-05 09:14:46 -

RG001LG013VS001 HEALTHY 2019-09-05 09:14:48 -

RG001LG013VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG013VS003 HEALTHY 2019-09-05 09:14:48 -

RG001LG014LOGHOME HEALTHY 2019-09-05 09:14:51 -

RG001LG014VS001 HEALTHY 2019-09-05 09:14:47 -

RG001LG014VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG014VS003 HEALTHY 2019-09-05 09:14:47 -

RG001LG015LOGHOME HEALTHY 2019-09-05 09:14:47 -

RG001LG015VS001 HEALTHY 2019-09-05 09:14:47 -

RG001LG015VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG015VS003 HEALTHY 2019-09-05 09:14:47 -

RG001LG016LOGHOME HEALTHY 2019-09-05 09:14:49 -

RG001LG016VS001 HEALTHY 2019-09-05 09:14:47 -

RG001LG016VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG016VS003 HEALTHY 2019-09-05 09:14:47 -

RG001LG017LOGHOME HEALTHY 2019-09-05 09:14:51 -

RG001LG017VS001 HEALTHY 2019-09-05 09:14:46 -

RG001LG017VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG017VS003 HEALTHY 2019-09-05 09:14:46 -

RG001LG018LOGHOME HEALTHY 2019-09-05 09:14:51 -

RG001LG018VS001 HEALTHY 2019-09-05 09:14:51 -

RG001LG018VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG018VS003 HEALTHY 2019-09-05 09:14:48 -

RG001LG019LOGHOME HEALTHY 2019-09-05 09:14:48 -

RG001LG019VS001 HEALTHY 2019-09-05 09:14:46 -

RG001LG019VS002 HEALTHY 2019-09-05 10:02:46 -

RG001LG019VS003 HEALTHY 2019-09-05 09:14:46 -

RG001LG020LOGHOME HEALTHY 2019-09-05 09:14:46 -

RG001LG020VS001 HEALTHY 2019-09-05 09:14:50 -

RG001LG020VS002 HEALTHY 2019-09-05 10:02:47 -

RG001LG020VS003 HEALTHY 2019-09-05 09:14:49 -

RG001ROOTLOGHOME HEALTHY 2019-09-05 09:14:51 -

•To view more information and user action of any event that caused a failure or degradation of any component, the mmhealth event show command can be used. For example, the mmhealth node show --verbose command shows that component NETWORK is degraded. The reason for the degradation shows the event nic_firmware_not_available.

To view more detailed description of the event nic_firmware_not_available, issue the mmhealth command:

mmhealth event show nic_firmware_not_available

The output of the command is similar to the following example:

Event Name: nic_firmware_not_available

Event ID: 998136

Description: The expected firmware level is not available.

Cause: /usr/lpp/mmfs/bin/tslshcafirmware -Y does not return any expected firmware level for this adapter

User Action: /usr/lpp/mmfs/bin/tslshcafirmware -Y does not return any firmware level for the expectedFirmware field. Check if it is working as expecting. This command uses /usr/lpp/mmfs/updates/latest/firmware/hca/FirmwareInfo.hca, which is provided with the ECE packages. Check if the file is available and accessible.

Severity: WARNING

State: DEGRADED

•To view the eventlog history of the node, issue the following command:

mmhealth node eventlog

The output of the command is similar to the following example:

Node name: client22-ib0.sonasad.almaden.ibm.com

Timestamp Event Name Severity Details

2019-08-21 21:02:47.136304 PDT disk_vanished INFO

The disk RG001LG001VS002 has vanished

2019-08-21 21:02:47.188529 PDT disk_vanished INFO

The disk RG001LG001VS003 has vanished

2019-08-21 21:02:47.256690 PDT disk_vanished INFO

The disk RG001LG001VS001 has vanished

2019-08-21 21:02:47.308793 PDT disk_vanished INFO

The disk RG001LG019VS003 has vanished

2019-08-21 21:02:47.368934 PDT disk_vanished INFO

The disk RG001LG019VS002 has vanished

2019-08-21 21:02:47.429105 PDT disk_vanished INFO

The disk RG001LG019VS001 has vanished

2019-08-21 21:02:47.481211 PDT disk_vanished INFO

The disk RG001LG006VS002 has vanished

2019-08-21 21:02:47.549418 PDT disk_vanished INFO

The disk RG001LG005VS002 has vanished

2019-08-21 21:02:47.609433 PDT disk_vanished INFO

The disk RG001LG005VS003 has vanished

2019-08-21 21:02:47.664537 PDT disk_vanished INFO

The disk RG001LG005VS001 has vanished

2019-08-21 21:02:47.729973 PDT disk_vanished INFO

The disk RG001LG008VS001 has vanished

2019-08-21 21:02:47.784668 PDT disk_vanished INFO

The disk RG001LG008VS003 has vanished

2019-08-21 21:02:47.840647 PDT disk_vanished INFO

The disk RG001LG008VS002 has vanished

2019-08-21 21:02:47.908559 PDT disk_vanished INFO

The disk RG001LG020VS001 has vanished

2019-08-21 21:02:47.965598 PDT disk_vanished INFO

The disk RG001LG004VS001 has vanished

2019-08-21 21:02:48.032636 PDT disk_vanished INFO

The disk RG001LG015VS002 has vanished

12. To view the detailed description of the cluster, issue the command.

mmhealth cluster show

The output of the command is similar to the following example:

Component Total Failed Degraded Healthy Other

------------------------------------------------------------------------------------------

NODE 10 0 4 6 0

GPFS 10 0 0 9 1

NETWORK 10 0 4 6 0

FILESYSTEM 2 0 0 2 0

DISK 60 0 0 60 0

GUI 1 0 1 0 0

NATIVE_RAID 10 0 1 9 0

PERFMON 1 0 0 1 0

THRESHOLD 1 0 0 1 0

7.9 Collecting data for problem determination

Regardless of the problem that is encountered with ECE system, the following data must be collected when contacting the IBM Support Center:

•A description of the problem.

•A tar file that is generated by the gpfs.snap command that contains data from the nodes in the ECE cluster. In large clusters, the gpfs.snap command can collect data from selected nodes by using the -N option.

If the gpfs.snap command cannot be run, collect the following data manually:

– On a Linux node, create a tar file of all entries in the /var/log/messages file from all nodes in the cluster or the nodes that experienced the failure.

– A master GPFS log file that is merged and chronologically sorted for the date of the failure. For more information about creating a master GPFS log file, see IBM Spectrum Scale: Problem Determination Guide.

– If the cluster was configured to store dumps, collect any internal GPFS dumps that were written to that directory that relates to the time of the failure. The default directory is /tmp/mmfs.

– On a failing Linux node, gather the installed software packages and the versions of each package by issuing the rpm -qa command.

– For file system attributes for all of the failing file systems, issue the mmlsfs Device command.

– For the current configuration and state of the disks for all of the failing file systems, issue the mmlsdisk Device command.

– A copy of file /var/mmfs/gen/mmsdrfs from the primary cluster configuration server.

In addition to these items, more trace information is needed to assist with problem diagnosis in certain situations. Complete the following steps:

1. Ensure that the /tmp/mmfs directory exists on all nodes. If this directory does not exist, the GPFS daemon does not generate internal dumps.

2. Set the traces on this ECE cluster:

mmtracectl --set --trace=def --trace-recycle=global

3. Start the trace facility by issuing the following command:

mmtracectl --start

4. Recreate the problem, if possible.

5. After the problem is encountered on the node, turn off the trace facility by issuing the following command:

mmtracectl --off

6. Collect gpfs.snap output by using the following command:

gpfs.snap

For performance issues, the following procedures must be performed to collect the data during the time of the performance problem. In the following example, the parameter -N all is used. However, in some cases, for large clusters you want to substitute a subset of nodes; for example, the mmvdisk node class that is used by a recovery group that is being analyzed:

1. Reset the internal counters and start the gpfs trace facility to collect trace in “overwrite” mode for reasonable amount of time during which performance problem is observed. Stop the trace, gather the internal counters, and collect gpfs.snap output. The following high-level procedure is used:

a. Specify the directory where trace data will be collected by using the DUMP_DIR environment variable. This directory must exist on each node and be part of a file system with adequate space for collecting trace data.

We recommend creating a directory under the directory that is specified by the dataStructureDump Spectrum Scale configuration value; for example, /tmp/mmfs/perf_issue. This way, the data is collected and packaged into a single file by using the gpfs.snap command. For more information about the gpfs.snap command and its use of dataStructureDump, see IBM Knowledge Center.

DUMP_DIR=/tmp/mmfs/perf_issue

b. Set the Trace configuration:

mmtracectl --set --trace=def --tracedev-write-mode=overwrite --tracedev-overwrite-buffer-size=2G -N all

c. Reset the I/O counters and I/O history on all of the nodes by using the following commands:

i. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm resetstats”

ii. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm dump iohist > /dev/null”

d. Start the trace:

mmtracectl --start -N all

e. Trigger the performance issue.

f. Stop the trace:

mmtracectl --stop -N all

g. Gather the I/O counters and I/O history on all of the nodes by using the following commands:

i. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm dump iocounters >
$DUMP_DIR/$(cat /etc/hostname).iocounters.txt”

ii. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm dump iohist >
$DUMP_DIR/$(cat /etc/hostname).iohist.txt”

h. Disable Trace config after Trace Capture:

mmtracectl --off -N all

i. Capture GPFS Snap:

gpfs.snap

2. In parallel, collect viostat output on each node. For example, to take viostat output every 5 seconds for 10 minutes on each node:

mmdsh -N all "/usr/lpp/mmfs/samples/vdisk/viostat --timestamp 5 1000 | tee -a

$DUMP_DIR/$(cat /etc/hostname).viostat"

3. Capture GPFS Snap:

gpfs.snap

7.10 Network tools

IBM Spectrum Scale is a distributed parallel file system. Like any other network distributed file system, it is highly dependent on the network stability, throughput, and latency.

Many network synthetic benchmarking tools are available to assess different types of networks on which IBM Spectrum Scale can run. Because we cannot cover every network tool in this publication, consider the following points as you decide on whichever tool you use to assess your network that runs IBM Spectrum Scale:

•IBM Spectrum Scale is not a one-to-one flow or one-to-many flow; it is many-to-many.

•If only an Ethernet network is used, IBM Spectrum Scale uses TCP single socket per node for data transfer. This does not apply to RDMA or RoCE networks.

•Always start by creating a baseline data set that can be used to compare with measurements that are taken later. Because it is your first measurement, it is not considered slow or fast, only your starting baseline.

•Because of the nature of many-to-many flows of IBM Spectrum Scale, the inter-switch links (ISL) are important. If starvation exists on any layer by way of ISL, performance degradation can occur.

•Ethernet link aggregation, in particular LACP (also referred as 802.3ad or 802.1AX) does not make one single flow as the sum of all the individual links. One single flow still uses only one port at any time.

•Networks evolve, the number of hosts change, workloads change, and so on. Measure when a network change occurs on your new baseline. Also, always keep time stamps of those changes.

•Network tuning is a journey, not a one-time event. As with any tuning, you measure, change, measure again and compare, and act based on the data.

•Performance is not a feeling. It must be based on hard evidence and data.

Although you can use any tool you feel comfortable with, several tools were developed for IBM Spectrum Scale to assess and measure networks. Those tools are generic to any IBM Spectrum Scale installation and not only to ECE, including: nsdperf, gpfsperf, and mmnetverify.

One specific tool was introduced with ECE, called SpectrumScale_NETWORK_READINESS. On any supported ECE installation, you need a baseline by using this tool that gives you permission to install the product. Then, you can always refer to that data as your baseline for future comparisons.

Furthermore, this tool can be used to measure the network performance metrics for any IBM Spectrum Scale network. The tool stresses the Spectrum Scale network and is run with care on a production cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Problem determination and debugging an ECE system

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7. Problem determination and debugging an ECE system