Problem determination and debugging an ECE system
The first step in problem determination is to conduct a high-level analysis of the state of the system and isolate where problems might occur.
In this chapter, we provide an analysis checklist as a good starting point for that analysis process.
This chapter includes the following topics:
 
7.1 Check whether the ECE nodes are active in the cluster
Use the mmgetstate command to check the state of all cluster nodes:
mmgetstate -a
 
Node number Node name GPFS state
-------------------------------------------
1 node1 active
2 node2 active
3 node3 active
4 node4 active
5 node5 active
6 node6 active
7 node7 active
8 node8 active
9 node9 active
10 node10 active
If any nodes are not in active state, check that the nodes are up and check for network problems. Resolve any problems before moving forward.
7.2 Check whether the recovery groups are active
Use the mmvdisk rg list -not-ok command to list recovery groups that are not healthy:
mmvdisk rg list --not-ok
 
recovery group remarks
-------------- -------
rg1 down
If any recovery groups are not active, check the mmfs.log.* files on the ECE storage nodes for resign messages. Example 7-1 shows what we expect to see if too many nodes or pdisks are down.
Example 7-1 Example log output showing RG resign
2019-08-26_17:34:04.208-0700: [E] RG rg1 pdisk n009p011 slot both: disk state missing/0048.000; slot status disk unavailable.
2019-08-26_17:34:04.208-0700: [E] RG rg1: Total of 36 unreadable pdisk slots.
2019-08-26_17:34:32.897-0700: [E] Failed to recover RG rg1, error code 217.
2019-08-26_17:34:32.898-0700: [E] Beginning to resign recovery group rg1 due to "recovery failure", caller err 217 when "recovering RG master"
2019-08-26_17:34:32.916-0700: [I] Finished resigning recovery group rg1
In this situation, IBM Spectrum Scale ECE continually retries the recovery until enough pdisks are available.
You might encounter other resign messages when too many pdisks are down, including the following example:
2019-08-23_13:09:30.865-0700: [E] Beginning to resign recovery group rg1 due to "vdisk IO failure with unavailable pdisks", caller err 5 when "root log group resign"
Other resign messages must be analyzed by IBM Support.
7.3 Check for pdisks that are ready for replacement
Use the mmvdisk pdisk list --replace command to find pdisks that are ready for replacement:
mmvdisk pdisk list -rg rg1 -replace
 
declustered
recovery group pdisk array paths capacity free space FRU (type) state
-------------- ------------ ----------- ----- -------- ---------- --------------- -----
rg1 n001p001 DA1 0 931 GiB 928 GiB ST91000640NS failing/drained
If any pdisks are listed, use the mmvdisk pdisk replace command to replace the failed drives.
7.4 Check for pdisks that are not in OK state
Use the mmvdisk pdisk list -not-ok command to find pdisks that are in a state other than “ok”:
mmvdisk pdisk list -rg rg1 -not-ok
This command lists pdisks in states other than “OK”. It includes the pdisks that are ready for replacement with mmvdisk pdisk list -replace, but it also includes pdisks in various other states that are not called out for replacement. These include unavailable, draining, transient, and maintenance states, as shown in the following example:
declustered
recovery group pdisk array paths capacity free space FRU (type) state
-------------- ------------ ----------- ----- -------- ---------- --------------- -----
rg1 n001p001 DA1 0 931 GiB 928 GiB ST91000640NS failing/replace
rg1 n001p002 DA1 0 931 GiB 928 GiB ST91000640NS slow/draining
rg1 n001p003 DA1 0 931 GiB 52 GiB ST91000640NS missing/drained
rg1 n001p004 DA1 0 931 GiB 10 GiB ST91000640NS diagnosing
The pdisk state consists of several flags that combine to form the overall state. When the state is displayed, not all flags are always shown. For example, if a pdisk is in “failing” and “slow” states, only “failing” is displayed because it is the more important state.
The pdisk states are described next.
 
7.5 Pdisk states
Pdisk states include the following ranges:
Normal state is OK, which indicates that the pdisk is operating normally.
States indicating that the drive is defective:
 – Dead
ECE can communicate with the drive, but the drive cannot read or write data.
 – Read Only
Only some blocks of the drive can no longer be successfully written and ECE designated the entire drive as read-only.
 – Failing
The uncorrectable error rate of the drive is too high. The drive reported a predicated failure through self-monitoring, analysis, and reporting technology (SMART), or the disk hospital determined that the medium error rate is too high.
 – Slow
Performance of this drive is slow compared to other drives in the array.
Transient states:
 – Diagnosing
The disk hospital is checking the drive.
 – PTOW
A write request is still pending on the drive after ECE timed out on the I/O. The state stands for pending timed-out write. Further writes are disallowed until the pending writes complete.
Longer-term unavailable states:
 – Missing
ECE has no I/O connectivity with the drive. The most common meaning of this state is that the server that is hosting the drive is down, but the state also can indicate any of the following issues:
 • The drive was physically removed from the system.
 • The drive is not fully plugged into its socket.
 • The drive is not receiving power.
 • A cable is not connected.
 • An I/O expander or I/O HBA is not working.
 • The GPFS labels on the drive were overwritten.
 • The drive failed such that it does not respond to inquiry or identify commands (rare).
Other causes must be ruled out before missing drives are replaced.
 – VWCE
The drive reports that it has volatile write caching enabled. The drive must be reconfigured to disable volatile write caching before ECE uses it.
Test and maintenance states:
 – SimulatedDead
The mmvdisk pdisk change -simulate-dead command was used to inject the equivalent of “dead” state. All data from the disk is rebuilt into the spare space of other drives. When the rebuild process completes, the pdisk is replaceable.
 
Note: This process is not a recommended test on a production system because too many disks in this state cause permanent data loss.
 – SimulatedFailing
The mmvdisk pdisk change -simulate-failing command was used to inject the equivalent of “failing” state. All data from the disk is rebuilt into the spare space of other drives. When the rebuild completes, the pdisk is replaceable. This test is safer than simulateDead because the system can still read and write the drive, if necessary.
 – Suspended
The pdisk was suspended for all I/O activity for a maintenance action, such as upgrading the drive firmware. For more information, see mmvdisk pdisk change command, --suspend and -resume options in IBM Knowledge Center.
 – ServiceDrain
All data from the pdisk is being temporarily drained in preparation for a maintenance operation. For more information, see mmvdisk pdisk change command, --begin-service-drain and -end-service-drain options in IBM Knowledge Center.
 – PathMaintenance
An I/O path was temporarily taken offline in preparation for a maintenance operation such as upgrading HBA firmware.
Deleting, draining, and replace states:
 – deleting
The pdisk is in the process of being deleted from the recovery group.
 – draining
Data from the pdisk is being rebuilt into the spare space of other drives.
 – drained
All data was drained from the pdisk.
 – Undrainable
The DA replacement threshold is set higher than the number of data spares and this pdisk was made replaceable without draining its data to meet the threshold. Replacing an undrainable pdisk effectively lowers the fault tolerance of all VDisks in the DA by one fault. Proceed with care.
 – Replace
The pdisk is replaceable and the replacement threshold was met in the DA.
Except for the pdisks that are listed by using the mmvdisk pdisk list -replace command, most of the pdisk states that are listed here do not necessarily require any user action. For example, defective pdisks that are still in the process of being drained appear here, but these disks cannot be replaced until the drain completes. The reason is that although ECE attempts to avoid sending I/O to defective disks, it does so if the alternative is data loss.
For example, suppose that an 8+2p code is used for a VDisk, two pdisks in the array are in failing state, and a third is in slow state. These three defective disks exceed the fault tolerance of the VDisk, but ECE reads from these disks while re-replicating the data onto other disks.
If you see pdisks in unexpected states, some troubleshooting might be needed. For example, if you see pdisks in “missing” state when the hosting node is not down, service action is required to figure out why it is in “missing” state.
The following questions must be answered:
Does Linux see the device
Does the device appear in the output of the tspreparedisk -s command?
Has the kernel logged errors from the device?
Is the drive present and seated in its socket?
Are there many missing drives with some hardware component in common?
7.6 Check each recovery group’s event log for messages
You can list the event log for each recovery group by using the mmvdisk recovery group list command:
mmvdisk recoverygroup list --recovery-group RgName -events
Not all messages indicate trouble. For example, drives are expected to have a certain frequency of errors (as many as 25 uncorrectable read errors in a year can be within the manufacture's specification). Although a message, such as “SCSI illegal request” appears similar to an error, it is a SCSI drive’s way of reporting that it does not support a particular feature. But in general, the logs should be quiet; seeing any message printing frequently suggests some kind of trouble.
An example is many “I/O error” or “pdisk timeout” messages, pdisks going into diagnosing state, hospital finding no problem, then pdisks returning to “OK” state. Although these issues can indicate bad drives, more often they indicate that something is wrong in the I/O layers between the drive and IBM Spectrum Scale.
Asking the following questions can be helpful:
Are the errors isolated to only one drive? Are the errors common to a subset of drives sharing a common hardware failure domain, or are they spread across all drives?
If the errors are isolated to one drive, do they always stay with that drive, or do they slowly move from drive to drive (for example, following the drive with current highest queue depth)?
What errors is Linux reporting? Do they suggest connectivity problems, such as low-level SAS ACK/NAK timeouts?
7.7 Using the mmhealth command with ECE
The use of the mmhealth command helps to monitor the health of the node, network, and services that are hosted on the node in an ECE system. Every service that is hosted on an ECE node has its own health monitoring service.
All subcomponents, such as the file system, network, or disk interfaces, are monitored through the monitoring service of their main component. The us of the mmhealth command provides the health details from these monitoring services.
If the status of a service that is hosted on any node is failed, mmhealth allows the user to view the event log to analyze and determine the problem. On detailed analysis of these events, a set of troubleshooting steps can be followed to resume the failed service.
 
 
Nodes or services feature the following possible statuses:
UNKNOWN: The status of the component or the service that is hosted on the node is not known.
HEALTHY: The component or the service that is hosted on the node is working as expected. No active error events exist.
CHECKING: The monitoring of a service or a component that is hosted on the node is starting at the moment. This state is a transient state and is updated when the start is completed.
TIPS: An issue might exist with the configuration and tuning of the components. This status is assigned to a tip event only.
DEGRADED: The node or the service that is hosted on the node is not working as expected. That is, a problem occurred with the component, but it did not result in a complete failure.
FAILED: The component or the service that is hosted on the node failed because of errors or no longer can be reached.
DEPEND: The component or the services that are hosted on the node failed because of the failure of some of its components.
The statuses are graded as shown in the following example:
HEALTHY< TIPS< DEGRADED< FAILED.
For example, the status of the service that is hosted on a node becomes FAILED if at least one active event occurred in the FAILED status for that corresponding service. The FAILED status gets more priority than the DEGRADED, which is followed by TIPS and then HEALTHY while setting the status of the service. That is, if a service has an active event with a HEALTHY status and another active event with a FAILED status, the system sets the status of the service as FAILED.
7.8 System health monitoring use cases
In this section, the following use cases demonstrate the use of the mmhealth command to verify the health of ECE components. For more information about the components and services that are monitored by using the mmhealth command, see the “Monitoring system health by using mmhealth command” topic in IBM Spectrum Scale: Administration Guide:
To view the status of current node, issue the mmhealth command:
mmhealth node show
The output of the command is similar to the following example:
Node name: client25-ib0.sonasad.almaden.ibm.com
Node status: HEALTHY
Status Change: 2 days ago
 
Component Status Status Change Reasons
--------------------------------------------------------------
GPFS HEALTHY 2 days ago -
NETWORK HEALTHY 3 days ago -
FILESYSTEM HEALTHY 2 days ago -
DISK HEALTHY 2 days ago -
NATIVE_RAID HEALTHY 1 day ago -
To view the status of components of NATIVE_RAID component on ECE server node, issue the command:
mmhealth node show NATIVE_RAID
The output of the command is like the following:
Node name: client25-ib0.sonasad.almaden.ibm.com
 
Component Status Status Change Reasons
--------------------------------------------------------------------
NATIVE_RAID HEALTHY 19 hours ago -
ARRAY HEALTHY 19 hours ago -
NVME HEALTHY 1 day ago -
PHYSICALDISK HEALTHY 19 hours ago -
RECOVERYGROUP HEALTHY 19 hours ago -
VIRTUALDISK HEALTHY 19 hours ago -
 
There are no active error events for the component NATIVE_RAID on this node (client25-ib0.sonasad.almaden.ibm.com).
In case the components of NATIVE_RAID has some errors or warning the output of the command, mmhealth node show NATIVE_RAID, presents the following information:
Node name: cv1
 
Component Status Status Change Reasons
------------------------------------------------------------------------------------------
NATIVE_RAID DEGRADED 3 min. ago gnr_pdisk_replaceable(rg1/n001p023), gnr_array_needsservice(rg1/DA1)
ARRAY DEGRADED 27 min. ago gnr_array_needsservice(rg1/DA1)
PHYSICALDISK DEGRADED 27 min. ago gnr_pdisk_replaceable(rg1/n001p023)
RECOVERYGROUP HEALTHY 27 min. ago -
VIRTUALDISK HEALTHY 3 min. ago -
 
 
Event Parameter Severity Active Since Event Message
------------------------------------------------------------------------------------------
gnr_pdisk_replaceable rg1/n001p023 ERROR 27 min. ago GNR pdisk rg1/n001p023 is replaceable
gnr_array_needsservice rg1/DA1 WARNING 27 min. ago GNR declustered array rg1/DA1 needs service
Here, the pdisk rg1/n001p023 encountered errors that caused the drive to be replaceable. This issue marked components PHYSICALDISK, ARRAY as DEGRADED, which marked upper level component NATIVE_RAID also as degraded. The reasons column shows the events caused for degradation of the component.
The status of low-level components of a node can be viewed by using the mmhealth node show command. For example, to view the status of physical disks of an ECE storage node, issue the following command:
mmhealth node show NATIVE_RAID PHYSICALDISK
 
The output of the command is like the following example:
Node name: client25-ib0.sonasad.almaden.ibm.com
 
Component Status Status Change Reasons
----------------------------------------------------------------
PHYSICALDISK HEALTHY 20 hours ago -
rg_1/n005p001 HEALTHY 1 day ago -
rg_1/n005p002 HEALTHY 1 day ago -
rg_1/n005p003 HEALTHY 1 day ago -
rg_1/n005p004 HEALTHY 1 day ago -
rg_1/n005p005 HEALTHY 1 day ago -
rg_1/n005p006 HEALTHY 1 day ago -
rg_1/n005p007 HEALTHY 1 day ago -
rg_1/n005p008 HEALTHY 1 day ago -
rg_1/n005p009 HEALTHY 1 day ago -
rg_1/n005p010 HEALTHY 1 day ago -
rg_1/n005p011 HEALTHY 1 day ago -
rg_1/n005p012 HEALTHY 1 day ago -
rg_1/n005p013 HEALTHY 1 day ago -
rg_1/n005p014 HEALTHY 1 day ago -
rg_1/n005p015 HEALTHY 1 day ago -
rg_1/n005p016 HEALTHY 1 day ago -
rg_1/n005p017 HEALTHY 1 day ago -
 
There are no active error events for the component PHYSICALDISK on this node (client25-ib0.sonasad.almaden.ibm.com).
To view the status of the subcomponents of a node, issue the mmhealth command:
mmhealth node show --verbose
 
The output of the command is similar to the following:
Node name: client22-ib0.sonasad.almaden.ibm.com
Node status: DEGRADED
Status Change: 2019-09-04 23:24:36
 
Component Status Status Change Reasons
-------------------------------------------------------------------------------------
GPFS HEALTHY 2019-09-04 23:24:32 -
NETWORK DEGRADED 2019-09-03 22:30:34 nic_firmware_not_available
ib0 HEALTHY 2019-09-03 22:30:34 -
FILESYSTEM HEALTHY 2019-09-04 23:25:31 -
gpfs_hd HEALTHY 2019-09-04 23:25:31 -
gpfs_hs HEALTHY 2019-09-04 23:25:31 -
DISK HEALTHY 2019-09-04 23:24:32 -
RG001LG001VS001 HEALTHY 2019-09-04 23:29:32 -
RG001LG001VS002 HEALTHY 2019-09-04 23:24:33 -
RG001LG001VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG002VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG002VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG002VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG003VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG003VS002 HEALTHY 2019-09-04 23:24:35 -
RG001LG003VS003 HEALTHY 2019-09-04 23:29:34 -
RG001LG004VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG004VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG004VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG005VS001 HEALTHY 2019-09-04 23:29:32 -
RG001LG005VS002 HEALTHY 2019-09-04 23:24:33 -
RG001LG005VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG006VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG006VS002 HEALTHY 2019-09-04 23:24:35 -
RG001LG006VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG007VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG007VS002 HEALTHY 2019-09-04 23:24:35 -
RG001LG007VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG008VS001 HEALTHY 2019-09-04 23:29:32 -
RG001LG008VS002 HEALTHY 2019-09-04 23:24:33 -
RG001LG008VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG009VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG009VS002 HEALTHY 2019-09-04 23:24:35 -
RG001LG009VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG010VS001 HEALTHY 2019-09-04 23:29:33 -
RG001LG010VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG010VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG011VS001 HEALTHY 2019-09-04 23:24:32 -
RG001LG011VS002 HEALTHY 2019-09-04 23:24:32 -
RG001LG011VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG012VS001 HEALTHY 2019-09-04 23:24:34 -
RG001LG012VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG012VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG013VS001 HEALTHY 2019-09-04 23:24:32 -
RG001LG013VS002 HEALTHY 2019-09-04 23:24:32 -
RG001LG013VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG014VS001 HEALTHY 2019-09-04 23:24:33 -
RG001LG014VS002 HEALTHY 2019-09-04 23:24:33 -
RG001LG014VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG015VS001 HEALTHY 2019-09-04 23:24:35 -
RG001LG015VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG015VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG016VS001 HEALTHY 2019-09-04 23:24:35 -
RG001LG016VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG016VS003 HEALTHY 2019-09-04 23:29:33 -
RG001LG017VS001 HEALTHY 2019-09-04 23:24:32 -
RG001LG017VS002 HEALTHY 2019-09-04 23:24:32 -
RG001LG017VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG018VS001 HEALTHY 2019-09-04 23:24:35 -
RG001LG018VS002 HEALTHY 2019-09-04 23:24:35 -
RG001LG018VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG019VS001 HEALTHY 2019-09-04 23:24:33 -
RG001LG019VS002 HEALTHY 2019-09-04 23:24:33 -
RG001LG019VS003 HEALTHY 2019-09-04 23:29:32 -
RG001LG020VS001 HEALTHY 2019-09-04 23:24:34 -
RG001LG020VS002 HEALTHY 2019-09-04 23:24:34 -
RG001LG020VS003 HEALTHY 2019-09-04 23:29:32 -
NATIVE_RAID HEALTHY 2019-09-05 10:02:46 -
ARRAY HEALTHY 2019-09-05 09:14:45 -
rg_1/DA1 HEALTHY 2019-09-05 09:14:52 -
rg_1/DA2 HEALTHY 2019-09-05 09:14:52 -
rg_1/DA3 HEALTHY 2019-09-05 09:14:52 -
NVME HEALTHY 2019-09-03 22:30:20 -
/dev/nvme0 HEALTHY 2019-09-03 22:30:20 -
PHYSICALDISK HEALTHY 2019-09-05 09:14:45 -
rg_1/n002p001 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p002 HEALTHY 2019-09-04 23:26:52 -
rg_1/n002p003 HEALTHY 2019-09-04 23:26:52 -
rg_1/n002p004 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p005 HEALTHY 2019-09-04 23:26:52 -
rg_1/n002p006 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p007 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p008 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p009 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p010 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p011 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p012 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p013 HEALTHY 2019-09-05 01:08:44 -
rg_1/n002p014 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p015 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p016 HEALTHY 2019-09-04 23:26:53 -
rg_1/n002p017 HEALTHY 2019-09-04 23:26:53 -
RECOVERYGROUP HEALTHY 2019-09-05 09:14:45 -
rg_1 HEALTHY 2019-09-05 09:14:45 -
VIRTUALDISK HEALTHY 2019-09-05 10:02:46 -
RG001LG001LOGHOME HEALTHY 2019-09-05 09:14:51 -
RG001LG001VS001 HEALTHY 2019-09-05 09:14:46 -
RG001LG001VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG001VS003 HEALTHY 2019-09-05 09:14:46 -
RG001LG002LOGHOME HEALTHY 2019-09-05 09:14:49 -
RG001LG002VS001 HEALTHY 2019-09-05 09:14:50 -
RG001LG002VS002 HEALTHY 2019-09-05 10:02:47 -
RG001LG002VS003 HEALTHY 2019-09-05 09:14:50 -
RG001LG003LOGHOME HEALTHY 2019-09-05 09:14:47 -
RG001LG003VS001 HEALTHY 2019-09-05 09:14:48 -
RG001LG003VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG003VS003 HEALTHY 2019-09-05 09:14:48 -
RG001LG004LOGHOME HEALTHY 2019-09-05 09:14:46 -
RG001LG004VS001 HEALTHY 2019-09-05 09:14:49 -
RG001LG004VS002 HEALTHY 2019-09-05 10:02:47 -
RG001LG004VS003 HEALTHY 2019-09-05 09:14:51 -
RG001LG005LOGHOME HEALTHY 2019-09-05 09:14:51 -
RG001LG005VS001 HEALTHY 2019-09-05 09:14:51 -
RG001LG005VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG005VS003 HEALTHY 2019-09-05 09:14:46 -
RG001LG006LOGHOME HEALTHY 2019-09-05 09:14:49 -
RG001LG006VS001 HEALTHY 2019-09-05 09:14:52 -
RG001LG006VS002 HEALTHY 2019-09-05 10:02:47 -
RG001LG006VS003 HEALTHY 2019-09-05 09:14:51 -
RG001LG007LOGHOME HEALTHY 2019-09-05 09:14:46 -
RG001LG007VS001 HEALTHY 2019-09-05 09:14:51 -
RG001LG007VS002 HEALTHY 2019-09-05 10:02:47 -
RG001LG007VS003 HEALTHY 2019-09-05 09:14:51 -
RG001LG008LOGHOME HEALTHY 2019-09-05 09:14:48 -
RG001LG008VS001 HEALTHY 2019-09-05 09:14:49 -
RG001LG008VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG008VS003 HEALTHY 2019-09-05 09:14:49 -
RG001LG009LOGHOME HEALTHY 2019-09-05 09:14:50 -
RG001LG009VS001 HEALTHY 2019-09-05 09:14:48 -
RG001LG009VS002 HEALTHY 2019-09-05 10:02:47 -
RG001LG009VS003 HEALTHY 2019-09-05 09:14:51 -
RG001LG010LOGHOME HEALTHY 2019-09-05 09:14:46 -
RG001LG010VS001 HEALTHY 2019-09-05 09:14:50 -
RG001LG010VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG010VS003 HEALTHY 2019-09-05 09:14:50 -
RG001LG011LOGHOME HEALTHY 2019-09-05 09:14:47 -
RG001LG011VS001 HEALTHY 2019-09-05 09:14:46 -
RG001LG011VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG011VS003 HEALTHY 2019-09-05 09:14:45 -
RG001LG012LOGHOME HEALTHY 2019-09-05 09:14:48 -
RG001LG012VS001 HEALTHY 2019-09-05 09:14:47 -
RG001LG012VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG012VS003 HEALTHY 2019-09-05 09:14:47 -
RG001LG013LOGHOME HEALTHY 2019-09-05 09:14:46 -
RG001LG013VS001 HEALTHY 2019-09-05 09:14:48 -
RG001LG013VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG013VS003 HEALTHY 2019-09-05 09:14:48 -
RG001LG014LOGHOME HEALTHY 2019-09-05 09:14:51 -
RG001LG014VS001 HEALTHY 2019-09-05 09:14:47 -
RG001LG014VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG014VS003 HEALTHY 2019-09-05 09:14:47 -
RG001LG015LOGHOME HEALTHY 2019-09-05 09:14:47 -
RG001LG015VS001 HEALTHY 2019-09-05 09:14:47 -
RG001LG015VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG015VS003 HEALTHY 2019-09-05 09:14:47 -
RG001LG016LOGHOME HEALTHY 2019-09-05 09:14:49 -
RG001LG016VS001 HEALTHY 2019-09-05 09:14:47 -
RG001LG016VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG016VS003 HEALTHY 2019-09-05 09:14:47 -
RG001LG017LOGHOME HEALTHY 2019-09-05 09:14:51 -
RG001LG017VS001 HEALTHY 2019-09-05 09:14:46 -
RG001LG017VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG017VS003 HEALTHY 2019-09-05 09:14:46 -
RG001LG018LOGHOME HEALTHY 2019-09-05 09:14:51 -
RG001LG018VS001 HEALTHY 2019-09-05 09:14:51 -
RG001LG018VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG018VS003 HEALTHY 2019-09-05 09:14:48 -
RG001LG019LOGHOME HEALTHY 2019-09-05 09:14:48 -
RG001LG019VS001 HEALTHY 2019-09-05 09:14:46 -
RG001LG019VS002 HEALTHY 2019-09-05 10:02:46 -
RG001LG019VS003 HEALTHY 2019-09-05 09:14:46 -
RG001LG020LOGHOME HEALTHY 2019-09-05 09:14:46 -
RG001LG020VS001 HEALTHY 2019-09-05 09:14:50 -
RG001LG020VS002 HEALTHY 2019-09-05 10:02:47 -
RG001LG020VS003 HEALTHY 2019-09-05 09:14:49 -
RG001ROOTLOGHOME HEALTHY 2019-09-05 09:14:51 -
To view more information and user action of any event that caused a failure or degradation of any component, the mmhealth event show command can be used. For example, the mmhealth node show --verbose command shows that component NETWORK is degraded. The reason for the degradation shows the event nic_firmware_not_available.
To view more detailed description of the event nic_firmware_not_available, issue the mmhealth command:
mmhealth event show nic_firmware_not_available
The output of the command is similar to the following example:
Event Name: nic_firmware_not_available
Event ID: 998136
Description: The expected firmware level is not available.
Cause: /usr/lpp/mmfs/bin/tslshcafirmware -Y does not return any expected firmware level for this adapter
User Action: /usr/lpp/mmfs/bin/tslshcafirmware -Y does not return any firmware level for the expectedFirmware field. Check if it is working as expecting. This command uses /usr/lpp/mmfs/updates/latest/firmware/hca/FirmwareInfo.hca, which is provided with the ECE packages. Check if the file is available and accessible.
Severity: WARNING
State: DEGRADED
To view the eventlog history of the node, issue the following command:
mmhealth node eventlog
The output of the command is similar to the following example:
Node name: client22-ib0.sonasad.almaden.ibm.com
Timestamp Event Name Severity Details
2019-08-21 21:02:47.136304 PDT disk_vanished INFO
The disk RG001LG001VS002 has vanished
2019-08-21 21:02:47.188529 PDT disk_vanished INFO
The disk RG001LG001VS003 has vanished
2019-08-21 21:02:47.256690 PDT disk_vanished INFO
The disk RG001LG001VS001 has vanished
2019-08-21 21:02:47.308793 PDT disk_vanished INFO
The disk RG001LG019VS003 has vanished
2019-08-21 21:02:47.368934 PDT disk_vanished INFO
The disk RG001LG019VS002 has vanished
2019-08-21 21:02:47.429105 PDT disk_vanished INFO
The disk RG001LG019VS001 has vanished
2019-08-21 21:02:47.481211 PDT disk_vanished INFO
The disk RG001LG006VS002 has vanished
2019-08-21 21:02:47.549418 PDT disk_vanished INFO
The disk RG001LG005VS002 has vanished
2019-08-21 21:02:47.609433 PDT disk_vanished INFO
The disk RG001LG005VS003 has vanished
2019-08-21 21:02:47.664537 PDT disk_vanished INFO
The disk RG001LG005VS001 has vanished
2019-08-21 21:02:47.729973 PDT disk_vanished INFO
The disk RG001LG008VS001 has vanished
2019-08-21 21:02:47.784668 PDT disk_vanished INFO
The disk RG001LG008VS003 has vanished
2019-08-21 21:02:47.840647 PDT disk_vanished INFO
The disk RG001LG008VS002 has vanished
2019-08-21 21:02:47.908559 PDT disk_vanished INFO
The disk RG001LG020VS001 has vanished
2019-08-21 21:02:47.965598 PDT disk_vanished INFO
The disk RG001LG004VS001 has vanished
2019-08-21 21:02:48.032636 PDT disk_vanished INFO
The disk RG001LG015VS002 has vanished
12. To view the detailed description of the cluster, issue the command.
mmhealth cluster show
The output of the command is similar to the following example:
Component Total Failed Degraded Healthy Other
------------------------------------------------------------------------------------------
NODE 10 0 4 6 0
GPFS 10 0 0 9 1
NETWORK 10 0 4 6 0
FILESYSTEM 2 0 0 2 0
DISK 60 0 0 60 0
GUI 1 0 1 0 0
NATIVE_RAID 10 0 1 9 0
PERFMON 1 0 0 1 0
THRESHOLD 1 0 0 1 0
7.9 Collecting data for problem determination
Regardless of the problem that is encountered with ECE system, the following data must be collected when contacting the IBM Support Center:
A description of the problem.
A tar file that is generated by the gpfs.snap command that contains data from the nodes in the ECE cluster. In large clusters, the gpfs.snap command can collect data from selected nodes by using the -N option.
If the gpfs.snap command cannot be run, collect the following data manually:
 – On a Linux node, create a tar file of all entries in the /var/log/messages file from all nodes in the cluster or the nodes that experienced the failure.
 – A master GPFS log file that is merged and chronologically sorted for the date of the failure. For more information about creating a master GPFS log file, see IBM Spectrum Scale: Problem Determination Guide.
 – If the cluster was configured to store dumps, collect any internal GPFS dumps that were written to that directory that relates to the time of the failure. The default directory is /tmp/mmfs.
 – On a failing Linux node, gather the installed software packages and the versions of each package by issuing the rpm -qa command.
 – For file system attributes for all of the failing file systems, issue the mmlsfs Device command.
 – For the current configuration and state of the disks for all of the failing file systems, issue the mmlsdisk Device command.
 – A copy of file /var/mmfs/gen/mmsdrfs from the primary cluster configuration server.
In addition to these items, more trace information is needed to assist with problem diagnosis in certain situations. Complete the following steps:
1. Ensure that the /tmp/mmfs directory exists on all nodes. If this directory does not exist, the GPFS daemon does not generate internal dumps.
2. Set the traces on this ECE cluster:
mmtracectl --set --trace=def --trace-recycle=global
3. Start the trace facility by issuing the following command:
mmtracectl --start
4. Recreate the problem, if possible.
5. After the problem is encountered on the node, turn off the trace facility by issuing the following command:
mmtracectl --off
6. Collect gpfs.snap output by using the following command:
gpfs.snap
For performance issues, the following procedures must be performed to collect the data during the time of the performance problem. In the following example, the parameter -N all is used. However, in some cases, for large clusters you want to substitute a subset of nodes; for example, the mmvdisk node class that is used by a recovery group that is being analyzed:
1. Reset the internal counters and start the gpfs trace facility to collect trace in “overwrite” mode for reasonable amount of time during which performance problem is observed. Stop the trace, gather the internal counters, and collect gpfs.snap output. The following high-level procedure is used:
a. Specify the directory where trace data will be collected by using the DUMP_DIR environment variable. This directory must exist on each node and be part of a file system with adequate space for collecting trace data.
We recommend creating a directory under the directory that is specified by the dataStructureDump Spectrum Scale configuration value; for example, /tmp/mmfs/perf_issue. This way, the data is collected and packaged into a single file by using the gpfs.snap command. For more information about the gpfs.snap command and its use of dataStructureDump, see IBM Knowledge Center.
DUMP_DIR=/tmp/mmfs/perf_issue
b. Set the Trace configuration:
mmtracectl --set --trace=def --tracedev-write-mode=overwrite --tracedev-overwrite-buffer-size=2G -N all
c. Reset the I/O counters and I/O history on all of the nodes by using the following commands:
i. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm resetstats”
ii. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm dump iohist > /dev/null”
d. Start the trace:
mmtracectl --start -N all
e. Trigger the performance issue.
f. Stop the trace:
mmtracectl --stop -N all
g. Gather the I/O counters and I/O history on all of the nodes by using the following commands:
i. mmdsh -N all /usr/lpp/mmfs/bin/mmfsadm dump iocounters >
$DUMP_DIR/$(cat /etc/hostname).iocounters.txt”
ii. mmdsh -N all “/usr/lpp/mmfs/bin/mmfsadm dump iohist >
$DUMP_DIR/$(cat /etc/hostname).iohist.txt”
h. Disable Trace config after Trace Capture:
mmtracectl --off -N all
i. Capture GPFS Snap:
gpfs.snap
2. In parallel, collect viostat output on each node. For example, to take viostat output every 5 seconds for 10 minutes on each node:
mmdsh -N all "/usr/lpp/mmfs/samples/vdisk/viostat --timestamp 5 1000 | tee -a
$DUMP_DIR/$(cat /etc/hostname).viostat"
3. Capture GPFS Snap:
gpfs.snap
7.10 Network tools
IBM Spectrum Scale is a distributed parallel file system. Like any other network distributed file system, it is highly dependent on the network stability, throughput, and latency.
Many network synthetic benchmarking tools are available to assess different types of networks on which IBM Spectrum Scale can run. Because we cannot cover every network tool in this publication, consider the following points as you decide on whichever tool you use to assess your network that runs IBM Spectrum Scale:
IBM Spectrum Scale is not a one-to-one flow or one-to-many flow; it is many-to-many.
If only an Ethernet network is used, IBM Spectrum Scale uses TCP single socket per node for data transfer. This does not apply to RDMA or RoCE networks.
Always start by creating a baseline data set that can be used to compare with measurements that are taken later. Because it is your first measurement, it is not considered slow or fast, only your starting baseline.
Because of the nature of many-to-many flows of IBM Spectrum Scale, the inter-switch links (ISL) are important. If starvation exists on any layer by way of ISL, performance degradation can occur.
Ethernet link aggregation, in particular LACP (also referred as 802.3ad or 802.1AX) does not make one single flow as the sum of all the individual links. One single flow still uses only one port at any time.
Networks evolve, the number of hosts change, workloads change, and so on. Measure when a network change occurs on your new baseline. Also, always keep time stamps of those changes.
Network tuning is a journey, not a one-time event. As with any tuning, you measure, change, measure again and compare, and act based on the data.
Performance is not a feeling. It must be based on hard evidence and data.
Although you can use any tool you feel comfortable with, several tools were developed for IBM Spectrum Scale to assess and measure networks. Those tools are generic to any IBM Spectrum Scale installation and not only to ECE, including: nsdperf, gpfsperf, and mmnetverify.
One specific tool was introduced with ECE, called SpectrumScale_NETWORK_READINESS. On any supported ECE installation, you need a baseline by using this tool that gives you permission to install the product. Then, you can always refer to that data as your baseline for future comparisons.
Furthermore, this tool can be used to measure the network performance metrics for any IBM Spectrum Scale network. The tool stresses the Spectrum Scale network and is run with care on a production cluster.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.21.108