Chapter 9. Troubleshooting IBM Spectrum Archive Enterprise Edition

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Troubleshooting IBM Spectrum Archive Enterprise Edition

This chapter describes the process that you can use to troubleshoot issues with IBM Spectrum Archive Enterprise Edition (IBM Spectrum Archive EE).

This chapter includes the following topics:

•9.1, “Overview” on page 294

•9.2, “Hardware” on page 297

•9.3, “Recovering data from a write failure tape” on page 301

•9.4, “Recovering data from a read failure tape” on page 302

•9.5, “Software” on page 303

•9.6, “Recovering from system failures” on page 312

9.1 Overview

This section provides a simple health check procedure for IBM Spectrum Archive EE.

9.1.1 Quick health check

If you are having issues with an IBM Spectrum Archive EE environment, Figure 9-1 shows a simple flowchart that you can follow as the first step to troubleshooting problems with the IBM Spectrum Archive EE components.

Figure 9-1 Quick health check procedure

If your issue remains after you perform these simple checks, follow the procedures that are described in the remainder of this chapter to perform more detailed troubleshooting. If the problem cannot be resolved, contact IBM Spectrum Archive Support.

9.1.2 Common startup errors

IBM Spectrum Archive EE has multiple components it manages and requires to have a successful start before the system is ready for use. In addition for its own components multiple external components must be running and configured properly for IBM Spectrum Archive EE to have a proper startup. This section will walk through some of these components and the common startup errors that effect IBM Spectrum Archive EE from starting up correctly.

After running eeadm cluster start and the returned status code is error, IBM Spectrum Archive EE failed to start correctly and user actions are required to remedy the situation. To view the type of error that has occurred during startup and which nodes are effected run the eeadm node list command

Failed startup caused by rpcbind

The rpcbind utility is required for MMM to function correctly, if rpcbind is not running IBM Spectrum Archive EE will start up with errors and will be unusable until the issue has been resolved. In most cases this is caused when rpcbind is not started up. If the server had been recently powered down for maintenance and was started back up rpcbind might not start up automatically.

Example 9-1 shows the output of a failed IBM Spectrum Archive EE startup due to rpcbind not running.

Example 9-1 rpcbind caused startup failure

[root@tora ~]# eeadm cluster start

Library name: lib_tora, library serial: 0000013FA002040C, control node (ltfsee_md) IP address: 9.11.244.63.

Starting - sending a startup request to lib_tora.

Starting - waiting for startup completion : lib_tora.

Starting - opening a communication channel : lib_tora.

Starting - waiting for getting ready to operate : lib_tora.

.......

2019-01-21 09:04:17 GLESL657E: Fail to start the IBM Spectrum Archive EE service (MMM) for library lib_tora.

Use the "eeadm node list" command to see the error modules.

The monitor daemon will start the recovery sequence.

[root@tora ~]# eeadm node list

Spectrum Archive EE service (MMM) for library lib_tora fails to start or is not running on tora.tuc.stglabs.ibm.com Node ID:1

Problem Detected:

Node ID Error Modules

1 MMM; rpcbind;

To remedy this issue use the systemctl start rpcbind command to start the process up, and either wait for the IBM Spectrum Archive EE Monitor daemon to start up MMM or issue a eeadm cluster stop and eeadm cluster start to get the MMM started. After rpcbind is started, it can be verified by running the systemctl status rpcbind to see if it is running.

Example 9-2 shows how to startup rpcbind and waiting for the Monitor daemon to restart MMM.

Example 9-2 Starting rpcbind to remedy MMM startup failure

[root@tora ~]# systemctl start rpcbind

[root@tora ~]# systemctl status rpcbind

? rpcbind.service - RPC bind service

Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; disabled; vendor preset: enabled)

Active: active (running) since Mon 2019-01-21 09:08:37 MST; 1min 12s ago

Process: 23628 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)

Main PID: 23629 (rpcbind)

Tasks: 1

Memory: 572.0K

CGroup: /system.slice/rpcbind.service

+-23629 /sbin/rpcbind -w

Jan 21 09:08:37 tora.tuc.stglabs.ibm.com systemd[1]: Starting RPC bind service...

Jan 21 09:08:37 tora.tuc.stglabs.ibm.com systemd[1]: Started RPC bind service.

[root@tora ~]# eeadm node list

Node ID State Node IP Drives Ctrl Node Library Node Group Host Name

1 available 9.11.244.63 0 yes(active) lib_tora G0 tora.tuc.stglabs.ibm.com

Failed startup caused by LE

LE is another crucial component to IBM Spectrum Archive EE. If it is not started up correctly, MMM does not start. There are two common startup problems that can be remedied quickly. The first problem is caused by no drive visibility by the Spectrum Archive EE node. This problem can be fixed by connecting Fibre Channel cables to the node and verified by running the ltfs -o device_list command.

Example 9-3 shows the output of ltfs -o device_list with no drives connected.

Example 9-3 No drives connected to server

[root@tora ~]# ltfs -o device_list

6b6c LTFS14000I LTFS starting, LTFS version 2.4.1.0 (10219), log level 2.

6b6c LTFS14058I LTFS Format Specification version 2.4.0.

6b6c LTFS14104I Launched by "/opt/IBM/ltfs/bin/ltfs -o device_list".

6b6c LTFS14105I This binary is built for Linux (x86_64).

6b6c LTFS14106I GCC version is 4.8.3 20140911 (Red Hat 4.8.3-9).

6b6c LTFS17087I Kernel version: Linux version 3.10.0-862.14.4.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Sep 21 09:07:21 UTC 2018 i386.

6b6c LTFS17089I Distribution: NAME="Red Hat Enterprise Linux Server".

6b6c LTFS17089I Distribution: Red Hat Enterprise Linux Server release 7.5 (Maipo).

6b6c LTFS17085I Plugin: Loading "sg" changer backend.

6b6c LTFS17085I Plugin: Loading "sg" tape backend.

Changer Device list:.

Tape Device list:.

The second most common LE error is caused after connecting the drives to the IBM Spectrum Archive EE node and forgetting to set at least one of the drives a control path drive from the library GUI. IBM Spectrum Archive EE requires at least one control path drive so it can communicate with the library.

Example 9-4 shows the output of a failed IBM Spectrum Archive EE start up that is caused by LE.

Example 9-4 LE failed MMM startup

[root@tora ~]# eeadm cluster start

Library name: lib_tora, library serial: 0000013FA002040C, control node (ltfsee_md) IP address: 9.11.244.63.

Starting - sending a startup request to lib_tora.

Starting - waiting for startup completion : lib_tora.

Starting - opening a communication channel : lib_tora.

Starting - waiting for getting ready to operate : lib_tora.

...

2019-01-21 09:24:33 GLESL657E: Fail to start the IBM Spectrum Archive EE service (MMM) for library lib_tora.

Use the "eeadm node list" command to see the error modules.

The monitor daemon will start the recovery sequence.

[root@tora ~]# eeadm node list

Spectrum Archive EE service (MMM) for library lib_tora fails to start or is not running on tora.tuc.stglabs.ibm.com Node ID:1

Problem Detected:

Node ID Error Modules

1 LE; MMM;

To remedy this failure, ensure that the node can see its drives and at least one drive is a control path drive.

9.2 Hardware

This section provides information that can help you to identify and resolve problems with the hardware that is used by IBM Spectrum Archive EE.

9.2.1 Tape library

If the TS4500 tape library has a problem, it reports an error in the events page on the TS4500 Management GUI. When an error occurs, IBM Spectrum Archive might not work. Figure 9-2 shows an example of a library error.

Figure 9-2 Tape library error log

For more information about how to solve tape library errors, see the IBM TS4500 R8 Tape Library Guide, SG24-8235.

9.2.2 Tape drives

If an LTO tape drive has a problem, it reports the error on a single-character display (SCD). If a TS1140 (or later) tape drive has a problem, it reports the error on an 8-character message display. When this error occurs, IBM Spectrum Archive might not work. To obtain information about a drive error, determine which drive is reporting the error and then access the events page to see the error by using the TS4500 Management GUI.

Figure 9-3 shows an example from the web interface of a tape drive that has an error and is no longer responding.

Figure 9-3 Tape drive error

If you right-click the event and select Display fix procedure, another window opens and shows suggestions about how to fix the problem. If a drive display reports a specific drive error code, see the tape drive maintenance manual for a solution, or call IBM H/W service. For more information about analyzing the operating system error logs, see 9.5.1, “Linux” on page 303.

If a problem is identified in the tape drive and the tape drive must be repaired, the drive must first be removed from the IBM Spectrum Archive EE system. For more information, see “Taking a tape drive offline” on page 299.

Managing tape drive dump files

This section describes how to manage the automatic erasure of drive dump files. IBM Spectrum Archive automatically generates two tape drive dump files in the /tmp directory when it receives unexpected sense data from a tape drive. Example 9-5 shows the format of the dump files.

Example 9-5 Dump files

[root@ltfs97 tmp]# ls -la *.dmp

-rw-r--r-- 1 root root 3681832 Apr 4 14:26 ltfs_1068000073_2013_0404_142634.dmp

-rw-r--r-- 1 root root 3681832 Apr 4 14:26 ltfs_1068000073_2013_0404_142634_f.dmp

-rw-r--r-- 1 root root 3681832 Apr 4 14:42 ltfs_1068000073_2013_0404_144212.dmp

-rw-r--r-- 1 root root 3697944 Apr 4 14:42 ltfs_1068000073_2013_0404_144212_f.dmp

-rw-r--r-- 1 root root 3697944 Apr 4 15:45 ltfs_1068000073_2013_0404_154524.dmp

-rw-r--r-- 1 root root 3683424 Apr 4 15:45 ltfs_1068000073_2013_0404_154524_f.dmp

-rw-r--r-- 1 root root 3683424 Apr 4 17:21 ltfs_1068000073_2013_0404_172124.dmp

-rw-r--r-- 1 root root 3721684 Apr 4 17:21 ltfs_1068000073_2013_0404_172124_f.dmp

-rw-r--r-- 1 root root 3721684 Apr 4 17:21 ltfs_1068000073_2013_0404_172140.dmp

-rw-r--r-- 1 root root 3792168 Apr 4 17:21 ltfs_1068000073_2013_0404_172140_f.dmp

The size of each drive dump file is approximately 2 MB. By managing the drive dump files that are generated, you can save disk space and enhance IBM Spectrum Archive performance.

It is not necessary to keep dump files after they are used for problem analysis. Likewise, the files are not necessary if the problems are minor and can be ignored. A script program that is provided with IBM Spectrum Archive EE periodically checks the number of drive dump files and their date and time. If some of the dump files are older than two weeks or if the number of dump files exceeds 1000 files, the script program erases them.

The script file is started by using Linux crontab features. A cron_ltfs_limit_dumps.sh file is in the /etc/cron.daily directory. This script file is started daily by the Linux operating system. The interval to run the script can be changed by moving the cron_ltfs_limit_dumps.sh file to other cron folders, such as cron.weekly. For more information about how to change the crontab setting, see the manual for your version of Linux.

In the cron_ltfs_limit_dumps.sh file, the automatic drive dump erase policy is specified by the option of the ltfs_limit_dump.sh script file, as shown in the following example:

/opt/ibm/ltfsle/bin/ltfs_limit_dumps.sh -t 14 -n 1000

You can modify the policy by editing the options in the cron_ltfs_limit_dumps.sh file. The expiration date is set as a number of days by the -t option. In the example, a drive dump file is erased when it is more than 14 days old. The number of files to keep is set by the -n option. In our example, if the number of files exceeds 1,000, older files are erased so that the 1,000-file maximum is not exceeded. If either of the options are deleted, the dump files are deleted by the remaining policy.

By editing these options in the cron_ltfs_limit_dumps.sh file, the number of days that files are kept and the number of files that are stored can be modified.

Although not recommended, you can disable the automatic erasure of drive dump files by removing the cron_ltfs_limit_dumps.sh file from the cron folder.

Taking a tape drive offline

This section describes how to take a drive offline from the IBM Spectrum Archive EE system to perform diagnostic operations while the IBM Spectrum Archive EE system stays operational. To accomplish this task, use software such as the IBM Tape Diagnostic Tool (ITDT) or the IBM LTFS Format Verifier, which are described in 10.3, “System calls and IBM tools” on page 327.

Important: If the diagnostic operation you intend to perform requires that a tape cartridge be loaded into the drive, ensure that you have an empty non-pool tape cartridge available in the logical library of IBM Spectrum Archive EE. If a tape cartridge is in the tape drive when the drive is removed, the tape cartridge is automatically moved to the home slot.

To perform diagnostic tests, complete the following steps:

1. Identify the node ID number of the drive to be taken offline by running the eeadm drive list command. Example 9-6 shows the tape drives in use by IBM Spectrum Archive EE.

Example 9-6 Identify the tape drive to remove

[root@tora ~]# eeadm drive list

Drive S/N State Type Role Library Node ID Tape Node Group Task ID

000000014A00 not_mounted TS1160 mrg lib_tora 1 - G0 -

0000078PG20C not_mounted TS1160 mrg lib_tora 1 - G0 -

In this example, we take the tape drive with serial number 000000014A00 on cluster node 1 offline.

2. Remove the tape drive from the IBM Spectrum Archive EE inventory by specifying the eeadm drive unassign <drive serial number> command. Example 9-7 shows the removal of a single tape drive from IBM Spectrum Archive EE.

Example 9-7 Remove the tape drive

[root@tora ~]# eeadm drive unassign 000000014A00

2019-01-22 13:26:45 GLESL700I: Task drive_unassign was created successfully, task id is 6432.

2019-01-22 13:26:51 GLESL121I: Drive serial 000000014A00 is removed from the tape drive list.

2019-01-22 13:36:49 GLESL121I: Drive serial 000000014A00 is removed from the tape drive list.

3. Check the success of the removal. Run the eeadm drive list command and verify that the output shows that the MMM attribute for the drive is in the stock state. Example 9-8 shows the status of the drives after it is removed from IBM Spectrum Archive EE.

Example 9-8 Check the tape drive status

[root@tora ~]# eeadm drive list

Drive S/N State Type Role Library Node ID Tape Node Group Task ID

0000078PG20C not_mounted TS1160 mrg lib_tora 1 - G0 -

000000014A00 unassigned - --- lib_tora - - - -

4. Identify the primary device number of the drive for subsequent operations by running the /opt/ibm/ltfsle/bin/ltfs -o device_list command. The command outputs a list of available drives. Example 9-9 shows the output of this command.

Example 9-9 ltfs -o device_list command output

[root@tora ~]# /opt/ibm/ltfsle/bin/ltfs -o device_list

77d1 LTFS14000I LTFS starting, LTFS version 2.4.1.1 (10226), log level 2.

77d1 LTFS14058I LTFS Format Specification version 2.4.0.

77d1 LTFS14104I Launched by "/opt/IBM/ltfs/bin/ltfs -o device_list".

77d1 LTFS14105I This binary is built for Linux (x86_64).

77d1 LTFS14106I GCC version is 4.8.3 20140911 (Red Hat 4.8.3-9).

77d1 LTFS17087I Kernel version: Linux version 3.10.0-862.14.4.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Sep 21 09:07:21 UTC 2018 i386.

77d1 LTFS17089I Distribution: NAME="Red Hat Enterprise Linux Server".

77d1 LTFS17089I Distribution: Red Hat Enterprise Linux Server release 7.5 (Maipo).

77d1 LTFS17085I Plugin: Loading "sg" changer backend.

77d1 LTFS17085I Plugin: Loading "sg" tape backend.

Changer Device list:.

Device Name = /dev/sg11, Vender ID = IBM , Product ID = 03584L22 , Serial Number = 0000013FA002040C, Product Name = TS3500/TS4500.

Tape Device list:.

Device Name = /dev/sg1, Vender ID = IBM , Product ID = 0359260F , Serial Number = 0000078PG20C, Product Name =[0359260F].

Device Name = /dev/sg0, Vender ID = IBM , Product ID = 0359260F , Serial Number = 000000014A00, Product Name =[0359260F].

5. If your diagnostic operations require a tape cartridge to be loaded into the drive, complete the following steps. Otherwise, you are ready to perform diagnostic operations on the drive, which has the drive address /dev/sgnumber, where number is the device number that is obtained in step 4:

a. Move the tape cartridge to the drive from the I/O station or home slot. You can move the tape cartridge by using ITDT (in which case the drive must have the control path), or the TS4500 Management GUI.

b. Perform the diagnostic operations on the drive, which has the drive address /dev/sgnumber, where number is the device number that is obtained in step 4.

c. When you are finished, return the tape cartridge to its original location.

6. Add the drive to the IBM Spectrum Archive EE inventory again by running the eeadm drive assign drive_serial -n node_id command, where node_id is the same node that the drive was assigned to originally in step 1 on page 299.

Example 9-10 shows the tape drive that is readded to IBM Spectrum Archive EE.

Example 9-10 Add again the tape drive

[root@tora ~]# eeadm drive assign 000000014A00 -n 1

2019-01-23 08:37:24 GLESL119I: Drive 000000014A00 assigned successfully.

Running the eeadm drive list command again shows that the tape drive is no longer in a “unassigned” state. Example 9-11 shows the output of this command.

Example 9-11 Check the tape drive status

[root@tora ~]# eeadm drive list

Drive S/N State Type Role Library Node ID Tape Node Group Task ID

000000014A00 not_mounted TS1160 mrg lib_tora 1 - G0 -

0000078PG20C not_mounted TS1160 mrg lib_tora 1 - G0 -

9.3 Recovering data from a write failure tape

A tape that suffers from a write failure goes into a require_replace state. Complete the following steps to recover data from a require_replace tape:

1. Verify that the tape is in the require_replace state by running eeadm tape list.

2. Verify there is a tape cartridge in the appendable state within the same cartridge pool which has enough available space to hold the data from the require_replace tape.

3. Run the eeadm tape replace command on the require_replace tape to start the data transfer process onto an appendable tape within the same pool and unassigning it from the tape cartridge pool at the end.

4. Run the eeadm tape move command on the require_replace tape to move the I/O station in order to dispose of the bad tape.

Example 9-12 shows the commands and output to replace a require_replace tape.

Example 9-12 Replacing data from a require_replace tape

[root@ginza prod]# eeadm tape list | grep pool1

FC0254L8 degraded require_replace 10907 0 0 0% pool1 liba homeslot -

FC0257L8 ok appendable 10907 0 10907 0% pool1 liba homeslot -

[root@ginza prod]# eeadm tape replace FC0254L8 -p pool1

2019-02-05 14:49:03 GLESL700I: Task tape_replace was created successfully, task id is 2319.

2019-02-05 14:49:03 GLESL755I: Start a reconcile before starting a replace against 1 tapes.

2019-02-05 14:51:46 GLESS002I: Reconciling tape FC0254L8 complete.

2019-02-05 14:51:49 GLESL756I: Reconcile before replace finished.

2019-02-05 14:51:49 GLESL753I: Starting tape replace for FC0254L8.

2019-02-05 14:51:49 GLESL754I: Found a target tape for tape replace (FC0257L8).

2019-02-05 14:55:03 GLESL749I: The tape replace operation for FC0254L8 is successful.

[root@ginza prod]# eeadm tape list | grep pool1

FC0257L8 ok appendable 10907 0 10906 0% pool1 liba drive -

In the rare case where the error on tape is permanent (several eeadm tape replace commands have failed), it is suggested to try the eeadm tape unassigned --safe-remove command instead. The --safe remove option recalls all the active files on the tape back to an IBM Spectrum Scale file system that has adequate free space. The files then need to be manually migrated again to a good tape again. Specify only one tape for use with the --safe-remove option.

9.4 Recovering data from a read failure tape

A tape that suffers from a read failure goes into a need_replace state.

Complete the following steps to copy migrated jobs from a need_replace tape to a valid tape within the same pool:

1. Identify a tape with a read failure by running eeadm tape list to locate the need_replace tape.

2. Verify there is an appendable tape within the same pool which has enough available space to hold the data from the need_replace tape cartridge.

3. Run the eeadm tape replace command to start the data transfer process onto an appendable tape within the same pool and unassigning it from the tape cartridge pool at the end.

4. Run eeadm tape move command on the need_replace tape to move the I/O station in order to dispose of the bad tape.

Example 9-13 shows system output of the steps to recover data from a read failure tape.

Example 9-13 Recovering data from a read failure

[root@hakone prod3]# eeadm tape list | grep test1

IM1229L6 ok appendable 2242 0 2242 0% test1 liba homeslot -

IM1195L6 info need_replace 2242 0 0 0% test1 liba drive -

[root@hakone prod3]# eeadm tape replace IM1195L6 -p test1

2019-02-26 14:29:27 GLESL700I: Task tape_replace was created successfully, task id is 1297.

2019-02-26 14:29:27 GLESL755I: Start a reconcile before starting a replace against 1 tapes.

2019-02-26 14:30:05 GLESS002I: Reconciling tape IM1195L6 complete.

2019-02-26 14:30:06 GLESL756I: Reconcile before replace finished.

2019-02-26 14:30:06 GLESL753I: Starting tape replace for IM1195L6.

2019-02-26 14:30:06 GLESL754I: Found a target tape for tape replace (IM1229L6).

2019-02-26 14:31:23 GLESL749I: The tape replace operation for IM1195L6 is successful.

In the seldom case where the error on tape is permanent such that several eeadm tape replace commands have failed, it is suggested to try the eeadm tape unassign --safe-remove command instead. The --safe remove option recalls all the active files on the tape back to an IBM Spectrum Scale file system that has adequate free space, and those files need to be manually migrated again to a good tape again. Specify only one tape for the use with --safe-remove option.

9.5 Software

IBM Spectrum Archive EE is composed of four major components, each with its own set of log files. Therefore, problem analysis is slightly more involved than other products. This section describes troubleshooting issues with each component in turn and the Linux operating system and Simple Network Management Protocol (SNMP) alerts.

9.5.1 Linux

The log file /var/log/messages contains global LINUX system messages, including the messages that are logged during system start and messages that are related to LTFS and IBM Spectrum Archive EE functions. However, three specific log files are also created:

•ltfs.log

•ltfsee.log

•ltfsee_trc.log

Unlike with previous LTFS/IBM Spectrum Archive products, there is no need to enable the system logging on Linux because it is automatically performed during the installation process. Example 9-14 shows the changes to the rsyslog.conf file and the location of the two log files.

Example 9-14 The rsyslog.conf file

[root@ltfssn1 ~]# cat /etc/rsyslog.conf | grep ltfs

:msg, startswith, "GLES," /var/log/ltfsee_trc.log;gles_trc_template

:msg, startswith, "GLES" /var/log/ltfsee.log;RSYSLOG_FileFormat

:msg, regex, "LTFS[ID0-9][0-9]*[EWID]" /var/log/ltfs.log;RSYSLOG_FileFormat

By default, after the ltfs.log, ltfsee.log, and ltfsee_trc.log files reach the threshold size, they are rotated and four copies are kept. Example 9-15 shows the log file rotation settings. These settings can be adjusted as needed within the /etc/logrotate.d/ibmsa-logrotate control file.

Example 9-15 Syslog rotation

[root@ltfssn1 ~]# cat /etc/logrotate.d/ibmsa-logrotate

/var/log/ltfsee.log {

size 1M

rotate 4

missingok

compress

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

/bin/kill -HUP `cat /var/run/rsyslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

/var/log/ltfsee_trc.log {

size 10M

rotate 9

missingok

compress

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

/bin/kill -HUP `cat /var/run/rsyslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

/var/log/ltfsee_mon.log {

size 1M

rotate 4

missingok

compress

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

/bin/kill -HUP `cat /var/run/rsyslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

These log files (ltfs.log, ltfsee.log, ltfsee_trc.log, and /var/log/messages) are invaluable in troubleshooting LTFS messages. The ltfsee.log file contains only warning and error messages. Therefore, it is easy to start looking here for the reasons of failure. For example, a typical file migration might return the information message that is shown in Example 9-16.

Example 9-16 Simple migration with informational messages

[root@ginza prod]# eeadm migrate mig -p pool2

2019-02-07 18:15:07 GLESL700I: Task migrate was created successfully, task id is 2326.

2019-02-07 18:15:08 GLESM896I: Starting the stage 1 of 3 for migration task 2326 (qualifying the state of migration candidate files).

2019-02-07 18:15:08 GLESM897I: Starting the stage 2 of 3 for migration task 2326 (copying the files to 1 pools).

2019-02-07 18:15:08 GLESM898I: Starting the stage 3 of 3 for migration task 2326 (changing the state of files on disk).

2019-02-07 18:15:08 GLESL159E: Not all migration has been successful.

2019-02-07 18:15:08 GLESL038I: Migration result: 0 succeeded, 100 failed, 0 duplicate, 0 duplicate wrong pool, 0 not found, 0 too small to qualify for migration, 0 too early for migration.

From the GLESL159E message, you know that the migration was unsuccessful, but you do not know why it was unsuccessful. To understand why, you must examine the ltfsee.log file. Example 9-17 shows the end of the ltfsee.log file immediately after the failed migrate command is run.

Example 9-17 The ltfsee.log file

# tail /var/log/ltfsee.log

2019-02-07T18:20:03.301978-07:00 ginza mmm[14807]: GLESM600E(00412): Failed to migrate/premigrate file /ibm/gpfs/prod/FILE1. The specified pool name does not match the existing replica copy.

2019-02-07T18:20:03.493810-07:00 ginza mmm[14807]: GLESL159E(00144): Not all migration has been successful.

In this case, the migration of the file was unsuccessful because it was previously migrated/premigrated to a different tape pool.

With IBM Spectrum Archive EE, there are two logging facilities. One is in a human-readable format that is monitored by users and the other is in machine-readable format that is used for further problem analysis. The former facility is logged in to /var/log/ltfsee.log through the “user” syslog facility and contains only warnings and errors. The latter facility is logged in to /var/log/ltfsee_trc.log through the “local2” Linux facility.

The messages in machine-readable format can be converted into human-readable format by the tool ltfsee_catcsvlog, which is run by the following command:

/opt/ibm/ltfsee/bin/ltfsee_catcsvlog /var/log/ltfsee_trc.log

The ltfsee_catcsvlog command accepts multiple log files as command-line arguments. If no argument is specified, ltfsee_catcsvlog reads from stdin.

Persistent problems

This section describes ways to solve persistent IBM Spectrum Archive EE problems.

If an unexpected and persistent condition occurs in the IBM Spectrum Archive EE environment, contact your IBM service representative. Provide the following information to help IBM re-create and solve the problem:

•Machine type and model of your IBM tape library in use for IBM Spectrum Archive EE

•Machine type and model of the tape drives that are embedded in the tape library

•Specific IBM Spectrum Archive EE version

•Description of the problem

•System configuration

•Operation that was performed at the time the problem was encountered

The operating system automatically generates system log files after initial configuration of the IBM Spectrum Archive EE. Provide the results of the ltfsee_log_collection command to your IBM service representative.

9.5.2 IBM Spectrum Scale

IBM Spectrum Scale writes operational messages and error data to the IBM Spectrum Scale log file. The IBM Spectrum Scale log can be found in the /var/adm/ras directory on each node. The IBM Spectrum Scale log file is named mmfs.log.date.nodeName, where date is the time stamp when the instance of IBM Spectrum Scale started on the node and nodeName is the name of the node. The latest IBM Spectrum Scale log file can be found by using the symbolic file name /var/adm/ras/mmfs.log.latest.

The IBM Spectrum Scale log from the prior start of IBM Spectrum Scale can be found by using the symbolic file name /var/adm/ras/mmfs.log.previous. All other files have a time stamp and node name that is appended to the file name.

At IBM Spectrum Scale start, files that were not accessed during the last 10 days are deleted. If you want to save old files, copy them elsewhere.

Example 9-18 shows normal operational messages that appear in the IBM Spectrum Scale log file.

Example 9-18 Normal operational messages in an IBM Spectrum Scale log file

[root@ltfs97 ]# cat /var/adm/ras/mmfs.log.latest

Wed Apr 3 13:25:04 JST 2013: runmmfs starting

Removing old /var/adm/ras/mmfs.log.* files:

Unloading modules from /lib/modules/2.6.32-220.el6.x86_64/extra

Loading modules from /lib/modules/2.6.32-220.el6.x86_64/extra

Module Size Used by

mmfs26 1749012 0

mmfslinux 311300 1 mmfs26

tracedev 29552 2 mmfs26,mmfslinux

Wed Apr 3 13:25:06.026 2013: mmfsd initializing. {Version: 3.5.0.7 Built: Dec 12 2012 19:00:50} ...

Wed Apr 3 13:25:06.731 2013: Pagepool has size 3013632K bytes instead of the requested 29360128K bytes.

Wed Apr 3 13:25:07.409 2013: Node 192.168.208.97 (htohru9) is now the Group Leader.

Wed Apr 3 13:25:07.411 2013: This node (192.168.208.97 (htohru9)) is now Cluster Manager for htohru9.ltd.sdl.

Starting ADSM Space Management daemons

Wed Apr 3 13:25:17.907 2013: mmfsd ready

Wed Apr 3 13:25:18 JST 2013: mmcommon mmfsup invoked. Parameters: 192.168.208.97 192.168.208.97 all

Wed Apr 3 13:25:18 JST 2013: mounting /dev/gpfs

Wed Apr 3 13:25:18.179 2013: Command: mount gpfs

Wed Apr 3 13:25:18.353 2013: Node 192.168.208.97 (htohru9) appointed as manager for gpfs.

Wed Apr 3 13:25:18.798 2013: Node 192.168.208.97 (htohru9) completed take over for gpfs.

Wed Apr 3 13:25:19.023 2013: Command: err 0: mount gpfs

Wed Apr 3 13:25:19 JST 2013: finished mounting /dev/gpfs

Depending on the size and complexity of your system configuration, the amount of time to start IBM Spectrum Scale varies. Taking your system configuration into consideration, if you cannot access a file system that is mounted (automatically or by running a mount command) after a reasonable amount of time, examine the log file for error messages.

The IBM Spectrum Scale log is a repository of error conditions that were detected on each node, and operational events, such as file system mounts. The IBM Spectrum Scale log is the first place to look when you are attempting to debug abnormal events. Because IBM Spectrum Scale is a cluster file system, events that occur on one node might affect system behavior on other nodes, and all IBM Spectrum Scale logs can have relevant data.

A common error that might appear when trying to mount GPFS is that it cannot read superblock. Example 9-19 shows the output of the error when trying to mount GPFS.

Example 9-19 Superblock error from mounting GPFS

[root@ltfsml1 ~]# mmmount gpfs

Wed May 24 12:53:59 MST 2017: mmmount: Mounting file systems ...

mount: gpfs: can't read superblock

mmmount: Command failed. Examine previous error messages to determine cause.

The cause of this error and failure to mount GPFS is that the GPFS file system had dmapi enabled, but the HSM process has not been started. To get around this error and successfully mount GPFS, issue the systemctl start hsm command, and make sure it is running by issuing systemctl status hsm. After HSM is running, wait for the recall processes to initiate. This process can be viewed by issuing ps -afe | grep dsm. Example 9-20 shows output of starting HSM, checking the status, and mounting GPFS.

Example 9-20 Starting HSM and mounting GPFS

[root@ltfsml1 ~]# systemctl start hsm

[root@ltfsml1 ~]# systemctl status hsm

? hsm.service - HSM Service

Loaded: loaded (/usr/lib/systemd/system/hsm.service; enabled; vendor preset: disabled)

Active: active (running) since Wed 2017-05-24 13:04:59 MST; 4s ago

Main PID: 16938 (dsmwatchd)

CGroup: /system.slice/hsm.service

+-16938 /opt/tivoli/tsm/client/hsm/bin/dsmwatchd nodetach

May 24 13:04:59 ltfsml1.tuc.stglabs.ibm.com systemd[1]: Started HSM Service.

May 24 13:04:59 ltfsml1.tuc.stglabs.ibm.com systemd[1]: Starting HSM Service...

May 24 13:04:59 ltfsml1.tuc.stglabs.ibm.com dsmwatchd[16938]: HSM(pid:16938): start

[root@ltfsml1 ~]# ps -afe | grep dsm

root 7906 1 0 12:56 ? 00:00:00 /opt/tivoli/tsm/client/hsm/bin/dsmwatchd nodetach

root 9748 1 0 12:57 ? 00:00:00 dsmrecalld

root 9773 9748 0 12:57 ? 00:00:00 dsmrecalld

root 9774 9748 0 12:57 ? 00:00:00 dsmrecalld

root 9900 26012 0 12:57 pts/0 00:00:00 grep --color=auto dsm

[root@ltfsml1 ~]# mmmount gpfs

Wed May 24 12:57:22 MST 2017: mmmount: Mounting file systems ...

[root@ltfsml1 ~]# df -h | grep gpfs

gpfs 280G 154G 127G 55% /ibm/glues

If HSM is already running, double check if the dsmrecalld daemons are running by issuing ps -afe | grep dsm. If no dsmrecalld daemons are running, start them by issuing dsmmigfs start. After they have been started, GPFS can be successfully mounted.

9.5.3 IBM Spectrum Archive LE component

This section describes the options that are available to analyze problems that are identified by the LTFS logs. It also provides links to messages and actions that can be used to troubleshoot the source of an error.

The messages that are referenced in this section provide possible actions only for solvable error codes. The error codes that are reported by LTFS program can be retrieved from the terminal console or log files. For more information about retrieving error messages, see 9.5.1, “Linux” on page 303.

When multiple errors are reported, LTFS attempts to find a message ID and an action for each error code. If you cannot locate a message ID or an action for a reported error code, LTFS encountered a critical problem. If you try an initial action again and continue to fail, LTFS also encountered a critical problem. In these cases, contact your IBM service representative for more support.

Message ID strings start with the keyword LTFS and are followed by a four- or five-digit value. However, some message IDs include the uppercase letter I or D after LTFS, but before the four- or five-digit value. When an IBM Spectrum Archive EE command is run and returns an error, check the message ID to ensure that you do not mistake the letter I for the numeral 1.

A complete list of all LTFS messages can be found in the IBM Spectrum Archive EE section of IBM Documentation.

At the end of the message ID, the following single capital letters indicate the importance of the problem:

•E: Error

•W: Warning

•I: Information

•D: Debugging

When you troubleshoot, check messages for errors only.

Example 9-21 shows a problem analysis procedure for LTFS.

Example 9-21 LTFS messages

cat /var/log/ltfs.log

2019-02-07T18:33:04.663564-07:00 ginza ltfs[12251]: 478d LTFS14787I Formatting cartridge FC0252L8.

2019-02-07T18:33:42.724406-07:00 ginza ltfs[12251]: 478d LTFS14837I Formatting cartridge FC0252L8 (0x5e, 0x00).

2019-02-07T18:33:42.729350-07:00 ginza ltfs[12251]: 478d LTFS14789E Failed to format cartridge FC0252L8 (-1079).

2019-02-07T18:33:42.729543-07:00 ginza ltfs[12251]: 478d LTFSI1079E The operation is not allowed.

The set of 10 characters represents the message ID, and the text that follows describes the operational state of LTFS. The fourth message ID (LTFSI1079E) in this list indicates that an error was generated because the last character is the letter E. The character immediately following LTFS is the letter I. The complete message, including an explanation and appropriate course of action for LTFSI1079E, is shown in the following example: Example 9-22 on page 309.

Example 9-22 Example of message

LTFS14789E Failed to format cartridge FC0252L8 (-1079).

LTFSI1079E The operation is not allowed.

The previous operation did not run due to tape and or drive issues.

In this case the drive and tape compatibility was incorrect.

Based on the description that is provided here, the tape cartridge in the library failed to format. Upon further investigation the tape cartridge and drive is incompatible. The required user action to solve the problem is to attach a compatible drive to the library and Spectrum Archive EE node and rerun the operation.

9.5.4 Hierarchical storage management

During installation, hierarchical storage management (HSM) is configured to write log entries to a log file in /opt/tivoli/tsm/client/hsm/bin/dsmerror.log. Example 9-23 shows an example of this file.

Example 9-23 The dsmerror.log file

[root@ltfs97 /]# cat dsmerror.log

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file1.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file2.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file3.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file4.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file5.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file6.img' could not be found.

04/02/2013 16:24:06 ANS9510E dsmrecalld: cannot get event messages from session 515A6F7E00000000, expected max message-length = 1024, returned message-length = 144. Reason : Stale NFS file handle

04/02/2013 16:24:06 ANS9474E dsmrecalld: Lost my session with errno: 1 . Trying to recover.

04/02/13 16:24:10 ANS9433E dsmwatchd: dm_send_msg failed with errno 1.

04/02/2013 16:24:11 ANS9433E dsmrecalld: dm_send_msg failed with errno 1.

04/03/13 13:25:06 ANS9505E dsmwatchd: cannot initialize the DMAPI interface. Reason: Stale NFS file handle

04/03/2013 13:38:14 ANS1079E No file specification entered

04/03/2013 13:38:20 ANS9085E dsmrecall: file system / is not managed by space management.

The HSM log contains information about file migration and recall, threshold migration, reconciliation, and starting and stopping the HSM daemon. You can analyze the HSM log to determine the current state of the system. For example, the logs can indicate when a recall has started but not finished within the last hour. The administrator can analyze a particular recall and react accordingly.

In addition, an HSM log might be analyzed by an administrator to optimize HSM usage. For example, if the HSM log indicates that 1,000 files are recalled at the same time, the administrator might suggest that the files can be first compressed into one .tar file and then migrated.

9.5.5 IBM Spectrum Archive EE logs

This section describes IBM Spectrum Archive EE logs and message IDs and provide some tips for dealing with failed recalls and missing files.

IBM Spectrum Archive EE log collection tool

IBM Spectrum Archive EE writes its logs to the files /var/log/ltfsee.log and /var/log/ltfsee_trc.log. These files can be viewed in a text editor for troubleshooting purposes. Use the IBM Spectrum Archive EE log collection tool to collect data that you can send to IBM Support.

The ltfsee_log_collection tool is in the /opt/ibm/ltfsee/bin folder. To use the tool, complete the following steps:

1. Log on to the operating system as the root user and open a console.

2. Start the tool by running the following command:

# /opt/ibm/ltfsee/bin/ltfsee_log_collection

3. When the following message displays, read the instructions, then enter y or p to continue:

LTFS Enterprise Edition - log collection program

This program collects the following information from your GPFS cluster.

a. Log files that are generated by GPFS, LTFS Enterprise Edition

b. Configuration information that is configured to use GPFS and LTFS Enterprise Edition

c. System information including OS distribution and kernel, and hardware information (CPU and memory)

d. Task information files under the following subdirectory <GPFS mount point>/.ltfsee/statesave.

If you want to collect all the information, enter y.

If you want to collect only a and b, enter p (partial).

If you agree to collect only (4) task information files, input 't'.

If you do not want to collect any information, enter n.

The collected data is compressed in the ltfsee_log_files_<date>_<time>.tar.gz file. You can check the contents of the file before submitting it to IBM.

4. Make sure that a packed file with the name ltfsee_log_files_[date]_[time].tar.gz is created in the current directory. This file contains the collected log files.

5. Send the tar.gz file to your IBM service representative.

Messages reference

For IBM Spectrum Archive EE, message ID strings start with the keyword GLES and are followed by a single letter and then by a three-digit value. The single letter indicates which component generated the message. For example, GLESL is used to indicate all messages that are related to the IBM Spectrum Archive EE command. At the end of the message ID, the following single uppercase letter indicates the importance of the problem:

•E: Error

•W: Warning

•I: Information

•D: Debugging

When you troubleshoot, check messages for errors only. For a list of available messages, see IBM Documentation.

Failed reconciliations

Failed reconciliations usually are indicated by the GLESS003E error message with the following description:

Reconciling tape %s failed due to a generic error.

File status

Table 9-1 lists the seven possible status codes for files in IBM Spectrum Archive EE. They can be viewed for individual files by running the eeadm file state command.

Table 9-1 Status codes for files in IBM Spectrum Archive EE

Status code	Description
resident	The resident status indicates that the file is resident in the GPFS namespace and is not saved, migrated, or premigrated to a tape.
migrated	The migrated status indicates that the file was migrated. The file was copied from GPFS file system to a tape, and exists only as a stub file in the GPFS namespace.
premigrated	The premigrated status indicates that the file was premigrated. The file was copied to a tape (or tapes), but the file was not removed from the GPFS namespace.
saved	The saved status indicates that the file system object that has no data (a symbolic link, an empty directory, or an empty regular file) was saved. The file system object was copied from GPFS file system to a tape.
offline	The offline status indicates that the file was saved or migrated to a tape cartridge and thereafter the tape cartridge was exported offline.
missing	The missing status indicates that a file has the migrated or premigrated status, but it is not accessible from IBM Spectrum Archive EE because the tape cartridge it is supposed to be on is not accessible. The file might be missing because of tape corruption or if the tape cartridge was removed from the system without exporting.

Files with the missing status are caused because of tape corruption or if the tape cartridge was removed from the system without exporting. If the cause is a corrupted index run the eeadm tape validate command, otherwise if the tape is missing from the tape library bring the tape back into the library and run the eeadm library rescan command.

9.6 Recovering from system failures

The system failures that are described in this section are the result of hardware failures or temporary outages that result in IBM Spectrum Archive EE errors.

9.6.1 Power failure

When a library power failure occurs, the data on the tape cartridge that is actively being written is probably left in an inconsistent state.

To recover a tape cartridge from a power failure, complete the following steps:

1. Create a mount point for the tape library. For more information, see the procedure described in 6.2.2, “IBM Spectrum Archive Library Edition component” on page 137.

2. If you do not know which tape cartridges are in use, try to access all tape cartridges in the library. If you do know which tape cartridges are in use, try to access the tape cartridge that was in use when the power failure occurred.

3. If a tape cartridge is damaged, it is identified as inconsistent and the corresponding subdirectories disappear from the file system. You can confirm which tape cartridges are damaged or inconsistent by running the eeadm tape list command. The list of tape cartridges that displays indicates the volume name, which is helpful in identifying the inconsistent tape cartridge. For more information, see 6.18, “Checking and repairing tapes” on page 201.

4. Recover the inconsistent tape cartridge by running the eeadm tape validate command. For more information, see 6.18, “Checking and repairing tapes” on page 201.

9.6.2 Mechanical failure

When a library receives an error message from one of its mechanical parts, the process to move a tape cartridge cannot be performed.

Important: A drive in the library normally performs well despite a failure so that ongoing access to an opened file on the loaded tape cartridge is not interrupted or damaged.

To recover a library from a mechanical failure, complete the following steps:

1. Identify the issue on the tape library.

2. Manually repair the gating part.

3. Run the eeadm tape validate command on each effected tape.

4. Follow the procedure that is described in 9.6.1, “Power failure” on page 312.

Important: One or more inconsistent tape cartridges might be found in the storage slots and might need to be made consistent by following the procedure that is described in “Unassigned state” on page 322.

9.6.3 Inventory failure

When a library cannot read the tape cartridge bar code for any reason, an inventory operation for the tape cartridge fails. The corresponding media folder does not display, but a specially designated folder that is named UNKN0000 is listed instead. This designation indicates that a tape cartridge is not recognized by the library.

If the user attempts to access the tape cartridge contents, the media folder is removed from the file system. The status of any library tape cartridge can be determined by running the eeadm tape list command. For more information, see 6.23, “Obtaining system resources, and tasks information” on page 214.

To recover from an inventory failure, complete the following steps:

1. Remove any unknown tape cartridges from the library by using the operator panel or Tape Library Specialist web interface, or by opening the door or magazine of the library.

2. Check all tape cartridge bar code labels.

Important: If the bar code is removed or about to peel off, the library cannot read it. Replace the label or firmly attach the bar code to fix the problem.

3. Insert the tape cartridge into the I/O station.

4. Check to determine whether the tape cartridge is recognized by running the eeadm tape list command.

5. Add the tape cartridge to the LTFS inventory by running the eeadm tape assign command.

9.6.4 Abnormal termination

If LTFS terminates because of an abnormal condition, such as a system hang-up or after the user initiates a kill command, the tape cartridges in the library might remain in the tape drives. If this occurs, LTFS locks the tape cartridges in the drives and the following command is required to release them:

# ltfs -o release_device -o chanager_devname=[device_name]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Troubleshooting IBM Spectrum Archive Enterprise Edition

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9. Troubleshooting IBM Spectrum Archive Enterprise Edition