Chapter 10. Troubleshooting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Troubleshooting

This chapter describes the process that you can use to troubleshoot issues with IBM Spectrum Archive EE.

This chapter includes the following topics:

•Overview

•Hardware

•Recovering data from a write failure tape

•Recovering data from a read failure tape

•Handling export errors

•Software

•Recovering from system failures

10.1 Overview

This section provides a simple health check procedure for IBM Spectrum Archive EE.

10.1.1 Quick health check

If you are having issues with an existing IBM Spectrum Archive EE environment, Figure 10-1 shows a simple flowchart that you can follow as the first step to troubleshooting problems with the IBM Spectrum Archive EE components.

Figure 10-1 Quick health check procedure

If your issue remains after you perform these simple checks, follow the procedures that are described in the remainder of this chapter to perform more detailed troubleshooting. If the problem cannot be resolved, contact IBM Spectrum Archive Support.

10.2 Hardware

The topics in this section provide information that can help you to identify and resolve problems with the hardware that is used by IBM Spectrum Archive EE.

10.2.1 Tape library

If the TS4500 tape library has a problem, it reports an error in the events page on the TS4500 Management GUI. When an error occurs, IBM Spectrum Archive might not work. Figure 10-2 shows an example of a library error.

Figure 10-2 Tape library error log

For more information about how to solve tape library errors, see the IBM TS4500 R4 Tape Library Guide, SG24-8235.

10.2.2 Tape drives

If an LTO tape drive has a problem, it reports the error on a single-character display (SCD). If a TS1140 (or later) tape drive has a problem, it reports the error on an 8-character message display. When this error occurs, IBM Spectrum Archive might not work. To obtain information about a drive error, determine which drive is reporting the error and then access the events page to see the error by using the TS4500 Management GUI.

Figure 10-3 shows an example from the web interface of a tape drive that has an error and is no longer responding.

Figure 10-3 Tape drive error

If you right-click the event and select Display fix procedure, another window opens and shows suggestions about how to fix the problem. If a drive display reports a specific drive error code, see the tape drive maintenance manual for a solution. For more information about analyzing the operating system error logs, see 10.6.1, “Linux” on page 293.

If a problem is identified in the tape drive and the tape drive must be repaired, the drive must first be removed from the IBM Spectrum Archive EE system. For more information, see “Taking a tape drive offline” on page 281.

Managing tape drive dump files

This section describes how to manage the automatic erasure of drive dump files. IBM Spectrum Archive automatically generates two tape drive dump files in the /tmp directory when it receives unexpected sense data from a tape drive. Example 10-1 shows the format of the dump files.

Example 10-1 Dump files

[root@ltfs97 tmp]# ls -la *.dmp

-rw-r--r-- 1 root root 3681832 Apr 4 14:26 ltfs_1068000073_2013_0404_142634.dmp

-rw-r--r-- 1 root root 3681832 Apr 4 14:26 ltfs_1068000073_2013_0404_142634_f.dmp

-rw-r--r-- 1 root root 3681832 Apr 4 14:42 ltfs_1068000073_2013_0404_144212.dmp

-rw-r--r-- 1 root root 3697944 Apr 4 14:42 ltfs_1068000073_2013_0404_144212_f.dmp

-rw-r--r-- 1 root root 3697944 Apr 4 15:45 ltfs_1068000073_2013_0404_154524.dmp

-rw-r--r-- 1 root root 3683424 Apr 4 15:45 ltfs_1068000073_2013_0404_154524_f.dmp

-rw-r--r-- 1 root root 3683424 Apr 4 17:21 ltfs_1068000073_2013_0404_172124.dmp

-rw-r--r-- 1 root root 3721684 Apr 4 17:21 ltfs_1068000073_2013_0404_172124_f.dmp

-rw-r--r-- 1 root root 3721684 Apr 4 17:21 ltfs_1068000073_2013_0404_172140.dmp

-rw-r--r-- 1 root root 3792168 Apr 4 17:21 ltfs_1068000073_2013_0404_172140_f.dmp

The size of each drive dump file is approximately 2 MB. By managing the drive dump files that are generated, you can save disk space and enhance IBM Spectrum Archive performance.

It is not necessary to keep dump files after they are used for problem analysis. Likewise, the files are not necessary if the problems are minor and can be ignored. A script program that is provided with IBM Spectrum Archive EE periodically checks the number of drive dump files and their date and time. If some of the dump files are older than two weeks or if the number of dump files exceeds 1000 files, the script program erases them.

The script file is started by using Linux crontab features. A cron_ltfs_limit_dumps.sh file is in the /etc/cron.daily directory. This script file is started daily by the Linux operating system. The interval to run the script can be changed by moving the cron_ltfs_limit_dumps.sh file to other cron folders, such as cron.weekly. For more information about how to change the crontab setting, see the manual for your version of Linux.

In the cron_ltfs_limit_dumps.sh file, the automatic drive dump erase policy is specified by the option of the ltfs_limit_dump.sh script file, as shown in the following example:

/opt/IBM/ltfs/bin/ltfs_limit_dumps.sh -t 14 -n 1000

You can modify the policy by editing the options in the cron_ltfs_limit_dumps.sh file. The expiration date is set as a number of days by the -t option. In the example, a drive dump file is erased when it is more than 14 days old. The number of files to keep is set by the -n option. In our example, if the number of files exceeds 1,000, older files are erased so that the 1,000-file maximum is not exceeded. If either of the options are deleted, the dump files are deleted by the remaining policy.

By editing these options in the cron_ltfs_limit_dumps.sh file, the number of days that files are kept and the number of files that are stored can be modified.

Although not recommended, you can disable the automatic erasure of drive dump files by removing the cron_ltfs_limit_dumps.sh file from the cron folder.

Taking a tape drive offline

This section describes how to take a drive offline from the IBM Spectrum Archive EE system to perform diagnostic operations while the IBM Spectrum Archive EE system stays operational. To accomplish this task, use software such as the IBM Tape Diagnostic Tool (ITDT) or the IBM LTFS Format Verifier, which are described in 11.3, “System calls and IBM tools” on page 331.

Important: If the diagnostic operation you intend to perform requires that a tape cartridge be loaded into the drive, ensure that you have an empty non-pool tape cartridge available in the logical library of IBM Spectrum Archive EE. If a tape cartridge is in the tape drive when the drive is removed, the tape cartridge is automatically moved to the home slot.

To perform diagnostic tests, complete the following steps:

1. Identify the node ID number of the drive to be taken offline by running the ltfsee info drives command. Example 10-2 shows the tape drives in use by IBM Spectrum Archive EE.

Example 10-2 Identify the tape drive to remove

[root@ltfssn1 ~]# ltfsee info drives

Drive S/N Status Type Role Library Address Node ID Tape Node Group

000000014A00 Not mounted TS1140 mrg TS4500_3592 257 1 - SINGLE_NODE

0000078D8320 Mounted TS1150 mrg TS4500_3592 258 1 JCC539JC SINGLE_NODE

In this example, we take the tape drive with serial number 000000014A00 on cluster node 1 offline.

2. Remove the tape drive from the IBM Spectrum Archive EE inventory by specifying the ltfsee drive remove <drive serial number> command. Example 10-3 shows the removal of a single tape drive from IBM Spectrum Archive EE.

Example 10-3 Remove the tape drive

[root@ltfssn1 ~]# ltfsee drive remove -d 000000014A00

GLESL121I(00282): Drive serial 000000014A00 is removed from LTFS EE drive list.

3. Check the success of the removal. Run the ltfsee info drives command and verify that the output shows that the MMM attribute for the drive is in the stock state. Example 10-4 shows the status of the drives after it is removed from IBM Spectrum Archive EE.

Example 10-4 Check the tape drive status

[root@ltfssn1 ~]# ltfsee info drives

Drive S/N Status Type Role Library Address Node ID Tape Node Group

0000078D8320 Mounted TS1150 mrg TS4500_3592 258 1 JCC539JC SINGLE_NODE

000000014A00 Stock UNKNOWN --- TS4500_3592 257 - - -

4. Identify the primary device number of the drive for subsequent operations by running the cat /proc/scsi/IBMtape command. The command output lists the device number in the Number field. Example 10-5 shows the output of this command with the offline tape drive 13 in bold.

Example 10-5 List the tape drives in Linux

[root@ltfs97 /]# cat /proc/scsi/IBMtape

lin_tape version: 1.76.0

lin_tape major number: 248

Attached Tape Devices:

Number model S/N HBA SCSI FO Path

0 ULT3580-TD5 00078A218E qla2xxx 2:0:0:0 NA

1 ULT3580-TD5 1168001144 qla2xxx 2:0:1:0 NA

2 ULTRIUM-TD5 9A700L0077 qla2xxx 2:0:2:0 NA

3 ULT3580-TD6 1013000068 qla2xxx 2:0:3:0 NA

4 03592E07 0000013D0485 qla2xxx 2:0:4:0 NA

5 ULT3580-TD5 00078A1D8F qla2xxx 2:0:5:0 NA

6 ULT3580-TD6 00013B0037 qla2xxx 2:0:6:0 NA

7 03592E07 001013000652 qla2xxx 2:0:7:0 NA

8 03592E07 0000078DDAC3 qla2xxx 2:0:8:0 NA

9 03592E07 001013000255 qla2xxx 2:0:9:0 NA

10 ULT3580-TD5 1068000073 qla2xxx 2:0:10:0 NA

11 03592E07 0000013D0734 qla2xxx 2:0:11:0 NA

12 ULT3580-TD6 00013B0084 qla2xxx 2:0:12:0 NA

13 3592E07 000000014A00 qla2xxx 2:0:13:0 NA

14 ULT3580-TD5 1068000070 qla2xxx 2:0:14:0 NA

15 ULT3580-TD5 1068000016 qla2xxx 2:0:15:0 NA

16 03592E07 0000013D0733 qla2xxx 2:0:19:0 NA

17 03592E07 0000078DDBF1 qla2xxx 2:0:20:0 NA

18 ULT3580-TD5 00078AC0A5 qla2xxx 2:0:21:0 NA

19 ULT3580-TD5 00078AC08B qla2xxx 2:0:22:0 NA

20 03592E07 0000013D0483 qla2xxx 2:0:23:0 NA

21 03592E07 0000013D0485 qla2xxx 2:0:24:0 NA

22 03592E07 0000078D13C1 qla2xxx 2:0:25:0 NA

5. If your diagnostic operations require a tape cartridge to be loaded into the drive, complete the following steps. Otherwise, you are ready to perform diagnostic operations on the drive, which has the drive address /dev/IBMtapenumber, where number is the device number that is obtained in step 4 on page 282:

a. Move the tape cartridge to the drive from the I/O station or home slot. You can move the tape cartridge by using ITDT (in which case the drive must have the control path), or the TS4500 Management GUI.

b. Perform the diagnostic operations on the drive, which has the drive address /dev/IBMtapenumber, where number is the device number that is obtained in step 4 on page 282.

c. When you are finished, return the tape cartridge to its original location.

6. Add the drive to the IBM Spectrum Archive EE inventory again by running the ltfsee drive add drive_serial node_id command, where node_id is the same node that the drive was assigned to originally in step 1 on page 281.

Example 10-6 shows the tape drive that is readded to IBM Spectrum Archive EE.

Example 10-6 Add again the tape drive

[root@ltfssn1 ~]# ltfsee drive add -d 000000014A00 -n 1

GLESL119I(00174): Drive 000000014A00 added successfully.

Running the ltfsee info drives command again shows that the tape drive is no longer in a “stock” state. Example 10-7 shows the output of this command.

Example 10-7 Check the tape drive status

[root@ltfssn1 ~]# ltfsee info drives

Drive S/N Status Type Role Library Address Node ID Tape Node Group

000000014A00 Not mounted TS1140 mrg TS4500_3592 257 1 - SINGLE_NODE

0000078D8320 Mounted TS1150 mrg TS4500_3592 258 1 JCC539JC SINGLE_NODE

10.2.3 Tape cartridge

Table 10-1 shows all possible status conditions for an IBM Spectrum Archive tape cartridge as displayed by the ltfsee info tapes command.

Table 10-1 Tape cartridge status

Tape cartridge status	File system access	Description	How to recover tape cartridges	Valid IBM Spectrum Archive EE commands
Valid	Yes	The Valid status indicates that the cartridge is valid.	N/A	All but imports, recover, and pool add
Exported	No	The Exported status indicates that the cartridge is valid and is exported.	N/A	ltfsee import ltfsee tape move
Offline	No	The Offline status indicates that the cartridge is valid and is exported offline.	N/A	ltfsee import with offline option ltfsee tape move
Unknown	Unknown	The tape cartridge contents are unknown.	The status change after it is used. Run ltfsee tape validate command to check the tape into a drive.	ltfsee pool remove ltfsee reclaim (source/target) ltfsee export ltfsee rebuild ltfsee recall
Write Protected	Read-only	The tape cartridge is physically (or logically) in a write-protected state.	Remove the physical write-protection.	ltfsee pool remove ltfsee recall
Critical	Read-only	Indicates that the tape had a write failure. The index in memory is dirty.	Run the recover command using the -c option to generate a scan list and bring the files back into resident state, then run the recover command using the -r option to double check any missed files. The -r option removes the tape from the tape cartridge pool if no new files were detected remaining on tape. Save the tape just in case there are issues recovering files and contact IBM Spectrum Archive support to determine the root cause of the critical tape.	ltfsee recover ltfsee recall
Warning	Read-only	Indicates that the tape had a read failure.	Files on Warning tapes can be recovered by using the relocate_replica.sh script. The relocate_replica.sh script preserves the pool order the files are migrated on and will recall then remigrate the files to a new tape within the pool of the warning tape. In the end, the warning tape will be moved out of the pool. Contact IBM Spectrum Archive support to determine the root cause of the warning state.	ltfsee pool remove ltfsee reconcile ltfsee reclaim(source) ltfsee rebuild ltfsee recall
Unavailable	No	Indicates that the cartridge is not available in the IBM Spectrum Archive EE system. A tape that is newly inserted into the tape library is in this state.	If this is a brand new cartridge, run the ltfsee pool add command with the -f option to format the tape and added to the resource. Otherwise, add the tape cartridge to LTFS by running the import command if it contains data.	ltfsee pool remove ltfsee pool add ltfsee import
Invalid	No	The tape cartridge is inconsistent with the LTFS format and must be checked by using the -c option.	Perform a pool add with -c option to attempt to recover the tape.	ltfsee pool add ltfsee pool remove
Unformatted	No	The tape cartridge is not formatted and must be formatted by using the -f option.	Format the tape cartridge.	ltfsee pool add ltfsee pool remove
Inaccessible	No	The tape cartridge is not allowed to move in the library or might be stuck in the drive.	Remove the stuck tape cartridge and fix the cause.	ltfsee pool remove
Error	No	Indicates that the tape cartridge reported a medium error.	The tape cartridge status returns to Valid by physically removing the medium from the library, then adding it to the library again. If this state occurs again, contact IBM Spectrum Archive support to determine the root cause of the error state.	ltfsee pool remove
Not supported	No	The tape cartridge is an older generation or an LTO Write-Once, Read-Many (WORM) tape cartridge.	LTFS supports the following tape cartridges: •LTO-8 •LTO-8 (M8) •LTO-7 •LTO-6 •LTO-5 •3592 Extended data (JD) •3592 Advanced data (JC) •3592 Extended data (JB) •3592 Economy data (JK) •3592 Economy data (JL) •3592 Extended WORM data (JZ) •3592 Advanced WORM data (JY)	None
Duplicate	No	The tape cartridge has the same bar code as another tape cartridge.	Remove one of the duplicate tape cartridges from the library.	None
Disconnected	No	The Disconnected status indicates that the EE and LE components that are used by this tape cannot communicate. The admin channel connection might be disconnected.	Check the EE and LE components to see whether they are running.	ltfsee pool remove
Unusable	No	The tape has become unusable.	Perform a pool add with -c option to attempt to recover the tape.	ltfsee recall ltfsee export ltfsee rebuild ltfsee reclaim (source/target) ltfsee reconcile ltfsee pool remove
Write Fenced	Read-only	Indicates that the tape had a write failure but the index was successfully written.	Run the recover command with the -c option to generate a scan list and bring the files back into resident state, then run the recover command using the -r option to double check any missed files. The -r option removes the tape from the ltfs ee pool if no new files were detected remaining on tape. Save the tape just in case there are issues recovering files and contact IBM Spectrum Archive support to determine the root cause of the write fenced tape.	ltfsee pool remove ltfsee recover ltfsee tape move ltfsee recall

Unknown status

This status is only a temporary condition that can be caused when a new tape cartridge is added to the tape library, but was not yet mounted in a tape drive to load the index.

Write-protected status

This status is caused by setting the write-protection tag on the tape cartridge. If you want to use this tape cartridge in IBM Spectrum Archive EE, you must remove the write protection because a write-protected tape cartridge cannot be added to a tape cartridge pool. After the write-protection is removed, you must run the ltfsee retrieve command to update the status of the tape to Valid LTFS.

Critical or Warning status

This status can be caused by actual physical errors on the tape. Automatic recovery has been added to IBM Spectrum Archive V1.2.2. See 10.3, “Recovering data from a write failure tape” on page 288 and 10.4, “Recovering data from a read failure tape” on page 290 for recovery procedures.

Unavailable status

This status is caused by a tape cartridge being removed from LTFS. The process of adding it to LTFS (see 7.7.1, “Adding tape cartridges” on page 151) changes the status back to Valid LTFS. Therefore, this message requires no other corrective action.

Invalid LTFS status

If an error occurs while writing to a tape cartridge, it might be displayed with an Invalid LTFS status that indicates an inconsistent LTFS format. Example 10-8 shows an Invalid LTFS status.

Example 10-8 Check the tape cartridge status

[root@ltfssn1 ~]# ltfsee info tapes

Tape ID Status Type Capacity(GiB) Free(GiB) Unref(GiB) Pool Library Address Drive

JZ0072JZ Offline TS1150 0 0 0 JZJ5WORM TS4500_3592 -1 -

JYA825JY Valid TS1140(J5) 6292 6292 0 JY TS4500_3592 1035 -

JCC539JC Valid TS1140(J5) 6292 6292 0 primary_ltfssn1 TS4500_3592 258 0000078D8320

JCC541JC Invalid TS1140 0 0 0 - TS4500_3592 0 -

JD0226JD Unavailable TS1150 0 0 0 - TS4500_3592 1033 -

JD0427JD Unavailable TS1150 0 0 0 - TS4500_3592 1032 -

JY0321JY Unavailable TS1140 0 0 0 - TS4500_3592 1028 -

This status can be changed back to Valid LTFS by checking the tape cartridge. To do so, run the command that is shown in Example 10-9.

Example 10-9 Add tape cartridge to pool with check option

[root@ltfssn1 ~]# ltfsee pool add -p J5 -t JCC541JC -c

GLESL042I(00640): Adding tape JCC541JC to storage pool J5.

Tape JCC541JC successfully checked.

Added tape JCC541JC to pool J5 successfully.

Example 10-10 shows the updated tape cartridge status for JCC541JC.

Example 10-10 Check the tape cartridge status

[root@ltfssn1 ~]# ltfsee info tapes

Tape ID Status Type Capacity(GiB) Free(GiB) Unref(GiB) Pool Library Address Drive

JCC541JC Valid TS1140(J5) 6292 6292 0 J5 TS4500_3592 1030 -

JZ0072JZ Offline TS1150 0 0 0 JZJ5WORM TS4500_3592 -1 -

JYA825JY Valid TS1140(J5) 6292 6292 0 JY TS4500_3592 1035 -

JCC539JC Valid TS1140(J5) 6292 6292 0 primary_ltfssn1 TS4500_3592 1031 -

JD0226JD Unavailable TS1150 0 0 0 - TS4500_3592 1033 -

JD0427JD Unavailable TS1150 0 0 0 - TS4500_3592 1032 -

JY0321JY Unavailable TS1140 0 0 0 - TS4500_3592 1028 -

Unformatted status

This status usually is observed when a scratch tape is added to LTFS without formatting it. It can be fixed by removing and readding it with the -format option, as described in 7.7.3, “Formatting tape cartridges” on page 155.

If the tape cartridge was imported from another system, the IBM LTFS Format Verifier can be useful for checking the tape format. For more information about performing diagnostic tests with the IBM LTFS Format Verifier, see 11.3.2, “Using the IBM LTFS Format Verifier” on page 331.

Inaccessible status

This status is most often the result of a stuck tape cartridge. Removing the stuck tape cartridge and then moving it back to its homeslot, as shown in 7.7.2, “Moving tape cartridges” on page 154, should correct the Inaccessible status.

Error status

Tape cartridges with an error status can often be the result of errors on the tape. This cartridge cannot be used until the condition is cleared. Stop IBM Spectrum Archive EE and clear the dcache for the files on tape and then restart IBM Spectrum Archive EE, as described in 7.4, “Starting and stopping IBM Spectrum Archive EE” on page 147. If the tape continues to go into an error state contact IBM Spectrum Archive support to determine the root cause.

Not-supported status

Only LTO-8, M8, 7, 6, 5, and 3592-JB, JC, JD, JK, JL, JY, and JZ tape cartridges are supported by IBM Spectrum Archive EE. This message indicates that the tape cartridge is not one of these types and should be removed from the tape library.

Write Fenced status

This status is caused by actual physical errors on the tape. However, the index was successfully written on one of the tape’s partitions. The process of recovering such a tape removes the tape from the LTFS EE pool and will still be marked as Write Fenced. Save the tape for future reference in case the recover was unsuccessful and contact IBM Spectrum Archive support. See 10.3, “Recovering data from a write failure tape” on page 288 for steps to recover a Write Fenced tape.

10.3 Recovering data from a write failure tape

The following are steps for recovering data from a write failure tape, including both Critical and Write Fenced tapes:

1. Verify that the tape is in either Critical or Write Fenced by running ltfsee info tapes.

2. Run ltfsee recover -c on the Critical/Write Fenced tape to recall all the files on the tape and make them resident again.

3. Run ltfsee recover -r on the Critical/Write Fenced tape to perform a final file check on the tape and finally remove the tape from the pool.

If the tape was Critical after step 3, the drive is now unlocked for further use.

Example 10-11 shows the commands and output to recover a Write Fenced tape.

Example 10-11 Recovering data from a write fenced tape

[root@ltfssrv18 ~]# ltfsee info tapes | grep IM1196L6

IM1196L6 Write Fenced LTO6 0 0 0 primary lib_lib0 4112 -

[root@ltfssrv18 ~]# ltfsee recover -p primary -t IM1196L6 -c

Scanning GPFS file systems for finding migrated/saved objects in tape IM1196L6.

Tape IM1196L6 has 101 files to be recovered. The list is saved to /tmp/ltfssrv18.17771.ibm.gpfs.recoverlist.

Bulk recalling files in tape IM1196L6

GLESL268I(00151): 101 file name(s) have been provided to recall.

GLESL263I(00207): Recall result: 101 succeeded, 0 failed, 0 duplicate, 0 not migrated, 0 not found.

Making 101 files resident in tape IM1196L6.

Changed to resident: 10/101.

Changed to resident: 20/101.

Changed to resident: 30/101.

Changed to resident: 40/101.

Changed to resident: 50/101.

Changed to resident: 60/101.

Changed to resident: 70/101.

Changed to resident: 80/101.

Changed to resident: 90/101.

Changed to resident: 100/101.

Scanning remaining objects migrated/saved in tape IM1196L6.

Scanning non EE objects in tape IM1196L6.

Recovery of tape IM1196L6 is successfully done. 101 files are recovered. The list is saved to /tmp/ltfssrv18.17771.ibm.gpfs.recoverlist.

[root@ltfssrv18 ~]# ltfsee recover -p primary -t IM1196L6 -r

Scanning GPFS file systems for finding migrated/saved objects in tape IM1196L6.

Tape IM1196L6 has no file to be recovered.

Removed the tape IM1196L6 from the pool write_fenced.

[root@ltfssrv18 ~]# ltfsee info tapes | grep IM1196L6

IM1196L6 Write Fenced LTO6 0 0 0 - lib_ltfssrv18 4112 -

Example 10-12 shows the commands and output to recover data from a Critical tape.

Example 10-12 Recovering data from a critical tape

[root@ltfssrv18 ~]# ltfsee info tapes | grep Critical

JD0335JD Critical TS1150(J5) 9022 9022 0 PrimPool lib_ltfssrv18 260 0000078D8322

[root@ltfssrv18 ~]# ltfsee recover -p PrimPool -t JD0335JD -c

Scanning GPFS file systems for finding migrated/saved objects in tape JD0335JD.

Tape JD0335JD has 100 files to be recovered. The list is saved to /tmp/ltfssrv18.tuc.stglabs.ibm.com.18168.ibm.glues.recoverlist.

Bulk recalling files in tape JD0335JD

GLESL268I(00151): 100 file name(s) have been provided to recall.

GLESL263I(00207): Recall result: 100 succeeded, 0 failed, 0 duplicate, 0 not migrated, 0 not found.

Making 100 files resident in tape JD0335JD.

Changed to resident: 10/100.

Changed to resident: 20/100.

Changed to resident: 30/100.

Changed to resident: 40/100.

Changed to resident: 50/100.

Changed to resident: 60/100.

Changed to resident: 70/100.

Changed to resident: 80/100.

Changed to resident: 90/100.

Changed to resident: 100/100.

Scanning remaining objects migrated/saved in tape JD0335JD.

Scanning non EE objects in tape JD0335JD.

Recovery of tape JD0335JD is successfully done. 100 files are recovered. The list is saved to /tmp/ltfssrv18.tuc.stglabs.ibm.com.18168.ibm.glues.recoverlist.

[root@ltfssrv18 ~]# ltfsee recover -p PrimPool-t JD0335JD -r

Scanning GPFS file systems for finding migrated/saved objects in tape JD0335JD.

Tape JD0335JD has no file to be recovered.

Removed the tape JD0335JD from the pool PrimPool.

[root@ltfssrv18 ~]# ltfsee info tapes | grep JD0335JD

JD0335JD Error TS1150 0 0 0 - lib_ltfssrv18 1063 -

10.4 Recovering data from a read failure tape

Copy migrated jobs from a Warning tape to a valid tape within the same pool with the following steps:

1. Identify a tape with a read failure by running ltfsee info tapes to locate the Warning tape.

2. After the Warning tape has been identified, run the relocate_replica.sh script to copy the files from the Warning tape to a new tape within the same pool.

3. After a successful copy, remove the Warning tape from the library and discard it.

The syntax for the script is as follows:

relocate_replica.sh -t <tape id> -p <pool name>@<library name>:<pool name>@<library name>[:<pool name>@<library name>] -P <path name>

•-t <tape id>

Specifies tape ID to relocate replica from

•-p <pool name>@<library name>:[<pool name>@<library name>]

Specifies pool names and library names to store replicas after running this script

•-P <path name>

Specifies path to GPFS file system to scan

Example 10-13 shows system output of the steps to recover data from a read failure tape.

Example 10-13 Recovering data from a read failure

[root@ltfssrv18 ~]# ltfsee info tapes | grep primary

JD2065JD Warning TS1150(J5) 9022 9021 0 primary lib0 1036 -

JD2067JD Valid TS1150(J5) 9022 9021 0 primary lib0 1038 -

JD2066JD Valid TS1150(J5) 9022 9022 0 primary lib0 1037 -

[root@ltfssrv18 ~]# ltfsee info files `ls | head -1`

Name: LTFS_EE_FILE_0qIc9FO1LsaBmyTvA_b1ljgl.bin

Tape id:JD2065JD@lib0:JD2070JD@lib0:JD2064JD@lib0 Status: premigrated

[root@ltfssrv18 ~]# ./relocate_replica.sh -t JD2065JD -p primary@lib0:copy@lib0:copy2@lib0 -P /ibm/gpfs/

1. Getting pool name and library name to which tape JD2065JD belongs

2. Removing specified tape JD2065JD from pool primary@lib0

3. Creating policy file...

4. Performing policy scan...

5. Recalling migrated files to premigrated state

6. Removing replica in target tape JD2065JD

7. Creating replica in alternative tape in pool primary@lib0

8. Creating policy file...

9. Performing policy scan...

All of replicas in tape JD2065JD have been relocated successfully.

[root@ltfssrv18 ~]# ltfsee info tapes | grep JD2065JD

JD2065JD Unavailable TS1150 0 0 0 - lib0 1036 -

[root@ltfssrv18 ~]# ltfsee info files `ls | head -1`

Name: LTFS_EE_FILE_0qIc9FO1LsaBmyTvA_b1ljgl.bin

Tape id:JD2066JD@lib0:JD2070JD@lib0:JD2064JD@lib0 Status: premigrated

Note: To obtain the relocate_replica.sh script, see Appendix A, “Additional material” on page 335.

10.5 Handling export errors

The following are the steps to clean up files referencing exported tapes on the IBM Spectrum Archive file system when there are export errors:

1. Stop the LTFS EE service by running ltfsee stop.

2. After the process has stopped and verified by running pidof mmm, gather all the IBMTPS attributes from the failed export message.

3. Run ltfsee_export_fix -T <IBMTPS_attribute> [<IBMTPS_attribute>].

4. Start the LTFS EE service by running ltfsee start.

Example 10-14 shows a typical export error and then follows the steps above in resolving the problem.

Example 10-14 Fix IBMTPS file pointers on the GPFS file system

[root@ltfssrv18 ~]# ltfsee export -p PrimPool -t JD3592JD

GLESS016I(00184): Reconciliation requested.

GLESM401I(00194): Loaded the global configuration.

GLESS049I(00636): Tapes to reconcile: JD3592JD .

GLESS050I(00643): GPFS file systems involved: /ibm/glues .

GLESS134I(00665): Reserving tapes for reconciliation.

GLESS135I(00698): Reserved tapes: JD3592JD .

GLESS054I(00736): Creating GPFS snapshots:

GLESS055I(00741): Deleting previous reconcile snapshot and creating a new one for /ibm/glues ( gpfs ).

GLESS056I(00762): Scanning GPFS snapshots:

GLESS057I(00767): Scanning GPFS snapshot of /ibm/glues ( gpfs ).

GLESS060I(00843): Processing scan results:

GLESS061I(00848): Processing scan results for /ibm/glues ( gpfs ).

GLESS141I(00861): Removing stale DMAPI attributes:

GLESS142I(00866): Removing stale DMAPI attributes for /ibm/glues ( gpfs ).

GLESS063I(00899): Reconciling the tapes:

GLESS086I(00921): Reconcile is skipped for tape JD3592JD because it is already reconciled.

GLESS137I(01133): Removing tape reservations.

GLESS058I(02319): Removing GPFS snapshots:

GLESS059I(02326): Removing GPFS snapshot of /ibm/glues ( gpfs ).

Export of tape JD3592JD has been requested...

GLESL075E(00660): Export of tape JD3592JD completed with errors. Some GPFS files still refer files in the exported tape.

GLESL373I(00765): Moving tape JD3592JD.

Tape JD3592JD is unmounted because it is inserted into the drive.

GLESL043I(00151): Removing tape JD3592JD from storage pool replicate.

GLESL631E(00193): Failed to export some tapes.

Tapes (<no tape>) were successfully exported.

Tapes (<no tape>) are still in the pool and needs a retry to export them.

Tapes (JD3592JD) are in Exported state but some GPFS files may still refer files in these tapes. TPS list to fix are: JD3592JD@65fb41ec-b42d-4fc5-8957-d57e1567aac1@0000013FA0020404

[root@ltfssrv18 ~]# ltfsee stop

Library name: lib_ltfssrv18, library id: 0000013FA0020404, control node (MMM) IP address: 9.11.121.249.

Stopped LTFS EE service (MMM) for library lib_ltfssrv18.

[root@ltfssrv18 ~]# pidof mmm

[root@ltfssrv18 ~]# ltfsee_export_fix -T JD3592JD@65fb41ec-b42d-4fc5-8957-d57e1567aac1@0000013FA0020404

Please make sure that IBM Spectrum Archive EE is not running on any node in this cluster.

And please do not remove/rename any file in all the DMAPI-enabled GPFS if possible.

Type "yes" to continue.

yes

GLESY016I(00545): Start finding strayed stub files and fix them for GPFS gpfs (id=14593091431837534985)

GLESY020I(00558): Listing up files that needs to be fixed for GPFS gpfs.

GLESY015I(00531): Fix of exported files completes. Total=985, Succeeded=985, Failed=0

GLESY018I(00603): Successfully fixed files in GPFS gpfs.

GLESY025I(00615): ltfsee_export_fix exits with RC=0

[root@ltfssrv18 ~]# ltfsee start

Library name: lib_ltfssrv18, library id: 0000013FA0020404, control node (MMM) IP address: 9.11.121.249.

GLESM397I(00221): Configuration option: DISABLE_AFM_CHECK yes.

GLESM401I(00264): Loaded the global configuration.

GLESM402I(00301): Created the Global Resource Manager.

GLESM403I(00316): Fetched the node groups from the Global Resource Manager.

GLESM404I(00324): Detected the IP address of the MMM (9.11.121.249).

GLESM405I(00335): Configured the node group (G0).

GLESM406I(00344): Created the unassigned list of the library resources.

GLESL536I(00080): Started the Spectrum Archive EE service (MMM) for library lib_ltfssrv18.

10.6 Software

IBM Spectrum Archive EE is composed of four major components, each with its own set of log files. Therefore, problem analysis is slightly more involved than other products. This section describes troubleshooting issues with each component in turn and the Linux operating system and Simple Network Management Protocol (SNMP) alerts.

10.6.1 Linux

The log file /var/log/messages contains global LINUX system messages, including the messages that are logged during system start and messages that are related to LTFS and IBM Spectrum Archive EE functions. However, three specific log files are also created:

•ltfs.log

•ltfsee.log

•ltfsee_trc.log

Unlike with previous LTFS/IBM Spectrum Archive products, there is no need to enable the system logging on Linux because it is automatically performed during the installation process. Example 10-15 shows the changes to the rsyslog.conf file and the location of the two log files.

Example 10-15 The rsyslog.conf file

[root@ltfssn1 ~]# cat /etc/rsyslog.conf | grep ltfs

:msg, startswith, "GLES," /var/log/ltfsee_trc.log;gles_trc_template

:msg, startswith, "GLES" /var/log/ltfsee.log;RSYSLOG_FileFormat

:msg, regex, "LTFS[ID0-9][0-9]*[EWID]" /var/log/ltfs.log;RSYSLOG_FileFormat

By default, after the ltfs.log, ltfsee.log, and ltfsee_trc.log files reach 1 MB, they are rotated and four copies are kept. Example 10-16 shows the log file rotation settings. These settings can be adjusted as needed within the /etc/logrotate.d/syslog control file.

Example 10-16 Syslog rotation

[root@ltfssn1 ~]# cat /etc/logrotate.d/syslog

/var/log/cron

/var/log/maillog

/var/log/messages

/var/log/secure

/var/log/spooler

{

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

/var/log/ltfs.log {

size 1M

rotate 4

missingok

compress

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

/bin/kill -HUP `cat /var/run/rsyslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

/var/log/ltfsee.log {

size 1M

rotate 4

missingok

compress

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

/bin/kill -HUP `cat /var/run/rsyslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

/var/log/ltfsee_trc.log {

size 1M

rotate 4

missingok

compress

sharedscripts

postrotate

/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true

/bin/kill -HUP `cat /var/run/rsyslogd.pid 2> /dev/null` 2> /dev/null || true

endscript

}

These log files (ltfs.log, ltfsee.log, ltfsee_trc.log, and /var/log/messages) are invaluable in troubleshooting LTFS messages. The ltfsee.log file contains only warning and error messages. Therefore, it is easy to start looking here for the reasons of failure. For example, a typical file migration might return the information message that is shown in Example 10-17.

Example 10-17 Simple migration with informational messages

# ltfsee migrate mig -p PrimPool@lib_lto

GLESL167I(00400): A list of files to be migrated has been sent to LTFS EE using scan id 1842482689.

GLESL159E(00440): Not all migration has been successful.

GLESL038I(00448): Migration result: 0 succeeded, 1 failed, 0 duplicate, 0 duplicate wrong pool, 0 not found, 0 too small to qualify for migration, 0 too early for migration.

From the GLESL159E message, you know that the migration was unsuccessful, but you do not know why it was unsuccessful. To understand why, you must examine the ltfsee.log file. Example 10-18 shows the end of the ltfsee.log file immediately after the failed migrate command is run.

Example 10-18 The ltfsee.log file

# tail /var/log/ltfsee.log

2016-12-14T09:05:38.494320-07:00 ltfs97 mmm[7889]: GLESM600E(01691): Failed to migrate/premigrate file /ibm/gpfs/file1.mpeg. The specified pool name does not match the existing replica copy.

2016-12-14T09:05:48.500848-07:00 ltfs97 ltfsee[29470]: GLESL159E(00440): Not all migration has been successful.

In this case, the migration of the file was unsuccessful because it was previously migrated/premigrated to a different tape cartridge.

With IBM Spectrum Archive EE, there are two logging facilities. One is in a human-readable format that is monitored by users and the other is in machine-readable format that is used for further problem analysis. The former facility is logged in to /var/log/ltfsee.log through the “user” syslog facility and contains only warnings and errors. The latter facility is logged in to /var/log/ltfsee_trc.log through the “local2” Linux facility.

The messages in machine-readable format can be converted into human-readable format by the newly created tool ltfsee_catcsvlog, which is run by the following command:

/opt/ibm/ltfsee/bin/ltfsee_catcsvlog /var/log/ltfsee_trc.log

The ltfsee_catcsvlog command accepts multiple log files as command-line arguments. If no argument is specified, ltfsee_catcsvlog reads from stdin.

Persistent problems

This section describes ways to solve persistent IBM Spectrum Archive EE problems.

If an unexpected and persistent condition occurs in the IBM Spectrum Archive EE environment, contact your IBM service representative. Provide the following information to help IBM re-create and solve the problem:

•Machine type and model of your IBM tape library in use for IBM Spectrum Archive EE

•Machine type and model of the tape drives that are embedded in the tape library

•Specific IBM Spectrum Archive EE version

•Description of the problem

•System configuration

•Operation that was performed at the time the problem was encountered

The operating system automatically generates system log files after initial configuration of the IBM Spectrum Archive EE. Provide the results of the ltfsee_log_collection command to your IBM service representative.

10.6.2 IBM Spectrum Scale

IBM Spectrum Scale writes operational messages and error data to the IBM Spectrum Scale log file. The IBM Spectrum Scale log can be found in the /var/adm/ras directory on each node. The IBM Spectrum Scale log file is named mmfs.log.date.nodeName, where date is the time stamp when the instance of IBM Spectrum Scale started on the node and nodeName is the name of the node. The latest IBM Spectrum Scale log file can be found by using the symbolic file name /var/adm/ras/mmfs.log.latest.

The IBM Spectrum Scale log from the prior start of IBM Spectrum Scale can be found by using the symbolic file name /var/adm/ras/mmfs.log.previous. All other files have a time stamp and node name that is appended to the file name.

At IBM Spectrum Scale start, files that were not accessed during the last 10 days are deleted. If you want to save old files, copy them elsewhere.

Example 10-19 shows normal operational messages that appear in the IBM Spectrum Scale log file.

Example 10-19 Normal operational messages in an IBM Spectrum Scale log file

[root@ltfs97 ]# cat /var/adm/ras/mmfs.log.latest

Wed Apr 3 13:25:04 JST 2013: runmmfs starting

Removing old /var/adm/ras/mmfs.log.* files:

Unloading modules from /lib/modules/2.6.32-220.el6.x86_64/extra

Loading modules from /lib/modules/2.6.32-220.el6.x86_64/extra

Module Size Used by

mmfs26 1749012 0

mmfslinux 311300 1 mmfs26

tracedev 29552 2 mmfs26,mmfslinux

Wed Apr 3 13:25:06.026 2013: mmfsd initializing. {Version: 3.5.0.7 Built: Dec 12 2012 19:00:50} ...

Wed Apr 3 13:25:06.731 2013: Pagepool has size 3013632K bytes instead of the requested 29360128K bytes.

Wed Apr 3 13:25:07.409 2013: Node 192.168.208.97 (htohru9) is now the Group Leader.

Wed Apr 3 13:25:07.411 2013: This node (192.168.208.97 (htohru9)) is now Cluster Manager for htohru9.ltd.sdl.

Starting ADSM Space Management daemons

Wed Apr 3 13:25:17.907 2013: mmfsd ready

Wed Apr 3 13:25:18 JST 2013: mmcommon mmfsup invoked. Parameters: 192.168.208.97 192.168.208.97 all

Wed Apr 3 13:25:18 JST 2013: mounting /dev/gpfs

Wed Apr 3 13:25:18.179 2013: Command: mount gpfs

Wed Apr 3 13:25:18.353 2013: Node 192.168.208.97 (htohru9) appointed as manager for gpfs.

Wed Apr 3 13:25:18.798 2013: Node 192.168.208.97 (htohru9) completed take over for gpfs.

Wed Apr 3 13:25:19.023 2013: Command: err 0: mount gpfs

Wed Apr 3 13:25:19 JST 2013: finished mounting /dev/gpfs

Depending on the size and complexity of your system configuration, the amount of time to start IBM Spectrum Scale varies. Taking your system configuration into consideration, if you cannot access a file system that is mounted (automatically or by running a mount command) after a reasonable amount of time, examine the log file for error messages.

The IBM Spectrum Scale log is a repository of error conditions that were detected on each node, and operational events, such as file system mounts. The IBM Spectrum Scale log is the first place to look when you are attempting to debug abnormal events. Because IBM Spectrum Scale is a cluster file system, events that occur on one node might affect system behavior on other nodes, and all IBM Spectrum Scale logs can have relevant data.

A common error that might appear when trying to mount GPFS is that it cannot read superblock. Example 10-20 shows the output of the error when trying to mount GPFS.

Example 10-20 Superblock error from mounting GPFS

[root@ltfsml1 ~]# mmmount gpfs

Wed May 24 12:53:59 MST 2017: mmmount: Mounting file systems ...

mount: gpfs: can't read superblock

mmmount: Command failed. Examine previous error messages to determine cause.

The cause of this error and failure to mount GPFS is that the GPFS file system had dmapi enabled, but the HSM process has not been started. To get around this error and successfully mount GPFS, issue the systemctl start hsm command, and make sure it is running by issuing systemctl status hsm. After HSM is running, wait for the recall processes to initiate. This process can be viewed by issuing ps -afe | grep dsm. Example 10-21 shows output of starting HSM, checking the status, and mounting GPFS.

Example 10-21 Starting HSM and mounting GPFS

[root@ltfsml1 ~]# systemctl start hsm

[root@ltfsml1 ~]# systemctl status hsm

? hsm.service - HSM Service

Loaded: loaded (/usr/lib/systemd/system/hsm.service; enabled; vendor preset: disabled)

Active: active (running) since Wed 2017-05-24 13:04:59 MST; 4s ago

Main PID: 16938 (dsmwatchd)

CGroup: /system.slice/hsm.service

+-16938 /opt/tivoli/tsm/client/hsm/bin/dsmwatchd nodetach

May 24 13:04:59 ltfsml1.tuc.stglabs.ibm.com systemd[1]: Started HSM Service.

May 24 13:04:59 ltfsml1.tuc.stglabs.ibm.com systemd[1]: Starting HSM Service...

May 24 13:04:59 ltfsml1.tuc.stglabs.ibm.com dsmwatchd[16938]: HSM(pid:16938): start

[root@ltfsml1 ~]# ps -afe | grep dsm

root 7906 1 0 12:56 ? 00:00:00 /opt/tivoli/tsm/client/hsm/bin/dsmwatchd nodetach

root 9748 1 0 12:57 ? 00:00:00 dsmrecalld

root 9773 9748 0 12:57 ? 00:00:00 dsmrecalld

root 9774 9748 0 12:57 ? 00:00:00 dsmrecalld

root 9900 26012 0 12:57 pts/0 00:00:00 grep --color=auto dsm

[root@ltfsml1 ~]# mmmount gpfs

Wed May 24 12:57:22 MST 2017: mmmount: Mounting file systems ...

[root@ltfsml1 ~]# df -h | grep gpfs

gpfs 280G 154G 127G 55% /ibm/glues

If HSM is already running, double check if the dsmrecalld daemons are running by issuing ps -afe | grep dsm. If no dsmrecalld daemons are running, start them by issuing dsmmigfs start. After they have been started, GPFS can be successfully mounted.

10.6.3 IBM Spectrum Archive LE+ component

This section describes the options that are available to analyze problems that are identified by the LTFS logs. It also provides links to messages and actions that can be used to troubleshoot the source of an error.

The messages that are referenced in this section provide possible actions only for solvable error codes. The error codes that are reported by LTFS program can be retrieved from the terminal console or log files. For more information about retrieving error messages, see 10.6.1, “Linux” on page 293.

When multiple errors are reported, LTFS attempts to find a message ID and an action for each error code. If you cannot locate a message ID or an action for a reported error code, LTFS encountered a critical problem. If you try an initial action again and continue to fail, LTFS also encountered a critical problem. In these cases, contact your IBM service representative for more support.

Message ID strings start with the keyword LTFS and are followed by a four- or five-digit value. However, some message IDs include the uppercase letter I or D after LTFS, but before the four- or five-digit value. When an IBM Spectrum Archive EE command is run and returns an error, check the message ID to ensure that you do not mistake the letter I for the numeral 1.

A complete list of all LTFS messages can be found in the IBM Spectrum Archive EE section of IBM Knowledge Center, which is available here:

https://www.ibm.com/support/knowledgecenter/ST9MBR_1.2.5/ltfs_ee_messages.html

At the end of the message ID, the following single capital letters indicate the importance of the problem:

•E: Error

•W: Warning

•I: Information

•D: Debugging

When you troubleshoot, check messages for errors only.

Example 10-22 shows a problem analysis procedure for LTFS.

Example 10-22 LTFS messages

cat /var/log/ltfs.log

LTFS11571I State of tape '' in slot 0 is changed from 'Not Initialized' to 'Non-supported'

LTFS14546W Cartridge '' does not have an associated LTFS volume

LTFS11545I Rebuilding the cartridge inventory

LTFSI1092E This operation is not allowed on a cartridge without a bar code

LTFS14528D [localhost at [127.0.0.1]:34888]: --> CMD_ERROR

The set of 10 characters represents the message ID, and the text that follows describes the operational state of LTFS. The fourth message ID (LTFSI1092E) in this list indicates that an error was generated because the last character is the letter E. The character immediately following LTFS is the letter I. The complete message, including an explanation and appropriate course of action for LTFSI1092E, is shown in the following example: Example 10-23.

Example 10-23 Example of message

LTFSI1092E This operation is not allowed on a cartridge without a bar code

Explanation

A bar code must be attached to the medium to perform this operation.

Action

Attach a bar code to the medium and try again.

Based on the description that is provided here, the tape cartridge in the library does not have a bar code. Therefore, the operation is rejected by LTFS. The required user action to solve the problem is to attach a bar code to the medium and try again.

10.6.4 Hierarchical storage management

During installation, hierarchical storage management (HSM) is configured to write log entries to a log file in /opt/tivoli/tsm/client/hsm/bin/dsmerror.log. Example 10-24 shows an example of this file.

Example 10-24 The dsmerror.log file

[root@ltfs97 /]# cat dsmerror.log

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file1.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file2.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file3.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file4.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file5.img' could not be found.

03/29/2013 15:24:28 ANS9101E Migrated files matching '/ibm/glues/file6.img' could not be found.

04/02/2013 16:24:06 ANS9510E dsmrecalld: cannot get event messages from session 515A6F7E00000000, expected max message-length = 1024, returned message-length = 144. Reason : Stale NFS file handle

04/02/2013 16:24:06 ANS9474E dsmrecalld: Lost my session with errno: 1 . Trying to recover.

04/02/13 16:24:10 ANS9433E dsmwatchd: dm_send_msg failed with errno 1.

04/02/2013 16:24:11 ANS9433E dsmrecalld: dm_send_msg failed with errno 1.

04/03/13 13:25:06 ANS9505E dsmwatchd: cannot initialize the DMAPI interface. Reason: Stale NFS file handle

04/03/2013 13:38:14 ANS1079E No file specification entered

04/03/2013 13:38:20 ANS9085E dsmrecall: file system / is not managed by space management.

The HSM log contains information about file migration and recall, threshold migration, reconciliation, and starting and stopping the HSM daemon. You can analyze the HSM log to determine the current state of the system. For example, the logs can indicate when a recall has started but not finished within the last hour. The administrator can analyze a particular recall and react accordingly.

In addition, an HSM log might be analyzed by an administrator to optimize HSM usage. For example, if the HSM log indicates that 1,000 files are recalled at the same time, the administrator might suggest that the files can be first compressed into one .tar file and then migrated.

10.6.5 IBM Spectrum Archive EE logs

This section describes IBM Spectrum Archive EE logs and message IDs and provide some tips for dealing with failed recalls and lost or strayed files.

IBM Spectrum Archive EE log collection tool

IBM Spectrum Archive EE writes its logs to the files /var/log/ltfsee.log and /var/log/ltfsee_trc.log. These files can be viewed in a text editor for troubleshooting purposes. Use the IBM Spectrum Archive EE log collection tool to collect data that you can send to IBM Support.

The ltfsee_log_collection tool is in the /opt/ibm/ltfsee/bin folder. To use the tool, complete the following steps:

1. Log on to the operating system as the root user and open a console.

2. Start the tool by running the following command:

# /opt/ibm/ltfsee/bin/ltfsee_log_collection

3. When the following message displays, read the instructions, then enter y or p to continue:

LTFS Enterprise Edition - log collection program

This program collects the following information from your GPFS cluster.

a. Log files that are generated by GPFS, LTFS Enterprise Edition

b. Configuration information that is configured to use GPFS and LTFS Enterprise Edition

c. System information including OS distribution and kernel, and hardware information (CPU and memory)

If you want to collect all the information, enter y.

If you want to collect only a and b, enter p (partial).

If you do not want to collect any information, enter n.

The collected data is compressed in the ltfsee_log_files_<date>_<time>.tar.gz file. You can check the contents of the file before submitting it to IBM.

4. Make sure that a packed file with the name ltfsee_log_files_[date]_[time].tar.gz is created in the current directory. This file contains the collected log files.

5. Send the tar.gz file to your IBM service representative.

Messages reference

For IBM Spectrum Archive EE, message ID strings start with the keyword GLES and are followed by a single letter and then by a three-digit value. The single letter indicates which component generated the message. For example, GLESL is used to indicate all messages that are related to the IBM Spectrum Archive EE command. At the end of the message ID, the following single uppercase letter indicates the importance of the problem:

•E: Error

•W: Warning

•I: Information

•D: Debugging

When you troubleshoot, check messages for errors only. For a list of available messages, see this website:

https://www.ibm.com/support/knowledgecenter/ST9MBR_1.2.5/ltfs_ee_messages.html

Failed reconciliations

Failed reconciliations usually are indicated by the GLESS003E error message with the following description:

Reconciling tape %s failed due to a generic error.

Lost or strayed files

Table 10-2 describes the seven possible status codes for files in IBM Spectrum Archive EE. They can be viewed for individual files by running the ltfsee info files command.

Table 10-2 Status codes for files in IBM Spectrum Archive EE

Status code	Description
Resident	The Resident status indicates that the file is resident in the GPFS namespace and is not saved, migrated, or premigrated to a tape.
Migrated	The file was migrated. The file was copied from a GPFS file system to a tape, and exists only as a stub file in the IBM Spectrum Scale namespace.
Premigrated	The file was partially migrated. An identical copy exists on your local file system and in tape.
Saved	The Saved status indicates that the file system object that has no data (a symbolic link, an empty directory, or an empty regular file) was saved. The file system object was copied from GPFS file system to a tape.
Offline	The file was migrated to a tape cartridge and then the tape cartridge was exported offline.
Lost	The file was in the migrated status, but the file is not accessible from IBM Spectrum Archive EE because the tape cartridge that the file is supposed to be on is not accessible. The file might be lost because of tape corruption or if the tape cartridge was removed from the system without first exporting it.
Strayed	The file was in the premigrated status, but the file is not accessible from IBM Spectrum Archive EE because the tape that the file is supposed to be on is not accessible. The file might be lost because of tape corruption or if the tape cartridge was removed from the system without first exporting.

The only two status codes that indicate an error are lost and strayed. Files in these states should be fixed where possible by returning the missing tape cartridge to the tape library or by attempting to repair the tape corruption. For more information, see 7.20, “Checking and repairing” on page 198. If this is not possible, they should be restored from a redundant copy. For more information, see 7.14.2, “Selective recall” on page 187. If a redundant copy is not available, the stub files must be deleted from the GPFS file system.

10.7 Recovering from system failures

The system failures that are described in this section are the result of hardware failures or temporary outages that result in IBM Spectrum Archive EE errors.

10.7.1 Power failure

When a library power failure occurs, the data on the tape cartridge that is actively being written is probably left in an inconsistent state.

To recover a tape cartridge from a power failure, complete the following steps:

1. Create a mount point for the tape library. For more information, see the procedure described in 7.2.2, “IBM Spectrum Archive Library Edition Plus component” on page 140.

2. If you do not know which tape cartridges are in use, try to access all tape cartridges in the library. If you do know which tape cartridges are in use, try to access the tape cartridge that was in use when the power failure occurred.

3. If a tape cartridge is damaged, it is identified as inconsistent and the corresponding subdirectories disappear from the file system. You can confirm which tape cartridges are damaged or inconsistent by running the ltfsee info tapes command. The list of tape cartridges that displays indicates the volume name, which is helpful in identifying the inconsistent tape cartridge. For more information, see 7.20, “Checking and repairing” on page 198.

4. Recover the inconsistent tape cartridge by running the ltfsee pool add command with the -c option. For more information, see 7.20, “Checking and repairing” on page 198.

10.7.2 Mechanical failure

When a library receives an error message from one of its mechanical parts, the process to move a tape cartridge cannot be performed.

Important: A drive in the library normally performs well despite a failure so that ongoing access to an opened file on the loaded tape cartridge is not interrupted or damaged.

To recover a library from a mechanical failure, complete the following steps:

1. Run the umount or fusermount -u command to start a tape cartridge unload operation for each tape cartridge in each drive.

Important: The umount command might encounter an error when it tries to move a tape cartridge from a drive to a storage slot.

To confirm that the status of all tapes, access the library operator panel or the web user interface to view the status of the tape cartridges.

2. Turn off the power to the library and remove the source of the error.

3. Follow the procedure that is described in 10.7.1, “Power failure” on page 301.

Important: One or more inconsistent tape cartridges might be found in the storage slots and might need to be made consistent by following the procedure that is described in “Unavailable status” on page 286.

10.7.3 Inventory failure

When a library cannot read the tape cartridge bar code for any reason, an inventory operation for the tape cartridge fails. The corresponding media folder does not display, but a specially designated folder that is named UNKN0000 is listed instead. This designation indicates that a tape cartridge is not recognized by the library.

If the user attempts to access the tape cartridge contents, the media folder is removed from the file system. The status of any library tape cartridge can be determined by running the ltfsee info tapes command. For more information, see 7.26, “Obtaining inventory, job, and scan status” on page 213.

To recover from an inventory failure, complete the following steps:

1. Remove any unknown tape cartridges from the library by using the operator panel or Tape Library Specialist web interface, or by opening the door or magazine of the library.

2. Check all tape cartridge bar code labels.

Important: If the bar code is removed or about to peel off, the library cannot read it. Replace the label or firmly attach the bar code to fix the problem.

3. Insert the tape cartridge into the I/O station.

4. Check to determine whether the tape cartridge is recognized by running the ltfsee info tapes command.

5. Add the tape cartridge to the LTFS inventory by running the ltfsee pool add command.

10.7.4 Abnormal termination

If LTFS terminates because of an abnormal condition, such as a system hang-up or after the user initiates a kill command, the tape cartridges in the library might remain in the tape drives. If this occurs, LTFS locks the tape cartridges in the drives and the following command is required to release them:

# ltfs release_device -o changer_devname=[device name]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10. Troubleshooting

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 10. Troubleshooting