7. Device Failure and Replacement

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7. Device Failure and Replacement

Whether the red LED is flashing or the syslog is filling up with cryptic messages, a hardware failure is never a day at the beach. The goal of this chapter is to provide a guide for identifying and remedying device failures. We begin with a discussion of supported devices before proceeding with a discussion of how to look for errors. We then discuss how to identify a failed device. Finally, we consider replacements and alternative options to remedy the problem.

Supported Devices

Before spending hours determining why a device has failed, we must first confirm that the device is supported by the running kernel. Next, we must confirm that the device meets the hardware requirements set forth by the vendor. Of course, most of us never check to see whether a device we are about to play with is supported; instead, we just plug it in, and if the lights come on, we think we have struck gold. Though this approach might work most of the time, to reduce a system’s downtime, troubleshooting a failed component should start with the fundamentals.

Each OS distribution provides a supported hardware list, so it is best to check with your applicable distribution to determine whether the device in question is supported. A good example is Red Hat’s complete hardware support matrix, which is located at http://hardware.redhat.com/hcl/?pagename=hcl&view=allhardware#form. Another example is Mandriva’s supported hardware database, which is found at http://wwwnew.mandriva.com/en/hardware/. For our last example, the SUSE Linux hardware support list can be found at http://hardwaredb.suse.de/index.php?LANG=en_UK.

It is not sufficient merely to determine that the device is supported. As previously noted, it also is crucial to confirm that the device meets the hardware requirements set forth by the vendor. Each driver for any given device is designed with certain hardware restrictions. For example, Emulex Corporation writes a driver that enables its Host Bus Adapters (HBAs) to function with the Linux kernel, yet restrictions do apply. Emulex has a vast number of HBAs to choose from, so determining which adapter, as well as which driver, is supported on any given kernel is critical. Determining these boundaries makes remedying the problem more manageable. Of course, recall that when a new hardware device such as an HBA is acquired, it is important to check its supportability with the distribution’s hardware support matrix. Finally, the driver code for a device also contains information regarding the supported hardware types. This is demonstrated in the following example. Upon reviewing the source code for the Emulex HBA in file fcLINUXfcp.c, which is used to build the driver module lpfcdd.o, we find the following:

if(pdev != NULL) {
    switch(pdev->device){
    case PCI_DEVICE_ID_CENTAUR:
      sprintf(buf,
        "Emulex %s (LP9000) SCSI on PCI bus %02x device %02x irq %d",
        multip, p_dev_ctl->pcidev->bus->number, p_dev_ctl->pcidev->devfn,
        p_dev_ctl->pcidev->irq);
      break;

Although the previous example provides us with the information needed to determine the supported HBA types for Emulex, a faster method of extracting the same data exists. As with any hardware vendor, common characteristics exist with naming conventions. For example, Emulex starts all its HBAs with “LP.” By using this shortcut, we can extract the supported HBA types at a greater speed than with the previous command. The following command demonstrates this faster method:

[root@cyclops lpfc]# cat fcLINUXfcp.c | grep "Emulex %s (LP"

"Emulex %s (LP9000) SCSI on PCI bus %02x device %02x irq %d",
            "Emulex %s (LP8000) SCSI on PCI bus %02x device %02x irq %d",
            "Emulex %s (LP7000) SCSI on PCI bus %02x device %02x irq %d",
            "Emulex %s (LP950) SCSI on PCI bus %02x device %02x irq %d",
            "Emulex %s (LP850) SCSI on PCI bus %02x device %02x irq %d",

We also can identify the version of the kernel for which the source is designed and confirm that the driver source works with the applicable kernel we are using. As shown in the following example, we can distinguish that the source code for our Emulex HBAs, contained in fcLINUXfcp.c, has notes pertaining to which kernels are supported.

/*
* LINUX specific code for lpfcdd driver
* This driver is written to work with 2.2 and 2.4 LINUX kernel threads.
*
*/

Next, we need to determine the driver version for the source code we are reviewing. To find the driver version, simply search for “version” and “driver” simultaneously. The driver developers usually make it easy to find out which source code version is being used. See the next example:

root@cyclops lpfc]# cat fcLINUXfcp.c| grep -i version | grep -i driver
#define LPFC_DRIVER_VERSION "4.20p
...

Note that we also can determine other needed information about the device from the manufacturer or Linux distribution. Data requirements with respect to drivers, kernels, hardware, and so on always can be found at the OS distribution’s Web site, from the hardware vendor, or in the driver code. The README files provided with almost all drivers contain specifications to adhere to as well. Always check the previous data points before continuing down a broken path.

Where to Look for Errors

The second step to remedying a device failure is to locate the error. Many tools are available to assist a user in looking for an error. Determining the problem for a device requires the utilization of these tools. They include dmesg, lspci, lsmod, syslog/messages, and the /proc filesystem, among others. This section covers dmesg and syslog/messages in detail. The remaining tools are discussed in more detail later in this chapter.

dmesg is often helpful; therefore, it is a good place to begin. dmesg is a command that reads the kernel ring buffer, which holds the latest kernel messages. dmesg reports the current errors detected by the kernel with respect to the hardware or application. This tool provides a fast and easy way to capture the latest errors from the kernel. We provide the man page here for dmesg for quick reference:

DMESG(8)                                                 DMESG(8)

NAME
       dmesg - print or control the kernel ring buffer

SYNOPSIS
       dmesg [ -c ] [ -n level ] [ -s bufsize ]

DESCRIPTION
       dmesg is used to examine or control the kernel ring buffer.

       The program helps users to print out their bootup messages.
       Instead of copying the messages by hand, the user need only:
              dmesg > boot.messages
       and mail the boot.messages file to whoever can debug their problem.

Although the dmesg command is simple to use, its extensive reports are critical to finding errors promptly. To assist in your understanding of dmesg, we now walk you through an example of a standard dmesg from a booted Linux machine.

greg@nc6000:/tmp> dmesg
Linux version 2.6.8-24.11-default (geeko@buildhost) (gcc version 3.3.4
(pre 3.3.5 20040809)) #1 Fri Jan 14 13:01:26 UTC 2005
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000001ffd0000 (usable)
BIOS-e820: 000000001ffd0000 - 000000001fff0c00 (reserved)
BIOS-e820: 000000001fff0c00 - 000000001fffc000 (ACPI NVS)
BIOS-e820: 000000001fffc000 - 0000000020000000 (reserved)

As shown in this example, the first line of the dmesg output provides information about the running kernel version, including who built the kernel, what compiler was used, and when the kernel was compiled. Therefore, if you are compiling source and have a GCC failure, this is a place to start looking. Next, we see the following:

502MB vmalloc/ioremap area available.

vmalloc is defined in arch/i386/kernel/setup.c for IA32 machines. Similarly, vmalloc for IA64 machines is defined in arch/ia64/kernel/perfmon.c. Complete details on memory structure are outside the scope of this chapter, but knowing where to look for details and documentation is a critical starting point. Also note that ioremap.c defines the space for kernel access. In the following code, we see basic boundaries of High and Low memory limits detected on boot by the communication between the BIOS and kernel.

0MB HIGHMEM available.
511MB LOWMEM available.
On node 0 totalpages: 131024
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 126928 pages, LIFO batch:16
  HighMem zone: 0 pages, LIFO batch:1
DMI 2.3 present.
ACPI: RSDP (v000 COMPAQ                                    ) @ 0x000f6f80
ACPI: RSDT (v001 HP     HP0890   0x23070420 CPQ  0x00000001) @ 0x1fff0c84
ACPI: FADT (v002 HP     HP0890   0x00000002 CPQ  0x00000001) @ 0x1fff0c00
ACPI: DSDT (v001 HP       nc6000 0x00010000 MSFT 0x0100000e) @ 0x00000000
ACPI: PM-Timer IO Port: 0x1008
ACPI: local apic disabled
Built 1 zonelists

The following is the bootloader command for calling the kernel. Whether using GRUB or LILO, the bootloader is called similarly to the line shown next.

Kernel command line: root=/dev/hda1 vga=0x317 selinux=0
resume=/dev/hda5 desktop elevator=as splash=silent PROFILE=Home
bootsplash: silent mode.

If a user forgets the boot options provided at boot time, dmesg or syslog will have the values recorded. Note that Chapter 6, “Disk Partitions and Filesystems,” discusses Master Boot Record (MBR) in great detail, offering information about how the BIOS uses a bootloader such as GRUB or LILO. Continuing with dmesg output, we see in the following that the processor and video console are detected:

Initializing CPU#0
PID hash table entries: 2048 (order: 11, 32768 bytes)
Detected 1694.763 MHz processor.
Using pmtmr for high-res timesource
Console: colour dummy device 80x25

The next entries report directory entry and inode cache allocation.

Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)

The document located in kernel source at /usr/src/linux/Documentation/filesystems/vfs.txt describes dentry and inode cache in great detail. Following dentry and inode cache, the amount of system memory is detected and displayed:

Memory: 513740k/524096k available (2076k kernel code, 9744k reserved,
780k data, 212k init, 0k highmem)

Checking if this processor honours the WP bit even in supervisor
mode... Ok.

Now, let us wrap up this short demonstration of dmesg with some information about BogoMIPS. In linux/arch/i386/kdb/kdba_io.c, the “Kernel Debugger Architecture Dependent Console I/O handler” defines the BogoMIPS. In simplest terms, BogoMIPS is merely a benchmark tool for comparing similar CPUs. However, this tool is never used, which is the reason for its name (“bogus”). Please note that certain structures use the output of BogoMIPS within the kernel, but the end user will find no large benefit in it. More details on BogoMIPS can be found at http://www.tldp.org/HOWTO/BogoMips/.

Calibrating delay loop... 3358.72 BogoMIPS (lpj=1679360)

In addition to dmesg, there are other places to look for hardware errors. As we have shown, dmesg reports everything to the syslog daemon, which in turn records to log file /var/log/messages by default. Although other tools exist for reporting hardware errors, dmesg and syslog are the most prominent. Other tools such as lspci are used in conjunction with dmesg/syslog later in this chapter to locate a failed hardware component.

Identifying Failed Devices

After an error is located, the failed device can be identified. The goal is to determine the root cause. As previously mentioned, dmesg and /var/log/messages are commonly used with lspci and the /proc filesystem. These combined tools are used to troubleshoot and locate hardware faults.

The lspci command presents a user with the hardware layout on a machine. In the following example, we use a small Linux IA32 server with dual processors and Fibre Channel attached storage to demonstrate what an lspci would look like. First, we display the kernel we are using with the uname command.

[root@cyclops lpfc]# uname -a
Linux cyclops 2.4.9-e.10custom-gt #4 SMP Mon Nov 1 14:17:36 EST 2004
i686 unknown

We continue with dmesg to list the PCI bus. Note that because we use the Emulex HBAs often in this chapter’s examples, they are shown in bold.

[root@cyclops lpfc]# dmesg | grep PCI
PCI: PCI BIOS revision 2.10 entry at 0xfda11, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Discovered primary peer bus 01 [IRQ]
PCI->APIC IRQ transform: (B0,I2,P0) -> 22
PCI->APIC IRQ transform: (B0,I8,P0) -> 23
PCI->APIC IRQ transform: (B0,I15,P0) -> 33
PCI->APIC IRQ transform: (B1,I3,P0) -> 17 <-Bus 1, interface/slot 3,
function/port 0 (Emulex HBA), lspci will help identify this HBA in a
future example.
PCI->APIC IRQ transform: (B1,I4,P0) -> 27 <-Bus 1, interface/slot 4,
function/port 0 (Emulex HBA), lspci will help identify this HBA in a
future example.

PCI->APIC IRQ transform: (B1,I5,P0) -> 24
PCI->APIC IRQ transform: (B1,I5,P1) -> 25
Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT
SHARE_IRQ SERIAL_PCI ISAPNP enabled
ide: Assuming 33MHz PCI bus speed for PIO modes; override with idebus=xx
ServerWorks OSB4: IDE controller on PCI bus 00 dev 79
pci_hotplug: PCI Hot Plug PCI Core version: 0.3
sym53c8xx: at PCI bus 1, device 5, function 0
sym53c8xx: at PCI bus 1, device 5, function 1

We conclude by using lspci to depict the PCI bus devices.

[root@cyclops lpfc]# lspci
00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 08)
00:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:08.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 08)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
01:03.0 Fibre Channel: Emulex Corporation: Unknown device f800 (rev 02)
01:04.0 Fibre Channel: Emulex Corporation: Unknown device f800 (rev 02)
01:05.0 SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR)
53c1010 Ultra3 SCSI Adapter (rev 01)
01:05.1 SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR)
53c1010 Ultra3 SCSI Adapter (rev 01)

Note that lspci uses /proc/bus/pci to build its device tree. All devices have a descriptor commonly referred to as the Virtual Page Descriptor (VPD), in which lspci uses the source decode list at /usr/share/pci.ids to determine the devices’ characteristics (note that the latest PCI IDs are maintained at http://pciids.sf.net/). By using lspci with the -v flag, a user can obtain complete details of PCI devices including subsystem, flags, memory, I/O ports, and expansion ROM locations. However, using the -t flag with the -v flag yields only the basic description. Although the -v output is missing great detail when used in conjunction with the -t flag, the -t flag remains a very nice option because it delivers a table view of the devices seen from the master bus.

For example, the following lspci -t -v output is from the same machine as the dmesg and lspci discussed previously. In addition, notice how the Emulex HBAs are denoted by the lspci as unknown under the function code f800. This is due to the fact that we are using a 2001 pci.ids update on our test server. Our latest production lab machine has a mid-August 2004 version loaded. With the pci.ids file dated 2004-08-24, f800 is decoded to be “LP8000 Fibre Channel Host Adapter,” as shown in the following example. Note there is no penalty for having an older pci.ids file. However, be prepared for lots of “unknown” devices to appear in lspci’s output. In addition, the pci.ids file is updated for each distribution, but it also can be researched manually by going to http://pciids.sourceforge.net/.

[root@cyclops root]# lspci -t -v

-+-[01]-+-03.0  Emulex Corporation: Unknown device f800
|       +-04.0  Emulex Corporation: Unknown device f800
|       +-05.0  LSI Logic / Symbios Logic (formerly NCR) 53c1010 Ultra3
                SCSI Adapter
|       -05.1  LSI Logic / Symbios Logic (formerly NCR) 53c1010 Ultra3
                SCSI Adapter
-[00]-+-00.0   ServerWorks CNB20LE Host Bridge
       +-00.1   ServerWorks CNB20LE Host Bridge
       +-02.0   Intel Corporation 82557 [Ethernet Pro 100]
       +-07.0   ATI Technologies Inc Rage XL
       +-08.0   Intel Corporation 82557 [Ethernet Pro 100]
       +-0f.0   ServerWorks OSB4 South Bridge
       +-0f.1   ServerWorks OSB4 IDE Controller
       -0f.2   ServerWorks OSB4/CSB5 OHCI USB Controller

Having a good understanding of the bus structure is critical to finding the device that is failing. In the next example, we find that the SCSI disk errors are being reported by syslogd, which by default writes to /var/log/messages (we can confirm where syslogd writes by viewing /etc/syslogd.conf or by using the logger command), and we can view the same errors through the dmesg buffer. The error code looks like the following:

SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 5439488
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 5439552

Viewing the same data in /var/log/messages, note that a timestamp is included.

Feb  4 14:04:37 cyclops kernel: SCSI disk error : host 2 channel 0 id 0
lun 2 return code = 70022
Feb  4 14:04:37 cyclops kernel:  I/O error: dev 08:80, sector 5439488
Feb  4 14:04:37 cyclops kernel: SCSI disk error : host 2 channel 0 id 0
lun 2 return code = 70022
Feb  4 14:04:37 cyclops kernel:  I/O error: dev 08:80, sector 5439552

The previous error log reports a disk I/O error on host 2, channel 0, id 0, lun 2, with a return code of 70022. The next task is to decipher the return code. The following is an example of deciphering the return code. In the example, we break down an I/O error from a different host in our lab, which received a return code of 2603007f. Note that the return code is always 4 bytes (8 digits) long, so in our previous example, 70022 is actually 00070022.

Next, we have broken down a SCSI I/O error from a syslog entry. In this example, the error code is 2603007f, and we complete the example by breaking down every component of the syslog entry for the I/O error.

Month Day Hour:Min:Sec localhost kernel: SCSI disk error :
     host 0 channel 0 id 0 lun 0 return code = 2603007f
Month Day Hour:Min:Sec localhost kernel: scsidisk I/O error:
     dev 08:01, sector 10
Month Day Hour:Min:Sec localhost kernel: raid1: Disk failure on
     sda1, disabling device.

As shown here, the first entry in the syslog with respect to the SCSI error contains the device that is failing. The next challenge is breaking down the return code. In this case, we have a SCSI disk error on host 0, channel 0, id 0, lun 0, which breaks down as follows:

It is important to break down the return code to determine root cause. The first entry in the previous example not only determines the location of the error but also provides the return code. Breaking down the return code of 2603007f is not very difficult, as long as bit order is maintained. Bit order is explained in greater detail in Chapter 6.

To break down the return code, we must look at scsi_lib.c from the SCSI source included in a current Linux kernel release. Reviewing the source, we see that the breakdown of the SCSI hardware address is as follows:

printk("SCSI error : <%d %d %d %d> return code = 0x%x ",
                       cmd->device->host->host_no,
                       cmd->device->channel,
                       cmd->device->id,
                       cmd->device->lun, result);

Upon reviewing the scsi_ioctl.c code, we find the following:

* If the SCSI command succeeds then 0 is returned.
* Positive numbers returned are the compacted SCSI error codes
(4 bytes in one int) where the lowest byte is the SCSI status.
See the drivers/scsi/scsi.h file for more information on this.

While reviewing drivers/scsi/scsi.h, we determine that the driver design for SCSI is changing and that the file provides a reference point for many SCSI subsets. However, the scope of this chapter excludes building drivers and focuses on device failure and status return codes.

Now that we have a general understanding of the device location of the PCI bus, we need to understand the order of the return code bytes. The return code is made up of four bytes, appearing in the order of 3, 2, 1, 0 and breaking down as follows:

   lsb |    ...    |    ...    | msb
======|===========|===========|============
status | sense key | host code | driver byte

(far left)   Byte 3:    SCSI driver status byte
             Byte 2:    Host adapter driver status byte
             Byte 1:    Message following the status byte returned by the
drive
(far right)  Byte 0:    Status byte returned by the drive (bits 5-1)

Now that we have defined the byte locations, we need to define the possible values to be able to decode the return code. Again, upon reviewing include/scsi/scsi.h, the user finds that the previous bytes are defined as follows:

...
/*
*  SCSI Architecture Model (SAM) Status codes. Taken from SAM-3 draft
*  T10/1561-D Revision 4 Draft dated 7th November 2002.
*/
#define SAM_STAT_GOOD            0x00
#define SAM_STAT_CHECK_CONDITION 0x02
#define SAM_STAT_CONDITION_MET   0x04
#define SAM_STAT_BUSY            0x08
#define SAM_STAT_INTERMEDIATE    0x10
#define SAM_STAT_INTERMEDIATE_CONDITION_MET 0x14
#define SAM_STAT_RESERVATION_CONFLICT 0x18
#define SAM_STAT_COMMAND_TERMINATED 0x22 /* obsolete in SAM-3 */
#define SAM_STAT_TASK_SET_FULL   0x28
#define SAM_STAT_ACA_ACTIVE      0x30
#define SAM_STAT_TASK_ABORTED    0x40

/** scsi_status_is_good - check the status return.
*
* @status: the status passed up from the driver (including host and
*          driver components)
*
* This returns true for known good conditions that may be treated as
* command completed normally
*/
static inline int scsi_status_is_good(int status)
{
       /*
       * FIXME: bit0 is listed as reserved in SCSI-2, but is
       * significant in SCSI-3. For now, we follow the SCSI-2
       * behaviour and ignore reserved bits.
       */
      status &= 0xfe;
      return ((status == SAM_STAT_GOOD) ||
            (status == SAM_STAT_INTERMEDIATE) ||
            (status == SAM_STAT_INTERMEDIATE_CONDITION_MET) ||
            /* FIXME: this is obsolete in SAM-3 */
            (status == SAM_STAT_COMMAND_TERMINATED));
}

/*
*  Status codes. These are deprecated as they are shifted 1 bit right
*  from those found in the SCSI standards. This causes confusion for
*  applications that are ported to several OSes. Prefer SAM Status codes
*  above.
*/

#define GOOD                 0x00
#define CHECK_CONDITION      0x01
#define CONDITION_GOOD       0x02
#define BUSY                 0x04
#define INTERMEDIATE_GOOD    0x08

#define INTERMEDIATE_C_GOOD  0x0a
#define RESERVATION_CONFLICT 0x0c
#define COMMAND_TERMINATED   0x11
#define QUEUE_FULL           0x14

#define STATUS_MASK          0x3e

/*
*  SENSE KEYS
*/

#define NO_SENSE             0x00
#define RECOVERED_ERROR      0x01
#define NOT_READY            0x02
#define MEDIUM_ERROR         0x03
#define HARDWARE_ERROR       0x04
#define ILLEGAL_REQUEST      0x05
#define UNIT_ATTENTION       0x06
#define DATA_PROTECT         0x07
#define BLANK_CHECK          0x08
#define COPY_ABORTED         0x0a
#define ABORTED_COMMAND      0x0b
#define VOLUME_OVERFLOW      0x0d
#define MISCOMPARE           0x0e

/*
* Host byte codes
*/

#define DID_OK           0x00   /* NO error                                */
#define DID_NO_CONNECT   0x01   /* Couldn't connect before timeout period  */
#define DID_BUS_BUSY     0x02   /* BUS stayed busy through time out period */
#define DID_TIME_OUT     0x03   /* TIMED OUT for other reason              */
#define DID_BAD_TARGET   0x04   /* BAD target.                             */
#define DID_ABORT        0x05   /* Told to abort for some other reason     */

#define DID_PARITY       0x06   /* Parity error                            */
#define DID_ERROR        0x07   /* Internal error                          */
#define DID_RESET        0x08   /* Reset by somebody                       */
#define DID_BAD_INTR     0x09   /* Got an interrupt we weren't expecting   */
#define DID_PASSTHROUGH  0x0a   /* Force command past mid-layer            */
#define DID_SOFT_ERROR   0x0b   /* The low level driver just wish a retry  */
#define DID_IMM_RETRY    0x0c   /* Retry without decrementing retry count  */
#define DRIVER_OK        0x00   /* Driver status                           */

/*
*  These indicate the error that occurred, and what is available.
*/

#define DRIVER_BUSY         0x01
#define DRIVER_SOFT         0x02
#define DRIVER_MEDIA        0x03
#define DRIVER_ERROR        0x04

#define DRIVER_INVALID      0x05
#define DRIVER_TIMEOUT      0x06
#define DRIVER_HARD         0x07
#define DRIVER_SENSE        0x08

#define SUGGEST_RETRY       0x10
#define SUGGEST_ABORT       0x20
#define SUGGEST_REMAP       0x30
#define SUGGEST_DIE         0x40
#define SUGGEST_SENSE       0x80
#define SUGGEST_IS_OK       0xff

#define DRIVER_MASK         0x0f
#define SUGGEST_MASK        0xf0
...

The return code from our previously mentioned example had a value of 2603007f, and it breaks down as follows:

26 = SCSI driver byte = 0010 0110 = RR10011R = Reserved code.
03 = Host adapter driver byte   (DID_TIME_OUT - TIMED OUT
                                for other reason)
00 = Message byte               (COMMAND_COMPLETE)
7f = Status byte                (bogus value), bits 5-1 = 1f

Additional information to help break down SCSI error detection can be found for the following topics at http://tldp.org/HOWTO/SCSI-Generic-HOWTO/index.html (search for scsi—the documents will change over time):

• SCSI programming HOWTO

• Decoding SCSI error status

• Sense codes and sense code qualifiers

Now that we have discussed where to find status codes and how to decode return codes for SCSI errors, we can decipher our lab error code of 00070022.

• 00 = define GOOD: 0x00, Driver status code is Good. We can now tell that it is not a driver issue.

• 07 = define DATA_PROTECT: 0x07, Sense key data points us in the right direction. It has informed us that the suspect drive has set a read/write exclusive lock; however, we need to know why this has occurred.

• 00 = define DID_OK: 0x00 NO error. No error with respect to Host.

• 22 = Breakdown of two nibbles:

• 0x02 = define DRIVER_SOFT.

• 0x20 = define SUGGEST_ABORT.

Note that in our previous lab case, the return code indicated a driver problem with the final byte; however, the indication of a driver issue is a little misleading. The actual cause is a drive issue. The sense key was the critical piece of data that helped to determine the source of the problem. Although the drive appeared to be visible, based upon the sense key data, we can conclude that the drive was locked. In fact, we did lock all the read and write I/Os to a drive while the LUN remained visible to the host, thus causing the error return codes to be misleading.

Replacement of a Failed Device

After we determine the applicable error and decoding information, we are ready to determine whether replacement is necessary and possible. Replacement might not be necessary if a good workaround is applicable. Furthermore, it might be impossible to replace a device for a number of reasons—most importantly, the need for keeping the system online.

To determine whether a device can be replaced online or offline, we must first define these terms. Within the computer industry, the terms online and offline simply refer to the application’s status. In short, the primary goal should always be to keep the application online. When a disk-type failure occurs, unless you are using a large type storage array, such as an HP XP storage array, EMC, or IBM shark, then you must be able to handle a disk failure through other means, such as a hardware RAID controller or a software RAID, such as logical volume manager mirroring.

If errors are being logged for disk I/O failures, as shown previously, the easiest thing to do is find the device driver that controls the device and determine the impact of removing it.

In the previous example, we saw an I/O error on a LUN within a storage array connected to an Emulex HBA. Installing the lpfcdd.o driver in a 2.4.9-e.10 kernel allowed access to many LUNs for the HP storage array through the Emulex HBA. It is critical to understand how to map the LUNs back to the driver that allows access. As with any SCSI LUN, the SCSI I/O driver allows read and write I/O; however, the lpfcdd driver allows the path to be available for all the LUNs down the HBA path.

By using dmesg, after running the command insmod lpfcdd, we can see the newly found disk, as depicted in the following example.

[root@cyclops root]# insmod lpfcdd
Using /lib/modules/2.4.9-e.10custom-gt/kernel/drivers/scsi/lpfcdd.o
Warning: loading /lib/modules/2.4.9-e.10custom-
gt/kernel/drivers/scsi/lpfcdd.o will taint the kernel: no license
(Note: Tainting of kernel is discussed in Chapter 2)

[root@cyclops log]# dmesg
Emulex LightPulse FC SCSI/IP 4.20p
PCI: Enabling device 01:03.0 (0156 -> 0157)
!lpfc0:045:Vital Product Data Data: 82 23 0 36
!lpfc0:031:Link Up Event received Data: 1 1 0 0
PCI: Enabling device 01:04.0 (0156 -> 0157)
scsi2 : Emulex LPFC (LP8000) SCSI on PCI bus 01 device 18 irq 17
scsi3 : Emulex LPFC (LP8000) SCSI on PCI bus 01 device 20 irq 27
  Vendor: HP        Model: OPEN-9-CVS-CM Rev: 2110 <---
Scsi_scan.c code scanning PCI bus to find devices...
  Type:   Direct-Access                     ANSI SCSI revision: 02
  Vendor: HP        Model: OPEN-9-CVS-CM Rev: 2110
  Type:   Direct-Access                     ANSI SCSI revision: 02
  Vendor: HP        Model: OPEN-8*13        Rev: 2110
  Type:   Direct-Access                     ANSI SCSI revision: 02
Attached scsi disk sdg at scsi2, channel 0, id 0, lun 0
Attached scsi disk sdh at scsi2, channel 0, id 0, lun 1
Attached scsi disk sdi at scsi2, channel 0, id 0, lun 2
SCSI device sdg: 1638720 512-byte hdwr sectors (839 MB)
sdg: sdg1
SCSI device sdh: 1638720 512-byte hdwr sectors (839 MB)
sdh: sdh1
SCSI device sdi: 186563520 512-byte hdwr sectors (95521 MB)
sdi: unknown partition table <---VERY Important... MBR is discussed
in great detail in Chapter 6.

We can force I/O on a drive by using the dd command. For example, dd if=/dev/sdi of=/dev/null bs=1024k easily creates 55+ MBps of read on a quality array device. The following depicts the I/O load discussed previously. For a complete picture, a general background must first be established.

The HBA used on our Linux server connects to a brocade switch on port 15, with port 2 going to an upstream ISL for its storage allocation. By using the switchshow command, we can see the WWN of the HBA connected to port 15, and by using the portperfshow command, we can determine the exact performance of our previous dd command. Again, the following demonstrates a heavy I/O performance during a total I/O failure. First, switchshow illustrates the HBA, followed by portperfshow, which illustrates the performance.

roadrunner:admin> switchshow
switchName:     roadrunner
switchType:     5.4
switchState:    Online
switchMode:     Interop
switchRole:     Subordinate
switchDomain:   123
switchId:       fffc7b
switchWwn:      10:00:00:60:69:10:6b:0e
switchBeacon:   OFF
Zoning:         ON (STC-zoneset-1)
port  0: sw  Online        E-Port 10:00:00:60:69:10:64:7e "coyote"
                           (downstream)
port  1: sw  Online        E-Port 10:00:00:60:69:10:2b:37 "pepe"
                           (upstream) < - Upstream ISL to Storage Switch.
port  2: --  No_Module
port  3: sw  Online        F-Port 20:00:00:05:9b:a6:65:40
port  4: sw  Online        F-Port 50:06:0b:00:00:0a:b8:9e
port  5: sw  Online        F-Port 50:00:0e:10:00:00:96:e4
port  6: sw  Online        F-Port 50:00:0e:10:00:00:96:ff
port  7: sw  Online        F-Port 50:00:0e:10:00:00:96:e5
port  8: sw  Online        F-Port 50:00:0e:10:00:00:96:fe
port  9: sw  Online        F-Port 20:00:00:e0:69:c0:81:b3
port 10: sw  Online        L-Port 1 private, 1 phantom
port 11: sw  No_Sync
port 12: sw  Online        L-Port 1 private, 1 phantom
port 13: sw  No_Sync
port 14: sw  No_Light
port 15: sw  Online        F-Port 10:00:00:00:c9:24:13:27 <--- HBA on
                           Linux host

roadrunner:admin> portperfshow
     0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
    ----------------------------------------------------------------
     0  57m  0   0   0   0   0   0   0   0   0   0   0   0   0  57m
     0  55m  0   0   0   0   0   0   0   0   0   0   0   0   0  55m

To create a complete hardware failure, we disable port 15, thus halting all I/O. By using dmesg, we can capture what the kernel is logging about the failure.

After disabling the port 15, dmesg logs the following:

!lpfc0:031:Link Down Event received Data: 2 2 0 20

After waiting 60 seconds for the I/O acknowledgement, the Emulex driver abandons the bus, forcing the SCSI layer to recognize the I/O failure. The dmesg command shows the following:

[root@cyclops log]# dmesg
!lpfc0:120:Device disappeared, nodev timeout: Data: 780500 0 0 1e
I/O error: dev 08:80, sector 5140160
I/O error: dev 08:80, sector 5140224
I/O error: dev 08:80, sector 5140160
I/O error: dev 08:80, sector 5140224

The user shell prompt looks similar to the following:

[root@cyclops proc]# dd if=/dev/sdi of=/dev/null bs=1024k

dd: reading '/dev/sdi': Input/output error
12067+1 records in
12067+1 records out
[root@cyclops proc]#

In the previous case, the Emulex driver detected a Fibre Channel Protocol (FCP) failure and deallocated the storage. After the path was restored, the link and I/O access return to normal. To demonstrate I/O returning, we simply issue the portenable command on the brocade port 15, thus enabling the HBA FCP connection. The following dd command demonstrates the I/O returning with the host not going offline. Note that in this case, the application would have lost access to the device, resulting in an offline condition to the application.

[root@cyclops proc]# dd if=/dev/sdi of=/dev/null bs=1024k
57+0 records in
57+0 records out
[root@cyclops proc]#

In the meantime, while the portenable and dd commands are being issued, dmesg reports the following:

[root@cyclops proc]# dmesg
!lpfc0:031:Link Up Event received Data: 3 3 0 20
!lpfc0:031:Link Up Event received Data: 5 5 0 74
!lpfc1:031:Link Up Event received Data: 1 1 0 0
!lpfc1:031:Link Up Event received Data: 3 3 0 74
[root@cyclops proc]#

In the previous condition, the entire device was offline. Although recovery was simple, applications would have failed. Depending on application parameters, such as buffer cache, remaining I/O operations could cause a system-like hang. Hangs of this nature are discussed in Chapter 2, “System Hangs and Panics.” Having a complete hardware device failure on an entire bus, in a SAN, or on some other type of storage network is usually quick to isolate and recover with the tactics described in this chapter; however, data integrity is another matter. Data integrity is beyond the scope of this chapter.

As mentioned earlier, logical path failures are easier to manage than the failure of a given device on a logical path. In the following example, we block the I/O to a given LUN on a bus, as done previously in this chapter to illustrate return code 70022. However, now the goal is to determine the best course of corrective action.

Repeating the same LUN I/O block as before, the dd read test results in the following errors:

[root@cyclops root]# dd if=/dev/sdi of=/dev/null bs=1024k
dd: reading '/dev/sdi': Input/output error
6584+1 records in
6584+1 records out

Notice how the prompt did not return; instead, this process is hung in kernel space (discussed in Chapter 8, “Linux Processes: Structure, Hangs, and Core Dumps”). This behavior results because the kernel knows the size of the disk and because we set the block size so large; the remaining I/O’s should fail, and the process will die.

[root@cyclops root]# dmesg
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485760
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485824
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485762
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485826
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485760
~~~Errors continue

While the read I/O errors continue, we have already decoded the return code. This informed us that the disk is read/write protected; thus, we must find a way to restore I/O or move the data to a new location. To get all the information with regards to the PID accessing the device, we run the command ps -ef | grep dd. This command enables us to confirm the PID of 8070. After the PID is established, we go to the PID directory found under the /proc filesystem and check the status of the process using the following method:

root@cyclops / ]# cd /proc/8070
root@cyclops 8070]# cat status
Name:   dd
State:  D (disk sleep) <--- Note the state. The state should be in R
(running) condition.
Pid:    8070
PPid:   8023
TracerPid:       0
Uid:    0        0        0        0
Gid:    0        0        0        0
TGid:   8070
FDSize: 256

Groups: 0 1 2 3 4 6 10
VmSize:     1660 kB
VmLck:         0 kB
VmRSS:       580 kB
VmData:       28 kB
VmStk:        24 kB
VmExe:        28 kB
VmLib:      1316 kB
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000001206
CapInh: 0000000000000000
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

[root@cyclops 8070]# cat cpu
cpu  8 8534
cpu0 8 8527
cpu1 0 7

[root@cyclops 8070]# cat cmdline
ddif/dev/sdiof/dev/nullbs1024k

The status of the process on the device is disk sleep, meaning that the process is waiting on the device to return the outstanding request before processing the next one. In this condition, I/O errors will not continue forever; they will stop after the read I/O block has expired or timed out at the SCSI layer. However, if an application continues to queue multiple I/O threads to the device, removing any device driver from the kernel will be impossible.

~~~~Errors continue~~~~~
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485822
SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022
I/O error: dev 08:80, sector 13485886

Notice the last sector that failed; the difference between the last sector and the previous failed sector is 64,512 bytes or 126 sectors at 512 bytes each. No matter the drive size, if a user issues a dd command on a failed drive, the duration of the I/O hang depends on the size of the outstanding block request. Setting the block size to 64K or less hangs the I/O at the SCSI layer on a buffer wait on each outstanding 2048-byte read request. To see the wait channel, issue ps -ef | grep dd, and then using the PID, issue ps -eo comm,pid,wchan | grep PID to find something such as dd ##### wait_on_buffer. Refer to Chapter 8 to acquire a better understanding of process structure because a discussion of system calls is beyond the scope of this chapter. The point of this general discussion is to demonstrate that the I/O will eventually time out or abort.

Again, the most important thing to understand with a failed device on a given bus is that we cannot remove the driver that controls the bus access path, such as lpfcdd.o, and of course we cannot remove the protocol driver, such as SCSI in this case. This is due to the fact that all remaining devices on the bus are online and in production transmitting I/Os. Issuing a command, such as rmmod lpfcdd, simply yields the result “device busy.” The only recovery method for this particular example is to restore access to the given device. If this involves replacing the device, such as installing a new LUN, the running kernel will have the incorrect device tree characteristics for the data construct. This is discussed in great detail in Chapter 8. In this particular case, device access must be restored. Otherwise, a new device will have to be put online and the server rebooted with the data restored from backup.

Summary

Whether a user decides to replace, repair, or remove old hardware, isolating and understanding the error return codes is critical to determining how to conduct the repair. We hope that this chapter provided you with insight into troubleshooting SCSI devices and the necessary tools to make the replacement decision easier, and possibly faster, the next time around.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Device Failure and Replacement

Create new playlist

Sign In

Sign Up

7. Device Failure and Replacement

Supported Devices

Where to Look for Errors

Identifying Failed Devices

Replacement of a Failed Device

Summary

Table of Contents for
7. Device Failure and Replacement