Chapter 1 Storage Essentials

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 1 Storage Essentials

In this chapter, you will learn how to

• Explain how drive storage systems work

• Describe the types of disk interfaces and their characteristics

• Combine multiple disks to achieve larger storage and fault tolerance

• Classify the way hosts interact with disks

At its most basic level, storage starts with a disk. These disks, connected either internally or externally to systems via interfaces, can be accessed by systems to read or write data. Disks can also be combined to achieve greater speed than each disk could individually provide, or they can be combined to protect against one or more disk failures. One or more disks are then presented to a host, enabling it and its applications to work with the storage.

How Disk Storage Systems Work

The hard disk drive (HDD) or hard drive is a good place to begin a discussion on storage. HDDs have been in use in computer systems for decades, and the principles behind their basic operation remain largely unchanged. The solid-state drive (SSD) is a more recent introduction, and although it uses different principles for storing data, it has not replaced the HDD. SSD and HDD both have their uses in storage systems today, which will be explained in this section.

Physical Components

Originally, the term hard disk drive was used to differentiate the rigid, inflexible metal drive from floppy disks, which could be easily bent. While floppy disks are mostly a thing of the past, the hard disk drive term remains. HDDs are physically composed of platters, heads, a spindle, and an actuator arm all sealed within a rectangular container that is 3.5 or 2.5 inches wide called the head disk assembly (HDA). This is pictured in Figure 1-1. Higher capacity can be obtained from 3.5-inch drives because they have a larger surface area. These 3.5-inch drives are typically installed in desktops and servers, while 2.5-inch drives are commonly found in laptops. Some servers use 2.5-inch drives in cases where an increased number of drives is preferred over increased capacity. Beginning with the platter, each of these components is discussed in more detail, and key performance metrics are introduced for several components.

Images

Figure 1-1 HDD components

Platter

If you open an HDD, the first thing you notice is a stack of round flat surfaces known as platters. Platters are made of thin but rigid sheets of aluminum, glass, or ceramic. The platters are covered on both the top and the bottom with a thin coating of substrate filled with small bits of metal. The majority of drives contain two or three platters providing four or six usable surfaces for storing data.

Spindle

An HDD has multiple platters connected to a rod called a spindle. The spindle and all the platters are rotated at a consistent rate by the spindle motor. Spindles commonly spin at 7,200 or 10,000 or 15,000 rotations per minute (rpm). Disk rpm is the determining factor in the rotational latency, an important disk performance metric.

Rotational latency is the amount of time it takes to move the platter to the desired location, measured in milliseconds (ms). Full rotational latency is the amount of time it takes to turn the platter 360 degrees. Average rotational latency is roughly half the full rotational latency. Average rotational latency is a significant metric in determining the time it takes to read random data. Rotational latency is directly related to rotational speed, with faster disks (higher rpm) providing lower rotational latency. Average rotational latency can be computed using the following formula:

Average rotational latency = 0.5 / (rpm/60) × 1,000

With this formula, a disk with 10k (10,000) rpm would have a rotational latency of 3 ms.

Cylinder

HDDs are organized into a series of concentric rings similar to the rings on a tree. Each ring—called a cylinder—is numbered, starting at 0 with the outermost cylinder. Cylinders provide a precise location where data is located relative to the platter’s center. Each cylinder consists of sections from the platters located directly above and below it. These cylinder sections on each platter start and stop at the same relative location. The cylinder area on an individual platter is called a track, so cylinder 0 would comprise track 0 on the top and bottom of all platters in the disk. Figure 1-2 shows hard disk platters and the tracks that make up cylinder 0. A physical demarcation, delineating the tracks on the platter, enables the head (discussed next) to follow and remain in proper alignment.

Images

Figure 1-2 Cylinder 0 and the tracks comprising it

Tracks are organized into sectors, each of a fixed length. A process, often performed by the disk manufacturer, called low-level formatting creates the sectors on the disk and marks the areas before and after each sector with an ID, making them ready for use. These sectors make it easier to more precisely locate data on the disk, and they are more manageable because of their consistent size. Disks used a sector size of 512 bytes until 2009, when advanced format drives were released that use 4,096 bytes, or 4 kilobytes (4KB), per sector. Advanced format drives are the most common ones in use today.

Earlier HDD designs stored the same amount of data on each track even though the inner tracks were much smaller physically than the outer tracks. Later designs incorporated zone bit recording to divide the tracks into zones, with some zones having more sectors than others. Each track in a zone has the same number of sectors in it, with the zone closest to the center having the fewest sectors. In this way, more data could be written to the outside tracks to make use of much more space on the platters. However, with zone bit recording, the disk still spins at a constant rate, so the data stored on the outside zones is read much quicker than data on inside zones. Since performance is partially determined by where benchmarking data is stored, this can lead to inconsistent results. A better result can be achieved when the benchmark is performed on the outer zones for all disks that need to be compared.

Head

HDDs contain small electromagnets called heads. These heads, one for both the top and bottom sides of each platter in the disk, are attached to an actuator arm. All heads are attached to the same actuator arm, so they move together back and forth across the platter. The actuator arm can move the heads across the radius of the platters to a specific cylinder. To locate a specific section of the disk, the actuator will move the head to the appropriate cylinder, and then the head must wait for the platter to rotate to the desired sector. The actuator arm and platter rotation thus allow the heads to access a point almost anywhere on the platter surface.

A common disk metric associated with the head is seek time, which is the time it takes to move the head, measured in milliseconds. There are three types of seek times. Full stroke seek time is the time it takes to move from the head to the first to last cylinder. Average seek time is usually one-third of the full stroke time, and it represents the average amount of time it takes to move from one cylinder to another. Average seek time is an important metric in determining the time it takes to read random data. The last type of seek time is track-to-track, which is the time it takes to move to the next track or cylinder. Track-to-track seek time is a useful metric for determining the seek time for reading or writing sequential data.

Images

EXAM TIP If a question asks about random reads, look for an answer that contains average seek time, but if the question asks about sequential reads, look for track-to-track seek time.

When a disk is not in use or powered down, heads are parked in a special area of the platter known as the landing zone. This landing zone is typically located on the innermost area of the platter. As the platters rotate, the disk heads float on a cushion of air nanometers above the platter surface. The actuator arm moves the heads back to the landing zone when the disk spins down again. Normally, heads should not come in contact with the platter in any area but the landing zone because contact can cause damage to both the head and the platter. This is known as a head crash, and it is important to avoid because it can lead to data loss.

Head crashes can be avoided by following these best practices:

• Secure hard disks into mountings or disk caddies using four screws, two on each side.

• Wait for disks to spin down and park heads before removing them from a bay.

• Transport hard disks in impact-resistant packaging such as bubble wrap or foam.

• Do not place heavy objects on top of hard disks to avoid warping the casing.

• Give disks time to acclimate to a warmer or cooler environment before powering them on.

• Replace hard disks that are nearing their expected life span.

• Connect disks to reliable power supplies to avoid power surges or power loss during operation.

• Shut down servers and equipment containing hard disks properly to give disks time to spin down and park heads.

The head is shaped like a horseshoe where current can be passed through in one of two directions. When current passes through the head, an electromagnetic field is generated, and the metallic particles in the substrate are polarized, aligning them with the field. These polarized particles are described as being in magnetic flux. Polarization can occur in one of two directions depending on the direction the current is flowing through the head. At first glance, it might seem logical to assume that the polarities are associated with a binary 1 or 0, but the reality is more complex. When polarity is positive, a transition will always be to a negative polarity, and vice versa, meaning it is impossible to represent multiple identical transitions in sequence simply by encoding a different polarity because the polarity would always have to switch back and forth, indicating only that a transition had occurred, not what that transition represents.

The disk head can detect changes in polarity known as flux transition only when performing read operations, so write operations are designed to create transitions from positive to negative or from negative to positive. When the head passes over an area of the platter that has a flux transition, a current will be generated in the head corresponding to the direction of the flux transition. Here are some examples:

• The head generates negative voltage when the flux transition goes from positive to negative.

• The head generates positive voltage when the flux transition goes from negative to positive.

Each flux transition takes up space on the platter. Disk designers want to minimize the number of transitions necessary to represent binary numbers on the disk, and this is where encoding methods come in. HDDs utilize an encoding method, consisting of a pattern of flux transitions, to represent one or more binary numbers. The method currently in use today is known as run length limited (RLL), and it has been in use since the 1980s. The method represents groups of bits in several flux transitions.

Images

NOTE RLL is not specific to disks. It is used in many situations where binary numbers need to be represented on a physical medium such as cable passing light or current.

Controller Board

A controller board mounted to the bottom of the disk controls the physical components in the disk and contains an interface port for sending and receiving data to and from a connected device. The controller board consists of a processor, cache, connecting circuits, and read-only memory (ROM) containing firmware that includes instructions for how the disk components are controlled. Data that comes in from the interface and data that is waiting to be transmitted on the interface reside in cache.

All of these components in the disk work together to store data on the disk. Disks receive data over an interface and place this data into cache until it can be written to the disk. Depending on the cache method, an acknowledgment may immediately be sent back over the interface once the data is written to cache, or it may wait until the data is written to the disk. The data in cache is marked as dirty so that the controller board knows that the data has not been written to disk yet. The data is then flushed to the disk. In this write operation example, the controller board will look up the closest available sector to the head’s current position and instruct the actuator arm to move and the platters to rotate to that location.

Once the head is positioned at the correct location, a current passes through the head to polarize the platter substrate. Sections of the substrate are polarized in a pattern determined by the disk’s encoding method to represent the binary numbers that make up the data to be written. This will continue until the sector is full. The disk head will continue writing in the next sector if it is also available. If it is not available, the controller board will send instructions to move to the next available location and continue the process of writing data until complete.

The controller locates data based on cylinder, head, and sector (CHS). The head information describes which platter and side the data resides on, while the cylinder information locates the track on that platter. Lastly, the sector addresses the area within that track so that the head can be positioned to read or write data to that location.

When the data has been fully written to the disk, the disk may send an acknowledgment back to the unit if it did not do so upon writing the data to cache. The controller will then reset the dirty bit in cache to show that the data has been successfully written to the disk. Additionally, the data may be removed from cache if the space is needed for another operation.

Images

EXAM TIP If a question asks about controller performance metrics, look for an answer with the words transfer rate.

The disk metric often associated with the controller is transfer rate. This is the amount of time it takes to move data and comprises two metrics. Internal transfer rate is the amount of time it takes to move data between the disk controller buffer and a physical sector. External transfer rate is the amount of time it takes to move data over the disk interface. External transfer rates will be discussed in the “Available Disk Interfaces and Their Characteristics” section.

Solid-State Drive

The solid-state drive is a form of data storage device far different from a hard disk drive. The only real similarities between SSD and HDD are that both store data, come in a similar 2.5-inch size, and share the SATA and SAS interfaces discussed in the next section. SSD does not have platters, spindles, or heads. In fact, there are no moving parts at all in SSD. Because of this, SSD uses far less power than HDD. SSDs do have a controller board and cache that operate similarly to their HDD counterparts. In addition to the controller board and cache, SSD consists of solid-state memory chips and an interface. Internal circuitry connects the solid-state memory chips to the controller board. Each circuit over which data can pass is called a channel, and the number of channels in a drive determines how much data can be read or written at once. The speed of SSD is far superior to HDD.

SSD organizes data into blocks and pages instead of tracks and sectors. A block consists of multiple pages and is not the same size as a block in HDD. Pages are typically 128 bytes. A block may be as small as 4KB (4,096 bytes) made up of 32 pages, or it could be as large as 16KB (16,385 bytes) made up of 128 pages. SSD disks support logical block addressing (LBA). LBA is a way of referencing a location on a disk without a device having knowledge of the physical disk geometry; the SSD page is presented to a system as a series of 512-byte blocks, and the controller board converts the LBA block address to the corresponding SSD block and page. This allows systems to reference SSD just as they would HDD.

SSD has a shorter shelf life than HDD because its cells wear out. SSD disks will deteriorate over time, resulting in longer access times, until they will eventually need to be replaced. Enterprise SSD is often equipped with write leveling techniques that store data evenly across cells. Data writes and changes are written to the block with the least use so that solid-state cells age at roughly the same rate.

SSD gets its name from the solid-state memory it uses to store and retrieve data. This memory consists of transistors and circuits similar to the memory in a computer, except that it is nonvolatile—meaning that the data stored on solid state does not need to be refreshed, nor does it require power to remain in memory. This technology is also called negated AND (NAND) flash; thus, the alternative name for SSD is flash drive. NAND chips come in two types, as discussed next: single-level cell (SLC) and multiple-level cell (MLC).

Images

NOTE This book will use the acronym SSD instead of the term flash drive to differentiate SSD from USB-based flash media also known as flash drives.

Single-Level Cell

Single-level cell NAND flash is arranged in rows and columns, and current can be passed or blocked to cells. Each intersection of a row and column is referred to as a cell, similar to the cells in a table. Two transistors in the cell control whether the cell represents a binary 0 or 1. Those with current are a 1, and those without are a 0. A transistor called the floating gate is surrounded by an oxide layer that is normally nonconductive except when subjected to a significant electric field. The control gate charges the floating gate by creating an electric field to make the oxide layer conductive, thus allowing current to flow into the floating gate. Once the floating gate is charged, it will remain so until the control gate drains it. In this way, the charge remains even when power is disconnected from the disk. The threshold voltage of the cell can be measured to read whether the cell is a 1 or a 0 without having to open the control gate.

The control gate can make the oxide layer conductive only so many times before it breaks down. This change in conductivity happens each time there is a write or erase, so the life span of the disk is tracked based on write/erase cycles. The average rating for SLC is 100,000 write/erase cycles. This is much better than the write/erase rating for MLC of 10,000 but much less than a standard HDD, which does not have a write/erase limit.

Multiple-Level Cell

Multiple-level cell technology stores two or more bits per cell, greatly increasing the storage density and overall capacity of SSD disks using MLC. MLC NAND flash is also organized like SLC NAND flash, but MLC control gates can regulate the amount of charge given to a floating gate. Different voltages can represent different bit combinations. To store two bits of information, the MLC cell must be able to hold four different charges, mapping two of the four combinations of two binary digits, 00, 01, 10, and 11. Likewise, the MLC cell would need to hold eight different charges to store three bits of information. More of the oxide layer breaks down in each write or erase cycle on MLC cells, resulting in a much lower life span than SLC cells. The average rating for MLC is 10,000 write/erase cycles, which is one-tenth the rating for SLC. MLC cells also have lower performance than SLC since it takes longer for them to store and measure voltage.

SLC’s higher performance and write/erase rating over MLC makes it the primary technology used in enterprise SSD, whereas MLC is most commonly seen in USB flash drives and flash media cards such as SD, MMC, Compact Flash, and xD.

Images

EXAM TIP Choose SLC if an exam question lists high performance or longevity as the primary concern, but choose MLC if cost is the primary concern.

Electrical and Mechanical Differences

The choice of whether to put enterprise, midrange, or entry-level drives in equipment in a storage array or server is not often one you will be allowed to make. Server and storage vendors generally build their systems to only recognize drives that contain proprietary firmware. This allows them to mark up the drives they sell, and it ensures that the drives used in their systems are compatible and capable of meeting their published performance and reliability metrics. However, there will be cases where you will have a choice of which drive to put in a system. You will likely be able to choose which drives to place in single-shelf NAS devices, custom servers, and equipment made by manufacturers not large enough to justify developing their own drive firmware and quality-testing procedures.

Your choice of drive falls into three categories—enterprise, midrange, and consumer—with enterprise drives offering the greatest performance and reliability at the highest price and consumer drives the lowest performance and reliability at the lowest price. You should select a drive type that meets the performance and reliability requirements of your application and the server or storage device’s operating environment. You might then adjust your selection upward to factor in higher utilization down the road. However, be careful to understand (and be able to explain) the impact of adjusting downward to meet cost requirements because application performance or reliability will likely be impacted.

Performance

Consumer drives, also called desktop or entry-level drives, are made for a single user who only occasionally needs to access information. Midrange drives offer somewhat higher performance than consumer drives but not the performance of an enterprise drive. Midrange drives are intended for small office NAS devices and servers with low I/O and utilization. An enterprise-grade drive is made to access data for many different users and run many different applications at the same time, which makes the drive work continuously.

Enterprise drives will perform better than consumer drives in a multiuser environment because the internal components are usually made to access the data faster. These mechanisms include faster spindle speeds, denser disks, and more cache. They usually have much more on-board cache, which is used to increase cache hits for reads and send write acknowledgements sooner.

Reliability

Enterprise-class drives are made with at least an estimated mean time between failures (MTBF) of 1.2 million hours. Consumer drives, on the other hand, only have an estimated MTBF of 700,000 hours. While these numbers may seem close at first, remember that consumer drives have an average utilization of 10–20 percent, whereas enterprise drives have 74–90 percent utilization. If both drives were at 50 percent utilization, the MTBF might be 350,000 for the consumer drive and 1.8 million for the enterprise drive. In short, using a drive for an unintended purpose can drastically affect the MTBF.

If you look at these numbers rationally, they will immediately seem implausible, because 1.2 million hours is roughly 137 years. However, let me add some clarity to these reliability numbers before you add that shiny new drive to your last will and testament. It is impossible to test drives for decades before releasing them, so statistical formulas are used and unit testing is performed to obtain these estimates. It is important to note that the accuracy of these metrics is limited to first-year operation only. The drive manufacturers gained these numbers by running a large number of drives for a shorter period of time and then computed an annualized failure rate (AFR). They then divided one year by the AFR to get the MTBF. The MTBF can be converted to the AFR by dividing the number of hours in a year by the MTBF. Thus, the enterprise drives with the 1.2 million MTBF have an AFR of 0.73 percent, whereas the consumer drives would have an AFR of 1.25 percent. The following computations are used to convert the MTBF to AFR for the enterprise drive:

• Hours in a year / MTBF = AFR

• Hours in a year: 24 × 365 = 8,760

• 8,760 / 1,200,000 = .0073, or 0.73%

If we take this a step further, we can see how many drives a storage administrator could expect to replace in the first year of a storage system if the storage system had 400 drives. Four hundred drives multiplied by the AFR of 0.0073 is 2.92, so the storage administrator could expect to replace about three drives the first year.

Also, the AFR metric is for the first year of life. The likelihood of failure increases exponentially each consecutive year. By the third year, the AFR is about five times what it was in the first year. For this reason, even enterprise drives are often warrantied only three years.

Enterprise drives have more components in them that detect read errors. When a consumer-grade drive tries to read a bad sector, it tries to read that sector over and over again, and there might be a long wait period before the user realizes that the data cannot be accessed. An enterprise environment does not have the tolerance to wait around for one drive to fix the issue. The data needs to be available at all times. Because most enterprise drives are in a RAID configuration, the computer can immediately go to another backup drive that is a clone of the drive with a bad sector and read the data from that one. After the data has been given to the user, the computer can then try to fix the bad portion of the drive that failed. Using consumer drives in such an environment would increase the likelihood of being marked “offline” and result in kernel panics and system crashes.

Enterprise-class drives use checksums when data is transmitted to make sure that the data has not been tampered with or corrupted. If the checksum does not match the data provided, the computer can make another request for the data until it does not detect that an error occurred. Enterprise drives often include error-correction techniques that are not used on consumer drives. Each sector in an enterprise-class drive has enough information to store a checksum of the sector and restore information if the sector is damaged. Consumer drives may have error-detecting checksums in them that detect that an error has occurred, but these sectors do not carry enough data to rebuild a corrupted area of the disk if something goes wrong. Enterprise-class hard drives are recommended by the manufacturers because when a consumer-class drive detects an error, it will go into a deep sleep to repeatedly try to recover from the error. This process can take so long that a RAID controller will often mark the entire drive as unusable, even though only one sector went bad, which can then take the whole array offline. Enterprise-class drives use time-limited error recovery to prevent this from occurring. This problem will progressively get worse for desktop-class drives as they age and may result in data loss. Enterprise-class drives are designed to have the RAID controller decide how the sector will be recovered.

Enterprise-grade hard disks are manufactured to be more reliable than the consumer models. Consumer drives are designed to operate at 10–20 percent average utilization. This is the type of utilization that would occur with one primary user of the device and possibly some shared access to media over a home or office network. In contrast, enterprise drives are designed for 75–90 percent utilization. They are designed to be utilized for a significant portion of the day, with multiple users or applications reading from them and writing to them concurrently. The higher workload requires greater reliability because the drive must be able to withstand a consistently higher level of read and write I/O than a consumer drive.

The cycle of a drive spinning up to read or write data and then stopping after it is done is known as a load/unload cycle. An enterprise-level drive is usually rated to have about twice as many of these cycles as an entry-level drive. This is an important point to think about when deciding which drive to use in a system that will be constantly spinning up and down the disks. A business wants to have as little latency as possible when performing a new read, and the spin-up time can significantly add to that latency.

Operating Environment

Drives differ in the operating environments, just like they differ in use cases. Enterprise drives are typically installed in high-density storage devices or servers. Today’s high-density servers can equip up to 24 2.5” drives in a 2U chassis or 24 3.5” drives in a 3U chassis, whereas high-density storage array shelves can equip 48 2.5” drives in a 2U chassis or 48 3.5” drives in a 3U chassis. Midrange systems might have six or ten drives in them, and the drives have more space for air flow and heat dissipation. Desktop systems are usually built with only one or two drives that are located away from many of the other computer components.

High-density enterprise environments need to be able to run hotter and withstand more rotational vibration. Spinning platters in the hard drives might cause vibrations back into the computer case, which can then cause the drives in close proximity to vibrate reciprocally. Consistent vibrations can cause stress on a drive, eventually leading to the breakdown of components such as the actuator arm or read/write heads. Enterprise drives have a much greater tolerance for these vibrations and are often equipped with sensors that track vibration and the Rotational Vibration Feed Forward (RVFF), which determines the correction to the voice coil motor actuator to keep it in the right position based on the vibrations detected. Instead of RVFF and sensors, midrange drives come equipped with dual-plane balance control. Imbalance of the platters can produce excessive vibration. A dual-plane balance control stabilizes this imbalance to reduce internal vibration.

Enterprise-class drives are designed for continuous operation at 45°C with a maximum operating temperature around 60°C. Consumer-grade hard drives, on the other hand, are designed to operate at 25°C with a maximum operating temperature around 40°C.

I/O vs. Throughput

Throughput is the amount of data transferred over a medium in a measurable time interval. The throughput of disks and RAID sets are often measured in terms of I/O operations per second (IOPS). IOPS is often calculated for different types of data usage patterns as follows:

• Sequential read IOPS The number of read operations performed on data that resides in contiguous locations on the disk per second

• Random read IOPS The number of read operations performed on data spread across the disk (random) per second

• Sequential write IOPS The number of write operations performed on data that resides in contiguous locations on the disk per second

• Random write IOPS The number of write operations performed on data spread across the disk (random) per second

IOPS is an important metric because it can be calculated at differing levels of abstraction from the storage. For example, IOPS can be calculated from the operating system or from the application or even the storage array, usually with built-in tools. Each calculation reflects additional factors as you move further away from the disks themselves. IOPS is calculated end-to-end, so it can comprise many factors such as seek time and rotational latency of disk, transfer rates of interfaces and cables between source and destination, and application latency or load in a single metric. IOPS is fairly consistent for HDD, but SSDs are heavily customized by manufacturers, so their IOPS values will be specified per disk. SSD IOPS can range from as low as 5,000 to 1,000,000 IOPS, so it pays to know which disks are in a system. Table 1-1 shows standard IOPS ratings for common SATA, SAS, and SSD disks.

Table 1-1 Standard IOPS Ratings for SATA, SAS, and SSD Disks

Images

Capacity vs. Speed

Determining the right storage for the job starts with determining the requirements of capacity and speed. Capacity involves how much space the system or application requires, and speed is how fast the storage needs to be. I/O and throughput were discussed in the previous section, providing an understanding of how to measure the speed of a device. An application may also have a speed requirement, which is often represented in IOPS. Disks may need to be combined to obtain the necessary capacity and speed for the application. For example, an application requires 5,000 IOPS and 2TB capacity; 600GB 15,000 rpm SAS disks that get 175 IOPS and 200GB SSD disks that get 4,000 IOPS are available for use. Twenty-nine SAS disks would be necessary to get 5,000 IOPS, and this would provide more than enough capacity (17.4TB). Alternatively, two SSD disks would provide the necessary IOPS but not the required capacity. It would take ten SSD disks to reach the required 2TB capacity, which would be more than enough IOPS at this point. These rough calculations assume that all disks would be able to be used concurrently to store and retrieve data, which would offer the most ideal performance circumstances. The “RAID Levels” section will show different methods for combining multiple disks and how they can be used to gain additional speed and prevent against data loss when one or more disks fail.

Available Disk Interfaces and Their Characteristics

Interfaces are the connection between data storage devices and other devices such as computers, networks, or enterprise storage equipment. Internal interfaces connect storage devices within a computer or storage system, while external interfaces connect to stand-alone storage equipment such as disk enclosures, tape libraries, or compact disc jukeboxes.

A variety of interfaces have been introduced over the years, some designed for high cost and high performance with a large amount of throughput and others designed for lower cost and lower performance.

ATA

Versions of Advanced Technology Attachment (ATA) provide a low-cost, low-performance option and are ideal when a large amount of storage is needed but does not have to be accessed directly by end users or applications. Examples of good uses of ATA include backup storage or storage for infrequently used data.

PATA

Parallel ATA (PATA), also known as Integrated Drive Electronics (IDE), was introduced in 1986 and standardized in 1988. It was the dominant interface used for HDDs and CD-ROM drives until 2003. PATA disks were originally referred to only as ATA disks, but the acronym PATA now differentiates these disks from SATA disks. The IDE name is derived from the fact that PATA disks have a controller integrated onto the disks instead of requiring a separate controller on the motherboard or expansion card. The integrated controller handles commands such as moving the actuator arm, spinning the disk up, and parking the heads.

A PATA cable, depicted in Figure 1-3, has 40 or 80 wires that make contact with the connectors on the disk and the motherboard. PATA cables also have a maximum length of 18 inches. Data is transferred in parallel, meaning that multiple bits of data are transferred at once. The number of bits transferred in a single operation is called the data bus width; in the case of PATA, it is 16.

Images

Figure 1-3 A PATA cable

Up to two devices may be attached, but in order for both to share the same cable, one device must be configured as the master (device 0) and the other as the slave (device 1). The master or slave setting is configured by a jumper on the back of the PATA device, depicted in Figure 1-4. Some devices support a mode called cable select, whereby the device automatically configures itself as master or slave depending on its location on the cable.

Images

Figure 1-4 A jumper for the master, slave, or cable select setting

PATA cables are a shared medium allowing only one device to communicate over the cable at a time. Connecting multiple devices to a single connector can impact the performance of the devices if disk operations execute on both disks at once, as is common when transferring files between disks or when installing software from a CD-ROM attached to the same cable. While earlier computers typically were equipped with two connectors on the motherboard, supporting up to four PATA devices, most computers today do not even include a PATA port on the motherboard. The best practice in this case is to place highly used devices on separate cables and pair lesser used devices with highly used ones so as to give each device maximum time on the cable.

PATA disks are powered by a plug with four pins and four wires known as a molex connector. The molex connector has one yellow 12-volt wire, one 5-volt red wire, and two black ground wires.

The last version of PATA, called Ultra ATA/133, had a maximum theoretical transfer rate of 133 MBps that was quickly eclipsed by SATA. Table 1-2 lists the PATA versions and their significant properties. Note that megabytes per second (MBps) is different from megabits per second (Mbps). There are 8 bits in a byte, so the megabit transfer speed is 8 times the megabyte transfer speed.

Table 1-2 PATA Versions

Images

SATA

Serial ATA (SATA) was introduced in 2003 to replace the aging PATA. The parallel architecture of PATA created difficulties for exceeding the 133-MBps limit because its parallel transmissions were susceptible to electromagnetic interference (EMI) among the wires in the bus. SATA, however, runs on lower voltages and uses a serial bus, so data is sent one bit at a time. These enhancements bypassed some of the problems faced with PATA and allowed for higher transfer rates. SATA utilized the ATA command set, making it easier for equipment to include SATA ports since ATA was already supported on a wide variety of systems and BIOSs. SATA disks can also be connected as an external disk using an eSATA cable and enclosure, which offers faster speeds than external disks connected over USB or IEEE 1394 FireWire. Table 1-3 provides the speeds of each of these external disk technologies.

Table 1-3 External Disk Speeds by Type

Images

SATA cables have seven connectors and a maximum length of 39 inches. SATA disks do not use a molex connector. Instead, the SATA power connector is smaller with 15 pins. Only one device can be connected to a cable, meaning the issue of a shared bus with master and slave designations does not exist with SATA disks. There have been multiple generations of SATA, depicted in Table 1-4. It should be noted that the bit per second transfer rates are not what one would expect since they are ten times the megabyte rate. This is intentional because SATA uses 8b/10b encoding. This form of encoding uses ten bits to represent one byte. The extra two bits are used to keep the clock rate the same on the sending and receiving ends.

Table 1-4 SATA Versions

Images

What really differentiates SATA from previous interfaces is the host of new features added along the way. Some of the features include

• Command queuing Command queuing optimizes the order in which operations are executed in a disk, based on the location of data. Without command queuing, operations are processed in the order they are received. Command queuing reorders the operations so that the data can be fetched with minimal actuator and platter movement.

• Hot-pluggable/hot-swappable This option allows a disk to be plugged or unplugged from a system while the system is running. Systems without hot-plug interfaces will need to be restarted before disks can be added or removed.

• Hard disk passwords SATA devices supporting this feature allow disks to be password protected by setting a user and a master password in the BIOS. User passwords are designed to be given to the primary user of a machine, while the master password can be retained by IT administrators. In this way, should an individual leave or forget their password, IT administrators can still unlock the disk. Once the disk is password-protected, each time the machine is started up, a prompt appears requiring the password to be entered. This password will be required even if the hard disk is moved to another machine. Hard disk passwords can also be synchronized with BIOS system passwords for ease of administration.

• Host protected area (HPA) HPA reserves space on the hard disk so that the system can be reset to factory defaults, including the original operating system, applications, and configuration that shipped with the machine. HPA space is hidden from the operating system and most partitioning tools so that it is not accidentally overwritten.

SCSI

Small Computer System Interface (SCSI) was standardized in 1986 and was the predominant format for high-performance disks for almost 30 years. There have been many versions of SCSI along with different cable types and speeds, as shown in Table 1-5. Internal SCSI uses ribbon cables similar to PATA. Over the years, 50-, 68-, and 80-pin connectors have been used. The last few versions of SCSI to be used shared much in common; and despite so many variations, expect to find the SCSI that uses a 16-bit bus supporting 16 devices and the connection types of 68 and 80 pin. SCSI’s max speed is 640 MBps, or 5,120 Mbps. You will typically find 68-pin cables connecting SCSI devices within a server, such as the connection between a SCSI controller card and the disk backplane; but the connection between disks and the backplane most often uses the hot-pluggable 80-pin connector known as Single Connector Attachment (SCA).

Table 1-5 SCSI Versions

Images

Devices on a SCSI bus are given an ID number unique to them on the bus. This was originally performed using jumpers on the back of SCSI devices similar to the master slave configuration on PATA, but more recent disks allow for the SCSI ID to be configured automatically by software or the SCSI adapter BIOS. Up to 16 devices can exist on a single bus.

Each device on a SCSI bus can have multiple addressable storage units, and these units are called logical units (LUs). Each LU has a logical unit number (LUN). You will not find a new machine today equipped with SCSI, but the command set and architecture are present in many current technologies, including SAS, Fibre Channel, and iSCSI.

Fibre Channel

Fibre Channel (FC) picks up in enterprise storage where SCSI left off. It is a serial disk interface with transfer speeds of 2, 4, or 8 Gbps. The latest versions of FC disks use a 40-pin Enhanced SCA-2 connector that is SFF-8454 compliant, and they are hot-pluggable, so they can be added or removed while the system is powered on without interrupting operations. They utilize the SCSI command set and are commonly found in enterprise storage arrays. FC disks are expensive compared to other disk types.

SAS

Serial Attached SCSI (SAS) is a high-speed interface used in a broad range of storage systems. SAS interfaces are less expensive than FC, and they offer transfer speeds up to 1.5 GBps (12 Gbps). The SAS interface uses the same 8b/10b encoding that SATA uses. Thus, converting from bits per second to bytes per second requires dividing by ten instead of eight. SAS can take advantage of redundant paths to the storage to increase speed and fault tolerance. SAS expanders are used to connect up to 65,535 devices to a single channel. SAS cables can be a maximum of 33 feet long. The SAS connector is similar to the SATA connector except that it has a plastic bridge between the power and interface cables. This allows for SATA disks to be connected to a SAS port, but SAS disks cannot be connected to a SATA port because there is no place for the bridge piece to fit.

SAS version 1 (SAS1) was limited to transfer speeds of 375 MBps (3 Gbps), but SAS version 2 (SAS2) achieves transfer speeds of 750 MBps (6 Gbps), with the latest SAS3 disks released in late 2013 reaching speeds of 1.5 GBps (12 Gbps). SAS2 and SAS3 are backward compatible with SAS1, so SAS1 and SAS2 drives can coexist on a SAS2 or SAS3 controller or in a SAS2 or SAS3 storage array, and SAS2 drives can likewise coexist with SAS3 drives on SAS3 arrays or controllers. However, it is not wise to mix SAS1, SAS2, and SAS3 drives in the same RAID array because the slower drives will reduce the performance of the array.

PCI Versions

Peripheral Component Interconnect (PCI) is a standard used for computer expansion cards and slots. PCI replaced the older ISA standard. As a standard, it can be used by any manufacturer. This led to easy adoption of PCI on computer and server motherboards and the creation of numerous PCI devices, such as graphics cards, sound cards, modems, network interface cards (NICs), host bus adapters (HBAs), and drive controllers. The initial PCI standard set the stage for two other standards: PCI-X and PCIe. Figure 1-5 shows a PCI Ethernet card.

Images

Figure 1-5 PCI Ethernet card

As you install HBAs, NICs, and other expansion cards mentioned in this book, consider which technology you will use to ensure that the card provides the required throughput. Modern servers will likely use PCIe, but you may run into some that use PCI-X. PCI is described here so that you can understand PCI-X better and the limitations of utilizing a PCI card in a PCI-X slot. It is important that you choose a technology that provides the performance necessary and does not impact the ability of other devices to function properly.

PCI

The original PCI specification was standardized in 1992. It defined a method to expand the capabilities of a computer system through the addition of cards that plugged into slots on a motherboard using a parallel 32-bit bus. Parallel busses have multiple paths, and bits travel down all the paths simultaneously. This is in contrast to serial technologies where a single path is used and data is transmitted sequentially. The number of bits transferred in a single parallel operation is called the data bus width. PCI’s 32-bit bus allows for 32 bits of data to be transferred at once.

PCI initially had a clock speed of 33 MHz in version 1.0. This means that there were 33,000 hertz (or operations per second). Transfer rate is a measurement of how much data can be transferred in a second, and it is calculated by multiplying the clock speed by the data bus width. The bus can perform 66,000 operations per second, and each operation sends 32 bits of data over the parallel bus. We multiply the clock speed by the data bus width to get the transfer rate in bits. We divide that by 8 to turn our bits into bytes. Therefore, 33,000 × 32 = 1,056,000 bits, which divided by 8 produces 132,000 bytes, or 132MB. Because the operations are per second, we write these numbers as megabytes per second, or MBps. Therefore, PCI 1.0 has a transfer rate of 132 MBps.

PCI had several revisions. Version 3.0 increased the clock speed to 266 MHz, which increased the transfer rate to 512 MBps, as shown in Table 1-6. PCI 3.0 was widely adopted, so most implementations you will see run at 33 or 66 MHz.

Table 1-6 PCI Versions

Images

PCI is a shared bus, so only one device can communicate over the bus at a time. Devices must contend for time on the PCI bus. This also means that some cards can be choked for speed. Imagine a server with a SCSI controller and an Ethernet card on a PCI bus. If the server hosts files for the network, user requests will be received over the Ethernet adapter, and then the SCSI controller will need to obtain the data and send it to the motherboard to be sent to the Ethernet card. The transmission of the data over the Ethernet card and obtaining the data from disk cannot occur at the same time over PCI, so one would constantly be waiting for the other in order to serve the data to network users.

PCI-X

PCI-eXtended (PCI-X) was the immediate successor to PCI. PCI-X’s major differentiation from PCI is its 64-bit data bus width. It has a clock speed of 133 MHz, resulting in a 1.06-GBps transfer rate. Later versions of PCI-X utilize the Double Data Rate (DDR) and Quad Data Rate (QDR) technologies to transfer 2 or 4 bits per operation. This effectively doubles or quadruples the transfer rate. DDR sends data on both the upbeat and downbeat of a cycle, or Hertz (Hz), whereas QDR sends data at four points within the cycle. These versions are depicted in Table 1-7.

Table 1-7 PCI-X Versions

Images

PCI-X is fully backward compatible with PCI cards. PCI-X achieves this backward compatibility by taking a PCI slot and adding an extension bus to the rear of the slot. However, although PCI devices can connect to a PCI-X slot and have full PCI functionality, they will only utilize 32 bits of the available 64-bit PCI-X data bus width, so there is no advantage to plugging a PCI card into a PCI-X slot. This extension means that the bus requires more room on a motherboard than both PCI and PCIe. Figure 1-6 shows a PCI-X Ethernet card.

Images

Figure 1-6 PCI-X Ethernet card

One unfortunate carryover from PCI is the shared bus. PCI-X still allows for only one device to communicate over the bus at a time. Furthermore, if a PCI expansion card is plugged into a PCI-X slot, all devices on the PCI-X bus will be reduced to PCI speeds in order to maintain compatibility. This can have a very serious impact on system performance.

I had a situation where one of my clients could not figure out why his HBAs did not come close to the manufacturer’s stated performance benchmarks, and he was getting very low throughput from his servers. He tried multiple HBAs, and each provided the same dismal performance. After talking through theories for an hour in his office, we went to lunch. On the way back we stopped by the data center, and the problem became immediately apparent once I saw the back of the server. He had installed PCI VGA graphics cards in the servers just in case the onboard graphics would fail. He had a bad experience once and was determined not to repeat it. However, this was causing the PCI-X cards to run at PCI speeds. Once the PCI cards were removed, performance increased dramatically. He still kept a few PCI cards tucked away in his toolbox, but he stopped putting them in servers.

PCIe

PCI Express (PCIe) is the current standard for devices. The older PCI and PCI-X devices do not function in the newer PCIe slots. Each slot for PCIe is independently accessed, so it does not have the bottleneck issue of being dragged down to the lowest device speed. It also has the benefit that each slot can work with other slots up to 16 times and provide much better throughput than just one slot could provide. The slots are also very compact; you can fit a 16-lane slot into less space than a single PCI-X slot. Figure 1-7 shows a PCIe Ethernet card.

Images

Figure 1-7 PCIe Ethernet card

Parallel busses can transmit more data in a single operation, but the timing required to ensure that the data over each path arrives at the same time demands greater precision and often results in lower clock speeds. Serial busses, on the other hand, send only one bit of data for each operation, but they do not need to break apart, synchronize, and reassemble data from multiple paths, so they can run at a much higher clock speed.

PCIe is a bus that is divided into lanes, each capable of supporting an equal bandwidth concurrently. This is a significant departure from the PCI and PCI-X architectures, which use a shared bus for all slots on a mainboard. Because each PCIe bus is independent, each card can perform at full speed, which starts off at 250 MBps in version 1 and increases to 500 and 984.6 MBps in versions 2 and 3. PCIe slots are sized to take advantage of one or more lanes, as described by their name. The x1 slot has one lane, whereas the x16 slot has 16 lanes and 16 times the bandwidth of the x1 slot. The slot bandwidth provided by each slot type and version is shown in Table 1-8. The architecture of PCIe is drastically different from PCI and PCI-X, so it is not backward compatible with older PCI devices.

Table 1-8 PCIe Bandwidth by Slot Type and Version

Images

PCIe devices can be plugged into any PCIe slot, regardless of lane size. It will slow down the connected device to the maximum speed that the attached lanes provide. This means that a PCIe x4 card can be connected to a PCIe x1 slot with reduced device performance. Smaller PCIe cards can also be placed in slots that have many more lanes than the device needs and still function. If a PCIe x1 expansion card is needed and there are no x1 slots available, the card can be plugged into an x4, x8, or even an x16 slot, but it will still be limited to the maximum number of lanes supported by its connector. Figure 1-8 shows the different PCIe card types. PCIe x1 is in the upper left, x4 in the lower left, x8 in the upper right, and x16 in the lower right.

Images

Figure 1-8 PCIe card connectors

Multiple Disks for Larger Storage and Fault Tolerance

Storage professionals are often called on to create groups of disks for data storage. Grouping disks together allows for the creation of a larger logical drive, and disks in the group can be used to protect against the failure of one or more disks. Most groupings are performed by creating a Redundant Array of Independent Disks (RAID). To create a RAID, multiple disks must be connected to a compatible RAID controller, and controllers may support one or more types of RAID known as RAID levels. There are two designations for capacity. Raw capacity is the total capacity of disks when not configured in a RAID. For example, five 100GB disks would have a raw capacity of 500GB. Usable capacity is the amount of storage available once the RAID has been configured.

RAID Levels

Several RAID specifications have been made. These levels define how multiple disks can be used together to provide increased storage space, increased reliability, increased speed, or some combination of the three. Although not covered here, RAID levels 2 to 4 were specified but never adopted in the industry, so you will not see them in the field. When a RAID is created, the collection of disks is referred to as a group. It is always best to use identical disks when creating a RAID group. However, if different disks are used, the capacity and speed will be limited by the smallest and slowest disk in the group. RAID 0, 1, 5, and 6 are basic RAID groups introduced in the following sections. RAID 10 and 0+1 are nested RAID groups because they are made up of multiple basic RAID groups. Nested RAID groups require more disks than basic RAID groups and are more commonly seen in networked or direct attached storage groups that contain many disks.

RAID 0

RAID 0 writes a portion of data to all disks in the group in a process known as striping. At least two identical disks are required to create a RAID 0 group, and these disks make up the stripe set. Figure 1-9 shows how a file would be written to a RAID 0 consisting of four disks. The file is broken into pieces and then written to multiple disks at the same time. This increases both read and write speeds because the work of reading or writing a file is evenly distributed among the disks in the group. RAID 0 usable capacity is the total capacity of all disks in the stripe set and thus the same as the raw capacity. The main drawback to RAID 0 is its lack of redundancy. If one disk fails, all the data in the group is lost.

Images

Figure 1-9 Writing a file to RAID 0

RAID 1

RAID 1 writes the data being saved to two disks at the same time. If a single disk fails, no data is lost, since all the data is contained on the other disk or disks. RAID 1 is also known as mirroring. Figure 1-10 depicts how a file would be stored on a two-disk RAID 1. The usable capacity of the mirror is the same as the capacity for one disk in the group, but because data can be retrieved from both disks simultaneously, mirroring results in a slight boost to read performance. When a failed disk is replaced in a RAID 1 group, the mirror is rebuilt by copying data from the existing disk to the new disk. This rebuild time has significant impact on group performance, even though it takes significantly less time to perform a rebuild on a RAID 1 than on a RAID 5. RAID 1 is good for high read situations since it can read data at twice the speed of single disks. However, it is not best for high write situations because each write must take place on both disks requiring the same time as a single disk.

Images

Figure 1-10 Writing a file to RAID 1

RAID 5

RAID 5 stripes data across the disks in the group, and because it also computes parity data and spreads this across the disks, if one disk in the group is lost, the other disks can use their parity data to rebuild the data on a new disk. RAID 5 requires at least three disks and, because of its speed, is popular. Figure 1-11 shows a file write to a RAID 5 array of four disks. Since parity data needs to be computed each time data is written, it suffers from a write penalty. Additionally, the parity data also takes up space on the group. The amount of space required diminishes as the number of disks in the group increases, but it takes more time to compute parity for larger RAID 5 groups. The equivalent of one disk in the group is used for parity, which means the usable capacity of a RAID 5 group can be computed by subtracting one from the number of disks in the group and multiplying that by the capacity of a single disk. For example, if six 1TB disks are placed into a RAID 5 group, the total capacity will be 5,000GB (5TB). We thus describe a RAID 5 group by listing the number of disks upon which the capacity is based and then the number of parity disks. Accordingly, making a six-disk RAID 5 group would be described as a 5+1.

Images

Figure 1-11 Writing a file to RAID 5

A RAID 5 group goes into a degraded state if a disk within the group is lost. Once that disk is replaced, the group begins rebuilding the data to the replaced disk. Group performance is significantly impacted while the group rebuilds, and a loss of any disks in the group, prior to completion of the rebuild, will result in a complete loss of data. Additionally, the increased load on the remaining disks in the group makes these disks more susceptible to a disk failure. Rebuild time increases with an increase in the capacity or number of disks in the group. Thus, a four-disk group takes longer to rebuild than a three-disk group, and a group of 3TB disks takes longer to rebuild than one made up of 1TB disks.

RAID 5 has good performance in high read situations because of its use of striping, and it has decent performance in high read situations but not as good as RAID 10 or RAID 1 because of the need for parity computations.

RAID 6

Although RAID 6 is similar to RAID 5, it computes even more parity data, and up to two disks can fail in the group before data is lost. RAID 6 requires at least four disks and suffers from an even greater write penalty than RAID 5. Consequently, the additional parity data of RAID 6 consumes the equivalent of two disks in the RAID group. Therefore, the usable capacity of the RAID group is the total disks in the group minus two. Figure 1-12 shows a file write to a RAID 6 array consisting of four disks. Similar to RAID 5 groups, RAID 6 groups can be described using the data and parity disks. This way, if a RAID 6 group is created from eight disks, it would be described as 6+2. A RAID 6 set made up of eight 500GB disks would have a usable capacity of 3,000GB (3TB). In spite of RAID 6 suffering from an even greater rebuild time, the group is not at risk for a total loss of data if another disk fails during the rebuild period.

Images

Figure 1-12 Writing a file to RAID 6

RAID 10 (1+0)

RAID 10 is a stripe made up of many mirrors. RAID 10 requires an even number of at least four disks to operate. Pairs of disks in the group are mirrored, and those mirrored sets are striped. RAID 10 offers the closest performance to RAID 0 and offers extremely high reliability as well since a disk from each of the mirrored pairs could fail before loss of data occurs. Figure 1-13 shows how a file would be written to a RAID 10 array of eight disks. The main drawback of RAID 10 is its total capacity. Since RAID 10 consists of many mirror sets, the usable capacity is half the raw capacity. A ten-disk RAID 10 group made up of 100GB disks would have a usable capacity of 500GB and a raw capacity of 1TB (1000GB). This disk configuration would be described as 5+5. RAID 10 rebuild times are comparable to those of RAID 1, and the performance impact of a failed disk will be much shorter in duration than had it occurred on a RAID 5 or RAID 50. RAID 10 is best for high read situations since it can read from the stripes and mirrors of all the disks in the RAID set. It is good for high write situations since it uses striping and there is no parity computation.

Images

Figure 1-13 Writing a file to RAID 10

RAID 0+1

RAID 0+1 is a mirror of stripes—the opposite of RAID 10. RAID 0+1 does not have the same reliability as RAID 10 because a loss of a disk from both mirrors would result in a total loss of data. RAID 0+1 can, however, sustain a multiple-disk failure if the failures occur within one of the stripe sets. RAID 0+1 is less commonly implemented because most implementations favor RAID 10. RAID 0+1 has fast rebuild times similar to RAID 10. Figure 1-14 shows how a file would be written to a RAID 0+1 array of eight disks.

Images

Figure 1-14 Writing a file to RAID 0+1

RAID 50

RAID 50 is a stripe made up of multiple RAID 5 groups. A RAID 50 can be created from at least six drives. RAID 50 offers better performance and reliability than a single RAID 5 but not as much as a RAID 10. However, RAID 50 offers greater capacity than a RAID 10, so it is often used when performance requirements fall in between RAID 5 and RAID 10. One drive from each RAID 5 group in the RAID 50 can fail before loss of data occurs. The usable capacity for a RAID 50 depends on how many RAID 5 sets are used in the construction of the RAID 50.

Similar to a RAID 5, rebuild times increase with an increase in the capacity or number of drives in the group. However, the rebuild time for a loss of a single drive in a RAID 50 group would be much faster than a single drive loss in a RAID 5 group with a similar number of disks because the rebuild effort would be isolated to one RAID 5 within the RAID 50. Performance would still be degraded because the degraded RAID 5 is within the overall stripe set for the RAID 50 group, so each new data operation would need to access the degraded RAID 5.

RAID 51

RAID 51 is a mirror of RAID 5 groups. This provides more reliability than RAID 5 and RAID 50, and the usable capacity of the set is half the total number of drives minus one; therefore, if there were ten 100GB drives in a RAID 51, the total capacity would be 400GB because five drives are used in the second RAID 5 set that is mirrored and one drive is used in each of the RAID 5 sets for parity. The parity drive in the mirrored set is already accounted for, so you subtract out the parity drive from the first set to get the usable capacity. RAID 51 can sustain a loss of multiple drives up to an entire RAID set as long as not more than one drive has failed on the other mirrored RAID 5 set.

RAID 51 rebuild times can be quite short because the data from the identical unit in the mirrored RAID 5 can be used to build the faulty drive. However, some implementations of RAID 51 use multiple controllers to protect against RAID controller failure. If this is the case, RAID 51 rebuild times will be comparable to RAID 5 because rebuilds take place within the controller, so the data from the mirror set cannot be used to rebuild individual drives on another controller.

RAID 51 is often used with two RAID controllers. Both controllers host a RAID 5 array, and the arrays are mirrored. If an entire controller fails, the RAID 51 is still accessible because it can operate in failover mode on the other RAID 5 set and controller until the controller and/or drives are replaced in the other RAID 5 set.

JBOD

Just a Bunch of Disks (JBOD) is not a RAID set, but it is a way to group drives to achieve greater capacity. JBOD uses a concatenation method to group disks, and data is written to a disk until it is full; then data is written to the next disk in the group. JBOD does not offer higher availability because the loss of a single drive in the group will result in a loss of all data. JBOD, however, is simple to implement, and it results in a usable capacity that is the same as its raw capacity. Use JBOD when you need the maximum usable capacity and when the reliability of the data is not a concern.

Hardware, Host-Based, and Software Implementations

RAID can be configured on a hardware RAID controller, on a host-based disk controller with RAID firmware, or in software. A hardware RAID controller can exist on an expansion card or be built onto the motherboard. Hardware RAID controllers perform all RAID functions on the controller and do not place additional burden on the computer processor. Hardware RAID is best when high performance is required for disks.

Host-based implementations utilize a dedicated drive controller, which usually contains custom firmware to understand how to group drives into a RAID. However, the host-based implementation relies on the computer processor to perform RAID computations. Host-based RAID is a low-cost way to add RAID to existing hardware when excess processor cycles are available for RAID operations.

Software-based RAID is implemented at the operating system level. Software RAID is the easiest to implement because it can be implemented on any hardware. No hardware needs to be purchased for a software RAID, so it is also the least expensive option. However, it suffers from reduced performance because RAID operations must pass through the operating system before being processed. It also places a burden on the computer processor. Software RAID systems usually cannot be used for boot drives because the operating system must first be available before RAID operations can begin. Some operating systems allow for a RAID 1 mirror of the operating system but no other forms of RAID.

RAID operations include parity computations, rebuilds, disk health tracking, and swapping out hot spares. However, the operation that is of main concern is parity computations. Parity is used for writing data to disks and for disk rebuild operations, and it can result in a lot of processing in write-intensive applications.

RAID controllers need to compute parity for each write to a RAID array that uses parity. As a reminder, RAID 5, RAID 6, RAID 50, and RAID 51 use parity. Parity is a value computed for a bit string in a stripe set that indicates whether the bit string is even or odd. Even-bit strings will have a 0 for parity, and odd-bit strings will have a 1 for parity. For example, consider a five-disk RAID 5, which can also be referenced as 4+1, with four drives for data and one for parity. If the bit string in the first stripe is 1, 0, 1, 0, then the parity bit would be 0 because the bits are even. However, if the bit string in the stripe was 0, 0, 0, 1, the parity bit would be 1 because the bits are odd.

The RAID controller goes through a process called “read, modify, write,” where it first reads the data present on the disks and then reads the current parity. It performs an XOR operation on the old and new bit strings to determine if new parity data needs to be written. It then writes the new bits to each of the bit strings in the RAID stripe and writes parity to the stripe if the parity has changed due to the new bits.

The second function a RAID controller performs is a rebuild. If a drive in a parity RAID set fails, the parity bit can be used to rewrite the data to a new drive. In the previous example, the first bit string has 1, 0, 1, 0, with a parity of 0, so disk 0 has a 1, disk 1 has a 0, disk 2 has a 1, disk 3 has a 0, and disk 4 (the parity disk) has a 0 in its first location. If disk 1 fails, then the parity on disk 4 would be used to reconstruct the bits on disk 1. Disk 4 has a 0, so the result of the bits must be even. Our remaining disks, 0, 2, and 3, have the values 1, 1, and 0, respectively. This is even, and our parity is even, so the missing bit on disk 1 must be a 0 so that the value of the bit string remains 0.

Our example computed parity and used parity to reconstruct the data for the first bit striped to the disks in this RAID 5 set. If you want to store 4MB of data on the RAID set, this data would be striped across four disks, so each disk would get 1MB, which is equivalent to 8 million bits, because each byte has 8 bits in it and a megabyte (MB) has 1 million bytes in it. This 4MB file results in 8 million parity computations. As you can see, parity computations can take up a lot of processing cycles, and this is where a dedicated hardware solution pays off. Both the host-based and software solutions utilize the main central processing unit (CPU) to perform these computations. The CPU already has to process data for the rest of the systems on the server, so this can place a burden on the CPU and slow down other operations.

Hosts Interaction with Disks

To be used, storage must be made available to hosts. Hosts store and organize data in a directory structure known as a file system, and file systems exist on logical volumes that can be made up of one or more physical volumes.

A physical volume is a hard disk; a portion of a hard disk, called a partition; or, if the storage is remote, a LUN. One or more physical volumes can be combined on the host into a volume group. The volume group can then be divided into manageable sections known as logical volumes. Thus, a physical disk may have multiple logical volumes, and a logical volume may have multiple physical disks. Figure 1-15 depicts the relationship between these concepts.

Images

Figure 1-15 Physical disks, volume groups, and logical volumes

For a system to use a logical volume, it must be mounted and formatted. Mounting the volume establishes the location where it can be referenced by the system. In Unix and Linux, this location is often a directory path. In Windows, mount points are usually given a drive letter, which is referenced with a colon and a forward slash such as C:. Windows also allows drives to be given directory mount points similar to Linux and Unix systems.

The formatting process establishes a file system on the logical volume and sets a cluster size. The cluster size is how large of chunks the file system will divide the available disk space into. Larger cluster sizes result in fewer chunks of larger size, and this makes it faster for the system to read large data files into memory. However, if many small files are stored on the system with large cluster sizes, clusters will be padded to fill up the remaining space in the cluster, which will decrease available space and performance. Understand the size of files that will be stored on a system before configuring the cluster size. Additionally, the stripe size on RAID 0, RAID 5, and RAID 6 volumes is often a configurable option on the RAID controller. The stripe size is the size of each chunk that is placed on the disks when files are split into pieces. Select a stripe size that is a multiple of the cluster size you will use when formatting the logical drive. You can take the number of disks in the stripe and divide by the cluster size to find an optimal number. Stripe and cluster size selections are powers of 2, so a four-disk stripe for a 512KB cluster might use 64KB or 128KB stripes since these numbers divide evenly into 512. This will ensure that file reads and writes will be distributed evenly across the stripe set.

File Systems

File systems are used to organize data on a logical drive. File systems track the location of files on the disk and file metadata. Metadata is data about the file such as creation date, last modified date, author, and permissions. File systems are structured hierarchically, beginning with a root directory under which other directories can be created. File systems also collect information used to track faulty areas of the drive.

Some file systems are journaled. Journaled file systems write file metadata to a journal before creating or modifying a file on the file system. Once the file has been created or modified, metadata is also modified. The journal protects against corruption should the drive or system go offline while the file operation is in progress. The journal can be used to determine whether files on the file system are in an inconsistent state, and sometimes the files can be corrected by using the journal data. Journaling increases the amount of time required to write files to the file system, but many file systems today use it to protect data integrity.

A file system is created by formatting a logical drive. Formatting consumes some of the available space on the drive, resulting in a lower-capacity value for a formatted disk than an unformatted one. However, it is an essential step in order to use the available space on the drive. A file system type can depend on the operating system being used when beginning the formatting process.

Whereas a controller identifies data based on CHS, logical volumes map blocks using logical block addressing (LBA). LBA is a way of referencing a location on a disk without a device having knowledge of the physical disk geometry. LBA provides a system with the number of blocks contained on the drive while the system keeps track of the data contained on each block. It requests the data by block number, and the drive then maps the block to a CHS and retrieves the data. Blocks are numbered starting with cylinder 0, head 0, and sector 1 and continuing until all sectors on that cylinder and head are read. The numbering starts again with head 1, which would be the bottom of the first platter, and continues along cylinder 0. This progresses until sectors for all heads in cylinder 0 are counted, at which time the process begins again for cylinder 1. The block count continues until all sectors for all heads and cylinders are accounted for.

Chapter Summary

A hard disk drive (HDD) is a component of storage systems, computers, and servers that is used to store data. HDDs consist of round flat surfaces known as platters that rotate on a spindle. The platters are organized into cylinders and sectors. Disk heads write data to the platters by aligning a substrate on the platters using an electromagnetic field. Solid-state drives (SSDs) store data by retaining an electrical current in a gate. Control gates allow current to be placed into floating gates. These floating gates can hold one or more binary numbers depending on the type of solid-state cell in use. MLC disks can hold more than one binary number per cell to achieve a greater capacity than SLC disks, which hold only one binary number per cell. However, MLC disks do not last as long as SLC disks, and they have lower performance. Both HDDs and SSDs can be purchased in different classes for certain use cases. Entry-level or consumer drives are the least expensive but offer the lowest performance and reliability while enterprise drives are the most expensive with the highest performance and reliability. Midrange drives fall inbetween entry level and enterprise drives. Entry-level drives are best for end user machines and very light workloads. Midrange drives are best for servers or shared systems that have low I/O requirements, and enterprise drives are best for servers with high I/O requirements, such as servers that have continual use and frequent disk reads and writes.

Interfaces are used to connect disks to other components in a computer or storage system. The ATA interfaces PATA and SATA offer a low-cost interface for lower performance needs, while SCSI, FC, and SAS disks are used for high-performance storage. Disks may also be connected using expansion cards. The PCI bus and its successor, PCI-X, are both older standards that have been largely eclipsed by PCIe. However, you may need to work on these interfaces. PCIe is the only bus that has dedicated bandwidth to each slot, whereas PCI and PCI-X use a shared bus.

A single disk often does not provide enough speed and resiliency for business applications, so disks are grouped together into a RAID set. RAID 0 writes pieces of data to multiple disks to improve speed in a process known as striping. Read and write operations are divided among the disks in the set, but a loss of a single disk causes a loss of data on all disks in the stripe set. RAID 1 stores identical data on two disks in a process called mirroring. RAID 5 uses striping to achieve high-data speeds while using parity data to protect against the loss of a single disk in the set. RAID 6 operates similarly to RAID 5, but it calculates additional parity to protect against the loss of two disks in the set. RAID 10 and RAID 0+1 combine striping and mirroring to achieve higher performance and resiliency than RAID 5 or 6. RAID 10 and RAID 0+1 can lose up to half the disks in a set before data is lost.

RAID can be configured on a hardware RAID controller, on a host-based disk controller with RAID firmware, or in software. Hardware RAID controllers perform all RAID functions on the controller and do not place additional burden on the computer processor, but that requires dedicated RAID controller hardware. Host-based implementations utilize a dedicated drive controller, which usually contains custom firmware to understand how to group drives into a RAID. However, the host-based implementation relies on the computer processor to perform RAID computations. Software-based RAID is implemented at the operating system level. No hardware needs to be purchased for a software RAID, so it is also the least expensive option. However, it suffers from reduced performance because RAID operations must pass through the operating system before being processed and it uses the computer processor to perform RAID computations.

Hosts interface with storage by using logical volumes. A logical volume can be a portion of a single disk, or it can be multiple physical disks together. The host uses a file system to organize the data on the logical volume. Hosts format a logical volume to create the initial file system and make the volume ready for use. The logical volume is then mounted into a directory or as a drive letter.

Chapter Review Questions

1. Which of the following components would not be found in a solid-state drive?

A. Controller board

B. Cache

C. Spindle

D. Interface

2. You are configuring a server that will store large database backups once a day. Which interface would give you the greatest capacity at the lowest price?

A. SAS

B. FC

C. SATA

D. SCSI

3. A head does which of the following?

A. Controls the flow of data to and from a hard drive

B. Aligns tiny pieces of metal using an electromagnetic field

C. Serves as the starting point in a file system under which files are created

D. Connects platters in a hard drive and spins them at a consistent rate

4. Which of the following statements is true about RAID types?

A. RAID 1 is known as striping, and it writes the same data to all disks.

B. RAID 0 is known as mirroring, and it writes pieces of data across all disks in the set.

C. RAID 5 uses mirroring and parity to achieve high performance and resiliency.

D. RAID 1 is known as mirroring, and it writes the same data to all disks.

5. You have created a RAID set from disks on a server, and you would like to make the disks available for use. Which order of steps would you perform?

A. Create a logical volume, mount the logical volume, and format the logical volume.

B. Format the logical volume, mount the logical volume, and create a directory structure.

C. Create a directory structure, mount the logical volume, and format the logical volume.

D. Create a logical volume from the RAID set, format the logical volume, and mount the logical volume.

6. Which interface can transmit at 6 Gbps?

A. SATA and SAS

B. Fibre Channel

C. SCSI

D. PATA

7. Which of the following is an advantage of IOPS versus other metrics such as seek time or rotational latency?

A. IOPS identifies whether an individual component in a system is exhibiting acceptable performance.

B. IOPS measures end-to-end transfer to provide a single metric for transmission.

C. The IOPS metric is specified by the IEEE 802.24, so it is implemented consistently across all hardware and software that is IEEE 802.24 compliant.

D. IOPS eliminates the need to run further benchmarking or performance tests on equipment.

8. A customer needs 250GB of storage for a high-performance database server requiring 2,000 IOPS. They want to minimize the number of disks but still provide for redundancy if one disk fails. Which solution would best meet the customer’s requirements?

A. RAID 1 with two 3,000 IOPS 128GB solid-state drives

B. RAID 5 with three 3,000 IOPS 128GB solid-state drives

C. RAID 5 with seven 350 IOPS 300GB SAS drives

D. RAID 0 with two 3,000 IOPS 128GB solid-state drives

9. You plug a PCI card into a PCI-X slot. What will the maximum transfer rate be?

A. 132 MBps

B. 264 MBps

C. 512 MBps

D. 1.06 GBps

10. What is the primary advantage of a hardware RAID versus a software RAID?

A. Hardware RAID costs less to implement.

B. Hardware RAID can support more drives than software RAID.

C. Hardware RAID has a dedicated processor for performing parity computations.

D. Hardware RAID has its own I/O interrupt to communicate with the processor.

Chapter Review Answers

1. C is correct. The spindle would not be found in an SSD because SSDs do not have moving parts. A, B, and D are incorrect because SSDs consist of flash memory cells, a controller board, a cache, and an interface.

2. C is correct. Performance was not a concern in this question. The only requirement was for capacity and lowest cost, so SATA disks would be the best fit. A, B, and D are incorrect because SATA disks are less expensive per gigabyte than SAS, FC, or SCSI. Additionally, SATA disks come in larger capacities than SCSI disks.

3. B is correct. The head is a small electromagnet that moves back and forth on an actuator arm across a platter. The head generates an electromagnetic field that aligns the substrate on the platter in the direction of the field. The field direction is determined based on the flow of current in the head. Current can flow in one of two directions in the head. A, C, and D are incorrect. Data flow is controlled by the drive controller, so A is incorrect. C is incorrect because the starting point in a file system is the root. D is incorrect because the spindle connects platters in a hard drive and spins them at a consistent rate.

4. D is correct. RAID 1 is mirroring. The mirroring process writes identical data to both disks in the mirror set. A, B, and C are incorrect. A is incorrect because RAID 1 is not known as striping. RAID 0 is known as striping. B is incorrect because it is not known as striping. The rest of the statement is true about RAID 0. C is incorrect because RAID 5 uses striping and parity, not mirroring and parity. Mirroring is used in any of the RAID implementations that have a 1 in them, such as RAID 1, RAID 10, and RAID 0+1.

5. D is correct. A logical volume must be created from the RAID set before the volume can be formatted. Mounting occurs last. The step of creating a directory structure is not required for the disk to be made available. A, B, and C are incorrect because they are either not in the right order or they do not contain the correct steps. It is not necessary to create a directory structure, as this is performed during the format process.

6. A is correct. SATA version 3 and SAS both support 6-Gbps speeds. B, C, and D are incorrect. Fibre Channel supports speeds of 2, 4, or 8 Gbps. SCSI has a max speed of 5,120 Mbps, while PATA has a max speed of 133 Mbps.

7. B is correct. IOPS measures the transfer rate from the point at which the IOPS metric is initiated to the storage device. A, C, and D are incorrect. A is incorrect because IOPS does not identify an individual component. C is incorrect because the IOPS metric is not standardized by the IEEE. The 802.24 deals with smart grids and has nothing to do with IOPS. D is incorrect because other metrics may be needed to diagnose where performance issues lie or to document granular metrics on individual components.

8. A is correct. Two SSD disks would achieve the required IOPS and still provide for redundancy if one disk fails. B, C, and D are incorrect. Options B and C use more disks than option A, so they are not ideal, and option D does not provide redundancy in the case of disk failure.

9. B is correct. The PCI card in the PCI-X 1.0 slot will run at the maximum PCI speed. Because this is a PCI-X slot and has a maximum clock rate of 133 MHz, the highest PCI standard supported will be 2.1 running at 66 MHz. It will only be able to use 32 bits of the 64-bit data bus width, so we take 32 × 66 to get 2,112 and then divide by 8 to convert megabits to megabytes. Thus, we get 264 MBps. A, C, and D are incorrect. A is incorrect because the device plugged into the PCI-X slot would be able to perform at 66 MHz, above the initial 33 MHz specification for PCI that results in 132 MBps. C is incorrect because it assumes that later standards of PCI would be possible, but PCI-X is limited to 133 MHz, so it would not be capable of providing the 266 MHz required for PCI 2.3 speeds. D is incorrect because this is the base speed of PCI-X and would require a PCI-X card to achieve it.

10. C is correct. The primary benefit of hardware RAID over software RAID is that it has a dedicated processor for performing parity computations. Parity computations can place a great burden on the CPU when applications are write intensive. A, B, and D are incorrect. A is incorrect because a hardware RAID is more expensive than a software RAID because it requires additional hardware. B is incorrect because both systems can support a large number of drives. D is incorrect because the RAID controller does not have a dedicated interrupt. It uses the interrupt of the bus it is attached to, such as PCI, PCI-X, or PCIe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1 Storage Essentials

Create new playlist

Sign In

Sign Up

CHAPTER 1

Storage Essentials

How Disk Storage Systems Work

Physical Components

Platter

Spindle

Cylinder

Head

Controller Board

Solid-State Drive

Single-Level Cell

Multiple-Level Cell

Electrical and Mechanical Differences

Performance

Reliability

Operating Environment

I/O vs. Throughput

Capacity vs. Speed

Available Disk Interfaces and Their Characteristics

ATA

PATA

SATA

SCSI

Fibre Channel

SAS

PCI Versions

PCI

PCI-X

PCIe

Multiple Disks for Larger Storage and Fault Tolerance

RAID Levels

RAID 0

RAID 1

RAID 5

RAID 6

RAID 10 (1+0)

RAID 0+1

RAID 50

RAID 51

JBOD

Hardware, Host-Based, and Software Implementations

Hosts Interaction with Disks

File Systems

Chapter Summary

Chapter Review Questions

Chapter Review Answers

Table of Contents for
Chapter 1 Storage Essentials