Block Device Drivers

Typical block devices like hard disks have very high average access times. Each operation requires several milliseconds to complete, mainly because the hard disk controller must move the heads on the disk surface to reach the exact position where the data is recorded. However, when the heads are correctly placed, data transfer can be sustained at rates of tens of megabytes per second.

To achieve acceptable performance, hard disks and similar devices transfer several adjacent bytes at once. In the following discussion, we say that groups of bytes are adjacent when they are recorded on the disk surface in such a manner that a single seek operation can access them.

The organization of Linux block device handlers is quite involved. We won’t be able to discuss in detail all the functions that are included in the kernel to support the handlers. But we outline the general software architecture and introduce the main data structures. Kernel support for block device handlers includes the following features:

  • A uniform interface through the VFS

  • Efficient read-ahead of disk data

  • Disk caching for the data

Keeping Track of Block Device Drivers

When a block device file is being opened, the kernel must determine whether the device file is already open. In fact, if the file is already open, the kernel must not initialize the corresponding block device driver.

This problem is as easy as it appears at first look. On the one hand, we stated in the earlier section Section 13.2 that block device files that have the same major number are usually associated with the same block device driver. However, each block device driver that handles more than one minor number can be considered several specialized block device drivers, so this case doesn’t create problems. In the rest of this section, when we use the term “block device driver,” we mean the kernel layer that handles I/O data transfers from/to a hardware device specified by both a major number and a minor number.

A real complication, however, is that block device files that have the same major and minor numbers but different pathnames are regarded by the VFS as different files, but they really refer to the same block device driver. Therefore, the kernel cannot determine whether a block device driver is already in use by simply checking for the existence in the inode cache of an object for the block device file.

To keep track of which block device drivers are currently in use, the kernel uses a hash table indexed by the major and minor numbers. Every time a block device driver is being used, the kernel checks whether the corresponding block device driver identified by the major and minor numbers is already stored in the hash table. If so, the block device driver is already in use; notice that the hash function works on the major and minor numbers of the block device file, thus it doesn’t matter whether the block device driver was previously activated by accessing a given block device file, or another one that has the same major and minor numbers. Conversely, if a block device driver associated with the given major and minor numbers is not found, the kernel inserts a new element into the hash table.

The hash table array is stored in bdev_hashtable variable; it includes 64 lists of block device descriptors. Each descriptor is a block_device data structure whose fields are shown in Table 13-4.

Table 13-4. The fields of the block device descriptor

Type

Field

Description

struct list_head

bd_hash

Pointers for the hash table list

atomic_t

bd_count

Usage counter for the block device descriptor

struct inode *

bd_inode

Pointer to the main inode object of the block device driver

dev_t

bd_dev

Major and minor numbers of the block device

int

bd_openers

Counter of how many times the block device driver has been opened

struct block_device_operations *

bd_op

Pointer to the block device driver operation table

struct semaphore

bd_sem

Semaphore protecting the block device driver

struct list_head

bd_inodes

List of inodes of opened block device files for this driver

The bd_inodes field of the block device descriptor stores the head (the first dummy element) of a doubly linked circular list of inodes relative to opened block device files that refer to the block device driver. The i_devices field of the inode object stores the pointers for the previous and next element in this list.

Each block device descriptor stores in the bd_inode field the address of a special block device inode object for the driver. This inode doesn’t correspond to a disk file; rather, it belongs to the bdev special filesystem (see Section 12.3.1). Essentially, the block device inode stores the “master copy” of the information shared by all inode objects of the block device files that refer to the same block device.

Initializing a Block Device Driver

Let’s now describe how a block device driver is initialized. We already described how the kernel customizes the methods of the file object when a block device file is opened in Section 13.2.3. Its f_op field is set to the address of the def_blk_fops variable. The contents of this table are shown in Table 13-5. The dentry_open( ) function checks whether the open method is defined; this is always true for a block device file, so the blkdev_open( ) function is executed.

Table 13-5. The default file operation methods for block device files

Method

Function for block device file

open

blkdev_open( )

release

blkdev_close( )

llseek

block_llseek( )

read

generic_file_read( )

write

generic_file_write( )

mmap

generic_file_mmap( )

fsync

block_fsync( )

ioctl

blkdev_ioctl( )

This function checks whether the block device driver is already in use:

bd_acquire(inode);
do_open(inode->i_bdev, filp);

The bd_acquire( ) function essentially executes the following operations:

  1. Checks whether the block device file corresponding to the inode object is already open (in this case, inode->i_bdev field points to the block device descriptor). If the file is already open, increments the usage counter of the block device descriptor (inode->i_bdev->bd_count) and returns.

  2. Looks up the block device driver in the hash table using the major and minor numbers stored in inode->rdev. If the descriptor is not found because the driver is not in use, allocates a new block_device and a new inode object for the block device, and then inserts the new descriptor in the hash table.

  3. Stores the address of the block device driver descriptor in inode->i_bdev.

  4. Adds inode to the list of inodes of the driver descriptor.

Next, blkdev_open( ) invokes do_open( ), which executes the following main steps:

  1. If the bd_op field of the block device driver descriptor is NULL, initializes it from the blkdevs table’s element corresponding to the major number of the block device file

  2. Invokes the open method of the block device driver descriptor (bd_op->open) if it is defined

  3. Increments the bd_openers counter of the block device driver descriptor

  4. Sets the i_size and i_blkbits fields of the block device inode object (bd_inode)

The open method of the block device driver descriptor can further customize the methods of the block device driver, allocate resources, and take other measures based on the minor number of the block device file.

Among other things, the device driver initialization function must determine the size of the physical block device corresponding to the device file. This length, represented in 1,024-byte units, is stored in the blk_size global array indexed by both the major and minor number of the device file.

Sectors, Blocks, and Buffers

Each data transfer operation for a block device acts on a group of adjacent bytes called a sector . In most disk devices, the size of a sector is 512 bytes, although there are devices that use larger sectors (1,024 and 2,048 bytes). Notice that the sector should be considered the basic unit of data transfer; it is never possible to transfer less than a sector, although most disk devices are capable of transferring several adjacent sectors at once.

The kernel stores the sector size of each hardware block device in a table named hardsect_size. Each element in the table is indexed by the major number and the minor number of the corresponding block device file. Thus, hardsect_size[3][2] represents the sector size of /dev/hda2, which is the second primary partition of the first IDE disk (see Table 13-2). If hardsect_size[ maj ] is NULL, all block devices sharing the major number maj have a standard sector size of 512 bytes.

Block device drivers transfer a large number of adjacent bytes called a block in a single operation. A block should not be confused with a sector. The sector is the basic unit of data transfer for the hardware device, while the block is simply a group of adjacent bytes involved in an I/O operation requested by a device driver.

In Linux, the block size must be a power of 2 and cannot be larger than a page frame. Moreover, it must be a multiple of the sector size, since each block must include an integral number of sectors. Therefore, on PC architecture, the permitted block sizes are 512, 1,024, 2,048, and 4,096 bytes. The same block device driver may operate with several block sizes, since it has to handle a set of device files sharing the same major number, while each block device file has its own predefined block size. For instance, a block device driver could handle a hard disk with two partitions containing an Ext2 filesystem and a swap area (see Chapter 16 and Chapter 17). In this case, the device driver uses two different block sizes: 1,024 bytes for the Ext2 partition and 4,096 bytes for the swap partition.

The kernel stores the block size in a table named blksize_size; each element in the table is indexed by the major number and the minor number of the corresponding block device file. If blksize_size[ maj ] is NULL, all block devices sharing the major number maj have a standard block size of 1,024 bytes. (You should not confuse blk_size with the blksize_size array, which stores the block size of the block devices rather than the size of the block device themselves.)

Each block requires its own buffer , which is a RAM memory area used by the kernel to store the block’s content. When a device driver reads a block from disk, it fills the corresponding buffer with the values obtained from the hardware device; similarly, when a device driver writes a block on disk, it updates the corresponding group of adjacent bytes on the hardware device with the actual values of the associated buffer. The size of a buffer always matches the size of the corresponding block.

Buffer Heads

The buffer head is a descriptor of type buffer_head associated with each buffer. It contains all the information needed by the kernel to know how to handle the buffer; thus, before operating on each buffer, the kernel checks its buffer head.

The buffer head fields are listed in Table 13-6. The b_data field of each buffer head stores the starting address of the corresponding buffer. Since a page frame may store several buffers, the b_this_page field points to the buffer head of the next buffer in the page. This field facilitates the storage and retrieval of entire page frames (see Section 13.4.8.2 later in this chapter). The b_blocknr field stores the logical block number (i.e., the index of the block inside the disk partition).

Table 13-6. The fields of a buffer head

Type

Field

Description

struct buffer_head *

b_next

Next item in collision hash list

unsigned long

b_blocknr

Logical block number

unsigned short

b_size

Block size

unsigned short

b_list

LRU list including the buffer head

kdev_t

b_dev

Virtual device identifier

atomic_t

b_count

Block usage counter

kdev_t

b_rdev

Real device identifier

unsigned long

b_state

Buffer status flags

unsigned long

b_flushtime

Flushing time for buffer

struct buffer_head *

b_next_free

Next item in list of buffer heads

struct buffer_head *

b_prev_free

Previous item in list of buffer heads

struct buffer_head *

b_this_page

Per-page buffer list

struct buffer_head *

b_reqnext

Next item in the request queue

struct buffer_head **

b_pprev

Previous item in collision hash list

char *

b_data

Pointer to buffer

struct page *

b_page

Pointer to the descriptor of the page that stores the buffer

void (*)( )

b_end_io

I/O completion method

void (*)

b_private

Specialized device driver data

unsigned long

b_rsector

Block number on real device

wait_queue_head_t

b_wait

Buffer wait queue

struct inode *

b_inode

Pointer to inode object to which the buffer belongs

struct list_head

b_inode_buffers

Pointers for list of inode buffers

The b_state field stores the following flags:

BH_Uptodate

Set if the buffer contains valid data. The value of this flag is returned by the buffer_uptodate( ) macro.

BH_Dirty

Set if the buffer is dirty—that is, if it contains data that must be written to the block device. The value of this flag is returned by the buffer_dirty( ) macro.

BH_Lock

Set if the buffer is locked, which happens if the buffer is involved in a disk transfer. The value of this flag is returned by the buffer_locked( ) macro.

BH_Req

Set if the corresponding block is requested (see the next section) and has valid (up-to-date) data. The value of this flag is returned by the buffer_req( ) macro.

BH_Mapped

Set if the buffer is mapped to disk—that is, if the b_dev and b_blocknr fields of the corresponding buffer head are significant. The value of this flag is returned by the buffer_mapped( ) macro.

BH_New

Set if the corresponding file block has just been allocated and has never been accessed. The value of this flag is returned by the buffer_new( ) macro.

BH_Async

Set if the buffer is being processed by end_buffer_io_async( ) (described in the later section Section 13.4.8.2). The value of this flag is returned by the buffer_async( ) macro.

BH_Wait_IO

Used to delay flushing the buffer to disk when reclaiming memory (see Chapter 16).

BH_launder

Set when the buffer is being flushed to disk when reclaiming memory (see Chapter 16).

BH_JBD

Set if the buffer is used by a journaling filesystem (see Chapter 17).

The b_dev field identifies the virtual device containing the block stored in the buffer, while the b_rdev field identifies the real device. This distinction, which is meaningless for simple hard disks, has been introduced to model Redundant Array of Independent Disks (RAID) storage units consisting of several disks operating in parallel. For reasons of safety and efficiency, files stored in a RAID array are scattered across several disks that the applications think of as a single logical disk. Besides the b_blocknr field representing the logical block number, it is necessary to specify the specific disk unit in the b_rdev field and the corresponding sector number in the b_rsector field.

An Overview of Block Device Driver Architecture

Although block device drivers are able to transfer a single block at a time, the kernel does not perform an individual I/O operation for each block to be accessed on disk; this would lead to poor disk performances, since locating the physical position of a block on the disk surface is quite time-consuming. Instead, the kernel tries, whenever possible, to cluster several blocks and handle them as a whole, thus reducing the average number of head movements.

When a process, the VFS layer, or any other kernel component wishes to read or write a disk block, it actually creates a block device request . That request essentially describes the requested block and the kind of operation to be performed on it (read or write). However, the kernel does not satisfy a request as soon as it is created—the I/O operation is just scheduled and will be performed at a later time. This artificial delay is paradoxically the crucial mechanism for boosting the performance of block devices. When a new block data transfer is requested, the kernel checks whether it can be satisfied by slightly enlarging a previous request that is still waiting (i.e., whether the new request can be satisfied without further seek operations). Since disks tend to be accessed sequentially, this simple mechanism is very effective.

Deferring requests complicates block device handling. For instance, suppose a process opens a regular file and, consequently, a filesystem driver wants to read the corresponding inode from disk. The block device driver puts the request on a queue and the process is suspended until the block storing the inode is transferred. However, the block device driver itself cannot be blocked because any other process trying to access the same disk would be blocked as well.

To keep the block device driver from being suspended, each I/O operation is processed asynchronously. Thus, no kernel control path is forced to wait until a data transfer completes. In particular, block device drivers are interrupt-driven (see Section 13.4.5.2 earlier in this chapter): a high-level driver creates a new block device request or enlarges an already existing block device request and then terminates. A low-level driver , which is activated at a later time, invokes a so-called strategy routine , which takes the request from a queue and satisfies it by issuing suitable commands to the disk controller. When the I/O operation terminates, the disk controller raises an interrupt and the corresponding handler invokes the strategy routine again, if necessary, to process another request in the queue.

Each block device driver maintains its own request queues ; there should be one request queue for each physical block device, so that the requests can be ordered in such a way as to increase disk performance. The strategy routine can thus sequentially scan the queue and service all requests with the minimum number of head movements.

Request descriptors

Each block device request is represented by a request descriptor , which is stored in the request data structure illustrated in Table 13-7. The direction of the data transfer is stored in the cmd field; it is either READ (from block device to RAM) or WRITE (from RAM to block device). The rq_status field is used to specify the status of the request; for most block devices, it is simply set either to RQ_INACTIVE (for a request descriptor not in use) or to RQ_ACTIVE (for a valid request that is to be serviced or is already being serviced by the low-level driver).

Table 13-7. The fields of a request descriptor

Type

Field

Description

struct list_head

queue

Pointers for request queue list

int

elevator_sequence

The “age” of the request for the elevator algorithm

volatile int

rq_status

Request status

kdev_t

rq_dev

Device identifier

int

cmd

Requested operation

int

errors

Success or failure code

unsigned long

sector

First sector number on the (virtual) block device

unsigned long

nr_sectors

Number of sectors of the request on the (virtual) block device

unsigned long

hard_sector

First sector number of the (real) block device

unsigned long

hard_nr_sectors

Number of sectors of the request on the (real) block device

unsigned int

nr_segments

Number of segments in the request on the (virtual) block device

unsigned int

nr_hw_segments

Number of segments in the request on the (real) block device

unsigned long

current_nr_sectors

Number of sectors in the block currently transferred

void *

special

Used only by drivers of SCSI devices

char *

buffer

Memory area for I/O transfer

struct completion *

waiting

Wait queue associated with request

struct buffer_head *

bh

First buffer descriptor of the request

struct buffer_head *

bhtail

Last buffer descriptor of the request

request_queue_t *

q

Pointer to request queue descriptor

The request may encompass many adjacent blocks on the same device. The rq_dev field identifies the block device, while the sector field specifies the number of the first sector of the first block in the request. The nr_sector field specifies the number of sectors in the request yet to be transferred. The current_nr_sector field stores the number of sectors in first block of the request. As we’ll later see in Section 13.4.7, the sector, nr_sector, and current_nr_sector fields could be dynamically updated while the request is being serviced.

The nr_segments field store the number of segments in the request. Although all blocks in the requests must be adjacent on the block device, their corresponding buffers are not necessarily contiguous in RAM. A segment is a sequence of adjacent blocks in the request whose corresponding buffers are also contiguous in memory. Of course, a low-level device driver could program the DMA controller so as to transfer all blocks in the same segment in a single operation.

The hard_sector, hard_nr_sectors, and nr_hw_segments fields usually have the same value as the sector, nr_sectors, and nr_segments fields, respectively. They differ, however, when the request refers to a driver that handles several physical block devices at once. A typical example of such a driver is the Logical Volume Manager (LVM), which is able to handle several disks and disk partitions as a single virtual disk partition. In this case, the two series of fields differ because the former refers to the real physical block device, while the latter refers to the virtual device. Another example is software RAID, a driver that duplicates data on several disks to enhance reliability.

All buffer heads of the blocks in the request are collected in a simply linked list. The b_reqnext field of each buffer head points to the next element in the list, while the bh and bhtail fields of the request descriptor point, respectively, to the first element and the last element in the list.

The buffer field of the request descriptor points to the memory area used for the actual data transfer. If the request involves a single block, buffer is just a copy of the b_data field of the buffer head. However, if the request encompasses several blocks whose buffers are not consecutive in memory, the buffers are linked through the b_reqnext fields of their buffer heads as shown in Figure 13-3. On a read, the low-level driver could choose to allocate a large memory area referred by buffer, read all sectors of the request at once, and then copy the data into the various buffers. Similarly, for a write, the low-level device driver could copy the data from many nonconsecutive buffers into a single memory area referred by buffer and then perform the whole data transfer at once.

A request descriptor and its buffers and sectors

Figure 13-3. A request descriptor and its buffers and sectors

Figure 13-3 illustrates a request descriptor encompassing three blocks. The buffers of two of them are consecutive in RAM, while the third buffer is by itself. The corresponding buffer heads identify the logical blocks on the block device; the blocks must necessarily be adjacent. Each logical block includes two sectors. The sector field of the request descriptor points to the first sector of the first block on disk, and the b_reqnext field of each buffer head points to the next buffer head.

During the initialization phase, each block device driver usually allocates a fixed number of request descriptors to handle its forthcoming I/O requests. The blk_init_queue( ) function sets up two equally sized lists of free request descriptors: one for the READ operation and another for the WRITE operations. The size of these lists is set to 64 if the RAM size is greater than 32 MB, or to 32 if the RAM size is less than or equal to 32 MB. The status of all request descriptors is set initially to RQ_INACTIVE.

The fixed number of request descriptors may become, under very heavy loads and high disk activity, a bottleneck. A dearth of free descriptors may force processes to wait until an ongoing data transfer terminates. Thus, a wait queue is used to queue processes waiting for a free request element. The get_request_wait( ) tries to get a free request descriptor and puts the current process to sleep in the wait queue if none is found; the get_request( ) function is similar but simply returns NULL if no free request descriptor is available.

A threshold value known as batch_requests (set to 32 or to 16, depending on the RAM size) is used to cut down kernel overhead; when releasing a request descriptor, processes waiting for free request descriptors are not woken up unless there are at least batch_requests free descriptors. Conversely, when looking for a free request descriptor, get_request_wait( ) relinquishes the CPU if there are fewer than batch_requests free descriptors.

Request queue descriptors

Request queues are represented by means of request queue descriptors; each of them is a request_queue_t data structure whose fields are listed in Table 13-8.

Table 13-8. The fields of a request queue descriptor

Type

Field

Description

struct request_list []

rq

READ and WRITE free lists of requests

struct list_head

queue_head

List of pending requests

elevator_t

elevator

Methods of the elevator algorithm

request_fn_proc *

request_fn

Strategy routine of the driver

merge_request_fn *

back_merge_fn

Method to append a block to the request

merge_request_fn *

front_merge_fn

Method to insert a block in front of the request

merge_requests_fn *

merge_requests_fn

Method to attempt merging an enlarged request with the adjacent ones

make_request_fn *

make_request_fn

Method that passes a request to a driver (usually it inserts the request in the proper queue)

plug_device_fn *

plug_device_fn

Method to plug the driver

void *

queuedata

Private data of the device driver

struct tq_struct

plug_tq

Task queue element for the plugging mechanism

char

plugged

Flag denoting whether the driver is currently plugged

char

head_active

Flag denoting whether the first request in queue is active when the driver is unplugged

spinlock_t

queue_lock

Request queue lock

wait_queue_head_t

wait_for_request

Wait queue for lack of request descriptors

When the kernel initializes a device driver, it creates and fills a request queue descriptor for each request queue handled by the driver.

Essentially, a request queue is a doubly linked list whose elements are request descriptors (that is, request data structures). The queue_head field of the request queue descriptor stores the head (the first dummy element) of the list, while the pointers in the queue field of the request descriptor link any request to the previous and next elements in the list. The ordering of the elements in the queue list is specific to each block device driver; the Linux kernel offers, however, two predefined ways of ordering elements, which are discussed in the later section Section 13.4.6.2.

Block device low-level driver descriptor

Each block device driver may define one or more request queues. To keep track of the request queues of each driver, a low-level driver descriptor is used. The descriptor is a data structure of type blk_dev_struct, whose fields are listed in Table 13-9. The descriptors for all the block devices are stored in the blk_dev table, which is indexed by the major number of the block device.

Table 13-9. The fields of a block device driver descriptor

Type

Field

Description

request_queue_t

request_queue

Common request queue (for drivers that do not define per-device queues)

queue_proc *

queue

Method returning the address of a per-device queue

void *

data

Data (e.g., minor number) used by queue

If the block device driver has a unique request queue for all physical block devices, its address is stored in the request_queue field. Conversely, if the block device driver maintains several queues, the queue field points to a custom driver method that receives the identifier of the block device file, selects one of the queues according to the value of the data field, then returns the address of the proper request queue.

The ll_rw_block( ) Function

The ll_rw_block( ) function creates a block device request. It is invoked from several places in the kernel to trigger the I/O data transfer of one or more blocks. The function receives the following parameters:

  • The type of operation, rw, whose value can be READ, WRITE, or READA . The last operation type differs from the former in that the function does not block when a request descriptor is not available.

  • The number, nr, of blocks to be transferred.

  • A bhs array of nr pointers to buffer heads describing the blocks (all of them must have the same block size and must refer to the same block device).

The buffer heads were previously initialized, so each specifies the block number, the block size, and the virtual device identifier (see the earlier section Section 13.4.4).

The function performs the following actions:

  1. Checks that the block size b_size matches the block size of the virtual device b_dev for each buffer head in the bhs array.

  2. If the operation is WRITE, checks that the block device is not read-only.

  3. For each buffer head in the bhs array, performs the following steps:

    1. Sets the BH_Lock flag of the buffer head. If it is already set by some other kernel thread, it skips that buffer.

    2. Increments the b_count field of the buffer head.

    3. Sets the b_end_io field of the buffer head to end_buffer_io_sync( ); that is, to the function that updates the buffer head when the data transfer is completed (see Section 13.4.7 later in this chapter.)

    4. If the block must be written, tests the BH_Dirty flag of the buffer head in one of the following ways:

      • If BH_Dirty is reset, executes the b_end_io method (the end_buffer_io_sync( ) function) and continues with the next buffer because there is no need to write this block.

      • If BH_Dirty is set, resets it and places the buffer head in the list of locked buffer heads.

      As a general rule, the caller of ll_rw_block( ) must set the BH_Dirty flag for each block that is going to be written. Therefore, if ll_rw_block( ) finds that the flag is clear, then the block is already involved in a write operation, so nothing has to be done.

    5. If the block must be read, tests the BH_Uptodate flag of the buffer head. If it is set, executes the b_end_io method (the end_buffer_io_sync( ) function) and continues with the next buffer. The kernel never rereads a block from disk when its buffer contains valid (up-to-date) data.

    6. Invokes the submit_bh( ) function, which:

      1. Determines the number of the first block’s sector on the disk—that is, the value of the b_rsector field—from the fields b_blocknr (the logical block number) and b_size (the block size). This field could be later modified by the block device driver if it handles the Logical Volume Manager (LVM) or a RAID disk.

      2. Sets the BH_Req flag in b_state to denote that the block has been requested.

      3. Initializes the b_rdev field from the b_dev field. As before, this field could be modified later by the block device driver if it handles the LVM or a RAID disk.

      4. Invokes generic_make_request( ).

The generic_make_request( ) function posts the request to the low-level driver. It receives as parameters the buffer head bh and the type of operation rw (READ, WRITE, or READA), and performs the following operations:

  1. Checks that bh->b_rsector does not exceed the number of sectors of the block device. If it does, prints a kernel error message, invokes the b_end_io method of the buffer head, and terminates.

  2. Extracts the major number maj of the block device driver from bh->b_rdev.

  3. Gets the descriptor of the device driver request queue from the low-level driver descriptor blk_dev[maj]. To do this, it invokes the blk_dev[maj].queue method, if it is defined (the driver makes use of several queues); otherwise, it reads the blk_dev[maj].request_queue field (the driver uses a single queue).

  4. Invokes the make_request_fn method of the request queue descriptor identified in the previous step.

In most cases, block device drivers implement the make_request_fn method with the _ _make_request( ) function. It receives as parameters the queue descriptor, the buffer head bh, and the type of operation rw, and performs the following operations:

  1. Checks whether the type of the operation rw is READA; in this case, it sets the local flag rw_ahead to 1 and sets rw to READ.

  2. Invokes the create_bounce( ) function, which looks at the PG_highmem flag in bh->b_page->flags and determines whether the bh->b_data buffer is stored in high memory or not (see Section 7.1.6). If the buffer is in high memory, the low-level driver might not be able to handle it. Therefore, create_bounce( ) temporarily allocates a new buffer in low memory and a new buffer head pointing to it. The new buffer head is almost identical to bh, except for the b_data field, which points to the new buffer, the b_private field, which points to the original buffer head bh, and the b_end_io method, which points to a custom method that releases the low-memory buffer when the I/O operation terminates. If rw is WRITE, then the low-memory buffer is filled with the high memory buffer contents by create_bounce( ); otherwise, if rw is READ, the low memory buffer is copied into the high-memory buffer by the b_end_io custom method.

  3. Checks whether the request queue is empty:

    • If the request queue is empty, inserts a new request descriptor in it and schedules activation of the strategy routine of the low-level driver at a later time.

    • If the request queue is not empty, inserts a new request descriptor in it, trying to cluster it with other requests that are already queued. As we’ll see shortly, there is no need to schedule the activation of the strategy routine.

    Let’s look closer at these two cases.

Scheduling the activation of the strategy routine

As we saw earlier, it’s expedient to delay activation of the strategy routine in order to increase the chances of clustering requests for adjacent blocks. The delay is accomplished through a technique known as device plugging and unplugging. As long as a block device driver is plugged, its strategy routine is not activated even if there are requests to be processed in the driver’s queues.

If the real device’s request queue is empty and the device isn’t already plugged, _ _make_request( ) carries out a device plugging . The plug_device_fn method that performs this task is usually implemented by means of the generic_plug_device( ) function. This function sets the plugged field of the request queue descriptor to 1 and inserts the plug _tq task queue element (statically included in the request queue descriptor) in the tq_disk task queue (see Section 4.7.3.1) to cause the device’s strategy routine to be activated later.

The _ _make_request( ) function then allocates a new request descriptor by invoking get_request( ). If no request descriptor is available, the function checks the value of the rw_ahead flag. If it is set, then the function is handling a relatively unimportant read-ahead operation, thus it invokes the b_end_io method and terminates without performing the I/O data transfer. Otherwise, the function invokes the get_request_wait( ) function to force the process to sleep until a request descriptor is freed.

Next, _ _make_request( ) initializes the new request descriptor with the information read from the buffer head, inserts it into the proper real device’s request queue, and terminates.

How is the actual I/O data transfer started? The kernel checks periodically whether the tq _disk task queue contains any elements. This occurs in a kernel thread such as kswapd, or when the kernel must wait for some resource related to block device drivers, such as buffers or request descriptors. During the tq _disk check, the kernel removes any element in the queue and executes the corresponding function.

Usually, the function stored in any plug_tq task queue points to the generic_unplug _device( ) function, which resets the plugged field of the request queue descriptor and invokes its request_fn method, thus executing the low-level driver’s strategy routine. This activity is referred to as unplugging the device. As a result, the requests included in the queues of the driver are processed, and the corresponding I/O data transfers take place.

Extending the request queue

If the request queue is not empty, the driver was already plugged when the kernel inserted the first request in the queue. Therefore, there is no need to schedule the activation of the strategy routine again. Either the low-level driver is already unplugged, or it soon will be.

Notice that if _ _make_request( ) finds that the queue is not empty, the low-level driver could be actively handling the requests of the queue. Nevertheless, the function can safely modify the queue because the low-level driver usually removes the requests from the queue before processing them. As a particular case, however, the function never touches the first request in the queue when the head_active field of the request queue descriptor is set. This flag is set when the low-level driver’s policy is always to process the first request in the queue and not to remove the request from the queue until the I/O data transfer completes.

The _ _make_request( ) function must either add a new element in the queue or include the new block in an existing request; the second case is known as block clustering .

Block clustering requires that all the following conditions be satisfied:

  • The block to be inserted belongs to the same block device as the other blocks in the request and is adjacent to them: it either immediately precedes the first block in the request or immediately follows the last block in the request.

  • The blocks in the request have the same I/O operation type (READ or WRITE) as the block to be inserted.

  • The extended request does not exceed the allowed maximum number of sectors. This value is stored in the max_sectors table, which is indexed by the major and minor number of the block device. The default value is 255 sectors.

  • The extended request does not exceed the allowed maximum number of segments (see Section 13.4.5.1 earlier in this chapter), which is usually 128.

  • No process is waiting for the completion of request—i.e., the waiting field of the request descriptor is NULL.

When the _ _make_request( ) function must determine how to insert the requested block in the queue, it uses a program that is traditionally called an elevator algorithm . The elevator algorithm basically defines the ordering of the elements in the queue; usually, this ordering is also followed by the low-level driver when it is handling the requests.

Although each block device driver may define its own elevator algorithm, most block device drivers use either one of the following:

ELEVATOR_NOOP algorithm

New requests descriptors are inserted at the end of the queue. Therefore, older requests precede younger ones. Notice that block clustering enlarges a request but doesn’t make it younger. This algorithm favors fairness of servicing time among the various requests.

ELEVATOR_LINUS algorithm

The queue elements tend to be ordered by the position of the corresponding sectors on the block device. This algorithm tries to minimize the number and extent of seek operations on the physical device. However, the algorithm must also rely on an ageing mechanism to avoid making requests in the last positions of the queue to remain unhanded for long periods of time. When searching for a request that might include the block, the algorithm starts from the bottom of the queue and interrupts the scanning as soon as it finds a very old request.

The elevator algorithm is implemented by three methods included in the elevator field of the request queue descriptor:

elevator_merge_fn

Scans the queue and searches a candidate request for the block clustering. If block clustering is not possible, it also returns the position where the new request should be inserted. Otherwise, it returns the request that has to be enlarged in order to include the blocks in the new request.

elevator_merge_cleanup_fn

Invoked after a successful block clustering operation. It should increase the age of all requests in the queue that follow the enlarged request. (The method does nothing in the ELEVATOR_NOOP algorithm).

elevator_merge_req_fn

Invoked when the kernel merges two existing requests of the queue. It should assign the age of the new enlarged request. (The method does nothing in the ELEVATOR_NOOP algorithm).

To add an existing request to the front or the back of a request, the _ _make_request( ) function uses back_merge_fn and front_merge_fn methods of the request queue descriptor, respectively. After a successful block clustering operation, _ _make_request( ) also checks whether the enlarged request can be merged with the previous or the next request in the queue by invoking the merge_requests_fn method of the request queue descriptor.

Low-Level Request Handling

We have now reached the lowest level in Linux’s block device handling. This level is implemented by the strategy routine, which interacts with the physical block device to satisfy the requests collected in the queue.

As mentioned earlier, the strategy routine is usually started after inserting a new request in an empty request queue. Once activated, the low-level driver should handle all requests in the queue and terminate when the queue is empty.

A naïve implementation of the strategy routine could be the following: for each element in the queue, interact with the block device controller to service the request and wait until the data transfer completes. Then remove the serviced request from the queue and proceed with the next one.

Such an implementation is not very efficient. Even assuming that data can be transferred using DMA, the strategy routine must suspend itself while waiting for I/O completion; hence, an unrelated user process would be heavily penalized. (The strategy routine does not necessarily execute on behalf of the process that has requested the I/O operation but at a random, later time, since it is activated by means of the tq _disk task queue.)

Therefore, many low-level drivers adopt the following strategy:

  • The strategy routine handles the first request in the queue and sets up the block device controller so that it raises an interrupt when the data transfer completes. Then the strategy routine terminates.

  • When the block device controller raises the interrupt, the interrupt handler activates a bottom half. The bottom half handler removes the request from the queue and reexecutes the strategy routine to service the next request in the queue.

Basically, low-level drivers can be further classified into the following:

  • Drivers that service each block in a request separately

  • Drivers that service several blocks in a request together

Drivers of the second type are much more complicated to design and implement than drivers of the first type. Indeed, although the sectors are adjacent on the physical block devices, the buffers in RAM are not necessarily consecutive. Therefore, any such driver may have to allocate a temporary area for the DMA data transfer, and then perform a memory-to-memory copy of the data between the temporary area and each buffer in the request’s list.

Since requests handled by both types of drivers consist of adjacent blocks, disk performance is enhanced because fewer seek commands are issued. However, the second type of drivers do not further reduce the number of seek commands, so transferring several blocks from disk together is not as effective in boosting disk performance.

The kernel doesn’t offer any support for the second type of drivers: they must handle the request queues and the buffer head lists on their own. The choice to leave the job up to the driver is not capricious or lazy. Each physical block device is inherently different from all others (for example, a floppy driver groups blocks in disk tracks and transfers a whole track in a single I/O operation), so making general assumptions on how to service each clustered request makes very little sense.

However, the kernel offers a limited degree of support for the low-level drivers in the first class. We’ll spend a little more time describing such drivers.

A typical strategy routine should perform the following actions:

  1. Get the current request from a request queue. If all request queues are empty, terminate the routine.

  2. Check that the current request has consistent information. In particular, compare the major number of the block device with the value stored in the rq _rdev field of the request descriptor. Moreover, check that the first buffer head in the list is locked (the BH_Lock flag should have been set by ll_rw_block( )).

  3. Program the block device controller for the data transfer of the first block. The data transfer direction can be found in the cmd field of the request descriptor and the address of the buffer in the buffer field, while the initial sector number and the number of sectors to be transferred are stored in the sector and current_nr_sectors fields, respectively.[96] Also, set up the block device controller so that an interrupt is raised when the DMA data transfer completes.

  4. If the routine is handling a block device file for which ll_rw_block( ) accomplishes block clustering, increment the sector field and decrement the nr_sectors field of the request descriptor to keep track of the blocks to be transferred.

The interrupt handler associated with the termination of the DMA data transfer for the block device should invoke (either directly or via a bottom half) the end_request function (or a custom function of the block device driver that does the same things). The function, which receives as parameters the value 1 if the data transfer succeeds or the value 0 if an error occurrs, performs the following operations:

  1. If an error occurred (parameter value is 0), updates the sector and nr_sectors fields so as to skip the remaining sectors of the block. In Step 3a, the buffer content is also marked as not up-to-date.

  2. Removes the buffer head of the transferred block from the request’s list.

  3. Invokes the b_end_io method of the buffer head. When the ll_rw_block( ) function allocates the buffer head, it loads this field with the address of the end_buffer_io_sync( ) function, which essentially performs two operations:

    1. Sets the BH_Uptodate flag of the buffer head to 1 or 0, according to the success or failure of the data transfer

    2. Clears the BH_Lock, BH_Wait_IO, and BH_launder flags of the buffer head and wakes up all processes sleeping in the wait queue to which the b_wait field of the buffer head points

    The b_end_io field could also point to other functions. For instance, if the create_bounce( ) function created a temporary buffer in low memory, the b_end_io field points to a suitable function that updates the original buffer in high memory and then invokes the b_end_io method of the original buffer head.

  4. If there is another buffer head on the request’s list, sets the current_nr_sectors field of the request descriptor to the number of sectors of the new block.

  5. Sets the buffer field with the address of the new buffer (from the b_data field of the new buffer head).

  6. Otherwise, if the request’s list is empty, all blocks have been processed. Therefore, it performs the following operations:

    1. Removes the request descriptor from the request queue

    2. Wakes up any process waiting for the request to complete (waiting field in the request descriptor)

    3. Sets the rq _status field of the request to RQ _INACTIVE

    4. Puts the request descriptor in the list of free requests

After invoking end_request, the low-level driver checks whether the request queue is empty. If it is not, the strategy routine is executed again. Notice that end_request actually performs two nested iterations: the outer one on the elements of the request queue and the inner one on the elements in the buffer head list of each request. The strategy routine is thus invoked once for each block in the request queue.

Block and Page I/O Operations

We’ll discuss in the forthcoming chapters how the kernel uses the block device drivers. We’ll see that there are a number of cases in which the kernel activates disk I/O data transfers. However, let’s describe here the two fundamental kinds of I/O data transfer for block devices:

Block I/O operations

Here the I/O operation transfers a single block of data, so the transferred data can be kept in a single RAM buffer. The disk address consists of a device number and a block number. The buffer is associated with a specific disk block, which is identified by the major and minor numbers of the block device and by the logical block number.

Page I/O operations

Here the I/O operation transfers as many blocks of data as needed to fill a single page frame (the exact number depends both on the disk block size and on the page frame size). If the size of a page frame is a multiple of the block size, several disk blocks are transferred in a single I/O operation. Each page frame contains data belonging to a file. Since this data is not necessarily stored in adjacent disk blocks, it is identified by the file’s inode and by an offset within the file.

Block I/O operations are most often used when the kernel reads or writes single blocks in a filesystem (for example, a block containing an inode or a superblock). Conversely, page I/O operations are used mainly for reading and writing files (both regular files and block device files), for accessing files through the memory mapping, and for swapping.

Both kinds of I/O operations rely on the same functions to access a block device, but the kernel uses different algorithms and buffering techniques with them.

Block I/O operations

The bread( ) function reads a single block from a block device and stores it in a buffer. It receives as parameters the device identifier, the block number, and the block size, and returns a pointer to the buffer head of the buffer containing the block.

The function performs the following operations:

  1. Invokes the getblk( ) function to search for the block in a software cache called the buffer cache (see Section 14.2.4). If the block is not included in the cache, getblk( ) allocates a new buffer for it.

  2. Invokes mark_page_accessed( ) on the buffer page containing the data (see Section 16.7.2).

  3. If the buffer already contains valid data, it terminates.

  4. Invokes ll_rw_block( ) to start the READ operation (see Section 13.4.6 earlier in this chapter).

  5. Waits until the data transfer completes. This is done by invoking a function named wait_on_buffer( ), which inserts the current process in the b_wait wait queue and suspends the process until the buffer is unlocked.

  6. Checks whether the buffer contains valid data. If so, it returns the address of the buffer head; otherwise, it returns a NULL pointer.

No function exists to directly write a block to disk. Declaring a buffer dirty is sufficient to force its flushing to disk at some later time. In fact, write operations are not considered critical for system performance, so they are deferred whenever possible (see Section 14.2.4).

Page I/O operations

Block devices transfer information one block at a time, while process address spaces (or to be more precise, memory regions allocated to the process) are defined as sets of pages. This mismatch can be hidden to some extent by using page I/O operations. They may be activated in the following cases:

  • A process issues a read( ) or write( ) system call on a file (see Chapter 15).

  • A process reads a location of a page that maps a file in memory (see Chapter 15).

  • The kernel flushes some dirty pages related to a file memory mapping to disk (see Section 15.2.5).

  • When swapping in or swapping out, the kernel loads from or saves to disk the contents of whole page frames (see Chapter 16).

Page I/O operations can be activated by several kernel functions. In this section, we’ll present the brw_page( ) function used to read or write swap pages (see Chapter 16). Other functions that start page I/O operations are discussed in Chapter 15.

The brw_page( ) function receives the following parameters:

rw

Type of I/O operation (READ, WRITE, or READA)

page

Address of a page descriptor

dev

Block device number (major and minor numbers)

b

Array of logical block numbers

size

Block size

The page descriptor refers to the page involved in the page I/O operation. It must already be locked (PG_locked flag on) before invoking brw_page( ) so that no other kernel control path can access it. The page is considered as split into 4096/size buffers; the i th buffer in the page is associated with the block b[i] of device dev.

The function performs the following operations:

  1. Checks the page->buffers field; if it is NULL, invokes create_empty_buffers( ) to allocate temporary buffer heads for all buffers included in the page (such buffer heads are called asynchronous; they are discussed in Section 14.2.1). The address of the buffer head for the first buffer in the page is stored in the page->buffers field. The b_this_page field of each buffer head points to the buffer head of the next buffer in the page.

    Conversely, if the page->buffers field is not NULL, the kernel does not need to allocate temporary buffer heads. In fact, in this case, the page stores some buffers already included in the buffer cache, presumably because some of them were previously involved in block I/O operations (see Section 14.2.2 for further details).

  2. For each buffer head in the page, performs the following substeps:

    1. Sets the BH_Lock (locks the buffer for the I/O data transfer) and the BH_Mapped (the buffer maps a file on disk) flags of the buffer head.

    2. Stores in the b_blocknr field the value of the corresponding element of the array b.

    3. Since it is an asynchronous buffer head, sets the BH_Async flag, and in the b_end_io field, stores a pointer to end_buffer_io_async( ) (described next).

  3. For each buffer head in the page, invokes submit_bh( ) to request the buffer (see Section 13.4.6 earlier in this chapter.)

The submit_bh( ) function activates the device driver of the block device being accessed. As described in the earlier section Section 13.4.7, the device driver performs the actual data transfer and then invokes the b_end_io method of all asynchronous buffer heads that have been transferred. The b_end_io field points to the end_buffer_io_async( ) function, which performs the following operations:

  1. Sets the BH_Uptodate flag of the asynchronous buffer head according to the result of the I/O operation.

  2. If the BH_Uptodate flag is off, sets the PG_error flag of the page descriptor because an error occurred while transferring the block. The function gets the page descriptor address from the b_page field of the buffer head.

  3. Gets the page_update_lock spin lock.

  4. Clears both the BH_Async and the BH_Lock flags of the buffer head, and awakens each process waiting for the buffer.

  5. If any of the buffer heads in the page are still locked (i.e., the I/O data transfer is not yet terminated), releases the page_update_lock spin lock and returns.

  6. Otherwise, releases the page_update_lock spin lock and checks the PG_error flag of the page descriptor. If it is cleared, then all data transfers on the page have successfully completed, so the function sets the PG_uptodate flag of the page descriptor.

  7. Unlocks the page, clears the PG_locked flag, and wakes any process sleeping on page->wait wait queue.

Notice that once the page I/O operation terminates, the temporary buffer heads allocated by create_empty_buffers( ) are not automatically released. As we shall see in Chapter 16, the temporary buffer heads are released only when the kernel tries to reclaim some memory.



[96] Recall that current_nr_sectors contains the number of sectors in the first block of the request, while nr_sectors contains the total number of sectors in the request.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.248.119