The most important function in a block driver is the request function, which performs the low-level operations related to reading and writing data. This section introduces the design of such a procedure.
When the kernel schedules a data transfer, it queues the ``request'' in a list, ordered so that it maximizes system performance. The linked list of requests is then passed to the driver’s request function, which should perform the following tasks for each request in the linked list:
Check the validity of the current request. This
task is performed by the macro INIT_REQUEST
, defined in
blk.h
.
Perform the actual data transfer. The
CURRENT
variable (macro, actually) can be used to
retrieve the details of the outstanding
request. CURRENT
is a pointer to struct request
,
whose fields are described in the next section.
Clean up the current request. This
operation is performed by end_request, a static function
whose code resides in blk.h
. The driver passes the function
a single argument, which is 1 in case of success and 0 in case
of failure. When end_request is called with an
argument of zero, an ``I/O error'' message is delivered to
the system logs (via printk).
Loop back to the beginning, to consume the next
request. A goto
, a surrounding for(;;)
,
or a surrounding while(1)
can be used, at the
programmer’s will.
In practice, the code for the request function is structured like this:
void sbull_request(void) { while(1) { INIT_REQUEST; printk("request %p: cmd %i sec %li (nr. %li), next %p ", CURRENT, CURRENT->cmd, CURRENT->sector, CURRENT->current_nr_sectors, CURRENT->next); end_request(1); /* success */ } }
Although this code does nothing but print messages, running this
function provides good insight into the basic design of data transfer.
The only unclear part of the code at this point should be the exact
meaning of CURRENT
and its fields, which I’ll describe
in the next section.
My first sbull implementation contained exactly the empty code just shown. I managed to make a filesystem on the ``nonexistent'' device and use it for a while, as long as data remained in the buffer cache. Looking at the system logs while running a verbose request function like this one can help you understand how the buffer cache works.
This empty-and-verbose function can still be run in
sbull by defining the symbol
SBULL_EMPTY_REQUEST
at compile time. If you want to
understand how the kernel handles different block sizes, you can
experiment with blksize=
on the insmod
command line. The empty request function uncovers the internal kernel
workings by printing the details of each request. You
might also play with hardsect=
, but currently
this is disabled because it’s dangerous (see Section 12.1 at the beginning of this chapter).
The code in a request function doesn’t explicitly issue
return()
, because INIT_REQUEST
does it for you when the
list of pending requests is exhausted.
In order to build a working data transfer for sbull, let’s
look at how the kernel describes a request within a struct request
. The structure is defined in <linux/blkdev.h>
. By
accessing the fields in CURRENT
, the driver can retrieve all the
information needed to transfer data between the buffer cache and the
physical block device.
CURRENT
is a macro that is used to access the current
request (the one to be serviced first). As you might guess,
CURRENT
is a short form of blk_dev[MAJOR_NR].current_request
.
The following fields of the current request carry useful information for the request function:
kdev_t rq_dev;
The device accessed by the request. The same request
function is used for every device managed by the driver. A
single request function deals with all the minor numbers;
rq_dev
can be used to extract the minor device being
acted upon. Although Linux 1.2 called this
field dev
, you can access this field through the macro
CURRENT_DEV
, which
is portable to any kernel version in the range we
are addressing.
int cmd;
This field is either READ
or WRITE
.
unsigned long sector;
The first sector the request refers to.
unsigned long current_nr_sectors;
,
unsigned long nr_sectors;
The number of sectors (the size) of the current
request. The driver should refer to
current_nr_sectors
and ignore
nr_sectors
(which is listed here just for
completeness). See the next section, Section 12.3.2, for more detail.
char *buffer;
The area in the buffer cache to which data should be
written (cmd==READ
) or from which data should be read
(cmd==WRITE
).
struct buffer_head *bh;
The structure describing the first buffer in the list for this request. We’ll use this field in Section 12.3.2.
There are other fields in the structure, but they are primarily meant for internal use in the kernel; the driver is not expected to use them.
The implementation for the working request function in the
sbull device is shown below. In the following code,
sbull_devices
is like scull_devices
,
introduced in Section 3.5.1 in Chapter 3.
void sbull_request(void) { Sbull_Dev *device; u8 *ptr; int size; while(1) { INIT_REQUEST; /* Check if the minor number is in range */ if (DEVICE_NR(CURRENT_DEV) > sbull_devs) { static int count = 0; if (count++ < 5) /* print the message at most 5 times */ printk(KERN_WARNING "sbull: request for unknown device "); end_request(0); continue; } /* pointer to device structure, from the global array */ device = sbull_devices + DEVICE_NR(CURRENT_DEV); ptr = device->data + CURRENT->sector * sbull_hardsect; size = CURRENT->current_nr_sectors * sbull_hardsect; if (ptr + size > device->data + sbull_blksize*sbull_size) { static int count = 0; if (count++ < 5) printk(KERN_WARNING "sbull: request past end of device "); end_request(0); continue; } switch(CURRENT->cmd) { case READ: /* from sbull to buffer */ memcpy(CURRENT->buffer, ptr, size); break; case WRITE: /* from buffer to sbull */ memcpy(ptr, CURRENT->buffer, size); break; default: /* can't happen */ end_request(0); continue; } end_request(1); /* success */ } }
Since sbull is just a RAM disk, its ``data
transfer'' reduces to a memcpy call. The only ``strange''
feature of the function is the conditional statement that limits it to
reporting five errors. This is intended to avoid clobbering the system
logs with too many messages, since end_request(0)
already prints
an ``I/O error'' message when the request fails. The static
counter is a standard way to limit message reporting and is used
several times in the kernel.
Each iteration of the loop in the request function above
transfers a number of sectors--usually the number of sectors that
equals a ``block'' of data, according to the use of such
data. For instance,
swapping is performed PAGE_SIZE
bytes at a time, while an
extended-2 filesystem transfers 1KB blocks.
Although a block is the most convenient data size for I/O, you can get a significant performance boost by clustering the reading or writing of adjacent blocks. In this context, ``adjacent'' refers to the location of blocks on the disk, while ``consecutive'' refers to consecutive memory areas.
There are two advantages to clustering adjacent blocks. First,
clustering speeds up the transfer (for example, the floppy driver
assembles adjacent blocks and transfers a whole track at a
time). It can also save memory in the kernel by avoiding
allocation of redundant request
structures.
You can, if you want, completely ignore clustering.
The skeletal request function shown above works flawlessly,
independent of clustering. If you want to exploit clustering,
on the other hand, you need to deal in greater detail with
the internals of struct_request
.
Unfortunately, all kernels I know of (up to at least 2.1.51) don’t perform clustering for custom drivers, just for internal drivers like SCSI and IDE. If you aren’t interested in the internals of the kernel, you can skip the rest of this section. On the other hand, clustering might be available to modules in the future, and it is an interesting way to increase data-transfer performance by reducing inter-request delays for adjacent sectors.
Before I describe how a driver can exploit clustered requests, let’s look at what happens when a request is queued.
When the kernel requests the transfer of a data block, it scans the linked list of active requests for the target device. If the new block is adjacent on the disk to a block that has already been requested, the new block is clustered to the first block; the existing request is enlarged without creating a new one.
Unfortunately, the fact that the contents of two data buffers
are adjacent on disk
doesn’t necessarily mean that they are consecutive in memory. This
observation, plus the need to efficiently manage the buffer
cache, led to the creation of a buffer_head
structure. One
buffer_head
is associated with each data buffer.
A ``clustered'' request, then, is a single request_struct
that refers to a linked list of buffer_head
structures.
The end_request function takes care of this problem, and that’s
why the request function shown earlier works independent of clustering.
In other words, end_request either cleans up the current
request and prepares to service the next one, or prepares to deal
with the next buffer in the same request. Clustering is therefore
transparent to the device driver that doesn’t care about it; the
sbull function above is such an example.
A driver may want to benefit from clustering by dealing with
the whole linked list of buffer heads at each pass through the loop in
its request_fn function. To do this, the driver
should refer to both CURRENT->current_nr_sectors
(the
field I already used above in sbull_request) and
CURRENT->nr_sectors
, which contains the number of
adjacent sectors that are clustered in the ``current'' list of
buffer_head
s.
The current buffer head is CURRENT->bh
, while the data
block is CURRENT->bh->b_data
. The latter pointer is cached in
CURRENT->buffer
for drivers like sbull that ignore clustering.
Request clustering is implemented in
drivers/block/ll_rw_block.c
, in the function
make_request; however, as suggested above,
clustering is performed only for a few
drivers (floppy, IDE, and SCSI), according to their major number. I’ve
been able to see how clustering works by loading sbull with
major=34
because 34 is IDE3_MAJOR
, and I don’t have
the third IDE controller on my system.[30]
The following list summarizes what needs to be done when
scanning a clustered request. bh
is the buffer head being
processed--the first in the list. For every buffer head in the list,
the driver should carry out the following sequence of operations:
Transfer the data block at address bh->b_data
,
of size bh->b_size
bytes. The direction of the data
transfer is CURRENT->cmd
, as usual.
Retrieve the next buffer head in the list:
bh->b_reqnext
. Then detach the buffer just transferred
from the list, by zeroing its b_reqnext
--the pointer
to the new buffer you just retrieved.
Tell the kernel you’re done with the previous buffer,
by calling mark_buffer_uptodate(bh,1); unlock_buffer(bh);
. These calls guarantee that the buffer
cache is kept sane, without wild pointers lying around. The ``1''
argument to mark_buffer_uptodate
indicates success; if
the transfer failed, substitute ``0''.
Loop back to the beginning to transfer the next adjacent block.
When you are done with the clustered request, CURRENT->bh
must be updated to point to the first buffer that was ``processed but not
unlocked.'' If all the buffers in the list were processed and
unlocked, CURRENT->bh
can be set to NULL
.
At this point, the driver can call end_request. If
CURRENT->bh
is valid, the function unlocks it before moving to the
next buffer--this is what happens for non-clustered operation,
where end_request takes care of everything. If
the pointer is NULL
, the function just moves to the next
request.
A full-featured implementation of clustering appears in
drivers/block/floppy.c
, while a summary of the operations
required appears in end_request, in blk.h
. Neither
floppy.c
nor blk.h
are easy to understand, but
the latter is a better place to start.
[30] While this is a handy trick to play dirty games on one’s home computer, I strongly discourage doing it in a production driver.
18.217.6.114