If you try to create partitions with fdisk, you’ll find out that
there’s something wrong with them. The
fdisk program calls the partitions /dev/sbull01
,
/dev/sbull02
, and so on, but those names don’t exist on the
filesystem. Indeed, the base sbull device is a byte array
with no entry points to provide access to subregions of the data
area, so partitioning sbull doesn’t work.
In order to be able to partition a device, we must assign several
minor numbers to each physical device. One number is used to access the
whole device (for example, /dev/hda
), and the others are used to
access the various partitions (such as /dev/hda1
). Since
fdisk creates
partition names by adding a numerical suffix to the whole-disk device
name, we’ll follow the same naming convention in our next block driver.
The device I’m going to introduce in this section is called
spull, because it is a ``Simple Partitionable Utility.'' The
device resides in the spull
directory and is completely
detached from sbull, even though they share a lot of code.
In the char driver scull, different minor numbers were able to implement different behaviors, so that a single driver could show several different implementations. Differentiating according to the minor number is not possible with block devices, and that’s why sbull and spull are kept separate. The inability to differentiate devices according to the minor number is a basic feature of block drivers, as several of the data structures and macros are defined only as a function of the major number.
As far as porting is concerned, it’s worth noting that partitionable modules can’t be loaded into the 1.2 kernel versions, because the symbol resetup_one_dev (introduced later in this section) wasn’t exported to modules. Before SCSI disk support was modularized, nobody ever considered partitionable modules.
The device nodes I’m going to introduce are called pd
,
for ``partitionable disk.'' The four whole devices (also called
``units'') are thus called dev/pda
through /dev/pdd
;
each device supports at most 15 partitions. Minor numbers have the
following meaning: the least significant four bits represent the
partition number (where 0 is the whole device), and the most significant
four bits represent the unit number. This convention is expressed in the
source file by the following macros:
#define SPULL_SHIFT 4 /* max 16 partitions */ #define SPULL_MAXNRDEV 4 /* max 4 device units */ #define DEVICE_NR(device) (MINOR(device)>>SPULL_SHIFT) #define DEVICE_NAME "pd" /* name for messaging */
Every partitionable device needs to know how it is partitioned. The information is available in the partition table, and part of the initialization process consists of decoding the partition table and updating the internal data structures to reflect the partition information.
This decoding isn’t easy, but fortunately, the kernel offers ``Generic Hard Disk'' support usable by all block drivers, which considerably reduces the amount of code needed in the driver for handling partitions. Another advantage of the generic support is that the driver writer doesn’t need to understand how the partitioning is done, and new partitioning schemes can be supported in the kernel without requiring changes to driver code.
A block driver that wants to support partitions should include
<linux/genhd.h>
and should declare a struct gendisk
structure.
All such structures are arranged in a linked list, whose
head is the global pointer gendisk_head
.
Before we go further, let’s look at the fields in struct gendisk
. You’ll need to understand them in
order to exploit generic device support.
int major
The major number identifies the device driver that the structure refers to.
const char *major_name
The base name for devices belonging to this major
number. Each device name is derived from this name by
adding a letter for each unit and a number for each partition.
For example, ``hd'' is the base name that is used to build
/dev/hda1
and /dev/hdb3
. The base name should be at
most five characters long, because add_partition builds the
full name of the partition in an eight-byte buffer, and the
letter that identifies the unit, the partition number, and the
' '
terminator have to be appended. The name for spull
is pd
(``Partitionable Disk'').
int minor_shift
The number of bit-shifts needed to extract the drive
number from the device minor number. In spull
the number is 4. The value in this field should be consistent
with the definition of the macro DEVICE_NR(device)
(see Section 12.2 earlier in this chapter). The
macro in spull expands to
device>>4
.
int max_p
The maximum number of partitions. In our example, max_p
is 16, or more generally, 1 << minor_shift
.
int max_nr
The maximum number of units. In spull, this number is 4.
The result of the maximum number of units shifted by
minor_shift
should fit in the available range of minor
numbers, which is currently 0-255.
The IDE driver can support both
many drives and many partitions per drive because it
registers several major numbers to work around the small
range of minor numbers.
void (*init)(struct gendisk *)
The initialization function for the driver, which is called after initializing the device and before the partition check is performed. I’ll describe this function in more detail below.
struct hd_struct *part
The decoded partition table for the device. The driver
uses this item to determine what range of the disk’s
sectors are accessible through each minor number. The driver is
responsible for allocation and deallocation of this array,
which most drivers implement as a static array of max_nr << minor_shift
structures. The driver should initialize the
array to zero before the kernel decodes
the partition table.
int *sizes
This field points to an array of integers. The array
holds the same information as blk_size
. The driver is
responsible for allocating and deallocating the data area.
Note that the partition check for the device copies this pointer
to blk_size
, so a driver handling partitionable devices
doesn’t need to allocate the latter array.
int nr_real
The number of real devices (units) that exist. This number
must be less than or equal to max_nr
.
void *real_devices
This pointer is used internally by each
driver that needs to keep additional private information (this
is similar to filp->private_data
).
void struct gendisk *next
A link in the list of generic hard disks.
The design of partition checking is best suited to drivers directly linked to the kernel image, so I’ll start by introducing the basic structure of the kernel code. Later I’ll introduce the way the spull module handles its partitions.
At boot time, init/main.c
calls the various
initialization functions. One of those functions,
start_kernel, initializes all drivers by calling device_setup.
This function in turn calls blk_dev_init and then checks
the partition information of all registered generic hard disks. Any
block driver that finds at least one of its devices registers the driver’s
genhd
structure in the kernel list so its partitions will be correctly detected.
A partitionable driver, therefore, should declare its own
struct genhd
. The structure looks like the following:
struct gendisk my_gendisk = { MAJOR_NR, /* Major number */ "my", /* Major name */ 6, /* Bits to shift to get real from partition */ 1 << 6, /* Number of partitions per real */ MY_MAXNRDEV, /* Maximum number of devices */ my_geninit, /* Init function */ my_partitions, /* hd_struct array, filled at partition check */ my_sizes, /* Block sizes */ 0, /* Number of units: updated by init code */ NULL, /* "real_devices" pointer: use at will */ NULL /* next: updated by the lines shown below */ };
In the initialization function for the driver, then, the structure is queued on the main list of partitionable devices.
The initialization function of a driver that is linked to the kernel is the equivalent of init_module, even though it is called in a different way. The function must enclose the following two lines, to take care of queueing the structure:
my_gendisk.next = gendisk_head; gendisk_head = &my_gendisk;
By inserting the structure into the linked list, these simple lines are all that’s needed in the driver’s entry point for all its partitions to be properly recognized and configured.
Additional setup can be performed by
my_geninit. In the example shown above, the function
fills the ``number of units'' field to reflect the actual hardware
setup of the computer system. After my_geninit
terminates, gendisk.c
performs the actual partition
detection for all the disks (the units). You can see the partitions
being detected at system boot because gendisk.c
prints
Partition check:
on the system console, followed by all
the partitions it finds on the available generic hard disks.
You can modify the previous code by delaying
allocation of both my_sizes
and my_partitions
until the
my_geninit function. This saves a small amount of kernel memory
because the arrays can be as small as nr_real << minor_shift
,
whereas static arrays must be max_nr << minor_shift
bytes long.
The typical savings, however, are a few hundred bytes per physical unit.
A modularized driver differs from a driver linked to the kernel in that
it can’t benefit from the centralized initialization. Instead, it should
handle its own setup. There’s no two-step initialization for
a module, so the gendisk
structure for spull has a
NULL
pointer in its init
function pointer:
struct gendisk spull_gendisk = { 0, /* Major no., assigned after dynamic retreival */ "pd", /* Major name */ SPULL_SHIFT, /* Bits to shift to get real from partition */ 1 << SPULL_SHIFT, /* Number of partitions per real */ SPULL_MAXNRDEV, /* Maximum no. of devices */ NULL, /* No init function (isn't called, anyways) */ NULL, /* Partition array, allocated by init_module */ NULL, /* Block sizes, allocated by init_module */ 0, /* Number of units: set by init_module */ NULL, /* "real_devices" pointer: not used */ NULL /* Next */ };
It is also unnecessary to
register the gendisk
structure in the global linked list of
generic disks.
The file gendisk.c
is prepared to handle a ``late''
initialization like the one needed by modules by exporting the function
resetup_one_dev, which scans the partitions for a single physical
device. The prototype for resetup_one_dev is:
void resetup_one_dev(struct gendisk *dev, int drive);
You can see from the name of the function that it is meant to change
the setup information for a device. The function was designed to be
called by the BLKRRPART
implementation within ioctl,
but it can also be used for the initial setup of a module.
When a module is initialized, it should call
resetup_one_dev for each physical device it is going to
access so that the partition information can be stored in
my_gendisk->part
. The partition information is then
used in the request_fn function of the device.
In spull, the init_module function includes the following code in addition to the usual instructions. It allocates the arrays needed for partition check and initializes the whole-disk entries in the arrays.
/* Prepare the `size' array and zero it. */ spull_sizes = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(int), GFP_KERNEL); if (!spull_sizes) goto fail_malloc; /* Start with zero-sized partitions, and correctly sized units */ memset(spull_sizes, 0, (spull_devs << SPULL_SHIFT) * sizeof(int)); for (i=0; i< spull_devs; i++) spull_sizes[i<<SPULL_SHIFT] = spull_size; blk_size[MAJOR_NR] = spull_gendisk.sizes = spull_sizes; /* Allocate the partitions, and refer the array in spull_gendisk. */ spull_partitions = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(struct hd_struct), GFP_KERNEL); if (!spull_partitions) goto fail_malloc; memset(spull_partitions, 0, (spull_devs << SPULL_SHIFT) * sizeof(struct hd_struct)); /* fill whole-disk entries */ for (i=0; i < spull_devs; i++) { /* start_sect is already 0, and sects are 512 bytes long */ spull_partitions[i << SPULL_SHIFT].nr_sects = 2 * spull_size; } spull_gendisk.part = spull_partitions; #if 0 /* * Well, now a *real* driver should call resetup_one_dev(). * Avoid it here, as there's no allocated data in spull yet. */ for (i=0; i< spull_devs; i++) { printk(KERN INFO "Spull partition check: "); resetup_one_dev(&spull_gendisk, i); } #endif
It’s interesting to note that resetup_one_dev prints partition information by repeatedly calling:
printk(" %s:", disk_name(hd, minor, buf));
That’s why spull would print a leading string. It’s meant to add some context to the information that gets stuffed into the system log.
When a partitionable module is unloaded, the driver should
arrange for all the partitions to be flushed, by calling
fsync_dev for every supported major/minor pair. Moreover,
if the gendisk
structure was inserted in the global list,
it should be removed—note that spull didn’t
insert itself, for the reasons outlined above.
The cleanup function for spull is:
for (i = 0; i < (spull_devs << SPULL_SHIFT); i++) fsync_dev(MKDEV(spull_major, i)); /* flush the devices */ blk_dev[major].request_fn = NULL; read_ahead[major] = 0; kfree(blk_size[major]); /* which is gendisk->sizes as well */ blk_size[major] = NULL; kfree(spull_gendisk.part);
If you want to mount your root filesystem from a device whose driver is available only in modularized form, you must use the Initrd facility offered by modern Linux kernels. I won’t introduce Initrd here; this subsection is aimed at readers who know about Initrd and wonder how it affects block drivers.
When you boot a kernel with Initrd, it establishes a temporary running environment before it mounts the real root filesystem. Modules are usually loaded from within the ramdisk being used as the temporary root file system.
Since the Initrd process is run after all boot-time initialization is complete (but before the real root filesystem has been mounted), there’s no difference between loading a normal module and one living in the Initrd ramdisk. If a driver can be correctly loaded and used as a module, all Linux distributions that have Initrd available can include the driver on their installation disks without requiring you to hack in the kernel source.
In addition to initialization and cleanup, there are other differences
between partitionable devices and
non-partitionable devices. Basically, the differences are due to the fact
that if the disk is partitionable, the same physical device can be
accessed using
different minor numbers. The mappings from the minor number to the physical
position on the disk is stored by resetup_one_dev in the
gendisk->part
array. The code below includes only those parts of
spull that differ from sbull, because most of the code is
exactly the same.
First of all, open and close must keep track of the
usage count for each device. Since the usage count refers to the physical
device (unit), the following assignment is used for the dev
variable:
Spull_Dev *dev = spull_devices + DEVICE_NR(inode->i_rdev);
The DEVICE_NR
macro used here is the one that must be
declared before <linux/blk.h>
is included.
While almost every device method works with the physical
device, ioctl should access specific information for each
partition. For example, mkfs should be told the size of each
partition, not the size of the whole device. Here is how the BLKGETSIZE
ioctl command is affected by the change from one minor number per
device to multiple minor numbers per device. As you might expect,
spull_gendisk->part
is used as the source of the
partition size.
case BLKGETSIZE: /* Return the device size, expressed in sectors */ if (!arg) return -EINVAL; /* NULL pointer: not valid */ err=verify_area(VERIFY_WRITE, (long *) arg, sizeof(long)); if (err) return err; size = spull_gendisk.part[MINOR(inode->i_rdev)].nr_sects; put_user (size, (long *) arg); return 0;
The other ioctl command that is different for partitionable
devices is BLKRRPART
. Re-reading the partition table makes
sense for partitionable devices and is equivalent to revalidating a
disk after a disk change:
case BLKRRPART: /* re-read partition table: fake a disk change */ return spull_revalidate(inode->i_rdev);
The function spull_revalidate in turn calls resetup_one_dev to rebuild the partition table. First, however, it must clear any previous information--otherwise, trailing partitions would still appear at the end of the partition table in case the new one contains fewer partitions than before.
int spull_revalidate(kdev_t i_rdev) { /* first partition, # of partitions */ int part1 = (DEVICE_NR(i_rdev) << SPULL_SHIFT) + 1; int npart = (1 << SPULL_SHIFT) -1; /* first clear old partition information */ memset(spull_gendisk.sizes+part1, 0, npart*sizeof(int)); memset(spull_gendisk.part +part1, 0, npart*sizeof(struct hd_struct)); /* then fill new info */ printk(KERN_INFO "Spull partition check: "); resetup_one_dev(&spull_gendisk, DEVICE_NR(i_rdev)); return 0; }
But the major difference between sbull and spull is in the request function. In spull, the request function needs to use the partition information in order to correctly transfer data for the different minor numbers.
Information in spull_gendisk->part
is used to locate each
partition on the physical device. part[minor]->nr_sects
is the
partition size, and part[minor]->start_sect
is its offset from
the beginning of the disk. The request function eventually falls back
to the whole-disk implementation.
Here are the relevant lines in spull_request:
/* the sector size is 512 bytes */ ptr = device->data + 512 * (spull_partitions[minor].start_sect + CURRENT->sector); size = CURRENT->current_nr_sectors * 512; if (CURRENT->sector + CURRENT->current_nr_sectors > spull_gendisk.part[minor].nr_sects) { printk(KERN_WARNING "spull: request past end of device "); end_request(0); continue; }
The number of sectors is multiplied by 512, the sector size (which is hardwired in spull), to get the size of the partition in bytes.
18.191.157.186