Partitionable Devices

If you try to create partitions with fdisk, you’ll find out that there’s something wrong with them. The fdisk program calls the partitions /dev/sbull01, /dev/sbull02, and so on, but those names don’t exist on the filesystem. Indeed, the base sbull device is a byte array with no entry points to provide access to subregions of the data area, so partitioning sbull doesn’t work.

In order to be able to partition a device, we must assign several minor numbers to each physical device. One number is used to access the whole device (for example, /dev/hda), and the others are used to access the various partitions (such as /dev/hda1). Since fdisk creates partition names by adding a numerical suffix to the whole-disk device name, we’ll follow the same naming convention in our next block driver.

The device I’m going to introduce in this section is called spull, because it is a ``Simple Partitionable Utility.'' The device resides in the spull directory and is completely detached from sbull, even though they share a lot of code.

In the char driver scull, different minor numbers were able to implement different behaviors, so that a single driver could show several different implementations. Differentiating according to the minor number is not possible with block devices, and that’s why sbull and spull are kept separate. The inability to differentiate devices according to the minor number is a basic feature of block drivers, as several of the data structures and macros are defined only as a function of the major number.

As far as porting is concerned, it’s worth noting that partitionable modules can’t be loaded into the 1.2 kernel versions, because the symbol resetup_one_dev (introduced later in this section) wasn’t exported to modules. Before SCSI disk support was modularized, nobody ever considered partitionable modules.

The device nodes I’m going to introduce are called pd, for ``partitionable disk.'' The four whole devices (also called ``units'') are thus called dev/pda through /dev/pdd; each device supports at most 15 partitions. Minor numbers have the following meaning: the least significant four bits represent the partition number (where 0 is the whole device), and the most significant four bits represent the unit number. This convention is expressed in the source file by the following macros:

#define SPULL_SHIFT 4                         /* max 16 partitions  */
#define SPULL_MAXNRDEV 4                      /* max 4 device units */
#define DEVICE_NR(device) (MINOR(device)>>SPULL_SHIFT)
#define DEVICE_NAME "pd"                      /* name for messaging */


The Generic Hard Disk

Every partitionable device needs to know how it is partitioned. The information is available in the partition table, and part of the initialization process consists of decoding the partition table and updating the internal data structures to reflect the partition information.

This decoding isn’t easy, but fortunately, the kernel offers ``Generic Hard Disk'' support usable by all block drivers, which considerably reduces the amount of code needed in the driver for handling partitions. Another advantage of the generic support is that the driver writer doesn’t need to understand how the partitioning is done, and new partitioning schemes can be supported in the kernel without requiring changes to driver code.

A block driver that wants to support partitions should include <linux/genhd.h> and should declare a struct gendisk structure. All such structures are arranged in a linked list, whose head is the global pointer gendisk_head.

Before we go further, let’s look at the fields in struct gendisk. You’ll need to understand them in order to exploit generic device support.

int major

The major number identifies the device driver that the structure refers to.

const char *major_name

The base name for devices belonging to this major number. Each device name is derived from this name by adding a letter for each unit and a number for each partition. For example, ``hd'' is the base name that is used to build /dev/hda1 and /dev/hdb3. The base name should be at most five characters long, because add_partition builds the full name of the partition in an eight-byte buffer, and the letter that identifies the unit, the partition number, and the '' terminator have to be appended. The name for spull is pd (``Partitionable Disk'').

int minor_shift

The number of bit-shifts needed to extract the drive number from the device minor number. In spull the number is 4. The value in this field should be consistent with the definition of the macro DEVICE_NR(device) (see Section 12.2 earlier in this chapter). The macro in spull expands to device>>4.

int max_p

The maximum number of partitions. In our example, max_p is 16, or more generally, 1 << minor_shift.

int max_nr

The maximum number of units. In spull, this number is 4. The result of the maximum number of units shifted by minor_shift should fit in the available range of minor numbers, which is currently 0-255. The IDE driver can support both many drives and many partitions per drive because it registers several major numbers to work around the small range of minor numbers.

void (*init)(struct gendisk *)

The initialization function for the driver, which is called after initializing the device and before the partition check is performed. I’ll describe this function in more detail below.

struct hd_struct *part

The decoded partition table for the device. The driver uses this item to determine what range of the disk’s sectors are accessible through each minor number. The driver is responsible for allocation and deallocation of this array, which most drivers implement as a static array of max_nr << minor_shift structures. The driver should initialize the array to zero before the kernel decodes the partition table.

int *sizes

This field points to an array of integers. The array holds the same information as blk_size. The driver is responsible for allocating and deallocating the data area. Note that the partition check for the device copies this pointer to blk_size, so a driver handling partitionable devices doesn’t need to allocate the latter array.

int nr_real

The number of real devices (units) that exist. This number must be less than or equal to max_nr.

void *real_devices

This pointer is used internally by each driver that needs to keep additional private information (this is similar to filp->private_data).

void struct gendisk *next

A link in the list of generic hard disks.

The design of partition checking is best suited to drivers directly linked to the kernel image, so I’ll start by introducing the basic structure of the kernel code. Later I’ll introduce the way the spull module handles its partitions.

Partition Detection in the Kernel

At boot time, init/main.c calls the various initialization functions. One of those functions, start_kernel, initializes all drivers by calling device_setup. This function in turn calls blk_dev_init and then checks the partition information of all registered generic hard disks. Any block driver that finds at least one of its devices registers the driver’s genhd structure in the kernel list so its partitions will be correctly detected.

A partitionable driver, therefore, should declare its own struct genhd. The structure looks like the following:

struct gendisk my_gendisk = {
    MAJOR_NR,        /* Major number */
    "my",            /* Major name */
    6,               /* Bits to shift to get real from partition */
    1 << 6,          /* Number of partitions per real */
    MY_MAXNRDEV,     /* Maximum number of devices */
    my_geninit,      /* Init function */
    my_partitions,   /* hd_struct array, filled at partition check */
    my_sizes,        /* Block sizes */
    0,               /* Number of units: updated by init code */
    NULL,            /* "real_devices" pointer: use at will */
    NULL             /* next: updated by the lines shown below */
};

In the initialization function for the driver, then, the structure is queued on the main list of partitionable devices.

The initialization function of a driver that is linked to the kernel is the equivalent of init_module, even though it is called in a different way. The function must enclose the following two lines, to take care of queueing the structure:

my_gendisk.next = gendisk_head;
gendisk_head = &my_gendisk;

By inserting the structure into the linked list, these simple lines are all that’s needed in the driver’s entry point for all its partitions to be properly recognized and configured.

Additional setup can be performed by my_geninit. In the example shown above, the function fills the ``number of units'' field to reflect the actual hardware setup of the computer system. After my_geninit terminates, gendisk.c performs the actual partition detection for all the disks (the units). You can see the partitions being detected at system boot because gendisk.c prints Partition check: on the system console, followed by all the partitions it finds on the available generic hard disks.

You can modify the previous code by delaying allocation of both my_sizes and my_partitions until the my_geninit function. This saves a small amount of kernel memory because the arrays can be as small as nr_real << minor_shift, whereas static arrays must be max_nr << minor_shift bytes long. The typical savings, however, are a few hundred bytes per physical unit.

Partition Detection in Modules

A modularized driver differs from a driver linked to the kernel in that it can’t benefit from the centralized initialization. Instead, it should handle its own setup. There’s no two-step initialization for a module, so the gendisk structure for spull has a NULL pointer in its init function pointer:

struct gendisk spull_gendisk = {
    0,                /* Major no., assigned after dynamic retreival */
    "pd",             /* Major name */
    SPULL_SHIFT,      /* Bits to shift to get real from partition */
    1 << SPULL_SHIFT, /* Number of partitions per real */
    SPULL_MAXNRDEV,   /* Maximum no. of devices */
    NULL,             /* No init function (isn't called, anyways) */
    NULL,             /* Partition array, allocated by init_module */
    NULL,             /* Block sizes, allocated by init_module */
    0,                /* Number of units: set by init_module */
    NULL,             /* "real_devices" pointer: not used */
    NULL              /* Next */
};

It is also unnecessary to register the gendisk structure in the global linked list of generic disks.

The file gendisk.c is prepared to handle a ``late'' initialization like the one needed by modules by exporting the function resetup_one_dev, which scans the partitions for a single physical device. The prototype for resetup_one_dev is:

void resetup_one_dev(struct gendisk *dev, int drive);

You can see from the name of the function that it is meant to change the setup information for a device. The function was designed to be called by the BLKRRPART implementation within ioctl, but it can also be used for the initial setup of a module.

When a module is initialized, it should call resetup_one_dev for each physical device it is going to access so that the partition information can be stored in my_gendisk->part. The partition information is then used in the request_fn function of the device.

In spull, the init_module function includes the following code in addition to the usual instructions. It allocates the arrays needed for partition check and initializes the whole-disk entries in the arrays.

/* Prepare the `size' array and zero it. */
spull_sizes = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(int),
                      GFP_KERNEL);
if (!spull_sizes)
    goto fail_malloc;


/* Start with zero-sized partitions, and correctly sized units */
memset(spull_sizes, 0, (spull_devs << SPULL_SHIFT) * sizeof(int));
for (i=0; i< spull_devs; i++)
    spull_sizes[i<<SPULL_SHIFT] = spull_size;
blk_size[MAJOR_NR] = spull_gendisk.sizes = spull_sizes;

/* Allocate the partitions, and refer the array in spull_gendisk. */
spull_partitions = kmalloc( (spull_devs << SPULL_SHIFT) *
                           sizeof(struct hd_struct), GFP_KERNEL);
if (!spull_partitions)
    goto fail_malloc;

memset(spull_partitions, 0, (spull_devs << SPULL_SHIFT) *
       sizeof(struct hd_struct));
/* fill whole-disk entries */
for (i=0; i < spull_devs; i++) {
    /* start_sect is already 0, and sects are 512 bytes long */
    spull_partitions[i << SPULL_SHIFT].nr_sects = 2 * spull_size;
}
spull_gendisk.part = spull_partitions;

#if 0
    /*
     * Well, now a *real* driver should call resetup_one_dev().
     * Avoid it here, as there's no allocated data in spull yet.
     */
    for (i=0; i< spull_devs; i++) {
        printk(KERN INFO "Spull partition check: ");
        resetup_one_dev(&spull_gendisk, i);
    }
#endif

It’s interesting to note that resetup_one_dev prints partition information by repeatedly calling:

printk(" %s:", disk_name(hd, minor, buf));

That’s why spull would print a leading string. It’s meant to add some context to the information that gets stuffed into the system log.

When a partitionable module is unloaded, the driver should arrange for all the partitions to be flushed, by calling fsync_dev for every supported major/minor pair. Moreover, if the gendisk structure was inserted in the global list, it should be removed—note that spull didn’t insert itself, for the reasons outlined above.

The cleanup function for spull is:

for (i = 0; i < (spull_devs << SPULL_SHIFT); i++)
    fsync_dev(MKDEV(spull_major, i)); /* flush the devices */
blk_dev[major].request_fn = NULL;
read_ahead[major] = 0;
kfree(blk_size[major]); /* which is gendisk->sizes as well */

blk_size[major] = NULL;
kfree(spull_gendisk.part);

Partition Detection Using Initrd

If you want to mount your root filesystem from a device whose driver is available only in modularized form, you must use the Initrd facility offered by modern Linux kernels. I won’t introduce Initrd here; this subsection is aimed at readers who know about Initrd and wonder how it affects block drivers.

When you boot a kernel with Initrd, it establishes a temporary running environment before it mounts the real root filesystem. Modules are usually loaded from within the ramdisk being used as the temporary root file system.

Since the Initrd process is run after all boot-time initialization is complete (but before the real root filesystem has been mounted), there’s no difference between loading a normal module and one living in the Initrd ramdisk. If a driver can be correctly loaded and used as a module, all Linux distributions that have Initrd available can include the driver on their installation disks without requiring you to hack in the kernel source.

The Device Methods for spull

In addition to initialization and cleanup, there are other differences between partitionable devices and non-partitionable devices. Basically, the differences are due to the fact that if the disk is partitionable, the same physical device can be accessed using different minor numbers. The mappings from the minor number to the physical position on the disk is stored by resetup_one_dev in the gendisk->part array. The code below includes only those parts of spull that differ from sbull, because most of the code is exactly the same.

First of all, open and close must keep track of the usage count for each device. Since the usage count refers to the physical device (unit), the following assignment is used for the dev variable:

Spull_Dev *dev = spull_devices + DEVICE_NR(inode->i_rdev);

The DEVICE_NR macro used here is the one that must be declared before <linux/blk.h> is included.

While almost every device method works with the physical device, ioctl should access specific information for each partition. For example, mkfs should be told the size of each partition, not the size of the whole device. Here is how the BLKGETSIZE ioctl command is affected by the change from one minor number per device to multiple minor numbers per device. As you might expect, spull_gendisk->part is used as the source of the partition size.

case BLKGETSIZE:
  /* Return the device size, expressed in sectors */
  if (!arg) return -EINVAL; /* NULL pointer: not valid */
  err=verify_area(VERIFY_WRITE, (long *) arg, sizeof(long));
  if (err) return err;
  size = spull_gendisk.part[MINOR(inode->i_rdev)].nr_sects;
  put_user (size, (long *) arg);
  return 0;

The other ioctl command that is different for partitionable devices is BLKRRPART. Re-reading the partition table makes sense for partitionable devices and is equivalent to revalidating a disk after a disk change:

case BLKRRPART: /* re-read partition table: fake a disk change */
  return spull_revalidate(inode->i_rdev);

The function spull_revalidate in turn calls resetup_one_dev to rebuild the partition table. First, however, it must clear any previous information--otherwise, trailing partitions would still appear at the end of the partition table in case the new one contains fewer partitions than before.

int spull_revalidate(kdev_t i_rdev)
{
    /* first partition, # of partitions */
    int part1 = (DEVICE_NR(i_rdev) << SPULL_SHIFT) + 1;
    int npart = (1 << SPULL_SHIFT) -1;

    /* first clear old partition information */
    memset(spull_gendisk.sizes+part1, 0, npart*sizeof(int));
    memset(spull_gendisk.part +part1, 0, npart*sizeof(struct hd_struct));

    /* then fill new info */
    printk(KERN_INFO "Spull partition check: ");
    resetup_one_dev(&spull_gendisk, DEVICE_NR(i_rdev));
    return 0;
}

But the major difference between sbull and spull is in the request function. In spull, the request function needs to use the partition information in order to correctly transfer data for the different minor numbers.

Information in spull_gendisk->part is used to locate each partition on the physical device. part[minor]->nr_sects is the partition size, and part[minor]->start_sect is its offset from the beginning of the disk. The request function eventually falls back to the whole-disk implementation.

Here are the relevant lines in spull_request:

/* the sector size is 512 bytes */
ptr = device->data +
      512 * (spull_partitions[minor].start_sect + CURRENT->sector);
size = CURRENT->current_nr_sectors * 512;

if (CURRENT->sector + CURRENT->current_nr_sectors >
        spull_gendisk.part[minor].nr_sects) {
    printk(KERN_WARNING "spull: request past end of device
");
    end_request(0);
    continue;
}

The number of sectors is multiplied by 512, the sector size (which is hardwired in spull), to get the size of the partition in bytes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.157.186