Blocking I/O

One problem that might arise with read is what to do when there’s no data yet, but we’re not at end-of-file.

The default answer is ``we must go to sleep waiting for data.'' This section shows how a process is put to sleep, how it is awakened, and how an application can ask if there is data, without blocking within the read call. We’ll then apply the same concepts to write.

As usual, before I show you the real code, I’ll explain a few concepts.

Going to Sleep and Awakening

When a process is waiting for an event (be it input data, the termination of a child process, or whatever else) it should be put to sleep so another process can use the computational resources. You can put a process to sleep by calling one of the following functions:

void interruptible_sleep_on(struct wait_queue **q);
void sleep_on(struct wait_queue **q);

Processes are then awakened by one of:

void wake_up_interruptible(struct wait_queue **q);
void wake_up(struct wait_queue **q);

In the preceding functions, the wait_queue pointer-pointer is used to refer to an event; we’ll discuss it in detail later in Section 5.2.3. For now, it will suffice to say that processes are awakened using the same queue that put them to sleep. Thus, you’ll need one wait queue for each event that can block processes. If you manage four devices, you’ll need four wait queues for blocking-read and four for blocking-write. The preferred place to put such queues is the hardware data structure associated with each device (Scull_Dev in our example).

But what’s the difference between ``interruptible'' and plain calls?

sleep_on can’t be aborted by a signal, while interruptible_sleep_on can. In practice, sleep_on is called only by critical sections of the kernel; for example, while waiting for a swap page to be read from disk. The process can’t proceed without the page, and interrupting the operation with a signal doesn’t make sense. interruptible_sleep_on, on the other hand, is used during so-called ``long system calls,'' like read. It does make sense to kill a process with a signal while it’s waiting for keyboard input.

Similarly, wake_up wakes any process sleeping on the queue, while wake_up_interruptible wakes only interruptible processes.

As a driver writer, you’ll call interruptible_sleep_on and wake_up_interruptible, because a process sleeps in the driver’s code only during read or write. Actually, you could call wake_up as well, since no ``uninterruptible'' processes will sleep on your queue. However, that’s not usually done, for the sake of consistency in the source code. (In addition, wake_up is also slightly slower than its counterpart.)

Writing Reentrant Code

When a process is put to sleep, the driver is still alive and can be called by another process. Let’s consider the console driver as an example. While an application is waiting for keyboard input on tty1, the user switches to tty2 and spawns a new shell. Now both shells are waiting for keyboard input within the console driver, although they sleep on different wait queues: one on the queue associated with tty1 and the other on the queue associated with tty2. Each process is locked within the interruptible_sleep_on function, but the driver can still receive and answer requests from other ttys.

Such situations can be handled painlessly by writing ``reentrant code.'' Reentrant code is code that doesn’t keep status information in global variables and thus is able to manage interwoven invocation without mixing anything up. If all the status information is process-specific, no interference will ever happen.

If status information is needed, it can either be kept in local variables within the driver function (each process has a different stack page where local variables are stored), or it can reside in private_data within the filp accessing the file. Using local variables is preferred, because sometimes the same filp can be shared between two processes (usually parent and child).

If you need to save large amounts of status data, you can keep the pointer in a local variable and use kmalloc to retrieve the actual storage space. In this case you must remember to kfree the data, because there’s no equivalent to ``everything is released at process termination'' when you’re working in kernel space.

You need to make reentrant any function that calls a flavor of sleep_on (or just schedule) and any function that can be in its call-trace. If sample_read calls sample_getdata, which in turn can block, then sample_read must be reentrant as well as sample_getdata, because nothing prevents another process from calling it while it is already executing on behalf of a process that went to sleep. Moreover, any function that copies data to or from user space must be reentrant, as access to user space might page-fault, and the process will be put to sleep while the kernel deals with the missing page.

Wait Queues

The next question I hear you ask is, ``How exactly can I use a wait queue?''

A wait queue is easy to use, although its design is quite subtle and you are not expected to peek at its internals. The best way to deal with wait queues is to stick to the following operations:

  • Declare a struct wait_queue * variable. You need one such pointer variable for each event that can put processes to sleep. This is the item that I suggested you put in the structure describing hardware features.

  • Pass a pointer to this variable as argument to the various sleep_on and wake_up functions.

It’s that easy. For example, let’s imagine you want to put a process to sleep when it reads your device and awaken it when someone else writes to the device. The following code does just that:

struct wait_queue *wq = NULL; /* must be zeroed at the beginning */

read_write_t sleepy_read (struct inode *inode, struct file *filp,
                          char *buf, count_t count)
{
    printk(KERN_DEBUG "process %i (%s) going to sleep
",
           current->pid, current->comm);
    interruptible_sleep_on(&wq);
    printk(KERN_DEBUG "awoken %i (%s)
", current->pid, current->comm);
    return 0; /* EOF */
}

read_write_t sleepy_write (struct inode *inode, struct file *filp,
                           const char *buf, count_t count)
{
    printk(KERN_DEBUG "process %i (%s) awakening the readers...
",
           current->pid, current->comm);
    wake_up_interruptible(&wq);
    return count; /* succeed, to avoid retrial */
}

The code for this device is available as sleepy in the example programs and can be tested using cat and input/output redirection, as usual.

The two operations listed above are the only ones you are allowed to use with a wait queue. However, I know that some readers might be interested in the internals and grasping them from the sources can be difficult. If you’re not interested in more detail, you can skip to the next subsection without missing anything. Note that I talk about the ``current'' implementation (version 2.0.x), but there’s nothing forcing kernel developers to stick to that implementation. If a better one comes along, the kernel can easily switch to the new one without bad effects as long as driver writers use the wait queue only through the two legal operations.

The current implementation of struct wait_queue uses two fields: a pointer to struct task_struct (the waiting process), and a pointer to struct wait_queue (the next item in the list). A wait queue is always circular, with the last structure pointing to the first.

The compelling feature of the design is that driver writers never declare or use such a structure; they only pass along pointers and pointer-pointers. Actual structures do exist, but only in one place: as a local variable within the function __sleep_on, which is called by both the sleep_on functions introduced above.

Strange as it appears, this is really a smart choice, because there’s no need to deal with allocation and deallocation of such structures. A process sleeps on a single queue at a time, and the data structure describing its sleeping exists in the non-swappable stack page associated with the process.

The actual operations performed when a process is added or removed from a wait queue are schematically represented in Figure 5.1.

The workings of wait queues

Figure 5-1. The workings of wait queues

Blocking and Nonblocking Operations

There is another point we need to touch on before we look at the implementation of full-featured read and write methods, and that is the O_NONBLOCK flag in filp->f_flags. The flag is defined in <linux/fcntl.h>, which is automatically included by <linux/fs.h> in recent kernels. You should include fcntl.h manually if you want your module to compile with 1.2.

The flag gets its name from ``open-nonblock,'' because it can be specified at open time (and originally could only be specified there). The flag is reset by default, because the normal behavior of a process waiting for data is just sleeping. In the case of a blocking operation, the following behavior should be implemented:

  • If a process calls read, but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the count argument to the method.

  • If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the device, and space becomes free in the output buffer, the process is awakened, and the write call succeeds, although the data may be only partially written if there isn’t room in the buffer for the count bytes that were requested.

Both statements in the previous list assume that there is an input and an output buffer, but every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading, and the output buffer is useful for squeezing more performance out of the computer, though it’s not strictly compulsory. Data can’t be lost on write, because if the system call doesn’t accept data bytes, they remain in the user-space buffer.

The performance gain of implementing an output buffer in the driver results from the diminished number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that’s one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks, and the loop continues. If the output buffer is big enough, write succeeds on the first attempt; data is pushed out to the device at interrupt time, without control ever going back to user space. The choice of a suitable dimension for the output buffer is clearly device-specific.

We didn’t use an input buffer in scull, because data is already available when read is issued. Similarly, no output buffer was used, as data is simply copied to the memory area associated with the device. We’ll see the use of buffers in Chapter 9, in the section titled Section 9.6.

The behavior of read and write is different if O_NONBLOCK is specified. In this case, the calls simply return -EAGAIN if a process calls read when no data is available, or if it calls write when there’s no space in the buffer.

As you might expect, nonblocking operations return immediately, allowing the application to poll for data. Applications must be careful when using the stdio functions when dealing with nonblocking files, because you can easily mistake a nonblocking return for EOF. You always have to check errno.

As you may imagine from its name, O_NONBLOCK is meaningful also in the open method. This happens when the call can actually block for a long time; for example, when opening a FIFO that has no writers (yet), or accessing a disk file with a pending lock. Usually, opening a device either succeeds or fails, without the need to wait for external events. Sometimes, however, opening the device requires a long initialization, and you may choose to check O_NONBLOCK, returning immediately with -EAGAIN (try it again) if the flag is set, after spawning device initialization. You might also decide to implement a blocking open to support access policies in a way similar to file locks. We’ll see one such implementation later in the section Section 5.6.3.

Only the read, write, and open file operations are affected by the nonblocking flag.

A Sample Implementation: scullpipe

The /dev/scullpipe devices (there are four of them by default) are part of the scull module and are used to show how blocking I/O is implemented.

Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens processes while handling the interrupt. The goal of scull is different, since you should be able to run scull on any computer without requiring any particular hardware--and without any interrupt handler. I chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake sleeping writer processes. The resulting implementation is similar to that of a FIFO (or ``named pipe'') filesystem node, whence the name.

The device driver uses a device structure that embeds two wait queues and a buffer. The size of the buffer is configurable in the usual ways (at compile time, load time, or run time).

typedef struct Scull_Pipe {
    struct wait_queue *inq, *outq;  /* read and write queues */
    char *buffer, *end;             /* begin of buf, end of buf */
    int buffersize;                 /* used in pointer arithmetic */
    char *rp, *wp;                  /* where to read, where to write */
    int nreaders, nwriters;         /* number of openings for r/w */
    struct fasync_struct *async_queue; /* asynchronous readers */
} Scull_Pipe;

The read implementation manages both blocking and nonblocking input and looks like this:

read_write_t scull_p_read (struct inode *inode, struct file *filp,
                           char *buf, count_t count)
{
    Scull_Pipe *dev = filp->private_data;

    while (dev->rp == dev->wp) { /* nothing to read */
        if (filp->f_flags & O_NONBLOCK)
            return -EAGAIN;
        PDEBUG(""%s" reading: going to sleep
",current->comm);
        interruptible_sleep_on(&dev->inq);
        if (current->signal & ~current->blocked) /* a signal arrived */
          return -ERESTARTSYS;     /* tell the fs layer to handle it */
        /* otherwise loop */
    }
    /* ok, data is there, return something */
    if (dev->wp > dev->rp)
        count = min(count, dev->wp - dev->rp);
    else /* the write pointer has wrapped, return data up to dev->end */
        count = min(count, dev->end - dev->rp);
    memcpy_tofs(buf, dev->rp, count);
    dev->rp += count;
    if (dev->rp == dev->end)
        dev->rp = dev->buffer; /* wrapped */

    /* finally, awake any writers and return */
    wake_up_interruptible(&dev->outq);
    PDEBUG(""%s" did read %li bytes
",current->comm, (long)count);
    return count;
}

As you can see, I left some PDEBUG statements in the code. When you compile the driver, you can enable messaging to make it easier to follow the interaction of different processes.

The if statement that follows interruptible_sleep_on takes care of signal handling. This statement ensures the proper and expected reaction to signals, which is to let the kernel take care of restarting the system call or returning -EINTR (the kernel handles -ERESTARTSYS internally, and what reaches user space is -EINTR instead). We don’t want the kernel to do this for blocked signals, though, because we want to ignore them. That is why we check current->blocked and screen out those signals. Otherwise, we pass a -ERESTARTSYS error value back to let the kernel do its work. We’ll use the same statement to deal with signal handling for every read and write implementation.

The implementation for write is quite similar to that for read. Its only ``peculiar'' feature is that it never completely fills the buffer, always leaving a hole of at least one byte. Thus when the buffer is empty, wp and rp are equal; when there is data there, they are always different.

read_write_t scull_p_write (struct inode *inode, struct file *filp,
                            const char *buf, count_t count)
{
    Scull_Pipe *dev = filp->private_data;
    /* left is the free space in the buffer, but it must be positive */
    int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;

    PDEBUG("write: left is %i
",left);
    while (left==1) { /* empty */
        if (filp->f_flags & O_NONBLOCK)
            return -EAGAIN;
        PDEBUG(""%s" writing: going to sleep
",current->comm);
        interruptible_sleep_on(&dev->outq);
        if (current->signal & ~current->blocked) /* a signal arrived */
          return -ERESTARTSYS; /* tell the fs layer to handle it */
        /* otherwise loop, but recalculate free space */
        left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;
    }
    /* ok, space is there, accept something */
    if (dev->wp >= dev->rp) {
        count = min(count, dev->end - dev->wp); /* up to */
                                                /* end-of-buffer */
        if (count == left) /* leave a hole, even if at e-o-b */
            count--;
    }
    else /* the write pointer has wrapped, fill up to rp-1 */
        count = min(count, dev->rp - dev->wp - 1);
    PDEBUG("Going to accept %li bytes to %p from %p
",
           (long)count, dev->wp, buf);
    memcpy_fromfs(dev->wp, buf, count);
    dev->wp += count;
    if (dev->wp == dev->end)
        dev ->wp = dev->buffer; /* wrapped */

    /* finally, awake any reader */
    wake_up_interruptible(&dev->inq);  /* blocked in read() */
                                       /* and select() */
    if (dev->async_queue)
        kill_fasync (dev->async_queue, SIGIO); /* asynchr. readers */
    PDEBUG(""%s" did write %li bytes
",current->comm, (long)count);
    return count;
}

The device, as I conceived it, doesn’t implement blocking open and is simpler than a real FIFO. If you want to look at the real thing, you can find it in fs/pipe.c, in the kernel sources.

To test the blocking operation of the scullpipe device, you can run some programs on it, using input/output redirection as usual. Testing nonblocking activity is trickier, as the conventional programs don’t perform nonblocking operations. The misc-progs source directory contains the following simple program, called nbtest, for testing nonblocking operations. All it does is copy its input to its output, using nonblocking I/O and delaying between retrials. The delay time is passed on the command line and is one second by default.

int main(int argc, char **argv)
{
    int delay=1, n, m=0;

    if (argc>1) delay=atoi(argv[1]);
    fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* stdin */
    fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */

    while (1) {
        n=read(0, buffer, 4096);
        if (n>=0)
            m=write(1, buffer, n);
        if ((n<0 || m<0) && (errno != EAGAIN))
            break;
        sleep(delay);
    }
    perror( n<0 ? "stdin" : "stdout");
    exit(1);
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.126.80