One problem that might arise with read is what to do when there’s no data yet, but we’re not at end-of-file.
The default answer is ``we must go to sleep waiting for data.'' This section shows how a process is put to sleep, how it is awakened, and how an application can ask if there is data, without blocking within the read call. We’ll then apply the same concepts to write.
As usual, before I show you the real code, I’ll explain a few concepts.
When a process is waiting for an event (be it input data, the termination of a child process, or whatever else) it should be put to sleep so another process can use the computational resources. You can put a process to sleep by calling one of the following functions:
void interruptible_sleep_on(struct wait_queue **q); void sleep_on(struct wait_queue **q);
Processes are then awakened by one of:
void wake_up_interruptible(struct wait_queue **q); void wake_up(struct wait_queue **q);
In the preceding functions, the wait_queue
pointer-pointer is used to refer to an event; we’ll discuss it in detail
later in Section 5.2.3. For now, it will suffice to say that
processes are awakened using the same queue that put them to sleep.
Thus, you’ll need one wait queue for each event that can block
processes. If you manage four devices, you’ll need four wait queues
for blocking-read and four for blocking-write. The preferred place to
put such queues is the hardware data structure associated with each
device (Scull_Dev
in our example).
But what’s the difference between ``interruptible'' and plain calls?
sleep_on can’t be aborted by a signal, while interruptible_sleep_on can. In practice, sleep_on is called only by critical sections of the kernel; for example, while waiting for a swap page to be read from disk. The process can’t proceed without the page, and interrupting the operation with a signal doesn’t make sense. interruptible_sleep_on, on the other hand, is used during so-called ``long system calls,'' like read. It does make sense to kill a process with a signal while it’s waiting for keyboard input.
Similarly, wake_up wakes any process sleeping on the queue, while wake_up_interruptible wakes only interruptible processes.
As a driver writer, you’ll call interruptible_sleep_on and wake_up_interruptible, because a process sleeps in the driver’s code only during read or write. Actually, you could call wake_up as well, since no ``uninterruptible'' processes will sleep on your queue. However, that’s not usually done, for the sake of consistency in the source code. (In addition, wake_up is also slightly slower than its counterpart.)
When a process is put to sleep, the driver is still alive and can be called by another process. Let’s consider the console driver as an example. While an application is waiting for keyboard input on tty1, the user switches to tty2 and spawns a new shell. Now both shells are waiting for keyboard input within the console driver, although they sleep on different wait queues: one on the queue associated with tty1 and the other on the queue associated with tty2. Each process is locked within the interruptible_sleep_on function, but the driver can still receive and answer requests from other ttys.
Such situations can be handled painlessly by writing ``reentrant code.'' Reentrant code is code that doesn’t keep status information in global variables and thus is able to manage interwoven invocation without mixing anything up. If all the status information is process-specific, no interference will ever happen.
If status information is needed, it can either be kept in local
variables within the driver function (each process has a different
stack page where local variables are stored), or it can reside in
private_data
within the filp
accessing the
file. Using local variables is preferred, because sometimes the same
filp
can be shared between two processes (usually parent
and child).
If you need to save large amounts of status data, you can keep the pointer in a local variable and use kmalloc to retrieve the actual storage space. In this case you must remember to kfree the data, because there’s no equivalent to ``everything is released at process termination'' when you’re working in kernel space.
You need to make reentrant any function that calls a flavor of sleep_on (or just schedule) and any function that can be in its call-trace. If sample_read calls sample_getdata, which in turn can block, then sample_read must be reentrant as well as sample_getdata, because nothing prevents another process from calling it while it is already executing on behalf of a process that went to sleep. Moreover, any function that copies data to or from user space must be reentrant, as access to user space might page-fault, and the process will be put to sleep while the kernel deals with the missing page.
The next question I hear you ask is, ``How exactly can I use a wait queue?''
A wait queue is easy to use, although its design is quite subtle and you are not expected to peek at its internals. The best way to deal with wait queues is to stick to the following operations:
Declare a struct wait_queue *
variable. You need one such pointer variable for each event that can
put processes to sleep. This is the item that I suggested you
put in the structure describing hardware features.
Pass a pointer to this variable as argument to the various sleep_on and wake_up functions.
It’s that easy. For example, let’s imagine you want to put a process to sleep when it reads your device and awaken it when someone else writes to the device. The following code does just that:
struct wait_queue *wq = NULL; /* must be zeroed at the beginning */ read_write_t sleepy_read (struct inode *inode, struct file *filp, char *buf, count_t count) { printk(KERN_DEBUG "process %i (%s) going to sleep ", current->pid, current->comm); interruptible_sleep_on(&wq); printk(KERN_DEBUG "awoken %i (%s) ", current->pid, current->comm); return 0; /* EOF */ } read_write_t sleepy_write (struct inode *inode, struct file *filp, const char *buf, count_t count) { printk(KERN_DEBUG "process %i (%s) awakening the readers... ", current->pid, current->comm); wake_up_interruptible(&wq); return count; /* succeed, to avoid retrial */ }
The code for this device is available as sleepy in the example programs and can be tested using cat and input/output redirection, as usual.
The two operations listed above are the only ones you are allowed to use with a wait queue. However, I know that some readers might be interested in the internals and grasping them from the sources can be difficult. If you’re not interested in more detail, you can skip to the next subsection without missing anything. Note that I talk about the ``current'' implementation (version 2.0.x), but there’s nothing forcing kernel developers to stick to that implementation. If a better one comes along, the kernel can easily switch to the new one without bad effects as long as driver writers use the wait queue only through the two legal operations.
The current implementation of struct wait_queue
uses two
fields: a pointer to struct task_struct
(the waiting process), and a
pointer to struct wait_queue
(the next item in the list).
A wait queue is always circular, with the last structure pointing to
the first.
The compelling feature of the design is that driver writers never declare or use such a structure; they only pass along pointers and pointer-pointers. Actual structures do exist, but only in one place: as a local variable within the function __sleep_on, which is called by both the sleep_on functions introduced above.
Strange as it appears, this is really a smart choice, because there’s no need to deal with allocation and deallocation of such structures. A process sleeps on a single queue at a time, and the data structure describing its sleeping exists in the non-swappable stack page associated with the process.
The actual operations performed when a process is added or removed from a wait queue are schematically represented in Figure 5.1.
There is another point we need to touch on before we look at the
implementation of full-featured read and write methods,
and that is the
O_NONBLOCK
flag in filp->f_flags
. The flag is
defined in <linux/fcntl.h>
, which is automatically included by
<linux/fs.h>
in recent kernels. You should include
fcntl.h
manually if you want your module to compile with 1.2.
The flag gets its name from ``open-nonblock,'' because it can be specified at open time (and originally could only be specified there). The flag is reset by default, because the normal behavior of a process waiting for data is just sleeping. In the case of a blocking operation, the following behavior should be implemented:
If a process calls read, but no data is (yet)
available, the
process must block. The process is awakened as soon as some data
arrives, and that data is returned to the caller, even if
there is less than the amount requested in the count
argument to the method.
If a process calls write and there is no space
in the buffer, the process must block,
and it must be on a different wait queue from the one
used for reading. When some data has been written to the device,
and space becomes free in the output buffer, the process
is awakened, and the write call succeeds, although
the data may be only partially written if there isn’t room
in the buffer for
the count
bytes that were requested.
Both statements in the previous list assume that there is an input and an output buffer, but every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading, and the output buffer is useful for squeezing more performance out of the computer, though it’s not strictly compulsory. Data can’t be lost on write, because if the system call doesn’t accept data bytes, they remain in the user-space buffer.
The performance gain of implementing an output buffer in the driver results from the diminished number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that’s one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks, and the loop continues. If the output buffer is big enough, write succeeds on the first attempt; data is pushed out to the device at interrupt time, without control ever going back to user space. The choice of a suitable dimension for the output buffer is clearly device-specific.
We didn’t use an input buffer in scull, because data is already available when read is issued. Similarly, no output buffer was used, as data is simply copied to the memory area associated with the device. We’ll see the use of buffers in Chapter 9, in the section titled Section 9.6.
The behavior of read and write is different if
O_NONBLOCK
is specified. In this case,
the calls simply return -EAGAIN
if a process calls read when no data is available, or if
it calls write when there’s no space in the buffer.
As you might expect, nonblocking operations return immediately, allowing
the application to poll for data. Applications must be
careful when using the stdio
functions when dealing with nonblocking
files, because
you can easily mistake a nonblocking return for EOF
. You
always have to check errno
.
As you may imagine from its name, O_NONBLOCK
is
meaningful also in the open method. This happens when the
call can actually block for a long time; for example, when opening a
FIFO that has no writers (yet), or accessing a disk file with a
pending lock. Usually, opening a device either succeeds or fails,
without the need to wait for external events. Sometimes, however,
opening the device requires a long initialization, and you may choose
to check O_NONBLOCK
, returning immediately with
-EAGAIN
(try it again) if the flag is set, after spawning
device initialization. You might also decide to implement a blocking
open to support access policies in a way similar to
file locks. We’ll see one such implementation later in the section
Section 5.6.3.
Only the read, write, and open file operations are affected by the nonblocking flag.
The /dev/scullpipe
devices (there are four of them by
default) are part of the scull module and are used to show how
blocking I/O is implemented.
Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens processes while handling the interrupt. The goal of scull is different, since you should be able to run scull on any computer without requiring any particular hardware--and without any interrupt handler. I chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake sleeping writer processes. The resulting implementation is similar to that of a FIFO (or ``named pipe'') filesystem node, whence the name.
The device driver uses a device structure that embeds two wait queues and a buffer. The size of the buffer is configurable in the usual ways (at compile time, load time, or run time).
typedef struct Scull_Pipe { struct wait_queue *inq, *outq; /* read and write queues */ char *buffer, *end; /* begin of buf, end of buf */ int buffersize; /* used in pointer arithmetic */ char *rp, *wp; /* where to read, where to write */ int nreaders, nwriters; /* number of openings for r/w */ struct fasync_struct *async_queue; /* asynchronous readers */ } Scull_Pipe;
The read implementation manages both blocking and nonblocking input and looks like this:
read_write_t scull_p_read (struct inode *inode, struct file *filp, char *buf, count_t count) { Scull_Pipe *dev = filp->private_data; while (dev->rp == dev->wp) { /* nothing to read */ if (filp->f_flags & O_NONBLOCK) return -EAGAIN; PDEBUG(""%s" reading: going to sleep ",current->comm); interruptible_sleep_on(&dev->inq); if (current->signal & ~current->blocked) /* a signal arrived */ return -ERESTARTSYS; /* tell the fs layer to handle it */ /* otherwise loop */ } /* ok, data is there, return something */ if (dev->wp > dev->rp) count = min(count, dev->wp - dev->rp); else /* the write pointer has wrapped, return data up to dev->end */ count = min(count, dev->end - dev->rp); memcpy_tofs(buf, dev->rp, count); dev->rp += count; if (dev->rp == dev->end) dev->rp = dev->buffer; /* wrapped */ /* finally, awake any writers and return */ wake_up_interruptible(&dev->outq); PDEBUG(""%s" did read %li bytes ",current->comm, (long)count); return count; }
As you can see, I left some PDEBUG
statements in the code.
When you compile the driver, you can enable messaging to make it easier
to follow the interaction of different processes.
The if
statement that follows
interruptible_sleep_on takes care of signal handling. This
statement ensures the proper and expected reaction to signals, which
is to let the kernel take care of restarting the system call or
returning -EINTR
(the kernel handles -ERESTARTSYS
internally, and what reaches user space is -EINTR
instead).
We don’t want the kernel to do this for blocked
signals, though, because we want to ignore them. That is why we check
current->blocked
and screen out those signals. Otherwise, we
pass a -ERESTARTSYS
error value back to let the kernel do its
work. We’ll use the same statement to deal with signal handling for every
read and write implementation.
The implementation for write is quite similar to that for
read. Its only ``peculiar'' feature is that it never completely
fills the buffer, always leaving a hole of at least one byte. Thus when the
buffer is empty, wp
and rp
are equal; when there is data
there, they are always different.
read_write_t scull_p_write (struct inode *inode, struct file *filp, const char *buf, count_t count) { Scull_Pipe *dev = filp->private_data; /* left is the free space in the buffer, but it must be positive */ int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize; PDEBUG("write: left is %i ",left); while (left==1) { /* empty */ if (filp->f_flags & O_NONBLOCK) return -EAGAIN; PDEBUG(""%s" writing: going to sleep ",current->comm); interruptible_sleep_on(&dev->outq); if (current->signal & ~current->blocked) /* a signal arrived */ return -ERESTARTSYS; /* tell the fs layer to handle it */ /* otherwise loop, but recalculate free space */ left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize; } /* ok, space is there, accept something */ if (dev->wp >= dev->rp) { count = min(count, dev->end - dev->wp); /* up to */ /* end-of-buffer */ if (count == left) /* leave a hole, even if at e-o-b */ count--; } else /* the write pointer has wrapped, fill up to rp-1 */ count = min(count, dev->rp - dev->wp - 1); PDEBUG("Going to accept %li bytes to %p from %p ", (long)count, dev->wp, buf); memcpy_fromfs(dev->wp, buf, count); dev->wp += count; if (dev->wp == dev->end) dev ->wp = dev->buffer; /* wrapped */ /* finally, awake any reader */ wake_up_interruptible(&dev->inq); /* blocked in read() */ /* and select() */ if (dev->async_queue) kill_fasync (dev->async_queue, SIGIO); /* asynchr. readers */ PDEBUG(""%s" did write %li bytes ",current->comm, (long)count); return count; }
The device, as I conceived it, doesn’t implement blocking
open and is simpler than a real FIFO. If you want to
look at the real thing, you can find it in fs/pipe.c
, in
the kernel sources.
To test the blocking operation of the scullpipe device, you can run
some programs on it, using input/output redirection as
usual. Testing nonblocking activity is trickier, as the conventional
programs don’t perform nonblocking operations.
The misc-progs
source directory contains the
following simple program, called nbtest, for testing
nonblocking operations. All it does is copy its input to its output,
using nonblocking I/O and delaying between retrials. The delay time is
passed on the command line and is one second by default.
int main(int argc, char **argv) { int delay=1, n, m=0; if (argc>1) delay=atoi(argv[1]); fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* stdin */ fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */ while (1) { n=read(0, buffer, 4096); if (n>=0) m=write(1, buffer, n); if ((n<0 || m<0) && (errno != EAGAIN)) break; sleep(delay); } perror( n<0 ? "stdin" : "stdout"); exit(1); }
18.119.126.80