When using nonblocking I/O, applications often exploit the select system call, which relies on a device method when it involves device files. This system call is also used to multiplex input from different sources. In the following discussion, I’m assuming that you understand the use of the select semantics in user space. Note that version 2.1.23 of the kernel introduced the poll system call, thus changing the way the driver method works in order to account for both the system calls.
The implementation of the select system call in Linux 2.0 uses a
select_table
structure to keep information about all the
files (or devices) being waited for. Once again, you’re expected not
to look inside the structure (but we’ll do it anyway a little later) and
are allowed only to call the functions that act on such a structure.
When the select method discovers that there’s no need to
block, it returns 1; when the process should wait, it should ``almost''
go to sleep. In this case, the correct wait queue is added to the
select_table
structure, and the function returns 0.
The process actually goes to sleep only if no file being selected
can accept or return data. This happens in sys_select, within
fs/select.c
.
The code for the select operation is far easier to write than to describe, and it’s high time to show the implementation used in scull:
int scull_p_select (struct inode *inode, struct file *filp, int mode, select_table *table) { Scull_Pipe *dev = filp->private_data; if (mode == SEL_IN) { if (dev->rp != dev->wp) return 1; /* readable */ PDEBUG("Waiting to read "); select_wait(&dev->inq, table); /* wait for data */ return 0; } if (mode == SEL_OUT) { /* * the buffer is full if "wp" is right behind "rp", * and the buffer is circular. "left" can't drop * to 0, as this would be taken as empty buffer */ int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize; if (left>1) return 1; /* writable */ PDEBUG("Waiting to write "); select_wait(&dev->outq, table); /* wait for free space */ return 0; } return 0; /* never exception-able */ }
There’s no code for the ``third form of select,''
selecting for exceptions. This form
is identified by mode == SEL_EX
, but most of the time you
code it as the default case, to be executed when the other checks
fail. The meaning of exception events is device-specific, so you can
choose whether or not to use them in your own driver. Such a feature will
be used only by programs specifically designed to use your driver, but
that’s exactly its intent. In that respect, it is similar to the
device-dependency of
the ioctl call. In the real world, the main use of exception
conditions in select is to signal arrival of Out-Of-Band
(urgent) data on a network connection, though it is also used in the
tty layer and in the pipe/FIFO implementation (you can look for
SEL_EX
in fs/pipe.c
). Note, however, that other Unix
systems don’t implement exception conditions for pipes and FIFOs.
The select code as shown is missing end-of-file support. When a read call is at end-of-file, it should return 0, and select must support this behavior by reporting that the device is readable, so the application will actually issue the read without waiting forever. With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, while in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trashcan where everyone can put data as long as there’s at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel.
Implementing end-of-file as FIFOs do would mean checking
dev->nwriters
, both in read and in
select-for-reading, and acting accordingly. Unfortunately though,
if a reader opens the scullpipe device before the writer, it
sees end-of-file, without having a chance to wait for data. The best
way to fix this problem is to implement blocking within open,
but this task is left as an exercise to the reader.
The purpose of the select call is to determine in advance if an I/O operation will block. In that respect, it complements read and write. select is also useful because it lets the driver wait simultaneously for several data streams (but this is not relevant in the case at hand).
A correct implementation of the three calls is fundamental in order to make applications work correctly. Though the following rules have more or less already been stated, I’ll summarize them here.
If there is data in the input buffer, the read call should return immediately, with no noticeable delay, even if less data than requested is available and the driver is sure the remaining data will arrive soon. You can always return less data than you’re asked for if this is convenient (we did it in scull), provided you return at least one byte. The implementation of bus mice in the current kernel is faulty in this respect, and several programs (like dd) fail to correctly read the device.
If there is no data in the input buffer, read must
block until at least one byte is there, unless O_NONBLOCK
is set. A nonblocking read returns immediately with a
return value of -EAGAIN
(although some old versions of
SystemV return 0 in this case). select must report that
the device is unreadable until at least one byte arrives. As soon as
there is some data, we fall back to the previous
case.
If we are at end-of-file, read should return
immediately with a return value of 0, independent of
O_NONBLOCK
. select should report that the
file is readable.
If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, select reports that the device is writable.
If the output buffer is full, write blocks until some space is freed, unless O_NONBLOCK
is set. A nonblocking write returns immediately, with a return value of -EAGAIN
(or conditionally 0, as stated previously for older SystemV reads). select should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns -ENOSPC
(``No space left on device''), independently of O_NONBLOCK
.
If the program using the device wants to ensure that the data it queues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point. Never make a write call wait for data transmission before returning, even if O_NONBLOCK
is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must consistently not block.
We’ve seen how the write method doesn’t account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap.
If some application will ever need to be assured that
data has been sent to the device, the fsync method must
be implemented. A call to fsync should return only when
the device has been completely flushed (i.e., the output buffer
is empty), even if that takes some time, regardless of
whether O_NONBLOCK
is set.
The fsync method has no unusual features.
The call isn’t time-critical, so every device driver can implement it to
the author’s taste. Most of the time, char drivers just have a
NULL
pointer in their fops
. Block devices, on the other
hand, always implement the method by calling the general-purpose
block_fsync, which in turn flushes all the blocks of the device,
waiting for I/O to complete.
The particular implementation of select used in 2.0 kernels is quite efficient and slightly complex. If you’re not interested in understanding the secrets of the operating system, you can jump directly to the next section.
First of all, I suggest that you look at Figure 5.2, which represents graphically the steps involved in making a select call. Looking at the figure will make it easier to follow the discussion.
The select work is performed by the functions select_wait,
declared
inline in <linux/sched.h>
, and free_wait, defined in
fs/select.c
. The underlying data structure is an array
of struct select_table_entry
, where each entry is made up of
a struct wait_queue
and a struct wait_queue **
. The
former is the actual structure that gets inserted in the wait queue
for the device (the one that only exists as a local variable when
calling sleep_on), while the latter is the ``handle'' that’s
needed to remove the current process from the queue when at least one
of the selected conditions becomes true--for example, it
contains &dev->inq
when selecting scullpipe for
reading (see the earlier example in Section 5.3).
In short, select_wait inserts the next free
select_table_entry
into the specified wait queue. When the
system call returns, free_wait removes every entry from its own
wait queue, using the associated pointer-pointer.
The select_table
structure (made up of a pointer to the
array of entries and the number of active entries) is declared as a
local variable in do_select, similar to what happens for
__sleep_on. The array of entries, on the other
hand, resides in a different page, because it could overflow the
stack page for the current process.
If you’re having trouble understanding this description, try looking at the source code. Once you understand the implementation, you’ll see that it is compact and efficient.
3.149.213.97