Select

When using nonblocking I/O, applications often exploit the select system call, which relies on a device method when it involves device files. This system call is also used to multiplex input from different sources. In the following discussion, I’m assuming that you understand the use of the select semantics in user space. Note that version 2.1.23 of the kernel introduced the poll system call, thus changing the way the driver method works in order to account for both the system calls.

The implementation of the select system call in Linux 2.0 uses a select_table structure to keep information about all the files (or devices) being waited for. Once again, you’re expected not to look inside the structure (but we’ll do it anyway a little later) and are allowed only to call the functions that act on such a structure.

When the select method discovers that there’s no need to block, it returns 1; when the process should wait, it should ``almost'' go to sleep. In this case, the correct wait queue is added to the select_table structure, and the function returns 0.

The process actually goes to sleep only if no file being selected can accept or return data. This happens in sys_select, within fs/select.c.

The code for the select operation is far easier to write than to describe, and it’s high time to show the implementation used in scull:

int scull_p_select (struct inode *inode, struct file *filp,
                    int mode, select_table *table)
{
    Scull_Pipe *dev = filp->private_data;

    if (mode == SEL_IN) {
        if (dev->rp != dev->wp) return 1; /* readable */
        PDEBUG("Waiting to read
");
        select_wait(&dev->inq, table); /* wait for data */
        return 0;
    }
    if (mode == SEL_OUT) {
        /*
         * the buffer is full if "wp" is right behind "rp",
         * and the buffer is circular. "left" can't drop
         * to 0, as this would be taken as empty buffer
         */
        int left = (dev->rp + dev->buffersize - dev->wp) % 
                    dev->buffersize;
        if (left>1) return 1; /* writable */
        PDEBUG("Waiting to write
");
        select_wait(&dev->outq, table); /* wait for free space */
        return 0;
    }
    return 0; /* never exception-able */
}

There’s no code for the ``third form of select,'' selecting for exceptions. This form is identified by mode == SEL_EX, but most of the time you code it as the default case, to be executed when the other checks fail. The meaning of exception events is device-specific, so you can choose whether or not to use them in your own driver. Such a feature will be used only by programs specifically designed to use your driver, but that’s exactly its intent. In that respect, it is similar to the device-dependency of the ioctl call. In the real world, the main use of exception conditions in select is to signal arrival of Out-Of-Band (urgent) data on a network connection, though it is also used in the tty layer and in the pipe/FIFO implementation (you can look for SEL_EX in fs/pipe.c). Note, however, that other Unix systems don’t implement exception conditions for pipes and FIFOs.

The select code as shown is missing end-of-file support. When a read call is at end-of-file, it should return 0, and select must support this behavior by reporting that the device is readable, so the application will actually issue the read without waiting forever. With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, while in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trashcan where everyone can put data as long as there’s at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel.

Implementing end-of-file as FIFOs do would mean checking dev->nwriters, both in read and in select-for-reading, and acting accordingly. Unfortunately though, if a reader opens the scullpipe device before the writer, it sees end-of-file, without having a chance to wait for data. The best way to fix this problem is to implement blocking within open, but this task is left as an exercise to the reader.

Interaction with read and write

The purpose of the select call is to determine in advance if an I/O operation will block. In that respect, it complements read and write. select is also useful because it lets the driver wait simultaneously for several data streams (but this is not relevant in the case at hand).

A correct implementation of the three calls is fundamental in order to make applications work correctly. Though the following rules have more or less already been stated, I’ll summarize them here.

Reading data from the device

If there is data in the input buffer, the read call should return immediately, with no noticeable delay, even if less data than requested is available and the driver is sure the remaining data will arrive soon. You can always return less data than you’re asked for if this is convenient (we did it in scull), provided you return at least one byte. The implementation of bus mice in the current kernel is faulty in this respect, and several programs (like dd) fail to correctly read the device.

If there is no data in the input buffer, read must block until at least one byte is there, unless O_NONBLOCK is set. A nonblocking read returns immediately with a return value of -EAGAIN (although some old versions of SystemV return 0 in this case). select must report that the device is unreadable until at least one byte arrives. As soon as there is some data, we fall back to the previous case.

If we are at end-of-file, read should return immediately with a return value of 0, independent of O_NONBLOCK. select should report that the file is readable.

Writing to the device

If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, select reports that the device is writable.

If the output buffer is full, write blocks until some space is freed, unless O_NONBLOCK is set. A nonblocking write returns immediately, with a return value of -EAGAIN (or conditionally 0, as stated previously for older SystemV reads). select should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns -ENOSPC (``No space left on device''), independently of O_NONBLOCK.

If the program using the device wants to ensure that the data it queues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point. Never make a write call wait for data transmission before returning, even if O_NONBLOCK is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must consistently not block.

Flushing pending output

We’ve seen how the write method doesn’t account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap.

If some application will ever need to be assured that data has been sent to the device, the fsync method must be implemented. A call to fsync should return only when the device has been completely flushed (i.e., the output buffer is empty), even if that takes some time, regardless of whether O_NONBLOCK is set.

The fsync method has no unusual features. The call isn’t time-critical, so every device driver can implement it to the author’s taste. Most of the time, char drivers just have a NULL pointer in their fops. Block devices, on the other hand, always implement the method by calling the general-purpose block_fsync, which in turn flushes all the blocks of the device, waiting for I/O to complete.

The Underlying Data Structure

The particular implementation of select used in 2.0 kernels is quite efficient and slightly complex. If you’re not interested in understanding the secrets of the operating system, you can jump directly to the next section.

First of all, I suggest that you look at Figure 5.2, which represents graphically the steps involved in making a select call. Looking at the figure will make it easier to follow the discussion.

The internals of select

Figure 5-2. The internals of select

The select work is performed by the functions select_wait, declared inline in <linux/sched.h>, and free_wait, defined in fs/select.c. The underlying data structure is an array of struct select_table_entry, where each entry is made up of a struct wait_queue and a struct wait_queue **. The former is the actual structure that gets inserted in the wait queue for the device (the one that only exists as a local variable when calling sleep_on), while the latter is the ``handle'' that’s needed to remove the current process from the queue when at least one of the selected conditions becomes true--for example, it contains &dev->inq when selecting scullpipe for reading (see the earlier example in Section 5.3).

In short, select_wait inserts the next free select_table_entry into the specified wait queue. When the system call returns, free_wait removes every entry from its own wait queue, using the associated pointer-pointer.

The select_table structure (made up of a pointer to the array of entries and the number of active entries) is declared as a local variable in do_select, similar to what happens for __sleep_on. The array of entries, on the other hand, resides in a different page, because it could overflow the stack page for the current process.

If you’re having trouble understanding this description, try looking at the source code. Once you understand the implementation, you’ll see that it is compact and efficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.213.97