This chapter deals with disk caches. It shows how Linux uses sophisticated techniques to improve system performances by reducing disk accesses as much as possible.
As mentioned in Section 12.1.1, a disk cache is a software mechanism that allows the system to keep in RAM some data that is normally stored on a disk, so that further accesses to that data can be satisfied quickly without accessing the disk.
Besides the dentry cache, which is used by the VFS to speed up the translation of a file pathname to the corresponding inode, two main disk caches—the buffer cache and the page cache—are used by Linux.
As suggested by its name, the buffer cache is a disk cache consisting of buffers; as we know from Section 13.4.3, each buffer stores a single disk block. The block I/O operations (described in Section 13.4.8.1 in the same chapter) rely on the buffer cache to reduce the number of disk accesses.
Conversely, the page cache is a disk cache consisting of pages; each page in the cache corresponds to several blocks of a regular file or a block device file. Of course, the exact number of blocks contained in a page depends on the size of the block. All such blocks are logically contiguous — that is, they represent an integral portion of a regular file or of a block device file. To reduce the number of disk accesses, before activating a page I/O operation (described in Section 13.4.8.2), the kernel should check whether the wanted data is already stored in the page cache.
Table 14-1 shows how some widely used I/O operations use the buffer and page caches. Some of the examples given refer to the Ext2 filesystem, but they can apply to almost all disk-based filesystems.
Table 14-1. Use of the buffer cache and page cache
Kernel function |
System call |
Cache |
I/O operation |
---|---|---|---|
|
None |
Buffer |
Read an Ext2 superblock |
|
None |
Buffer |
Read an Ext2 inode |
|
|
Page |
Read an Ext2 directory |
|
|
Page |
Read an Ext2 regular file |
|
|
Page |
Write an Ext2 regular file |
|
|
Page |
Read a block device file |
|
|
Page |
Write a block device file |
|
None |
Page |
Access a memory-mapped file |
|
None |
Page |
Access to swapped-out page |
Each operation in this table appears in subsequent chapters:
See Section 13.4.8.1. See also Chapter 17.
See Section 15.1.3. See also Chapter 17 for the Ext2 filesystem.
See Section 15.1.1. See also Chapter 17 for the Ext2 filesystem.
See Section 15.1.1.
See Section 15.1.3. See also Chapter 17 for the Ext2 filesystem.
See Section 15.1.1.
See Section 15.1.3.
See Section 15.2.
See Section 13.4.8.2. See also Chapter 16.
For each type of I/O activity, the table also shows the system call required to start it (if any) and the main corresponding kernel function that handles it.
The table shows that accesses to memory-mapped files and swapped-out pages do not require system calls; they are transparent to the programmer. Once a file memory mapping is set up and swapping is activated, the application program can access the mapped file or the swapped-out page as if it were present in memory. It is the kernel’s responsibility to delay the process until the required page is located on disk and brought into RAM.
You’ll also notice that the same kernel function,
namely generic_file_read( )
, is used to read from
block device files and from regular files. Similarly,
generic_file_write( )
is used to write both into
block device files and into regular files. We describe these
functions in Chapter 15.
To avoid unnecessary disk accesses, the kernel never tries to read a page from disk without looking into the page cache and verifying that it does not already include the requested data. To take the maximum advantage from the page cache, searching into it should be a very fast operation.
The unit of information kept in the page cache is, of course, a whole
page of data. A page does not necessarily contain physically adjacent
disk blocks, so it cannot be identified by a device number and a
block number. Instead, a page in the page cache is identified by an
address of a special data structure named
address_space
, and by an offset within the file
(or whatever) referenced by the address_space
data
structure.
Table 14-1 suggests that the Linux page cache speeds up several different kinds of I/O operations. In fact, the page cache may include the following types of pages:
Pages containing data of regular files and directories of disk-based filesystems; in Chapter 15, we describe how the kernel handles read and write operations on them.
Pages containing data of a memory-mapped file; see Chapter 15 for details.
Pages containing data directly read from block device files (skipping the filesystem layer); as discussed in Chapter 15, the kernel handles them using the same set of functions as for pages containing data of regular files.
Pages containing data of User Mode processes that have been swapped out on disk. As we shall see in Chapter 16, the kernel could be forced to keep in the page cache some pages whose contents have been already written on a swap area.
Pages belonging to an Interprocess Communication (IPC) shared memory region; we describe IPC resources in Chapter 19.
So far, so good, but how is the kernel supposed to keep track of how every page in the page cache should be handled? For instance, suppose the kernel wishes to update the content of a page included in the page cache — reading the page contents from a regular file, from a directory, from a block device file, or from a swap area are quite different operations, and the kernel must execute the proper operation according to the type of page.
The key data structure that establishes the relationship between
pages and methods that operate on the pages is the
address_space
object. Formally, each
address_space
object establishes a link between a
generic kernel object (the so-called
owner
)
and a set of methods that operate on the pages belonging to the
owner.
As stated before, the page cache includes five kinds of pages, so a page may belong to five possible kinds of owners.
For instance, if a page belongs to a regular file that is stored in
an Ext2 filesystem, the owner of the page is an inode object. The
i_mapping
field of this object points to an
address_space
object. In turn, the
address_space
object defines a set of methods that
allow the kernel to act on the pages containing the data of our
regular file.
Specifically, the address_space
object includes
the fields shown in Table 14-2.
Table 14-2. The fields of the address_space object
Type |
Field |
Description |
---|---|---|
|
|
List of owner’s clean pages |
|
|
List of owner’s nonlocked dirty pages |
|
|
List of owner’s locked dirty pages |
|
|
Total number of owner’s pages |
|
|
Methods that operate on the owner’s pages |
|
|
Pointer to the owning inode |
|
|
List of memory regions for private memory mapping |
|
|
List of memory regions for shared memory mapping |
|
|
Spin lock for the lists of memory regions |
|
|
Memory allocator flags for the owner’s pages |
The clean_pages
, dirty_pages
,
and locked_pages
fields represent the heads of
three lists of page descriptors. Together, these lists include all
pages that belong to the owner of the
address_space
object. We discuss the role of each
list in the next section. The nrpages
field stores
the total number of pages inserted in the three lists.
Although the owner of the address_space
object
could be any generic kernel object, usually it is a VFS inode object.
(After all, the page cache was introduced to speed up disk accesses!)
In this case, the host
field points to the inode
that owns the address_space
object.
The i_mmap
, i_mmap_shared
,
i_shared_lock
, and gfp_mask
fields are used whenever the owner of the
address_space
object is an inode of a
memory-mapped file. We discuss them in Section 15.2.1.
The most important field of the address_space
object is a_ops
, which points to a table of type
address_space_operations
containing the methods
that define how the owner’s pages are handled. These
methods are shown in Table 14-3.
Table 14-3. The methods of the address_space object
Method |
Description |
---|---|
|
Write operation (from the page to the owner’s disk image) |
|
Read operation (from the owner’s disk image to the page) |
|
Start the I/O data transfer of already scheduled operations on the page |
|
Prepare the write operation (used by disk-based filesystems) |
|
Complete the write operation (used by disk-based filesystems) |
|
Get a logical block number from a file block index |
|
Prepare to delete the page from the owner’s disk image |
|
Used by journaling filesystems to prepare the release of a page |
|
Direct I/O transfer of the data of the page |
The most important methods are readpage
,
writepage
, prepare_write
, and
commit_write
. We discuss them in Chapter 15. In most cases, the methods link the owner
inode objects with the low-level drivers that access the physical
devices. For instance, the function that implements the
readpage
method for an inode of a regular file
“knows” how to locate the positions
on the physical disk device of the blocks corresponding to any page
of the file. In this chapter, however, we don’t have
to discuss the address_space
methods
further.
The page cache uses the following main data structures:
This lets the kernel quickly derive the page descriptor address for
the page associated with a specified address_space
object and a specified offset (presumably, a file offset)
address_space
object
This
lets the kernel quickly retrieve all pages in a given state owned by
a particular inode object (or other kernel object) referenced by an
address_space
object
Manipulation of the page cache involves adding and removing entries
from these data structures, as well as updating the fields in all
objects that reference the cached pages. The
pagecache_lock
spin lock protects the page cache
data structures against concurrent accesses in multiprocessor
systems.
When a process reads a large file, the page cache may become filled with pages related to that file. In such cases, scanning a long list of page descriptors to find the page that maps the required file portion could become a time-consuming operation.
For this reason, Linux uses a hash table of page descriptor pointers
named page_hash_table
. Its size depends on the
amount of available RAM; for example, for systems having 128 MB of
RAM, page_hash_table
is stored in 32 page frames
and includes 32,768 page descriptor pointers.
The page_hash
macro uses the address of an
address_space
object and an offset value to derive
the address of the corresponding entry in the hash table. As usual,
chaining is introduced to handle entries that cause a collision: the
next_hash
and pprev_hash
fields
of the page descriptors are used to implement doubly circular lists
of entries that have the same hash value. The
page_cache_size
variable specifies the number of
page descriptors included in the collision lists of the page hash
table (and therefore in the whole page cache).
The add_page_to_hash_queue( )
and
remove_page_from_hash_queue( )
functions add an
element into the hash table and remove an element from it,
respectively.
As we have seen, the
address_space
object includes three lists of page
descriptors, whose heads are stored in the
clean_pages
, dirty_pages
, and
locked_pages
fields. The lists allow the kernel to
quickly find all pages of a file (or whatever) in a specific state:
clean_pages
Includes the pages not locked and not dirty (the
PG_locked
and PG_dirty
flags in
the page descriptor are equal to 0). The
PG_uptodate
flag indicates whether the data in the
pages is up to date. Typically, a page is not up to date when its
contents have yet to be read from the corresponding image on disk.
dirty_pages
Includes the pages that contain up-to-date data, but whose image on
disk have yet to be updated. The PG_uptodate
and
PG_dirty
flags in the page descriptor are set,
while the PG_locked
flag is clear.
locked_pages
Includes the pages whose contents are being transferred to or from
disk, so the pages cannot be currently accessed. The
PG_locked
flag is set.
The add_page_to_inode_queue( )
function is used to
insert a page descriptor into the clean_pages
list
of an address_space
object. Conversely, the
remove_page_from_inode_queue( )
is used to remove
a page descriptor from the list that is currently including
it.[98] The kernel
moves a page descriptor from a list to another one whenever the page
changes its state.
When a page is included in the page cache, some fields of the corresponding page descriptor have special meanings:
list
Depending on the state of the page, includes pointers for the next
and previous elements in the doubly linked list of clean, dirty, or
locked pages of the address_space
object.
mapping
Points to the address_space
object to which the
page belongs. If the page does not belong to the page cache, this
field is NULL
.
index
When the page’s owner is an inode object, specifies the position of the data contained in the page within the disk image. The value is in page-size units.
next_hash
Points to the next colliding page descriptor in the page hash list.
pprev_hash
Points to the next_hash
field of the previous
colliding page descriptor in the page hash list.
In addition, when a page is inserted into the page cache, the usage
counter (count
field) of the corresponding page
descriptor is incremented. If the count
field is
exactly 1, the page belongs to the cache but is not being accessed by
any process; it can thus be removed from the page cache whenever free
memory becomes scarce, as described in Chapter 16.
The high-level functions that use the page cache involve finding, adding, and removing a page.
The find_get_page
macro receives as parameters the
address of an address_space
object and an offset
value. It uses the page_hash
macro to derive the
address of the hash table entry corresponding to the values of the
parameters, and invokes the _ _find_get_page( )
function to search for the requested page descriptor in the proper
collision list. In turn, _ _find_get_page( )
acquires the pagecache_lock
spin lock, scans the
list of entries that have the same hash value, then releases the spin
lock. If the page is found, the function increments the
count
field of the corresponding page descriptor
and returns its address; otherwise, it returns
NULL
.
The add_to_page_cache( )
function inserts a new
page descriptor (whose address is passed as a parameter) in the page
cache. This is achieved by performing the following operations:
Acquires the pagecache_lock
spin lock.
Clears the PG_uptodate
,
PG_error
, PG_dirty
,
PG_referenced
, PG_arch_1
, and
PG_checked
flags, and sets the
PG_locked
flag of the page frame to indicate that
the page is locked and present in the cache, but not yet filled with
data.
Increments the count
field of the page descriptor.
Initializes the index
field of the page descriptor
with a value passed as a parameter, which specifies the position of
the data contained in the page within the page’s
disk image.
Invokes add_page_to_inode_queue( )
to insert the
page descriptor in the clean_pages
list of an
address_space
object, whose address is passed as a
parameter.
Invokes add_page_to_hash_queue( )
to insert the
page descriptor in the hash table, using the
address_space
object address and the value of
page’s index
field as hash keys.
Releases the pagecache_lock
spin lock.
Invokes lru_cache_add( )
to add the page
descriptor in the inactive list (see Chapter 16).
The find_or_create_page( )
function is similar to
find_get_page
; however, if the requested page is
not in the cache, the function invokes alloc_page( )
to get a new page frame, then invokes
add_to_page_cache( )
to insert the page descriptor
in the page cache.
The remove_inode_page( )
function removes a page
descriptor from the page cache. This is achieved by acquiring the
pagecache_lock
spin lock, invoking
remove_page_from_inode_queue( )
and
remove_page_from_hash_queue( )
, and then releasing
the spin lock.
3.147.79.45