Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Disk Caches

This chapter deals with disk caches. It shows how Linux uses sophisticated techniques to improve system performances by reducing disk accesses as much as possible.

As mentioned in Section 12.1.1, a disk cache is a software mechanism that allows the system to keep in RAM some data that is normally stored on a disk, so that further accesses to that data can be satisfied quickly without accessing the disk.

Besides the dentry cache, which is used by the VFS to speed up the translation of a file pathname to the corresponding inode, two main disk caches—the buffer cache and the page cache—are used by Linux.

As suggested by its name, the buffer cache is a disk cache consisting of buffers; as we know from Section 13.4.3, each buffer stores a single disk block. The block I/O operations (described in Section 13.4.8.1 in the same chapter) rely on the buffer cache to reduce the number of disk accesses.

Conversely, the page cache is a disk cache consisting of pages; each page in the cache corresponds to several blocks of a regular file or a block device file. Of course, the exact number of blocks contained in a page depends on the size of the block. All such blocks are logically contiguous — that is, they represent an integral portion of a regular file or of a block device file. To reduce the number of disk accesses, before activating a page I/O operation (described in Section 13.4.8.2), the kernel should check whether the wanted data is already stored in the page cache.

Table 14-1 shows how some widely used I/O operations use the buffer and page caches. Some of the examples given refer to the Ext2 filesystem, but they can apply to almost all disk-based filesystems.

Table 14-1. Use of the buffer cache and page cache

Kernel function	System call	Cache	I/O operation
`bread( )`	None	Buffer	Read an Ext2 superblock
`bread( )`	None	Buffer	Read an Ext2 inode
`generic_file_read( )`	`getdents( )`	Page	Read an Ext2 directory
`generic_file_read( )`	`read( )`	Page	Read an Ext2 regular file
`generic_file_write( )`	`write( )`	Page	Write an Ext2 regular file
`generic_file_read( )`	`read( )`	Page	Read a block device file
`generic_file_write( )`	`write( )`	Page	Write a block device file
`filemap_nopage( )`	None	Page	Access a memory-mapped file
`brw_page( )`	None	Page	Access to swapped-out page

Each operation in this table appears in subsequent chapters:

Read an Ext2 superblock: See Section 13.4.8.1. See also Chapter 17.
Read an Ext2 inode: See Section 15.1.3. See also Chapter 17 for the Ext2 filesystem.
Read an Ext2 directory: See Section 15.1.1. See also Chapter 17 for the Ext2 filesystem.
Read an Ext2 regular file: See Section 15.1.1.
Write an Ext2 regular file: See Section 15.1.3. See also Chapter 17 for the Ext2 filesystem.
Read a block device file: See Section 15.1.1.
Write a block device file: See Section 15.1.3.
Access a memory-mapped file: See Section 15.2.
Access to swapped-out page: See Section 13.4.8.2. See also Chapter 16.

For each type of I/O activity, the table also shows the system call required to start it (if any) and the main corresponding kernel function that handles it.

The table shows that accesses to memory-mapped files and swapped-out pages do not require system calls; they are transparent to the programmer. Once a file memory mapping is set up and swapping is activated, the application program can access the mapped file or the swapped-out page as if it were present in memory. It is the kernel’s responsibility to delay the process until the required page is located on disk and brought into RAM.

You’ll also notice that the same kernel function, namely generic_file_read( ), is used to read from block device files and from regular files. Similarly, generic_file_write( ) is used to write both into block device files and into regular files. We describe these functions in Chapter 15.

The Page Cache

To avoid unnecessary disk accesses, the kernel never tries to read a page from disk without looking into the page cache and verifying that it does not already include the requested data. To take the maximum advantage from the page cache, searching into it should be a very fast operation.

The unit of information kept in the page cache is, of course, a whole page of data. A page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by a device number and a block number. Instead, a page in the page cache is identified by an address of a special data structure named address_space, and by an offset within the file (or whatever) referenced by the address_space data structure.

The address_space Object

Table 14-1 suggests that the Linux page cache speeds up several different kinds of I/O operations. In fact, the page cache may include the following types of pages:

Pages containing data of regular files and directories of disk-based filesystems; in Chapter 15, we describe how the kernel handles read and write operations on them.
Pages containing data of a memory-mapped file; see Chapter 15 for details.
Pages containing data directly read from block device files (skipping the filesystem layer); as discussed in Chapter 15, the kernel handles them using the same set of functions as for pages containing data of regular files.
Pages containing data of User Mode processes that have been swapped out on disk. As we shall see in Chapter 16, the kernel could be forced to keep in the page cache some pages whose contents have been already written on a swap area.
Pages belonging to an Interprocess Communication (IPC) shared memory region; we describe IPC resources in Chapter 19.

So far, so good, but how is the kernel supposed to keep track of how every page in the page cache should be handled? For instance, suppose the kernel wishes to update the content of a page included in the page cache — reading the page contents from a regular file, from a directory, from a block device file, or from a swap area are quite different operations, and the kernel must execute the proper operation according to the type of page.

The key data structure that establishes the relationship between pages and methods that operate on the pages is the address_space object. Formally, each address_space object establishes a link between a generic kernel object (the so-called owner ) and a set of methods that operate on the pages belonging to the owner.

As stated before, the page cache includes five kinds of pages, so a page may belong to five possible kinds of owners.

For instance, if a page belongs to a regular file that is stored in an Ext2 filesystem, the owner of the page is an inode object. The i_mapping field of this object points to an address_space object. In turn, the address_space object defines a set of methods that allow the kernel to act on the pages containing the data of our regular file.

Specifically, the address_space object includes the fields shown in Table 14-2.

Table 14-2. The fields of the address_space object

Type	Field	Description
`struct list_head`	`clean_pages`	List of owner’s clean pages
`struct list_head`	`dirty_pages`	List of owner’s nonlocked dirty pages
`struct list_head`	`locked_pages`	List of owner’s locked dirty pages
`unsigned long`	`nrpages`	Total number of owner’s pages
`struct` `address_space_operations *`	`a_ops`	Methods that operate on the owner’s pages
`struct inode *`	`host`	Pointer to the owning inode
`struct vm_area_struct *`	`i_mmap`	List of memory regions for private memory mapping
`struct vm_area_struct *`	`i_mmap_shared`	List of memory regions for shared memory mapping
`spinlock_t`	`i_shared_lock`	Spin lock for the lists of memory regions
`int`	`gfp_mask`	Memory allocator flags for the owner’s pages

The clean_pages, dirty_pages, and locked_pages fields represent the heads of three lists of page descriptors. Together, these lists include all pages that belong to the owner of the address_space object. We discuss the role of each list in the next section. The nrpages field stores the total number of pages inserted in the three lists.

Although the owner of the address_space object could be any generic kernel object, usually it is a VFS inode object. (After all, the page cache was introduced to speed up disk accesses!) In this case, the host field points to the inode that owns the address_space object.

The i_mmap, i_mmap_shared, i_shared_lock, and gfp_mask fields are used whenever the owner of the address_space object is an inode of a memory-mapped file. We discuss them in Section 15.2.1.

The most important field of the address_space object is a_ops, which points to a table of type address_space_operations containing the methods that define how the owner’s pages are handled. These methods are shown in Table 14-3.

Table 14-3. The methods of the address_space object

Method	Description
`writepage`	Write operation (from the page to the owner’s disk image)
`readpage`	Read operation (from the owner’s disk image to the page)
`sync_page`	Start the I/O data transfer of already scheduled operations on the page
`prepare_write`	Prepare the write operation (used by disk-based filesystems)
`commit_write`	Complete the write operation (used by disk-based filesystems)
`bmap`	Get a logical block number from a file block index
`flushpage`	Prepare to delete the page from the owner’s disk image
`releasepage`	Used by journaling filesystems to prepare the release of a page
`direct_IO`	Direct I/O transfer of the data of the page

The most important methods are readpage, writepage, prepare_write, and commit_write. We discuss them in Chapter 15. In most cases, the methods link the owner inode objects with the low-level drivers that access the physical devices. For instance, the function that implements the readpage method for an inode of a regular file “knows” how to locate the positions on the physical disk device of the blocks corresponding to any page of the file. In this chapter, however, we don’t have to discuss the address_space methods further.

Page Cache Data Structures

The page cache uses the following main data structures:

A page hash table: This lets the kernel quickly derive the page descriptor address for the page associated with a specified address_space object and a specified offset (presumably, a file offset)
Page descriptor lists in the address_space object: This lets the kernel quickly retrieve all pages in a given state owned by a particular inode object (or other kernel object) referenced by an address_space object

Manipulation of the page cache involves adding and removing entries from these data structures, as well as updating the fields in all objects that reference the cached pages. The pagecache_lock spin lock protects the page cache data structures against concurrent accesses in multiprocessor systems.

The page hash table

When a process reads a large file, the page cache may become filled with pages related to that file. In such cases, scanning a long list of page descriptors to find the page that maps the required file portion could become a time-consuming operation.

For this reason, Linux uses a hash table of page descriptor pointers named page_hash_table. Its size depends on the amount of available RAM; for example, for systems having 128 MB of RAM, page_hash_table is stored in 32 page frames and includes 32,768 page descriptor pointers.

The page_hash macro uses the address of an address_space object and an offset value to derive the address of the corresponding entry in the hash table. As usual, chaining is introduced to handle entries that cause a collision: the next_hash and pprev_hash fields of the page descriptors are used to implement doubly circular lists of entries that have the same hash value. The page_cache_size variable specifies the number of page descriptors included in the collision lists of the page hash table (and therefore in the whole page cache).

The add_page_to_hash_queue( ) and remove_page_from_hash_queue( ) functions add an element into the hash table and remove an element from it, respectively.

The lists of page descriptors in the address_space object

As we have seen, the address_space object includes three lists of page descriptors, whose heads are stored in the clean_pages, dirty_pages, and locked_pages fields. The lists allow the kernel to quickly find all pages of a file (or whatever) in a specific state:

clean_pages: Includes the pages not locked and not dirty (the PG_locked and PG_dirty flags in the page descriptor are equal to 0). The PG_uptodate flag indicates whether the data in the pages is up to date. Typically, a page is not up to date when its contents have yet to be read from the corresponding image on disk.
dirty_pages: Includes the pages that contain up-to-date data, but whose image on disk have yet to be updated. The PG_uptodate and PG_dirty flags in the page descriptor are set, while the PG_locked flag is clear.
locked_pages: Includes the pages whose contents are being transferred to or from disk, so the pages cannot be currently accessed. The PG_locked flag is set.

The add_page_to_inode_queue( ) function is used to insert a page descriptor into the clean_pages list of an address_space object. Conversely, the remove_page_from_inode_queue( ) is used to remove a page descriptor from the list that is currently including it.^[98] The kernel moves a page descriptor from a list to another one whenever the page changes its state.

Page descriptor fields related to the page cache

When a page is included in the page cache, some fields of the corresponding page descriptor have special meanings:

list: Depending on the state of the page, includes pointers for the next and previous elements in the doubly linked list of clean, dirty, or locked pages of the address_space object.
mapping: Points to the address_space object to which the page belongs. If the page does not belong to the page cache, this field is NULL.
index: When the page’s owner is an inode object, specifies the position of the data contained in the page within the disk image. The value is in page-size units.
next_hash: Points to the next colliding page descriptor in the page hash list.
pprev_hash: Points to the next_hash field of the previous colliding page descriptor in the page hash list.

In addition, when a page is inserted into the page cache, the usage counter (count field) of the corresponding page descriptor is incremented. If the count field is exactly 1, the page belongs to the cache but is not being accessed by any process; it can thus be removed from the page cache whenever free memory becomes scarce, as described in Chapter 16.

Page Cache Handling Functions

The high-level functions that use the page cache involve finding, adding, and removing a page.

The find_get_page macro receives as parameters the address of an address_space object and an offset value. It uses the page_hash macro to derive the address of the hash table entry corresponding to the values of the parameters, and invokes the _ _find_get_page( ) function to search for the requested page descriptor in the proper collision list. In turn, _ _find_get_page( ) acquires the pagecache_lock spin lock, scans the list of entries that have the same hash value, then releases the spin lock. If the page is found, the function increments the count field of the corresponding page descriptor and returns its address; otherwise, it returns NULL.

The add_to_page_cache( ) function inserts a new page descriptor (whose address is passed as a parameter) in the page cache. This is achieved by performing the following operations:

Acquires the pagecache_lock spin lock.
Clears the PG_uptodate, PG_error, PG_dirty, PG_referenced, PG_arch_1, and PG_checked flags, and sets the PG_locked flag of the page frame to indicate that the page is locked and present in the cache, but not yet filled with data.
Increments the count field of the page descriptor.
Initializes the index field of the page descriptor with a value passed as a parameter, which specifies the position of the data contained in the page within the page’s disk image.
Invokes add_page_to_inode_queue( ) to insert the page descriptor in the clean_pages list of an address_space object, whose address is passed as a parameter.
Invokes add_page_to_hash_queue( ) to insert the page descriptor in the hash table, using the address_space object address and the value of page’s index field as hash keys.
Releases the pagecache_lock spin lock.
Invokes lru_cache_add( ) to add the page descriptor in the inactive list (see Chapter 16).

The find_or_create_page( ) function is similar to find_get_page; however, if the requested page is not in the cache, the function invokes alloc_page( ) to get a new page frame, then invokes add_to_page_cache( ) to insert the page descriptor in the page cache.

The remove_inode_page( ) function removes a page descriptor from the page cache. This is achieved by acquiring the pagecache_lock spin lock, invoking remove_page_from_inode_queue( ) and remove_page_from_hash_queue( ), and then releasing the spin lock.

^[98]The names of these functions are inherited from the old Version 2.2 of the kernel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Disk Caches

Create new playlist

Sign In

Sign Up