Memory Mapping

As already mentioned in Section 8.3, a memory region can be associated with some portion of either a regular file in a disk-based filesystem or a block device file. This means that an access to a byte within a page of the memory region is translated by the kernel into an operation on the corresponding byte of the file. This technique is called memory mapping.

Two kinds of memory mapping exist:

Shared

Any write operation on the pages of the memory region changes the file on disk; moreover, if a process writes into a page of a shared memory mapping, the changes are visible to all other processes that map the same file.

Private

Meant to be used when the process creates the mapping just to read the file, not to write it. For this purpose, private mapping is more efficient than shared mapping. But any write operation on a privately mapped page will cause it to stop mapping the page in the file. Thus, a write does not change the file on disk, nor is the change visible to any other processes that access the same file.

A process can create a new memory mapping by issuing an mmap( ) system call (see Section 15.2.2 later in this chapter). Programmers must specify either the MAP_SHARED flag or the MAP_PRIVATE flag as a parameter of the system call; as you can easily guess, in the former case the mapping is shared, while in the latter it is private. Once the mapping is created, the process can read the data stored in the file by simply reading from the memory locations of the new memory region. If the memory mapping is shared, the process can also modify the corresponding file by simply writing into the same memory locations. To destroy or shrink a memory mapping, the process may use the munmap( ) system call (see the later section Section 15.2.3).

As a general rule, if a memory mapping is shared, the corresponding memory region has the VM_SHARED flag set; if it is private, the VM_SHARED flag is cleared. As we’ll see later, an exception to this rule exists for read-only shared memory mappings.

Memory Mapping Data Structures

A memory mapping is represented by a combination of the following data structures:

  • The inode object associated with the mapped file

  • The address_space object of the mapped file

  • A file object for each different mapping performed on the file by different processes

  • A vm_area_struct descriptor for each different mapping on the file

  • A page descriptor for each page frame assigned to a memory region that maps the file

Figure 15-4 illustrates how the data structures are linked. In the upper-left corner, we show the inode, which identifies the file. The i_mapping field of each inode object points to the address_space object of the file. In turn, the i_mmap or i_mmap_shared fields of each address_space object point to the first element of a doubly linked list that includes all memory regions that currently map the file; if both fields are NULL, the file is not mapped by any memory region. The list contains vm_area_struct descriptors that represent memory regions, and is implemented by means of the vm_next_share and vm_pprev_share fields.

Data structures for file memory mapping

Figure 15-4. Data structures for file memory mapping

The vm_file field of each memory region descriptor contains the address of a file object for the mapped file; if that field is null, the memory region is not used in a memory mapping. The file object contains fields that allow the kernel to identify both the process that owns the memory mapping and the file being mapped.

The position of the first mapped location is stored into the vm_pgoff field of the memory region descriptor; it represents the file offset as a number of page-size units. The length of the mapped file portion is simply the length of the memory region, which can be computed from the vm_start and vm_end fields.

Pages of shared memory mappings are always included in the page cache; pages of private memory mappings are included in the page cache as long as they are unmodified. When a process tries to modify a page of a private memory mapping, the kernel duplicates the page frame and replaces the original page frame with the duplicate in the process Page Table; this is one of the applications of the Copy On Write mechanism that we discussed in Chapter 8. The original page frame still remains in the page cache, although it no longer belongs to the memory mapping since it is replaced by the duplicate. In turn, the duplicate is not inserted into the page cache since it no longer contains valid data representing the file on disk.

Figure 15-4 also shows a few page descriptors of pages included in the page cache that refer to the memory-mapped file. Notice that the first memory region in the figure is three pages long, but only two page frames are allocated for it; presumably, the process owning the memory region has never accessed the third page. Although not shown in the figure, the page descriptors are inserted into the clean_pages, dirty_pages, and locked_pages doubly linked lists described in Section 14.1.2.

The kernel offers several hooks to customize the memory mapping mechanism for every different filesystem. The core of memory mapping implementation is delegated to a file object’s method named mmap. For most disk-based filesystems and for block device files, this method is implemented by a general function called generic_file_mmap( ), which is described in the next section.

File memory mapping depends on the demand paging mechanism described in Section 8.4.3. In fact, a newly established memory mapping is a memory region that doesn’t include any page; as the process references an address inside the region, a Page Fault occurs and the Page Fault handler checks whether the nopage method of the memory region is defined. If nopage is not defined, the memory region doesn’t map a file on disk; otherwise, it does, and the method takes care of reading the page by accessing the block device. Almost all disk-based filesystems and block device files implement the nopage method by means of the filemap_nopage( ) function.

Creating a Memory Mapping

To create a new memory mapping, a process issues an mmap( ) system call, passing the following parameters to it:

  • A file descriptor identifying the file to be mapped.

  • An offset inside the file specifying the first character of the file portion to be mapped.

  • The length of the file portion to be mapped.

  • A set of flags. The process must explicitly set either the MAP_SHARED flag or the MAP_PRIVATE flag to specify the kind of memory mapping requested.[106]

  • A set of permissions specifying one or more types of access to the memory region: read access (PROT_READ), write access (PROT_WRITE), or execution access (PROT_EXEC).

  • An optional linear address, which is taken by the kernel as a hint of where the new memory region should start. If the MAP_FIXED flag is specified and the kernel cannot allocate the new memory region starting from the specified linear address, the system call fails.

The mmap( ) system call returns the linear address of the first location in the new memory region. For compatibility reasons, in the 80 × 86 architecture, the kernel reserves two entries in the system call table for mmap( ): one at index 90 and the other at index 192. The former entry corresponds to the old_mmap( ) service routine (used by older C libraries), while the latter one corresponds to the sys_mmap2( ) service routine (used by recent C libraries). The two service routines differ only in how the six parameters of the system call are passed. Both of them end up invoking the do_mmap_pgoff( ) function described in Section 8.3.4. We now complete that description by detailing the steps performed only when creating a memory region that maps a file.

  1. Checks whether the mmap file operation for the file to be mapped is defined; if not, it returns an error code. A NULL value for mmap in the file operation table indicates that the corresponding file cannot be mapped (for instance, because it is a directory).

  2. Checks whether the get_unmapped_area method of the file object is defined. If so, invokes it; otherwise, invokes the arch_get_unmapped_area( ) function already described in Chapter 8. On the 80 × 86 architecture, a custom method is used only by the frame buffer layer, so we don’t discuss the case further. Remember that the arch_get_unmapped_area( ) allocates an interval of linear addresses for the new memory region.

  3. In addition to the usual consistency checks, compares the kind of memory mapping requested and the flags specified when the file was opened. The flags passed as a parameter of the system call specify the kind of mapping required, while the value of the f_mode field of the file object specifies how the file was opened. Depending on these two sources of information, it performs the following checks:

    1. If a shared writable memory mapping is required, checks that the file was opened for writing and that it was not opened in append mode (O_APPEND flag of the open( ) system call)

    2. If a shared memory mapping is required, checks that there is no mandatory lock on the file (see Section 12.7)

    3. For any kind of memory mapping, checks that the file was opened for reading

    If any of these conditions is not fulfilled, an error code is returned.

  4. When initializing the value of the vm_flags field of the new memory region descriptor, sets the VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, VM_MAYREAD, VM_MAYWRITE, VM_MAYEXEC, and VM_MAYSHARE flags according to the access rights of the file and the kind of requested memory mapping (see Section 8.3.2). As an optimization, the VM_SHARED flag is cleared for nonwritable shared memory mapping. This can be done because the process is not allowed to write into the pages of the memory region, so the mapping is treated the same as a private mapping; however, the kernel actually allows other processes that share the file to access the pages in this memory region.

  5. Initializes the vm_file field of the memory region descriptor with the address of the file object and increments the file’s usage counter.

  6. Invokes the mmap method for the file being mapped, passing as parameters the address of the file object and the address of the memory region descriptor. For most filesystems, this method is implemented by the generic_file_mmap( ) function, which performs the following operations:

    1. If a shared writable memory mapping is required, checks that the writepage method of the address_space object of the file is defined; if not, it returns the error code -EINVAL.

    2. Checks that the readpage method of the address_space object of the file is defined; if not, it returns the error code -ENOEXEC.

    3. Stores the current time in the i_atime field of the file’s inode and marks the inode as dirty.

    4. Initializes the vm_ops field of the memory region descriptor with the address of the generic_file_vm_ops table. All methods in this table are null, except the nopage method, which is implemented by the filemap_nopage( ) function.

  7. Recall from Section 8.3.4 that do_mmap( ) invokes vma_link( ). This function inserts the memory region descriptor into either the i_mmap list or the i_mmap_shared list of the address_space object, according to whether the requested memory mapping is private or shared, respectively.

Destroying a Memory Mapping

When a process is ready to destroy a memory mapping, it invokes the munmap( ) system call, passing the following parameters to it:

  • The address of the first location in the linear address interval to be removed

  • The length of the linear address interval to be removed

Notice that the munmap( ) system call can be used to either remove or reduce the size of each kind of memory region. Indeed, the sys_munmap( ) service routine of the system call essentially invokes the do_munmap( ) function already described in Section 8.3.5. However, if the memory region maps a file, the following additional steps are performed for each memory region included in the range of linear addresses to be released:

  1. Invokes remove_shared_vm_struct( ) to remove the memory region descriptor from the address_space object list (either i_mmap or i_mmap_shared).

  2. When executing the unmap_fixup( ) function, decrements the file usage counter if an entire memory region is destroyed, and increments the file usage counter if a new memory region is created — that is, if the unmapping created a hole inside a region. If the region has just been shrunken, it leaves the file usage counter unchanged.

Notice that there is no need to flush to disk the contents of the pages included in a writable shared memory mapping to be destroyed. In fact, these pages continue to act as a disk cache because they are still included in the page cache (see the next section).

Demand Paging for Memory Mapping

For reasons of efficiency, page frames are not assigned to a memory mapping right after it has been created at the last possible moment—that is, when the process attempts to address one of its pages, thus causing a Page Fault exception.

We saw in Section 8.4 how the kernel verifies whether the faulty address is included in some memory region of the process; if so, the kernel checks the Page Table entry corresponding to the faulty address and invokes the do_no_page( ) function if the entry is null (see Section 8.4.3).

The do_no_page( ) function performs all the operations that are common to all types of demand paging, such as allocating a page frame and updating the Page Tables. It also checks whether the nopage method of the memory region involved is defined. In Section 8.4.3, we described the case in which the method is undefined (anonymous memory region); now we complete the description by discussing the actions performed by the function when the method is defined:

  1. Invokes the nopage method, which returns the address of a page frame that contains the requested page.

  2. If the process is trying to write into the page and the memory mapping is private, avoids a future Copy On Write fault by making a copy of the page just read and inserting it into the inactive list of pages (see Chapter 16). In the following steps, the function uses the new page instead of the page returned by the nopage method so that the latter is not modified by the User Mode process.

  3. Increments the rss field of the process memory descriptor to indicate that a new page frame has been assigned to the process.

  4. Sets up the Page Table entry corresponding to the faulty address with the address of the page frame and the page access rights included in the memory region vm_page_prot field.

  5. If the process is trying to write into the page, forces the Read/Write and Dirty bits of the Page Table entry to 1. In this case, either the page frame is exclusively assigned to the process, or the page is shared; in both cases, writing to it should be allowed.

The core of the demand paging algorithm consists of the memory region’s nopage method. Generally speaking, it must return the address of a page frame that contains the page accessed by the process. Its implementation depends on the kind of memory region in which the page is included.

When handling memory regions that map files on disk, the nopage method must first search for the requested page in the page cache. If the page is not found, the method must read it from disk. Most filesystems implement the nopage method by means of the filemap_nopage( ) function, which receives three parameters:

area

Descriptor address of the memory region, including the required page.

address

Linear address of the required page.

unused

Parameter of the nopage method that is not used by filemap_nopage( ).

The filemap_nopage( ) function executes the following steps:

  1. Gets the file object address file from area->vm_file field. Derives the address_space object address from file->f_dentry->d_inode->i_mapping. Derives the inode object address from the host field of the address_space object.

  2. Uses the vm_start and vm_pgoff fields of area to determine the offset within the file of the data corresponding to the page starting from address.

  3. Checks whether the file offset exceeds the file size. When this happens, returns NULL, which means failure in allocating the new page, unless the Page Fault was caused by a debugger tracing another process through the ptrace( ) system call. We are not going to discuss this special case.

  4. Invokes find_get_page( ) to look in the page cache for the page identified by the address_space object and the file offset.

  5. If the page is not in the page cache, checks the value of the VM_RAND_READ flag of the memory region. The value of this flag can be changed by means of the madvise( ) system call; when the flag is set, it indicates that the user application is not going to read more pages of the file than those just accessed.

    • If the VM_RAND_READ flag is set, invokes page_cache_read( ) to read just the requested page from disk (see the earlier section Section 15.1.1).

    • If the VM_RAND_READ flag is cleared, invokes page_cache_read( ) several times to read a cluster of adjacent pages inside the memory region, including the requested page. The length of the cluster is stored in the page_request variable; its default value is three pages, but the system administrator may tune its value by writing into the /proc/sys/vm/page-cluster special file.

    Then the function jumps back to Step 4 and repeats the page cache lookup operation (the process might have been blocked while executing the page_cache_read( ) function).

  6. The page is inside the page cache. Checks its PG_uptodate flag. If the flag is not set (page not up to date), the function performs the following substeps:

    1. Locks up the page by setting the PG_locked flag, sleeping if necessary.

    2. Invokes the readpage method of the address_space object to trigger the I/O data transfer.

    3. Invokes wait_on_page( ) to sleep until the I/O transfer completes.

  7. The page is up to date. The function checks the VM_SEQ_READ flag of the memory region. The value of this flag can be changed by means of the madvise( ) system call; when the flag is set, it indicates that the user application is going to reference the pages of the mapped file sequentially, thus the pages should be aggressively read in advance and freed after they are accessed. If the flag is set, it invokes nopage_sequential_readahead( ). This function uses a large, fixed-size read-ahead window, whose length is approximately the maximum read-ahead window size of the underlying block device (see the earlier section Section 15.1.2). The vm_raend field of the memory region descriptor stores the ending position of the current read-ahead window. The function shifts the read-ahead windows forward (by reading in advance the corresponding pages) whenever the requested page falls exactly in the middle point of the current read-ahead window. Moreover, the function should release the pages in the memory region that are far behind the requested page; if the function reads the n th read-ahead window of the memory region, it flushes to disk the pages belonging to the (n-3)th window (however, the kernel Version 2.4.18 doesn’t release them; see the next section).

  8. Invokes mark_page_accessed( ) to mark the requested page as accessed (see Chapter 16).

  9. Returns the address of the requested page.

Flushing Dirty Memory Mapping Pages to Disk

The msync( ) system call can be used by a process to flush to disk dirty pages belonging to a shared memory mapping. It receives as parameters the starting address of an interval of linear addresses, the length of the interval, and a set of flags that have the following meanings:

MS_SYNC

Asks the system call to suspend the process until the I/O operation completes. In this way, the calling process can assume that when the system call terminates, all pages of its memory mapping have been flushed to disk.

MS_ASYNC

Asks the system call to return immediately without suspending the calling process.

MS_INVALIDATE

Asks the system call to remove all pages included in the memory mapping from the process address space (not really implemented).

The sys_msync( ) service routine invokes msync_interval( ) on each memory region included in the interval of linear addresses. In turn, the latter function performs the following operations:

  1. If the vm_file field of the memory region descriptor is NULL, or if the VM_SHARED flag is clear, returns 0 (the memory region is not a writable shared memory mapping of a file).

  2. Invokes the filemap_sync( ) function, which scans the Page Table entries corresponding to the linear address intervals included in the memory region. For each page found, it invokes flush_tlb_page( ) to flush the corresponding translation lookaside buffers, and marks the page as dirty.

  3. If the MS_SYNC flag is not set, returns. Otherwise, continues with the following steps to flush the pages in the memory region to disk, sleeping until all I/O data transfers terminate. Notice that, at least in the last stable version of the kernel at the time of this writing, the function does not take the MS_INVALIDATE flag into consideration.

  4. Acquires the i_sem semaphore of the file’s inode.

  5. Invokes the filemap_fdatasync( ) function, which receives the address of the file’s address_space object. For every page belonging to the dirty pages list of the address_space object, the function performs the following substeps:

    1. Moves the page from the dirty pages list to the locked pages list.

    2. If the PG_Dirty flag is not set, continues with the next page in the list (the page is already being flushed by another process).

    3. Increments the usage counter of the page and locks it, sleeping if necessary.

    4. Clears the PG_dirty flag of the page.

    5. Invokes the writepage method of the address_space object on the page (described following this list).

    6. Releases the usage counter of the page

    The writepage method for block device files and almost all disk-based filesystems is just a wrapper for the block_write_full_page( ) function; it is used to pass to block_write_full_page( ) the address of a filesystem-dependent function that translates the block numbers relative to the beginning of the file into logical block numbers relative to positions of the block in the disk partition. (This is the same mechanism that is already described in the earlier section Section 15.1.1 and that is used for the readpage method). In turn, block_write_full_page( ) is very similar to block_read_full_page( ) described earlier: it allocates asynchronous buffer heads for the page, and invokes the submit_bh( ) function on each of them specifying the WRITE operation.

  6. Checks whether the fsync method of the file object is defined; if so, executes it. For regular files, this method usually limits itself to flushing the inode object of the file to disk. For block device files, however, the method invokes sync_buffers( ), which activates the I/O data transfer of all dirty buffers of the device.

  7. Executes the filemap_fdatawait( ) function. For each page in the locked pages list of the address_space object, the function waits until the page becomes unlocked — when the ongoing I/O data transfer on the page terminates.

  8. Releases the i_sem semaphore of the file.



[106] The process could also set the MAP_ANONYMOUS flag to specify that the new memory region is anonymous — that is, not associated with any disk-based file (see Section 8.4.3). This flag is supported by some Unix operating systems, including Linux, but it is not defined by the POSIX standard. In Linux 2.4, a process can also create a memory region that is both MAP_SHARED and MAP_ANONYMOUS. In this case, the region maps a special file in the shm filesystem (see Section 19.3.5), which can be accessed by all the process’s descendants.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.51.129