Chapter 5
Memory Management

5.1 Terminology

A central component of any operating system is the memory-management system. As the name implies, memory-management facilities are responsible for the management of memory resources available on a machine. These resources are typically layered in a hierarchical fashion, with memory-access times inversely related to their proximity to the CPU (see Fig. 5.1). The primary memory system is main memory; the next level of storage is secondary storage or backing storage. Main-memory systems usually are constructed from random-access memories, whereas secondary stores are placed on moving-head disk drives. In certain workstation environments, the common two-level hierarchy is becoming a three-level hierarchy, with the addition of file-server machines connected to a workstation via a local-area network [Gingell, Moran, & Shannon, 1987].

Figure 5.1 Hierarchical layering of memory.

image

In a multiprogrammed environment, it is critical for the operating system to share available memory resources effectively among the processes. The operation of any memory-management policy is directly related to the memory required for a process to execute. That is, if a process must reside entirely in main memory for it to execute, then a memory-management system must be oriented toward allocating large units of memory. On the other hand, if a process can execute when it is only partially resident in main memory, then memory-management policies are likely to be substantially different. Memory-management facilities usually try to optimize the number of runnable processes that are resident in main memory. This goal must be considered with the goals of the process scheduler (Chapter 4), so that conflicts that can adversely affect overall system performance are avoided.

Although the availability of secondary storage permits more processes to exist than can be resident in main memory, it also requires additional algorithms that can be complicated. Space management typically requires algorithms and policies different from those used for main memory, and a policy must be devised for deciding when to move processes between main memory and secondary storage.

Processes and Memory

Each process operates on a virtual machine that is defined by the architecture of the underlying hardware on which it executes. We are interested in only those machines that include the notion of a virtual address space. A virtual address space is a range of memory locations that a process references independently of the physical memory present in the system. In other words, the virtual address space of a process is independent of the physical address space of the CPU. For a machine to support virtual memory, we also require that the whole of a process’s virtual address space does not need to be resident in main memory for that process to execute.

References to the virtual address space—virtual addresses—are translated by hardware into references to physical memory. This operation, termed address translation, permits programs to be loaded into memory at any location without requiring position-dependent addresses in the program to be changed. Address translation and virtual addressing are also important in efficient sharing of a CPU, because position independence usually permits context switching to be done quickly.

Most machines provide a contiguous virtual address space for processes. Some machines, however, choose to partition visibly a process’s virtual address space into regions termed segments [Intel, 1984]; such segments usually must be physically contiguous in main memory and must begin at fixed addresses. We shall be concerned with only those systems that do not visibly segment their virtual address space. This use of the word segment is not the same as its earlier use in Section 3.5, when we were describing 4.4BSD process segments, such as text and data segments.

When multiple processes are coresident in main memory, we must protect the physical memory associated with each process’s virtual address space to ensure that one process cannot alter the contents of another process’s virtual address space. This protection is implemented in hardware and is usually tightly coupled with the implementation of address translation. Consequently, the two operations usually are defined and implemented together as hardware termed the memory-management unit.

Virtual memory can be implemented in many ways, some of which are software based, such as overlays. Most effective virtual-memory schemes are, however, hardware based. In these schemes, the virtual address space is divided into fixed-sized units, termed pages, as shown in Fig. 5.2. Virtual-memory references are resolved by the address-translation unit to a page in main memory and an offset within that page. Hardware protection is applied by the memory-management unit on a page-by-page basis.

Some systems provide a two-tiered virtual-memory system in which pages are grouped into segments [Organick, 1975]. In these systems, protection is usually at the segment level. In the remainder of this chapter, we shall be concerned with only those virtual-memory systems that are page based.

Paging

Address translation provides the implementation of virtual memory by decoupling the virtual address space of a process from the physical address space of the CPU. Each page of virtual memory is marked as resident or nonresident in main memory. If a process references a location in virtual memory that is not resident, a hardware trap termed a page fault is generated. The servicing of page faults, or paging, permits processes to execute even if they are only partially resident in main memory.

Figure 5.2 Paged virtual-memory scheme. Key: MMU—memory-management unit.

image

Coffman and Denning [1973] characterize paging systems by three important policies:

1.  When the system loads pages into memory—the fetch policy

2.  Where the system places pages in memory—the placement policy

3.  How the system selects pages to be removed from main memory when pages are unavailable for a placement request—the replacement policy

In normal circumstances, all pages of main memory are equally good, and the placement policy has no effect on the performance of a paging system. Thus, a paging system’s behavior is dependent on only the fetch policy and the replacement policy. Under a pure demand-paging system, a demand-fetch policy is used, in which only the missing page is fetched, and replacements occur only when main memory is full. Consequently, the performance of a pure demand-paging system depends on only the system’s replacement policy. In practice, paging systems do not implement a pure demand-paging algorithm. Instead, the fetch policy often is altered to do prepaging—fetching pages of memory other than the one that caused the page fault—and the replacement policy is invoked before main memory is full.

Replacement Algorithms

The replacement policy is the most critical aspect of any paging system. There is a wide range of algorithms from which we can select in designing a replacement strategy for a paging system. Much research has been carried out in evaluating the performance of different page-replacement algorithms [Belady, 1966; King, 1971; Marshall, 1979].

A process’s paging behavior for a given input is described in terms of the pages referenced over the time of the process’s execution. This sequence of pages, termed a reference string, represents the behavior of the process at discrete times during the process’s lifetime. Corresponding to the sampled references that constitute a process’s reference string are real-time values that reflect whether or not the associated references resulted in a page fault. A useful measure of a process’s behavior is the fault rate, which is the number of page faults encountered during processing of a reference string, normalized by the length of the reference string.

Page-replacement algorithms typically are evaluated in terms of their effectiveness on reference strings that have been collected from execution of real programs. Formal analysis can also be used, although it is difficult to perform unless many restrictions are applied to the execution environment. The most common metric used in measuring the effectiveness of a page-replacement algorithm is the fault rate.

Page-replacement algorithms are defined in terms of the criteria that they use for selecting pages to be reclaimed. For example, the optimal replacement policy [Denning, 1970] states that the “best” choice of a page to replace is the one with the longest expected time until its next reference. Clearly, this policy is not applicable to dynamic systems, as it requires a priori knowledge of the paging characteristics of a process. The policy is useful for evaluation purposes, however, as it provides a yardstick for comparing the performance of other page-replacement algorithms.

Practical page-replacement algorithms require a certain amount of state information that the system uses in selecting replacement pages. This state typically includes the reference pattern of a process, sampled at discrete time intervals. On some systems, this information can be expensive to collect [Babaoimgelu & Joy, 1981]. As a result, the “best” page-replacement algorithm may not be the most efficient.

Working-Set Model

The working-set model assumes that processes exhibit a slowly changing locality of reference. For a period of time, a process operates in a set of subroutines or loops, causing all its memory references to refer to a fixed subset of its address space, termed the working set. The process periodically changes its working set, abandoning certain areas of memory and beginning to access new ones. After a period of transition, the process defines a new set of pages as its working set. In general, if the system can provide the process with enough pages to hold that process’s working set, the process will experience a low page-fault rate. If the system cannot provide the process with enough pages for the working set, the process will run slowly and will have a high page-fault rate.

Precise calculation of the working set of a process is impossible without a priori knowledge of that process’s memory-reference pattern. However, the working set can be approximated by various means. One method of approximation is to track the number of pages held by a process and that process’s page-fault rate. If the page-fault rate increases above a high watermark, the working set is assumed to have increased, and the number of pages held by the process is allowed to grow. Conversely, if the page-fault rate drops below a low watermark, the working set is assumed to have decreased, and the number of pages held by the process is reduced.

Swapping

Swapping is the term used to describe a memory-management policy in which entire processes are moved to and from secondary storage when main memory is in short supply. Swap-based memory-management systems usually are less complicated than are demand-paged systems, since there is less bookkeeping to do. However, pure swapping systems are typically less effective than are paging systems, since the degree of multiprogramming is lowered by the requirement that processes be fully resident to execute. Swapping is sometimes combined with paging in a two-tiered scheme, whereby paging satisfies memory demands until a severe memory shortfall requires drastic action, in which case swapping is used.

In this chapter, a portion of secondary storage that is used for paging or swapping is termed a swap area or swap space. The hardware devices on which these areas reside are termed swap devices.

Advantages of Virtual Memory

There are several advantages to the use of virtual memory on computers capable of supporting this facility properly. Virtual memory allows large programs to be run on machines with main-memory configurations that are smaller than the program size. On machines with a moderate amount of memory, it allows more programs to be resident in main memory to compete for CPU time, as the programs do not need to be completely resident. When programs use sections of their program or data space for some time, leaving other sections unused, the unused sections do not need to be present. Also, the use of virtual memory allows programs to start up faster, as they generally require only a small section to be loaded before they begin processing arguments and determining what actions to take. Other parts of a program may not be needed at all during individual runs. As a program runs, additional sections of its program and data spaces are paged in on demand (demand paging). Finally, there are many algorithms that are more easily programmed by sparse use of a large address space than by careful packing of data structures into a small area. Such techniques are too expensive for use without virtual memory, but may run much faster when that facility is available, without using an inordinate amount of physical memory.

On the other hand, the use of virtual memory can degrade performance. It is more efficient to load a program all at one time than to load it entirely in small sections on demand. There is a finite cost for each operation, including saving and restoring state and determining which page must be loaded. So, some systems use demand paging for only those programs that are larger than some minimum size.

Hardware Requirements for Virtual Memory

Nearly all versions of UNIX have required some form of memory-management hardware to support transparent multiprogramming. To protect processes from modification by other processes, the memory-management hardware must prevent programs from changing their own address mapping. The 4.4BSD kernel runs in a privileged mode (kernel mode or system mode) in which memory mapping can be controlled, whereas processes run in an unprivileged mode (user mode). There are several additional architectural requirements for support of virtual memory. The CPU must distinguish between resident and nonresident portions of the address space, must suspend programs when they refer to nonresident addresses, and must resume programs’ operation once the operating system has placed the required section in memory. Because the CPU may discover missing data at various times during the execution of an instruction, it must provide a mechanism to save the machine state, so that the instruction can be continued or restarted later. The CPU may implement restarting by saving enough state when an instruction begins that the state can be restored when a fault is discovered. Alternatively, instructions could delay any modifications or side effects until after any faults would be discovered, so that the instruction execution does not need to back up before restarting. On some computers, instruction backup requires the assistance of the operating system.

Most machines designed to support demand-paged virtual memory include hardware support for the collection of information on program references to memory. When the system selects a page for replacement, it must save the contents of that page if they have been modified since the page was brought into memory. The hardware usually maintains a per-page flag showing whether the page has been modified. Many machines also include a flag recording any access to a page for use by the replacement algorithm.

5.2 Overview of the 4.4BSD Virtual-Memory System

The 4.4BSD virtual-memory system differs completely from the system that was used in 4.3BSD and predecessors. The implementation is based on the Mach 2.0 virtual-memory system [Tevanian, 1987], with updates from Mach 2.5 and Mach 3.0. The Mach virtual-memory system was adopted because it features efficient support for sharing and a clean separation of machine-independent and machine-dependent features, as well as (currently unused) multiprocessor support. None of the original Mach system-call interface remains. It has been replaced with the interface first proposed for 4.2BSD that has been widely adopted by the UNIX industry; the 4.4BSD interface is described in Section 5.5.

The virtual-memory system implements protected address spaces into which can be mapped data sources (objects) such as files or private, anonymous pieces of swap space. Physical memory is used as a cache of recently used pages from these objects, and is managed by a global page-replacement algorithm much like that of 4.3BSD.

The virtual address space of most architectures is divided into two parts. Typically, the top 30 to 100 Mbyte of the address space is reserved for use by the kernel. The remaining address space is a available for use by processes. A traditional UNIX layout is shown in Fig. 5.3 (on page 124). Here, the kernel and its associated data structures reside at the top of the address space. The initial text and data areas start at or near the beginning of memory. Typically, the first 4 or 8 Kbyte of memory are kept off limits to the process. The reason for this restriction is to ease program debugging; indirecting through a null pointer will cause an invalid address fault, instead of reading or writing the program text. Memory allocations made by the running process using the malloc() library routine (or the sbrk system call) are done on the heap that starts immediately following the data area and grows to higher addresses. The argument vector and environment vectors are at the top of the user portion of the address space. The user’s stack starts just below these vectors and grows to lower addresses. Subject to only administrative limits, the stack and heap can each grow until they meet. At that point, a process running on a 32-bit machine will be using nearly 4 Gbyte of address space.

Figure 5.3 Layout of virtual address space.

image

In 4.4BSD and other modern UNIX systems that support the mmap system call, address-space usage is less structured. Shared library implementations may place text or data arbitrarily, rendering the notion of predefined regions obsolete. For compatibility, 4.4BSD still supports the sbrk call that malloc() uses to provide a contiguous heap region, and the kernel has a designated stack region where adjacent allocations are performed automatically.

At any time, the currently executing process is mapped into the virtual address space. When the system decides to context switch to another process, it must save the information about the current-process address mapping, then load the address mapping for the new process to be run. The details of this address-map switching are architecture dependent. Some architectures need to change only a few memory-mapping registers that point to the base, and to give the length of memory-resident page tables. Other architectures store the page-table descriptors in special high-speed static RAM. Switching these maps may require dumping and reloading hundreds of map entries.

Both the kernel and user processes use the same basic data structures for the management of their virtual memory. The data structures used to manage virtual memory are as follows:

vmspace

Structure that encompasses both the machine-dependent and machine-independent structures describing a process’s address space

vm_map

Highest-level data structure that describes the machine-independent virtual address space

vm_map_entry

Structure that describes a virtually contiguous range of address space that shares protection and inheritance attributes

object

Structure that describes a source of data for a range of addresses

shadow object

Special object that represents modified copy of original data

vm_page

The lowest-level data structure that represents the physical memory being used by the virtual-memory system

In the remainder of this section, we shall describe briefly how all these data structures fit together. The remainder of this chapter will describe what the details of the structures are and how the structures are used.

Figure 5.4 shows a typical process address space and associated data structures. The vmspace structure encapsulates the virtual-memory state of a particular process, including the machine-dependent and machine-independent data structures, as well as statistics. The machine-dependent pmap structure is opaque to all but the lowest level of the system, and contains all information necessary to manage the memory-management hardware. This pmap layer is the subject of Section 5.13 and is ignored for the remainder of the current discussion. The machine-independent data structures include the address space that is represented by a vm_map structure. The vm_map contains a linked list of vm_map_entry structures, hints for speeding up lookups during memory allocation and page-fault handling, and a pointer to the associated machine-dependent pmap structure contained in the vmspace. A vm_map_entry structure describes a virtually contiguous range of address space that has the same protection and inheritance attributes. Every vm_map_entry points to a chain of vm_object structures that describes sources of data (objects) that are mapped at the indicated address range. At the tail of the chain is the original mapped data object, usually representing a persistent data source, such as a file. Interposed between that object and the map entry are one or more transient shadow objects that represent modified copies of the original data. These shadow objects are discussed in detail in Section 5.5.

Figure 5.4 Data structures that describe a process address space.

image

Each vm_object structure contains a linked list of vm_page structures representing the physical-memory cache of the object, as well as a pointer to the pager_struct structure that contains information on how to page in or page out data from its backing store. There is a vm_page structure allocated for every page of physical memory managed by the virtual-memory system, where a page here may be a collection of multiple, contiguous hardware pages that will be treated by the machine-dependent layer as though they were a single unit. The structure also contains the status of the page (e.g., modified or referenced) and links for various paging queues.

All structures contain the necessary interlocks for multithreading in a multiprocessor environment. The locking is fine grained, with at least one lock per instance of a data structure. Many of the structures contain multiple locks to protect individual fields.

5.3 Kernel Memory Management

There are two ways in which the kernel’s memory can be organized. The most common is for the kernel to be permanently mapped into the high part of every process address space. In this model, switching from one process to another does not affect the kernel portion of the address space. The alternative org anization is to switch between having the kernel occupy the whole address space and mapping the currently running process into the address space. Having the kernel permanently mapped does reduce the amount of address space available to a large process (and the kernel), but it also reduces the cost of data copying. Many system calls require data to be transferred between the currently running user process and the kernel. With the kernel permanently mapped, the data can be copied via the efficient block-copy instructions. If the kernel is alternately mapped with the process, data copying requires the use of special instructions that copy to and from the previously mapped address space. These instructions are usually a factor of 2 slower than the standard block-copy instructions. Since up to one-third of the kernel time is spent in copying between the kernel and user processes, slowing this operation by a factor of 2 significantly slows system throughput.

Although the kernel is able freely to read and write the address space of the user process, the converse is not true. The kernel’s range of virtual address space is marked inaccessible to all user processes. The reason for restricting writing is so that user processes cannot tamper with the kernel’s data structures. The reason for restricting reading is so that user processes cannot watch sensitive kernel data structures, such as the terminal input queues, that include such things as users typing their passwords.

Usually, the hardware dictates which organization can be used. All the architectures supported by 4.4BSD map the kernel into the top of the address space.

Kernel Maps and Submaps

When the system boots, the first task that the kernel must do is to set up data structures to describe and manage its address space. Like any process, the kernel has a vm_map with a corresponding set of vm_map_entry structures that describe the use of a range of addresses. Submaps are a special kernel-only construct used to isolate and constrain address-space allocation for kernel subsystems. One use is in subsystems that require contiguous pieces of the kernel address space. So that intermixing of unrelated allocations within an address range is avoided, that range is covered by a submap, and only the appropriate subsystem can allocate from that map. For example, several network buffer (mbuf) manipulation macros use address arithmetic to generate unique indices, thus requiring the network buffer region to be contiguous. Parts of the kernel may also require addresses with particular alignments or even specific addresses. Both can be ensured by use of submaps. Finally, submaps can be used to limit statically the amount of address space and hence the physical memory consumed by a subsystem.

A typical layout of the kernel map is shown in Fig. 5.5. The kernel’s address space is described by the vm_map structure shown in the upper-left corner of the figure. Pieces of the address space are described by the vm_map_entry structures that are linked in ascending address order from K0 to K8 on the vm_map structure. Here, the kernel text, initialized data, uninitialized data, and initially allocated data structures reside in the range K0 to K1 and are represented by the first vm_map_entry. The next vm_map_entry is associated with the address range from K2 to K6; this piece of the kernel address space is being managed via a submap headed by the referenced vm_map structure. This submap currently has two parts of its address space used: the address range K2 to K3, and the address range K4 to K5. These two submaps represent the kernel malloc arena and the network buffer arena, respectively. The final part of the kernel address space is being managed in the kernel’s main map; the address range K7 to K8 representing the kernel I/O staging area.

Figure 5.5 Kernel address-space maps.

image

Kernel Address-Space Allocation

The virtual-memory system implements a set of primitive functions for allocating and freeing the page-aligned, page-rounded virtual-memory ranges that the kernel uses. These ranges may be allocated either from the main kernel-address map or from a submap. The allocation routines take a map and size as parameters, but do not take an address. Thus, specific addresses within a map cannot be selected. There are different allocation routines for obtaining nonpageable and pageable memory ranges.

A nonpageable, or wired, range has physical memory assigned at the time of the call, and this memory is not subject to replacement by the pageout daemon. Wired pages must never cause a page fault that might result in a blocking operation. Wired memory is allocated with kmem_alloc() and kmem_malloc(). Kmem_alloc() returns zero-filled memory and may block if insufficient physical memory is available to honor the request. It will return a failure only if no address space is available in the indicated map. Kmem_malloc() is a variant of kmem_alloc() used by only the general allocator, malloc(), described in the next subsection. This routine has a nonblocking option that protects callers against inadvertently blocking on kernel data structures; it will fail if insufficient physical memory is available to fill the requested range. This nonblocking option allocates memory at interrupt time and during other critical sections of code. In general, wired memory should be allocated via the general-purpose kernel allocator. Kmem_alloc() should be used only to allocate memory from specific kernel submaps.

Pageable kernel virtual memory can be allocated with kmem_alloc_pageable() and kmem_alloc_wait(). A pageable range has physical memory allocated on demand, and this memory can be written out to backing store by the pageout daemon as part of the latter’s normal replacement policy. Kmem_alloc_pageable() will return an error if insufficient address space is available for the desired allocation; kmem_alloc_wait() will block until space is available. Currently, pageable kernel memory is used only for temporary storage of exec arguments and for the kernel stacks of processes that have been swapped out.

Kmem_free() deallocates kernel wired memory and pageable memory allocated with kmem_alloc_pageable(). Kmem_free_wakeup() should be used with kmem_alloc_wait() because it wakes up any processes waiting for address space in the specified map.

Kernel Malloc

The kernel also provides a generalized nonpageable memory-allocation and freeing mechanism that can handle requests with arbitrary alignment or size, as well as allocate memory at interrupt time. Hence, it is the preferred way to allocate kernel memory. This mechanism has an interface similar to that of the well-known memory allocator provided for applications programmers through the C library routines malloc() and free(). Like the C library interface, the allocation routine takes a parameter specifying the size of memory that is needed. The range of sizes for memory requests are not constrained. The free routine takes a pointer to the storage being freed, but does not require the size of the piece of memory being freed.

Often, the kernel needs a memory allocation for the duration of a single system call. In a user process, such short-term memory would be allocated on the run-time stack. Because the kernel has a limited run-time stack, it is not feasible to allocate even moderate blocks of memory on it. Consequently, such memory must be allocated dynamically. For example, when the system must translate a pathname, it must allocate a 1-Kbyte buffer to hold the name. Other blocks of memory must be more persistent than a single system call, and have to be allocated from dynamic memory. Examples include protocol control blocks that remain throughout the duration of a network connection.

The design specification for a kernel memory allocator is similar to, but not identical to, the design criteria for a user-level memory allocator. One criterion for a memory allocator is that the latter make good use of the physical memory. Use of memory is measured by the amount of memory needed to hold a set of allocations at any point in time. Percentage utilization is expressed as

image

Here, requested is the sum of the memory that has been requested and not yet freed; required is the amount of memory that has been allocated for the pool from which the requests are filled. An allocator requires more memory than requested because of fragmentation and a need to have a ready supply of free memory for future requests. A perfect memory allocator would have a utilization of 100 percent. In practice, a 50-percent utilization is considered good [Korn & Vo, 1985].

Good memory utilization in the kernel is more important than in user processes. Because user processes run in virtual memory, unused parts of their address space can be paged out. Thus, pages in the process address space that are part of the required pool that are not being requested do not need to tie up physical memory. Since the kernel malloc arena is not paged, all pages in the required pool are held by the kernel and cannot be used for other purposes. To keep the kernelutilization percentage as high as possible, the kernel should release unused memory in the required pool, rather than hold it, as is typically done with user processes. Because the kernel can manipulate its own page maps directly, freeing unused memory is fast; a user process must do a system call to free memory.

The most important criterion for a kernel memory allocator is that the latter be fast. A slow memory allocator will degrade the system performance because memory allocation is done frequently. Speed of allocation is more critical when executing in the kernel than it is in user code because the kernel must allocate many data structures that user processes can allocate cheaply on their run-time stack. In addition, the kernel represents the platform on which all user processes run, and, if it is slow, it will degrade the performance of every process that is running.

Another problem with a slow memory allocator is that programmers of frequently used kernel interfaces will think that they cannot afford to use the memory allocator as their primary one. Instead, they will build their own memory allocator on top of the original by maintaining their own pool of memory blocks. Multiple allocators reduce the efficiency with which memory is used. The kernel ends up with many different free lists of memory, instead of a single free list from which all allocations can be drawn. For example, consider the case of two subsystems that need memory. If they have their own free lists, the amount of memory tied up in the two lists will be the sum of the greatest amount of memory that each of the two subsystems has ever used. If they share a free list, the amount of memory tied up in the free list may be as low as the greatest amount of memory that either subsystem used. As the number of subsystems grows, the savings from having a single free list grow.

The kernel memory allocator uses a hybrid strategy. Small allocations are done using a power-of-2 list strategy; the typical allocation requires only a computation of the list to use and the removal of an element if that element is available, so it is fast. Only if the request cannot be fulfilled from a list is a call made to the allocator itself. To ensure that the allocator is always called for large requests, the lists corresponding to large allocations are always empty.

Freeing a small block also is fast. The kernel computes the list on which to place the block, and puts the block there. The free routine is called only if the block of memory is considered to be a large allocation.

Because of the inefficiency of power-of-2 allocation strategies for large allocations, the allocation method for large blocks is based on allocating pieces of memory in multiples of pages. The algorithm switches to the slower but more memory-efficient strategy for allocation sizes larger than 2×pagesize. This value is chosen because the power-of-2 algorithm yields sizes of 1, 2, 4, 8, …, n pages, whereas the large block algorithm that allocates in multiples of pages yields sizes of 1, 2, 3, 4, …, n pages. Thus, for allocations of sizes between one and two pages, both algorithms use two pages; a difference emerges beginning with allocations of sizes between two and three pages, where the power-of-2 algorithm will use four pages, whereas the large block algorithm will use three pages. Thus, the threshold between the large and small allocators is set to two pages.

Large allocations are first rounded up to be a multiple of the page size. The allocator then uses a “first-fit” algorithm to find space in the kernel address arena set aside for dynamic allocations. On a machine with a 4-Kbyte page size, a request for a 20-Kbyte piece of memory will use exactly five pages of memory, rather than the eight pages used with the power-of-2 allocation strategy. When a large piece of memory is freed, the memory pages are returned to the free-memory pool and the vm_map_entry structure is deleted from the submap, effectively coalescing the freed piece with any adjacent free space.

Another technique to improve both the efficiency of memory utilization and the speed of allocation is to cluster same-sized small allocations on a page. When a list for a power-of-2 allocation is empty, a new page is allocated and is divided into pieces of the needed size. This strategy speeds future allocations because several pieces of memory become available as a result of the call into the allocator.

Because the size is not specified when a block of memory is freed, the allocator must keep track of the sizes of the pieces that it has handed out. Many allocators increase the allocation request by a few bytes to create space to store the size of the block in a header just before the allocation. However, this strategy doubles the memory requirement for allocations that request a power-of-2–sized block. Therefore, instead of storing the size of each piece of memory with the piece itself, the kernel associates the size information with the memory page. Figure 5.6 shows how the kernel determines the size of a piece of memory that is being freed, by calculating the page in which it resides and looking up the size associated with that page. Locating the allocation size outside of the allocated block improved utilization far more than expected. The reason is that many allocations in the kernel are for blocks of memory whose size is exactly a power of 2. These requests would be nearly doubled in size if the more typical strategy were used. Now they can be accommodated with no wasted memory.

The allocator can be called both from the top half of the kernel that is willing to wait for memory to become available, and from the interrupt routines in the bottom half of the kernel that cannot wait for memory to become available. Clients show their willingness (and ability) to wait with a flag to the allocation routine. For clients that are willing to wait, the allocator guarantees that their request will succeed. Thus, these clients do not need to check the return value from the allocator. If memory is unavailable and the client cannot wait, the allocator returns a null pointer. These clients must be prepared to cope with this (hopefully infrequent) condition (usually by giving up and hoping to succeed later). The details of the kernel memory allocator are further described in [McKusick & Karels, 1988].

Figure 5.6 Calculation of allocation size. Key: free—unused page; cont—continuation of previous page.

image

5.4 Per-Process Resources

As we have already seen, a process requires a process entry and a kernel stack. The next major resource that must be allocated is its virtual memory. The initial virtual-memory requirements are defined by the header in the process’s executable. These requirements include the space needed for the program text, the initialized data, the uninitialized data, and the run-time stack. During the initial startup of the program, the kernel will build the data structures necessary to describe these four areas. Most programs need to allocate additional memory. The kernel typically provides this additional memory by expanding the uninitialized data area.

Most 4.4BSD systems also provide shared libraries. The header for the executable will describe the libraries that it needs (usually the C library, and possibly others). The kernel is not responsible for locating and mapping these libraries during the initial execution of the program. Finding, mapping, and creating the dynamic linkages to these libraries is handled by the user-level startup code prepended to the file being executed. This startup code usually runs before control is passed to the main entry point of the program [Gingell et al, 1987].

4.4BSD Process Virtual-Address Space

The initial layout of the address space for a process is shown in Fig. 5.7. As discussed in Section 5.2, the address space for a process is described by that process’s vmspace structure. The contents of the address space are defined by a list of vm_map_entry structures, each structure describing a region of virtual address space that resides between a start and an end address. A region describes a range of memory that is being treated in the same way. For example, the text of a program is a region that is read-only and is demand paged from the file on disk that contains it. Thus, the vm_map_entry also contains the protection mode to be applied to the region that it describes. Each vm_map_entry structure also has a pointer to the object that provides the initial data for the region. It also stores the modified contents either transiently when memory is being reclaimed or more permanently when the region is no longer needed. Finally, each vm_map_entry structure has an offset that describes where within the object the mapping begins.

The example shown in Fig. 5.7 represents a process just after it has started execution. The first two map entries both point to the same object; here, that object is the executable. The executable consists of two parts: the text of the program that resides at the beginning of the file and the initialized data area that follows at the end of the text. Thus, the first vm_map_entry describes a read-only region that maps the text of the program. The second vm_map_entry describes the copy-on-write region that maps the initialized data of the program that follows the program text in the file (copy-on-write is described in Section 5.6). The offset field in the entry reflects this different starting location. The third and fourth vm_map_entry structures describe the uninitialized data and stack areas, respectively. Both of these areas are represented by anonymous objects. An anonymous object provides a zero-filled page on first use, and arranges to store modified pages in the swap area if memory becomes tight. Anonymous objects are described in more detail later in this section.

Figure 5.7 Layout of an address space.

image

Page-Fault Dispatch

When a process attempts to access a piece of its address space that is not currently resident, a page fault occurs. The page-fault handler in the kernel is presented with the virtual address that caused the fault. The fault is handled with the following four steps:

1.  Find the vmspace structure for the faulting process; from that structure, find the head of the vm_map_entry list.

2.  Traverse the vm_map_entry list starting at the entry indicated by the map hint; for each entry, check whether the faulting address falls within its start and end address range. If the kernel reaches the end of the list without finding any valid region, the faulting address is not within any valid part of the address space for the process, so send the process a segment fault signal.

3.  Having found a vm_map_entry that contains the faulting address, convert that address to an offset within the underlying object. Calculate the offset within the object as

                  object_offset = fault_address
                       - vm_map_entry→start_address
                       + vm_map_entry→object_offset

Subtract off the start address to give the offset into the region mapped by the vm_map_entry. Add in the object offset to give the absolute offset of the page within the object.

4.  Present the absolute object offset to the underlying object, which allocates a vm_page structure and uses its pager to fill the page. The object then returns a pointer to the vm_page structure, which is mapped into the faulting location in the process address space.

Once the appropriate page has been mapped into the faulting location, the page-fault handler returns and reexecutes the faulting instruction.

Mapping to Objects

Objects are used to hold information about either a file or about an area of anonymous memory. Whether a file is mapped by a single process in the system or by many processes in the system, it will always be represented by a single object. Thus, the object is responsible for maintaining all the state about those pages of a file that are resident. All references to that file will be described by vm_map_entry structures that reference the same object. An object never stores the same page of a file in more than one memory page, so that all mappings will get a consistent view of the file.

An object stores the following information:

• A list of the pages for that object that are currently resident in main memory; a page may be mapped into multiple address spaces, but it is always claimed by exactly one object

• A count of the number of vm_map_entry structures or other objects that reference the object

• The size of the file or anonymous area described by the object

• The number of memory-resident pages held by the object

• Pointers to copy or shadow objects (described in Section 5.5)

• A pointer to the pager for the object; the pager is responsible for providing the data to fill a page, and for providing a place to store the page when it has been modified (pagers are covered in Section 5.10)

There are four types of objects in the system:

Named objects represent files; they may also represent hardware devices that are able to provide mapped memory such as frame buffers.

Anonymous objects represent areas of memory that are zero filled on first use; they are abandoned when they are no longer needed.

Shadow objects hold private copies of pages that have been modified; they are abandoned when they are no longer referenced.

Copy objects hold old pages from files that have been modified after they were privately mapped; they are abandoned when the private mapping is abandoned.

These objects are often referred to as “internal” objects in the source code. The type of an object is defined by the pager that that object uses to fulfill page-fault requests.

A named object uses either (an instance of) the device pager, if it maps a hardware device, or the vnode pager, if it is backed by a file in the filesystem. A pager services a page fault by returning the appropriate address for the device being mapped. Since the device memory is separate from the main memory on the machine, it will never be selected by the pageout daemon. Thus, the device pager never has to handle a pageout request.

The vnode pager provides an interface to objects that represent files in the filesystem. A vnode-pager instance keeps a reference to a vnode that represents the file being mapped by the object. A vnode pager services a pagein request by doing a read on the vnode; it services a pageout request by doing a write to the vnode. Thus, the file itself stores the modified pages. In cases where it is not appropriate to modify the file directly, such as an executable that does not want to modify its initialized data pages, the kernel must interpose an anonymous shadow object between the vm_map_entry and the object representing the file.

Anonymous objects use the swap pager. An anonymous object services pagein requests by getting a page of memory from the free list, and zeroing that page. When a pageout request is made for a page for the first time, the swap pager is responsible for finding an unused page in the swap area, writing the contents of the page to that space, and recording where that page is stored. If a pagein request comes for a page that had been previously paged out, the swap pager is responsible for finding where it stored that page and reading back the contents into a free page in memory. A later pageout request for that page will cause the page to be written out to the previously allocated location.

Shadow objects and copy objects also use the swap pager. They work just like anonymous objects, except that the swap pager provides their initial pages by copying existing pages in response to copy-on-write faults, instead of by zero-filling pages.

Further details on the pagers are given in Section 5.10.

Objects

Each virtual-memory object has a pager associated with it; objects that map files have a vnode pager associated with them. Each instance of a vnode pager is associated with a particular vnode. Objects are stored on a hash chain and are identified by their associated pager. When a fault occurs for a file that is mapped into memory, the kernel checks its vnode pager cache to see whether a pager already exists for that file. If a pager exists, the kernel then looks to see whether there is an object still associated with that pager. If the object exists, it can be checked to see whether the faulted page is resident. If the page is resident, it can be used. If the page is not resident, a new page is allocated, and the pager is requested to fill the new page.

Caching in the virtual-memory system is identified by an object that is associated with a file or region that it represents. Each object contains pages that are the cached contents of its associated file or region. Objects that represent anonymous memory are reclaimed as soon as the reference count drops to zero. However, objects that refer to files are persistent. When their reference count drops to zero, the object is stored on a least-recently used (LRU) list known as the object cache. The object remains on its hash chain, so that future uses of the associated file will cause the existing object to be found. The pages associated with the object are moved to the inactive list, which is described in Section 5.12. However, their identity is retained, so that, if the object is reactivated and a page fault occurs before the associated page is freed, that page can be reattached, rather than being reread from disk.

This cache is similar to the text cache found in earlier versions of BSD in that it provides performance improvements for short-running but frequently executed programs. Frequently executed programs include those to list the contents of directories, to show system status, or to do the intermediate steps involved in compiling a program. For example, consider a typical application that is made up of multiple source files. Each of several compiler steps must be run on each file in turn. The first time that the compiler is run, the objects associated with its various components are read in from the disk. For each file compiled thereafter, the previously created objects are found, alleviating the need to reload them from disk each time.

Objects to Pages

When the system is first booted, the kernel looks through the physical memory on the machine to find out how many pages are available. After the physical memory that will be dedicated to the kernel itself has been deducted, all the remaining pages of physical memory are described by vm_page structures. These vm_page structures are all initially placed on the memory free list. As the system starts running and processes begin to execute, they generate page faults. Each page fault is matched to the object that covers the faulting piece of address space. The first time that a piece of an object is faulted, it must allocate a page from the free list, and must initialize that page either by zero filling it or by reading its contents from the filesystem. That page then becomes associated with the object. Thus, each object has its current set of vm_page structures linked to it. A page can be associated with at most one object at a time. Although a file may be mapped into several processes at once, all those mappings reference the same object. Having a single object for each file ensures that all processes will reference the same physical pages. One anomaly is that the object offset in a vm_map_entry structure may not be page aligned (the result of an mmap call with a non–page-aligned offset parameter). Consequently, a vm_page may be filled and associated with the object with a non–page-aligned tag that will not match another access to the same object at the page-aligned boundary. Hence, if two processes map the same object with offsets of 0 and 32, two vm_pages will be filled with largely the same data, and that can lead to inconsistent views of the file.

If memory becomes scarce, the paging daemon will search for pages that have not been used recently. Before these pages can be used by a new object, they must be removed from all the processes that currently have them mapped, and any modified contents must be saved by the object that owns them. Once cleaned, the pages can be removed from the object that owns them and can be placed on the free list for reuse. The details of the paging system are described in Section 5.12.

5.5 Shared Memory

In Section 5.4, we explained how the address space of a process is organized. This section shows the additional data structures needed to support shared address space between processes. Traditionally, the address space of each process was completely isolated from the address space of all other processes running on the system. The only exception was read-only sharing of program text. All interprocess communication was done through well-defined channels that passed through the kernel: pipes, sockets, files, and special devices. The benefit of this isolated approach is that, no matter how badly a process destroys its own address space, it cannot affect the address space of any other process running on the system. Each process can precisely control when data are sent or received; it can also precisely identify the locations within its address space that are read or written. The drawback of this approach is that all interprocess communication requires at least two system calls: one from the sending process and one from the receiving process. For high volumes of interprocess communication, especially when small packets of data are being exchanged, the overhead of the system calls dominates the communications cost.

Shared memory provides a way to reduce interprocess-communication costs dramatically. Two or more processes that wish to communicate map the same piece of read–write memory into their address space. Once all the processes have mapped the memory into their address space, any changes to that piece of memory are visible to all the other processes, without any intervention by the kernel. Thus, interprocess communication can be achieved without any system-call overhead, other than the cost of the initial mapping. The drawback to this approach is that, if a process that has the memory mapped corrupts the data structures in that memory, all the other processes mapping that memory also are corrupted. In addition, there is the complexity faced by the application developer who must develop data structures to control access to the shared memory, and must cope with the race conditions inherent in manipulating and controlling such data structures that are being accessed concurrently.

Some variants of UNIX have a kernel-based semaphore mechanism to provide the needed serialization of access to the shared memory. However, both getting and setting such semaphores require system calls. The overhead of using such semaphores is comparable to that of using the traditional interprocess-communication methods. Unfortunately, these semaphores have all the complexity of shared memory, yet confer little of its speed advantage. The primary reason to introduce the complexity of shared memory is for the commensurate speed gain. If this gain is to be obtained, most of the data-structure locking needs to be done in the shared memory segment itself. The kernel-based semaphores should be used for only those rare cases where there is contention for a lock and one process must wait. Consequently, modern interfaces, such as POSIX Pthreads, are designed such that the semaphores can be located in the shared memory region. The common case of setting or clearing an uncontested semaphore can be done by the user process, without calling the kernel. There are two cases where a process must do a system call. If a process tries to set an already-locked semaphore, it must call the kernel to block until the semaphore is available. This system call has little effect on performance because the lock is contested, so it is impossible to proceed and the kernel has to be invoked to do a context switch anyway. If a process clears a semaphore that is wanted by another process, it must call the kernel to awaken that process. Since most locks are uncontested, the applications can run at full speed without kernel intervention.

Mmap Model

When two processes wish to create an area of shared memory, they must have some way to name the piece of memory that they wish to share, and they must be able to describe its size and initial contents. The system interface describing an area of shared memory accomplishes all these goals by using files as the basis for describing a shared memory segment. A process creates a shared memory segment by using

              caddr_t addr = mmap(
                 caddr_t addr,     /* base address */
                 size_t len,       /* length of region */
                 int prot,         /* protection of region */
                 int flags,        /* mapping flags */
                 int fd,           /* file to map */
                 off_t offset);    /* offset to begin mapping */

to map the file referenced by descriptor fd starting at file offset offset into its address space starting at addr and continuing for len bytes with access permission prot. The flags parameter allows a process to specify whether it wants to make a shared or private mapping. Changes made to a shared mapping are written back to the file and are visible to other processes. Changes made to a private mapping are not written back to the file and are not visible to other processes. Two processes that wish to share a piece of memory request a shared mapping of the same file into their address space. Thus, the existing and well-understood filesystem name space is used to identify shared objects. The contents of the file are used as the initial value of the memory segment. All changes made to the mapping are reflected back into the contents of the file, so long-term state can be maintained in the shared memory region, even across invocations of the sharing processes.

Some applications want to use shared memory purely as a short-term interprocess-communication mechanism. They need an area of memory that is initially zeroed and whose contents are abandoned when they are done using it. Such processes neither want to pay the relatively high start-up cost associated with paging in the contents of a file to initialize a shared memory segment, nor to pay the shutdown costs of writing modified pages back to the file when they are done with the memory. Although an alternative naming scheme was considered to provide a rendezvous mechanism for such short-term shared memory, the designers ultimately decided that all naming of memory objects should use the filesystem name space. To provide an efficient mechanism for short-term shared memory, they created a virtual-memory–resident filesystem for transient objects. The details of the virtual-memory–resident filesystem are described in Section 8.4. Unless memory is in high demand, files created in the virtual-memory–resident filesystem reside entirely in memory. Thus, both the initial paging and later write-back costs are eliminated. Typically, a virtual-memory–resident filesystem is mounted on /tmp. Two processes wishing to create a transient area of shared memory create a file in /tmp that they can then both map into their address space.

When a mapping is no longer needed, it can be removed using

         munmap(caddr_t addr, size_t len);

The munmap system call removes any mappings that exist in the address space, starting at addr and continuing for len bytes. There are no constraints between previous mappings and a later munmap. The specified range may be a subset of a previous mmap or it may encompass an area that contains many mmap ’ed files. When a process exits, the system does an implied munmap over its entire address space.

During its initial mapping, a process can set the protections on a page to allow reading, writing, and/or execution. The process can change these protections later by using

         mprotect(caddr_t addr, int len, int prot);

This feature can be used by debuggers when they are trying to track down a memory-corruption bug. By disabling writing on the page containing the data structure that is being corrupted, the debugger can trap all writes to the page and verify that they are correct before allowing them to occur.

Traditionally, programming for real-time systems has been done with specially written operating systems. In the interests of reducing the costs of real-time applications and of using the skills of the large body of UNIX programmers, companies developing real-time applications have expressed increased interest in using UNIX-based systems for writing these applications. Two fundamental requirements of a real-time system are maximum guaranteed latencies and predictable execution times. Predictable execution time is difficult to provide in a virtual-memory–based system, since a page fault may occur at any point in the execution of a program, resulting in a potentially large delay while the faulting page is retrieved from the disk or network. To avoid paging delays, the system allows a process to force its pages to be resident, and not paged out, by using

             mlock(caddr_t addr, size_t len);

As long as the process limits its accesses to the locked area of its address space, it can be sure that it will not be delayed by page faults. To prevent a single process from acquiring all the physical memory on the machine to the detriment of all other processes, the system imposes a resource limit to control the amount of memory that may be locked. Typically, this limit is set to no more than one-third of the physical memory, and it may be set to zero by a system administrator that does not want random processes to be able to monopolize system resources.

When a process has finished with its time-critical use of an mlock ’ed region, it can release the pages using

            munlock(caddr_t addr, size_t len);

After the munlock call, the pages in the specified address range are still accessible, but they may be paged out if memory is needed and they are not accessed.

The architecture of some multiprocessing machines does not provide consistency between a high-speed cache local to a CPU and the machine’s main memory. For these machines, it may be necessary to flush the cache to main memory before the changes made in that memory are visible to processes running on other CPUs. A process does this synchronization using

            msync(caddr_t addr, int len);

For a region containing a mapped file, msync also writes back any modified pages to the filesystem.

Shared Mapping

When multiple processes map the same file into their address space, the system must ensure that all the processes view the same set of memory pages. As shown in Section 5.4, each file that is being used actively by a client of the virtual-memory system is represented by an object. Each mapping that a process has to a piece of a file is described by a vm_map_entry structure. An example of two processes mapping the same file into their address space is shown in Fig. 5.8. When a page fault occurs in one of these processes, the process’s vm_map_entry references the object to find the appropriate page. Since all mappings reference the same object, the processes will all get references to the same set of physical memory, thus ensuring that changes made by one process will be visible in the address spaces of the other processes as well.

A second organization arises when a process with a shared mapping does a fork. Here, the kernel interposes a sharing map between the two processes and the shared object, so that both processes’ map entries reference this map, instead of the object. A sharing map is identical in structure to an address map: It is a linked list of map entries. The intent is that a sharing map, referenced by all processes inheriting a shared memory region, will be the focus of map-related operations that should affect all the processes. Sharing maps are useful in the creation of shadow objects for copy-on-write operations because they affect part or all of the shared region. Here, all sharing processes should use the same shadow object, so that all will see modifications made to the region. Sharing maps are an artifact of the virtual-memory code’s early Mach origin; they do not work well in the 4.4BSD environment because they work for only that memory shared by inheritance. Shared mappings established with mmap do not use them. Hence, even if a sharing map exists for a shared region, it does not necessarily reflect all processes involved. The only effect that sharing maps have in 4.4BSD is to extend across forks the delayed creation of shadow and copy objects. This delay does not offer a significant advantage, and the small advantage is outweighed by the added amount and complexity of code necessary to handle sharing maps. For this reason, sharing maps probably will be eliminated from systems derived from 4.4BSD, as they were from later versions of Mach.

Figure 5.8 Multiple mappings to a file.

image

Private Mapping

A process may request a private mapping of a file. A private mapping has two main effects:

1. Changes made to the memory mapping the file are not reflected back into the mapped file.

2. Changes made to the memory mapping the file are not visible to other processes mapping the file.

An example of the use of a private mapping would be during program debugging. The debugger will request a private mapping of the program text so that, when it sets a breakpoint, the modification is not written back into the executable stored on the disk and is not visible to the other (presumably nondebugging) processes executing the program.

The kernel uses shadow objects to prevent changes made by a process from being reflected back to the underlying object. The use of a shadow object is shown in Fig. 5.9. When the initial private mapping is requested, the file object is mapped into the requesting-process address space, with copy-on-write semantics. If the process attempts to write a page of the object, a page fault occurs and traps into the kernel. The kernel makes a copy of the page to be modified and hangs it from the shadow object. In this example, process A has modified page 0 of the file object. The kernel has copied page 0 to the shadow object that is being used to provide the private mapping for process A.

If free memory is limited, it would be better simply to move the modified page from the file object to the shadow object. The move would reduce the immediate demand on the free memory, because a new page would not have to be allocated. The drawback to this optimization is that, if there is a later access to the file object by some other process, the kernel will have to allocate a new page. The kernel will also have to pay the cost of doing an I/O operation to reload the page contents. In 4.4BSD, the virtual-memory system never moves the page rather than copying it.

Figure 5.9 Use of a shadow object for a private mapping.

image

When a page fault for the private mapping occurs, the kernel traverses the list of objects headed by the vm_map_entry, looking for the faulted page. The first object in the chain that has the desired page is the one that is used. If the search gets to the final object on the chain without finding the desired page, then the page is requested from that final object. Thus, pages on a shadow object will be used in preference to the same pages in the file object itself. The details of page-fault handling are given in Section 5.11.

When a process removes a mapping from its address space (either explicitly from an munmap request or implicitly when the address space is freed on process exit), pages held by its shadow object are not written back to the file object. The shadow-object pages are simply placed back on the memory free list for immediate reuse.

When a process forks, it does not want changes to its private mappings to be visible in its child; similarly, the child does not want its changes to be visible in its parent. The result is that each process needs to create a shadow object if it continues to make changes in a private mapping. When process A in Fig. 5.9 forks, a set of shadow object chains is created, as shown in Fig. 5.10 (on page 144). In this example, process A modified page 0 before it forked, then later modified page 1. Its modified version of page 1 hangs off its new shadow object, so that those modifications will not be visible to its child. Similarly, its child has modified page 0. If the child were to modify page 0 in the original shadow object, that change would be visible in its parent. Thus, the child process must make a new copy of page 0 in its own shadow object.

If the system runs short of memory, the kernel may need to reclaim inactive memory held in a shadow object. The kernel assigns to the swap pager the task of backing the shadow object. The swap pager creates a swap map that is large enough to describe the entire contents of the shadow object. It then allocates enough swap space to hold the requested shadow pages and writes them to that area. These pages can then be freed for other uses. If a later page fault requests a swapped-out page, then a new page of memory is allocated and its contents are reloaded with an I/O from the swap area.

Figure 5.10 Shadow-object chains.

image

Collapsing of Shadow Chains

When a process with a private mapping removes that mapping either explicitly with an munmap system call or implicitly by exiting, its parent or child process may be left with a chain of shadow objects. Usually, these chains of shadow objects can be collapsed into a single shadow object, often freeing up memory as part of the collapse. Consider what happens when process A exits in Fig. 5.10. First, shadow object 3 can be freed, along with its associated page of memory. This deallocation leaves shadow objects 1 and 2 in a chain with no intervening references. Thus, these two objects can be collapsed into a single shadow object. Since they both contain a copy of page 0, and since only the page 0 in shadow object 2 can be accessed by the remaining child process, the page 0 in shadow object 1 can be freed, along with shadow object 1 itself.

If the child of process A were to exit, then shadow object 2 and the associated page of memory could be freed. Shadow objects 1 and 3 would then be in a chain that would be eligible for collapse. Here, there are no common pages, so the remaining collapsed shadow object would contain page 0 from shadow object 1, as well as page 1 from shadow object 3. A limitation of the implementation is that it cannot collapse two objects if either of them has allocated a pager. This limitation is serious, since pagers are allocated when the system begins running short of memory—precisely the time when reclaiming of memory from collapsed objects is most necessary.

Private Snapshots

When a process makes read accesses to a private mapping of an object, it continues to see changes made to that object by other processes that are writing to the object through the filesystem or that have a shared mapping to the object. When a process makes a write access to a private mapping of an object, a snapshot of the corresponding page of the object is made and is stored in the shadow object, and the modification is made to that snapshot. Thus, further changes to that page made by other processes that are writing to the page through the filesystem or that have a shared mapping to the object are no longer visible for that page. However, changes to unmodified pages of the object continue to be visible. This mix of changing and unchanging parts of the file can be confusing.

To provide a more consistent view of a file, a process may want to take a snapshot of the file at the time that it is initially privately mapped. A process takes such a snapshot by using a copy object, as shown in Fig. 5.11 (on page 146). In this example, process B has a shared mapping to the file object, whereas process A has a private mapping. Modifications made by process B will be reflected in the file, and hence will be visible to any other process (such as process A) that is mapping that file. To avoid seeing the modifications made by process B after process B has done its mapping, process A interposes a copy object between itself and the file object. At the same time, it changes the protections on the file object to be copy-on-write. Thereafter, when process B tries to modify the file object, it will generate a page fault. The page-fault handler will save a copy of the unmodified page in the copy object, then will allow process B to write the original page. If process A later tries to access one of the pages that process B has modified, it will get the page that was saved in the copy object, instead of getting the version that process B changed.

In 4.4BSD, private snapshots work correctly only if all processes modifying the file do so through the virtual-memory interface. For example, in Fig. 5.11, assume that a third process C writes page 2 of the file using write before A or B reference page 2. Now, even though A has made a snapshot of the file, it will see the modified version of page 2, since the virtual-memory system has no knowledge that page 2 was written. This behavior is an unwelcome side effect of the separate virtual memory and filesystem caches; it would be eliminated if the two caches were integrated.

Most non-BSD systems that provide the mmap interface do not provide copy-object semantics. Thus, 4.4BSD does not provide copy semantics by default; such semantics are provided only when they are requested explicitly. It is debatable whether the copy semantics are worth providing at all, because a process can obtain them trivially by reading the file in a single request into a buffer in the process address space. The added complexity and overhead of copy objects may well exceed the value of providing copy semantics in the mmap interface.

Figure 5.11 Use of a copy object.

image

5.6 Creation of a New Process

Processes are created with a fork system call. The fork is usually followed shortly thereafter by an exec system call that overlays the virtual address space of the child process with the contents of an executable image that resides in the filesystem. The process then executes until it terminates by exiting, either voluntarily or involuntarily, by receiving a signal. In Sections 5.6 to 5.9, we trace the management of the memory resources used at each step in this cycle.

A fork system call duplicates the address space of an existing process, creating an identical child process. Fork is the only way that new processes are created in 4.4BSD (except for its variant, vfork, which is described in the last subsection of this section). Fork duplicates all the resources of the original process, and copies that process’s address space.

The virtual-memory resources of the process that must be allocated for the child include the process structure and its associated substructures, and the user area that includes both the user structure and the kernel stack. In addition, the kernel must reserve storage (either memory, filesystem space, or swap space) used to back the process. The general outline of the implementation of a fork is as follows:

• Reserve virtual address space for the child process.

• Allocate a process entry for the child process, and fill it in.

• Copy to the child the parent’s process group, credentials, file descriptors, limits, and signal actions.

• Allocate a new user area, copying the current one to initialize it.

• Allocate a vmspace structure.

• Duplicate the address space, by creating copies of the parent vm_map_entry structures marked copy-on-write.

• Arrange for the child process to return 0, to distinguish its return value from the new PID that is returned by the parent process.

The allocation and initialization of the process structure, and the arrangement of the return value, were covered in Chapter 4. The remainder of this section discusses the other steps involved in duplicating a process.

Reserving Kernel Resources

The first resource to be reserved when an address space is duplicated is the required virtual address space. To avoid running out of memory resources, the kernel must ensure that it does not promise to provide more virtual memory than it is able to deliver. The total virtual memory that can be provided by the system is limited to the amount of physical memory available for paging plus the amount of swap space that is provided. A few pages are held in reserve to stage I/O between the swap area and main memory.

The reason for this restriction is to ensure that processes get synchronous notification of memory limitations. Specifically, a process should get an error back from a system call (such as sbrk, fork, or mmap) if there are insufficient resources to allocate the needed virtual memory. If the kernel promises more virtual memory than it can support, it can deadlock trying to service a page fault. Trouble arises when it has no free pages to service the fault and no available swap space to save an active page. Here, the kernel has no choice but to send a segmentation-fault signal to the process unfortunate enough to be page faulting. Such asynchronous notification of insufficient memory resources is unacceptable.

Excluded from this limit are those parts of the address space that are mapped read-only, such as the program text. Any pages that are being used for a read-only part of the address space can be reclaimed for another use without being saved because their contents can be refilled from the original source. Also excluded from this limit are parts of the address space that map shared files. The kernel can reclaim any pages that are being used for a shared mapping after writing their contents back to the filesystem from which they are mapped. Here, the filesystem is being used as an extension of the swap area. Finally, any piece of memory that is used by more than one process (such as an area of anonymous memory being shared by several processes) needs to be counted only once toward the virtual-memory limit.

The limit on the amount of virtual address space that can be allocated causes problems for applications that want to allocate a large piece of address space, but want to use the piece only sparsely. For example, a process may wish to make a private mapping of a large database from which it will access only a small part. Because the kernel has no way to guarantee that the access will be sparse, it takes the pessimistic view that the entire file will be modified and denies the request. One extension that many BSD derived systems have made to the mmap system call is to add a flag that tells the kernel that the process is prepared to accept asynchronous faults in the mapping. Such a mapping would be permitted to use up to the amount of virtual memory that had not been promised to other processes. If the process then modifies more of the file than this available memory, or if the limit is reduced by other processes allocating promised memory, the kernel can then send a segmentation-fault signal to the process. On receiving the signal, the process must munmap an unneeded part of the file to release resources back to the system. The process must ensure that the code, stack, and data structures needed to handle the segment-fault signal do not reside in the part of the address space that is subject to such faults.

Tracking the outstanding virtual memory accurately is a complex task. The 4.4BSD system makes no effort to calculate the outstanding-memory load and can be made to promise more than it can deliver. When memory resources run out, it either picks a process to kill or simply hangs. An important future enhancement is to track the amount of virtual memory being used by the processes in the system.

Duplication of the User Address Space

The next step in fork is to allocate and initialize a new process structure. This operation must be done before the address space of the current process is duplicated because it records state in the process structure. From the time that the process structure is allocated until all the needed resources are allocated, the parent process is locked against swapping to avoid deadlock. The child is in an inconsistent state and cannot yet run or be swapped, so the parent is needed to complete the copy of its address space. To ensure that the child process is ignored by the scheduler, the kernel sets the process’s state to SIDL during the entire fork procedure.

Historically, the fork system call operated by copying the entire address space of the parent process. When large processes fork, copying the entire user address space is expensive. All the pages that are on secondary storage must be read back into memory to be copied. If there is not enough free memory for both complete copies of the process, this memory shortage will cause the system to begin paging to create enough memory to do the copy (see Section 5.12). The copy operation may result in parts of the parent and child processes being paged out, as well as the paging out of parts of unrelated processes.

The technique used by 4.4BSD to create processes without this overhead is called copy-on-write. Rather than copy each page of a parent process, both the child and parent processes resulting from a fork are given references to the same physical pages. The page tables are changed to prevent either process from modifying a shared page. Instead, when a process attempts to modify a page, the kernel is entered with a protection fault. On discovering that the fault was caused by an attempt to modify a shared page, the kernel simply copies the page and changes the protection field for the page to allow modification once again. Only pages modified by one of the processes need to be copied. Because processes that fork typically overlay the child process with a new image with exec shortly thereafter, this technique significantly improves the performance of fork.

The next step in fork is to traverse the list of vm_map_entry structures in the parent and to create a corresponding entry in the child. Each entry must be analyzed and the appropriate action taken:

• If the entry maps a read-only region, the child can take a reference to it.

• If the entry maps a privately mapped region (such as the data area or stack), the child must create a copy-on-write mapping of the region. The parent must be converted to a copy-on-write mapping of the region. If either process later tries to write the region, it will create a shadow map to hold the modified pages.

• If the entry maps a shared region, a sharing map is created referencing the shared object, and both map entries are set to reference this map.

Map entries for a process are never merged (simplified). Only entries for the kernel map itself can be merged. The kernel-map entries need to be simplified so that excess growth is avoided. It might be worthwhile to do such a merge of the map entries for a process when it forks, especially for large or long-running processes.

With the virtual-memory resources allocated, the system sets up the kernel-and user-mode state of the new process, including the hardware memory-management registers and the user area. It then clears the SIDL flag and places the process on the run queue; the new process can then begin execution.

Creation of a New Process Without Copying

When a process (such as a shell) wishes to start another program, it will generally fork, do a few simple operations such as redirecting I/O descriptors and changing signal actions, and then start the new program with an exec. In the meantime, the parent shell suspends itself with wait until the new program completes. For such operations, it is not necessary for both parent and child to run simultaneously, and therefore only one copy of the address space is required. This frequently occurring set of system calls led to the implementation of the vfork system call. In 4.4BSD, the vfork system call still exists, but it is implemented using the same copy-on-write algorithm described in this section. Its only difference is that it ensures that the parent does not run until the child has done either an exec or an exit.

The historic implementation of vfork will always be more efficient than the copy-on-write implementation because the kernel avoids copying the address space for the child. Instead, the kernel simply passes the parent’s address space to the child and suspends the parent. The child process needs to allocate only new process and user structures, receiving everything else from the parent. The child process returns from the vfork system call with the parent still suspended. The child does the usual activities in preparation for starting a new program, then calls exec. Now the address space is passed back to the parent process, rather than being abandoned, as in a normal exec. Alternatively, if the child process encounters an error and is unable to execute the new program, it will exit. Again, the address space is passed back to the parent, instead of being abandoned.

With vfork, the entries describing the address space do not need to be copied, and the page-table entries do not need to be marked and then cleared of copy-on-write. Vfork is likely to remain more efficient than copy-on-write or other schemes that must duplicate the process’s virtual address space. The architectural quirk of the vfork call is that the child process may modify the contents and even the size of the parent’s address space while the child has control. Modification of the parent’s address space is bad programming practice. Some programs that took advantage of this quirk broke when they were ported to 4.4BSD, which implemented vfork using copy-on-write.

5.7 Execution of a File

The exec system call was described in Sections 2.4 and 3.1; it replaces the address space of a process with the contents of a new program obtained from an executable file. During an exec, the target executable image is validated, then the arguments and environment are copied from the current process image into a temporary area of pageable kernel virtual memory.

To do an exec, the system must allocate resources to hold the new contents of the virtual address space, set up the mapping for this address space to reference the new image, and release the resources being used for the existing virtual memory.

The first step is to reserve memory resources for the new executable image. The algorithm for the calculation of the amount of virtual address space that must be reserved was described in Section 5.6. For an executable that is not being debugged (and hence will not have its text space modified), a space reservation needs to be made for only the data and stack space of the new executable. Exec does this reservation without first releasing the currently assigned space, because the system must be able to continue running the old executable until it is sure that it will be able to run the new one. If the system released the current space and the memory reservation failed, the exec would be unable to return to the original process. Once the reservation is made, the address space and virtual-memory resources of the current process are then freed as though the process were exiting; this mechanism is described in Section 5.9.

Now, the process has only a user structure and kernel stack. The kernel now allocates a new vmspace structure and creates the list of four vm_map_entry structures:

1.  A copy-on-write, fill-from-file entry maps the text segment. A copy-on-write mapping is used, rather than a read-only one, to allow active text segments to have debugging breakpoints set without affecting other users of the binary. In 4.4BSD, some legacy code in the kernel debugging interface disallows the setting of break points in binaries being used by more than one process. This legacy code prevents the use of the copy-on-write feature.

2.  A private (copy-on-write), fill-from-file entry maps the initialized data segment.

3.  An anonymous zero-fill-on-demand entry maps the uninitialized data segment.

4.  An anonymous zero-fill-on-demand entry maps the stack segment.

No further operations are needed to create a new address space during an exec system call; the remainder of the work comprises copying the arguments and environment out to the top of the new stack. Initial values are set for the registers: The program counter is set to the entry point, and the stack pointer is set to point to the argument vector. The new process image is then ready to run.

5.8 Process Manipulation of Its Address Space

Once a process begins execution, it has several ways to manipulate its address space. The system has always allowed processes to expand their uninitialized data area (usually done with the malloc() library routine). The stack is grown on an as-needed basis. The 4.4BSD system also allows a process to map files and devices into arbitrary parts of its address space, and to change the protection of various parts of its address space, as described in Section 5.5. This section describes how these address-space manipulations are done.

Change of Process Size

A process can change its size during execution by explicitly requesting more data space with the sbrk system call. Also, the stack segment will be expanded automatically if a protection fault is encountered because of an attempt to grow the stack below the end of the stack region. In either case, the size of the process address space must be changed. The size of the request is always rounded up to a multiple of page size. New pages are marked fill-with-zeros, as there are no contents initially associated with new sections of the address space.

The first step of enlarging a process’s size is to check whether the new size would violate the size limit for the process segment involved. If the new size is in range, the following steps are taken to enlarge the data area:

1. Verify that the virtual-memory resources are available.

2. Verify that the address space of the requested size immediately following the current end of the data area is not already mapped.

3. If the existing vm_map_entry is not constrained to be a fixed size because of the allocation of swap space, increment its ending address by the requested size. If the entry has had one or more of its pages written to swap space, then the current implementation of the swap pager will not permit it to grow. Consequently, a new vm_map_entry must be created with a starting address immediately following the end of the previous fixed-sized entry. Its ending address is calculated to give it the size of the request. Until a pageout forces the allocation of a fixed-sized swap partition of this new entry, the latter will be able to continue growing.

If the change is to reduce the size of the data segment, the operation is easy: Any memory allocated to the pages that will no longer be part of the address space is freed. The ending address of the vm_map_entry is reduced by the size. If the requested size reduction is bigger than the range defined by the vm_map_entry, the entire entry is freed, and the remaining reduction is applied to the vm_map_entry that precedes it. This algorithm is applied until the entire reduction has been made. Future references to these addresses will result in protection faults, as access is disallowed when the address range has been deallocated.

The allocation of the stack segment is considerably different. At exec time, the stack is allocated at its maximum possible size. Due to the lazy allocation of virtual-memory resources, this operation involves allocating only sufficient address space. Physical memory and swap space are allocated on demand as the stack grows. Hence, only step 3 of the data-growth algorithm applies to stack-growth–related page faults. An additional step is required to check that the desired growth does not exceed the dynamically changeable stack-size limit.

File Mapping

The mmap system call requests that a file be mapped into an address space. The system call may request either that the mapping be done at a particular address or that the kernel to pick an unused area. If the request is for a particular address range, the kernel first checks to see whether that part of the address space is already in use. If it is in use, the kernel first does an munmap of the existing mapping, then proceeds with the new mapping.

The kernel implements the munmap system call by traversing the list of vm_map_entry structures for the process. The various overlap conditions to consider are shown in Fig. 5.12. The five cases are as follows:

Figure 5.12 Five types of overlap that the kernel must consider when adding a new address mapping.

image

1.  The new mapping exactly overlaps an existing mapping. The old mapping is deallocated as described in Section 5.9. The new mapping is created in its place as described in the paragraph following this list.

2.  The new mapping is a subset of the existing mapping. The existing mapping is split into three pieces (two pieces if the new mapping begins at the beginning or ends at the end of the existing mapping). The existing vm_map_entry structure is augmented with one or two additional vm_map_entry structures: one mapping the remaining part of the existing mapping before the new mapping, and one mapping the remaining part of the existing mapping following the new mapping. Its overlapped piece is replaced by the new mapping, as described in the paragraph following this list.

3.  The new mapping is a superset of an existing mapping. The old mapping is deallocated as described in Section 5.9, and a new mapping is created as described in the paragraph following this list.

4.  The new mapping starts part way into and extends past the end of an existing mapping. The existing mapping has its length reduced by the size of the unmapped area. Its overlapped piece is replaced by the new mapping, as described in the paragraph following this list.

5.  The new mapping extends into the beginning of an existing mapping. The existing mapping has its starting address incremented and its length reduced by the size of the covered area. Its overlapped piece is replaced by the new mapping, as described in the paragraph following this list.

In addition to the five basic types of overlap listed, a new mapping request may span several existing mappings. Specifically, a new request may be composed of zero or one of type 4, zero to many of type 3, and zero or one of type 5. When a mapping is shortened, any shadow or copy pages associated with it are released, as they are no longer needed.

Once the address space is zero filled, the kernel creates a new vm_map_entry to describe the new address range. If the object being mapped is already being mapped by another process, the new entry gets a reference to the existing object. This reference is obtained in the same way, as described in Section 5.6, when a new process is being created and needs to map each of the regions in its parent. If this request is the first mapping of an object, then the kernel checks the object cache to see whether a previous instance of the object still exists. If one does, then that object is activated and referenced by the new vm_map_entry.

If the object is not found, then a new object must be created. First, a new object is allocated. Next, the kernel must determine what is being mapped, so that it can associate the correct pager with the object (pagers are described in Section 5.10). Once the object and its pager have been set up, the new vm_map_entry can be set to reference the object.

Change of Protection

A process may change the protections associated with a region of its virtual memory by using the mprotect system call. The size of the region to be protected may be as small as a single page. Because the kernel depends on the hardware to enforce the access permissions, the granularity of the protection is limited by the underlying hardware. A region may be set for any combination of read, write, and execute permissions. Many architectures do not distinguish between read and execute permissions; on such architectures, the execute permission is treated as read permission.

The kernel implements the mprotect system call by finding the existing vm_map_entry structure or structures that cover the region specified by the call. If the existing permissions are the same as the request, then no further action is required. Otherwise, the new permissions are compared to the maximum protection value associated with the vm_map_entry. The maximum value is set at mmap time and reflects the maximum value allowed by the underlying file. If the new permissions are valid, one or more new vm_map_entry structures have to be set up to describe the new protections. The set of overlap conditions that must be handled is similar to that described in the previous subsection. Instead of replacing the object underlying the new vm_map_entry structures, these vm_map_entry structures still reference the same object; the difference is that they grant different access permissions to it.

5.9 Termination of a Process

The final change in process state that relates to the operation of the virtual-memory system is exit; this system call terminates a process, as described in Chapter 4. The part of exit that is discussed here is the release of the virtual-memory resources of the process. The release is done in two steps:

1.  The user portions of the address space are freed, both in memory and on swap space.

2.  The user area is freed.

These two operations are complicated because the kernel stack in the user area must be used until the process relinquishes the processor for the final time.

The first step—freeing the user address space—is identical to the one that occurs during exec to free the old address space. The free operation proceeds entry by entry through the list of vm_map_entry structures associated with the address space. The first step in freeing an entry is to traverse the latter’s list of shadow and copy objects. If the entry is the last reference to a shadow or copy object, then any memory or swap space that is associated with the object can be freed. In addition, the machine-dependent routines are called to unmap and free up any page table or data structures that are associated with the object. If the shadow or copy object is still referenced by other vm_map_entry structures, its resources cannot be freed, but the kernel still needs to call the machine-dependent routines to unmap and free the resources associated with the current process mapping. Finally, if the underlying object referenced by the vm_map_entry is losing its last reference, then that object is a candidate for deallocation. If it is an object that will never have any chance of a future reuse (such as an anonymous object associated with a stack or uninitialized data area), then its resources are freed as though it were a shadow or copy object. However, if the object maps a file (such as an executable) that might be used again soon, the object is saved in the object cache, where it can be found by newly executing processes or by processes mapping in a file. The number of unreferenced cached objects is limited to a threshold set by the system (typically 100). If adding this new object would cause the cache to grow beyond its limit, the least recently used object in the cache is removed and deallocated.

Next, the memory used by the user area must be freed. This operation begins the problematic time when the process must free resources that it has not yet finished using. It would be disastrous if a page from the user structure or kernel stack were reallocated and reused before the process had finished the exit(). Memory is allocated either synchronously by the page-fault handler or asynchronously from interrupt handlers that use malloc(), such as the network when packets arrive (see Chapter 12). To block any allocation of memory, it is necessary to delay interrupts by raising the processor interrupt-priority level. The process may then free the pages of its user area, safe from having them reused until it has relinquished the processor. The next context switch will lower the priority so that interrupts may resume.

With all its resources free, the exiting process finishes detaching itself from its process group, and notifies its parent that it is done. The process has now become a zombie process—one with no resources, not even a kernel stack. Its parent will collect its exit status with a wait call, and will free its process structure.

There is nothing for the virtual-memory system to do when wait is called: All virtual-memory resources of a process are removed when exit is done. On wait, the system just returns the process status to the caller, and deallocates the process-table entry and the small amount of space in which the resource-usage information waskept.

5.10 The Pager Interface

The pager interface provides the mechanism by which data are moved between backing store and physical memory. The 4.4BSD pager interface is a modification of the interface present in Mach 2.0. The interface is page based, with all data requests made in multiples of the software page size. Vm_page structures are passed around as descriptors providing the backing-store offset and physical cache-page address of the desired data. This interface should not be confused with the current Mach 3.0 external paging interface [Young, 1989], where pagers are typically user applications outside the kernel and are invoked via asynchronous remote procedure calls using the Mach interprocess-communication mechanism. The 4.4BSD interface is internal in the sense that the pagers are compiled into the kernel and pager routines are invoked via simple function calls.

Associated with each object is a pager_struct structure representing an instance of the pager type responsible for supplying the contents of pages within the object. This structure contains pointers to type-specific routines for reading and writing data, as well as a pointer to instance-specific storage. Conceptually, the pager_struct structure describes a logically contiguous piece of backing store, such as a chunk of swap space or a disk file. A pager_struct and any associated instance-specific data are collectively known as a pager instance in the following discussion.

A pager instance is typically created at the same time as the object when a file, device, or piece of anonymous memory is mapped into a process address space. The pager instance continues to exist until the object is deallocated. When a page fault occurs for a virtual address mapping a particular object, the fault-handling code allocates a vm_page structure and converts the faulting address to an offset within the object. This offset is recorded in the vm_page structure, and the page is added to the list of pages cached by the object. The page frame and the object’s pager instance are then passed to the underlying pager routine. The pager routine is responsible for filling the vm_page structure with the appropriate initial value for that offset of the object that it represents.

The pager instance is also responsible for saving the contents of a dirty page if the system decides to push out the latter to backing store. When the pageout daemon decides that a particular page is no longer needed, it requests the object that owns the page to free the page. The object first passes the page with the associated logical offset to the underlying pager instance, to be saved for future use. The pager instance is responsible for finding an appropriate place to save the page, doing any I/O necessary for the save, and then notifying the object that the page can be freed. When it is done, the pager instance notifies the pageout daemon to move the vm_page structure from the object to the free list for future use.

There are seven routines associated with each pager type. The pgo_init routine is called at boot time to do any one-time type-specific initializations, such as allocating a pool of private pager structures. The pgo_alloc and pgo_dealloc routines are called when an instance of a pager should be created or destroyed. The allocation routine is called whenever the corresponding object is mapped into an address space via mmap. Hence, only the first call should create the structure; successive calls just increment the reference count for the associated object and return a pointer to the existing pager instance. The deallocation routine is called only when the object reference count drops to zero.

Pgo_getpages is called to return one or more pages of data from a pager instance either synchronously or asynchronously. Currently, this routine is called from only the page-fault handler to synchronously fill single pages. Pgo_putpages writes back one or more pages of data. This routine is called by the pageout daemon to write back one or more pages asynchronously, and by msync to write back single pages synchronously or asynchronously. Both the get and put routines are called with an array of vm_page structures indicating the affected pages.

Pgo_cluster takes an offset and returns an enclosing offset range representing an optimal I/O transfer unit for the backing store. This range can be used with pgo_getpages and pgo_putpages to help do informed prefetching or clustered cleaning. Currently, it is used by only the pageout daemon for the latter task. The pgo_haspage routine queries a pager instance to see whether that instance has data at a particular backing-store offset. This routine is used in only the page-fault handler, to determine whether an internal copy object already has received a copy of a particular page.

The three types of pagers supported by the system are described in the next three subsections.

Vnode Pager

The vnode pager handles objects that map files in a filesystem. Whenever a file is mapped either explicitly by mmap or implicitly by exec, the vnode-pager allocation routine is called. If the call represents the first mapping of the vnode, the necessary vnode-pager–specific structure is created, and an object of the appropriate size is allocated and is associated with the pager instance. The vnode-pager structure contains a pointer to the vnode and a copy of the latter’s current size. The vnode reference count is incremented to reflect the pager reference. If this initialization call is not the first for a vnode, the existing pager structure is located. In either case, the associated object’s reference count is incremented, and a pointer to the pager instance is returned.

When a pagein request is received by the vnode-pager read routine, the provided physical page is mapped into the kernel address space long enough for the pager instance to call the filesystem VOP_READ vnode operation to load the page with the file contents. Once the page is filled, the kernel mapping can be dropped, and the page can be returned.

When the vnode pager is asked to save a page to be freed, it simply arranges to write back the page to the part of the file from which the page came. The page is mapped into the kernel address space long enough for the pager routine to call the filesystem VOP_WRITE vnode operation to store the page back into the file. Once the page is stored, the kernel mapping can be dropped, and the object can be notified that the page can be freed.

If a file is being privately mapped, then modified pages cannot be written back to the filesystem. Such private mapping must use a shadow object with a swap pager for all pages that are modified. Thus, a privately mapped object will never be asked to save any dirty pages to the underlying file.

When the last address-space mapping of a vnode is removed by munmap or exit, the vnode-pager deallocation routine is called. This routine releases the vnode reference and frees the vnode-pager structure.

The vnode-pager I/O routines use the VOP_READ and VOP_WRITE vnode operations that pass data through any caches maintained by filesystems (e.g., the buffer cache used by UFS and NFS). The problem with this approach is that the virtual-memory system maintains a cache of file pages that is independent of the filesystem caches, resulting in potential double caching of file data. This condition leads to inefficient cache use, and worse, to the potential for inconsistencies between the two caches. Modifications to files that are mapped into memory are not seen by processes that read those files until the mapped file is written back to the filesystem and reread into the filesystem cache. Similarly, changes to files written to the filesystem are not visible to processes that map those files until the file is written back to disk and then page faulted into the process. The writeback and rereading may take seconds to hours, depending on the level of memory activity.

In 4.4BSD, this problem is addressed in an ad hoc and incomplete fashion. Two vnode-pager–specific routines are called from various points in the VFS code. Vnode_pager_setsize() is invoked when a file changes size. If the file has shrunk, any excess cached pages are removed from the object. This page removal guarantees that future mapped references to those pages will cause page faults, and in turn, will generate a signal to the mapping process. Vnode_pager_uncache() removes the object representing a vnode from the object cache. Recall that the object cache contains only objects that are not currently referenced; thus, this routine will not help to maintain consistency for an object that is currently mapped.

A more consistent interface can be obtained by using a common cache for both the virtual-memory system and the filesystem. Three approaches to merging the two caches are being undertaken. One approach is to have the filesystem use objects in the virtual-memory system as its cache; a second approach is to have the virtual-memory objects that map files use the existing filesystem cache; the third approach is to create a new cache that is a merger of the two existing caches, and to convert both the virtual memory and the filesystems to use this new cache. Each of these approaches has its merits and drawbacks; it is not yet clear which approach will work best.

Device Pager

The device pager handles objects representing memory-mapped hardware devices. Memory-mapped devices provide an interface that looks like a piece of memory. An example of a memory-mapped device is a frame buffer, which presents a range of memory addresses with one word per pixel on the screen. The kernel provides access to memory-mapped devices by mapping the device memory into a process’s address space. The process can then access that memory without further operating-system intervention. Writing to a word of the frame-buffer memory causes the corresponding pixel to take on the appropriate color and brightness.

The device pager is fundamentally different from the other two pagers in that it does not fill provided physical-memory pages with data. Instead, it creates and manages its own vm_page structures, each of which describes a page of the device space. This approach makes device memory look like wired physical memory. Thus, no special code should be needed in the remainder of the virtual-memory system to handle device memory.

When a device is first mapped, the device-pager allocation routine will validate the desired range by calling the device d_mmap() routine. If the device allows the requested access for all pages in the range, the device pager creates a device-pager structure and associated object. It does not create vm_page structures at this time—they are created individually by the page-get routine as they are referenced. The reason for this late allocation is that some devices export a large memory range in which either not all pages are valid or the pages may not be accessed for common operations. Complete allocation of vm_page structures for these sparsely accessed devices would be wasteful.

The first access to a device page will cause a page fault and will invoke the device-pager page-get routine. The pager instance creates a vm_page structure, initializes the latter with the appropriate object offset and a physical address returned by the device d_mmap() routine, and flags the page as fictitious. This vm_page structure is added to the list of all such allocated pages in the device-pager structure. Since the fault code has no special knowledge of the device pager, it has preallocated a physical-memory page to fill and has associated that vm_page structure with the object. The device-pager routine removes that vm_page structure from the object, returns the structure to the free list, and inserts its own vm_page structure in the same place.

The device-pager page-put routine expects never to be called and will panic if it is. This behavior is based on the assumption that device-pager pages are never entered into any of the paging queues and hence will never be seen by the pageout daemon. However, it is possible to msync a range of device memory. This operation brings up an exception to the higher-level virtual-memory system’s ignorance of device memory: The object page-cleaning routine will skip pages that are flagged as fictitious.

Finally, when a device is unmapped, the device-pager deallocation routine is invoked. This routine deallocates the vm_page structures that it allocated, as well as the device-pager structure itself.

Swap Pager

The term swap pager refers to two functionally different pagers. In the most common use, swap pager refers to the pager that is used by objects that map anonymous memory. This pager has sometimes been referred to as the default pager because it is the pager that is used if no other pager has been requested. It provides what is commonly known as swap space : nonpersistent backing store that is zero filled on first reference. The zero filling is really done by the fault-handling code, without ever invoking the swap pager. Because of the zero-filling optimization and the transient nature of the backing store, allocation of swap-pager resources for a particular object may be delayed until the first pageout operation. Until that time, the pager-structure pointer in the object is NULL. While the object is in this state, page faults (getpage) are handled by zero filling, and page queries (haspage) are not necessary. The expectation is that free memory will be plentiful enough that it will not be necessary to swap out any pages. The object will simply create zero-filled pages during the process lifetime that can all be returned to the free list when the process exits.

The role of the swap pager is swap-space management: figuring out where to store dirty pages and how to find dirty pages when they are needed again. Shadow objects require that these operations be efficient. A typical shadow object is sparsely populated: It may cover a large range of pages, but only those pages that have been modified will be in the shadow object’s backing store. In addition, long chains of shadow objects may require numerous pager queries to locate the correct copy of an object page to satisfy a page fault. Hence, determining whether a pager instance contains a particular page needs to be fast, preferably requiring no I/O operations. A final requirement of the swap pager is that it can do asynchronous writeback of dirty pages. This requirement is necessitated by the pageout daemon, which is a single-threaded process. If the pageout daemon blocked waiting for a page-clean operation to complete before starting the next operation, it is unlikely that it could keep enough memory free in times of heavy memory demand.

In theory, any pager that meets these criteria can be used as the swap pager. In Mach 2.0, the vnode pager was used as the swap pager. Special paging files could be created in any filesystem and registered with the kernel. The swap pager would then suballocate pieces of the files to back particular anonymous objects. Asynchronous writes were a side effect of the filesystem’s use of the buffer cache. One obvious advantage of using the vnode pager is that swap space can be expanded by the addition of more swap files or the extension of existing ones dynamically (i.e., without rebooting or reconfiguring of the kernel). The main disadvantage is that, in the past, the filesystem has not been able to deliver a respectable fraction of the disk bandwidth.

The desire to provide the highest possible disk bandwidth led to the creation of a special raw-partition pager to use as the swap pager for 4.4BSD. Previous versions of BSD also used dedicated disk partitions, commonly known as swap partitions; hence, this partition pager became the swap pager. The remainder of this section describes how the partition pager is implemented, and how it provides the necessary capabilities for backing anonymous objects.

As mentioned, a swap-pager instance will not be created until the first time that a page from the object is replaced in memory. At that time, a structure is allocated to describe the swap space that can hold the object. This swap space is described by an array of fixed-sized swap blocks. The size of each swap block is selected based on the size of the object that the swap pager is managing. For a small object, a minimal-sized (32-Kbyte) swap block will be used; for a large object, each swap block may be as large as 32 ×pagesize. For a machine such as the HP300 with a pagesize of 4 Kbyte, the maximum swap-block size will be 128 Kbyte. A swap block is always an integral number of virtual-memory pages, because those are the units used in the pager interface.

The two structures created by the swap pager are shown in Fig. 5.13. The swpager structure describes the swap area being managed by the pager. It records the total size of the object (object size), the size of each swap block being managed (block size), and the number of swap blocks that make up the swap area for the object (block count). It also has a pointer to an array of block-count swblock structures, each containing a device block number and a bit mask. The block number gives the address of the first device block in a contiguous chunk of block-size DEV_BSIZE–sized blocks that form the swap block, or is zero if a swap block has never been allocated. A mask of 1 bit per page-sized piece within this swap block records which pages in the block contain valid data. A bit is set when the corresponding page is first written to the swap area. Together, the swblock array and associated bit masks provide a two-level page table describing the backing store of an object. This structure provides efficient swap-space allocation for sparsely populated objects, since a given swap block does not need to be allocated until the first time that a page in its block-size range is written back. The structure also allows efficient page lookup: at most an array-indexing operation and a bit-mask operation.

The size of the object is frozen at the time of allocation. Thus, if the anonymous area continues to grow (such as the stack or heap of a process), a new object must be created to describe the expanded area. On a system that is short of memory, the result is that a large process may acquire many anonymous objects. Changing the swap pager to handle growing objects would cut down on this object proliferation dramatically.

Figure 5.13 Structures used to manage swap space.

image

Figure 5.14 A kernel resource map.

image

Allocation of swap blocks from the system’s pool of swap space is managed with a resource map called swapmap. Resource maps are ordered arrays of <base, size> pairs describing the free segments of a resource (see Fig. 5.14). A segment of a given size is allocated from a resource map by rmalloc(), using a first-fit algorithm, and is subsequently freed with rmfree(). The swapmap is initialized at boot time to contain all the available swap space. An index into the swapmap at which space has been allocated is used as the index of the disk address within the swap area.

The swap pager is complicated considerably by the requirement that it handle asynchronous writes from the pageout daemon. The protocol for managing these writes is described in Section 5.12.

5.11 Paging

When the memory-management hardware detects an invalid virtual address, it generates a trap to the system. This page-fault trap can occur for several reasons. Most BSD programs are created in a format that permits the executable image to be paged into main memory directly from the filesystem. When a program in a demand-paged format is first run, the kernel marks as invalid the pages for the text and initialized data regions of the executing process. The text and initialized data regions share an object that provides fill-on-demand from the filesystem. As each page of the text or initialized data region is first referenced, a page fault occurs.

Page faults can also occur when a process first references a page in the uninitialized data region of a program. Here, the anonymous object managing the region automatically allocates memory to the process and initializes the newly assigned page to zero. Other types of page faults arise when previously resident pages have been reclaimed by the system in response to a memory shortage.

The handling of page faults is done with the vm_fault() routine; this routine services all page faults. Each time vm_fault() is invoked, it is provided the virtual address that caused the fault. The first action of vm_fault() is to traverse the vm_map_entry list of the faulting process to find the entry associated with the fault. The routine then computes the logical page within the underlying object and traverses the list of objects to find or create the needed page. Once the page has been found, vm_fault() must call the machine-dependent layer to validate the faulted page, and return to restart the process.

The details of calculating the address within the object were described in Section 5.4. Having computed the offset within the object and determined the object’s protection and object list from the vm_map_entry, the kernel is ready to find or create the associated page. The page-fault–handling algorithm is shown in Fig 5.15 (on pages 164 and 165). In the following overview, the lettered points are references to the tags down the left side of the code.

A.  The loop traverses the list of shadow, copy, anonymous, and file objects until it either finds an object that holds the sought-after page, or reaches the final object in the list. If no page is found, the final object will be requested to produce it.

B.  An object with the desired page has been found. If the page is busy, another process may be in the middle of faulting it in, so this process is blocked until the page is no longer busy. Since many things could have happened to the affected object while the process was blocked, it must restart the entire fault-handling algorithm. If the page was not busy, the algorithm exits the loop with the page.

C.  Anonymous objects (such as those used to represent shadow and copy objects) do not allocate a pager until the first time that they need to push a page to backing store. Thus, if an object has a pager, then there is a chance that the page previously existed but was paged out. If the object does have a pager, then the kernel needs to allocate a page to give to the pager to be filled (see D). The special case for the object being the first object is to avoid a race condition with two processes trying to get the same page. The first process through will create the sought after page in the first object, but keep it marked as busy. When the second process tries to fault the same page it will find the page created by the first process and block on it (see B). When the first process completes the pagein processing, it will unlock the first page, causing the second process to awaken, retry the fault, and find the page created by the first process.

D.  If the page is present in the file or swap area, the pager will bring it back into the newly allocated page. If the pagein succeeds, then the sought after page has been found. If the page never existed, then the pagein request will fail. Unless this object is the first, the page is freed and the search continues. If this object is the first, the page is not freed, so that it will act as a block to further searches by other processes (as described in C).

E.  If the kernel created a page in the first object but did not use that page, it will have to remember that page so that it can free the page when the pagein is done (see M).

F.  If the search has reached the end of the object list and has not found the page, then the fault is on an anonymous object chain, and the first object in the list will handle the page fault using the page allocated in C. The first_page entry is set to NULL to show that it does not need to be freed, the page is zero filled, and the loop is exited.

Figure 5.15 Page-fault handling.

image

image

G.  The search exits the loop with page as the page that has been found or allocated and initialized, and object as the owner of that page. The page has been filled with the correct data at this point.

H.  If the object providing the page is not the first object, then this mapping must be private, with the first object being a shadow object of the object providing the page. If pagein is handling a write fault, then the contents of the page that it has found have to be copied to the page that it allocated for the first object. Having made the copy, it can release the object and page from which the copy came, as the first object and first page will be used to finish the page-fault service. If pagein is handling a read fault, it can use the page that it found, but it has to mark the page copy-on-write to avoid the page being modified in the future.

I.  Pagein is handling a read fault. It can use the page that it found, but has to mark the page copy-on-write to avoid the page being modified before pagein has had a chance to copy the page for the copy object.

J.  If the copy object already has a copy of the page in memory, then pagein does not have to worry about saving the one that it just created.

K.  If there is no page in memory, the copy object may still have a copy in its backing store. If the copy object has a pager, the vm_pager_has_page() routine is called to find out if the copy object still has a copy of the page in its backing store. This routine does not return any data; the blank page is allocated to avoid a race with other faults. Otherwise, the page does not exist, so pagein must copy the page that it found to the page owned by the copy object. After doing this copying, pagein has to remove all existing mappings for the page from which it copied, so that future attempts to access that page will fault and find the page that pagein left in the copy object.

L.  If pagein is handling a write fault, then it has made any copies that were necessary, so it can safely make the page writable.

M.  As the page and possibly the first_page are released, any processes waiting for that page of the object will get a chance to run to get their own references.

Note that the page and object locking has been elided in Fig. 5.15 to simplify the explanation. In 4.4BSD, no clustering is done on pagein; only the requested page is brought in from the backing store.

5.12 Page Replacement

The service of page faults and other demands for memory may be satisfied from the free list for some time, but eventually memory must be reclaimed for reuse. Some pages are reclaimed when processes exit. On systems with a large amount of memory and low memory demand, exiting processes may provide enough free memory to fill demand. This case arises when there is enough memory for the kernel and for all pages that have ever been used by any current process. Obviously, many computers do not have enough main memory to retain all pages in memory. Thus, it eventually becomes necessary to move some pages to secondary storage—to the swap space. Bringing in a page is demand driven. For paging it out, however, there is no immediate indication when a page is no longer needed by a process. The kernel must implement some strategy for deciding which pages to move out of memory so that it can replace these pages with the ones that are currently needed in memory. Ideally, the strategy will choose pages for replacement that will not be needed soon. An approximation to this strategy is to find pages that have not been used recently.

The system implements demand paging with a page-replacement algorithm that approximates global least-recently used [Corbato, 1968; Easton & Franaszek, 1979]. It is an example of a global replacement algorithm : one in which the choice of a page for replacement is made according to system wide criteria. A local replacement algorithm would choose a process for which to replace a page, and then chose a page based on per-process criteria. Although the algorithm in 4.4BSD is similar in nature to that in 4.3BSD, its implementation is considerably different.

The kernel scans physical memory on a regular basis, considering pages for replacement. The use of a systemwide list of pages forces all processes to compete for memory on an equal basis. Note that it is also consistent with the way that 4.4BSD treats other resources provided by the system. A common alternative to allowing all processes to compete equally for memory is to partition memory into multiple independent areas, each localized to a collection of processes that compete with one another for memory. This scheme is used, for example, by the VMS operating system [Kenah & Bate, 1984]. With this scheme, system administrators can guarantee that a process, or collection of processes, will always have a minimal percentage of memory. Unfortunately, this scheme can be difficult to administer. Allocating too small a number of pages to a partition can result in underutilization of memory and excessive I/O activity to secondary-storage devices, whereas setting the number too high can result in excessive swapping [Lazowska & Kelsey, 1978].

The kernel divides the main memory into four lists:

1.  Wired : Wired pages are locked in memory and cannot be paged out. Typically, these pages are being used by the kernel or have been locked down with mlock. In addition, all the pages being used to hold the user areas of loaded (i.e., not swapped-out) processes are also wired. Wired pages cannot be paged out.

2.  Active : Active pages are being used by one or more regions of virtual memory. Although the kernel can page them out, doing so is likely to cause an active process to fault them back again.

3.  Inactive : Inactive pages have contents that are still known, but they are not usually part of any active region. If the system becomes short of memory, the pageout daemon may try to move active pages to the inactive list in the hopes of finding pages that are not really in use. The selection criteria that are used by the pageout daemon to select pages to move from the active list to the inactive list are described later in this section. When the free-memory list drops too low, the pageout daemon traverses the inactive list to create more free pages.

4.  Free : Free pages have no useful contents, and will be used to fulfill new page-fault requests.

The pages of main memory that can be used by user processes are those on the active, inactive, and free lists.

Ideally, the kernel would maintain a working set for each process in the system. It would then know how much memory to provide to each process to minimize the latter’s page-fault behavior. The 4.4BSD virtual-memory system does not use the working-set model because it lacks accurate information about the reference pattern of a process. It does track the number of pages held by a process via the resident-set size, but it does not know which of the resident pages constitute the working set. In 4.3BSD, the count of resident pages was used in making decisions on whether there was enough memory for a process to be swapped in when that process wanted to run. This feature was not carried over to the 4.4BSD virtual-memory system. Because it worked well during periods of high memory demand, this feature should be incorporated in future 4.4BSD systems.

Paging Parameters

The memory-allocation needs of processes compete constantly, through the page-fault handler, with the overall system goal of maintaining a minimum threshold of pages in the free list. As the system operates, it monitors main-memory utilization, and attempts to run the pageout daemon frequently enough to keep the amount of free memory at or above the minimum threshold. When the page-allocation routine, vm_page_alloc(), determines that more memory is needed, it aw akens the pageout daemon.

The work of the pageout daemon is controlled by several parameters that are calculated during system startup. These parameters are fine tuned by the pageout daemon as it runs based on the memory available for processes to use. In general, the goal of this policy is to maintain free memory at, or above, a minimum threshold. The pageout daemon implements this policy by reclaiming pages for the free list. The number of pages to be reclaimed by the pageout daemon is a function of the memory needs of the system. As more memory is needed by the system, more pages are scanned. This scanning causes the number of pages freed to increase.

The pageout daemon determines the memory needed by comparing the number of free memory pages against several parameters. The first parameter, free_target, specifies a threshold (in pages) for stopping the pageout daemon. When available memory is above this threshold, no pages will be paged out by the pageout daemon. Fr ee_target is normally 7 percent of user memory. The other interesting limit specifies the minimum free memory considered tolerable, free_min; this limit is normally 5 percent of user memory. If the amount of free memory goes below free_min, the pageout daemon is started. The desired size of the list of inactive pages is kept in inactive_target; this limit is normally 33 percent of available user memory. The size of this threshold changes over time as more or less of the system memory is wired down by the kernel. If the number of inactive pages goes below inactive_target, the pageout daemon begins scanning the active pages to find candidates to move to the inactive list.

The desired values for the paging parameters are communicated to the page-out daemon through global variables. Likewise, the pageout daemon records its progress in a global variable. Progress is measured by the number of pages scanned over each interval that it runs.

The Pageout Daemon

Page replacement is done by the pageout daemon. When the pageout daemon reclaims pages that have been modified, it is responsible for writing them to the swap area. Thus, the pageout daemon must be able to use normal kernel-synchronization mechanisms, such as sleep(). It therefore runs as a separate process, with its own process structure, user structure, and kernel stack. Like init, the pageout daemon is created by an internal fork operation during system startup (see Section 14.5); unlike init, however, it remains in kernel mode after the fork. The pageout daemon simply enters vm_pageout() and never returns. Unlike other users of the disk I/O routines, the pageout process needs to do its disk operations asynchronously so that it can continue scanning in parallel with disk writes.

The goal of the pageout daemon is to keep at least 5 percent of the memory on the free list. Whenever an operation that uses pages causes the amount of free memory to fall below this threshold, the pageout daemon is awakened. It starts by checking to see whether any processes are eligible to be swapped out (see the next subsection). If the pageout daemon finds and swaps out enough eligible processes to meet the free-page target, then the pageout daemon goes to sleep to await another memory shortage.

If there is still not enough free memory, the pageout daemon scans the queue of inactive pages, starting with the oldest page and working toward the youngest. It frees those pages that it can until the free-page target is met or it reaches the end of the inactive list. The following list enumerates the possible actions that can be taken with each page:

• If the page is clean and unreferenced, move it to the free list and increment the free-list count.

• If the page has been referenced by an active process, move it from the inactive list back to the active list.

• If the page is dirty and is being written to the swap area or the filesystem, skip it for now. The expectation is that the I/O will have completed by the next time that the pageout daemon runs, so the page will be clean and can be freed.

• If the page is dirty but is not actively being written to the swap space or the filesystem, then start an I/O operation to get it written. As long as a pageout is needed to save the current page, adjacent pages of the region that are resident, inactive, and dirty are clustered together so that the whole group can be written to the swap area or filesystem in a single I/O operation. If they are freed before they are next modified, the free operation will not require the page to be written.

When the scan of the inactive list completes, the pageout daemon checks the size of the inactive list. Its target is to keep one-third of the available (nonwired) pages on the inactive list. If the inactive queue has gotten too small, the pageout daemon moves pages from the active list over to the inactive list until it reaches its target. Like the inactive list, the active list is sorted into a least recently activated order: The pages selected to be moved to the inactive list are those that were activated least recently. Vm_pageout() then goes to sleep until free memory drops below the target.

The procedure for writing the pages of a process to the swap device, a page push, is somewhat complicated. The mechanism used by the pageout daemon to write pages to the swap area differs from normal I/O in two important ways:

1.  The dirty pages are mapped into the virtual address space of the kernel, rather than being part of the virtual address space of the process.

2.  The write operation is done asynchronously.

Both these operations are done by the swap_pager_putpage() routine. Because the pageout daemon does not synchronously wait while the I/O is done, it does not regain control after the I/O operation completes. Therefore, swap_pager_putpage() marks the buffer with a callback flag and sets the routine for the callback to be swap_pager_iodone(). When the push completes, swap_pager_iodone() is called; it places the buffer on the list of completed page-outs. If the pageout daemon has finished initiating paging I/O and has gone to sleep, swap_pager_iodone() awakens it so that it can process the completed page-out list. If the pageout daemon is still running, it will find the buffer the next time that it processes the completed pageout list.

Doing the write asynchronously allows the pageout daemon to continue examining pages, possibly starting additional pushes. Because the number of swap buffers is constant, the kernel must take care to ensure that a buffer is available before a commitment to a new page push is made. If the pageout daemon has used all the swap buffers, swap_pager_putpage() waits for at least one write operation to complete before it continues. When pageout operations complete, the buffers are added to the list of completed pageouts and, if a swap_pager_putpage() was blocked awaiting a buffer, swap_pager_putpage() is awakened.

The list of completed pageouts is processed by swap_pager_clean() each time a swap-pager instance is deallocated, before a new swap operation is started, and before the pageout daemon sleeps. For each pageout operation on the list, each page (including each in a page cluster) is marked as clean, has its busy bit cleared, and has any processes waiting for it awakened. The page is not moved from its active or inactive list to the free list. If a page remains on the inactive list, it will eventually be moved to the free list during a future pass of the pageout daemon. A count of pageouts in progress is kept for the pager associated with each object; this count is decremented when the pageout completes, and, if the count goes to zero, a wakeup() is issued. This operation is done so that an object that is deallocating a swap pager can wait for the completion of all pageout operations before freeing the pager’s references to the associated swap space.

Swapping

Although swapping is generally avoided, there are several times when it is used in 4.4BSD to address a serious resource shortage. Swapping is done in 4.4BSD when any of the following occurs:

• The system becomes so short of memory that the paging process cannot free memory fast enough to satisfy the demand. For example, a memory shortfall may happen when multiple large processes are run on a machine lacking enough memory for the minimum working sets of the processes.

• Processes are completely inactive for more than 20 seconds. Otherwise, such processes would retain a few pages of memory associated with the user structure and kernel stack.

Swap operations completely remove a process from main memory, including the process page tables, the pages of the data and the stack segments that are not already in swap space, and the user area.

Process swapping is invoked only when paging is unable to keep up with memory needs or when short-term resource needs warrant swapping a process. In general, the swap-scheduling mechanism does not do well under heavy load; system performance is much better when memory scheduling can be done by the page-replacement algorithm than when the swap algorithm is used.

Swapout is driven by the pageout daemon. If the pageout daemon can find any processes that have been sleeping for more than 20 seconds (maxslp, the cutoff for considering the time sleeping to be “a long time”), it will swap out the one sleeping for the longest time. Such processes have the least likelihood of making good use of the memory that they occupy; thus, they are swapped out even if they are small. If none of these processes are available, the pageout daemon will swap out a process that has been sleeping for a shorter time. If memory is still desperately low, it will select to swap out the runnable process that has been resident the longest. These criteria attempt to avoid swapping entirely until the pageout daemon is clearly unable to keep enough memory free. Once swapping of runnable processes has begun, the processes eligible for swapping should take turns in memory so that no process is frozen out entirely.

The mechanics of doing a swap out are simple. The swapped-in process flag P_INMEM is cleared to show that the process is not resident in memory, and, if necessary, the process is removed from the runnable process queue. Its user area is then marked as pageable, which allows the user area pages, along with any other remaining pages for the process, to be paged out via the standard pageout mechanism. The swapped-out process cannot be run until after it is swapped back into memory.

The Swap-In Process

Swap-in operations are done by the swapping process, process 0. This process is the first one created by the system when the latter is started. The swap-in policy of the swapper is embodied in the scheduler() routine. This routine swaps processes back in when memory is available and they are ready to run. At any time, the swapper is in one of three states:

1.  Idle: No swapped-out processes are ready to be run. Idle is the normal state.

2.  Swapping in : At least one runnable process is swapped out, and scheduler() attempts to find memory for it.

3.  Swapping out : The system is short of memory or there is not enough memory to swap in a process. Under these circumstances, scheduler() awakens the pageout daemon to free pages and to swap out other processes until the memory shortage abates.

If more than one swapped-out process is runnable, the first task of the swapper is to decide which process to swap in. This decision may affect the decision whether to swap out another process. Each swapped-out process is assigned a priority based on

• The length of time it has been swapped out

• Its nice value

• The amount of time it was asleep since it last ran

In general, the process that has been swapped out longest or was swapped out because it was not runnable will be brought in first. Once a process is selected, the swapper checks to see whether there is enough memory free to swap in the process. Historically, the 4.3BSD system required as much memory to be available as was occupied by the process before that process was swapped. Under 4.4BSD, this requirement was reduced to a requirement that only enough memory be available to hold the swapped-process user structure and kernel stack. If there is enough memory available, the process is brought back into memory. The user area is swapped in immediately, but the process loads the rest of its working set by demand paging from the swap device. Thus, not all the memory that is committed to the process is used immediately.

The procedure for swapin of a process is the reverse of that for swapout:

1.  Memory is allocated for the user structure and kernel stack, and they are read back from swap space.

2.  The process is marked as resident and is returned to the run queue if it is runnable (i.e., is not stopped or sleeping).

After the swapin completes, the process is ready to run like any other, except that it has no resident pages. It will bring in the pages that it needs by faulting them.

5.13 Portability

Everything discussed in this chapter up to this section has been part of the machine-independent data structures and algorithms. These parts of the virtual-memory system require little change when 4.4BSD is ported to a new architecture. This section will describe the machine-dependent parts of the virtual-memory system; the parts of the virtual-memory system that must be written as part of a port of 4.4BSD to a new architecture. The machine-dependent parts of the virtual-memory system control the hardware memory-management unit (MMU). The MMU implements address translation and access control when virtual memory is mapped onto physical memory.

One common MMU design uses memory-resident forward-mapped page tables. These page tables are large contiguous arrays indexed by the virtual address. There is one element, or page-table entry, in the array for each virtual page in the address space. This element contains the physical page to which the virtual page is mapped, as well as access permissions, status bits telling whether the page has been referenced or modified, and a bit indicating whether the entry contains valid information. For a 4-Gbyte address space with 4-Kbyte virtual pages and a 32-bit page-table entry, 1 million entries, or 4 Mbyte, would be needed to describe an entire address space. Since most processes use little of their address space, most of the entries would be invalid, and allocating 4 Mbyte of physical memory per process would be wasteful. Thus, most page-table structures are hierarchical, using two or more levels of mapping. With a hierarchical structure, different portions of the virtual address are used to index the various levels of the page tables. The intermediate levels of the table contain the addresses of the next lower level of the page table. The kernel can mark as unused large contiguous regions of an address space by inserting invalid entries at the higher levels of the page table, eliminating the need for invalid page descriptors for each individual unused virtual page.

This hierarchical page-table structure requires the hardware to make frequent memory references to translate a virtual address. To speed the translation process, most page-table–based MMUs also have a small, fast, fully associative hardware cache of recent address translations, a structure known commonly as a translation lookaside buffer (TLB). When a memory reference is translated, the TLB is first consulted and, only if a valid entry is not found there, the page-table structure for the current process is traversed. Because most programs exhibit spatial locality in their memory-access patterns, the TLB does not need to be large; many are as small as 64 entries.

As address spaces grew beyond 32 to 48 and, more recently, 64 bits, simple indexed data structures become unwieldy, with three or more levels of tables required to handle address translation. A response to this page-table growth is the inverted page table, also known as the re verse-mapped page table. In an inverted page table, the hardware still maintains a memory-resident table, but that table contains one entry per physical page and is indexed by physical address, instead of by virtual address. An entry contains the virtual address to which the physical page is currently mapped, as well as protection and status attributes. The hardware does virtual-to-physical address translation by computing a hash function on the virtual address to select an entry in the table. The system handles collisions by linking together table entries and making a linear search of this chain until it finds the matching virtual address.

The advantages of an inverted page table are that the size of the table is proportional to the amount of physical memory and that only one global table is needed, rather than one table per process. A disadvantage to this approach is that there can be only one virtual address mapped to any giv en physical page at any one time. This limitation makes virtual-address aliasing—having multiple virtual addresses for the same physical page—difficult to handle. As it is with the forward-mapped page table, a hardware TLB is typically used to speed the translation process.

A final common MMU organization consists of just a TLB. This architecture is the simplest hardware design. It gives the software maximum flexibility by allowing the latter to manage translation information in whatever structure it desires.

The machine-dependent part of the virtual-memory system also may need to interact with the memory cache. Because the speed of CPUs has increased far more rapidly than the speed of main memory, most machines today require the use of a memory cache to allow the CPU to operate near its full potential. There are several cache-design choices that require cooperation with the virtual-memory system.

The design option with the biggest effect is whether the cache uses virtual or physical addressing. A physically addressed cache takes the address from the CPU, runs it through the MMU to get the address of the physical page, then uses this physical address to find out whether the requested memory location is available in the cache. Although the TLB significantly reduces the average latency of the translation, there is still a delay in going through the MMU. A virtually addressed cache uses the virtual address as that address comes from the CPU to find out whether the requested memory location is available in the cache. The virtual-address cache is faster than the physical-address cache because it avoids the time to run the address through the MMU. However, the virtual-address cache must be flushed completely after each context switch, because virtual addresses from one process are indistinguishable from the virtual addresses of another process. By contrast, a physical-address cache needs to flush only a few individual entries when their associated physical page is reassigned. In a system with many short-running processes, a virtual-address cache gets flushed so frequently that it is seldom useful.

A further refinement to the virtual-address cache is to add a process tag to each cache entry. At each context switch, the kernel loads a hardware context register with the tag assigned to the process. Each time an entry is recorded in the cache, both the virtual address and the process tag that faulted it are recorded. The cache looks up the virtual address as before, but, when it finds an entry, it compares the tag associated with that entry to the hardware context register. If they match, the cached value is returned. If they do not match, the correct value and current process tag replace the old cached value. When this technique is used, the cache does not need to be flushed completely at each context switch, since multiple processes can have entries in the cache. The drawback is that the kernel must manage the process tags. Usually, there are fewer tags (eight to 16) than there are processes. The kernel must assign the tags to the active set of processes. When an old process drops out of the active set to allow a new one to enter, the kernel must flush the cache entries associated with the tag that it is about to reuse.

A final consideration is a write-through versus a write-back cache. A write-through cache writes the data back to main memory at the same time as it is writing to the cache, forcing the CPU to wait for the memory access to conclude. A write-back cache writes the data to only the cache, delaying the memory write until an explicit request or until the cache entry is reused. The write-back cache allows the CPU to resume execution more quickly and permits multiple writes to the same cache block to be consolidated into a single memory write.

Often, a port to another architecture with a similar memory-management organization can be used as a starting point for a new port. Most models of the HP300 line of workstations, built around the Motorola 68000 family of processors, use the typical two-level page-table organization shown in Fig. 5.16 (on page 176). An address space is broken into 4-Kbyte virtual pages, with each page identified by a 32-bit entry in the page table. Each page-table entry contains the physical page number assigned to the virtual page, the access permissions allowed, modify and reference information, and a bit indicating that the entry contains valid information. The 4 Mbyte of page-table entries are likewise divided into 4-Kbyte page-table pages, each of which is described by a single 32-bit entry in the segment table. Segment-table entries are nearly identical to page-table entries: They contain access bits, modify and reference bits, a valid bit, and the physical page number of the page-table page described. One 4-Kbyte page—1024 segment-table entries—covers the maximum-sized 4-Gbyte address space. A hardware register contains the physical address of the segment-table for the currently active process.

In Fig. 5.16, translation of a virtual address to a physical address during a CPU access proceeds as follows:

Figure 5.16 Two-level page-table organization. Key: V—page-valid bit; M—page-modified bit; R—page-referenced bit; ACC—page-access permissions.

image

• The 10 most significant bits of the virtual address are used to index into the active segment table.

• If the selected segment-table entry is valid and the access permissions grant the access being made, the next 10 bits of the virtual address are used to index into the page-table page referenced by the segment-table entry.

• If the selected page-table entry is valid and the access permissions match, the final 12 bits of the virtual address are combined with the physical page referenced by the page-table entry to form the physical address of the access.

The Role of the pmap Module

The machine-dependent code describes how the physical mapping is done between the user-processes and kernel virtual addresses and the physical addresses of the main memory. This mapping function includes management of access rights in addition to address translation. In 4.4BSD, the physical mapping (pmap) module manages machine-dependent translation and access tables that are used either directly or indirectly by the memory-management hardware. For example, on the HP300, the pmap maintains the memory-resident segment and page tables for each process, as well as for the kernel. The machine-dependent state required to describe the translation and access rights of a single page is often referred to as a mapping or mapping structure.

The 4.4BSD pmap interface is nearly identical to that in Mach 3.0: it shares many design characteristics. The pmap module is intended to be logically independent of the higher levels of the virtual-memory system. The interface deals strictly in machine-independent page-aligned virtual and physical addresses and in machine-independent protections. The machine-independent page size may be a multiple of the architecture-supported page size. Thus, pmap operations must be able to affect more than one physical page per logical page. The machine-independent protection is a simple encoding of read, write, and execute permission bits. The pmap must map all possible combinations into valid architecture-specific values.

A process’s pmap is considered to be a cache of mapping information kept in a machine-dependent format. As such, it does not need to contain complete state for all valid mappings. Mapping state is the responsibility of the machine-independent layer. With one exception, the pmap module may throw away mapping state at its discretion to reclaim resources. The exception is wired mappings, which should never cause a fault that reaches the machine-independent vm_fault() routine. Thus, state for wired mappings must be retained in the pmap until it is removed explicitly.

In theory, the pmap module may also delay most interface operations, such as removing mappings or changing their protection attributes. It can then do many of them batched together, before doing expensive operations such as flushing the TLB. In practice, however, this delayed operation has never been used, and it is unclear whether it works completely. This feature was dropped from later releases of the Mach 3.0 pmap interface.

In general, pmap routines may act either on a set of mappings defined by a virtual address range or on all mappings for a particular physical address. Being able to act on individual or all virtual mappings for a physical page requires that the mapping information maintained by the pmap module be indexed by both virtual and physical address. For architectures such as the HP300 that support memory-resident page tables, the virtual-to-physical, or forward, lookup may be a simple emulation of the hardware page-table traversal. Physical-to-virtual, or reverse, lookup requires an inverted page table : an array with one entry per physical page indexed by the physical page number. Entries in this table may be either a single mapping structure, if only one virtual translation is allowed per physical page, or a pointer to a list of mapping structures, if virtual-address aliasing is allowed. The kernel typically handles forward lookups in a system without page tables by using a hash table to map virtual addresses into mapping structures in the inverted page table.

There are two strategies that can be used for management of pmap memory resources, such as user-segment or page-table memory. The traditional and easiest approach is for the pmap module to manage its own memory. Under this strategy, the pmap module can grab a fixed amount of wired physical memory at system boot time, map that memory into the kernel’s address space, and allocate pieces of the memory as needed for its own data structures. The primary benefit is that this approach isolates the pmap module’s memory needs from those of the rest of the system and limits the pmap module’s dependencies on other parts of the system. This design is consistent with a layered model of the virtual-memory system in which the pmap is the lowest, and hence self-sufficient, layer.

The disadvantage is that this approach requires the duplication of many of the memory-management functions. The pmap module has its own memory allocator and deallocator for its private heap—a heap that is statically sized and cannot be adjusted for varying systemwide memory demands. For an architecture with memory-resident page tables, it must keep track of noncontiguous chunks of processes’ page tables, because a process may populate its address space sparsely. Handling this requirement entails duplicating much of the standard list-management code, such as that used by the vm_map code.

An alternative approach, used by the HP300, is to use the higher-level virtual-memory code recursively to manage some pmap resources. Here, the page table for a user process appears as a virtually contiguous 4-Mbyte array of page-table entries in the kernel’s address space. Using higher-level allocation routines, such as kmem_alloc_wait(), ensures that physical memory is allocated only when needed and from the systemwide free-memory pool. Page tables and other pmap resources also can be allocated from pageable kernel memory. This approach easily and efficiently supports large sparse address spaces, including the kernel’s own address space.

The primary drawback is that this approach violates the independent nature of the interface. In particular, the recursive structure leads to deadlock problems with global multiprocessor spin locks that can be held while the kernel is calling a pmap routine. Another problem for page-table allocation is that page tables are typically hierarchically arranged; they are not flat, as this technique represents them. With a two-level org anization present on some HP300 machines, the pmap module must be aware that a new page has been allocated within the 4-Mbyte range, so that the page’s physical address can be inserted into the segment table. Thus, the advantage of transparent allocation of physical memory is partially lost. Although the problem is not severe in the two-level case, the technique becomes unwieldy for three or more levels.

The pmap data structures are contained in the machine-dependent include directory in the file pmap.h. Most of the code for these routines is in the machine-dependent source directory in the file pmap.c. The main tasks of the pmap module are these:

• System initialization and startup (pmap_bootstrap_alloc(), pmap_bootstrap(), pmap_init())

• Allocation and deallocation of mappings of physical to virtual pages (pmap_enter(), pmap_remove())

• Change of access protections and other attributes of mappings (pmap_change_wiring(), pmap_page_protect(), pmap_protect())

• Maintenance of physical page-usage information (pmap_clear_modify(), pmap_clear_reference(), pmap_is_modified(), pmap_is_referenced())

• Initialization of physical pages (pmap_copy_page(), pmap_zero_page())

• Management of internal data structures (pmap_create(), pmap_reference(), pmap_destroy(), pmap_pinit(), pmap_release(), pmap_copy(), pmap_pageable(), pmap_collect(), pmap_update())

Each of these tasks will be described in the following subsections.

Initialization and Startup

The first step in starting up the system is for the loader to bring the kernel image from a disk or the network into the physical memory of the machine. The kernel load image looks much like that of any other process; it contains a text segment, an initialized data segment, and an uninitialized data segment. The loader places the kernel contiguously into the beginning of physical memory. Unlike a user process that is demand paged into memory, the text and data for the kernel are read into memory in their entirety. Following these two segments, the loader zeros an area of memory equal to the size of the kernel’s uninitialized memory segment. After loading the kernel, the loader passes control to the starting address given in the kernel executable image. When the kernel begins executing, it is executing with the MMU turned off. Consequently, all addressing is done using the direct physical addresses.

The first task undertaken by the kernel is to set up the kernel pmap, and any other data structures that are necessary to describe the kernel’s virtual address space and to make it possible to enable the MMU. This task is done in pmap_bootstrap(). On the HP300, bootstrap tasks include allocating and initializing the segment and page tables that map the statically loaded kernel image and memory-mapped I/O address space, allocating a fixed amount of memory for kernel page-table pages, allocating and initializing the user structure and kernel stack for the initial process, allocating the empty segment table initially shared by all processes, reserving special areas of the kernel’s address space, and initializing assorted critical pmap -internal data structures. After this call, the MMU is enabled, and the kernel begins running in the context of process zero.

Once the kernel is running in its virtual address space, it proceeds to initialize the rest of the system. This initialization starts with a call to set up the machine-independent portion of the virtual-memory system and concludes with a call to pmap_init(). Any subsystem that requires dynamic memory allocation between enabling of the MMU and the call to pmap_init() must use pmap_boot-strap_alloc(). Memory allocated by this routine will not be managed by the virtual-memory system and is effectively wired down. Pmap_init() allocates all resources necessary to manage multiple user address spaces and synchronizes the higher level kernel virtual-memory data structures with the kernel pmap.

On the HP300, it first marks as in use the areas of the kernel’s vm_map that were allocated during the bootstrap. These marks prevent future high-level allocations from trying to use those areas. Next, it allocates a range of kernel virtual address space, via a kernel submap, to use for user-process page tables. Pieces of this address range are allocated when processes are created and are deallocated when the processes exit. These areas are not populated with memory on allocation. Page-table pages are allocated on demand when a process first accesses memory that is mapped by an entry in that page of the page table. This allocation is discussed later, in the mapping-allocation subsection. Page tables are allocated from their own submap to limit the amount of kernel virtual address space that they consume. At 4 Mbyte per process page table, 1024 active processes would occupy the entire kernel address space. The available page-table address-space limit is approximately one-half of the entire address space.

Pmap_init allocates a fixed amount of wired memory to use for kernel page-table pages. In theory, these pages could be allocated on demand from the general free-memory pool, as user page-table pages are; in practice, however, this approach leads to deadlocks, so a fixed pool of memory is used.

After determining the number of pages of physical memory remaining, the startup code allocates the inverted page table, pv_table. This table is an array of pv_entry structures. Each pv_entry describes a single address translation and includes the virtual address, a pointer to the associated pmap structure for that virtual address, a link for chaining together multiple entries mapping this physical address, and additional information specific to entries mapping page-table pages. Figure 5.17 shows the pv_entry references for a set of pages that have a single mapping. The pv_table contains actual instances of pv_entry structures, rather than pointers; this strategy optimizes the common case where physical pages have only one mapping. The purpose of the pv_entry structures is to identify the address space that has the page mapped. Rather than having a pointer from the vm_page structure to its corresponding pv_entry, the relationship is based on the array index of the two entries. In Fig. 5.17, the object is using pages 5, 18, and 79; thus, the corresponding pv_entry structures 5, 18, and 79 point to the physical map for the address space that has page tables referencing those pages.

Each pv_entry can reference only one physical map. When an object becomes shared between two or more processes, each physical page of memory becomes mapped into two or more sets of page tables. To track these multiple references, the pmap module must create chains of pv_entry structures, as shown in Fig. 5.18. These additional structures are allocated dynamically and are linked from a list headed by the pv_entry that was allocated in the initial table. For example, implementation of copy-on-write requires that the page tables be set to read-only in all the processes sharing the object. The pmap module can implement this request by walking the list of pages associated with the object to be made copy-on-write. For each page, it finds that pages’ corresponding pv_entry structure. It then makes the appropriate change to the page table associated with that pv_entry structure. If that pv_entry structure has any additional pv_entry structures linked off it, the pmap module traverses them, making the same modification to their referenced page-table entry.

Figure 5.17 Physical pages with a single mapping.

image

Figure 5.18 Physical pages with multiple mappings.

image

Finally, a page-attribute array is allocated with 1 byte per physical page. This array contains reference and dirty information and is described later in the subsection on the management of page usage information. The first and last physical addresses of the area covered by both the pv_entry and attribute arrays are recorded, and are used by other routines for bounds checking. This area is referred to as the pmap-managed memory.

Mapping Allocation and Deallocation

The primary responsibility of the pmap module is validating (allocating) and invalidating (deallocating) mappings of physical pages to virtual addresses. The physical pages represent cached portions of an object that is providing data from a file or an anonymous memory region. A physical page is bound to a virtual address because that object is being mapped into a process’s address space either explicitly by mmap or implicitly by fork or exec. Physical-to-virtual address mappings are not created at the time that the object is mapped; rather, their creation is delayed until the first reference to a particular page is made. At that point, an access fault will occur, and pmap_enter() will be called. Pmap_enter() is responsible for any required side effects associated with creation of a new mapping. Such side effects are largely the result of entering a second translation for an already mapped physical page—for example, as the result of a copy-on-write operation. Typically, this operation requires flushing uniprocessor or multiprocessor TLB or cache entries to maintain consistency.

In addition to its use to create new mappings, pmap_enter() may also be called to modify the wiring or protection attributes of an existing mapping or to rebind an existing mapping for a virtual address to a new physical address. The kernel can handle changing attributes by calling the appropriate interface routine, described in the next subsection. Changing the target physical address of a mapping is simply a matter of first removing the old mapping and then handling it like any other new mapping request.

Pmap_enter() is the only routine that cannot lose state or delay its action. When called, it must create a mapping as requested, and it must validate that mapping before returning to the caller. On the HP300, pmap_enter() takes the following actions:

1.  If no page-table exists for the process, a 4-Mbyte range is allocated in the kernel’s address space to map the process’s address space.

2.  If the process has no segment table of its own (i.e., it still references the initial shared segment table), a private one is allocated.

3.  If a physical page has not yet been allocated to the process page-table at the location required for the new mapping, that is done now. Kernel page-table pages are acquired from the reserved pool allocated at bootstrap time. For user processes, the kernel does the allocation by simulating a fault on the appropriate location in the 4-Mbyte page-table range. This fault forces allocation of a zero-filled page and makes a recursive call to pmap_enter() to enter the mapping of that page in the kernel’s pmap. For either kernel or user page-table pages, the kernel mapping for the new page is flagged as being a page-table page, and the physical address of the page is recorded in the segment table. Recording this address is more complicated on the 68040 that has the top two levels of the page-table hierarchy squeezed into the single segment-table page.

After ensuring that all page-table resources exist for the mapping being entered, pmap_enter() validates or modifies the requested mapping as follows:

1.  Check to see whether a mapping structure already exists for this virtual-to-physical address translation. If one does, the call must be one to change the protection or wiring attributes of the mapping; it is handled as described in the next subsection.

2.  Otherwise, if a mapping exists for this virtual address but it references a different physical address, that mapping is removed.

3.  If the indicated mapping is for a user process, the kernel page-table page containing that page-table entry is marked as nonpageable. Making this marking is an obscure way of keeping page-table pages wired as long as they contain any valid mappings. The vm_map_pageable() routine keeps a wired count for every virtual page, wiring the page when the count is incremented from zero and unwiring the page when the count is decremented to zero. The wiring and unwiring calls trigger a call to pmap_pageable(), whose function is described in the last subsection on the management of internal data structures. Wiring a page-table page avoids having it involuntarily paged out, effectively invalidating all pages that it currently maps. A beneficial side effect is that, when a page-table page is finally unwired, it contains no useful information and does not need to be paged out. Hence, no backing store is required for page-table pages.

4.  If the physical address is outside the range managed by the pmap module (e.g., a frame-buffer page), no pv_table entry is allocated; only a page-table entry is created. Otherwise, for the common case of a new mapping for a managed physical page, a pv_table entry is created.

5.  For HP300 machines with a virtually-indexed cache, a check is made to see whether this physical page already has other mappings. If it does, all mappings may need to be marked cache inhibited, to avoid cache inconsistencies.

6.  A page-table entry is created and validated, with cache and TLB entries flushed as necessary.

When an object is unmapped from an address space, either explicitly by munmap or implicitly on process exit, the pmap module is invoked to invalidate and remove the mappings for all physical pages caching data for the object. Unlike pmap_enter(), pmap_remove() can be called with a virtual-address range encompassing more than one mapping. Hence, the kernel does the unmapping by looping over all virtual pages in the range, ignoring those for which there is no mapping and removing those for which there is one. Also unlike pmap_enter(), the implied action can be delayed until pmap_update(), described in the next subsection, is called. This delay may enable the pmap to optimize the invalidation process by aggregating individual operations.

Pmap_remove() on the HP300 is simple. It loops over the specified address range, invalidating individual page mappings. Since pmap_remove() can be called with large sparsely allocated regions, such as an entire process virtual address range, it needs to skip efficiently invalid entries within the range. It skips invalid entries by first checking the segment-table entry for a particular address and, if an entry is invalid, skipping to the next 4-Mbyte boundary. This check also prevents unnecessary allocation of a page-table page for the empty area. When all page mappings have been invalidated, any necessary global cache flushing is done.

To invalidate a single mapping, the kernel locates and marks as invalid the appropriate page-table entry. The reference and modify bits for the page are saved in the separate attribute array for future retrieval. If this mapping was a user mapping, vm_map_pageable() is called to decrement the wired count on the page-table page. When the count reaches zero, the page-table page can be reclaimed because it contains no more valid mappings. If the physical address from the mapping is outside the managed range, nothing more is done. Otherwise, the pv_table entry is found and is deallocated. When a user page-table page is removed from the kernel’s address space (i.e., as a result of removal of the final valid user mapping from that page), the process’s segment table must be updated. The kernel does this update by invalidating the appropriate segment-table entry.

Change of Access and Wiring Attributes for Mappings

An important role of the pmap module is to manipulate the hardware access protections for pages. These manipulations may be applied to all mappings covered by a virtual-address range within a pmap via pmap_protect(), or they may be applied to all mappings of a particular physical page across pmap s via pmap_page_protect(). There are two features common to both calls. First, either form may be called with a protection value of VM_PROT_NONE to remove a range of virtual addresses or to remove all mappings for a particular physical page. Second, these routines should never add write permission to the affected mappings; thus, calls including VM_PROT_WRITE should make no changes. This restriction is necessary for the copy-on-write mechanism to function properly. Write permission is added only via calls to pmap_enter().

Pmap_protect() is used primarily by the mprotect system call to change the protection for a region of process address space. The strategy is similar to that of pmap_remove(): Loop over all virtual pages in the range and apply the change to all valid mappings that are found. Invalid mappings are left alone. As occurs with pmap_remove(), the action may be delayed until pmap_update() is called.

For the HP300, pmap_protect() first checks for the special cases. If the requested permission is VM_PROT_NONE, it calls pmap_remove() to handle the revocation of all access permission. If VM_PROT_WRITE is included, it just returns immediately. For a normal protection value, pmap_remove() loops over the given address range, skipping invalid mappings. For valid mappings, the page-table entry is looked up and, if the new protection value differs from the current value, the entry is modified and any TLB and cache flushing done. As occurs with pmap_remove(), any global cache actions are delayed until the entire range has been modified.

Pmap_page_protect() is used internally by the virtual-memory system for two purposes. It is called to set read-only permission when a copy-on-write operation is set up (e.g., during fork). It also removes all access permissions before doing page replacement to force all references to a page to block pending the completion of its operation. In Mach, this routine used to be two separate routines—pmap_copy_on_write() and pmap_remove_all()—and many pmap modules implement pmap_page_protect() as a call to one or the other of these functions, depending on the protection argument.

In the HP300 implementation of pmap_page_protect(), a check is made to ensure that this page is a managed physical page and that VM_PROT_WRITE was not specified. If either of these conditions is not met, pmap_page_protect() returns without doing anything. Otherwise, it locates the pv_table entry for the specified physical page. If the request requires the removal of mappings, pmap_page_protect() loops over all pv_entry structures that are chained together for this page, invalidating the individual mappings as described in the previous subsection. Note that TLB and cache flushing differ from those for pmap_remove(), since they must invalidate entries from multiple process contexts, rather than invalidating multiple entries from a single process context.

If pmap_page_protect() is called to make mappings read-only, then it loops over all pv_entry structures for the physical address, modifying the appropriate page-table entry for each. As occurs with pmap_protect(), the entry is checked to ensure that it is changing before expensive TLB and cache flushes are done.

Pmap_change_wiring() is called to wire or unwire a single machine-independent virtual page within a pmap. As described in the previous subsection, wiring informs the pmap module that a mapping should not cause a hardware fault that reaches the machine-independent vm_fault() code. Wiring is typically a software attribute that has no affect on the hardware MMU state: It simply tells the pmap not to throw away state about the mapping. As such, if a pmap module never discards state, then it is not strictly necessary for the module even to track the wired status of pages. The only side effect of not tracking wiring information in the pmap is that the mlock system call cannot be completely implemented without a wired page-count statistic.

The HP300 pmap implementation maintains wiring information. An unused bit in the page-table–entry structure records a page’s wired status. Pmap_change_wiring() sets or clears this bit when it is invoked with a valid virtual address. Since the wired bit is ignored by the hardware, there is no need to modify the TLB or cache when the bit is changed.

Management of Page-Usage Information

The machine-independent page-management code needs to be able to get basic information about the usage and modification of pages from the underlying hardware. The pmap module facilitates the collection of this information without requiring the machine-independent code to understand the details of the mapping tables by providing a set of interfaces to query and clear the reference and modify bits. The pageout daemon can call pmap_is_modified() to determine whether a page is dirty. If the page is dirty, the pageout daemon can write it to backing store, then call pmap_clear_modify() to clear the modify bit. Similarly, when the page-out daemon pages out or inactivates a page, it uses pmap_clear_reference() to clear the reference bit for the page. Later, when it considers moving the page from the inactive list, it uses pmap_is_referenced() to check whether the page has been used since the page was inactivated. If the page has been used, it is moved back to the active list; otherwise, it is moved to the free list.

One important feature of the query routines is that they should return valid information even if there are currently no mappings for the page in question. Thus, referenced and modified information cannot just be gathered from the hardware-maintained bits of the various page-table or TLB entries; rather, there must be an auxiliary array where the information is retained when a mapping is removed.

The HP300 implementation of these routines is simple. As mentioned in the subsection on initialization and startup, a page-attribute array with one entry per managed physical page is allocated at boot time. Initially zeroed, the entries are updated whenever a mapping for a page is removed. The query routines return FALSE if they are not passed a managed physical page. Otherwise, they test the referenced or modified bit of the appropriate attribute-array entry and, if the bit is set, return TRUE immediately. Since this attribute array contains only past information, they still need to check status bits in the page-table entries for currently valid mappings of the page. Thus, they loop over all pv_entry structures associated with the physical page and examine the appropriate page-table entry for each. They can return TRUE as soon as they encounter a set bit or FALSE if the bit is not set in any page-table entry.

The clear routines also return immediately if they are not passed a managed physical page. Otherwise, the referenced or modified bit is cleared in the attribute array, and they loop over all pv_entry structures associated with the physical page, clearing the hardware-maintained page-table–entry bits. This final step may involve TLB or cache flushes along the way or afterward.

Initialization of Physical Pages

Two interfaces are provided to allow the higher-level virtual-memory routines to initialize physical memory. Pmap_zero_page() takes a physical address and fills the page with zeros. Pmap_copy_page() takes two physical addresses and copies the contents of the first page to the second page. Since both take physical addresses, the pmap module will most likely have first to map those pages into the kernel’s address space before it can access them. Since mapping and unmapping single pages dynamically may be expensive, an alternative is to have all physical memory permanently mapped into the kernel’s address space at boot time. With this technique, addition of an offset to the physical address is all that is needed to create a usable kernel virtual address.

The HP300 implementation has a pair of global kernel virtual addresses reserved for zeroing and copying pages, and thus is not as efficient as it could be. Pmap_zero_page() calls pmap_enter() with the reserved virtual address and the specified physical address, calls bzero() to clear the page, and then removes the temporary mapping with the single translation-invalidation primitive used by pmap_remove(). Similarly, pmap_copy_page() creates mappings for both physical addresses, uses bcopy() to make the copy, and then removes both mappings.

Management of Internal Data Structures

The remaining pmap interface routines are used for management and synchronization of internal data structures. Pmap_create() creates an instance of the machine-dependent pmap structure. The value returned is the handle used for all other pmap routines. Pmap_reference() increments the reference count for a particular pmap. In theory this reference count allows a pmap to be shared by multiple processes; in practice, only the kernel submaps that use the kernel’s pmap share references. Since kernel submaps as well as the kernel map are permanent, there is currently no real need to maintain a reference count. Pmap_destroy() decrements the reference count of the given pmap and deallocates the pmap ’s resources when the count drops to zero.

Because of an incomplete transition in the virtual-memory code, there is also another set of routines to create and destroy pmap s effectively. Pmap_pinit() initializes an already-existing pmap structure, and pmap_release() frees any resources associated with a pmap without freeing the pmap structure itself. These routines were added in support of the vmspace structure that encapsulates all storage associated with a process’s virtual-memory state.

On the HP300, the create and destroy routines use the kernel malloc() and free() routines to manage space for the pmap structure, and then use pmap_pinit() and pmap_release() to initialize and release the pmap. Pmap_pinit() sets the process segment-table pointer to the common empty segment table. As noted earlier in the subsection on mapping allocation and deallocation, page-table allocation is delayed until the first access to the process’s address space. Pmap_release() simply frees the process segment and page tables.

Pmap_copy() and pmap_pageable() are optional interface routines that are used to provide hints to the pmap module about the use of virtual-memory regions. Pmap_copy() is called when a copy-on-write operation has been done. Its parameters include the source and destination pmap, and the virtual address and the length of the region copied. On the HP300, this routine does nothing. Pmap_pageable() indicates that the specified address range has been either wired or unwired. The HP300 pmap module uses this interface to detect when a page-table page is empty and can be released. The current implementation does not free the page-table page; it just clears the modified state of the page and allows the page to be reclaimed by the pageout daemon as needed. Clearing the modify bit is necessary to prevent the empty page from being wastefully written out to backing store.

Pmap_update() is called to notify the pmap module that all delayed actions for all pmap s should be done now. On the HP300, this routine does nothing. Pmap_collect() is called to inform the pmap module that the given pmap is not expected to be used for some time, allowing the pmap module to reclaim resources that could be used more effectively elsewhere. Currently, it is called whenever a process is about to be swapped out. The HP300 pmap module does not use this information for user processes, but it does use the information to attempt to reclaim unused kernel page-table pages when none are available on the free list.

Exercises

5.1  What does it mean for a machine to support virtual memory? What four hardware facilities are typically required for a machine to support virtual memory?

5.2  What is the relationship between paging and swapping on a demand-paged virtual-memory system? Explain whether it is desirable to provide both mechanisms in the same system. Can you suggest an alternative to providing both mechanisms?

5.3  What three policies characterize paging systems? Which of these policies usually has no effect on the performance of a paging system?

5.4  Describe a disadvantage of the scheme used for the management of swap space that holds the dynamic per-process segments. Hint : Consider what happens when a process on a heavily paging system expands in many small increments.

5.5  What is copy-on-write ? In most UNIX applications, the fork system call is followed almost immediately by an exec system call. Why does this behavior make it particularly attractive to use copy-on-write in implementing fork?

5.6  Explain why the vfork system call will always be more efficient than a clever implementation of the fork system call.

5.7  When a process exits, all its pages may not be placed immediately on the memory free list. Explain this behavior.

5.8  What is clustering? Where is it used in the virtual-memory system?

5.9  What purpose does the pageout-daemon process serve in the virtual-memory system? What facility is used by the pageout daemon that is not available to a normal user process?

5.10  Why is the sticky bit no longer useful in 4.4BSD?

5.11  Give two reasons for swapping to be initiated.

*5.12  The 4.3BSD virtual-memory system had a text cache that retained the identity of text pages from one execution of a program to the next. How does the object cache in 4.4BSD improve on the performance of the 4.3BSD text cache?

*5.13  Postulate a scenario under which the HP300 kernel would deadlock if it were to allocate kernel page-table pages dynamically.

References

Babaoimgelu & Joy, 1981.
O. Babaoimgelu & W. N. Joy, “Converting a Swap-Based System to Do Paging in an Architecture Lacking Page-Referenced Bits,” Proceedings of the Eighth Symposium on Operating Systems Principles, p. 78–86, December 1981.

Belady, 1966.
L. A. Belady, “A Study of Replacement Algorithms for Virtual Storage Systems,” IBM Systems Journal, vol. 5, no. 2, p. 78–101, 1966.

Coffman & Denning, 1973.
E. G. Coffman, Jr. & P. J. Denning, Operating Systems Theory, p. 243, Prentice-Hall, Englewood Cliffs, NJ, 1973.

Corbato, 1968.
F. J. Corbato, “A Paging Experiment with the Multics System,” Project MAC Memo MAC-M-384, Massachusetts Institute of Technology, Boston, MA, July 1968.

Denning, 1970.
P. J. Denning, “Virtual Memory,” Computer Surveys, vol. 2, no. 3, p. 153–190, September 1970.

Easton & Franaszek, 1979.
M. C. Easton & P. A. Franaszek, “Use Bit Scanning in Replacement Decisions,” IEEE Transactions on Computing, vol. 28, no. 2, p. 133–141, February 1979.

Gingell et al, 1987.
R. Gingell, M. Lee, X. Dang, & M. Weeks, “Shared Libraries in SunOS,” USENIX Association Conference Proceedings, p. 131–146, June 1987.

Gingell, Moran, & Shannon, 1987.
R. Gingell, J. Moran, & W. Shannon, “Virtual Memory Architecture in SunOS,” USENIX Association Conference Proceedings, p. 81–94, June 1987.

Intel, 1984.
Intel, “Introduction to the iAPX 286,” Order Number 210308, Intel Corporation, Santa Clara, CA, 1984.

Kenah & Bate, 1984.
L. J. Kenah & S. F. Bate, VAX/VMS Internals and Data Structures, Digital Press, Bedford, MA, 1984.

King, 1971.
W. F. King, “Analysis of Demand Paging Algorithms,” IFIP, p. 485–490, North Holland, Amsterdam, 1971.

Korn & Vo, 1985.
D. Korn & K. Vo, “In Search of a Better Malloc,” USENIX Association Conference Proceedings, p. 489–506, June 1985.

Lazowska & Kelsey, 1978.
E. D. Lazowska & J. M. Kelsey, “Notes on Tuning VAX/VMS.,” Technical Report 78-12-01, Department of Computer Science, University of Washington, Seattle, WA, December 1978.

Marshall, 1979.
W. T. Marshall, “A Unified Approach to the Evaluation of a Class of ‘Working Set Like’ Replacement Algorithms,” PhD Thesis, Department of Computer Engineering, Case Western Reserve University, Cleveland, OH, May 1979.

McKusick & Karels, 1988.
M. K. McKusick & M. J. Karels, “Design of a General Purpose Memory Allocator for the 4.3BSD UNIX Kernel,” USENIX Association Conference Proceedings, p. 295–304, June 1988.

Organick, 1975.
E. I. Organick, The Multics System: An Examination of Its Structure, MIT Press, Cambridge, MA, 1975.

Tevanian, 1987.
A. Tevanian, “Architecture-Independent Virtual Memory Management for Parallel and Distributed Environments: The Mach Approach,” Technical Report CMU-CS-88-106, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, December 1987.

Young, 1989.
M. W. Young, Exporting a User Interface to Memory Management from a Communication-Oriented Operating System, CMU-CS-89-202, Department of Computer Science, Carnegie-Mellon University, November 1989.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.25.32