Memory Regions

Linux implements a memory region by means of an object of type vm_area_struct; its fields are shown in Table 8-3.

Table 8-3. The fields of the memory region object

Type

Field

Description

struct mm_struct *

vm_mm

Pointer to the memory descriptor that owns the region

unsigned long

vm_start

First linear address inside the region

unsigned long

vm_end

First linear address after the region

struct vm_area_struct *

vm_next

Next region in the process list

pgprot_t

vm_page_prot

Access permissions for the page frames of the region

unsigned long

vm_flags

Flags of the region

rb_node_t

vm_rb

Data for the red-black tree (see later in this chapter)

struct vm_area_struct *

vm_next_share

Pointer to the next element in the file memory mapping list

struct vm_area_struct **

vm_pprev_share

Pointer to previous element in the file memory mapping list

struct vm_operations_struct *

vm_ops

Pointer to the methods of the memory region

unsigned long

vm_pgoff

Offset in mapped file, if any (see Chapter 15)

struct file *

vm_file

Pointer to the file object of the mapped file, if any

unsigned long

vm_raend

End of current read-ahead window of the mapped file (see Section 15.2.4)

void *

vm_private_data

Pointer to private data of the memory region

Each memory region descriptor identifies a linear address interval. The vm_start field contains the first linear address of the interval, while the vm_end field contains the first linear address outside of the interval; vm_end - vm_start thus denotes the length of the memory region. The vm_mm field points to the mm_struct memory descriptor of the process that owns the region. We shall describe the other fields of vm_area_struct as they come up.

Memory regions owned by a process never overlap, and the kernel tries to merge regions when a new one is allocated right next to an existing one. Two adjacent regions can be merged if their access rights match.

As shown in Figure 8-1, when a new range of linear addresses is added to the process address space, the kernel checks whether an already existing memory region can be enlarged (case a). If not, a new memory region is created (case b). Similarly, if a range of linear addresses is removed from the process address space, the kernel resizes the affected memory regions (case c). In some cases, the resizing forces a memory region to split into two smaller ones (case d ).[58]

Adding or removing a linear address interval

Figure 8-1. Adding or removing a linear address interval

The vm_ops field points to a vm_operations_struct data structure, which stores the methods of the memory region. Only three methods are defined:

open

Invoked when the memory region is added to the set of regions owned by a process.

close

Invoked when the memory region is removed from the set of regions owned by a process.

nopage

Invoked by the Page Fault exception handler when a process tries to access a page not present in RAM whose linear address belongs to the memory region (see the later section Section 8.4).

Memory Region Data Structures

All the regions owned by a process are linked in a simple list. Regions appear in the list in ascending order by memory address; however, successive regions can be separated by an area of unused memory addresses. The vm_next field of each vm_area_struct element points to the next element in the list. The kernel finds the memory regions through the mmap field of the process memory descriptor, which points to the first memory region descriptor in the list.

The map_count field of the memory descriptor contains the number of regions owned by the process. A process may own up to MAX_MAP_COUNT different memory regions (this value is usually set to 65,536).

Figure 8-2 illustrates the relationships among the address space of a process, its memory descriptor, and the list of memory regions.

Descriptors related to the address space of a process

Figure 8-2. Descriptors related to the address space of a process

A frequent operation performed by the kernel is to search the memory region that includes a specific linear address. Since the list is sorted, the search can terminate as soon as a memory region that ends after the specific linear address is found.

However, using the list is convenient only if the process has very few memory regions—let’s say less than a few tens of them. Searching, inserting elements, and deleting elements in the list involve a number of operations whose times are linearly proportional to the list length.

Although most Linux processes use very few memory regions, there are some large applications, such as object-oriented databases, that one might consider “pathological” because they have many hundreds or even thousands of regions. In such cases, the memory region list management becomes very inefficient, hence the performance of the memory-related system calls degrades to an intolerable point.

Therefore, Linux 2.4 stores memory descriptors in data structures called red-black trees .[59] In an red-black tree, each element (or node ) usually has two children: a left child and a right child . The elements in the tree are sorted. For each node N, all elements of the subtree rooted at the left child of N precede N, while, conversely, all elements of the subtree rooted at the right child of N follow N (see Figure 8-3(a); the key of the node is written inside the node itself.

Moreover, a red-black tree must satisfy four additional rules:

  1. Every node must be either red or black.

  2. The root of the tree must be black.

  3. The children of a red node must be black.

  4. Every path from a node to a descendant leaf must contain the same number of black nodes. When counting the number of black nodes, null pointers are counted as black nodes.

Example of red-black trees

Figure 8-3. Example of red-black trees

These four rules ensure that any red-black tree with n internal nodes has a height of at most 2log(n + 1).

Searching an element in a red-black tree is thus very efficient because it requires operations whose execution time is linearly proportional to the logarithm of the tree size. In other words, doubling the number of memory regions adds just one more iteration to the operation.

Inserting and deleting an element in a red-black tree is also efficient because the algorithm can quickly traverse the tree to locate the position at which the element will be inserted or from which it will be removed. Any new node must be inserted as a leaf and colored red. If the operation breaks the rules, a few nodes of the tree must be moved or recolored.

For instance, suppose that an element having the value 4 must be inserted in the red-black tree shown in Figure 8-3(a). Its proper position is the right child of the node that has key 3, but once it is inserted, the red node that has the value 3 has a red child, thus breaking rule 3. To satisfy the rule, the color of nodes that have the values 3, 4, and 7 is changed. This operation, however, breaks rule 4, thus the algorithm performs a “rotation” on the subtree rooted at the node that has the key 19, producing the new red-black tree shown in Figure 8-3(b). This looks complicated, but inserting or deleting an element in a red-black tree requires a small number of operations—a number linearly proportional to the logarithm of the tree size.

Therefore, to store the memory regions of a process, Linux uses both a linked list and a red-black tree. Both data structures contain pointers to the same memory region descriptors, When inserting or removing a memory region descriptor, the kernel searches the previous and next elements through the red-black tree and uses them to quickly update the list without scanning it.

The head of the linked list is referenced by the mmap field of the memory descriptor. Any memory region object stores the pointer to the next element of the list in the vm_next field. The head of the red-black tree is referred by the mm_rb field of the memory descriptor. Any memory region object stores the color of the node, as well as the pointers to the parent, the left child, and the right child, into the vm_rb field of type rb_node_t.

In general, the red-black tree is used to locate a region including a specific address, while the linked list is mostly useful when scanning the whole set of regions.

Memory Region Access Rights

Before moving on, we should clarify the relation between a page and a memory region. As mentioned in Chapter 2, we use the term “page” to refer both to a set of linear addresses and to the data contained in this group of addresses. In particular, we denote the linear address interval ranging between 0 and 4,095 as page 0, the linear address interval ranging between 4,096 and 8,191 as page 1, and so forth. Each memory region therefore consists of a set of pages that have consecutive page numbers.

We have already discussed two kinds of flags associated with a page:

  • A few flags such as Read/Write, Present, or User/Supervisor stored in each Page Table entry (see Section 2.4.1).

  • A set of flags stored in the flags field of each page descriptor (see Section 7.1).

The first kind of flag is used by the 80 × 86 hardware to check whether the requested kind of addressing can be performed; the second kind is used by Linux for many different purposes (see Table 7-2).

We now introduce a third kind of flags: those associated with the pages of a memory region. They are stored in the vm_flags field of the vm_area_struct descriptor (see Table 8-4). Some flags offer the kernel information about all the pages of the memory region, such as what they contain and what rights the process has to access each page. Other flags describe the region itself, such as how it can grow.

Table 8-4. The memory region flags

Flag name

Description

VM_READ

Pages can be read.

VM_WRITE

Pages can be written.

VM_EXEC

Pages can be executed.

VM_SHARED

Pages can be shared by several processes.

VM_MAYREAD

VM_READ flag may be set.

VM_MAYWRITE

VM_WRITE flag may be set.

VM_MAYEXEC

VM_EXEC flag may be set.

VM_MAYSHARE

VM_SHARE flag may be set.

VM_GROWSDOWN

The region can expand toward lower addresses.

VM_GROWSUP

The region can expand toward higher addresses.

VM_SHM

The region is used for IPC’s shared memory.

VM_DENYWRITE

The region maps a file that cannot be opened for writing.

VM_EXECUTABLE

The region maps an executable file.

VM_LOCKED

Pages in the region are locked and cannot be swapped out.

VM_IO

The region maps the I/O address space of a device.

VM_SEQ_READ

The application accesses the pages sequentially.

VM_RAND_READ

The application accesses the pages in a truly random order.

VM_DONTCOPY

Does not copy the region when forking a new process.

VM_DONTEXPAND

Forbids region expansion through mremap( ) system call.

VM_RESERVED

Does not swap out the region.

Page access rights included in a memory region descriptor may be combined arbitrarily. It is possible, for instance, to allow the pages of a region to be executed but not read. To implement this protection scheme efficiently, the read, write, and execute access rights associated with the pages of a memory region must be duplicated in all the corresponding Page Table entries so that checks can be directly performed by the Paging Unit circuitry. In other words, the page access rights dictate what kinds of access should generate a Page Fault exception. As we shall see shortly, the job of figuring out what caused the Page Fault is delegated by Linux to the Page Fault handler, which implements several page-handling strategies.

The initial values of the Page Table flags (which must be the same for all pages in the memory region, as we have seen) are stored in the vm_page_ prot field of the vm_area_struct descriptor. When adding a page, the kernel sets the flags in the corresponding Page Table entry according to the value of the vm_page_prot field.

However, translating the memory region’s access rights into the page protection bits is not straightforward for the following reasons:

  • In some cases, a page access should generate a Page Fault exception even when its access type is granted by the page access rights specified in the vm_flags field of the corresponding memory region. For instance, as we shall see in Section 8.4.4 later in this chapter, the kernel may wish to store two identical, writable private pages (whose VM_SHARE flags are cleared) belonging to two different processes into the same page frame; in this case, an exception should be generated when either one of the processes tries to modify the page.

  • 80 × 86 processors’s Page Tables have just two protection bits, namely the Read/Write and User/Supervisor flags. Moreover, the User/Supervisor flag of any page included in a memory region must always be set, since the page must always be accessible by User Mode processes.

To overcome the hardware limitation of the 80 × 86 microprocessors, Linux adopts the following rules:

  • The read access right always implies the execute access right.

  • The write access right always implies the read access right.

Moreover, to correctly defer the allocation of page frames through the Section 8.4.4 technique (see later in this chapter), the page frame is write-protected whenever the corresponding page must not be shared by several processes. Therefore, the 16 possible combinations of the read, write, execute, and share access rights are scaled down to the following three:

  • If the page has both write and share access rights, the Read/Write bit is set.

  • If the page has the read or execute access right but does not have either the write or the share access right, the Read/Write bit is cleared.

  • If the page does not have any access rights, the Present bit is cleared so that each access generates a Page Fault exception. However, to distinguish this condition from the real page-not-present case, Linux also sets the Page size bit to 1.[60]

The downscaled protection bits corresponding to each combination of access rights are stored in the protection_map array.

Memory Region Handling

Having the basic understanding of data structures and state information that control memory handling, we can look at a group of low-level functions that operate on memory region descriptors. They should be considered auxiliary functions that simplify the implementation of do_mmap( ) and do_munmap( ). Those two functions, which are described in Section 8.3.4 and Section 8.3.5 later in this chapter, enlarge and shrink the address space of a process, respectively. Working at a higher level than the functions we consider here, they do not receive a memory region descriptor as their parameter, but rather the initial address, the length, and the access rights of a linear address interval.

Finding the closest region to a given address: find_vma( )

The find_vma( ) function acts on two parameters: the address mm of a process memory descriptor and a linear address addr. It locates the first memory region whose vm_end field is greater than addr and returns the address of its descriptor; if no such region exists, it returns a NULL pointer. Notice that the region selected by find_vma( ) does not necessarily include addr because addr may lie outside of any memory region.

Each memory descriptor includes a mmap_cache field that stores the descriptor address of the region that was last referenced by the process. This additional field is introduced to reduce the time spent in looking for the region that contains a given linear address. Locality of address references in programs makes it highly likely that if the last linear address checked belonged to a given region, the next one to be checked belongs to the same region.

The function thus starts by checking whether the region identified by mmap_cache includes addr. If so, it returns the region descriptor pointer:

vma = mm->mmap_cache; 
if (vma && vma->vm_end > addr && vma->vm_start <= addr) 
    return vma;

Otherwise, the memory regions of the process must be scanned, and the function looks up the memory region in the red-black tree:

rb_node = mm->mm_rb.rb_node;
vma = NULL;
while (rb_node) {
    vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
    if (vma_tmp->vm_end > addr) {
        vma = vma_tmp;
        if (vma_tmp->vm_start <= addr)
            break;
        rb_node = rb_node->rb_left;
    } else
        rb_node = rb_node->rb_right;
} 
if (vma)
    mm->mmap_cache = vma;
return vma;

The function uses the rb_entry macro, which derives from a pointer to a node of the red-black tree the address of the corresponding memory region descriptor.

The kernel also defines the find_vma_prev( ) function (which returns the descriptor addresses of the memory region that precedes the linear address given as parameter and of the memory region that follows it) and the find_vma_prepare( ) function (which locates the position of the new leaf in the red-black tree that corresponds to a given linear address and returns the addresses of the preceding memory region and of the parent node of the leaf to be inserted).

Finding a region that overlaps a given interval: find_vma_intersection( )

The find_vma_intersection( ) function finds the first memory region that overlaps a given linear address interval; the mm parameter points to the memory descriptor of the process, while the start_addr and end_addr linear addresses specify the interval:

vma = find_vma(mm,start_addr); 
if (vma && end_addr <= vma->vm_start) 
    vma = NULL; 
return vma;

The function returns a NULL pointer if no such region exists. To be exact, if find_vma( ) returns a valid address but the memory region found starts after the end of the linear address interval, vma is set to NULL.

Finding a free interval: arch_get_unmapped_area( )

The arch_get_unmapped_area( ) function searches the process address space to find an available linear address interval. The len parameter specifies the interval length, while the addr parameter may specify the address from which the search is started. If the search is successful, the function returns the initial address of the new interval; otherwise, it returns the error code -ENOMEM.

if (len > TASK_SIZE)
    return -ENOMEM;
addr = (addr + 0xfff) & 0xfffff000;
if (addr && addr + len <= TASK_SIZE) {
    vma = find_vma(current->mm, addr);
    if (!vma || addr + len <= vma->vm_start)
        return addr;
} 
addr = (TASK_SIZE/3 + 0xfff) & 0xfffff000;
for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
    if (addr + len > TASK_SIZE)
        return -ENOMEM;
    if (!vma || addr + len <= vma->vm_start)
        return addr;
    addr = vma->vm_end;
}

The function starts by checking to make sure the interval length is within the limit imposed on User Mode linear addresses, usually 3 GB. If addr is different from zero, the function tries to allocate the interval starting from addr. To be on the safe side, the function rounds up the value of addr to a multiple of 4 KB. If addr is 0 or the previous search failed, the search’s starting point is set to one-third of the User Mode linear address space. Starting from addr, the function then repeatedly invokes find_vma( ) with increasing values of addr to find the required free interval. During this search, the following cases may occur:

  • The requested interval is larger than the portion of linear address space yet to be scanned (addr + len > TASK_SIZE). Since there are not enough linear addresses to satisfy the request, return -ENOMEM.

  • The hole following the last scanned region is not large enough (vma != NULL && vma->vm_start < addr + len). Consider the next region.

  • If neither one of the preceding conditions holds, a large enough hole has been found. Return addr.

Inserting a region in the memory descriptor list: insert_vm_struct( )

insert_vm_struct( ) inserts a vm_area_struct structure in the memory region object list and red-black tree of a memory descriptor. It uses two parameters: mm, which specifies the address of a process memory descriptor, and vma, which specifies the address of the vm_area_struct object to be inserted. The vm_start and vm_end fields of the memory region object must have already been initialized. The function invokes the find_vma_prepare( ) function to look up the position in the red-black tree mm->mm_rb where vma should go. Then insert_vm_struct( ) invokes the vma_link( ) function, which in turn:

  1. Acquires the mm->page_table_lock spin lock.

  2. Inserts the memory region in the linked list referenced by mm->mmap.

  3. Inserts the memory region in the red-black tree mm->mm_rb.

  4. Releases the mm->page_table_lock spin lock.

  5. Increments by 1 the mm->map_count counter.

If the region contains a memory-mapped file, the vma_link( ) function performs additional tasks that are described in Chapter 16.

The kernel also defines the _ _insert_vm_struct( ) function, which is identical to insert_vm_struct( ) but doesn’t acquire any lock before modifying the memory region data structures referenced by mm. The kernel uses it when it is sure that no concurrent accesses to the memory region data structures can happen—for instance, because it already acquired a suitable lock.

The _ _vma_unlink( ) function receives as parameter a memory descriptor address mm and two memory region object addresses vma and prev. Both memory regions should belong to mm, and prev should precede vma in the memory region ordering. The function removes vma from the linked list and the red-black tree of the memory descriptor.

Allocating a Linear Address Interval

Now let’s discuss how new linear address intervals are allocated. To do this, the do_mmap( ) function creates and initializes a new memory region for the current process. However, after a successful allocation, the memory region could be merged with other memory regions defined for the process.

The function uses the following parameters:

file and offset

File descriptor pointer file and file offset offset are used if the new memory region will map a file into memory. This topic is discussed in Chapter 15. In this section, we assume that no memory mapping is required and that file and offset are both NULL.

addr

This linear address specifies where the search for a free interval must start.

len

The length of the linear address interval.

prot

This parameter specifies the access rights of the pages included in the memory region. Possible flags are PROT_READ, PROT_WRITE, PROT_EXEC, and PROT_NONE. The first three flags mean the same things as the VM_READ, VM_WRITE, and VM_EXEC flags. PROT_NONE indicates that the process has none of those access rights.

flag

This parameter specifies the remaining memory region flags:

MAP_GROWSDOWN, MAP_LOCKED, MAP_DENYWRITE, and MAP_EXECUTABLE

Their meanings are identical to those of the flags listed in Table 8-4.

MAP_SHARED and MAP_PRIVATE

The former flag specifies that the pages in the memory region can be shared among several processes; the latter flag has the opposite effect. Both flags refer to the VM_SHARED flag in the vm_area_struct descriptor.

MAP_ANONYMOUS

No file is associated with the memory region (see Chapter 15).

MAP_FIXED

The initial linear address of the interval must be exactly the one specified in the addr parameter.

MAP_NORESERVE

The function doesn’t have to do a preliminary check on the number of free page frames.

The do_mmap( ) function performs some preliminary checks on the value of offset and then executes the do_mmap_pgoff() function. Assuming that the new interval of linear address does not map a file on disk, the latter function executes the following steps:

  1. Checks whether the parameter values are correct and whether the request can be satisfied. In particular, it checks for the following conditions that prevent it from satisfying the request:

    • The linear address interval has zero length or includes addresses greater than TASK_SIZE.

    • The process has already mapped too many memory regions, so the value of the map_count field of its mm memory descriptor exceeds MAX_MAP_COUNT.

    • The flag parameter specifies that the pages of the new linear address interval must be locked in RAM, and the number of pages locked by the process exceeds the threshold stored in the rlim[RLIMIT_MEMLOCK].rlim_cur field of the process descriptor.

    If any of the preceding conditions holds, do_mmap_pgoff( ) terminates by returning a negative value. If the linear address interval has a zero length, the function returns without performing any action.

  2. Obtains a linear address interval for the new region; if the MAP_FIXED flag is set, a check is made on the addr value. Otherwise, the arch_get_unmapped_area( ) function is invoked to get it:

        if (flags & MAP_FIXED) {
            if (addr + len > TASK_SIZE)
                return -ENOMEM;
            if (addr & ~PAGE_MASK)
                return -EINVAL;
        } else 
            addr = arch_get_unmapped_area(file, addr, len, pgoff, flags);
  3. Computes the flags of the new memory region by combining the values stored in the prot and flags parameters:

        vm_flags = calc_vm_flags(prot,flags) | mm->def_flags 
                   | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
        if (flags & MAP_SHARED)
            vm_flags |= VM_SHARED | VM_MAYSHARE;

    The calc_vm_flags( ) function sets the VM_READ, VM_WRITE, and VM_EXEC flags in vm_flags only if the corresponding PROT_READ, PROT_WRITE, and PROT_EXEC flags in prot are set; it also sets the VM_GROWSDOWN, VM_DENYWRITE, and VM_EXECUTABLE flags in vm_flags only if the corresponding MAP_GROWSDOWN, MAP_DENYWRITE, and MAP_EXECUTABLE flags in flags are set. A few other flags are set to 1 in vm_flags: VM_MAYREAD, VM_MAYWRITE, VM_MAYEXEC, the default flags for all memory regions in mm->def_flags,[61] and both VM_SHARED and VM_MAYSHARE if the memory region has to be shared with other processes.

  4. Invokes find_vma_prepare( ) to locate the object of the memory region that shall precede the new interval, as well as the position of the new region in the red-black tree:

        for (;;) {
            vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
            if (!vma || vma->vm_start >= addr + len)
                break;
            if (do_munmap(mm, addr, len))
                    return -ENOMEM;
        }

    The find_vma_prepare( ) function also checks whether a memory region that overlaps the new interval already exists. This occurs when the function returns a non-NULL address pointing to a region that starts before the end of the new interval. In this case, do_mmap_pgoff( ) invokes do_munmap( ) to remove the new interval and then repeats the whole step (see the later section Section 8.3.5).

  5. Checks whether inserting the new memory region causes the size of the process address space (mm->total_vm<<PAGE_SHIFT)+len to exceed the threshold stored in the rlim[RLIMIT_AS].rlim_cur field of the process descriptor. In the affirmative case, it returns the error code -ENOMEM. Notice that the check is done here and not in Step 1 with the other checks because some memory regions could have been removed in Step 4.

  6. Returns the error code -ENOMEM if the MAP_NORESERVE flag was not set in the flags parameter, the new memory region contains private writable pages, and the number of free page frames is less than the size (in pages) of the linear address interval; this last check is performed by the vm_enough_memory( ) function.

  7. If the new interval is private (VM_SHARED not set) and it does not map a file on disk, invokes vma_merge( ) to check whether the preceding memory region can be expanded in such a way to include the new interval. Of course, the preceding memory region must have exactly the same flags as those memory regions stored in the vm_flags local variable. If the preceding memory region can be expanded, vma_merge( ) also tries to merge it with the following memory region (this occurs when the new interval fills the hole between two memory regions and all three have the same flags). In case it succeeds in expanding the preceding memory region, jump to Step 12.

  8. Allocates a vm_area_struct data structure for the new memory region by invoking the kmem_cache_alloc( ) slab allocator function.

  9. Initializes the new memory region object (pointed by vma):

        vma->vm_mm = mm;
        vma->vm_start = addr;
        vma->vm_end = addr + len;
        vma->vm_flags = vm_flags;
        vma->vm_page_prot = protection_map[vm_flags & 0x0f];
        vma->vm_ops = NULL;
        vma->vm_pgoff = pgoff;
        vma->vm_file = NULL;
        vma->vm_private_data = NULL;
        vma->vm_raend = 0;
  10. If the MAP_SHARED flag is set (and the new memory region doesn’t map a file on disk), the region is used for IPC shared memory. The function invokes shmem_zero_setup( ) to initialize it (see Chapter 19).

  11. Invokes vma_link( ) to insert the new region in the memory region list and red-black tree (see the earlier section Section 8.3.3.4).

  12. Increments the size of the process address space stored in the total_vm field of the memory descriptor.

  13. If the VM_LOCKED flag is set, increments the counter of locked pages mm->locked_vm:

        if (vm_flags & VM_LOCKED) {
            mm->locked_vm += len >> PAGE_SHIFT;
            make_pages_present(addr, addr + len);
        }

    The make_pages_present( ) function is invoked to allocate all the pages of the memory region in succession and lock them in RAM. The function, in turn, invokes get_user_pages( ) as follows:

        write = (vma->vm_flags & VM_WRITE) != 0;
        get_user_pages(current, current->mm, addr, len, write, 0, NULL, NULL);

    The get_user_pages( ) function cycles through all starting linear addresses of the pages between addr and addr+len; for each of them, it invokes follow_page( ) to check whether there is a mapping to a physical page in the current’s Page Tables. If no such physical page exists, get_user_pages( ) invokes handle_mm_fault( ), which, as we shall see in Section 8.4.2, allocates one page frame and sets its Page Table entry according to the vm_flags field of the memory region descriptor.

  14. Finally, terminates by returning the linear address of the new memory region.

Releasing a Linear Address Interval

The do_munmap( ) function deletes a linear address interval from the address space of the current process. The parameters are the starting address addr of the interval and its length len. The interval to be deleted does not usually correspond to a memory region; it may be included in one memory region or span two or more regions.

The function goes through two main phases. First, it scans the list of memory regions owned by the process and removes all regions that overlap the linear address interval. In the second phase, the function updates the process Page Tables and reinserts a downsized version of the memory regions that were removed during the first phase.

First phase: scanning the memory regions

The do_munmap( ) function executes the following steps:

  1. Performs some preliminary checks on the parameter values. If the linear address interval includes addresses greater than TASK_SIZE, if addr is not a multiple of 4,096, or if the linear address interval has a zero length, it returns the error code -EINVAL.

  2. Locates the first memory region that overlaps the linear address interval to be deleted:

        mpnt = find_vma_prev(current->mm, addr, &prev); 
        if (!mpnt || mpnt->vm_start >= addr + len) 
            return 0;
  3. If the linear address interval is located inside a memory region, its deletion splits the region into two smaller ones. In this case, do_munmap( ) checks whether current is allowed to obtain an additional memory region:

        if ((mpnt->vm_start < addr && mpnt->vm_end > addr + len) && 
             current->mm->map_count > MAX_MAP_COUNT) 
            return -ENOMEM;
  4. Attempts to get a new vm_area_struct descriptor. There may be no need for it, but the function makes the request anyway so that it can terminate right away if the allocation fails. This cautious approach simplifies the code since it allows an easy error exit.

  5. Builds up a list that includes all descriptors of the memory regions that overlap the linear address interval. This list is created by setting the vm_next field of the memory region descriptor (temporarily) so it points to the previous item in the list; this field thus acts as a backward link. As each region is added to this backward list, a local variable named free points to the last inserted element. The regions inserted in the list are also removed from the list of memory regions owned by the process and from the red-black tree (by means of the rb_erase( ) function):

        npp = (prev ? &prev->vm_next : &current->mm->mmap); 
        free = NULL; 
        spin_lock(&current->mm->page_table_lock);
        for ( ; mpnt && mpnt->vm_start < addr + len; mpnt = *npp) { 
            *npp = mpnt->vm_next; 
            mpnt->vm_next = free; 
            free = mpnt; 
            rb_erase(&mpnt->vm_rb, &current->mm->mm_rb);
        } 
        current->mm->mmap_cache = NULL;
        spin_unlock(&current->mm->page_table_lock);

Second phase: updating the Page Tables

A while cycle is used to scan the list of memory regions built in the first phase, starting with the memory region descriptor that free points to.

In each iteration, the mpnt local variable points to the descriptor of a memory region in the list. The map_count field of the current->mm memory descriptor is decremented (since the region has been removed in the first phase from the list of regions owned by the process) and a check is made (by means of two question-mark conditional expressions) to determine whether the mpnt region must be eliminated or simply downsized:

current->mm->map_count--; 
st = addr < mpnt->vm_start ? mpnt->vm_start : addr;
end = addr+len;
end = end > mpnt->vm_end ? mpnt->vm_end : end;
size = end - st;

The st and end local variables delimit the linear address interval in the mpnt memory region that should be deleted; the size local variable specifies the length of the interval.

Next, do_munmap( ) releases the page frames allocated for the pages included in the interval from st to end:

zap_page_range(mm, st, size);

The zap_page_range( ) function deallocates the page frames included in the interval from st to end and updates the corresponding Page Table entries. The function invokes in nested fashion the zap_pmd_range( ) and zap_pte_range( ) functions for scanning the Page Tables; the latter function clears the Page Table entries and frees the corresponding page frames (or slot in a swap area; see Chapter 14). While doing this, zap_pte_range( ) also invalidates the TLB entries corresponding to the interval from st to end.

The last action performed in each iteration of the do_munmap( ) loop is to check whether a downsized version of the mpnt memory region must be reinserted in the list of regions of current:

extra = unmap_fixup(mm, mpnt, st, size, extra);

The unmap_fixup( ) function considers four possible cases:

  1. The memory region has been totally canceled. It returns the address of the previously allocated memory region object (see Step 4 in the earlier section Section 8.3.5.1), which can be released by invoking kmem_cache_free( ).

  2. Only the lower part of the memory region has been removed:

    (mpnt->vm_start < st) && (mpnt->vm_end == end)

    In this case, it updates the vm_end field of mnpt, invokes _ _insert_vm_struct( ) to insert the downsized region in the list of regions belonging to the process, and returns the address of the previously allocated memory region object.

  3. Only the upper part of the memory region has been removed:

    (mpnt->vm_start == st) && (mpnt->vm_end > end)

    In this case, it updates the vm_start field of mnpt, invokes _ _insert_vm_struct( ) to insert the downsized region in the list of regions belonging to the process, and returns the address of the previously allocated memory object.

  4. The linear address interval is in the middle of the memory region:

    (mpnt->vm_start < st) && (mpnt->vm_end > end)

    It updates the vm_start and vm_end fields of mnpt and of the previously allocated extra memory region object so that they refer to the linear address intervals, respectively, from mpnt->vm_start to st and from end to mpnt->vm_end. Then it invokes _ _insert_vm_struct( ) twice to insert the two regions in the list of regions belonging to the process and in the red-black tree, and returns NULL, thus preserving the memory region object previously allocated.

This terminates the description of what must be done in a single iteration of the second-phase loop of do_munmap( ).

After handling all the memory region descriptors in the list built during the first phase, do_munmap( ) checks if the additional extra memory descriptor has been used. If the address returned by unmap_fixup( ) is NULL, the descriptor has been used; otherwise, do_munmap( ) invokes kmem_cache_free( ) to release it. Finally, do_munmap( ) invokes the free_pgtables( ) function: it again scans the Page Table entries corresponding to the linear address interval just removed and reclaims the page frames that store unused Page Tables.



[58] Removing a linear address interval may theoretically fail because no free memory is available for a new memory descriptor.

[59] Up to Version 2.4.9, the Linux kernel used another type of balanced search tree called AVL tree .

[60] You might consider this use of the Page size bit to be a dirty trick, since the bit was meant to indicate the real page size. But Linux can get away with the deception because the 80 × 86 chip checks the Page size bit in Page Directory entries, but not in Page Table entries.

[61] Actually, the def_flags field of the memory descriptor is modified only by the mlockall( ) system call, which can be used to set the VM_LOCKED flag, thus locking all future pages of the calling process in RAM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.56.45