Chapter 13. Mmap and DMA

This chapter introduces the internals of Linux memory management and memory mapping. It also describes how ``Direct Memory Access'' (DMA) is used by device drivers. Although you might object that DMA belongs more to hardware handling than to the software interface, I feel it more related to memory management than to hardware control.

This chapter is quite advanced; most driver writers won’t need to go so deep into the system internals. Nonetheless, understanding how memory works will help you design a driver that makes effective use of the system’s capabilities.

Memory Management in Linux

Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation of the theory. This section is mainly informative and skipping over it shouldn’t prevent you from understanding the later topics that are more implementation-oriented.

Page Tables

When a program looks up a virtual address, the processor splits the address into bitfields. Each bitfield is used as an index into an array, called a page table, to retrieve either the address of the next table or the address of the physical page that holds the virtual address.

The Linux kernel manages three levels of page tables in order to map virtual addresses to physical addresses. This might appear strange at first. As most PC programmers are aware, the x86 hardware implements only two levels of page table. In fact, most 32-bit processors supported by Linux implement two levels, but the kernel implements three anyway.

The use of three levels in a processor-independent implementation allows Linux to support both two-level and three-level processors (such as the Alpha) without clobbering the code with a lot of #ifdef statements. This kind of ``conservative coding'' doesn’t lead to additional overhead when the kernel runs on two-level processors, because the compiler actually optimizes out the unused level.

But let’s look for a moment at the data structures used to implement paging. To follow the discussion, you should remember that most data items used for memory management are kept internally as unsigned long, because they represent addresses that are not meant to be dereferenced.

The list below summarizes the implementation of the three levels in Linux, and Figure 13.1 depicts them:

The three levels of Linux page tables

Figure 13-1. The three levels of Linux page tables

  • A ``PaGe Directory'' (PGD) is the top-level page table. The PGD is an array of pgd_t items, each of which points to a second-level page table. Each process has its own page directory. You can think of the page directory as a page-aligned array of pgd_ts.

  • The second-level table is called a ``Page Mid-level Directory,'' or PMD. The PMD is a page-aligned array of pmd_t items. A pmd_t is a pointer to the third-level page table. Two-level processors, such as the x86 and the Sparc-4c, have no physical PMD; they declare their PMD as an array with a single element, whose value is the PMD itself--we’ll see in a while how this is handled in C and how the compiler optimizes this level away.

  • What comes next is called simply a ``Page Table.'' Once again, it is a page-aligned array of items, each of which is called a ``Page Table Entry.'' The kernel uses the pte_t type for the items. A pte_t contains the physical address of the data page.

The types introduced in this list are defined in <asm/page.h>, which must be included by every source file that plays with paging.

The kernel doesn’t need to worry about doing page-table lookups during normal program execution, because they are done in hardware. Nonetheless, the kernel must arrange things so the hardware can do its work. It must build the page tables and look them up whenever the processor reports a page fault; that is, whenever a virtual address needed by the processor is not present in memory.

The following symbols are used to access the page tables. Both <asm/page.h> and <asm/pgtable.h> must be included for all of them to be accessible.

PTRS_PER_PGD , PTRS_PER_PMD , PTRS_PER_PTE

The size of each table. Two-level processors set PTRS_PER_PMD to 1, to avoid dealing with the middle level.

unsigned long pgd_val(pgd_t pgd) , unsigned long pmd_val(pmd_t pmd) , unsigned long pte_val(pte_t pte)

These three macros are used to retrieve the unsigned long value from the typed data item. The macros help in using strict data typing in source code without introducing computational overhead.

pgd_t * pgd_offset(struct mm_struct * mm, ,                    unsigned long address) , pmd_t * pmd_offset(pgd_t * dir, unsigned long address) , pte_t * pte_offset(pmd_t * dir, unsigned long address)

These inline functions[31] are used to retrieve the pgd, pmd, and pte entries associated with address. Page-table lookup begins with a pointer to struct mm_struct. The pointer associated with the memory map of the current process is current->mm. The pointer to kernel space is described by &init_mm, which isn’t exported to modules because they don’t need it. Two-level processors define pmd_offset(dir,add) as (pmd_t *)dir, thus folding the pmd over the pgd. Functions that scan page tables are always declared as inline, and the compiler optimizes out any pmd lookup.

unsigned long pte_page(pte_t pte)

This function extracts the address of a physical page from its page-table entry. Using pte_val(pte) wouldn’t work, because microprocessors use the low bits of the pte to store additional information about the page. The bits are not part of the actual address, and pte_page is needed to extract the real address from the page table.

pte_present(pte_t pte)

This macro returns a boolean value that indicates whether the data page is currently in memory. This is the most used of several functions that access the low bits in the pte--the bits that are discarded by pte_page. It’s interesting to note that while physical pages can be present or not, page tables are always present (in the current Linux implementation). This simplifies the kernel code because pgd_offset and friends never fail; on the other hand, even a process with a ``resident storage size'' of zero keeps its page tables in real RAM.

Just seeing the list of these functions is not enough for you to be proficient in the Linux memory management algorithms; real memory management is much more complex and must deal with other complications, like cache coherence. The list above should nonetheless be sufficient to give you a feel for how page management is implemented; you can get more information from the include/asm and mm subtrees of the kernel source.

Virtual Memory Areas

While paging sits at the lowest level of memory management, something more is necessary before you can use the computer’s resources efficiently. The kernel needs a higher-level mechanism to handle the way a process sees its memory. This mechanism is implemented in Linux by means of ``Virtual Memory Areas,'' which I’ll refer to as ``areas'' or ``VMAs.''

An area is a homogeneous region in the virtual memory of a process, a contiguous range of addresses featuring the same permission flags. It corresponds loosely to the concept of a ``segment,'' although it is better described as ``a memory object with its own properties.'' The memory map of a process is made up of: an area for program code (text); one each for data, BSS (uninitialized data),[32] and the stack; and one for each active memory mapping. The memory areas of a process can be seen by looking in /proc/ pid /maps. /proc/self is a special case of /proc/ pid, as it always refers to the current process. As an example, here are three different memory maps, to which I added short comments after a sharp sign:

morgana.root# head /proc/1/maps /proc/self/maps
==> /proc/1/maps <==                         #### "init" is a.out on my x86
00000000-00003000 r-xp 00000400 03:01 30818  # hda1:/bin/init--text
00003000-00004000 rwxp 00003400 03:01 30818  # hda1:/bin/init--data
00004000-0000c000 rwxp 00000000 00:00 0      # zero-mapped bss
5ffff000-6009a000 rwxp 00000000 03:01 26621  # hda1:/lib/libc.so.4.7.2
6009a000-600c9000 rwxp 00000000 00:00 0
bfffd000-c0000000 rwxp ffffe000 00:00 0      # zero-mapped stack

==> /proc/self/maps <==                      ####   "head" is ELF on my x86
08000000-08002000 r-xp 00000000 03:01 16778  # hda1:/bin/head--text
08002000-08003000 rw-p 00001000 03:01 16778  # hda1:/bin/head--data
08003000-0800a000 rwxp 00000000 00:00 0      # zero-mapped bss
40000000-40005000 r-xp 00000000 03:01 26863  # /lib/ld-linux.so.1.7.3--text
40005000-40006000 rw-p 00004000 03:01 26863  # /lib/ld-linux.so.1.7.3--data
40006000-40007000 rw-p 00000000 00:00 0
40008000-40080000 r-xp 00000000 03:01 27025  # /lib/libc.so.5.0.9--text
40080000-40085000 rw-p 00077000 03:01 27025  # /lib/libc.so.5.0.9--data
40085000-400b8000 rw-p 00000000 00:00 0
bfffe000-c0000000 rwxp fffff000 00:00 0      # zero-mapped stack

morgana.root# rsh wolf head /proc/self/maps  #### alpha-axp: static ecoff
000000011fffe000-0000000120000000 rwxp 0000000000000000 00:00 0     # stack
0000000120000000-0000000120014000 r-xp 0000000000000000 08:03 2844  # text
0000000140000000-0000000140002000 rwxp 0000000000014000 08:03 2844  # data
0000000140002000-0000000140008000 rwxp 0000000000000000 00:00 0     # bss

The fields in each line are:

               start-end perm offset major:minor inode.

perm represents a bit mask including the read, write, and execute permissions; it represents what the process is allowed to do with pages belonging to the area. The last character in the field is either p for ``private'' or s for ``shared.''

Each field in /proc/*/maps corresponds to a field in struct vm_area_struct, and is described in the list below.

A driver that implements the mmap method needs to fill a VMA structure in the address space of the process mapping the device. The driver writer should therefore have at least a minimal understanding of VMAs in order to use them.

Let’s look at the most important fields in struct vm_area_struct (defined in <linux/mm.h>). These fields may be used by device drivers in their mmap implementation. Note that the kernel maintains lists and trees of VMAs to optimize area lookup, and several fields of vm_area_struct are used to maintain this organization. VMAs can’t be created at will by a driver, or the structures will break. The main fields of VMAs are the following:

unsigned long vm_start; , unsigned long vm_end;

A VMA describes virtual addresses between vma->vm_start and vma->vm_end. These fields are the first two fields shown in /proc/*/maps.

struct inode * vm_inode;

If the area is associated with an inode (such as a disk file or a device node), this field is a pointer to the inode. Otherwise, it is NULL.

unsigned long vm_offset;

The offset of the area in the inode. When a file or device is mapped, this is the file position (filp->f_pos) of the first byte mapped in this area.

struct vm_operations_struct *vm_ops;

vma->vm_ops indicates that the memory area is a kernel ``object'' like the struct file we have been using throughout the book. The area declares the ``methods'' to act on its contents, and this field is used to list the methods.

Like struct vm_area_struct, the vm_operations_struct is defined in <linux/mm.h>; it includes the operations listed below. These operations are the only ones needed to handle the process’s memory needs, and they are listed in the order they are declared. The prototypes shown are for 2.0; the minor changes from 1.2.13 are described in each entry. Later in this chapter, some of these functions will be implemented, and they will be described more completely at that point.

void (*open)(struct vm_area_struct *vma);

After the kernel creates a VMA, it opens it. When an area is copied, the child inherits its operations from the father and the new area is opened via vm_ops->open. When fork copies the areas of the existing process to the new one, for example, vm_ops->open is called to open all the maps. Whenever mmap executes, on the other hand, the area is created before file->f_ops->mmap is called, and no vm_ops->open gets invoked.

void (*close)(struct vm_area_struct *vma);

When an area is destroyed, the kernel calls its close operation. Note that there’s no usage count associated with VMAs; the area is opened and closed only once.

void (*unmap)(struct vm_area_struct *vma, ,               unsigned long addr, size_t len);

The kernel calls this method to ``unmap'' part or all of an area. If the entire area is unmapped, then the kernel calls vm_ops->close as soon as vm_ops->unmap returns.

void (*protect)(struct vm_area_struct *vma, unsigned long, ,                 size_t, unsigned int newprot);

Currently not used. The handling of permission (protection) bits doesn’t depend on the area itself.

int (*sync)(struct vm_area_struct *vma, unsigned long, ,             size_t, unsigned int flags);

This method is called by the msync system call to save a dirty memory region to the storage medium. The return value is expected to be 0 to indicate success and negative if there was an error. Kernel version 1.2 used void as the return value for this method because the function was not expected to fail.

void (*advise)(struct vm_area_struct *vma, unsigned long, ,                size_t, unsigned int advise);

Currently not used.

unsigned long (*nopage)(struct vm_area_struct *vma, ,                         unsigned long address, ,                         int write_access);

When a process tries to access a page that belongs to a valid VMA, but that is currently not in memory, the nopage method is called if it is defined for the related area. The method returns the (physical) address of the page. If the method isn’t defined for the area, an empty page is allocated by the kernel. It’s unusual for drivers to implement nopage, because regions mapped by a driver are usually completely mapped in the system’s physical addresses. Version 1.2 of the kernel features a different prototype for nopage and a different semantic as well. The third argument, write_access, counts as ``no-share''--a non-zero value means the page must be owned by the current process, while zero means that sharing is possible.

unsigned long (*wppage)(struct vm_area_struct *vma, ,                         unsigned long address, ,                         unsigned long page);

This method handles ``write-protected'' page faults but is currently unused. The kernel handles any attempts to write over a protected page without invoking the area-specific callback. Write-protect faults are used to implement copy-on-write. A private page can be shared across processes until one process writes to it. When that happens, the page is cloned, and the process writes on its own copy of the page. If the whole area is marked as read-only, a SIGSEGV is sent to the process, and the copy-on-write is not performed.

int (*swapout)(struct vm_area_struct *vma, ,                unsigned long offset, pte_t *page_table);

This method is called when a page is selected for swap-out. The offset argument is the file offset: virt_address - vma->vm_start + vma->vm_offset. A return value of 0 signals success; any other value signals an error. In case of error, the process owning the page is sent a SIGBUS. In version 1.2, the function returned void because it was never expected to fail. The method usually writes useful information to *page_table, so it can be retrieved at swap-in. Such information can be, for example, the offset in the swap device.

pte_t (*swapin)(struct vm_area_struct *, ,                 unsigned long offset, unsigned long entry);

This method is used to retrieve a page from swap space. The offset argument is relative to the area (as above for swapout), while entry is the current pte for the page—if swapout saved some information in the entry, that information can now be used to retrieve the page.

It’s unlikely a driver will ever need to implement swapout or swapin, because drivers usually map pages of I/O memory, not regular memory. I/O pages are physical addresses that are accessed like memory but map to the device hardware instead of to RAM. I/O memory regions are either marked as ``reserved'' or live above the top of physical memory, so they never get swapped out--swapping I/O memory wouldn’t make much sense anyway.

The Memory Map

There is a third data structure related to memory management in Linux. While VMAs and page tables organize the virtual address space, the physical address space is summarized in the memory map.

The kernel needs a description of the current usage of physical memory. Since memory can be considered as an array of pages, the information is organized in an array. If you need information about a page, you just use its physical address to access the memory map. Here are the symbols used by the kernel code to access the memory map:

typedef struct { /* ... */ } mem_map_t; , extern mem_map_t mem_map[];

The map itself is an array of mem_map_ts. Each physical page in the system, including kernel code and kernel data, has an entry in mem_map.

PAGE_OFFSET

This macro represents the virtual address in the kernel address space to which physical addresses are mapped. PAGE_OFFSET must be considered whenever ``physical'' addresses are used. What the kernel considers to be a physical address is actually a virtual address, offset by PAGE_OFFSET from the real physical address--the one that is used in the electrical address lines outside the CPU. Through Linux 2.0.x, PAGE_OFFSET was zero for the PC and non-zero for most other platforms. Version 2.1.0 changed the PC implementation so it now uses offset-mapping as well. Mapping physical space to high virtual addresses has sound advantages as far as kernel code is concerned, but the topic is beyond the scope of this book.

int MAP_NR(addr);

Whenever a program needs to access the memory map, MAP_NR returns the index in the mem_map array associated with addr. The addr argument can either be an unsigned long or a pointer. Since the macro is used several times by critical memory management functions, it performs no validity checking on addr; the calling code must make its own checks when needed.

((nr << PAGE_SHIFT) + PAGE_OFFSET)

No standardized function or macro exists to translate a map number into a physical address. If you ever need the inverse function of MAP_NR, this clause will work.

The memory map is used to maintain some low-level information about each memory page. The exact definition of the memory map structure changed several times during kernel development; you don’t need to understand the details because drivers aren’t expected to look inside the map.

If, however, you are interested in looking at the internals of page management, the header <linux/mm.h> includes a long comment that explains the meaning of the fields in mem_map_t.



[31] As a matter of fact, on the Sparc the functions are not inline, but rather real extern functions, which are not exported to modularized code. Therefore you won’t be able to use these functions in a module running on the Sparc, but you won’t usually need to.

[32] The name BSS is an historical relic, from an old assembly operator meaning ``Block Started by Symbol.'' The BSS segment of executable files isn’t stored on disk, and the kernel maps the zero-page to the BSS address range.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.25.32