This chapter introduces the internals of Linux memory management and memory mapping. It also describes how ``Direct Memory Access'' (DMA) is used by device drivers. Although you might object that DMA belongs more to hardware handling than to the software interface, I feel it more related to memory management than to hardware control.
This chapter is quite advanced; most driver writers won’t need to go so deep into the system internals. Nonetheless, understanding how memory works will help you design a driver that makes effective use of the system’s capabilities.
Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation of the theory. This section is mainly informative and skipping over it shouldn’t prevent you from understanding the later topics that are more implementation-oriented.
When a program looks up a virtual address, the processor splits the address into bitfields. Each bitfield is used as an index into an array, called a page table, to retrieve either the address of the next table or the address of the physical page that holds the virtual address.
The Linux kernel manages three levels of page tables in order to map virtual addresses to physical addresses. This might appear strange at first. As most PC programmers are aware, the x86 hardware implements only two levels of page table. In fact, most 32-bit processors supported by Linux implement two levels, but the kernel implements three anyway.
The use of three levels in a processor-independent implementation
allows Linux to support both two-level and three-level processors
(such as the Alpha) without clobbering the code with a lot of #ifdef
statements. This kind of ``conservative coding'' doesn’t lead to
additional overhead when the kernel runs on two-level processors, because
the compiler actually optimizes out the unused level.
But let’s look for a moment at the data structures used to
implement paging. To follow the discussion, you should remember that
most data items used for memory management are kept internally as
unsigned long
, because they represent addresses that are not
meant to be dereferenced.
The list below summarizes the implementation of the three levels in Linux, and Figure 13.1 depicts them:
A ``PaGe Directory'' (PGD) is the top-level page table.
The PGD is an array of pgd_t
items, each of which
points to a second-level page table. Each process has its own page
directory. You can think of the page directory as a
page-aligned array of pgd_t
s.
The second-level table is called a ``Page Mid-level
Directory,'' or PMD. The PMD is a page-aligned array of
pmd_t
items. A pmd_t
is a pointer to
the third-level page table. Two-level processors, such as the
x86 and the Sparc-4c, have no physical PMD; they declare their
PMD as an array with a single element, whose value is the PMD
itself--we’ll see in a while how this is handled in C and how
the compiler optimizes this level away.
What comes next is called simply a ``Page Table.''
Once again, it is a page-aligned array of items, each of which
is called a ``Page Table Entry.'' The kernel uses the
pte_t
type for the items. A pte_t
contains the physical address of the data page.
The types introduced in this list are defined in
<asm/page.h>
, which must be included by every source
file that plays with paging.
The kernel doesn’t need to worry about doing page-table lookups during normal program execution, because they are done in hardware. Nonetheless, the kernel must arrange things so the hardware can do its work. It must build the page tables and look them up whenever the processor reports a page fault; that is, whenever a virtual address needed by the processor is not present in memory.
The following symbols are used to access the page
tables. Both <asm/page.h>
and
<asm/pgtable.h>
must be included for all of them to
be accessible.
PTRS_PER_PGD
,
PTRS_PER_PMD
,
PTRS_PER_PTE
The size of each table. Two-level processors set
PTRS_PER_PMD
to 1, to avoid dealing with the
middle level.
unsigned long pgd_val(pgd_t pgd)
,
unsigned long pmd_val(pmd_t pmd)
,
unsigned long pte_val(pte_t pte)
These three macros are used to retrieve the unsigned long
value from the typed data item. The
macros help in using strict data typing in source code without
introducing computational overhead.
pgd_t * pgd_offset(struct mm_struct * mm,
,
unsigned long address)
,
pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
,
pte_t * pte_offset(pmd_t * dir, unsigned long address)
These inline functions[31] are used to retrieve the
pgd
, pmd
, and pte
entries associated with
address
. Page-table lookup begins with a
pointer to struct mm_struct
. The pointer associated with
the memory map of the current process is
current->mm
. The pointer to kernel space is
described by &init_mm
, which isn’t exported to
modules because they don’t need it. Two-level
processors define pmd_offset(dir,add)
as
(pmd_t *)dir
, thus folding the pmd
over the
pgd
. Functions that scan page tables are always
declared as inline
, and the compiler optimizes out
any pmd
lookup.
unsigned long pte_page(pte_t pte)
This function extracts the address of a physical page
from its page-table entry. Using pte_val(pte)
wouldn’t
work, because microprocessors use the low bits of the pte
to store additional information about the page. The bits are
not part of the actual address, and pte_page is needed
to extract the real address from the page table.
pte_present(pte_t pte)
This macro returns a boolean value that indicates whether
the data page is currently in memory. This is the most
used of several functions that access the low bits in the
pte
--the bits that are discarded by pte_page.
It’s interesting to note that while physical pages can be
present or not, page tables are always present (in the
current Linux implementation). This simplifies the kernel code
because pgd_offset and friends never fail; on the other
hand, even a process with a ``resident storage size'' of zero keeps
its page tables in real RAM.
Just seeing the list of these functions is not enough for you to be
proficient in the Linux memory management algorithms; real memory management is
much more complex and must deal with other complications,
like cache coherence. The list above should nonetheless be sufficient to
give you a feel for how page management is implemented; you can get more
information from the include/asm
and mm
subtrees of the
kernel source.
While paging sits at the lowest level of memory management, something more is necessary before you can use the computer’s resources efficiently. The kernel needs a higher-level mechanism to handle the way a process sees its memory. This mechanism is implemented in Linux by means of ``Virtual Memory Areas,'' which I’ll refer to as ``areas'' or ``VMAs.''
An area is a homogeneous region in the virtual memory of a
process, a contiguous range of addresses featuring the same permission
flags. It corresponds loosely to the concept of a ``segment,''
although it is better described as ``a memory object with its own properties.''
The memory map of a process is made up of: an area for program code (text);
one each for data, BSS (uninitialized data),[32] and the stack; and one for each active memory
mapping. The memory areas of a process can be seen by looking
in /proc/
pid
/maps
. /proc/self
is a special
case of /proc/
pid, as it always refers to the
current process. As an example, here
are three different memory maps, to which I added short comments after
a sharp sign:
morgana.root#head /proc/1/maps /proc/self/maps
==> /proc/1/maps <== #### "init" is a.out on my x86 00000000-00003000 r-xp 00000400 03:01 30818 # hda1:/bin/init--text 00003000-00004000 rwxp 00003400 03:01 30818 # hda1:/bin/init--data 00004000-0000c000 rwxp 00000000 00:00 0 # zero-mapped bss 5ffff000-6009a000 rwxp 00000000 03:01 26621 # hda1:/lib/libc.so.4.7.2 6009a000-600c9000 rwxp 00000000 00:00 0 bfffd000-c0000000 rwxp ffffe000 00:00 0 # zero-mapped stack ==> /proc/self/maps <== #### "head" is ELF on my x86 08000000-08002000 r-xp 00000000 03:01 16778 # hda1:/bin/head--text 08002000-08003000 rw-p 00001000 03:01 16778 # hda1:/bin/head--data 08003000-0800a000 rwxp 00000000 00:00 0 # zero-mapped bss 40000000-40005000 r-xp 00000000 03:01 26863 # /lib/ld-linux.so.1.7.3--text 40005000-40006000 rw-p 00004000 03:01 26863 # /lib/ld-linux.so.1.7.3--data 40006000-40007000 rw-p 00000000 00:00 0 40008000-40080000 r-xp 00000000 03:01 27025 # /lib/libc.so.5.0.9--text 40080000-40085000 rw-p 00077000 03:01 27025 # /lib/libc.so.5.0.9--data 40085000-400b8000 rw-p 00000000 00:00 0 bfffe000-c0000000 rwxp fffff000 00:00 0 # zero-mapped stack morgana.root#rsh wolf head /proc/self/maps
#### alpha-axp: static ecoff 000000011fffe000-0000000120000000 rwxp 0000000000000000 00:00 0 # stack 0000000120000000-0000000120014000 r-xp 0000000000000000 08:03 2844 # text 0000000140000000-0000000140002000 rwxp 0000000000014000 08:03 2844 # data 0000000140002000-0000000140008000 rwxp 0000000000000000 00:00 0 # bss
The fields in each line are:
start-end perm offset major:minor inode.
perm
represents a bit mask including the
read, write, and execute permissions; it represents what the process
is allowed to do with pages belonging to the area. The last character
in the field is either p
for ``private'' or s
for ``shared.''
Each field in /proc/*/maps
corresponds to a field in struct vm_area_struct
, and is described in the list below.
A driver that implements the mmap method needs to fill a VMA structure in the address space of the process mapping the device. The driver writer should therefore have at least a minimal understanding of VMAs in order to use them.
Let’s look at the most important fields in struct vm_area_struct
(defined in <linux/mm.h>
). These fields
may be used by device drivers in their mmap implementation. Note
that the kernel maintains lists and trees of VMAs to optimize area
lookup, and several fields of vm_area_struct
are used to
maintain this organization. VMAs can’t be created at will by a
driver, or the structures will break.
The main fields of VMAs are the following:
unsigned long vm_start;
,
unsigned long vm_end;
A VMA describes virtual addresses between
vma->vm_start
and vma->vm_end
. These fields are
the first two fields shown in /proc/*/maps
.
struct inode * vm_inode;
If the area is associated with an inode (such as a disk
file or a device node), this field is a pointer to the
inode. Otherwise, it is NULL
.
unsigned long vm_offset;
The offset of the area in the inode. When a file or device is
mapped, this is the file position (filp->f_pos
) of the
first byte mapped in this area.
struct vm_operations_struct *vm_ops;
vma->vm_ops
indicates that the memory area is a
kernel ``object'' like the struct file
we have been
using throughout the book. The area declares the
``methods'' to act on its contents, and this field is used
to list the methods.
Like struct vm_area_struct
, the vm_operations_struct
is defined in <linux/mm.h>
; it
includes the operations listed below. These operations are the only ones
needed to handle the process’s memory needs, and they are listed in
the order they are declared. The prototypes shown are for 2.0;
the minor changes from 1.2.13 are described in each entry.
Later in this chapter, some of these functions will be
implemented, and they will be described more completely at that point.
void (*open)(struct vm_area_struct *vma);
After the kernel creates a VMA, it opens it. When an
area is copied, the child inherits its operations from the
father and the new area is opened via
vm_ops->open
. When fork copies the
areas of the existing process to the new one, for example,
vm_ops->open
is called to open all the
maps. Whenever mmap executes, on the other hand,
the area is created before
file->f_ops->mmap
is called, and no
vm_ops->open
gets invoked.
void (*close)(struct vm_area_struct *vma);
When an area is destroyed, the kernel calls its close operation. Note that there’s no usage count associated with VMAs; the area is opened and closed only once.
void (*unmap)(struct vm_area_struct *vma,
,
unsigned long addr, size_t len);
The kernel calls this method to ``unmap'' part or all of
an area. If the entire area is unmapped, then the kernel
calls vm_ops->close
as soon as vm_ops->unmap
returns.
void (*protect)(struct vm_area_struct *vma, unsigned long,
,
size_t, unsigned int newprot);
Currently not used. The handling of permission (protection) bits doesn’t depend on the area itself.
int (*sync)(struct vm_area_struct *vma, unsigned long,
,
size_t, unsigned int flags);
This method is called by the msync system call to
save a dirty memory region to the storage medium.
The return value is expected to
be 0 to indicate success and negative if there was an error.
Kernel version 1.2 used void
as the return value for this method because the function was not
expected to fail.
void (*advise)(struct vm_area_struct *vma, unsigned long,
,
size_t, unsigned int advise);
unsigned long (*nopage)(struct vm_area_struct *vma,
,
unsigned long address,
,
int write_access);
When a process tries to access a page that belongs to a
valid VMA, but that is currently not in memory, the nopage
method is called if it is defined for the related area.
The method returns the (physical) address of
the page. If the method isn’t defined for the area, an
empty page is allocated by the kernel.
It’s unusual for drivers to implement nopage,
because regions mapped by a driver are usually completely
mapped in the system’s physical addresses. Version 1.2 of the
kernel features a different prototype for nopage and
a different semantic as well. The third argument, write_access
,
counts as ``no-share''--a non-zero value means the page must be owned
by the current process, while zero means that sharing
is possible.
unsigned long (*wppage)(struct vm_area_struct *vma,
,
unsigned long address,
,
unsigned long page);
This method handles ``write-protected'' page faults but
is currently unused. The kernel handles any attempts to write
over a protected page without invoking the area-specific
callback. Write-protect faults are used to implement
copy-on-write. A private page can be shared across processes
until one process writes to it. When that happens, the page is
cloned, and the process writes on its own copy of the page. If
the whole area is marked as read-only, a SIGSEGV
is sent to the process, and the copy-on-write is not
performed.
int (*swapout)(struct vm_area_struct *vma,
,
unsigned long offset, pte_t *page_table);
This method is called when a page is selected for swap-out.
The offset
argument is the file offset: virt_address - vma->vm_start + vma->vm_offset
.
A return value of 0 signals success; any other value signals an error.
In case of error, the process owning the page is sent a SIGBUS
.
In version 1.2, the function returned void
because it was never expected to fail.
The method usually writes useful information to *page_table
,
so it can be retrieved at swap-in.
Such information can be, for example, the offset in the swap device.
pte_t (*swapin)(struct vm_area_struct *,
,
unsigned long offset, unsigned long entry);
This method is used to retrieve a page from swap space.
The offset
argument is relative to the area
(as above for swapout), while entry
is the current pte
for the page—if
swapout saved some information in the entry, that
information can now be used to retrieve the page.
It’s unlikely a driver will ever need to implement swapout or swapin, because drivers usually map pages of I/O memory, not regular memory. I/O pages are physical addresses that are accessed like memory but map to the device hardware instead of to RAM. I/O memory regions are either marked as ``reserved'' or live above the top of physical memory, so they never get swapped out--swapping I/O memory wouldn’t make much sense anyway.
There is a third data structure related to memory management in Linux. While VMAs and page tables organize the virtual address space, the physical address space is summarized in the memory map.
The kernel needs a description of the current usage of physical memory. Since memory can be considered as an array of pages, the information is organized in an array. If you need information about a page, you just use its physical address to access the memory map. Here are the symbols used by the kernel code to access the memory map:
typedef struct { /* ... */ } mem_map_t;
,
extern mem_map_t mem_map[];
The map itself is an array of mem_map_t
s. Each
physical page in the system, including kernel code and kernel
data, has an entry in mem_map
.
PAGE_OFFSET
This macro represents the virtual address in the kernel
address space to which physical addresses are mapped.
PAGE_OFFSET
must be considered whenever ``physical''
addresses are used. What the kernel considers to be a physical
address is actually a virtual address, offset by
PAGE_OFFSET
from the real physical address--the one
that is used in the electrical address lines outside the CPU.
Through
Linux 2.0.x, PAGE_OFFSET
was zero for the
PC and non-zero for most other platforms. Version
2.1.0 changed the PC implementation so it now uses offset-mapping
as well. Mapping physical space to high virtual addresses has sound
advantages as far as kernel code is concerned, but the topic
is beyond the scope of this book.
int MAP_NR(addr);
Whenever a program needs to access the memory map,
MAP_NR
returns the index in the mem_map
array
associated with addr
. The addr
argument can either
be an unsigned long
or a pointer. Since the macro is
used several times by critical memory management functions,
it performs no validity checking on addr
; the calling code
must make its own checks when needed.
((nr << PAGE_SHIFT) + PAGE_OFFSET)
No standardized function or macro exists to translate a
map number into a physical address. If you ever need the
inverse function of MAP_NR
, this clause will
work.
The memory map is used to maintain some low-level information about each memory page. The exact definition of the memory map structure changed several times during kernel development; you don’t need to understand the details because drivers aren’t expected to look inside the map.
If, however, you are interested in looking at the internals of page
management, the header <linux/mm.h>
includes a long comment
that explains the meaning of the fields in mem_map_t
.
[31] As a matter
of fact, on the Sparc the functions are not inline
,
but rather real extern
functions, which are not exported
to modularized code. Therefore you won’t be able to use these
functions in a module running on the Sparc, but you won’t
usually need to.
[32] The name BSS is an historical relic, from an old assembly operator meaning ``Block Started by Symbol.'' The BSS segment of executable files isn’t stored on disk, and the kernel maps the zero-page to the BSS address range.
3.15.25.32