5.7. Global Page Management

Pages are the fundamental unit of physical memory in the Solaris memory management subsystem. In this section, we discuss how pages are structured, how they are located, and how free lists manage pools of pages within the system.

5.7.1. Pages—The Basic Unit of Solaris Memory

Physical memory is divided into pages. Every active (not free) page in the Solaris kernel is a mapping between a file (vnode) and memory; the page can be identified with a vnode pointer and the page size offset within that vnode. A page's identity is its vnode/offset pair. The vnode/offset pair is the backing store for the page and represents the file and offset that the page is mapping.

The hardware address translation (HAT) and address space layers manage the mapping between a physical page and its virtual address space (more about that in “The Hardware Address Translation Layer”). The key property of the vnode/offset pair is reusability; that is, we can reuse each physical page for another task by simply synchronizing its contents in RAM with its backing store (the vnode and offset) before the page is reused.

For example, we can reuse a page of heap memory from a process by simply copying the contents to its vnode and offset, which in this case will copy the contents to the swap device. The same mechanism is used for caching files, and we simply use the vnode/offset pair to reference the file that the page is caching. If we were to reuse a page of memory that was caching a regular file, then we simply synchronize the page with its backing store (if the page has been modified) or just reuse the page if it is not modified and does not need resyncing with its backing store.

Figure 5.17. The Page Structure


5.7.2. The Page Hash List

The VM system hashes pages with identity (a valid vnode/offset pair) onto a global hash list so that they can be located by vnode and offset. Three page functions search the global page hash list: page_find(), page_lookup(), and page_lookup_nowait(). These functions take a vnode and offset as arguments and return a pointer to a page structure if found.

The global hash list is an array of pointers to linked lists of pages. The functions use a hash to index into the page_hash array to locate the list of pages that contains the page with the matching vnode/offset pair. Figure 5.18 shows how the page_find() function indexes into the page_hash array to locate a page matching a given vnode/offset.

Figure 5.18. Locating Pages by Their Vnode/Offset Identity


page_find()locates a page as follows:

  1. It calculates the slot in the page_hash array containing a list of potential pages by using the PAGE_HASH_FUNC macro, shown below.

    Example. Header File <vm/page.h>
    #define PAGE_HASHSZ     page_hashsz
    #define PAGE_HASHAVELEN         4
    #define PAGE_HASHVPSHIFT        6
    #define PAGE_HASH_FUNC(vp, off) 
            ((((uintptr_t)(off) >> PAGESHIFT) + 
                    ((uintptr_t)(vp) >> PAGE_HASHVPSHIFT)) & 
                    (PAGE_HASHSZ - 1))
    

  2. It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds such a page.

    Example. Header File <vm/page.h>
    ine PAGE_HASH_SEARCH(index, pp, vp, off) { 
            for ((pp) = page_hash[(index)]; (pp); (pp) = (pp)->p_hash) { 
                    if ((pp)->p_vnode == (vp) && (pp)->p_offset == (off)) 
                            break; 
            } 
    

5.7.3. MMU-Specific Page Structures

The defined page structure is the same across different platforms and hence contains no machine-specific structures. We do, however, need to keep machine-specific data about every page, for example, the HAT information that describes how the page is mapped by the MMU. The kernel wraps the machine-independent page structure with a machine-specific page structure, struct machpage. The contents of the machine-specific page structure are hidden from the generic kernel—only the HAT machine-specific layer can see or manipulate its contents. Figure 5.19 shows how each page structure is embedded in a machine-dependent struct machpage.

Figure 5.19. Machine-Specific Page Structures: sun4u Example


The machine-specific page contains a pointer to the HAT-specific mapping information, and information about the page's HAT state is stored in the machine-specific machpage. The store information includes bits that indicate whether the page has been referenced or modified, for use in the page scanner (covered later in the chapter). Both the machine-independent and machine-dependent page structures share the same start address in memory, so a pointer to a page structure can be cast to a pointer to a machine-specific page structure (see Figure 5.19). Macros for converting between machine-independent pages and machine-dependent page structures make the cast.

5.7.4. Physical Page Lists

The Solaris kernel uses a segmented global physical page list, consisting of segments of contiguous physical memory. (Many hardware platforms now present memory in noncontiguous groups.) Contiguous physical memory segments are added during system boot. They are also added and deleted dynamically when physical memory is added and removed while the system is running. Figure 5.20 shows the arrangement of the physical page lists into contiguous segments.

Figure 5.20. Contiguous Physical Memory Segments


5.7.4.1. Free List and Cache List

The free list and the cache list hold pages that are not mapped into any address space and that have been freed by page_free(). The sum of these lists is reported in the free column in vmstat. Even though vmstat reports these pages as free, they can still contain a valid page from a vnode/offset and hence are still part of the global page cache. Pages that are caching files in the page cache can appear on the free list. Memory on the cache list is not really free, it is a valid cache of a page from a file. The cache list exemplifies how the file systems use memory as a file system cache.

The free list contains pages that no longer have a vnode and offset associated with them—which can only occur if the page has been destroyed and removed from a vnode's hash list. The free list is generally very small, since most pages that are no longer used by a process or the kernel still keep their vnode/offset information intact. Pages are put on the free list when a process exits, at which point all of the anonymous memory pages (heap, stack, and copy-on-write pages) are freed.

The cache list is a hashed list of pages that still have mappings to valid vnode and offset. Recall that pages can be obtained from the cache list by the page_lookup() routine. This function accepts a vnode and offset as the argument and returns a page structure. If the page is found on the cache list, then the page is removed from the cache list and returned to the caller. When we find and remove pages from the cache list, we are reclaiming a page. Page reclaims are reported by vmstat in the “re” column.

5.7.5. The Page-Level Interfaces

The Solaris virtual memory system implementation has grouped page management and manipulation into a central group of functions. These functions are used by the segment drivers and file systems to create, delete, and modify pages. The major page-level interfaces are shown in Table 5-10.

Table 5-10. Solaris 7 Page Level Interfaces
Method Description
page_create() Creates pages. Page coloring is based on a hash of the vnode offset. page_create() is provided for backward compatibility only. Don't use it if you don't have to. Instead, use the page_create_va() function so that pages are correctly colored.
page_create_va() Creates pages, taking into account the virtual address they will be mapped to. The address is used to calculate page coloring.
page_exists() Tests that a page for vnode/offset exists.
page_find() Searches the hash list for a page with the specified vnode and offset that is known to exist and is already locked.
page_first() Finds the first page on the global page hash list.
page_free() Frees a page. Pages with vnode/offset go onto the cache list; other pages go onto the free list.
page_isfree() Checks whether a page is on the free list.
page_ismod() Checks whether a page is modified. This function checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_ismod().
page_isref() Checks whether a page has been referenced; checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_isref().
page_isshared() Checks whether a page is shared across more than one address space.
page_lookup() Finds a page representing the specified vnode/offset. If the page is found on a free list, then it will be removed from the free list.
page_lookup_nowait() Finds a page representing the specified vnode/offset that is not locked or on the free list.
page_needfree() Informs the VM system we need some pages freed up. Calls to page_needfree() must be symmetric; that is, they must be followed by another page_needfree() with the same amount of memory multiplied by -1,after the task is complete.
page_next() Finds the next page on the global page hash list.

The page_create_va() function allocates pages. It takes the number of pages to allocate as an argument and returns a page list linked with the pages that have been taken from the free list. page_create_va() also takes a virtual address as an argument so that it can implement page coloring (discussed in Section 5.7.8, “Page Coloring,”). The new page_create_va() function subsumes the older page_create() function and should be used by all newly developed subsystems because page_create() may not correctly color the allocated pages.

5.7.6. The Page Throttle

Solaris implements a page creation throttle so a small core of memory is available for consumption by critical parts of the kernel. The page throttle, implemented in the page_create() and page_create_va() functions, causes page creates to block when the PG_WAIT flag is specified, that is, when available memory is less than the system global, throttlefree. By default, the system global parameter, throttlefree, is set to the same value as the system global parameter minfree. By default, memory allocated through the kernel memory allocator specifies PG_WAIT and is subject to the page-created throttle. (See Section 6.2" “Kernel Memory Allocation,” for more information on kernel memory allocation.)

5.7.7. Page Sizes

The Solaris kernel uses a fundamental page size that varies according to the underlying hardware. On UltraSPARC and beyond, the fundamental page size is 8 Kbytes. The hardware on which Solaris runs has several different types of memory management units, which support a variety of page sizes, as listed in Table 5-11.

Table 5-11. Page Sizes on Different Sun Platforms
System Type System Type MMU Page Size Capability Solaris 2.x Page Size
Early SPARC systems sun4c 4K 4K
microSPARC-I, -II sun4m 4K 4K
SuperSPARC-I, -II sun4m 4K, 4M 4K, 4M
UltraSPARC-I, -II sun4u 4K, 64K, 512K, 4M 8K, 4M
Intel x86 architecture i86pc 4K, 4M 4K, 4M

The optimal MMU page size is a trade-off between performance and memory size efficiency. A larger page size has less memory management overhead and hence better performance, but a smaller page size wastes less memory (memory is wasted when a page is not completely filled). (See “Large Pages” for further information on large pages.)

5.7.8. Page Coloring

Some interesting effects result from the organization of pages within the processor caches, and as a result, the page placement policy within these caches can dramatically affect processor performance. When pages overlay other pages in the cache, they can displace cache data that we might not want overlaid, resulting in less cache utilization and “hot spots.”

The optimal placement of pages in the cache often depends on the memory access patterns of the application; that is, is the application accessing memory in a random order, or is it doing some sort of strided ordered access? Several different algorithms can be selected in the Solaris kernel to implement page placement; the default attempts to provide the best overall performance.

To understand how page placement can affect performance, let's look at the cache configuration and see when page overlaying and displacement can occur. The UltraSPARC-I and -II implementations use virtually addressed L1 caches and physically addressed L2 caches. The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical memory in 64-byte units. Figure 5.27 shows the architecture of the UltraSPARC-I and -II CPU modules with their caches. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size of the caches reported to the operating system. The L1 cache sizes are recorded in the vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.

Figure 5.27. UltraSPARC-I and -II MMUs


# adb -k
physmem 7a97
vac_size/D
vac_size:
vac_size:       16384
ecache_size/D
ecache_size:
ecache_size:    1048576

We'll start by using the L2 cache as an example of how page placement can affect performance. The physical addressing of the L2 cache means that the cache is organized in page-sized multiples of the physical address space, which means that the cache effectively has only a limited number of page-aligned slots. The number of effective page slots in the cache is the cache size divided by the page size. To simplify our examples, let's assume we have a 32-Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8 Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte cache has 1024 addressable slots. Figure 5.21 shows how our cache would look if we laid it out linearly.

Figure 5.21. Physical Page Mapping into a 64-Kbyte Physical Cache


The L2 cache is direct-mapped from physical memory. If we were to access physical addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory locations would map to the same cache line. If we were now to access these two addresses, we cause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it effectively reduces our performance to that of real-memory speed, rather than cache speed. By accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64 bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 10–20 times slower than cache and so can have a dramatic effect on performance.

Our simple example was based on the assumption that we were accessing physical memory in a regular pattern, but we don't program to physical memory; rather, we program to virtual memory. Therefore, the operating system must provide a sensible mapping between virtual memory and physical memory; otherwise, effects such as our example can occur.

By default, physical pages are assigned to an address space from the order in which they appear in the free list. In general, the first time a machine boots, the free list may have physical memory in a linear order, and we may end up with the behavior described in our “ping pong” example. Once a machine has been running, the physical page free list will become randomly ordered, and subsequent reruns of an identical application could get very different physical page placement and, as a result, very different performance. On early Solaris implementations, this is exactly what customers saw—differing performance for identical runs, as much as 30 percent difference.

To provide better and consistent performance, the Solaris kernel uses a page coloring algorithm when pages are allocated to a virtual address space. Rather than being randomly allocated, the pages are allocated with a specific predetermined relationship between the virtual address to which they are being mapped and their underlying physical address. The virtual-to-physical relationship is predetermined as follows: The free list of physical pages is organized into specifically colored bins, one color bin for each slot in the physical cache; the number of color bins is determined by the ecache size divided by the page size. (In our example, there would be exactly four colored bins.)

When a page is put on the free list, the page_free() algorithms assign it to a color bin. When a page is consumed from the free list, the virtual-to-physical algorithm takes the page from a physical color bin, chosen as a function of the virtual address which to which the page will be mapped. The algorithm requires that when allocating pages from the free list, the page create function must know the virtual address to which a page will be mapped.

New pages are allocated by calling the page_create_va() function. The page_create_va() function accepts the virtual address of the location to which the page is going to be mapped as an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take physical pages from. The page_create_va() function is described with the page management functions in Table 5-10.

Note

The page_create_va() function deprecates the older page_create() function. We chose to add a new function rather than adding an additional argument to the existing page_create() function so that existing third-party loadable kernel modules which call page_create()remain functional. However, because page_create() does not know about virtual addresses, it has to pick a color at random—which can cause significant performance degradation. The page_create_va() function should always be used for new code.


No one algorithm suits all applications because different applications have different memory access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel supports a default algorithm and two optional algorithms. The default algorithm was chosen according to the following criteria:

  • Fairly consistent, repeatable results

  • Good overall performance for the majority of applications

  • Acceptable performance across a wide range of applications

The default algorithm uses a hashing algorithm to distribute pages as evenly as possible throughout the cache. The default and three other available page coloring algorithms are shown in Table 5-12.

Table 5-12. Solaris Page Coloring Algorithms
Algorithm Description Solaris Availability
No. Name  2.5.1 2.6 7
0 Hashed VA The physical page color bin is chosen on a hashed algorithm to ensure even distribution of virtual addresses across the cache. Default Default Default
1 P. Addr = V. Addr The physical page color is chosen so that physical addresses map directly to the virtual addresses (as in our example). Yes Yes Yes
2 Bin Hopping Physical pages are allocated with a round-robin method. Yes Yes Yes
6 Kessler's Best Bin Keeps history per process of used colors and chooses least used color; if multiple, use largest bin. E10000 Only (Default) E10000 Only (Default) Not Available

The Ultra Enterprise 10000 has a different default algorithm, which tries to evenly distribute colors for each process's address space so that no one color is more used than another. This algorithm is correct most of the time, but in some cases, the hashed or direct (0 or 1) algorithms can be better.

You can change the default algorithm by setting the system parameter consistent_coloring, either on-the-fly with adb or permanently in /etc/system.

# adb -kw
physmem 7a97
consistent_coloring/D
consistent_coloring:
consistent_coloring:            0
consistent_coloring/W 1
consistent_coloring:            0x0             =       0x1

So, which algorithm is best? Well, your mileage will vary, depending on your application. Page coloring usually only makes a difference on memory-intensive scientific applications, and the defaults are usually fine for commercial or database systems. If you have a time-critical scientific application, then we recommend that you experiment with the different algorithms and see which is best. Remember that some algorithms will produce different results for each run, so aggregate as many runs as possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.172.229