5.9. The Hardware Address Translation Layer

The hardware address translation (HAT) layer controls the hardware that manages mapping of virtual to physical memory. The HAT layer provides interfaces that implement the creation and destruction of mappings between virtual and physical memory and provides a set of interfaces to probe and control the MMU. The HAT layer also implements all of the low-level trap handlers to manage page faults and memory exceptions. Figure 5.26 shows the logical demarcation between elements of the HAT layer.

Figure 5.26. Role of the HAT Layer in Virtual-to-Physical Translation


The HAT implementation is different for each type of hardware MMU, and hence there are several different HAT implementations. The HAT layer hides the platform-specific implementation and is used by the segment drivers to implement the segment driver's view of virtual-to-physical translation. The HAT uses the struct hat data structure to hold the top-level translation information for an address space. The hat structure is platform specific and is referenced by the address space structure (see Figure 5.7). HAT-specific data structures existing in every page represent the translation information at a page level.

The HAT layer is called when the segment drivers want to manipulate the hardware MMU. For example, when a segment driver wants to create or destroy an address space mapping, it calls the HAT functions specifying the address range and the action to be taken. We can call the HAT functions without knowing anything about the underlying MMU implementation; the arguments to the HAT functions are machine independent and usually consist of virtual addresses, lengths, page pointers, and protection modes.

Table 5-16 summarizes HAT functions.

Table 5-16. Machine-Independent HAT Functions
Function Description
hat_alloc() Allocates a HAT structure in the address space.
hat_chgattr() Changes the protections for the supplied virtual address range.
hat_clrattr() Clears the protections for the supplied virtual address range.
hat_free_end() Informs the HAT layer that a process has exited.
hat_free_start() Informs the HAT layer that a process is exiting.
hat_get_mapped_size() Returns the number of bytes that have valid mappings.
hat_getattr() Gets the protections for the supplied virtual address range.
hat_memload() Creates a mapping for the supplied page at the supplied virtual address. Used to create mappings.
hat_setattr() Sets the protections for the supplied virtual address range.
hat_stats_disable() Finishes collecting stats on an address space.
hat_stats_enable() Starts collecting page reference and modification stats on an address space.
hat_swapin() Allocates resources for a process that is about to be swapped in.
hat_swapout() Allocates resources for a process that is about to be swapped out.
hat_sync() Synchronizes the struct_page software referenced and modified bits with the hardware MMU.
hat_unload() Unloads a mapping for the given page at the given address.

5.9.1. Virtual Memory Contexts and Address Spaces

A virtual memory context is a set of virtual-to-physical translations that maps an address space. We change virtual memory contexts when the scheduler wants to switch execution from one process to another or when a trap or interrupt from user mode to kernel occurs. The Solaris kernel always uses virtual memory context zero to indicate the kernel context. The machine-specific HAT layer implements functions required to create, delete, and switch virtual memory contexts. Different hardware MMUs support different numbers of concurrent virtual memory contexts.

When the number of concurrent address spaces (processes) exceeds the number of concurrent contexts supported by the hardware, the kernel has to steal contexts from peer processes at the time of context switch. Stealing contexts from other address spaces has an indirect effect on performance. However, this issue was only a concern on older systems such as the SPARCstation 1 and 2.

5.9.1.1. Hardware Translation Acceleration

The hardware MMUs convert from virtual-to-physical addresses by looking up entries in a page translation table. All of the platforms on which Solaris runs have a hardware cache of recent translations, known as the translation lookaside buffer (TLB). The number of entries in the TLB is typically 64 on SPARC systems. Some platforms (such as Intel and older SPARC implementations) use hardware to populate the TLB, and others (like the UltraSPARC architecture) use software algorithms.

The characteristics of the MMU hardware for the different Solaris platforms are shown in Table 5-17.

Table 5-17. Solaris MMU HAT Implementations
Platform No. of Contexts Size of TLB TLB Fill Virtual Bits Physical Bits
SPARC 1,2 8 64 Hardware 32 32
MicroSPARC 65536 64 Hardware 32 32
SuperSPARC 65536 64 Hardware 32 36
UltraSPARC-I and -II 8192 64 x 2 Software 44 41
Intel Pentium   Hardware 32 36

5.9.2. The UltraSPARC-I and -II HAT

The UltraSPARC-I and -II MMUs do the following:

  • Implement mapping between a 44-bit virtual address and a 41-bit physical address

  • Support page sizes of 8 Kbytes, 64 Kbytes, 512 bytes, and 4 Mbytes

  • Share their implementation with the entire range of UltraSPARC-based desktop and server machines, from the Ultra 1 to the Enterprise Server 10000

The MMU is an integral part of the UltraSPARC chip, which hosts two MMUs: one for instructions and one for data. Figure 5.27 illustrates the topology of the CPU and MMU.

The UltraSPARC-I and -II MMU supports 44-bit virtual addresses (64 bits with a virtual address hole) and 41-bit physical addresses. During memory access, the MMU translates virtual address to physical addresses by translating a virtual page number into a physical page number and using a common page offset between the virtual and physical pages. The page number of a page is the high-order bits of the address, excluding the page offset. For 8-Kbyte pages, the page number is bits 13 through 63 of the virtual address and bits 13 through 40 of the physical address. For 4-Mbyte pages, the page number is bits 22 through 63 of the virtual address and bits 22 through 40 of the physical address.

Figure 5.28 illustrates the relationship between virtual and physical addresses.

Figure 5.28. Virtual-to-Physical Translation


We traditionally use page tables to map virtual-to-physical addresses, so that for any given virtual address, the MMU hardware looks up and translates the appropriate physical address. microSPARC and SuperSPARC processors use page tables for address translation to convert virtual page numbers to physical page numbers. The microSPARC and SuperSPARC page table comprises mapping entries, one per page, known as page table entries, that map each virtual page to a physical page.

In UltraSPARC-I and -II, we use a translation table to describe the virtual-to-physical translation. A translation table is the functional equivalent of a page table, with some significant differences. For example, unlike the older page table implementation (like that on SuperSPARC), the UltraSPARC-I and -II MMU uses a software-based translation mechanism to fill the hardware TLB translation cache.

The UltraSPARC-I and -II software page table entries are known as translation table entries (TTEs), one TTE for each page. The TTE is a translation map entry that contains a virtual address tag and the high bits of the physical address for each page (the physical page number) so that the hardware MMU can convert the virtual page address into a physical address by finding entries matching the virtual page number.

The TTE virtual page tag contains the virtual page number of the virtual address it maps and the address space context number to identify the context to which each TTE belongs. The context information in the TTE allows the MMU to find a TTE specific to an address space context, which allows multiple contexts to be simultaneously active in the TLB. This behavior offers a major performance benefit because traditionally we need to flush all the TTEs for an address space when we context-switch. Having multiple TTE contexts in the TLB dramatically decreases the time taken for a context switch because translations do not need to be reloaded into the TLB each time we context-switch.

The TTEs found in the page structure are a software representation of the virtual-to-physical mapping and must be loaded into the TLB to enable the MMU to perform a hardware translation. Once a TTE is loaded, the hardware MMU can translate addresses for that TTE without further interaction. The hardware MMU relies on the hardware copies of the TTEs in the TLB to do the real translation. The TLB contains the 64 most recent TTEs and is used directly by the hardware to assemble the physical addresses of each page as virtual addresses are accessed.

The MMU finds the TTE entry that matches the virtual page number and current context. When it finds the match, it retrieves the physical page information. The physical page information contains the physical page number (which is the high 13 or 22 bits of the physical address), the size of the page, a bit to indicate whether the page can be written to, and a bit to indicate whether the page can only be accessed in privileged mode.

Figure 5.29 illustrates how TTEs are used.

Figure 5.29. UltraSPARC-I and -II Translation Table Entry (TTE)


Software populates the TLB entries from the hment structures in the machine-specific page structure. Since the process of converting a software TTE into a TLB entry is fairly expensive, an intermediate software cache of TTEs is arranged as a direct-mapped cache of the TLB. The software cache of TTEs is the translation software buffer (TSB) and is simply an array of TTEs in regular physical memory. Figure 5.30 shows the relationship between software TTEs, the TSB, and the TLB.

Figure 5.30. Relationship of TLBs, TSBs, and TTEs


The TLB on UltraSPARC-I and -II has 64 entries, and the TSB has between 2,048 and 262,144 entries, depending on the amount of memory in the system. Unlike the previous generation SPARC MMU, the UltraSPARC-I and -II MMU does not use a hardware page tablewalk to access the TSB entries, but it still provides a form of hardware assistance to speed up TSB access: Hardware precomputes TSB table offsets to help the software handler find TTEs in the TSB. When the CPU needs to access a virtual address, the system takes the following steps:

1.
The MMU first looks in the TLB for a valid TTE for the requested virtual address.

2.
If a valid TTE is not found, then the MMU automatically generates a pointer for the location of the TSB TTE entry and generates a TLB miss trap.

3.
The trap handler reads the hardware-constructed pointer, retrieves the entry from the TSB, and places it in the appropriate TLB slot with an atomic write into the TLB.

4.
If the TTE is not found in the TSB, then the TLB miss handler jumps to a more complex, but slower, TSB miss handler, which retrieves the TTE from the page structure by using a software hashed index structure.

UltraSPARC-I and -II also provide a separate set of global registers to process MMU traps. This approach dramatically reduces the time taken for the TLB miss handler to locate TSB entries since the CPU does not need to save process state during the trap—it simply locates the entry, atomically stores the entry, and returns to execution of the running process.

To optimize performance, the TSB is sized according to the amount of memory in the system. In addition, if memory is sufficient, a separate TSB is allocated for the kernel context. The maximum size of a TSB on UltraSPARC-I and -II is 512 Kbytes, so to provide a large TSB, multiple 512-Kbyte TSBs are allocated. When multiple TSBs are allocated, they are divided by a mask of the context number, where multiple contexts share the same TSB. An array of TSB base addresses, indexed by the context number, implements the concept. When a context switch is performed, the new TSB base address is looked up in the tsbbase address array and loaded into the MMU TSB base address register.

The memory sizes and the corresponding TSB sizes are shown in Table 5-18.

Table 5-18. Solaris 7 UltraSPARC-I and -II TSB Sizes
Memory Size Kernel TSB Entries Kernel TSB Size User TSB Entries User TSB Size
< 32 Mbytes 2,048 128 Kbytes
32 Mbytes–64 Mbytes 4096 256 Kbytes 8,192– 16,383 512 Kbytes– 1 Mbyte
32 Mbytes–2 Gbytes 4096– 262,144 512 Kbytes– 16 Mbytes 16,384– 524,287 1 Mbyte– 32 Mbytes
2 Gbytes–8 Gbytes 262,144 16 Mbytes 524,288– 2,097,511 32 Mbytes– 128 Mbytes
8 Gbytes -> 262,144 16 Mbytes 2,097,512 128 Mbytes

5.9.3. Address Space Identifiers

SPARC systems use an address space identifier (ASI) to describe the MMU mode and hardware used to access pages in the current environment. UltraSPARC uses an 8-bit ASI that is derived from the instruction being executed and the current trap level. Most of the 50+ different ASIs can be grouped into three different modes of physical memory access, shown in Table 5-19. The MMU translation context used to index TLB entries is derived from the ASI.

Table 5-19. UltraSPARC-I and -II Address Space Identifiers
ASI Description Derived Context
Primary The default address translation; used for regular SPARC instructions. The address space translation is done through TLB entries that match the context number in the MMU primary context register.
Secondary A secondary address space context; used for accessing another address space context without requiring a context switch. The address space translation is done through TLB entries that match the context number in the MMU secondary context register.
Nucleus The address translation; used for TLB miss handlers, system calls, and interrupts. The nucleus context is always zero (the kernel's context).

The three UltraSPARC ASIs allow an instruction stream to access up to three different address space contexts without actually having to perform a context switch. This feature is used in the Solaris kernel to help place the kernel into a completely separate address space, by allowing very fast access to user space from kernel mode. The kernel can access user space by using SPARC instructions that allow an ASI to be specified with the address of a load/store, and the user space is always available from the kernel in the secondary ASI.

Other SPARC ASIs access hardware registers in the MMU itself, and special ASIs access I/O space. For further details on UltraSPARC ASIs, see the UltraSPARC-I and -II Users Manual [30].

5.9.3.1. UltraSPARC-I and II Watchpoint Implementation

The UltraSPARC-I and -II MMU implementation provides support for watchpoints. Virtual address and physical address watchpoint registers, when enabled, describe the address of watchpoints for the address space. Watchpoint traps are generated when watchpoints are enabled and the data MMU detects a load or store to the virtual or physical address specified by the virtual address data watchpoint register or the physical data watchpoint register, respectively. (See “Virtual Memory Watchpoints” for further information.)

5.9.3.2. UltraSPARC-I and -II Protection Modes

Protection modes are implemented by the instruction and data TTEs. Table 5-20 shows the resultant protection modes.

Table 5-20. UltraSPARC MMU Protection Modes
Condition Resultant Protection Mode
TTE in D-MMU TTE in I-MMU Writable Attribute Bit
Yes No 0 Read-only
No Yes Don't Care Execute-only
Yes No 1 Read/Write
Yes Yes 0 Read-only/Execute
Yes Yes 1 Read/Write/Execute

5.9.3.3. UltraSPARC-I and -II MMU-Generated Traps

The UltraSPARC MMU generates traps to implement the software handlers for events that the hardware MMU cannot handle. Such events occur when the MMU encounters an MMU-related instruction exception or when the MMU cannot find a TTE in the TSB for a virtual address. Table 5-21 describes UltraSPARC-I and -II traps.

Table 5-21. UltraSPARC-I and -II MMU Traps
Trap Description
Instruction_access_miss A TTE for the virtual address of an instruction was not found in the instruction TLB.
Instruction_access_exception An instruction privilege violation or invalid instruction address occurred.
Data_access_MMU_miss A TTE for the virtual address of a load was not found in the data TLB.
Data_access_exception A data access privilege violation or invalid data address occurred.
Data_access_protection A data write was attempted to a read-only page.
Privileged_action An attempt was made to access a privileged address space.
Watchpoint Watchpoints were enabled and the CPU attempted to load or store at the address equivalent to that stored in the watchpoint register.
Mem_address_not_aligned An attempt was made to load or store from an address that is not correctly word aligned.

5.9.4. Large Pages

Large pages, typically 4 Mbytes in size, optimize the effectiveness of the hardware TLB. They were introduced in Solaris 2.5 to map core kernel text and data and have continued to develop into other areas of the operating system. Let's take a look at why large pages help optimize TLB performance.

5.9.4.1. TLB Performance and Large Pages

The memory performance of a modern system is largely influenced by the effectiveness of the TLB because of the time spent servicing TLB misses. The objective of the TLB is to cache as many recent page translations in hardware as possible, so that it can satisfy a process's or thread's memory accesses by performing all of the virtual-to-physical translations on-the-fly. If we don't find a translation in the TLB, then we need to look up the translation from a larger table, either in software (UltraSPARC) or with lengthy hardware steps (Intel or SuperSPARC).

Most TLBs are limited in size because of the amount of transistor space available on the CPU die. For example, the UltraSPARC-I and -II TLBs are only 64 entries. This means that the TLB can at most address 64 pages of translations at any time; on UltraSPARC, 64 x 8 Kbytes, or 512 Kbytes.

The amount of memory the TLB can address concurrently is known as the TLB reach, so we can say the UltraSPARC-I and -II have a TLB reach of 512 Kbytes. If an application makes heavy use of less than 512 Kbytes of memory, then the TLB will be able to cache the entire set of translations. But if the application were to make heavy use of more than 512 Kbytes of memory, then the TLB will begin to miss, and translations will have to be loaded from the larger translation table.

Table 5-22 shows the TLB miss rate and the amount of time spent servicing TLB misses from a study done by Madu Talluri [33] on older SPARC architectures. We can see from the table that only a small range of compute-bound applications fit well in the SuperSPARC TLB (gcc, ML, and pthor), whereas the others spend a significant amount of their time in the TLB miss handlers.

Table 5-22. Sample TLB Miss Data from a SuperSPARC Study
Workload Total Time (secs) User Time (secs) # User TLB Misses % User Time in TLB Miss Handling Cache Misses ('000s) Peak Memory Usage (MB)
coral 177 172 85974 50 71053 19.9
nasa7 387 385 152357 40 64213 3.5
compress 99 77 21347 28 21567 1.4
fftpde 55 53 11280 21 14472 14.7
wave5 110 107 14510 14 4583 14.3
mp3d 37 36 4050 11 5457 4.8
spice 620 617 41923 7 81949 3.6
pthor 48 35 2580 7 6957 15.4
ML 945 917 38423 4 314137 32.0
gcc 118 105 2440 2 9980 5.6

TLB effectiveness has become a larger issue in the past few years because the average amount of memory used by applications has grown significantly—by as much as double per year according to recent statistical data. The easiest way to increase the effectiveness of the TLB is to increase the TLB reach so that the working set of the application fits within the TLB's reach. You can increase TLB reach in two ways:

  • Increase the number of entries in the TLB. This approach adds complexity to the TLB hardware and increases the number of transistors required, taking up valuable CPU die space.

  • Increase the page size that each entry reflects. This approach increases the TLB reach without the need to increase the size of the TLB.

A trade-off to increasing the page size is this: If we increase the page size, we may boost the performance of some applications at the expense of slower performance elsewhere, and because of larger-size memory fragmentation, we would almost certainly increase the memory usage of many applications. Luckily, a solution is at hand: Some of the newer processor architectures allow us to use two or more different page sizes at the same time. For example, the UltraSPARC microprocessor provides hardware support to concurrently select 8-Kbyte, 64-Kbyte, 512-Kbyte. or 4-Mbyte pages. If we were to use 4-Mbyte pages to map all memory, then the TLB would have a theoretical reach of 64 x 4 Mbytes, or 256 Mbytes. We do, however, need operating system support to take advantage of large pages.

5.9.4.2. Solaris Support for Large Pages

Prior to the introduction of the first UltraSPARC processor, the Solaris kernel did not support multiple page sizes, so significant kernel development effort was required before the kernel could take advantage of the underlying hardware's support for multiple page sizes. This development was phased over several Solaris releases, starting with Solaris 2.5 (the Solaris release with which UltraSPARC was first released).

The page size for Solaris on UltraSPARC is 8 Kbytes, chosen to give a good mix of performance across the range of smaller machines (32 Mbytes) to larger machines (several gigabytes). The 8-Kbyte page size is appropriate for many applications, but those with a larger working set spend a lot of their time in TLB miss handling, and as a result, the 8-Kbyte page size hurts a class of applications, mainly large-memory scientific applications and large-memory databases. The 8-Kbyte page size also hurts kernel performance, since the kernel's working set is in the order of 2 to 3 megabytes.

At Solaris 2.5, the first 4 Mbytes of kernel text and data are mapped with two 4-Mbyte pages at system boot time. This page size significantly reduces the number of TLB entries required to satisfy kernel memory translations, speeding up the kernel code path and freeing up valuable TLB slots for hungry applications. In addition, the Ultra Creator graphics adapter makes use of a large translation to accelerate graphics performance.

However, user applications had no way to take advantage of the large-page support until Solaris 2.6. With that release, Sun enabled the use of 4-Mbyte pages for shared memory segments. This capability primarily benefits databases, since databases typically use extremely large shared memory segments for database data structures and shared file buffers. Shared memory segments ranging from several hundred megabytes to several gigabytes are typical on most database installations, and the 8-Kbyte page size means that very few 8-Kbyte shared memory page translations fit into the TLB. The 4-Mbyte pages allow large contiguous memory to be mapped by just a few pages.

Starting with Solaris 2.6, System V shared memory created with the intimate shared memory flag, SHM_SHARE_MMU, is created with as many large 4-Mbyte pages as possible, greatly increasing the performance of database applications. Table 5-23 shows a sample of the performance gains from adding large-page shared memory support.

Table 5-23. Large-Page Database Performance Improvements
Database Performance Improvement
Oracle TPC-C 12%
Informix TPC-C 1%
Informix TPC-D 6%

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.145.2