Chapter 8. Native Memory Best Practices

The heap is the largest consumer of memory in a Java application, but the JVM will allocate and use a large amount of native memory. And while Chapter 7 discussed ways to efficiently manage the heap from a programmatic point of view, the configuration of the heap and how it interacts with the native memory of the operating system is another important factor in the overall performance of an application.

This chapter discusses these aspects of native (or operating system) memory. We start with a discussion of the entire memory use of the JVM, with a goal of understanding how to monitor that usage for performance issues. Then we’ll discuss various ways to tune the JVM and operating system for optimal memory use.

Footprint

The heap (usually) accounts for the largest amount of memory used by the JVM, but the JVM also uses memory for its internal operations. This non-heap memory is native memory. Native memory can also be allocated in applications (via JNI calls to malloc() and similar methods, or when using NIO). The total of native and heap memory used by the JVM yields the total footprint of an application.

From the point of view of the operating system, this total footprint is the key to performance. If enough physical memory to contain the entire total footprint of an application is not available, performance may begin to suffer. The operative word here is “may.” There are parts of native memory that are really only used during startup (for instance, the memory associated to load the jar files in the classpath), and if that memory is swapped out, it won’t necessarily be noticed. Some of the native memory used by one Java process is shared with other Java processes on the system, and some smaller part is shared with other kinds of processes on the system. For the most part, though, for optimal performance you want to be sure that the total footprint of all Java processes does not exceed the physical memory of the machine (plus you want to leave some memory available for other applications).

Measuring Footprint

To measure the total footprint of a process, you need to use an operating-system specific tool. In UNIX-based systems, programs like top and ps can show you that data at a basic level; on Windows, you can use perfmon or VMMap. No matter which tool and platform are used, you need to look at the actual allocated memory (as opposed to the reserved memory) of the process.

The distinction between allocated and reserved memory comes about as a result of the way the JVM (and all programs) manage memory. Consider a heap is specified with the parameters -Xms512m -Xmx2048m. The heap starts by using 512 MB, and it will be resized as needed to meet the GC goals of the application.

That concept is the essential difference between committed (or allocated) memory and reserved memory (sometimes called the virtual size of a process). The JVM must tell the operating system that it might need as much as 2 GB of memory for the heap, so that memory is reserved—the operating system promises that when the JVM attempts to allocate additional memory when it increases the size of the heap, that memory will be available.

Still, only 512 MB of that memory is actually allocated initially, and that 512 MB is all of the memory that actually is being used (for the heap). That (actually allocated) memory is known as the committed memory. The amount of committed memory will fluctuate as the heap resizes; in particular, as the heap size increases, the committed memory correspondingly increases.

  1. Is Over-Reserving a Problem?

This difference applies to almost all significant memory that the JVM allocates. The code cache grows from an initial to a maximum value as more code gets compiled. Permgen or the class metaspace is allocated separately and grows between its initial (committed) size and its maximum (reserved) size.

One exception to this is thread stacks. Every time the JVM creates a thread, the OS allocates some native memory to hold that thread’s stack, committing more memory to the process (until the thread exits, at least). Threads stacks, though, are fully allocated when they are created.

In Unix systems, the actual footprint of an application can be estimated by the resident set size (RSS) of the process as reported by various OS tools. That value is a good estimate of the amount of committed memory a process is using, though it is inexact in two ways. First, the few pages that are shared at the OS level between JVM and other processes (that is, the text portions of shared libraries) are counted in the RSS of each process. Second, a process may have committed more memory than it actually paged in at any moment. Still, tracking the RSS of a process is a good first-pass way to monitor the total memory use. On more recent Linux kernels, the PSS is a refinement of the RSS that removes the data shared by other programs.

On Windows systems, the equivalent idea is called the working set of an application, which is what is reported by the task manager.

Minimizing Footprint

To minimize the footprint used by the JVM, limit the amount of memory used by the following:

Heap
The heap is the biggest chunk of memory, though surprisingly it may take up only 50 - 60% of total footprint. Using a smaller maximum heap (or setting the GC tuning parameters such that the heap never fully expands) limits the program’s footprint.
Thread Stacks
Thread stacks are quite large, particularly for a 64-bit JVM. See Chapter 9 for ways to limit the amount of memory consumed by thread stacks.
Code Cache
The code cache uses native memory to hold compiled code. As discussed in Chapter 4, this can be tuned (though performance will suffer if all the code cannot be compiled due to space limitations).
Direct Byte Buffers
These are discussed in the next section.

Native NIO Buffers

Developers can allocate native memory via JNI calls, but NIO byte buffers will also allocate native memory if they are created via the allocateDirect() method. Native byte buffers are quite important from a performance perspective, since they allow native code and Java code to share data without copying it. The most common example here is buffers that are used for filesystem and socket operations. Writing data to a native NIO buffer and then sending that data to the channel (e.g, the file or socket) requires no copying of data between the JVM and the C library used to transmit the data. If a heap byte buffer is used instead, contents of the buffer must be copied by the JVM.

The allocateDirect() method call is quite expensive; direct byte buffers should be reused as much as possible. The ideal situation is when threads are independent and each can keep a direct byte buffer as a thread local variable. That can sometimes use too much native memory if there are many threads that need buffers of variable sizes, since eventually each thread will end up with a buffer at the maximum possible size. For that kind of situation—or when thread local buffers don’t fit the application design—an object pool of direct byte buffers maybe more useful.

Byte buffers can also be managed by slicing them. The application can allocate one very large direct byte buffer, and individual requests can allocate a portion out of that buffer using the slice() method of the ByteBuffer class. This solution can become unwieldy when the slices are not always the same size: the original byte buffer can then become fragmented in the same way the heap becomes fragmented when allocating and freeing objects of different sizes. Unlike the heap, however, the individual slices of a byte buffer cannot be compacted, so this solution really works well only when all the slices are a uniform size.

From a tuning perspective, the one thing to realize with any of these programming models is that the amount of direct byte buffer space that an application can allocate can be limited by the JVM. The total amount of memory that can be allocated for direct byte buffers is specified by setting the -XX:MaxDirectMemorySize=N flag. Starting in Java 7, the default value for this flag is 0, which means there is no limit (subject to the address space size and any operating-system limits on the process). That flag can be set to limit the direct byte buffer use of an application (and to provide compatibility with previous releases of Java, where the limit was 64 MB).

Quick Summary

  1. The total footprint of the JVM has a significant effect on its performance, particularly if physical memory on the machine is constrained. Footprint is another aspect of performance tests that should be commonly monitored.
  2. From a tuning perspective, the footprint of the JVM can be limited in the amount of native memory it uses for direct byte buffers, thread stack sizes, and the code cache (as well as the heap).

Native Memory Tracking

Beginning in Java 8, the JVM allows some visibility into how it allocates native memory when using this option: -XX:NativeMemoryTracking=off|summary|detail. By default, native memory tracking (NMT) tracking is off. If the summary or detail mode is enabled, you can get the native memory information at any time from jcmd:

% jcmd process_id VM.native_memory summary

If the JVM is started with the argument -XX:+PrintNMTStatistics (by default, false), the JVM will print out information about the allocation when the program exits.

Here is the summary output from a JVM running with a 512 MB initial heap size and a 4 GB maximum heap size:

Native Memory Tracking:

 Total:  reserved=4787210KB,  committed=1857677KB

Although the JVM has made memory reservations totaling 4.7 GB, it has used much less than that: only 1.8 GB total. This is fairly typical (and one reason not to pay particular attention to the virtual size of the process displayed in OS tools, since that reflects only the memory reservations).

This memory usage breaks down as follows:

-                 Java Heap (reserved=4296704KB, committed=1470428KB)
                            (mmap: reserved=4296704KB, committed=1470428KB)

The heap itself is (unsurprisingly) the largest part of the reserved memory at 4 GB. But the dynamic sizing of the heap meant it grew only to 1.4 GB.

-                     Class (reserved=65817KB, committed=60065KB)
                            (classes #19378)
                            (malloc=6425KB, #14245)
                            (mmap: reserved=59392KB, committed=53640KB)

This is the native memory used to hold class metadata. Again, note that the JVM has reserved more memory than it actually used to hold the 19,378 classes in the program.

-                    Thread (reserved=84455KB, committed=84455KB)
                            (thread #77)
                            (stack: reserved=79156KB, committed=79156KB)
                            (malloc=243KB, #314)
                            (arena=5056KB, #154)

77 thread stacks were allocated at about 1 MB each.

-                      Code (reserved=102581KB, committed=15221KB)
                            (malloc=2741KB, #4520)
                            (mmap: reserved=99840KB, committed=12480KB)

This is the JIT code cache. 19,378 classes is not very many, so just a small section of the code cache is committed.

-                        GC (reserved=183268KB, committed=173156KB)
                            (malloc=5768KB, #110)
                            (mmap: reserved=177500KB, committed=167388KB)

These are areas outside of the heap that GC algorithms use for their processing.

-                  Compiler (reserved=162KB, committed=162KB)
                            (malloc=63KB, #229)
                            (arena=99KB, #3)

Similarly, this area is used by the compiler for its operations, apart from the resulting code placed in the code cache.

-                    Symbol (reserved=12093KB, committed=12093KB)
                            (malloc=10039KB, #110773)
                            (arena=2054KB, #1)

Interned String references and symbol table references are held here.

-           Memory Tracking (reserved=22466KB, committed=22466KB)
                            (malloc=22466KB, #1872)

Native memory tracking itself needs some space for its operation.

  1. Detailed Memory Tracking Information

NMT provides two keys pieces of information:

Total Committed Size
The total committed size of the process is the actual amount of physical memory that the process will consume. This is close to the RSS (or working set) of the application, but those OS-provided measurements don’t include any memory that has been committed but paged out of the process. In fact, if the RSS of the process is less than the committed memory, that is often an indication that the OS is having difficulty fitting all of the JVM in physical memory.
Individual Committed Sizes
When it is time to tune maximum values—of the heap, the code cache, the metaspace—it is helpful to how much of that memory the JVM is actually using. Over allocating those areas usually leads only to harmless memory reservations, though in those cases where the reserved memory is important, NMT can help to track down where those maximum sizes can be trimmed.

NMT over time

NMT also allows you to track how memory allocations occur over time. After the JVM is started with NMT enabled, you can establish a baseline for memory usage by using this command:

% jcmd process_id VM.native_memory baseline

That causes the JVM to mark its current memory allocations. Later, you can compare the current memory usage to that mark:

% jcmd process_id VM.native_memory summary.diff
Native Memory Tracking:

Total:  reserved=5896078KB  -3655KB, committed=2358357KB -448047KB

-                 Java Heap (reserved=4194304KB, committed=1920512KB -444927KB)
                            (mmap: reserved=4194304KB, committed=1920512KB -444927KB)
....

In this case, the JVM has reserved 5.8 GB of memory and is presently using 2.3 GB. That committed size is 448 MB less than when the baseline was established. Similarly, the committed memory used by the heap has declined by 444 MB (and the rest of the output could be inspected to see where else the memory use declined to account for the remaining 4 MB).

This is a very useful technique to examine the footprint of the JVM over time.

Quick Summary

  1. Available in Java 8, Native Memory Tracking (NMT) provides details about the native memory usage of the JVM. From an operating system perspective, that includes the JVM heap (which to the OS is just a section of native memory).
  2. The summary mode of NMT is sufficient for most analysis, and allows you to determine how much memory the JVM has committed (and what that memory is used for).

JVM tunings for the Operating System

There are several tunings that the JVM can use to improve the way in which it uses OS memory.

Large Pages

Discussions about memory allocation and swapping occur in terms of pages. A page is a unit of memory by which operating systems manage physical memory. A page is the minimum unit of allocation for the operating system: when one byte is allocated, the operating system must allocate an entire page. Further allocations for that program come from that same page until it is filled, at which point a new page is allocated.

The operating system allocates many more pages than can fit in physical memory, which is why there is paging: pages of the address space are moved to and from swap space (or other storage depending on what the page contains). This means there must be some mapping between these pages and where they are currently stored in the computer’s RAM. Those mappings are handled in two different ways. All page mappings are held in global page table (which the OS can scan to find a particular mapping), and the most frequently used mappings are held in translation lookaside buffers (TLBs). TLBs are held in a fast cache, so accessing pages through a TLB entry is much faster than accessing it through the the page table.

Machines have a limited number of TLB entries, so it becomes important to maximize the hit rate on TLB entries (it functions as a least-recently used cache). Since each entry represents a page of memory, it is often advantageous to increase the page size used by an application. If each page represents more memory, fewer TLB entries are required to encompass the entire program, and it is more likely that a page will be found in the TLB when required. This is true in general for any program, and so is also true in specific for things like Java application servers or other Java programs with even a moderately-sized heap.

Java supports this with the -XX:+UseLargePages option. The default value of this flag varies depending on the operating system configuration. On Windows, large pages must be enabled in the OS. In Windows terms, this means giving individual users the ability to lock pages into memory, which is possible only on server versions of Windows. Even so, the JVM on Windows defaults to using regular pages unless the UseLargePages flag is explicitly enabled.

On Linux, the UseLargePages flag is enabled by default, but the OS must also be configured to support large pages. Otherwise, regular pages will be used. Hence, the default depends on the OS configuration.

On Solaris, no OS configuration is required, and large pages are enabled by default.

If the UseLargePages flag is enabled on a system that does not support large pages, no warning is given and the JVM uses regular pages. If the UseLargePages flag is enabled on a system that does support large pages, but for which no large pages are available (either because they are already all in use or because the operating system is misconfigured), the JVM will print a warning.

Linux Huge (Large) Pages

Linux refers to large pages as huge pages. The configuration of huge pages on Linux varies somewhat from release to release; for the most accurate instructions, consult the documentation for your release. But the general procedure for Linux 5 is this:

  1. Determine which huge page sizes the kernel supports. The size is based on the computer’s processor and the boot parameters given when the kernel has started, but the most common value is 2 MB:

    # grep Hugepagesize /proc/meminfo
    Hugepagesize:       2048 kB
    
  2. Figure out how many huge pages are needed. If a JVM will allocate a 4GB heap and the system has 2MB huge pages, 2048 huge pages will be needed for that heap. The number of huge pages that can be used is defined globally in the Linux kernel, so repeat this process for all the JVMs that will run (plus any other programs that will use huge pages). You should overestimate this value by 10% to account for other non-heap uses of huge pages (so the example here uses 2200 huge pages).
  3. Write out that value to the operating system (so it takes effect immediately):

    # echo 2200 > /proc/sys/vm/nr_hugepages
    
  4. Save that value in /etc/sysctl.conf so that it is preserved after rebooting:

    sys.nr_hugepages=2200
  5. On many versions of Linux, the amount of huge page memory that a user can allocate is limited. Edit the /etc/security/limits.conf file and add memlock entries for the user running your JVMs (e.g., in the example, the user appuser):

    appuser soft    memlock        4613734400
    appuser hard    memlock        4613734400

At this point, the JVM should be able to allocate the necessary huge pages. To verify that it works, run the following command:

# java -Xms4G -Xmx4G -XX:+UseLargePages -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

Successful completion of that command indicates that the huge pages are configured correctly. If the huge page memory configuration is not correct, a warning will be given:

Java HotSpot(TM) 64-Bit Server VM warning:
Failed to reserve shared memory (errno = 22).

Linux Transparent Huge Pages

Linux kernels starting with version 2.6.32 support transparent huge pages, which obviate the need for the configuration described above. Transparent large pages must still be enabled for Java, which is done by changing the contents of /sys/kernel/mm/transparent_hugepage/enabled.

# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

The default value in that file (shown in the output for the first command) is madvise—huge pages are used only for programs that explicitly advise the kernel they will be using huge pages. The JVM does not issue that advisory, so the default value must be set to always (by issuing the second command). Be aware this affects the JVM and any other programs run on the system; they will all run with huge pages.

If transparent large pages are enabled, do not specify the UseLargePages flag. If that flag is explicitly set, the JVM will return to using traditional huge pages if they are configured, or standard pages if traditional huge pages are not configured. If the flag is left to its default value, then the transparent huge pages will be used (if they have been configured).

Windows Large Pages

Windows large pages can only be enabled on server-based Windows versions. Exact instructions for Windows 7 are given here; there will be some variations between releases.

  1. Start the Microsoft Management Center: Press the Start Button and in the Search box, type mmc.
  2. If the left-hand panel does not display a “Local Computer Policy” icon, select “Add/Remove Snap-in” from the File menu and add the “Group Policy Object Editor.” If that option is not available, then the version of Windows in use does not support large pages.
  3. In the left-hand panel, expand Local Computer Policy → Computer Configuration → Windows Settings → Security Settings → Local Policies and click on the “User Rights Assignment” folder.
  4. In the right-hand panel, double click on “Lock pages in memory.”
  5. In the pop up, add the user or group.
  6. Click OK
  7. Quit the MMC.
  8. Reboot.

At this point, the JVM should be able to allocate the necessary large pages. To verify that it works, run the following command:

# java -Xms4G -Xmx4G -XX:+UseLargePages -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

If the command completes successfully like that, large pages are set up correctly. If the large memory configuration is incorrect, a warning is given:

Java HotSpot(TM) Server VM warning: JVM cannot use large page memory
because it does not have enough privilege to lock pages in memory.

Remember that command will not print an error on a Windows system (like “home” versions) that does not support large pages—once the JVM finds out that large pages are not supported on the OS, it sets the UseLargePages flag to false, regardless of the command line setting.

Large Page Sizes

On most Linux and Windows systems, the OS uses 2 MB large pages, but that number can vary depending on the operating system configuration.

Strictly speaking, it is the processor that defines the possible page sizes. Most current Intel and SPARC processors support a number of possible page sizes: 4 KB, 8 KB, 2 MB, 256 MB, and so on. However, the operating system determines which page sizes can actually be allocated. On Solaris, all processor page sizes are supported, and the JVM is free to allocate pages of any size. On Linux kernels (at least as of this writing), you can specify which processor-supported large page size should be used when the kernel is booted, but that is the only large page size an application can actually allocate. On Windows, the large page size is fixed (again, at least for now) at 2 MB.

To support Solaris, Java allows the size of the large pages it allocates to be set via -XX:LargePageSizeInBytes=N flag. By default, that flag is set to 0, which means that the JVM should choose a processor-specific large page size.

That flag can be set on all platforms, and there is never any indication that the specified page size was or was not used. On a Linux system where you are allocating a very large heap, you might think you should specify -XX:LargePageSizeInBytes=256M to get the best chance of getting TLB cache hits. You can do that, and the JVM won’t complain, but it will still allocate only 2 MB pages (or whatever page size the kernel is set to support). In fact, it is possible to specify page sizes that don’t make any sense at all, like -XX:LargePageSizeInBytes=11111. Because that page size is unavailable, the JVM will simply use the default large page size for the platform.

So—for now at least—this flag is really useful only on Solaris. On Solaris, choose a different page size to use larger pages than the default (which is 4 MB). On systems with a large amount of memory, this will increase the number of pages that will fit in the TLB cache and improve performance. To find the available page sizes on Solaris, use the pagesize -a command.

Quick Summary

  1. Using large pages will usually measurably speed up applications.
  2. Large page support must be explicitly enabled in most operating systems.

Compressed oops

Chapter 4 mentioned that the performance of a 32-bit JVM is anywhere from 5% to 20% faster than the performance of a 64-bit JVM for the same task. This assumes, of course, that the application can fit in a 32-bit process space, which limits the size of the heap to less than 4 GB.footnote[In practical terms, this often means less than 3.5 GB, since the JVM needs some native memory space, and on certain versions of Windows, the limit is 3 GB.]

This performance gap is because of the 64-bit object references. The main reason for this is 64-bit references take up twice the space (eight bytes) in the heap than do 32-bit references (four bytes). That leads to more GC cycles, since there is now less room in the heap for other data.

The JVM can compensate for that additional memory by using compressed oops. “oop” stands for ordinary object pointer—oops are the handles the JVM uses as object references. When oops are only 32-bits long, they can reference only 4 GB of memory (2**32), which is why a 32-bit JVM is limited to a 4GB heap size.[48] When oops are 64 bits long, they can reference terabytes of memory.

There is a middle ground here—what if there were 35-bit oops? Then the pointer could reference 32 GB of memory (2**35) and still take up less space in the heap than 64-bit references. The problem is that there aren’t 35-bit registers in which to store such references. Instead, though, the JVM can assume that the last three bits of the reference are all 0. Now every reference can be stored in 32 bits in the heap. When the reference is stored into a 64-bit register, the JVM can shift it left by three bits (adding three zeros at the end). When the reference is saved from a register, the JVM can right-shift it by three bits, discarding the zeros at the end.

This leaves the JVM with pointers that can reference 32 GB of memory while using only 32 bits in the heap. However it also means that the JVM cannot access any object at an address that isn’t divisible by eight, since any address from a compressed oop ends with three zeros. The first possible oop is 0x1, which when shifted becomes 0x8. The next oop is 0x2, which when shifted becomes 0x10 (16). Objects must therefore be located on an 8-byte boundary.

It turns out that objects are already aligned on an 8-byte boundary in the JVM (both the 32- and 64-bit versions); this is the optimal alignment for most processors. So nothing is lost by using compressed oops. If the first object in the JVM is stored at location 0 and occupies 57 bytes, then the next object will be stored at location 64—wasting 7 bytes that cannot be allocated. That memory trade-off is worthwhile (and will occur whether compressed oops are used or not), because the object can be accessed faster given that 8-byte alignment.

But that is the reason that the JVM doesn’t try to emulate a 36-bit reference which could access 64 GB of memory. In that case, objects would have to be aligned on a 16 byte boundary, and the savings from storing the compressed pointer in the heap would be outweighed by the amount of memory that would be wasted in between the memory-aligned objects.

There are two implications of this. First, for heaps that are between 4 GB and 32 GB, use compressed oops. Compressed oops are enabled using the -XX:+UseCompressedOops flag; in Java 7 and later versions, they are enabled by default whenever the maximum heap size is less than 32 GB.[49]

Second, a program that uses a 31 GB heap and compressed oops will usually be faster than a program that uses a 33 GB heap. Although the 33 GB heap is larger, the extra space used by the pointers in that heap means that the larger heap will perform more frequent GC cycles and have worse performance.

Hence, it is better to use heaps that are less than 32 GB, or heaps that are at least a few GB larger than 32 GB. Once extra memory is added to the heap to make up for the space used by the uncompressed references, the number of GC cycles will be reduced. There is no hard rule there for how much memory is needed before the GC impact of the uncompressed oops is ameliorated—but given that 20% of an average heap might be used for object references, planning on at least 38 GB is a good start.

Quick Summary

  1. Compressed oops are enabled by default whenever they are most useful.
  2. A 31 GB heap using compressed oops will often outperform slightly larger heaps that are too big to use compressed oops.

Summary

Although the Java heap is the memory region that gets the most attention, the entire footprint of the JVM is crucial to its performance, particularly in relation to the operating system. The tools discussed in this chapter allow you to track that footprint over time (and, crucially, to focus on the committed memory of the JVM rather than the reserved memory).

Certain ways that the JVM uses OS memory—particularly large pages—can also be tuned to improve performance. Long-running JVMs will almost always benefit by using large pages, particularly if they have large heaps.



[48] The same restriction applies at the operating system level, which is why any 32-bit process is limited to 4GB of address space.

[49] In Reducing Object Size, it was noted that the size of an object reference on a 64-bit JVM with a 32 GB heap is four bytes—which is the default case since compressed oops are enabled by default.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.157.34