The heap is the largest consumer of memory in a Java application, but the JVM will allocate and use a large amount of native memory. And while Chapter 7 discussed ways to efficiently manage the heap from a programmatic point of view, the configuration of the heap and how it interacts with the native memory of the operating system is another important factor in the overall performance of an application.
This chapter discusses these aspects of native (or operating system) memory. We start with a discussion of the entire memory use of the JVM, with a goal of understanding how to monitor that usage for performance issues. Then we’ll discuss various ways to tune the JVM and operating system for optimal memory use.
The heap (usually) accounts for the largest amount of memory used by the
JVM, but the JVM also uses memory for its internal operations.
This non-heap memory is
native memory. Native memory can also be allocated in applications
(via
JNI calls to
malloc()
and similar methods, or when using NIO). The total of native and heap
memory used by the JVM yields the total footprint of an application.
From the point of view of the operating system, this total footprint is the key to performance. If enough physical memory to contain the entire total footprint of an application is not available, performance may begin to suffer. The operative word here is “may.” There are parts of native memory that are really only used during startup (for instance, the memory associated to load the jar files in the classpath), and if that memory is swapped out, it won’t necessarily be noticed. Some of the native memory used by one Java process is shared with other Java processes on the system, and some smaller part is shared with other kinds of processes on the system. For the most part, though, for optimal performance you want to be sure that the total footprint of all Java processes does not exceed the physical memory of the machine (plus you want to leave some memory available for other applications).
To measure the total footprint of a process, you need to use an
operating-system specific tool. In UNIX-based systems, programs like
top
and ps
can show you that data at a basic level; on Windows, you
can use
perfmon
or
VMMap
.
No matter which tool and platform are used,
you need to look
at the actual allocated memory (as opposed to the reserved memory) of the
process.
The distinction between allocated and reserved memory comes about as
a result of the way the JVM (and all programs) manage memory.
Consider a heap is specified
with the parameters
-Xms512m
-Xmx2048m
.
The heap starts by using 512 MB, and it
will be resized as needed to meet the GC goals of the application.
That concept is the essential difference between committed (or allocated) memory and reserved memory (sometimes called the virtual size of a process). The JVM must tell the operating system that it might need as much as 2 GB of memory for the heap, so that memory is reserved—the operating system promises that when the JVM attempts to allocate additional memory when it increases the size of the heap, that memory will be available.
Still, only 512 MB of that memory is actually allocated initially, and that 512 MB is all of the memory that actually is being used (for the heap). That (actually allocated) memory is known as the committed memory. The amount of committed memory will fluctuate as the heap resizes; in particular, as the heap size increases, the committed memory correspondingly increases.
This difference applies to almost all significant memory that the JVM allocates. The code cache grows from an initial to a maximum value as more code gets compiled. Permgen or the class metaspace is allocated separately and grows between its initial (committed) size and its maximum (reserved) size.
One exception to this is thread stacks. Every time the JVM creates a thread, the OS allocates some native memory to hold that thread’s stack, committing more memory to the process (until the thread exits, at least). Threads stacks, though, are fully allocated when they are created.
In Unix systems, the actual footprint of an application can be estimated by the resident set size (RSS) of the process as reported by various OS tools. That value is a good estimate of the amount of committed memory a process is using, though it is inexact in two ways. First, the few pages that are shared at the OS level between JVM and other processes (that is, the text portions of shared libraries) are counted in the RSS of each process. Second, a process may have committed more memory than it actually paged in at any moment. Still, tracking the RSS of a process is a good first-pass way to monitor the total memory use. On more recent Linux kernels, the PSS is a refinement of the RSS that removes the data shared by other programs.
On Windows systems, the equivalent idea is called the working set of an application, which is what is reported by the task manager.
To minimize the footprint used by the JVM, limit the amount of memory used by the following:
Developers can allocate native memory via JNI calls, but NIO byte buffers
will also allocate native memory if they are created via the
allocateDirect()
method. Native byte buffers are quite important from
a performance perspective, since they allow native code and Java code to
share data without copying it. The most common example here is buffers that
are used for filesystem and socket operations. Writing data to a native
NIO buffer and then sending that data to the channel (e.g, the file or socket)
requires no copying of data between the JVM and the C library used to
transmit the data. If a heap byte buffer is used instead,
contents of the buffer must be copied by the JVM.
The
allocateDirect()
method call is quite expensive; direct byte buffers
should be reused as much as possible. The ideal situation is when threads are
independent and each can keep a direct byte buffer as a thread local variable.
That can sometimes use too much native memory if there are many threads
that need buffers of variable sizes, since eventually each thread will
end up with a buffer at the maximum possible size. For that kind of situation—or when thread local buffers don’t fit the application design—an object pool of direct byte buffers maybe more useful.
Byte buffers can also be managed by slicing them. The
application can allocate one very large direct byte buffer, and individual
requests can allocate a portion out of that buffer using the
slice()
method
of the
ByteBuffer
class. This solution can become unwieldy when the slices
are not always the same size: the original byte buffer can then become
fragmented in the same way the heap becomes fragmented when allocating and
freeing objects of different sizes. Unlike the heap, however, the individual
slices of a byte buffer cannot be compacted, so this solution really works
well only when all the slices are a uniform size.
From a tuning perspective, the one thing to realize with any of these
programming models is that the amount of direct byte buffer space that an
application can allocate can be limited by the JVM. The total amount of
memory that can be allocated for direct byte buffers is specified by
setting the
-XX:MaxDirectMemorySize
=N
flag. Starting in Java 7, the default value for this flag is 0, which means
there is no limit
(subject to the address space size and any operating-system limits on the
process). That flag can be set to limit the direct byte buffer use of
an application (and to provide compatibility with previous releases of
Java, where the limit was 64 MB).
Beginning in Java 8, the JVM allows some visibility into how it allocates
native memory when using
this option:
-XX:NativeMemoryTracking=
.
By
default, native memory tracking (NMT) tracking is off. If the summary or
detail mode is enabled, you can get the native memory
information at any time from off|summary|detail
jcmd
:
% jcmd process_id VM.native_memory summary
If the JVM is started with the argument
-XX:+PrintNMTStatistics
(by default, false
), the JVM will
print out information about the allocation when the program exits.
Here is the summary output from a JVM running with a 512 MB initial heap size and a 4 GB maximum heap size:
Native Memory Tracking: Total: reserved=4787210KB, committed=1857677KB
Although the JVM has made memory reservations totaling 4.7 GB, it has used much less than that: only 1.8 GB total. This is fairly typical (and one reason not to pay particular attention to the virtual size of the process displayed in OS tools, since that reflects only the memory reservations).
This memory usage breaks down as follows:
- Java Heap (reserved=4296704KB, committed=1470428KB) (mmap: reserved=4296704KB, committed=1470428KB)
The heap itself is (unsurprisingly) the largest part of the reserved memory at 4 GB. But the dynamic sizing of the heap meant it grew only to 1.4 GB.
- Class (reserved=65817KB, committed=60065KB) (classes #19378) (malloc=6425KB, #14245) (mmap: reserved=59392KB, committed=53640KB)
This is the native memory used to hold class metadata. Again, note that the JVM has reserved more memory than it actually used to hold the 19,378 classes in the program.
- Thread (reserved=84455KB, committed=84455KB) (thread #77) (stack: reserved=79156KB, committed=79156KB) (malloc=243KB, #314) (arena=5056KB, #154)
77 thread stacks were allocated at about 1 MB each.
- Code (reserved=102581KB, committed=15221KB) (malloc=2741KB, #4520) (mmap: reserved=99840KB, committed=12480KB)
This is the JIT code cache. 19,378 classes is not very many, so just a small section of the code cache is committed.
- GC (reserved=183268KB, committed=173156KB) (malloc=5768KB, #110) (mmap: reserved=177500KB, committed=167388KB)
These are areas outside of the heap that GC algorithms use for their processing.
- Compiler (reserved=162KB, committed=162KB) (malloc=63KB, #229) (arena=99KB, #3)
Similarly, this area is used by the compiler for its operations, apart from the resulting code placed in the code cache.
- Symbol (reserved=12093KB, committed=12093KB) (malloc=10039KB, #110773) (arena=2054KB, #1)
Interned String
references and symbol table references are held here.
- Memory Tracking (reserved=22466KB, committed=22466KB) (malloc=22466KB, #1872)
Native memory tracking itself needs some space for its operation.
NMT provides two keys pieces of information:
NMT also allows you to track how memory allocations occur over time. After the JVM is started with NMT enabled, you can establish a baseline for memory usage by using this command:
% jcmd process_id VM.native_memory baseline
That causes the JVM to mark its current memory allocations. Later, you can compare the current memory usage to that mark:
% jcmd process_id VM.native_memory summary.diff
Native Memory Tracking:
Total: reserved=5896078KB -3655KB, committed=2358357KB -448047KB
- Java Heap (reserved=4194304KB, committed=1920512KB -444927KB)
(mmap: reserved=4194304KB, committed=1920512KB -444927KB)
....
In this case, the JVM has reserved 5.8 GB of memory and is presently using 2.3 GB. That committed size is 448 MB less than when the baseline was established. Similarly, the committed memory used by the heap has declined by 444 MB (and the rest of the output could be inspected to see where else the memory use declined to account for the remaining 4 MB).
This is a very useful technique to examine the footprint of the JVM over time.
There are several tunings that the JVM can use to improve the way in which it uses OS memory.
Discussions about memory allocation and swapping occur in terms of pages. A page is a unit of memory by which operating systems manage physical memory. A page is the minimum unit of allocation for the operating system: when one byte is allocated, the operating system must allocate an entire page. Further allocations for that program come from that same page until it is filled, at which point a new page is allocated.
The operating system allocates many more pages than can fit in physical memory, which is why there is paging: pages of the address space are moved to and from swap space (or other storage depending on what the page contains). This means there must be some mapping between these pages and where they are currently stored in the computer’s RAM. Those mappings are handled in two different ways. All page mappings are held in global page table (which the OS can scan to find a particular mapping), and the most frequently used mappings are held in translation lookaside buffers (TLBs). TLBs are held in a fast cache, so accessing pages through a TLB entry is much faster than accessing it through the the page table.
Machines have a limited number of TLB entries, so it becomes important to maximize the hit rate on TLB entries (it functions as a least-recently used cache). Since each entry represents a page of memory, it is often advantageous to increase the page size used by an application. If each page represents more memory, fewer TLB entries are required to encompass the entire program, and it is more likely that a page will be found in the TLB when required. This is true in general for any program, and so is also true in specific for things like Java application servers or other Java programs with even a moderately-sized heap.
Java supports this with the
-XX:+UseLargePages
option. The default
value of this flag varies depending on the operating system configuration.
On Windows, large pages must be enabled in the OS. In Windows terms, this
means giving individual users the ability to lock pages into memory, which
is possible only on server versions of Windows. Even so, the JVM on
Windows defaults
to using regular pages unless the
UseLargePages
flag is explicitly enabled.
On Linux, the
UseLargePages
flag is enabled by default, but the OS must also
be configured to support large pages. Otherwise, regular pages will be
used. Hence, the default depends on the OS configuration.
On Solaris, no OS configuration is required, and large pages are enabled by default.
If the
UseLargePages
flag is enabled on a system that does not
support large pages, no warning is given and the JVM uses regular pages.
If the
UseLargePages
flag is enabled on a system that does support
large pages, but for which no large pages are available (either because
they are already all in use or because the operating system is misconfigured),
the JVM will print a warning.
Linux refers to large pages as huge pages. The configuration of huge pages on Linux varies somewhat from release to release; for the most accurate instructions, consult the documentation for your release. But the general procedure for Linux 5 is this:
Determine which huge page sizes the kernel supports. The size is based on the computer’s processor and the boot parameters given when the kernel has started, but the most common value is 2 MB:
# grep Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
Write out that value to the operating system (so it takes effect immediately):
# echo 2200 > /proc/sys/vm/nr_hugepages
Save that value in /etc/sysctl.conf so that it is preserved after rebooting:
sys.nr_hugepages=2200
On many versions of Linux, the amount of huge page memory that a user can allocate is limited. Edit the /etc/security/limits.conf file and add memlock
entries for the user running your JVMs (e.g., in the example, the user appuser
):
appuser soft memlock 4613734400 appuser hard memlock 4613734400
At this point, the JVM should be able to allocate the necessary huge pages. To verify that it works, run the following command:
# java -Xms4G -Xmx4G -XX:+UseLargePages -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
Successful completion of that command indicates that the huge pages are configured correctly. If the huge page memory configuration is not correct, a warning will be given:
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 22).
Linux kernels starting with version 2.6.32 support transparent huge pages, which obviate the need for the configuration described above. Transparent large pages must still be enabled for Java, which is done by changing the contents of /sys/kernel/mm/transparent_hugepage/enabled.
#cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never #echo always > /sys/kernel/mm/transparent_hugepage/enabled
#cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
The default value in that file (shown in the output for the first command)
is madvise
—huge pages are used only for programs that explicitly advise
the kernel they will be using huge pages. The JVM does not issue that
advisory, so the default value must be set to always (by issuing
the second command).
Be aware this
affects the JVM and any other programs run on the system; they will all run
with huge pages.
If transparent large pages are enabled, do not specify the
UseLargePages
flag. If that flag is explicitly set, the JVM will return to using traditional
huge pages if they are
configured, or standard pages if traditional huge pages are not configured.
If the flag is left to its default value, then the transparent huge pages
will be used (if they have been configured).
Windows large pages can only be enabled on server-based Windows versions. Exact instructions for Windows 7 are given here; there will be some variations between releases.
mmc
.
At this point, the JVM should be able to allocate the necessary large pages. To verify that it works, run the following command:
# java -Xms4G -Xmx4G -XX:+UseLargePages -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
If the command completes successfully like that, large pages are set up correctly. If the large memory configuration is incorrect, a warning is given:
Java HotSpot(TM) Server VM warning: JVM cannot use large page memory because it does not have enough privilege to lock pages in memory.
Remember that command will not print an error on a Windows system (like
“home” versions) that does
not support large pages—once the JVM finds out that large
pages are not supported on the OS, it sets the
UseLargePages
flag
to false
, regardless of the command line setting.
On most Linux and Windows systems, the OS uses 2 MB large pages, but that number can vary depending on the operating system configuration.
Strictly speaking, it is the processor that defines the possible page sizes. Most current Intel and SPARC processors support a number of possible page sizes: 4 KB, 8 KB, 2 MB, 256 MB, and so on. However, the operating system determines which page sizes can actually be allocated. On Solaris, all processor page sizes are supported, and the JVM is free to allocate pages of any size. On Linux kernels (at least as of this writing), you can specify which processor-supported large page size should be used when the kernel is booted, but that is the only large page size an application can actually allocate. On Windows, the large page size is fixed (again, at least for now) at 2 MB.
To support Solaris, Java
allows the size
of the large pages it allocates to be set via
-XX:LargePageSizeInBytes=
N
flag. By default, that flag is set to 0, which means that the JVM should
choose a processor-specific large page size.
That flag can be set on all platforms, and there is never any
indication that the specified page size was or was not used. On a Linux system
where you are allocating a very large heap, you might think you should specify
-XX:LargePageSizeInBytes=256M
to get the best chance of getting TLB cache
hits. You can do that, and the JVM won’t complain, but it will still allocate
only 2 MB pages (or whatever page size the kernel is set to support). In fact,
it is possible to specify page sizes that don’t make any sense at all,
like
-XX:LargePageSizeInBytes=11111
.
Because that page size is unavailable, the JVM will simply use the
default large page size for the platform.
So—for now at least—this flag is really useful only on Solaris. On
Solaris, choose a different page size to use larger pages than the default
(which is 4 MB). On systems with a large amount of memory, this will increase
the number of pages that will fit in the TLB cache and improve performance.
To find the available page sizes on Solaris, use the pagesize -a
command.
Chapter 4 mentioned that the performance of a 32-bit JVM is anywhere from 5% to 20% faster than the performance of a 64-bit JVM for the same task. This assumes, of course, that the application can fit in a 32-bit process space, which limits the size of the heap to less than 4 GB.footnote[In practical terms, this often means less than 3.5 GB, since the JVM needs some native memory space, and on certain versions of Windows, the limit is 3 GB.]
This performance gap is because of the 64-bit object references. The main reason for this is 64-bit references take up twice the space (eight bytes) in the heap than do 32-bit references (four bytes). That leads to more GC cycles, since there is now less room in the heap for other data.
The JVM can compensate for that additional memory by using compressed oops. “oop” stands for ordinary object pointer—oops are the handles the JVM uses as object references. When oops are only 32-bits long, they can reference only 4 GB of memory (2**32), which is why a 32-bit JVM is limited to a 4GB heap size.[48] When oops are 64 bits long, they can reference terabytes of memory.
There is a middle ground here—what if there were 35-bit oops? Then the pointer could reference 32 GB of memory (2**35) and still take up less space in the heap than 64-bit references. The problem is that there aren’t 35-bit registers in which to store such references. Instead, though, the JVM can assume that the last three bits of the reference are all 0. Now every reference can be stored in 32 bits in the heap. When the reference is stored into a 64-bit register, the JVM can shift it left by three bits (adding three zeros at the end). When the reference is saved from a register, the JVM can right-shift it by three bits, discarding the zeros at the end.
This leaves the JVM with pointers that can reference 32 GB of memory while using only 32 bits in the heap. However it also means that the JVM cannot access any object at an address that isn’t divisible by eight, since any address from a compressed oop ends with three zeros. The first possible oop is 0x1, which when shifted becomes 0x8. The next oop is 0x2, which when shifted becomes 0x10 (16). Objects must therefore be located on an 8-byte boundary.
It turns out that objects are already aligned on an 8-byte boundary in the JVM (both the 32- and 64-bit versions); this is the optimal alignment for most processors. So nothing is lost by using compressed oops. If the first object in the JVM is stored at location 0 and occupies 57 bytes, then the next object will be stored at location 64—wasting 7 bytes that cannot be allocated. That memory trade-off is worthwhile (and will occur whether compressed oops are used or not), because the object can be accessed faster given that 8-byte alignment.
But that is the reason that the JVM doesn’t try to emulate a 36-bit reference which could access 64 GB of memory. In that case, objects would have to be aligned on a 16 byte boundary, and the savings from storing the compressed pointer in the heap would be outweighed by the amount of memory that would be wasted in between the memory-aligned objects.
There are two implications of this. First, for heaps that are
between 4 GB and 32 GB, use compressed oops.
Compressed oops are enabled using the
-XX:+UseCompressedOops
flag; in
Java 7 and later versions, they are enabled by default whenever the maximum
heap size is less than 32 GB.[49]
Second, a program that uses a 31 GB heap and compressed oops will usually be faster than a program that uses a 33 GB heap. Although the 33 GB heap is larger, the extra space used by the pointers in that heap means that the larger heap will perform more frequent GC cycles and have worse performance.
Hence, it is better to use heaps that are less than 32 GB, or heaps that are at least a few GB larger than 32 GB. Once extra memory is added to the heap to make up for the space used by the uncompressed references, the number of GC cycles will be reduced. There is no hard rule there for how much memory is needed before the GC impact of the uncompressed oops is ameliorated—but given that 20% of an average heap might be used for object references, planning on at least 38 GB is a good start.
Although the Java heap is the memory region that gets the most attention, the entire footprint of the JVM is crucial to its performance, particularly in relation to the operating system. The tools discussed in this chapter allow you to track that footprint over time (and, crucially, to focus on the committed memory of the JVM rather than the reserved memory).
Certain ways that the JVM uses OS memory—particularly large pages—can also be tuned to improve performance. Long-running JVMs will almost always benefit by using large pages, particularly if they have large heaps.
[48] The same restriction applies at the operating system level, which is why any 32-bit process is limited to 4GB of address space.
[49] In Reducing Object Size, it was noted that the size of an object reference on a 64-bit JVM with a 32 GB heap is four bytes—which is the default case since compressed oops are enabled by default.
18.218.157.34