3. Garbage First Garbage Collector Performance Tuning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3. Garbage First Garbage Collector Performance Tuning

Performance engineering deals with the nonfunctional performance requirements of a system or its software and ensures that the design requirements are met in the product implementation. Thus it goes hand-in-hand with systems or software engineering.

In this chapter, we will discuss performance tuning in detail, concentrating on the newest garbage collector in Java HotSpot VM: Garbage First, or G1. We will skip young generation tuning advice already provided in Java™ Performance [1], particularly Chapter 7, “Tuning the JVM, Step by Step.” Readers are encouraged to read that chapter and also the previous chapters in this supplemental book.

The Stages of a Young Collection

A G1 young collection has serial and parallel phases. The pause is serial in the sense that several tasks can be carried out only after certain other tasks are completed during a given stop-the-world pause. The parallel phases employ multiple GC worker threads that have their own work queues and can perform work stealing from other threads’ work queues when their own queue tasks are completed.

Tip

The serial stages of the young collection pause can be multithreaded and use the value of -XX:ParallelGCThreads to determine the GC worker thread count.

Let’s look at an excerpt from a G1 GC output log generated while running DaCapo with the HotSpot VM command-line option -XX:+PrintGCDetails. Here is the command-line and “Java version” output:

Click here to view code image

JAVA_OPTS="-XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:jdk8u45_h2.log"
       MB:DaCapo mb$ java -version
       java version "1.8.0_45"
       Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
       Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

This is the GC log snippet (jdk8u45_h2.log):

Click here to view code image

108.815: [GC pause (G1 Evacuation Pause) (young), 0.0543862 secs]
   [Parallel Time: 52.1 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 108815.5, Avg: 108815.5, Max: 108815.6,
Diff: 0.1]
      [Ext Root Scanning (ms): Min: 0.1, Avg: 0.2, Max: 0.2, Diff: 0.1,
Sum: 1.2]
      [Update RS (ms): Min: 12.8, Avg: 13.0, Max: 13.2, Diff: 0.4, Sum:
103.6]
         [Processed Buffers: Min: 15, Avg: 16.0, Max: 17, Diff: 2, Sum: 128]
      [Scan RS (ms): Min: 13.4, Avg: 13.6, Max: 13.7, Diff: 0.3, Sum: 109.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.1]
      [Object Copy (ms): Min: 25.1, Avg: 25.2, Max: 25.2, Diff: 0.1, Sum:
201.5]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum:
0.4]
      [GC Worker Total (ms): Min: 51.9, Avg: 52.0, Max: 52.1, Diff: 0.1,
Sum: 416.0]
      [GC Worker End (ms): Min: 108867.5, Avg: 108867.5, Max: 108867.6,
Diff: 0.1]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 2.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.1 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.2 ms]
   [Eden: 537.0M(537.0M)->0.0B(538.0M) Survivors: 23.0M->31.0M Heap:
849.2M(1024.0M)->321.4M(1024.0M)]

The snippet shows one G1 GC young collection pause, identified in the first line by (G1 Evacuation Pause) and (young). The line’s timestamp is 108.815, and total pause time is 0.0543862 seconds:

Click here to view code image

108.815: [GC pause (G1 Evacuation Pause) (young), 0.0543862 secs]

Start of All Parallel Activities

The second line of the log snippet shows the total time spent in the parallel phase and the GC worker thread count:

Click here to view code image

[Parallel Time: 52.1 ms, GC Workers: 8]

The following lines show the major parallel work carried out by the eight worker threads:

Click here to view code image

[GC Worker Start (ms): Min: 108815.5, Avg: 108815.5, Max: 108815.6, Diff:
0.1]
[Ext Root Scanning (ms): Min: 0.1, Avg: 0.2, Max: 0.2, Diff: 0.1, Sum: 1.2]
[Update RS (ms): Min: 12.8, Avg: 13.0, Max: 13.2, Diff: 0.4, Sum: 103.6]
[Processed Buffers: Min: 15, Avg: 16.0, Max: 17, Diff: 2, Sum: 128]
[Scan RS (ms): Min: 13.4, Avg: 13.6, Max: 13.7, Diff: 0.3, Sum: 109.0]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Object Copy (ms): Min: 25.1, Avg: 25.2, Max: 25.2, Diff: 0.1, Sum: 201.5]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4]
[GC Worker Total (ms): Min: 51.9, Avg: 52.0, Max: 52.1, Diff: 0.1,
Sum: 416.0]
[GC Worker End (ms): Min: 108867.5, Avg: 108867.5, Max: 108867.6, Diff: 0.1]

GC Worker Start and GC Worker End tag the starting and ending timestamps respectively of the parallel phase. The Min timestamp for GC Worker Start is the time at which the first worker thread started; similarly, the Max timestamp for GC Worker End is the time at which the last worker thread completed all its tasks. The lines also contain Avg and Diff values in milliseconds. The things to look out for in those lines are:

How far away the Diff value is from 0, 0 being the ideal.

Any major variance in Max, Min, or Avg. This indicates that the worker threads could not start or finish their parallel work around the same time. That could mean that some sort of queue-handling issue exists that requires further analysis by looking at the parallel work done during the parallel phase.

External Root Regions

External root region scanning (Ext Root Scanning) is one of the first parallel tasks. During this phase the external (off-heap) roots such as the JVM’s system dictionary, VM data structures, JNI thread handles, hardware registers, global variables, and thread stack roots are scanned to find out if any point into the current pause’s collection set (CSet).

Click here to view code image

[Ext Root Scanning (ms): Min: 0.1, Avg: 0.2, Max: 0.2, Diff: 0.1, Sum: 1.2]

Here again, we look for Diff >> 0 and major variance in Max, Min, or Avg.

Tip

The variance (Diff) is shown for all the timed activities that make up the parallel phase. A high variance usually means that the work is not balanced across the parallel threads for that particular activity. This knowledge is an analysis starting point, and ideally a deeper dive will identify the potential cause, which may require refactoring the Java application.

Another thing to watch out for is whether a worker thread is caught up in dealing with a single root. We have seen issues where the system dictionary, which is treated as a single root, ends up holding up a worker thread when there is a large number of loaded classes. When a worker thread is late for “termination” (explained later in this section), it is also considered held up.

Remembered Sets and Processed Buffers

Click here to view code image

[Update RS (ms): Min: 12.8, Avg: 13.0, Max: 13.2, Diff: 0.4, Sum: 103.6]
[Processed Buffers: Min: 15, Avg: 16.0, Max: 17, Diff: 2, Sum: 128]

As explained in Chapter 2, “Garbage First Garbage Collection in Depth,” G1 GC uses remembered sets (RSets) to help maintain and track references into G1 GC regions that “own” those RSets. The concurrent refinement threads, also discussed in Chapter 2, are tasked with scanning the update log buffers and updating RSets for the regions with dirty cards. In order to supplement the work carried out by the concurrent refinement threads, any remainder buffers that were logged but not yet processed by the refinement threads are handled during the parallel phase of the collection pause and are processed by the worker threads. These buffers are what are referred to as Processed Buffers in the log snippet.

In order to limit the time spent updating RSets, G1 sets a target time as a percentage of the pause time goal (-XX:MaxGCPauseMillis). The target time defaults to 10 percent of the pause time goal. Any evacuation pause should spend most of its time copying live objects, and 10 percent of the pause time goal is considered a reasonable amount of time to spend updating RSets. If after looking at the logs you realize that spending 10 percent of your pause time goal in updating RSets is undesirable, you can change the percentage by updating the -XX:G1RSetUpdatingPauseTimePercent command-line option to reflect your desired value. It is important to remember, however, that if the number of updated log buffers does not change, any decrease in RSet update time during the collection pause will result in fewer buffers being processed during that pause. This will push the log buffer update work off onto the concurrent refinement threads and will result in increased concurrent work and sharing of resources with the Java application mutator threads. Also, worst case, if the concurrent refinement threads cannot keep up with the log buffer update rate, the Java application mutators must step in and help with the processing—a scenario best avoided!

Tip

As discussed in Chapter 2, there is a command-line option called –XX:G1ConcRefinementThreads. By default it is set to the same value as –XX:ParallelGCThreads, which means that any change in XX:ParallelGCThreads will change the –XX:G1ConcRefinementThreads value as well.

Before collecting regions in the current CSet, the RSets for the regions in the CSet must be scanned for references into the CSet regions. As discussed in Chapter 2, a popular object in a region or a popular region itself can lead to its RSet being coarsened from a sparse PRT (per-region table) to a fine-grained PRT or even a coarsened bitmap, and thus scanning such an RSet will require more time. In such a scenario, you will see an increase in the Scan RS time shown here since the scan times depend on the coarseness gradient of the RSet data structures:

Click here to view code image

[Scan RS (ms): Min: 13.4, Avg: 13.6, Max: 13.7, Diff: 0.3, Sum: 109.0]

Another parallel task related to RSets is code root scanning, during which the code root set is scanned to find references into the current CSet:

Click here to view code image

[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0543862,
Sum: 0.1]

In earlier versions of HotSpot, the entire code cache was treated as a single root and was claimed and processed by a single worker thread. A large and full or nearly full code cache would thus hold up that worker thread and lead to an increase in the total pause time. With the introduction of code root scanning as a separate parallel activity, the work of scanning the nmethods is reduced to just scanning the RSets for references from the compiled code. Hence for a particular region in the CSet, only if the RSet for that region has strong code roots is the corresponding nmethod scanned.

Tip

Developers often refer to the dynamically compiled code for a Java method by the HotSpot term of art nmethod. An nmethod is not to be confused with a native method, which refers to a JNI method. Nmethods include auxiliary information such as constant pools in addition to generated code.

Tip

To reduce nmethod scanning times, only the RSets of the regions in the CSet are scanned for references introduced by the compiler, rather than the “usual” references that are introduced by the Java application mutator threads.

Summarizing Remembered Sets

The option -XX:+G1SummarizeRSetStats can be used to provide a window into the total number of RSet coarsenings (fine-grained PRT or coarsened bitmap) to help determine if concurrent refinement threads are able to handle the updated buffers and to gather more information on nmethods. This option summarizes RSet statistics every nth GC pause, where n is set by -XX:G1SummarizeRSetStatsPeriod=n.

Tip

–XX:+G1SummarizeRSetStats is a diagnostic option and hence must be enabled by adding –XX:+UnlockDiagnosticVMOptions to the command line, for example,

Click here to view code image

JAVA_OPTS="-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:
+PrintGCDetails -XX:+G1SummarizeRSetStats -XX:G1Summarize
RSetStatsPeriod=1 -Xloggc:jdk8u45_h2.log".

Here is a GC log snippet with the RSets summarized on every GC pause:

Click here to view code image

Before GC RS summary

Recent concurrent refinement statistics
  Processed 23270 cards
  Of 96 completed buffers:
           96 (100.0%) by concurrent RS threads.
            0 (  0.0%) by mutator threads.
  Did 0 coarsenings.
  Concurrent RS threads times (s)
          1.29     1.29     1.29     1.29     1.29     1.29     1.29     0.00
  Concurrent sampling threads times (s)
          0.97

Current rem set statistics
  Total per region rem sets sizes = 4380K. Max = 72K.
         767K ( 17.5%) by 212 Young regions
          29K (  0.7%) by 9 Humongous regions
        2151K ( 49.1%) by 648 Free regions
        1431K ( 32.7%) by 155 Old regions
   Static structures = 256K, free_lists = 0K.
    816957 occupied cards represented.
        13670 (  1.7%) entries by 212 Young regions
            4 (  0.0%) entries by 9 Humongous regions
            0 (  0.0%) entries by 648 Free regions
       803283 ( 98.3%) entries by 155 Old regions
    Region with largest rem set =
4:(O)[0x00000006c0400000,0x00000006c0500000,0x00000006c0500000],
size = 72K, occupied = 190K.
  Total heap region code root sets sizes = 40K.  Max = 22K.
           3K (  8.7%) by 212 Young regions
           0K (  0.3%) by 9 Humongous regions
          10K ( 24.8%) by 648 Free regions
          27K ( 66.2%) by 155 Old regions
    1035 code roots represented.
            5 (  0.5%) elements by 212 Young regions
            0 (  0.0%) elements by 9 Humongous regions
            0 (  0.0%) elements by 648 Free regions
         1030 ( 99.5%) elements by 155 Old regions
    Region with largest amount of code roots =
4:(O)[0x00000006c0400000,0x00000006c0500000,0x00000006c0500000],
size = 22K, num_elems = 0.

Click here to view code image

After GC RS summary

Recent concurrent refinement statistics
  Processed 3782 cards
  Of 26 completed buffers:
           26 (100.0%) by concurrent RS threads.
            0 (  0.0%) by mutator threads.
  Did 0 coarsenings.
  Concurrent RS threads times (s)
          0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00
  Concurrent sampling threads times (s)
          0.00

Current rem set statistics
  Total per region rem sets sizes = 4329K. Max = 73K.
          33K (  0.8%) by 10 Young regions
          29K (  0.7%) by 9 Humongous regions
        2689K ( 62.1%) by 810 Free regions
        1577K ( 36.4%) by 195 Old regions
   Static structures = 256K, free_lists = 63K.
    805071 occupied cards represented.
            0 (  0.0%) entries by 10 Young regions
            4 (  0.0%) entries by 9 Humongous regions
            0 (  0.0%) entries by 810 Free regions
       805067 (100.0%) entries by 195 Old regions
    Region with largest rem set =
4:(O)[0x00000006c0400000,0x00000006c0500000,0x00000006c0500000],
size = 73K, occupied = 190K.
  Total heap region code root sets sizes = 40K.  Max = 22K.
           0K (  0.8%) by 10 Young regions
           0K (  0.3%) by 9 Humongous regions
          12K ( 30.9%) by 810 Free regions
          27K ( 68.0%) by 195 Old regions
    1036 code roots represented.
            2 (  0.2%) elements by 10 Young regions
            0 (  0.0%) elements by 9 Humongous regions
            0 (  0.0%) elements by 810 Free regions
         1034 ( 99.8%) elements by 195 Old regions
    Region with largest amount of code roots =
4:(O)[0x00000006c0400000,0x00000006c0500000,0x00000006c0500000],
size = 22K, num_elems = 0.

The main things to look out for in this snippet are as follows:

Click here to view code image

Processed 23270 cards
  Of 96 completed buffers:
           96 (100.0%) by concurrent RS threads.
            0 (  0.0%) by mutator threads.
and
Did 0 coarsenings.

The log output is printed for both before and after the GC pause. The Processed cards tag summarizes the work done by the concurrent refinement threads and sometimes, though very rarely, the Java application mutator threads. In this case, 96 completed buffers had 23,270 processed cards, and all (100 percent) of the work was done by the concurrent RSet refinement threads. There were no RSet coarsenings as indicated by Did 0 coarsenings.

Other parts of the log output describe concurrent RSet times and current RSet statistics, including their sizes and occupied cards per region type (young, free, humongous, or old). You can use the log to figure out how the code root sets are referencing the RSets for each region type as well as the total number of references per region type.

Tip

The ability to visualize four areas of potential improvement—RSet coarsenings, updating RSets, scanning RSets, and scanning nmethods referencing RSets—can help significantly in understanding your application and may pave the way for application improvements.

Evacuation and Reclamation

Now that G1 knows about its CSet for the current collection pause, along with a complete set of references into the CSet, it can move on to the most expensive part of a pause: the evacuation of live objects from the CSet regions and reclamation of the newly freed space. Ideally, the object copy times are the biggest contributor to the pause. Live objects that need to be evacuated are copied to thread-local GC allocation buffers (GCLABs) allocated in target regions. Worker threads compete to install a forwarding pointer to the newly allocated copy of the old object image. With the help of work stealing [2], a single “winner” thread helps with copying and scanning the object. Work stealing also provides load balancing between the worker threads.

Click here to view code image

[Object Copy (ms): Min: 25.1, Avg: 25.2, Max: 25.2, Diff: 0.1, Sum: 201.5]

Tip

G1 GC uses the copy times as weighted averages to predict the time it takes to copy a single region. Users can adjust the young generation size if the prediction logic fails to keep up with the desired pause time goal.

Termination

After completing the tasks just described, each worker thread offers termination if its work queue is empty. A thread requesting termination checks the other threads’ work queues to attempt work stealing. If no work is available, it terminates. Termination tags the time that each worker thread spends in this termination protocol. A GC worker thread that gets caught up in a single root scan can be late to complete all the tasks in its queue and hence eventually be late for termination.

Click here to view code image

[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]

If any (or all) worker threads are getting caught up somewhere, it will show up in long termination times and may indicate a work-stealing or load-balancing issue.

Parallel Activity Outside of GC

Termination marks the end of parallel activities for worker threads during the evacuation/collection pause. The next line in the log snippet, tagged GC Worker Other, is time spent in the parallel phase but not in any of the “usual” parallel activities described so far. Though attributed to “GC time,” it could very easily be occupied by something outside of the GC that happens to occur during the parallel phase of the GC pause. GC threads are stopped during this period. We have seen GC Worker Other times being high when there is an increase in compiler work due to JVM activities as a result of ill-considered compiler options. If you observe long times here, investigate such non-GC activities.

Click here to view code image

[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4]

Summarizing All Parallel Activities

The last line of the parallel phase, tagged GC Worker Total, is the sum of the “usual” and “unusual” GC worker thread times:

Click here to view code image

[GC Worker Total (ms): Min: 51.9, Avg: 52.0, Max: 52.1, Diff: 0.1,
Sum: 416.0]

Start of All Serial Activities

After the parallel phase is complete, the serial phase begins with the lines tagged Code Root Fixup, Code Root Purge, and Clear CT. During these times, the main GC thread updates code roots with the new location of evacuated objects and also purges the code root set table. The Clear CT phase (which is carried out in parallel with the help of parallel worker threads) clears the card table marks, as mentioned in Chapter 2, when scanning RSets; once a card is scanned, G1 GC marks a corresponding entry in the global card table to avoid rescanning that card. This mark is cleared during the Clear CT phase of the pause.

Tip

The main GC thread is the VM thread that executes the GC VM operation during a safepoint.

[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.2 ms]

Tip

During a mixed collection pause, Code Root Fixup will include the time spent in updating non-evacuated regions.

Other Serial Activities

The final part of the sequential phase is tagged Other. The bigger contributors to Other involve choosing the CSet for the collection, reference processing and enqueuing, card redirtying, reclaiming free humongous regions, and freeing the CSet after the collection. These major contributors are shown the PrintGCDetails output:

Click here to view code image

   [Other: 2.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.1 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.2 ms]

Tip

For a young collection all the young regions are collected, hence there is no “choosing” as such since all the young regions automatically become a part of the young CSet. The “choosing” occurs for a mixed collection pausse and becomes an important factor in understanding how to “tame” your mixed collections. We will discuss choosing the CSet in detail later in this chapter.

Reference processing and enqueuing are for soft, weak, phantom, final, and JNI references. We discuss this and more in the section titled “Reference Processing Tuning.”

The act of reference enqueuing may require updating the RSets. Hence, the updates need to be logged and their associated cards need to be marked as dirty. The time spent redirtying the cards is shown as the Redirty Cards time (0.2 ms in the preceding example).

Humongous Reclaim is new in JDK 8u40. If a humongous object is found to be unreachable by looking at all references from the root set or young generation regions and by making sure that there are no references to the humongous object in the RSet, that object can be reclaimed during the evacuation pause. (See Chapter 2 for detailed descriptions of humongous regions and humongous objects.)

The remainder of the Other time is spent in fixing JNI handles and similar work. The collective time in Other should be very small, and any time hog should have a reasonable explanation. As an example, you could see higher times in Free CSet if your CSet per pause is very large. Similarly, Ref Proc and Ref Enq could show higher times depending on how many references are used in your application. Similar reasoning can be applied to Humongous Reclaim times, if you have many short-lived humongous objects.

Young Generation Tunables

As covered in Java™ Performance [1], Chapter 7, “Tuning the JVM, Step by Step,” there are some tunables that can help with tuning the young generation itself.

G1 GC has great tuning potential. There are initial and default values that help G1 GC’s heuristics, and in order to be able to tune G1 GC, one needs to understand these defaults and their effect on the heuristics. Options such as -XX:MaxGCPauseMillis (pause time goal; defaults to 200 ms), -XX:G1NewSizePercent (initial young generation size expressed as a percentage of the total heap; defaults to 5 percent), and -XX:G1MaxNewSizePercent (maximum young generation growth limit expressed as a percentage of the total heap; defaults to 60 percent) help grow or shrink the young generation based on the initial and upper bounds, the pause time goal, and the weighted average of previous copy times. If there is a good understanding of the workload and if there is a perceived benefit from circumventing the adaptive resizing (for example, if you see that the predicted time varies drastically from the actual time observed), you can adjust the defaults. The side effect is that you are forgoing adaptive resizing for more predictable limits. Another thing to keep in mind is that the new limits are applicable only to the one application that you are tuning and will not carry forward to any other/different applications that may have similar pause time requirements since most applications exhibit different behaviors based on their allocation rates, promotion rates, steady and transient live data sets, object sizes, and life spans.

Let’s look at some examples. These are once again run using JDK 8u45 with different values for -XX:MaxGCPauseMillis (as can be seen in the time noted as the target pause time). The benchmark suite was DaCapo, and -XX:+PrintAdaptiveSizePolicy was enabled at the JVM command line.

Click here to view code image

6.317: [GC pause (G1 Evacuation Pause) (young) 6.317: [G1Ergonomics (CSet
Construction) start choosing CSet, _pending_cards: 5800, predicted base
time: 20.39 ms, remaining time: 179.61 ms, target pause time: 200.00 ms]
6.317: [G1Ergonomics (CSet Construction) add young regions to CSet, eden:
225 regions, survivors: 68 regions, predicted young region time: 202.05
ms]
6.317: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 225
regions, survivors: 68 regions, old: 0 regions, predicted pause time:
222.44 ms, target pause time: 200.00 ms]
, 0.1126312 secs]

In the above example, the pause time goal was left at its default value of 200ms, and it was observed that even though the prediction logic predicted the pause time to be 222.44ms, the actual pause time was only 112.63 ms. In this case, we could have easily added more young regions to the CSet.

Click here to view code image

36.931: [GC pause (G1 Evacuation Pause) (young) 36.931: [G1Ergonomics
(CSet Construction) start choosing CSet, _pending_cards: 9129, predicted
base time: 14.46 ms, remaining time: 35.54 ms, target pause time: 50.00
ms]
36.931: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 284 regions, survivors: 16 regions, predicted young region time:
60.90 ms]
36.931: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
284 regions, survivors: 16 regions, old: 0 regions, predicted pause time:
75.36 ms, target pause time: 50.00 ms]
, 0.0218629 secs]

The second example above showcases a similar scenario to the one shown before, only here the pause time target was changed to 50 ms (no adjustments were made to the young generation). Once again, the prediction logic was off and it predicted a pause time of 75.36 ms, whereas the actual pause time was 21.86 ms.

After adjusting the young generation (as can be seen in the number of eden and survivor regions added to the CSet in the third example, below), we could get the pause times to be in the 50 ms range as shown here:

Click here to view code image

58.373: [GC pause (G1 Evacuation Pause) (young) 58.373: [G1Ergonomics
(CSet Construction) start choosing CSet, _pending_cards: 5518, predicted
base time: 10.00 ms, remaining time: 40.00 ms, target pause time: 50.00
ms]
58.373: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 475 regions, survivors: 25 regions, predicted young region time:
168.35 ms]
58.373: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
475 regions, survivors: 25 regions, old: 0 regions, predicted pause time:
178.35 ms, target pause time: 50.00 ms]
, 0.0507471 secs]

Here, even though the prediction logic is still way off, our pause time (50.75 ms) is in the desired range (50 ms).

Concurrent Marking Phase Tunables

For G1, the tunable -XX:InitiatingHeapOccupancyPercent=n (here n defaults to 45 percent of the total Java heap size and takes into account the old generation occupancy, which include old and humongous regions) helps with deciding when to initiate the concurrent marking cycle.

Tip

Unlike CMS GC’s initiation of a marking cycle, which is with respect to its old generation size, G1’s InitiatingHeapOccupancyPercent is with respect to the entire Java heap size.

The concurrent marking cycle starts with an initial marking pause which happens at the same time as (aka, is “piggybacked onto”) a young collection pause. This pause marks the beginning of the collection cycle and is followed by other concurrent and parallel tasks for root region scanning, concurrent marking and liveness accounting, final mark, and cleanup. Figure 3.1 shows all the pauses in a concurrent marking cycle: initial mark, remark, and cleanup. To learn more about the concurrent marking cycle please refer to Chapter 2.

Click here to view code image

277.559: [GC pause (G1 Evacuation Pause) (young) (initial-mark), 0.0960289
secs]

Figure 3.1 Young collection pauses, mixed collection pauses, and pauses in a concurrent marking cycle

The concurrent marking tasks can take a long time if the application’s live object graph is large and may often be interrupted by young collection pauses. The concurrent marking cycle must be complete before a mixed collection pause can start and is immediately followed by a young collection that calculates the thresholds required to trigger a mixed collection on the next pause, as shown in Figure 3.1. The figure shows an initial-mark pause (which, as mentioned earlier, piggybacks on a young collection). There could be more than one young collection when the concurrent phase is under way (only one pause is shown in the figure). The final mark (also known as remark) completes the marking, and a small cleanup pause helps with the cleanup activities as described in Chapter 2. There is a young generation evacuation pause right after the cleanup pause which helps prepare for the mixed collection cycle. The four pauses after this young collection pause are the mixed collection evacuation pauses that successfully collect all the garbage out of the target CSet regions.

If any of the concurrent marking tasks and hence the entire cycle take too long to complete, a mixed collection pause is delayed, which could eventually lead to an evacuation failure. An evacuation failure will show up as a to-space exhausted message on the GC log, and the total time attributed to the failure will be shown in the Other section of the pause. Here is an example log snippet:

Click here to view code image

276.731: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted),
0.8272932 secs]
   [Parallel Time: 387.0 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 276731.9, Avg: 276731.9, Max: 276732.1,
Diff: 0.2]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 0.2, Diff: 0.2,
Sum: 1.3]
      [Update RS (ms): Min: 17.0, Avg: 17.2, Max: 17.3, Diff: 0.4, Sum:
137.3]
         [Processed Buffers: Min: 19, Avg: 21.0, Max: 23, Diff: 4, Sum: 168]
      [Scan RS (ms): Min: 10.5, Avg: 10.7, Max: 10.9, Diff: 0.4, Sum: 85.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.1]
      [Object Copy (ms): Min: 358.7, Avg: 358.8, Max: 358.9, Diff: 0.2,
Sum: 2870.3]
      [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.7]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum:
0.2]
      [GC Worker Total (ms): Min: 386.7, Avg: 386.9, Max: 387.0, Diff:
0.2, Sum: 3095.3]
      [GC Worker End (ms): Min: 277118.8, Avg: 277118.8, Max: 277118.9,
Diff: 0.0]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 440.0 ms]
      [Evacuation Failure: 437.5 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.1 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.9 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.9 ms]
   [Eden: 831.0M(900.0M)->0.0B(900.0M) Survivors: 0.0B->0.0B Heap:
1020.1M(1024.0M)->1020.1M(1024.0M)]
[Times: user=3.64 sys=0.20, real=0.83 secs]

When you see such messages in your log, you can try the following to avoid the problem:

It is imperative to set the marking threshold to fit your application’s static plus transient live data needs. If you set the marking threshold too high, you risk running into evacuation failures. If you set the marking threshold too low, you may prematurely trigger concurrent cycles and may reclaim close to no space during your mixed collections. It is generally better to err on the side of starting the marking cycle too early rather than too late, since the negative consequences of an evacuation failure tend to be greater than those of the marking cycle running too frequently.

If you think that the marking threshold is correct, but the concurrent cycle is still taking too long and your mixed collections end up “losing the race” to reclaim regions and triggering evacuation failures, try increasing your total concurrent thread count. -XX:ConcGCThreads defaults to one-fourth of -XX:ParallelGCThreads. You can either increase the concurrent thread count directly or increase the parallel GC thread count, which effectively increases the concurrent thread count.

Tip

Increasing the concurrent thread count will take processing time away from mutator (Java application) threads since the concurrent GC threads work at the same time as your application threads.

A Refresher on the Mixed Garbage Collection Phase

Now that we have tuned young collections and concurrent marking cycles, we can focus on old generation collection carried out by mixed collection cycles. Recall from Chapter 2 that a mixed collection CSet consists of all the young regions plus a few regions selected from the old generation. Tuning mixed collections can be broken down to varying the number of old regions in the mixed collection’s CSet and adding enough back-to-back mixed collections to diffuse the cost of any single one of them over the time it takes to collect all eligible old regions. Taming mixed collections will help you achieve your system-level agreement for GC overhead and responsiveness.

The -XX:+PrintAdaptiveSizePolicy option dumps details of G1’s ergonomics heuristic decisions. An example follows.

First the command line:

Click here to view code image

JAVA_OPTS="-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy -Xloggc:jdk8u45_h2.log"

And the log snippet:

Click here to view code image

97.859: [GC pause (G1 Evacuation Pause) (mixed) 97.859: [G1Ergonomics
(CSet Construction) start choosing CSet, _pending_cards: 28330, predicted
base time: 17.45 ms, remaining time: 182.55 ms, target pause time: 200.00
ms]
97.859: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 37 regions, survivors: 14 regions, predicted young region time:
16.12 ms]
97.859: [G1Ergonomics (CSet Construction) finish adding old regions to
CSet, reason: old CSet region num reached max, old: 103 regions, max: 103
regions]
97.859: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 37
regions, survivors: 14 regions, old: 103 regions, predicted pause time:
123.38 ms, target pause time: 200.00 ms]
97.905: [G1Ergonomics (Mixed GCs) continue mixed GCs, reason: candidate
old regions available, candidate old regions: 160 regions, reclaimable:
66336328 bytes (6.18 %), threshold: 5.00 %]
, 0.0467862 secs]

The first line tells us the evacuation pause type, in this case a mixed collection, and the predicted times for activities such as CSet selection and adding young and old regions to the CSet.

On the fifth timestamp, tagged Mixed GCs, you can see G1 decide to continue with mixed collections since there are candidate regions available and reclaimable bytes are still higher than the default 5 percent threshold.

This example highlights two tunables: the number of old regions to be added to the CSet as can be seen here:

Click here to view code image

97.859: [G1Ergonomics (CSet Construction) finish adding old regions
to CSet, reason: old CSet region num reached max, old: 103 regions,
max: 103 regions]

and the reclaimable percentage threshold:

Click here to view code image

97.905: [G1Ergonomics (Mixed GCs) continue mixed GCs, reason: candidate
old regions available, candidate old regions: 160 regions, reclaimable:
66336328 bytes (6.18 %), threshold: 5.00 %]

Let’s talk more about these and other tunables next.

The Taming of a Mixed Garbage Collection Phase

The reclaimable percentage threshold, -XX:G1HeapWastePercent, is a measure of the total amount of garbage (or fragmentation, if you prefer) that you are able to tolerate in your application. It is expressed as a percentage of the application’s total Java heap and defaults to 5 percent (JDK 8u45). Let’s look at an example:

Click here to view code image

123.563: [GC pause (G1 Evacuation Pause) (mixed) 123.563: [G1Ergonomics
(CSet Construction) start choosing CSet, _pending_cards: 7404, predicted
base time: 6.13 ms, remaining time: 43.87 ms, target pause time: 50.00 ms]
123.563: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 464 regions, survivors: 36 regions, predicted young region time:
80.18 ms]
123.563: [G1Ergonomics (CSet Construction) finish adding old regions
to CSet, reason: predicted time is too high, predicted time: 0.70 ms,
remaining time: 0.00 ms, old: 24 regions, min: 24 regions]
123.563: [G1Ergonomics (CSet Construction) added expensive regions to
CSet, reason: old CSet region num not reached min, old: 24 regions,
expensive: 24 regions, min: 24 regions, remaining time: 0.00 ms]
123.563: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
464 regions, survivors: 36 regions, old: 24 regions, predicted pause time:
101.83 ms, target pause time: 50.00 ms]
123.640: [G1Ergonomics (Mixed GCs) continue mixed GCs, reason: candidate
old regions available, candidate old regions: 165 regions, reclaimable:
109942200 bytes (10.24 %), threshold: 5.00 %]
, 0.0771597 secs]

The last line of this example shows that mixed GCs will be continued since there is enough garbage to be reclaimed (10.24 percent). Here the reclaimable percentage threshold is kept at its default value of 5 percent.

If mixed collections are becoming exponentially expensive as can be seen in Figure 3.2, increasing this threshold will help. Remember, however, that the increase will leave more regions fragmented and occupied. This means that the old generation will retain more (transient) live data, which must be accounted for by adjusting your marking threshold accordingly.

Figure 3.2 Mixed GC collection cycle showing exponentially expensive collection times

The minimum threshold for the number of old regions to be included in the CSet per mixed collection pause within a mixed collection cycle is specified by -XX:G1MixedGCCountTarget and defaults to 8. As briefly discussed in Chapter 2, the minimum number of old regions per mixed collection pause is

Minimum old CSet size per mixed collection pause = Total number of candidate old regions identified for mixed collection cycle/G1MixedGCCountTarget

This formula determines the minimum number of old regions per CSet for each mixed collection to be x such as to facilitate y back-to-back mixed collections that will collect all candidate old regions. The set of back-to-back mixed collections carried out after a completed concurrent marking cycle constitutes a mixed collection cycle. Let’s look at line 4 of the preceding example:

Click here to view code image

123.563: [G1Ergonomics (CSet Construction) added expensive regions to
CSet, reason: old CSet region num not reached min, old: 24 regions,
expensive: 24 regions, min: 24 regions, remaining time: 0.00 ms]

Line 4 tells us that only 24 regions were added to the CSet, since the minimum number of old regions to be added per CSet was not met. The previous (young) collection pause tells us that there are 189 candidate old regions available for reclamation, hence G1 GC should start a mixed collection cycle:

Click here to view code image

117.378: [G1Ergonomics (Mixed GCs) start mixed GCs, reason: candidate
old regions available, candidate old regions: 189 regions, reclaimable:
134888760 bytes (12.56 %), threshold: 5.00 %]

So, dividing 189 by the default value of G1MixedGCCountTarget (8) = Ceiling (189/8) = 24. And that is how we get to the minimum number of 24 regions.

Just as there is a minimum threshold for old regions to be added to the CSet, there is a maximum as well. The value for the maximum threshold for the old regions to be added to the CSet is specified by -XX:G1OldCSetRegionThresholdPercent, which defaults to 10 percent of the total Java heap size. Once again, let’s look at an example:

Click here to view code image

Lines 3 and 5 show that even though there were more candidate old regions available for collection, the total number of old regions in the current CSet was capped at 103 regions. As mentioned in the preceding description, the 103 region count comes from the total heap size of 1G and the 10 percent default value for G1OldCSetRegionThresholdPercent, which rounds up to 103.

Now that we know how to specify the minimum and maximum number of regions per CSet per mixed collection, we can modify the thresholds to suit our pause time goal and at the same time maintain the desired amount of transient live data in the old generation.

The -XX:G1MixedGCLiveThresholdPercent option, which defaults to 85 percent (JDK 8u45), is the maximum percentage of live data within a region that will allow it to be included in a CSet. Per-region live data percentages are computed during the concurrent marking phase. An old region deemed to be too expensive to evacuate—that is, whose live data percentage is above the liveness threshold—is not included as a CSet candidate region. This option directly controls fragmentation per region, so be careful.

Tip

Increasing the G1MixedGCLiveThresholdPercent value means that it will take longer to evacuate old regions, which means that mixed collection pauses will also be longer.

Avoiding Evacuation Failures

In the “Concurrent Marking Phase Tunables” section, we discussed a couple of ways to avoid evacuation failures. Here are a few more important tuning parameters:

Heap size. Make sure that you can accommodate all the static and transient live data and your short- and medium-lived application data in your Java heap. Apart from the accommodation of live data, additional Java heap space, or headroom, should be available in order for GC to operate efficiently. The more the available headroom, the higher the possible throughput and/or the lower the possible latency.

Avoid over-specifying your JVM command-line options! Let the defaults work for you. Get a baseline with just your initial and maximum heap settings and a desired pause time goal. If you already know that the default marking threshold is not helping, add your tuned marking threshold on the command line for the baseline run. So, your base command would look something like this:

Click here to view code image

-Xms2g -Xmx4g -XX:MaxGCPauseMillis=100

Click here to view code image

-Xms2g -Xmx4g -XX:MaxGCPauseMillis=100 -
XX:InitiatingHeapOccupancyPercent=55

If your application has long-lived humongous objects, make sure that your marking threshold is set low enough to accommodate them. Also, make sure that the long-lived objects that you deem “humongous” are treated as such by G1. You can ensure this by setting -XX:G1HeapRegionSize to a value that guarantees that objects greater than or equal to 50 percent of the region size are treated as humongous. As mentioned in Chapter 2, the default value is calculated based on your initial and maximum heap sizes and can range from 1 to 32MB.

Here is a log snippet using -XX:+PrintAdaptiveSizePolicy:

Click here to view code image

91.890: [G1Ergonomics (Concurrent Cycles) request concurrent cycle
initiation, reason: occupancy higher than threshold, occupancy: 483393536
bytes, allocation request: 2097168 bytes, threshold: 483183810 bytes
(45.00 %), source: concurrent humongous allocation]

A concurrent cycle is being requested since the heap occupancy crossed the marking threshold due to a humongous allocation request. The request was for 2,097,168 bytes, which is far larger than the 1MB G1 heap region size default set by G1 at JVM start-up time.

There are times when evacuation failures are caused by not having enough space in the survivor regions for newly promoted objects. When you observe this happening, try increasing -XX:G1ReservePercent. The reserve percent creates a false ceiling for the reserved space so as to accommodate any variation in promotion patterns. The default value is 10 percent of the total Java heap and is limited by G1 to 50 percent of the total Java heap since more would amount to a very large amount of wasted reserve space.

Reference Processing

Garbage collection must treat Java reference objects—phantom references, soft references, and weak references—differently from other Java objects. Reference objects require more work to collect than non-reference objects.

In this section, you will learn how to identify if the time required to perform reference processing during a G1 garbage collection pause is an issue for your application and how to tune G1 to reduce this overhead, along with tips for isolating which reference object type is inducing the most overhead. Depending on the application and its pause time requirements, refactoring the application’s source code may be required to reduce reference processing overhead.

Tip

Java Platform, Standard Edition (Java SE) API documentation for each of the reference object types (http://docs.oracle.com/javase/8/docs/api) and the java.lang.ref package APIs for phantom reference, soft reference, and weak reference are both good sources for understanding how each reference object type behaves and how and when they are garbage collected.

Observing Reference Processing

There are several major activities associated with garbage collecting reference objects: discovering the reference objects, pushing them onto the JVM’s reference queue, and pulling them off the reference queue and processing them. When using G1 with -XX:+PrintGCDetails, the time spent enqueuing reference objects is reported separately from the time spent processing them. The two times are reported in the Other section of the log on both young and mixed GCs:

Click here to view code image

   [Other: 9.9 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 8.2 ms]
      [Ref Enq: 0.3 ms]
      [Redirty Cards: 0.7 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.5 ms]

Ref Proc is the time spent processing reference objects, and Ref Enq is the time spent enqueuing reference objects. As the example suggests, the time spent in Ref Enq is rarely as long as the time spent in Ref Proc. In fact, we have yet to see an application that consistently has higher Ref Enq times. If it happens, it means that the amount of effort required to process a reference is very small relative to its enqueuing time, which is unlikely for most reference object types.

G1 also reports reference processing activity during the remark phase during G1’s concurrent cycle. Using -XX:+PrintGCDetails, the log output from remark will include reference processing time:

Click here to view code image

[GC remark [Finalize Marking, 0.0007422 secs][GC ref-proc, 0.0129203 secs][Unloading, 0.0160048 secs], 0.0308670 secs]

-XX:+PrintReferenceGC reports details for each reference object type at each collection and is very useful for isolating a specific reference object type that the collector is spending the most time processing.

Click here to view code image

[GC remark [Finalize Marking, 0.0003322 secs][GC ref-proc [SoftReference,
2 refs, 0.0052296 secs][WeakReference, 3264308 refs, 2.0524238 secs]
[FinalReference, 215 refs, 0.0028225 secs][PhantomReference, 1787 refs,
0.0050046 secs][JNI Weak Reference, 0.0007776 secs], 2.0932150 secs]
[Unloading, 0.0060031 secs], 2.1201401 secs]

Note that G1 is the only HotSpot GC that reports reference processing times with -XX:+PrintGCDetails. However, all HotSpot GCs will report per-reference-object-type information using -XX:+PrintReferenceGC.

As a general guideline, Ref Proc times in PrintGCDetails output that are more than 10 percent of the total GC pause time for a G1 young or mixed GC are cause to tune the garbage collector’s reference processing. For G1 remark events it is common to see a larger percentage of time spent in reference processing since the remark phase of the concurrent cycle is when the bulk of reference objects discovered during an old generation collection cycle are processed. If the elapsed time for the G1 remark pause exceeds your target pause time and a majority of that time is spent in reference processing, tune as described in the next section.

Reference Processing Tuning

The first thing to do is enable multithreaded reference processing using -XX:+ParallelRefProcEnabled, because HotSpot defaults to single-threaded reference processing. This default exists to minimize memory footprint and to make CPU cycles available to other applications. Enabling this option will burn more CPU during reference processing, but the elapsed, or wall clock, time to complete reference processing will go down.

If reference processing time remains larger than 10 percent of young or mixed GC pause times after adding -XX:+ParallelRefProcEnabled, or G1’s remark phase is spending too much time in reference processing, the next step is to determine which reference object type or set of reference object types G1 is spending most of its time processing. -XX:+PrintReferenceGC will report statistics on each reference object type at each GC:

Click here to view code image

[GC pause (young) [SoftReference, 0 refs, 0.0001139 secs][WeakReference,
26845 refs, 0.0050601 secs][FinalReference, 5216 refs, 0.0032409 secs]
[PhantomReference, 336 refs, 0.0000986 secs][JNI Weak Reference, 0.0000836
secs], 0.0726876 secs]

A large number of a particular reference object type indicates that the application is heavily using it. Suppose you observe the following:

Click here to view code image

In addition, further suppose you regularly observe a similar pattern of a high number of weak references across other GCs. You can use this information to refactor your application in order to reduce the use of the identified reference object type, or to reduce the reclamation time of that reference object type. In the preceding example, the number of processed weak references is very high relative to the other reference object types, and the amount of time to process them also dominates reference processing time. Corrective actions include the following:

1. Verify that -XX:+ParallelRefProcEnabled is enabled. If it is not, enable it and observe whether it reduces pause time enough to reach your pause time goal.

2. If -XX:+ParallelRefProcEnabled is enabled, tell the application developers that you are observing a very high weak reference reclamation rate and ask that they consider refactoring the application to reduce the number of weak references being used.

One reference object type to watch for and be careful of using is the soft reference. If PrintReferenceGC log output suggests that a large number of soft references are being processed, you may also be observing frequent old generation collection cycles, which consist of concurrent cycles followed by a sequence of mixed GCs. If you see a large number of soft references being processed and GC events are occurring too frequently, or heap occupancy consistently stays near the maximum heap size, tune the aggressiveness with which soft references are reclaimed using XX:SoftRefLRUPolicyMSPerMB. It defaults to a value of 1000, and its units are milliseconds.

The default setting of -XX:SoftRefLRUPolicyMSPerMB=1000 means that a soft reference will be cleared and made eligible for reclamation if the time it was last accessed is greater than 1000ms times the amount of free space in the Java heap, measured in megabytes. To illustrate with an example, suppose -XX:SoftLRUPolicyMSPerMS=1000, and the amount of free space is 1GB, that is, 1024MB. Any soft reference that has not been accessed since 1024 × 1000 = 1,024,000ms, or 1024 seconds, or slightly over 17 minutes ago, is eligible to be cleared and reclaimed by the HotSpot garbage collector.

The effect of setting a lower value for -XX:SoftRefLRUPolicyMSPerMB is to provoke more aggressive clearing and reclamation of soft references, which leads to lower heap occupancy after GC events, or in other words, less live data. Conversely, setting -XX:SoftRefLRUPolicyMSPerMB higher causes less aggressive soft reference clearing and reclamation, which leads to more live data and higher heap occupancy. Tuning -XX:SoftRefLRUPolicyMSPerMB may not actually lead to lower reference processing times and in fact may increase them.

The primary reason to tune -XX:SoftRefLRUPolicyMSPerMB is to reduce the frequency of old generation collection events by reducing the amount of live data in the heap. We recommend against the use of soft references as a means for implementing memory-sensitive object caches in Java applications because doing so will increase the amount of live data and result in additional GC overheads. See the sidebar “Using Soft References” for more detail.

Using Soft References

It is unfortunate that the soft references JavaDoc (http://docs.oracle.com/javase/8/docs/api/index.html?java/lang/ref/SoftReference.html) recommends their use for implementing memory-sensitive object caches. This often misleads users into thinking that the JVM can manage the size of the object cache and remove entries—that is, free space—as the garbage collector desires or needs. Plus, the existence of -XX:SoftRefLRUPolicyMSPerMB suggests that further control of the object cache is possible. The reality is that the use of soft references for an object cache leads to high heap occupancy, which in turn causes frequent GC events, and in some cases leads to lengthy full GC events or, even worse, out-of-memory errors.

To illustrate why using soft references for object caches tends to result in unexpected or undesirable behavior, let’s begin by looking at -XX:SoftLRUPolicyMSPerMB, which tries to control how many recently accessed soft references are kept alive by the garbage collector. For every megabyte of estimated free memory found in the heap before getting rid of the soft references, and after partially figuring out what is no longer reachable (what is dead), the -XX:SoftRefLRUPolicyMSPerMB value in milliseconds is added to a time window of how long it has been since the application accessed a given reference’s soft reference yes/no test for keeping it strongly alive in a currently executing garbage collection.

If you step back and think about what this means, the faster an application runs, the more frequently you can expect soft references to be accessed. For example, say an application is executing 10,000 transactions per second. That application is accessing 100 times more soft references per second than if it were executing 100 transactions per second. The object allocation and promotion rates are also roughly 100 times greater. The faster the application executes, the harder it makes the garbage collector work.

Keep in mind that the amount of free megabytes used to estimate the “which soft references to keep” time window is computed based on how much space will be free if all soft references were to be cleared and reclaimed. However, the garbage collector has no way of knowing how much additional memory will be kept alive by a soft reference. It can only speculate or make an estimate of how many soft references to keep alive without knowing how much memory they will actually retain.

Also, the larger the number of soft references that are kept alive, say in an object cache, the less actual free memory will remain at the end of a garbage collection. The less the amount of space available after a GC, the more frequently the garbage collector will run for a given allocation or promotion rate. For example, at ten times less free memory, the garbage collector will run ten times more frequently and will do approximately ten times more work per some unit of application work.

What this means is that soft references are useful when an application is doing little or no work, because the overhead of soft reference use grows at the square of the rate of application work. You can control the constant multiplier ahead of that square, but that is the only control you have. This is why many GC tuning guides suggest setting this constant multiplier to zero using -XX:SoftLRUPolicyMSPerMB=0.

Keep in mind there is no Java or JVM specification that specifies a default value for the constant multiplier just described. A garbage collector implementation can clear and reclaim a differing number of soft references for the same Java application executing with the same size Java heap with the same load. This means that if you are using soft references for a memory-sensitive object cache, the size of that object cache may vary between garbage collectors, not only between different JVM vendors, but also from one garbage collector to another within the same JVM. You may observe completely different GC behavior such as a difference in live data or heap occupancy, number of GCs, the frequency at which they occur, and/or their duration by moving between HotSpot’s CMS, Parallel, and G1 GCs. JVMs such as IBM’s J9 or Azul’s Zing may also behave differently with respect to soft reference reclamation. This puts limitations on the portability of a Java application using soft references since it will behave differently moving from one garbage collector to another, or at least require JVM tuning of the constant multiplier, which may be nontrivial, in order to achieve acceptable application behavior.

References

[1] Charlie Hunt and Binu John. Java™ Performance. Addison-Wesley, Upper Saddle River, NJ, 2012. ISBN 978-0-13-714252-1.

[2] Nirmar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. “Thread Scheduling for Multiprogrammed Multiprocessors.” Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, New York, 1998, pp. 119–29. ISBN 0-89791-989-0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Garbage First Garbage Collector Performance Tuning

Create new playlist

Sign In

Sign Up

3. Garbage First Garbage Collector Performance Tuning

The Stages of a Young Collection

Start of All Parallel Activities

External Root Regions

Remembered Sets and Processed Buffers

Summarizing Remembered Sets

Evacuation and Reclamation

Termination

Parallel Activity Outside of GC

Summarizing All Parallel Activities

Start of All Serial Activities

Other Serial Activities

Young Generation Tunables

Concurrent Marking Phase Tunables

A Refresher on the Mixed Garbage Collection Phase

The Taming of a Mixed Garbage Collection Phase

Avoiding Evacuation Failures

Reference Processing

Observing Reference Processing

Reference Processing Tuning

References

Table of Contents for
3. Garbage First Garbage Collector Performance Tuning