Common bottlenecks and how to avoid them

As in most of our chapters so far, we will finish by reviewing a few common mistakes and false optimizations. In the benchmarking world, this is all about understanding the bottlenecks and the anti patterns that frequently show up in application code and how they can be avoided.

Note

Care should be taken that any instrumentation is not too intrusive. If, for example, the chosen instrumentation tool inserts extra bytecode operations all over the application code, the overall timing of the program can change completely. This may make the resulting profile useless for drawing any kinds of conclusions about the original program behavior. While small bytecode instrumenters may be handy for things like implementing counters for specific kinds of events, they rarely produce a true execution profile. Bytecode instrumenters also make it necessary to recompile and restart the application. The JRockit Mission Control suite, on the other hand, can plug in at runtime and profiles the application with virtually no extra overhead.

A benchmark or instrumentation result can provide great insights into why an application contains performance bottlenecks. Over the years, the authors of this book have examined many applications to determine why they aren't running as fast as they should. Some findings keep recurring, and the following section provides information on several common areas that cause performance problems and what practices should be avoided or used with caution when programming Java.

The -XXaggressive flag

From one time to another we discover customers using the undocumented and experimental -XXaggressive flag for JRockit. This flag is a wrapper for other flags that tell JRockit to perform at high speed and try to reach a stable state as soon as possible. The cost of this is more resource use at startup. The parameters that this option modifies are subject to change from release to release. Because of its experimental nature, the frivolous use of -XXaggressive is discouraged. However, it can be useful to try as one of many different setups when doing profiling. Use this flag at your own risk.

Too many finalizers

Finalizers are, as we have already discussed in Chapter 3, unsafe in that they can resurrect objects and interfere with the GC. They usually incur processing overhead in the JVM as well.

Objects waiting for finalization have to be kept track of separately by the garbage collector. There is also call overhead when the finalize method is invoked (not to mention the execution time of the finalize method itself, if it does something fairly complex). Finalizers should simply be avoided.

Too many reference objects

As with finalizers, the garbage collector has to treat soft, weak, and phantom references specially. Although all of these can provide great aid in, for example, simplifying a cache implementation, too many live Reference objects will make the garbage collector run slower. A Reference object is usually a magnitude more expensive than a normal object (strong reference) to bookkeep.

To get information on Reference object processing along with garbage collections, JRockit can be started with the flag -Xverbose:refobj. Following is an example of its output:

hastur:material marcus$ java Xverbose:refobj GarbageCollectionTest
[INFO ][refobj ] [YC#1] SoftRef: Reach: 25 Act: 0 PrevAct: 0 Null: 0
[INFO ][refobj ] [YC#1] WeakRef: Reach: 103 Act: 0 PrevAct: 0 Null: 0
[INFO ][refobj ] [YC#1] Phantom: Reach: 0 Act: 0 PrevAct: 0 Null: 0
[INFO ][refobj ] [YC#1] ClearPh: Reach: 0 Act: 0 PrevAct: 0 Null: 0
[INFO ][refobj ] [YC#1] Finaliz: Reach: 12 Act: 3 PrevAct: 0 Null: 0
[INFO ][refobj ] [YC#1] WeakHnd: Reach: 217 Act: 0 PrevAct: 0 Null: 0
[INFO ][refobj ] [YC#1] SoftRef: @Mark: 25 @Preclean: 0 @FinalMark: 0
[INFO ][refobj ] [YC#1] WeakRef: @Mark: 94 @Preclean: 0 @FinalMark: 9
[INFO ][refobj ] [YC#1] Phantom: @Mark: 0 @Preclean: 0 @FinalMark: 0
[INFO ][refobj ] [YC#1] ClearPh: @Mark: 0 @Preclean: 0 @FinalMark: 0
[INFO ][refobj ] [YC#1] Finaliz: @Mark: 0 @Preclean: 0 @FinalMark: 15
[INFO ][refobj ] [YC#1] WeakHnd: @Mark: 0 @Preclean: 0 @FinalMark: 217
[INFO ][refobj ] [YC#1] SoftRef: SoftAliveOnly: 24 SoftAliveAndReach:1
[INFO ][refobj ] [YC#1] NOTE: This count only applies to a part of the heap.

The program in the previous example seems to have only a small number of Reference objects and the GC has no trouble keeping up. Beware of applications where each GC needs to handle hundreds of thousands of soft references.

Object pooling

As was discussed in the chapter on memory management, object pooling, the practice of keeping a collection of objects alive for reuse in order to reduce allocation overhead, is usually a bad idea.

Apart from interfering with GC workloads and heuristics, pooling objects will also cause objects to live longer, and thus eventually force their promotion to the old generation on the heap. This introduces extra overhead and contributes to fragmentation. Recall from Chapter 3 that large amounts of live data is a GC bottleneck and the GC is optimized to handle many objects with short life spans. Object pooling contributes both to more live data and to longer object life spans.

Also, allocating fresh objects instead of keeping old objects alive, will most likely be more beneficial to cache locality.

No rule without exception, however. In very specific applications, allocation time is actually a program bottleneck, especially the clearing part of allocations. As Java guarantees that every new object is initialized with null, freshly allocated objects have to be cleared. Performance in an environment with many large objects, for example large arrays, may therefore occasionally benefit from object pooling. JRockit tries to remove unnecessary object clearings if it can prove that they are not needed.

In general, though, it is probably a good idea to just keep things simple.

Bad algorithms and data structures

It should be fairly obvious that a hash table is a better data structure for fast element lookups than a linked list. It should also be clear that a QuickSort algorithm of runtime complexity O(n log n) is better than a naively implemented BubbleSort of O(n2). We assume that the reader is fairly proficient in picking the correct algorithms and data structures in order to minimize algorithm complexity.

Classic textbook issues

However, when working with a poorly written third-party application, bad data structures can still be a problem. By benchmarking and by working backwards from where the program spends its time, serious runtime improvements can sometimes be made.

public List<Node> breadthFirstSearchSlow(Node root) {
List<Node> order = new LinkedList<Node>();
List<Node> queue = new LinkedList<Node>();
queue.add(root);
while (!queue.isEmpty()) {
Node node = queue.remove(0);
order.add(node);
for (Node succ : node.getSuccessors()) {
if (!order.contains(succ) && !queue.contains(succ)) {
queue.add(succ);
}
}
}
return order;
}

The previous code is a standard breadth first search algorithm for (possibly cyclic) graphs. Given a root node, the algorithm uses a queue to traverse its successors in breadth first order. In order to avoid duplicates and potentially infinite loops, a check to see if a node has been already processed is done before adding it to the queue.

The method contains, used for finding an element in a linked list, is implemented in the JDK as a linear scan of the entire list. This means that in the worst case, our search algorithm is quadratic to the number of nodes, which will be very slow for large data sets.

public List<Node> breadthFirstSearchFast(Node root) {
List<Node> order = new LinkedList<Node>();
List<Node> queue = new LinkedList<Node>();
Set<Node> visited = new HashSet<Node>();
queue.add(root);
visited.add(root);
while (!queue.isEmpty()) {
Node node = queue.remove(0);
order.add(node);
for (Node succ : node.getSuccessors()) {
if (!visited.contains(succ)) {
queue.add(succ);
visited.add(succ);
}
}
}
return order;
}

The previous code corrects the problem by adding a HashSet type data structure to keep track of visited nodes. Constant time lookups replace the linear traversion when checking if a node already has been visited. The potential difference for large data sets is enormous.

Unwanted intrinsic properties

There are however other, more subtle problems caused by picking the wrong data structure. Consider a queue implementation in the form of a linked list. This is seemingly a good general purpose data structure that can be used as a queue without any modifications. It provides the ability of inserting elements last in the list and removing elements from the list head. These are both constant time operations and no other functionality is required. So, what can go wrong? First of all, even if your program never iterates over the entire linked list, the garbage collector still has to.

If the queue contains large amounts of data, many objects are kept alive as long as they are referenced from the linked list. As the linked list elements consist of object references that both point out the next list element and wrap a payload of data, the queue elements may exist anywhere on the heap. Thus, accessing objects in a linked list may actually lead to bad cache locality and long pause times. Bad cache locality can ensue because payloads or element wrappers aren't guaranteed to be stored next to each other in memory. This will cause long pause times as, if the object pointers are spread over a very large heap area, a garbage collector would repeatedly miss the cache while doing pointer chasing during its mark phase.

So, seemingly innocent data structures with low complexities for the operations involved can turn out to introduce intrinsic problems in systems with automatic memory management.

Misuse of System.gc

There is no guarantee from the Java language specification that calling System.gc will do anything at all. But if it does, it probably does more than you want or doesn't do the same thing every time you call it. To reiterate, don't confuse the garbage collector by trying to tell it what to do. To be safe, don't call System.gc.

Too many threads

While it is a good thing to be able to break up a problem into many computational threads with little inter-thread communication, context switches always incur overhead anyway. We have been over the different thread implementations, from green threads to OS threads in Chapter 4. However, true for all thread implementation is that some kind of context switch, during which no useful program execution can be performed, takes place while shifting execution from one thread to another. The number of context switches grows proportionally to the number of fairly scheduled threads, and there may also be hidden overhead here.

Note

A worst-case example from real life is the Intel IA-64 processor, with its massive amount of registers, where a native thread context is on the order of several KB. Every memory copy performed in order to initiate a new thread context after a context switch contributes to the overhead. This makes many threads executing in parallel particularly expensive as their contexts are large.

One contended lock is the global bottleneck

Contended locks are bottlenecks, as their presence means that several threads want to access the same resource or execute the same piece of code at the same time. It is not uncommon that one lock is the main source of all contention in a program. A typical example is an application using some third-party library for logging that acquires a global lock each time log file information is to be written. When many, mutually independent, threads are trying to log output at the same time, the log lock might be the one bottleneck that brings an otherwise well written application to its knees.

Unnecessary exceptions

Handling exceptions takes time and interrupts normal program flow. Using exceptions for the common case in a program as means of communicating results or implementing control flow is definitely bad practice.

It is useful to try to create some kind of exception profile of the program to find out what exceptions are thrown and from where. Unnecessary hardware exceptions such as null pointers and divisions by zero should be removed wherever possible. Hardware exceptions are the most expensive type of exception, as they are triggered by an interrupt at the native level, whereas throwing a Java exception explicitly from code keeps most of the (though still expensive) work of handling the exception inside the JVM.

Note

We have seen cases with customer applications throwing tens of thousands of unnecessary NullPointerExceptions every second, as part of normal control flow. Once this behavior was rectified, performance gains of an order of magnitude were achieved.

The simplest way to discover which exceptions, both caught and uncaught, are thrown by JRockit, is to use the flag -Xverbose:exceptions. An example of its output is shown as follows:

hastur:~ marcus$ java -Xverbose:exceptions Jvm98Wrapper _200_check
[INFO ][excepti][00004] java/io/FileNotFoundException: /localhome/jrockits/R28.0.0_R28.0.0-454_1.6.0/jre/classes
[INFO ][excepti][00004] java/lang/ArrayIndexOutOfBoundsException: 6
[INFO ][excepti][00004] java/lang/ArithmeticException: / by zero
[INFO ][excepti][00004] java/lang/ArithmeticException: fisk
[INFO ][excepti][00004] java/lang/ArrayIndexOutOfBoundsException: 11
[INFO ][excepti][00004] java/lang/RuntimeException: fisk

Each line in the log corresponds to a thrown exception. To get stack traces along with the exceptions, use -Xverbose:exceptions=debug. The JRockit Mission Control suite also contains frameworks for profiling exceptions and for introspecting them in a more user friendly way. The following is an example output that shows exceptions along with their stack traces, as they are thrown by the JVM:

hastur:~ marcus$ java -Xverbose:exceptions=debug Jvm98Wrapper _200_check
[DEBUG][excepti][00004] java/lang/ArrayIndexOutOfBoundsException: 6
at spec/jbb/validity/PepTest.testArray()Ljava/lang/String; (Unknown Source)
at spec/jbb/validity/PepTest.instanceMain()V(Unknown Source)
at spec/jbb/validity/Check.doCheck()Z(Unknown Source)
at spec/jbb/JBBmain.main([Ljava/lang/String;)V(Unknown Source)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
--- End of stack trace
[DEBUG][excepti][00004] java/lang/ArithmeticException: / by zero
at jrockit/vm/Reflect.fillInStackTrace0(Ljava/lang/Throwable;) V(Native Method)
at java/lang/Throwable.fillInStackTrace()Ljava/lang/Throwable; (Native Method)
at java/lang/Throwable.<init>(Throwable.java:196)
at java/lang/Exception.<init>(Exception.java:41)
at java/lang/RuntimeException.<init>(RuntimeException.java:43)
at java/lang/ArithmeticException.<init>(ArithmeticException.java:36)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
at jrockit/vm/ExceptionHandler.throwPendingType()V(Native Method)
at spec/jbb/validity/PepTest.testDiv()Ljava/lang/String; (Unknown Source)
at spec/jbb/validity/PepTest.instanceMain()V(Unknown Source)
at spec/jbb/validity/Check.doCheck()Z(Unknown Source)
at spec/jbb/JBBmain.main([Ljava/lang/String;)V(Unknown Source)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
--- End of stack trace
[DEBUG][excepti][00004] java/lang/ArithmeticException: fisk
at spec/jbb/validity/PepTest.testExc1()Ljava/lang/String; (Unknown Source)
at spec/jbb/validity/PepTest.instanceMain()V(Unknown Source)
at spec/jbb/validity/Check.doCheck()Z(Unknown Source)
at spec/jbb/JBBmain.main([Ljava/lang/String;)V(Unknown Source)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
--- End of stack trace
[DEBUG][excepti][00004] java/lang/ArrayIndexOutOfBoundsException: 11
at spec/jbb/validity/PepTest.testExc1()Ljava/lang/String; (Unknown Source)
at spec/jbb/validity/PepTest.instanceMain()V(Unknown Source)
at spec/jbb/validity/Check.doCheck()Z(Unknown Source)
at spec/jbb/JBBmain.main([Ljava/lang/String;)V(Unknown Source)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
--- End of stack trace
[DEBUG][excepti][00004] java/lang/RuntimeException: fisk
at spec/jbb/validity/PepTest.testExc2()Ljava/lang/String; (Unknown Source)
at spec/jbb/validity/PepTest.instanceMain()V(Unknown Source)
at spec/jbb/validity/Check.doCheck()Z(Unknown Source)
at spec/jbb/JBBmain.main([Ljava/lang/String;)V(Unknown Source)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
--- End of stack trace

Any JRockit Flight Recording contains an exception profile of the running program that can be examined using JRockit Mission Control.

Large objects

Large objects sometimes have to be allocated directly on the heap and not in thread local areas. The rationale is that they would mostly contribute to overhead in a small TLA and cause frequent evacuations. In JRockit versions prior to R28, an explicit large object size could be given and objects that were larger were never allocated in a TLA. Post R28, the waste limit for a TLA is instead the modifiable property, allowing large objects to be allocated in a TLA if they fit well enough.

Large objects on the heap are bad in that they contribute to fragmentation more quickly. This is because they might not readily fit in most spaces provided by the free list, where smaller "normal" objects have been previously garbage collected.

Allocation time increases dramatically on a fragmented heap, and juggling many large objects extensively contributes to fragmentation. As large objects may be allocated directly on the heap and not in the TLA, large object allocation in JRockit also contributes to overhead because it may require taking a global heap lock on allocation.

The worst case scenario is that an overuse of large objects leads to full heap compaction being done too frequently, which is very disruptive and requires stopping the world for large amounts of time.

As it is hard to pick a "one size fits all" value as an explicit large object limit, for any given application, the large object limit in JRockit (or the TLA waste limit in R28) can be changed. This is useful if analysis shows that there are, for example, a large number of fixed size objects slightly larger than the default, or if many direct-to-heap allocations take place as the TLAs want to be too tightly packed. In that case, it might be beneficial to increase the large object limit or TLA waste limit, depending on JRockit version.

The bad corner cases, the large object death spirals, occur when objects frequently are on the orders of several megabytes. The real world examples are typically database query results or very large arrays. Avoid them at all costs. Do whatever it takes to keep them off the heap, even implement a native layer for them, but do not let your Java Virtual Machine juggle a large number of humongous objects.

Native memory versus heap memory

To the JVM, all memory that exists is system memory, available from the underlying operating system. Some of this system memory is used by the JVM to allocate the Java heap. The amount of memory used for the heap can be controlled by the -Xms (initial heap size) and -Xmx (maximum heap size) flags.

The JVM will throw OutOfMemoryError both from inside the JVM, when there is not enough memory available to complete some internal operation, and from Java, when a program tries to allocate more objects than will fit on the current heap.

A JVM is a native application that also consumes system memory for its own purposes, for example, to allocate data structures used for code optimization. Internal JVM memory management is, to a large extent, kept off the Java heap and allocated natively in the operating system, through system calls like malloc. We refer to non-heap system memory allocated by the JVM as native memory.

While heap memory can be reclaimed when the JVM garbage collects objects in running program, native memory can't. If all native memory management was handled by the JVM alone, and the JVM was economic enough in its native memory usage, all would be well. However, there are complications.

In certain scenarios, we can run out of native memory. One example is when several parallel threads perform code optimizations in the JVM. Code optimization typically is one of the JVM operations that consumes the largest amounts of native memory, though only when the optimizing JIT is running and only on a per-method basis. There are also mechanisms that allow the Java program, and not just the JVM, to allocate native memory, for example through JNI calls. If a JNI call executes a native malloc to reserve a large amount of memory, this memory will be unavailable to the JVM until it is freed.

Note

Mechanisms for tracking native memory usage are available in JRockit, and can be accessed through the JRockit Mission Control suite or JRCMD. Histograms that show the native memory usage of individual JVM modules are also available.

If the heap is too large, it may well be the case that not enough native memory is left for JVM internal usage—bookkeeping, code optimizations, and so on. In that case, the JVM may have no other choice than to throw an OutOfMemoryError from native code. For JRockit, increasing the amount of available native memory is done implicitly by lowering the maximum Java heap size using -Xmx.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.67