Adaptive code generation

Java is dynamic in nature and certain code generation strategies fit less well than others. From the earlier discussion, the following conclusions can be drawn:

  • Code generation should be done at runtime, not ahead of time.
  • All methods cannot be treated equally by code generator. There needs to be a way to discern a hot method from a cold one. Otherwise unnecessary optimization effort is spent on cold methods, or worse, not enough optimization effort on hot methods.
  • In a JIT compiler, bookkeeping needs to be in place in order to keep up with the adaptive runtime. This is because generated native code invalidated by changes to the running program must be thrown away and potentially regenerated.

Achieving code execution efficiency in an adaptive runtime, no matter what JIT or interpretation strategy it uses, all boils down to the equation:

Total Execution Time = Code Generation Time + Execution Time

In other words, if we spend lots of effort carefully generating and optimizing every method to make sure it turns into efficient native code, we contribute too much code generation time to the total execution time. We want the JVM to execute our Java code in every available clock cycle, not use the expensive cycles to garbage collect or generate code.

If we spend too little time preparing methods for execution, their runtime performance is likely to be bad and thus contribute too many "inefficient" cycles to the total execution time.

The JVM needs to know precisely which methods are worth the extra time spent on more elaborate code generation and optimization efforts.

There are, of course, other aspects of total execution time, such as time spent in garbage collection. This, however, is beyond the scope of this chapter and will be covered in more detail in the chapter on memory management. Here it is sufficient to mention that the code optimizer sometimes can help reduce garbage collection overhead by generating efficient code, that is less memory bound. One example would be by applying escape analysis, which is briefly covered later in this chapter.

Determining "hotness"

As we have seen, "one size fits all" code generation that interprets every method, or JIT compiling every method with a high optimization level, is a bad idea in an adaptive runtime. The former, because although it keeps code generation time down, execution time goes way up. The latter, because even though execution is fast, generating the highly optimized code takes up a significant part of the total runtime. We need to know if a method is hot or not in order to know if we should give it lots of code generator attention, as we can't treat all methods the same.

Profiling to determine "hotness" can, as was hinted at in the previous sections, be implemented in several different ways. The common denominator for all ways of profiling is that a number of samples of where code spends execution time is collected. These are used by the runtime to make optimization decisions—the more samples available, the better informed decisions are made. Just a few isolated samples in different methods won't really tell us much about the execution profile of a program. Naturally, collecting samples always incurs some overhead in itself, and there is a tradeoff between having enough samples and the overhead of collecting them.

Invocation counters

One way to sample hot methods is to use invocation counters. An invocation counter is typically associated with each method and is incremented when the method is called. This is done either by the bytecode interpreter or in the form of an extra add instruction compiled into the prologue of the native code version of the method.

Especially in the JIT compiled world, where code execution speed doesn't disappear into interpretation overhead, invocation counters may incur some visible runtime overhead, usually in the form of cache misses in the CPU. This is because a particular location in memory has to be frequently written to by the add at the start of each method.

Software-based thread sampling

Another, more cache friendly, way to determine hotness is by using thread sampling. This means periodically examining where in the program Java threads are currently executing and logging their instruction pointers. Thread sampling requires no code instrumentation.

Stopping threads, which is normally required in order to extract their contexts is, however, quite an expensive operation. Thus, getting a large amount of samples without disrupting anything at all requires a complete JVM-internal thread implementation, a custom operating system such as in Oracle JRockit Virtual Edition, or specialized hardware support.

Hardware-based sampling

Certain hardware platforms, such as Intel IA-64, provides hardware instrumentation mechanisms that may be used by an application. One example is the hardware IP sample buffer. While generating code for IA-64 is a rather complex business, at least the hardware architecture allows for collecting a large amount of samples cheaply, thus facilitating better optimization decisions.

Another benefit of hardware-based sampling is that it may provide other data, not just instruction pointers, cheaply. For example, hardware profilers may export data on how often a hardware branch predictor makes an incorrect assumption, or on how often the CPU caches miss in particular locations. The runtime can use this information to generate more optimal code. Inverting the condition of the jump instruction that caused the branch prediction miss and prefetching data ahead of the instruction that caused the cache miss would solve these issues. Thus, efficient hardware-based sampling can lay an excellent groundwork for further adaptive code optimizations in the runtime.

Optimizing a changing program

In assembly code, method calls typically end up as call instructions. Variants of these exist in all hardware architectures. Depending on the type of call, the format of the call instruction varies.

In object-oriented languages, virtual method dispatch is usually compiled as indirect calls (that is the destination has to be read from memory) to addresses in a dispatch table. This is because a virtual call can have several possible receivers depending on the class hierarchy. A dispatch table exists for every class and contains the receivers of its virtual calls. A static method or a virtual method that is known to have only one implementation can instead be turned into a direct call with a fixed destination. This is typically much faster to execute.

Note

In native code, a static call would look something similar to:

call 0x2345670 (a jump to a fixed location)

A virtual call would look something similar to:

mov eax, [esi] (load type info from receiver in esi)

call [eax+0x4c] (eax + 0x4c is the dispatch table entry)

As we have to dereference memory twice for the virtual call, it is slower than just calling a fixed destination address.

Consider a static environment, such as a compiled C++ program. For the code generator, everything that can be known about the application is known at compile time. For example, we know that any given virtual method with a single implementation will never be overridden by another, simply because no other virtual method exists. New code cannot enter the system, so the overrider also will never exist. This not only removes the need for the extra bookkeeping required for throwing out old code, but it also allows for the C++ compiler to generate static calls to the virtual method.

Now, consider the same virtual method in a Java program. At the moment it exists only in one version, but Java allows that it can be overridden at any time during program execution. When the JIT compiler wants to generate a call to this method, it would prefer that the method remained a single implementation forever. Then, the previous C++ optimization can be used and the call can be generated as a fast fixed call instead of a slower virtual dispatch. However, if the method is not declared final, it can be overridden at any time. It looks like we don't dare use the direct call at all, even though it is highly unlikely that the method will ever be overridden.

There are several other situations in Java where the world looks good right now to the compiler, and optimizations can be applied, but if the world changes in the future, the optimizations would have to be immediately reverted. For compiled Java, in order to match compiled C++ in speed, there must be a way to do these kinds of optimizations anyway.

The JVM solves this by "gambling". It bases its code generation decisions on assumptions that the world will remain unchanged forever, which is usually the case. If it turns out not to be so, its bookkeeping system triggers callbacks if any assumption is violated. When this happens, the code containing the original assumption needs to be regenerated—in our example the static dispatch needs to be replaced by a virtual one. Having to revert code generated based on an assumption about a closed world is typically very costly, but if it happens rarely enough, the benefit of the original assumption will deliver a performance increase anyway.

Some typical assumptions that the JIT compiler and JVM, in general, might bet on are:

  • A virtual method probably won't be overridden. As it only exists only in one version, it can always be called with a fixed destination address like a static method.
  • A float will probably never be NaN. We can use hardware instructions instead of an expensive call to the native floating point library that is required for corner cases.
  • The program probably won't throw an exception in a particular try block. Schedule the catch clause as cold code and give it less attention from the optimizer.
  • The hardware instruction fsin probably has the right precision for most trigonometry. If it doesn't, cause an exception and call the native floating point library instead.
  • A lock probably won't be too saturated and can start out as a fast spinlock.
  • A lock will probably be repeatedly taken and released by the same thread, so the unlock operation and future reacquisitions of the lock can optimistically be treated as no-ops.

A static environment that was compiled ahead of time and runs in a closed world can not, in general, make these kinds of assumptions. An adaptive runtime, however, can revert its illegal decisions if the criteria they were based on are violated. In theory, it can make any crazy assumption that might pay off, as long as it can be reverted with small enough cost. Thus, an adaptive runtime is potentially far more powerful than a static environment given that the "gambling" pays off.

Getting the gambling right is a very difficult problem. If we assume that relatively rare events will occur frequently, in order to avoid regenerating code, we can never achieve anything near the performance of a static compiler. However, if very frequent events are assumed to be rare, we will instead have to pay the penalty in increased code generation time for reoptimizations or invalidations. There is a fine area of middle ground here of what kinds of assumptions can be made. There is a significant art to finding this middle ground, and this is where a high performance runtime can make its impact. Given that we find this area—and JRockit is based on runtime information feedback in all relevant areas to make the best decisions—an adaptive runtime has the potential to outperform a static environment every time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.72.245