Near-real-time garbage collection

Real-time systems tend to fit badly in a garbage collecting world. No matter how well a garbage collector performs, we still have a non-deterministic runtime overhead. Even if the latencies introduced by the GC are few and stopping the world completely is a rare event, a certain degree of non-determinism cannot be avoided.

So what do we mean by real-time? The terminology suffers from a certain degree of misuse. To avoid some of the confusion associated with real-time, we will divide the concept into hard real-time and soft real-time.

Hard and soft real-time

Hard real-time should be understood as the more traditional real-time system—perhaps a synthesizer or a pacemaker, a system where 100 percent determinism is an absolute requirement. There are few runtimes with automatic memory management that can work for this kind of environment, at least not without extensive modifications to the application and some kind of program language constructs for controlling the garbage collection.

A typical example is the Java real-time effort as specified in Java Specification Request (JSR) 1, which specifies an API (javax.realtime) for interacting with the runtime and for controlling the occurrence of garbage collection at certain program points. When using Java to develop a new application, this might be a feasible way to go ahead, but porting an existing Java production system to use a new API with new semantics is often very challenging or downright impossible. Even if it is technically feasible, modifying the key aspects of an existing system is very costly. Hence, the concept soft real-time.

We use the term soft real-time to mean a runtime system where it is possible to specify a quality of service level for latencies, and control pause times so that, even though they are non-deterministic, no single pause will last longer than a certain amount of time. This is the technique that is implemented in the product JRockit Real Time.

JRockit Real Time

It turns out that guaranteeing a quality of service level in the form of a maximum pause time setting is sufficient for most complex systems that require a certain degree of determinism. It is enough for the system to guarantee that latencies stay below the given bound. If this works as it should, the immediate benefit is of course that more deterministic and lower latencies can be gained without modifying an existing application.

The main selling point of JRockit Real Time is that getting deterministic latencies requires no modifications to the application—it just plugs in. The only thing that needs to be specified from the user side is the pause time target in milliseconds. Current JRockit releases have no problems maintaining single millisecond pause time targets on modern CPU architectures.

No world is perfect, however, and as we have discussed in the section about concurrent GC, the price of low latencies has to be paid for with longer total garbage collection time. Recollect that it is more difficult to garbage collect efficiently when the application is running, and if we have to interrupt GC more often, it might be even more problematic. In practice, this has turned out not to be a problem. It is more important to most customers who want JRockit Real Time that the degree of predictability and latency is deterministic, than that the total garbage collection time goes down. Most customers feel that response times is their main problem and that a sudden increase in pause time while large garbage collections take place is more harmful than if the total time spent in GC increases.

The following graph illustrates response times over time for a running application. The application in question is a benchmark for WebLogic SIP Server, a product for the telecom industry. JRockit Real Time is not enabled. As can be seen, the deviation in response times is large.

JRockit Real Time

Does the soft real-time approach work?

The soft real-time approach in JRockit Real Time has turned out to be a major winner. But how can a non-deterministic system like a garbage collector provide the degree of determinism required to never have longer than single millisecond pause times? The complete answer is that it can't, but the boundary cases are rare enough so that it doesn't matter.

Of course there is no silver bullet, and there are indeed scenarios when a pause time target cannot be guaranteed. It turns out, however, that practically all standard applications, with live data sets up to about 30 to 50 percent of the heap size, can be successfully handled by JRockit Real Time with pause times shorter than, or equal to, the supported service level. This fits the majority of all Java applications that customers run. The live data set bound of 30 to 50 percent is constantly being improved by tuning and gets better with each new JRockit Real Time release. The minimum supported pause time is also continuously made lower.

In the event that JRockit Real Time isn't a perfect first fit for an application, several other things can be done to tune the behavior of the garbage collector. When looking for the cause of latencies in a Java program, there are frequently non GC-related user issues involved. For example, it is common that a lock in the Java code is so contended that it is actually contributing more to program latencies than the GC itself. JRockit Mission Control contains a set of diagnostic tools that can fairly easily point out problems like this from a runtime recording.

Note

We often hear success stories from the field, such as when a trading system started making tens of thousands of dollars more per day because of lower latencies and consequently faster response times. The system could complete a significant number of more trades per day on the same hardware. No other action than switching VMs to JRockit Real Time was required.

The following graph shows the same benchmark run as before, with JRockit Real Time enabled and a maximum latency service level set to 10 milliseconds using the -XpauseTarget flag. Note that after the initial warm-up spikes, there is virtually no unpredictability left in the latencies.

Does the soft real-time approach work?

Note

One might easily theorize that the spikes in the beginning of the run are caused by the VM aggressively trying to reach a steady state, for example through large amounts of code optimization. This can be true, and indeed this kind of pattern can show up. For this particular benchmark run, however, the initial latencies were actually caused by a bug in the Java application, unrelated to GC or adaptive optimization. The problem was subsequently fixed.

We also note that JRockit Real Time has no trouble fulfilling the 10 millisecond guarantee it was given. All of this comes at the affordable price of a slightly longer total time spent in garbage collection.

How does it work?

So how can JRockit deliver this kind of garbage collection performance? There are three key issues at work here:

  • Efficient parallelization
  • Splitting garbage collection work into work packets, transactions that may be rollbacked or aborted if they fail to complete in time
  • Efficient heuristics

Efficient parallelization isn't a novel concept. There are several concurrent garbage collectors in existing literature, and there are few conceptual changes or technological leaps in how JRockit Real Time handles concurrency. Performance is, as always, in the details—synchronize efficiently, avoid locks if possible, make sure existing locks aren't saturated, and schedule the worker threads in an efficient manner.

The key to low latency is still to let the Java program run as much as possible, and keep heap usage and fragmentation at a decent level. We can think of JRockit Real Time as a greedy strategy for keeping Java programs running. The basic strategy is postponing stopping the world for as long as possible, hoping that whatever problem that caused us to want to stop the world in the first place will resolve itself, or that the time required to stop the world will go down once it is inevitable. Hopefully, there are fewer objects to compact or sweep when we finally pause.

All garbage collector work in JRockit Real Time is split up into work packets. If we start to execute a work packet, for example a compaction job for part of the heap, with the Java program halted, and it takes too long, we can throw away whatever work it has done so far and restart the application. Sometimes the partial work can be kept, but the entire transaction doesn't have time to complete. The time to completion while the world is stopped is governed by the quality of service level for latencies that the user has specified. If a very low latency bound has been specified we might have to throw away more of a partially completed transaction in order to keep the Java program running than with a higher one.

The mark phase is, as has already been covered, simple to modify, so that it runs concurrently with the Java program. However, both the sweep phase and compaction need to stop the world at times. Luckily, mark time tends to make up around 90 percent of the total garbage collection time. If the phases that need stopping the world take too long, we just have to make sure we can terminate what they are doing and restart the concurrent phase, hoping that the problem goes away in the meantime. The work package abstraction makes it easier to implement this functionality.

There are, of course, several heuristics involved. Slight modifications to the runtime system helps JRockit Real Time make more informed decisions. One example is a somewhat more complex write barrier that keeps track of the number of cards dirtied on a per-thread basis. The code takes a little bit more time to execute than that of a traditional generational GC write barrier, but provides more adequate profiling data to the GC. If one thread is much more active at dirtying cards than others, it probably needs special attention. JRockit also uses the sum of all executed write barriers in a thread as a heuristic trigger.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.57.172