Chapter 5. Benchmarking and Tuning

This chapter introduces benchmarking as a fundamental method to measure the performance of a Java application. It also covers the JVM side of performance, discussing how the virtual machine can be made to execute Java faster.

Benchmarking can, and should, be used to regression test an application during development, to ensure that new code modifications do not impact performance. Time and time again during our careers, we have seen unexpected performance regressions crop up from seemingly innocent changes. Continuous, preferably automated, benchmarking is the best way to prevent this from happening. Each software project should have a performance goal and benchmarking is the way to make sure that this goal is achieved.

Once we have discussed the hows and whys of good benchmarks, we will go on to discuss how to draw conclusions from what is measured and when there is a need to change the application or to just reconfigure the JVM by tuning parameters and setup. Tuning will be discussed in general terms, but concrete examples will be JRockit specific.

You will learn the following from this chapter:

  • The relevance of benchmarking an application for finding bottlenecks, in order to avoid regressions and in order to make sure that performance goals for your software are achieved.
  • How to create a benchmark appropriate for a particular problem set. This includes deciding what to measure and making sure that the benchmark actually does that. This also includes correctly extracting core application functionality into a smaller benchmark.
  • How some of the various industry-standard benchmarks for Java work.
  • How to use benchmark results to tune an application and the JVM for increased runtime performance.
  • How to recognize bottlenecks in a Java program and how to avoid them. This includes standard pitfalls, common mistakes, and false optimizations.

    Note

    Throughout this chapter, we will, among other things, discuss the SPEC benchmarks. SPEC (www.spec.org) is a non-profit organization that establishes, maintains, and endorses standardized benchmarks to evaluate performance for the newest generation of computing systems. Product and service names mentioned within this chapter are SPEC's intellectual property and are protected by trademarks and service marks.

Reasons for benchmarking

Benchmarking is always needed in a complex environment. There are several reasons for benchmarking, for example making sure that an application is actually usable in the real world or to detect and avoid performance regressions when new code is checked in. Benchmarking can also help optimize a complex application by breaking it down into more manageable problem domains (specialized benchmarks) that are easier to optimize. Finally, benchmarking should not be underestimated as a tool for marketing purposes.

Performance goals

Benchmarking is relevant in software development on all levels, from OEM or Java Virtual Machine vendors to developers of standalone Java applications. It is too often the case in software development that while the functionality goals of an application are well specified, no performance goals are defined at all. Without performance goals and benchmarks in place to track the progress of those goals, the end result may be stable but completely unusable. During our careers we have seen this many times, including during the development of business critical systems. If a critical performance issue is discovered too late in the development cycle, the entire application may need to be scrapped.

Note

Performance benchmarking needs to be a fundamental part of any software development process—set a performance goal, create benchmarks for it, examine the benchmarking results, tune the application, and repeat until done.

Typically, in order to avoid these kinds of embarrassments, performance must be a fundamental requirement of a system throughout the entire software development process, and it needs to be verified with regular benchmarking. The importance of application performance should never be underestimated. Inadequate performance should be treated as any other bug.

Performance regression testing

An application that is developed without a good Quality Assurance (QA) infrastructure in place from day one is likely to be prone to bugs and instabilities. More specifically, an application without good functional unit tests that run whenever new code is checked in, is likely to break, no matter how well the new code has been reviewed. This is conventional wisdom in software engineering.

The first and foremost purpose of regression testing is to maintain stability. Whenever a new bug is discovered along with its fix, it is a good practice to check-in a regression test in the form of a reproducer, possibly based on code left over from debugging the problem. The ideal reproducer is a program with a few lines of code in a main function that breaks whatever was wrong with the system, but reproducers can also be more complex. It is almost always worth it to spend time turning a more complex reproducer into a regression test. Keeping the regression test running upon new source code check-ins will prevent the particular problem from recurring, potentially saving future hours of debugging an issue that has been fixed at least once already. Naturally, all functionality tests don't easily break down into simple regression tests or self-contained reproducers. Extensive runs of large, hard-to-setup applications are still often needed to validate stability.

The other side of regression testing is to maintain performance. For some reason this has not been as large a part of the conventional wisdom in software engineering as functionality testing, but it is equally important. New code might as easily introduce performance regressions as functional problems. The former is harder to spot, as performance degradations normally don't cause a program to actually break down and crash. Thus, including performance tests as part of the QA infrastructure makes a lot of sense. Discovering too late that application performance has gone down requires plenty of detective work to figure out exactly where the regression happened. Possibly this involves going over a large number of recent source code check-ins, recompiling the application at different changes and rerunning the affected application until the guilty check-in is spotted. Thus, integrating simple performance regression tests into the source code check-in criteria to guard against regressions makes as much sense as with unit tests or regression tests for functionality.

Note

A good QA infrastructure should contain benchmarking for the purpose of detecting slowdowns as well as traditional functionality tests for the purpose of detecting bugs. A performance regression should be treated as no less serious than any other traditional bug.

Performance regression testing is also good for detecting unexplained performance boosts. While this might seem like a good thing, sometimes this indicates that a bug has been introduced, for example important code may no longer be executed. In general, all unexpected performance changes should be investigated and performance regression tests should trigger warnings both if performance unexpectedly goes up as well as if it goes down.

A cardinal rule when regression testing performance is to have as many points of measurements as possible, in order to quickly detect regressions. One per source control modification or check-in is ideal, if the infrastructure and test resources allow for it.

Let us, for a moment, disregard large and complex tests in the QA infrastructure and concentrate on simple programs that can serve as unit tests. While a unit test for functionality usually takes the form of a small program that either works or doesn't, a unit test for performance is a micro benchmark. A micro benchmark should be fairly simple to set up, it should run only for a short time, and it can be used to quickly determine if a performance requirement is reached. We will extensively discuss techniques for implementing micro benchmarks as well as more complex benchmarks later in this chapter.

Easier problem domains to optimize

Another reason for keeping a collection of relevant benchmarks around is that performance is a difficult thing to quantify—are we talking about high throughput or low latency? Surely we can't have both at the same time, or can we? If so, how can we ensure that our application is good enough in both these problem domains?

While running an application with a large set of inputs and workloads is a good idea for the general QA perspective, it might provide too complex a picture to illustrate where the application performs well and where it requires improvements.

A lot of trouble can be avoided if a program can be broken down into sub-programs that can be treated as individual benchmarks and if it can be made sure that all of them perform well. Not only is it easier to understand the different aspects of the performance of the program, it is also easier to work on improving these aspects on a simpler problem domain that only measures one thing at a time. Furthermore, it is also simpler to verify that code optimizations actually result in performance improvements. It should be common sense for an engineer that the fewer factors that are affected at once, the easier it is to measure and draw conclusions from the results.

Also, if a simple and self-contained benchmark correctly reflects the behavior of a larger application, performance improvements to the benchmarks will most likely be applicable to the larger application as well. In that case, working with the benchmark code instead of with the complete application may significantly speed up the development process.

Commercial success

Finally, a large number of industry-standard benchmarks for various applications and environments exist on the Internet. These are useful for verifying and measuring performance in a specific problem domain, for example, XML processing, decoding mp3s, or processing database transactions.

Industry benchmarks also provide standards with which to measure the performance of an application against competing products. Later in this chapter, we will introduce a few common industrial benchmarks, targeting both the JVM itself as well as Java applications on various levels in the stack.

Note

Marketing based on standard benchmark scores is naturally a rather OEM centric (or JVM centric) practice. It can also be important when developing a competing product in a market segment with many vendors. It is not as relevant for more unique third-party software.

Being the world leader on a recognized benchmark makes for good press release material and excellent marketing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.114.28