What to think of when creating a benchmark

Creating a benchmark, large as well as small, for an application without knowing much about the application behavior is a fairly futile exercise. In order to understand which benchmarks may be relevant for performance testing an application, the application needs to be well profiled.

There are several tools available for examining Java applications that either work by creating a special version of the application by inserting instrumentation code in the bytecode or through online analysis of an unmodified program. The JRockit Mission Control suite is an example of the latter. The next part of this book extensively explains how to use the JRockit Mission Control suite for profiling purposes.

Profiling an application will reveal things such as in which methods most of the run time is spent, common patterns in garbage collection, and which locks are often contended versus those that do not matter for overall performance.

Naturally, profiling or instrumenting an application does not necessarily require advanced tools. In some cases, it can be as simple as using System.out.println to occasionally output statistics to the console and examine them.

Once enough data about an application has been collected, suitable subsets of the application can hopefully be isolated for benchmarking. However, before creating a benchmark, we need to determine if it is relevant, and whether the benchmark and the application are subject to the same kind of performance issues or not.

When normalizing a benchmark against an application, warm-up time is an important factor to consider. If an application requires warm-up time to reach a steady state, as most server-side applications do, the benchmark might require this as well. Can the application be turned into a small self-contained benchmark with lower startup time, for the sake of simplicity? Is it still a case of comparing apples with apples if the benchmarking time is shrunk to five minutes from an application runtime of an hour, including warm-up, or does this kind of scaledown not work? Has the benchmark turned into an application with a completely different kind of behavior?

Note

An example of a complex application that needs to be broken down into benchmark domains is an application server, which typically is a collection of a vast number of subcomponents with varying functions.

The ideal benchmark is a small self-contained program that emulates a relevant part of an application. If this isn't easy to implement, maybe individual subsystems of the application can still be broken out as "black boxes" that can be fed with limited subsets of the input data. Then the subsystems may still form a simpler basis for measurements than the entire application, which might be hard to set up and require multiple input sources.

Measuring outside the system

For anything but very small benchmarks and small pieces of proof of concept code, it is usually a good idea to measure "outside the system". By that, we mean running the benchmark with some kind of external driver. The driver is an application independent of the actual benchmark code on which the performance evaluation is to take place.

Working with a driver usually means that the driver injects a workload into the benchmark, for example over the network. The entire response time, including network traffic, for the benchmark to run its payload code and respond is then measured.

Using a driver has the benefit that response times can be accurately measured without mixing up the measurements with the data generation or load generation. The driver can also be moved to another machine, and can be run under lower total load, to make sure that neither data generation nor load generation is a bottleneck in the benchmark.

The need for externalizing certain measurements can perhaps be more simply illustrated by a smaller example. Consider the following benchmark that measures how fast an implementation of the MD5 message digest algorithm runs on random data:

import java.util.Random;
import java.security.*;
public class Md5ThruPut {
static MessageDigest algorithm;
static Random r = new Random();
static int ops;
public static void main(String args[]) throws Exception { algorithm = MessageDigest.getInstance("MD5");
algorithm.reset();
long t0 = System.currentTimeMillis();
test(100000);
long t1 = System.currentTimeMillis();
System.out.println((long)ops / (t1 - t0) + " ops/ms");
}
public static void test(int size) {
for (int i = 0; i < size; i++) {
byte b[] = new byte[1024];
r.nextBytes(b);
digest(b);
}
}
public static void digest(byte [] data) {
algorithm.update(data);
algorithm.digest();
ops++;
}
}

If our goal was to only measure the performance of the MD5 algorithm, the previous benchmark is less than optimal, as the generation of the random input data is part of the time of an operation being measured. So, the total runtime will not only reflect the performance of the MD5 algorithm, but it will also reflect the performance of the random number generator. This was probably not intended. A better version of the MD5 benchmark would look like the following program:

import java.util.Random;
import java.security.*;
public class Md5ThruPutBetter {
static MessageDigest algorithm;
static Random r = new Random();
static int ops;
static byte[][] input;
public static void main(String args[]) throws Exception { algorithm = MessageDigest.getInstance("MD5");
algorithm.reset();
generateInput(100000);
long t0 = System.currentTimeMillis();
test();
long t1 = System.currentTimeMillis();
System.out.println((long)ops / (t1 - t0) + " ops/ms");
}
public static void generateInput(int size) {
input = new byte[size];
for (int i = 0; i < size; i++) {
input[i] = new byte[1024];
r.nextBytes(input[i]);
}
}
public static void test() {
for (int i = 0; i < input.length; i++) {
digest(input[i]);
}
}
public static void digest(byte [] data) {
algorithm.update(data);
algorithm.digest();
ops++;
}
}

Measuring several times

It is also of the utmost importance to collect a large amount of statistics before drawing any conclusions from a benchmark. The simplest way to do this is to repeat the measurements many times—do multiple benchmark runs. This way, a better grasp of the standard deviations in benchmark results can be obtained. Relevant benchmark scores can only be recorded if the size of deviation for the score in your benchmarking setup is known.

If possible, multiple runs should also be spread over multiple equivalent machines. That way, configuration errors can be discovered and removed from the data. For example, a forgotten load generator may be running on one of the benchmarking machines, contributing to lower scores. If all runs take place on that machine, erroneous measurements will be recorded.

Micro benchmarks

Micro benchmarks are benchmarks that contain a very small amount of code and only measure a small piece of functionality, for example, how quickly the JVM multiplies instances of java.math.BigInteger or how quickly it does AES encryption. Micro benchmarks are simple to write and often require just a single or a few function calls that contain the algorithmic payload.

Note

Micro benchmarks are convenient, as they are simple to write and simple to run. They may prove invaluable in understanding a particular bottleneck in a very large application. They form an excellent backbone both for regression testing against performance loss and for optimizing the execution of known problems in the code, as discussed in the first section of this chapter.

We strongly encourage any application developer to keep a number of micro benchmarks around for regression testing, the same way as we encourage creating and keeping unit tests around when fixing bugs. In both cases, this is to verify that a resolved issue doesn't ever happen again and causes a regression.

If any large application could be reduced to a number of micro benchmarks, or at least "mini benchmarks", life would be easier. Sadly, for modern complex applications this is rarely the case. However, almost always, a fair (but for industrial purposes, incomplete) amount of micro benchmarks can be created from the understanding gained by profiling an application and figuring out what it does. The following code is a function from a real world micro benchmark that is used as a performance regression test for the JRockit JVM:

public Result testArrayClear(Loop loop, boolean validate) {
long count = 0;
OpsPerMillis t = new OpsPerMillis("testArrayClear");
t.start();
loop.start();
while (!loop.done()) {
int[] intArray = new int[ARRAYSIZE];
System.arraycopy(this.sourceIntArray, 0, intArray, 0, ARRAYSIZE);
//Introduce side effects:
//This call prevents dead code elimination
//from removing the entire allocation.
escape(intArray);
count++;
}
t.end();
return new OpsPerMillis(count, t.elapsed());
}

Java requires that objects are cleared on allocation, so that all their fields are initiated to their default values. The code optimizer in JRockit should be able to detect that the newly allocated array intArray is immediately and completely overwritten by data, and so, as it is non-volatile, need not be cleared. If, for some reason, this code optimization starts to fail, this benchmark will take longer to complete, and the QA infrastructure will trigger a warning while examining the result database.

Micro benchmarks can (and should) also be created from scratch for problems that are known to be performance-critical in an application. For example for an XML parser, it makes sense to create a small benchmark that operates on a set of auto-generated XML files of different sizes. For a math package that wants to use java.math.BigDecimal, it makes sense to write a couple of small self-contained applications that operate on BigDecimal instances.

Creating micro benchmarks that aren't valid or that don't produce useful results for the problem set is not just a waste of time and effort, but it is also potentially harmful if the benchmark is believed to accurately measure all important aspects of a problem. For example, testing an implementation of java.util.HashMap by just creating a HashMap and filling it up with data might not be good enough. How long does rehashing take? Extracting elements? What about collisions in HashMaps of different sizes?

Similarly, testing a java.math.BigDecimal implementation by just performing a large number of additions is almost certainly not good enough. What if there is a fatal performance flaw in the division algorithm?

Note

When creating a micro benchmark, the main rule is always to understand what you are measuring. Verify that the benchmark is valid and that the result is useful.

While the previous two examples might seem somewhat artificial, they are still examples of the kind of thinking that can lead you astray when creating benchmarks. A somewhat more relevant example might be the case of benchmarking an important synchronized operation in a class library. If the lock in the synchronized operation is contended in an application, this obviously won't show up in a single threaded micro benchmark. It may seem trivial that reducing the load from many to fewer threads fundamentally changes the lock behavior, but this is one of the main reasons that benchmarks fail to represent the real workload. Make sure that any lock in a benchmark is actually stressed by many threads, if this is the case in the real application.

Finally, it might make sense to try to eliminate parts of the runtime that aren't relevant to the problem from the benchmark. If you want to measure pure code performance for some algorithm, it probably makes little sense to create large numbers of objects and stress the garbage collector at the same time. Or at least, pick a good garbage collection strategy that won't interfere too much with the execution of the algorithm (large heap, no nursery, and optimized for throughput).

Micro benchmarks and on-stack replacement

Another common benchmarking mistake is to assume that any JVM will perform on-stack replacement, i.e. that any method can be optimized and replaced in the middle of its execution. As was also mentioned in Chapter 2 on adaptive code generation, all VMs do not perform on-stack replacement. Thus, if the entire payload of the benchmark exists in a loop in the main function, code replacement may never have a chance to take place, even though all relevant methods are flagged as hot and reoptimized into more optimal versions.

The following benchmark executes some kind of complex operation in each iteration of a loop in main. A JVM that, like JRockit, doesn't use on-stack replacement may well select main for optimization and make the operation in the loop execute a lot faster. However, since main never returns during the execution of the benchmark, the new code is never run. Moving the main benchmark operation into a separate method, and calling that method each iteration in the loop would solve the problem.

public class BadMicro {
public static void main(String args[]) {
long t0 = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
// complex benchmarking operation
}
long t1 = System.currentTimeMillis();
System.out.println("Time: " + (t1 - t0) + " ms");
}
}

Micro benchmarks and startup time

Recall, from the chapter on code generation, that JVM startup time is dependent on the time it takes to load initial classes and generate bootstrap code. If the intention is to measure only the runtime of the benchmark payload, startup time needs to be subtracted from the overall runtime. The problem can also be solved by having the benchmark perform enough operations so that startup time won't be a factor.

It is important to realize that if a micro benchmark does its work quickly and then exits, the JVM startup time contributes significantly to the overall benchmark runtime. A micro benchmark that measures the time it takes to multiply 100 floating point numbers will be all startup time and nothing else, but if it multiplies trillions of floating point numbers instead, startup time won't matter.

This point is somewhat more relevant for JVMs, such as JRockit, that lack interpreters and consequently start up slightly slower than JVMs that use bytecode interpretation for cold or previously unseen code.

So, it is important to start timing the micro benchmark only when the main workload function is called, and not from the start of main. Similarly, using an external library for timing that measures the time a Java program takes from start to finish will also implicitly factor in startup time.

Note

There may of course be cases where startup time is a very relevant thing to measure, even in a micro benchmark.

The runtime of the benchmark in the following example will almost certainly only be startup time, since the required time to get the VM up and running significantly exceeds the time required to add 1,000 random numbers.

import java.util.Random;
public class AnotherBadMicro {
static Random r = new Random();
static int sum;
public static void main(String args[]) {
long t0 = System.currentTimeMillis();
int s = 0;
for (int i = 0; i < 1000; i++) {
s += r.nextInt();
}
sum = s;
long t1 = System.currentTimeMillis();
System.out.println("Time: " + (t1 - t0) + " ms");
}
}

Give the benchmark a chance to warm-up

Different VMs use different optimization heuristics that trigger at different times. So, it might be possible to improve the quality of measurements and decrease deviation by doing a small amount of "dry runs" or warm-up rounds before starting the actual measurements. The warm-up gives the VM a chance to retrieve runtime feedback from the executing code and perform optimizations in relevant places, before measurements start. This ensures that the measurements are done in a steady optimized state.

Many industry standard benchmarks, such as the SPECjvm2008 suite mentioned later in this chapter, come with warm-up rounds built into the benchmark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.234.24