Microbenchmarking

Microbenchmarking is measuring the performance of a small code fragment. When we want to optimize our code, we will have to measure it. Without measurement, code optimization is like shooting blindfolded. You will not hit the target, but you likely will shoot somebody else.

Shooting is a good metaphor because you should usually not do it, but when you really have to then you have no choice. If there is no performance issue and the software meets the requirements, then any optimization, including speed measurement, is a waste of money. This does not mean that you are encouraged to write slow and sloppy code. When we measure performance, we will compare it against a requirement, and the requirement is usually on the user level. Something like, the response time of the application should be less than 2 seconds. To do such a measurement, we usually create load tests in a test environment and use different profiling tools that tell us what is consuming the most time and where we should optimize. Many times, it is not only Java code, but configuration optimization, using larger database connection pool, more memory, and similar things.

Microbenchmarking is a different story. It is about the performance of a small Java code fragment and, as such, closer to the Java programming.

It is rarely used, and before starting to do a microbenchmark for real commercial environment, we will have to think twice. Microbenchmark is a luring tool to optimize something small without knowing if it is worth optimizing that code. When we have a huge application that has several modules run on several servers, how can we be sure that improving some special part of the application drastically improves the performance? Will it pay back in increased revenue that generates so much profit that will cover the cost we burned into the performance testing and development? Statistically, almost sure that such an optimization including microbenchmarking will not pay off.

Once I was maintaining the code of a senior's colleague. He created a highly optimized code to recognize configuration keywords that were present in a file. He created a program structure that represented a decision tree based on the characters in the key string. If there was a keyword in the configuration file that was misspelled, the code threw an exception at the very first character where it could decide that the keyword could not be correct.
To insert a new keyword, it needed to get through the code structure to find the occasion in the code where the new keyword was first different from already existing ones and extend the deeply nested if/else structures. To read the list of the handled keywords was possible from the comments that listed all the keywords that he did not forget to document. The code was working blazingly fast, probably saving a few milliseconds of the servlet application startup time. The application was started up only after system maintenance every few month.
You feel the irony, don't you? Seniority is not always the number of years. Lucky ones can save their inner child.

So when to use microbenchmarking? I can see two areas:

You identified the code segment that eats most of the resources in your application and the improvement can be tested by microbenchmarks
You cannot identify the code segment that will eat most of the resources in an application but you suspect it

The first is the usual case. The second is when you develop a library, and you just do not know all the applications that will use it. In this case, you will try to optimize the part that you think is the most crucial for most of the imagined, suspected applications. Even in that case, it is better to take some sample applications that are created by users of your library and collect some statistics about the use.

Why should we talk about microbenchmarking in details? What are the pitfalls? Benchmarking is an experiment. The first programs I wrote was a TI calculator code and I can just count the number of steps the program made to factor two large (10 digits those days) prime numbers. Even at that time, I was using an old Russian stopwatch to measure the time, being lazy to calculate the number of steps. Experiment and measurement was easier.

Today, you cannot calculate the number of steps the CPU makes even if you wanted. There are so many small factors that may change the performance of the applications that are out of control of the programmer, which makes it impossible to calculate the steps. We have the measurement left for us, and we will gain all the problems of measurements.

What is the biggest problem? We are interested in something, say X, and we usually cannot measure that. So, we will measure Y instead and hope that the values of Y and X are coupled together. We want to measure the length of the room, but instead we measure the time it takes for the laser beam to travel from one end to the other. In this case, the length X and the time Y are strongly coupled. Many times, X and Y only correlate more or less. Most of the times, when people do measurement, the X and Y values have no relation to each other at all. Still, people put their money and more on decisions backed by such measurements.

Microbenchmarking is no different. The first question is how to measure the execution time? Small code runs short times and System.currentTimeMillis() may just return the same value when the measurement starts and when it ends, because we are still in the same millisecond. Even if the execution is 10ms, the error of the measurement is still at least 10% purely because of the quantization of the time as we measure. Luckily, there is System.nanoTime(). But is there? Just because the name says it returns the number of nanoseconds from a specific start time, it does not necessarily mean it really can.

It very much depends on the hardware and the implementation of the method in the JDK. It is called nano because this is the precision that we cannot certainly reach. If it was microseconds, then some implementation may be limited by the definition, even if on the specific hardware, there is a more precise clock. However, this is not only the precision of an available hardware clock; it is about the precision of the hardware.

Let's remember the heartbeat of the bureaucrats, and the time needed to read something from memory. Calling a method, such as System.nanoTime(), is like asking the bellboy in a hotel to run down from the second floor to the lobby and peek out to look at the clock on the tower on the other side of the road, come back, and tell seconds precision what the time was it when we asked. Nonsense. We should know the precision of the clock on the tower and the speed of the bellboy running from the floor to the lobby and back. This is a bit more than just calling nanoTime. This is what a microbenchmarking harness does for us.

The Java Microbenchmarking Harness (JMH) is available for some time as a library. It is developed by Oracle and used to tune the performance of some core JDK classes, and with Java 9, these performance measurements and results become part of the distributed JDK. This is good news for those who develop Java platform for new hardware, but also for developers, because it means that the JMH is and will be supported by Oracle.

"JMH is a Java harness to build, run, and analyze nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM." (quote from the official site of JMH, http://openjdk.java.net/projects/code-tools/jmh/).

You can run jmh as a separate project independent from the actual project you measure, or you can just store the measurement code in a separate directory. The harness will compile against the production class files and will execute the benchmark. The easiest way, as I see, is to use the Gradle plugin to execute JMH. You can store the benchmark code in a directory called jmh (the same level as main and test) and create a main that can start the benchmark.

The Gradle build script is extended with the following lines:

buildscript { 
     repositories { 
         jcenter() 
     } 
     dependencies { 
         classpath "me.champeau.gradle:jmh-gradle-plugin:0.2.0" 
     } 
 } 
 apply plugin: "me.champeau.gradle.jmh" 
  
 jmh { 
     jmhVersion = '1.13' 
     includeTests = true 
 }

And the microbenchmark class is as follows:

public class MicroBenchmark { 
     public static void main(String... args) 
                              throws IOException, RunnerException { 
         Options opt = new OptionsBuilder() 
                 .include(MicroBenchmark.class.getSimpleName()) 
                 .forks(1) 
                 .build(); 
         new Runner(opt).run(); 
     } 
  
     @State(Scope.Benchmark) 
     public static class ThreadsAndQueueSizes { 
         @Param(value = {"1", "4", "8"}) 
         String nrThreads; 
         @Param(value = { "-1","1", "10", "100", "1000000"}) 
         String queueSize; 
     } 
  
     @Benchmark 
     @Fork(1) 
     public void playParallel(ThreadsAndQueueSizes t3qs) throws InterruptedException { 
         int nrThreads = Integer.valueOf(t3qs.nrThreads); 
         int queueSize = Integer.valueOf(t3qs.queueSize); 
         new ParallelGamePlayer(nrThreads, queueSize).play(); 
     } 
  
     @Benchmark 
     @Fork(1) 
     public void playSimple(){ 
         new SimpleGamePlayer().play(); 
     } 
  
 }

ParallelGamePlayer is created to play the game with -1, 1, 4, and 8 IntervalGuesser threads, and in each case, there is a test running with a queue of length 1, 10, 100, and 1 million. These are 16 test executions. When the number of threads is negative, then the constructor uses LinkedBlockingDeque. There is another separate measurement that measures the nonparallel player. The test was executed with unique guesses and secrets (no color used more than once) and ten colors and six columns.

When the harness starts, it does all the calibrations automatically and runs the tests for many iterations to let the JVM start up. You may recall the code that just never stopped unless we used the volatile modifier in for the variable that was used to signal the code to stop. That happened because the JIT compiler optimized the code. This is done only when the code was already run a few thousand times. The harness makes these executions to warm up the code and ensure that the measurement is done when JVM is at full speed.

Running this benchmark takes approximately 15 minutes on my machine. During the execution, it is recommended to stop all other processes and let the benchmark use all available resources. If there is anything using resources during the measurement, then it will be reflected in the result.

Benchmark     (nrThreads)  (queueSize) Score   Error 
playParallel            1           -1 15,636 &pm; 1,905 
playParallel            1            1 15,316 &pm; 1,237 
playParallel            1           10 15,425 &pm; 1,673 
playParallel            1          100 16,580 &pm; 1,133 
playParallel            1      1000000 15,035 &pm; 1,148 
playParallel            4           -1 25,945 &pm; 0,939 
playParallel            4            1 25,559 &pm; 1,250 
playParallel            4           10 25,034 &pm; 1,414 
playParallel            4          100 24,971 &pm; 1,010 
playParallel            4      1000000 20,584 &pm; 0,655 
playParallel            8           -1 24,713 &pm; 0,687 
playParallel            8            1 24,265 &pm; 1,022 
playParallel            8           10 24,475 &pm; 1,137 
playParallel            8          100 24,514 &pm; 0,836 
playParallel            8      1000000 16,595 &pm; 0,739 
playSimple            N/A          N/A 18,613 &pm; 2,040

The actual output of the program is a bit more verbose; it was edited for printing purposes. The Score column shows how many times the benchmark can run in a second. The Error shows that the measurement shows less than 10% scattering.

The fastest performance we have is when the algorithm runs on eight threads, which is the number of threads the processor can independently handle on my machine. It is interesting that limiting the size of the queue did not help the performance. I actually expected it to be different. Using a one million length array as a blocking queue has a huge overhead and this is not a surprise that, in this case, the execution is slower than when we have only 100 elements in the queue. The unlimited linked list-based queue handling, on the other hand, fairly fast and clearly shows that the extra speed at the limited queue for 100 elements does not come from the fact that the limit does not allow the IntervalThreads to run too far.

When we start one thread, then we expect similar results, as when we run the serial algorithm. The fact that the serial algorithm beats the parallel algorithm running on one thread is not a surprise. The thread creation and the communication between the main thread and the extra one thread have overhead. The overhead is significant, especially when the queue is unnecessarily large.

Table of Contents for Microbenchmarking

Create new playlist

Sign In

Sign Up

Table of Contents for
Microbenchmarking