Avoiding Benchmarking Pitfalls on the JVM

by Julien Ponge
Published July 2014

Use JMH to write useful benchmarks that produce accurate results.

enchmarks are an endless source of debates, especially because they do not always represent real-world usage patterns. It is often quite easy to produce the outcome you want, so skepticism is a good thing when looking at benchmark results.

Yet, evaluating the performance of certain critical pieces of code is essential for developers who create applications, frameworks, and tools. Stressing critical portions of code and obtaining metrics that are meaningful is actually difficult in the Java Virtual Machine (JVM) world, because the JVM is an adaptive virtual machine. As we will see in this article, the JVM does many optimizations that render the simplest benchmark irrelevant unless many precautions are taken.

Originally published in the July/August 2014 issue of Java Magazine. Subscribe today.

In this article, we will start by creating a simple yet naive benchmarking framework. We will see why things do not turn out as well as we hoped. We then will look at JMH, a benchmark harness that gives us a solid foundation for writing benchmarks. Finally, we’ll discuss how JMH makes writing concurrent benchmarks simple.

A Naive Benchmarking Framework

Benchmarking does not seem so difficult. After all, it should boil down to measuring how long some operation takes, and if the operation is too fast, we can always repeat it in a loop. While this approach is sound for a program written in a statically compiled language, such as C, things are very different with an adaptive virtual machine. Let’s see why.

Implementation. Let’s take a naive approach and design a benchmarking framework ourselves. The solution fits into a single static method, as shown in Listing 1.




public class WrongBench {

  public static void bench(String name, long runMillis, int loop, 
int warmup, int repeat, Runnable runnable) {
    System.out.println("Running: " + name);
    int max = repeat + warmup;
    long average = 0L;
    for (int i = 0; i < max; i++) {
      long nops = 0;
      long duration = 0L;
      long start = System.currentTimeMillis();
      while (duration < runMillis) {
        for (int j = 0; j < loop; j++) {
          runnable.run();
          nops++;
        }
        duration = System.currentTimeMillis() - start;
      }
      long throughput = nops / duration;
      boolean benchRun = i >= warmup;
      if (benchRun) {
        average = average + throughput;
      }
      System.out.print(throughput + " ops/ms" + ([
!benchRun ? " (warmup) | " : " | "));
    }
    average = average / repeat;
    System.out.println("\n[ ~" + average + " ops/ms ]\n");
  }
}

Listing 1

The bench method executes a benchmark expressed as a java.lang.Runnable. The other parameters include a descriptive name (name), a benchmark run duration (runMillis), the inner loop upper bound (loop), the number of warm-up rounds (warmup), and the number of measured rounds (repeat).

Looking at the implementation, we can see that this simple benchmarking method measures a throughput. The time a benchmark takes to run is one thing, but a throughput measurement is often more helpful, especially when designing microbenchmarks.

Sample usage. Let’s use our fresh benchmarking framework with the following method:



static double distance(
double x1, double y1, 
double x2, double y2) {
  double dx = x2 - x1;
  double dy = y2 - y1;
  return Math.sqrt((dx * dx) +
(dy * dy));
}

The distance method computes the Euclidean distance between two points (x1, y1) and (x2, y2).

Let’s introduce the following constants for our experiments: 4-second runs, 10 measurements, 15 warm-up rounds, and an inner loop of 10,000 iterations:



static final long RUN_MILLIS =
4000;
static final int REPEAT = 10;
static final int WARMUP = 15;
static final int LOOP = 10_000;

Running the benchmark is done as follows:


    
public static void main(
String... args) {
  bench("distance", RUN_MILLIS, 
LOOP, WARMUP, REPEAT, () ->
distance(0.0d, 0.0d, 10.0d, 
10.0d));
}

On a test machine, a random execution produces the following shortened trace:



Running: distance
  (...)
[ ~30483613 ops/ms ]

According to our benchmark, the distance method has a throughput of 30483613 operations per millisecond (ms). Another run would yield a slightly different throughput. Java developers will not be surprised by that. After all, the JVM is an adaptive virtual machine: bytecode is first interpreted, and then native code is generated by a just-in-time compiler. Hence, performance results are subject to random variations that tend to stabilize as time increases.

Great; but still . . . is 30483613 operations per ms for distance a meaningful result?

What Could Possibly Go Wrong?

The raw throughput value does not give us much perspective, so let’s compare our result for distance with the throughput of other methods.

Looking for a baseline. Let’s take the same method signature as distance and return a constant instead of doing a computation with the parameters:



static double constant(
  double x1, double y1, 
  double x2, double y2) {
  return 0.0d;
}

public static void main(String... args) {
  bench("distance", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> 
distance(0.0d, 0.0d, 10.0d, 10.0d));
  bench("constant", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> 
constant(0.0d, 0.0d, 10.0d, 10.0d));
}

Listing 2

We also update our benchmark as shown in Listing 2. The constant method will give us a good baseline for our measurements, since it just returns a constant. Unfortunately, the results are not what we would intuitively expect:



Running: distance
  (...)
[ ~30302907 ops/ms ]

Running: constant
  (...)
[ ~475665 ops/ms ]


The throughput of constant appears to be lower than that of distance, although constant is doing no computation at all.



static void nothing() {

}

// (...)

public static void main(String... args) {
  bench("distance", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> 
distance(0.0d, 0.0d, 10.0d, 10.0d));
  bench("constant", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> 
constant(0.0d, 0.0d, 10.0d, 10.0d));
  bench("nothing", RUN_MILLIS, LOOP, WARMUP, REPEAT, 
WrongBench::nothing);
}

Listing 3

To give more depth to this observation, let’s benchmark an empty method (see Listing 3). The results get even more surprising.



Running: distance
  (...)
[ ~29975598 ops/ms ]

Running: constant
  (...)
[ ~421092 ops/ms ]

Running: nothing
  (...)
[ ~274938 ops/ms ]

nothing has the lowest throughput, although it is doing the least.

Isolating runs. This is the first lesson: mixing benchmarks within the same JVM run is wrong. Indeed, let’s change the benchmark order:



Running: nothing
  (...)
[ ~30146676 ops/ms ]

Running: distance
  (...)
[ ~493272 ops/ms ]

Running: constant
  (...)
[ ~284219 ops/ms ]

We get the same relative throughput drop figures, albeit with a different benchmark ordering. Let’s run a first benchmark alone, as shown in Listing 4. By repeating the process for each benchmark, we get the following results:



Running: nothing
  (...)
[ ~30439911 ops/ms ]

Running: distance
  (...)
[ ~30213221 ops/ms ]

Running: constant
[ ~30229883 ops/ms ]

In some runs distance could be faster than constant. The general observation is that all these measurements are very similar, with nothing being marginally faster. In itself, this result is suspicious, because the distance method is doing computations on double numbers. So we would expect a much lower throughput. We will come back to this later, but first let’s discuss why mixing benchmarks within the same JVM process was a bad idea.



public static void main(String... args) {
  bench("nothing", RUN_MILLIS, LOOP, WARMUP, REPEAT, 
WrongBench::nothing);
  // bench("distance", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> 
distance(0.0d, 0.0d, 10.0d, 10.0d));
  // bench("constant", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> 
constant(0.0d, 0.0d, 10.0d, 10.0d));
}

Listing 4

The main factor in why benchmarks get slower over runs is the Runnable.run() method call in the bench method. While the first benchmark runs, the corresponding call site sees only one implementation class for java.lang.Runnable.

Given enough runs, the virtual machine speculates that run() always dispatches to the same target class, and it can generate very efficient native code. This assumption gets invalidated with the second benchmark, because it introduces a second class to dispatch run() calls to. The virtual machine has to deoptimize the generated code. It eventually generates efficient code to dispatch to either of the seen classes, but this is slower than in the previous case. Similarly, the third benchmark introduces a third implementation of java.lang .Runnable. Its execution gets slower because Java HotSpot VM generates efficient native code for up to two different types at a call site, and then it falls back to a more generic dispatch mechanism for additional types.

This is the first lesson: mixing benchmarks within the same JVM run is wrong.irrelevant unless many precautions are taken.ist

This is not the sole factor, though. Indeed, the bench method’s code and the Runnable objects’ code blend when seen by the virtual machine. The virtual machine tries to speculate on the entire code using optimizations such as loop unrolling, method inlining, and on-stack replacements.

Calling System.currentTimeMillis() has an effect on throughput as well, and our benchmarks would need to accurately subtract the time taken for each of these calls. We could also play with different inner-loop upper-bound values and observe very different results.

The OpenJDK wiki Performance Techniques page provides a great overview of the various techniques being used in Java HotSpot VM. As you can see, ensuring that we measure only the code to be benchmarked is difficult.

More pitfalls. Going back to the performance evaluation of the distance method, we noted that its throughput was very similar to the throughput measured for a method that would do no computation and return a constant.

In fact, Java HotSpot VM used dead-code elimination; since the return value of distance is never used by our java.lang.Runnable under test, it practically removed it. This also happened because the method has no side effect and has a simple control flow that is recursion-free.

To convince ourselves of this, let’s modify the java.lang.Runnable lambda that we pass to our benchmark method, as shown in Listing 5.



// (...)

static double last = 0.0d;

public static void main(String... args) {
  bench("distance_use_return", RUN_MILLIS, LOOP, WARMUP, 
REPEAT, () -> last = distance(0.0, 0.0, 10.0, 10.0));
  System.out.println(last);
}

Listing 5

Instead of just calling distance, we now assign its return value to a field and eventually print it, to force the virtual machine not to ignore it. The benchmark figures are now quite different:



Running: distance_use_return
  (...)
[ ~18865939 ops/ms ]

We now have a more meaningful result, because the constant method had a throughput of about 30229883 operations per ms on our test machine.

Although it’s not perceptible in this example, we could also highlight the effect of constant folding. Given a simple method with constant arguments and a return value that is evidently dependent on those parameters, the virtual machine is able to speculate that it is not useful to evaluate each call. We could come up with an example to illustrate that, but let’s instead focus on writing benchmarks with a good harness framework.

Indeed, the following should be clear by now:

  • Our simple benchmarking framework has flaws.
  • The virtual machine does so many optimizations that it is difficult to ensure that what we are benchmarking is actually what we expect to benchmark.

Introducing JMH

JMH is a Java harness library for writing benchmarks on the JVM, and it was developed as part of the OpenJDK project. JMH provides a very solid foundation for writing and running benchmarks whose results are not erroneous due to unwanted virtual machine optimizations. JMH itself does not prevent the pitfalls that we exposed earlier, but it greatly helps in mitigating them.

JMH is popular for writing microbenchmarks, that is, benchmarks that stress a very specific piece of code. JMH also excels at concurrent benchmarks. That being said, JMH is a general-purpose benchmarking harness. It is useful for larger benchmarks, too.

Creating and running a JMH project. While JMH releases are being regularly published to Maven Central Repository, JMH development is very active and it is a great idea to make builds of JMH yourself. To do so, you need to clone the JMH Mercurial repository, and then build it with Apache Maven, as shown in Listing 6. Once this is done, you can bootstrap a new Maven-based JMH project, as shown in Listing 7.




$ hg clone http://hg.openjdk.java.net/code-tools/jmh/ openjdk-jmh
  (...)
$ cd openjdk-jmh
  (...)
$ mvn install
  (…)

Listing 6



$ mvn archetype:generate \
    -DinteractiveMode=false \
    -DarchetypeGroupId=org.openjdk.jmh \
    -DarchetypeArtifactId=jmh-java-benchmark-archetype \
    -DgroupId=com.mycompany \
    -DartifactId=benchmarks \
    -Dversion=1.0-SNAPSHOT

Listing 7

This creates a project in the benchmarks folder. A sample benchmark can be found in src/main/java/MyBenchmark.java. While we will dissect the sample benchmark in a minute, we can already build the project with Apache Maven:



$ cd benchmarks/
$ mvn package
  (...)
$ java  \
    -jar target/microbenchmarks.jar 
    (…)

When you run the self-contained microbenchmarks.jar executable JAR file, JMH launches all the benchmarks of the project with default settings. In this case, it runs MyBenchmark with the default JDK and no specific JVM tuning. Each benchmark is run with 20 warm-up rounds of 1 second each and then with 20 measurement rounds of 1 second each. Also, JMH launches a new JVM 10 times for running each benchmark.

As we will see later, this behavior can be customized in the benchmark source code, and it can be overridden using command-line flags. Running java -jar target/microbenchmarks.jar -help allows us to see the available flags.

Let’s instead run the benchmark with the parameters shown in Listing 8. These parameters specify the following:

  • We use only one fork (-f 1).
  • We run five warm-up iterations (-wi 5).
  • We run five iterations of 3 seconds each (-i 5 -r 3s).
  • We tune the JVM configuration with jvmArgs.
  • We run all benchmarks whose class name matches the .*Benchmark.* regular expression. 


$ java -jar target/microbenchmarks.jar \
    -f 1 -wi 5 -i 5 -r 3s \
    -jvmArgs '-server -XX:+AggressiveOpts' \
    .*Benchmark.*

Listing 8

The execution gives a recap of the configuration, the information for each iteration and, finally, a summary of the results that includes confidence intervals, as shown in Listing 9.



# Run progress: 0.00% complete, ETA 00:00:20
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Con
tents/Home/jre/bin/java
# VM options: -server -XX:+AggressiveOpts
# Fork: 1 of 1
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 3 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.mycompany.MyBenchmark.testMethod
# Warmup Iteration   1: 1292809.889 ops/ms
# Warmup Iteration   2: 1320406.283 ops/ms
# Warmup Iteration   3: 1313474.495 ops/ms
# Warmup Iteration   4: 1320902.931 ops/ms
# Warmup Iteration   5: 1324933.533 ops/ms
Iteration   1: 1323529.164 ops/ms
Iteration   2: 1324869.829 ops/ms
Iteration   3: 1318025.798 ops/ms
Iteration   4: 1309566.744 ops/ms
Iteration   5: 1320382.335 ops/ms

Result : 1319274.774 ±(99.9%) 23298.541 ops/ms
  Statistics: (min, avg, max) = (1309566.744, 1319274.774, 
1324869.829), stdev = 6050.557
  Confidence interval (99.9%): [1295976.233, 1342573.315]

Benchmark                     Mode   Samples            Mean 
Mean error Units
c.m.MyBenchmark.testMethod    thrpt        5     1319274.774  
23298.541 ops/ms

Listing 9



package com.mycompany;

import org.openjdk.jmh.annotations.GenerateMicroBenchmark;

public class MyBenchmark {

  @GenerateMicroBenchmark
  public void testMethod() {
    // place your benchmarked code here
  }
}

Listing 10

Anatomy of a JMH benchmark. The sample benchmark that was generated looks like Listing 10. A JMH benchmark is simply a class in which each @GenerateMicroBenchmark annotated method is a benchmark. Let’s transform the benchmark to measure the cost of adding two integers (see Listing 11).



package com.mycompany;

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.TimeUnit;

@State(Scope.Thread)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(value = 3,
    jvmArgsAppend = {"-server", "-disablesystemassertions"})
public class MyBenchmark {

  int x = 923;
  int y = 123;

  @GenerateMicroBenchmark
  @Warmup(iterations = 10, time = 3, timeUnit = TimeUnit.SECONDS)
  public int baseline() {
    return x;
  }

  @GenerateMicroBenchmark
  @Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
  public int sum() {
    return x + y;
  }
}

Listing 11

We have a baseline benchmark that gives us a reference on returning an int value. JMH takes care of reusing return values so as to defeat dead-code elimination. We also return the value of field x; because the value can be changed from a large number of sources, the virtual machine is unlikely to attempt constant folding optimizations. The code of sum is very similar.

The benchmark has more configuration annotations present. The @State annotation is useful in the context of concurrent benchmarks. In our case, we simply hint to JMH that x and y are thread-scoped.

JMH was designed with concurrent benchmarks in mind. These kinds of benchmarks are very difficult to measure correctly, because they involve several threads and inherently nondeterministic behaviors.

The other annotations are self-explanatory. Note that these values can be overridden from the command line. By running the benchmark on a sample machine, we get the results shown in Listing 12.



Benchmark                    Mode   Samples         Mean   
Mean error    Units
c.m.MyBenchmark.baseline    thrpt        60   527635.162      
756.927   ops/ms
c.m.MyBenchmark.sum         thrpt        60   440033.766      
623.455   ops/ms

Listing 12

Lifecycle and parameter injection. In simple cases, class fields can hold the benchmark state values.

In more-elaborate contexts, it is better to extract those into separate @State annotated classes. Benchmark methods can then have parameters of the type of these state classes, and JMH arranges instances injection. A state class can also have its own lifecycle with a setup and a tear-down method. We can also specify whether a state holds for the whole benchmark, for one trial, or for one invocation.

We can also require JMH to inject a Blackhole object. A Blackhole is used when it is not convenient to return a single object from a benchmark method. This happens when the benchmark produces several values, and we want to make sure that the virtual machine will not speculate based on the observation that the benchmark code does not make use of these. The Blackhole class provides several consume(...) methods.

The class shown in Listings 13a and 13b is an elaborated version of the previous benchmark with a state class, a lifecycle for the state class, and a Blackhole.



package com.mycompany;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.logic.BlackHole;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(value = 3,
    jvmArgsAppend = {"-server", "-disablesystemassertions"})
public class MyBenchmark {

  @State(Scope.Thread)
  static public class AdditionState {

    int x;
    int y;

    @Setup(Level.Iteration)
    public void prepare() {
      Random random = new Random();
      x = random.nextInt();
      y = random.nextInt();
    }

    @TearDown(Level.Iteration)
    public void shutdown() {
      x = y = 0; // useless in this benchmark...
    }
  }

Listing 13a



@GenerateMicroBenchmark
  @Warmup(iterations = 10, time = 3, timeUnit = TimeUnit.SECONDS)
  public int baseline(AdditionState state) {
    return state.x;
  }

  @GenerateMicroBenchmark
  @Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
  public int sum(AdditionState state) {
    return state.x + state.y;
  }

  @GenerateMicroBenchmark
  @Warmup(iterations = 10, time = 3, timeUnit = TimeUnit.SECONDS)
  public void baseline_blackhole(
AdditionState state, BlackHole blackHole) {
    blackHole.consume(state.x);
  }

  @GenerateMicroBenchmark
  @Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
  public void sum_blackhole(
AdditionState state, BlackHole blackHole) {
    blackHole.consume(state.x + state.y);
  }
}

Listing 13b

When a benchmark method returns a value, JMH takes it and consumes it into a Blackhole. Returning a value and using a Blackhole object are equivalent, as shown by the benchmark results in Listing 14.



Benchmark                            Mode    Samples              Mean
   Mean error    Units
c.m.MyBenchmark.baseline             thrpt        60        527565.188
     1531.198   ops/ms
c.m.MyBenchmark.baseline_blackhole   thrpt        60        528168.519
      710.463   ops/ms
c.m.MyBenchmark.sum                  thrpt        60        439957.824
      956.078   ops/ms
c.m.MyBenchmark.sum_blackhole        thrpt        60        439852.867
     1001.242   ops/ms

Listing 14

The @TearDown annotation was illustrated for the sake of completeness, but we could clearly have omitted the shutdown() method for this simple benchmark. It is mostly useful for cleaning up resources such as files.

Our “wrong” benchmark, JMH-style. We can now revisit with JMH the benchmark we did in the beginning of the article. The enclosing class looks like Listing 15.



@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class GoodBench {

  public static double constant(double x1, double y1, double x2, 
double y2) {
    return 0.0;
  }

  public static double distance(double x1, double y1, double x2, 
double y2) {
    double dx = x2 - x1;
    double dy = y2 - y1;
    return sqrt((dx * dx) + (dy * dy));
  }

  @State(Scope.Thread)
  public static class Data {
    double x1 = 0.0;
    double y1 = 0.0;
    double x2 = 10.0;
    double y2 = 10.0;
  }

  // (...)
}

Listing 15

We will be measuring throughput in terms of operations per ms. Data is enclosed within an @State-annotated static inner class whose mutable fields will prevent Java HotSpot VM from doing certain optimizations that we discussed earlier.

We are using two baselines. The first is a void empty method, and the second simply returns a constant double value, as shown in Listing 16. Benchmarking constant() and distance() is as simple as Listing 17.




@GenerateMicroBenchmark
public void baseline_return_void() {

}

@GenerateMicroBenchmark
public double baseline_return_zero() {
  return 0.0;
}

Listing 16



@GenerateMicroBenchmark
public double constant(Data data) {
  return constant(data.x1, data.y1, data.x2, data.y2);
}

@GenerateMicroBenchmark
public double distance(Data data) {
  return distance(data.x1, data.y1, data.x2, data.y2);
}

Listing 17

To put things into perspective, we also include flawed measurements subject to dead-code elimination and constant folding optimizations (see Listing 18).



@GenerateMicroBenchmark
public double distance_folding() {
  return distance(0.0, 0.0, 10.0, 10.0);
}

@GenerateMicroBenchmark
public void distance_deadcode(Data data) {
  distance(data.x1, data.y1, data.x2, data.y2);
}

@GenerateMicroBenchmark
public void distance_deadcode_and_folding() {
  distance(0.0, 0.0, 10.0, 10.0);
}

Listing 18

Finally, we can also provide a main method to this benchmark using the JMH builder API, which mimics the command-line arguments that can be given to the self-contained JAR executable. See Listing 19.



public static void main(String... args) throws RunnerException {
  Options opts = new OptionsBuilder()
    .include(".*.GoodBench.*")
    .warmupIterations(20)
    .measurementIterations(5)
    .measurementTime(TimeValue.milliseconds(3000))
    .jvmArgsPrepend("-server")
    .forks(3)
    .build();
  new Runner(opts).run();}

Listing 19

Figure 1 shows the results as a bar chart with the mean error included for each benchmark.

Figure 1

Given the two baselines, we clearly see the effects of dead-code elimination and constant folding. The only meaningful measurement of distance() is when the value is being consumed by JMH and parameters are passed through field values. All other cases converge to either the performance of returning a constant double or an empty void-returning method. Devising Concurrent Benchmarks JMH was designed with concurrent benchmarks in mind. These kinds of benchmarks are very difficult to measure correctly, because they involve several threads and inherently nondeterministic behaviors. Next, let’s examine concurrent benchmarking with JMH for the comparison of readers and writers over an incrementing long value. To do so, we use a pessimistic implementation based on a long value for which every access is protected by a synchronized block, and an optimistic implementation based on java.util.concurrent.atomic .AtomicLong. We want to compare the performance of each implementation depending on the proportion of readers and writers that we have.

JMH has the ability to execute a group of threads with different benchmark code. We can specify how many threads will be allocated to a certain benchmark method. In our case, we will have cases with more readers than writers and, conversely, cases with more writers than readers.

Benchmarking the pessimistic implementation. We start with the following benchmark class code:



@BenchmarkMode(
Mode.AverageTime)
@OutputTimeUnit(
TimeUnit.NANOSECONDS)
public class ConcurrentBench {
  // (...)
}


The pessimistic case is implemented using an inner class of ConcurrentBench, as shown in Listing 20



State(Scope.Group)
@Threads(8)
public static class Pessimistic {

  long value = 0L;
  final Object lock = new Object();

  @Setup(Level.Iteration)
  public void prepare() {
    value = 0L;
  }

  public long get() {
    synchronized (lock) {
      return value;
    }
  }

  public long incrementAndGet() {
    synchronized (lock) {
      value = value + 1L;
      return value;
    }
  }
}

Listing 20

The @State annotation specifies that there should be a shared instance per group of threads while running benchmarks. The @Threads annotation specifies that eight threads should be allocated to run the benchmarks (the default value is 4).

Benchmarking the pessimistic case is done through the methods shown in Listing 21. The @Group annotation gives a group name, while the @GroupThreads annotation specifies how many threads from the group should be allocated to a certain benchmark.



@GenerateMicroBenchmark
@Group("pessimistic_more_readers")
@GroupThreads(7)
public long pessimistic_more_readers_get(Pessimistic state) {
  return state.get();
}

@GenerateMicroBenchmark
@Group("pessimistic_more_readers")
@GroupThreads(1)
public long pessimistic_more_readers_incrementAndGet(
Pessimistic state) {
  return state.incrementAndGet();
}

@GenerateMicroBenchmark
@Group("pessimistic_more_writers")
@GroupThreads(1)
public long pessimistic_more_writers_get(Pessimistic state) {
  return state.get();
}

@GenerateMicroBenchmark
@Group("pessimistic_more_writers")
@GroupThreads(7)
public long pessimistic_more_writers_incrementAndGet(
Pessimistic state) {
  return state.incrementAndGet();
}

Listing 21

We, hence, have two groups: one with seven readers and one writer, and one with one reader and seven writers.

Benchmarking the optimistic implementation. This case is quite symmetrical, albeit with a different implementation (see Listing 22). The benchmark methods are also split in two groups, as shown in Listing 23.



@State(Scope.Group)
@Threads(8)
public static class Optimistic {

  AtomicLong atomicLong;

  @Setup(Level.Iteration)
  public void prepare() {
    atomicLong = new AtomicLong(0L);
  }

  public long get() {
    return atomicLong.get();
  }

  public long incrementAndGet() {
    return atomicLong.incrementAndGet();
  }
}

Listing 22



@GenerateMicroBenchmark
@Group("optmistic_more_readers")
@GroupThreads(7)
public long optimistic_more_readers_get(Optimistic state) {
  return state.get();
}

@GenerateMicroBenchmark
@Group("optmistic_more_readers")
@GroupThreads(1)
public long optimistic_more_readers_incrementAndGet(
Optimistic state) {
  return state.incrementAndGet();
}

@GenerateMicroBenchmark
@Group("optmistic_more_writers")
@GroupThreads(1)
public long optimistic_more_writers_get(Optimistic state) {
  return state.get();
}

@GenerateMicroBenchmark
@Group("optmistic_more_writers")
@GroupThreads(7)
public long optimistic_more_writers_incrementAndGet(
Optimistic state) {
  return state.incrementAndGet();
}

Listing 23

Execution and plotting. JMH offers a variety of output formats beyond plain-text console output, including JSON and CSV output. The JMH configuration shown in Listing 24 allows us to obtain results in a .csv file.



public static void main(String... args) throws RunnerException {
  Options opts = new OptionsBuilder()
    .include(".*.ConcurrentBench.*")
    .warmupIterations(5)
    .measurementIterations(5)
    .measurementTime(TimeValue.milliseconds(5000))
    .forks(3)
    .result("results.csv")
    .resultFormat(ResultFormatType.CSV)
    .build();
  new Runner(opts).run();
}

Listing 24

The console output provides detailed results with metrics for each benchmarked method. In our case, we can distinguish the performance of reads and writes. There is also a consolidated performance result for the whole benchmark.

Figure 2

The resulting .csv file can be processed with a variety of tools, including spreadsheet software and plotting tools. For concurrent benchmarks, it contains only the consolidated results. Listing 25 is a processing example using the Python matplotlib library. The result is shown in Figure 2.




import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('results.csv', delimiter=',', names=True, 
dtype=None)

x   = data['Mean']
y   = np.arange(len(data['Benchmark']))
err = data['Mean_Error_999']
labels = []
for name in data['Benchmark']:
  labels.append(name[len('"bench.ConcurrentBench.'):-1])

plt.rcdefaults()
plt.barh(y, x, xerr=err, color='blue', ecolor='red', alpha=0.4, 
align='center')
plt.yticks(y, labels)
plt.xlabel("Performance (ns/op)")
plt.title("Benchmark")
plt.tight_layout()
plt.savefig('plot.png')

Listing 25

As we could expect, we see that the pessimistic implementation is very predictable: reads and writes share a single intrinsic lock, which is consistent, albeit slow. The optimistic case takes advantage of compare and swap, and reads are very fast when there is low write contention. As a warning, we could further increase the contention with more writers, and then the performance would be worse than that of the pessimistic case.

Conclusion

This article introduced JMH, a benchmark harness for the JVM. We started with our own benchmarking code and quickly realized that the JVM was doing optimizations that rendered the results meaningless. By contrast, JMH provides a coherent framework to write benchmark code and avoid common pitfalls. As usual, benchmarks should always be taken with a grain of salt. Microbenchmarks are very peculiar, since stressing a small portion of code does not preclude what actually happens to that code when it is part of a larger application. Nevertheless, such benchmarks are great quality assets for performance-critical code, and JMH provides a reliable foundation for writing them correctly.

Julien Ponge (@jponge) is a longtime open source craftsman who is currently an associate professor in computer science and engineering at INSA de Lyon. He focuses his research on programming languages, virtual machines, and middleware as part of the CITI Laboratory activities.