Lies, Damned Lies, and Microbenchmarks

A Fibonacci microbenchmark runs slightly faster with Java 8 than Java 17 on some fellow's laptop. Should you stick with Java 8?

.jpeg

The Benchmarks

In this blog, Marian Čaikovski claims to demonstrate that “Java 17 is slower at calculations” than Java 8.

He offers two pieces of evidence.

The well-known slow Fibonacci method

public static long fibonacci(long n) {
   if (n <= 1) {
      return n;
   } else {
      return fibonacci(n - 1) + fibonacci(n - 2);
  }
}

runs 9% slower with Java 17 than with Java 8 on his laptop when computing fibonacci(40) a hundred times.

And computing the same number is 16% slower when using the code in the RecursiveTask API docs:

class Fibonacci extends RecursiveTask<Integer> {
   final int n;
   Fibonacci(int n) { this.n = n; }
   protected Integer compute() {
      if (n <= 1)
         return n;
      Fibonacci f1 = new Fibonacci(n - 1);
      f1.fork();
      Fibonacci f2 = new Fibonacci(n - 2);
      return f2.compute() + f1.join();
   }
}

The complete code is in this repo.

Elsewhere he points out that the language features between 8 and 17 are mostly meh. Combined with what he calls “performance regressions”, he doubts whether it's worth “upgrading an existing perfectly working application” to the latest Java version.

I agree that there haven't been huge language and API changes. The principal reason I upgrade is the ongoing maintenance. There are over 50,000 fixed issues between Java 9 and Java 17 in the bug database.

Of course, many of these issues are pretty technical, but I do run into them on a regular basis. Like https://blog.fastthread.io/2021/10/06/performance-impact-of-java-lang-system-getproperty/. Quietly solved in Java 11.

Back to those benchmarks. Do they really measure “calculations”? Not really. The first one does mostly method calls. The second exercises the fork join pool. Based on these data, I wouldn't say “whoa, our app does calculations; let's stick with Java 8”.

Are the numbers accurate? Java Champion Henry Tremblay wrote a JMH benchmark (code here) that is a bit more reliable than running the program directly, since it warms up the JVM first. On his laptop, Java 17 is also slower.

Java 8
Benchmark           Mode  Cnt    Score   Error  Units
MyBenchmark.fibPar  avgt   20   71.610 ± 3.832  ms/op
MyBenchmark.fibSeq  avgt   20  326.094 ± 0.536  ms/op

Java 17
Benchmark           Mode  Cnt    Score   Error  Units
MyBenchmark.fibPar  avgt   20   82.433 ± 2.607  ms/op
MyBenchmark.fibSeq  avgt   20  356.393 ± 0.601  ms/op

Could it be that the code generated by the just in time compiler has gotten worse? I looked at the assembly code that the JIT generated for Java 8 and Java 17. This article by Gunnar Morling, another Java Champion, showed me how to generate the HSDIS library that is necessary for disassembly, and how to run the program with a log that can be viewed by JITWatch.

java -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:LogFile=8.log -cp src/main/java com.acme.performance.Main 40 100

The C2 disassembly looked essentially identical for both versions (Java 8 line 69253, Java 17 line 2939). Interestingly, it inlined one level of call, computing

fibonacci(n - 2) + fibonacci(n - 3) + fibonacci(n - 3) + fibonacci(n - 4)

I also played with the parallel solution. I never liked this example. Clearly computing Fibonacci without memoization is dumb. I know it's just an example, and it's supposed to teach recursive subdivision. But the structure is curiously asymmetric. Why call

f2.compute() + f1.join();

What if you fork each subtask and combine the results? I rewrote the code like that:

Fibonacci f1 = new Fibonacci(n - 1);
f1.fork();
Fibonacci f2 = new Fibonacci(n - 2);
f2.fork();
return f1.join() + f2.join(); 

And when I ran that variant, it ran faster than the original. You would think that it shouldn't make much of a difference. And it ran faster with JDK 17 than with JDK 8. At least on my laptop. But when I ran it again today, I couldn't reproduce that.

Heinz Kabutz (also a Java Champion) reported this:

Hi Henri,

your benchmark on my server:

Java 8
# VM version: JDK 1.8.0_302, OpenJDK 64-Bit Server VM, 25.302-b08
Benchmark           Mode  Cnt    Score   Error  Units
MyBenchmark.fibPar  avgt   20   60.407 ± 0.527  ms/op
MyBenchmark.fibSeq  avgt   20  495.456 ± 0.035  ms/op

Java 17
# VM version: JDK 17, OpenJDK 64-Bit Server VM, 17+35-2724
Benchmark           Mode  Cnt    Score   Error  Units
MyBenchmark.fibPar  avgt   20   59.035 ± 0.525  ms/op
MyBenchmark.fibSeq  avgt   20  495.776 ± 0.219  ms/op

And he later wrote: I'm not a great fan of running benchmarks on laptops. Things that I thought were issues often went away when running on server hardware.

The takeaway is that these microbenchmarks are really hard to get right. There is a reason that JMH has the following message after it completes a run:

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

.jpeg

So, should you stay with JDK 8? Unless you want to compute factorials the dumb way on a laptop, you might want to take these results with many grains of salt. If in doubt, benchmark your workload under realistic conditions and give yourself the time to do it right. It's not an easy thing to do.

Comments powered by Talkyard.