Skip to main content
fast by design

Benchmarking Java Code Correctly: JMH, Warmup, and the JIT Optimizations That Invalidate Naive Tests

11 min read Chapter 7 of 90

Benchmarking Java Code Correctly: JMH, Warmup, and the JIT Optimizations That Invalidate Naive Tests

The JIT compiler is smarter than your benchmark.

When you write a timing loop around a method call, the C2 compiler observes the loop, profiles the method, and then applies optimizations that remove the work you intended to measure. Your benchmark reports 2 nanoseconds per operation. You conclude the method is fast. The method is not fast. The compiler eliminated it.

This is not a theoretical concern. It is the default behavior of the HotSpot JVM. Every naive Java benchmark is potentially invalid because the JIT compiler’s job is to make your code faster, and it cannot distinguish “code the user wants to run” from “code the user wants to measure.”

JMH exists because this problem is unsolvable without framework support. JMH interposes barriers that prevent the JIT compiler from optimizing away the measured code while still allowing the JIT to optimize the code as it would in production. This distinction is critical. You want the benchmark to measure JIT-optimized code. You do not want the JIT to optimize away the benchmark itself.

The JIT Compilation Pipeline

Before understanding what the JIT does to your benchmarks, you need to understand how the JIT works.

The HotSpot JVM uses tiered compilation with four levels:

JIT Compilation Pipeline

This diagram shows the progression from bytecode through four compilation tiers. Tier 0 (Interpreter) executes bytecode without compilation, collecting invocation and back-edge counters. When a method’s invocation count crosses the threshold (typically 10,000), it is compiled by the C1 compiler (Tiers 1-3), which produces moderately optimized native code while collecting type profiling data. When C1-compiled code remains hot, the C2 compiler (Tier 4) recompiles it with aggressive optimizations including dead code elimination, constant folding, loop unrolling, and escape analysis. These C2 optimizations, highlighted in red, are precisely the ones that invalidate naive benchmarks. If a C2 speculation fails at runtime (for example, a type check that was monomorphic becomes polymorphic), the JVM deoptimizes back to C1 or the interpreter.

The key insight for benchmarking: your method’s performance changes as it moves through tiers. A method that takes 500ns in the interpreter might take 50ns after C1 compilation and 15ns after C2 compilation. A naive benchmark that starts timing from the first invocation averages across all three speeds. This average represents no real-world execution mode.

JMH’s @Warmup annotation lets you run enough iterations for the JIT to reach steady state (typically C2-compiled) before measurement begins. The number of warmup iterations required depends on the method complexity and call graph, but 5 iterations of 2 seconds each is a reliable default.

The Four JIT Traps

Four C2 compiler optimizations invalidate naive benchmarks. JMH provides specific countermeasures for each.

Trap 1: Dead Code Elimination

The compiler removes code whose result is never used.

// SLOW: This benchmark measures nothing
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    String json = mapper.writeValueAsString(article);
    // 'json' is never used. C2 may eliminate the entire call.
}
long elapsed = System.nanoTime() - start;
// FAST: JMH prevents dead code elimination
@Benchmark
public String serialize() throws Exception {
    // Returning the result tells JMH to consume it
    return mapper.writeValueAsString(article);
}

// Alternative: use Blackhole to consume multiple results
@Benchmark
public void serializeMultiple(Blackhole bh) throws Exception {
    bh.consume(mapper.writeValueAsString(article1));
    bh.consume(mapper.writeValueAsString(article2));
}

JMH’s Blackhole.consume() is a method the JIT cannot see through. It accepts the value in a way that the compiler must assume has side effects. The return-value approach works because JMH’s generated benchmark harness uses the return value.

Trap 2: Constant Folding

The compiler evaluates constant expressions at compile time and replaces them with the result.

// SLOW: The compiler may precompute this
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    int result = computeScore(42, 100, 0.8);  // constant arguments
    // If computeScore is pure, C2 computes it once
}
long elapsed = System.nanoTime() - start;
// FAST: JMH @State provides values the compiler cannot constant-fold
@State(Scope.Benchmark)
public class ScoreBenchmark {
    int views;
    int shares;
    double weight;

    @Setup
    public void setup() {
        views = 42;
        shares = 100;
        weight = 0.8;
    }

    @Benchmark
    public double computeScore() {
        // 'views', 'shares', 'weight' come from @State fields
        // The compiler cannot prove they are constant
        return RecommendationScorer.score(views, shares, weight);
    }
}

@State fields are opaque to the JIT. The compiler sees that views is a field read from a heap object and cannot prove it will not change between invocations. This prevents constant folding while still allowing the JIT to optimize the method body itself.

Trap 3: Loop Unrolling and Hoisting

The compiler transforms loops to eliminate loop overhead and hoists invariant computations out of the loop body.

// In a naive benchmark loop, the compiler may:
// 1. Unroll the loop (execute 4 iterations per loop check)
// 2. Hoist invariant expressions before the loop
// 3. Combine iterations algebraically
for (int i = 0; i < n; i++) {
    result += array[i] * coefficient;
    // 'coefficient' is loop-invariant, hoisted before the loop
    // The multiplication is vectorized across iterations
}

JMH avoids this problem entirely by not using a user-visible loop. JMH generates its own measurement loop in a separate class that the JIT treats as a call boundary. Your @Benchmark method is invoked as a method call, not inlined into a measurement loop.

Trap 4: Escape Analysis and Scalar Replacement

If the JIT determines that an object does not escape the method (no references to it are stored in heap fields or returned), it can allocate the object on the stack instead of the heap, or eliminate the allocation entirely by replacing the object’s fields with local variables.

// SLOW: Benchmark that accidentally measures scalar replacement
@Benchmark
public double benchmarkArticleScore() {
    // If Article doesn't escape this method, C2 may:
    // 1. Eliminate the Article allocation entirely
    // 2. Replace article.views with a local variable
    Article article = new Article(1L, "test", "body", "slug",
        List.of(), List.of(), 1L, Instant.now(), Instant.now());
    return article.id();  // Trivially returns 1L
}
// FAST: Ensure the benchmark measures real work
@State(Scope.Benchmark)
public class ArticleScoreBenchmark {
    private Article article;

    @Setup
    public void setup() {
        // Created outside the benchmark method
        // Cannot be scalar-replaced because it's a heap field
        article = new Article(1L, "Performance Engineering",
            "A".repeat(10_000), "perf-eng",
            List.of("java", "performance"),
            List.of("jvm", "profiling"),
            42L, Instant.now(), Instant.now());
    }

    @Benchmark
    public double computeScore() {
        // 'article' comes from @State, not created in the method
        return RecommendationScorer.score(article);
    }
}

Benchmark Modes

JMH supports four benchmark modes, each answering a different question:

Mode.Throughput: How many operations per second? Use this when you care about bulk processing rate. Example: “How many articles can the serializer process per second?”

Mode.AverageTime: What is the average time per operation? Use this for latency-sensitive code where you need a per-operation time. Example: “What is the average time to serialize one article?”

Mode.SampleTime: What is the time distribution? JMH samples individual operation times and reports percentiles (p50, p90, p99, p99.9). Use this when you care about tail latency. Example: “What is the p99 serialization time?”

Mode.SingleShotTime: What is the time for a single invocation without warmup? Use this for cold-start performance measurement. Example: “What is the first-invocation time for the recommendation engine?”

// Measure both average time and time distribution
@BenchmarkMode({Mode.AverageTime, Mode.SampleTime})
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Benchmark)
public class ArticleSearchBenchmark {

    private SearchService searchService;
    private String query;

    @Setup(Level.Trial)
    public void setup() {
        searchService = new SearchService(/* ... */);
        query = "java performance engineering";
    }

    @Benchmark
    public List<SearchResult> search() {
        return searchService.search(query, 20);
    }
}

The SampleTime mode is underused. AverageTime tells you the mean, which can hide bimodal distributions. If 99% of operations take 10us and 1% take 5ms, the average is 60us, which describes neither mode. SampleTime shows both modes in its percentile output.

Common JMH Mistakes

Mistake 1: Not enough forks.

// Wrong: single fork
@Fork(1)  // JIT profile from warmup contaminates measurement

// Correct: multiple forks
@Fork(2)  // Each fork starts a fresh JVM with clean JIT profiles

Each fork runs the benchmark in a fresh JVM process. With a single fork, the JIT’s profile data from warmup iterations influences the compilation of measurement iterations. With two forks, you get results from two independent JIT compilations, and JMH reports the aggregate.

Mistake 2: Measuring setup cost.

// Wrong: setup cost included in measurement
@Benchmark
public List<SearchResult> search() {
    SearchService service = new SearchService(dataSource);  // SLOW: constructor cost
    return service.search("java", 20);
}

// Correct: setup in @Setup method
@Setup(Level.Trial)
public void setup() {
    service = new SearchService(dataSource);
}

@Benchmark
public List<SearchResult> search() {
    return service.search("java", 20);
}

Mistake 3: Shared mutable state in Scope.Benchmark.

// Wrong: shared counter creates contention in multi-threaded benchmarks
@State(Scope.Benchmark)
public class CounterBenchmark {
    int counter = 0;  // Shared across threads

    @Benchmark
    public int increment() {
        return counter++;  // Race condition
    }
}

// Correct: per-thread state
@State(Scope.Thread)
public class CounterBenchmark {
    int counter = 0;  // Each thread has its own counter

    @Benchmark
    public int increment() {
        return counter++;
    }
}

Scope.Benchmark shares a single state instance across all threads. Scope.Thread creates one state instance per thread. For benchmarks that measure contention, use Scope.Benchmark intentionally. For benchmarks that measure single-threaded throughput, use Scope.Thread.

Mistake 4: Wrong output time unit.

// Wrong: nanoseconds for a 50ms operation
@OutputTimeUnit(TimeUnit.NANOSECONDS)  // Output: 50,000,000 ns/op (hard to read)

// Correct: match the unit to the magnitude
@OutputTimeUnit(TimeUnit.MILLISECONDS)  // Output: 50.000 ms/op (readable)

Use nanoseconds for sub-microsecond operations (hash lookups, field access). Use microseconds for micro-operations (serialization, small computations). Use milliseconds for I/O-bound operations (database queries, network calls).

Mistake 5: Ignoring the error margin.

Benchmark              Mode  Cnt  Score    Error   Units
serializeV1            avgt   10  8432 ±  124.5   ns/op
serializeV2            avgt   10  8389 ±  131.2   ns/op

The difference between V1 and V2 is 43ns. The error margins overlap (±124.5 and ±131.2). This result is not statistically significant. The two versions perform identically within measurement noise. Do not ship “optimizations” that are within error margin.

JMH computes the error as a 99.9% confidence interval. If the error bars do not overlap, the difference is statistically significant with high confidence. If they overlap, the difference may be noise.

Running Benchmarks in CI

Performance regressions are bugs. Detect them in CI.

# Run benchmarks and output JSON results
java -jar target/benchmarks.jar -rf json -rff results.json

# Compare against baseline with a threshold
java -jar target/benchmarks.jar -rf json -rff current.json
# Then compare current.json against baseline.json
# compare_benchmarks.py
# Flag regressions that exceed a threshold
import json
import sys

REGRESSION_THRESHOLD = 0.10  # 10% regression threshold

def compare(baseline_path, current_path):
    with open(baseline_path) as f:
        baseline = {b["benchmark"]: b for b in json.load(f)}
    with open(current_path) as f:
        current = {b["benchmark"]: b for b in json.load(f)}

    regressions = []
    for name, curr in current.items():
        if name in baseline:
            base_score = baseline[name]["primaryMetric"]["score"]
            curr_score = curr["primaryMetric"]["score"]
            mode = curr["mode"]

            # For throughput, lower is worse. For time, higher is worse.
            if mode == "thrpt":
                change = (base_score - curr_score) / base_score
            else:
                change = (curr_score - base_score) / base_score

            if change > REGRESSION_THRESHOLD:
                regressions.append(
                    f"{name}: {change:+.1%} regression "
                    f"({base_score:.1f} -> {curr_score:.1f})"
                )

    if regressions:
        print("PERFORMANCE REGRESSIONS DETECTED:")
        for r in regressions:
            print(f"  {r}")
        sys.exit(1)
    else:
        print("No regressions detected.")

compare(sys.argv[1], sys.argv[2])

Set the regression threshold based on your application’s requirements. 10% is a reasonable default for most operations. For latency-critical hot paths (view counting, cache lookups), use 5% or lower. For cold paths (admin operations, batch jobs), 20% may be acceptable.

The next two sections go deeper. The first section examines each JIT trap in detail with specific benchmarks that demonstrate the trap and the fix. The second section covers advanced JMH patterns: parameterized benchmarks, asymmetric state, and multi-threaded benchmarks for contention measurement.