JMH Mastery: State, Setup, and Parameterized Benchmarks

Basic JMH usage measures a single operation with fixed inputs. Production performance questions are rarely that simple. You need to know how performance scales with input size, how contention affects throughput under multiple threads, and how different implementation strategies compare across a parameter space. JMH provides tools for all of these.

Parameterized Benchmarks with @Param

@Param generates separate benchmark runs for each parameter value. This is how you answer questions like “how does serialization time scale with article body size?”

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class SerializationScalingBenchmark {

    @Param({"100", "1000", "10000", "50000"})
    private int bodySize;

    @Param({"0", "5", "20"})
    private int tagCount;

    private ObjectMapper mapper;
    private Article article;

    @Setup(Level.Trial)
    public void setup() {
        mapper = new ObjectMapper();
        mapper.registerModule(new JavaTimeModule());

        List<String> tags = new ArrayList<>();
        for (int i = 0; i < tagCount; i++) {
            tags.add("tag-" + i);
        }

        article = new Article(
            1L,
            "Performance Engineering",
            "A".repeat(bodySize),
            "perf-eng",
            List.of("java", "performance"),
            tags,
            42L,
            Instant.now(),
            Instant.now()
        );
    }

    @Benchmark
    public byte[] serialize() throws Exception {
        return mapper.writeValueAsBytes(article);
    }
}

This generates 12 benchmark runs (4 body sizes x 3 tag counts). The output shows how serialization time varies across the parameter space:

Benchmark                            (bodySize)  (tagCount)  Mode  Cnt    Score    Error  Units
SerializationScalingBenchmark.serialize     100           0  avgt   10    1.247 ±  0.031  us/op
SerializationScalingBenchmark.serialize     100           5  avgt   10    1.382 ±  0.028  us/op
SerializationScalingBenchmark.serialize     100          20  avgt   10    1.891 ±  0.044  us/op
SerializationScalingBenchmark.serialize    1000           0  avgt   10    2.834 ±  0.062  us/op
SerializationScalingBenchmark.serialize    1000           5  avgt   10    2.961 ±  0.057  us/op
SerializationScalingBenchmark.serialize    1000          20  avgt   10    3.472 ±  0.089  us/op
SerializationScalingBenchmark.serialize   10000           0  avgt   10    8.432 ±  0.124  us/op
SerializationScalingBenchmark.serialize   10000           5  avgt   10    8.589 ±  0.131  us/op
SerializationScalingBenchmark.serialize   10000          20  avgt   10    9.047 ±  0.198  us/op
SerializationScalingBenchmark.serialize   50000           0  avgt   10   34.218 ±  0.742  us/op
SerializationScalingBenchmark.serialize   50000           5  avgt   10   34.891 ±  0.812  us/op
SerializationScalingBenchmark.serialize   50000          20  avgt   10   36.124 ±  0.934  us/op

This data reveals that serialization time is dominated by body size, not tag count. The 50KB body takes 34us; adding 20 tags adds only 2us. If you want to optimize serialization for the content platform, optimize body handling, not tag handling.

Use @Param to test your assumptions about what drives performance. If you think “the number of categories affects search speed,” parameterize the category count and measure. The data may contradict your assumption.

Parameterized Implementation Comparison

Compare multiple implementations across the same parameter space:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class LookupStrategyBenchmark {

    @Param({"10", "100", "1000", "10000"})
    private int articleCount;

    private Map<Long, Article> hashMap;
    private TreeMap<Long, Article> treeMap;
    private Article[] sortedArray;
    private long lookupId;

    @Setup(Level.Trial)
    public void setup() {
        var random = new java.util.Random(42);
        hashMap = new HashMap<>(articleCount);
        treeMap = new TreeMap<>();
        sortedArray = new Article[articleCount];

        for (int i = 0; i < articleCount; i++) {
            Article article = new Article(
                (long) i, "Article " + i, "Body", "slug-" + i,
                List.of(), List.of(), 1L, Instant.now(), Instant.now()
            );
            hashMap.put(article.id(), article);
            treeMap.put(article.id(), article);
            sortedArray[i] = article;
        }
        lookupId = random.nextLong(articleCount);
    }

    @Benchmark
    public Article hashMapLookup() {
        return hashMap.get(lookupId);
    }

    @Benchmark
    public Article treeMapLookup() {
        return treeMap.get(lookupId);
    }

    @Benchmark
    public Article binarySearchLookup() {
        int idx = Arrays.binarySearch(sortedArray, null,
            (a, b) -> Long.compare(
                a != null ? a.id() : lookupId,
                b != null ? b.id() : lookupId
            ));
        return idx >= 0 ? sortedArray[idx] : null;
    }
}

At 10 elements, all three perform similarly. At 10,000 elements, HashMap is constant-time (5ns), TreeMap is O(log n) (40ns), and binary search on a sorted array is O(log n) (35ns). The data structure choice matters by an order of magnitude at scale.

Asymmetric Benchmarks with @Group

Real systems have concurrent readers and writers. JMH’s @Group and @GroupThreads annotations benchmark asymmetric concurrency patterns.

In the content platform, the view counter is read by article serving (frequent) and incremented by view events (also frequent). The benchmark needs to measure both operations concurrently:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Group)
public class ViewCounterContentionBenchmark {

    private AtomicLong atomicCounter;
    private LongAdder adderCounter;
    private long volatileCounter;

    @Setup
    public void setup() {
        atomicCounter = new AtomicLong();
        adderCounter = new LongAdder();
        volatileCounter = 0;
    }

    // AtomicLong: single CAS point of contention
    @Benchmark
    @Group("atomic")
    @GroupThreads(3)  // 3 reader threads
    public long atomicRead() {
        return atomicCounter.get();
    }

    @Benchmark
    @Group("atomic")
    @GroupThreads(1)  // 1 writer thread
    public void atomicWrite() {
        atomicCounter.incrementAndGet();
    }

    // LongAdder: distributed contention
    @Benchmark
    @Group("adder")
    @GroupThreads(3)
    public long adderRead() {
        return adderCounter.sum();
    }

    @Benchmark
    @Group("adder")
    @GroupThreads(1)
    public void adderWrite() {
        adderCounter.increment();
    }
}

Expected results at high contention:

Benchmark                                       Mode  Cnt         Score        Error  Units
ViewCounterContentionBenchmark.atomic           thrpt   10  128432891.2 ± 4231082.1  ops/s
ViewCounterContentionBenchmark.atomic:atomicRead thrpt  10   96421823.4 ± 3182041.2  ops/s
ViewCounterContentionBenchmark.atomic:atomicWrite thrpt 10   32011067.8 ± 1049040.9  ops/s
ViewCounterContentionBenchmark.adder            thrpt   10  287612043.1 ± 8941230.4  ops/s
ViewCounterContentionBenchmark.adder:adderRead  thrpt   10  201438201.3 ± 6723410.2  ops/s
ViewCounterContentionBenchmark.adder:adderWrite thrpt   10   86173841.8 ± 2217820.2  ops/s

LongAdder delivers 2.2x higher total throughput under this read/write mix because it distributes write contention across multiple cells. For the content platform’s view counter at 10,000 reads/second and 10,000 writes/second, LongAdder is the correct choice.

The @GroupThreads ratio matters. 3 readers to 1 writer reflects the content platform’s read-heavy traffic. Change the ratio to match your workload: 10:1 for read-dominated, 1:1 for balanced, 1:3 for write-dominated.

Setup Levels

JMH offers three setup levels, each with different timing semantics:

Level.Trial: Runs once per fork (per JVM instance). Use for expensive, reusable setup: database connections, large data structures, ObjectMapper initialization.

Level.Iteration: Runs once per measurement iteration. Use when state must be refreshed periodically but the refresh is cheap: resetting a counter, clearing a small collection.

Level.Invocation: Runs before every benchmark invocation. Use with extreme caution. The setup cost is included in the timing measurement at this level. JMH warns: Level.Invocation should only be used for benchmarks with invocation times in the millisecond range, where the setup cost (microseconds) is negligible.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class SetupLevelBenchmark {

    private ArticleRepository repository;
    private ObjectMapper mapper;
    private List<Article> articles;

    // Trial: create the repository once per fork
    @Setup(Level.Trial)
    public void setupTrial() {
        repository = new InMemoryArticleRepository();
        mapper = new ObjectMapper();
        mapper.registerModule(new JavaTimeModule());
    }

    // Iteration: reload articles each iteration
    // (simulates cache invalidation between measurement windows)
    @Setup(Level.Iteration)
    public void setupIteration() {
        articles = repository.findAll();
    }

    @Benchmark
    public byte[] serializeAll() throws Exception {
        return mapper.writeValueAsBytes(articles);
    }
}

Never use Level.Invocation to “reset state” when the reset is cheap enough for Level.Iteration. Level.Invocation introduces measurement overhead and distorts results for fast operations.

Profiler Integration

JMH can invoke built-in profilers during the benchmark run, providing deeper insight than raw timing numbers.

GC Profiler

java -jar target/benchmarks.jar ArticleSerializationBenchmark -prof gc

Output:

Benchmark                                                Mode  Cnt      Score     Error   Units
ArticleSerializationBenchmark.serialize                  avgt   10   8432.241 ± 124.556   ns/op
ArticleSerializationBenchmark.serialize:·gc.alloc.rate   avgt   10   1247.832 ±  32.461   MB/sec
ArticleSerializationBenchmark.serialize:·gc.alloc.rate.norm avgt 10  41232.018 ±   0.421   B/op
ArticleSerializationBenchmark.serialize:·gc.count        avgt   10     42.000              counts
ArticleSerializationBenchmark.serialize:·gc.time         avgt   10     31.000              ms

gc.alloc.rate.norm shows bytes allocated per operation: 41KB per serialization. This confirms the allocation analysis from Chapter 2. When you optimize allocation, this number should decrease.

async-profiler Integration

java -jar target/benchmarks.jar ArticleSerializationBenchmark \
     -prof "async:libPath=/path/to/libasyncProfiler.so;output=flamegraph;dir=/tmp/profiles"

This produces a flame graph specific to the benchmark method, without infrastructure noise from JMH itself. The flame graph shows only the time spent inside serialize(), not the JMH harness overhead.

Stack Profiler

java -jar target/benchmarks.jar ViewCounterContentionBenchmark -prof stack

The stack profiler samples thread stack traces and reports the most common stack shapes. This is a lightweight alternative to async-profiler when you want quick stack analysis without installing external tools.

Output:

....[Thread state: RUNNABLE].............
 42.3%  42.3%  java.util.concurrent.atomic.AtomicLong.incrementAndGet
 31.2%  31.2%  java.util.concurrent.atomic.AtomicLong.get
 15.7%  15.7%  org.openjdk.jmh.infra.Blackhole.consume
  8.1%   8.1%  (other)

This tells you that 42.3% of sampled time is in the atomic increment, which is the CAS contention point. The stack profiler confirms that contention is the bottleneck, not computation.

Multi-Threaded Scaling Benchmarks

To measure how throughput scales with thread count, use @Threads or the -t command-line flag:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Benchmark)
public class SearchScalingBenchmark {

    private SearchService searchService;
    private String[] queries;

    @Setup(Level.Trial)
    public void setup() {
        searchService = new SearchService(/* ... */);
        queries = new String[]{
            "java performance", "database indexing",
            "cache invalidation", "distributed systems",
            "jvm tuning", "garbage collection"
        };
    }

    @Benchmark
    public List<SearchResult> search() {
        String query = queries[ThreadLocalRandom.current().nextInt(queries.length)];
        return searchService.search(query, 20);
    }
}

Run with increasing thread counts:

# 1 thread
java -jar target/benchmarks.jar SearchScalingBenchmark -t 1

# 4 threads
java -jar target/benchmarks.jar SearchScalingBenchmark -t 4

# 8 threads
java -jar target/benchmarks.jar SearchScalingBenchmark -t 8

# Number of available CPUs
java -jar target/benchmarks.jar SearchScalingBenchmark -t max

If throughput doubles when threads double (from 1 to 2, from 2 to 4), the operation scales linearly. If throughput plateaus or decreases, you have a contention bottleneck: a shared lock, a synchronized block, a single-threaded resource like a database connection.

For the content platform’s search endpoint, expect near-linear scaling until you hit the connection pool limit. If the pool has 10 connections and you test with 20 threads, half the threads wait for connections. The flame graph will show HikariPool.getConnection as a wide frame. The fix is not more threads. The fix is a larger connection pool or fewer queries per request.

Benchmark Anti-Patterns Recap

Pattern	Problem	Fix
Return type `void` without Blackhole	Dead code elimination	Return the result or use `bh.consume()`
Literal values as method arguments	Constant folding	Use `@State` fields
`@Fork(1)`	JIT profile pollution	Use `@Fork(2)` or higher
`@Setup(Level.Invocation)` for fast ops	Setup cost in measurement	Use `Level.Trial` or `Level.Iteration`
Shared mutable state with `Scope.Benchmark`	Unintentional contention	Use `Scope.Thread` unless measuring contention
Ignoring error margin	Reporting noise as signal	Only report statistically significant changes
No `-prof gc`	Missing allocation data	Always run with `-prof gc` for allocation-sensitive code

Every benchmark in the remaining chapters follows these patterns. When you see a benchmark result, you can trust it because the benchmark was designed to avoid these traps. When you write your own benchmarks for the content platform, follow the same rules.