JMH Mastery: State, Setup, and Parameterized Benchmarks
JMH Mastery: State, Setup, and Parameterized Benchmarks
Basic JMH usage measures a single operation with fixed inputs. Production performance questions are rarely that simple. You need to know how performance scales with input size, how contention affects throughput under multiple threads, and how different implementation strategies compare across a parameter space. JMH provides tools for all of these.
Parameterized Benchmarks with @Param
@Param generates separate benchmark runs for each parameter value. This is how you answer questions like “how does serialization time scale with article body size?”
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class SerializationScalingBenchmark {
@Param({"100", "1000", "10000", "50000"})
private int bodySize;
@Param({"0", "5", "20"})
private int tagCount;
private ObjectMapper mapper;
private Article article;
@Setup(Level.Trial)
public void setup() {
mapper = new ObjectMapper();
mapper.registerModule(new JavaTimeModule());
List<String> tags = new ArrayList<>();
for (int i = 0; i < tagCount; i++) {
tags.add("tag-" + i);
}
article = new Article(
1L,
"Performance Engineering",
"A".repeat(bodySize),
"perf-eng",
List.of("java", "performance"),
tags,
42L,
Instant.now(),
Instant.now()
);
}
@Benchmark
public byte[] serialize() throws Exception {
return mapper.writeValueAsBytes(article);
}
}
This generates 12 benchmark runs (4 body sizes x 3 tag counts). The output shows how serialization time varies across the parameter space:
Benchmark (bodySize) (tagCount) Mode Cnt Score Error Units
SerializationScalingBenchmark.serialize 100 0 avgt 10 1.247 ± 0.031 us/op
SerializationScalingBenchmark.serialize 100 5 avgt 10 1.382 ± 0.028 us/op
SerializationScalingBenchmark.serialize 100 20 avgt 10 1.891 ± 0.044 us/op
SerializationScalingBenchmark.serialize 1000 0 avgt 10 2.834 ± 0.062 us/op
SerializationScalingBenchmark.serialize 1000 5 avgt 10 2.961 ± 0.057 us/op
SerializationScalingBenchmark.serialize 1000 20 avgt 10 3.472 ± 0.089 us/op
SerializationScalingBenchmark.serialize 10000 0 avgt 10 8.432 ± 0.124 us/op
SerializationScalingBenchmark.serialize 10000 5 avgt 10 8.589 ± 0.131 us/op
SerializationScalingBenchmark.serialize 10000 20 avgt 10 9.047 ± 0.198 us/op
SerializationScalingBenchmark.serialize 50000 0 avgt 10 34.218 ± 0.742 us/op
SerializationScalingBenchmark.serialize 50000 5 avgt 10 34.891 ± 0.812 us/op
SerializationScalingBenchmark.serialize 50000 20 avgt 10 36.124 ± 0.934 us/op
This data reveals that serialization time is dominated by body size, not tag count. The 50KB body takes 34us; adding 20 tags adds only 2us. If you want to optimize serialization for the content platform, optimize body handling, not tag handling.
Use @Param to test your assumptions about what drives performance. If you think “the number of categories affects search speed,” parameterize the category count and measure. The data may contradict your assumption.
Parameterized Implementation Comparison
Compare multiple implementations across the same parameter space:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class LookupStrategyBenchmark {
@Param({"10", "100", "1000", "10000"})
private int articleCount;
private Map<Long, Article> hashMap;
private TreeMap<Long, Article> treeMap;
private Article[] sortedArray;
private long lookupId;
@Setup(Level.Trial)
public void setup() {
var random = new java.util.Random(42);
hashMap = new HashMap<>(articleCount);
treeMap = new TreeMap<>();
sortedArray = new Article[articleCount];
for (int i = 0; i < articleCount; i++) {
Article article = new Article(
(long) i, "Article " + i, "Body", "slug-" + i,
List.of(), List.of(), 1L, Instant.now(), Instant.now()
);
hashMap.put(article.id(), article);
treeMap.put(article.id(), article);
sortedArray[i] = article;
}
lookupId = random.nextLong(articleCount);
}
@Benchmark
public Article hashMapLookup() {
return hashMap.get(lookupId);
}
@Benchmark
public Article treeMapLookup() {
return treeMap.get(lookupId);
}
@Benchmark
public Article binarySearchLookup() {
int idx = Arrays.binarySearch(sortedArray, null,
(a, b) -> Long.compare(
a != null ? a.id() : lookupId,
b != null ? b.id() : lookupId
));
return idx >= 0 ? sortedArray[idx] : null;
}
}
At 10 elements, all three perform similarly. At 10,000 elements, HashMap is constant-time (5ns), TreeMap is O(log n) (40ns), and binary search on a sorted array is O(log n) (35ns). The data structure choice matters by an order of magnitude at scale.
Asymmetric Benchmarks with @Group
Real systems have concurrent readers and writers. JMH’s @Group and @GroupThreads annotations benchmark asymmetric concurrency patterns.
In the content platform, the view counter is read by article serving (frequent) and incremented by view events (also frequent). The benchmark needs to measure both operations concurrently:
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Group)
public class ViewCounterContentionBenchmark {
private AtomicLong atomicCounter;
private LongAdder adderCounter;
private long volatileCounter;
@Setup
public void setup() {
atomicCounter = new AtomicLong();
adderCounter = new LongAdder();
volatileCounter = 0;
}
// AtomicLong: single CAS point of contention
@Benchmark
@Group("atomic")
@GroupThreads(3) // 3 reader threads
public long atomicRead() {
return atomicCounter.get();
}
@Benchmark
@Group("atomic")
@GroupThreads(1) // 1 writer thread
public void atomicWrite() {
atomicCounter.incrementAndGet();
}
// LongAdder: distributed contention
@Benchmark
@Group("adder")
@GroupThreads(3)
public long adderRead() {
return adderCounter.sum();
}
@Benchmark
@Group("adder")
@GroupThreads(1)
public void adderWrite() {
adderCounter.increment();
}
}
Expected results at high contention:
Benchmark Mode Cnt Score Error Units
ViewCounterContentionBenchmark.atomic thrpt 10 128432891.2 ± 4231082.1 ops/s
ViewCounterContentionBenchmark.atomic:atomicRead thrpt 10 96421823.4 ± 3182041.2 ops/s
ViewCounterContentionBenchmark.atomic:atomicWrite thrpt 10 32011067.8 ± 1049040.9 ops/s
ViewCounterContentionBenchmark.adder thrpt 10 287612043.1 ± 8941230.4 ops/s
ViewCounterContentionBenchmark.adder:adderRead thrpt 10 201438201.3 ± 6723410.2 ops/s
ViewCounterContentionBenchmark.adder:adderWrite thrpt 10 86173841.8 ± 2217820.2 ops/s
LongAdder delivers 2.2x higher total throughput under this read/write mix because it distributes write contention across multiple cells. For the content platform’s view counter at 10,000 reads/second and 10,000 writes/second, LongAdder is the correct choice.
The @GroupThreads ratio matters. 3 readers to 1 writer reflects the content platform’s read-heavy traffic. Change the ratio to match your workload: 10:1 for read-dominated, 1:1 for balanced, 1:3 for write-dominated.
Setup Levels
JMH offers three setup levels, each with different timing semantics:
Level.Trial: Runs once per fork (per JVM instance). Use for expensive, reusable setup: database connections, large data structures, ObjectMapper initialization.
Level.Iteration: Runs once per measurement iteration. Use when state must be refreshed periodically but the refresh is cheap: resetting a counter, clearing a small collection.
Level.Invocation: Runs before every benchmark invocation. Use with extreme caution. The setup cost is included in the timing measurement at this level. JMH warns: Level.Invocation should only be used for benchmarks with invocation times in the millisecond range, where the setup cost (microseconds) is negligible.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Thread)
public class SetupLevelBenchmark {
private ArticleRepository repository;
private ObjectMapper mapper;
private List<Article> articles;
// Trial: create the repository once per fork
@Setup(Level.Trial)
public void setupTrial() {
repository = new InMemoryArticleRepository();
mapper = new ObjectMapper();
mapper.registerModule(new JavaTimeModule());
}
// Iteration: reload articles each iteration
// (simulates cache invalidation between measurement windows)
@Setup(Level.Iteration)
public void setupIteration() {
articles = repository.findAll();
}
@Benchmark
public byte[] serializeAll() throws Exception {
return mapper.writeValueAsBytes(articles);
}
}
Never use Level.Invocation to “reset state” when the reset is cheap enough for Level.Iteration. Level.Invocation introduces measurement overhead and distorts results for fast operations.
Profiler Integration
JMH can invoke built-in profilers during the benchmark run, providing deeper insight than raw timing numbers.
GC Profiler
java -jar target/benchmarks.jar ArticleSerializationBenchmark -prof gc
Output:
Benchmark Mode Cnt Score Error Units
ArticleSerializationBenchmark.serialize avgt 10 8432.241 ± 124.556 ns/op
ArticleSerializationBenchmark.serialize:·gc.alloc.rate avgt 10 1247.832 ± 32.461 MB/sec
ArticleSerializationBenchmark.serialize:·gc.alloc.rate.norm avgt 10 41232.018 ± 0.421 B/op
ArticleSerializationBenchmark.serialize:·gc.count avgt 10 42.000 counts
ArticleSerializationBenchmark.serialize:·gc.time avgt 10 31.000 ms
gc.alloc.rate.norm shows bytes allocated per operation: 41KB per serialization. This confirms the allocation analysis from Chapter 2. When you optimize allocation, this number should decrease.
async-profiler Integration
java -jar target/benchmarks.jar ArticleSerializationBenchmark \
-prof "async:libPath=/path/to/libasyncProfiler.so;output=flamegraph;dir=/tmp/profiles"
This produces a flame graph specific to the benchmark method, without infrastructure noise from JMH itself. The flame graph shows only the time spent inside serialize(), not the JMH harness overhead.
Stack Profiler
java -jar target/benchmarks.jar ViewCounterContentionBenchmark -prof stack
The stack profiler samples thread stack traces and reports the most common stack shapes. This is a lightweight alternative to async-profiler when you want quick stack analysis without installing external tools.
Output:
....[Thread state: RUNNABLE].............
42.3% 42.3% java.util.concurrent.atomic.AtomicLong.incrementAndGet
31.2% 31.2% java.util.concurrent.atomic.AtomicLong.get
15.7% 15.7% org.openjdk.jmh.infra.Blackhole.consume
8.1% 8.1% (other)
This tells you that 42.3% of sampled time is in the atomic increment, which is the CAS contention point. The stack profiler confirms that contention is the bottleneck, not computation.
Multi-Threaded Scaling Benchmarks
To measure how throughput scales with thread count, use @Threads or the -t command-line flag:
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Benchmark)
public class SearchScalingBenchmark {
private SearchService searchService;
private String[] queries;
@Setup(Level.Trial)
public void setup() {
searchService = new SearchService(/* ... */);
queries = new String[]{
"java performance", "database indexing",
"cache invalidation", "distributed systems",
"jvm tuning", "garbage collection"
};
}
@Benchmark
public List<SearchResult> search() {
String query = queries[ThreadLocalRandom.current().nextInt(queries.length)];
return searchService.search(query, 20);
}
}
Run with increasing thread counts:
# 1 thread
java -jar target/benchmarks.jar SearchScalingBenchmark -t 1
# 4 threads
java -jar target/benchmarks.jar SearchScalingBenchmark -t 4
# 8 threads
java -jar target/benchmarks.jar SearchScalingBenchmark -t 8
# Number of available CPUs
java -jar target/benchmarks.jar SearchScalingBenchmark -t max
If throughput doubles when threads double (from 1 to 2, from 2 to 4), the operation scales linearly. If throughput plateaus or decreases, you have a contention bottleneck: a shared lock, a synchronized block, a single-threaded resource like a database connection.
For the content platform’s search endpoint, expect near-linear scaling until you hit the connection pool limit. If the pool has 10 connections and you test with 20 threads, half the threads wait for connections. The flame graph will show HikariPool.getConnection as a wide frame. The fix is not more threads. The fix is a larger connection pool or fewer queries per request.
Benchmark Anti-Patterns Recap
| Pattern | Problem | Fix |
|---|---|---|
Return type void without Blackhole | Dead code elimination | Return the result or use bh.consume() |
| Literal values as method arguments | Constant folding | Use @State fields |
@Fork(1) | JIT profile pollution | Use @Fork(2) or higher |
@Setup(Level.Invocation) for fast ops | Setup cost in measurement | Use Level.Trial or Level.Iteration |
Shared mutable state with Scope.Benchmark | Unintentional contention | Use Scope.Thread unless measuring contention |
| Ignoring error margin | Reporting noise as signal | Only report statistically significant changes |
No -prof gc | Missing allocation data | Always run with -prof gc for allocation-sensitive code |
Every benchmark in the remaining chapters follows these patterns. When you see a benchmark result, you can trust it because the benchmark was designed to avoid these traps. When you write your own benchmarks for the content platform, follow the same rules.