GC Collectors Compared: G1, ZGC, and Shenandoah Under Load
GC Collectors Compared: G1, ZGC, and Shenandoah Under Load
Choosing a garbage collector without benchmarking it against your workload is guessing. Vendor documentation tells you what the collector is designed to do. Benchmarks tell you what it actually does with your allocation patterns, your object lifetimes, and your heap size.
The content platform has a specific allocation profile: each request allocates 50-200KB of short-lived objects (deserialized articles, search results, JSON response buffers), plus a stable 2.5GB of long-lived data (article cache, connection pools, configuration). This profile stresses the young generation collector because allocation rate is high, but rarely stresses the old generation because most objects die young.
The Benchmark Workload
The benchmark simulates the content platform’s hot path: fetch an article from a cache, assemble a response with recommendations, serialize it, and discard the temporary objects. The allocation pattern mimics production.
@BenchmarkMode(Mode.SampleTime)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 15, time = 10, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-Xms8g", "-Xmx8g"})
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class GcCollectorComparison {
private static final int ARTICLE_COUNT = 50_000;
private static final int RECOMMENDATION_COUNT = 10;
private Article[] articleCache;
private Random random;
@Setup
public void setup() {
articleCache = new Article[ARTICLE_COUNT];
random = new Random(42);
for (int i = 0; i < ARTICLE_COUNT; i++) {
articleCache[i] = new Article(
"Article Title " + i,
generateContent(random.nextInt(2000, 8000)),
List.of("java", "performance", "gc"),
Instant.now(),
random.nextInt(1000, 50000)
);
}
}
@Benchmark
public byte[] serveArticleWithRecommendations() {
// Simulate: fetch article
Article article = articleCache[random.nextInt(ARTICLE_COUNT)];
// Simulate: build recommendation list (short-lived allocations)
List<ArticleSummary> recommendations = new ArrayList<>(RECOMMENDATION_COUNT);
for (int i = 0; i < RECOMMENDATION_COUNT; i++) {
Article rec = articleCache[random.nextInt(ARTICLE_COUNT)];
recommendations.add(new ArticleSummary(
rec.title(),
rec.content().substring(0, Math.min(150, rec.content().length())),
rec.viewCount()
));
}
// Simulate: build response (short-lived StringBuilder)
StringBuilder response = new StringBuilder(article.content().length() + 2000);
response.append("{\"article\":{\"title\":\"").append(article.title()).append("\",");
response.append("\"content\":\"").append(article.content()).append("\",");
response.append("\"views\":").append(article.viewCount()).append(",");
response.append("\"recommendations\":[");
for (int i = 0; i < recommendations.size(); i++) {
if (i > 0) response.append(",");
ArticleSummary rec = recommendations.get(i);
response.append("{\"title\":\"").append(rec.title()).append("\",");
response.append("\"preview\":\"").append(rec.preview()).append("\",");
response.append("\"views\":").append(rec.viewCount()).append("}");
}
response.append("]}");
// Simulate: serialize to bytes (short-lived byte array)
return response.toString().getBytes(StandardCharsets.UTF_8);
}
private String generateContent(int length) {
char[] chars = new char[length];
for (int i = 0; i < length; i++) {
chars[i] = (char) ('a' + (i % 26));
}
return new String(chars);
}
record Article(String title, String content, List<String> tags,
Instant publishedAt, int viewCount) {}
record ArticleSummary(String title, String preview, int viewCount) {}
}
This benchmark is run three times, once per collector, with identical heap settings. The only change between runs is the collector flag.
G1 Results
# JVM flags: -XX:+UseG1GC -Xms8g -Xmx8g
Benchmark Mode Cnt Score Error Units
serveArticleWithRecommendations sample 45000 42.3 ± 1.2 us/op
serveArticleWithRecommendations:p0.50 sample 38.1 us/op
serveArticleWithRecommendations:p0.90 sample 51.4 us/op
serveArticleWithRecommendations:p0.99 sample 284.7 us/op
serveArticleWithRecommendations:p0.999 sample 8241.0 us/op
serveArticleWithRecommendations:p1.00 sample 18432.0 us/op
The p50 is 38 microseconds. The p99 jumps to 284 microseconds. The p999 reaches 8.2 milliseconds. That p999 spike is a young GC pause. The maximum of 18.4ms is a mixed collection.
The allocation rate during this benchmark is approximately 1.2 GB/s. G1’s young generation fills every 200-300ms, triggering a young collection. Every 10-15 young collections, G1 runs a mixed collection that also evacuates old-generation regions.
ZGC Results
# JVM flags: -XX:+UseZGC -Xms8g -Xmx8g
Benchmark Mode Cnt Score Error Units
serveArticleWithRecommendations sample 43200 44.8 ± 0.9 us/op
serveArticleWithRecommendations:p0.50 sample 40.2 us/op
serveArticleWithRecommendations:p0.90 sample 53.8 us/op
serveArticleWithRecommendations:p0.99 sample 72.4 us/op
serveArticleWithRecommendations:p0.999 sample 98.6 us/op
serveArticleWithRecommendations:p1.00 sample 412.0 us/op
The p50 is 2 microseconds higher (40.2 vs 38.1). That is the load barrier overhead. But the tail is transformed: p99 drops from 284 to 72 microseconds. p999 drops from 8,241 to 98 microseconds. The maximum drops from 18,432 to 412 microseconds.
ZGC’s pauses are invisible in the benchmark results because they are sub-millisecond. The p99 and p999 numbers reflect application-level variance (cache misses, thread scheduling), not GC pauses.
Shenandoah Results
# JVM flags: -XX:+UseShenandoahGC -Xms8g -Xmx8g
Benchmark Mode Cnt Score Error Units
serveArticleWithRecommendations sample 42800 45.6 ± 1.1 us/op
serveArticleWithRecommendations:p0.50 sample 41.0 us/op
serveArticleWithRecommendations:p0.90 sample 55.2 us/op
serveArticleWithRecommendations:p0.99 sample 78.1 us/op
serveArticleWithRecommendations:p0.999 sample 118.3 us/op
serveArticleWithRecommendations:p1.00 sample 524.0 us/op
Shenandoah’s numbers are close to ZGC: slightly higher p50 overhead (forwarding pointer indirection costs marginally more than ZGC’s load barriers in this workload) and similar tail latency improvement over G1.
Throughput Comparison
The sample-time benchmark above measures per-operation latency. A separate throughput benchmark measures aggregate operations per second:
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 15, time = 10, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-Xms8g", "-Xmx8g"})
@Threads(8)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class GcThroughputComparison {
// Same setup and benchmark method as above, but @Threads(8) for concurrency
private Article[] articleCache;
private ThreadLocalRandom random() { return ThreadLocalRandom.current(); }
@Setup
public void setup() {
articleCache = new Article[50_000];
Random r = new Random(42);
for (int i = 0; i < 50_000; i++) {
articleCache[i] = new Article(
"Article Title " + i,
"x".repeat(r.nextInt(2000, 8000)),
List.of("java", "performance"),
Instant.now(),
r.nextInt(1000, 50000)
);
}
}
@Benchmark
public byte[] serveArticle() {
ThreadLocalRandom r = random();
Article article = articleCache[r.nextInt(articleCache.length)];
List<String> recTitles = new ArrayList<>(10);
for (int i = 0; i < 10; i++) {
recTitles.add(articleCache[r.nextInt(articleCache.length)].title());
}
StringBuilder sb = new StringBuilder(article.content().length() + 500);
sb.append(article.title()).append(article.content());
for (String t : recTitles) sb.append(t);
return sb.toString().getBytes(StandardCharsets.UTF_8);
}
record Article(String title, String content, List<String> tags,
Instant publishedAt, int viewCount) {}
}
Results:
Collector Throughput (ops/s) Relative
G1 1,847,234 ± 23,145 100% (baseline)
ZGC 1,762,108 ± 18,932 95.4%
Shenandoah 1,738,542 ± 21,087 94.1%
G1 delivers the highest throughput. ZGC costs 4.6%. Shenandoah costs 5.9%. These percentages are stable across runs.
Allocation Rate Sensitivity
The throughput gap between G1 and ZGC widens as allocation rate increases. When each operation allocates more memory, ZGC’s load barriers fire more frequently on the additional object references.
@Benchmark
@Fork(value = 2, jvmArgs = {"-Xms8g", "-Xmx8g"})
public byte[] highAllocationServe() {
ThreadLocalRandom r = random();
// Allocate a large intermediate map (high allocation rate)
Map<String, Object> response = new HashMap<>(50);
for (int i = 0; i < 50; i++) {
Article a = articleCache[r.nextInt(articleCache.length)];
response.put("article_" + i, Map.of(
"title", a.title(),
"preview", a.content().substring(0, 200),
"tags", new ArrayList<>(a.tags()),
"views", a.viewCount()
));
}
StringBuilder sb = new StringBuilder(10_000);
for (var entry : response.entrySet()) {
sb.append(entry.getKey()).append("=").append(entry.getValue());
}
return sb.toString().getBytes(StandardCharsets.UTF_8);
}
With the high-allocation variant (5x more allocation per operation):
Collector Throughput (ops/s) Relative
G1 412,890 ± 8,234 100% (baseline)
ZGC 381,045 ± 7,112 92.3%
Shenandoah 374,218 ± 7,890 90.6%
The ZGC overhead grows from 4.6% to 7.7% with higher allocation rates. This is expected: more allocated objects mean more reference loads, and each reference load incurs a barrier check.
The latency improvement remains decisive:
Collector p99 (us) p999 (us) max (us)
G1 1,842 24,576 41,088
ZGC 184 298 712
Shenandoah 201 342 856
G1’s p999 is 24.5 milliseconds. That is a full mixed collection pause under high allocation pressure. ZGC’s p999 is 298 microseconds. The tail latency improvement is 82x.
Multi-threaded Scaling
GC pauses affect all application threads simultaneously. The more threads your application uses, the more requests are affected by each pause.
With 32 application threads at 200 req/s per thread (6,400 total req/s):
G1: ~32 requests delayed per 50ms mixed GC pause
ZGC: ~0.2 requests delayed per 0.3ms pause (effectively zero)
The impact of GC pauses scales linearly with request throughput. At 6,400 req/s, a single 50ms G1 mixed collection creates 320 delayed request-milliseconds. At the content platform’s peak traffic, that is the difference between meeting SLO and breaching it.
When G1 Wins
G1 is not always the wrong choice. For batch processing workloads where latency does not matter, G1’s higher throughput is preferable:
// Batch indexing: process 100,000 articles for search indexing
// G1 completes in 45 seconds
// ZGC completes in 47 seconds
// The 2-second difference matters when you run this hourly
For the content platform’s indexing pipeline, which runs every 30 minutes to re-index new articles into the search engine, G1 finishes faster. No user is waiting for the result. The occasional 100ms GC pause during indexing is irrelevant.
The pattern: use ZGC or Shenandoah on the serving path where latency SLOs apply. Use G1 on the batch path where throughput determines completion time.
Collector Warm-up Behavior
Collectors also differ in warm-up characteristics. G1 reaches steady-state GC behavior after 2-3 young collections (a few seconds). ZGC needs to complete its first concurrent cycle to establish its pacing heuristics, which takes 10-30 seconds under load.
During ZGC’s warm-up period, it may over-collect (wasting CPU) or under-collect (growing the heap). The -XX:SoftMaxHeapSize flag helps ZGC establish pacing faster by giving it a target below the hard maximum.
For the content platform, this means ZGC-based services should receive traffic gradually during deployment (rolling restart with health check gates) rather than receiving full load immediately. G1 is more forgiving of cold-start traffic spikes.
The Verdict for the Content Platform
The numbers make the decision. The content platform’s serving tier has a p99 SLO of 20ms at the application level. Network and serialization consume 2-5ms. That leaves 15-18ms for application processing and GC pauses combined.
G1’s mixed collection pauses (18-120ms) blow through that budget. ZGC’s sub-millisecond pauses are invisible within the budget. The 4.6% throughput cost of ZGC is paid back by eliminating the retry storms and queue buildup that follow G1 mixed collection pauses.
ZGC for serving. G1 for batch. This is not a universal prescription. It is the right answer for this workload, at this heap size, with this SLO.