GC Collectors Compared: G1, ZGC, and Shenandoah Under Load

Choosing a garbage collector without benchmarking it against your workload is guessing. Vendor documentation tells you what the collector is designed to do. Benchmarks tell you what it actually does with your allocation patterns, your object lifetimes, and your heap size.

The content platform has a specific allocation profile: each request allocates 50-200KB of short-lived objects (deserialized articles, search results, JSON response buffers), plus a stable 2.5GB of long-lived data (article cache, connection pools, configuration). This profile stresses the young generation collector because allocation rate is high, but rarely stresses the old generation because most objects die young.

The Benchmark Workload

The benchmark simulates the content platform’s hot path: fetch an article from a cache, assemble a response with recommendations, serialize it, and discard the temporary objects. The allocation pattern mimics production.

@BenchmarkMode(Mode.SampleTime)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 15, time = 10, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-Xms8g", "-Xmx8g"})
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class GcCollectorComparison {

    private static final int ARTICLE_COUNT = 50_000;
    private static final int RECOMMENDATION_COUNT = 10;

    private Article[] articleCache;
    private Random random;

    @Setup
    public void setup() {
        articleCache = new Article[ARTICLE_COUNT];
        random = new Random(42);
        for (int i = 0; i < ARTICLE_COUNT; i++) {
            articleCache[i] = new Article(
                "Article Title " + i,
                generateContent(random.nextInt(2000, 8000)),
                List.of("java", "performance", "gc"),
                Instant.now(),
                random.nextInt(1000, 50000)
            );
        }
    }

    @Benchmark
    public byte[] serveArticleWithRecommendations() {
        // Simulate: fetch article
        Article article = articleCache[random.nextInt(ARTICLE_COUNT)];

        // Simulate: build recommendation list (short-lived allocations)
        List<ArticleSummary> recommendations = new ArrayList<>(RECOMMENDATION_COUNT);
        for (int i = 0; i < RECOMMENDATION_COUNT; i++) {
            Article rec = articleCache[random.nextInt(ARTICLE_COUNT)];
            recommendations.add(new ArticleSummary(
                rec.title(),
                rec.content().substring(0, Math.min(150, rec.content().length())),
                rec.viewCount()
            ));
        }

        // Simulate: build response (short-lived StringBuilder)
        StringBuilder response = new StringBuilder(article.content().length() + 2000);
        response.append("{\"article\":{\"title\":\"").append(article.title()).append("\",");
        response.append("\"content\":\"").append(article.content()).append("\",");
        response.append("\"views\":").append(article.viewCount()).append(",");
        response.append("\"recommendations\":[");
        for (int i = 0; i < recommendations.size(); i++) {
            if (i > 0) response.append(",");
            ArticleSummary rec = recommendations.get(i);
            response.append("{\"title\":\"").append(rec.title()).append("\",");
            response.append("\"preview\":\"").append(rec.preview()).append("\",");
            response.append("\"views\":").append(rec.viewCount()).append("}");
        }
        response.append("]}");

        // Simulate: serialize to bytes (short-lived byte array)
        return response.toString().getBytes(StandardCharsets.UTF_8);
    }

    private String generateContent(int length) {
        char[] chars = new char[length];
        for (int i = 0; i < length; i++) {
            chars[i] = (char) ('a' + (i % 26));
        }
        return new String(chars);
    }

    record Article(String title, String content, List<String> tags,
                   Instant publishedAt, int viewCount) {}
    record ArticleSummary(String title, String preview, int viewCount) {}
}

This benchmark is run three times, once per collector, with identical heap settings. The only change between runs is the collector flag.

G1 Results

# JVM flags: -XX:+UseG1GC -Xms8g -Xmx8g

Benchmark                                          Mode    Cnt     Score    Error  Units
serveArticleWithRecommendations                    sample  45000   42.3 ±    1.2  us/op
serveArticleWithRecommendations:p0.50              sample          38.1           us/op
serveArticleWithRecommendations:p0.90              sample          51.4           us/op
serveArticleWithRecommendations:p0.99              sample         284.7           us/op
serveArticleWithRecommendations:p0.999             sample        8241.0           us/op
serveArticleWithRecommendations:p1.00              sample       18432.0           us/op

The p50 is 38 microseconds. The p99 jumps to 284 microseconds. The p999 reaches 8.2 milliseconds. That p999 spike is a young GC pause. The maximum of 18.4ms is a mixed collection.

The allocation rate during this benchmark is approximately 1.2 GB/s. G1’s young generation fills every 200-300ms, triggering a young collection. Every 10-15 young collections, G1 runs a mixed collection that also evacuates old-generation regions.

ZGC Results

# JVM flags: -XX:+UseZGC -Xms8g -Xmx8g

Benchmark                                          Mode    Cnt     Score    Error  Units
serveArticleWithRecommendations                    sample  43200   44.8 ±    0.9  us/op
serveArticleWithRecommendations:p0.50              sample          40.2           us/op
serveArticleWithRecommendations:p0.90              sample          53.8           us/op
serveArticleWithRecommendations:p0.99              sample          72.4           us/op
serveArticleWithRecommendations:p0.999             sample          98.6           us/op
serveArticleWithRecommendations:p1.00              sample         412.0           us/op

The p50 is 2 microseconds higher (40.2 vs 38.1). That is the load barrier overhead. But the tail is transformed: p99 drops from 284 to 72 microseconds. p999 drops from 8,241 to 98 microseconds. The maximum drops from 18,432 to 412 microseconds.

ZGC’s pauses are invisible in the benchmark results because they are sub-millisecond. The p99 and p999 numbers reflect application-level variance (cache misses, thread scheduling), not GC pauses.

Shenandoah Results

# JVM flags: -XX:+UseShenandoahGC -Xms8g -Xmx8g

Benchmark                                          Mode    Cnt     Score    Error  Units
serveArticleWithRecommendations                    sample  42800   45.6 ±    1.1  us/op
serveArticleWithRecommendations:p0.50              sample          41.0           us/op
serveArticleWithRecommendations:p0.90              sample          55.2           us/op
serveArticleWithRecommendations:p0.99              sample          78.1           us/op
serveArticleWithRecommendations:p0.999             sample         118.3           us/op
serveArticleWithRecommendations:p1.00              sample         524.0           us/op

Shenandoah’s numbers are close to ZGC: slightly higher p50 overhead (forwarding pointer indirection costs marginally more than ZGC’s load barriers in this workload) and similar tail latency improvement over G1.

Throughput Comparison

The sample-time benchmark above measures per-operation latency. A separate throughput benchmark measures aggregate operations per second:

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 15, time = 10, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-Xms8g", "-Xmx8g"})
@Threads(8)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class GcThroughputComparison {

    // Same setup and benchmark method as above, but @Threads(8) for concurrency

    private Article[] articleCache;
    private ThreadLocalRandom random() { return ThreadLocalRandom.current(); }

    @Setup
    public void setup() {
        articleCache = new Article[50_000];
        Random r = new Random(42);
        for (int i = 0; i < 50_000; i++) {
            articleCache[i] = new Article(
                "Article Title " + i,
                "x".repeat(r.nextInt(2000, 8000)),
                List.of("java", "performance"),
                Instant.now(),
                r.nextInt(1000, 50000)
            );
        }
    }

    @Benchmark
    public byte[] serveArticle() {
        ThreadLocalRandom r = random();
        Article article = articleCache[r.nextInt(articleCache.length)];

        List<String> recTitles = new ArrayList<>(10);
        for (int i = 0; i < 10; i++) {
            recTitles.add(articleCache[r.nextInt(articleCache.length)].title());
        }

        StringBuilder sb = new StringBuilder(article.content().length() + 500);
        sb.append(article.title()).append(article.content());
        for (String t : recTitles) sb.append(t);
        return sb.toString().getBytes(StandardCharsets.UTF_8);
    }

    record Article(String title, String content, List<String> tags,
                   Instant publishedAt, int viewCount) {}
}

Results:

Collector     Throughput (ops/s)    Relative
G1            1,847,234 ± 23,145   100%  (baseline)
ZGC           1,762,108 ± 18,932   95.4%
Shenandoah    1,738,542 ± 21,087   94.1%

G1 delivers the highest throughput. ZGC costs 4.6%. Shenandoah costs 5.9%. These percentages are stable across runs.

Allocation Rate Sensitivity

The throughput gap between G1 and ZGC widens as allocation rate increases. When each operation allocates more memory, ZGC’s load barriers fire more frequently on the additional object references.

@Benchmark
@Fork(value = 2, jvmArgs = {"-Xms8g", "-Xmx8g"})
public byte[] highAllocationServe() {
    ThreadLocalRandom r = random();
    // Allocate a large intermediate map (high allocation rate)
    Map<String, Object> response = new HashMap<>(50);
    for (int i = 0; i < 50; i++) {
        Article a = articleCache[r.nextInt(articleCache.length)];
        response.put("article_" + i, Map.of(
            "title", a.title(),
            "preview", a.content().substring(0, 200),
            "tags", new ArrayList<>(a.tags()),
            "views", a.viewCount()
        ));
    }

    StringBuilder sb = new StringBuilder(10_000);
    for (var entry : response.entrySet()) {
        sb.append(entry.getKey()).append("=").append(entry.getValue());
    }
    return sb.toString().getBytes(StandardCharsets.UTF_8);
}

With the high-allocation variant (5x more allocation per operation):

Collector     Throughput (ops/s)    Relative
G1              412,890 ± 8,234    100%  (baseline)
ZGC             381,045 ± 7,112    92.3%
Shenandoah      374,218 ± 7,890    90.6%

The ZGC overhead grows from 4.6% to 7.7% with higher allocation rates. This is expected: more allocated objects mean more reference loads, and each reference load incurs a barrier check.

The latency improvement remains decisive:

Collector     p99 (us)    p999 (us)    max (us)
G1            1,842       24,576       41,088
ZGC             184          298          712
Shenandoah      201          342          856

G1’s p999 is 24.5 milliseconds. That is a full mixed collection pause under high allocation pressure. ZGC’s p999 is 298 microseconds. The tail latency improvement is 82x.

Multi-threaded Scaling

GC pauses affect all application threads simultaneously. The more threads your application uses, the more requests are affected by each pause.

With 32 application threads at 200 req/s per thread (6,400 total req/s):

G1:   ~32 requests delayed per 50ms mixed GC pause
ZGC:  ~0.2 requests delayed per 0.3ms pause (effectively zero)

The impact of GC pauses scales linearly with request throughput. At 6,400 req/s, a single 50ms G1 mixed collection creates 320 delayed request-milliseconds. At the content platform’s peak traffic, that is the difference between meeting SLO and breaching it.

When G1 Wins

G1 is not always the wrong choice. For batch processing workloads where latency does not matter, G1’s higher throughput is preferable:

// Batch indexing: process 100,000 articles for search indexing
// G1 completes in 45 seconds
// ZGC completes in 47 seconds
// The 2-second difference matters when you run this hourly

For the content platform’s indexing pipeline, which runs every 30 minutes to re-index new articles into the search engine, G1 finishes faster. No user is waiting for the result. The occasional 100ms GC pause during indexing is irrelevant.

The pattern: use ZGC or Shenandoah on the serving path where latency SLOs apply. Use G1 on the batch path where throughput determines completion time.

Collector Warm-up Behavior

Collectors also differ in warm-up characteristics. G1 reaches steady-state GC behavior after 2-3 young collections (a few seconds). ZGC needs to complete its first concurrent cycle to establish its pacing heuristics, which takes 10-30 seconds under load.

During ZGC’s warm-up period, it may over-collect (wasting CPU) or under-collect (growing the heap). The -XX:SoftMaxHeapSize flag helps ZGC establish pacing faster by giving it a target below the hard maximum.

For the content platform, this means ZGC-based services should receive traffic gradually during deployment (rolling restart with health check gates) rather than receiving full load immediately. G1 is more forgiving of cold-start traffic spikes.

The Verdict for the Content Platform

The numbers make the decision. The content platform’s serving tier has a p99 SLO of 20ms at the application level. Network and serialization consume 2-5ms. That leaves 15-18ms for application processing and GC pauses combined.

G1’s mixed collection pauses (18-120ms) blow through that budget. ZGC’s sub-millisecond pauses are invisible within the budget. The 4.6% throughput cost of ZGC is paid back by eliminating the retry storms and queue buildup that follow G1 mixed collection pauses.

ZGC for serving. G1 for batch. This is not a universal prescription. It is the right answer for this workload, at this heap size, with this SLO.