Skip to main content
fast by design

Memory Model and Garbage Collection: Choosing a Collector, Tuning Heap Regions, and Eliminating GC Pauses

12 min read Chapter 10 of 90

Memory Model and Garbage Collection: Choosing a Collector, Tuning Heap Regions, and Eliminating GC Pauses

Garbage collection is not a background concern. It is a latency event.

Every GC pause is a request that waited. Every mixed collection that runs long is a p99 violation. Every promotion failure that triggers a full GC is a user who saw a timeout. The choice of garbage collector is a performance architecture decision, not a default you leave untouched.

The content platform serves article pages at 200 requests per second. Each request allocates temporary objects: the deserialized article, the search result list, the recommendation payload, the serialized response. These objects live for microseconds. The GC must reclaim them without pausing the threads that serve requests.

This chapter covers three production-grade collectors: G1, ZGC, and Shenandoah. Each makes a different trade-off between throughput and pause time. The right choice depends on your latency SLO, your heap size, and your allocation rate.

The Three Collectors

G1 (Garbage-First) is the default collector since JDK 9. It divides the heap into equal-sized regions and collects the regions with the most garbage first. It targets a configurable maximum pause time but achieves this target with variable success. Under high allocation rates, G1 falls back to mixed collections that pause the application for tens or hundreds of milliseconds.

ZGC performs nearly all of its work concurrently with application threads. Its pause times are sub-millisecond regardless of heap size. The cost is CPU overhead: ZGC uses load barriers on every object reference load, consuming 3-5% additional CPU cycles.

Shenandoah, like ZGC, is a concurrent collector with sub-millisecond pauses. It uses Brooks forwarding pointers instead of load barriers, achieving similar pause characteristics with slightly different throughput trade-offs.

GC Heap Region Layout and Pause Timeline

This diagram shows the G1 heap divided into regions tagged as Eden (new allocations), Survivor (objects that survived a young GC), Old (long-lived objects), Humongous (objects larger than half a region), and Free (unallocated). Below the heap layout, a pause timeline compares the three collectors over a 5-second window. G1 shows several short pauses (8-15ms) plus one 82ms mixed collection spike. ZGC and Shenandoah show only sub-millisecond pauses with no spikes. The latency impact panel at the bottom quantifies this: G1 delivers p99 of 18ms but p999 of 95ms due to mixed GC pauses, while ZGC achieves p99 of 6ms and p999 of 9ms.

Why GC Pauses Destroy p99

A GC pause is a stop-the-world event. All application threads are suspended. No requests are processed. The requests that arrive during the pause queue in the network stack or load balancer, and they all complete simultaneously when the pause ends. This creates a latency spike that affects not just the requests during the pause but the burst of queued requests immediately after.

Consider the arithmetic. The content platform serves 200 requests per second. A 50ms GC pause means 10 requests are delayed. Those 10 requests, which would have completed in 3-5ms, now complete in 50-55ms. That is 10 p99 violations from a single pause.

If your SLO is p99 < 20ms, you can tolerate occasional 15ms pauses from G1’s young collections. You cannot tolerate a single 82ms mixed collection. If your SLO is p999 < 10ms, only ZGC or Shenandoah will work.

Choosing a Collector: The Decision Framework

The decision is not “which collector is best.” It is “which collector matches your constraints.”

Choose G1 when:

  • Heap size is 4-32GB
  • Throughput matters more than tail latency
  • p99 SLO is 20ms or higher
  • You need the most mature, best-documented collector

Choose ZGC when:

  • Heap size is 8GB to 16TB
  • p99 SLO is under 10ms
  • You can afford 3-5% CPU overhead
  • Application is latency-sensitive (real-time serving, trading)

Choose Shenandoah when:

  • Requirements similar to ZGC
  • Running on Red Hat / Fedora JDK builds (Shenandoah originated at Red Hat)
  • Heap size is moderate (4-64GB)

The content platform runs an 8GB heap serving read-heavy traffic. The recommendation engine and search service are latency-sensitive. ZGC is the right choice for the serving tier. G1 is acceptable for the batch indexing tier where throughput matters and occasional pauses are tolerable.

Baseline: G1 with Default Settings

Before tuning, measure the default. This is the baseline against which every change is evaluated.

// JVM flags for G1 baseline
// -XX:+UseG1GC
// -Xms8g -Xmx8g
// -Xlog:gc*:file=gc.log:time,uptime,level,tags

The content platform under G1 defaults with an 8GB heap and 200 req/s load produces this GC profile:

Young GC count:     ~40/min
Young GC avg:       12ms
Young GC p99:       25ms
Mixed GC count:     ~3/min
Mixed GC avg:       45ms
Mixed GC max:       120ms
Full GC count:      0

Those mixed collections at 45ms average and 120ms max are the problem. They occur when the old generation fills enough to trigger the Initiating Heap Occupancy Percent (IHOP) threshold, and G1 decides to collect old regions alongside young regions.

G1 Tuning: Controlling Mixed Collection Pauses

G1 accepts a -XX:MaxGCPauseMillis target (default: 200ms). This is a soft target. G1 adjusts the number of regions it collects per pause to stay within the target, but it cannot guarantee compliance.

// SLOW: G1 with default MaxGCPauseMillis=200
// -XX:+UseG1GC -Xms8g -Xmx8g
// Mixed GC pauses: avg 45ms, max 120ms

// FAST: G1 with aggressive pause target
// -XX:+UseG1GC -Xms8g -Xmx8g
// -XX:MaxGCPauseMillis=15
// -XX:G1HeapRegionSize=4m
// -XX:InitiatingHeapOccupancyPercent=35
// Mixed GC pauses: avg 18ms, max 40ms, but more frequent

Lowering MaxGCPauseMillis to 15ms forces G1 to collect fewer old regions per mixed collection. This reduces individual pause times but increases the number of mixed collections. The total time spent in GC increases slightly, reducing throughput by 1-2%.

InitiatingHeapOccupancyPercent=35 (default: 45) triggers concurrent marking earlier, giving G1 more time to complete marking before the old generation fills. This prevents the emergency full GC that occurs when marking cannot keep up with promotion.

G1HeapRegionSize=4m with an 8GB heap creates 2048 regions. Larger regions mean fewer regions to track (less overhead) but coarser-grained collection. For the content platform, 4MB regions work well because article objects are typically under 1MB.

ZGC: Sub-Millisecond Pauses at a Cost

ZGC’s pause times are independent of heap size. A 128GB heap pauses for the same duration as an 8GB heap. The pauses handle only root scanning (thread stacks, JNI handles), which takes constant time.

// ZGC configuration for the content platform
// -XX:+UseZGC
// -Xms8g -Xmx8g
// -XX:SoftMaxHeapSize=6g
// -Xlog:gc*:file=gc.log:time,uptime,level,tags

SoftMaxHeapSize=6g tells ZGC to try to keep heap usage below 6GB, triggering collection earlier. This leaves headroom for allocation spikes without forcing the heap to grow to the maximum.

ZGC’s throughput cost comes from load barriers. Every time the application loads an object reference from the heap, ZGC inserts a small check to see if the reference needs remapping. On a read-heavy workload like the content platform, this check executes millions of times per second. The aggregate CPU cost is 3-5%.

// JMH benchmark comparing throughput under G1 vs ZGC
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2, jvmArgs = {"-Xms8g", "-Xmx8g"})
@State(Scope.Benchmark)
public class GcThroughputBenchmark {

    private List<Article> articles;
    private Random random;

    @Setup
    public void setup() {
        articles = new ArrayList<>(100_000);
        random = new Random(42);
        for (int i = 0; i < 100_000; i++) {
            articles.add(new Article(
                "Title " + i,
                "x".repeat(random.nextInt(500, 5000)),
                List.of("java", "performance"),
                Instant.now()
            ));
        }
    }

    @Benchmark
    public Article serveArticle() {
        int index = random.nextInt(articles.size());
        Article article = articles.get(index);
        // Simulate response serialization
        return new Article(
            article.title().toUpperCase(),
            article.content().substring(0, Math.min(200, article.content().length())),
            article.tags(),
            article.publishedAt()
        );
    }

    record Article(String title, String content, List<String> tags, Instant publishedAt) {}
}

Results with -XX:+UseG1GC:

Benchmark                        Mode  Cnt       Score     Error  Units
GcThroughputBenchmark.serveArticle  thrpt   20  1,847,234 ± 23,145  ops/s

Results with -XX:+UseZGC:

Benchmark                        Mode  Cnt       Score     Error  Units
GcThroughputBenchmark.serveArticle  thrpt   20  1,762,108 ± 18,932  ops/s

ZGC’s throughput is 4.6% lower. That is the load barrier tax. For the content platform, this trade-off is acceptable because the latency improvement is dramatic: p99 drops from 18ms to 6ms, and p999 drops from 95ms to 9ms.

Shenandoah: The Third Option

Shenandoah achieves concurrent collection through Brooks forwarding pointers. Every object has an extra word in its header that points to the object’s current location. During concurrent evacuation, the forwarding pointer is updated to point to the new location, and subsequent accesses follow the pointer.

// Shenandoah configuration
// -XX:+UseShenandoahGC
// -Xms8g -Xmx8g
// -XX:ShenandoahGCHeuristics=adaptive
// -Xlog:gc*:file=gc.log:time,uptime,level,tags

Shenandoah’s pause profile is similar to ZGC: sub-millisecond pauses for root scanning. The throughput overhead is comparable (3-6%), but the overhead comes from forwarding pointer indirection rather than load barriers.

The practical difference between ZGC and Shenandoah is diminishing. Both deliver sub-millisecond pauses. Both handle multi-gigabyte heaps. The choice often comes down to JDK distribution: ZGC is available in all OpenJDK builds, while Shenandoah is available in most but was historically absent from Oracle JDK builds.

Humongous Objects: The Silent Killer

G1 treats objects larger than half a region as “humongous.” These objects bypass Eden and are allocated directly in contiguous old-generation regions. Humongous allocations trigger special collection cycles and can cause premature mixed collections.

In the content platform, article content bodies can be large. A 3MB article body in a heap with 2MB regions requires two contiguous humongous regions. If article bodies regularly exceed half the region size, G1 spends excessive time managing humongous allocations.

// SLOW: 2MB region size, article bodies trigger humongous allocation
// -XX:+UseG1GC -XX:G1HeapRegionSize=2m
// Articles > 1MB are humongous, triggers special handling

// FAST: 4MB region size, most articles fit in regular Eden allocation
// -XX:+UseG1GC -XX:G1HeapRegionSize=4m
// Only articles > 2MB are humongous, rare in practice

The fix is straightforward: increase the region size so that your common large objects are below the humongous threshold. Monitor with -Xlog:gc+humongous=debug to see humongous allocation frequency.

ZGC and Shenandoah handle large objects without special categorization, which is another reason to prefer them for workloads with variable object sizes.

Heap Sizing: Not Too Small, Not Too Large

Heap sizing is not about giving the JVM as much memory as possible. An oversized heap delays GC but makes each collection more expensive when it finally occurs. An undersized heap forces frequent collections that consume CPU.

The rule of thumb: set -Xms equal to -Xmx (no heap resizing at runtime) and size the heap so that live data occupies 30-50% of the total heap after a full GC.

For the content platform:

  • Live data (article cache, connection pools, thread stacks): ~2.5GB
  • Target heap occupancy: 30-40%
  • Calculated heap: 2.5GB / 0.35 = ~7GB
  • Set: -Xms8g -Xmx8g

Setting -Xms equal to -Xmx prevents heap resizing, which causes full GC pauses. The JVM allocates all the memory at startup and never releases it back to the OS.

// BAD: Heap can resize, causing full GC during growth
// -Xms2g -Xmx8g

// GOOD: Fixed heap, no resizing pauses
// -Xms8g -Xmx8g

Concurrent GC Threads

Both ZGC and G1 use concurrent threads for marking and other phases. The number of concurrent threads defaults to 25% of the available processors. On a 16-core machine, that is 4 concurrent GC threads.

If your application is CPU-bound, 4 concurrent GC threads competing with application threads reduce throughput. If your application is I/O-bound (like the content platform, which spends most time waiting on database and cache responses), concurrent GC threads are nearly free.

// For CPU-bound workloads: reduce concurrent threads
// -XX:ConcGCThreads=2

// For I/O-bound workloads: increase concurrent threads for faster marking
// -XX:ConcGCThreads=6

The content platform is I/O-bound. We increase concurrent threads to 6, which accelerates concurrent marking and reduces the window where the old generation might fill before marking completes.

Measuring the Impact

The only way to validate GC tuning is to measure under production-representative load. GC behavior changes dramatically between idle and loaded states because allocation rate drives collection frequency.

Use a load test that runs for at least 10 minutes at steady-state load. Capture GC logs with -Xlog:gc*:file=gc.log:time,uptime,level,tags. Analyze with GCViewer or GCEasy to extract:

  1. Pause time distribution: p50, p99, p999, max
  2. Pause frequency: pauses per minute
  3. Allocation rate: MB/s allocated
  4. Promotion rate: MB/s promoted from young to old
  5. Heap after GC: live data size trend

If heap-after-GC trends upward over time, you have a memory leak. If promotion rate is high, short-lived objects are surviving young GC and being promoted unnecessarily. If pause frequency is increasing, allocation rate exceeds the collector’s ability to reclaim.

The Trade-Off Matrix

MetricG1 (tuned)ZGCShenandoah
p99 pause15-25ms<1ms<1ms
p999 pause40-120ms<2ms<2ms
Throughput overheadBaseline-3 to -5%-3 to -6%
CPU overheadLowModerate (barriers)Moderate (forwarding)
Heap range4-32GB8GB-16TB4-64GB
MaturityHighestHighHigh

G1 wins on throughput. ZGC and Shenandoah win on tail latency. There is no collector that wins on both. The choice is a trade-off, and the right trade-off depends on your SLO.

For the content platform: ZGC on the serving tier (latency-sensitive), G1 on the indexing and batch processing tier (throughput-sensitive). Two different JVM configurations for two different workload profiles.