Heap Tuning and GC Log Analysis
Heap Tuning and GC Log Analysis
GC logs are the most underused diagnostic tool in Java. They record every collection event, every pause duration, every heap transition. They cost nothing to enable in production. They tell you exactly what the garbage collector did, why it did it, and how long it took.
Not enabling GC logs is like running a database without slow query logging. You have no baseline, no history, no evidence when something goes wrong.
Enabling GC Logs
The Unified Logging framework (introduced in JDK 9) consolidates all GC logging under a single flag syntax.
// Minimum viable GC logging for production
-Xlog:gc*:file=/var/log/app/gc.log:time,uptime,level,tags:filecount=10,filesize=50M
// Detailed GC logging for tuning sessions
-Xlog:gc*,gc+heap=debug,gc+phases=debug,gc+humongous=debug:file=/var/log/app/gc-detail.log:time,uptime,level,tags:filecount=10,filesize=100M
// ZGC-specific detailed logging
-Xlog:gc*,gc+phases=debug,gc+alloc=trace:file=/var/log/app/gc-zgc.log:time,uptime,level,tags:filecount=10,filesize=100M
The filecount=10,filesize=50M rotation keeps total disk usage under 500MB. Ten rotated files of 50MB each give you enough history to diagnose issues from the last several hours.
The cost of GC logging is negligible. The JVM writes log entries asynchronously. Even with debug-level logging enabled, the overhead is under 0.1% of CPU time. There is no reason to disable GC logging in production.
Reading G1 GC Logs
A G1 young collection log entry looks like this:
[2.145s][info][gc] GC(14) Pause Young (Normal) (G1 Evacuation Pause) 4812M->1842M(8192M) 12.456ms
Breaking this down:
[2.145s]: Wall clock time since JVM startGC(14): The 14th GC eventPause Young (Normal): A normal young-generation collection (not initial-mark)4812M->1842M(8192M): Heap usage went from 4812MB to 1842MB, total heap is 8192MB12.456ms: The stop-the-world pause duration
A mixed collection entry:
[8.934s][info][gc] GC(42) Pause Young (Mixed) (G1 Evacuation Pause) 5634M->2156M(8192M) 45.231ms
Pause Young (Mixed) means G1 collected both young and old regions. The pause is longer (45ms vs 12ms) because it evacuated old-generation regions in addition to Eden and Survivor regions.
The detail log shows phases within the pause:
[8.934s][debug][gc,phases] GC(42) Pre Evacuate Collection Set: 0.2ms
[8.934s][debug][gc,phases] GC(42) Merge Heap Roots: 1.4ms
[8.934s][debug][gc,phases] GC(42) Evacuate Collection Set: 38.1ms
[8.934s][debug][gc,phases] GC(42) Post Evacuate Collection Set: 3.2ms
[8.934s][debug][gc,phases] GC(42) Other: 2.3ms
Evacuate Collection Set: 38.1ms is where the time goes. This phase copies live objects from collected regions to free regions. The duration depends on the amount of live data in the collected regions. More live data means more copying.
Identifying Promotion Failure
A promotion failure occurs when G1 cannot find enough free regions to evacuate live objects during a young collection. This forces a full GC, which pauses the application for seconds.
The log entry for a promotion failure:
[45.678s][info][gc] GC(189) Pause Young (Normal) (G1 Evacuation Pause) 7845M->7845M(8192M) 0.892ms
[45.678s][info][gc] GC(189) To-space exhausted
[45.679s][info][gc] GC(190) Pause Full (G1 Evacuation Pause) 7845M->2134M(8192M) 4523.456ms
To-space exhausted is the critical indicator. G1 tried to evacuate objects but found no free regions. The subsequent Full GC paused for 4.5 seconds. Four and a half seconds. Every request during that window timed out.
The symptom: Application freezes for multiple seconds. Monitoring shows a latency spike of 4-10 seconds.
The cause: Heap is too full. Live data plus allocation rate exceeds the heap’s ability to provide free regions for evacuation. Common triggers:
-Xmxtoo small for the live data set- Memory leak causing live data to grow over time
- Allocation rate spike (e.g., batch import running alongside serving)
- IHOP set too high, delaying concurrent marking
The fix: One or more of:
- Increase heap size
- Lower IHOP to trigger concurrent marking earlier
- Fix the memory leak
- Separate batch and serving workloads
// SLOW: IHOP too high, concurrent marking starts too late
// -XX:+UseG1GC -Xms8g -Xmx8g
// -XX:InitiatingHeapOccupancyPercent=45 (default)
// Concurrent marking cannot finish before heap fills -> promotion failure
// FAST: Lower IHOP, concurrent marking starts earlier
// -XX:+UseG1GC -Xms8g -Xmx8g
// -XX:InitiatingHeapOccupancyPercent=30
// Concurrent marking finishes with headroom -> no promotion failure
The content platform experienced promotion failures during the recommendation re-ranking batch job that runs every 30 minutes. The batch job temporarily doubles the allocation rate. Lowering IHOP from 45% to 30% gave G1 enough lead time to complete concurrent marking before the heap filled.
Heap Sizing Methodology
Heap sizing requires measuring live data size. Live data is the memory occupied by objects that survive a full GC: caches, connection pools, static data structures, long-lived session state.
Step 1: Run the application under production load for at least 30 minutes.
Step 2: Trigger a full GC (or wait for one to occur naturally):
jcmd <pid> GC.run
Step 3: Read the post-full-GC heap size from the log:
[1800.234s][info][gc] GC(504) Pause Full (System.gc()) 6234M->2487M(8192M) 1234.567ms
Post-full-GC heap: 2487MB. This is the live data size.
Step 4: Size the heap at 2.5x to 3x the live data:
- Conservative: 2487MB * 3 = 7.5GB ->
-Xms8g -Xmx8g - Aggressive: 2487MB * 2.5 = 6.2GB ->
-Xms7g -Xmx7g
The multiplier provides headroom for young-generation collections. G1 needs free regions to copy survivors into. If the heap is too close to live data size, G1 cannot maintain free regions, and every collection risks promotion failure.
For ZGC, the multiplier can be lower (2x) because ZGC handles heap pressure more gracefully, but it still needs headroom for concurrent collection.
Region Size Tuning
G1 divides the heap into regions of equal size. The JVM automatically chooses a region size based on heap size, targeting approximately 2048 regions. For an 8GB heap, the default region size is 4MB.
You can override this with -XX:G1HeapRegionSize:
// Auto-selected: 8GB heap -> 4MB regions -> 2048 regions
// -XX:+UseG1GC -Xms8g -Xmx8g
// Override: 8GB heap -> 8MB regions -> 1024 regions
// -XX:+UseG1GC -Xms8g -Xmx8g -XX:G1HeapRegionSize=8m
// Override: 8GB heap -> 2MB regions -> 4096 regions
// -XX:+UseG1GC -Xms8g -Xmx8g -XX:G1HeapRegionSize=2m
Larger regions (8MB, 16MB):
- Fewer regions to track, lower remembered set overhead
- Humongous threshold is higher (objects < 4MB for 8MB regions), fewer humongous allocations
- Each collected region contains more data, so individual pause times may be longer
Smaller regions (1MB, 2MB):
- More regions, finer-grained collection selection
- Humongous threshold is lower (objects < 0.5MB for 1MB regions), more humongous allocations
- Each collected region contains less data, so individual pauses are shorter
For the content platform, article content bodies average 3KB but occasionally reach 2MB (long-form articles with embedded data). With 4MB regions (default for 8GB heap), the humongous threshold is 2MB. Only the rare outlier article triggers humongous allocation. This is acceptable.
If your workload regularly allocates objects in the 2-4MB range, increase region size to 8MB to push the humongous threshold to 4MB.
MaxGCPauseMillis Tuning
-XX:MaxGCPauseMillis (default: 200ms) is G1’s pause time target. G1 adjusts three parameters to meet this target:
- Young generation size: Smaller Eden means less data to evacuate per young GC
- Number of old regions in mixed GC: Fewer old regions means less work per mixed GC
- Collection frequency: More frequent, shorter collections rather than fewer, longer ones
// Default: 200ms target -> large young gen, infrequent collection, long pauses
// -XX:+UseG1GC -Xms8g -Xmx8g -XX:MaxGCPauseMillis=200
// Young GC: ~30ms avg, Mixed GC: ~80ms avg, Full GC: rare
// Aggressive: 10ms target -> small young gen, frequent collection, short pauses
// -XX:+UseG1GC -Xms8g -Xmx8g -XX:MaxGCPauseMillis=10
// Young GC: ~8ms avg, Mixed GC: ~18ms avg, Full GC: rare
// BUT: GC frequency increases from 40/min to 120/min
// Total GC CPU time increases from 3% to 7%
The trade-off: Aggressive pause targets reduce individual pause times but increase total GC overhead. The application spends more total time in GC, even though each pause is shorter.
For the content platform serving tier with a p99 latency SLO of 20ms, -XX:MaxGCPauseMillis=15 provides a balance. G1 keeps most pauses under 15ms, with occasional spikes to 25-30ms during mixed collections. The total GC CPU overhead is 4-5%.
Concurrent GC Thread Tuning
The -XX:ConcGCThreads flag controls how many threads perform concurrent marking. The default is 25% of available processors (rounded up).
// 16-core machine, default: 4 concurrent GC threads
// -XX:+UseG1GC -Xms8g -Xmx8g
// I/O-bound workload: increase concurrent threads
// -XX:ConcGCThreads=6
// Faster concurrent marking, less risk of IHOP-triggered full GC
// CPU-bound workload: reduce concurrent threads
// -XX:ConcGCThreads=2
// Less CPU competition, but slower concurrent marking
The content platform is I/O-bound (90% of request time is waiting for database and cache responses). Increasing concurrent GC threads to 6 (on a 16-core machine) accelerates concurrent marking with minimal impact on application throughput, because the application threads are mostly idle waiting for I/O.
A CPU-bound workload (e.g., the recommendation ranking algorithm) would suffer from additional concurrent GC threads competing for CPU time. Reduce concurrent threads for compute-intensive services.
ZGC Log Analysis
ZGC logs look different from G1. A ZGC cycle:
[3.456s][info][gc] GC(8) Garbage Collection (Proactive)
[3.456s][info][gc] GC(8) Pause Mark Start 0.021ms
[3.612s][info][gc] GC(8) Concurrent Mark 156.234ms
[3.613s][info][gc] GC(8) Pause Mark End 0.018ms
[3.614s][info][gc] GC(8) Concurrent Mark Free 0.001ms
[3.615s][info][gc] GC(8) Concurrent Process Non-Strong References 1.234ms
[3.616s][info][gc] GC(8) Concurrent Reset Relocation Set 0.012ms
[3.617s][info][gc] GC(8) Concurrent Select Relocation Set 2.345ms
[3.618s][info][gc] GC(8) Pause Relocate Start 0.019ms
[3.789s][info][gc] GC(8) Concurrent Relocate 171.234ms
[3.789s][info][gc] GC(8) Load: 3.21/2.87/2.45
[3.789s][info][gc] GC(8) MMU: 2ms/98.1%, 5ms/99.3%, 10ms/99.8%, 20ms/99.9%, 50ms/100.0%, 100ms/100.0%
[3.789s][info][gc] GC(8) Mark: 4096M(50%) / 4096M(50%)
[3.789s][info][gc] GC(8) Relocate: 234M(2%) / 234M(2%)
[3.789s][info][gc] GC(8) NMethods: 12345 registered, 234 unregistered
[3.789s][info][gc] GC(8) Metaspace: 89M used, 92M committed, 1024M reserved
[3.789s][info][gc] GC(8) Soft: 2345 encountered, 0 discovered, 0 enqueued
[3.789s][info][gc] GC(8) Weak: 8901 encountered, 2345 discovered, 2345 enqueued
[3.789s][info][gc] GC(8) Heap: 6234M(76%)->3456M(42%)
Key observations:
The pauses are Pause Mark Start (0.021ms) and Pause Mark End (0.018ms) and Pause Relocate Start (0.019ms). All under 0.1ms. These are the only stop-the-world events.
The concurrent phases (Concurrent Mark: 156ms, Concurrent Relocate: 171ms) run while the application is serving requests. They consume CPU but do not pause threads.
The MMU line (Minimum Mutator Utilization) shows what percentage of time application threads were running. 2ms/98.1% means that in any 2ms window, application threads ran for at least 98.1% of the time. This is ZGC’s guarantee: near-zero pause impact.
Heap: 6234M(76%)->3456M(42%) shows the heap went from 76% to 42% occupancy. This is healthy. If the post-GC occupancy stays above 70%, the heap is too small or there is a leak.
Diagnosing Memory Leaks from GC Logs
A memory leak manifests in GC logs as steadily increasing post-GC heap occupancy:
[ 100s] GC(10) Heap: 5234M->2100M(8192M) # Post-GC: 2100M
[ 200s] GC(20) Heap: 5890M->2340M(8192M) # Post-GC: 2340M (up 240M)
[ 300s] GC(30) Heap: 6012M->2580M(8192M) # Post-GC: 2580M (up 480M)
[ 400s] GC(40) Heap: 6234M->2820M(8192M) # Post-GC: 2820M (up 720M)
[ 500s] GC(50) Heap: 6512M->3060M(8192M) # Post-GC: 3060M (up 960M)
Post-GC heap is growing by ~240MB every 100 seconds. At this rate, the heap will be full in about 35 minutes, triggering promotion failures and full GC pauses.
The diagnostic workflow:
- Confirm the trend in GC logs (at least 5 data points)
- Take a heap dump:
jcmd <pid> GC.heap_dump /tmp/heap.hprof - Analyze with Eclipse MAT or VisualVM
- Look for the dominator tree: which objects retain the most memory
- In the content platform, common culprits: unbounded caches, leaked database connections, growing collections in
@Singletonbeans
// SLOW: Unbounded article cache causes memory leak
@Singleton
public class ArticleCache {
private final Map<String, Article> cache = new ConcurrentHashMap<>();
public void put(String id, Article article) {
cache.put(id, article); // Never evicted, grows forever
}
}
// FAST: Bounded cache with LRU eviction
@Singleton
public class ArticleCache {
private final Map<String, Article> cache = new LinkedHashMap<>(1000, 0.75f, true) {
@Override
protected boolean removeEldestEntry(Map.Entry<String, Article> eldest) {
return size() > 10_000;
}
};
public synchronized void put(String id, Article article) {
cache.put(id, article); // Evicts oldest entry when over 10,000
}
}
A Tuning Checklist
Before changing any flag, establish a baseline under production-representative load:
- Enable GC logging with rotation: you need the data before you can tune
- Measure live data size: trigger a full GC and note post-GC heap
- Set heap size:
-Xms=-Xmx= 2.5x to 3x live data - Set region size: Default is fine unless you have frequent humongous allocations
- Set pause target:
-XX:MaxGCPauseMillisbased on your latency SLO minus application processing time - Set IHOP: Lower than default (30-35%) if you see promotion failures
- Set concurrent threads: Increase for I/O-bound workloads, decrease for CPU-bound
- Measure again: Compare p50/p99/p999 pause times and total GC overhead
- Repeat: Change one flag at a time. Measure after each change.
The worst mistake in GC tuning is changing multiple flags simultaneously. You cannot attribute the result to any individual change, and you may have made one improvement and one regression that cancel out.
Every flag change must be justified by a measurement. If you cannot explain why you set a flag to a specific value, remove the flag and use the default.