Allocation Profiling and Heap Dumps

Every Java object you allocate is a future garbage collection event. Most objects die young and are collected cheaply in minor GC cycles. But objects that survive to Old Generation force expensive major collections. Understanding where objects are allocated and where they are retained is the foundation of GC-aware programming.

Finding Allocation Hotspots

async-profiler’s allocation profiling intercepts TLAB (Thread-Local Allocation Buffer) allocation events and slow-path allocations. It records the stack trace at each allocation point and the size of the allocated object.

# Profile all allocations for 30 seconds
./asprof -e alloc -d 30 -f /tmp/alloc.html <pid>

# Profile only allocations larger than 512 bytes
./asprof -e alloc --alloc 512 -d 30 -f /tmp/large-alloc.html <pid>

# Profile allocations with size in the flame graph title
./asprof -e alloc -d 30 --total -f /tmp/alloc-total.html <pid>

The --total flag makes the flame graph width proportional to total allocated bytes rather than allocation count. This distinction matters: allocating 10 million 32-byte objects stresses the GC differently than allocating 1,000 32KB objects. The former fills Eden quickly and triggers frequent minor GCs. The latter creates fewer collections but larger survivors.

Content Platform Allocation Hotspots

Profile the article serving endpoint under load:

locust -f locust_baseline.py --host http://localhost:8080 \
       --users 100 --spawn-rate 10 --tags read &

./asprof -e alloc --total -d 60 -f /tmp/article-alloc.html <pid>

Typical allocation hotspots in the content platform:

Jackson serialization buffers. ObjectMapper.writeValueAsBytes() allocates intermediate byte[] buffers. For a 10KB article response, Jackson allocates approximately 40KB of buffers (initial buffer, growth copies, and the final output). At 5,000 requests per second, that is 200MB/s of allocation from serialization alone.

// SLOW: Default ObjectMapper creates new buffers per serialization
@RestController
public class ArticleController {
    private final ObjectMapper mapper = new ObjectMapper();

    @GetMapping("/api/articles/{id}")
    public byte[] getArticle(@PathVariable long id) {
        Article article = articleService.findById(id);
        return mapper.writeValueAsBytes(article);  // 40KB allocation per call
    }
}

// FAST: Reuse serialization buffers with Jackson's ByteArrayBuilder
@RestController
public class ArticleController {
    private final ObjectMapper mapper;
    private final ThreadLocal<ByteArrayBuilder> bufferPool =
        ThreadLocal.withInitial(() -> new ByteArrayBuilder(32_768));

    @GetMapping("/api/articles/{id}")
    public byte[] getArticle(@PathVariable long id) throws Exception {
        Article article = articleService.findById(id);
        ByteArrayBuilder buf = bufferPool.get();
        buf.reset();
        try (JsonGenerator gen = mapper.getFactory()
                .createGenerator(buf, JsonEncoding.UTF8)) {
            mapper.writeValue(gen, article);
        }
        return buf.toByteArray();
    }
}

The ThreadLocal<ByteArrayBuilder> reuses the buffer across requests on the same thread. The initial size of 32KB avoids growth copies for most articles. Allocation per serialization drops from 40KB to approximately 10KB (the final output copy).

String concatenation in logging. A common hidden allocation source:

// SLOW: String concatenation even when debug logging is disabled
log.debug("Processing article " + article.id() + " with "
         + article.tags().size() + " tags");

// FAST: Parameterized logging avoids allocation when level is disabled
log.debug("Processing article {} with {} tags",
          article.id(), article.tags().size());

When debug logging is disabled (as it should be in production), the first version still allocates and concatenates the string, then discards it. The second version checks the log level before formatting. At 10,000 requests per second with 5 log statements per request, the first version allocates 50,000 unnecessary strings per second.

Autoboxing in metrics. Counter libraries that accept Object instead of long:

// SLOW: Autoboxing creates Integer/Long objects on every call
public void recordLatency(Map<String, Object> metrics, String endpoint, long latencyMs) {
    metrics.put(endpoint + ".count", (Long) metrics.getOrDefault(endpoint + ".count", 0L) + 1);
    metrics.put(endpoint + ".latency", latencyMs);  // autoboxed
}

// FAST: Use primitive-specialized collections or LongAdder
public class EndpointMetrics {
    private final LongAdder count = new LongAdder();
    private final LongAdder totalLatency = new LongAdder();

    public void record(long latencyMs) {
        count.increment();
        totalLatency.add(latencyMs);
    }
}

Heap Dump Analysis

Allocation profiling tells you where objects are created. Heap dumps tell you which objects are alive at a point in time and what keeps them alive. These are complementary: high allocation rate causes GC pressure; high retention causes memory growth.

Capturing Heap Dumps

Three methods, each with different trade-offs:

# Method 1: jcmd (recommended for production)
# Pauses the JVM for the duration of the dump
jcmd <pid> GC.heap_dump /tmp/heap.hprof

# Method 2: jmap (older, equivalent functionality)
jmap -dump:format=b,file=/tmp/heap.hprof <pid>

# Method 3: Automatic on OOM (set once at startup)
java -XX:+HeapDumpOnOutOfMemoryError \
     -XX:HeapDumpPath=/tmp/ \
     -jar content-platform.jar

Method 3 is non-negotiable. Every production JVM should have -XX:+HeapDumpOnOutOfMemoryError set. When an OOM occurs without a heap dump, you are left guessing. With a heap dump, you have the exact memory state at the moment of failure.

Eclipse MAT Analysis Workflow

Open the .hprof file in Eclipse MAT. The tool presents several starting points:

Leak Suspects Report. MAT’s automated analysis identifies objects or collections that retain disproportionately large memory. Start here. The report shows suspect objects with their retained size and the reference chain from GC root to the suspect.

Histogram. Lists all classes sorted by instance count or retained size. Use this to find which class has the most instances. If java.lang.String has 4 million instances and 600MB of retained heap, strings are your problem.

Dominator Tree. Shows the “dominates” relationship: object A dominates object B if every path from a GC root to B passes through A. Freeing A would free B. Sort by retained size. The top entries are the objects whose removal would reclaim the most memory.

OOM Investigation: The Content Platform

The content platform’s analytics aggregation service crashes with OutOfMemoryError: Java heap space during peak traffic. A heap dump captured automatically by -XX:+HeapDumpOnOutOfMemoryError is available.

Step 1: Open the Leak Suspects Report.

MAT reports:

Problem Suspect 1:
  Thread "analytics-aggregator-1" retains 1.8 GB
  Accumulated in java.util.HashMap @ 0x7f2a3c000000
  Retained by com.contentplatform.analytics.UsageAggregator.eventBuffer

Problem Suspect 2:
  34,271 instances of com.contentplatform.analytics.ViewEvent
  retained in the HashMap above, totaling 1.2 GB

Step 2: Examine the dominator tree.

Navigate to the UsageAggregator object. The dominator tree shows:

UsageAggregator
  └── eventBuffer: HashMap (1.8 GB retained)
        ├── table: Node[] (524,288 entries)
        │     ├── Node → ViewEvent (34,271 entries)
        │     │     ├── articleId: Long
        │     │     ├── userId: String (avg 200 bytes)
        │     │     ├── metadata: HashMap (avg 2 KB)
        │     │     └── rawPayload: byte[] (avg 34 KB) ← largest field
        │     ...

Each ViewEvent retains a rawPayload byte array averaging 34KB. With 34,271 events, that is 1.14 GB of raw payloads alone.

Step 3: Find the root cause in code.

// SLOW: Accumulating raw events in memory without bounds
public class UsageAggregator {
    // This buffer grows without limit during peak traffic
    private final Map<Long, List<ViewEvent>> eventBuffer = new HashMap<>();

    public void bufferEvent(ViewEvent event) {
        eventBuffer.computeIfAbsent(event.articleId(), k -> new ArrayList<>())
                   .add(event);
    }

    @Scheduled(fixedRate = 3600_000)  // Flush every hour
    public void flush() {
        // Process and persist aggregated data
        aggregateAndPersist(eventBuffer);
        eventBuffer.clear();
    }
}

The eventBuffer accumulates all view events for an entire hour. During peak traffic (10,000 views/second), that is 36 million events per hour. Each event retains its raw payload. The heap cannot hold 36 million events with 34KB payloads.

// FAST: Aggregate incrementally, discard raw payloads
public class UsageAggregator {
    private final ConcurrentHashMap<Long, LongAdder> viewCounts =
        new ConcurrentHashMap<>();
    private final BlockingQueue<ViewEvent> recentEvents =
        new ArrayBlockingQueue<>(10_000);

    public void recordView(ViewEvent event) {
        // Aggregate the count immediately (O(1) memory per article)
        viewCounts.computeIfAbsent(event.articleId(), k -> new LongAdder())
                  .increment();

        // Keep only recent events for detailed analytics
        // If queue is full, drop the oldest (bounded memory)
        if (!recentEvents.offer(event.withoutRawPayload())) {
            recentEvents.poll();
            recentEvents.offer(event.withoutRawPayload());
        }
    }

    @Scheduled(fixedRate = 60_000)  // Flush every minute, not every hour
    public void flush() {
        Map<Long, LongAdder> snapshot = new HashMap<>(viewCounts);
        viewCounts.clear();
        persistAggregatedCounts(snapshot);
    }
}

Three changes fix the OOM:

Incremental aggregation. View counts are aggregated in LongAdder (8 bytes per article) instead of buffered as full events (34KB per event). Memory usage is proportional to distinct articles, not event count.
Raw payload stripping. event.withoutRawPayload() creates a lightweight copy without the 34KB byte array. Only the metadata needed for analytics is retained.
Bounded queue. The ArrayBlockingQueue with capacity 10,000 provides a hard memory ceiling. If events arrive faster than they can be processed, the oldest are dropped. Losing a few events is acceptable; OOM is not.

Measuring the Fix

After deploying the fix, verify with allocation profiling:

# Before fix: measure allocation rate
./asprof -e alloc --total -d 60 -f /tmp/before-alloc.html <pid>
# Result: 2.4 GB/s allocation rate in UsageAggregator path

# After fix: measure allocation rate
./asprof -e alloc --total -d 60 -f /tmp/after-alloc.html <pid>
# Result: 12 MB/s allocation rate in UsageAggregator path (200x reduction)

Monitor Old Generation usage after deployment:

# Before fix: Old Gen grows 50MB/minute during peak traffic
# After fix: Old Gen stable at 180MB

# Verify with jstat
jstat -gcutil <pid> 5000
# S0     S1     E      O      M      CCS    YGC    YGCT   FGC    FGCT
# 0.00   42.31  67.23  22.14  97.32  95.18  1847   4.231  0      0.000

Zero Full GC events (FGC = 0). Old Generation at 22%. The memory leak is fixed.

Allocation Profiling Checklist

For every performance investigation where GC metrics show elevated pause times or growing heap usage:

Profile allocations: ./asprof -e alloc --total -d 60 -f /tmp/alloc.html <pid>
Identify the widest frames in the allocation flame graph
For each hotspot, ask: can this object be reused, pre-allocated, or eliminated?
For retained objects (objects that survive to Old Gen), take a heap dump and find the retention chain
Verify the fix reduces allocation rate with a second allocation profile
Verify the fix stabilizes Old Gen with jstat -gcutil

Objects that die young are cheap. Objects that survive are expensive. Objects that never die are bugs.