CPU Flame Graphs: Finding the Hot Frame

Flame graphs were invented by Brendan Gregg to visualize profiled stack traces. They encode three pieces of information: which methods ran, how long they ran, and who called them. Every other profiler output (flat profiles, call trees, hot method lists) loses at least one of these dimensions. Flame graphs preserve all three.

The Grammar of Flame Graphs

A flame graph is a visualization of sampled stack traces aggregated by frequency. The profiler samples the call stack at regular intervals (typically 10ms). After profiling, it merges identical stack traces and renders them as stacked rectangles.

The x-axis is not time. This is the most common misreading. The horizontal axis represents the proportion of samples where a frame appeared. Frames are sorted alphabetically within each level, not chronologically. Two adjacent frames did not necessarily execute sequentially. They are adjacent because their names are alphabetically close.

Width is CPU proportion. A frame that spans 30% of the total width appeared in 30% of all samples. If you profiled for 30 seconds at 100 samples per second (3,000 total samples), that frame appeared in 900 samples. Width is the only quantitative dimension.

Height is stack depth. Taller stacks indicate deeper call chains. A frame sitting on top of a tall stack consumed CPU time deep in a call hierarchy. Height does not indicate slowness. A shallow frame occupying 40% width is a bigger bottleneck than a deep frame occupying 2% width.

Top frames are where CPU was spent. The topmost frame in a stack is the leaf method, the method that was executing when the profiler sampled. If java.util.HashMap.getNode is a top frame occupying 15% width, your application spends 15% of CPU time inside the hash table lookup code itself, not in methods that call it.

Bottom frames are roots. The bottom of every stack is the thread entry point: Thread.run() for standard threads, ForkJoinWorkerThread.run() for ForkJoinPool threads. You rarely care about the bottom frames. They are infrastructure.

async-profiler Command Reference

The commands you will use most often:

# CPU profile to HTML flame graph (interactive, searchable)
./asprof -d 30 -f /tmp/cpu.html <pid>

# CPU profile to collapsed stacks (text format, for diff tools)
./asprof -d 30 -o collapsed -f /tmp/cpu.txt <pid>

# Wall-clock profile (includes threads waiting on I/O)
./asprof -e wall -d 30 -f /tmp/wall.html <pid>

# Per-thread flame graph
./asprof -d 30 -t -f /tmp/cpu-threads.html <pid>

# Include kernel frames (requires root or perf_event_paranoid=-1)
./asprof -d 30 --all-kernel -f /tmp/cpu-kernel.html <pid>

# Profile a specific set of threads by name regex
./asprof -d 30 -I 'http-nio-*' -f /tmp/http-threads.html <pid>

# Continuous profiling with periodic output (every 60 seconds)
./asprof start -i 10000000 --loop 60s -f /tmp/profile-%t.html <pid>

The -I flag (include pattern) is essential for the content platform. Profiling only HTTP worker threads eliminates noise from GC threads, JIT compiler threads, and scheduled tasks. When your flame graph shows that 30% of CPU is in C2 CompilerThread, that is the JIT compiler, not your application. Filter it out.

Reading the Content Platform’s Flame Graph

Here is a systematic procedure for reading a flame graph from the content platform’s article serving endpoint.

Step 1: Look at the top. Scan across the top of the flame graph. Identify the widest top-level frames. These are where CPU time is actually consumed. In the content platform under load, typical top-level frames include:

java.util.HashMap.getNode (hash table lookups in caching)
com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeString (JSON serialization)
org.postgresql.core.v3.QueryExecutorImpl.receiveFields (waiting for PostgreSQL response)
java.lang.String.hashCode (hash computations for cache keys)

Step 2: Follow the widest top-level frame down. Click on the widest top-level frame in the HTML flame graph. This zooms in and shows its callers. Trace the call chain from top to bottom. This tells you which code path is responsible for the CPU consumption.

For example, if WriterBasedJsonGenerator.writeString is 18% of CPU time, trace downward:

WriterBasedJsonGenerator.writeString                    18.2%
  DefaultSerializerProvider.serializeValue              18.2%
    ArticleSerializer.serialize                         18.2%
      ArticleController.getArticle                      18.2%
        DispatcherServlet.doDispatch                    18.2%

This tells you that JSON serialization of the article response is the source. Now you know what to optimize: the article response payload.

Step 3: Compare adjacent stacks. Two stacks might both lead through ArticleController.getArticle but diverge at different sub-methods. One path goes through articleRepository.findById (database). Another goes through viewCounter.increment (Redis). The relative width of each sub-path tells you the proportion of CPU each operation consumes within the controller method.

Step 4: Search for specific frames. The HTML flame graph supports search (Ctrl+F). Search for class names or method names:

Search jackson to highlight all Jackson serialization frames
Search postgresql to highlight all PostgreSQL driver frames
Search HikariPool to highlight connection pool frames
Search GC to highlight garbage collection frames

The search highlights matching frames and shows the total percentage of samples where they appear. If searching for jackson highlights 22% of the graph, serialization is 22% of your CPU.

The Hot Frame Investigation: Content Platform Search

Let us work through a complete investigation. The search endpoint (GET /api/articles/search?q=java+performance) has degraded from p99 of 180ms to p99 of 620ms after a code change introduced full-text highlighting.

Profile under load:

locust -f locust_baseline.py --host http://localhost:8080 \
       --users 100 --spawn-rate 10 --tags search &

./asprof -d 60 -I 'http-nio-*' -f /tmp/search-cpu.html <pid>

The flame graph shows a new wide frame that was not present before:

HighlightService.highlightSnippets           38.7%
  ├── Pattern.matcher                        21.3%
  │     └── Pattern.compile                  18.9%
  └── String.replaceAll                      17.4%
        └── Pattern.compile                  14.2%

The hot frame is Pattern.compile, appearing in two sub-paths totaling 33.1% of CPU. The cause: the code compiles a regex pattern on every invocation instead of compiling it once and reusing the compiled Pattern.

// SLOW: Compiling regex on every call
public class HighlightService {
    public String highlightSnippets(String text, String query) {
        String[] terms = query.split("\\s+");
        for (String term : terms) {
            // Pattern.compile called per term, per result, per request
            text = text.replaceAll(
                "(?i)(" + Pattern.quote(term) + ")",
                "<mark>$1</mark>"
            );
        }
        return text;
    }
}

// FAST: Pre-compiled pattern, single pass
public class HighlightService {
    private final Map<String, Pattern> patternCache = new ConcurrentHashMap<>();

    public String highlightSnippets(String text, String query) {
        String[] terms = query.split("\\s+");
        String patternStr = Arrays.stream(terms)
            .map(Pattern::quote)
            .collect(Collectors.joining("|"));

        Pattern pattern = patternCache.computeIfAbsent(
            patternStr,
            p -> Pattern.compile("(?i)(" + p + ")")
        );

        return pattern.matcher(text).replaceAll("<mark>$1</mark>");
    }
}

The fix reduces regex compilation from O(terms * results * requests) to O(unique queries). The ConcurrentHashMap cache holds compiled patterns for repeated queries. For the content platform, where users frequently search for the same terms, this eliminates 33% of CPU consumption on the search endpoint.

Benchmark the fix:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(2)
@State(Scope.Benchmark)
public class HighlightBenchmark {

    private String text;
    private String query;
    private HighlightService naiveService;
    private HighlightService cachedService;

    @Setup
    public void setup() {
        text = "Java performance engineering requires measurement. "
             + "Performance without measurement is guessing. "
             + "The JVM optimizes hot paths through JIT compilation.";
        text = text.repeat(50);  // Simulate a real article snippet
        query = "java performance jvm";
        naiveService = new NaiveHighlightService();
        cachedService = new CachedHighlightService();
    }

    @Benchmark
    public String naiveHighlight() {
        return naiveService.highlightSnippets(text, query);
    }

    @Benchmark
    public String cachedHighlight() {
        return cachedService.highlightSnippets(text, query);
    }
}

Expected results:

Benchmark                           Mode  Cnt     Score     Error  Units
HighlightBenchmark.naiveHighlight   avgt   10  1247.832 ±  42.118  us/op
HighlightBenchmark.cachedHighlight  avgt   10    89.451 ±   3.227  us/op

14x faster per operation. At 200 search requests per second with 20 results each, that is 4,000 fewer Pattern.compile calls per second.

Differential Flame Graphs

When comparing performance before and after a change, differential flame graphs show which frames grew and which shrank.

# Capture before profile
./asprof -d 60 -o collapsed -f /tmp/before.txt <pid>

# Apply the fix, restart

# Capture after profile
./asprof -d 60 -o collapsed -f /tmp/after.txt <pid>

# Generate differential flame graph using Brendan Gregg's tools
./difffolded.pl /tmp/before.txt /tmp/after.txt | \
    ./flamegraph.pl --negate > /tmp/diff.svg

In the differential flame graph, red frames grew (took more CPU after the change) and blue frames shrank (took less CPU). If your fix was correct, the target frame should be blue (smaller) and no other frame should be significantly red (larger).

A differential flame graph that shows the target frame shrinking but another frame growing means your fix moved the bottleneck, not removed it. This happens when the fixed code path was blocking other work. Once the block is removed, the previously starved code path gets more CPU and may itself become the new bottleneck. This is progress, not a failure. Profile again and address the next bottleneck.

When Not to Profile CPU

CPU profiling is the wrong tool in three situations:

The application is waiting, not computing. If latency is high but CPU utilization is low (under 20%), the application is waiting for something: database responses, network calls, lock acquisition, thread pool capacity. Use wall-clock profiling (-e wall) instead of CPU profiling.

The problem is garbage collection. If your application pauses periodically and the flame graph shows large GCTaskThread frames, the problem is allocation pressure, not CPU-bound computation. Switch to allocation profiling (-e alloc).

The problem is in the database. If the flame graph shows that most CPU time is in PgStatement.executeQuery and PgConnection.receiveFields, the JVM is not the bottleneck. The database is. Use EXPLAIN ANALYZE in PostgreSQL to profile the query itself. No amount of JVM profiling optimizes a missing index.

Know which tool to reach for. CPU flame graphs are for CPU-bound problems. Wall-clock profiles are for latency problems. Allocation profiles are for GC problems. Query plans are for database problems. Using the wrong tool is not just inefficient. It leads to wrong conclusions.