The Flamegraph: Your Performance X-Ray
SummaryExplains what flamegraphs are, how to generate them...
Explains what flamegraphs are, how to generate them...
Explains what flamegraphs are, how to generate them across Python, Java, Go, and Linux-native tooling, walks through interpreting a realistic flamegraph to find a JSON serialization bottleneck, and covers memory profiling and the observer effect in production profiling.
The Flamegraph: Your Performance X-Ray
Brendan Gregg created the flamegraph in 2011 while investigating a MySQL performance problem at Joyent. He had perf profiles with thousands of stack samples, but the text output was unreadable. He wrote a Perl script to convert stack traces into an SVG visualization where the x-axis represents time proportion and the y-axis represents call stack depth — and accidentally created the most important performance visualization in the history of software.
Before flamegraphs, understanding where CPU time went required reading profiler output that looked like accounting spreadsheets — flat lists of function names with percentages. You could see that processRequest() consumed 45% of CPU, but you couldn’t see why. Was it the function itself, or something it called, or something called three layers deep? Flamegraphs made the answer visual and immediate.
What a Flamegraph Actually Shows
A flamegraph is a visualization of sampled call stacks. The profiler interrupts the running program at fixed intervals (typically 99 times per second, or once every ~10ms) and records the current call stack — which function is running, which function called it, which function called that, all the way up to the entry point.
After collecting thousands of these samples, the tool merges identical stacks and renders them as a stacked bar chart:
- Each bar is a function. Its width is proportional to the number of samples where that function appeared in the stack — either running directly or as an ancestor of the running function.
- The y-axis is stack depth. The bottom bar is the entry point (usually
main()or the thread’s start function). Each bar above it is a function called by the bar below. - Width equals time. A function that appears in 60% of samples takes a bar that’s 60% of the total width. This is the critical insight: wide bars are where the time goes.
- The x-axis is not time. This confuses everyone at first. The x-axis is alphabetically sorted, not chronologically ordered. Adjacent bars at the same level are siblings or different code paths, not sequential operations. The only dimension that matters is width.
- Color is typically arbitrary — random warm tones to distinguish adjacent bars. Some tools use color to encode type (red for CPU, blue for I/O, green for runtime), but the default is just visual differentiation.
A plateau — a wide bar that doesn’t narrow as you go up the stack — means a single function is directly consuming that CPU time. It’s doing computation, not delegating. This is a hot function.
A tower — a narrow column many bars deep — means a code path with many layers of indirection but little total time. Deep stacks aren’t inherently bad; they’re only problematic if they’re also wide.
A wide bar that narrows into many children means a function that delegates to many different code paths. The parent aggregates time from many callees. To optimize, you’d investigate each child independently.
Generating Flamegraphs by Language
Python: py-spy
py-spy is a sampling profiler for Python that attaches to a running process without any code changes. It’s written in Rust, so it has negligible overhead:
# Profile a running process
py-spy record -o profile.svg --pid 12345
# Profile a command from start
py-spy record -o profile.svg -- python my_service.py
# Top-like real-time view
py-spy top --pid 12345
py-spy handles the GIL correctly — it samples even when Python is blocked on I/O or waiting for the GIL. This means its flamegraphs accurately distinguish CPU-bound time (in Python functions) from I/O-bound time (in native calls like select() or recv()).
For CPU profiling specifically, use --native to include C extension call stacks, which reveals whether your bottleneck is in Python code or in a C library:
py-spy record --native -o profile.svg --pid 12345
Java: async-profiler
async-profiler is the gold standard for JVM profiling. Unlike the built-in JVM profiler (which suffers from safepoint bias — it can only sample at JVM safepoints, missing work between them), async-profiler uses the perf_events kernel mechanism to sample actual CPU usage:
# Download and attach to a running JVM
./asprof -d 30 -f profile.html <pid>
# CPU profiling with kernel stacks (sees native code too)
./asprof -d 30 -e cpu -f profile.svg --cstack fp <pid>
# Allocation profiling (what's creating objects?)
./asprof -d 30 -e alloc -f alloc_profile.svg <pid>
The allocation profiling mode is particularly valuable. Instead of sampling CPU time, it samples object allocations. The flamegraph then shows which code paths are creating the most objects — directly revealing code that’s likely to cause GC pressure.
Go: pprof
Go has profiling built into the standard library. For any HTTP service, add a single import:
import _ "net/http/pprof"
Then collect and visualize profiles:
# CPU profile for 30 seconds
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
# Heap profile (current allocations)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap
# Goroutine profile (what are all goroutines doing?)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine
The -http=:8080 flag opens an interactive web UI with flamegraph, graph, and source views. Go’s pprof is genuinely the most ergonomic profiling experience in any language.
Linux: perf + FlameGraph
For any language, or when you need to include kernel-level stacks, use Linux’s perf:
# Record CPU samples for 30 seconds
perf record -g -p <pid> -- sleep 30
# Convert to flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg
The stackcollapse-perf.pl and flamegraph.pl scripts are from Brendan Gregg’s FlameGraph repository. This pipeline gives you the most complete view: user-space stacks, kernel stacks, and the transitions between them. If you suspect the bottleneck is in a system call or kernel path, this is the only tool that shows it.
Reading a Flamegraph: A Walkthrough
Here’s a realistic scenario. You profile a web API endpoint that returns product recommendations. The average response time is 280ms. The team assumes the machine learning inference step is the bottleneck. The flamegraph tells a different story:
[100%] handle_request()
├── [8%] parse_request()
├── [12%] load_user_profile()
│ ├── [3%] db_query()
│ └── [9%] deserialize_profile()
├── [14%] run_ml_inference()
│ ├── [11%] model.predict()
│ └── [3%] feature_extraction()
├── [61%] serialize_response()
│ ├── [42%] json.dumps()
│ │ └── [38%] _encoder.encode()
│ │ └── [35%] _make_iterencode.<locals>._iterencode()
│ └── [19%] format_products()
│ ├── [12%] _resolve_image_urls()
│ └── [7%] _compute_display_price()
└── [5%] send_response()
The ML inference — the part the team worried about — is 14% of CPU time. The JSON serialization is 61%. Nearly two-thirds of every request’s CPU budget is spent converting Python objects to a JSON string.
Drilling deeper: within serialization, 42% is json.dumps() itself — the standard library JSON encoder using reflection-based encoding. And 19% is format_products(), which includes resolving image URLs (12%) that involves string concatenation and path manipulation for each of the 50 product images in the response.
The optimization targets, ranked by impact:
-
json.dumps() — 42%. Switch to
orjson, which serializes 5-10x faster than the standard library by using compiled Rust code instead of Python reflection. Or use Pydantic v2’s compiled serialization. -
_resolve_image_urls() — 12%. The image URL resolution is doing string formatting inside a loop. Pre-compute the URL template with the CDN prefix once, then apply per-image IDs. Or better: move URL resolution to the client and return image IDs only, reducing both computation and response size.
-
deserialize_profile() — 9%. User profiles are deserialized from JSON on every request. If a user makes multiple requests in a session, cache the deserialized object.
-
model.predict() — 11%. The ML inference is already well-optimized. Leave it alone unless all the above are fixed and it becomes the new dominant cost.
After implementing fixes 1 and 2, the new flamegraph shows serialize_response() dropping from 61% to 18%, and overall response time drops from 280ms to 94ms. The same instances now handle 3x the traffic.
Memory Profiling
CPU flamegraphs show where time goes. Memory profilers show where bytes go. Different question, equally important.
Python: tracemalloc
import tracemalloc
tracemalloc.start()
# ... do work ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('lineno')[:10]:
print(stat)
This shows which lines of code allocated the most memory that’s still alive. For leak detection, take two snapshots and compare:
snapshot1 = tracemalloc.take_snapshot()
# ... do more work ...
snapshot2 = tracemalloc.take_snapshot()
for stat in snapshot2.compare_to(snapshot1, 'lineno')[:10]:
print(stat)
Java: jmap + Eclipse MAT
# Dump the heap
jmap -dump:live,format=b,file=heap.hprof <pid>
Open heap.hprof in Eclipse Memory Analyzer (MAT). The “Leak Suspects” report automatically identifies objects that dominate the heap. The “Dominator Tree” shows the hierarchy of object retention — which object is keeping which other objects alive.
C/C++: heaptrack
heaptrack ./my_program
heaptrack_gui heaptrack.my_program.<pid>.gz
Heaptrack produces flamegraphs of allocation sites, showing both total allocations and peak memory. It’s the modern replacement for Valgrind’s massif tool, with dramatically lower overhead.
Development vs. Production Profiling
Profiling in development is safe but often misleading. Your development machine has different hardware, different load patterns, different data sizes, and different concurrency levels. A function that’s fast with 10 items might be quadratic and catastrophic with 10,000 items. Development profiling catches gross inefficiencies, but production is where the real bottlenecks emerge.
Production profiling introduces the observer effect: the act of measuring changes what you’re measuring. A CPU profiler that interrupts the process 99 times per second adds roughly 2-5% overhead. A memory profiler that hooks every allocation can add 10-50% overhead. A tracing profiler that records every function entry and exit can add 100%+ overhead.
The solution is sampling profilers with low overhead. py-spy, async-profiler, and Go’s pprof are all designed for production use. They sample, not trace — they capture a statistical snapshot of behavior rather than recording every event. The accuracy is proportional to the sampling duration: 30 seconds of sampling at 99Hz gives you 2,970 samples, which is sufficient to identify any function consuming more than ~1% of CPU time.
Run production profiling for 30-60 seconds during representative load, download the flamegraph, analyze offline. The overhead is negligible, the insight is invaluable, and you’ll discover performance truths that no amount of code reading can reveal.
The flamegraph is not just a visualization. It’s a worldview. It says: “Don’t guess where the time goes. Measure.” And once you’ve measured, the path to optimization is clear — not always easy, but clear. The widest bar is where you start. Everything else is distraction.