Why Your Performance Intuition Fails
Why Your Performance Intuition Fails
A senior Java developer looks at a slow API endpoint and says: “The database query is probably slow. Let me add an index.” They add the index, latency drops by 15ms, and they move on.
The actual bottleneck was Jackson serialization of a 200KB JSONB column, consuming 47% of the request’s CPU time. The database query was 8ms. The index they added helped, but the endpoint is still slow. They optimized a 8ms operation instead of a 180ms operation. The profiler would have shown this in 30 seconds.
This is not a contrived example. This pattern repeats in every organization, on every team, with every level of engineer. The mechanism is well-understood: cognitive biases distort performance reasoning in predictable ways.
The Availability Bias
You optimize what you have optimized before.
A developer who spent three months tuning PostgreSQL queries last quarter will see every performance problem as a database problem. A developer who just debugged a garbage collection issue will suspect GC pauses. A developer who reads Hacker News articles about Redis will suggest adding a cache.
The availability bias causes you to weight recent experience disproportionately when diagnosing new problems. The symptom is identical (high latency), but the cause is different every time. Your brain searches for matches in recent memory and finds one. That match feels like insight. It is pattern-matching, not analysis.
Consider this scenario in the content platform. The recommendation endpoint is slow, returning p99 latency of 450ms against a 200ms target. Three engineers diagnose the problem:
Engineer A recently optimized a PostgreSQL query with a missing index. Their diagnosis: “The recommendation query probably needs an index.” They run EXPLAIN ANALYZE and find the query uses an index scan with a 12ms execution time. The query is not the problem.
Engineer B recently debugged a memory leak in a different service. Their diagnosis: “We’re probably creating too many objects in the recommendation loop.” They attach VisualVM and see normal GC behavior. Memory is not the problem.
Engineer C attaches async-profiler and captures a 30-second flame graph. The flame graph shows that 62% of CPU time is in RecommendationScorer.computeCosineSimilarity, which iterates over a 512-dimensional embedding vector for every candidate article. With 500 candidates per request, that is 256,000 floating-point multiplications. The algorithm is O(n*d) where n is the number of candidates and d is the embedding dimension. The fix is to precompute and cache normalized vectors, reducing each similarity computation from a dot product plus two norms to a single dot product.
Engineer C found the bottleneck because they looked at the profiler instead of their memory.
The Streetlight Effect
You look where the tools are easy to use, not where the problem is.
Most developers know how to read application logs. Fewer know how to read a flame graph. Almost nobody on a typical team knows how to read a PostgreSQL EXPLAIN ANALYZE output with buffers enabled. The consequence: teams debug performance problems using log timestamps and System.currentTimeMillis() differences, which show you where time was spent at method granularity but not why.
Log-based timing tells you that ArticleService.getArticle() took 340ms. It does not tell you whether those 340ms were spent in a database query, in JSON serialization, in GC pauses, or waiting for a network response. You need a profiler for that decomposition.
Here is a common investigation pattern that illustrates the problem:
// SLOW: Timing individual methods with log statements
public Article getArticle(long id) {
long t0 = System.nanoTime();
Article article = articleRepository.findById(id);
long t1 = System.nanoTime();
log.debug("findById: {}ms", (t1 - t0) / 1_000_000);
long t2 = System.nanoTime();
List<RecommendationResult> recs = recommendationService.rank(article);
long t3 = System.nanoTime();
log.debug("rank: {}ms", (t3 - t2) / 1_000_000);
long t4 = System.nanoTime();
ArticleResponse response = mapper.toResponse(article, recs);
long t5 = System.nanoTime();
log.debug("toResponse: {}ms", (t5 - t4) / 1_000_000);
return response;
}
// FAST: Use async-profiler instead of manual timing
// No code changes needed. Attach to the running JVM:
// ./asprof -d 30 -f /tmp/article-flamegraph.html <pid>
//
// The flame graph shows the FULL stack, including:
// - Time inside Jackson serialization (inside toResponse)
// - Time inside JDBC driver (inside findById)
// - Time inside GC pauses (invisible to nanoTime)
// - Time inside kernel calls (socket reads, file I/O)
The manual timing approach has three defects. First, it measures wall-clock time, which includes time spent in GC pauses that are not attributable to the method. Second, it cannot decompose time within a method. If mapper.toResponse() is slow because Jackson is slow, the timing tells you the method is slow but not which part of Jackson is slow. Third, the timing code itself introduces overhead and must be removed before production, which means you cannot profile production behavior.
async-profiler requires zero code changes. Attach it to production. Get a flame graph. Read the flame graph. Remove async-profiler. Done.
The Focusing Illusion
The component you are currently examining feels more important than it is.
When you open a method in your IDE and study it line by line, every line looks significant. You notice that a String.format() call could be replaced with a StringBuilder. You notice that a HashMap is created on every request and could be reused. You start optimizing these micro-details because you are staring at them.
Meanwhile, the actual bottleneck is three layers up: the HTTP client that calls a downstream service has a connection timeout of 30 seconds, and when that service is slow, threads pile up waiting for connections. No amount of StringBuilder optimization fixes a thread pool exhaustion problem.
The focusing illusion is the reason code-review-driven performance optimization fails. A reviewer sees new ArrayList<>() in a hot path and requests a change to pre-sized new ArrayList<>(expectedSize). The change is correct in isolation. But the hot path is hot because it is called 50,000 times per second, and the actual performance problem is that each call makes a database query that should have been batched. Pre-sizing the ArrayList saves 200 nanoseconds per call. Batching the queries saves 15 milliseconds per call. The reviewer optimized a 200ns operation because it was visible in the diff. The 15ms operation was not in the diff.
// SLOW: Optimizing what's visible (micro-optimization)
public List<Article> getArticlesForFeed(List<Long> articleIds) {
// Code reviewer says: "pre-size this list"
List<Article> articles = new ArrayList<>(articleIds.size());
for (Long id : articleIds) {
// THIS is the actual problem: N+1 query
articles.add(articleRepository.findById(id).orElseThrow());
}
return articles;
}
// FAST: Optimizing what matters (algorithmic fix)
public List<Article> getArticlesForFeed(List<Long> articleIds) {
// Single query instead of N queries
return articleRepository.findAllById(articleIds);
}
The pre-sized ArrayList saves roughly 100-300 nanoseconds depending on the list size. The batched query saves 10-50 milliseconds depending on network latency and the number of IDs. These are not in the same category. They are not in the same universe.
Anchoring on the Wrong Metric
You optimize latency when the problem is throughput. You optimize throughput when the problem is tail latency.
The content platform serves article pages at a median latency of 25ms. The p99 latency is 800ms. The team focuses on reducing median latency to 20ms. They succeed. The p99 is still 780ms.
The problem was never median latency. The problem is that 1% of requests hit a code path that triggers a full-text search re-index operation that holds a database lock. The fix is to move the re-index to an async background task. Median latency stays at 25ms. The p99 drops to 90ms.
Anchoring on a single metric causes you to optimize the wrong dimension. Performance engineering requires looking at the full latency distribution, not a single number.
# After your Locust run, check the DISTRIBUTION, not just the average
# baseline_stats.csv shows:
#
# Name Avg p50 p90 p95 p99
# /api/articles/[id] 25 22 35 48 800
#
# The avg and p50 look fine. The p99 is 32x the median.
# Optimizing the median is wasted work.
# Find out what happens on those 1% of requests.
The Confirmation Bias Loop
You suspect the database is slow. You add monitoring to the database. You see that some queries take 50ms. You confirm your hypothesis. You optimize the queries.
But you never checked whether the application was slow in a way that has nothing to do with the database. You confirmed your suspicion because you only measured the thing you suspected.
A flame graph does not have this bias. It shows all CPU time, distributed across all stack frames, weighted by actual execution time. The database might be 8% of the profile. Jackson serialization might be 45%. GC pauses might be 12%. Thread contention might be 35%. The flame graph shows all of these simultaneously. Your monitoring dashboard, which has database query time and request latency and nothing in between, shows a correlation that may not be causation.
The Antidote
The antidote to all five biases is the same: profile first, hypothesize second.
This is not a philosophical position. It is a workflow:
- Reproduce the performance problem with a Locust test.
- While the Locust test is running, attach async-profiler to the JVM.
- Read the flame graph. Find the widest top-level frames.
- Now form your hypothesis. Now look at the code. Now investigate.
The order matters. If you form your hypothesis before profiling, you will interpret the profile through the lens of your hypothesis. You will see the frames you expect to see. If you profile first, the data leads.
Some objections:
“I don’t have time to profile. I need to fix this now.” Profiling takes 30 seconds. Attaching async-profiler, capturing a flame graph, and identifying the widest frame takes less time than reading the logs, forming a hypothesis, writing a fix, testing the fix, and discovering it did not help. Profiling is faster than guessing.
“I can’t run a profiler in production.” Yes you can. async-profiler’s overhead is under 2% CPU. It does not require a JVM restart. It does not require a special JVM build. It works with OpenJDK, GraalVM, and Amazon Corretto. If your organization prohibits profiling in production, your organization has decided that guessing is cheaper than measuring.
“My intuition is usually right.” Test this claim. For the next five performance investigations, write down your hypothesis before profiling. Then profile. Compare. If your intuition is right more than 70% of the time, you have unusually accurate intuition. Most engineers score under 40%.
The remainder of this chapter sets up the measurement tools. After completing the setup, you will be able to run every benchmark, every profile, and every load test in this book.