Skip to main content
fast by design

Escape Analysis and Scalar Replacement in Practice

11 min read Chapter 15 of 90

Escape Analysis and Scalar Replacement in Practice

Every new keyword in Java source code is a potential allocation. But the JIT compiler can prove that many of these allocations are unnecessary. Escape analysis identifies objects that never leave their method. Scalar replacement decomposes those objects into their individual fields, placing them in CPU registers or on the stack. The allocation disappears. The garbage collector never sees the object.

This optimization is invisible. It does not change the program’s behavior. It does not appear in heap dumps or allocation profiles. But it can reduce allocation rate by 10x, cutting GC frequency proportionally.

How Scalar Replacement Works

When the C2 compiler determines that an object does not escape its compilation unit, it replaces the object with its fields. The new instruction is removed from the compiled code. No object header is allocated. No GC metadata is created. The fields exist as scalar values in registers or stack slots.

// Source code:
public record SearchResult(String articleId, double score, int rank) {}

public SearchResult findBestMatch(String query, List<Article> articles) {
    SearchResult best = null;
    for (int i = 0; i < articles.size(); i++) {
        Article a = articles.get(i);
        double score = computeScore(query, a);
        SearchResult current = new SearchResult(a.id(), score, i);  // Allocation?
        if (best == null || current.score() > best.score()) {
            best = current;
        }
    }
    return best;  // Escapes! Cannot scalar-replace
}

In this example, best escapes the method (it is returned). The JIT cannot scalar-replace it. Every SearchResult that becomes best must be heap-allocated.

But current in iterations where it does not become best is dead. The JIT recognizes this pattern and eliminates the allocation for the non-best iterations. Only the final best is allocated.

Now consider a version where the result is consumed within the method:

// Object does NOT escape -> scalar replaced
public double findBestScore(String query, List<Article> articles) {
    SearchResult best = null;
    for (int i = 0; i < articles.size(); i++) {
        Article a = articles.get(i);
        double score = computeScore(query, a);
        SearchResult current = new SearchResult(a.id(), score, i);
        if (best == null || current.score() > best.score()) {
            best = current;
        }
    }
    return best.score();  // Only the score escapes, not the object
}

// After scalar replacement by C2:
public double findBestScore(String query, List<Article> articles) {
    String best_articleId = null;  // Scalar fields
    double best_score = Double.NEGATIVE_INFINITY;
    int best_rank = -1;

    for (int i = 0; i < articles.size(); i++) {
        Article a = articles.get(i);
        double score = computeScore(query, a);
        // No SearchResult allocation
        if (score > best_score) {
            best_articleId = a.id();
            best_score = score;
            best_rank = i;
        }
    }
    return best_score;
}

The JIT decomposes SearchResult into three scalar variables. Zero allocations per iteration. The loop runs at the speed of the computeScore calculation, not at the speed of memory allocation.

JMH Proof: Measuring Allocation Elimination

JMH can measure allocation rate with the -prof gc profiler. This profiler reports gc.alloc.rate.norm: the number of bytes allocated per benchmark operation, normalized to exclude JMH infrastructure allocations.

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class EscapeAnalysisBenchmark {

    private double[] vector1;
    private double[] vector2;

    @Setup
    public void setup() {
        Random r = new Random(42);
        vector1 = r.doubles(128).toArray();
        vector2 = r.doubles(128).toArray();
    }

    // Object escapes -> heap allocated
    @Benchmark
    public ScoredVector escapingAllocation() {
        double dot = 0;
        for (int i = 0; i < vector1.length; i++) {
            dot += vector1[i] * vector2[i];
        }
        double mag1 = magnitude(vector1);
        double mag2 = magnitude(vector2);
        return new ScoredVector(dot / (mag1 * mag2), vector1.length);  // Escapes
    }

    // Object consumed locally -> scalar replaced
    @Benchmark
    public double nonEscapingAllocation() {
        double dot = 0;
        for (int i = 0; i < vector1.length; i++) {
            dot += vector1[i] * vector2[i];
        }
        double mag1 = magnitude(vector1);
        double mag2 = magnitude(vector2);
        ScoredVector sv = new ScoredVector(dot / (mag1 * mag2), vector1.length);
        return sv.score();  // Only primitive escapes
    }

    private double magnitude(double[] v) {
        double sum = 0;
        for (double d : v) sum += d * d;
        return Math.sqrt(sum);
    }

    record ScoredVector(double score, int dimensions) {}
}

Run with allocation profiling:

java -jar benchmarks.jar EscapeAnalysisBenchmark -prof gc

Results:

Benchmark                              Mode  Cnt    Score    Error  Units
escapingAllocation                     avgt   30   82.34 ±   1.45  ns/op
escapingAllocation:gc.alloc.rate.norm  avgt   30   24.00 ±   0.01  B/op

nonEscapingAllocation                  avgt   30   78.12 ±   1.23  ns/op
nonEscapingAllocation:gc.alloc.rate.norm avgt  30    0.00 ±   0.01  B/op

The escaping version allocates 24 bytes per operation (16-byte object header + 8 bytes for the double and int fields). The non-escaping version allocates 0 bytes. The ScoredVector object was scalar-replaced.

The timing difference is small (82ns vs 78ns) because the allocation itself is fast (bump pointer in TLAB). But at 200 requests per second, each request calling this method 10,000 times for search scoring, the allocation difference is:

  • Escaping: 24 bytes * 10,000 * 200 = 48 MB/s of garbage
  • Non-escaping: 0 bytes of garbage

48 MB/s of additional garbage means more frequent young GC, which means more pauses, which means higher p99 latency. Scalar replacement does not make individual operations faster. It makes the system faster by reducing GC pressure.

When Escape Analysis Fails

Escape analysis fails in predictable circumstances. Knowing these failure modes lets you write code that avoids them.

Failure 1: Object Stored in a Field

// EA FAILS: Object stored in instance field
public class ArticleRanker {
    private SearchResult lastResult;  // Field reference

    public double rank(Article article) {
        SearchResult result = new SearchResult(article.id(), computeScore(article), 0);
        this.lastResult = result;  // GlobalEscape: stored in field
        return result.score();
    }
}

Storing the object in a field makes it globally reachable. Any other thread or method could access lastResult. The JIT must allocate on the heap.

Failure 2: Object Passed to Un-inlined Method

// EA FAILS: Object passed to method that is not inlined
public double processResult(SearchResult result) {
    return externalLibrary.analyze(result);  // Not inlined -> EA assumes escape
}

If externalLibrary.analyze() is not inlined (too large, or megamorphic), the JIT cannot prove that result does not escape inside analyze(). Conservative assumption: the object escapes.

Failure 3: Object Added to a Collection

// EA FAILS: Object added to a collection
public void collectResults(List<SearchResult> results, Article article) {
    SearchResult result = new SearchResult(article.id(), computeScore(article), 0);
    results.add(result);  // GlobalEscape: stored in collection's backing array
}

Collections store object references in backing arrays. Adding an object to a collection is equivalent to storing it in a field. The object must be heap-allocated.

Failure 4: Object Used in Synchronized Block

// EA FAILS: Object used as lock target
public double computeScore(Article article) {
    Object lock = new Object();  // Would be NoEscape...
    synchronized (lock) {        // ...but synchronization prevents scalar replacement
        return article.getViewCount() * 0.01;
    }
}

The JVM eliminates synchronization on non-escaping objects (lock elision), but scalar replacement cannot decompose an object that is used as a monitor. The object remains allocated, even though the lock is elided.

Failure 5: Too Many Fields

The C2 compiler has a limit on the number of scalar variables it tracks per replaced object. Objects with more than approximately 100 fields (including fields from nested, also-replaced objects) may not be scalar-replaced.

// EA MAY FAIL: Object with many fields
public record ArticleMetadata(
    String id, String title, String author, String category,
    long viewCount, long shareCount, long commentCount,
    double readTime, double scrollDepth, double bounceRate,
    Instant publishedAt, Instant updatedAt, Instant indexedAt,
    List<String> tags, List<String> categories, List<String> relatedIds,
    // ... 30 more fields
) {}

Large records may exceed the scalar replacement limit. Decompose large objects into smaller, focused records used in specific contexts.

Verifying Escape Analysis with JVM Flags

The -XX:+PrintEscapeAnalysis flag (requires -XX:+UnlockDiagnosticVMOptions) shows what the JIT decided for each allocation:

java -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintEscapeAnalysis \
     -jar content-platform.jar

Output:

======== Connection graph for method ArticleService.serveArticle
  JavaObject NoEscape(NoEscape) -> [ ... ] SearchResult
  JavaObject GlobalEscape -> [ ... ] ArrayList
  JavaObject NoEscape(NoEscape) -> [ ... ] StringBuilder

SearchResult is NoEscape: it will be scalar-replaced. ArrayList is GlobalEscape: it will be heap-allocated (it is returned from the method). StringBuilder is NoEscape: scalar replacement will decompose it if it is small enough.

Content Platform: Allocation Reduction in Practice

The content platform’s search ranking path allocates intermediate Score objects for each candidate article. With 500 candidates per query and 200 queries per second, that is 100,000 Score allocations per second.

// SLOW: Score objects escape into the sorted list
public List<Article> rankArticles(String query, List<Article> candidates) {
    List<Score> scores = new ArrayList<>(candidates.size());
    for (Article a : candidates) {
        double relevance = computeRelevance(query, a);
        double popularity = Math.log(a.viewCount() + 1);
        scores.add(new Score(a, relevance * 0.7 + popularity * 0.3));  // Heap allocated
    }
    scores.sort(Comparator.comparingDouble(Score::value).reversed());
    return scores.subList(0, Math.min(10, scores.size()))
                 .stream().map(Score::article).toList();
}

record Score(Article article, double value) {}

Every Score object is heap-allocated because it is added to the ArrayList. 100,000 allocations per second * 32 bytes each = 3.2 MB/s of garbage from this single method.

// FAST: Avoid intermediate objects entirely
public List<Article> rankArticles(String query, List<Article> candidates) {
    int size = candidates.size();
    double[] scores = new double[size];    // Primitive array, no per-element overhead
    int[] indices = new int[size];

    for (int i = 0; i < size; i++) {
        Article a = candidates.get(i);
        double relevance = computeRelevance(query, a);
        double popularity = Math.log(a.viewCount() + 1);
        scores[i] = relevance * 0.7 + popularity * 0.3;
        indices[i] = i;
    }

    // Partial sort: find top 10 without sorting the entire array
    partialSort(scores, indices, 10);

    List<Article> result = new ArrayList<>(10);
    for (int i = 0; i < Math.min(10, size); i++) {
        result.add(candidates.get(indices[i]));
    }
    return result;
}

private void partialSort(double[] scores, int[] indices, int k) {
    // Selection algorithm: O(n) average instead of O(n log n)
    // Rearranges indices so that top k by score are in positions 0..k-1
    int lo = 0, hi = scores.length - 1;
    while (lo < hi) {
        int pivot = partition(scores, indices, lo, hi);
        if (pivot == k) break;
        if (pivot < k) lo = pivot + 1;
        else hi = pivot - 1;
    }
}

private int partition(double[] scores, int[] indices, int lo, int hi) {
    double pivot = scores[hi];
    int i = lo;
    for (int j = lo; j < hi; j++) {
        if (scores[j] >= pivot) {  // Descending order
            swap(scores, i, j);
            swap(indices, i, j);
            i++;
        }
    }
    swap(scores, i, hi);
    swap(indices, i, hi);
    return i;
}

private void swap(double[] arr, int i, int j) {
    double tmp = arr[i]; arr[i] = arr[j]; arr[j] = tmp;
}

private void swap(int[] arr, int i, int j) {
    int tmp = arr[i]; arr[i] = arr[j]; arr[j] = tmp;
}

The fast version allocates two primitive arrays (total: 500 * 8 + 500 * 4 = 6KB) instead of 500 Score objects (500 * 32 = 16KB). More importantly, the primitive arrays are contiguous memory, cache-friendly, and do not create garbage that needs individual tracing during GC.

The partial sort using quickselect finds the top 10 in O(n) average time instead of O(n log n) for a full sort. For 500 candidates, this saves ~4000 comparisons per query.

Measured impact:

Before: 3.2 MB/s garbage from Score objects, 156 young GCs/min
After:  0.4 MB/s garbage from primitive arrays, 18 young GCs/min
p99 latency improvement: 22ms -> 14ms (fewer GC pauses)

The Escape Analysis Checklist

When optimizing hot paths in the content platform:

  1. Check allocation rate with -prof gc in JMH or async-profiler in allocation mode
  2. Identify the top allocating methods from the allocation flame graph
  3. For each allocation, ask: does this object escape?
    • Returned from the method? -> Escapes
    • Stored in a field? -> Escapes
    • Added to a collection? -> Escapes
    • Passed to an un-inlined method? -> Probably escapes
    • Used only within the method and passed to inlined methods? -> Does not escape
  4. If the object escapes, can you restructure to avoid the allocation?
    • Use primitive arrays instead of object arrays
    • Use parallel primitive arrays instead of a list of records
    • Return primitives instead of wrapper objects
    • Use output parameters (mutable buffers) instead of returning new objects
  5. Verify with JMH -prof gc that gc.alloc.rate.norm decreased

Escape analysis is the JIT’s most valuable optimization for allocation-heavy code. Understanding when it applies and when it fails is the difference between code that allocates nothing and code that creates millions of garbage objects per second.