JIT Compilation Internals: Inlining, Escape Analysis, and the Code the JVM Rewrites Behind You
JIT Compilation Internals: Inlining, Escape Analysis, and the Code the JVM Rewrites Behind You
The Java code you write is not the code that runs.
The C2 JIT compiler observes your code executing, profiles its behavior, and then compiles a version optimized for the patterns it observed. Methods are inlined. Allocations are eliminated. Virtual calls are devirtualized. Bounds checks are hoisted out of loops. The compiled code may bear little resemblance to your source.
This is why Java can be as fast as C++ for sustained workloads. The JIT has information that a static compiler lacks: actual runtime type profiles, branch frequencies, and call site behavior. It uses this information to make speculative optimizations that would be illegal for a static compiler.
But the JIT’s optimizations are conditional. They depend on type profiles remaining stable, methods remaining small enough to inline, and objects remaining local to methods. When these conditions break, the JIT cannot optimize, or worse, it deoptimizes code it previously compiled, causing a performance cliff.
Understanding JIT internals is not academic. It is the difference between code that runs at 2ns per operation and code that runs at 25ns per operation. Both versions look identical in the source.
The C2 Optimization Pipeline
The C2 compiler applies optimizations in phases. Each phase depends on the results of prior phases. The most important chain is:
- Type profiling (C1): Record the concrete types observed at each call site
- Devirtualization (C2): Replace virtual/interface calls with direct calls based on type profiles
- Inlining (C2): Copy the callee’s code into the caller, eliminating call overhead
- Escape analysis (C2): Determine if objects allocated in inlined code escape the compilation unit
- Scalar replacement (C2): Replace non-escaping objects with their individual fields on the stack
- Other optimizations (C2): Loop unrolling, bounds check elimination, constant folding
This diagram shows the JIT optimization decision tree. When a hot method is compiled by C2, each call site is classified by its type profile. Monomorphic sites (one observed type) are devirtualized and inlined, enabling escape analysis and potential scalar replacement. Bimorphic sites (two types) get a type guard with both paths inlined. Megamorphic sites (three or more types) fall through to vtable/itable dispatch with no inlining and no escape analysis. The performance impact section shows the consequence: monomorphic at 2ns, bimorphic at 8ns, megamorphic at 25ns per operation.
The critical insight: inlining is the gateway optimization. Without inlining, escape analysis cannot see the object’s lifetime. Without escape analysis, objects are heap-allocated. Each blocked optimization cascades into the next.
Inlining: The Foundation
Inlining replaces a method call with the method’s body. This eliminates call overhead (stack frame creation, argument passing, return value handling) and, more importantly, exposes the callee’s code to the caller’s optimization context.
// Before inlining:
public long sumViewCounts(List<Article> articles) {
long sum = 0;
for (Article a : articles) {
sum += a.getViewCount(); // virtual call, opaque to optimizer
}
return sum;
}
// After inlining getViewCount():
public long sumViewCounts(List<Article> articles) {
long sum = 0;
for (Article a : articles) {
sum += a.viewCount; // direct field access, optimizer can see everything
}
return sum;
}
The JVM decides whether to inline based on three factors:
Method size: Methods smaller than -XX:MaxInlineSize=35 bytecodes are always inlined at call sites. Methods smaller than -XX:FreqInlineSize=325 bytecodes are inlined if the call site is “hot” (frequently executed). Methods larger than 325 bytecodes are never inlined.
Inline depth: The JVM limits inline nesting to -XX:MaxInlineLevel=15. Method A inlines B, B inlines C, and so on. Beyond depth 15, inlining stops.
Total inlined size: The total compiled size of a method (including all inlined callees) cannot exceed -XX:NodeCountInliningCutoff=150000 intermediate representation nodes.
These limits exist because inlining has a cost: larger compiled methods consume more code cache and more compilation time. Inlining everything would exhaust the code cache and slow compilation.
Devirtualization and the Type Profile
Java methods are virtual by default. An interface call or a call on a non-final class requires dynamic dispatch: the JVM looks up the target method in the object’s vtable (for class methods) or itable (for interface methods). This indirection costs 5-15ns per call due to the dependent load (load the object header, load the klass pointer, load the vtable entry, call through the pointer).
The JIT eliminates this overhead through devirtualization. During C1 compilation, the JVM records the concrete types observed at each call site. If a call site always sees the same type, it is monomorphic. The C2 compiler replaces the virtual call with a direct call guarded by a type check:
// Source code:
ContentProcessor processor = getProcessor();
processor.process(article); // interface call
// C2 compiled (monomorphic, always sees HtmlProcessor):
if (processor.getClass() == HtmlProcessor.class) {
// Direct call, inlined
HtmlProcessor.process_inlined(article);
} else {
// Uncommon trap -> deoptimize
deoptimize();
}
The type check costs 1ns. The inlined direct call runs in the caller’s context with full optimization. As long as the speculation holds (the concrete type is always HtmlProcessor), the code runs at maximum speed.
When the speculation fails (a different type appears), the JVM triggers an “uncommon trap.” It deoptimizes the method, discards the compiled code, and falls back to the interpreter. The method is eventually recompiled with a broader type profile, but this recompilation takes time and the new code is less optimized.
Escape Analysis: Eliminating Allocations
Escape analysis determines whether an object allocated inside a method can be accessed from outside that method. If the object does not escape, the JVM can:
- Scalar replace it: decompose the object into its fields and store them in CPU registers or on the stack
- Stack allocate it: allocate the object on the stack instead of the heap (eliminates GC pressure)
- Eliminate synchronization: remove locks on objects that never escape the thread
The three escape states are:
- NoEscape: Object is only used within the allocating method. Eligible for scalar replacement.
- ArgEscape: Object is passed to a callee but does not escape that callee. Eligible for stack allocation.
- GlobalEscape: Object is stored in a field, returned from the method, or otherwise accessible beyond the compilation unit. Must be heap-allocated.
// Object does NOT escape -> scalar replaced (no allocation)
public double distance(double x1, double y1, double x2, double y2) {
Point p = new Point(x2 - x1, y2 - y1); // NoEscape
return Math.sqrt(p.x * p.x + p.y * p.y);
}
// After scalar replacement:
public double distance(double x1, double y1, double x2, double y2) {
double p_x = x2 - x1; // p.x on stack
double p_y = y2 - y1; // p.y on stack
return Math.sqrt(p_x * p_x + p_y * p_y);
// No Point object ever allocated
}
Escape analysis depends on inlining. If the callee that receives the object is not inlined, the JIT cannot prove the object does not escape. This is why megamorphic call sites block escape analysis: the call is not inlined, so the JIT must assume the worst.
Intrinsics: When the JVM Replaces Your Code
The JVM recognizes certain method calls and replaces them with hand-tuned assembly. These “intrinsics” are faster than anything the JIT could generate from the Java source. The JVM ships with hundreds of intrinsics.
Common intrinsics relevant to the content platform:
// String.equals() -> vectorized comparison using SIMD
// Arrays.copyOf() -> optimized memcpy
// Math.min(), Math.max() -> conditional move (no branch)
// Integer.bitCount() -> POPCNT instruction
// System.arraycopy() -> hand-tuned copy loop
// String.hashCode() -> unrolled multiply-accumulate
// Objects.checkIndex() -> implicit bounds check elimination
You cannot beat an intrinsic with hand-written Java. The intrinsic maps directly to a CPU instruction or a hand-tuned assembly sequence. Writing your own arraycopy loop is strictly slower than System.arraycopy().
// SLOW: Manual array copy
public static void copyArray(int[] src, int[] dst, int len) {
for (int i = 0; i < len; i++) {
dst[i] = src[i];
}
}
// FAST: Intrinsic array copy
public static void copyArray(int[] src, int[] dst, int len) {
System.arraycopy(src, 0, dst, 0, len);
}
For small arrays, the difference is 2-3x. For large arrays (> 1KB), the intrinsic uses AVX2/AVX-512 instructions to copy 32 or 64 bytes per cycle, achieving memory-bandwidth-limited throughput.
Deoptimization: The Performance Cliff
Deoptimization occurs when the JIT’s speculative optimizations are proven wrong at runtime. The JVM discards the compiled code, restores the interpreter state, and continues execution in the interpreter.
Common triggers:
- Type profile change: A monomorphic call site sees a new type
- Class loading: A new subclass is loaded, invalidating CHA (Class Hierarchy Analysis) assumptions
- Uncommon trap: An exception is thrown in a path the JIT compiled as “never taken”
- Method replacement: A class is redefined (e.g., hot-reload in development)
// Code that triggers deoptimization:
public long processArticles(List<ContentProcessor> processors) {
long total = 0;
for (ContentProcessor p : processors) {
total += p.process(); // Initially monomorphic (HtmlProcessor)
}
return total;
}
// After adding a new processor type to the list:
// The call site becomes bimorphic or megamorphic
// C2 deoptimizes, recompiles with broader type profile
// Performance drops temporarily (100-500ms of interpreter execution)
In the content platform, deoptimization is triggered during deployment when new processor types are loaded. The service runs in the interpreter for a few hundred milliseconds while C2 recompiles hot methods. This is visible as a latency spike in the first few seconds after deployment.
The mitigation: warm up the application before routing traffic to it. Run a health check that exercises all code paths, ensuring the JIT has compiled hot methods before the instance receives production load.
Observing JIT Behavior
The -XX:+PrintCompilation flag (or -Xlog:jit+compilation=info) shows what the JIT compiles:
567 42 % 4 com.platform.ArticleService::serveArticle @ 23 (245 bytes)
589 43 4 com.platform.ArticleService::buildResponse (89 bytes)
612 44 4 com.platform.ArticleCache::get (12 bytes)
1234 45 4 com.platform.ArticleService::serveArticle @ 23 (245 bytes) made not entrant
Column 1: Timestamp (ms). Column 2: Compilation ID. Column 3: Tier (4 = C2). The % symbol indicates on-stack replacement (compiling a method while it is executing). made not entrant means the compiled version was invalidated (deoptimization).
The -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining flag shows what was inlined:
@ 15 com.platform.ArticleCache::get (12 bytes) inline (hot)
@ 23 com.platform.Article::getTitle (5 bytes) accessor
@ 28 com.platform.Article::getContent (5 bytes) accessor
@ 45 com.platform.ContentProcessor::process (156 bytes) inline (hot)
@ 67 java.lang.StringBuilder::<init> (7 bytes) inline (hot)
@ 89 com.platform.RecommendationEngine::rank (412 bytes) too big
too big at the rank method means it exceeded FreqInlineSize and was not inlined. If rank allocates objects that would otherwise be scalar-replaced, this inlining failure cascades into unnecessary allocations.
Writing JIT-Friendly Code
The practical guidelines for writing code the JIT can optimize:
Keep hot methods small. Methods under 325 bytecodes are candidates for inlining. If a method is too large, extract the cold paths (error handling, logging) into separate methods.
Prefer monomorphic call sites. Use concrete types in hot loops. If you must use interfaces, ensure the hot path sees one or two implementations, not three or more.
Avoid unnecessary abstraction in hot paths. Each layer of abstraction is a method call the JIT must inline. Frameworks that wrap every call in interceptors, decorators, and proxies create deep inline chains that exceed the inline depth limit.
Use final and sealed where appropriate. final classes and methods cannot be overridden, allowing the JIT to devirtualize without a type guard. sealed types limit the possible implementations, improving type profile stability.
// SLOW: Deep abstraction chain in hot path
public Article serve(String id) {
return pipeline.execute( // Layer 1: Pipeline
new FetchStage( // Layer 2: Stage
new CacheDecorator( // Layer 3: Decorator
new ArticleRepository(// Layer 4: Repository
dataSource // Layer 5: DataSource
)
)
),
id
);
}
// FAST: Direct call in hot path
public Article serve(String id) {
Article cached = cache.get(id);
if (cached != null) return cached;
Article article = repository.findById(id);
cache.put(id, article);
return article;
}
The slow version requires inlining through 5 layers. If any layer exceeds the inline size limit or introduces a megamorphic call site, the entire chain loses optimization.
The fast version has two call sites, both monomorphic, both inlinable. The JIT inlines cache.get, sees the allocation does not escape, and scalar-replaces the cache entry wrapper. The entire method compiles to a hash lookup and a conditional branch.
The JIT compiler is your most powerful optimizer. Write code that lets it work.