Skip to main content
fast by design

False Sharing and the @Contended Annotation

10 min read Chapter 18 of 90

False Sharing and the @Contended Annotation

False sharing is one of the most insidious performance problems in concurrent Java. Two threads update independent variables that happen to share the same cache line. Neither thread modifies the other’s data. But because the CPU cache coherence protocol operates at cache line granularity, every write by one thread invalidates the entire cache line on the other thread’s core. Both threads effectively serialize on cache line ownership.

The symptom: adding more threads makes performance worse. The cause: invisible cache line contention on data that appears independent.

The MESI Protocol and Cache Line Invalidation

Modern multi-core CPUs use the MESI (Modified, Exclusive, Shared, Invalid) protocol to maintain cache coherence. Each cache line exists in one of four states:

  • Modified: This core has the only valid copy, and it has been written to
  • Exclusive: This core has the only valid copy, but it has not been written to
  • Shared: Multiple cores have valid read-only copies
  • Invalid: This cache line is not valid (must be re-fetched from memory or another core’s cache)

When core A writes to a cache line in the Shared state, it must:

  1. Send an invalidation message to all other cores holding that cache line
  2. Wait for acknowledgment from each core
  3. Transition the cache line to Modified state
  4. Perform the write

This invalidation round-trip takes 40-100 nanoseconds on modern hardware, depending on the interconnect latency (within a socket vs across sockets).

False sharing occurs when two variables on the same cache line are written by different cores. Each write triggers the invalidation sequence, even though the two variables are logically independent.

Reproducing False Sharing

This benchmark demonstrates false sharing with two counters that share a cache line:

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Group)
public class FalseSharingBenchmark {

    // SLOW: Two counters adjacent in memory -> same cache line
    private volatile long counter1;
    private volatile long counter2;

    @Benchmark
    @Group("falseSharing")
    @GroupThreads(1)
    public long incrementCounter1() {
        return ++counter1;
    }

    @Benchmark
    @Group("falseSharing")
    @GroupThreads(1)
    public long incrementCounter2() {
        return ++counter2;
    }
}

The two volatile long fields are 8 bytes each. They are adjacent in the object layout, both within the same 64-byte cache line. Thread 1 writes counter1, invalidating the cache line on Thread 2’s core. Thread 2 writes counter2, invalidating the cache line on Thread 1’s core. Every increment requires a cross-core cache line transfer.

Now add padding to separate them:

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Group)
public class FalseSharingFixedBenchmark {

    // FAST: Counters separated by padding -> different cache lines
    private volatile long counter1;
    private long p1, p2, p3, p4, p5, p6, p7;  // 56 bytes of padding
    private volatile long counter2;

    @Benchmark
    @Group("noPadding")
    @GroupThreads(1)
    public long incrementCounter1() {
        return ++counter1;
    }

    @Benchmark
    @Group("noPadding")
    @GroupThreads(1)
    public long incrementCounter2() {
        return ++counter2;
    }
}

Results:

Benchmark                                          Mode  Cnt      Score    Error   Units
FalseSharingBenchmark.falseSharing                 thrpt  30   48,234 ±  1,234   ops/ms
FalseSharingFixedBenchmark.noPadding               thrpt  30  312,567 ±  5,678   ops/ms

The padded version is 6.5x faster. 48,234 ops/ms vs 312,567 ops/ms. The false-sharing version runs at cache invalidation speed. The padded version runs at L1 cache write speed.

The @Contended Annotation

Manual padding is fragile. The JVM can reorder fields, and padding fields might be compressed or removed by future JVM versions. Java provides the @Contended annotation (in jdk.internal.vm.annotation or sun.misc) as a formal mechanism.

import jdk.internal.vm.annotation.Contended;

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-XX:-RestrictContended"})  // Required to use @Contended outside JDK
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Group)
public class ContendedBenchmark {

    @Contended
    private volatile long counter1;

    @Contended
    private volatile long counter2;

    @Benchmark
    @Group("contended")
    @GroupThreads(1)
    public long incrementCounter1() {
        return ++counter1;
    }

    @Benchmark
    @Group("contended")
    @GroupThreads(1)
    public long incrementCounter2() {
        return ++counter2;
    }
}

@Contended tells the JVM to add 128 bytes of padding around the annotated field. This guarantees the field occupies its own cache line (64 bytes), with padding on both sides to prevent false sharing with adjacent fields.

The -XX:-RestrictContended flag is required to use @Contended in application code. Without it, the annotation is silently ignored outside the JDK’s own classes.

Results:

Benchmark                                 Mode  Cnt      Score    Error   Units
ContendedBenchmark.contended              thrpt  30  308,912 ±  4,567   ops/ms

Performance matches the manual padding version (308k vs 312k ops/ms), confirming that @Contended provides the same cache line isolation.

@Contended with Grouping

@Contended supports a group parameter. Fields in the same group are placed together (they share a cache line) but are padded away from other groups and non-contended fields.

public class ViewCounters {
    // Group "reads": these two fields are accessed together by read threads
    @Contended("reads")
    private volatile long totalPageViews;
    @Contended("reads")
    private volatile long uniqueVisitors;

    // Group "writes": this field is updated by write threads
    @Contended("writes")
    private volatile long articleUpdates;
}

totalPageViews and uniqueVisitors share a cache line (both are read by the same threads). articleUpdates is on a separate cache line because it is written by different threads. Without grouping, all three fields would each get their own cache line, wasting 128 bytes of padding per field.

False Sharing in the Content Platform

The content platform tracks real-time view counts for articles. The view counter is updated by every serving thread for every page view. The analytics aggregator reads the counters periodically.

// SLOW: View counter with false sharing
public class ViewCounterService {
    // All counters are fields in the same object -> same or adjacent cache lines
    private final AtomicLong totalViews = new AtomicLong();
    private final AtomicLong uniqueArticlesViewed = new AtomicLong();
    private final AtomicLong searchQueries = new AtomicLong();
    private final AtomicLong recommendationClicks = new AtomicLong();

    // Called by every request thread
    public void recordPageView(String articleId) {
        totalViews.incrementAndGet();        // Writes to cache line X
        uniqueArticlesViewed.incrementAndGet(); // Writes to cache line X (adjacent!)
    }

    // Called by different threads
    public void recordSearch() {
        searchQueries.incrementAndGet();     // Writes to cache line X or X+1
    }

    public void recordClick() {
        recommendationClicks.incrementAndGet(); // Writes to cache line X or X+1
    }
}

AtomicLong objects are small (16-byte header + 8-byte value = 24 bytes, padded to 32 bytes). Two AtomicLong objects allocated consecutively fit in the same 64-byte cache line. Four counters, all written by different thread groups, all sharing two cache lines. Maximum contention.

// FAST: Padded counters, each on its own cache line
public class ViewCounterService {

    @Contended
    private volatile long totalViews;
    @Contended
    private volatile long uniqueArticlesViewed;
    @Contended
    private volatile long searchQueries;
    @Contended
    private volatile long recommendationClicks;

    // Use VarHandle for atomic operations on the fields directly
    private static final VarHandle TOTAL_VIEWS;
    private static final VarHandle UNIQUE_VIEWS;
    private static final VarHandle SEARCH_QUERIES;
    private static final VarHandle RECOMMENDATION_CLICKS;

    static {
        try {
            var lookup = MethodHandles.lookup();
            TOTAL_VIEWS = lookup.findVarHandle(ViewCounterService.class, "totalViews", long.class);
            UNIQUE_VIEWS = lookup.findVarHandle(ViewCounterService.class, "uniqueArticlesViewed", long.class);
            SEARCH_QUERIES = lookup.findVarHandle(ViewCounterService.class, "searchQueries", long.class);
            RECOMMENDATION_CLICKS = lookup.findVarHandle(ViewCounterService.class, "recommendationClicks", long.class);
        } catch (ReflectiveOperationException e) {
            throw new ExceptionInInitializerError(e);
        }
    }

    public void recordPageView(String articleId) {
        TOTAL_VIEWS.getAndAdd(this, 1L);
        UNIQUE_VIEWS.getAndAdd(this, 1L);
    }

    public void recordSearch() {
        SEARCH_QUERIES.getAndAdd(this, 1L);
    }

    public void recordClick() {
        RECOMMENDATION_CLICKS.getAndAdd(this, 1L);
    }

    public long getTotalViews() { return totalViews; }
    public long getUniqueArticlesViewed() { return uniqueArticlesViewed; }
    public long getSearchQueries() { return searchQueries; }
    public long getRecommendationClicks() { return recommendationClicks; }
}

This version uses @Contended on each field and VarHandle for atomic operations, avoiding the overhead of four separate AtomicLong objects. Each counter occupies its own cache line.

Benchmarking the Fix

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-XX:-RestrictContended"})
@Threads(8)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class ViewCounterBenchmark {

    private ViewCounterSlow slowCounter;
    private ViewCounterFast fastCounter;

    @Setup
    public void setup() {
        slowCounter = new ViewCounterSlow();
        fastCounter = new ViewCounterFast();
    }

    @Benchmark
    public void slowIncrement() {
        slowCounter.totalViews.incrementAndGet();
    }

    @Benchmark
    public void fastIncrement() {
        ViewCounterFast.TOTAL_VIEWS.getAndAdd(fastCounter, 1L);
    }

    static class ViewCounterSlow {
        final AtomicLong totalViews = new AtomicLong();
        final AtomicLong uniqueViews = new AtomicLong();  // Adjacent, causes false sharing
        final AtomicLong searchQueries = new AtomicLong();
        final AtomicLong clicks = new AtomicLong();
    }

    static class ViewCounterFast {
        @Contended volatile long totalViews;
        @Contended volatile long uniqueViews;
        @Contended volatile long searchQueries;
        @Contended volatile long clicks;

        static final VarHandle TOTAL_VIEWS;
        static {
            try {
                TOTAL_VIEWS = MethodHandles.lookup()
                    .findVarHandle(ViewCounterFast.class, "totalViews", long.class);
            } catch (ReflectiveOperationException e) {
                throw new ExceptionInInitializerError(e);
            }
        }
    }
}

Results with 8 threads:

Benchmark                            Mode  Cnt       Score     Error   Units
ViewCounterBenchmark.slowIncrement   thrpt  30   124,567 ±   3,456   ops/ms
ViewCounterBenchmark.fastIncrement   thrpt  30   845,234 ±  12,345   ops/ms

The padded version is 6.8x faster with 8 threads. With more threads (16, 32), the ratio increases because more cores contend on the shared cache line.

When False Sharing Does Not Matter

False sharing is only a problem when:

  1. Multiple threads write to fields on the same cache line
  2. The writes are frequent (thousands or millions per second)
  3. The threads run on different cores

Read-only access does not cause false sharing. The MESI protocol allows multiple cores to hold a cache line in the Shared state. Only writes trigger invalidation.

Fields that are written rarely (configuration, startup-initialized values) do not benefit from @Contended padding. The occasional invalidation is amortized over millions of reads.

Single-threaded code cannot have false sharing. There is only one core, so there is no cross-core invalidation.

Detecting False Sharing

False sharing is hard to detect from application-level metrics. The symptoms are:

  1. Throughput does not scale with threads: Adding threads produces less-than-linear improvement, or negative improvement
  2. High CPU utilization with low throughput: CPUs are busy doing cache coherence work, not application work
  3. perf stat shows high L1-dcache-load-misses: Cache lines are being invalidated and re-fetched

Use perf stat to check for abnormal cache miss rates:

perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
    java -jar content-platform.jar

A cache miss rate above 5% on L1 data cache, combined with high CPU utilization and lower-than-expected throughput, suggests false sharing.

Use perf c2c (cache-to-cache) for definitive diagnosis:

perf c2c record -- java -jar content-platform.jar
perf c2c report

perf c2c reports cache lines that experience frequent cross-core transfers, identifying the exact memory addresses involved. Match these addresses to JVM objects using -XX:+PrintFlagsFinal and JOL.

The @Contended Checklist

  1. Identify hot concurrent counters: Any volatile field or Atomic* variable written by multiple threads at high frequency
  2. Verify false sharing: Use perf c2c or benchmark with/without padding
  3. Apply @Contended: Add the annotation and -XX:-RestrictContended
  4. Consider grouping: Fields accessed by the same thread group share a group
  5. Measure the improvement: JMH throughput benchmark with production thread count
  6. Accept the memory cost: Each @Contended field wastes 128 bytes. For a few counters, this is negligible. For millions of objects, it is prohibitive.

False sharing is a targeted problem with a targeted fix. Do not add @Contended to every field. Add it to the fields that are written concurrently at high frequency, verified by measurement, and confirmed by improvement in benchmarks.