Concurrency Performance: Lock Contention, Atomic Operations, and the Cases Where Fewer Threads Is Faster
Concurrency Performance: Lock Contention, Atomic Operations, and the Cases Where Fewer Threads Is Faster
Adding threads does not add performance. Adding threads adds contention. Contention for locks, for cache lines, for memory bandwidth, for CPU scheduling. The performance gain from parallelism must exceed the cost of coordination, and that cost grows non-linearly.
The content platform’s view counter processes 50,000 increment requests per second across hundreds of articles. A naive synchronized counter handles 12 million ops/sec on a single thread. Add 16 threads and throughput drops to 3 million. 32 threads: 1.8 million. The code is correct at every thread count. It is also progressively slower.
This chapter dissects why. We will measure lock contention at every level of the JVM’s locking hierarchy, benchmark each synchronization primitive under increasing thread pressure, and build a view counter that scales linearly. Then we will address thread pool sizing, where the industry’s most quoted formula is wrong for the workloads that matter.
The Locking Hierarchy
The JVM does not use one lock implementation. It uses a hierarchy, escalating from cheap to expensive as contention increases.
Biased locking (removed in recent JDK releases, but worth understanding): the JVM assumes one thread owns the lock and avoids atomic operations entirely. Cost: near zero. This works when a single thread repeatedly enters the same synchronized block. The moment a second thread arrives, the bias is revoked, which is expensive.
Thin locks use a CAS (compare-and-swap) operation to acquire the monitor. If the CAS succeeds, the thread enters the critical section with no kernel transition. Cost: 20-40 nanoseconds. This works when contention is low and lock hold times are short.
Fat locks escalate when CAS spinning fails. The JVM allocates an OS-level mutex, and waiting threads park via the kernel. Cost: 5-15 microseconds for the context switch alone, plus cache pollution when the thread resumes on a different core.
The escalation is one-directional during a contention spike. Once a lock inflates to a fat lock, threads pay the full parking cost for every acquisition, even if contention subsides.
The diagram shows three concurrent views. The left panel illustrates lock escalation from biased to thin to fat to full thread parking, with the cost growing by orders of magnitude at each step. The right panel plots throughput versus thread count for three synchronization strategies: synchronized hits a contention cliff around 8 threads where throughput drops instead of rising, while StampedLock and LongAdder continue scaling. The bottom panel shows thread pool behavior at three sizes, illustrating that both undersized and oversized pools degrade performance, but for different reasons.
synchronized vs ReentrantLock vs StampedLock
The benchmark uses a counter protected by each lock type. This is not a synthetic exercise. The content platform’s view counter, trending score accumulator, and rate limiter all share this pattern.
// SLOW: synchronized counter
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@Fork(2)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class SynchronizedCounter {
private long count = 0;
@Benchmark
@Threads(16)
public long increment() {
synchronized (this) {
return ++count;
}
}
}
// FAST: StampedLock with optimistic read
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@Fork(2)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class StampedLockCounter {
private final StampedLock lock = new StampedLock();
private long count = 0;
@Benchmark
@Threads(16)
public long increment() {
long stamp = lock.writeLock();
try {
return ++count;
} finally {
lock.unlockWrite(stamp);
}
}
@Benchmark
@Threads(16)
public long read() {
long stamp = lock.tryOptimisticRead();
long current = count;
if (!lock.validate(stamp)) {
stamp = lock.readLock();
try {
current = count;
} finally {
lock.unlockRead(stamp);
}
}
return current;
}
}
Representative results on a 16-core server (ops/μs, higher is better):
| Threads | synchronized | ReentrantLock | StampedLock (write) | StampedLock (read) |
|---|---|---|---|---|
| 1 | 210 | 195 | 180 | 310 |
| 4 | 85 | 120 | 115 | 290 |
| 8 | 42 | 95 | 90 | 280 |
| 16 | 18 | 72 | 68 | 270 |
| 32 | 11 | 55 | 52 | 265 |
Three observations:
synchronized degrades catastrophically. At 32 threads, it delivers 5% of single-threaded throughput. The monitor inflation triggers OS-level parking, and the thundering herd effect when the lock releases means threads compete for cache lines holding the monitor metadata.
ReentrantLock degrades gracefully. It uses CLH queue-based parking, which is fairer and avoids the thundering herd. Waiting threads park in order and wake one at a time.
StampedLock reads barely degrade. Optimistic reads do not acquire the lock at all. They read the stamp, read the data, then validate. If no write occurred, the read completes without any atomic operation. This is the correct primitive for the content platform’s article metadata, which is read 1000x for every write.
AtomicLong vs LongAdder
When the critical section is a single counter increment, locks are the wrong tool entirely.
// SLOW: AtomicLong under heavy contention
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@Fork(2)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class AtomicLongBench {
private final AtomicLong counter = new AtomicLong();
@Benchmark
@Threads(16)
public long increment() {
return counter.incrementAndGet();
}
}
// FAST: LongAdder distributes contention across cells
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@Fork(2)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class LongAdderBench {
private final LongAdder counter = new LongAdder();
@Benchmark
@Threads(16)
public void increment() {
counter.increment();
}
}
At 16 threads, AtomicLong.incrementAndGet() delivers roughly 80 million ops/sec. LongAdder.increment() delivers 600 million. The difference: AtomicLong forces every thread to CAS on the same memory location. When a CAS fails (because another thread modified the value between read and compare), the thread retries. Under 16-thread contention, most CAS attempts fail.
LongAdder maintains an internal array of cells. Each thread increments a different cell, eliminating cross-thread cache line contention. Calling sum() reads all cells and adds them. This trades read precision (the sum is not atomic) for write throughput.
The trade-off: LongAdder is not suitable when you need the exact current value after every write. For the view counter, where displaying “12,847 views” versus “12,849 views” is irrelevant, LongAdder is the correct choice. For a sequence generator or account balance, AtomicLong is required.
Lock Striping for the View Counter
The content platform tracks views per article. A single ConcurrentHashMap with AtomicLong values is the starting point:
// SLOW: Single ConcurrentHashMap, contention on popular articles
public class ViewCounter {
private final ConcurrentHashMap<String, AtomicLong> counts =
new ConcurrentHashMap<>();
public void recordView(String articleId) {
counts.computeIfAbsent(articleId, k -> new AtomicLong())
.incrementAndGet();
}
public long getViews(String articleId) {
AtomicLong count = counts.get(articleId);
return count != null ? count.get() : 0;
}
}
This works, but popular articles create hot spots. When 10,000 concurrent users read the same trending article, they all CAS on the same AtomicLong. The fix is lock striping combined with LongAdder:
// FAST: LongAdder per article eliminates CAS contention
public class StripedViewCounter {
private final ConcurrentHashMap<String, LongAdder> counts =
new ConcurrentHashMap<>();
public void recordView(String articleId) {
counts.computeIfAbsent(articleId, k -> new LongAdder())
.increment();
}
public long getViews(String articleId) {
LongAdder adder = counts.get(articleId);
return adder != null ? adder.sum() : 0;
}
public Map<String, Long> snapshot() {
Map<String, Long> result = new HashMap<>();
counts.forEach((k, v) -> result.put(k, v.sum()));
return result;
}
}
The computeIfAbsent call is thread-safe in ConcurrentHashMap. It guarantees that only one LongAdder is created per key, even under concurrent access. After the initial insertion, all subsequent calls to increment() operate on a per-key LongAdder with its internal cell striping.
Under 16-thread contention on a single hot article, the LongAdder version sustains 580 million ops/sec versus the AtomicLong version’s 78 million.
Thread Pool Sizing
The formula Ncpu * 2 appears in countless blog posts, StackOverflow answers, and framework defaults. It is wrong for most workloads.
The formula originates from a specific assumption: pure CPU-bound work with zero I/O wait. Under that assumption, Ncpu threads saturate the CPU, and Ncpu * 2 provides headroom for scheduling jitter. For a content platform that makes database queries, HTTP calls, and file reads, this assumption is false.
Little’s Law provides the correct framework:
$$L = \lambda \times W$$
Where $L$ is the number of concurrent requests in the system, $\lambda$ is the arrival rate (requests/sec), and $W$ is the mean service time. For a thread pool, $L$ is the number of threads needed.
A more actionable formulation for thread pool sizing:
$$\text{threads} = N_{cpu} \times U_{target} \times \left(1 + \frac{W_{wait}}{W_{compute}}\right)$$
Where $U_{target}$ is the target CPU utilization (0.0 to 1.0), $W_{wait}$ is the time a task spends waiting (I/O, locks), and $W_{compute}$ is the time a task spends computing.
For the content platform’s article fetcher:
- 8 CPU cores
- Target utilization: 0.8
- Wait time per request: 45ms (database query + Redis lookup)
- Compute time per request: 5ms (JSON serialization)
$$\text{threads} = 8 \times 0.8 \times \left(1 + \frac{45}{5}\right) = 8 \times 0.8 \times 10 = 64$$
64 threads, not 16. The I/O ratio drives the calculation. A CPU-bound image processing pipeline with 0.1ms wait and 50ms compute yields 8 * 0.8 * 1.002 = 6 threads.
The problem with formulas is that they assume you know the wait/compute ratio. You do not. Measure it.
Contention Profiling with async-profiler
async-profiler’s lock mode reveals exactly where threads spend time waiting:
# Profile lock contention for 30 seconds
./asprof -e lock -d 30 -f lock-profile.html <pid>
The flame graph shows lock acquisition stacks, with width proportional to time spent waiting. Look for:
- Wide frames in application code: your locks are held too long
- Wide frames in
java.util.concurrent: you are using the wrong concurrent data structure - Wide frames in
sun.misc.Unsafe.park: threads are parked waiting for a fat lock - Narrow, tall stacks: locks are acquired frequently but held briefly. Consider lock coarsening or batching
For the content platform, the lock contention profile revealed three hotspots:
- View counter:
AtomicLong.incrementAndGetshowing 23% of lock wait time. Fixed by switching toLongAdder. - Article cache:
synchronizedblock in the cache refresh method. Fixed by switching toStampedLockwith optimistic reads. - Metrics aggregation:
ConcurrentHashMap.computewith a lambda that made a database call. Fixed by moving the database call outside the compute lambda.
The third case is the most instructive. ConcurrentHashMap.compute holds a lock on the hash bucket for the entire duration of the lambda. If the lambda does I/O, every other thread accessing keys in the same bucket blocks on a segment lock. The fix:
// SLOW: I/O inside compute lambda holds the bucket lock
metrics.compute(key, (k, existing) -> {
MetricData fresh = database.fetchMetric(k); // BLOCKS BUCKET
return fresh;
});
// FAST: I/O outside compute, only the merge is locked
MetricData fresh = database.fetchMetric(key);
metrics.merge(key, fresh, (existing, update) -> update);
The merge lambda runs under the bucket lock, but it performs no I/O. It completes in nanoseconds.
Decision Framework
Choose your synchronization primitive based on the access pattern:
Write-only counter (view counts, metrics): LongAdder. No locks. Reads are eventually consistent.
Read-heavy, write-rare (article metadata, configuration): StampedLock with optimistic reads. Reads pay no synchronization cost when no write is in progress.
Mixed read-write with fairness needs (rate limiter, connection pool): ReentrantLock with fairness. Prevents starvation at the cost of throughput.
Single atomic value needing exact reads (sequence generator): AtomicLong or AtomicReference. The CAS retry loop is acceptable when exact-after-write semantics are required.
Never: synchronized on a hot path with more than 4 threads. There are zero cases where synchronized outperforms ReentrantLock under contention. The only advantage of synchronized is syntactic brevity, and that advantage disappears the first time a production system hits 100% CPU on lock inflation.