ArrayList vs LinkedList: Memory Layout Wins
ArrayList vs LinkedList: Memory Layout Wins
The main chapter established the headline numbers. This section goes deeper into why ArrayList wins, quantifying the cache miss penalty that makes LinkedList’s theoretical O(1) insertion irrelevant in practice. We will use JMH’s built-in profiler integration to measure L1, L2, and L3 cache misses directly, then apply the results to the content platform’s search result ranking pipeline.
Cache Misses: The Hidden Cost
Modern CPUs do not access main memory one byte at a time. They load 64-byte cache lines. When the CPU accesses an address, it loads the entire 64-byte line containing that address into L1 cache. If the next access falls within the same cache line or a nearby line that the prefetcher has already loaded, the access costs 1-4 cycles (~1ns). If the next access requires a cache line that is not loaded, the CPU stalls:
- L1 cache miss, L2 hit: ~5ns
- L2 cache miss, L3 hit: ~12ns
- L3 cache miss, main memory: ~60-100ns
ArrayList stores references in a contiguous Object[]. With compressed oops (default for heaps under 32GB), each reference is 4 bytes. One 64-byte cache line holds 16 consecutive references. Iterating an ArrayList of 10,000 elements accesses 10,000 / 16 = 625 cache lines in a perfectly sequential pattern. The hardware prefetcher detects this pattern after 2-3 accesses and begins loading lines ahead of the iteration, effectively hiding memory latency.
LinkedList stores each element in a Node object. Each Node is 32 bytes (compressed oops) and is allocated wherever the heap allocator places it. Iterating the list follows the next pointer from each Node to the next. Each pointer dereference is a dependent load: the CPU cannot even begin computing the address of the next Node until the current Node’s data arrives from memory. This serializes the access pattern and defeats the prefetcher.
JMH with Performance Counters
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(value = 2, jvmArgsAppend = {"-XX:+UnlockDiagnosticVMOptions",
"-XX:+DebugNonSafepoints"})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CacheMissBenchmark {
@Param({"1000", "10000", "100000"})
private int size;
private ArrayList<String> arrayList;
private LinkedList<String> linkedList;
@Setup
public void setup() {
arrayList = new ArrayList<>(size);
linkedList = new LinkedList<>();
for (int i = 0; i < size; i++) {
String s = "article-title-" + i;
arrayList.add(s);
linkedList.add(s);
}
}
@Benchmark
public int iterateArrayList() {
int sum = 0;
for (String s : arrayList) {
sum += s.length();
}
return sum;
}
@Benchmark
public int iterateLinkedList() {
int sum = 0;
for (String s : linkedList) {
sum += s.length();
}
return sum;
}
@Benchmark
public int sumLengthArrayListIndexed() {
int sum = 0;
for (int i = 0; i < size; i++) {
sum += arrayList.get(i).length();
}
return sum;
}
@Benchmark
public int sumLengthLinkedListIndexed() {
int sum = 0;
for (int i = 0; i < size; i++) {
sum += linkedList.get(i).length();
}
return sum;
}
}
Run with the LinuxPerfNormProfiler to capture hardware counters:
java -jar benchmarks.jar CacheMissBenchmark \
-prof "perf:events=cache-misses,cache-references,L1-dcache-load-misses"
Results at 10,000 elements:
| Metric | ArrayList (iterate) | LinkedList (iterate) |
|---|---|---|
| Time (ns) | 14,200 | 86,500 |
| L1-dcache-load-misses | 1,250 | 9,800 |
| cache-misses | 180 | 4,200 |
| cache-references | 12,400 | 48,300 |
LinkedList generates 7.8x more L1 cache misses and 23x more LLC cache misses for the same iteration. The 6.1x time ratio from the main chapter is entirely explained by cache behavior. The pointer-chasing access pattern forces the CPU to wait for memory at every step.
Results at 100,000 elements:
| Metric | ArrayList (iterate) | LinkedList (iterate) |
|---|---|---|
| Time (ns) | 142,000 | 1,850,000 |
| L1-dcache-load-misses | 12,500 | 98,000 |
| cache-misses | 2,800 | 68,000 |
| cache-references | 124,000 | 485,000 |
At 100,000 elements, LinkedList’s working set exceeds L2 cache (100,000 * 32 bytes = 3.2MB). The LLC miss count explodes from 2,800 to 68,000, a 24x ratio. ArrayList’s working set for references is 100,000 * 4 bytes = 400KB, fitting comfortably in L2. The time ratio widens to 13x.
Sorted Insertion: The Binary Search Advantage
The content platform’s search results arrive in relevance order, but the recommendation engine merges multiple sources into a single ranked list. A common pattern: maintain a sorted list and insert new items at the correct position.
// SLOW: LinkedList with linear search for insertion point
public class LinkedListSortedMerge {
private final LinkedList<ScoredArticle> merged = new LinkedList<>();
public void insert(ScoredArticle article) {
ListIterator<ScoredArticle> it = merged.listIterator();
while (it.hasNext()) {
if (it.next().score() < article.score()) {
it.previous();
it.add(article);
return;
}
}
merged.addLast(article);
}
}
// FAST: ArrayList with binary search for insertion point
public class ArrayListSortedMerge {
private final ArrayList<ScoredArticle> merged = new ArrayList<>();
private static final Comparator<ScoredArticle> BY_SCORE =
Comparator.comparingDouble(ScoredArticle::score).reversed();
public void insert(ScoredArticle article) {
int pos = Collections.binarySearch(merged, article, BY_SCORE);
if (pos < 0) {
pos = -(pos + 1);
}
merged.add(pos, article);
}
}
The ArrayList version uses Collections.binarySearch, which is O(log n) random accesses on contiguous memory, followed by an O(n) System.arraycopy shift. The LinkedList version uses O(n) pointer traversal to find the insertion point, then O(1) node insertion. But that O(n) traversal involves cache-hostile pointer chasing.
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class SortedInsertionBenchmark {
@Param({"100", "1000", "10000"})
private int insertions;
private ScoredArticle[] articles;
@Setup
public void setup() {
articles = new ScoredArticle[insertions];
var rng = new Random(42);
for (int i = 0; i < insertions; i++) {
articles[i] = new ScoredArticle("article-" + i, rng.nextDouble());
}
}
@Benchmark
public List<ScoredArticle> arrayListSortedInsert() {
var list = new ArrayList<ScoredArticle>(insertions);
var cmp = Comparator.comparingDouble(ScoredArticle::score).reversed();
for (ScoredArticle a : articles) {
int pos = Collections.binarySearch(list, a, cmp);
if (pos < 0) pos = -(pos + 1);
list.add(pos, a);
}
return list;
}
@Benchmark
public List<ScoredArticle> linkedListSortedInsert() {
var list = new LinkedList<ScoredArticle>();
for (ScoredArticle a : articles) {
ListIterator<ScoredArticle> it = list.listIterator();
boolean inserted = false;
while (it.hasNext()) {
if (it.next().score() < a.score()) {
it.previous();
it.add(a);
inserted = true;
break;
}
}
if (!inserted) list.addLast(a);
}
return list;
}
record ScoredArticle(String id, double score) implements Comparable<ScoredArticle> {
@Override
public int compareTo(ScoredArticle o) {
return Double.compare(o.score, this.score);
}
}
}
| Insertions | ArrayList (us) | LinkedList (us) | Ratio |
|---|---|---|---|
| 100 | 8 | 14 | 1.8x slower |
| 1,000 | 120 | 680 | 5.7x slower |
| 10,000 | 4,200 | 58,000 | 13.8x slower |
At 10,000 insertions, ArrayList’s binary search + arraycopy is 13.8x faster than LinkedList’s linear scan + pointer insert. Binary search does O(log 10000) = ~14 cache-friendly array accesses. The subsequent arraycopy shifts an average of 5,000 elements at memory bandwidth speed. LinkedList’s linear scan traverses an average of 5,000 nodes via pointer chasing.
Immutable Lists: List.of() and List.copyOf()
When the content platform builds a page response, the article list is constructed once and never modified. Java’s List.of() and List.copyOf() return unmodifiable lists with lower memory overhead:
// SLOW: ArrayList carries mutable overhead even when you never mutate
public List<Article> buildResponse(List<Article> source) {
return new ArrayList<>(source); // mutable, has size + modCount fields
}
// FAST: Unmodifiable list, no mutation tracking overhead
public List<Article> buildResponse(List<Article> source) {
return List.copyOf(source); // immutable, optimized internal layout
}
List.of() for small element counts (0-2 elements) uses specialized classes with fields instead of arrays, eliminating array overhead entirely. For larger counts, it uses a trimmed array with no excess capacity and no size, modCount, or growth logic.
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ImmutableListBenchmark {
@Param({"10", "100", "1000"})
private int size;
private List<String> source;
@Setup
public void setup() {
source = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
source.add("article-" + i);
}
}
@Benchmark
public List<String> copyToArrayList() {
return new ArrayList<>(source);
}
@Benchmark
public List<String> copyToUnmodifiable() {
return List.copyOf(source);
}
@Benchmark
public int iterateArrayList() {
List<String> list = new ArrayList<>(source);
int sum = 0;
for (String s : list) {
sum += s.length();
}
return sum;
}
@Benchmark
public int iterateUnmodifiable() {
List<String> list = List.copyOf(source);
int sum = 0;
for (String s : list) {
sum += s.length();
}
return sum;
}
}
| Operation | Size | ArrayList (ns) | List.copyOf (ns) | Difference |
|---|---|---|---|---|
| Copy | 10 | 45 | 55 | +22% (initial overhead) |
| Copy | 100 | 180 | 160 | -11% |
| Copy | 1,000 | 1,600 | 1,400 | -12% |
| Iterate | 10 | 22 | 18 | -18% |
| Iterate | 100 | 150 | 130 | -13% |
| Iterate | 1,000 | 1,400 | 1,250 | -11% |
List.copyOf iteration is 11-18% faster because the JVM can skip the modCount check on each iteration step (there is no concurrent modification possible on an unmodifiable list), and the JIT compiler can sometimes eliminate bounds checks when it proves the list is unmodifiable.
The Removal Problem
ArrayList has one genuine weakness: removal from the front or middle. remove(0) shifts every element left by one position, an O(n) operation. If your access pattern requires frequent removal from the front, this matters.
The content platform’s event processing pipeline consumes articles from a queue: take from the front, process, repeat. This is a FIFO pattern, not a List pattern. The correct answer is ArrayDeque, which Section 2 covers. But developers sometimes use ArrayList as a queue:
// SLOW: ArrayList as a FIFO queue. remove(0) is O(n).
public class ArticleProcessingQueue {
private final List<Article> queue = new ArrayList<>();
public void enqueue(Article article) {
queue.add(article); // O(1) amortized, appends to end
}
public Article dequeue() {
if (queue.isEmpty()) return null;
return queue.remove(0); // O(n), shifts all elements left
}
}
At 10,000 elements in the queue, each remove(0) copies 9,999 references. Processing 10,000 articles in FIFO order costs 10,000 * 10,000 / 2 = 50,000,000 reference copies. With ArrayDeque, each poll is O(1): increment the head index, return the element. Zero copies.
This is not an ArrayList deficiency. It is an API misuse. ArrayList implements List. It is not a Queue. When you need FIFO semantics, use a Queue implementation.
Practical Checklist
Symptom: GC pause times increasing, throughput declining as collection sizes grow.
Cause: LinkedList creating 8x per-element overhead, fragmenting the heap, increasing GC scan times.
Fix: Replace with ArrayList, pre-sized where possible.
Proof: JMH with -prof gc showing allocation rate and GC pressure reduction.
Trade-off: ArrayList’s backing array requires contiguous memory. For very large collections (>10M elements), this can cause allocation pressure if the heap is fragmented. In practice, this is rarely the binding constraint.
Symptom: Iteration-heavy loops show unexpectedly high latency in profiler.
Cause: Cache misses from pointer-chasing through LinkedList nodes.
Fix: Replace with ArrayList or List.copyOf for read-only collections.
Proof: JMH with -prof perf showing L1/LLC cache miss reduction.
Trade-off: If the collection is modified during iteration via ListIterator, ArrayList’s ConcurrentModificationException semantics differ from LinkedList’s. Use CopyOnWriteArrayList for concurrent read-modify patterns.