ArrayList vs LinkedList: Memory Layout Wins

The main chapter established the headline numbers. This section goes deeper into why ArrayList wins, quantifying the cache miss penalty that makes LinkedList’s theoretical O(1) insertion irrelevant in practice. We will use JMH’s built-in profiler integration to measure L1, L2, and L3 cache misses directly, then apply the results to the content platform’s search result ranking pipeline.

Cache Misses: The Hidden Cost

Modern CPUs do not access main memory one byte at a time. They load 64-byte cache lines. When the CPU accesses an address, it loads the entire 64-byte line containing that address into L1 cache. If the next access falls within the same cache line or a nearby line that the prefetcher has already loaded, the access costs 1-4 cycles (~1ns). If the next access requires a cache line that is not loaded, the CPU stalls:

L1 cache miss, L2 hit: ~5ns
L2 cache miss, L3 hit: ~12ns
L3 cache miss, main memory: ~60-100ns

ArrayList stores references in a contiguous Object[]. With compressed oops (default for heaps under 32GB), each reference is 4 bytes. One 64-byte cache line holds 16 consecutive references. Iterating an ArrayList of 10,000 elements accesses 10,000 / 16 = 625 cache lines in a perfectly sequential pattern. The hardware prefetcher detects this pattern after 2-3 accesses and begins loading lines ahead of the iteration, effectively hiding memory latency.

LinkedList stores each element in a Node object. Each Node is 32 bytes (compressed oops) and is allocated wherever the heap allocator places it. Iterating the list follows the next pointer from each Node to the next. Each pointer dereference is a dependent load: the CPU cannot even begin computing the address of the next Node until the current Node’s data arrives from memory. This serializes the access pattern and defeats the prefetcher.

JMH with Performance Counters

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(value = 2, jvmArgsAppend = {"-XX:+UnlockDiagnosticVMOptions",
    "-XX:+DebugNonSafepoints"})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CacheMissBenchmark {

    @Param({"1000", "10000", "100000"})
    private int size;

    private ArrayList<String> arrayList;
    private LinkedList<String> linkedList;

    @Setup
    public void setup() {
        arrayList = new ArrayList<>(size);
        linkedList = new LinkedList<>();
        for (int i = 0; i < size; i++) {
            String s = "article-title-" + i;
            arrayList.add(s);
            linkedList.add(s);
        }
    }

    @Benchmark
    public int iterateArrayList() {
        int sum = 0;
        for (String s : arrayList) {
            sum += s.length();
        }
        return sum;
    }

    @Benchmark
    public int iterateLinkedList() {
        int sum = 0;
        for (String s : linkedList) {
            sum += s.length();
        }
        return sum;
    }

    @Benchmark
    public int sumLengthArrayListIndexed() {
        int sum = 0;
        for (int i = 0; i < size; i++) {
            sum += arrayList.get(i).length();
        }
        return sum;
    }

    @Benchmark
    public int sumLengthLinkedListIndexed() {
        int sum = 0;
        for (int i = 0; i < size; i++) {
            sum += linkedList.get(i).length();
        }
        return sum;
    }
}

Run with the LinuxPerfNormProfiler to capture hardware counters:

java -jar benchmarks.jar CacheMissBenchmark \
  -prof "perf:events=cache-misses,cache-references,L1-dcache-load-misses"

Results at 10,000 elements:

Metric	ArrayList (iterate)	LinkedList (iterate)
Time (ns)	14,200	86,500
L1-dcache-load-misses	1,250	9,800
cache-misses	180	4,200
cache-references	12,400	48,300

LinkedList generates 7.8x more L1 cache misses and 23x more LLC cache misses for the same iteration. The 6.1x time ratio from the main chapter is entirely explained by cache behavior. The pointer-chasing access pattern forces the CPU to wait for memory at every step.

Results at 100,000 elements:

Metric	ArrayList (iterate)	LinkedList (iterate)
Time (ns)	142,000	1,850,000
L1-dcache-load-misses	12,500	98,000
cache-misses	2,800	68,000
cache-references	124,000	485,000

At 100,000 elements, LinkedList’s working set exceeds L2 cache (100,000 * 32 bytes = 3.2MB). The LLC miss count explodes from 2,800 to 68,000, a 24x ratio. ArrayList’s working set for references is 100,000 * 4 bytes = 400KB, fitting comfortably in L2. The time ratio widens to 13x.

Sorted Insertion: The Binary Search Advantage

The content platform’s search results arrive in relevance order, but the recommendation engine merges multiple sources into a single ranked list. A common pattern: maintain a sorted list and insert new items at the correct position.

// SLOW: LinkedList with linear search for insertion point
public class LinkedListSortedMerge {

    private final LinkedList<ScoredArticle> merged = new LinkedList<>();

    public void insert(ScoredArticle article) {
        ListIterator<ScoredArticle> it = merged.listIterator();
        while (it.hasNext()) {
            if (it.next().score() < article.score()) {
                it.previous();
                it.add(article);
                return;
            }
        }
        merged.addLast(article);
    }
}

// FAST: ArrayList with binary search for insertion point
public class ArrayListSortedMerge {

    private final ArrayList<ScoredArticle> merged = new ArrayList<>();
    private static final Comparator<ScoredArticle> BY_SCORE =
        Comparator.comparingDouble(ScoredArticle::score).reversed();

    public void insert(ScoredArticle article) {
        int pos = Collections.binarySearch(merged, article, BY_SCORE);
        if (pos < 0) {
            pos = -(pos + 1);
        }
        merged.add(pos, article);
    }
}

The ArrayList version uses Collections.binarySearch, which is O(log n) random accesses on contiguous memory, followed by an O(n) System.arraycopy shift. The LinkedList version uses O(n) pointer traversal to find the insertion point, then O(1) node insertion. But that O(n) traversal involves cache-hostile pointer chasing.

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class SortedInsertionBenchmark {

    @Param({"100", "1000", "10000"})
    private int insertions;

    private ScoredArticle[] articles;

    @Setup
    public void setup() {
        articles = new ScoredArticle[insertions];
        var rng = new Random(42);
        for (int i = 0; i < insertions; i++) {
            articles[i] = new ScoredArticle("article-" + i, rng.nextDouble());
        }
    }

    @Benchmark
    public List<ScoredArticle> arrayListSortedInsert() {
        var list = new ArrayList<ScoredArticle>(insertions);
        var cmp = Comparator.comparingDouble(ScoredArticle::score).reversed();
        for (ScoredArticle a : articles) {
            int pos = Collections.binarySearch(list, a, cmp);
            if (pos < 0) pos = -(pos + 1);
            list.add(pos, a);
        }
        return list;
    }

    @Benchmark
    public List<ScoredArticle> linkedListSortedInsert() {
        var list = new LinkedList<ScoredArticle>();
        for (ScoredArticle a : articles) {
            ListIterator<ScoredArticle> it = list.listIterator();
            boolean inserted = false;
            while (it.hasNext()) {
                if (it.next().score() < a.score()) {
                    it.previous();
                    it.add(a);
                    inserted = true;
                    break;
                }
            }
            if (!inserted) list.addLast(a);
        }
        return list;
    }

    record ScoredArticle(String id, double score) implements Comparable<ScoredArticle> {
        @Override
        public int compareTo(ScoredArticle o) {
            return Double.compare(o.score, this.score);
        }
    }
}

Insertions	ArrayList (us)	LinkedList (us)	Ratio
100	8	14	1.8x slower
1,000	120	680	5.7x slower
10,000	4,200	58,000	13.8x slower

At 10,000 insertions, ArrayList’s binary search + arraycopy is 13.8x faster than LinkedList’s linear scan + pointer insert. Binary search does O(log 10000) = ~14 cache-friendly array accesses. The subsequent arraycopy shifts an average of 5,000 elements at memory bandwidth speed. LinkedList’s linear scan traverses an average of 5,000 nodes via pointer chasing.

Immutable Lists: List.of() and List.copyOf()

When the content platform builds a page response, the article list is constructed once and never modified. Java’s List.of() and List.copyOf() return unmodifiable lists with lower memory overhead:

// SLOW: ArrayList carries mutable overhead even when you never mutate
public List<Article> buildResponse(List<Article> source) {
    return new ArrayList<>(source); // mutable, has size + modCount fields
}

// FAST: Unmodifiable list, no mutation tracking overhead
public List<Article> buildResponse(List<Article> source) {
    return List.copyOf(source); // immutable, optimized internal layout
}

List.of() for small element counts (0-2 elements) uses specialized classes with fields instead of arrays, eliminating array overhead entirely. For larger counts, it uses a trimmed array with no excess capacity and no size, modCount, or growth logic.

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ImmutableListBenchmark {

    @Param({"10", "100", "1000"})
    private int size;

    private List<String> source;

    @Setup
    public void setup() {
        source = new ArrayList<>(size);
        for (int i = 0; i < size; i++) {
            source.add("article-" + i);
        }
    }

    @Benchmark
    public List<String> copyToArrayList() {
        return new ArrayList<>(source);
    }

    @Benchmark
    public List<String> copyToUnmodifiable() {
        return List.copyOf(source);
    }

    @Benchmark
    public int iterateArrayList() {
        List<String> list = new ArrayList<>(source);
        int sum = 0;
        for (String s : list) {
            sum += s.length();
        }
        return sum;
    }

    @Benchmark
    public int iterateUnmodifiable() {
        List<String> list = List.copyOf(source);
        int sum = 0;
        for (String s : list) {
            sum += s.length();
        }
        return sum;
    }
}

Operation	Size	ArrayList (ns)	List.copyOf (ns)	Difference
Copy	10	45	55	+22% (initial overhead)
Copy	100	180	160	-11%
Copy	1,000	1,600	1,400	-12%
Iterate	10	22	18	-18%
Iterate	100	150	130	-13%
Iterate	1,000	1,400	1,250	-11%

List.copyOf iteration is 11-18% faster because the JVM can skip the modCount check on each iteration step (there is no concurrent modification possible on an unmodifiable list), and the JIT compiler can sometimes eliminate bounds checks when it proves the list is unmodifiable.

The Removal Problem

ArrayList has one genuine weakness: removal from the front or middle. remove(0) shifts every element left by one position, an O(n) operation. If your access pattern requires frequent removal from the front, this matters.

The content platform’s event processing pipeline consumes articles from a queue: take from the front, process, repeat. This is a FIFO pattern, not a List pattern. The correct answer is ArrayDeque, which Section 2 covers. But developers sometimes use ArrayList as a queue:

// SLOW: ArrayList as a FIFO queue. remove(0) is O(n).
public class ArticleProcessingQueue {

    private final List<Article> queue = new ArrayList<>();

    public void enqueue(Article article) {
        queue.add(article); // O(1) amortized, appends to end
    }

    public Article dequeue() {
        if (queue.isEmpty()) return null;
        return queue.remove(0); // O(n), shifts all elements left
    }
}

At 10,000 elements in the queue, each remove(0) copies 9,999 references. Processing 10,000 articles in FIFO order costs 10,000 * 10,000 / 2 = 50,000,000 reference copies. With ArrayDeque, each poll is O(1): increment the head index, return the element. Zero copies.

This is not an ArrayList deficiency. It is an API misuse. ArrayList implements List. It is not a Queue. When you need FIFO semantics, use a Queue implementation.

Practical Checklist

Symptom: GC pause times increasing, throughput declining as collection sizes grow. Cause: LinkedList creating 8x per-element overhead, fragmenting the heap, increasing GC scan times. Fix: Replace with ArrayList, pre-sized where possible. Proof: JMH with -prof gc showing allocation rate and GC pressure reduction. Trade-off: ArrayList’s backing array requires contiguous memory. For very large collections (>10M elements), this can cause allocation pressure if the heap is fragmented. In practice, this is rarely the binding constraint.

Symptom: Iteration-heavy loops show unexpectedly high latency in profiler. Cause: Cache misses from pointer-chasing through LinkedList nodes. Fix: Replace with ArrayList or List.copyOf for read-only collections. Proof: JMH with -prof perf showing L1/LLC cache miss reduction. Trade-off: If the collection is modified during iteration via ListIterator, ArrayList’s ConcurrentModificationException semantics differ from LinkedList’s. Use CopyOnWriteArrayList for concurrent read-modify patterns.