Skip to main content
fast by design

Algorithmic Complexity on Real Data: When O(n log n) Beats O(n) and Why

10 min read Chapter 37 of 90

Algorithmic Complexity on Real Data: When O(n log n) Beats O(n) and Why

Algorithm textbooks teach that O(n) is faster than O(n log n), which is faster than O(n^2). This is true in the limit. It is often wrong in practice. An O(n) algorithm that chases pointers across the heap, allocating objects on every step, will lose to an O(n log n) algorithm that operates on a contiguous array in cache. At 100,000 elements, the constant factor difference can be 10x.

The content platform’s recommendation engine scores articles for each user request. The scoring pipeline filters candidates (O(n)), computes features (O(n)), sorts by composite score (O(n log n)), and returns the top k (O(k)). The total complexity is O(n log n), dominated by the sort. But when the team optimized the O(n) filtering step from a LinkedList scan to an array scan, latency dropped more than when they optimized the O(n log n) sort. The reason: the LinkedList scan was cache-hostile, and the “O(n)” operation was performing like O(n * cache_miss_penalty).

This chapter proves that algorithmic complexity is necessary but not sufficient for performance prediction. You must also account for constant factors, memory layout, and hardware effects.

Constant Factors: The Hidden Multiplier

Big-O notation strips constant factors. O(3n + 7) becomes O(n). But the constant 3 matters. If algorithm A runs in 3n operations and algorithm B runs in 0.1 * n * log2(n) operations, algorithm B is faster when:

$$0.1 \cdot n \cdot \log_2(n) < 3n$$ $$0.1 \cdot \log_2(n) < 3$$ $$\log_2(n) < 30$$ $$n < 2^{30} \approx 1{,}073{,}741{,}824$$

For any dataset under 1 billion elements, the O(n log n) algorithm with a 0.1 constant factor beats the O(n) algorithm with a 3 constant factor. Real-world algorithms routinely have constant factor ratios of 10x or more due to cache behavior, branch prediction, and allocation patterns.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class ConstantFactorBenchmark {

    @Param({"10000", "100000", "1000000"})
    int size;

    int[] data;
    Node head; // Linked list

    record Node(int value, Node next) {}

    @Setup(Level.Trial)
    public void setup() {
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        data = rng.ints(size).toArray();

        // Build linked list with same values (scattered on heap)
        head = null;
        for (int i = size - 1; i >= 0; i--) {
            head = new Node(data[i], head);
        }
    }

    @Benchmark
    public long arrayScan() {
        // O(n) with excellent cache behavior
        long sum = 0;
        for (int v : data) {
            sum += v;
        }
        return sum;                             // FAST: sequential memory
    }

    @Benchmark
    public long linkedListScan() {
        // O(n) with terrible cache behavior
        long sum = 0;
        Node current = head;
        while (current != null) {
            sum += current.value;
            current = current.next;             // SLOW: pointer chasing
        }
        return sum;
    }

    @Benchmark
    public int[] arraySort() {
        // O(n log n) but cache-friendly
        int[] copy = data.clone();
        Arrays.sort(copy);
        return copy;                            // FAST: sequential access
    }
}
OperationComplexity10K (us)100K (us)1M (us)
Array scanO(n)328280
LinkedList scanO(n)425808,200
Array sortO(n log n)851,20014,000

At 100,000 elements, the linked list scan (O(n)) takes 580us while the array sort (O(n log n)) takes 1,200us. The sort is only 2x slower despite being O(n log n) vs O(n). At small sizes, the linked list scan is already 14x slower than the array scan despite identical complexity.

The linked list’s constant factor comes from pointer chasing. Each node.next dereference loads a memory address that the CPU cannot predict. The array scan accesses sequential memory that the hardware prefetcher loads ahead of the access pattern. The effective cost per element:

  • Array scan: ~0.28ns/element (data in L1/L2 cache, prefetcher active)
  • LinkedList scan: ~8.2ns/element (random heap access, ~5ns L3 miss penalty per node)
  • Array sort: ~14ns/element * log2(n) comparisons, but each comparison accesses cached data

Cache Effects: Memory Layout Determines Performance

Modern CPUs have three cache levels with different access times:

Cache LevelSize (typical)Latency (cycles)Latency (ns at 3GHz)
L132 KB41.3
L2256 KB124.0
L38 MB4013.3
Main memoryGB20066.7

A cache line is 64 bytes. When the CPU loads one int from an array, it loads 16 adjacent int values (64 bytes / 4 bytes per int). The next 15 accesses hit L1 cache. This is spatial locality, and it gives array-based algorithms a 50x advantage over pointer-based algorithms for sequential access.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class CacheEffectBenchmark {

    @Param({"1000000"})
    int size;

    int[] sequentialArray;
    int[] randomAccessIndices;
    int[] stridedIndices;

    @Setup(Level.Trial)
    public void setup() {
        sequentialArray = new int[size];
        for (int i = 0; i < size; i++) sequentialArray[i] = i;

        // Random access: indices in random order
        randomAccessIndices = new int[size];
        for (int i = 0; i < size; i++) randomAccessIndices[i] = i;
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        for (int i = size - 1; i > 0; i--) {
            int j = rng.nextInt(i + 1);
            int tmp = randomAccessIndices[i];
            randomAccessIndices[i] = randomAccessIndices[j];
            randomAccessIndices[j] = tmp;
        }

        // Strided access: every 16th element (one per cache line)
        stridedIndices = new int[size / 16];
        for (int i = 0; i < stridedIndices.length; i++) {
            stridedIndices[i] = i * 16;
        }
    }

    @Benchmark
    public long sequentialAccess() {
        long sum = 0;
        for (int i = 0; i < size; i++) {
            sum += sequentialArray[i];              // FAST: prefetcher active
        }
        return sum;
    }

    @Benchmark
    public long randomAccess() {
        long sum = 0;
        for (int idx : randomAccessIndices) {
            sum += sequentialArray[idx];            // SLOW: cache misses
        }
        return sum;
    }

    @Benchmark
    public long stridedAccess() {
        long sum = 0;
        for (int idx : stridedIndices) {
            sum += sequentialArray[idx];            // Medium: one useful int per cache line
        }
        return sum;
    }
}
Access PatternTime (ns/element)vs Sequential
Sequential0.281.0x
Strided (every 16)1.86.4x slower
Random12.544.7x slower

Random access to the same array is 45x slower than sequential access. The data is identical. The algorithm is identical (sum all elements). The only difference is access order. This is not a microarchitecture detail. This is the difference between a recommendation engine that responds in 10ms and one that responds in 450ms.

The Recommendation Engine: Sorting Candidates by Score

The content platform scores article candidates for each user:

// SLOW: Object-oriented approach with pointer chasing
public class SlowRecommendationEngine {

    record ScoredArticle(Article article, double score) {}

    public List<Article> recommend(User user, List<Article> candidates, int topK) {
        List<ScoredArticle> scored = candidates.stream()       // SLOW: boxing
            .map(a -> new ScoredArticle(a, computeScore(user, a)))
            .sorted(Comparator.comparingDouble(ScoredArticle::score).reversed())
            .limit(topK)
            .toList();
        return scored.stream().map(ScoredArticle::article).toList();
    }

    private double computeScore(User user, Article article) {
        return 0.0; // Feature computation
    }
}
// FAST: Array-based approach with cache locality
public class FastRecommendationEngine {

    public List<Article> recommend(User user, Article[] candidates, int topK) {
        int n = candidates.length;
        double[] scores = new double[n];

        // Phase 1: Compute scores into contiguous array
        for (int i = 0; i < n; i++) {
            scores[i] = computeScore(user, candidates[i]);    // FAST: sequential
        }

        // Phase 2: Find top-k indices using partial sort (min-heap)
        int[] topIndices = topKIndices(scores, topK);

        // Phase 3: Collect results
        List<Article> result = new ArrayList<>(topK);
        for (int idx : topIndices) {
            result.add(candidates[idx]);
        }
        return result;
    }

    private int[] topKIndices(double[] scores, int k) {
        // Min-heap of (score, index) pairs using parallel arrays
        double[] heapScores = new double[k + 1];
        int[] heapIndices = new int[k + 1];
        int heapSize = 0;

        for (int i = 0; i < scores.length; i++) {
            if (heapSize < k) {
                heapScores[heapSize] = scores[i];
                heapIndices[heapSize] = i;
                heapSize++;
                siftUp(heapScores, heapIndices, heapSize - 1);
            } else if (scores[i] > heapScores[0]) {
                heapScores[0] = scores[i];
                heapIndices[0] = i;
                siftDown(heapScores, heapIndices, 0, heapSize);
            }
        }

        return Arrays.copyOf(heapIndices, heapSize);
    }

    private void siftUp(double[] scores, int[] indices, int k) {
        while (k > 0) {
            int parent = (k - 1) / 2;
            if (scores[k] < scores[parent]) {
                swap(scores, indices, k, parent);
                k = parent;
            } else break;
        }
    }

    private void siftDown(double[] scores, int[] indices, int k, int size) {
        while (2 * k + 1 < size) {
            int child = 2 * k + 1;
            if (child + 1 < size && scores[child + 1] < scores[child]) child++;
            if (scores[k] <= scores[child]) break;
            swap(scores, indices, k, child);
            k = child;
        }
    }

    private void swap(double[] scores, int[] indices, int a, int b) {
        double ts = scores[a]; scores[a] = scores[b]; scores[b] = ts;
        int ti = indices[a]; indices[a] = indices[b]; indices[b] = ti;
    }

    private double computeScore(User user, Article article) {
        return 0.0;
    }
}
Approach10K candidates, top 50 (us)100K candidates, top 50 (us)
Stream + sorted + limit1,80022,000
Array scores + heap top-k1801,650

The array-based approach is 10-13x faster. The improvement comes from three sources:

  1. Score computation: Sequential array write vs stream pipeline with lambda dispatch.
  2. Top-k selection: O(n log k) heap vs O(n log n) full sort. For k=50, log2(50)=~6 vs log2(100,000)=~17.
  3. Memory layout: Scores in a contiguous double[] vs ScoredArticle objects scattered on the heap.

Amortized Analysis: ArrayList vs LinkedList

ArrayList uses amortized O(1) append: most appends write to the next slot in the backing array. When the array is full, ArrayList allocates a new array 1.5x larger, copies all elements, and discards the old array. The copy is O(n), but it happens every n/2 operations, so the amortized cost is O(1).

LinkedList uses O(1) append: create a new Node, link it. No copying ever.

The amortized analysis says they are equivalent for append. The benchmark says otherwise:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class AppendBenchmark {

    @Param({"100000", "1000000"})
    int size;

    @Benchmark
    public ArrayList<Integer> arrayListAppend() {
        ArrayList<Integer> list = new ArrayList<>();
        for (int i = 0; i < size; i++) {
            list.add(i);                              // FAST: amortized O(1)
        }
        return list;
    }

    @Benchmark
    public LinkedList<Integer> linkedListAppend() {
        LinkedList<Integer> list = new LinkedList<>();
        for (int i = 0; i < size; i++) {
            list.add(i);                              // SLOW: O(1) but allocates
        }
        return list;
    }

    @Benchmark
    public ArrayList<Integer> arrayListPreSized() {
        ArrayList<Integer> list = new ArrayList<>(size);
        for (int i = 0; i < size; i++) {
            list.add(i);                              // FAST: no resizing
        }
        return list;
    }
}
Collection100K appends (us)1M appends (us)
ArrayList (default)4205,200
ArrayList (pre-sized)3103,800
LinkedList1,80022,000

LinkedList is 4.2x slower than ArrayList for appending 1M elements, despite both being O(1) amortized. The reasons:

  1. Allocation: LinkedList allocates a Node object (24 bytes header + prev/next/item references) per append. ArrayList allocates nothing per append (except during resize).
  2. GC pressure: 1M Node objects create 1M objects for the garbage collector to track.
  3. Cache: LinkedList nodes are scattered across the heap. ArrayList’s backing array is contiguous.

Pre-sizing ArrayList eliminates the resize copies, saving another 27% at 1M elements. For the content platform’s article list building, always pre-size:

// FAST: Pre-sized ArrayList
List<Article> articles = new ArrayList<>(expectedCount);

The sections that follow benchmark parallel streams (when fork-join overhead exceeds the parallelism benefit) and provide a decision framework for choosing algorithms based on real hardware effects rather than theoretical complexity alone.

Algorithmic Complexity on Real Data