Skip to main content
fast by design

Sorting, Searching, and the Access Pattern That Changes Everything at 1M Records

10 min read Chapter 31 of 90

Sorting, Searching, and the Access Pattern That Changes Everything at 1M Records

Every developer knows that sorting is O(n log n) and binary search is O(log n). Those bounds are correct. They are also incomplete. The constant factor inside O(n log n) varies by 10x depending on which sort you use, how sorted your data already is, and whether your comparator forces the JVM to box primitives on every comparison. At 1,000 elements, none of this matters. At 1,000,000 elements, it determines whether your response comes back in 12 milliseconds or 180 milliseconds.

The content platform’s recommendation engine sorts article candidates by composite score on every request. The ingestion pipeline maintains a sorted index of article publish timestamps for range queries. The trending articles endpoint searches for articles by view count threshold. Each of these runs against a different data size, a different data shape, and a different access frequency. The sorting algorithm that wins for recommendation ranking (10,000 candidates, partially sorted from previous rankings) is not the algorithm that wins for the full article index (500,000 entries, random order on cold start).

This chapter benchmarks the sorting and searching algorithms available in the JDK, identifies the crossover points where one overtakes another, and shows how data shape (already sorted, reverse sorted, nearly sorted, random) changes the answer.

Arrays.sort vs Arrays.parallelSort: The Crossover Point

Arrays.sort(int[]) uses dual-pivot quicksort for primitives. Arrays.parallelSort(int[]) uses a merge-sort variant that splits the array across ForkJoinPool threads. Parallel sort has overhead: it must split the array, dispatch to threads, and merge results. Below a certain size, that overhead exceeds the parallelism benefit.

The JDK sets the minimum parallelism granularity at 8,192 elements (1 << 13). Below that threshold, parallelSort falls back to sequential sort. But the JDK threshold is conservative. On modern hardware with fast L1/L2 caches, sequential sort remains competitive well beyond 8,192 elements because the sequential version enjoys perfect cache locality while the parallel version pays for cross-thread communication.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class SortCrossoverBenchmark {

    @Param({"1000", "10000", "50000", "100000", "500000", "1000000"})
    int size;

    int[] data;
    int[] copy;

    @Setup(Level.Invocation)
    public void setup() {
        data = ThreadLocalRandom.current().ints(size).toArray();
        copy = data.clone();
    }

    @Benchmark
    public int[] sequentialSort() {
        Arrays.sort(data);       // SLOW at 500K+
        return data;
    }

    @Benchmark
    public int[] parallelSort() {
        Arrays.parallelSort(copy); // FAST at 500K+, overhead below 50K
        return copy;
    }
}

Results on a 4-core machine (typical server):

SizeArrays.sort (us)Arrays.parallelSort (us)Speedup
1,00012180.67x (slower)
10,0001451121.29x
50,0008503802.24x
100,0001,8206802.68x
500,00010,2003,1003.29x
1,000,00021,5005,8003.71x

The crossover point sits around 8,000 to 12,000 elements on this hardware. Below 8,000, sequential sort wins because the parallel version’s thread dispatch and merge overhead dominates. Above 50,000, parallel sort approaches linear scaling with core count.

Symptom: Recommendation engine sorts 10,000 candidates per request, latency is acceptable. Cause: At 10,000 elements, parallel sort provides only ~1.3x speedup, and the thread pool contention under high request concurrency can erase that gain. Fix: Use Arrays.sort for the recommendation engine (small, frequent sorts) and Arrays.parallelSort for the batch article reindex (500,000+ entries, infrequent). Trade-off: Parallel sort shares the common ForkJoinPool. Under high request concurrency, sort tasks compete with other parallel stream operations. For the recommendation engine, sequential sort gives more predictable latency.

TimSort on Partially Sorted Data

Arrays.sort(Object[]) and Collections.sort(List) use TimSort, a merge sort variant designed for real-world data. TimSort identifies pre-existing sorted runs in the input and merges them, avoiding redundant comparisons. On fully random data, TimSort performs comparably to other O(n log n) sorts. On partially sorted data, TimSort approaches O(n).

The content platform’s recommendation engine re-sorts article candidates every 30 seconds. Between sorts, only a few articles change score (new views arrive, a few articles expire). The data is 95% already sorted. TimSort exploits this:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class TimSortPartialBenchmark {

    @Param({"100000"})
    int size;

    @Param({"0.0", "0.01", "0.05", "0.10", "0.50", "1.0"})
    double perturbFraction;

    Integer[] data;

    @Setup(Level.Invocation)
    public void setup() {
        data = new Integer[size];
        for (int i = 0; i < size; i++) {
            data[i] = i;
        }
        // Perturb a fraction of elements
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        int perturbed = (int) (size * perturbFraction);
        for (int i = 0; i < perturbed; i++) {
            int a = rng.nextInt(size);
            int b = rng.nextInt(size);
            Integer tmp = data[a];
            data[a] = data[b];
            data[b] = tmp;
        }
    }

    @Benchmark
    public Integer[] sortPartial() {
        Arrays.sort(data);
        return data;
    }
}
PerturbationTime (us)vs Random
0% (sorted)28027x faster
1%1,1007x faster
5%2,8002.7x faster
10%4,2001.8x faster
50%6,9001.1x faster
100% (random)7,600baseline

At 1% perturbation (the recommendation engine’s steady state), TimSort is 7x faster than on random data. This is not a micro-optimization. At 100,000 candidates sorted every 30 seconds across 50 recommendation shards, the difference between 1.1ms and 7.6ms per sort is 325ms of CPU time saved per 30-second cycle.

Custom Comparator Boxing Overhead

When you sort int[] with Arrays.sort, the dual-pivot quicksort operates on raw primitives. No objects. No boxing. When you sort Integer[] or List<Integer> with a Comparator<Integer>, every comparison passes two boxed Integer objects. The comparator receives references, not primitives.

// SLOW: Comparator forces boxing on every comparison
List<Article> articles = getArticles();
articles.sort(Comparator.comparingInt(Article::getViewCount)
    .thenComparingInt(Article::getCommentCount));

Comparator.comparingInt returns a comparator that calls Article::getViewCount and extracts an int. But the tiebreaker thenComparingInt wraps the result in a chained comparator where intermediate results may not be inlined by the JIT. The performance cost is not in boxing (the JIT can sometimes avoid it) but in the comparison chain preventing escape analysis from eliminating allocations.

// FAST: Single comparator with manual tiebreaking
articles.sort((a, b) -> {
    int cmp = Integer.compare(a.getViewCount(), b.getViewCount());
    if (cmp != 0) return cmp;
    return Integer.compare(a.getCommentCount(), b.getCommentCount());
});
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class ComparatorBoxingBenchmark {

    record Article(String id, int viewCount, int commentCount) {}

    List<Article> articles;

    @Setup(Level.Invocation)
    public void setup() {
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        articles = new ArrayList<>(100_000);
        for (int i = 0; i < 100_000; i++) {
            articles.add(new Article("art-" + i, rng.nextInt(1_000_000),
                rng.nextInt(10_000)));
        }
    }

    @Benchmark
    public List<Article> chainedComparator() {
        articles.sort(Comparator.comparingInt(Article::viewCount)   // SLOW
            .thenComparingInt(Article::commentCount));
        return articles;
    }

    @Benchmark
    public List<Article> manualComparator() {
        articles.sort((a, b) -> {                                   // FAST
            int cmp = Integer.compare(a.viewCount(), b.viewCount());
            if (cmp != 0) return cmp;
            return Integer.compare(a.commentCount(), b.commentCount());
        });
        return articles;
    }
}

The manual comparator runs 15-25% faster at 100,000 elements. The JIT compiler can inline the lambda body completely, while the chained comparator creates an intermediate Comparator object that adds an indirection layer to each comparison call.

Trade-off: The chained comparator is more readable. For sorts under 10,000 elements, the difference is negligible. For sorts above 100,000 elements in hot paths, the manual comparator is worth the readability cost.

Binary Search vs HashMap Lookup

Binary search on a sorted array is O(log n). HashMap lookup is O(1) amortized. The textbook says HashMap wins. The hardware says “it depends.”

Binary search on a contiguous sorted array is cache-friendly. The array sits in a continuous memory region, and binary search accesses elements in a pattern that the CPU prefetcher can partially predict. HashMap lookup chases a pointer from the bucket array to a Node object that lives somewhere else on the heap. Each lookup is a potential cache miss.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class SearchStrategyBenchmark {

    @Param({"100", "1000", "10000", "100000", "1000000"})
    int size;

    int[] sortedArray;
    HashMap<Integer, Integer> map;
    int searchKey;

    @Setup(Level.Trial)
    public void setup() {
        sortedArray = new int[size];
        map = HashMap.newHashMap(size);
        for (int i = 0; i < size; i++) {
            sortedArray[i] = i * 3; // Sparse keys
            map.put(i * 3, i);
        }
        searchKey = (size / 2) * 3; // Middle element
    }

    @Benchmark
    public int binarySearch() {
        return Arrays.binarySearch(sortedArray, searchKey);
    }

    @Benchmark
    public Integer hashMapGet() {
        return map.get(searchKey);
    }
}
SizebinarySearch (ns)HashMap.get (ns)Winner
100188HashMap
1,000289HashMap
10,0004510HashMap
100,0006812HashMap
1,000,0009514HashMap

HashMap wins at every size for single-key lookup. But the picture changes when you consider:

  1. Memory: The sorted int[] with 1M entries uses 4MB. The HashMap<Integer, Integer> with 1M entries uses ~72MB (boxed keys, boxed values, Node objects, bucket array).
  2. Range queries: Finding all elements between keys A and B in a sorted array is O(log n + k) where k is the result count. In a HashMap, it is O(n).
  3. Iteration order: Sorted array iteration is sequential memory access. HashMap iteration follows the bucket array order, which is hash-dependent.

The content platform’s article index needs both: single-article lookup by ID (HashMap) and range queries by publish date (sorted array with binary search). The solution is to maintain both structures, updated together.

When Sorted Arrays Win

The publish-date index in the content platform demonstrates where sorted arrays outperform HashMap despite the single-lookup disadvantage:

public class ArticleDateIndex {
    private long[] timestamps;    // Sorted epoch millis
    private String[] articleIds;  // Parallel array, same order

    public List<String> articlesInRange(long fromEpoch, long toEpoch) {
        int fromIdx = insertionPoint(Arrays.binarySearch(timestamps, fromEpoch));
        int toIdx = insertionPoint(Arrays.binarySearch(timestamps, toEpoch));
        List<String> result = new ArrayList<>(toIdx - fromIdx);
        for (int i = fromIdx; i < toIdx; i++) {
            result.add(articleIds[i]);
        }
        return result;                               // FAST: O(log n + k)
    }

    private int insertionPoint(int bsResult) {
        return bsResult >= 0 ? bsResult : -(bsResult + 1);
    }
}

The equivalent HashMap-based approach would require iterating all entries and filtering, which is O(n) regardless of result size. For a 500,000-article index returning the 100 most recent articles, the sorted array approach takes ~0.5us. The HashMap scan takes ~2,500us.

The Sorting Decision Framework

The choice depends on four factors:

  1. Data size: Below 8,000 elements, Arrays.sort beats Arrays.parallelSort. Below 100,000 elements, comparator style is irrelevant.
  2. Data shape: If data is 90%+ already sorted (re-ranking, incremental updates), TimSort provides 3-7x advantage over the random-data baseline.
  3. Comparison cost: Chained comparators add indirection. For hot-path sorts above 100,000 elements, manual comparators save 15-25%.
  4. Access pattern: Single-key lookup favors HashMap. Range queries, ordered iteration, and memory-constrained scenarios favor sorted arrays.

The sections that follow benchmark each of these dimensions in isolation with JMH, providing exact crossover points on current hardware.

Sort Performance Comparison