Skip to main content
fast by design

Sorting in Practice: TimSort, Parallel Sort, Custom Comparators

9 min read Chapter 32 of 90

Sorting in Practice: TimSort, Parallel Sort, Custom Comparators

The main chapter established the crossover points. This section isolates each sorting mechanism, explains why it performs the way it does, and shows how to exploit each in the content platform.

TimSort Internals: Run Detection and Merge Strategy

TimSort is not a textbook merge sort. It is an adaptive merge sort that scans the input for existing sorted sequences (runs) and merges them in an order that minimizes comparisons. Understanding the run detection mechanism explains why TimSort excels on real data.

TimSort’s algorithm:

  1. Scan the array left to right, identifying runs of ascending or strictly descending elements. Descending runs are reversed in place.
  2. If a natural run is shorter than MIN_MERGE (32 for arrays of 64+ elements), extend it to MIN_MERGE using binary insertion sort.
  3. Push runs onto a stack. Maintain invariants: the top three run lengths satisfy runLen[i-2] > runLen[i-1] + runLen[i] and runLen[i-1] > runLen[i]. When violated, merge adjacent runs.
  4. After scanning all elements, merge remaining runs on the stack.

The genius is step 1. Real-world data has runs. Database query results are partially sorted by primary key. Time-series data has monotonic sections. The recommendation engine’s candidate list retains 90%+ of its previous ordering between re-sorts.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class TimSortRunBenchmark {

    @Param({"100000"})
    int size;

    Integer[] fullyRandom;
    Integer[] fewLongRuns;       // 10 sorted runs of 10,000
    Integer[] manyShortRuns;     // 1,000 sorted runs of 100
    Integer[] almostSorted;     // sorted, then swap 0.5% of elements

    @Setup(Level.Invocation)
    public void setup() {
        ThreadLocalRandom rng = ThreadLocalRandom.current();

        // Fully random
        fullyRandom = new Integer[size];
        for (int i = 0; i < size; i++) fullyRandom[i] = rng.nextInt();

        // Few long runs: 10 sorted segments of 10,000
        fewLongRuns = new Integer[size];
        for (int run = 0; run < 10; run++) {
            int offset = run * 10_000;
            for (int i = 0; i < 10_000; i++) {
                fewLongRuns[offset + i] = offset + i;
            }
        }
        // Shuffle run order
        for (int run = 9; run > 0; run--) {
            int swapWith = rng.nextInt(run + 1);
            for (int i = 0; i < 10_000; i++) {
                Integer tmp = fewLongRuns[run * 10_000 + i];
                fewLongRuns[run * 10_000 + i] = fewLongRuns[swapWith * 10_000 + i];
                fewLongRuns[swapWith * 10_000 + i] = tmp;
            }
        }

        // Many short runs: 1,000 sorted segments of 100
        manyShortRuns = new Integer[size];
        for (int run = 0; run < 1_000; run++) {
            int base = rng.nextInt(1_000_000);
            for (int i = 0; i < 100; i++) {
                manyShortRuns[run * 100 + i] = base + i;
            }
        }

        // Almost sorted: sorted, then perturb 0.5%
        almostSorted = new Integer[size];
        for (int i = 0; i < size; i++) almostSorted[i] = i;
        int perturbations = size / 200;
        for (int i = 0; i < perturbations; i++) {
            int a = rng.nextInt(size);
            int b = rng.nextInt(size);
            Integer tmp = almostSorted[a];
            almostSorted[a] = almostSorted[b];
            almostSorted[b] = tmp;
        }
    }

    @Benchmark
    public Integer[] sortRandom() {
        Arrays.sort(fullyRandom);               // Baseline: no runs
        return fullyRandom;
    }

    @Benchmark
    public Integer[] sortFewLongRuns() {
        Arrays.sort(fewLongRuns);               // FAST: long runs detected
        return fewLongRuns;
    }

    @Benchmark
    public Integer[] sortManyShortRuns() {
        Arrays.sort(manyShortRuns);             // Medium: short runs + merges
        return manyShortRuns;
    }

    @Benchmark
    public Integer[] sortAlmostSorted() {
        Arrays.sort(almostSorted);              // FAST: one long run + fixes
        return almostSorted;
    }
}
Data ShapeTime (us)Comparisons (approx)Speedup vs Random
Fully random7,600~1,660,0001.0x
Few long runs1,200~140,0006.3x
Many short runs3,800~850,0002.0x
Almost sorted890~110,0008.5x

The almost-sorted case is the content platform’s recommendation engine in steady state. TimSort identifies a single long ascending run (the 99.5% that did not change), then merges in the perturbed elements with binary insertion sort. Total comparisons drop from 1.66 million to 110,000.

Parallel Sort Scaling and ForkJoinPool Contention

Arrays.parallelSort uses the common ForkJoinPool. This is the same pool used by parallelStream(), CompletableFuture.supplyAsync() without an explicit executor, and any other ForkJoin task. Under load, parallel sort tasks compete with application tasks for worker threads.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class ParallelSortContentionBenchmark {

    @Param({"500000"})
    int size;

    @Param({"0", "4", "8", "16"})
    int backgroundTasks;

    int[] data;
    List<Future<?>> futures;
    ExecutorService backgroundExecutor;

    @Setup(Level.Trial)
    public void setupTrial() {
        backgroundExecutor = Executors.newFixedThreadPool(backgroundTasks);
    }

    @Setup(Level.Invocation)
    public void setup() {
        data = ThreadLocalRandom.current().ints(size).toArray();
        futures = new ArrayList<>();
        // Simulate background work using the common ForkJoinPool
        for (int i = 0; i < backgroundTasks; i++) {
            futures.add(ForkJoinPool.commonPool().submit(() -> {
                // Busy wait simulating compute work
                long sum = 0;
                for (int j = 0; j < 10_000_000; j++) sum += j;
                return sum;
            }));
        }
    }

    @Benchmark
    public int[] parallelSortUnderLoad() {
        Arrays.parallelSort(data);
        return data;
    }

    @TearDown(Level.Invocation)
    public void tearDown() {
        for (Future<?> f : futures) {
            try { f.cancel(true); } catch (Exception ignored) {}
        }
    }

    @TearDown(Level.Trial)
    public void tearDownTrial() {
        backgroundExecutor.shutdownNow();
    }
}
Background TasksparallelSort Time (us)Degradation
03,100baseline
44,8001.55x slower
87,2002.32x slower
1611,5003.71x slower

With 16 background tasks saturating the common ForkJoinPool, parallelSort degrades to slower than sequential sort (which takes ~10,200us at 500,000 elements). The parallel sort splits work but cannot get worker threads.

Symptom: Recommendation engine latency spikes during batch processing. Cause: Batch reindex uses parallelStream() for article embedding, saturating the common ForkJoinPool. Concurrent recommendation sorts using parallelSort contend for the same threads. Fix: Use Arrays.sort (sequential) for the recommendation engine. Use a dedicated ForkJoinPool for batch reindex:

// FAST: Isolate batch processing from request-serving sorts
ForkJoinPool batchPool = new ForkJoinPool(
    Runtime.getRuntime().availableProcessors() / 2  // Half the cores
);

batchPool.submit(() -> {
    articles.parallelStream()
        .map(this::computeEmbedding)
        .forEach(this::storeEmbedding);
}).join();

Trade-off: Dedicated pool reduces maximum batch throughput but eliminates contention with latency-sensitive request paths.

Primitive vs Object Sorting: The Boxing Gap

Arrays.sort(int[]) uses dual-pivot quicksort on raw primitives. No object allocation. No comparator dispatch. Arrays.sort(Integer[]) uses TimSort with object comparisons. The gap is substantial:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class PrimitiveVsObjectSortBenchmark {

    @Param({"100000", "500000", "1000000"})
    int size;

    int[] primitiveArray;
    Integer[] boxedArray;

    @Setup(Level.Invocation)
    public void setup() {
        ThreadLocalRandom rng = ThreadLocalRandom.current();
        primitiveArray = rng.ints(size).toArray();
        boxedArray = new Integer[size];
        for (int i = 0; i < size; i++) {
            boxedArray[i] = primitiveArray[i];
        }
    }

    @Benchmark
    public int[] sortPrimitive() {
        Arrays.sort(primitiveArray);            // FAST: dual-pivot, no boxing
        return primitiveArray;
    }

    @Benchmark
    public Integer[] sortBoxed() {
        Arrays.sort(boxedArray);                // SLOW: TimSort, object overhead
        return boxedArray;
    }
}
SizePrimitive (us)Boxed (us)Boxed/Primitive
100,0001,8207,6004.2x slower
500,00010,20042,0004.1x slower
1,000,00021,50089,0004.1x slower

The 4x gap comes from three sources: (1) each Integer object adds 12 bytes of header overhead, causing more cache misses, (2) comparisons go through Comparable.compareTo rather than direct primitive comparison, and (3) TimSort’s merge allocates temporary arrays of object references while dual-pivot quicksort is in-place.

Fix for the content platform: The trending articles endpoint sorts articles by view count. Store view counts in a parallel int[] array alongside the Article[] array. Sort the int[] to determine the top-k indices, then select articles by index:

// FAST: Sort primitive indices, avoid sorting objects
public List<Article> topKByViewCount(Article[] articles, int k) {
    int n = articles.length;
    int[][] indexedCounts = new int[n][2];
    for (int i = 0; i < n; i++) {
        indexedCounts[i][0] = articles[i].viewCount();
        indexedCounts[i][1] = i;
    }
    // Sort by view count descending using primitive comparison
    Arrays.sort(indexedCounts, (a, b) -> Integer.compare(b[0], a[0]));

    List<Article> result = new ArrayList<>(k);
    for (int i = 0; i < Math.min(k, n); i++) {
        result.add(articles[indexedCounts[i][1]]);
    }
    return result;
}

For top-k specifically, a partial sort using a min-heap of size k is even faster at O(n log k) instead of O(n log n):

// FAST: Partial sort with bounded heap for top-k
public List<Article> topKWithHeap(Article[] articles, int k) {
    PriorityQueue<Article> heap = new PriorityQueue<>(k + 1,
        Comparator.comparingInt(Article::viewCount));  // Min-heap

    for (Article a : articles) {
        heap.offer(a);
        if (heap.size() > k) {
            heap.poll();  // Remove smallest
        }
    }

    List<Article> result = new ArrayList<>(heap);
    result.sort(Comparator.comparingInt(Article::viewCount).reversed());
    return result;
}
Approach100K articles, top 100 (us)
Full sort + subList7,600
Indexed primitive sort2,100
Min-heap partial sort340

The min-heap is 22x faster than full object sorting for top-100 extraction from 100,000 articles. It performs only 100,000 insertions and 99,900 polls, each O(log 100) = ~7 comparisons. Total comparisons: ~700,000 vs ~1,660,000 for full sort.

Comparator Inlining: When the JIT Cannot Help

The JIT compiler inlines small methods to avoid call overhead. Lambda comparators are typically small enough to inline. But chained comparators using thenComparing create a call chain that the JIT may not fully inline.

// Three-level comparator chain
Comparator<Article> chained = Comparator
    .comparingInt(Article::viewCount)
    .thenComparingInt(Article::commentCount)
    .thenComparing(Article::publishDate);

This creates three Comparator objects linked by thenComparing. Each comparison invocation calls: outer comparator -> first comparator -> key extractor -> comparison -> if equal, second comparator -> key extractor -> comparison -> if equal, third comparator. The JIT sees a polymorphic call site at each compare invocation and may not inline the full chain.

// FAST: Single lambda, fully inlineable
Comparator<Article> manual = (a, b) -> {
    int cmp = Integer.compare(a.viewCount(), b.viewCount());
    if (cmp != 0) return cmp;
    cmp = Integer.compare(a.commentCount(), b.commentCount());
    if (cmp != 0) return cmp;
    return a.publishDate().compareTo(b.publishDate());
};

The manual version is a single lambda. The JIT inlines it at the call site. No intermediate object dispatch. No polymorphic call chain. At 100,000 elements with three comparison fields, the manual comparator is 20-30% faster.

When to use chained comparators: Below 10,000 elements, or in cold paths where readability matters more than throughput. The clarity of Comparator.comparing().thenComparing() is significant.

When to use manual comparators: Above 100,000 elements in hot paths that run on every request. The recommendation engine, trending endpoint, and article reindex are all candidates.