Sorting in Practice: TimSort, Parallel Sort, Custom Comparators
Sorting in Practice: TimSort, Parallel Sort, Custom Comparators
The main chapter established the crossover points. This section isolates each sorting mechanism, explains why it performs the way it does, and shows how to exploit each in the content platform.
TimSort Internals: Run Detection and Merge Strategy
TimSort is not a textbook merge sort. It is an adaptive merge sort that scans the input for existing sorted sequences (runs) and merges them in an order that minimizes comparisons. Understanding the run detection mechanism explains why TimSort excels on real data.
TimSort’s algorithm:
- Scan the array left to right, identifying runs of ascending or strictly descending elements. Descending runs are reversed in place.
- If a natural run is shorter than
MIN_MERGE(32 for arrays of 64+ elements), extend it toMIN_MERGEusing binary insertion sort. - Push runs onto a stack. Maintain invariants: the top three run lengths satisfy
runLen[i-2] > runLen[i-1] + runLen[i]andrunLen[i-1] > runLen[i]. When violated, merge adjacent runs. - After scanning all elements, merge remaining runs on the stack.
The genius is step 1. Real-world data has runs. Database query results are partially sorted by primary key. Time-series data has monotonic sections. The recommendation engine’s candidate list retains 90%+ of its previous ordering between re-sorts.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class TimSortRunBenchmark {
@Param({"100000"})
int size;
Integer[] fullyRandom;
Integer[] fewLongRuns; // 10 sorted runs of 10,000
Integer[] manyShortRuns; // 1,000 sorted runs of 100
Integer[] almostSorted; // sorted, then swap 0.5% of elements
@Setup(Level.Invocation)
public void setup() {
ThreadLocalRandom rng = ThreadLocalRandom.current();
// Fully random
fullyRandom = new Integer[size];
for (int i = 0; i < size; i++) fullyRandom[i] = rng.nextInt();
// Few long runs: 10 sorted segments of 10,000
fewLongRuns = new Integer[size];
for (int run = 0; run < 10; run++) {
int offset = run * 10_000;
for (int i = 0; i < 10_000; i++) {
fewLongRuns[offset + i] = offset + i;
}
}
// Shuffle run order
for (int run = 9; run > 0; run--) {
int swapWith = rng.nextInt(run + 1);
for (int i = 0; i < 10_000; i++) {
Integer tmp = fewLongRuns[run * 10_000 + i];
fewLongRuns[run * 10_000 + i] = fewLongRuns[swapWith * 10_000 + i];
fewLongRuns[swapWith * 10_000 + i] = tmp;
}
}
// Many short runs: 1,000 sorted segments of 100
manyShortRuns = new Integer[size];
for (int run = 0; run < 1_000; run++) {
int base = rng.nextInt(1_000_000);
for (int i = 0; i < 100; i++) {
manyShortRuns[run * 100 + i] = base + i;
}
}
// Almost sorted: sorted, then perturb 0.5%
almostSorted = new Integer[size];
for (int i = 0; i < size; i++) almostSorted[i] = i;
int perturbations = size / 200;
for (int i = 0; i < perturbations; i++) {
int a = rng.nextInt(size);
int b = rng.nextInt(size);
Integer tmp = almostSorted[a];
almostSorted[a] = almostSorted[b];
almostSorted[b] = tmp;
}
}
@Benchmark
public Integer[] sortRandom() {
Arrays.sort(fullyRandom); // Baseline: no runs
return fullyRandom;
}
@Benchmark
public Integer[] sortFewLongRuns() {
Arrays.sort(fewLongRuns); // FAST: long runs detected
return fewLongRuns;
}
@Benchmark
public Integer[] sortManyShortRuns() {
Arrays.sort(manyShortRuns); // Medium: short runs + merges
return manyShortRuns;
}
@Benchmark
public Integer[] sortAlmostSorted() {
Arrays.sort(almostSorted); // FAST: one long run + fixes
return almostSorted;
}
}
| Data Shape | Time (us) | Comparisons (approx) | Speedup vs Random |
|---|---|---|---|
| Fully random | 7,600 | ~1,660,000 | 1.0x |
| Few long runs | 1,200 | ~140,000 | 6.3x |
| Many short runs | 3,800 | ~850,000 | 2.0x |
| Almost sorted | 890 | ~110,000 | 8.5x |
The almost-sorted case is the content platform’s recommendation engine in steady state. TimSort identifies a single long ascending run (the 99.5% that did not change), then merges in the perturbed elements with binary insertion sort. Total comparisons drop from 1.66 million to 110,000.
Parallel Sort Scaling and ForkJoinPool Contention
Arrays.parallelSort uses the common ForkJoinPool. This is the same pool used by parallelStream(), CompletableFuture.supplyAsync() without an explicit executor, and any other ForkJoin task. Under load, parallel sort tasks compete with application tasks for worker threads.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class ParallelSortContentionBenchmark {
@Param({"500000"})
int size;
@Param({"0", "4", "8", "16"})
int backgroundTasks;
int[] data;
List<Future<?>> futures;
ExecutorService backgroundExecutor;
@Setup(Level.Trial)
public void setupTrial() {
backgroundExecutor = Executors.newFixedThreadPool(backgroundTasks);
}
@Setup(Level.Invocation)
public void setup() {
data = ThreadLocalRandom.current().ints(size).toArray();
futures = new ArrayList<>();
// Simulate background work using the common ForkJoinPool
for (int i = 0; i < backgroundTasks; i++) {
futures.add(ForkJoinPool.commonPool().submit(() -> {
// Busy wait simulating compute work
long sum = 0;
for (int j = 0; j < 10_000_000; j++) sum += j;
return sum;
}));
}
}
@Benchmark
public int[] parallelSortUnderLoad() {
Arrays.parallelSort(data);
return data;
}
@TearDown(Level.Invocation)
public void tearDown() {
for (Future<?> f : futures) {
try { f.cancel(true); } catch (Exception ignored) {}
}
}
@TearDown(Level.Trial)
public void tearDownTrial() {
backgroundExecutor.shutdownNow();
}
}
| Background Tasks | parallelSort Time (us) | Degradation |
|---|---|---|
| 0 | 3,100 | baseline |
| 4 | 4,800 | 1.55x slower |
| 8 | 7,200 | 2.32x slower |
| 16 | 11,500 | 3.71x slower |
With 16 background tasks saturating the common ForkJoinPool, parallelSort degrades to slower than sequential sort (which takes ~10,200us at 500,000 elements). The parallel sort splits work but cannot get worker threads.
Symptom: Recommendation engine latency spikes during batch processing.
Cause: Batch reindex uses parallelStream() for article embedding, saturating the common ForkJoinPool. Concurrent recommendation sorts using parallelSort contend for the same threads.
Fix: Use Arrays.sort (sequential) for the recommendation engine. Use a dedicated ForkJoinPool for batch reindex:
// FAST: Isolate batch processing from request-serving sorts
ForkJoinPool batchPool = new ForkJoinPool(
Runtime.getRuntime().availableProcessors() / 2 // Half the cores
);
batchPool.submit(() -> {
articles.parallelStream()
.map(this::computeEmbedding)
.forEach(this::storeEmbedding);
}).join();
Trade-off: Dedicated pool reduces maximum batch throughput but eliminates contention with latency-sensitive request paths.
Primitive vs Object Sorting: The Boxing Gap
Arrays.sort(int[]) uses dual-pivot quicksort on raw primitives. No object allocation. No comparator dispatch. Arrays.sort(Integer[]) uses TimSort with object comparisons. The gap is substantial:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class PrimitiveVsObjectSortBenchmark {
@Param({"100000", "500000", "1000000"})
int size;
int[] primitiveArray;
Integer[] boxedArray;
@Setup(Level.Invocation)
public void setup() {
ThreadLocalRandom rng = ThreadLocalRandom.current();
primitiveArray = rng.ints(size).toArray();
boxedArray = new Integer[size];
for (int i = 0; i < size; i++) {
boxedArray[i] = primitiveArray[i];
}
}
@Benchmark
public int[] sortPrimitive() {
Arrays.sort(primitiveArray); // FAST: dual-pivot, no boxing
return primitiveArray;
}
@Benchmark
public Integer[] sortBoxed() {
Arrays.sort(boxedArray); // SLOW: TimSort, object overhead
return boxedArray;
}
}
| Size | Primitive (us) | Boxed (us) | Boxed/Primitive |
|---|---|---|---|
| 100,000 | 1,820 | 7,600 | 4.2x slower |
| 500,000 | 10,200 | 42,000 | 4.1x slower |
| 1,000,000 | 21,500 | 89,000 | 4.1x slower |
The 4x gap comes from three sources: (1) each Integer object adds 12 bytes of header overhead, causing more cache misses, (2) comparisons go through Comparable.compareTo rather than direct primitive comparison, and (3) TimSort’s merge allocates temporary arrays of object references while dual-pivot quicksort is in-place.
Fix for the content platform: The trending articles endpoint sorts articles by view count. Store view counts in a parallel int[] array alongside the Article[] array. Sort the int[] to determine the top-k indices, then select articles by index:
// FAST: Sort primitive indices, avoid sorting objects
public List<Article> topKByViewCount(Article[] articles, int k) {
int n = articles.length;
int[][] indexedCounts = new int[n][2];
for (int i = 0; i < n; i++) {
indexedCounts[i][0] = articles[i].viewCount();
indexedCounts[i][1] = i;
}
// Sort by view count descending using primitive comparison
Arrays.sort(indexedCounts, (a, b) -> Integer.compare(b[0], a[0]));
List<Article> result = new ArrayList<>(k);
for (int i = 0; i < Math.min(k, n); i++) {
result.add(articles[indexedCounts[i][1]]);
}
return result;
}
For top-k specifically, a partial sort using a min-heap of size k is even faster at O(n log k) instead of O(n log n):
// FAST: Partial sort with bounded heap for top-k
public List<Article> topKWithHeap(Article[] articles, int k) {
PriorityQueue<Article> heap = new PriorityQueue<>(k + 1,
Comparator.comparingInt(Article::viewCount)); // Min-heap
for (Article a : articles) {
heap.offer(a);
if (heap.size() > k) {
heap.poll(); // Remove smallest
}
}
List<Article> result = new ArrayList<>(heap);
result.sort(Comparator.comparingInt(Article::viewCount).reversed());
return result;
}
| Approach | 100K articles, top 100 (us) |
|---|---|
| Full sort + subList | 7,600 |
| Indexed primitive sort | 2,100 |
| Min-heap partial sort | 340 |
The min-heap is 22x faster than full object sorting for top-100 extraction from 100,000 articles. It performs only 100,000 insertions and 99,900 polls, each O(log 100) = ~7 comparisons. Total comparisons: ~700,000 vs ~1,660,000 for full sort.
Comparator Inlining: When the JIT Cannot Help
The JIT compiler inlines small methods to avoid call overhead. Lambda comparators are typically small enough to inline. But chained comparators using thenComparing create a call chain that the JIT may not fully inline.
// Three-level comparator chain
Comparator<Article> chained = Comparator
.comparingInt(Article::viewCount)
.thenComparingInt(Article::commentCount)
.thenComparing(Article::publishDate);
This creates three Comparator objects linked by thenComparing. Each comparison invocation calls: outer comparator -> first comparator -> key extractor -> comparison -> if equal, second comparator -> key extractor -> comparison -> if equal, third comparator. The JIT sees a polymorphic call site at each compare invocation and may not inline the full chain.
// FAST: Single lambda, fully inlineable
Comparator<Article> manual = (a, b) -> {
int cmp = Integer.compare(a.viewCount(), b.viewCount());
if (cmp != 0) return cmp;
cmp = Integer.compare(a.commentCount(), b.commentCount());
if (cmp != 0) return cmp;
return a.publishDate().compareTo(b.publishDate());
};
The manual version is a single lambda. The JIT inlines it at the call site. No intermediate object dispatch. No polymorphic call chain. At 100,000 elements with three comparison fields, the manual comparator is 20-30% faster.
When to use chained comparators: Below 10,000 elements, or in cold paths where readability matters more than throughput. The clarity of Comparator.comparing().thenComparing() is significant.
When to use manual comparators: Above 100,000 elements in hot paths that run on every request. The recommendation engine, trending endpoint, and article reindex are all candidates.