Caffeine Internals and Hit Rate Optimization

The main chapter showed that Caffeine’s W-TinyLFU eviction policy achieves 94.2% hit rates on the content platform’s Zipfian workload, compared to 82.1% for LRU. This section explains why: how W-TinyLFU works internally, how the frequency sketch tracks access patterns with constant memory, and how to configure Caffeine for maximum hit rates.

W-TinyLFU: The Eviction Policy

W-TinyLFU (Window Tiny Least Frequently Used) combines recency and frequency into an eviction policy that resists both scan pollution and frequency aging. It divides the cache into three regions:

Admission Window (1% of total size): A small LRU region where new entries land. New entries must survive this window before being considered for the main space.
Probation Segment (20% of total size): Entries promoted from the admission window. These are candidates for eviction if they do not prove their frequency.
Protected Segment (79% of total size): Entries that have proven their value through repeated access. Eviction from this segment requires low frequency relative to incoming candidates.

When the cache is full and a new entry arrives:

The new entry enters the admission window.
If the admission window is full, the LRU victim from the window is compared against the LRU victim from the probation segment.
The frequency sketch determines which entry has been accessed more often. The higher-frequency entry stays; the lower-frequency entry is evicted.

This two-step process means a scan (which touches each entry exactly once) cannot displace established high-frequency entries. The scan entries enter the admission window, but their frequency count (1) is lower than the frequency count of established entries (50+), so they lose the admission contest and are evicted immediately.

// Demonstrating W-TinyLFU behavior under scan attack
public class ScanResistanceDemo {

    public static void main(String[] args) {
        Cache<Integer, String> cache = Caffeine.newBuilder()
            .maximumSize(1_000)
            .recordStats()
            .build();

        // Phase 1: Build frequency for 1,000 hot keys
        // Each key accessed 50 times
        for (int round = 0; round < 50; round++) {
            for (int key = 0; key < 1_000; key++) {
                cache.get(key, k -> "value-" + k);
            }
        }

        CacheStats before = cache.stats();
        System.out.println("Before scan - Hit rate: " +
            String.format("%.1f%%", before.hitRate() * 100));

        // Phase 2: Scan 100,000 unique keys (100x cache size)
        for (int key = 1_000; key < 101_000; key++) {
            cache.get(key, k -> "scan-" + k);
        }

        // Phase 3: Access the original hot keys again
        long hits = 0;
        for (int key = 0; key < 1_000; key++) {
            if (cache.getIfPresent(key) != null) {
                hits++;
            }
        }

        System.out.println("Hot keys surviving scan: " + hits + "/1000");
        // W-TinyLFU: ~920/1000 survive
        // LRU: 0/1000 survive
    }
}

With W-TinyLFU, approximately 920 of the 1,000 hot keys survive a scan of 100,000 unique entries. With LRU, zero survive. The 80 evicted keys are the ones in the admission window and probation segment at the time of the scan, where frequency history is thinner.

The Frequency Sketch: Count-Min Sketch

W-TinyLFU needs to know how frequently each key has been accessed. Storing exact counts for every key ever seen would require unbounded memory. Instead, Caffeine uses a Count-Min Sketch: a probabilistic data structure that estimates frequency with bounded memory and bounded error.

The Count-Min Sketch is a 2D array of counters with $d$ rows and $w$ columns. Each row uses a different hash function. To increment the count for a key:

Hash the key with each of the $d$ hash functions.
Increment the counter at position $h_i(key) \mod w$ in row $i$.

To estimate the count for a key:

Hash the key with each hash function.
Return the minimum counter value across all rows.

The minimum is taken because hash collisions can only inflate counts, never deflate them. The minimum across multiple hash functions gives the tightest estimate.

// Simplified Count-Min Sketch (Caffeine's actual implementation
// uses 4-bit counters packed into longs for density)
public class FrequencySketch {
    private final int[][] table;
    private final int depth;
    private final int width;
    private int sampleSize;
    private int size;

    public FrequencySketch(int maximumSize) {
        this.depth = 4;
        // Width is next power of 2 >= maximumSize
        this.width = Integer.highestOneBit(maximumSize - 1) << 1;
        this.table = new int[depth][width];
        this.sampleSize = maximumSize * 10;
    }

    public void increment(int hashCode) {
        for (int i = 0; i < depth; i++) {
            int index = spread(hashCode, i) & (width - 1);
            table[i][index] = Math.min(table[i][index] + 1, 15);
            // Max count is 15 (4-bit counter)
        }
        if (++size >= sampleSize) {
            reset(); // Halve all counters (aging)
        }
    }

    public int frequency(int hashCode) {
        int min = Integer.MAX_VALUE;
        for (int i = 0; i < depth; i++) {
            int index = spread(hashCode, i) & (width - 1);
            min = Math.min(min, table[i][index]);
        }
        return min;
    }

    private void reset() {
        // Halve all counters to age out stale frequency data
        for (int i = 0; i < depth; i++) {
            for (int j = 0; j < width; j++) {
                table[i][j] = table[i][j] >>> 1;
            }
        }
        size = size / 2;
    }

    private int spread(int hashCode, int row) {
        // Different hash per row using bit mixing
        hashCode ^= row * 0x9e3779b9;
        hashCode ^= (hashCode >>> 16);
        return hashCode;
    }
}

Caffeine’s actual implementation packs 4-bit counters (max value 15) into long values, storing 16 counters per long. For a cache of 10,000 entries, the frequency sketch uses roughly 10,000 / 16 * 8 bytes * 4 rows = 20 KB. This is negligible compared to the cached data.

The reset() method is the aging mechanism. After every $10 \times \text{maximumSize}$ increments, all counters are halved. This ensures that frequency reflects recent access patterns, not historical ones. Without aging, an article that was popular last month but is no longer accessed would retain a high frequency count and resist eviction indefinitely.

Counter Saturation and Accuracy

The 4-bit counter (max value 15) seems too small. Can the sketch distinguish between an entry accessed 100 times and one accessed 1,000 times? No. But it does not need to. The eviction decision is binary: is this entry worth keeping or not? An entry with frequency 15 (saturated) is clearly worth keeping. An entry with frequency 1 is clearly worth evicting. The sketch needs to distinguish “frequently accessed” from “rarely accessed,” not provide exact counts.

The accuracy of the Count-Min Sketch depends on the width-to-items ratio:

$$P(\text{overcount} > \epsilon \cdot n) \leq \left(\frac{e}{w/n}\right)^d$$

For Caffeine’s defaults ($w = \text{maximumSize}$, $d = 4$):

$$P(\text{overcount} > \epsilon \cdot n) \leq \left(\frac{e}{1}\right)^4 \approx 54.6$$

This looks bad in isolation, but the aging mechanism (halving every $10 \times \text{maximumSize}$ increments) bounds the total count, and the 4-bit saturation prevents any single key from dominating. In practice, Caffeine’s sketch produces eviction decisions within 1% of optimal (measured against Belady’s MIN algorithm, which has perfect future knowledge).

Caffeine Configuration Deep Dive

Beyond the basics covered in the main chapter, Caffeine offers configuration options that affect hit rates, memory usage, and contention under load.

Expiration Strategies

// Time-based expiration: fixed TTL after write
Cache<String, Article> fixedTtl = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfterWrite(Duration.ofMinutes(5))
    .build();

// Time-based expiration: TTL after last access
Cache<String, Article> accessTtl = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfterAccess(Duration.ofMinutes(10))
    .build();

// Variable expiration: per-entry TTL based on content
Cache<String, Article> variableTtl = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfter(new Expiry<String, Article>() {
        @Override
        public long expireAfterCreate(String key, Article article,
                                       long currentTime) {
            // Trending articles: short TTL (data changes fast)
            if (article.viewCount() > 10_000) {
                return Duration.ofMinutes(1).toNanos();
            }
            // Stable articles: long TTL
            return Duration.ofMinutes(10).toNanos();
        }

        @Override
        public long expireAfterUpdate(String key, Article article,
                                       long currentTime, long currentDuration) {
            return currentDuration; // Keep existing TTL on update
        }

        @Override
        public long expireAfterRead(String key, Article article,
                                     long currentTime, long currentDuration) {
            return currentDuration; // Reading does not extend TTL
        }
    })
    .build();

The variable expiration strategy is powerful for the content platform. Trending articles change frequently (view count, recommendation score) and need short TTLs. Stable articles (published weeks ago, content frozen) can tolerate longer TTLs. The hit rate improvement from variable TTL depends on the workload:

Strategy	Hit Rate	Staleness P99
Fixed 5 min TTL	94.2%	5 min
Fixed 10 min TTL	96.1%	10 min
Variable (1-10 min)	95.8%	1 min (trending), 10 min (stable)

Variable TTL achieves nearly the hit rate of a 10-minute fixed TTL while bounding staleness for trending content to 1 minute.

Weight-Based Sizing

Not all cache entries are the same size. An article metadata entry (title, categories, timestamps) is 500 bytes. A full article with body content is 50 KB. Sizing by count treats them equally, wasting memory on a cache full of large articles when it could hold 100x more metadata entries.

Cache<String, Article> weightedCache = Caffeine.newBuilder()
    .maximumWeight(100_000_000) // 100 MB
    .weigher((String key, Article article) -> {
        // Approximate memory footprint
        return 200  // object overhead
            + key.length() * 2  // key chars
            + article.title().length() * 2
            + article.body().length() * 2
            + article.categories().size() * 50;
    })
    .build();

Weight-based sizing bounds the cache by memory rather than count. For the content platform, this prevents a scenario where 10,000 full articles (500 MB) consume the entire heap budget, while 10,000 metadata entries (5 MB) would leave 495 MB unused.

Removal Listeners

Cache<String, Article> cache = Caffeine.newBuilder()
    .maximumSize(10_000)
    .removalListener((String key, Article article, RemovalCause cause) -> {
        if (cause == RemovalCause.SIZE) {
            metrics.counter("cache.eviction.size").increment();
        } else if (cause == RemovalCause.EXPIRED) {
            metrics.counter("cache.eviction.expired").increment();
        }
        // Write-behind: flush dirty entries to Redis
        if (article.isDirty()) {
            redis.opsForValue().set("article:" + key, article,
                Duration.ofMinutes(30));
        }
    })
    .build();

Removal listeners run asynchronously by default, which means they do not block cache operations. Use them for metrics and write-behind patterns, not for correctness-critical operations (the entry is already evicted by the time the listener runs).

Benchmarking Cache Configuration

The following benchmark measures the combined effect of configuration choices:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 2)
@Measurement(iterations = 5, time = 2)
@Fork(2)
@State(Scope.Benchmark)
public class CaffeineConfigBenchmark {

    @Param({"1000", "5000", "10000"})
    private int cacheSize;

    @Param({"true", "false"})
    private boolean recordStats;

    private Cache<String, Article> cache;
    private String[] accessPattern;

    @Setup(Level.Trial)
    public void setup() {
        var builder = Caffeine.newBuilder()
            .maximumSize(cacheSize);
        if (recordStats) {
            builder.recordStats();
        }
        cache = builder.build();

        accessPattern = generateZipfianKeys(100_000, 1_000_000);
    }

    @Benchmark
    @Threads(8)
    public Article cacheAccess(ThreadState state) {
        String key = accessPattern[state.nextIndex()];
        return cache.get(key, k -> loadArticle(k));
    }

    @State(Scope.Thread)
    public static class ThreadState {
        private int index;
        public int nextIndex() {
            return index++ % 1_000_000;
        }
    }
}

Cache Size	Stats	Throughput (ops/s)	Hit Rate
1,000	off	38.2M	71.2%
1,000	on	35.8M	71.2%
5,000	off	41.1M	89.4%
5,000	on	38.5M	89.4%
10,000	off	42.3M	94.2%
10,000	on	39.7M	94.2%

recordStats() adds 6-7% overhead due to atomic counter updates on every access. In production, enable it: the 6% overhead is negligible compared to the insight it provides. In benchmarks measuring raw cache throughput, disable it to isolate the eviction policy’s performance.

The hit rate plateau between 5,000 and 10,000 entries (89.4% to 94.2%) confirms the Zipfian distribution analysis from the main chapter. For the content platform, 10,000 entries is the sweet spot: doubling to 20,000 would add only 2-3 percentage points while doubling memory consumption.