Skip to main content
resilience patterns in production

Cache Stampede Prevention and Distributed Cache Resilience

4 min read Chapter 24 of 40

Cache Stampede Prevention and Distributed Cache Resilience

When a popular cache entry expires, every concurrent request finds the entry missing and attempts to refresh it simultaneously. This is the cache stampede: a burst of identical requests to the dependency, triggered by a single expiration event.

Cache Stampede Mechanics

The fraud balance cache has 50,000 entries. Entries expire after 5 minutes. If a single account receives 200 requests per second and its entry expires, 200 threads simultaneously call balanceClient.getBalance(accountId). The balance service sees a 200x spike for that single account. Multiply by the number of popular accounts expiring within the same second, and the stampede can overwhelm the dependency.

Caffeine’s refreshAfterWrite mitigates this: the first thread to access a stale entry triggers an async refresh, and all other threads receive the stale value. But this only works when the entry is stale (between refreshAfterWrite and expireAfterWrite). After expireAfterWrite, Caffeine evicts the entry, and the next access is a synchronous load call. If 200 threads arrive during this window, Caffeine’s internal locking ensures only one thread loads the value, but the other 199 block waiting for it.

Probabilistic Early Expiration (XFetch)

Probabilistic early expiration prevents synchronized expirations across entries:

// PRODUCTION - XFetch-style probabilistic early refresh
public class ProbabilisticCache<K, V> {

    private final LoadingCache<K, StampedValue<V>> innerCache;
    private final double beta; // Controls refresh aggressiveness

    record StampedValue<V>(V value, Instant fetchedAt, Duration computeTime) {}

    public V get(K key) {
        StampedValue<V> stamped = innerCache.get(key);
        Duration age = Duration.between(stamped.fetchedAt(), Instant.now());
        Duration ttl = Duration.ofMinutes(5);
        Duration remaining = ttl.minus(age);

        // XFetch formula: refresh early with increasing probability
        // as the entry approaches expiration
        double delta = stamped.computeTime().toMillis();
        double random = -delta * beta * Math.log(Math.random());

        if (remaining.toMillis() < random) {
            // Probabilistically trigger early refresh
            innerCache.refresh(key);
        }

        return stamped.value();
    }
}

The probability of early refresh increases as the entry ages. Entries with longer compute times (delta) refresh earlier, because replacing them costs more if they expire and cause a synchronous load. The beta parameter controls aggressiveness: higher values trigger earlier refreshes.

The effect: instead of all entries expiring at exactly 5 minutes, entries probabilistically refresh between 4 and 5 minutes, spreading the refresh load across time.

Two-Tier Caching for Cross-Instance Resilience

A local Caffeine cache provides millisecond access and survives balance service outages. But when a payment service instance restarts, its local cache is cold. Under high traffic, the restarting instance stampedes the balance service with cache misses.

A distributed Redis cache provides a shared second tier:

// PRODUCTION - Two-tier resilient cache: Caffeine (L1) + Redis (L2)
@Configuration
public class TwoTierBalanceCacheConfig {

    @Bean
    public TwoTierCache<String, BigDecimal> balanceCache(
            BalanceClient balanceClient,
            RedisTemplate<String, BigDecimal> redisTemplate,
            MeterRegistry meterRegistry) {

        LoadingCache<String, BigDecimal> localCache = Caffeine.newBuilder()
                .maximumSize(50_000)
                .refreshAfterWrite(Duration.ofSeconds(30))
                .expireAfterWrite(Duration.ofMinutes(5))
                .build(accountId -> loadFromRedisOrOrigin(
                        accountId, redisTemplate, balanceClient));

        return new TwoTierCache<>(localCache, redisTemplate, meterRegistry);
    }

    private BigDecimal loadFromRedisOrOrigin(
            String accountId,
            RedisTemplate<String, BigDecimal> redis,
            BalanceClient balanceClient) {

        // Try Redis (L2) first
        String redisKey = "balance:" + accountId;
        BigDecimal fromRedis = redis.opsForValue().get(redisKey);

        if (fromRedis != null) {
            return fromRedis;
        }

        // Redis miss: fetch from origin
        BigDecimal fromOrigin = balanceClient.getBalance(accountId);

        // Populate Redis with longer TTL than local cache
        redis.opsForValue().set(redisKey, fromOrigin,
                Duration.ofMinutes(15));

        return fromOrigin;
    }
}

The two-tier design provides three levels of resilience:

  1. L1 (Caffeine): Survives balance service outages for up to 5 minutes per entry.
  2. L2 (Redis): Survives balance service outages for up to 15 minutes. Shared across all payment service instances. Prevents cold-start stampedes.
  3. Origin (balance service): The source of truth, called only when both cache layers miss.

When Redis itself is unavailable, the local Caffeine cache still functions. The loadFromRedisOrOrigin method should catch Redis exceptions and fall back directly to the origin:

private BigDecimal loadFromRedisOrOrigin(
        String accountId,
        RedisTemplate<String, BigDecimal> redis,
        BalanceClient balanceClient) {

    try {
        BigDecimal fromRedis = redis.opsForValue()
                .get("balance:" + accountId);
        if (fromRedis != null) return fromRedis;
    } catch (Exception e) {
        // Redis unavailable: skip L2, go directly to origin
    }

    BigDecimal fromOrigin = balanceClient.getBalance(accountId);

    try {
        redis.opsForValue().set("balance:" + accountId,
                fromOrigin, Duration.ofMinutes(15));
    } catch (Exception e) {
        // Redis write failed: L1 still caches the value
    }

    return fromOrigin;
}

Each layer degrades independently. Redis outage: L1 + origin. Balance service outage: L1 + L2 (stale). Both down: L1 only (stale, draining). All three down: failure propagates to the caller. The cache stack buys time proportional to the longest TTL across all layers.