Cache as a Resilience Primitive
Cache as a Resilience Primitive
Caching is commonly treated as a performance optimization. In resilience engineering, caching serves a different purpose: it provides a known-good response when the dependency is unavailable. The cached value may be stale, but a stale response is better than no response. This reframes the cache from “fast path” to “survival path.”
The Failure Mode
The balance service returns the current account balance. Under normal operation, the payment service calls the balance service for every payment request. When the balance service is unavailable, the payment service cannot process payments.
But what does “balance” mean for the fraud detection check? Fraud scoring uses the balance as one input among many: transaction velocity, merchant category, geographic distance, device fingerprint. The fraud score changes by less than 2% when the balance is off by $100. A balance that is 30 seconds stale is functionally equivalent to a real-time balance for fraud scoring purposes.
This creates a natural separation: the balance check for payment authorization needs real-time data (cannot cache). The balance check for fraud scoring can tolerate staleness (can cache). The same dependency, called by two different consumers, has different staleness tolerances.
From Scratch: Stale-While-Revalidate
// SCRATCH - Stale-while-revalidate cache
public class ResilientCache<K, V> {
private final ConcurrentHashMap<K, CacheEntry<V>> store =
new ConcurrentHashMap<>();
private final Duration freshDuration;
private final Duration staleDuration;
private final ExecutorService revalidationPool;
record CacheEntry<V>(V value, Instant fetchedAt, boolean revalidating) {
CacheEntry<V> withRevalidating(boolean r) {
return new CacheEntry<>(value, fetchedAt, r);
}
}
public ResilientCache(Duration freshDuration,
Duration staleDuration,
ExecutorService revalidationPool) {
this.freshDuration = freshDuration;
this.staleDuration = staleDuration;
this.revalidationPool = revalidationPool;
}
/**
* Get a value from the cache with resilient fallback behavior:
* - FRESH: return cached value
* - STALE: return cached value, trigger async revalidation
* - EXPIRED: call fetcher synchronously, cache result
* - MISSING: call fetcher synchronously, cache result
* - FETCHER FAILS + STALE: return stale value
* - FETCHER FAILS + EXPIRED: throw exception
*/
public V get(K key, Supplier<V> fetcher) {
CacheEntry<V> entry = store.get(key);
if (entry == null) {
// Cache miss: fetch synchronously
return fetchAndCache(key, fetcher);
}
Duration age = Duration.between(entry.fetchedAt(), Instant.now());
if (age.compareTo(freshDuration) <= 0) {
// Fresh: serve from cache
return entry.value();
}
if (age.compareTo(staleDuration) <= 0) {
// Stale but within grace period: serve stale, revalidate async
triggerRevalidation(key, entry, fetcher);
return entry.value();
}
// Expired: must fetch synchronously
try {
return fetchAndCache(key, fetcher);
} catch (Exception e) {
// Fetch failed and cache is expired: no fallback available
throw new CacheRefreshException(
"Cannot refresh cache for key " + key +
" and cached entry has expired", e);
}
}
private V fetchAndCache(K key, Supplier<V> fetcher) {
V value = fetcher.get();
store.put(key, new CacheEntry<>(value, Instant.now(), false));
return value;
}
private void triggerRevalidation(K key,
CacheEntry<V> entry,
Supplier<V> fetcher) {
// Only one revalidation at a time per key
if (entry.revalidating()) return;
CacheEntry<V> revalidating = entry.withRevalidating(true);
if (!store.replace(key, entry, revalidating)) return;
revalidationPool.submit(() -> {
try {
V fresh = fetcher.get();
store.put(key,
new CacheEntry<>(fresh, Instant.now(), false));
} catch (Exception e) {
// Revalidation failed: keep serving stale
store.put(key, entry.withRevalidating(false));
}
});
}
}
What this reveals:
Three states, not two. Most cache implementations have two states: hit and miss. A resilient cache has three: fresh (serve immediately), stale (serve immediately, refresh in background), and expired (must refresh or fail). The stale state is the resilience primitive.
Revalidation must be deduplicated. Without the revalidating flag, every request for a stale key triggers a revalidation. Under 1,000 requests per second, that is 1,000 concurrent revalidation calls to the dependency, exactly the thundering herd problem. The flag ensures only one revalidation per key is in flight.
Stale-if-error is implicit. When async revalidation fails, the stale entry remains. The cache continues serving the stale value. No explicit error handling needed in the caller. The cache absorbs the failure.
Production Implementation
// PRODUCTION - Caffeine cache with stale-if-error for fraud scoring
@Configuration
public class FraudBalanceCacheConfig {
@Bean
public LoadingCache<String, BigDecimal> fraudBalanceCache(
BalanceClient balanceClient,
MeterRegistry meterRegistry) {
return Caffeine.newBuilder()
.maximumSize(50_000) // Max 50K accounts cached
.expireAfterWrite(Duration.ofMinutes(5))
.refreshAfterWrite(Duration.ofSeconds(30))
// refreshAfterWrite triggers async reload after 30s
// expireAfterWrite evicts after 5 minutes
// Between 30s and 5m: entry is stale, served while refreshing
.recordStats()
.build(new CacheLoader<>() {
@Override
public BigDecimal load(String accountId) {
return balanceClient.getBalance(accountId);
}
@Override
public CompletableFuture<BigDecimal> asyncReload(
String accountId,
BigDecimal oldBalance,
Executor executor) {
return CompletableFuture.supplyAsync(() -> {
try {
return balanceClient.getBalance(accountId);
} catch (Exception e) {
meterRegistry.counter(
"cache.reload.failure",
"cache", "fraudBalance")
.increment();
// Return old value on reload failure
return oldBalance;
}
}, executor);
}
});
}
}
Caffeine’s refreshAfterWrite implements stale-while-revalidate natively. The asyncReload method returns the old value when the refresh fails, implementing stale-if-error. The cache serves the stale balance while the refresh happens in the background.
// PRODUCTION - Fraud service using cached balance
@Service
public class FraudScoringService {
private final LoadingCache<String, BigDecimal> balanceCache;
private final FraudScoringEngine scoringEngine;
public FraudScore scoreFraud(PaymentRequest payment) {
// This call never throws on balance service outage
// (returns stale data from cache instead)
BigDecimal balance = balanceCache.get(payment.accountId());
return scoringEngine.score(
payment,
balance,
// ... other fraud signals
);
}
}
The fraud scoring service does not know whether the balance is fresh or stale. It does not need to know. The cache abstracts the dependency’s availability into a simple get call that returns a value. The staleness is bounded (maximum 5 minutes by expireAfterWrite), and within that window, the fraud score is within acceptable accuracy.
Testing Cache-Based Resilience
// PRODUCTION - Test: cache serves stale data during dependency outage
@SpringBootTest
class FraudBalanceCacheTest {
@Autowired
private LoadingCache<String, BigDecimal> fraudBalanceCache;
@MockBean
private BalanceClient balanceClient;
@Test
void servesStaleDataDuringOutage() throws Exception {
String accountId = "ACC-12345";
BigDecimal originalBalance = new BigDecimal("5000.00");
// Prime the cache
when(balanceClient.getBalance(accountId))
.thenReturn(originalBalance);
fraudBalanceCache.get(accountId);
// Simulate dependency outage
when(balanceClient.getBalance(accountId))
.thenThrow(new RuntimeException("Connection refused"));
// Force refresh (simulates time passing beyond refreshAfterWrite)
fraudBalanceCache.refresh(accountId);
Thread.sleep(100); // Allow async refresh to complete
// Cache should still serve the original value
BigDecimal cached = fraudBalanceCache.get(accountId);
assertThat(cached).isEqualByComparingTo(originalBalance);
}
@Test
void updatesWhenDependencyRecovers() throws Exception {
String accountId = "ACC-12345";
BigDecimal oldBalance = new BigDecimal("5000.00");
BigDecimal newBalance = new BigDecimal("4800.00");
// Prime with old balance
when(balanceClient.getBalance(accountId))
.thenReturn(oldBalance);
fraudBalanceCache.get(accountId);
// Dependency recovers with new balance
when(balanceClient.getBalance(accountId))
.thenReturn(newBalance);
fraudBalanceCache.refresh(accountId);
Thread.sleep(100);
BigDecimal cached = fraudBalanceCache.get(accountId);
assertThat(cached).isEqualByComparingTo(newBalance);
}
}
The Observable Signal
# PRODUCTION - Prometheus metrics for cache-based resilience
# Caffeine exposes these via Micrometer integration
# Cache hit rate (should be >95% under normal operation)
cache_gets_total{cache="fraudBalance", result="hit"}
# Cache miss rate (triggers synchronous load)
cache_gets_total{cache="fraudBalance", result="miss"}
# Cache load failures (dependency unreachable, stale data served)
cache_load_failure_total{cache="fraudBalance"}
# Cache eviction (entries expired beyond staleDuration)
cache_evictions_total{cache="fraudBalance"}
A healthy cache shows: high hit rate (>95%), low miss rate (<5%), near-zero load failures. During a dependency outage, the hit rate stays high (stale entries served), load failures spike (async refreshes failing), and the eviction count stays flat (entries not being evicted because they are being served). When the outage exceeds expireAfterWrite, evictions begin and cache misses spike. This is the point where the cache-based resilience is exhausted and the caller must handle the failure explicitly.
The critical alert: cache_load_failure_total sustained for longer than expireAfterWrite. This means the cache is draining and stale data will soon be unavailable.