Skip to main content
surviving the spike

Computed Aggregate Caching and Invalidation Strategies

8 min read Chapter 20 of 66

Computed Aggregate Caching and Invalidation Strategies

The surge pricing multiplier is the most financially sensitive cached value in the ride-hailing platform. Get it wrong by even 0.5x and you are either overcharging riders (they switch to a competitor) or undercharging during peak demand (you lose thousands per minute in revenue). This section covers the Redis data structures, invalidation strategies, and operational metrics for getting computed aggregates right.

Surge Multiplier in a Redis Hash

The Symptom

Riders in zone-12 are seeing a 1.0x fare while zone-12 actually has a 3.4x surge. The Kafka consumer that updates surge pricing is behind by 45 seconds after a deployment triggered a consumer group rebalance. The Redis cache is serving a stale TTL-based entry. Revenue loss during the 45-second window: approximately $2,800 based on 62 rides at an average $45 fare delta.

The Cause

The surge multiplier was stored as a simple string key with SET surge:zone-12 "3.4" EX 30. This loses context. You cannot tell when the value was computed, what the demand and supply inputs were, or which invalidation path wrote it. When staleness occurs, you have no diagnostic data.

The Baseline

Simple string-based surge cache:

// BOTTLENECK: Opaque string cache, no diagnostic metadata
public Mono<Double> getSurgeMultiplier(String zoneId) {
    String key = "surge:" + zoneId;
    return redisTemplate.opsForValue().get(key)
        .map(Double::parseDouble)
        .switchIfEmpty(
            computationService.compute(zoneId)
                .flatMap(surge -> redisTemplate.opsForValue()
                    .set(key, String.valueOf(surge.multiplier()), Duration.ofSeconds(30))
                    .thenReturn(surge.multiplier()))
        );
}

No way to distinguish a fresh value from a 29-second-old one. No visibility into what triggered the cache write. When surge accuracy drifts, you are debugging blind.

The Fix: Redis Hash with Full State

Store the complete surge computation result in a Hash. Each field carries diagnostic value.

// SCALED: Full surge state in a Redis Hash
@Service
public class SurgeCacheService {

    private final ReactiveRedisTemplate<String, String> redisTemplate;
    private final SurgeComputationService computationService;
    private final MeterRegistry meterRegistry;

    private static final Duration SURGE_TTL = Duration.ofSeconds(30);

    public Mono<SurgeResponse> getSurge(String zoneId) {
        String key = "surge:" + zoneId;
        return readFromCache(key, zoneId)
            .switchIfEmpty(computeAndCache(zoneId, key));
    }

    private Mono<SurgeResponse> readFromCache(String key, String zoneId) {
        return redisTemplate.opsForHash().entries(key)
            .collectMap(e -> (String) e.getKey(), e -> (String) e.getValue())
            .filter(map -> !map.isEmpty())
            .map(cached -> {
                meterRegistry.counter("surge.cache", "result", "hit", "zone", zoneId).increment();
                return new SurgeResponse(
                    zoneId,
                    Double.parseDouble(cached.get("multiplier")),
                    Instant.parse(cached.get("computedAt")),
                    Long.parseLong(cached.get("demand")),
                    Long.parseLong(cached.get("supply")),
                    cached.get("invalidatedBy")
                );
            });
    }

    public Mono<SurgeResponse> computeAndCache(String zoneId, String key) {
        return computationService.compute(zoneId)
            .flatMap(surge -> writeSurgeCache(key, surge, "miss"))
            .doOnNext(s -> meterRegistry.counter("surge.cache",
                "result", "miss", "zone", zoneId).increment());
    }

    public Mono<SurgeResponse> writeSurgeCache(
            String key, SurgeResponse surge, String trigger) {
        Map<String, String> fields = Map.of(
            "multiplier", String.valueOf(surge.multiplier()),
            "computedAt", surge.computedAt().toString(),
            "demand", String.valueOf(surge.demand()),
            "supply", String.valueOf(surge.supply()),
            "invalidatedBy", trigger
        );
        return redisTemplate.opsForHash().putAll(key, fields)
            .then(redisTemplate.expire(key, SURGE_TTL))
            .thenReturn(surge);
    }
}

The invalidatedBy field is the key diagnostic tool. In Grafana, chart the ratio of invalidatedBy=event vs invalidatedBy=ttl vs invalidatedBy=miss. A healthy system shows 80%+ event-driven invalidations. If TTL starts dominating, your event pipeline is lagging.

Driver Counts in a Redis Sorted Set

The Symptom

The driver matching service reports 23 available drivers in zone-47. The rider-facing app shows “High availability” with a green badge. The rider requests a ride and waits 9 minutes because 21 of those 23 drivers went offline 45 seconds ago. The driver count cache is stale.

The Cause

Driver counts were cached as a simple integer with a 10-second TTL. Between cache writes, drivers go offline without decrementing the count. The cache is a snapshot that decays immediately.

The Fix: Sorted Set with Heartbeat Timestamps

Instead of caching a count, cache the individual driver states. Each driver’s last heartbeat timestamp is the score in a Sorted Set. To get the count, run ZCOUNT with a time window. Stale drivers naturally fall below the cutoff without explicit removal.

// SCALED: Driver availability as a live Sorted Set
@Service
public class DriverAvailabilityService {

    private final ReactiveRedisTemplate<String, String> redisTemplate;

    public Mono<Boolean> updateDriverHeartbeat(String zoneId, String driverId) {
        String key = "drivers:available:" + zoneId;
        double score = Instant.now().toEpochMilli();
        return redisTemplate.opsForZSet().add(key, driverId, score);
    }

    public Mono<Long> getAvailableCount(String zoneId) {
        String key = "drivers:available:" + zoneId;
        double cutoff = Instant.now().minus(60, ChronoUnit.SECONDS).toEpochMilli();
        double now = Instant.now().toEpochMilli();
        return redisTemplate.opsForZSet()
            .count(key, Range.closed(cutoff, now));
    }

    public Flux<String> getAvailableDrivers(String zoneId) {
        String key = "drivers:available:" + zoneId;
        double cutoff = Instant.now().minus(60, ChronoUnit.SECONDS).toEpochMilli();
        double now = Instant.now().toEpochMilli();
        return redisTemplate.opsForZSet()
            .rangeByScore(key, Range.closed(cutoff, now));
    }

    // Periodic cleanup of stale entries to prevent memory growth
    @Scheduled(fixedRate = 300_000) // every 5 minutes
    public void pruneStaleDrivers() {
        double cutoff = Instant.now().minus(5, ChronoUnit.MINUTES).toEpochMilli();
        redisTemplate.keys("drivers:available:*")
            .flatMap(key -> redisTemplate.opsForZSet()
                .removeRangeByScore(key, Range.closed(0.0, cutoff)))
            .subscribe();
    }
}

The Sorted Set approach eliminates the stale count problem entirely. A driver who stops sending heartbeats automatically drops out of the count after 60 seconds. No explicit invalidation needed for individual driver state.

TTL-Based vs Event-Driven vs Hybrid Invalidation

TTL-Based Invalidation

The simplest approach. Set a TTL on the cache key and let Redis expire it. On the next request, recompute and cache.

When TTL works: Data that changes slowly and tolerates staleness. User profiles, trip history, route geometry. A 5-minute TTL on trip history is fine because new trips only appear a few times per day.

When TTL fails: Surge pricing. A 30-second TTL means riders see prices up to 30 seconds stale. During a rapid demand spike (concert ending, rain starting), 30 seconds of stale pricing means hundreds of underpriced rides.

Event-Driven Invalidation

A Kafka consumer subscribes to ride-requested, ride-completed, and driver-status-changed events. Each event triggers a targeted cache update for the affected zone.

// SCALED: Event-driven surge cache invalidation
@Component
public class SurgeCacheInvalidator {

    private final SurgeCacheService surgeCacheService;
    private final SurgeComputationService computationService;
    private final MeterRegistry meterRegistry;

    @KafkaListener(topics = {"ride-requested", "ride-completed", "driver-status-changed"},
                   groupId = "surge-cache-invalidator")
    public void onEvent(ConsumerRecord<String, RideEvent> record) {
        RideEvent event = record.value();
        String key = "surge:" + event.zoneId();

        computationService.compute(event.zoneId())
            .flatMap(surge -> surgeCacheService.writeSurgeCache(key, surge, "event"))
            .doOnSuccess(v -> meterRegistry.counter("surge.invalidation",
                "trigger", "event",
                "topic", record.topic(),
                "zone", event.zoneId()).increment())
            .doOnError(e -> meterRegistry.counter("surge.invalidation.error",
                "trigger", "event",
                "zone", event.zoneId()).increment())
            .subscribe();
    }
}

When events work: Primary invalidation for high-frequency, financially sensitive data. Surge pricing, driver availability, fare estimates during active promotions.

When events fail: Kafka consumer group rebalances during deployments. For 15 to 45 seconds, partitions are unassigned. Events queue up in the topic but are not consumed. The cache goes stale with no TTL to save it.

Hybrid: Events Primary, TTL Safety Net

Use events for normal operation. Keep the TTL as a backstop. Track which fires first.

// SCALED: Hybrid invalidation with staleness tracking
public Mono<SurgeResponse> getSurgeHybrid(String zoneId) {
    String key = "surge:" + zoneId;
    return redisTemplate.opsForHash().entries(key)
        .collectMap(e -> (String) e.getKey(), e -> (String) e.getValue())
        .flatMap(cached -> {
            if (cached.isEmpty()) {
                return surgeCacheService.computeAndCache(zoneId, key);
            }

            Instant computedAt = Instant.parse(cached.get("computedAt"));
            Duration age = Duration.between(computedAt, Instant.now());

            meterRegistry.timer("surge.cache.age", "zone", zoneId)
                .record(age);

            String lastTrigger = cached.getOrDefault("invalidatedBy", "unknown");
            if ("ttl".equals(lastTrigger) || "miss".equals(lastTrigger)) {
                meterRegistry.counter("surge.invalidation.ttl_fired_first",
                    "zone", zoneId).increment();
            }

            return Mono.just(SurgeCacheService.fromCache(zoneId, cached));
        });
}

Build a Grafana dashboard with three panels:

  1. surge.cache.age histogram: should cluster under 5 seconds with hybrid, under 30 seconds with TTL-only.
  2. surge.invalidation counter by trigger: healthy ratio is 85%+ event-driven.
  3. surge.invalidation.ttl_fired_first counter: spikes here correlate with deployments and Kafka rebalances. Alert if this exceeds 20% of total invalidations for more than 2 minutes.

Locust Test: Comparing Invalidation Strategies

This test simulates a demand spike in a single zone and measures surge pricing accuracy across the three invalidation strategies.

# locustfile_invalidation_comparison.py
from locust import HttpUser, task, between, events
import random
import time
import json

class SurgeAccuracyUser(HttpUser):
    wait_time = between(0.05, 0.2)

    @task(6)
    def get_surge_ttl_only(self):
        """Hit the TTL-only surge endpoint"""
        zone = "zone-12"
        with self.client.get(
            f"/api/v1/surge/ttl/{zone}",
            name="/surge/ttl [accuracy test]",
            catch_response=True
        ) as resp:
            if resp.status_code == 200:
                data = resp.json()
                self._check_accuracy(resp, data, "ttl")

    @task(6)
    def get_surge_event_driven(self):
        """Hit the event-driven surge endpoint"""
        zone = "zone-12"
        with self.client.get(
            f"/api/v1/surge/event/{zone}",
            name="/surge/event [accuracy test]",
            catch_response=True
        ) as resp:
            if resp.status_code == 200:
                data = resp.json()
                self._check_accuracy(resp, data, "event")

    @task(6)
    def get_surge_hybrid(self):
        """Hit the hybrid surge endpoint"""
        zone = "zone-12"
        with self.client.get(
            f"/api/v1/surge/hybrid/{zone}",
            name="/surge/hybrid [accuracy test]",
            catch_response=True
        ) as resp:
            if resp.status_code == 200:
                data = resp.json()
                self._check_accuracy(resp, data, "hybrid")

    @task(3)
    def generate_demand(self):
        """Simulate demand spike by flooding ride requests"""
        zone = "zone-12"
        self.client.post(
            "/api/v1/ride/request",
            json={
                "zoneId": zone,
                "riderId": f"rider-{random.randint(1, 50000)}",
                "pickupLat": 40.7128 + random.uniform(-0.01, 0.01),
                "pickupLng": -74.0060 + random.uniform(-0.01, 0.01)
            },
            name="/ride/request [demand spike]"
        )

    def _check_accuracy(self, resp, data, strategy):
        computed_at = data.get("computedAt", "")
        if computed_at:
            from datetime import datetime, timezone
            try:
                cached_time = datetime.fromisoformat(computed_at.replace("Z", "+00:00"))
                age_seconds = (datetime.now(timezone.utc) - cached_time).total_seconds()
                if age_seconds > 30:
                    resp.failure(f"{strategy}: stale by {age_seconds:.1f}s")
            except (ValueError, TypeError):
                pass

Run with: locust -f locustfile_invalidation_comparison.py --users 500 --spawn-rate 100 --run-time 5m

Comparison Results

MetricTTL-Only (30s)Event-DrivenHybrid
Max staleness30s45s (during rebalance)30s (bounded by TTL)
Avg staleness15s0.3s0.4s
Staleness during spike30s0.5s0.6s
Staleness during deploy30s45s30s
DB queries/min (peak)280320 (event-driven recomputes)290
Revenue accuracy91%99.2%99.1%

The hybrid approach wins because it combines the best of both. Event-driven handles the normal case with sub-second staleness. TTL catches the edge case where events stop flowing. You get 99%+ revenue accuracy with a hard upper bound on staleness.