Skip to main content
surviving the spike

Partial Availability Architectures

11 min read Chapter 57 of 66

Partial Availability Architectures

The Symptom

The team has circuit breakers (CH18) and feature flags (CH19-S1). The surge pricing failure playbook works. Then four things happen in one weekend:

  1. Friday 11 PM: PostgreSQL primary crashes. Failover takes 14 seconds. 5,600 ride requests fail.
  2. Saturday 2 AM: Redis hits maxmemory. Eviction clears the feature flag hash. All features default to enabled, including a broken promotion engine that starts returning negative fares.
  3. Saturday 9 AM: Kafka broker goes down for maintenance. Trip completion events queue in memory, the pod restarts, events are lost. 340 trips have no completion record.
  4. Sunday 3 PM: Surge pricing service OOMs. Circuit breaker opens. Cached multipliers serve stale data for 45 minutes because nobody noticed the circuit was open.

Each failure had a fix. None of the fixes worked together. The system needed a unified architecture for partial availability.

The Cause

Each dependency has a different failure mode, a different fallback strategy, and a different recovery process. Treating them identically produces either over-engineering (triple redundancy for analytics) or under-engineering (no fallback for PostgreSQL writes).

Partial availability architecture maps each dependency to:

  1. What breaks when it is gone
  2. What serves as a temporary replacement
  3. How to recover without data loss
Dependency     Failure Mode          Fallback              Recovery
PostgreSQL     Connection timeout    Redis read cache +    Replay WAL
                                     local WAL for writes
Redis          Connection timeout    Caffeine in-memory    Warm cache on restart
                                     + flags default on
Kafka          Broker unavailable    In-memory bounded     Flush on reconnect
                                     queue
Surge service  Circuit breaker open  Cached multiplier     Circuit half-open
                                     or default 1.0x       auto-recovery

The Baseline

Without partial availability, each dependency is a hard requirement:

// BOTTLENECK: PostgreSQL is a hard dependency for writes
@Service
public class TripService {

    private final TripRepository tripRepository; // PostgreSQL

    public Mono<Trip> createTrip(RideRequest request, MatchResult match) {
        return tripRepository.save(new Trip(request, match));
        // PG down = trip creation fails = booking fails
    }
}
// BOTTLENECK: Kafka is a hard dependency for events
@Service
public class TripCompletionService {

    private final KafkaTemplate<String, TripEvent> kafkaTemplate;

    public Mono<Void> completeTrip(Trip trip) {
        return Mono.fromFuture(
            kafkaTemplate.send("trip-events", trip.getId(),
                TripEvent.completed(trip))
        ).then();
        // Kafka down = trip completion fails
    }
}

The Fix

PostgreSQL Down: Redis Cache + Local WAL

// SCALED: Trip service with PostgreSQL fallback
@Service
public class TripService {

    private final TripRepository tripRepository;
    private final ReactiveRedisTemplate<String, String> redis;
    private final WriteAheadLog wal;
    private final ObjectMapper mapper;
    private final HealthIndicator pgHealth;

    private static final String TRIP_CACHE_PREFIX = "trip:active:";

    public Mono<Trip> createTrip(RideRequest request, MatchResult match) {
        Trip trip = new Trip(request, match);

        return tripRepository.save(trip)
            .doOnNext(saved -> cacheTrip(saved))
            .onErrorResume(ex -> {
                Metrics.counter("trip.fallback.wal").increment();
                return saveToWAL(trip)
                    .then(cacheTrip(trip))
                    .thenReturn(trip);
            });
    }

    public Mono<Trip> getActiveTrip(String riderId) {
        return tripRepository.findActiveByRiderId(riderId)
            .onErrorResume(ex ->
                redis.opsForValue()
                    .get(TRIP_CACHE_PREFIX + riderId)
                    .flatMap(json -> deserializeTrip(json)));
    }

    private Mono<Void> cacheTrip(Trip trip) {
        return redis.opsForValue()
            .set(TRIP_CACHE_PREFIX + trip.getRiderId(),
                serializeTrip(trip),
                Duration.ofHours(2))
            .then();
    }

    private Mono<Void> saveToWAL(Trip trip) {
        return Mono.fromCallable(() -> {
            wal.append(new WALEntry(
                UUID.randomUUID().toString(),
                "TRIP_CREATE",
                serializeTrip(trip),
                Instant.now()
            ));
            return null;
        }).subscribeOn(Schedulers.boundedElastic()).then();
    }

    private String serializeTrip(Trip trip) {
        try {
            return mapper.writeValueAsString(trip);
        } catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }

    private Mono<Trip> deserializeTrip(String json) {
        try {
            return Mono.just(mapper.readValue(json, Trip.class));
        } catch (JsonProcessingException e) {
            return Mono.empty();
        }
    }
}

File-Based WAL with Idempotent Replay

// SCALED: Write-ahead log for PostgreSQL fallback
@Component
public class WriteAheadLog {

    private final Path walDirectory;
    private final ObjectMapper mapper;
    private final AtomicLong sequence = new AtomicLong(0);

    public WriteAheadLog(@Value("${wal.directory:/data/wal}") String dir,
                         ObjectMapper mapper) {
        this.walDirectory = Path.of(dir);
        this.mapper = mapper;
        try {
            Files.createDirectories(walDirectory);
        } catch (IOException e) {
            throw new RuntimeException("Cannot create WAL directory", e);
        }
    }

    public synchronized void append(WALEntry entry) {
        long seq = sequence.incrementAndGet();
        Path file = walDirectory.resolve(
            String.format("wal-%020d-%s.json", seq, entry.id()));

        try {
            String json = mapper.writeValueAsString(entry);
            // Write to temp file first, then atomic rename
            Path temp = walDirectory.resolve(".tmp-" + seq);
            Files.writeString(temp, json, StandardCharsets.UTF_8);
            Files.move(temp, file, StandardCopyOption.ATOMIC_MOVE);
        } catch (IOException e) {
            throw new RuntimeException("WAL append failed", e);
        }
    }

    public List<WALEntry> readPending() {
        try (Stream<Path> files = Files.list(walDirectory)) {
            return files
                .filter(p -> p.getFileName().toString().startsWith("wal-"))
                .sorted()
                .map(this::readEntry)
                .filter(Objects::nonNull)
                .toList();
        } catch (IOException e) {
            return List.of();
        }
    }

    public void markProcessed(WALEntry entry) {
        try {
            Path file = walDirectory.resolve(
                String.format("wal-%020d-%s.json",
                    0, entry.id())); // find by id pattern
            Files.list(walDirectory)
                .filter(p -> p.getFileName().toString().contains(entry.id()))
                .forEach(p -> {
                    try { Files.delete(p); }
                    catch (IOException ignored) {}
                });
        } catch (IOException ignored) {}
    }

    private WALEntry readEntry(Path file) {
        try {
            return mapper.readValue(
                Files.readString(file), WALEntry.class);
        } catch (IOException e) {
            return null;
        }
    }
}

public record WALEntry(
    String id,
    String operation,
    String payload,
    Instant timestamp
) {}
// SCALED: WAL replay on PostgreSQL recovery
@Component
public class WALReplayJob {

    private final WriteAheadLog wal;
    private final TripRepository tripRepository;
    private final ObjectMapper mapper;

    @Scheduled(fixedDelay = 5000)
    public void replayPendingWrites() {
        List<WALEntry> pending = wal.readPending();
        if (pending.isEmpty()) return;

        for (WALEntry entry : pending) {
            try {
                if ("TRIP_CREATE".equals(entry.operation())) {
                    Trip trip = mapper.readValue(entry.payload(), Trip.class);
                    // Idempotent: INSERT ON CONFLICT DO NOTHING
                    tripRepository.saveIdempotent(trip).block(Duration.ofSeconds(5));
                    wal.markProcessed(entry);
                    Metrics.counter("wal.replay.success").increment();
                }
            } catch (Exception e) {
                Metrics.counter("wal.replay.error").increment();
                break; // Stop on first error, retry next cycle
            }
        }
    }
}

The replay is idempotent. saveIdempotent uses INSERT ... ON CONFLICT (ride_id) DO NOTHING. If the original write succeeded before the WAL entry was created (race condition), the replay skips it.

Redis Down: Caffeine Fallback

// SCALED: Redis with Caffeine fallback
@Service
public class CacheService {

    private final ReactiveRedisTemplate<String, String> redis;
    private final Cache<String, String> localCache;

    public CacheService(ReactiveRedisTemplate<String, String> redis) {
        this.redis = redis;
        this.localCache = Caffeine.newBuilder()
            .maximumSize(10_000)
            .expireAfterWrite(Duration.ofMinutes(5))
            .recordStats()
            .build();
    }

    public Mono<String> get(String key) {
        return redis.opsForValue().get(key)
            .doOnNext(val -> localCache.put(key, val))
            .onErrorResume(ex -> {
                Metrics.counter("cache.redis.fallback").increment();
                String local = localCache.getIfPresent(key);
                return local != null ? Mono.just(local) : Mono.empty();
            });
    }

    public Mono<Void> set(String key, String value, Duration ttl) {
        localCache.put(key, value);
        return redis.opsForValue().set(key, value, ttl)
            .onErrorResume(ex -> {
                Metrics.counter("cache.redis.write.fallback").increment();
                return Mono.empty(); // Local cache has it, Redis will catch up
            })
            .then();
    }
}
// SCALED: Feature flags default to all-enabled when Redis is down
@Service
public class FeatureFlagService {

    private final ReactiveRedisTemplate<String, String> redis;
    private static final String FLAGS_KEY = "feature_flags";

    public Mono<Boolean> isEnabled(String feature) {
        return redis.opsForHash()
            .get(FLAGS_KEY, feature)
            .map(val -> "true".equals(val))
            .defaultIfEmpty(true)  // Redis miss = enabled
            .onErrorReturn(true);  // Redis down = all enabled
    }
    // When Redis is down, no features get disabled.
    // A broken feature is better than a missing feature.
    // The circuit breaker handles broken features.
}

Kafka Down: In-Memory Bounded Queue

// SCALED: Kafka with in-memory queue fallback
@Service
public class EventPublisher {

    private final KafkaTemplate<String, String> kafkaTemplate;
    private final ObjectMapper mapper;
    private final BlockingQueue<PendingEvent> pendingQueue;
    private final AtomicBoolean kafkaHealthy = new AtomicBoolean(true);

    public EventPublisher(KafkaTemplate<String, String> kafkaTemplate,
                          ObjectMapper mapper) {
        this.kafkaTemplate = kafkaTemplate;
        this.mapper = mapper;
        this.pendingQueue = new LinkedBlockingQueue<>(10_000);
    }

    public Mono<Void> publish(String topic, String key, Object event) {
        return Mono.fromCallable(() -> mapper.writeValueAsString(event))
            .flatMap(json -> sendToKafka(topic, key, json))
            .onErrorResume(ex -> {
                kafkaHealthy.set(false);
                Metrics.counter("kafka.fallback.queued", "topic", topic).increment();
                return queueLocally(topic, key,
                    serializeEvent(event));
            });
    }

    private Mono<Void> sendToKafka(String topic, String key, String value) {
        return Mono.fromFuture(
            kafkaTemplate.send(topic, key, value)
        ).then();
    }

    private Mono<Void> queueLocally(String topic, String key, String value) {
        return Mono.fromCallable(() -> {
            PendingEvent event = new PendingEvent(topic, key, value, Instant.now());
            if (!pendingQueue.offer(event)) {
                Metrics.counter("kafka.fallback.dropped").increment();
                // Queue full: oldest events are more important, drop this one
            }
            return null;
        }).then();
    }

    private String serializeEvent(Object event) {
        try {
            return mapper.writeValueAsString(event);
        } catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }

    @Scheduled(fixedDelay = 2000)
    public void flushPendingEvents() {
        if (pendingQueue.isEmpty()) return;

        List<PendingEvent> batch = new ArrayList<>();
        pendingQueue.drainTo(batch, 100);

        for (PendingEvent event : batch) {
            try {
                kafkaTemplate.send(event.topic(), event.key(), event.value())
                    .get(5, TimeUnit.SECONDS);
                kafkaHealthy.set(true);
                Metrics.counter("kafka.fallback.flushed").increment();
            } catch (Exception e) {
                kafkaHealthy.set(false);
                pendingQueue.offer(event); // Put it back
                break; // Stop flushing, Kafka still down
            }
        }
    }

    record PendingEvent(String topic, String key, String value, Instant created) {}
}

The queue is bounded at 10,000 events. At 500 events/second, this holds 20 seconds of traffic. If Kafka is down for longer, events drop. The dropped events are analytics and trip completion notifications. Trips still exist in PostgreSQL or the WAL. The events are convenience, not data.

Surge Pricing Service Down: Circuit Breaker + Cache + Default

// SCALED: Three-level surge pricing fallback
@Component
public class SurgePricingClient {

    private final WebClient webClient;
    private final ReactiveRedisTemplate<String, String> redis;
    private final CircuitBreakerRegistry cbRegistry;
    private final BulkheadRegistry bulkheadRegistry;

    private static final String CACHE_PREFIX = "surge:last_known:";
    private static final BigDecimal DEFAULT_MULTIPLIER = BigDecimal.ONE;

    public Mono<SurgeResult> getMultiplier(String zoneId) {
        CircuitBreaker cb = cbRegistry.circuitBreaker("surgePricing");
        Bulkhead bh = bulkheadRegistry.bulkhead("surgePricing");

        // Level 1: Live service
        return webClient.get()
            .uri("/api/surge/{zoneId}", zoneId)
            .retrieve()
            .bodyToMono(SurgeResponse.class)
            .map(r -> SurgeResult.live(r.getMultiplier()))
            .timeout(Duration.ofSeconds(2))
            .doOnNext(r -> cacheMultiplier(zoneId, r.multiplier()))
            .transformDeferred(CircuitBreakerOperator.of(cb))
            .transformDeferred(BulkheadOperator.of(bh))
            // Level 2: Cached multiplier
            .onErrorResume(ex ->
                redis.opsForValue().get(CACHE_PREFIX + zoneId)
                    .map(val -> SurgeResult.cached(new BigDecimal(val)))
                    // Level 3: Default 1.0x
                    .defaultIfEmpty(SurgeResult.defaultValue()))
            .onErrorReturn(SurgeResult.defaultValue());
    }

    private void cacheMultiplier(String zoneId, BigDecimal multiplier) {
        redis.opsForValue()
            .set(CACHE_PREFIX + zoneId, multiplier.toString(),
                Duration.ofMinutes(10))
            .subscribe();
    }

    public record SurgeResult(BigDecimal multiplier, String source) {
        public static SurgeResult live(BigDecimal m) {
            return new SurgeResult(m, "live");
        }
        public static SurgeResult cached(BigDecimal m) {
            return new SurgeResult(m, "cached");
        }
        public static SurgeResult defaultValue() {
            return new SurgeResult(DEFAULT_MULTIPLIER, "default");
        }
    }
}

DI Wiring for Primary/Fallback

// SCALED: Conditional bean wiring for fallback implementations
@Configuration
public class PersistenceConfig {

    @Bean
    @Primary
    @ConditionalOnProperty(name = "persistence.mode", havingValue = "normal",
                           matchIfMissing = true)
    public TripPersistence postgresqlTripPersistence(
            TripRepository repository,
            CacheService cache) {
        return new PostgresTripPersistence(repository, cache);
    }

    @Bean
    @ConditionalOnProperty(name = "persistence.mode", havingValue = "degraded")
    public TripPersistence degradedTripPersistence(
            CacheService cache,
            WriteAheadLog wal) {
        return new DegradedTripPersistence(cache, wal);
    }

    @Bean
    public TripPersistence adaptiveTripPersistence(
            TripRepository repository,
            CacheService cache,
            WriteAheadLog wal,
            HealthIndicator pgHealth) {
        return new AdaptiveTripPersistence(repository, cache, wal, pgHealth);
    }
}
// SCALED: Adaptive persistence that switches at runtime
public class AdaptiveTripPersistence implements TripPersistence {

    private final TripRepository repository;
    private final CacheService cache;
    private final WriteAheadLog wal;
    private final HealthIndicator pgHealth;

    @Override
    public Mono<Trip> save(Trip trip) {
        if (pgHealth.health().getStatus() == Status.UP) {
            return repository.save(trip)
                .doOnNext(t -> cache.set("trip:" + t.getRideId(),
                    serialize(t), Duration.ofHours(2)).subscribe())
                .onErrorResume(ex -> fallbackSave(trip));
        }
        return fallbackSave(trip);
    }

    private Mono<Trip> fallbackSave(Trip trip) {
        return Mono.fromCallable(() -> {
            wal.append(new WALEntry(
                trip.getRideId(), "TRIP_CREATE",
                serialize(trip), Instant.now()));
            return trip;
        }).subscribeOn(Schedulers.boundedElastic())
          .flatMap(t -> cache.set("trip:" + t.getRideId(),
              serialize(t), Duration.ofHours(2))
              .thenReturn(t));
    }

    private String serialize(Trip trip) {
        try {
            return new ObjectMapper().writeValueAsString(trip);
        } catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }
}

The Proof

Load test: PostgreSQL killed at T+30s, stays down for 60 seconds. Locust running 500 users throughout.

# SCALED: Locust test for partial availability
from locust import HttpUser, task, between

class PartialAvailabilityTest(HttpUser):
    wait_time = between(0.05, 0.2)

    @task(10)
    def book_ride(self):
        with self.client.post("/api/rides/book", json={
            "riderId": f"rider-{self.environment.runner.user_count}",
            "pickupLat": 40.7128, "pickupLng": -74.0060,
            "dropoffLat": 40.7580, "dropoffLng": -73.9855,
            "zoneId": "manhattan-midtown"
        }, catch_response=True) as resp:
            if resp.status_code == 200:
                resp.success()
            else:
                resp.failure(f"Status {resp.status_code}")

    @task(3)
    def complete_trip(self):
        self.client.post("/api/trips/complete", json={
            "rideId": "ride-test-1",
            "fare": 24.50
        })

Results with PostgreSQL down for 60 seconds:

Locust: 500 users, PG killed at T+30s, restored at T+90s

Phase           Duration  Booking p50  Error Rate   Throughput   Persistence
Before PG fail  30s       120ms        0.03%        4,980 RPS    PostgreSQL
PG failover     3s        800ms        4.2%         3,100 RPS    Transitioning
PG down         57s       145ms        0.2%         4,230 RPS    Redis + WAL
PG recovery     5s        310ms        0.5%         4,500 RPS    WAL replay
After recovery  60s       120ms        0.03%        4,980 RPS    PostgreSQL

During the 57-second PostgreSQL outage:

  • 4,230 RPS at 0.2% error rate (85% of normal throughput)
  • Trip records written to the WAL: 12,400 entries
  • Trip records cached in Redis: 12,375 (25 cache write failures)
  • WAL replay on recovery: 12,400 entries replayed in 8 seconds
  • Duplicate records (PG had the original): 0 (idempotent insert)
  • Data loss: 0

The 0.2% error rate came from new riders with no cached driver location data. Their first request failed at the driver matching stage because the matching service needed PostgreSQL for the initial driver pool query. Subsequent requests for those riders succeeded because the matching service cached the driver pool in Redis after the first successful query (before PG went down).

The 15% throughput reduction came from the WAL write overhead (file I/O on every request) and the Redis round-trip replacing the PostgreSQL round-trip (similar latency but different connection pool sizing).

WAL replay statistics:
Entries written:        12,400
Entries replayed:       12,400
Replay duration:        8.2 seconds
Replay rate:            1,512 entries/second
Duplicate skipped:      0
Errors:                 0
Data integrity:         100%