Skip to main content
surviving the spike

Feature Criticality and Graceful Degradation Chains

8 min read Chapter 56 of 66

Feature Criticality and Graceful Degradation Chains

The Symptom

The on-call engineer gets paged at 4 AM. The surge pricing service is returning 500s. The runbook says “restart the surge pricing pods.” The engineer restarts them. The pods come up. The surge pricing service works. 22 minutes later, it fails again. The engineer restarts again. This cycle repeats three times before someone senior wakes up and asks: “Why does surge pricing failing prevent ride bookings?”

It should not. But nobody ranked the features by criticality. Nobody built fallback chains. Nobody asked: “If this component disappears, what should the rider experience?”

The Cause

Every feature in the ride-hailing platform is treated as equally important. The booking pipeline calls surge pricing, fare calculation, driver matching, trip persistence, analytics, promotions, and ETA estimation in sequence. Any failure in any step returns 500 to the rider.

This is wrong. The platform exists to connect riders with drivers. Everything else is optimization. If surge pricing is down, book at base fare. If fare calculation is down, show an estimate. If analytics is down, nobody cares until the morning standup.

The criticality matrix forces engineering decisions about what matters:

Criticality    Feature              Must Work?   Fallback Acceptable?   Can Disable?
Critical       Ride booking         Yes          No                     No
Critical       Driver matching      Yes          Queue for retry        No
Important      Fare calculation     No           Estimated fare         No
Important      Surge pricing        No           Cached / base fare     Yes
Important      Payment processing   No           Charge after trip      No
Deferrable     Trip history         No           Show later             Yes
Deferrable     Driver ETA           No           "On the way"           Yes
Expendable     Analytics            No           Drop silently          Yes
Expendable     Promotions           No           Full price             Yes

Critical features need redundancy, circuit breakers, and fast failover. Important features need fallback chains. Deferrable features need graceful hiding. Expendable features need kill switches.

The Baseline

The fare calculation path without a fallback chain:

// BOTTLENECK: Single path, no fallback
@Service
public class FareService {

    private final PricingRuleRepository pricingRules; // PostgreSQL
    private final SurgePricingClient surgeClient;      // External service

    public Mono<FareEstimate> calculateFare(RideRequest request) {
        return pricingRules.findByZone(request.getPickupZoneId()) // PG query
            .switchIfEmpty(Mono.error(
                new FareException("No pricing rules for zone")))
            .flatMap(rules ->
                surgeClient.getMultiplier(request.getZoneId())
                    .map(multiplier -> computeFare(request, rules, multiplier)));
    }
}

If PostgreSQL is slow, the fare calculation is slow. If PostgreSQL is down, the fare calculation fails. If the surge pricing service is down, the fare calculation fails. Two single points of failure in one method.

The Fix

Fallback Chain for Fare Calculation

// SCALED: Four-level fallback chain
@Service
public class FareService {

    private final PricingRuleRepository pricingRules;
    private final ReactiveRedisTemplate<String, String> redis;
    private final SurgePricingClient surgeClient;
    private final ObjectMapper objectMapper;

    private static final String FARE_CACHE_PREFIX = "fare:rules:";
    private static final String ZONE_BASE_RATES = "fare:base_rates";

    public Mono<FareEstimate> calculateFare(RideRequest request,
                                             List<String> degraded) {
        // Level 1: Exact fare from PostgreSQL + live surge
        return calculateExactFare(request)
            .onErrorResume(ex -> {
                degraded.add("exact_fare");
                // Level 2: Cached rules from Redis + live surge
                return calculateFromCachedRules(request);
            })
            .onErrorResume(ex -> {
                degraded.add("cached_fare");
                // Level 3: Base fare from Redis (no surge)
                return calculateBaseFare(request);
            })
            .onErrorResume(ex -> {
                degraded.add("base_fare");
                // Level 4: Fixed fare with post-trip calculation
                return Mono.just(FareEstimate.deferred(request,
                    "Fare will be calculated after your trip"));
            });
    }

    private Mono<FareEstimate> calculateExactFare(RideRequest request) {
        return pricingRules.findByZone(request.getPickupZoneId())
            .flatMap(rules -> {
                cachePricingRules(request.getPickupZoneId(), rules);
                return surgeClient.getMultiplier(request.getZoneId())
                    .map(m -> computeFare(request, rules, m));
            });
    }

    private Mono<FareEstimate> calculateFromCachedRules(RideRequest request) {
        return redis.opsForValue()
            .get(FARE_CACHE_PREFIX + request.getPickupZoneId())
            .flatMap(json -> {
                PricingRules rules = deserialize(json);
                return surgeClient.getMultiplier(request.getZoneId())
                    .map(m -> computeFare(request, rules, m))
                    .onErrorResume(ex ->
                        Mono.just(computeFare(request, rules, BigDecimal.ONE)));
            });
    }

    private Mono<FareEstimate> calculateBaseFare(RideRequest request) {
        return redis.opsForHash()
            .get(ZONE_BASE_RATES, request.getPickupZoneId())
            .map(rate -> FareEstimate.estimated(request,
                new BigDecimal(rate.toString()), BigDecimal.ONE));
    }

    private void cachePricingRules(String zoneId, PricingRules rules) {
        redis.opsForValue()
            .set(FARE_CACHE_PREFIX + zoneId,
                serialize(rules), Duration.ofHours(1))
            .subscribe();
    }

    private PricingRules deserialize(String json) {
        try {
            return objectMapper.readValue(json, PricingRules.class);
        } catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }

    private String serialize(PricingRules rules) {
        try {
            return objectMapper.writeValueAsString(rules);
        } catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }
}

The fallback chain:

Level 1: Exact fare (PG rules + live surge)
  ↓ PG fails or surge fails
Level 2: Cached fare (Redis rules + live surge, or Redis rules + no surge)
  ↓ Redis cache miss
Level 3: Base fare (Redis base rate for zone, no surge)
  ↓ Redis fails entirely
Level 4: Deferred fare ("Fare calculated after trip")

Each level produces a FareEstimate with a source field indicating how the fare was calculated. The frontend adjusts the display:

// SCALED: FareEstimate with degradation tracking
public record FareEstimate(
    BigDecimal amount,
    BigDecimal surgeMultiplier,
    String currency,
    FareSource source,
    String message
) {
    public enum FareSource {
        EXACT,       // PG + live surge
        CACHED,      // Redis rules + surge
        ESTIMATED,   // Redis base rate
        DEFERRED     // Calculate after trip
    }

    public static FareEstimate deferred(RideRequest request, String message) {
        return new FareEstimate(null, null, "USD", FareSource.DEFERRED, message);
    }

    public static FareEstimate estimated(RideRequest request,
                                          BigDecimal baseRate,
                                          BigDecimal surge) {
        BigDecimal distance = calculateDistance(request);
        return new FareEstimate(
            baseRate.multiply(distance).multiply(surge),
            surge, "USD", FareSource.ESTIMATED,
            "Estimated fare based on zone base rate");
    }
}

Redis Feature Flags with Kill Switches

// SCALED: Feature flag service with health tracking
@Service
public class FeatureFlagService {

    private final ReactiveRedisTemplate<String, String> redis;
    private final MeterRegistry meterRegistry;
    private static final String FLAGS_KEY = "feature_flags";

    private static final Map<String, Boolean> DEFAULTS = Map.of(
        "surge_pricing_enabled", true,
        "trip_history_enabled", true,
        "analytics_enabled", true,
        "promotions_enabled", true,
        "driver_eta_enabled", true,
        "exact_fare_enabled", true
    );

    public Mono<Boolean> isEnabled(String feature) {
        return redis.opsForHash()
            .get(FLAGS_KEY, feature)
            .map(val -> "true".equals(val))
            .defaultIfEmpty(DEFAULTS.getOrDefault(feature, true))
            .onErrorReturn(DEFAULTS.getOrDefault(feature, true))
            .doOnNext(enabled -> meterRegistry.gauge(
                "feature.flag.status",
                Tags.of("feature", feature),
                enabled ? 1.0 : 0.0));
    }

    public Mono<Void> disable(String feature) {
        return redis.opsForHash()
            .put(FLAGS_KEY, feature, "false")
            .doOnSuccess(v -> meterRegistry.counter(
                "feature.flag.changed",
                Tags.of("feature", feature, "action", "disable"))
                .increment())
            .then();
    }

    public Mono<Void> enable(String feature) {
        return redis.opsForHash()
            .put(FLAGS_KEY, feature, "true")
            .doOnSuccess(v -> meterRegistry.counter(
                "feature.flag.changed",
                Tags.of("feature", feature, "action", "enable"))
                .increment())
            .then();
    }
}

WebFilter Kill Switch

// SCALED: Kill switch filter for deferrable/expendable features
@Component
@Order(1)
public class KillSwitchFilter implements WebFilter {

    private final FeatureFlagService featureFlags;
    private final MeterRegistry meterRegistry;

    private static final Map<String, KillSwitchConfig> KILL_SWITCHES = Map.of(
        "/api/trips/history", new KillSwitchConfig(
            "trip_history_enabled", "deferrable",
            "{\"message\":\"Trip history is temporarily unavailable\"}"),
        "/api/analytics", new KillSwitchConfig(
            "analytics_enabled", "expendable", ""),
        "/api/promotions", new KillSwitchConfig(
            "promotions_enabled", "expendable",
            "{\"promotions\":[]}"),
        "/api/drivers/eta", new KillSwitchConfig(
            "driver_eta_enabled", "deferrable",
            "{\"eta\":null,\"message\":\"ETA temporarily unavailable\"}")
    );

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
        String path = exchange.getRequest().getPath().value();

        return KILL_SWITCHES.entrySet().stream()
            .filter(e -> path.startsWith(e.getKey()))
            .findFirst()
            .map(entry -> featureFlags.isEnabled(entry.getValue().flag())
                .flatMap(enabled -> {
                    if (!enabled) {
                        meterRegistry.counter("killswitch.activated",
                            Tags.of("feature", entry.getValue().flag(),
                                    "criticality", entry.getValue().criticality()))
                            .increment();

                        ServerHttpResponse response = exchange.getResponse();
                        response.setStatusCode(HttpStatus.SERVICE_UNAVAILABLE);
                        response.getHeaders().setContentType(MediaType.APPLICATION_JSON);
                        response.getHeaders().add("X-Degraded", entry.getValue().flag());

                        String body = entry.getValue().responseBody();
                        if (body.isEmpty()) {
                            return response.setComplete();
                        }
                        DataBuffer buffer = response.bufferFactory()
                            .wrap(body.getBytes(StandardCharsets.UTF_8));
                        return response.writeWith(Mono.just(buffer));
                    }
                    return chain.filter(exchange);
                }))
            .orElse(chain.filter(exchange));
    }

    record KillSwitchConfig(String flag, String criticality, String responseBody) {}
}

Response Contract with Degraded Field

// SCALED: API response with degradation transparency
public record RideBookingResponse(
    String rideId,
    String driverId,
    FareEstimate fare,
    List<String> degradedFeatures,
    Map<String, String> degradedMessages
) {
    public static RideBookingResponse from(Trip trip, List<String> degraded) {
        Map<String, String> messages = new LinkedHashMap<>();
        for (String feature : degraded) {
            messages.put(feature, DEGRADED_MESSAGES.getOrDefault(feature,
                "This feature is temporarily in degraded mode"));
        }
        return new RideBookingResponse(
            trip.getRideId(),
            trip.getDriverId(),
            trip.getFare(),
            degraded,
            messages
        );
    }

    private static final Map<String, String> DEGRADED_MESSAGES = Map.of(
        "exact_fare", "Showing estimated fare. Exact fare calculated after trip.",
        "surge_pricing", "Surge pricing unavailable. Booking at standard rate.",
        "trip_persistence", "Trip saved temporarily. Full receipt available soon.",
        "driver_eta", "Driver ETA temporarily unavailable."
    );
}

Grafana Degraded Mode Dashboard

# SCALED: Grafana dashboard for degraded mode monitoring
# Panels:

# Panel 1: Feature Flag Status (Stat panel, red/green)
# Query: feature_flag_status{feature=~".*"}
# Threshold: 0 = red (disabled), 1 = green (enabled)

# Panel 2: Kill Switch Activations (Time series)
# Query: rate(killswitch_activated_total[5m])
# Group by: feature

# Panel 3: Fallback Chain Usage (Pie chart)
# Query: sum by (source) (fare_estimate_total)
# Shows distribution: exact vs cached vs estimated vs deferred

# Panel 4: Degraded Response Rate (Time series)
# Query: sum(rate(http_server_requests_seconds_count{degraded="true"}[5m]))
#       / sum(rate(http_server_requests_seconds_count[5m])) * 100
# Alert if > 20% of responses are degraded for > 5 minutes
# Alert: More than 20% degraded responses for 5 minutes
- alert: HighDegradationRate
  expr: |
    sum(rate(http_server_requests_seconds_count{degraded="true"}[5m]))
    / sum(rate(http_server_requests_seconds_count[5m])) * 100
    > 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "{{ $value | humanize }}% of responses are degraded"
    description: "Check which features are in degraded mode and investigate root cause"

The Proof

Scenario: surge pricing killed, PostgreSQL at 50% capacity, analytics service down.

                      Without Degraded     With Degraded Mode
Feature               Design               Design

Surge pricing         500 errors            Base fare (1.0x)
Fare calculation      Slow (PG at 50%)      Cached rules from Redis
Trip persistence      Works (PG at 50%)     Works (PG at 50%)
Trip history          Slow                  Kill-switched (503)
Analytics             500 errors            Kill-switched (silent)
Promotions            Works                 Kill-switched (full price)

Booking error rate    34%                   0.3%
Booking throughput    3,200 RPS             4,600 RPS (92%)
p50 booking           1,400ms               155ms

92% throughput with three services degraded or down. The 8% reduction comes from Redis-based fare lookups being slightly slower than the hot PostgreSQL cache under normal conditions.

The 0.3% error rate comes from edge cases: new zones with no cached pricing rules, new riders with no historical data. Those hit Level 4 of the fallback chain (deferred fare), which succeeds but produces a response the frontend has to handle differently.

Kill switches for trip history, analytics, and promotions freed up 15% of the rider API’s capacity. Those features were consuming thread pool slots, Redis connections, and PostgreSQL queries that the critical booking path now uses instead.