Skip to main content
surviving the spike

Algorithms and Health Checks That Actually Work

8 min read Chapter 41 of 66

Algorithms and Health Checks That Actually Work

The Symptom

The incident postmortem reads: “Health check returned 200 for 7 minutes while the service was unable to serve requests.” The team stares at the timeline. Pod-3’s PostgreSQL connection pool exhausted at 18:42. Every request to pod-3 started failing at 18:42. The load balancer continued routing 430 RPS to pod-3 until 18:49 when the on-call engineer manually removed it. 180,600 failed requests. Customer-facing error rate: 8.3%.

The health check endpoint:

// BOTTLENECK: Health check that lies
@GetMapping("/health")
public ResponseEntity<Map<String, String>> health() {
    return ResponseEntity.ok(Map.of("status", "UP"));
}

This endpoint checks nothing. It confirms the JVM is running and the HTTP server is listening. It does not confirm the service can do its job. The load balancer asked “are you alive?” and the pod answered “yes” while dropping every request.

The Cause

Health checks serve two audiences with conflicting needs.

The load balancer needs to know: “Can this pod serve requests right now?” If the answer is no, stop routing traffic to it. This is the readiness check. It should verify every external dependency the pod needs to handle a request: database connectivity, cache availability, downstream service reachability.

The container runtime needs to know: “Is this pod in a recoverable state?” If the answer is no, restart it. This is the liveness check. It should verify only the pod’s internal state: is the JVM responsive, is the main thread alive, is the process deadlocked. It must not check external dependencies because restarting a pod does not fix a broken database.

The rider API depends on PostgreSQL (primary data store) and Redis (session cache, rate limiting). A request cannot be served without both. The readiness check must verify both:

// SCALED: Readiness check that verifies real capabilities
@GetMapping("/health/ready")
public Mono<ResponseEntity<Map<String, Object>>> readiness() {
    Mono<HealthStatus> dbCheck = checkPostgres();
    Mono<HealthStatus> redisCheck = checkRedis();
    Mono<HealthStatus> poolCheck = checkConnectionPool();

    return Mono.zip(dbCheck, redisCheck, poolCheck)
        .map(tuple -> {
            Map<String, Object> details = new LinkedHashMap<>();
            details.put("postgres", tuple.getT1());
            details.put("redis", tuple.getT2());
            details.put("connectionPool", tuple.getT3());

            boolean healthy = tuple.getT1().isUp()
                && tuple.getT2().isUp()
                && tuple.getT3().isUp();

            details.put("status", healthy ? "UP" : "DOWN");

            return healthy
                ? ResponseEntity.ok(details)
                : ResponseEntity.status(503).body(details);
        });
}

private Mono<HealthStatus> checkPostgres() {
    return Mono.fromCallable(() -> {
        try (Connection conn = dataSource.getConnection()) {
            try (PreparedStatement stmt = conn.prepareStatement("SELECT 1")) {
                stmt.setQueryTimeout(3);
                stmt.executeQuery();
                return HealthStatus.up("postgres");
            }
        }
    })
    .subscribeOn(Schedulers.boundedElastic())
    .timeout(Duration.ofSeconds(3))
    .onErrorResume(e -> Mono.just(
        HealthStatus.down("postgres", e.getMessage())
    ));
}

private Mono<HealthStatus> checkRedis() {
    return redisConnectionFactory.getReactiveConnection()
        .ping()
        .map(pong -> HealthStatus.up("redis"))
        .timeout(Duration.ofSeconds(2))
        .onErrorResume(e -> Mono.just(
            HealthStatus.down("redis", e.getMessage())
        ));
}

private Mono<HealthStatus> checkConnectionPool() {
    HikariPoolMXBean pool = ((HikariDataSource) dataSource)
        .getHikariPoolMXBean();

    int active = pool.getActiveConnections();
    int total = pool.getTotalConnections();
    int pending = pool.getThreadsAwaitingConnection();

    boolean healthy = pending < 5 && active < total;

    return Mono.just(healthy
        ? HealthStatus.up("connectionPool",
            String.format("active=%d total=%d pending=%d",
                active, total, pending))
        : HealthStatus.down("connectionPool",
            String.format("active=%d total=%d pending=%d",
                active, total, pending))
    );
}
// Supporting HealthStatus record
public record HealthStatus(String component, String status, String detail) {
    public boolean isUp() { return "UP".equals(status); }
    public static HealthStatus up(String component) {
        return new HealthStatus(component, "UP", "");
    }
    public static HealthStatus up(String component, String detail) {
        return new HealthStatus(component, "UP", detail);
    }
    public static HealthStatus down(String component, String detail) {
        return new HealthStatus(component, "DOWN", detail);
    }
}

The connection pool check deserves attention. threadsAwaitingConnection is the number of threads blocked waiting for a database connection. When this exceeds 5, the pool is under pressure. When activeConnections equals totalConnections, the pool is exhausted. The readiness check catches pool exhaustion before request timeouts do.

The liveness check is minimal:

// SCALED: Liveness check - only JVM responsiveness
@GetMapping("/health/live")
public ResponseEntity<Map<String, String>> liveness() {
    return ResponseEntity.ok(Map.of("status", "UP"));
}

If the JVM can execute this handler and return a response, it is alive. If it cannot (deadlock, full GC loop, out of file descriptors), the HTTP server will not respond, the liveness probe will time out, and Kubernetes will restart the pod.

The Baseline

Comparison of health check approaches:

Check Type      Detects DB Failure    Detects Redis Failure    False Restarts    Cost
TCP connect     No                    No                       No                ~0
HTTP 200        No                    No                       No                ~0
Shallow /health No                    No                       No                0.5ms
Deep readiness  Yes                   Yes                      No                5ms
Deep liveness   Yes                   Yes                      YES               5ms

Deep liveness checks (checking dependencies in the liveness probe) cause false restarts. When PostgreSQL goes down, the liveness probe fails on all pods. Kubernetes restarts all pods simultaneously. The pods come back, attempt to connect to the still-down database, fail the liveness check again, and restart again. A restart loop that amplifies the outage.

The Fix

Kubernetes probe configuration

# SCALED: Probe configuration for the rider API
spec:
  containers:
    - name: rider-api
      ports:
        - containerPort: 8080
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
        successThreshold: 1
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 15
        timeoutSeconds: 3
        failureThreshold: 5
        successThreshold: 1
      startupProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 30
        successThreshold: 1

The startupProbe runs during pod startup only. It gives the JVM up to 155 seconds (5 + 30*5) to start. During this time, neither the liveness nor readiness probe runs. Without a startup probe, the liveness probe’s initialDelaySeconds: 30 might be too short for a cold start with class loading and Spring context initialization. The startup probe prevents liveness from killing a pod that is still booting.

Timing decisions:

Readiness periodSeconds: 10. Check every 10 seconds. A pod that fails becomes unhealthy within 30 seconds (3 failures * 10 seconds). Recovery takes 10 seconds (1 success). 30 seconds of downtime per pod is acceptable because other pods absorb the traffic.

Readiness failureThreshold: 3. Three consecutive failures before removal. A single failed check (network blip, momentary connection pool spike) does not remove the pod. Three consecutive failures over 30 seconds indicate a real problem.

Liveness failureThreshold: 5. Five consecutive failures before restart, at 15-second intervals: 75 seconds. This is intentionally high. Restarting is destructive. All in-flight requests are dropped. The pod’s warm caches are lost. If the JVM is truly hung, 75 seconds is a reasonable detection time. If the JVM is experiencing a long GC pause (which G1 can have during a mixed collection), 75 seconds gives it time to recover without an unnecessary restart.

Readiness timeoutSeconds: 5. The deep health check queries PostgreSQL and Redis. If either takes more than 5 seconds to respond, the check fails. A 5-second timeout is generous for “SELECT 1” and Redis PING. If these operations take 5 seconds, the service cannot serve real requests within SLA anyway.

Algorithm comparison for the ride-hailing platform

Testing with 12 pods, 5,000 RPS, one pod injected with 200ms artificial latency:

Algorithm           p50     p99      Max      Degraded Pod RPS
Round-robin         95ms    1,450ms  3,200ms  430 (8.3%)
Weighted RR         90ms    980ms    2,100ms  215 (4.1%)
Least connections   92ms    210ms    480ms    85 (1.6%)
Power of 2 choices  93ms    240ms    520ms    110 (2.1%)

Round-robin sends equal traffic to the degraded pod. The degraded pod’s 200ms additional latency affects 8.3% of all requests, pulling the p99 to 1,450ms.

Least connections detects the degraded pod’s higher connection count within seconds. Connection count rises because requests take longer: at 430 RPS with 200ms additional latency, the pod has ~86 extra concurrent connections compared to healthy pods. The load balancer shifts traffic away.

Power of two choices performs similarly to least connections but avoids the thundering-herd effect. When the degraded pod recovers, least connections might briefly stampede all traffic to it (it has the fewest connections). Power of two choices randomizes the selection, spreading the recovery load.

For the ride-hailing platform, least connections is the correct default. The thundering herd risk is low with 12+ pods because recovery traffic distributes across the fleet, not to a single pod.

The health check that lied: a postmortem

The timeline:

18:42:00  PostgreSQL connection pool exhausted on pod-3
18:42:01  All new requests to pod-3 start timing out (5s DB timeout)
18:42:01  Health check still returns 200 (does not check DB)
18:42:10  First health check runs: 200 OK
18:42:20  Second health check: 200 OK
18:42:30  Third health check: 200 OK
          ... (health check returns 200 for 7 minutes)
18:49:00  On-call engineer runs: kubectl delete pod rider-api-pod-3
18:49:05  New pod starts, connects to PostgreSQL successfully
18:49:35  New pod passes readiness, starts receiving traffic

With the deep readiness check, the timeline would have been:

18:42:00  PostgreSQL connection pool exhausted on pod-3
18:42:10  Readiness check: DB check fails (pool exhausted), returns 503
18:42:20  Readiness check: 503 (failure 2)
18:42:30  Readiness check: 503 (failure 3) → pod removed from endpoints
18:42:31  Traffic stops routing to pod-3
18:42:31  Pod-3's connection pool starts recovering (no new requests)
18:43:00  Connection pool recovers, readiness check returns 200
18:43:00  Pod-3 re-added to endpoints, resumes serving traffic

Total downtime: 30 seconds of degraded traffic to pod-3, affecting ~4,300 requests. With auto-recovery, the pod comes back without manual intervention.

The Proof

After deploying deep readiness checks, startup probes, and least-connections balancing:

Metric                           Before           After            Delta
DB outage detection time         7 min (manual)   30s (auto)       -93%
Requests affected by DB issue    180,600          4,300            -97%
GC pause blast radius            430 requests     22 requests      -95%
False pod restarts/month         0                0                No change
Health check latency overhead    0.5ms            5ms              +4.5ms

The 5ms health check overhead is the cost of querying PostgreSQL and Redis every 10 seconds. At 12 pods, that is 1.2 health check queries per second to PostgreSQL. The database handles 50,000 queries per second. The overhead is 0.002%.