Skip to main content
surviving the spike

Load Balancers: Algorithms, Health Checks, and the Sticky Session Trap

9 min read Chapter 40 of 66

Load Balancers: Algorithms, Health Checks, and the Sticky Session Trap

The Symptom

The rider API runs 12 pods behind a Kubernetes Service with the default load balancing: round-robin via iptables rules. Grafana shows an even distribution of requests: each pod handles ~430 of the 5,200 RPS. The p99 latency is 180ms. Everything looks balanced.

Then pod-7 triggers a full GC pause. For 1.2 seconds, pod-7 stops responding. During that 1.2 seconds, round-robin sends 430 requests to pod-7. They queue behind the GC pause. When the pause ends, pod-7 processes the backlog. Those 430 requests experience 800ms-2,200ms latency. The p99 for the entire service spikes to 2,200ms because 8% of traffic hit the paused pod.

With least-connections balancing, the load balancer would have noticed pod-7’s connection count climbing during the GC pause. After 50ms, new requests would route to other pods. The blast radius: ~22 requests instead of 430. The p99 would barely blip.

A second incident, 3 weeks later. The health check endpoint returns 200 OK on all pods. Pod-3’s PostgreSQL connection pool is exhausted. Every database query times out after 5 seconds. But /health does not check the database. It returns {"status": "UP"} unconditionally. The load balancer keeps routing 430 RPS to pod-3. Every request to pod-3 returns a 500 error after the 5-second database timeout. The error rate is 8.3% for 7 minutes until the on-call engineer notices.

A third pattern, more subtle. The platform uses sticky sessions for the rider API because an early version stored session data in local memory. That code was refactored to use Redis months ago, but the sticky session configuration stayed. One pod handles 3x the traffic of others because power users (frequent riders who open the app 40+ times per day) create long-lived sessions that pin to a single pod. When that pod restarts during a rolling deployment, 3x the expected traffic redistributes. The remaining pods, sized for 1x load, buckle.

Three problems. Three different aspects of load balancing.

L7 load balancer diagram showing traffic distributed to 3 healthy pods with deep health checks, while an unhealthy pod with exhausted DB connections is removed from rotation and drains existing connections

The diagram shows the target architecture: an L7 load balancer using least-connections routing with deep health checks that verify database and Redis connectivity. When Pod 4’s connection pool is exhausted, the deep health check detects the failure and removes it from rotation. In-flight requests drain gracefully over a 30-second window. Traffic redistributes to the 3 healthy pods automatically — no engineer intervention required.

The Cause

Load balancing has three components: the algorithm (how to pick a backend), the health check (how to know a backend is healthy), and the session affinity (whether to route repeat requests to the same backend). Getting any one wrong undermines the other two.

Round-robin distributes requests equally. It assumes all backends have identical capacity and response times. When a backend is slow (GC pause, cold cache, noisy neighbor on the node), round-robin keeps sending it the same share of traffic. The slow backend accumulates a request queue. Latency for those requests degrades. The other backends are underutilized.

Least connections routes each request to the backend with the fewest active connections. A slow backend naturally accumulates connections (requests take longer to complete). The load balancer sends it fewer new requests. The algorithm adapts without explicit health information.

Power of two random choices picks two backends at random, then routes to the one with fewer connections. This avoids the thundering herd problem with pure least-connections: when one backend finishes a batch of requests, its connection count drops to the lowest, and every new request targets it simultaneously. Power of two choices adds randomness to prevent that pile-on.

Health checks determine whether a backend can serve traffic. A shallow health check (TCP connect, HTTP 200) confirms the process is running. A deep health check (database query, Redis ping, downstream service call) confirms the process can do useful work. The gap between “process is running” and “process can serve requests” is where outages hide.

Sticky sessions (session affinity) pin a client to a specific backend. They exist because some applications store state locally: HTTP sessions, WebSocket connections, in-memory caches. Sticky sessions trade load distribution for consistency. The cost is uneven load, cascading failure when a backend dies, and sublinear scaling (adding backends does not proportionally increase capacity because sticky clients do not redistribute).

The Baseline

Current load balancing configuration:

Component          Setting                     Problem
Algorithm          Round-robin (iptables)       Ignores backend health
Health check       GET /health → 200            Does not check dependencies
Sticky sessions    Cookie-based, 1hr TTL        3x traffic skew
Liveness probe     GET /health/live             Same as readiness
Readiness probe    GET /health/live             Does not check DB/Redis

Impact during incidents:

Incident                           Duration    Error Rate    Root Cause
GC pause on pod-7                  8s          8.3% (burst)  Round-robin to paused pod
DB pool exhaustion on pod-3        7 min       8.3%          Health check did not check DB
Rolling deploy with sticky         45s         12%           3x traffic redistribution

Target configuration:

Component          Setting                     Benefit
Algorithm          Least connections            Adapts to slow pods
Health check       Deep: DB + Redis + JVM       Detects real failures
Sticky sessions    Removed (state in Redis)     Even distribution
Liveness probe     JVM responsiveness only      No false restarts
Readiness probe    DB + Redis connectivity      Remove unhealthy pods from rotation

The Fix

Algorithm: least connections via Kubernetes Service topology

Kubernetes Services default to iptables-based round-robin. For least-connections, use an Ingress controller that supports it:

# SCALED: Ingress with least-connections balancing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rider-api
  namespace: ridehailing
  annotations:
    nginx.ingress.kubernetes.io/upstream-hash-by: ""
    nginx.ingress.kubernetes.io/load-balance: "least_conn"
spec:
  rules:
    - host: rider-api.ridehailing.internal
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: rider-api
                port:
                  number: 8080

Health checks: readiness vs liveness

The rider API needs two health endpoints with different purposes:

// SCALED: Health endpoints that detect real problems
@RestController
public class HealthController {

    private final DataSource dataSource;
    private final ReactiveRedisConnectionFactory redisFactory;

    @GetMapping("/health/ready")
    public Mono<ResponseEntity<Map<String, Object>>> readiness() {
        Mono<String> dbCheck = Mono.fromCallable(() -> {
            try (Connection conn = dataSource.getConnection()) {
                try (var stmt = conn.prepareStatement("SELECT 1")) {
                    stmt.setQueryTimeout(3);
                    stmt.executeQuery();
                    return "UP";
                }
            }
        }).subscribeOn(Schedulers.boundedElastic())
          .onErrorReturn("DOWN");

        Mono<String> redisCheck = redisFactory.getReactiveConnection()
            .ping()
            .map(pong -> "UP")
            .onErrorReturn("DOWN")
            .timeout(Duration.ofSeconds(2), Mono.just("DOWN"));

        return Mono.zip(dbCheck, redisCheck)
            .map(tuple -> {
                String db = tuple.getT1();
                String redis = tuple.getT2();
                boolean healthy = "UP".equals(db) && "UP".equals(redis);

                Map<String, Object> body = Map.of(
                    "status", healthy ? "UP" : "DOWN",
                    "postgres", db,
                    "redis", redis
                );

                return healthy
                    ? ResponseEntity.ok(body)
                    : ResponseEntity.status(503).body(body);
            });
    }

    @GetMapping("/health/live")
    public ResponseEntity<Map<String, String>> liveness() {
        // Liveness only checks JVM responsiveness
        // Do NOT check external dependencies here
        return ResponseEntity.ok(Map.of("status", "UP"));
    }
}

The readiness probe removes a pod from the Service’s endpoint list when it cannot reach PostgreSQL or Redis. Traffic stops routing to it. The pod stays running, and when the dependency recovers, the readiness probe passes again and traffic resumes.

The liveness probe only checks that the JVM is responsive. Never put dependency checks in the liveness probe. If PostgreSQL goes down, a liveness probe that checks PostgreSQL will restart every pod. Restarting does not fix a database outage. It makes it worse: every pod restarts simultaneously, connection pools surge on recovery, and the database falls over again.

# SCALED: Pod spec with correct probe configuration
spec:
  containers:
    - name: rider-api
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
        successThreshold: 1
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 15
        timeoutSeconds: 3
        failureThreshold: 5
        successThreshold: 1

Readiness probe timing: check every 10 seconds, fail after 3 consecutive failures (30 seconds to remove from rotation), recover after 1 success. This avoids flapping during brief network hiccups.

Liveness probe timing: check every 15 seconds, fail after 5 consecutive failures (75 seconds before restart). The high failure threshold prevents restarts during transient issues. If the JVM is truly hung, 75 seconds to detect it is acceptable. An unnecessary restart during a GC pause is not.

Eliminating sticky sessions

The rider API’s sticky session configuration exists because of legacy local session storage. After migrating to Redis (CH3), the stickiness is unnecessary overhead:

# BOTTLENECK: Sticky session configuration
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "RIDER_AFFINITY"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"

Remove the annotations. All session data lives in Redis. Any pod can serve any request.

# SCALED: No sticky sessions, state externalized to Redis
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/load-balance: "least_conn"
    # No affinity annotations

Locust: round-robin vs least-connections

# SCALED: Locust test comparing LB algorithms under heterogeneous latency
from locust import HttpUser, task, between

class RiderApiUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task(5)
    def fare_estimate(self):
        self.client.get("/api/rides/fare-estimate", params={
            "pickup_lat": 40.7128, "pickup_lng": -74.0060,
            "dropoff_lat": 40.7589, "dropoff_lng": -73.9851
        })

    @task(3)
    def nearby_drivers(self):
        self.client.get("/api/drivers/nearby", params={
            "lat": 40.7128, "lng": -74.0060, "radius_km": 2
        })

    @task(1)
    def request_ride(self):
        self.client.post("/api/rides/request", json={
            "rider_id": "rider-load-test",
            "pickup_lat": 40.7128, "pickup_lng": -74.0060,
            "dropoff_lat": 40.7589, "dropoff_lng": -73.9851,
            "ride_type": "standard"
        })

Run against both configurations at 5,000 RPS while one pod is injected with 200ms artificial latency (simulating GC pressure):

# Inject latency into pod-7
kubectl exec rider-api-pod-7 -- \
  curl -X POST localhost:8080/actuator/chaosmonkey/assaults \
  -H 'Content-Type: application/json' \
  -d '{"latencyActive": true, "latencyRangeStart": 150, "latencyRangeEnd": 250}'

# Run Locust
locust -f locust_lb_test.py \
  --host=https://rider-api.ridehailing.internal \
  --users 10000 --spawn-rate 500 \
  --run-time 300s --headless --csv=lb_comparison

The Proof

Results at 5,000 RPS with one degraded pod:

Metric              Round-Robin       Least-Conn        Delta
p50 latency         95ms              92ms              -3%
p99 latency         1,450ms           210ms             -86%
Max latency         3,200ms           480ms             -85%
Error rate          2.1%              0.08%             -96%
Degraded pod RPS    430 (equal)       85 (reduced)      -80%

Least connections reduced the degraded pod’s traffic from 430 RPS (1/12 of total, same as healthy pods) to 85 RPS. The load balancer detected the higher connection count and routed traffic elsewhere. The p99 dropped from 1,450ms to 210ms because fewer requests hit the slow pod.

After removing sticky sessions:

Metric              Sticky Sessions   No Sticky          Delta
Max pod traffic     1,560 RPS (3x)    445 RPS (1x)       -71%
Min pod traffic     280 RPS           410 RPS             +46%
Rolling deploy errors  12%            0.3%                -97%

Traffic distribution went from a 5.6:1 ratio (busiest to quietest pod) to 1.1:1. Rolling deployments no longer cause error spikes because no single pod carries disproportionate load.

CH14-S1 covers algorithm internals and health check design in depth. CH14-S2 covers sticky session elimination, connection draining, and graceful shutdown during rolling deployments.