Deployment Latency: Connection Draining, Health Checks, and Warm-Up

The content platform deploys 4 times per day. Each deployment triggers a rolling update that replaces pods one at a time. During this transition, P99 latency spikes from 30ms to 500ms+. Three factors cause deployment latency: connection draining (existing requests on dying pods), cold JVM startup (no JIT compilation), and empty connection pools (cold downstream connections).

This section eliminates all three.

Connection Draining: Finishing In-Flight Requests

When a pod is terminated, it must finish processing in-flight requests before shutting down. Without graceful draining, Kubernetes sends SIGKILL after 30 seconds (default terminationGracePeriodSeconds), aborting active requests:

Deployment timeline without proper draining:
  t=0:    Kubernetes sends SIGTERM to old pod
  t=0:    Kubernetes removes pod from Service endpoints
  t=0:    Load balancer still has old endpoints cached (stale for 1-5s)
  t=0-5s: New requests still arrive at dying pod
  t=0-5s: Pod immediately stops accepting → 502 errors from proxy
  t=30s:  Kubernetes sends SIGKILL (default grace period)

Deployment timeline WITH proper draining:
  t=0:    Kubernetes sends SIGTERM to old pod
  t=0:    Pod stops accepting NEW connections (readiness = false)
  t=0:    Kubernetes removes pod from Service endpoints
  t=0-5s: Stale load balancer routes drain naturally (short-lived requests finish)
  t=0-30s: In-flight long requests complete normally
  t=30s:  Pod exits cleanly (all requests finished)

Spring Boot Graceful Shutdown

// application.yml: Enable graceful shutdown
// server:
//   shutdown: graceful
// spring:
//   lifecycle:
//     timeout-per-shutdown-phase: 30s

// Programmatic graceful shutdown with connection draining:
@Configuration
public class GracefulShutdownConfig {

    @Bean
    public GracefulShutdownHandler gracefulShutdownHandler() {
        return new GracefulShutdownHandler();
    }
}

@Component
public class GracefulShutdownHandler {

    private static final Logger log = LoggerFactory.getLogger(GracefulShutdownHandler.class);
    private final AtomicBoolean shuttingDown = new AtomicBoolean(false);
    private final AtomicInteger activeRequests = new AtomicInteger(0);
    private final CountDownLatch drainComplete = new CountDownLatch(1);

    public boolean isShuttingDown() {
        return shuttingDown.get();
    }

    public void incrementActive() {
        activeRequests.incrementAndGet();
    }

    public void decrementActive() {
        int remaining = activeRequests.decrementAndGet();
        if (shuttingDown.get() && remaining == 0) {
            drainComplete.countDown();
        }
    }

    @PreDestroy
    public void shutdown() {
        log.info("SIGTERM received. Starting graceful drain. Active requests: {}",
            activeRequests.get());
        shuttingDown.set(true);

        // Wait for in-flight requests to complete (max 25s, leave 5s for cleanup)
        try {
            boolean drained = drainComplete.await(25, TimeUnit.SECONDS);
            if (drained) {
                log.info("All requests drained successfully");
            } else {
                log.warn("Drain timeout. {} requests still active",
                    activeRequests.get());
            }
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            log.warn("Drain interrupted");
        }
    }
}

// Filter that tracks active requests and rejects new ones during shutdown:
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class DrainFilter implements Filter {

    private final GracefulShutdownHandler handler;

    public DrainFilter(GracefulShutdownHandler handler) {
        this.handler = handler;
    }

    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
            throws IOException, ServletException {
        if (handler.isShuttingDown()) {
            HttpServletResponse response = (HttpServletResponse) res;
            response.setStatus(503);
            response.setHeader("Connection", "close");
            response.getWriter().write("Service shutting down");
            return;
        }

        handler.incrementActive();
        try {
            chain.doFilter(req, res);
        } finally {
            handler.decrementActive();
        }
    }
}

Kubernetes Configuration for Zero-Downtime Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: article-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0      # Never reduce below desired replicas
      maxSurge: 1            # Create new pod before killing old
  template:
    spec:
      terminationGracePeriodSeconds: 45  # Must be > drain timeout (25s) + startup time
      containers:
        - name: article-service
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"]
                # Sleep 5s after SIGTERM but BEFORE shutdown begins.
                # This allows Kubernetes endpoint controller to remove
                # this pod from Service, so load balancer stops sending traffic.
                # Without this sleep: race condition where traffic arrives
                # after SIGTERM but before endpoint removal.
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 2
            failureThreshold: 1    # Remove from endpoints on first failure
            successThreshold: 2    # Require 2 successes before adding back
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3

The preStop sleep is critical. Without it, there is a race condition:

Race condition without preStop sleep:
  t=0.000s: Kubernetes sends SIGTERM
  t=0.001s: Application begins shutdown, stops accepting requests
  t=0.050s: Kubernetes endpoint controller updates Service endpoints
  t=0.050s: kube-proxy updates iptables rules
  t=0.100s: Nginx upstream config refreshed (if using DNS-based discovery)

  Between t=0.001s and t=0.100s: requests routed to dying pod → 503 errors

With preStop sleep(5):
  t=0.000s: Kubernetes sends SIGTERM, preStop hook runs
  t=0.000-5.0s: Pod still accepts requests normally (sleep running)
  t=0.050s: Kubernetes removes pod from endpoints (happens during sleep)
  t=0.100s: All load balancers updated (no more new traffic to this pod)
  t=5.000s: preStop sleep finishes, application SIGTERM handler runs
  t=5.001s: Application stops accepting new requests, drains existing
  t=5.001-30s: In-flight requests complete
  t=30s: Pod exits
  Errors: 0

JVM Warm-Up: The First 60 Seconds

A freshly started JVM runs code in interpreter mode. The JIT compiler needs thousands of method invocations before it compiles hot paths. During this warm-up period, latency is 5-20x higher than steady state:

Content platform article service latency after fresh start:
  t=0-5s:    P50 = 180ms (interpreter mode, class loading)
  t=5-15s:   P50 = 85ms (C1 compiled, basic optimizations)
  t=15-45s:  P50 = 32ms (C2 compiling hot paths)
  t=45-90s:  P50 = 18ms (C2 complete, inlining stabilized)
  t=90s+:    P50 = 14ms (steady state, all optimizations applied)

Latency ratio: cold/warm = 180/14 = 12.8x worse at startup

Warm-Up Strategy: Synthetic Load Before Accepting Traffic

// JVM warm-up: exercise hot paths with synthetic requests
// Run AFTER application context is ready, BEFORE readiness probe passes
@Component
public class JvmWarmer {

    private static final Logger log = LoggerFactory.getLogger(JvmWarmer.class);

    private final ArticleRepository articleRepository;
    private final SearchClient searchClient;
    private final ArticleRenderingService renderingService;
    private final ReadinessController readinessController;

    @EventListener(ApplicationReadyEvent.class)
    public void warmJvm() {
        log.info("Starting JVM warm-up (exercising hot paths)");
        long start = System.nanoTime();

        // Phase 1: Warm class loading and basic JIT (C1)
        warmPhase1_classLoading();

        // Phase 2: Warm hot paths to trigger C2 compilation
        warmPhase2_hotPaths();

        // Phase 3: Warm connection pools (covered in CH24-S2)
        warmPhase3_connections();

        long elapsed = (System.nanoTime() - start) / 1_000_000;
        log.info("JVM warm-up completed in {}ms. Marking ready.", elapsed);
        readinessController.markReady();
    }

    private void warmPhase1_classLoading() {
        // Load all classes in the request path
        // This prevents class loading latency during real requests
        for (int i = 0; i < 100; i++) {
            try {
                articleRepository.findById("warmup-" + i);
            } catch (Exception ignored) {
                // Expected: warmup articles do not exist
            }
        }
    }

    private void warmPhase2_hotPaths() {
        // Execute the full rendering path enough times to trigger C2
        // C2 threshold: typically 10,000 invocations (configurable via -XX:CompileThreshold)
        // With tiered compilation: C1 at ~200, C2 at ~5,000
        int iterations = 5000;
        List<String> sampleArticleIds = articleRepository.findRecentIds(10);

        for (int i = 0; i < iterations; i++) {
            String articleId = sampleArticleIds.get(i % sampleArticleIds.size());
            try {
                // Exercise the full request path
                renderingService.renderArticle(articleId, "warmup-user");
            } catch (Exception ignored) {
                // Some downstream calls may fail; that is acceptable
            }
        }
    }

    private void warmPhase3_connections() {
        // Already covered in ConnectionPoolWarmer (CH24-S2)
        // Ensure search, recommendation, analytics, image connections are warm
    }
}

JVM Flags for Faster Warm-Up

# JVM startup flags for the content platform article service:
java \
  # Tiered compilation (default in modern JVMs):
  -XX:+TieredCompilation \
  # Lower C2 threshold for faster warm-up (default: 10000):
  -XX:CompileThreshold=5000 \
  # Reserve C2 compiler threads (speeds up background compilation):
  -XX:CICompilerCount=4 \
  # AOT class data sharing (eliminates class loading time):
  -XX:SharedArchiveFile=app-cds.jsa \
  # Pre-touch memory pages (avoid page faults during request processing):
  -XX:+AlwaysPreTouch \
  # Application class-data sharing (CDS) for faster startup:
  -XX:SharedClassListFile=classlist.txt \
  -jar article-service.jar

# Step 1: Generate class list during warm-up run
java -XX:DumpLoadedClassList=classlist.txt \
     -jar article-service.jar --warmup-mode

# Step 2: Create shared archive from class list
java -Xshare:dump \
     -XX:SharedClassListFile=classlist.txt \
     -XX:SharedArchiveFile=app-cds.jsa \
     -jar article-service.jar

# Step 3: Use shared archive in production
java -Xshare:on \
     -XX:SharedArchiveFile=app-cds.jsa \
     -jar article-service.jar

# Impact on content platform startup:
#   Without CDS: class loading = 4.2s, total startup = 12s
#   With CDS:    class loading = 0.8s, total startup = 8.6s
#   Savings: 3.4s (28% faster startup)

Measuring Deployment Latency

# Locust script that continuously measures latency during deployment
# Run this alongside `kubectl rollout restart deployment/article-service`
from locust import HttpUser, task, between, events
import time
import csv
import os

class DeploymentLatencyMonitor(HttpUser):
    """Measures P99 latency during rolling deployment"""
    wait_time = between(0.01, 0.05)  # High frequency for accurate percentiles
    host = "http://content-platform.example.com"

    latency_log = []

    @task
    def fetch_article(self):
        start = time.perf_counter()
        response = self.client.get("/api/articles/12345",
                                   name="GET /api/articles/:id")
        elapsed_ms = (time.perf_counter() - start) * 1000
        self.latency_log.append({
            "timestamp": time.time(),
            "latency_ms": elapsed_ms,
            "status": response.status_code
        })

        # Alert on deployment spike
        if elapsed_ms > 100:
            print(f"SPIKE: {elapsed_ms:.1f}ms at {time.strftime('%H:%M:%S')}")

    @events.quitting.add_listener
    def on_quitting(environment, **kwargs):
        """Save latency data for analysis"""
        with open("deployment_latency.csv", "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=["timestamp", "latency_ms", "status"])
            writer.writeheader()
            writer.writerows(DeploymentLatencyMonitor.latency_log)
        print(f"Saved {len(DeploymentLatencyMonitor.latency_log)} measurements")

# Results during rolling deployment (3 pods, 1 at a time):
#
# WITHOUT warm-up and proper draining:
#   Pre-deploy P99:    30ms
#   During deploy P99: 520ms (cold JVM + connection pool miss)
#   Duration of spike:  90s (30s per pod * 3 pods)
#   502 errors:        12 (race condition, no preStop sleep)
#
# WITH full optimization (drain + preStop + JVM warm + pool warm):
#   Pre-deploy P99:    30ms
#   During deploy P99: 42ms (slight increase from reduced capacity)
#   Duration of spike:  0s (no spike; new pods are warm before receiving traffic)
#   502 errors:        0

Rolling Deployment Strategy

# Optimized deployment for zero-latency-spike rolling updates:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: article-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0    # Always maintain 3 ready pods
      maxSurge: 1          # Create 4th pod, warm it, then kill 1 old pod
  template:
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: article-service
          resources:
            requests:
              cpu: "2"       # Ensure warm-up has CPU for JIT compilation
              memory: "2Gi"
            limits:
              cpu: "4"       # Allow burst during warm-up
              memory: "2Gi"
          env:
            - name: JAVA_OPTS
              value: >-
                -XX:+TieredCompilation
                -XX:CompileThreshold=5000
                -XX:CICompilerCount=4
                -XX:+AlwaysPreTouch
                -Xshare:on
                -XX:SharedArchiveFile=/app/app-cds.jsa
          startupProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 2
            failureThreshold: 30    # Allow up to 65s for startup
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 0   # Start checking immediately after startup probe passes
            periodSeconds: 2
            failureThreshold: 1
            successThreshold: 2      # Must pass twice (prevent flapping)
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 10
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"]

Timeline: Optimized Deployment

t=0s:     kubectl rollout triggered. New pod (v2) created.
t=5s:     v2 JVM starts, class loading begins.
t=8s:     v2 application context ready. JVM warm-up starts.
t=8-20s:  v2 exercises hot paths (5000 iterations).
t=20-22s: v2 warms connection pools to downstream services.
t=22s:    v2 readiness probe passes. Added to Service endpoints.
t=23s:    Load balancer routes traffic to v2. Pod serves at full speed.
t=23s:    v1-pod1 receives SIGTERM. preStop sleep(5) begins.
t=28s:    v1-pod1 stops accepting requests. Drains in-flight.
t=28-53s: v1-pod1 finishes remaining requests.
t=53s:    v1-pod1 exits. Process repeats for v1-pod2, v1-pod3.

Total deployment time: ~90s (3 pods)
User-visible impact: 0ms latency spike, 0 errors
Capacity during deployment: never below 3 ready pods

Health Check Optimization

Health checks must distinguish three states: starting (not ready), running (ready), and draining (no longer ready):

@RestController
public class HealthController {

    private final AtomicBoolean started = new AtomicBoolean(false);
    private final AtomicBoolean ready = new AtomicBoolean(false);
    private final AtomicBoolean draining = new AtomicBoolean(false);

    // Liveness: Is the process alive? Should Kubernetes restart it?
    @GetMapping("/health/live")
    public ResponseEntity<String> liveness() {
        if (!started.get()) {
            return ResponseEntity.status(503).body("starting");
        }
        return ResponseEntity.ok("alive");
    }

    // Readiness: Should traffic be sent to this pod?
    @GetMapping("/health/ready")
    public ResponseEntity<Map<String, Object>> readiness() {
        if (draining.get()) {
            return ResponseEntity.status(503).body(Map.of(
                "status", "draining",
                "message", "Pod is shutting down"
            ));
        }
        if (!ready.get()) {
            return ResponseEntity.status(503).body(Map.of(
                "status", "warming",
                "message", "JVM warm-up in progress"
            ));
        }
        return ResponseEntity.ok(Map.of(
            "status", "ready",
            "jit_compiled", getCompiledMethodCount(),
            "connections_warm", getWarmConnectionCount()
        ));
    }

    // Called after JVM warm-up and connection pool warm-up complete
    public void markReady() { ready.set(true); }
    public void markStarted() { started.set(true); }
    public void markDraining() { draining.set(true); ready.set(false); }

    private int getCompiledMethodCount() {
        CompilationMXBean compilation = ManagementFactory.getCompilationMXBean();
        return (int) (compilation.getTotalCompilationTime() / 10); // Rough estimate
    }

    private int getWarmConnectionCount() {
        // Return number of established connections in pool
        return 24; // From ConnectionPoolWarmer metrics
    }
}

Summary: The Deployment Latency Checklist

Before deployment (zero-downtime requirements):
  ✓ preStop sleep(5) configured (prevents race condition)
  ✓ terminationGracePeriodSeconds > drain timeout + preStop sleep
  ✓ maxUnavailable: 0 (never reduce ready replicas)
  ✓ maxSurge: 1 (new pod ready before old pod dies)
  ✓ Graceful shutdown drains in-flight requests

During startup (eliminate cold-start penalty):
  ✓ DNS prefetched for all downstream services
  ✓ Connection pools warmed with health check requests
  ✓ JVM hot paths exercised (5000+ iterations)
  ✓ CDS archive loaded (3.4s startup savings)
  ✓ Readiness probe gates on warm-up completion

Steady state (maintain low latency):
  ✓ Connection max-lifetime rotates connections (DNS rebalancing)
  ✓ Stale connection detection enabled (validateAfterInactivity)
  ✓ Passive health checks detect backend failures in < 1ms
  ✓ Response buffering protects backends from slow clients

Result: P99 latency remains at 30-42ms throughout deployment.
No 502 errors. No cold-start spikes visible to users.

The content platform deploys 4 times daily with zero user-visible impact. The engineering cost was a 22-second startup delay (JVM warm-up + connection warm-up) that is completely hidden behind the readiness probe. Users never see a cold JVM.