The Sticky Session Trap and Connection Draining
The Sticky Session Trap and Connection Draining
The Symptom
The rider API runs 12 pods with cookie-based sticky sessions. The Nginx Ingress controller sets a RIDER_AFFINITY cookie with a 1-hour TTL, pinning each rider to a specific pod.
Pod traffic distribution during Friday evening peak:
Pod RPS % of Total Active Sessions
pod-0 380 7.3% 1,200
pod-1 410 7.9% 1,340
pod-2 395 7.6% 1,280
pod-3 1,560 30.0% 5,100
pod-4 360 6.9% 1,180
pod-5 440 8.5% 1,420
pod-6 375 7.2% 1,210
pod-7 390 7.5% 1,260
pod-8 355 6.8% 1,150
pod-9 280 5.4% 920
pod-10 420 8.1% 1,370
pod-11 335 6.4% 1,090
Pod-3 handles 30% of all traffic. The other 11 pods share the remaining 70%. Pod-3’s CPU is at 68%. Its connection pool utilization is at 92%. Its p99 latency is 890ms while the fleet average is 165ms.
Why pod-3? The platform’s most active riders (frequent business travelers, daily commuters) generate disproportionate traffic. They open the app 40+ times per day. Their long-lived affinity cookies stick them to whichever pod they first landed on. Pod-3 accumulated a cluster of power users through random initial assignment. The sticky session cookie ensures they never redistribute.
During a rolling deployment, pod-3 restarts. Its 5,100 sessions lose their affinity cookie. The 1,560 RPS redistributes across the remaining 11 pods. Each pod’s load increases by ~142 RPS. Pod-4, which was handling 360 RPS, jumps to 502 RPS. Several pods cross their connection pool threshold. The error rate spikes to 12% for 45 seconds until the new pod-3 starts and begins accepting sessions.
The Cause
Sticky sessions create a positive feedback loop. A pod that accumulates more sessions handles more traffic. More traffic means higher resource utilization. Higher utilization means longer response times. Longer response times mean connections are held longer. The pod cannot shed load because the affinity cookie bypasses the load balancing algorithm. Least connections would route traffic away from a busy pod, but the sticky session cookie overrides it.
The ride-hailing team added sticky sessions in month 2 of the project. The rider API stored authentication tokens and user preferences in an HttpSession object backed by an in-memory ConcurrentHashMap. Without sticky sessions, a rider’s second request might hit a different pod that had no session data. The rider would be forced to re-authenticate.
In month 5, the team migrated session storage to Redis (CH3). Every pod reads the same session data from the same Redis cluster. The sticky session configuration was not removed. Nobody tested what happens without it because the cookie was invisible. It kept working. It kept creating skew.
The second problem: connection draining during deployments. When Kubernetes terminates a pod during a rolling update, it sends SIGTERM. By default, the pod has 30 seconds (terminationGracePeriodSeconds) to shut down. If the pod does not implement graceful shutdown, SIGTERM kills the process immediately. In-flight requests receive a connection reset. The client sees a 502.
Spring Boot has graceful shutdown support, but it is disabled by default.
The Baseline
Impact of sticky sessions:
Metric With Sticky Without Sticky Delta
Max pod traffic ratio 5.6:1 1.1:1 -82%
p99 (busiest pod) 890ms 175ms -80%
Connection pool max 92% 45% -51%
Rolling deploy errors 12% 0.3% -97%
Pod failure blast radius 30% of users 8.3% of users -72%
Connection draining during rolling deployment:
Metric No Draining With Draining Delta
502 errors during deploy 847 12 -99%
Request failures 3.2% 0.04% -99%
Deploy duration 45s 90s +100%
In-flight request loss ~430 ~3 -99%
The Fix
Eliminating sticky sessions
Step 1: Verify all session state is in Redis. Query the codebase for local state:
grep -r "HttpSession\|@SessionScope\|SessionAttribute\|ConcurrentHashMap" \
--include="*.java" src/main/java/
If any results appear, migrate that state to Redis before removing sticky sessions.
Step 2: Configure Spring Session with Redis:
// SCALED: Redis-backed session configuration
@Configuration
@EnableRedisHttpSession(maxInactiveIntervalInSeconds = 3600)
public class SessionConfig {
@Bean
public LettuceConnectionFactory connectionFactory() {
RedisStandaloneConfiguration config =
new RedisStandaloneConfiguration("redis.ridehailing.internal", 6379);
return new LettuceConnectionFactory(config);
}
@Bean
public CookieSerializer cookieSerializer() {
DefaultCookieSerializer serializer = new DefaultCookieSerializer();
serializer.setCookieName("RIDER_SESSION");
serializer.setSameSite("Lax");
serializer.setUseSecureCookie(true);
return serializer;
}
}
Step 3: Remove sticky session annotations from the Ingress:
# BOTTLENECK: Remove these annotations
# nginx.ingress.kubernetes.io/affinity: "cookie"
# nginx.ingress.kubernetes.io/session-cookie-name: "RIDER_AFFINITY"
# nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
# SCALED: Clean Ingress with least-connections
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: rider-api
namespace: ridehailing
annotations:
nginx.ingress.kubernetes.io/load-balance: "least_conn"
spec:
rules:
- host: rider-api.ridehailing.internal
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: rider-api
port:
number: 8080
Step 4: Deploy and monitor. Watch the traffic distribution flatten across pods within minutes as riders’ next requests route via least-connections instead of cookie affinity.
Connection draining with preStop hook
When Kubernetes sends SIGTERM to a pod during a rolling update, three things must happen in order:
- The pod is removed from the Service’s endpoint list (no new traffic)
- In-flight requests complete
- The pod shuts down
The problem: step 1 is asynchronous. Kubernetes updates the endpoint list, but the Ingress controller and kube-proxy take time to propagate the change. For 1-5 seconds after SIGTERM, new requests may still arrive at the terminating pod. If the pod shuts down immediately on SIGTERM, those requests get connection resets.
The preStop hook adds a delay:
# SCALED: Pod spec with connection draining
spec:
terminationGracePeriodSeconds: 60
containers:
- name: rider-api
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
ports:
- containerPort: 8080
The timeline:
T+0s Kubernetes sends SIGTERM, starts removing pod from endpoints
T+0s preStop hook starts: sleep 10
T+1-5s Endpoint propagation completes, no new traffic reaches pod
T+10s preStop hook ends, SIGTERM delivered to the application
T+10s Spring Boot begins graceful shutdown
T+10-40s In-flight requests complete, new requests rejected with 503
T+40s Application exits cleanly
T+60s terminationGracePeriodSeconds expires (hard kill if still running)
Spring Boot graceful shutdown
# SCALED: application.yml for graceful shutdown
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
management:
endpoint:
health:
probes:
enabled: true
When server.shutdown=graceful is set, Spring Boot on SIGTERM:
- Stops accepting new connections
- Returns 503 for new requests on existing connections
- Waits for in-flight requests to complete (up to
timeout-per-shutdown-phase) - Closes the Netty event loop
- Destroys the Spring application context
The 30-second timeout handles requests that are mid-database-transaction. A ride request that is partway through fare calculation, driver matching, and payment authorization may take 5-10 seconds to complete. 30 seconds gives generous headroom.
// SCALED: Graceful shutdown listener for cleanup
@Component
public class ShutdownListener {
private final DriverLocationSseController sseController;
@EventListener(ContextClosedEvent.class)
public void onShutdown(ContextClosedEvent event) {
// Close all SSE connections gracefully
sseController.closeAllConnections();
// Deregister from service discovery
// Flush metrics buffers
}
}
Locust: rolling deployment error comparison
# SCALED: Locust test for rolling deployment error rate
from locust import HttpUser, task, between, events
import time
class RollingDeployUser(HttpUser):
wait_time = between(0.1, 0.3)
@task
def fare_estimate(self):
with self.client.get(
"/api/rides/fare-estimate",
params={
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851
},
catch_response=True
) as response:
if response.status_code == 502:
response.failure("502 during rolling deploy")
elif response.status_code == 503:
response.failure("503 during rolling deploy")
Test procedure:
# Terminal 1: Start Locust
locust -f locust_deploy_test.py \
--host=https://rider-api.ridehailing.internal \
--users 5000 --spawn-rate 500 \
--run-time 300s --headless --csv=deploy_test
# Terminal 2: Trigger rolling deployment 30 seconds after Locust starts
sleep 30 && kubectl set image deployment/rider-api \
rider-api=rider-api:v2.1.0 --namespace=ridehailing
Results without graceful shutdown:
Time Event RPS Error Rate
T+0-30s Baseline 5,000 0.02%
T+30s Deploy starts 5,000 0.02%
T+32s First pod terminating 5,000 3.8%
T+35s Second pod terminating 5,000 6.2%
T+38s First new pod ready 5,000 4.1%
T+45s Rolling continues 5,000 2.8%
T+75s Deploy complete 5,000 0.02%
Results with preStop hook and graceful shutdown:
Time Event RPS Error Rate
T+0-30s Baseline 5,000 0.02%
T+30s Deploy starts 5,000 0.02%
T+32s First pod draining 5,000 0.03%
T+42s First pod terminated 5,000 0.04%
T+45s First new pod ready 5,000 0.03%
T+90s Deploy complete 5,000 0.02%
The deployment takes longer (90 seconds vs 75 seconds) because each pod spends 10 seconds in the preStop sleep plus up to 30 seconds draining connections. The error rate stays below 0.05% throughout. No connection resets. No 502s. The trade: 15 seconds of additional deployment time buys a 99% reduction in deployment-related errors.
Rolling update strategy
# SCALED: Deployment with tuned rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
name: rider-api
namespace: ridehailing
spec:
replicas: 12
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: rider-api
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
maxUnavailable: 0 ensures no capacity reduction during the deployment. At least 12 pods are always serving traffic. maxSurge: 25% allows up to 3 extra pods (15 total) during the transition. New pods start before old pods terminate. The cluster temporarily runs more pods than usual, consuming more resources, but the service never drops below full capacity.
The Proof
After removing sticky sessions and implementing connection draining:
Metric Before After Delta
Traffic skew (max/min pod) 5.6:1 1.1:1 -80%
p99 during normal ops 890ms (skewed) 175ms -80%
Rolling deploy error rate 12% 0.04% -99.7%
Pod failure blast radius 30% of users 8.3% of users -72%
502 errors per deploy 847 3 -99.6%
Deploy duration 45s 90s +100%
The deployment takes twice as long. That is the correct trade. 45 seconds of fast, error-prone deployments cost the platform 847 failed requests per deploy. With 4 deploys per day, that is 3,388 failed requests daily. At 90-second deploys with connection draining, the daily failure count drops to 12. The extra 45 seconds per deploy is invisible to users. The 847 errors per deploy were not.