HPA, VPA, and Why CPU-Based Scaling Fails for I/O-Bound Services

The Symptom

The rider API deployment has CPU-based HPA configured with a target of 70%. During Friday evening surge, the Grafana dashboard shows a flat line at 18% CPU across all 3 pods. The HPA controller evaluates every 15 seconds, computes desiredReplicas = ceil(3 * (18 / 70)) = ceil(0.77) = 1, and decides the deployment is over-provisioned. It wants to scale down.

The service is handling 5,200 RPS across 3 pods. Connection pool exhaustion on PostgreSQL causes request queuing. The p99 climbs from 150ms to 4,200ms over 12 minutes. Riders see spinning loading screens. The HPA does nothing.

The on-call engineer manually scales to 12 pods with kubectl scale deployment rider-api --replicas=12. Latency drops to 200ms within 90 seconds. The engineer adds a TODO to fix the autoscaling configuration. The TODO stays open for 3 months, surviving through two more Friday evening incidents.

The Cause

HPA uses a simple formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

For CPU-based scaling with a 70% target and current CPU at 18%:

desiredReplicas = ceil(3 * (18 / 70)) = ceil(0.77) = 1

HPA wants to scale down to 1 replica. The minReplicas floor of 3 prevents that. But HPA will never scale up because CPU will never approach 70%.

Why does CPU stay low? The rider API uses Spring WebFlux with Netty. The default event loop pool has Runtime.getRuntime().availableProcessors() threads. On a 2-core pod, that is 2 event loop threads. These threads never block. They accept a request, dispatch the PostgreSQL query asynchronously, and immediately handle the next request. The CPU work per request is approximately:

JSON deserialization:     0.3ms
Route matching:           0.1ms
Request validation:       0.2ms
Response serialization:   0.4ms
Netty frame encoding:     0.2ms
Total CPU time/request:   1.2ms

The remaining 53ms of a typical request (55ms total wall clock) is I/O wait: PostgreSQL query (35ms), Redis lookup (8ms), network write (10ms). The event loop thread is free during that time, handling other requests.

At 1,700 RPS per pod (5,200 / 3), the total CPU time is 1,700 * 1.2ms = 2,040ms = 2.04 CPU-seconds per second. On a 2-core pod with a 1000m CPU request, that is 2.04 / 2.0 = 102% of the allocated CPU. But Kubernetes measures CPU utilization against the pod’s resources.requests.cpu, and the actual utilization is distributed across the event loop’s non-blocking model. The metrics pipeline reports ~18% average CPU because the utilization is bursty at the microsecond level, with the event loop alternating between brief CPU bursts and I/O dispatches.

The correct metric is request throughput. When RPS per pod exceeds the capacity of the connection pools and event loop, latency degrades. For the rider API, that threshold is approximately 500 RPS per pod with current pool sizes (PostgreSQL: 20 connections, Redis: 50 connections).

The Baseline

HPA scaling algorithm behavior with CPU vs custom metrics:

Scenario            CPU-Based HPA          RPS-Based HPA
500 RPS (3 pods)    CPU 5%, no scale       167 RPS/pod, no scale
2000 RPS (3 pods)   CPU 12%, no scale      667 RPS/pod, scale to 4
5000 RPS (3 pods)   CPU 18%, no scale      1667 RPS/pod, scale to 10
10000 RPS (3 pods)  CPU 22%, no scale      3333 RPS/pod, scale to 20
10000 RPS (20 pods) N/A                    500 RPS/pod, stable

The CPU column demonstrates why CPU-based HPA is invisible to I/O-bound load. Even at 10,000 RPS (2x the Friday peak), CPU barely reaches 22%. The service would return 503s before CPU triggered a scale event.

The Fix

prometheus-adapter: bridging Prometheus metrics to Kubernetes HPA

Spring Boot Actuator exports http_server_requests_seconds_count to Prometheus. The prometheus-adapter converts this Prometheus metric into a Kubernetes custom metric that HPA can query:

# SCALED: prometheus-adapter ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_server_requests_seconds_count{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_seconds_count$"
        as: "${1}_per_second"
      metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

The metricsQuery computes rate() over a 2-minute window. This smooths out per-second spikes. A 1-minute window is too noisy (a 5-second burst of 2,000 RPS would cause unnecessary scaling). A 5-minute window is too slow (a sustained increase from 500 to 2,000 RPS would take 5 minutes to register fully).

Verify the custom metric is available:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/ridehailing/pods/*/http_server_requests_per_second" | jq .

HPA manifest with custom metrics

# SCALED: HPA for rider-api on request rate
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rider-api-hpa
  namespace: ridehailing
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rider-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_server_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
      selectPolicy: Min

Scale-up uses selectPolicy: Max: the larger of “double current count” or “add 5 pods.” This ensures that at low replica counts (3 pods), scaling adds at least 5 instead of just 3 (100% of 3). At high replica counts (20 pods), scaling adds 20 (100%) instead of just 5.

Scale-down uses selectPolicy: Min: the smaller of the two policies applies. Conservative. A traffic dip during a bathroom break at a concert does not mean the surge is over.

VPA for the surge pricing calculator

The surge pricing calculator loads a zone graph into memory. Each zone has pricing coefficients, demand multipliers, and historical baselines. During normal hours, the graph has ~200 active zones consuming 380Mi. During Friday peak, 800+ zones activate, and the graph grows to 1.4Gi.

The deployment has resources.limits.memory: 512Mi. When the graph grows past 512Mi, the JVM’s garbage collector thrashes, then the kernel OOMKills the pod. The pod restarts, reloads the graph (which has already grown), and gets OOMKilled again. A restart loop.

# BOTTLENECK: Fixed memory limits for variable workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: surge-pricing-calc
spec:
  template:
    spec:
      containers:
        - name: surge-pricing-calc
          resources:
            requests:
              memory: "512Mi"
            limits:
              memory: "512Mi" # OOMKilled during peak

VPA fixes this by observing actual memory consumption and adjusting limits:

# SCALED: VPA for surge pricing calculator
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: surge-pricing-vpa
  namespace: ridehailing
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: surge-pricing-calc
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: surge-pricing-calc
        minAllowed:
          memory: "512Mi"
          cpu: "250m"
        maxAllowed:
          memory: "4Gi"
          cpu: "2"
        controlledResources: ["memory"]

VPA in “Auto” mode evicts pods and recreates them with updated resource requests. This means a brief disruption. For the surge pricing calculator running 2 replicas, VPA evicts one at a time, so at least 1 replica serves traffic during the adjustment.

Do not run VPA in “Auto” mode on the same deployment as HPA. They conflict: HPA wants to add pods, VPA wants to resize pods, and the interaction is undefined. Use VPA in “Off” or “Initial” mode alongside HPA, reading the recommendations manually:

kubectl describe vpa surge-pricing-vpa | grep -A 20 "Recommendation"

Scaling speed: the hidden cost

The time from HPA detecting the need to scale to the new pod serving traffic:

Step                        Duration     Cumulative
Metric scrape interval      15s          15s
HPA evaluation interval     15s          30s
Stabilization window        30s          60s
Pod scheduling              2-5s         65s
Image pull (cached)         1-3s         68s
Image pull (uncached)       15-45s       105s
JVM startup                 8-12s        117s
Spring context init         5-8s         125s
Readiness probe passes      10-30s       155s

Worst case: 155 seconds from metric breach to new pod serving traffic. During those 155 seconds, the existing pods absorb the excess load. This is why minReplicas: 3 is not optional. Running fewer than 3 pods means a sudden spike has zero headroom while HPA ramps up.

Optimize each step:

# SCALED: Multi-stage build with layered JVM image
FROM eclipse-temurin:21-jre-alpine AS runtime
COPY --from=build /app/target/rider-api.jar /app/app.jar

# Pre-extract Spring Boot layers for faster image pull
RUN java -Djarmode=layertools -jar /app/app.jar extract

ENTRYPOINT ["java", \
  "-XX:+UseG1GC", \
  "-XX:MaxRAMPercentage=75.0", \
  "-XX:+TieredCompilation", \
  "-XX:TieredStopAtLevel=1", \
  "-Dspring.main.lazy-initialization=true", \
  "-jar", "/app/app.jar"]

-XX:TieredStopAtLevel=1 disables C2 compilation at startup, reducing JVM startup from 12s to 6s. The JIT compiler will optimize hot paths later, after the pod is serving traffic. -Dspring.main.lazy-initialization=true defers bean creation until first use, cutting Spring context initialization from 8s to 3s.

With these optimizations, the scaling timeline drops:

Step                        Duration     Cumulative
Metric + HPA + stabilize    60s          60s
Pod scheduling              2s           62s
Image pull (cached)         1s           63s
JVM startup (optimized)     6s           69s
Spring context (lazy)       3s           72s
Readiness probe             10s          82s

82 seconds. Still not instant. The minReplicas floor and the scale-up aggressiveness in the HPA behavior block exist to cover this gap.

The Proof

After switching from CPU-based HPA to request-rate HPA with prometheus-adapter:

Metric                       CPU-based HPA     RPS-based HPA     Delta
First scale event (5k RPS)   Never             T+45s             Fixed
Pods at peak (10k RPS)       3 (never scaled)  24                +700%
p99 at peak                  4,200ms           185ms             -96%
Error rate at peak           3.2%              0.02%             -99%
Manual interventions/month   3                 0                 -100%

VPA results for the surge pricing calculator:

Metric                  Fixed limits     VPA Auto       Delta
Memory limit            512Mi            1.8Gi (auto)   +250%
OOMKilled events/week   4                0              -100%
p99 during surge        1,800ms          220ms          -88%

The HPA now reacts to actual service pressure instead of a metric that does not correlate with load. The engineer who added the TODO three months ago closes the ticket.