CPU Throttling: The Silent Latency Killer
CPU Throttling: The Silent Latency Killer
The main chapter showed CFS bandwidth control throttling the article service for 40ms per burst, creating P99 spikes at 180ms. This section goes deeper: how CFS period and quota interact, why the default 100ms period is wrong for latency-sensitive services, how to read throttling statistics from the cgroup filesystem, and the concrete Kubernetes configuration that eliminated throttling for the content platform.
CFS Bandwidth Control Internals
The Linux Completely Fair Scheduler uses two parameters for CPU bandwidth enforcement:
cpu.cfs_period_us: The time window (microseconds). Default: 100000 (100ms).
cpu.cfs_quota_us: CPU time allowed per period (microseconds).
Relationship to Kubernetes limits:
limits.cpu: "2" → quota = 2 × period = 200000us (200ms per 100ms period)
limits.cpu: "500m" → quota = 0.5 × period = 50000us (50ms per 100ms period)
limits.cpu: "4" → quota = 4 × period = 400000us (400ms per 100ms period)
Key rule: Quota is consumed by ALL threads in the cgroup combined.
8 threads each using 25ms of CPU in a period = 200ms quota consumed.
1 thread using 200ms of CPU in a period = 200ms quota consumed.
Both hit the same limit.
This is the core problem for the JVM. The JVM runs many threads concurrently: application threads, GC threads, JIT compiler threads, Netty I/O threads. During a GC pause, all GC threads run simultaneously on all available cores. The quota is consumed at a rate proportional to the thread count, not wall-clock time.
Symptom: P99 Spikes at Low Average CPU
The article service processes search queries. Average response time is 12ms. Average CPU is 35%. Operations sees no problem.
Then the P99 SLO breach alert fires. P99 has risen from 30ms to 180ms. No deployment. No traffic change. No dependency slowdown.
Cause: GC Bursts Exhausting CFS Quota
Timeline of a single P99 spike:
t=0ms: Period starts. Quota: 200ms available.
t=0-18ms: Application threads serve 4 requests (8 threads × 18ms = 144ms quota)
t=18ms: G1GC triggers Young Collection
t=18-38ms: GC pause. 25 ParallelGCThreads active (JVM saw 32 host cores)
CPU consumed: 25 threads × 20ms = 500ms
But quota remaining was only 56ms (200ms - 144ms)
Quota exhausted at t=20.2ms (56ms / 25 threads = 2.2ms into GC)
t=20.2ms: CFS THROTTLES the cgroup. All threads frozen.
GC is mid-pause. Application threads frozen. I/O threads frozen.
t=100ms: New period starts. 200ms quota refilled.
t=100ms: GC resumes with remaining work (~18ms of 20ms pause remaining)
25 threads × 18ms = 450ms quota needed
Quota exhausted again at ~108ms
t=200ms: Third period. GC finishes.
Application threads resume.
Total wall time for a 20ms GC pause: 200ms+
Request that arrived at t=15ms waited: 185ms
This is the P99 spike.
Benchmark: Measuring the Throttle
# Step 1: Record throttling stats before load test
cat /sys/fs/cgroup/cpu.stat > /tmp/before.txt
# Step 2: Run load test (wrk2 at constant 5000 RPS for 60 seconds)
wrk2 -t4 -c200 -d60s -R5000 --latency http://localhost:8080/api/search?q=java
# Step 3: Record throttling stats after
cat /sys/fs/cgroup/cpu.stat > /tmp/after.txt
# Step 4: Calculate throttling rate
diff /tmp/before.txt /tmp/after.txt
Content platform article service results:
Before load test:
nr_periods 1000000
nr_throttled 28500
throttled_usec 412500000
After load test (60 seconds later):
nr_periods 1000600 (+600 periods = 60 seconds)
nr_throttled 28519 (+19 throttled periods)
throttled_usec 413250000 (+750ms total throttle time)
Throttle rate during test: 19/600 = 3.2% of periods
Average throttle duration: 750ms / 19 = 39.5ms per event
Estimated P99 impact: 39.5ms added latency on ~3.2% of requests
Continuous Monitoring
// Prometheus metric exporter for CFS throttling
// Reads cgroup v2 cpu.stat every 5 seconds
@Component
public class CfsThrottleMetrics {
private final Counter throttledPeriods = Counter.build()
.name("container_cpu_throttled_periods_total")
.help("Number of CFS periods where CPU was throttled")
.register();
private final Counter throttledTime = Counter.build()
.name("container_cpu_throttled_seconds_total")
.help("Total time CPU was throttled in seconds")
.register();
private long lastThrottledCount = 0;
private long lastThrottledUsec = 0;
@Scheduled(fixedRate = 5000)
public void collectThrottleMetrics() {
try {
// cgroup v2 path
Map<String, Long> stats = parseCpuStat("/sys/fs/cgroup/cpu.stat");
long currentThrottled = stats.getOrDefault("nr_throttled", 0L);
long currentThrottledUsec = stats.getOrDefault("throttled_usec", 0L);
if (lastThrottledCount > 0) {
throttledPeriods.inc(currentThrottled - lastThrottledCount);
throttledTime.inc(
(currentThrottledUsec - lastThrottledUsec) / 1_000_000.0
);
}
lastThrottledCount = currentThrottled;
lastThrottledUsec = currentThrottledUsec;
} catch (IOException e) {
// cgroup filesystem not available (not in container)
}
}
private Map<String, Long> parseCpuStat(String path) throws IOException {
Map<String, Long> stats = new HashMap<>();
for (String line : Files.readAllLines(Path.of(path))) {
String[] parts = line.split(" ");
if (parts.length == 2) {
stats.put(parts[0], Long.parseLong(parts[1]));
}
}
return stats;
}
}
Alert rule:
# Prometheus alert: throttling exceeding 5% of periods
groups:
- name: container-cpu
rules:
- alert: CpuThrottlingHigh
expr: |
rate(container_cpu_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "Container CPU throttling > 5%"
description: "{{ $labels.pod }} throttled {{ $value | humanizePercentage }} of periods"
Fix 1: Reduce GC Thread Count
The main chapter showed the fix: set ParallelGCThreads and ConcGCThreads to match the container CPU limit, not the host core count. Here is the detailed benchmark:
# Test matrix: GC thread count vs throttling and latency
# Setup: article-service, 2-core container, 4GB memory, 500MB live heap
# Load: wrk2 at 5000 RPS constant, 60 seconds
# Test 1: Default GC threads (JVM sees 32 host cores)
java -Xmx2g -jar article-service.jar
# ParallelGCThreads=25, ConcGCThreads=6
# Results:
# P50: 12.1ms P99: 178ms P99.9: 312ms
# Throttled periods: 3.2% Avg throttle: 39ms
# GC pause avg: 18ms GC pause max: 35ms
# Test 2: GC threads = CPU limit (2)
java -Xmx2g -XX:ParallelGCThreads=2 -XX:ConcGCThreads=1 \
-jar article-service.jar
# Results:
# P50: 12.3ms P99: 52ms P99.9: 78ms
# Throttled periods: 0.1% Avg throttle: 8ms
# GC pause avg: 42ms GC pause max: 65ms
# Test 3: GC threads = 2× CPU limit (4)
java -Xmx2g -XX:ParallelGCThreads=4 -XX:ConcGCThreads=2 \
-jar article-service.jar
# Results:
# P50: 12.2ms P99: 38ms P99.9: 62ms
# Throttled periods: 0.4% Avg throttle: 12ms
# GC pause avg: 28ms GC pause max: 48ms
Results summary:
GC Threads GC Pause Throttle% P99 P99.9
─────────────────────────────────────────────────────
25 (default) 18ms 3.2% 178ms 312ms
4 (2× limit) 28ms 0.4% 38ms 62ms
2 (= limit) 42ms 0.1% 52ms 78ms
Best choice: 4 GC threads (2× CPU limit)
- GC pauses stay short (28ms vs 42ms)
- Throttling nearly eliminated (0.4%)
- P99 is lowest at 38ms
- P99.9 is lowest at 62ms
The sweet spot is 2x the CPU limit, not 1x. At 1x (2 threads), GC pauses are too long (42ms) and the P99 is dominated by pause time rather than throttling. At 2x (4 threads), GC pauses are shorter (28ms) and the small amount of throttling (0.4%) adds less than the reduced GC pause saves.
Fix 2: Increase the CPU Limit (Allow Bursting)
If GC thread tuning is insufficient, increase the CPU limit to accommodate bursts:
# Before: tight limit triggers throttling
resources:
requests:
cpu: "1"
limits:
cpu: "2"
# After: higher limit allows GC bursts
resources:
requests:
cpu: "1"
limits:
cpu: "4"
Impact on throttling:
Limit=2 cores (200ms quota):
GC burst (4 threads × 28ms) = 112ms quota
App threads (8 threads × 10ms) = 80ms quota
Total: 192ms — barely fits, any variance causes throttle
Limit=4 cores (400ms quota):
GC burst (4 threads × 28ms) = 112ms quota
App threads (8 threads × 10ms) = 80ms quota
Total: 192ms — 208ms headroom, no throttle
Trade-off: Higher limits mean fewer pods per node if other pods also have high limits. But since requests stay at 1 core, the scheduler still packs efficiently. The limit only matters during bursts, which happen a few percent of the time.
Fix 3: Remove CPU Limits
The most aggressive fix. Set no CPU limit and rely on requests for scheduling:
resources:
requests:
cpu: "1"
memory: "4Gi"
limits:
# No cpu limit
memory: "4Gi"
# Verification: no CFS quota enforcement
cat /sys/fs/cgroup/cpu.max
# max 100000
# "max" means no quota (unlimited)
# Throttle stats will show zero growth:
cat /sys/fs/cgroup/cpu.stat
# nr_throttled 0
# throttled_usec 0
Content platform results after removing CPU limits:
Metric With 2-core limit No CPU limit Change
──────────────────────────────────────────────────────────────
P50 12.1ms 11.8ms -2.5%
P99 178ms 27ms -84.8%
P99.9 312ms 45ms -85.6%
Throttle rate 3.2% 0% eliminated
Avg CPU usage 0.7 cores 0.7 cores unchanged
Peak CPU usage 2.0 cores (capped) 3.8 cores burst allowed
GC pause avg 18ms 15ms -16.7%
GC pause max 35ms 22ms -37.1%
P99 dropped from 178ms to 27ms. Peak CPU bursts to 3.8 cores during GC, but only for 15ms. Average CPU is unchanged. The node is not overloaded because requests-based scheduling prevents overcommit at the average level.
Fix 4: Tune the CFS Period
For workloads where CPU limits are required (multi-tenant clusters, cost allocation), reducing the CFS period reduces the maximum throttle duration:
Default period: 100ms, quota 200ms (2 cores)
Worst case throttle: up to 100ms (entire remaining period)
Reduced period: 10ms, quota 20ms (still 2 cores)
Worst case throttle: up to 10ms (shorter period = shorter max throttle)
But: GC burst (112ms quota) is spread across 12 periods (112ms/10ms = ~12)
Each period contributes ~17ms of quota, fitting within 20ms budget.
Throttling may not occur at all for moderate bursts.
# Set CFS period to 10ms (requires host-level access or Kubernetes kubelet config)
# In Kubernetes, set via kubelet --cpu-cfs-quota-period=10ms
# Or per-container via cgroup v2:
echo 20000 10000 > /sys/fs/cgroup/cpu.max
# Format: quota_us period_us
# 20000us quota per 10000us period = 2 cores
Period tuning benchmark (2-core limit, article service):
Period Max Throttle P99 P99.9
─────────────────────────────────────────
100ms 39ms 178ms 312ms
50ms 22ms 95ms 155ms
20ms 11ms 52ms 88ms
10ms 6ms 38ms 65ms
5ms 3ms 31ms 52ms
Trade-off: Shorter periods increase scheduler overhead. At 5ms periods, the kernel performs 200 scheduling decisions per second per cgroup instead of 10. On hosts with hundreds of containers, this adds measurable kernel CPU overhead (0.5-1% of a core per container). The 10-20ms range provides the best balance for latency-sensitive services.
Requests vs Limits: The Complete Decision Framework
Decision matrix for CPU resource configuration:
Workload Type Requests Limits Rationale
──────────────────────────────────────────────────────────────────────────
Latency-sensitive Java avg usage none GC/JIT need burst headroom
Batch processing avg usage 2× avg prevent runaway, no latency SLO
Sidecar (envoy, fluentd) measured measured predictable, no bursts
CronJob peak peak runs briefly, needs all resources
Multi-tenant (billing) avg usage allocated CPU limit tied to cost allocation
Content platform services:
article-service: requests=1, limits=none (latency-critical)
search-indexer: requests=2, limits=4 (batch, bounded burst)
nginx-proxy: requests=200m, limits=500m (predictable, low burst)
recommendation: requests=1, limits=none (latency-critical)
analytics-writer: requests=500m, limits=1 (background, bounded)
Proof: Before and After
The content platform article service running in production for 7 days before and after removing CPU limits and tuning GC threads:
Before (2-core limit, default GC threads):
P50: 12ms (stable)
P99: 45-210ms (fluctuates with GC frequency)
P99.9: 180-450ms
Error rate: 0.02% (OOM-adjacent restarts)
CFS throttled periods: 2.8-4.1% (varies with traffic)
Pod restarts/day: 0.3 (occasional OOM kill)
After (no CPU limit, ParallelGCThreads=4, ConcGCThreads=2):
P50: 11.8ms (stable)
P99: 22-30ms (stable, only GC pause)
P99.9: 35-55ms
Error rate: 0.001%
CFS throttled periods: 0%
Pod restarts/day: 0
P99 improvement: 6-7× reduction
P99.9 improvement: 5-8× reduction
Stability: throttle-related variance eliminated entirely
The configuration that achieved this:
# Final JVM flags for article-service container
java \
-XX:+UseContainerSupport \
-XX:ActiveProcessorCount=2 \
-XX:ParallelGCThreads=4 \
-XX:ConcGCThreads=2 \
-XX:CICompilerCount=2 \
-XX:MaxRAMPercentage=50.0 \
-XX:MaxMetaspaceSize=256m \
-XX:ReservedCodeCacheSize=128m \
-XX:MaxDirectMemorySize=256m \
-Xss512k \
-Xlog:gc*:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=10m \
-jar article-service.jar
# Kubernetes resource spec
resources:
requests:
cpu: "1"
memory: "4Gi"
limits:
memory: "4Gi"
# No CPU limit