QoS Classes and the Scheduling Decision
QoS Classes and the Scheduling Decision
The main chapter introduced the three QoS classes and their eviction ordering. This section dissects the mechanics: how the kubelet determines QoS class assignment, the exact eviction algorithm it follows under memory pressure, how OOM scoring interacts with cgroup limits, and the scheduling consequences of each configuration pattern. The content platform uses all three classes for different workload tiers, and the difference between a correct and incorrect assignment is the difference between a graceful degradation and an outage.
QoS Assignment Rules
The kubelet assigns QoS class at pod admission. The rules are strict and there is no override:
QoS class assignment logic (for each pod):
1. For EVERY container in the pod (including init containers):
- Does it have cpu AND memory requests?
- Does it have cpu AND memory limits?
- Do requests == limits for both cpu AND memory?
2. If ALL containers satisfy condition (1): QoS = Guaranteed
3. If at least one container has a request or limit
for cpu or memory: QoS = Burstable
4. If NO container has any request or limit: QoS = BestEffort
The critical subtlety is that a single container without matching requests and limits downgrades the entire pod. A pod with three containers where two are perfectly configured and one has a missing memory limit is Burstable, not Guaranteed.
# SLOW: accidentally Burstable
# The sidecar container has no limits, downgrading the entire pod
apiVersion: v1
kind: Pod
metadata:
name: article-service
spec:
containers:
- name: article-service
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
- name: envoy-sidecar
resources:
requests:
cpu: "100m"
memory: "128Mi"
# Missing limits: entire pod becomes Burstable
# FAST: Guaranteed with all containers configured
apiVersion: v1
kind: Pod
metadata:
name: article-service
spec:
containers:
- name: article-service
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
- name: envoy-sidecar
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m" # Matches request
memory: "128Mi" # Matches request
The content platform caught this pattern during a production incident. The article service was running as Burstable for three weeks because a logging sidecar had resources: {}. During a memory pressure event, the kubelet evicted the article service before the analytics batch jobs, which were also Burstable but had lower proportional memory usage. The fix was two lines of YAML. The debugging took four hours.
Default Resource Behavior
When a namespace has a LimitRange, Kubernetes applies default requests and limits to containers that omit them. This creates a trap:
# LimitRange in the production namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: "1"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
A container with no resource spec inherits requests.cpu=100m, limits.cpu=1. Requests do not equal limits, so the pod becomes Burstable. This is usually the correct default, but it means that claiming a pod is Guaranteed requires explicit verification, not assumption.
# Verify QoS class of running pods
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}'
# Output for the content platform:
# article-service-7d8b9c4f5-x2k9p Guaranteed
# search-api-5f6a7b8c9-m3n4o Guaranteed
# analytics-pipeline-8e9f0a1b-p5q6r Burstable
# log-collector-2c3d4e5f-s7t8u BestEffort
The Eviction Algorithm
The kubelet runs an eviction loop that monitors node resource usage. When a resource crosses its eviction threshold, the kubelet begins killing pods. The algorithm has specific ordering rules that interact with QoS class, actual resource usage, and pod priority.
Eviction Signals and Thresholds
Eviction signals monitored by kubelet:
Signal Description Default Threshold
memory.available Available memory on node 100Mi
nodefs.available Available disk on root filesystem 10%
nodefs.inodesFree Available inodes on root fs 5%
imagefs.available Available disk for container images 15%
pid.available Available process IDs (none)
The kubelet supports soft and hard eviction thresholds:
kubelet \
--eviction-hard="memory.available<100Mi,nodefs.available<10%" \
--eviction-soft="memory.available<500Mi,nodefs.available<15%" \
--eviction-soft-grace-period="memory.available=2m,nodefs.available=2m"
Hard thresholds trigger immediate eviction with no grace period. Soft thresholds start a grace period timer. If the condition persists beyond the grace period, eviction proceeds.
Eviction Ordering
When the kubelet decides to evict, it ranks pods using this algorithm:
Eviction priority (highest to lowest):
1. BestEffort pods exceeding no resource (they have none)
Ranked by: memory consumption (highest first)
2. Burstable pods exceeding their memory request
Ranked by: (actual memory - request) / request (highest ratio first)
3. Burstable pods within their memory request
Ranked by: pod priority class (lowest first)
4. Guaranteed pods
Ranked by: pod priority class (lowest first)
Evicted ONLY if node is in hard eviction territory
Within each tier, Kubernetes considers PriorityClass. A Burstable pod with priority 1000 survives longer than a Burstable pod with priority 0, even if the high-priority pod uses more memory. The content platform assigns explicit priority classes:
# Priority class for latency-sensitive services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: latency-critical
value: 1000000
globalDefault: false
description: "For services where eviction causes user-visible degradation"
---
# Priority class for batch workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-processing
value: 100000
globalDefault: false
description: "For workloads that tolerate restarts"
---
# Default priority for everything else
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: default
value: 0
globalDefault: true
description: "Default priority"
# Article service with priority class
apiVersion: apps/v1
kind: Deployment
metadata:
name: article-service
spec:
template:
spec:
priorityClassName: latency-critical
containers:
- name: article-service
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
Priority class interacts with QoS in ways that can surprise. A BestEffort pod with priority 1000000 is still evicted before a Burstable pod with priority 0, because QoS tier takes precedence over priority within the eviction algorithm. Priority only breaks ties within the same QoS tier.
OOM Kill Mechanics
When the Linux kernel’s OOM killer activates, it selects a process based on the oom_score computed from two factors: the process’s memory consumption and the oom_score_adj set by the kubelet.
OOM score computation:
Base score: Proportional to process RSS as fraction of total memory
(0 = no memory, 1000 = all memory)
Adjustment: oom_score_adj added to base score
Range: -1000 to 1000
Final score: base_score + oom_score_adj
Process with highest final score gets killed
Kubelet settings by QoS:
Guaranteed: oom_score_adj = -997 (effectively immune)
Burstable: oom_score_adj = max(2, 1000 - 1000 * memRequest / nodeMem)
BestEffort: oom_score_adj = 1000 (always first to die)
The OOM killer operates independently from the kubelet eviction manager. The kubelet tries to prevent OOM kills by evicting pods before the kernel reaches critical memory pressure. When the kubelet is too slow (a sudden memory spike), the kernel OOM killer takes over and uses the oom_score_adj values.
There is a race condition here. The kubelet eviction loop runs on a polling interval (default 10 seconds). A container that allocates 8GB in 2 seconds can trigger the kernel OOM killer before the kubelet notices the pressure. In this case, the kernel kills based on OOM score, which might not match the kubelet’s eviction ordering if priority classes would have changed the order.
Timeline of a memory pressure event:
T+0.0s: Node memory: 58GB / 64GB used (6GB available)
T+0.2s: Analytics job starts processing large dataset
T+0.5s: Node memory: 62GB / 64GB used (2GB available)
T+0.8s: Node memory: 63.5GB / 64GB used (500MB available)
Kubelet eviction threshold: 100Mi NOT YET crossed
T+1.0s: Node memory: 63.95GB / 64GB used (50MB available)
Kernel OOM killer activates
Selects: log-collector (BestEffort, oom_score_adj=1000)
Kills: log-collector
T+1.1s: Node memory: 63.7GB (250MB freed)
Still below kubelet hard threshold
T+2.0s: Memory continues rising
T+2.1s: Kernel OOM killer activates again
Selects: analytics-pipeline (Burstable, oom_score_adj=937)
Kills: analytics-pipeline
T+10.0s: Kubelet eviction loop runs
Notices memory.available < 100Mi
Would have evicted analytics-pipeline first
But kernel already killed it
For the content platform, the article service runs as Guaranteed with oom_score_adj=-997. Even in the worst case, the kernel OOM killer skips it in favor of any BestEffort or Burstable process. The service has never been OOM-killed in production. The analytics pipeline has been OOM-killed 14 times in the past quarter, each time recovering automatically because the batch job is idempotent.
Scheduling Behavior by QoS Class
QoS class does not directly affect scheduling. The scheduler uses requests, not QoS class, to make placement decisions. But the relationship between requests and limits (which determines QoS) has indirect scheduling consequences.
Guaranteed Pods and Resource Accounting
A Guaranteed pod with requests.cpu=2, limits.cpu=2 reserves exactly 2 cores on the node. The scheduler deducts 2 cores from the node’s allocatable capacity. No other pod can claim those 2 cores.
Node capacity: 16 cores, 64Gi memory
System reserved: 1 core, 2Gi
Kube reserved: 1 core, 2Gi
Allocatable: 14 cores, 60Gi
After scheduling Guaranteed pods:
article-service: 2 cores, 4Gi (Guaranteed)
search-api: 2 cores, 2Gi (Guaranteed)
cdn-origin: 2 cores, 4Gi (Guaranteed)
Remaining: 8 cores, 50Gi
The 8 remaining cores are available for Burstable and BestEffort pods.
Burstable Pods and Overcommit
Burstable pods with requests.cpu=500m, limits.cpu=2 reserve only 500m but can burst to 2 cores. This enables overcommitment: the sum of all limits can exceed node capacity, relying on the fact that not all pods burst simultaneously.
# SLOW: overcommitting without understanding burst patterns
# Total requests: 8 cores (fits on 14-core allocatable)
# Total limits: 32 cores (2.3x overcommit ratio)
analytics-1: requests.cpu=500m, limits.cpu=4 (Burstable)
analytics-2: requests.cpu=500m, limits.cpu=4 (Burstable)
analytics-3: requests.cpu=500m, limits.cpu=4 (Burstable)
analytics-4: requests.cpu=500m, limits.cpu=4 (Burstable)
indexer-1: requests.cpu=1, limits.cpu=4 (Burstable)
indexer-2: requests.cpu=1, limits.cpu=4 (Burstable)
indexer-3: requests.cpu=1, limits.cpu=4 (Burstable)
indexer-4: requests.cpu=1, limits.cpu=4 (Burstable)
When all 8 pods burst simultaneously: 32 cores demanded, 8 available
Result: severe CFS throttling, all pods degraded
# FAST: controlled overcommit with burst budget
# Total requests: 8 cores
# Total limits: 16 cores (2x overcommit, within node capacity)
analytics-1: requests.cpu=1, limits.cpu=2 (Burstable)
analytics-2: requests.cpu=1, limits.cpu=2 (Burstable)
analytics-3: requests.cpu=1, limits.cpu=2 (Burstable)
analytics-4: requests.cpu=1, limits.cpu=2 (Burstable)
indexer-1: requests.cpu=1, limits.cpu=2 (Burstable)
indexer-2: requests.cpu=1, limits.cpu=2 (Burstable)
indexer-3: requests.cpu=1, limits.cpu=2 (Burstable)
indexer-4: requests.cpu=1, limits.cpu=2 (Burstable)
When all 8 pods burst: 16 cores demanded, 8 available on shared pool
Result: CFS throttling limited to 50% reduction, acceptable for batch
The content platform’s rule: overcommit ratio for CPU never exceeds 2x on any node. Memory overcommit never exceeds 1.2x because memory pressure triggers OOM kills, while CPU pressure only causes throttling. Throttling degrades latency. OOM kills cause restarts and data loss.
BestEffort Pods and Scavenging
BestEffort pods request nothing and get whatever is left. The scheduler places them on any node with available capacity, but since they have no requests, they do not reserve any resources. They run on spare cycles.
Use cases for BestEffort in the content platform:
- Log shipping agents (tolerate delays)
- Debug sidecars (temporary, disposable)
- Smoke test runners (fail fast, retry cheap)
NOT appropriate for BestEffort:
- Any service handling user traffic
- Any job with state that cannot be reconstructed
- Any process whose restart has a cold-start penalty
Practical Configuration Matrix
The content platform’s resource configuration follows this decision matrix:
Workload Assessment:
Latency- Restart Burst Recommended
Service Sensitive Cost Pattern Config
─────────────────────────────────────────────────────────────────
article-service Yes High Steady Guaranteed + Priority 1M
search-api Yes Medium Spiky Guaranteed + Priority 1M
cdn-origin Yes Low Steady Guaranteed + Priority 500K
analytics-pipeline No Low Bursty Burstable + Priority 100K
content-generator No Low Bursty Burstable + Priority 100K
search-indexer No Medium Bursty Burstable + Priority 200K
log-collector No None Steady BestEffort + Priority 0
debug-tools No None N/A BestEffort + Priority 0
Every service that handles user requests runs as Guaranteed. Every batch job runs as Burstable with explicit limits. Nothing user-facing runs as BestEffort.
The eviction hierarchy that results from this configuration means the platform degrades gracefully: logs stop shipping before analytics stops processing, analytics stops before search stops indexing, and search indexing stops before any user-facing service is affected. Each tier absorbs the impact of resource pressure before it reaches the tier above.
QoS classes are the first line of defense. They control what gets killed and when. But they do not control what gets degraded by interference while running. That is the domain of CPU pinning and topology management, covered in the next section.