Skip to main content

On This Page

Inside the Kubernetes Scheduling Tournament: How Pods Win Node Placement

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

🧠 The Hidden Brain of Kubernetes: How Pod Scheduling Really Works (And Why It’s Smarter Than You Think)

Every unscheduled pod enters a ruthless multi-round elimination tournament known as the scheduling cycle. This process ensures that pods survive a talent show of filters and scores before being bound to a node.

Why This Matters

While the idealized model suggests pods simply land on available nodes, the technical reality is a complex plugin-based architecture designed to manage resource pressure and fault tolerance. Misconfiguring priority classes or topology constraints can lead to pods being stuck in a permanent ‘Pending’ state, directly impacting production Service Level Objectives (SLOs).

Key Insights

  • High-priority pods can trigger preemption, allowing them to evict lower-priority workloads to reclaim node resources.
  • The Filter Phase acts as an elimination round where nodes are disqualified by plugins like NodeResourcesFit or TaintToleration.
  • Scoring plugins rank surviving nodes from 0 to 100, with ImageLocality providing bonus points for cached container images.
  • The Permit phase enables Gang Scheduling, which allows distributed ML training jobs to wait until all required pods can be scheduled simultaneously.
  • Topology-aware scheduling via topologySpreadConstraints enforces zone fault tolerance, ensuring services survive outages in specific cloud regions.
  • The entire kube-scheduler is pluggable, allowing organizations like NVIDIA to run custom schedulers for specialized hardware like GPUs.

Working Examples

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 1000000
globalDefault: false
description: "For production workloads. Will preempt lower-priority pods."

Enforces zone fault tolerance by limiting pod count skew between zones.

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: api-server

Prometheus query to monitor P99 scheduling latency, which should stay below 100ms.

histogram_quantile(0.99, rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m]))

Practical Applications

  • Google SRE Priority Tiers: Implementing critical, high, and batch tiers to manage SLOs. Pitfall: Batch jobs starving user-facing services due to lack of preemption logic.
  • Zone Fault Tolerance: Using topologyKey to spread workloads across availability zones. Pitfall: Pods clustering in a single zone and failing simultaneously during a regional outage.
  • Custom Scheduler Implementation: Using schedulerName to handle specialized GPU placement. Pitfall: Default schedulers causing suboptimal placement on specialized hardware nodes.

References:

Continue reading

Next article

Optimizing Remote Job Pipelines with We Work Remotely Data

Related Content