Taints, Tolerations, and Topology Constraints

Taints: Nodes That Repel Pods

The scheduling mechanisms covered in the previous section — nodeSelector, nodeAffinity, podAffinity — all work from the Pod’s perspective. The Pod declares where it wants to go. Taints flip this relationship: the node declares what it doesn’t want.

A taint is a property applied to a node that repels Pods unless those Pods explicitly tolerate the taint. This is how Kubernetes keeps regular workloads off control plane nodes, reserves GPU machines for ML jobs, or dedicates nodes to a specific team.

Applying a Taint

kubectl taint nodes worker1 dedicated=gpu:NoSchedule

This adds a taint to worker1 with:

Key: dedicated
Value: gpu
Effect: NoSchedule

Any Pod that doesn’t tolerate this taint will not be scheduled on worker1. Pods that are already running on the node are not affected by NoSchedule — it only applies to future scheduling decisions.

Viewing Taints

kubectl describe node worker1 | grep -A 5 Taints

Taints:  dedicated=gpu:NoSchedule

Removing a Taint

Append a minus sign to the taint specification:

kubectl taint nodes worker1 dedicated=gpu:NoSchedule-

The trailing - removes the taint. The key, value, and effect must match exactly.

Taint Effects

There are three effects, each with different behavior:

NoSchedule

New Pods that don’t tolerate the taint will not be scheduled on the node. Existing Pods are unaffected — they continue running even if they don’t have a matching toleration.

kubectl taint nodes worker1 maintenance=true:NoSchedule

Use case: Preparing a node for maintenance without evicting current workloads.

PreferNoSchedule

A soft version of NoSchedule. The scheduler tries to avoid placing non-tolerating Pods on the node, but will do so if there are no other options. This is a scoring penalty, not a hard filter.

kubectl taint nodes worker2 preferred=lowpriority:PreferNoSchedule

Use case: Discouraging general workloads from landing on a node without making it impossible.

NoExecute

The strictest effect. New Pods that don’t tolerate the taint are not scheduled, and existing Pods that don’t tolerate the taint are evicted. This is the only effect that impacts already-running Pods.

kubectl taint nodes worker1 critical=outage:NoExecute

When this taint is applied:

All Pods on worker1 that lack a matching toleration are evicted immediately.
No new Pods without the toleration are scheduled.

Use case: A node is experiencing hardware issues and all non-essential workloads must leave.

Effect	New Pods without toleration	Existing Pods without toleration
`NoSchedule`	Not scheduled	Not affected
`PreferNoSchedule`	Scheduler tries to avoid	Not affected
`NoExecute`	Not scheduled	Evicted

Tolerations: Pods That Accept Taints

A toleration is declared in the Pod’s spec and tells the scheduler “this Pod can run on a node with this taint.” A toleration does not request placement on a tainted node — it permits it. The Pod might still land on any eligible node, tainted or not.

Basic Toleration Syntax

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
    - name: ml-trainer
      image: ml-trainer:3.0

This toleration matches the taint dedicated=gpu:NoSchedule applied earlier. The Pod can be scheduled on worker1.

The Operator Field

Tolerations support two operators:

Equal (default): The toleration matches when the key, value, and effect all match the taint.

tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

Exists: The toleration matches when the key exists on the node, regardless of the value. No value field is needed.

tolerations:
  - key: "dedicated"
    operator: "Exists"
    effect: "NoSchedule"

This matches any taint with key dedicated and effect NoSchedule, regardless of whether the value is gpu, ml, batch, or anything else.

Wildcard Tolerations

An empty key with operator Exists matches every taint:

tolerations:
  - operator: "Exists"

This Pod tolerates all taints on all nodes. Use this sparingly — it defeats the purpose of tainting. DaemonSets often use wildcard tolerations because their Pods need to run on every node, even tainted ones.

Omitting the effect field matches all effects for the given key:

tolerations:
  - key: "dedicated"
    operator: "Exists"

This matches dedicated=gpu:NoSchedule, dedicated=ml:NoExecute, and any other taint with key dedicated.

tolerationSeconds: Delayed Eviction

When a NoExecute taint is applied to a node, Pods without a matching toleration are evicted immediately. But tolerations can include a tolerationSeconds field that delays the eviction:

tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300

This toleration says: “If the node becomes unreachable (which applies a NoExecute taint automatically), keep this Pod running for 300 seconds before evicting it.” This gives the node time to recover from transient issues — a network blip, a brief resource spike — without immediately killing workloads.

After tolerationSeconds expires, the Pod is evicted. If the node recovers before the timer runs out, the Pod stays.

Kubernetes automatically taints nodes with these built-in taints when problems occur:

Taint	Trigger
`node.kubernetes.io/not-ready`	Node condition becomes NotReady
`node.kubernetes.io/unreachable`	Node is unreachable by the controller
`node.kubernetes.io/memory-pressure`	Node is under memory pressure
`node.kubernetes.io/disk-pressure`	Node disk usage is high
`node.kubernetes.io/pid-pressure`	Node is running too many processes
`node.kubernetes.io/unschedulable`	Node is cordoned

Pods without explicit tolerations for these taints are given default tolerations of 300 seconds by the DefaultTolerationSeconds admission controller.

Combining Taints with nodeSelector

Taints and tolerations are often paired with nodeSelector or nodeAffinity. Tolerations alone don’t attract Pods to tainted nodes — they only allow it. Without a selector, a tolerating Pod might still land on any non-tainted node.

To ensure Pods run on the dedicated GPU nodes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-exclusive
spec:
  nodeSelector:
    dedicated: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
    - name: trainer
      image: ml-trainer:3.0

The nodeSelector attracts the Pod to GPU nodes. The toleration allows it past the taint. Together they guarantee the Pod runs on — and only on — a GPU node.

Topology Spread Constraints

While podAntiAffinity prevents Pods from sharing the same domain, it doesn’t guarantee even distribution. If you have three zones and six replicas, anti-affinity ensures no two replicas share a node — but you might get four replicas in zone A and one each in zones B and C.

Topology spread constraints solve this by specifying a maximum allowed imbalance:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web
      containers:
        - name: web
          image: nginx:1.25

maxSkew

The maxSkew field defines the maximum difference in Pod count between any two topology domains. With maxSkew: 1 across three zones, the scheduler distributes Pods as evenly as possible — 2-2-2 rather than 4-1-1.

The skew is calculated as: $$\text{skew} = \max(\text{domain counts}) - \min(\text{domain counts})$$

If placing a new Pod in zone A would create a skew greater than maxSkew, the scheduler picks a different zone.

whenUnsatisfiable

Controls what happens when the constraint cannot be met:

DoNotSchedule: The Pod stays Pending until placement is possible without violating the skew. This is a hard constraint.
ScheduleAnyway: The scheduler places the Pod but prioritizes domains that minimize skew. This is a soft constraint.

Complete Example with Node-Level Spreading

Spread replicas across individual nodes:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: web

With ScheduleAnyway, the scheduler tries to spread Pods evenly across nodes but won’t leave Pods Pending if perfect balance isn’t achievable.

Multiple Constraints

Multiple topology spread constraints are evaluated together. A Pod must satisfy all of them:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: web

This says: “hard requirement — even zone distribution; soft preference — even node distribution within each zone.” The Pod won’t schedule if zone balance is violated, but will accept uneven node balance if necessary.

Putting It All Together

A realistic scheduling configuration often combines multiple mechanisms:

apiVersion: v1
kind: Pod
metadata:
  name: production-api
  labels:
    app: api
    tier: backend
spec:
  nodeSelector:
    env: production
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 70
          preference:
            matchExpressions:
              - key: disk
                operator: In
                values: ["ssd"]
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values: ["api"]
          topologyKey: kubernetes.io/hostname
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "production"
      effect: "NoSchedule"
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: api
  containers:
    - name: api
      image: api-server:2.0

This Pod:

Must run on a node labeled env=production (nodeSelector — filter)
Prefers SSD nodes (nodeAffinity preferred — scoring)
Must not share a node with another app=api Pod (podAntiAffinity — filter)
Tolerates the dedicated=production:NoSchedule taint (toleration — filter bypass)
Tries to spread evenly across zones (topology spread — scoring)

Each mechanism targets a different concern, and they compose cleanly.

Exercises (Chapters 6–9)

Practice these exercises to reinforce batch workloads, update strategies, and scheduling concepts. Each exercise is self-contained and can be completed in a few minutes.

Exercise 1: Sidecar Logging Container with Shared Volume

Deploy a multi-container Pod where the main application writes logs and a sidecar container reads them.

Requirements:

Pod name: app-with-logger
Main container: app using image busybox, running sh -c "while true; do echo $(date) Application running >> /var/log/app.log; sleep 5; done"
Sidecar container: log-reader using image busybox, running sh -c "tail -f /var/log/app.log"
Both containers share a volume named log-volume of type emptyDir
The main container mounts the volume at /var/log
The sidecar mounts the volume at /var/log

Verification:

kubectl logs app-with-logger -c log-reader

You should see timestamped log lines from the main application, streamed live by the sidecar.

Exercise 2: Rolling Update with Zero Downtime

Perform a rolling update and verify it proceeds without downtime.

Requirements:

Create a Deployment named web-server with 4 replicas running nginx:1.25
Set the rolling update strategy: maxUnavailable: 1, maxSurge: 1
After the Deployment is running, update the image to nginx:1.26
Monitor the rollout and confirm it completes with no unavailable replicas at any point

Commands to use:

kubectl create deployment web-server --image=nginx:1.25 --replicas=4
kubectl set image deployment/web-server nginx=nginx:1.26
kubectl rollout status deployment/web-server

Verification:

kubectl rollout history deployment/web-server

You should see two revisions in the history. Confirm the current image:

kubectl describe deployment web-server | grep Image

Exercise 3: Parallel Job Processing

Create a Job that processes multiple items in parallel.

Requirements:

Job name: batch-processor
Image: busybox
Command: sh -c "echo Processing item on $HOSTNAME && sleep 10"
Completions: 5
Parallelism: 2
backoffLimit: 3
restartPolicy: Never

Verification:

kubectl get jobs batch-processor -w

Watch the completions column progress from 0/5 to 5/5. Confirm that at most 2 Pods run concurrently:

kubectl get pods -l job-name=batch-processor

Exercise 4: Schedule a Pod on Labeled Nodes

Use nodeSelector to constrain Pod placement.

Requirements:

Label a node with disk=ssd (pick any worker node with kubectl get nodes)
Create a Pod named ssd-pod with image nginx:1.25
Use nodeSelector to ensure the Pod runs only on the labeled node

Commands:

kubectl label node <node-name> disk=ssd

Verification:

kubectl get pod ssd-pod -o wide

The NODE column should show the labeled node. Remove the label and try creating a second Pod — it should stay Pending:

kubectl label node <node-name> disk-

Exercise 5: Taints and Tolerations

Add a taint to a node and create a Pod that tolerates it.

Requirements:

Taint a worker node: kubectl taint nodes <node-name> dedicated=testing:NoSchedule
Create a Pod named test-pod with image nginx:1.25 that does NOT tolerate the taint — verify it is not scheduled on the tainted node
Create a Pod named tolerant-pod with image nginx:1.25 that includes a toleration for dedicated=testing:NoSchedule

Verification:

# test-pod should not be on the tainted node
kubectl get pod test-pod -o wide

# tolerant-pod can be on the tainted node
kubectl get pod tolerant-pod -o wide

Cleanup:

kubectl taint nodes <node-name> dedicated=testing:NoSchedule-
kubectl delete pod test-pod tolerant-pod

Solutions for these exercises are provided in the next chapter.