Taints, Tolerations, and Topology Constraints
SummaryCovers taints and their three effects (NoSchedule, PreferNoSchedule,...
Covers taints and their three effects (NoSchedule, PreferNoSchedule,...
Covers taints and their three effects (NoSchedule, PreferNoSchedule, NoExecute), toleration syntax with Equal and Exists operators, tolerationSeconds for delayed eviction, topology spread constraints with maxSkew and whenUnsatisfiable, and practice exercises integrating concepts from Chapters 6-9.
Taints, Tolerations, and Topology Constraints
Taints: Nodes That Repel Pods
The scheduling mechanisms covered in the previous section — nodeSelector, nodeAffinity, podAffinity — all work from the Pod’s perspective. The Pod declares where it wants to go. Taints flip this relationship: the node declares what it doesn’t want.
A taint is a property applied to a node that repels Pods unless those Pods explicitly tolerate the taint. This is how Kubernetes keeps regular workloads off control plane nodes, reserves GPU machines for ML jobs, or dedicates nodes to a specific team.
Applying a Taint
kubectl taint nodes worker1 dedicated=gpu:NoSchedule
This adds a taint to worker1 with:
- Key:
dedicated - Value:
gpu - Effect:
NoSchedule
Any Pod that doesn’t tolerate this taint will not be scheduled on worker1. Pods that are already running on the node are not affected by NoSchedule — it only applies to future scheduling decisions.
Viewing Taints
kubectl describe node worker1 | grep -A 5 Taints
Taints: dedicated=gpu:NoSchedule
Removing a Taint
Append a minus sign to the taint specification:
kubectl taint nodes worker1 dedicated=gpu:NoSchedule-
The trailing - removes the taint. The key, value, and effect must match exactly.
Taint Effects
There are three effects, each with different behavior:
NoSchedule
New Pods that don’t tolerate the taint will not be scheduled on the node. Existing Pods are unaffected — they continue running even if they don’t have a matching toleration.
kubectl taint nodes worker1 maintenance=true:NoSchedule
Use case: Preparing a node for maintenance without evicting current workloads.
PreferNoSchedule
A soft version of NoSchedule. The scheduler tries to avoid placing non-tolerating Pods on the node, but will do so if there are no other options. This is a scoring penalty, not a hard filter.
kubectl taint nodes worker2 preferred=lowpriority:PreferNoSchedule
Use case: Discouraging general workloads from landing on a node without making it impossible.
NoExecute
The strictest effect. New Pods that don’t tolerate the taint are not scheduled, and existing Pods that don’t tolerate the taint are evicted. This is the only effect that impacts already-running Pods.
kubectl taint nodes worker1 critical=outage:NoExecute
When this taint is applied:
- All Pods on
worker1that lack a matching toleration are evicted immediately. - No new Pods without the toleration are scheduled.
Use case: A node is experiencing hardware issues and all non-essential workloads must leave.
| Effect | New Pods without toleration | Existing Pods without toleration |
|---|---|---|
NoSchedule | Not scheduled | Not affected |
PreferNoSchedule | Scheduler tries to avoid | Not affected |
NoExecute | Not scheduled | Evicted |
Tolerations: Pods That Accept Taints
A toleration is declared in the Pod’s spec and tells the scheduler “this Pod can run on a node with this taint.” A toleration does not request placement on a tainted node — it permits it. The Pod might still land on any eligible node, tainted or not.
Basic Toleration Syntax
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: ml-trainer
image: ml-trainer:3.0
This toleration matches the taint dedicated=gpu:NoSchedule applied earlier. The Pod can be scheduled on worker1.
The Operator Field
Tolerations support two operators:
Equal (default): The toleration matches when the key, value, and effect all match the taint.
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
Exists: The toleration matches when the key exists on the node, regardless of the value. No value field is needed.
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
This matches any taint with key dedicated and effect NoSchedule, regardless of whether the value is gpu, ml, batch, or anything else.
Wildcard Tolerations
An empty key with operator Exists matches every taint:
tolerations:
- operator: "Exists"
This Pod tolerates all taints on all nodes. Use this sparingly — it defeats the purpose of tainting. DaemonSets often use wildcard tolerations because their Pods need to run on every node, even tainted ones.
Omitting the effect field matches all effects for the given key:
tolerations:
- key: "dedicated"
operator: "Exists"
This matches dedicated=gpu:NoSchedule, dedicated=ml:NoExecute, and any other taint with key dedicated.
tolerationSeconds: Delayed Eviction
When a NoExecute taint is applied to a node, Pods without a matching toleration are evicted immediately. But tolerations can include a tolerationSeconds field that delays the eviction:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
This toleration says: “If the node becomes unreachable (which applies a NoExecute taint automatically), keep this Pod running for 300 seconds before evicting it.” This gives the node time to recover from transient issues — a network blip, a brief resource spike — without immediately killing workloads.
After tolerationSeconds expires, the Pod is evicted. If the node recovers before the timer runs out, the Pod stays.
Kubernetes automatically taints nodes with these built-in taints when problems occur:
| Taint | Trigger |
|---|---|
node.kubernetes.io/not-ready | Node condition becomes NotReady |
node.kubernetes.io/unreachable | Node is unreachable by the controller |
node.kubernetes.io/memory-pressure | Node is under memory pressure |
node.kubernetes.io/disk-pressure | Node disk usage is high |
node.kubernetes.io/pid-pressure | Node is running too many processes |
node.kubernetes.io/unschedulable | Node is cordoned |
Pods without explicit tolerations for these taints are given default tolerations of 300 seconds by the DefaultTolerationSeconds admission controller.
Combining Taints with nodeSelector
Taints and tolerations are often paired with nodeSelector or nodeAffinity. Tolerations alone don’t attract Pods to tainted nodes — they only allow it. Without a selector, a tolerating Pod might still land on any non-tainted node.
To ensure Pods run on the dedicated GPU nodes:
apiVersion: v1
kind: Pod
metadata:
name: gpu-exclusive
spec:
nodeSelector:
dedicated: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: trainer
image: ml-trainer:3.0
The nodeSelector attracts the Pod to GPU nodes. The toleration allows it past the taint. Together they guarantee the Pod runs on — and only on — a GPU node.
Topology Spread Constraints
While podAntiAffinity prevents Pods from sharing the same domain, it doesn’t guarantee even distribution. If you have three zones and six replicas, anti-affinity ensures no two replicas share a node — but you might get four replicas in zone A and one each in zones B and C.
Topology spread constraints solve this by specifying a maximum allowed imbalance:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
containers:
- name: web
image: nginx:1.25
maxSkew
The maxSkew field defines the maximum difference in Pod count between any two topology domains. With maxSkew: 1 across three zones, the scheduler distributes Pods as evenly as possible — 2-2-2 rather than 4-1-1.
The skew is calculated as: $$\text{skew} = \max(\text{domain counts}) - \min(\text{domain counts})$$
If placing a new Pod in zone A would create a skew greater than maxSkew, the scheduler picks a different zone.
whenUnsatisfiable
Controls what happens when the constraint cannot be met:
- DoNotSchedule: The Pod stays Pending until placement is possible without violating the skew. This is a hard constraint.
- ScheduleAnyway: The scheduler places the Pod but prioritizes domains that minimize skew. This is a soft constraint.
Complete Example with Node-Level Spreading
Spread replicas across individual nodes:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
With ScheduleAnyway, the scheduler tries to spread Pods evenly across nodes but won’t leave Pods Pending if perfect balance isn’t achievable.
Multiple Constraints
Multiple topology spread constraints are evaluated together. A Pod must satisfy all of them:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
This says: “hard requirement — even zone distribution; soft preference — even node distribution within each zone.” The Pod won’t schedule if zone balance is violated, but will accept uneven node balance if necessary.
Putting It All Together
A realistic scheduling configuration often combines multiple mechanisms:
apiVersion: v1
kind: Pod
metadata:
name: production-api
labels:
app: api
tier: backend
spec:
nodeSelector:
env: production
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 70
preference:
matchExpressions:
- key: disk
operator: In
values: ["ssd"]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["api"]
topologyKey: kubernetes.io/hostname
tolerations:
- key: "dedicated"
operator: "Equal"
value: "production"
effect: "NoSchedule"
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api
containers:
- name: api
image: api-server:2.0
This Pod:
- Must run on a node labeled
env=production(nodeSelector — filter) - Prefers SSD nodes (nodeAffinity preferred — scoring)
- Must not share a node with another
app=apiPod (podAntiAffinity — filter) - Tolerates the
dedicated=production:NoScheduletaint (toleration — filter bypass) - Tries to spread evenly across zones (topology spread — scoring)
Each mechanism targets a different concern, and they compose cleanly.
Exercises (Chapters 6–9)
Practice these exercises to reinforce batch workloads, update strategies, and scheduling concepts. Each exercise is self-contained and can be completed in a few minutes.
Exercise 1: Sidecar Logging Container with Shared Volume
Deploy a multi-container Pod where the main application writes logs and a sidecar container reads them.
Requirements:
- Pod name:
app-with-logger - Main container:
appusing imagebusybox, runningsh -c "while true; do echo $(date) Application running >> /var/log/app.log; sleep 5; done" - Sidecar container:
log-readerusing imagebusybox, runningsh -c "tail -f /var/log/app.log" - Both containers share a volume named
log-volumeof typeemptyDir - The main container mounts the volume at
/var/log - The sidecar mounts the volume at
/var/log
Verification:
kubectl logs app-with-logger -c log-reader
You should see timestamped log lines from the main application, streamed live by the sidecar.
Exercise 2: Rolling Update with Zero Downtime
Perform a rolling update and verify it proceeds without downtime.
Requirements:
- Create a Deployment named
web-serverwith 4 replicas runningnginx:1.25 - Set the rolling update strategy:
maxUnavailable: 1,maxSurge: 1 - After the Deployment is running, update the image to
nginx:1.26 - Monitor the rollout and confirm it completes with no unavailable replicas at any point
Commands to use:
kubectl create deployment web-server --image=nginx:1.25 --replicas=4
kubectl set image deployment/web-server nginx=nginx:1.26
kubectl rollout status deployment/web-server
Verification:
kubectl rollout history deployment/web-server
You should see two revisions in the history. Confirm the current image:
kubectl describe deployment web-server | grep Image
Exercise 3: Parallel Job Processing
Create a Job that processes multiple items in parallel.
Requirements:
- Job name:
batch-processor - Image:
busybox - Command:
sh -c "echo Processing item on $HOSTNAME && sleep 10" - Completions: 5
- Parallelism: 2
- backoffLimit: 3
- restartPolicy: Never
Verification:
kubectl get jobs batch-processor -w
Watch the completions column progress from 0/5 to 5/5. Confirm that at most 2 Pods run concurrently:
kubectl get pods -l job-name=batch-processor
Exercise 4: Schedule a Pod on Labeled Nodes
Use nodeSelector to constrain Pod placement.
Requirements:
- Label a node with
disk=ssd(pick any worker node withkubectl get nodes) - Create a Pod named
ssd-podwith imagenginx:1.25 - Use
nodeSelectorto ensure the Pod runs only on the labeled node
Commands:
kubectl label node <node-name> disk=ssd
Verification:
kubectl get pod ssd-pod -o wide
The NODE column should show the labeled node. Remove the label and try creating a second Pod — it should stay Pending:
kubectl label node <node-name> disk-
Exercise 5: Taints and Tolerations
Add a taint to a node and create a Pod that tolerates it.
Requirements:
- Taint a worker node:
kubectl taint nodes <node-name> dedicated=testing:NoSchedule - Create a Pod named
test-podwith imagenginx:1.25that does NOT tolerate the taint — verify it is not scheduled on the tainted node - Create a Pod named
tolerant-podwith imagenginx:1.25that includes a toleration fordedicated=testing:NoSchedule
Verification:
# test-pod should not be on the tainted node
kubectl get pod test-pod -o wide
# tolerant-pod can be on the tainted node
kubectl get pod tolerant-pod -o wide
Cleanup:
kubectl taint nodes <node-name> dedicated=testing:NoSchedule-
kubectl delete pod test-pod tolerant-pod
Solutions for these exercises are provided in the next chapter.