Optimizing GKE Node Upgrades: Lessons from a 45-Minute Production Outage

The GKE Upgrade That Took Down Our Production Pods for 45 Minutes

A standard GKE node pool upgrade triggered a 45-minute outage for critical customer-facing API services. The incident occurred despite using Google’s default surge upgrade strategy, which evicts pods one node at a time to maintain availability.

Why This Matters

Managed services like GKE provide powerful automation, but relying on defaults without understanding pod topology and lifecycle management can lead to catastrophic availability loss. This case demonstrates that without Pod Disruption Budgets, the interaction between surge upgrades and small replica counts (n=2) can bypass high-availability assumptions, resulting in 50% capacity loss or total service failure during node rotation.

Key Insights

GKE surge upgrades with a value of one can cause 50% capacity loss for dual-replica deployments if both nodes are cycled consecutively without health checks (Charlotte, 2026).
Missing PodDisruptionBudgets (PDBs) allow the Kubernetes control plane to evict pods freely, ignoring the availability needs of specific workloads during maintenance.
Inaccurate readiness probes, such as static 10-second delays for services requiring cache warming, lead to traffic routing to unready pods.
Policy enforcement via Kyverno can mandate the existence of PDBs for all production-namespace deployments to prevent configuration drift.
Maintenance windows must be calibrated against real traffic patterns; a Tuesday 9am window coincided with critical weekly batch jobs, exacerbating the outage.

Working Examples

A PodDisruptionBudget ensuring GKE will not evict a pod until a replacement is healthy.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: session-validator-pdb
  namespace: production
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: session-validator

Tightened readiness probe to ensure services are genuinely ready to handle load.

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3
  successThreshold: 1

Kyverno policy to enforce PDB requirements in production.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-pdb
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-pdb-exists
    match:
      any:
      - resources:
          kinds:
          - Deployment
        namespaces:
          - production
    validate:
      message: "A PodDisruptionBudget is required for all production deployments."
      deny:
        conditions:
          all:
          - key: "{{ request.object.metadata.name }}"
            operator: NotIn
            value: "{{ request.object.metadata.annotations.\"pdb-configured\" || '' }}"

Practical Applications

Use Case: Implementing PodDisruptionBudgets with minAvailable: 1 for critical services to force GKE node drains to wait for healthy pod replacements.
Pitfall: Using default readiness probes that only check if a process has started rather than if internal caches or connections are established.
Use Case: Moving GKE maintenance windows to verified low-traffic periods, such as Saturday nights, to minimize impact of unforeseen upgrade issues.
Pitfall: Relying on fully automatic upgrades for production environments without pre-upgrade checks for workload disruption constraints.

References:

https://dev.to/charlotte05478/the-gke-upgrade-that-took-down-our-production-pods-for-45-minutes-om9

On This Page

The GKE Upgrade That Took Down Our Production Pods for 45 Minutes

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Blue/Green Release Emails: The Critical Handoff Signal Most Kubernetes Teams Miss

EKS Standard vs. EKS Auto Mode: The Evolutionary Leap in Kubernetes Operations

Proving Resilience: How AWS Chaos Engineering Prevents Facebook-Style Outages