Skip to main content

On This Page

Optimizing GKE Node Upgrades: Lessons from a 45-Minute Production Outage

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

The GKE Upgrade That Took Down Our Production Pods for 45 Minutes

A standard GKE node pool upgrade triggered a 45-minute outage for critical customer-facing API services. The incident occurred despite using Google’s default surge upgrade strategy, which evicts pods one node at a time to maintain availability.

Why This Matters

Managed services like GKE provide powerful automation, but relying on defaults without understanding pod topology and lifecycle management can lead to catastrophic availability loss. This case demonstrates that without Pod Disruption Budgets, the interaction between surge upgrades and small replica counts (n=2) can bypass high-availability assumptions, resulting in 50% capacity loss or total service failure during node rotation.

Key Insights

  • GKE surge upgrades with a value of one can cause 50% capacity loss for dual-replica deployments if both nodes are cycled consecutively without health checks (Charlotte, 2026).
  • Missing PodDisruptionBudgets (PDBs) allow the Kubernetes control plane to evict pods freely, ignoring the availability needs of specific workloads during maintenance.
  • Inaccurate readiness probes, such as static 10-second delays for services requiring cache warming, lead to traffic routing to unready pods.
  • Policy enforcement via Kyverno can mandate the existence of PDBs for all production-namespace deployments to prevent configuration drift.
  • Maintenance windows must be calibrated against real traffic patterns; a Tuesday 9am window coincided with critical weekly batch jobs, exacerbating the outage.

Working Examples

A PodDisruptionBudget ensuring GKE will not evict a pod until a replacement is healthy.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: session-validator-pdb
  namespace: production
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: session-validator

Tightened readiness probe to ensure services are genuinely ready to handle load.

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3
  successThreshold: 1

Kyverno policy to enforce PDB requirements in production.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-pdb
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-pdb-exists
    match:
      any:
      - resources:
          kinds:
          - Deployment
        namespaces:
          - production
    validate:
      message: "A PodDisruptionBudget is required for all production deployments."
      deny:
        conditions:
          all:
          - key: "{{ request.object.metadata.name }}"
            operator: NotIn
            value: "{{ request.object.metadata.annotations.\"pdb-configured\" || '' }}"

Practical Applications

  • Use Case: Implementing PodDisruptionBudgets with minAvailable: 1 for critical services to force GKE node drains to wait for healthy pod replacements.
  • Pitfall: Using default readiness probes that only check if a process has started rather than if internal caches or connections are established.
  • Use Case: Moving GKE maintenance windows to verified low-traffic periods, such as Saturday nights, to minimize impact of unforeseen upgrade issues.
  • Pitfall: Relying on fully automatic upgrades for production environments without pre-upgrade checks for workload disruption constraints.

References:

Continue reading

Next article

Standardizing AI Agent Payments: The x402 Protocol and the Governance Gap

Related Content