Init container cascade when every kubectl patch reverts in 10 seconds

A fanout service in a platform namespace became wedged in an Init:1/3 state with two replicas stuck and seven changes queued. The on-call engineer attempted to patch the deployment four times in twenty minutes, only to see every change revert within ten seconds.

Why This Matters

This incident highlights the danger of node-side enforcers acting as undocumented sources of truth that override standard Kubernetes API interactions. When infrastructure drift is managed by hidden scripts or systemd timers rather than integrated GitOps controllers like ArgoCD or Flux, it creates a recovery loop that can mislead engineers into suspecting etcd corruption or API server failures. The cost of such architectural gaps is measured in prolonged MTTR, as engineers chase symptoms across protocol layers—TCP timeouts, NXDOMAIN errors, and AMQP access refusals—while the actual culprit remains invisible to standard cluster auditing tools.

Key Insights

Node-side admission scripts can silently override Kubernetes API changes every 10 seconds, as seen in the /var/lib/apex/admission.sh incident (2026).
Hardcoded ClusterIPs are unstable; the Redis init container failed with a TCP timeout because the service IP had changed from 10.43.181.44 to 10.43.218.92.
DNS naming errors in ConfigMaps, such as using ‘mongo’ instead of ‘mongodb’, result in immediate NXDOMAIN failures for init containers.
AMQP ACCESS_REFUSED errors often indicate a missing vhost rather than a credential failure, requiring management API enumeration via tools like rabbitmqadmin.
Missing activeDeadlineSeconds on init containers allows transient network failures to hang Pods indefinitely, preventing the kubelet from retrying the lifecycle.

Working Examples

Discovery of the supervisord-managed script reverting manual patches on the node.

$ ssh node-01 'ps -ef | grep admission'
root 1842 ... /usr/bin/supervisord -c /etc/supervisor/conf.d/admission.conf
root 2104 ... /bin/bash /var/lib/apex/admission.sh
$ ssh node-01 'cat /etc/supervisor/conf.d/admission.conf'
[program:admission]
command=/var/lib/apex/admission.sh
autorestart=true
startsecs=5

A pre-deployment validation Job to reconcile RabbitMQ topology and ensure auditability.

apiVersion: batch/v1
kind: Job
metadata:
  name: topology-reconcile-2026-05-15
  labels:
    validation: predeploy
spec:
  activeDeadlineSeconds: 120
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: reconcile
        image: rabbitmq:3.13-management
        command: ["/bin/bash", "-c"]
        args:
        - |
          set -euo pipefail
          EXPECTED=$(yq '.bindings | length' /config/topology.yaml)
          for b in $(yq -o=json '.bindings[]' /config/topology.yaml | jq -c .); do
            EX=$(echo $b | jq -r .exchange)
            QU=$(echo $b | jq -r .queue)
            RK=$(echo $b | jq -r ."routing-key")
            rabbitmqadmin declare binding source=$EX destination=$QU routing_key=$RK
          done
          ACTUAL=$(curl -s -u $USER:$PASS http://rabbitmq:15672/api/bindings | jq 'length')
          [ "$ACTUAL" -ge "$EXPECTED" ] || exit 1

Practical Applications

Use a topology-reconcile Job (RabbitMQ) to ensure bindings match a YAML source of truth; avoids the pitfall of using kubectl exec which lacks audit records.
Implement activeDeadlineSeconds (120s for init, 600s for Pods) to fail-fast; avoids the pitfall of indefinite Pod hangs during transient DNS hiccups.
Require two consecutive green health checks 20 seconds apart before declaring a rollout finished; avoids the pitfall of catching a Pod in a false green state between revert ticks.

References:

On This Page

Init container cascade when every kubectl patch reverts in 10 seconds