Init container cascade when every kubectl patch reverts in 10 seconds
These articles are AI-generated summaries. Please check the original sources for full details.
Init container cascade when every kubectl patch reverts in 10 seconds
A fanout service in a platform namespace became wedged in an Init:1/3 state with two replicas stuck and seven changes queued. The on-call engineer attempted to patch the deployment four times in twenty minutes, only to see every change revert within ten seconds.
Why This Matters
This incident highlights the danger of node-side enforcers acting as undocumented sources of truth that override standard Kubernetes API interactions. When infrastructure drift is managed by hidden scripts or systemd timers rather than integrated GitOps controllers like ArgoCD or Flux, it creates a recovery loop that can mislead engineers into suspecting etcd corruption or API server failures. The cost of such architectural gaps is measured in prolonged MTTR, as engineers chase symptoms across protocol layers—TCP timeouts, NXDOMAIN errors, and AMQP access refusals—while the actual culprit remains invisible to standard cluster auditing tools.
Key Insights
- Node-side admission scripts can silently override Kubernetes API changes every 10 seconds, as seen in the /var/lib/apex/admission.sh incident (2026).
- Hardcoded ClusterIPs are unstable; the Redis init container failed with a TCP timeout because the service IP had changed from 10.43.181.44 to 10.43.218.92.
- DNS naming errors in ConfigMaps, such as using ‘mongo’ instead of ‘mongodb’, result in immediate NXDOMAIN failures for init containers.
- AMQP ACCESS_REFUSED errors often indicate a missing vhost rather than a credential failure, requiring management API enumeration via tools like rabbitmqadmin.
- Missing activeDeadlineSeconds on init containers allows transient network failures to hang Pods indefinitely, preventing the kubelet from retrying the lifecycle.
Working Examples
Discovery of the supervisord-managed script reverting manual patches on the node.
$ ssh node-01 'ps -ef | grep admission'
root 1842 ... /usr/bin/supervisord -c /etc/supervisor/conf.d/admission.conf
root 2104 ... /bin/bash /var/lib/apex/admission.sh
$ ssh node-01 'cat /etc/supervisor/conf.d/admission.conf'
[program:admission]
command=/var/lib/apex/admission.sh
autorestart=true
startsecs=5
A pre-deployment validation Job to reconcile RabbitMQ topology and ensure auditability.
apiVersion: batch/v1
kind: Job
metadata:
name: topology-reconcile-2026-05-15
labels:
validation: predeploy
spec:
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: OnFailure
containers:
- name: reconcile
image: rabbitmq:3.13-management
command: ["/bin/bash", "-c"]
args:
- |
set -euo pipefail
EXPECTED=$(yq '.bindings | length' /config/topology.yaml)
for b in $(yq -o=json '.bindings[]' /config/topology.yaml | jq -c .); do
EX=$(echo $b | jq -r .exchange)
QU=$(echo $b | jq -r .queue)
RK=$(echo $b | jq -r ."routing-key")
rabbitmqadmin declare binding source=$EX destination=$QU routing_key=$RK
done
ACTUAL=$(curl -s -u $USER:$PASS http://rabbitmq:15672/api/bindings | jq 'length')
[ "$ACTUAL" -ge "$EXPECTED" ] || exit 1
Practical Applications
- Use a topology-reconcile Job (RabbitMQ) to ensure bindings match a YAML source of truth; avoids the pitfall of using kubectl exec which lacks audit records.
- Implement activeDeadlineSeconds (120s for init, 600s for Pods) to fail-fast; avoids the pitfall of indefinite Pod hangs during transient DNS hiccups.
- Require two consecutive green health checks 20 seconds apart before declaring a rollout finished; avoids the pitfall of catching a Pod in a false green state between revert ticks.
References:
Continue reading
Next article
Mitigating Tool Sprawl: Strategies for Reducing Cognitive Load in Development Workflows
Related Content
Optimizing Mac Kubernetes Labs: Migrating from Multipass to OrbStack
Learn how OrbStack reduces Kubernetes VM boot times from 60 seconds to under 3 seconds while optimizing resource allocation on Apple Silicon.
Optimizing AKS Deployments via Centralized Azure DevOps YAML Templates
Streamline Azure Kubernetes Service deployments using centralized YAML templates and Helm to reduce manual configuration errors and standardize API delivery.
CKA Certification Strategy: A Technical Guide to Mastering Kubernetes Administration
Engineer Shahzad Ali Ahmad details the resources and hands-on labs used to achieve CKA, CKAD, and CKS certifications for cloud-native orchestration.