ArgoCD Application Definitions, Sync Policies, and Health Checks
ArgoCD Application Definitions, Sync Policies, and Health Checks
The Failure
The team enabled automated.prune: true on all applications without understanding the implications. A developer accidentally deleted a Service manifest from the infra repo. ArgoCD pruned the live Service. The checkout service became unreachable for 4 minutes until the deletion was reverted.
prune: true is correct for production. But it requires that the infra repo has branch protection, required reviews, and CI validation. Pruning without those safeguards turns a Git mistake into a production outage.
The Mechanism
Sync Policy Matrix
| Setting | Effect | When to Use |
|---|---|---|
automated: {} | Sync on Git change, no prune, no self-heal | Dev environments |
automated.prune: true | Delete resources removed from Git | All environments with branch protection |
automated.selfHeal: true | Revert manual cluster changes | All environments (strongly recommended) |
syncOptions.ApplyOutOfSyncOnly | Only apply changed resources | Large applications with many resources |
syncOptions.ServerSideApply | Use server-side apply for conflict resolution | When multiple controllers manage the same resources |
retry.limit: 5 | Retry failed syncs | Always (transient failures are common) |
Sync Windows
Production should not sync during peak traffic hours. Sync windows restrict when ArgoCD can perform automated syncs.
The Implementation
Sync Window Configuration
# HARDENED: Restrict production syncs to business hours
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: ecommerce
namespace: argocd
spec:
syncWindows:
# Allow syncs Monday-Friday 09:00-17:00 UTC
- kind: allow
schedule: "0 9 * * 1-5"
duration: 8h
applications: ["*-production"]
namespaces: ["production"]
# Deny syncs on weekends
- kind: deny
schedule: "0 0 * * 0,6"
duration: 48h
applications: ["*-production"]
namespaces: ["production"]
# Allow emergency syncs with manual override
- kind: allow
schedule: "* * * * *"
duration: 24h
applications: ["*-production"]
namespaces: ["production"]
manualSync: true # Only manual syncs allowed outside window
Custom Health Check for External Resources
# ArgoCD ConfigMap: custom health checks
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
data:
resource.customizations.health.argoproj.io_Rollout: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for _, condition in ipairs(obj.status.conditions) do
if condition.type == "Paused" and condition.status == "True" then
hs.status = "Suspended"
hs.message = condition.message
return hs
end
if condition.type == "InvalidSpec" then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
end
end
if obj.status.phase == "Healthy" then
hs.status = "Healthy"
elseif obj.status.phase == "Degraded" then
hs.status = "Degraded"
else
hs.status = "Progressing"
end
end
return hs
resource.customizations.health.bitnami.com_SealedSecret: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for _, condition in ipairs(obj.status.conditions) do
if condition.type == "Synced" and condition.status == "True" then
hs.status = "Healthy"
return hs
end
end
end
end
hs.status = "Progressing"
return hs
Notification Configuration
# HARDENED: ArgoCD notifications for Slack
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
trigger.on-sync-succeeded: |
- when: app.status.operationState.phase in ['Succeeded']
send: [sync-succeeded]
trigger.on-sync-failed: |
- when: app.status.operationState.phase in ['Error', 'Failed']
send: [sync-failed]
trigger.on-health-degraded: |
- when: app.status.health.status == 'Degraded'
send: [health-degraded]
template.sync-succeeded: |
message: |
✅ {{.app.metadata.name}} synced successfully
Revision: {{.app.status.sync.revision | truncate 7 ""}}
Environment: {{.app.spec.destination.namespace}}
template.sync-failed: |
message: |
❌ {{.app.metadata.name}} sync failed
Revision: {{.app.status.sync.revision | truncate 7 ""}}
Error: {{.app.status.operationState.message}}
template.health-degraded: |
message: |
⚠️ {{.app.metadata.name}} is degraded
Namespace: {{.app.spec.destination.namespace}}
Health: {{.app.status.health.message}}
service.slack: |
token: $slack-token
signingSecret: $slack-signing-secret
Application with All Settings
# HARDENED: Production application with all settings
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: checkout-service-production
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: deploys
notifications.argoproj.io/subscribe.on-sync-failed.slack: ci-alerts
notifications.argoproj.io/subscribe.on-health-degraded.slack: ci-alerts
labels:
app.kubernetes.io/part-of: ecommerce
environment: production
team: checkout
spec:
project: ecommerce
source:
repoURL: https://github.com/acme/ecommerce-infra.git
targetRevision: main
path: apps/checkout-service/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
- ApplyOutOfSyncOnly=true
- PruneLast=true
- RespectIgnoreDifferences=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
- group: autoscaling
kind: HorizontalPodAutoscaler
jqPathExpressions:
- .status
The Gate
ArgoCD itself is the gate. A sync that results in unhealthy pods is reported as Degraded. The notification system alerts the team. The retry configuration handles transient failures (API server timeouts, etcd latency).
PruneLast: true ensures resources are deleted only after all other resources in the sync are healthy. This prevents deleting a Service before the replacement Deployment is ready.
The Recovery
Sync fails repeatedly: Check argocd app get checkout-service-production for the error. Common causes: invalid manifests, permission errors (RBAC), resource quota exceeded.
Application stuck in Progressing: The deployment’s progressDeadlineSeconds is too short, or pods are CrashLooping. Check pod events and logs.
Need to sync outside the sync window: Use manual sync (argocd app sync checkout-service-production). The sync window with manualSync: true allows manual syncs at any time while blocking automated syncs.