ArgoCD Application Definitions, Sync Policies, and Health Checks

The Failure

The team enabled automated.prune: true on all applications without understanding the implications. A developer accidentally deleted a Service manifest from the infra repo. ArgoCD pruned the live Service. The checkout service became unreachable for 4 minutes until the deletion was reverted.

prune: true is correct for production. But it requires that the infra repo has branch protection, required reviews, and CI validation. Pruning without those safeguards turns a Git mistake into a production outage.

The Mechanism

Sync Policy Matrix

Setting	Effect	When to Use
`automated: {}`	Sync on Git change, no prune, no self-heal	Dev environments
`automated.prune: true`	Delete resources removed from Git	All environments with branch protection
`automated.selfHeal: true`	Revert manual cluster changes	All environments (strongly recommended)
`syncOptions.ApplyOutOfSyncOnly`	Only apply changed resources	Large applications with many resources
`syncOptions.ServerSideApply`	Use server-side apply for conflict resolution	When multiple controllers manage the same resources
`retry.limit: 5`	Retry failed syncs	Always (transient failures are common)

Sync Windows

Production should not sync during peak traffic hours. Sync windows restrict when ArgoCD can perform automated syncs.

The Implementation

Sync Window Configuration

# HARDENED: Restrict production syncs to business hours
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ecommerce
  namespace: argocd
spec:
  syncWindows:
    # Allow syncs Monday-Friday 09:00-17:00 UTC
    - kind: allow
      schedule: "0 9 * * 1-5"
      duration: 8h
      applications: ["*-production"]
      namespaces: ["production"]

    # Deny syncs on weekends
    - kind: deny
      schedule: "0 0 * * 0,6"
      duration: 48h
      applications: ["*-production"]
      namespaces: ["production"]

    # Allow emergency syncs with manual override
    - kind: allow
      schedule: "* * * * *"
      duration: 24h
      applications: ["*-production"]
      namespaces: ["production"]
      manualSync: true # Only manual syncs allowed outside window

Custom Health Check for External Resources

# ArgoCD ConfigMap: custom health checks
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.argoproj.io_Rollout: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for _, condition in ipairs(obj.status.conditions) do
          if condition.type == "Paused" and condition.status == "True" then
            hs.status = "Suspended"
            hs.message = condition.message
            return hs
          end
          if condition.type == "InvalidSpec" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
        end
      end
      if obj.status.phase == "Healthy" then
        hs.status = "Healthy"
      elseif obj.status.phase == "Degraded" then
        hs.status = "Degraded"
      else
        hs.status = "Progressing"
      end
    end
    return hs

  resource.customizations.health.bitnami.com_SealedSecret: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for _, condition in ipairs(obj.status.conditions) do
          if condition.type == "Synced" and condition.status == "True" then
            hs.status = "Healthy"
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    return hs

Notification Configuration

# HARDENED: ArgoCD notifications for Slack
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  trigger.on-sync-succeeded: |
    - when: app.status.operationState.phase in ['Succeeded']
      send: [sync-succeeded]
  trigger.on-sync-failed: |
    - when: app.status.operationState.phase in ['Error', 'Failed']
      send: [sync-failed]
  trigger.on-health-degraded: |
    - when: app.status.health.status == 'Degraded'
      send: [health-degraded]

  template.sync-succeeded: |
    message: |
      ✅ {{.app.metadata.name}} synced successfully
      Revision: {{.app.status.sync.revision | truncate 7 ""}}
      Environment: {{.app.spec.destination.namespace}}

  template.sync-failed: |
    message: |
      ❌ {{.app.metadata.name}} sync failed
      Revision: {{.app.status.sync.revision | truncate 7 ""}}
      Error: {{.app.status.operationState.message}}

  template.health-degraded: |
    message: |
      ⚠️ {{.app.metadata.name}} is degraded
      Namespace: {{.app.spec.destination.namespace}}
      Health: {{.app.status.health.message}}

  service.slack: |
    token: $slack-token
    signingSecret: $slack-signing-secret

Application with All Settings

# HARDENED: Production application with all settings
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: checkout-service-production
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.slack: deploys
    notifications.argoproj.io/subscribe.on-sync-failed.slack: ci-alerts
    notifications.argoproj.io/subscribe.on-health-degraded.slack: ci-alerts
  labels:
    app.kubernetes.io/part-of: ecommerce
    environment: production
    team: checkout
spec:
  project: ecommerce
  source:
    repoURL: https://github.com/acme/ecommerce-infra.git
    targetRevision: main
    path: apps/checkout-service/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
      - ApplyOutOfSyncOnly=true
      - PruneLast=true
      - RespectIgnoreDifferences=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas
    - group: autoscaling
      kind: HorizontalPodAutoscaler
      jqPathExpressions:
        - .status

The Gate

ArgoCD itself is the gate. A sync that results in unhealthy pods is reported as Degraded. The notification system alerts the team. The retry configuration handles transient failures (API server timeouts, etcd latency).

PruneLast: true ensures resources are deleted only after all other resources in the sync are healthy. This prevents deleting a Service before the replacement Deployment is ready.

The Recovery

Sync fails repeatedly: Check argocd app get checkout-service-production for the error. Common causes: invalid manifests, permission errors (RBAC), resource quota exceeded.

Application stuck in Progressing: The deployment’s progressDeadlineSeconds is too short, or pods are CrashLooping. Check pod events and logs.

Need to sync outside the sync window: Use manual sync (argocd app sync checkout-service-production). The sync window with manualSync: true allows manual syncs at any time while blocking automated syncs.