Canary Deployments with Argo Rollouts and Locust Validation

The Failure

The checkout team deployed a new version using a canary strategy with manual observation. They set the canary to 10% traffic, watched Grafana dashboards for 5 minutes, saw no obvious errors, and promoted to 100%. The new version had a memory leak that only manifested under sustained load. At 10% traffic, the leak was invisible. At 100%, the service OOM-killed after 45 minutes.

Automated analysis with Locust would have caught this. A 5-minute Locust run against the canary at realistic load would have shown memory consumption growing linearly. The analysis template would have detected the anomaly and aborted the rollout.

The Mechanism

Canary with Automated Analysis

Argo Rollouts manages the canary lifecycle:

Create canary pods with the new image
Route a percentage of traffic to canary pods
Run analysis templates at each step
If analysis passes, increase traffic percentage
If analysis fails, abort and route all traffic back to stable
Repeat until 100% or failure

Locust as a Canary Validator

Locust generates realistic load against the canary during the analysis phase. The Locust results (response times, error rates, throughput) are pushed to Prometheus. The AnalysisTemplate queries Prometheus to determine pass/fail.

The Locust run is not a full performance test. It is a targeted validation: send realistic traffic patterns to the canary version for the duration of the analysis window and verify that the canary’s behavior matches the stable version’s baseline.

The Implementation

Argo Rollout with Locust Analysis Steps

# HARDENED: Canary rollout with Locust validation at each stage
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: production
spec:
  replicas: 5
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout-ingress
      steps:
        # Stage 1: 5% traffic, basic health
        - setWeight: 5
        - pause: { duration: 1m }
        - analysis:
            templates:
              - templateName: canary-health
            args:
              - name: canary-service
                value: checkout-canary

        # Stage 2: 20% traffic, Locust validation
        - setWeight: 20
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: locust-canary-validation
              - templateName: canary-error-rate
            args:
              - name: canary-service
                value: checkout-canary
              - name: stable-service
                value: checkout-stable

        # Stage 3: 50% traffic, full analysis
        - setWeight: 50
        - pause: { duration: 3m }
        - analysis:
            templates:
              - templateName: locust-canary-validation
              - templateName: canary-error-rate
              - templateName: canary-latency
              - templateName: canary-memory

        # Stage 4: Full promotion
        - setWeight: 100
      rollbackWindow:
        revisions: 3
      analysis:
        successfulRunHistoryLimit: 3
        unsuccessfulRunHistoryLimit: 3
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:NEW_SHA
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

Locust Analysis Template

# HARDENED: Locust-based canary analysis
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: locust-canary-validation
spec:
  args:
    - name: canary-service
    - name: stable-service
  metrics:
    - name: locust-validation
      interval: 60s
      count: 3
      failureLimit: 1
      provider:
        job:
          spec:
            backoffLimit: 0
            template:
              spec:
                restartPolicy: Never
                containers:
                  - name: locust
                    image: ghcr.io/acme/locust-suite:latest
                    env:
                      - name: TARGET_HOST
                        value: "http://{{args.canary-service}}.production.svc.cluster.local"
                      - name: USERS
                        value: "20"
                      - name: SPAWN_RATE
                        value: "5"
                      - name: RUN_TIME
                        value: "45s"
                      - name: LOCUST_FILE
                        value: "checkout_flow.py"
                      - name: PUSHGATEWAY_URL
                        value: "http://prometheus-pushgateway.monitoring:9091"
                    command:
                      - /bin/sh
                      - -c
                      - |
                        locust -f $LOCUST_FILE \
                          --headless \
                          --host=$TARGET_HOST \
                          --users=$USERS \
                          --spawn-rate=$SPAWN_RATE \
                          --run-time=$RUN_TIME \
                          --csv=/tmp/results \
                          --exit-code-on-error 1

                        # Push results to Prometheus
                        python3 push_metrics.py \
                          --csv=/tmp/results \
                          --pushgateway=$PUSHGATEWAY_URL \
                          --labels="service={{args.canary-service}},type=canary"

Locust Checkout Flow for Canary

# locust-suite/checkout_flow.py
# HARDENED: Realistic checkout flow for canary validation
from locust import HttpUser, task, between

class CheckoutUser(HttpUser):
    wait_time = between(1, 3)

    @task(5)
    def browse_product(self):
        self.client.get("/api/products/42",
                       name="/api/products/:id")

    @task(3)
    def search_products(self):
        self.client.get("/api/products?q=wireless&limit=20",
                       name="/api/products?q=...")

    @task(1)
    def checkout_flow(self):
        # Add to cart
        self.client.post("/api/cart/items",
                        json={"productId": 42, "quantity": 1},
                        name="/api/cart/items")

        # Create checkout
        response = self.client.post("/api/checkout",
                                   json={"paymentMethod": "card_test"},
                                   name="/api/checkout")

        if response.status_code == 201:
            order_id = response.json().get("orderId")
            # Check order status
            self.client.get(f"/api/orders/{order_id}",
                          name="/api/orders/:id")

Error Rate Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-error-rate
spec:
  args:
    - name: canary-service
  metrics:
    - name: error-rate
      interval: 30s
      count: 5
      successCondition: result[0] < 0.01
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.canary-service}}",
              code=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.canary-service}}"}[2m]))

Memory Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-memory
spec:
  args:
    - name: canary-service
  metrics:
    - name: memory-growth
      interval: 60s
      count: 3
      successCondition: result[0] < 0.8
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            max(container_memory_working_set_bytes{
              pod=~"{{args.canary-service}}.*",
              namespace="production"})
            /
            max(kube_pod_container_resource_limits{
              pod=~"{{args.canary-service}}.*",
              namespace="production",
              resource="memory"})

The Gate

The canary promotion proceeds only if all analysis templates pass at each step. The Locust validation generates realistic load against the canary. If the canary’s error rate exceeds 1%, latency degrades, or memory usage exceeds 80% of limits, the analysis fails.

At stage 2 (20% traffic), both Locust validation and error rate analysis must pass. At stage 3 (50% traffic), Locust, error rate, latency, and memory analysis all must pass. This progressive gating catches issues that only manifest under higher load.

Two failures out of five measurements are allowed for error rate (failureLimit: 2) to account for transient spikes. Only one failure is allowed for memory (failureLimit: 1) because memory leaks are progressive and a single data point showing high usage is a strong signal.

The Recovery

Canary analysis fails at 5%: Argo Rollouts scales canary to zero and routes all traffic to stable. Check the AnalysisRun in ArgoCD to see which metric failed. Fix and redeploy.

Canary passes all stages but issues appear after 100%: Use kubectl argo rollouts undo checkout-service to revert to the previous ReplicaSet. Argo Rollouts keeps the previous version’s ReplicaSet available (rollbackWindow.revisions: 3).

Locust validation is flaky: The Locust scenario is too aggressive or the canary has insufficient resources. Reduce user count, increase spawn rate interval, or increase canary pod resources. Flaky validation is worse than no validation because it trains the team to ignore failures.