Canary Deployments with Argo Rollouts and Locust Validation
Canary Deployments with Argo Rollouts and Locust Validation
The Failure
The checkout team deployed a new version using a canary strategy with manual observation. They set the canary to 10% traffic, watched Grafana dashboards for 5 minutes, saw no obvious errors, and promoted to 100%. The new version had a memory leak that only manifested under sustained load. At 10% traffic, the leak was invisible. At 100%, the service OOM-killed after 45 minutes.
Automated analysis with Locust would have caught this. A 5-minute Locust run against the canary at realistic load would have shown memory consumption growing linearly. The analysis template would have detected the anomaly and aborted the rollout.
The Mechanism
Canary with Automated Analysis
Argo Rollouts manages the canary lifecycle:
- Create canary pods with the new image
- Route a percentage of traffic to canary pods
- Run analysis templates at each step
- If analysis passes, increase traffic percentage
- If analysis fails, abort and route all traffic back to stable
- Repeat until 100% or failure
Locust as a Canary Validator
Locust generates realistic load against the canary during the analysis phase. The Locust results (response times, error rates, throughput) are pushed to Prometheus. The AnalysisTemplate queries Prometheus to determine pass/fail.
The Locust run is not a full performance test. It is a targeted validation: send realistic traffic patterns to the canary version for the duration of the analysis window and verify that the canary’s behavior matches the stable version’s baseline.
The Implementation
Argo Rollout with Locust Analysis Steps
# HARDENED: Canary rollout with Locust validation at each stage
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
namespace: production
spec:
replicas: 5
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
trafficRouting:
nginx:
stableIngress: checkout-ingress
steps:
# Stage 1: 5% traffic, basic health
- setWeight: 5
- pause: { duration: 1m }
- analysis:
templates:
- templateName: canary-health
args:
- name: canary-service
value: checkout-canary
# Stage 2: 20% traffic, Locust validation
- setWeight: 20
- pause: { duration: 2m }
- analysis:
templates:
- templateName: locust-canary-validation
- templateName: canary-error-rate
args:
- name: canary-service
value: checkout-canary
- name: stable-service
value: checkout-stable
# Stage 3: 50% traffic, full analysis
- setWeight: 50
- pause: { duration: 3m }
- analysis:
templates:
- templateName: locust-canary-validation
- templateName: canary-error-rate
- templateName: canary-latency
- templateName: canary-memory
# Stage 4: Full promotion
- setWeight: 100
rollbackWindow:
revisions: 3
analysis:
successfulRunHistoryLimit: 3
unsuccessfulRunHistoryLimit: 3
selector:
matchLabels:
app: checkout-service
template:
metadata:
labels:
app: checkout-service
spec:
containers:
- name: checkout
image: ghcr.io/acme/checkout-service:NEW_SHA
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Locust Analysis Template
# HARDENED: Locust-based canary analysis
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: locust-canary-validation
spec:
args:
- name: canary-service
- name: stable-service
metrics:
- name: locust-validation
interval: 60s
count: 3
failureLimit: 1
provider:
job:
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: locust
image: ghcr.io/acme/locust-suite:latest
env:
- name: TARGET_HOST
value: "http://{{args.canary-service}}.production.svc.cluster.local"
- name: USERS
value: "20"
- name: SPAWN_RATE
value: "5"
- name: RUN_TIME
value: "45s"
- name: LOCUST_FILE
value: "checkout_flow.py"
- name: PUSHGATEWAY_URL
value: "http://prometheus-pushgateway.monitoring:9091"
command:
- /bin/sh
- -c
- |
locust -f $LOCUST_FILE \
--headless \
--host=$TARGET_HOST \
--users=$USERS \
--spawn-rate=$SPAWN_RATE \
--run-time=$RUN_TIME \
--csv=/tmp/results \
--exit-code-on-error 1
# Push results to Prometheus
python3 push_metrics.py \
--csv=/tmp/results \
--pushgateway=$PUSHGATEWAY_URL \
--labels="service={{args.canary-service}},type=canary"
Locust Checkout Flow for Canary
# locust-suite/checkout_flow.py
# HARDENED: Realistic checkout flow for canary validation
from locust import HttpUser, task, between
class CheckoutUser(HttpUser):
wait_time = between(1, 3)
@task(5)
def browse_product(self):
self.client.get("/api/products/42",
name="/api/products/:id")
@task(3)
def search_products(self):
self.client.get("/api/products?q=wireless&limit=20",
name="/api/products?q=...")
@task(1)
def checkout_flow(self):
# Add to cart
self.client.post("/api/cart/items",
json={"productId": 42, "quantity": 1},
name="/api/cart/items")
# Create checkout
response = self.client.post("/api/checkout",
json={"paymentMethod": "card_test"},
name="/api/checkout")
if response.status_code == 201:
order_id = response.json().get("orderId")
# Check order status
self.client.get(f"/api/orders/{order_id}",
name="/api/orders/:id")
Error Rate Analysis Template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-error-rate
spec:
args:
- name: canary-service
metrics:
- name: error-rate
interval: 30s
count: 5
successCondition: result[0] < 0.01
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.canary-service}}",
code=~"5.."}[2m]))
/
sum(rate(http_requests_total{
service="{{args.canary-service}}"}[2m]))
Memory Analysis Template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-memory
spec:
args:
- name: canary-service
metrics:
- name: memory-growth
interval: 60s
count: 3
successCondition: result[0] < 0.8
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
max(container_memory_working_set_bytes{
pod=~"{{args.canary-service}}.*",
namespace="production"})
/
max(kube_pod_container_resource_limits{
pod=~"{{args.canary-service}}.*",
namespace="production",
resource="memory"})
The Gate
The canary promotion proceeds only if all analysis templates pass at each step. The Locust validation generates realistic load against the canary. If the canary’s error rate exceeds 1%, latency degrades, or memory usage exceeds 80% of limits, the analysis fails.
At stage 2 (20% traffic), both Locust validation and error rate analysis must pass. At stage 3 (50% traffic), Locust, error rate, latency, and memory analysis all must pass. This progressive gating catches issues that only manifest under higher load.
Two failures out of five measurements are allowed for error rate (failureLimit: 2) to account for transient spikes. Only one failure is allowed for memory (failureLimit: 1) because memory leaks are progressive and a single data point showing high usage is a strong signal.
The Recovery
Canary analysis fails at 5%: Argo Rollouts scales canary to zero and routes all traffic to stable. Check the AnalysisRun in ArgoCD to see which metric failed. Fix and redeploy.
Canary passes all stages but issues appear after 100%: Use kubectl argo rollouts undo checkout-service to revert to the previous ReplicaSet. Argo Rollouts keeps the previous version’s ReplicaSet available (rollbackWindow.revisions: 3).
Locust validation is flaky: The Locust scenario is too aggressive or the canary has insufficient resources. Reduce user count, increase spawn rate interval, or increase canary pod resources. Flaky validation is worse than no validation because it trains the team to ignore failures.