Traffic Splitting with Istio/Nginx and Automated Rollback Triggers
Traffic Splitting with Istio/Nginx and Automated Rollback Triggers
The Failure
The team configured traffic splitting with Nginx ingress annotations. The canary weight was set to 10%, but they noticed that some users hit the canary 50% of the time while others never hit it. Nginx’s canary routing is probabilistic per-request, not per-user. A user making 10 requests might get 1 canary response or 5. For checkout flows that span multiple requests (add to cart → checkout → payment), a user could start on stable and finish on canary, or vice versa.
Session affinity during canary rollouts ensures a user stays on the same version for the duration of their session. Istio provides this with consistent hashing. Nginx provides it with the canary-by-cookie annotation.
The Mechanism
Traffic Routing Options
| Router | Mechanism | Session Affinity | Weighted Routing | Header Routing |
|---|---|---|---|---|
| Nginx Ingress | Canary annotations | Cookie-based | Yes (weight annotation) | Yes (header annotation) |
| Istio VirtualService | Traffic rules | Consistent hash | Yes (weight field) | Yes (match rules) |
| Traefik | Weighted services | Cookie-based | Yes (weight) | Yes (headers) |
| AWS ALB | Target groups | Cookie-based | Yes (weight) | No |
The Implementation
Nginx Traffic Splitting
# HARDENED: Nginx ingress with canary routing
# Stable ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: checkout-ingress
namespace: production
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "30"
spec:
ingressClassName: nginx
rules:
- host: api.acme.com
http:
paths:
- path: /api/checkout
pathType: Prefix
backend:
service:
name: checkout-stable
port:
number: 80
---
# Canary ingress (managed by Argo Rollouts)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: checkout-ingress-canary
namespace: production
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
# Header-based override for testing
nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
nginx.ingress.kubernetes.io/canary-by-header-value: "true"
# Session affinity: once routed to canary, stay on canary
nginx.ingress.kubernetes.io/canary-by-cookie: "canary-session"
spec:
ingressClassName: nginx
rules:
- host: api.acme.com
http:
paths:
- path: /api/checkout
pathType: Prefix
backend:
service:
name: checkout-canary
port:
number: 80
Istio VirtualService Traffic Splitting
# HARDENED: Istio VirtualService for weighted routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: checkout-service
namespace: production
spec:
hosts:
- checkout-service.production.svc.cluster.local
http:
# Header-based routing for canary testing
- match:
- headers:
X-Canary:
exact: "true"
route:
- destination:
host: checkout-canary
port:
number: 80
# Weighted routing for canary traffic
- route:
- destination:
host: checkout-stable
port:
number: 80
weight: 90
- destination:
host: checkout-canary
port:
number: 80
weight: 10
---
# Session affinity via consistent hashing
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: checkout-service
namespace: production
spec:
host: checkout-service.production.svc.cluster.local
trafficPolicy:
loadBalancer:
consistentHash:
httpCookie:
name: canary-session
ttl: 3600s
Argo Rollouts with Istio
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
spec:
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
trafficRouting:
istio:
virtualServices:
- name: checkout-service
routes:
- primary
destinationRule:
name: checkout-service
canarySubsetName: canary
stableSubsetName: stable
steps:
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: error-rate
- setWeight: 20
- pause: { duration: 5m }
- analysis:
templates:
- templateName: error-rate
- templateName: latency-p99
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
Custom Rollback Triggers
Beyond Prometheus metrics, trigger rollback from external systems:
# Rollback on PagerDuty incident creation
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: no-active-incidents
spec:
metrics:
- name: pagerduty-check
interval: 60s
count: 5
successCondition: result == "0"
failureLimit: 0
provider:
web:
url: "https://api.pagerduty.com/incidents?statuses[]=triggered&statuses[]=acknowledged&service_ids[]={{args.pd-service-id}}"
headers:
- key: Authorization
value: "Token token={{args.pd-token}}"
jsonPath: "{$.incidents.length}"
# Manual rollback abort trigger
kubectl argo rollouts abort checkout-service -n production
# Force immediate rollback to previous version
kubectl argo rollouts undo checkout-service -n production
The Gate
Traffic splitting is the mechanism that enables progressive gating. At each stage, a larger percentage of users validate the new version. The combination of automated analysis and traffic splitting creates a multi-layered gate:
- Technical gate: Metrics (error rate, latency, memory)
- Operational gate: No active incidents (PagerDuty check)
- User experience gate: Session-based routing ensures users have consistent experiences
The Recovery
Traffic split not working (all traffic goes to stable): Check the ingress annotations or VirtualService configuration. Common issue: the canary ingress is not in the same ingress class as the stable ingress. Verify with kubectl get ingress -n production.
Session affinity causes uneven distribution: Cookie-based affinity means returning users always hit the same version. If most traffic is from returning users, the canary gets less traffic than the weight suggests. Increase the canary weight or use a shorter cookie TTL.
Canary receives traffic before it is ready: Add setHeaderRoute to Argo Rollouts to send only test traffic (via header) before enabling weighted routing. Validate with test traffic first, then open to real users.