Pass/Fail Criteria as Code
Pass/Fail Criteria as Code
The Failure
The team’s performance budget was documented in Confluence. The CI script checked p99 < 500ms for all endpoints. The checkout endpoint was naturally slower (p99 of 400ms) while the catalog endpoint was fast (p99 of 50ms). A regression doubled catalog latency to 100ms. Still under 500ms. The gate passed. Users noticed the catalog was slower. The global budget was too coarse to catch per-endpoint regressions.
Per-endpoint budgets catch regressions that global budgets miss.
The Mechanism
Budget Hierarchy
Global budget:
p99 < 500ms
error_rate < 0.5%
Per-endpoint budgets:
/api/catalog/products:
p99 < 80ms
/api/checkout:
p99 < 400ms
/api/payments:
p99 < 600ms
The checker validates both levels. A request must pass its endpoint budget AND the global budget.
The Implementation
Budget Configuration File
# tests/performance/budgets.yaml
# HARDENED: Performance budgets as code
global:
p50_ms: 100
p99_ms: 500
error_rate_percent: 0.5
min_rps: 50
endpoints:
"/api/catalog/products":
p50_ms: 30
p99_ms: 80
"/api/catalog/products/[id]":
p50_ms: 20
p99_ms: 50
"/api/cart/items":
p50_ms: 50
p99_ms: 200
"/api/checkout":
p50_ms: 200
p99_ms: 400
Enhanced Budget Checker
# tests/performance/check_budget.py
# HARDENED: Per-endpoint budget checking with baseline comparison
import csv
import json
import sys
import yaml
from pathlib import Path
def load_budgets(path="tests/performance/budgets.yaml"):
with open(path) as f:
return yaml.safe_load(f)
def load_baseline(path=".performance-baseline.json"):
if Path(path).exists():
with open(path) as f:
return json.load(f)
return None
def check(csv_path):
budgets = load_budgets()
baseline = load_baseline()
failures = []
results = {}
with open(csv_path) as f:
reader = csv.DictReader(f)
for row in reader:
name = row["Name"]
p50 = float(row["50%"])
p99 = float(row["99%"])
total = int(row["Request Count"])
fail_count = int(row["Failure Count"])
error_rate = (fail_count / total * 100) if total > 0 else 0
results[name] = {"p50": p50, "p99": p99, "error_rate": error_rate}
# Check endpoint-specific budget
if name in budgets.get("endpoints", {}):
ep = budgets["endpoints"][name]
if "p50_ms" in ep and p50 > ep["p50_ms"]:
failures.append(f"{name}: p50 {p50}ms > {ep['p50_ms']}ms")
if "p99_ms" in ep and p99 > ep["p99_ms"]:
failures.append(f"{name}: p99 {p99}ms > {ep['p99_ms']}ms")
# Check global budget
if name == "Aggregated":
g = budgets["global"]
if p50 > g["p50_ms"]:
failures.append(f"Global p50 {p50}ms > {g['p50_ms']}ms")
if p99 > g["p99_ms"]:
failures.append(f"Global p99 {p99}ms > {g['p99_ms']}ms")
if error_rate > g["error_rate_percent"]:
failures.append(f"Error rate {error_rate:.2f}% > {g['error_rate_percent']}%")
# Check regression from baseline
if baseline and name in baseline:
prev_p99 = baseline[name]["p99"]
regression_pct = ((p99 - prev_p99) / prev_p99 * 100) if prev_p99 > 0 else 0
if regression_pct > 20:
failures.append(
f"{name}: p99 regressed {regression_pct:.0f}% "
f"({prev_p99}ms → {p99}ms)"
)
# Save current results as new baseline candidate
with open(".performance-baseline-candidate.json", "w") as f:
json.dump(results, f, indent=2)
if failures:
print("Performance budget EXCEEDED:")
for f_msg in failures:
print(f" ✗ {f_msg}")
sys.exit(1)
else:
print("Performance budget OK")
# Promote candidate to baseline
Path(".performance-baseline-candidate.json").rename(
".performance-baseline.json"
)
if __name__ == "__main__":
check(sys.argv[1])
Trend Detection
# tests/performance/trend.py
# HARDENED: Detect gradual performance degradation
import json
from pathlib import Path
def check_trend(history_dir=".performance-history"):
"""Check if p99 has been increasing over the last N runs."""
history_path = Path(history_dir)
if not history_path.exists():
return
files = sorted(history_path.glob("*.json"))[-10:] # Last 10 runs
if len(files) < 5:
return
aggregated_p99 = []
for f in files:
data = json.loads(f.read_text())
if "Aggregated" in data:
aggregated_p99.append(data["Aggregated"]["p99"])
if len(aggregated_p99) < 5:
return
# Check if p99 has increased in 4 of the last 5 runs
increases = sum(
1 for i in range(1, len(aggregated_p99))
if aggregated_p99[i] > aggregated_p99[i - 1]
)
if increases >= 4:
first = aggregated_p99[0]
last = aggregated_p99[-1]
pct = ((last - first) / first * 100) if first > 0 else 0
print(f"⚠ Performance trend warning: p99 increased {pct:.0f}% "
f"over last {len(aggregated_p99)} runs ({first}ms → {last}ms)")
The Gate
The budget checker is a two-layer gate:
- Absolute budget: Each endpoint must be under its defined maximum
- Relative regression: No endpoint can regress more than 20% from baseline
Both must pass for the PR to merge. Trend warnings are advisory—they surface gradual degradation that individual PRs might not trigger.
The Recovery
Budget is too tight for CI environment: CI runners are slower than production hardware. Set CI budgets at 2x production budgets. The goal is to catch regressions, not validate absolute performance.
Baseline keeps ratcheting upward: If every PR is slightly slower and the baseline updates each time, the baseline drifts. Add the 20% regression check to prevent gradual drift.
Different results on each run: Performance tests have variance. Run the test 3 times and take the median. Or increase test duration to reduce variance.