Locust in CI: Automated Performance Gates
Locust in CI: Automated Performance Gates
The main chapter defined the Locust test, the baseline file, and the comparison script. This section covers the engineering problems that appear when you actually run this in CI: flaky infrastructure, test data management, parallelism with functional tests, and the workflow mechanics that make the gate reliable.
Service Startup Ordering
The GitHub Actions workflow from the main chapter has a race condition. The application container starts, but the for loop that polls /health does not guarantee that the database migrations have run or that the connection pools are initialized. A health check that returns 200 means the HTTP server is listening. It does not mean the application is ready to serve traffic at representative latency.
# SLOW: Health check passes but app is not warm
# Health endpoint that lies about readiness
@app.get("/health")
def health():
return {"status": "ok"}
# First requests hit cold connection pool (3-5s to establish)
# First queries hit empty prepared statement cache
# First responses include class loading / JIT overhead
# CI test measures startup cost, not steady-state performance
# FAST: Readiness check that validates dependencies
import asyncpg
import redis.asyncio as redis
@app.get("/health/ready")
async def readiness():
checks = {}
# Verify database connection pool has active connections
try:
async with app.state.db_pool.acquire() as conn:
await conn.fetchval("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = str(e)
return JSONResponse(
status_code=503,
content={"status": "not_ready", "checks": checks},
)
# Verify Redis is responding
try:
await app.state.redis.ping()
checks["redis"] = "ok"
except Exception as e:
checks["redis"] = str(e)
return JSONResponse(
status_code=503,
content={"status": "not_ready", "checks": checks},
)
# Verify cache is populated (warmup complete)
trending_cache = await app.state.redis.get("trending:articles")
if trending_cache is None:
checks["cache_warm"] = "not_populated"
return JSONResponse(
status_code=503,
content={"status": "warming_up", "checks": checks},
)
checks["cache_warm"] = "ok"
return {"status": "ready", "checks": checks}
The CI workflow uses the readiness endpoint instead of the basic health check:
- name: Wait for application readiness
run: |
for i in $(seq 1 60); do
response=$(curl -sf http://localhost:8000/health/ready 2>&1) || true
if echo "$response" | grep -q '"status":"ready"'; then
echo "Application is ready"
echo "$response" | python3 -m json.tool
break
fi
echo "Waiting for readiness... ($i/60)"
echo "Current status: $response"
sleep 3
done
# Final check: fail if still not ready
curl -sf http://localhost:8000/health/ready | grep -q '"status":"ready"' || {
echo "Application failed to become ready within 3 minutes"
docker logs app
exit 1
}
The timeout is 3 minutes (60 iterations, 3 seconds each). If the application is not ready in 3 minutes, the step fails and prints the container logs. No silent hangs. No mystery failures.
Test Data Seeding
Performance tests need data. An empty database returns in microseconds. A database with 50,000 articles, 500,000 tags, and 2 million analytics events returns in representative time.
The seed script must be deterministic. Running it twice produces the same dataset. Running it on different CI runners produces the same dataset. Non-deterministic test data means non-deterministic test results.
# seed_ci_data.py: Deterministic test data for performance testing
import random
import hashlib
from datetime import datetime, timedelta
# Fixed seed for reproducibility
random.seed(42)
ARTICLE_COUNT = 10_000
TAGS_PER_ARTICLE = 3
CATEGORIES = [
"performance", "architecture", "databases",
"networking", "security", "devops",
"frontend", "backend", "infrastructure",
]
def generate_article(index: int) -> dict:
# Deterministic slug from index
slug = f"article-{index:05d}"
# Deterministic content length (varies 500-5000 words)
content_length = 500 + (index * 37 % 4500)
# Deterministic publish date (spread over 2 years)
days_ago = index % 730
publish_date = datetime(2024, 1, 1) + timedelta(days=days_ago)
return {
"slug": slug,
"title": f"Performance Engineering Article {index}",
"content": "x " * content_length,
"category": CATEGORIES[index % len(CATEGORIES)],
"tags": [
f"tag-{(index * 7 + i) % 200}"
for i in range(TAGS_PER_ARTICLE)
],
"published_at": publish_date,
"view_count": index * 13 % 50_000,
}
def seed_database(conn):
articles = [generate_article(i) for i in range(ARTICLE_COUNT)]
# Batch insert for speed
conn.executemany(
"""
INSERT INTO articles (slug, title, content, category, published_at, view_count)
VALUES (%(slug)s, %(title)s, %(content)s, %(category)s, %(published_at)s, %(view_count)s)
ON CONFLICT (slug) DO NOTHING
""",
articles,
)
# Insert tags
tag_rows = []
for article in articles:
for tag in article["tags"]:
tag_rows.append({"slug": article["slug"], "tag": tag})
conn.executemany(
"""
INSERT INTO article_tags (article_slug, tag)
VALUES (%(slug)s, %(tag)s)
ON CONFLICT DO NOTHING
""",
tag_rows,
)
conn.commit()
print(f"Seeded {len(articles)} articles with {len(tag_rows)} tags")
The seed script uses random.seed(42). Same seed, same data, every time. The content is synthetic (repeated “x ” strings) because the performance test measures query and serialization time, not content rendering. Real content would make the seed script slower without improving the signal.
Parallelizing Performance and Functional Tests
Performance tests take 90-120 seconds plus setup time. Running them sequentially after functional tests adds 5-10 minutes to every PR. Running them in parallel keeps CI fast.
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements-test.txt
- run: pytest tests/unit/ -x --tb=short
integration-tests:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
# ... service config
steps:
- uses: actions/checkout@v4
- run: pytest tests/integration/ -x --tb=short
performance-tests:
runs-on: ubuntu-latest
needs: [unit-tests] # Only run if unit tests pass
services:
postgres:
image: postgres:16
# ... service config
redis:
image: redis:7
# ... service config
steps:
- uses: actions/checkout@v4
- name: Build and start application
run: |
docker build -t content-platform:ci .
docker run -d --name app --network host \
--cpus 2 --memory 4g \
-e DATABASE_URL=postgresql://app:ci_pass@localhost:5432/content_platform \
-e REDIS_URL=redis://localhost:6379 \
content-platform:ci
- name: Seed and warm up
run: |
docker exec app python scripts/seed_ci_data.py
# Warmup: hit every endpoint once to initialize caches
curl -s http://localhost:8000/api/articles/article-00001 > /dev/null
curl -s http://localhost:8000/api/search?q=performance > /dev/null
curl -s http://localhost:8000/api/trending > /dev/null
sleep 5 # Let connection pools stabilize
- name: Run Locust
run: |
pip install locust
locust -f tests/perf/ci_locustfile.py \
--headless \
--users 20 \
--spawn-rate 5 \
--run-time 90s \
--host http://localhost:8000
- name: Check regression
run: python tests/perf/compare_perf.py
- name: Store results artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: perf-${{ github.sha }}
path: perf-results.json
retention-days: 90
The needs: [unit-tests] dependency means performance tests only run if unit tests pass. No point benchmarking broken code. Integration tests and performance tests run in parallel, sharing no state.
Handling Flaky Performance Results
A CI performance test that fails randomly is worse than no performance test at all. Teams learn to ignore it, re-run the workflow, and merge anyway. Three strategies reduce flakiness.
Strategy 1: Multiple runs with statistical aggregation.
# run_perf_stable.py: Run Locust 3 times and take the median
import subprocess
import json
import statistics
from pathlib import Path
def run_locust() -> dict:
subprocess.run(
[
"locust", "-f", "tests/perf/ci_locustfile.py",
"--headless", "--users", "20",
"--spawn-rate", "5", "--run-time", "60s",
"--host", "http://localhost:8000",
],
check=True,
)
return json.loads(Path("perf-results.json").read_text())
def median_results(runs: list[dict]) -> dict:
merged = {"endpoints": {}}
all_endpoints = set()
for run in runs:
all_endpoints.update(run["endpoints"].keys())
for endpoint in all_endpoints:
endpoint_runs = [
r["endpoints"][endpoint]
for r in runs
if endpoint in r["endpoints"]
]
if not endpoint_runs:
continue
merged["endpoints"][endpoint] = {
"count": int(statistics.median(r["count"] for r in endpoint_runs)),
"p50": statistics.median(r["p50"] for r in endpoint_runs),
"p95": statistics.median(r["p95"] for r in endpoint_runs),
"p99": statistics.median(r["p99"] for r in endpoint_runs),
"mean": statistics.median(r["mean"] for r in endpoint_runs),
"failures": max(r["failures"] for r in endpoint_runs),
}
return merged
runs = [run_locust() for _ in range(3)]
stable_results = median_results(runs)
Path("perf-results.json").write_text(json.dumps(stable_results, indent=2))
Three runs at 60 seconds each takes 3 minutes. The median filters out the one run that happened during a CPU spike on the shared runner.
Strategy 2: Coefficient of variation check. If the three runs disagree by more than 15%, the environment is too noisy. Warn instead of fail.
def check_stability(runs: list[dict], max_cv: float = 0.15) -> bool:
for endpoint in runs[0]["endpoints"]:
p95_values = [
r["endpoints"][endpoint]["p95"]
for r in runs
if endpoint in r["endpoints"]
]
if len(p95_values) < 2:
continue
mean = statistics.mean(p95_values)
stdev = statistics.stdev(p95_values)
cv = stdev / mean if mean > 0 else 0
if cv > max_cv:
print(
f"WARNING: {endpoint} p95 has CV={cv:.2f} "
f"(values: {p95_values}). Environment too noisy for "
f"reliable comparison."
)
return False
return True
Strategy 3: Relative comparison within the same CI run. Instead of comparing against a stored baseline, run the test twice in the same job: once with the base branch code and once with the PR code. Same hardware, same moment in time. The comparison is relative, not absolute.
- name: Test base branch performance
run: |
git stash
git checkout ${{ github.event.pull_request.base.sha }}
docker build -t content-platform:base .
# ... run locust, save to perf-results-base.json
- name: Test PR branch performance
run: |
git checkout ${{ github.sha }}
docker build -t content-platform:pr .
# ... run locust, save to perf-results-pr.json
- name: Compare
run: python tests/perf/compare_relative.py perf-results-base.json perf-results-pr.json
This eliminates hardware variance entirely. The cost is doubling CI time. Use this approach for release branches where accuracy matters more than speed.
Artifact Retention and Result History
CI artifacts expire. GitHub Actions defaults to 90 days. After 90 days, the performance results for a specific commit are gone.
For long-term tracking, push results to external storage after each run:
- name: Archive to S3
if: github.ref == 'refs/heads/main'
run: |
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
aws s3 cp perf-results.json \
"s3://perf-results-bucket/content-platform/${TIMESTAMP}-${GITHUB_SHA:0:8}.json"
Only archive results from the main branch. PR results are useful for regression detection during review, not for historical analysis. The main branch results form the trend line.
A simple script queries S3 to build the trend:
# trend.py: Build performance trend from archived results
import boto3
import json
s3 = boto3.client("s3")
bucket = "perf-results-bucket"
prefix = "content-platform/"
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
results = []
for obj in sorted(response["Contents"], key=lambda x: x["Key"]):
body = s3.get_object(Bucket=bucket, Key=obj["Key"])["Body"].read()
data = json.loads(body)
timestamp = obj["Key"].split("/")[1].split("-")[0]
for endpoint, metrics in data["endpoints"].items():
results.append({
"date": timestamp,
"endpoint": endpoint,
"p50": metrics["p50"],
"p95": metrics["p95"],
"p99": metrics["p99"],
})
# Output CSV for Grafana or spreadsheet import
print("date,endpoint,p50,p95,p99")
for r in results:
print(f"{r['date']},{r['endpoint']},{r['p50']},{r['p95']},{r['p99']}")
This creates the raw data for the drift analysis covered in Section 2.
When the Gate Fails on Purpose
Sometimes a PR is intentionally slow. Adding full-text search to the article API requires a database query that is inherently slower than a primary key lookup. The performance gate will fail. This is correct behavior.
The developer has three options:
-
Update the baseline. If the new latency is acceptable, update
perf-baseline.jsonin the same PR. The reviewer sees both the code change and the new performance expectation. -
Add a skip annotation. For experimental features behind a flag, skip the performance check for specific endpoints:
{
"thresholds": {
"/api/search": {
"p95_ms": 200,
"skip_until": "2025-02-01",
"skip_reason": "Full-text search migration in progress, #1234"
}
}
}
- Optimize until it fits. The budget forces the developer to think about performance during development, not after deployment. Maybe the full-text search needs an index. Maybe the query needs a LIMIT. The gate creates the feedback loop.
Option 3 is the whole point. The performance gate does not exist to block merges. It exists to create a moment where the developer asks “can I make this faster?” before the code reaches production.