Locust in CI: Automated Performance Gates

The main chapter defined the Locust test, the baseline file, and the comparison script. This section covers the engineering problems that appear when you actually run this in CI: flaky infrastructure, test data management, parallelism with functional tests, and the workflow mechanics that make the gate reliable.

Service Startup Ordering

The GitHub Actions workflow from the main chapter has a race condition. The application container starts, but the for loop that polls /health does not guarantee that the database migrations have run or that the connection pools are initialized. A health check that returns 200 means the HTTP server is listening. It does not mean the application is ready to serve traffic at representative latency.

# SLOW: Health check passes but app is not warm

# Health endpoint that lies about readiness
@app.get("/health")
def health():
    return {"status": "ok"}

# First requests hit cold connection pool (3-5s to establish)
# First queries hit empty prepared statement cache
# First responses include class loading / JIT overhead
# CI test measures startup cost, not steady-state performance

# FAST: Readiness check that validates dependencies

import asyncpg
import redis.asyncio as redis

@app.get("/health/ready")
async def readiness():
    checks = {}

    # Verify database connection pool has active connections
    try:
        async with app.state.db_pool.acquire() as conn:
            await conn.fetchval("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = str(e)
        return JSONResponse(
            status_code=503,
            content={"status": "not_ready", "checks": checks},
        )

    # Verify Redis is responding
    try:
        await app.state.redis.ping()
        checks["redis"] = "ok"
    except Exception as e:
        checks["redis"] = str(e)
        return JSONResponse(
            status_code=503,
            content={"status": "not_ready", "checks": checks},
        )

    # Verify cache is populated (warmup complete)
    trending_cache = await app.state.redis.get("trending:articles")
    if trending_cache is None:
        checks["cache_warm"] = "not_populated"
        return JSONResponse(
            status_code=503,
            content={"status": "warming_up", "checks": checks},
        )

    checks["cache_warm"] = "ok"
    return {"status": "ready", "checks": checks}

The CI workflow uses the readiness endpoint instead of the basic health check:

- name: Wait for application readiness
  run: |
    for i in $(seq 1 60); do
      response=$(curl -sf http://localhost:8000/health/ready 2>&1) || true
      if echo "$response" | grep -q '"status":"ready"'; then
        echo "Application is ready"
        echo "$response" | python3 -m json.tool
        break
      fi
      echo "Waiting for readiness... ($i/60)"
      echo "Current status: $response"
      sleep 3
    done
    
    # Final check: fail if still not ready
    curl -sf http://localhost:8000/health/ready | grep -q '"status":"ready"' || {
      echo "Application failed to become ready within 3 minutes"
      docker logs app
      exit 1
    }

The timeout is 3 minutes (60 iterations, 3 seconds each). If the application is not ready in 3 minutes, the step fails and prints the container logs. No silent hangs. No mystery failures.

Test Data Seeding

Performance tests need data. An empty database returns in microseconds. A database with 50,000 articles, 500,000 tags, and 2 million analytics events returns in representative time.

The seed script must be deterministic. Running it twice produces the same dataset. Running it on different CI runners produces the same dataset. Non-deterministic test data means non-deterministic test results.

# seed_ci_data.py: Deterministic test data for performance testing

import random
import hashlib
from datetime import datetime, timedelta

# Fixed seed for reproducibility
random.seed(42)

ARTICLE_COUNT = 10_000
TAGS_PER_ARTICLE = 3
CATEGORIES = [
    "performance", "architecture", "databases",
    "networking", "security", "devops",
    "frontend", "backend", "infrastructure",
]


def generate_article(index: int) -> dict:
    # Deterministic slug from index
    slug = f"article-{index:05d}"
    # Deterministic content length (varies 500-5000 words)
    content_length = 500 + (index * 37 % 4500)
    # Deterministic publish date (spread over 2 years)
    days_ago = index % 730
    publish_date = datetime(2024, 1, 1) + timedelta(days=days_ago)

    return {
        "slug": slug,
        "title": f"Performance Engineering Article {index}",
        "content": "x " * content_length,
        "category": CATEGORIES[index % len(CATEGORIES)],
        "tags": [
            f"tag-{(index * 7 + i) % 200}"
            for i in range(TAGS_PER_ARTICLE)
        ],
        "published_at": publish_date,
        "view_count": index * 13 % 50_000,
    }


def seed_database(conn):
    articles = [generate_article(i) for i in range(ARTICLE_COUNT)]

    # Batch insert for speed
    conn.executemany(
        """
        INSERT INTO articles (slug, title, content, category, published_at, view_count)
        VALUES (%(slug)s, %(title)s, %(content)s, %(category)s, %(published_at)s, %(view_count)s)
        ON CONFLICT (slug) DO NOTHING
        """,
        articles,
    )

    # Insert tags
    tag_rows = []
    for article in articles:
        for tag in article["tags"]:
            tag_rows.append({"slug": article["slug"], "tag": tag})

    conn.executemany(
        """
        INSERT INTO article_tags (article_slug, tag)
        VALUES (%(slug)s, %(tag)s)
        ON CONFLICT DO NOTHING
        """,
        tag_rows,
    )

    conn.commit()
    print(f"Seeded {len(articles)} articles with {len(tag_rows)} tags")

The seed script uses random.seed(42). Same seed, same data, every time. The content is synthetic (repeated “x ” strings) because the performance test measures query and serialization time, not content rendering. Real content would make the seed script slower without improving the signal.

Parallelizing Performance and Functional Tests

Performance tests take 90-120 seconds plus setup time. Running them sequentially after functional tests adds 5-10 minutes to every PR. Running them in parallel keeps CI fast.

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements-test.txt
      - run: pytest tests/unit/ -x --tb=short

  integration-tests:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        # ... service config
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/integration/ -x --tb=short

  performance-tests:
    runs-on: ubuntu-latest
    needs: [unit-tests]  # Only run if unit tests pass
    services:
      postgres:
        image: postgres:16
        # ... service config
      redis:
        image: redis:7
        # ... service config
    steps:
      - uses: actions/checkout@v4
      - name: Build and start application
        run: |
          docker build -t content-platform:ci .
          docker run -d --name app --network host \
            --cpus 2 --memory 4g \
            -e DATABASE_URL=postgresql://app:ci_pass@localhost:5432/content_platform \
            -e REDIS_URL=redis://localhost:6379 \
            content-platform:ci
      - name: Seed and warm up
        run: |
          docker exec app python scripts/seed_ci_data.py
          # Warmup: hit every endpoint once to initialize caches
          curl -s http://localhost:8000/api/articles/article-00001 > /dev/null
          curl -s http://localhost:8000/api/search?q=performance > /dev/null
          curl -s http://localhost:8000/api/trending > /dev/null
          sleep 5  # Let connection pools stabilize

      - name: Run Locust
        run: |
          pip install locust
          locust -f tests/perf/ci_locustfile.py \
            --headless \
            --users 20 \
            --spawn-rate 5 \
            --run-time 90s \
            --host http://localhost:8000

      - name: Check regression
        run: python tests/perf/compare_perf.py

      - name: Store results artifact
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: perf-${{ github.sha }}
          path: perf-results.json
          retention-days: 90

The needs: [unit-tests] dependency means performance tests only run if unit tests pass. No point benchmarking broken code. Integration tests and performance tests run in parallel, sharing no state.

Handling Flaky Performance Results

A CI performance test that fails randomly is worse than no performance test at all. Teams learn to ignore it, re-run the workflow, and merge anyway. Three strategies reduce flakiness.

Strategy 1: Multiple runs with statistical aggregation.

# run_perf_stable.py: Run Locust 3 times and take the median

import subprocess
import json
import statistics
from pathlib import Path


def run_locust() -> dict:
    subprocess.run(
        [
            "locust", "-f", "tests/perf/ci_locustfile.py",
            "--headless", "--users", "20",
            "--spawn-rate", "5", "--run-time", "60s",
            "--host", "http://localhost:8000",
        ],
        check=True,
    )
    return json.loads(Path("perf-results.json").read_text())


def median_results(runs: list[dict]) -> dict:
    merged = {"endpoints": {}}
    all_endpoints = set()
    for run in runs:
        all_endpoints.update(run["endpoints"].keys())

    for endpoint in all_endpoints:
        endpoint_runs = [
            r["endpoints"][endpoint]
            for r in runs
            if endpoint in r["endpoints"]
        ]
        if not endpoint_runs:
            continue
        merged["endpoints"][endpoint] = {
            "count": int(statistics.median(r["count"] for r in endpoint_runs)),
            "p50": statistics.median(r["p50"] for r in endpoint_runs),
            "p95": statistics.median(r["p95"] for r in endpoint_runs),
            "p99": statistics.median(r["p99"] for r in endpoint_runs),
            "mean": statistics.median(r["mean"] for r in endpoint_runs),
            "failures": max(r["failures"] for r in endpoint_runs),
        }
    return merged


runs = [run_locust() for _ in range(3)]
stable_results = median_results(runs)
Path("perf-results.json").write_text(json.dumps(stable_results, indent=2))

Three runs at 60 seconds each takes 3 minutes. The median filters out the one run that happened during a CPU spike on the shared runner.

Strategy 2: Coefficient of variation check. If the three runs disagree by more than 15%, the environment is too noisy. Warn instead of fail.

def check_stability(runs: list[dict], max_cv: float = 0.15) -> bool:
    for endpoint in runs[0]["endpoints"]:
        p95_values = [
            r["endpoints"][endpoint]["p95"]
            for r in runs
            if endpoint in r["endpoints"]
        ]
        if len(p95_values) < 2:
            continue
        mean = statistics.mean(p95_values)
        stdev = statistics.stdev(p95_values)
        cv = stdev / mean if mean > 0 else 0
        if cv > max_cv:
            print(
                f"WARNING: {endpoint} p95 has CV={cv:.2f} "
                f"(values: {p95_values}). Environment too noisy for "
                f"reliable comparison."
            )
            return False
    return True

Strategy 3: Relative comparison within the same CI run. Instead of comparing against a stored baseline, run the test twice in the same job: once with the base branch code and once with the PR code. Same hardware, same moment in time. The comparison is relative, not absolute.

- name: Test base branch performance
  run: |
    git stash
    git checkout ${{ github.event.pull_request.base.sha }}
    docker build -t content-platform:base .
    # ... run locust, save to perf-results-base.json
    
- name: Test PR branch performance
  run: |
    git checkout ${{ github.sha }}
    docker build -t content-platform:pr .
    # ... run locust, save to perf-results-pr.json

- name: Compare
  run: python tests/perf/compare_relative.py perf-results-base.json perf-results-pr.json

This eliminates hardware variance entirely. The cost is doubling CI time. Use this approach for release branches where accuracy matters more than speed.

Artifact Retention and Result History

CI artifacts expire. GitHub Actions defaults to 90 days. After 90 days, the performance results for a specific commit are gone.

For long-term tracking, push results to external storage after each run:

- name: Archive to S3
  if: github.ref == 'refs/heads/main'
  run: |
    TIMESTAMP=$(date +%Y%m%d-%H%M%S)
    aws s3 cp perf-results.json \
      "s3://perf-results-bucket/content-platform/${TIMESTAMP}-${GITHUB_SHA:0:8}.json"

Only archive results from the main branch. PR results are useful for regression detection during review, not for historical analysis. The main branch results form the trend line.

A simple script queries S3 to build the trend:

# trend.py: Build performance trend from archived results

import boto3
import json

s3 = boto3.client("s3")
bucket = "perf-results-bucket"
prefix = "content-platform/"

response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
results = []

for obj in sorted(response["Contents"], key=lambda x: x["Key"]):
    body = s3.get_object(Bucket=bucket, Key=obj["Key"])["Body"].read()
    data = json.loads(body)
    timestamp = obj["Key"].split("/")[1].split("-")[0]
    for endpoint, metrics in data["endpoints"].items():
        results.append({
            "date": timestamp,
            "endpoint": endpoint,
            "p50": metrics["p50"],
            "p95": metrics["p95"],
            "p99": metrics["p99"],
        })

# Output CSV for Grafana or spreadsheet import
print("date,endpoint,p50,p95,p99")
for r in results:
    print(f"{r['date']},{r['endpoint']},{r['p50']},{r['p95']},{r['p99']}")

This creates the raw data for the drift analysis covered in Section 2.

When the Gate Fails on Purpose

Sometimes a PR is intentionally slow. Adding full-text search to the article API requires a database query that is inherently slower than a primary key lookup. The performance gate will fail. This is correct behavior.

The developer has three options:

Update the baseline. If the new latency is acceptable, update perf-baseline.json in the same PR. The reviewer sees both the code change and the new performance expectation.
Add a skip annotation. For experimental features behind a flag, skip the performance check for specific endpoints:

{
  "thresholds": {
    "/api/search": {
      "p95_ms": 200,
      "skip_until": "2025-02-01",
      "skip_reason": "Full-text search migration in progress, #1234"
    }
  }
}

Optimize until it fits. The budget forces the developer to think about performance during development, not after deployment. Maybe the full-text search needs an index. Maybe the query needs a LIMIT. The gate creates the feedback loop.

Option 3 is the whole point. The performance gate does not exist to block merges. It exists to create a moment where the developer asks “can I make this faster?” before the code reaches production.