Skip to main content
fast by design

Continuous Performance Testing: Locust in CI, Regression Detection, and the Baseline That Drifts

11 min read Chapter 88 of 90

Continuous Performance Testing: Locust in CI, Regression Detection, and the Baseline That Drifts

The content platform passes all functional tests. CI is green. A developer merges a PR that adds a new field to the article API response. The field requires a JOIN against the tags table. Nobody notices the latency change because nobody measured it.

Two weeks later, p99 latency on the article endpoint has crept from 180ms to 340ms. The on-call engineer looks at the commit history and sees 47 merged PRs. Which one caused the regression? Maybe it was one PR. Maybe it was three PRs that each added 50ms. The engineer spends two days bisecting commits to find the problem.

This is the cost of not running performance tests in CI. Functional tests answer “does it work?” Performance tests answer “does it work fast enough?” Both questions matter. Most teams only ask the first one.

The Problem with Manual Performance Testing

Manual performance testing follows a predictable pattern:

  1. A performance engineer runs Locust against a staging environment before a release
  2. They compare the results against their memory of what the numbers were last time
  3. They write a report with screenshots of Grafana dashboards
  4. Nobody reads the report until production is slow

This process has three failure modes.

Inconsistent environments. The staging server that ran last month’s test had 16GB of RAM. The current staging server has 8GB because someone resized it for cost savings. The test results are not comparable, but nobody knows that.

Human baseline comparison. The engineer remembers that p95 was “around 200ms.” It was actually 165ms. They see 210ms and call it acceptable. A 27% regression ships to production.

Infrequent execution. Manual tests run before major releases, maybe quarterly. Three months of merged PRs means hundreds of potential causes for any observed regression.

# SLOW: Manual performance test with no baseline comparison
# Run by a human, compared against memory, results in a PDF

from locust import HttpUser, task, between

class ManualArticleTest(HttpUser):
    wait_time = between(1, 3)

    @task
    def get_article(self):
        self.client.get("/api/articles/distributed-tracing-guide")

    @task
    def search_articles(self):
        self.client.get("/api/search?q=performance+optimization")

# Run: locust -f manual_test.py --headless -u 100 -r 10 --run-time 60s
# Results: printed to terminal, copy-pasted to Slack, lost forever

The fix is to run performance tests on every PR, compare results against a stored baseline, and fail the build when latency exceeds the budget. No human judgment required. No reports to ignore.

Designing the CI Performance Test

A CI performance test is not a production load test. Production load tests simulate thousands of users over extended periods to find capacity limits. CI performance tests simulate a small, consistent load over a short period to detect relative changes.

The design constraints for CI performance tests:

ConstraintProduction Load TestCI Performance Test
Duration30-60 minutes60-120 seconds
UsersHundreds to thousands10-50
EnvironmentProduction-likeMinimal but consistent
GoalFind capacity limitsDetect regressions
FrequencyMonthly/quarterlyEvery PR
Pass/failHuman judgmentAutomated threshold

The CI test does not need to prove the system can handle 10,000 concurrent users. It needs to prove this PR did not make things slower.

# FAST: CI-oriented Locust test with structured output for automated comparison

import json
import sys
import time
from pathlib import Path
from locust import HttpUser, task, between, events
from locust.runners import MasterRunner

RESULTS = {
    "endpoints": {},
    "start_time": None,
    "end_time": None,
    "total_requests": 0,
    "total_failures": 0,
}


class CIArticlePlatformUser(HttpUser):
    wait_time = between(0.5, 1.5)
    host = "http://localhost:8000"

    @task(5)
    def get_article(self):
        self.client.get(
            "/api/articles/distributed-tracing-guide",
            name="/api/articles/[slug]",
        )

    @task(3)
    def search_articles(self):
        self.client.get(
            "/api/search?q=performance",
            name="/api/search",
        )

    @task(2)
    def get_recommendations(self):
        self.client.get(
            "/api/articles/distributed-tracing-guide/recommendations",
            name="/api/articles/[slug]/recommendations",
        )

    @task(1)
    def get_trending(self):
        self.client.get(
            "/api/trending",
            name="/api/trending",
        )


@events.request.add_listener
def on_request(request_type, name, response_time, response_length, exception, **kwargs):
    if name not in RESULTS["endpoints"]:
        RESULTS["endpoints"][name] = {
            "response_times": [],
            "failures": 0,
        }
    endpoint = RESULTS["endpoints"][name]
    if exception:
        endpoint["failures"] += 1
        RESULTS["total_failures"] += 1
    else:
        endpoint["response_times"].append(response_time)
    RESULTS["total_requests"] += 1


@events.quitting.add_listener
def on_quitting(environment, **kwargs):
    RESULTS["end_time"] = time.time()
    output = compute_summary(RESULTS)
    output_path = Path("perf-results.json")
    output_path.write_text(json.dumps(output, indent=2))
    print(f"\nResults written to {output_path}")


def compute_summary(results):
    summary = {
        "duration_seconds": results["end_time"] - results["start_time"],
        "total_requests": results["total_requests"],
        "total_failures": results["total_failures"],
        "error_rate": results["total_failures"] / max(results["total_requests"], 1),
        "endpoints": {},
    }
    for name, data in results["endpoints"].items():
        times = sorted(data["response_times"])
        if not times:
            continue
        summary["endpoints"][name] = {
            "count": len(times),
            "failures": data["failures"],
            "p50": times[len(times) // 2],
            "p95": times[int(len(times) * 0.95)],
            "p99": times[int(len(times) * 0.99)],
            "mean": sum(times) / len(times),
            "min": times[0],
            "max": times[-1],
        }
    return summary


@events.init.add_listener
def on_init(environment, **kwargs):
    RESULTS["start_time"] = time.time()

This test writes structured JSON output. Every field is machine-readable. No parsing terminal output with regex. No scraping HTML reports.

The Baseline File

The baseline is a JSON file checked into the repository. It contains the expected performance numbers for each endpoint.

{
  "version": 2,
  "environment": {
    "cpu_cores": 2,
    "memory_mb": 4096,
    "container_image": "content-platform:ci"
  },
  "thresholds": {
    "/api/articles/[slug]": {
      "p50_ms": 45,
      "p95_ms": 120,
      "p99_ms": 180,
      "error_rate": 0.001
    },
    "/api/search": {
      "p50_ms": 80,
      "p95_ms": 200,
      "p99_ms": 350,
      "error_rate": 0.005
    },
    "/api/articles/[slug]/recommendations": {
      "p50_ms": 60,
      "p95_ms": 150,
      "p99_ms": 250,
      "error_rate": 0.002
    },
    "/api/trending": {
      "p50_ms": 30,
      "p95_ms": 80,
      "p99_ms": 120,
      "error_rate": 0.001
    }
  },
  "regression_tolerance": 0.10,
  "block_on_regression": true
}

The regression_tolerance field is important. It means a result can be 10% worse than the baseline without failing the build. Performance tests have inherent variance. A test that runs on a shared CI runner will see different numbers depending on what else is running on the host. Without tolerance, the build flaps.

The Comparison Script

The comparison script reads the baseline and the test results, then decides pass or fail.

# compare_perf.py: Automated regression detection

import json
import sys
from pathlib import Path


def load_json(path: str) -> dict:
    return json.loads(Path(path).read_text())


def compare(baseline: dict, results: dict) -> tuple[bool, list[str]]:
    tolerance = baseline.get("regression_tolerance", 0.10)
    passed = True
    messages = []

    for endpoint, thresholds in baseline["thresholds"].items():
        actual = results["endpoints"].get(endpoint)
        if not actual:
            messages.append(f"SKIP {endpoint}: no data in results")
            continue

        for metric in ["p50", "p95", "p99"]:
            threshold_key = f"{metric}_ms"
            if threshold_key not in thresholds:
                continue

            threshold_value = thresholds[threshold_key]
            actual_value = actual.get(metric, 0)
            max_allowed = threshold_value * (1 + tolerance)

            if actual_value > max_allowed:
                passed = False
                pct_over = ((actual_value - threshold_value) / threshold_value) * 100
                messages.append(
                    f"FAIL {endpoint} {metric}: "
                    f"{actual_value:.1f}ms > {max_allowed:.1f}ms "
                    f"(baseline {threshold_value}ms, +{pct_over:.1f}%)"
                )
            else:
                messages.append(
                    f"PASS {endpoint} {metric}: "
                    f"{actual_value:.1f}ms <= {max_allowed:.1f}ms"
                )

        if "error_rate" in thresholds:
            actual_error_rate = actual.get("failures", 0) / max(actual.get("count", 1), 1)
            if actual_error_rate > thresholds["error_rate"]:
                passed = False
                messages.append(
                    f"FAIL {endpoint} error_rate: "
                    f"{actual_error_rate:.4f} > {thresholds['error_rate']}"
                )

    return passed, messages


def main():
    baseline = load_json("perf-baseline.json")
    results = load_json("perf-results.json")

    passed, messages = compare(baseline, results)

    print("=" * 60)
    print("PERFORMANCE REGRESSION CHECK")
    print("=" * 60)
    for msg in messages:
        print(f"  {msg}")
    print("=" * 60)

    if not passed:
        print("RESULT: FAILED - Performance regression detected")
        if baseline.get("block_on_regression", True):
            sys.exit(1)
        else:
            print("WARNING: block_on_regression is false, not failing build")
            sys.exit(0)
    else:
        print("RESULT: PASSED - No performance regression detected")
        sys.exit(0)


if __name__ == "__main__":
    main()

The output is explicit. Every endpoint and metric gets a PASS or FAIL line. The CI log shows exactly what regressed and by how much. No ambiguity.

============================================================
PERFORMANCE REGRESSION CHECK
============================================================
  PASS /api/articles/[slug] p50: 42.0ms <= 49.5ms
  PASS /api/articles/[slug] p95: 115.0ms <= 132.0ms
  PASS /api/articles/[slug] p99: 172.0ms <= 198.0ms
  FAIL /api/search p95: 285.0ms > 220.0ms (baseline 200ms, +42.5%)
  PASS /api/search p99: 340.0ms <= 385.0ms
  PASS /api/trending p50: 28.0ms <= 33.0ms
============================================================
RESULT: FAILED - Performance regression detected

The GitHub Actions Workflow

The full CI pipeline starts the application in a container, runs the Locust test, compares results against the baseline, and stores results as artifacts.

name: Performance Gate
on:
  pull_request:
    paths:
      - 'src/**'
      - 'requirements.txt'
      - 'Dockerfile'

jobs:
  performance-test:
    runs-on: ubuntu-latest
    timeout-minutes: 15

    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: content_platform
          POSTGRES_USER: app
          POSTGRES_PASSWORD: ci_test_password
        ports:
          - 5432:5432
        options: >-
          --health-cmd="pg_isready"
          --health-interval=5s
          --health-timeout=3s
          --health-retries=5

      redis:
        image: redis:7
        ports:
          - 6379:6379
        options: >-
          --health-cmd="redis-cli ping"
          --health-interval=5s
          --health-timeout=3s
          --health-retries=5

    steps:
      - uses: actions/checkout@v4

      - name: Build application
        run: docker build -t content-platform:ci .

      - name: Start application
        run: |
          docker run -d --name app \
            --network host \
            -e DATABASE_URL=postgresql://app:ci_test_password@localhost:5432/content_platform \
            -e REDIS_URL=redis://localhost:6379 \
            -e ENVIRONMENT=ci \
            content-platform:ci
          
          # Wait for application to be ready
          for i in $(seq 1 30); do
            if curl -sf http://localhost:8000/health > /dev/null; then
              echo "Application is ready"
              break
            fi
            echo "Waiting for application... ($i/30)"
            sleep 2
          done

      - name: Seed test data
        run: |
          docker exec app python scripts/seed_ci_data.py

      - name: Install Locust
        run: pip install locust

      - name: Run performance test
        run: |
          locust -f tests/perf/ci_locustfile.py \
            --headless \
            --users 20 \
            --spawn-rate 5 \
            --run-time 90s \
            --host http://localhost:8000 \
            --csv perf-results \
            --html perf-report.html

      - name: Compare against baseline
        run: python tests/perf/compare_perf.py

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: perf-results-${{ github.sha }}
          path: |
            perf-results.json
            perf-report.html
            perf-results_stats.csv
          retention-days: 90

      - name: Comment on PR
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('perf-results.json', 'utf8'));
            const baseline = JSON.parse(fs.readFileSync('perf-baseline.json', 'utf8'));
            
            let body = '## Performance Regression Detected\n\n';
            body += '| Endpoint | Metric | Baseline | Actual | Status |\n';
            body += '|---|---|---|---|---|\n';
            
            for (const [endpoint, thresholds] of Object.entries(baseline.thresholds)) {
              const actual = results.endpoints[endpoint];
              if (!actual) continue;
              for (const metric of ['p50', 'p95', 'p99']) {
                const key = `${metric}_ms`;
                if (!thresholds[key]) continue;
                const maxAllowed = thresholds[key] * (1 + baseline.regression_tolerance);
                const status = actual[metric] > maxAllowed ? '**FAIL**' : 'PASS';
                body += `| ${endpoint} | ${metric} | ${thresholds[key]}ms | ${actual[metric].toFixed(1)}ms | ${status} |\n`;
              }
            }
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

This workflow runs only when source code or dependencies change. Documentation changes do not trigger a 15-minute performance test. The paths filter keeps the CI bill under control.

Performance Budgets: Block vs Warn

Not every regression warrants a blocked merge. A 5% increase in p50 latency might be acceptable if the PR adds a feature that users have been requesting. A 50% increase in p99 is never acceptable.

The baseline file supports two modes:

{
  "thresholds": {
    "/api/articles/[slug]": {
      "p50_ms": 45,
      "p95_ms": 120,
      "p99_ms": 180,
      "p50_action": "warn",
      "p95_action": "block",
      "p99_action": "block"
    }
  }
}

With this configuration, a p50 regression posts a warning comment on the PR but does not block the merge. A p95 or p99 regression blocks the merge. The team decides which metrics are blockers and which are advisory.

This creates a tiered response:

  • p50 regression: The median user experience got slower. Worth investigating but might be an acceptable trade-off for new functionality.
  • p95 regression: One in twenty users is hitting a slow path. This is a real problem.
  • p99 regression: The tail is growing. Under load, this will become the p95. Block the merge.

Reducing Variance in CI

CI runners are shared infrastructure. The same test on the same code will produce different numbers depending on CPU contention, disk I/O from other jobs, network latency to the database container, and memory pressure.

Techniques that reduce variance:

Pin CPU and memory for the application container. Docker resource limits create a consistent ceiling.

docker run -d --name app \
  --cpus 2 \
  --memory 4g \
  --network host \
  content-platform:ci

Warm up before measuring. The first 15 seconds of a Locust test hit cold caches, uninitialized connection pools, and JIT compilation. Exclude the warmup period from results.

# In the Locust test: skip the first 15 seconds of data
@events.request.add_listener
def on_request(request_type, name, response_time, exception, **kwargs):
    elapsed = time.time() - RESULTS["start_time"]
    if elapsed < 15:
        return  # skip warmup requests
    # ... record the request

Run multiple iterations and take the median. A single 90-second test might catch an anomaly. Three 90-second runs with the median result are more stable.

Use dedicated CI runners. If performance testing is critical, use self-hosted runners with consistent hardware. No neighbor noise. No CPU stealing. The cost is worth the signal quality.

Trade-offs

DecisionBenefitCost
Run perf tests on every PRCatch regressions immediatelyCI time increases 5-10 min per PR
Store baseline in repoVersion-controlled, reviewableManual updates when intentional changes ship
10% toleranceAbsorbs CI varianceMisses regressions under 10%
Block on p95/p99 onlyDoes not block feature workp50 regressions accumulate silently
Container resource limitsConsistent resultsDoes not reflect production capacity
Warmup exclusionRemoves cold-start noiseMisses cold-start regressions

The hardest trade-off is tolerance. Set it too low and the build flaps on every PR. Set it too high and real regressions slip through. Start at 10%, track how often the gate flaps (fails then passes on retry without code changes), and adjust. If it flaps more than once a week, increase tolerance. If regressions are landing in production, decrease it.

The next two sections cover the full GitHub Actions integration in detail (Section 1) and long-term baseline drift management with Prometheus and Grafana (Section 2).

CI Performance Pipeline