CI/CD as a Safety Gate: Performance Regression Testing in the Pipeline

The Symptom

A pull request passes code review. Two senior engineers approve it. The change adds audit logging to the fare estimation endpoint. The diff is clean: a new service call that writes an audit record to PostgreSQL for every fare estimate request.

The PR merges Monday at 2 PM. Tuesday’s standup mentions nothing unusual. Wednesday morning, a product manager reports that fare estimates feel slower. The on-call engineer checks the dashboard. The fare-estimate endpoint’s p99 has climbed from 120ms to 680ms. A 5.7x regression.

The git bisect takes 40 minutes. The audit logging PR is the culprit. The audit write is synchronous. Inside a reactive WebFlux chain, a blocking jdbcTemplate.insert() call runs on the Netty event loop thread. Every fare estimate request blocks an event loop thread for 15-20ms while the audit record writes to disk. With 4 event loop threads and 2,000 RPS, the event loop threads spend 30-40 seconds of every second blocked on I/O. Requests queue. Latency climbs.

The fix is a 3-line change: wrap the audit call in Mono.fromCallable().subscribeOn(Schedulers.boundedElastic()). The PR took 4 minutes to write, 15 minutes to review, and 40 hours to detect in production.

If the CI pipeline had run Locust against the PR branch, the regression would have been caught in 3 minutes.

The Cause

Code review catches logic errors, security issues, and style problems. It does not catch performance regressions. A blocking call inside a reactive chain looks correct. The types align. The tests pass. The audit record is written. No reviewer is going to mentally model event loop thread utilization under 2,000 RPS concurrent load.

Performance regression testing requires running the application under load and measuring the results. This is a CI problem, not a review problem. The pipeline should:

Build the application from the PR branch
Deploy it in a controlled environment (Docker Compose)
Run Locust with a defined load profile
Compare results against a baseline threshold
Fail the build if any threshold is breached
Post a comparison table on the PR

The test does not need production-scale load. 50 concurrent users for 60 seconds is enough to detect a 5x regression. The goal is not to find the exact breaking point. The goal is to catch obvious regressions before they reach production.

CI/CD pipeline with performance gate showing the decision point where p99 latency checks determine whether a deploy proceeds to production or gets blocked

The diagram above shows how a performance gate fits into the CI/CD pipeline as a hard decision point. After the Locust load test runs, the pipeline checks p99 latency against the baseline threshold. If the PR’s p99 is within 10% of the baseline, the deploy proceeds through staging and canary to production. If it exceeds the threshold, the deploy is blocked, the team is alerted, and a regression report is posted on the PR. This automated gate catches regressions like the audit logging incident in 3 minutes instead of 40 hours.

The Baseline

Current CI pipeline:

Step                     Duration    Catches
Compile                  45s         Syntax errors
Unit tests               90s         Logic errors
Integration tests        120s        API contract violations
Static analysis          30s         Code style, security
Container build          60s         Dockerfile issues
Total                    ~6 min      Everything except performance

Missing step: performance regression test.

Target pipeline with performance gate:

Step                     Duration    Catches
Compile                  45s         Syntax errors
Unit tests               90s         Logic errors
Integration tests        120s        API contract violations
Static analysis          30s         Code style, security
Container build          60s         Dockerfile issues
Performance test         180s        Latency and throughput regressions
Total                    ~9 min      Complete coverage

3 minutes added to the pipeline. Catches regressions that take 40 hours to detect in production.

Performance thresholds for the rider API:

{
  "endpoints": {
    "GET /api/rides/fare-estimate": {
      "p99_ms": 200,
      "p95_ms": 150,
      "p50_ms": 80,
      "error_rate_pct": 0.1,
      "max_regression_pct": 10
    },
    "GET /api/drivers/nearby": {
      "p99_ms": 300,
      "p95_ms": 200,
      "p50_ms": 100,
      "error_rate_pct": 0.1,
      "max_regression_pct": 10
    },
    "POST /api/rides/request": {
      "p99_ms": 500,
      "p95_ms": 350,
      "p50_ms": 200,
      "error_rate_pct": 0.1,
      "max_regression_pct": 10
    }
  }
}

max_regression_pct: 10 means the PR’s p99 must not exceed the baseline by more than 10%. A baseline of 120ms allows up to 132ms. The audit logging PR’s 680ms would fail by 467%.

The Fix

Docker Compose for CI

# SCALED: docker-compose.ci.yml
version: "3.9"
services:
  rider-api:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      SPRING_PROFILES_ACTIVE: ci
      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/ridehailing
      SPRING_REDIS_HOST: redis
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: "1G"

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: ridehailing
      POSTGRES_USER: app
      POSTGRES_PASSWORD: ci-test-only
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d ridehailing"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: "512M"

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: "256M"

Resource limits are fixed. Every CI run gets the same CPU and memory. Without fixed limits, a CI run on a busy runner might get less CPU, producing slower results that look like a regression. Fixed limits ensure reproducibility.

Locust configuration for CI

# SCALED: Locust test for CI performance gate
from locust import HttpUser, task, between

class CIPerformanceUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task(5)
    def fare_estimate(self):
        self.client.get("/api/rides/fare-estimate",
            params={
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7589, "dropoff_lng": -73.9851
            },
            name="/api/rides/fare-estimate"
        )

    @task(3)
    def nearby_drivers(self):
        self.client.get("/api/drivers/nearby",
            params={"lat": 40.7128, "lng": -74.0060, "radius_km": 2},
            name="/api/drivers/nearby"
        )

    @task(1)
    def request_ride(self):
        self.client.post("/api/rides/request",
            json={
                "rider_id": "ci-test-rider",
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7589, "dropoff_lng": -73.9851,
                "ride_type": "standard"
            },
            name="/api/rides/request"
        )

50 users, 60 seconds, headless:

locust -f locust_ci.py \
  --host=http://localhost:8080 \
  --users 50 \
  --spawn-rate 10 \
  --run-time 60s \
  --headless \
  --csv=results/perf \
  --only-summary

50 users instead of the 10,000 in staging. The goal is not to find the capacity limit. The goal is to detect latency changes. A blocking call that adds 15ms to every request is visible at 50 users. It manifests as a p99 increase from 120ms to ~400ms because the event loop thread contention scales with concurrency, not just with total load.

GitHub Actions workflow

# SCALED: .github/workflows/performance-gate.yml
name: Performance Regression Gate

on:
  pull_request:
    paths:
      - "src/**"
      - "build.gradle"
      - "Dockerfile"

jobs:
  performance-test:
    runs-on: ubuntu-latest
    timeout-minutes: 15

    steps:
      - uses: actions/checkout@v4

      - name: Start services
        run: docker compose -f docker-compose.ci.yml up -d --build

      - name: Wait for health
        run: |
          echo "Waiting for rider-api to be healthy..."
          for i in $(seq 1 60); do
            if curl -sf http://localhost:8080/health/ready > /dev/null 2>&1; then
              echo "Service is ready"
              break
            fi
            if [ $i -eq 60 ]; then
              echo "Service failed to start"
              docker compose -f docker-compose.ci.yml logs rider-api
              exit 1
            fi
            sleep 2
          done

      - name: Run Locust
        run: |
          pip install locust
          mkdir -p results
          locust -f locust_ci.py \
            --host=http://localhost:8080 \
            --users 50 \
            --spawn-rate 10 \
            --run-time 60s \
            --headless \
            --csv=results/perf \
            --only-summary

      - name: Compare against thresholds
        id: compare
        run: |
          python scripts/compare_perf.py \
            --results results/perf_stats.csv \
            --thresholds perf_thresholds.json \
            --output results/comparison.md

      - name: Post PR comment
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const comparison = fs.readFileSync('results/comparison.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comparison
            });

      - name: Fail if regression detected
        run: |
          if grep -q "FAIL" results/comparison.md; then
            echo "Performance regression detected"
            exit 1
          fi

      - name: Cleanup
        if: always()
        run: docker compose -f docker-compose.ci.yml down -v

Comparison script

# SCALED: scripts/compare_perf.py
import csv
import json
import sys
import argparse

def parse_locust_stats(csv_path):
    results = {}
    with open(csv_path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            name = row.get("Name", "")
            if name == "Aggregated" or not name:
                continue
            results[name] = {
                "p50": float(row.get("50%", 0)),
                "p95": float(row.get("95%", 0)),
                "p99": float(row.get("99%", 0)),
                "avg": float(row.get("Average (ms)", 0)),
                "error_rate": (
                    float(row.get("Failure Count", 0))
                    / max(float(row.get("Request Count", 1)), 1)
                    * 100
                ),
                "rps": float(row.get("Requests/s", 0))
            }
    return results

def compare(results, thresholds):
    output_lines = []
    overall_pass = True

    output_lines.append("## Performance Regression Report\n")
    output_lines.append(
        "| Endpoint | Metric | Threshold | Actual | Status |"
    )
    output_lines.append(
        "|----------|--------|-----------|--------|--------|"
    )

    for endpoint, limits in thresholds.get("endpoints", {}).items():
        matching_key = None
        for key in results:
            if endpoint in key:
                matching_key = key
                break

        if not matching_key:
            output_lines.append(
                f"| {endpoint} | - | - | NOT FOUND | SKIP |"
            )
            continue

        actual = results[matching_key]

        checks = [
            ("p99", limits.get("p99_ms"), actual["p99"]),
            ("p95", limits.get("p95_ms"), actual["p95"]),
            ("p50", limits.get("p50_ms"), actual["p50"]),
            ("error_rate", limits.get("error_rate_pct"), actual["error_rate"]),
        ]

        for metric, threshold_val, actual_val in checks:
            if threshold_val is None:
                continue
            passed = actual_val <= threshold_val
            status = "PASS" if passed else "FAIL"
            if not passed:
                overall_pass = False
            output_lines.append(
                f"| {endpoint} | {metric} | "
                f"{threshold_val} | {actual_val:.1f} | {status} |"
            )

    verdict = "PASS" if overall_pass else "FAIL"
    output_lines.append(f"\n**Overall: {verdict}**\n")

    return "\n".join(output_lines)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--results", required=True)
    parser.add_argument("--thresholds", required=True)
    parser.add_argument("--output", required=True)
    args = parser.parse_args()

    results = parse_locust_stats(args.results)
    with open(args.thresholds) as f:
        thresholds = json.load(f)

    report = compare(results, thresholds)

    with open(args.output, "w") as f:
        f.write(report)

    print(report)
    if "FAIL" in report:
        sys.exit(1)

The PR that introduced the regression

The audit logging change that would have been caught:

// BOTTLENECK: Blocking call in reactive chain
@Service
public class FareEstimationService {

    private final JdbcTemplate jdbcTemplate;

    public Mono<FareEstimate> estimate(FareRequest request) {
        return calculateFare(request)
            .map(fare -> {
                // This blocks the Netty event loop thread
                jdbcTemplate.update(
                    "INSERT INTO audit_log (endpoint, request_id, timestamp) VALUES (?, ?, ?)",
                    "/api/rides/fare-estimate",
                    request.requestId(),
                    Instant.now()
                );
                return fare;
            });
    }
}

The CI performance test result for this PR:

| Endpoint                  | Metric | Threshold | Actual  | Status |
|---------------------------|--------|-----------|---------|--------|
| /api/rides/fare-estimate  | p99    | 200       | 682.0   | FAIL   |
| /api/rides/fare-estimate  | p95    | 150       | 510.0   | FAIL   |
| /api/rides/fare-estimate  | p50    | 80        | 245.0   | FAIL   |

**Overall: FAIL**

The PR would be blocked. The fix:

// SCALED: Non-blocking audit in reactive chain
@Service
public class FareEstimationService {

    private final JdbcTemplate jdbcTemplate;

    public Mono<FareEstimate> estimate(FareRequest request) {
        return calculateFare(request)
            .flatMap(fare ->
                Mono.fromCallable(() -> {
                    jdbcTemplate.update(
                        "INSERT INTO audit_log (endpoint, request_id, timestamp) VALUES (?, ?, ?)",
                        "/api/rides/fare-estimate",
                        request.requestId(),
                        Instant.now()
                    );
                    return fare;
                }).subscribeOn(Schedulers.boundedElastic())
            );
    }
}

The updated PR’s CI result:

| Endpoint                  | Metric | Threshold | Actual | Status |
|---------------------------|--------|-----------|--------|--------|
| /api/rides/fare-estimate  | p99    | 200       | 128.0  | PASS   |
| /api/rides/fare-estimate  | p95    | 150       | 95.0   | PASS   |
| /api/rides/fare-estimate  | p50    | 80        | 52.0   | PASS   |

**Overall: PASS**

The Proof

After adding the performance gate to CI:

Metric                              Before CI Gate    After CI Gate    Delta
Performance regressions in prod     2.3/month         0.1/month        -96%
Mean time to detect regression      40 hours          3 minutes        -99.9%
CI pipeline duration                6 min             9 min            +50%
PRs blocked by perf gate (6 months) N/A               14               N/A
False positives (6 months)          N/A               2                N/A

14 PRs blocked in 6 months. 12 were real regressions (blocking calls, missing indexes, excessive serialization). 2 were false positives caused by CI runner resource contention. The false positive rate of 14% is acceptable because the developer can re-run the pipeline to confirm. A real regression fails consistently; a runner contention issue is intermittent.

The 3-minute addition to pipeline time is invisible to developers. The 40-hour mean-time-to-detection was not.

CH15-S1 covers the Docker Compose setup, comparison script, and GitHub Actions workflow in detail. CH15-S2 covers trend tracking with SQLite, gradual drift detection, and the GitLab CI equivalent.