Test Stage Placement and Parallelization
Test Stage Placement and Parallelization
The Failure
The inventory service team added 340 integration tests over 18 months. Every test runs in the same job, sequentially. The test job takes 11 minutes. Developers push code, wait 11 minutes, see a failure, fix it, push again, wait 11 minutes. The feedback loop is 22 minutes for a single iteration.
The team tried adding fail-fast: true to a matrix strategy. The first suite to fail cancelled the others. A developer pushed a change that broke both unit and integration tests. The matrix cancelled the integration tests when a unit test failed. The developer fixed the unit test, pushed, waited 4 minutes for integration tests to start, then saw the integration failure. Three iterations: 33 minutes to see both problems.
fail-fast: false shows all failures in one push. One iteration: 11 minutes (the longest suite) instead of 33 minutes across three iterations.
The Mechanism
Matrix Design for Test Suites
A matrix strategy creates one job per combination of values. For test suites, the matrix is not a cross-product but a list of explicit configurations:
# HARDENED: Explicit matrix with per-suite configuration
strategy:
fail-fast: false
matrix:
include:
- suite: unit
compose: false
timeout: 3
retries: 0
- suite: integration
compose: true
timeout: 10
retries: 1
- suite: contract
compose: false
timeout: 5
retries: 0
Each entry gets its own runner. The include directive creates exactly three jobs instead of the cartesian product of separate arrays.
Timing Budgets
Set timing budgets per suite. If a suite consistently exceeds its budget, something is wrong: tests are too slow, the test database is not being cleaned up, or the Docker Compose startup is taking too long.
| Suite | Target Duration | Hard Timeout | Action on Breach |
|---|---|---|---|
| Unit | < 2 min | 3 min | Investigate: parallelism, test count, slow assertions |
| Integration | < 6 min | 10 min | Investigate: compose startup, test isolation, DB cleanup |
| Contract | < 3 min | 5 min | Investigate: Pact broker latency, mock startup |
Track test duration as a metric. Export it from CI to Prometheus (CH18). Alert when median duration increases by more than 20% over a rolling 7-day window.
The Implementation
Full Test Matrix with Retries and Timing
# HARDENED: Test matrix with retry logic for integration tests
jobs:
test:
runs-on: ubuntu-latest
needs: [build]
strategy:
fail-fast: false
matrix:
include:
- suite: unit
compose: false
timeout: 3
retries: 0
- suite: integration
compose: true
timeout: 10
retries: 1
- suite: contract
compose: false
timeout: 5
retries: 0
steps:
- uses: actions/checkout@v4
- name: Start dependencies
if: matrix.compose
run: |
docker compose -f docker-compose.test.yml up -d --wait --wait-timeout 60
docker compose -f docker-compose.test.yml ps
- name: Run ${{ matrix.suite }} tests
id: run-tests
timeout-minutes: ${{ matrix.timeout }}
continue-on-error: ${{ matrix.retries > 0 }}
run: |
START=$(date +%s)
docker run --rm \
${{ matrix.compose && '--network=host' || '' }} \
-v $PWD/test-results:/app/test-results \
-e TEST_SUITE=${{ matrix.suite }} \
${{ env.IMAGE }}@${{ needs.build.outputs.image-digest }} \
./run-tests.sh ${{ matrix.suite }}
END=$(date +%s)
echo "duration=$((END - START))" >> "$GITHUB_OUTPUT"
- name: Retry ${{ matrix.suite }} tests
if: matrix.retries > 0 && steps.run-tests.outcome == 'failure'
timeout-minutes: ${{ matrix.timeout }}
run: |
echo "::warning::Retrying ${{ matrix.suite }} tests after failure"
docker run --rm \
${{ matrix.compose && '--network=host' || '' }} \
-v $PWD/test-results:/app/test-results \
-e TEST_SUITE=${{ matrix.suite }} \
${{ env.IMAGE }}@${{ needs.build.outputs.image-digest }} \
./run-tests.sh ${{ matrix.suite }}
- name: Upload test results
if: always()
uses: actions/upload-artifact@v4
with:
name: test-results-${{ matrix.suite }}
path: test-results/
retention-days: 14
- name: Report duration
if: always()
run: |
echo "### Test Duration: ${{ matrix.suite }}" >> "$GITHUB_STEP_SUMMARY"
echo "Duration: ${{ steps.run-tests.outputs.duration }}s" >> "$GITHUB_STEP_SUMMARY"
- name: Cleanup
if: matrix.compose && always()
run: docker compose -f docker-compose.test.yml down -v
Flaky Test Quarantine
When a test fails intermittently (passes on retry, fails on different runs with no code change), quarantine it:
# HARDENED: Quarantined tests run separately, do not block promotion
quarantined-tests:
runs-on: ubuntu-latest
needs: [build]
continue-on-error: true # Never blocks the pipeline
steps:
- uses: actions/checkout@v4
- name: Run quarantined tests
run: |
docker run --rm \
-e TEST_SUITE=quarantined \
${{ env.IMAGE }}@${{ needs.build.outputs.image-digest }} \
./run-tests.sh quarantined
- name: Report quarantine results
if: always()
run: |
echo "### Quarantined Test Results" >> "$GITHUB_STEP_SUMMARY"
echo "These tests are flaky and under investigation." >> "$GITHUB_STEP_SUMMARY"
echo "They do not block promotion." >> "$GITHUB_STEP_SUMMARY"
The quarantine job uses continue-on-error: true. The promote job does not depend on it. Quarantined tests still run to provide visibility, but failures do not block deployment.
Track quarantined tests in a file in the repository:
{
"quarantined": [
{
"test": "InventoryReservationTest.testConcurrentReservation",
"quarantinedDate": "2024-03-15",
"reason": "Race condition in test setup, not in production code",
"owner": "inventory-team",
"deadline": "2024-04-01"
}
]
}
Review the quarantine list in each sprint planning. If a test has been quarantined for more than two sprints, either fix it or delete it. Quarantine is not a parking lot.
The Gate
The promote job depends on the three test matrix jobs (unit, integration, contract) and the scan job. All must pass. The quarantine job is excluded from the dependency graph.
promote:
runs-on: ubuntu-latest
needs: [test, scan]
# quarantined-tests is NOT in needs — it cannot block promotion
if: github.ref == 'refs/heads/main' && !failure() && !cancelled()
steps:
- run: echo "All gates passed — promoting image"
The condition !failure() && !cancelled() ensures promotion only happens when all required jobs succeed. If any matrix variant fails, the entire test job is marked as failed, and promotion is blocked.
The Recovery
All unit tests fail after a dependency update: The lock file changed and a transitive dependency introduced a breaking change. Pin the dependency, revert the lock file, and investigate.
Integration tests pass locally but fail in CI: The Docker Compose environment in CI differs from the local environment. Common causes: resource limits on GitHub-hosted runners (7 GB RAM, 2 cores), network timing differences, port conflicts with other services. Add health checks with retries to docker-compose.test.yml:
services:
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U test"]
interval: 2s
timeout: 5s
retries: 10
Tests are slow and getting slower: Profile the test suite. The usual suspects: database state not cleaned between tests (truncate tables in @BeforeEach, not @AfterEach), unnecessary sleep() calls in async test waits (use polling with timeout instead), tests that create real HTTP connections instead of using WireMock.