Test Stage Placement and Parallelization

The Failure

The inventory service team added 340 integration tests over 18 months. Every test runs in the same job, sequentially. The test job takes 11 minutes. Developers push code, wait 11 minutes, see a failure, fix it, push again, wait 11 minutes. The feedback loop is 22 minutes for a single iteration.

The team tried adding fail-fast: true to a matrix strategy. The first suite to fail cancelled the others. A developer pushed a change that broke both unit and integration tests. The matrix cancelled the integration tests when a unit test failed. The developer fixed the unit test, pushed, waited 4 minutes for integration tests to start, then saw the integration failure. Three iterations: 33 minutes to see both problems.

fail-fast: false shows all failures in one push. One iteration: 11 minutes (the longest suite) instead of 33 minutes across three iterations.

The Mechanism

Matrix Design for Test Suites

A matrix strategy creates one job per combination of values. For test suites, the matrix is not a cross-product but a list of explicit configurations:

# HARDENED: Explicit matrix with per-suite configuration
strategy:
  fail-fast: false
  matrix:
    include:
      - suite: unit
        compose: false
        timeout: 3
        retries: 0
      - suite: integration
        compose: true
        timeout: 10
        retries: 1
      - suite: contract
        compose: false
        timeout: 5
        retries: 0

Each entry gets its own runner. The include directive creates exactly three jobs instead of the cartesian product of separate arrays.

Timing Budgets

Set timing budgets per suite. If a suite consistently exceeds its budget, something is wrong: tests are too slow, the test database is not being cleaned up, or the Docker Compose startup is taking too long.

Suite	Target Duration	Hard Timeout	Action on Breach
Unit	< 2 min	3 min	Investigate: parallelism, test count, slow assertions
Integration	< 6 min	10 min	Investigate: compose startup, test isolation, DB cleanup
Contract	< 3 min	5 min	Investigate: Pact broker latency, mock startup

Track test duration as a metric. Export it from CI to Prometheus (CH18). Alert when median duration increases by more than 20% over a rolling 7-day window.

The Implementation

Full Test Matrix with Retries and Timing

# HARDENED: Test matrix with retry logic for integration tests
jobs:
  test:
    runs-on: ubuntu-latest
    needs: [build]
    strategy:
      fail-fast: false
      matrix:
        include:
          - suite: unit
            compose: false
            timeout: 3
            retries: 0
          - suite: integration
            compose: true
            timeout: 10
            retries: 1
          - suite: contract
            compose: false
            timeout: 5
            retries: 0
    steps:
      - uses: actions/checkout@v4

      - name: Start dependencies
        if: matrix.compose
        run: |
          docker compose -f docker-compose.test.yml up -d --wait --wait-timeout 60
          docker compose -f docker-compose.test.yml ps

      - name: Run ${{ matrix.suite }} tests
        id: run-tests
        timeout-minutes: ${{ matrix.timeout }}
        continue-on-error: ${{ matrix.retries > 0 }}
        run: |
          START=$(date +%s)
          docker run --rm \
            ${{ matrix.compose && '--network=host' || '' }} \
            -v $PWD/test-results:/app/test-results \
            -e TEST_SUITE=${{ matrix.suite }} \
            ${{ env.IMAGE }}@${{ needs.build.outputs.image-digest }} \
            ./run-tests.sh ${{ matrix.suite }}
          END=$(date +%s)
          echo "duration=$((END - START))" >> "$GITHUB_OUTPUT"

      - name: Retry ${{ matrix.suite }} tests
        if: matrix.retries > 0 && steps.run-tests.outcome == 'failure'
        timeout-minutes: ${{ matrix.timeout }}
        run: |
          echo "::warning::Retrying ${{ matrix.suite }} tests after failure"
          docker run --rm \
            ${{ matrix.compose && '--network=host' || '' }} \
            -v $PWD/test-results:/app/test-results \
            -e TEST_SUITE=${{ matrix.suite }} \
            ${{ env.IMAGE }}@${{ needs.build.outputs.image-digest }} \
            ./run-tests.sh ${{ matrix.suite }}

      - name: Upload test results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: test-results-${{ matrix.suite }}
          path: test-results/
          retention-days: 14

      - name: Report duration
        if: always()
        run: |
          echo "### Test Duration: ${{ matrix.suite }}" >> "$GITHUB_STEP_SUMMARY"
          echo "Duration: ${{ steps.run-tests.outputs.duration }}s" >> "$GITHUB_STEP_SUMMARY"

      - name: Cleanup
        if: matrix.compose && always()
        run: docker compose -f docker-compose.test.yml down -v

Flaky Test Quarantine

When a test fails intermittently (passes on retry, fails on different runs with no code change), quarantine it:

# HARDENED: Quarantined tests run separately, do not block promotion
quarantined-tests:
  runs-on: ubuntu-latest
  needs: [build]
  continue-on-error: true # Never blocks the pipeline
  steps:
    - uses: actions/checkout@v4
    - name: Run quarantined tests
      run: |
        docker run --rm \
          -e TEST_SUITE=quarantined \
          ${{ env.IMAGE }}@${{ needs.build.outputs.image-digest }} \
          ./run-tests.sh quarantined
    - name: Report quarantine results
      if: always()
      run: |
        echo "### Quarantined Test Results" >> "$GITHUB_STEP_SUMMARY"
        echo "These tests are flaky and under investigation." >> "$GITHUB_STEP_SUMMARY"
        echo "They do not block promotion." >> "$GITHUB_STEP_SUMMARY"

The quarantine job uses continue-on-error: true. The promote job does not depend on it. Quarantined tests still run to provide visibility, but failures do not block deployment.

Track quarantined tests in a file in the repository:

{
  "quarantined": [
    {
      "test": "InventoryReservationTest.testConcurrentReservation",
      "quarantinedDate": "2024-03-15",
      "reason": "Race condition in test setup, not in production code",
      "owner": "inventory-team",
      "deadline": "2024-04-01"
    }
  ]
}

Review the quarantine list in each sprint planning. If a test has been quarantined for more than two sprints, either fix it or delete it. Quarantine is not a parking lot.

The Gate

The promote job depends on the three test matrix jobs (unit, integration, contract) and the scan job. All must pass. The quarantine job is excluded from the dependency graph.

promote:
  runs-on: ubuntu-latest
  needs: [test, scan]
  # quarantined-tests is NOT in needs — it cannot block promotion
  if: github.ref == 'refs/heads/main' && !failure() && !cancelled()
  steps:
    - run: echo "All gates passed — promoting image"

The condition !failure() && !cancelled() ensures promotion only happens when all required jobs succeed. If any matrix variant fails, the entire test job is marked as failed, and promotion is blocked.

The Recovery

All unit tests fail after a dependency update: The lock file changed and a transitive dependency introduced a breaking change. Pin the dependency, revert the lock file, and investigate.

Integration tests pass locally but fail in CI: The Docker Compose environment in CI differs from the local environment. Common causes: resource limits on GitHub-hosted runners (7 GB RAM, 2 cores), network timing differences, port conflicts with other services. Add health checks with retries to docker-compose.test.yml:

services:
  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U test"]
      interval: 2s
      timeout: 5s
      retries: 10

Tests are slow and getting slower: Profile the test suite. The usual suspects: database state not cleaned between tests (truncate tables in @BeforeEach, not @AfterEach), unnecessary sleep() calls in async test waits (use polling with timeout instead), tests that create real HTTP connections instead of using WireMock.