Job Dependencies and the Critical Path
Job Dependencies and the Critical Path
The Failure
The inventory service pipeline has six jobs. The team parallelized aggressively after reading about pipeline optimization. Each test type runs in its own job: unit tests, integration tests, API contract tests, linting, type checking, and security scanning. Every job depends only on the build job. Maximum parallelism.
The pipeline takes 9 minutes. The build takes 3 minutes. Each parallel job takes between 30 seconds (linting) and 4 minutes (integration tests). But each job also takes 45 seconds to provision a runner and download the image artifact. Six parallel jobs times 45 seconds of overhead is 4.5 minutes of runner time spent on setup alone.
The linting job takes 30 seconds to run and 45 seconds to start. It would have been faster as a step in the build job.
The Mechanism
Every GitHub Actions job runs on a fresh runner. The runner must be provisioned (queued, assigned, booted), the repository must be checked out, and any artifacts from upstream jobs must be downloaded. This overhead is typically 30-90 seconds depending on runner availability and artifact size.
Splitting work into separate jobs is valuable when:
- The tasks can run in parallel and their combined duration exceeds the overhead
- The tasks need different runner types (e.g., Linux vs macOS for cross-platform testing)
- The tasks should gate independently (a scan failure should not block test results from being visible)
Splitting is counterproductive when:
- The task duration is less than the runner provisioning overhead
- The tasks share expensive setup (database migrations, dependency installation) that would need to be repeated in each job
The decision rule: if the task takes less than 2 minutes and does not need a different runner or independent gating, keep it as a step in an existing job.
The Implementation
# FRAGILE: Over-parallelized pipeline with excessive runner overhead
name: ci
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/build-push-action@v6
with:
context: .
push: true
tags: ghcr.io/acme/inventory-service:${{ github.sha }}
lint:
needs: [build]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run lint # 30 seconds of work, 45 seconds of startup
typecheck:
needs: [build]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run typecheck # 20 seconds of work, 45 seconds of startup
unit-test:
needs: [build]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run test:unit # 90 seconds
integration-test:
needs: [build]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: docker compose up -d --wait
- run: npm run test:integration # 4 minutes
contract-test:
needs: [build]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run test:contract # 2 minutes
scan:
needs: [build]
runs-on: ubuntu-latest
steps:
- uses: aquasecurity/trivy-action@master
with:
image-ref: ghcr.io/acme/inventory-service:${{ github.sha }}
exit-code: 1
# HARDENED: Right-sized parallelism, fast checks in build job
name: ci
on: [push, pull_request]
env:
IMAGE: ghcr.io/acme/inventory-service
jobs:
build-and-check:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
outputs:
image-digest: ${{ steps.build.outputs.digest }}
steps:
- uses: actions/checkout@v4
# Fast checks run before the build, fail fast
- name: Lint
run: npm run lint
- name: Type check
run: npm run typecheck
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
id: build
uses: docker/build-push-action@v6
with:
context: .
push: true
tags: ${{ env.IMAGE }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
test:
runs-on: ubuntu-latest
needs: [build-and-check]
strategy:
fail-fast: false
matrix:
suite: [unit, integration, contract]
steps:
- uses: actions/checkout@v4
- name: Start dependencies
if: matrix.suite == 'integration'
run: docker compose -f docker-compose.test.yml up -d --wait
- name: Run ${{ matrix.suite }} tests
run: |
docker run --rm \
${{ matrix.suite == 'integration' && '--network=host' || '' }} \
${{ env.IMAGE }}@${{ needs.build-and-check.outputs.image-digest }} \
./run-${{ matrix.suite }}-tests.sh
- name: Stop dependencies
if: matrix.suite == 'integration' && always()
run: docker compose -f docker-compose.test.yml down
scan:
runs-on: ubuntu-latest
needs: [build-and-check]
steps:
- uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE }}@${{ needs.build-and-check.outputs.image-digest }}
exit-code: 1
severity: CRITICAL,HIGH
promote:
runs-on: ubuntu-latest
needs: [test, scan]
if: github.ref == 'refs/heads/main'
steps:
- run: echo "All gates passed. Ready for infra repo update."
The restructured pipeline merges lint and typecheck into the build job (they take 50 seconds combined and share the same checkout). The three test suites run as a matrix strategy, which provisions three runners but shares the job definition. The scan runs in parallel with all tests.
The Gate
The promote job depends on both test (all matrix variants) and scan. A matrix strategy with fail-fast: false ensures all test suites run to completion even if one fails. This means the developer sees all failures at once instead of fixing one, re-running, and discovering the next.
When fail-fast: true (the default), the first matrix failure cancels all other matrix jobs. Use fail-fast: false for test suites where seeing all failures is more valuable than saving runner minutes.
The Recovery
When a specific matrix variant fails consistently (e.g., integration tests are flaky), the temptation is to add continue-on-error: true to that variant. Do not. Instead, fix the flaky test, or move it to a separate non-blocking job with a clear label: integration-test-flaky. The flaky job emits a warning but does not block promotion. The team tracks the flaky test and fixes it on a defined timeline. Chapter 18 covers flaky test detection and management.