Pipeline Observability: Metrics, Flaky Tests, and Dashboards
Pipeline Observability: Metrics, Flaky Tests, and Dashboards
The pipeline is infrastructure. It needs monitoring like any other infrastructure. When the build takes 15 minutes, someone should be alerted, not surprised.
The Failure
The team’s CI pipeline averaged 8 minutes. Over three months, it crept to 22 minutes. No one noticed because no one tracked it. A developer complained during a retro. The team investigated and found: Docker layer caching broke two months ago (added 6 minutes), a flaky test was retried 3 times on every run (added 4 minutes), and a dependency mirror was slow (added 4 minutes). Three independent issues, each small enough to ignore, combined to nearly triple build time.
Pipeline metrics would have caught each regression within days.
The Mechanism
Key Pipeline Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Build duration | Total wall-clock time | > 2x baseline |
| Queue time | Time waiting for a runner | > 5 minutes |
| Success rate | % of builds that pass | < 90% |
| Flaky rate | % of tests that pass on retry | > 5% |
| Cache hit rate | % of steps using cached results | < 80% |
| MTTR | Mean time to fix a broken build | > 2 hours |
Flaky Test Detection
A flaky test is one that passes and fails on the same code. Detecting them requires tracking test results over time:
- Test fails on PR → developer retries → test passes → PR merges
- Record both results: the test is marked as flaky
- After 3 flaky occurrences in 30 days → quarantine the test
The Implementation
Pipeline Metrics Collection
# .github/workflows/metrics.yml
# HARDENED: Collect pipeline metrics
name: Pipeline Metrics
on:
workflow_run:
workflows: ["CI"]
types: [completed]
jobs:
collect-metrics:
runs-on: ubuntu-latest
steps:
- name: Collect workflow metrics
uses: actions/github-script@v7
with:
script: |
const run = context.payload.workflow_run;
const duration = (new Date(run.updated_at) - new Date(run.created_at)) / 1000;
const queueTime = (new Date(run.run_started_at) - new Date(run.created_at)) / 1000;
const metrics = {
workflow: run.name,
conclusion: run.conclusion,
duration_seconds: duration,
queue_seconds: queueTime,
branch: run.head_branch,
sha: run.head_sha,
timestamp: run.created_at,
};
// Push to Prometheus Pushgateway
const body = [
`ci_build_duration_seconds{workflow="${run.name}",conclusion="${run.conclusion}"} ${duration}`,
`ci_queue_duration_seconds{workflow="${run.name}"} ${queueTime}`,
`ci_build_total{workflow="${run.name}",conclusion="${run.conclusion}"} 1`,
].join('\n');
await fetch(`${process.env.PUSHGATEWAY_URL}/metrics/job/ci/instance/${run.name}`, {
method: 'POST',
body: body,
});
env:
PUSHGATEWAY_URL: ${{ secrets.PUSHGATEWAY_URL }}
Flaky Test Tracker
# scripts/flaky-tracker.py
# HARDENED: Track and quarantine flaky tests
import json
import sys
from pathlib import Path
from datetime import datetime, timedelta
FLAKY_DB = ".flaky-tests.json"
QUARANTINE_THRESHOLD = 3
WINDOW_DAYS = 30
def load_db():
if Path(FLAKY_DB).exists():
return json.loads(Path(FLAKY_DB).read_text())
return {"tests": {}}
def save_db(db):
Path(FLAKY_DB).write_text(json.dumps(db, indent=2))
def record_flaky(test_name):
db = load_db()
cutoff = (datetime.now() - timedelta(days=WINDOW_DAYS)).isoformat()
if test_name not in db["tests"]:
db["tests"][test_name] = {"occurrences": [], "quarantined": False}
entry = db["tests"][test_name]
entry["occurrences"].append(datetime.now().isoformat())
# Remove old occurrences
entry["occurrences"] = [o for o in entry["occurrences"] if o > cutoff]
if len(entry["occurrences"]) >= QUARANTINE_THRESHOLD:
entry["quarantined"] = True
print(f"⚠ QUARANTINED: {test_name} "
f"({len(entry['occurrences'])} flaky in {WINDOW_DAYS} days)")
save_db(db)
def get_quarantined():
db = load_db()
return [name for name, data in db["tests"].items() if data.get("quarantined")]
if __name__ == "__main__":
if sys.argv[1] == "record":
record_flaky(sys.argv[2])
elif sys.argv[1] == "list-quarantined":
for t in get_quarantined():
print(t)
JUnit XML Parser for Retry Detection
# In CI workflow
- name: Run tests with retry
run: |
pytest --junitxml=results.xml --retries=2
- name: Detect flaky tests
if: always()
run: |
python scripts/detect-flaky.py results.xml
# scripts/detect-flaky.py
# HARDENED: Detect tests that passed on retry
import xml.etree.ElementTree as ET
import subprocess
import sys
def detect_flaky(junit_xml):
tree = ET.parse(junit_xml)
for testcase in tree.iter("testcase"):
# If test has a rerun element, it was retried
reruns = testcase.findall("rerun")
if reruns:
name = f"{testcase.get('classname')}.{testcase.get('name')}"
print(f"Flaky: {name} (retried {len(reruns)} times)")
subprocess.run(["python", "scripts/flaky-tracker.py", "record", name])
if __name__ == "__main__":
detect_flaky(sys.argv[1])
The Gate
Pipeline health is not a PR gate—it is a team gate. When pipeline success rate drops below 90% or build duration exceeds 2x baseline, the team pauses feature work to fix the pipeline. This is a process gate, enforced by the dashboard and alerts, not by branch protection.
The Recovery
Metrics collection adds overhead to CI: The metrics job runs as a separate workflow triggered by workflow_run. It does not add time to the main pipeline.
Flaky test database conflicts: Store the flaky test database in a separate branch or use an external datastore (SQLite in an artifact, or a real database). Multiple concurrent PRs writing to the same file will conflict.
Dashboards show spikes but no root cause: Correlate build duration spikes with git log. Tag metrics with commit SHA. When duration spikes, git log the SHA range to find the change that caused it.