GitHub Actions Metrics and Flaky Test Detection
GitHub Actions Metrics and Flaky Test Detection
The Failure
The team had 1,200 tests. Twelve were flaky. Each flaky test had a 5% chance of failing on any given run. With 12 flaky tests at 5% each, the probability of at least one failure per build was 46%. Nearly half of all builds failed due to flaky tests. Developers reflexively re-ran failed builds. The team lost an average of 15 minutes per developer per day to flaky test retries.
Quarantining the 12 flaky tests and fixing them in a dedicated sprint eliminated 46% of build failures overnight.
The Mechanism
Flaky Test Lifecycle
- Detection: Test fails then passes on retry → marked as potentially flaky
- Confirmation: Same test is flaky 3+ times in 30 days → confirmed flaky
- Quarantine: Test is moved to a non-blocking test suite → issue created
- Fix: Developer fixes the root cause (timing, state, ordering)
- Reinstatement: Fixed test moves back to the blocking suite
GitHub Actions API Endpoints
| Endpoint | Data |
|---|---|
GET /repos/{owner}/{repo}/actions/runs | Workflow runs with timing |
GET /repos/{owner}/{repo}/actions/runs/{id}/jobs | Individual job timing |
GET /repos/{owner}/{repo}/actions/runs/{id}/artifacts | Test result artifacts |
The Implementation
Metrics Extraction Script
# scripts/ci-metrics.py
# HARDENED: Extract CI metrics from GitHub Actions API
import os
import requests
from datetime import datetime, timedelta
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = os.environ.get("GITHUB_REPOSITORY", "acme/checkout-service")
API_BASE = "https://api.github.com"
headers = {
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3+json"
}
def get_workflow_runs(days=7):
since = (datetime.now() - timedelta(days=days)).isoformat()
runs = []
page = 1
while True:
resp = requests.get(
f"{API_BASE}/repos/{REPO}/actions/runs",
headers=headers,
params={
"created": f">={since}",
"per_page": 100,
"page": page
}
)
resp.raise_for_status()
data = resp.json()
runs.extend(data["workflow_runs"])
if len(data["workflow_runs"]) < 100:
break
page += 1
return runs
def compute_metrics(runs):
total = len(runs)
success = sum(1 for r in runs if r["conclusion"] == "success")
failure = sum(1 for r in runs if r["conclusion"] == "failure")
durations = []
for r in runs:
if r["updated_at"] and r["run_started_at"]:
start = datetime.fromisoformat(r["run_started_at"].replace("Z", "+00:00"))
end = datetime.fromisoformat(r["updated_at"].replace("Z", "+00:00"))
durations.append((end - start).total_seconds())
durations.sort()
p50 = durations[len(durations) // 2] if durations else 0
p90 = durations[int(len(durations) * 0.9)] if durations else 0
return {
"total_runs": total,
"success_rate": (success / total * 100) if total else 0,
"failure_rate": (failure / total * 100) if total else 0,
"duration_p50_s": round(p50),
"duration_p90_s": round(p90),
}
if __name__ == "__main__":
runs = get_workflow_runs(days=7)
metrics = compute_metrics(runs)
for k, v in metrics.items():
print(f"{k}: {v}")
Quarantine Workflow
# .github/workflows/quarantine.yml
# HARDENED: Run quarantined tests separately (non-blocking)
name: Quarantined Tests
on:
pull_request:
branches: [main]
jobs:
quarantined:
runs-on: ubuntu-latest
continue-on-error: true # Non-blocking
steps:
- uses: actions/checkout@v4
- name: Run quarantined tests
run: |
pytest -m "quarantine" \
--junitxml=quarantine-results.xml \
|| true
- name: Report quarantine status
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = fs.readFileSync('quarantine-results.xml', 'utf8');
const failCount = (results.match(/failures="(\d+)"/)?.[1]) || 0;
const totalCount = (results.match(/tests="(\d+)"/)?.[1]) || 0;
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: `🔬 Quarantined tests: ${totalCount - failCount}/${totalCount} passed`
});
Pytest Quarantine Marker
# conftest.py
import pytest
import json
from pathlib import Path
def pytest_collection_modifyitems(config, items):
"""Skip quarantined tests unless explicitly requested."""
quarantine_file = Path(".flaky-tests.json")
if not quarantine_file.exists():
return
db = json.loads(quarantine_file.read_text())
quarantined = {
name for name, data in db.get("tests", {}).items()
if data.get("quarantined")
}
run_quarantine = config.getoption("-m", "") == "quarantine"
for item in items:
full_name = f"{item.module.__name__}.{item.name}"
is_quarantined = full_name in quarantined
if is_quarantined and not run_quarantine:
item.add_marker(pytest.mark.skip(reason="Quarantined (flaky)"))
elif not is_quarantined and run_quarantine:
item.add_marker(pytest.mark.skip(reason="Not quarantined"))
Automated Issue Creation
# scripts/create-flaky-issues.py
# HARDENED: Create GitHub issues for newly quarantined tests
import json
import os
import requests
from pathlib import Path
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = os.environ["GITHUB_REPOSITORY"]
db = json.loads(Path(".flaky-tests.json").read_text())
for name, data in db["tests"].items():
if data.get("quarantined") and not data.get("issue_created"):
resp = requests.post(
f"https://api.github.com/repos/{REPO}/issues",
headers={
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3+json"
},
json={
"title": f"Fix flaky test: {name}",
"body": (
f"Test `{name}` has been quarantined after "
f"{len(data['occurrences'])} flaky occurrences.\n\n"
f"Last flaky: {data['occurrences'][-1]}\n\n"
f"This test is currently skipped in CI."
),
"labels": ["flaky-test", "tech-debt"]
}
)
if resp.status_code == 201:
data["issue_created"] = True
data["issue_url"] = resp.json()["html_url"]
Path(".flaky-tests.json").write_text(json.dumps(db, indent=2))
The Gate
The main test suite is the gate. Quarantined tests run in a separate non-blocking job. The team’s definition of “green build” excludes quarantined tests but includes a visibility comment on each PR showing quarantine status.
The Recovery
Too many tests are quarantined: Set a maximum quarantine size (e.g., 20 tests). If the quarantine is full, the team must fix existing flaky tests before quarantining new ones.
Fixed test becomes flaky again: Reset its occurrence count but increase its stability threshold: require 5 consecutive stable runs before reinstatement.
Metrics collection fails silently: Add a weekly scheduled job that checks if metrics have been collected in the last 7 days. If not, alert via Slack webhook.