GitHub Actions Metrics and Flaky Test Detection

The Failure

The team had 1,200 tests. Twelve were flaky. Each flaky test had a 5% chance of failing on any given run. With 12 flaky tests at 5% each, the probability of at least one failure per build was 46%. Nearly half of all builds failed due to flaky tests. Developers reflexively re-ran failed builds. The team lost an average of 15 minutes per developer per day to flaky test retries.

Quarantining the 12 flaky tests and fixing them in a dedicated sprint eliminated 46% of build failures overnight.

The Mechanism

Flaky Test Lifecycle

Detection: Test fails then passes on retry → marked as potentially flaky
Confirmation: Same test is flaky 3+ times in 30 days → confirmed flaky
Quarantine: Test is moved to a non-blocking test suite → issue created
Fix: Developer fixes the root cause (timing, state, ordering)
Reinstatement: Fixed test moves back to the blocking suite

GitHub Actions API Endpoints

Endpoint	Data
`GET /repos/{owner}/{repo}/actions/runs`	Workflow runs with timing
`GET /repos/{owner}/{repo}/actions/runs/{id}/jobs`	Individual job timing
`GET /repos/{owner}/{repo}/actions/runs/{id}/artifacts`	Test result artifacts

The Implementation

Metrics Extraction Script

# scripts/ci-metrics.py
# HARDENED: Extract CI metrics from GitHub Actions API
import os
import requests
from datetime import datetime, timedelta

GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = os.environ.get("GITHUB_REPOSITORY", "acme/checkout-service")
API_BASE = "https://api.github.com"

headers = {
    "Authorization": f"Bearer {GITHUB_TOKEN}",
    "Accept": "application/vnd.github.v3+json"
}


def get_workflow_runs(days=7):
    since = (datetime.now() - timedelta(days=days)).isoformat()
    runs = []
    page = 1
    while True:
        resp = requests.get(
            f"{API_BASE}/repos/{REPO}/actions/runs",
            headers=headers,
            params={
                "created": f">={since}",
                "per_page": 100,
                "page": page
            }
        )
        resp.raise_for_status()
        data = resp.json()
        runs.extend(data["workflow_runs"])
        if len(data["workflow_runs"]) < 100:
            break
        page += 1
    return runs


def compute_metrics(runs):
    total = len(runs)
    success = sum(1 for r in runs if r["conclusion"] == "success")
    failure = sum(1 for r in runs if r["conclusion"] == "failure")

    durations = []
    for r in runs:
        if r["updated_at"] and r["run_started_at"]:
            start = datetime.fromisoformat(r["run_started_at"].replace("Z", "+00:00"))
            end = datetime.fromisoformat(r["updated_at"].replace("Z", "+00:00"))
            durations.append((end - start).total_seconds())

    durations.sort()
    p50 = durations[len(durations) // 2] if durations else 0
    p90 = durations[int(len(durations) * 0.9)] if durations else 0

    return {
        "total_runs": total,
        "success_rate": (success / total * 100) if total else 0,
        "failure_rate": (failure / total * 100) if total else 0,
        "duration_p50_s": round(p50),
        "duration_p90_s": round(p90),
    }


if __name__ == "__main__":
    runs = get_workflow_runs(days=7)
    metrics = compute_metrics(runs)
    for k, v in metrics.items():
        print(f"{k}: {v}")

Quarantine Workflow

# .github/workflows/quarantine.yml
# HARDENED: Run quarantined tests separately (non-blocking)
name: Quarantined Tests
on:
  pull_request:
    branches: [main]

jobs:
  quarantined:
    runs-on: ubuntu-latest
    continue-on-error: true # Non-blocking
    steps:
      - uses: actions/checkout@v4

      - name: Run quarantined tests
        run: |
          pytest -m "quarantine" \
            --junitxml=quarantine-results.xml \
            || true

      - name: Report quarantine status
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = fs.readFileSync('quarantine-results.xml', 'utf8');
            const failCount = (results.match(/failures="(\d+)"/)?.[1]) || 0;
            const totalCount = (results.match(/tests="(\d+)"/)?.[1]) || 0;

            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: `🔬 Quarantined tests: ${totalCount - failCount}/${totalCount} passed`
            });

Pytest Quarantine Marker

# conftest.py
import pytest
import json
from pathlib import Path


def pytest_collection_modifyitems(config, items):
    """Skip quarantined tests unless explicitly requested."""
    quarantine_file = Path(".flaky-tests.json")
    if not quarantine_file.exists():
        return

    db = json.loads(quarantine_file.read_text())
    quarantined = {
        name for name, data in db.get("tests", {}).items()
        if data.get("quarantined")
    }

    run_quarantine = config.getoption("-m", "") == "quarantine"

    for item in items:
        full_name = f"{item.module.__name__}.{item.name}"
        is_quarantined = full_name in quarantined

        if is_quarantined and not run_quarantine:
            item.add_marker(pytest.mark.skip(reason="Quarantined (flaky)"))
        elif not is_quarantined and run_quarantine:
            item.add_marker(pytest.mark.skip(reason="Not quarantined"))

Automated Issue Creation

# scripts/create-flaky-issues.py
# HARDENED: Create GitHub issues for newly quarantined tests
import json
import os
import requests
from pathlib import Path

GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = os.environ["GITHUB_REPOSITORY"]

db = json.loads(Path(".flaky-tests.json").read_text())

for name, data in db["tests"].items():
    if data.get("quarantined") and not data.get("issue_created"):
        resp = requests.post(
            f"https://api.github.com/repos/{REPO}/issues",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json"
            },
            json={
                "title": f"Fix flaky test: {name}",
                "body": (
                    f"Test `{name}` has been quarantined after "
                    f"{len(data['occurrences'])} flaky occurrences.\n\n"
                    f"Last flaky: {data['occurrences'][-1]}\n\n"
                    f"This test is currently skipped in CI."
                ),
                "labels": ["flaky-test", "tech-debt"]
            }
        )
        if resp.status_code == 201:
            data["issue_created"] = True
            data["issue_url"] = resp.json()["html_url"]

Path(".flaky-tests.json").write_text(json.dumps(db, indent=2))

The Gate

The main test suite is the gate. Quarantined tests run in a separate non-blocking job. The team’s definition of “green build” excludes quarantined tests but includes a visibility comment on each PR showing quarantine status.

The Recovery

Too many tests are quarantined: Set a maximum quarantine size (e.g., 20 tests). If the quarantine is full, the team must fix existing flaky tests before quarantining new ones.

Fixed test becomes flaky again: Reset its occurrence count but increase its stability threshold: require 5 consecutive stable runs before reinstatement.

Metrics collection fails silently: Add a weekly scheduled job that checks if metrics have been collected in the last 7 days. If not, alert via Slack webhook.