Skip to main content
ship it and sleep

Hotfix Pipelines and Post-Incident Pipeline Improvements

5 min read Chapter 63 of 66

Hotfix Pipelines and Post-Incident Pipeline Improvements

The Failure

The production incident required a one-line fix. The normal pipeline took 22 minutes: lint, unit tests, integration tests, security scan, performance test, staging deploy, staging smoke test, production deploy. The service was down for the entire 22 minutes because the team did not have a fast path.

A hotfix pipeline cuts the feedback loop to under 5 minutes by running only the gates that matter during an incident.

The Mechanism

Gate Reduction Rules

GateNormal PipelineHotfix PipelineRationale
LintStyle does not cause outages
Unit tests✓ (short mode)Catch regressions
Integration testsToo slow for incidents
Security scan (critical)Non-negotiable
Security scan (high)Can wait
Performance testNot relevant to hotfix
Staging deploySkip to production
Production deployThe point

Hotfix Branch Convention

hotfix/INCIDENT-123-fix-payment-timeout

Branch name prefix hotfix/ triggers the reduced pipeline. The hotfix must be merged to main within 24 hours via a normal PR with the full pipeline.

The Implementation

Hotfix Branch Rules

# checkout-service/.github/workflows/ci.yml
# HARDENED: Route to appropriate pipeline based on branch
name: CI
on:
  push:
    branches:
      - main
      - "hotfix/**"
  pull_request:
    branches: [main]

jobs:
  determine-pipeline:
    runs-on: ubuntu-latest
    outputs:
      is-hotfix: ${{ contains(github.ref, 'hotfix/') }}
    steps:
      - run: echo "Branch type determined"

  full-pipeline:
    needs: determine-pipeline
    if: needs.determine-pipeline.outputs.is-hotfix != 'true'
    uses: ./.github/workflows/full-ci.yml

  hotfix-pipeline:
    needs: determine-pipeline
    if: needs.determine-pipeline.outputs.is-hotfix == 'true'
    uses: ./.github/workflows/hotfix-ci.yml

Hotfix CI Workflow

# checkout-service/.github/workflows/hotfix-ci.yml
# HARDENED: Reduced pipeline for emergency fixes
name: Hotfix CI
on:
  workflow_call:

jobs:
  hotfix:
    runs-on: ubuntu-latest
    environment: production # Requires environment approval
    steps:
      - uses: actions/checkout@v4

      - name: Unit tests (short)
        run: go test -short -count=1 ./...
        timeout-minutes: 3

      - name: Critical security scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: "fs"
          severity: "CRITICAL"
          exit-code: "1"
        timeout-minutes: 2

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/acme/checkout-service:hotfix-${{ github.sha }}
          cache-from: type=gha

      - name: Deploy to production
        uses: peter-evans/repository-dispatch@v3
        with:
          token: ${{ secrets.INFRA_REPO_TOKEN }}
          repository: acme/ecommerce-infra
          event-type: deploy-service
          client-payload: |
            {
              "service": "checkout-service",
              "tag": "hotfix-${{ github.sha }}",
              "environment": "production",
              "hotfix": true
            }

      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": ":ambulance: Hotfix deployed: checkout-service hotfix-${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Post-Incident Pipeline Audit

# .github/ISSUE_TEMPLATE/post-incident-pipeline.yml
# HARDENED: Structured pipeline improvement after incidents
name: Post-Incident Pipeline Review
description: Review and improve the pipeline after an incident
body:
  - type: input
    id: incident
    attributes:
      label: Incident ID
    validations:
      required: true

  - type: textarea
    id: timeline
    attributes:
      label: Deployment Timeline
      value: |
        - Deploy started:
        - Issue detected:
        - Rollback initiated:
        - Service recovered:
        - Root cause identified:
        - Hotfix deployed:

  - type: checkboxes
    id: pipeline-gaps
    attributes:
      label: Pipeline Gaps
      options:
        - label: Could a test have caught this?
        - label: Could canary analysis have caught this?
        - label: Could a security scan have caught this?
        - label: Was rollback fast enough?
        - label: Did monitoring detect the issue?

  - type: textarea
    id: improvements
    attributes:
      label: Pipeline Improvements
      description: Specific, actionable changes to the pipeline
      value: |
        1. Add test:
        2. Tighten canary threshold:
        3. Add monitoring:
        4. Update runbook:

Runbook as Code

#!/bin/bash
# runbooks/payments-outage.sh
# HARDENED: Automated runbook for payments service outage
set -euo pipefail

echo "=== Payments Service Outage Runbook ==="
echo ""

echo "Step 1: Check service health"
argocd app get payments-production -o json | jq '.status.health'

echo ""
echo "Step 2: Check recent deploys"
git -C /tmp/ecommerce-infra log --oneline --grep="payments" -5

echo ""
read -p "Rollback to previous version? (y/n): " ROLLBACK
if [[ "$ROLLBACK" == "y" ]]; then
  bash scripts/argocd-safe-rollback.sh payments-production
fi

echo ""
echo "Step 3: Check pod logs"
kubectl logs -n production -l app=payments-service --tail=50

echo ""
echo "Step 4: Check dependencies"
kubectl exec -n production deploy/payments-service -- \
  curl -s http://checkout-service.production.svc.cluster.local/health

The Gate

The hotfix pipeline uses GitHub Environments with required reviewers. The production environment requires approval from at least one person on the on-call rotation. This prevents accidental hotfix deploys while keeping the process fast enough for emergencies.

The Recovery

Hotfix pipeline is abused for normal deploys: Monitor hotfix branch frequency. If more than 2 hotfixes per month, the normal pipeline is too slow or the team is taking shortcuts. Fix the root cause.

Hotfix breaks something else: The reduced test suite missed a regression. After the incident is resolved, backfill: create a PR from the hotfix branch to main and run the full pipeline. Any failures must be fixed before merge.

Runbook is outdated: Version runbooks alongside code. Review them during post-incident reviews. If the runbook did not help, update it. If no runbook existed, create one.