Skip to main content
ship before you scale

Grafana Dashboards and Alerting

5 min read Chapter 36 of 42

Grafana Dashboards and Alerting

The Feature

A single Grafana dashboard shows everything the developer needs to assess the health of Marketflow:

  1. Request rate: How many requests per minute the API is handling
  2. Error rate: What percentage of requests return 5xx status codes
  3. Response time: P50, P95, and P99 latency
  4. System resources: CPU, memory, and disk usage on the VPS
  5. Business metrics: Active vendors, applications submitted, payments processed

Alerts fire when error rate exceeds 5%, response time P95 exceeds one second, or disk usage exceeds 80%.

The Decision

One dashboard. Not five. Not one per service. One dashboard with five rows of panels that answers the question “is Marketflow healthy right now?” in under 10 seconds. When something is wrong, the dashboard narrows the scope: is it the API (high error rate), the database (slow response times), or the infrastructure (high CPU or disk)?

The Implementation

Dashboard Layout (PromQL Queries)

Row 1: Traffic Overview

# Request rate (requests per minute)
sum(rate(http_requests_total[5m])) * 60

# Error rate (percentage of 5xx responses)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# Success rate (for the stat panel, shows green number)
100 - (
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100
)

Row 2: Latency

# P50 response time
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# P95 response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99 response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Row 3: Endpoint Breakdown

# Slowest endpoints (P95 by endpoint)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m]))
  by (le, endpoint)
)

# Most errored endpoints
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)

Row 4: System Resources (from node_exporter)

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

Row 5: Business Metrics

# Active vendors (gauge)
marketflow_active_vendors

# Applications submitted today
increase(marketflow_applications_total[24h])

# Payments processed today
increase(marketflow_payments_total{status="succeeded"}[24h])

System Metrics Collection

# /etc/grafana-agent.yaml (expanded)
server:
  log_level: warn

metrics:
  configs:
    - name: marketflow
      scrape_configs:
        - job_name: marketflow-api
          scrape_interval: 60s
          static_configs:
            - targets: ["localhost:8000"]

        - job_name: node
          scrape_interval: 60s
          static_configs:
            - targets: ["localhost:9100"]

      remote_write:
        - url: https://prometheus-prod-xx.grafana.net/api/prom/push
          basic_auth:
            username: "<GRAFANA_CLOUD_USER_ID>"
            password: "<GRAFANA_CLOUD_API_KEY>"

Install node_exporter on the VPS for system metrics:

# On the Hetzner VPS
sudo apt-get install prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter
sudo systemctl start prometheus-node-exporter

Alert Rules

Configure in Grafana Cloud (Alerting > Alert rules):

# Alert: High Error Rate
- alert: HighErrorRate
  expr: >
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 5% for 2 minutes"

# Alert: Slow Response Times
- alert: SlowResponses
  expr: >
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P95 response time above 1 second for 5 minutes"

# Alert: High Disk Usage
- alert: HighDiskUsage
  expr: >
    (1 - node_filesystem_avail_bytes{mountpoint="/"}
    / node_filesystem_size_bytes{mountpoint="/"}) > 0.80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Disk usage above 80%"

# Alert: High Memory Usage
- alert: HighMemoryUsage
  expr: >
    (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Memory usage above 90%"

Alert Notification Channel

Configure a notification channel in Grafana Cloud. Options include email (free), Slack webhook, or a webhook to a custom endpoint. For a solo developer, email notifications are sufficient.

# Grafana Cloud > Alerting > Contact points
- name: developer-email
  type: email
  settings:
    addresses: "[email protected]"
    singleEmail: true

The Trap

# TRAP: Alerting on every metric anomaly
- alert: HighCPU
  expr: node_cpu_usage > 0.50 # 50% CPU
  for: 1m
  # Fires during every deployment, every database migration, every
  # image optimization. Alert fatigue sets in within a week.
  # The developer starts ignoring all alerts.

# SAFE: Alert only on conditions that require action
- alert: HighCPU
  expr: node_cpu_usage > 0.90 # 90% CPU
  for: 10m # Sustained for 10 minutes
  # This means something is genuinely wrong, not a temporary spike

Alert fatigue kills observability. Every false positive trains the developer to ignore alerts. Set thresholds high enough that firing always means a real problem. A 50% CPU alert fires during normal operations. A 90% sustained CPU alert fires when the server is overwhelmed. Only the second one requires action.

The Cost

ComponentFree Tier
Grafana Cloud10,000 active series, 14 day retention
Grafana Agent$0 (open source)
node_exporter$0 (open source)
prometheus_client$0 (Python library)
Email alertsIncluded in Grafana Cloud free tier

The entire observability stack costs $0. Grafana Cloud’s free tier provides 10,000 active metric series. Marketflow generates approximately 500 series (20 endpoints x 5 HTTP methods x 5 status codes for counters, plus system metrics from node_exporter). The 14-day retention is sufficient for debugging recent issues.