Observability on the Cheap: Sentry for Errors, Grafana Cloud for Metrics, and Knowing What Broke Before Your Users Tell You
Observability on the Cheap
A production application without observability is a box with no windows. Requests go in, responses come out, and when something breaks, the only signal is a customer complaint. Observability on a budget means three things: knowing when errors happen (Sentry), knowing how the system is performing (Grafana Cloud), and knowing what the application did (structured logs).
The free tiers of Sentry and Grafana Cloud provide more than enough visibility for Marketflow’s first hundred customers. Sentry catches exceptions, groups them, and provides stack traces with local variables. Grafana Cloud collects system and application metrics, displays them on dashboards, and sends alerts.
The Feature
When a vendor’s application submission fails, the developer receives a Sentry alert within 60 seconds. The alert includes the full stack trace, the request payload, the user’s session data, and the database query that failed. The Grafana dashboard shows request rate, error rate, response time percentiles, and system resource usage. If the error rate exceeds 5% or response times exceed one second, the developer receives a notification.
The Decision
Sentry for errors, not logs. Sentry excels at error tracking: deduplication, grouping, stack traces with context, release tracking. Using Sentry for general logging wastes the event quota and makes real errors harder to find. Errors go to Sentry. Application logs stay in Docker container logs, queryable with docker compose logs.
Grafana Cloud for metrics, not self-hosted Grafana. Self-hosted Grafana requires a Prometheus instance, persistent storage for metrics data, and maintenance. Grafana Cloud’s free tier includes 10,000 metrics, 50 GB of logs, and 50 GB of traces. At Marketflow’s scale, this is more than sufficient.
The Implementation
Sentry Setup
# Install the Sentry SDK
cd server && uv add sentry-sdk[fastapi]
# backend/app/main.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration
from app.config import settings
if settings.sentry_dsn:
sentry_sdk.init(
dsn=settings.sentry_dsn,
integrations=[
FastApiIntegration(transaction_style="endpoint"),
SqlalchemyIntegration(),
],
traces_sample_rate=0.1, # Sample 10% of transactions for performance
profiles_sample_rate=0.1,
environment=settings.environment,
release=settings.app_version,
# Don't send PII (emails, IPs) to Sentry
send_default_pii=False,
)
Sentry Context Enrichment
# backend/app/middleware/sentry_context.py
import sentry_sdk
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
class SentryContextMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
# Add request context without PII
sentry_sdk.set_context("request_info", {
"path": request.url.path,
"method": request.method,
"query_params": dict(request.query_params),
})
# Add user context if authenticated (ID only, no email)
if hasattr(request.state, "user") and request.state.user:
sentry_sdk.set_user({"id": str(request.state.user.id)})
response = await call_next(request)
return response
Structured Logging
# backend/app/logging_config.py
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record: logging.LogRecord) -> str:
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
}
if record.exc_info and record.exc_info[0]:
log_entry["exception"] = self.formatException(record.exc_info)
# Add extra fields if present
for key in ("request_id", "user_id", "market_id", "duration_ms"):
if hasattr(record, key):
log_entry[key] = getattr(record, key)
return json.dumps(log_entry)
def configure_logging() -> None:
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
root_logger = logging.getLogger()
root_logger.handlers = [handler]
root_logger.setLevel(logging.INFO)
# Reduce noise from libraries
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)
JSON-formatted logs are searchable with docker compose logs | jq. Plain text logs require grep and regex. The structured format makes it possible to filter by request ID, user ID, or any other field.
# Search for errors in the last hour
docker compose logs backend --since 1h | jq 'select(.level == "ERROR")'
# Find all requests for a specific user
docker compose logs backend --since 1h | jq 'select(.user_id == "550e8400-...")'
# Find slow requests
docker compose logs backend --since 1h | jq 'select(.duration_ms > 200)'
Grafana Cloud Metrics
# backend/app/metrics.py
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["method", "endpoint"],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5],
)
# Business metrics
ACTIVE_VENDORS = Gauge(
"marketflow_active_vendors",
"Number of active vendors",
)
APPLICATIONS_SUBMITTED = Counter(
"marketflow_applications_total",
"Total vendor applications submitted",
["market_id"],
)
PAYMENTS_PROCESSED = Counter(
"marketflow_payments_total",
"Total payments processed",
["status"],
)
Metrics Middleware
# backend/app/middleware/metrics.py
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from app.metrics import REQUEST_COUNT, REQUEST_DURATION
class MetricsMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
# Normalize path to avoid cardinality explosion
path = request.url.path
# Replace UUIDs with placeholder
# /markets/550e8400-... becomes /markets/{id}
import re
path = re.sub(
r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
"{id}",
path,
)
REQUEST_COUNT.labels(
method=request.method,
endpoint=path,
status=response.status_code,
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=path,
).observe(duration)
return response
Prometheus Metrics Endpoint
# backend/app/routers/metrics.py
from fastapi import APIRouter
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response
router = APIRouter()
@router.get("/metrics")
async def metrics():
"""Prometheus metrics endpoint.
Scraped by Grafana Cloud Agent every 60 seconds."""
return Response(
content=generate_latest(),
media_type=CONTENT_TYPE_LATEST,
)
Grafana Cloud Agent Configuration
# /etc/grafana-agent.yaml (on the Hetzner VPS)
server:
log_level: warn
metrics:
configs:
- name: marketflow
scrape_configs:
- job_name: marketflow-api
scrape_interval: 60s
static_configs:
- targets: ["localhost:8000"]
remote_write:
- url: https://prometheus-prod-xx.grafana.net/api/prom/push
basic_auth:
username: "<GRAFANA_CLOUD_USER_ID>"
password: "<GRAFANA_CLOUD_API_KEY>"
Health Check
# backend/app/routers/health.py
from fastapi import APIRouter, Depends
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession
from app.database import get_db
from app.services.cache import cache_get
router = APIRouter()
@router.get("/health")
async def health_check(db: AsyncSession = Depends(get_db)):
checks = {}
# Database connectivity
try:
await db.execute(text("SELECT 1"))
checks["database"] = "healthy"
except Exception as e:
checks["database"] = f"unhealthy: {e}"
# Redis connectivity
try:
await cache_get("health_check_ping")
checks["redis"] = "healthy"
except Exception as e:
checks["redis"] = f"unhealthy: {e}"
status = "healthy" if all(
v == "healthy" for v in checks.values()
) else "degraded"
return {"status": status, "checks": checks}
The Trap
# TRAP: High-cardinality labels in Prometheus metrics
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path, # Includes UUIDs: /markets/550e8400-...
status=response.status_code,
user_id=str(user.id), # Unique per user
).inc()
# 1000 unique paths x 100 users = 100,000 time series
# Grafana Cloud free tier allows 10,000 metrics
# SAFE: Normalize paths, avoid per-user labels
REQUEST_COUNT.labels(
method=request.method,
endpoint="/markets/{id}", # Normalized
status=response.status_code,
).inc()
# ~20 endpoints x 5 methods x 5 status codes = 500 time series
High-cardinality labels generate thousands of unique time series. Each unique combination of label values creates a separate time series in Prometheus. The Grafana Cloud free tier limits you to 10,000 active series. Normalizing paths and avoiding per-user labels keeps the cardinality under control.
The Cost
| Component | Free Tier |
|---|---|
| Sentry | 5,000 errors/month, 10,000 transactions |
| Grafana Cloud | 10,000 metrics, 50 GB logs, 50 GB traces |
| Prometheus client library | $0 |
| Grafana Agent | $0 |
At 50 customers, Marketflow generates approximately 100-500 errors per month (mostly transient network issues) and 50,000-200,000 requests. Sentry’s 5,000 error quota is ample. Grafana Cloud’s 10,000 metric series accommodate all of Marketflow’s application and system metrics.