Building the Observability Dashboard
Building the Observability Dashboard
The Symptom
The team has Prometheus metrics for MongoDB (CH13, CH14, CH19), JVM metrics (CH3), connection pool metrics (CH4), and OpenTelemetry traces (CH24). But diagnosing an incident requires opening 6 dashboards and mentally correlating timestamps. The mean time to diagnose a performance issue is 25 minutes.
The Cause
Each monitoring tool answers a narrow question:
- Prometheus: “What are the current values of these counters?”
- Traces: “How long did this specific request take?”
- Logs: “What error messages were emitted?”
- MongoDB profiler: “Which queries were slow on the server?”
Without a unified dashboard, the engineer must manually correlate: “The HTTP p99 spiked at 14:32. Was there a checkpoint at 14:32? Was replication lagging? Was the connection pool exhausted?”
The Benchmark
| Diagnosis approach | Mean time to diagnose | Requires | Accuracy |
|---|---|---|---|
| Manual (6 dashboards) | 25 minutes | Expert knowledge of all tools | High (if patient) |
| Unified Grafana dashboard | 5 minutes | Dashboard exists and is maintained | High |
| Automated runbook (alert -> diagnosis) | 2 minutes | Runbook is written and tested | Medium (pre-defined patterns) |
The Fix
The Telemetry Platform Observability Dashboard.
Organize the dashboard into 5 rows, each answering a specific question:
Row 1: Is the system healthy? (SLI/SLO)
// Key SLI metrics to export
@Component
public class SliMetricsExporter {
private final MeterRegistry registry;
// SLI 1: HTTP latency (from Spring Boot Actuator)
// Already exported as http_server_requests_seconds
// SLI 2: Error rate
// Already exported as http_server_requests_seconds with status tag
// SLI 3: MongoDB availability
@Scheduled(fixedRate = 10000)
public void checkMongoAvailability() {
try {
long start = System.nanoTime();
client.getDatabase("admin").runCommand(new Document("ping", 1));
long durationMs = (System.nanoTime() - start) / 1_000_000;
registry.gauge("mongodb.ping.duration_ms", durationMs);
registry.counter("mongodb.ping.success").increment();
} catch (Exception e) {
registry.counter("mongodb.ping.failure").increment();
}
}
}
Grafana panels for Row 1:
| Panel | Query | Threshold |
|---|---|---|
| HTTP p99 latency | histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) | < 200ms green, < 500ms yellow, > 500ms red |
| Error rate | rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) | < 0.1% green, < 1% yellow, > 1% red |
| MongoDB availability | rate(mongodb_ping_success[5m]) / (rate(mongodb_ping_success[5m]) + rate(mongodb_ping_failure[5m])) | > 99.9% green |
Row 2: Where is latency coming from?
| Panel | Metric | Purpose |
|---|---|---|
| Latency breakdown (stacked) | Pool checkout + query + mapping | Shows which layer dominates |
| MongoDB op latency | mongodb_latency_writes_micros / mongodb_latency_writes_ops | Server-side latency |
| Connection pool queue time | Pool checkout duration histogram | Driver-side wait time |
Row 3: Is MongoDB under pressure?
| Panel | Metric | Alert threshold |
|---|---|---|
| WiredTiger cache utilization | wt_cache_dirty_bytes / wt_cache_bytes_max | > 20% dirty |
| Concurrency tickets available | wt_tickets_available_read, wt_tickets_available_write | < 20 available |
| Replication lag | mongodb_replication_lag_seconds | > 30s |
| Oplog window | mongodb_oplog_window_seconds | < 2 hours |
| Checkpoint duration | wt_checkpoint_duration_ms | > 30s |
Row 4: Is the infrastructure healthy?
| Panel | Metric | Alert threshold |
|---|---|---|
| CPU utilization | container_cpu_usage_seconds_total | > 80% sustained |
| CFS throttle rate | container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total | > 5% |
| Memory utilization | container_memory_usage_bytes / container_spec_memory_limit_bytes | > 85% |
| Disk I/O utilization | node_disk_io_time_seconds_total | > 70% |
| Disk IOPS | rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m]) | Approaching provisioned limit |
Row 5: What changed? (Correlation annotations)
Add Grafana annotations for:
- Deployments (from CI/CD pipeline)
- MongoDB configuration changes (from audit log)
- Index builds (from currentOp monitoring)
- Balancer migrations (from config.changelog)
// FAST: Push Grafana annotations for MongoDB events
public void annotateGrafana(String event, String description) {
// POST to Grafana annotations API
HttpClient httpClient = HttpClient.newHttpClient();
String body = String.format(
"{\"text\":\"%s: %s\",\"tags\":[\"mongodb\",\"%s\"]}",
event, description, event.toLowerCase().replace(" ", "-"));
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("http://grafana:3000/api/annotations"))
.header("Authorization", "Bearer " + grafanaApiKey)
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(body))
.build();
httpClient.sendAsync(request, HttpResponse.BodyHandlers.discarding());
}
The Proof
Alert hierarchy for the telemetry platform:
| Priority | Alert | Condition | Action |
|---|---|---|---|
| P1 (page) | MongoDB unavailable | Ping failures > 3 consecutive | Check pod status, check elections |
| P1 (page) | Error rate > 5% | 5xx responses > 5% for 2 min | Check MongoDB connectivity |
| P2 (urgent) | HTTP p99 > 500ms | Sustained for 5 min | Check dashboard rows 2-4 |
| P2 (urgent) | Connection pool exhausted | Available connections = 0 for 1 min | Increase pool size or reduce concurrency |
| P3 (warning) | Replication lag > 30s | Sustained for 5 min | Check secondary I/O, index builds |
| P3 (warning) | WiredTiger cache > 20% dirty | Sustained for 10 min | Check checkpoint, eviction threads |
| P4 (ticket) | Disk utilization > 70% | Sustained for 1 hour | Plan disk expansion or compaction |
| P4 (ticket) | Oplog window < 4 hours | Any occurrence | Resize oplog |
After implementing the unified dashboard and alert hierarchy:
| Metric | Before | After |
|---|---|---|
| Mean time to diagnose (MTTD) | 25 min | 5 min |
| Mean time to resolve (MTTR) | 45 min | 15 min |
| False positive alerts/week | 12 | 2 |
| Incidents escalated to MongoDB expert | 80% | 25% |
The Trade-off
Building and maintaining the dashboard is an ongoing investment. Every new metric added to the application must be reflected in the dashboard. Dashboard rot (stale panels, broken queries, outdated thresholds) is common. Assign ownership: one engineer maintains the dashboard and updates it during each performance review.
The alert hierarchy requires tuning. Initial thresholds are educated guesses. After 2-4 weeks of production data, adjust thresholds to eliminate false positives without missing real incidents. Too many alerts cause alert fatigue (the team ignores all alerts). Too few alerts cause missed incidents.
Sampling rate for traces (10% in production) means that not every slow request has a trace. For investigating a specific user’s slow request, the trace may not exist. Implement on-demand trace capture: when a user reports a slow request, temporarily increase sampling for their session or enable debug-level tracing for their sensorId.
The observability stack itself consumes resources. Prometheus, Grafana, and the OTel collector need CPU, memory, and storage. Budget 5-10% of the cluster’s resources for observability. This is not overhead; it is the cost of operating a production system reliably.