Building the Observability Dashboard

The Symptom

The team has Prometheus metrics for MongoDB (CH13, CH14, CH19), JVM metrics (CH3), connection pool metrics (CH4), and OpenTelemetry traces (CH24). But diagnosing an incident requires opening 6 dashboards and mentally correlating timestamps. The mean time to diagnose a performance issue is 25 minutes.

The Cause

Each monitoring tool answers a narrow question:

Prometheus: “What are the current values of these counters?”
Traces: “How long did this specific request take?”
Logs: “What error messages were emitted?”
MongoDB profiler: “Which queries were slow on the server?”

Without a unified dashboard, the engineer must manually correlate: “The HTTP p99 spiked at 14:32. Was there a checkpoint at 14:32? Was replication lagging? Was the connection pool exhausted?”

The Benchmark

Diagnosis approach	Mean time to diagnose	Requires	Accuracy
Manual (6 dashboards)	25 minutes	Expert knowledge of all tools	High (if patient)
Unified Grafana dashboard	5 minutes	Dashboard exists and is maintained	High
Automated runbook (alert -> diagnosis)	2 minutes	Runbook is written and tested	Medium (pre-defined patterns)

The Fix

The Telemetry Platform Observability Dashboard.

Organize the dashboard into 5 rows, each answering a specific question:

Row 1: Is the system healthy? (SLI/SLO)

// Key SLI metrics to export
@Component
public class SliMetricsExporter {

    private final MeterRegistry registry;

    // SLI 1: HTTP latency (from Spring Boot Actuator)
    // Already exported as http_server_requests_seconds

    // SLI 2: Error rate
    // Already exported as http_server_requests_seconds with status tag

    // SLI 3: MongoDB availability
    @Scheduled(fixedRate = 10000)
    public void checkMongoAvailability() {
        try {
            long start = System.nanoTime();
            client.getDatabase("admin").runCommand(new Document("ping", 1));
            long durationMs = (System.nanoTime() - start) / 1_000_000;
            registry.gauge("mongodb.ping.duration_ms", durationMs);
            registry.counter("mongodb.ping.success").increment();
        } catch (Exception e) {
            registry.counter("mongodb.ping.failure").increment();
        }
    }
}

Grafana panels for Row 1:

Panel	Query	Threshold
HTTP p99 latency	`histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))`	< 200ms green, < 500ms yellow, > 500ms red
Error rate	`rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m])`	< 0.1% green, < 1% yellow, > 1% red
MongoDB availability	`rate(mongodb_ping_success[5m]) / (rate(mongodb_ping_success[5m]) + rate(mongodb_ping_failure[5m]))`	> 99.9% green

Row 2: Where is latency coming from?

Panel	Metric	Purpose
Latency breakdown (stacked)	Pool checkout + query + mapping	Shows which layer dominates
MongoDB op latency	`mongodb_latency_writes_micros / mongodb_latency_writes_ops`	Server-side latency
Connection pool queue time	Pool checkout duration histogram	Driver-side wait time

Row 3: Is MongoDB under pressure?

Panel	Metric	Alert threshold
WiredTiger cache utilization	`wt_cache_dirty_bytes / wt_cache_bytes_max`	> 20% dirty
Concurrency tickets available	`wt_tickets_available_read`, `wt_tickets_available_write`	< 20 available
Replication lag	`mongodb_replication_lag_seconds`	> 30s
Oplog window	`mongodb_oplog_window_seconds`	< 2 hours
Checkpoint duration	`wt_checkpoint_duration_ms`	> 30s

Row 4: Is the infrastructure healthy?

Panel	Metric	Alert threshold
CPU utilization	`container_cpu_usage_seconds_total`	> 80% sustained
CFS throttle rate	`container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total`	> 5%
Memory utilization	`container_memory_usage_bytes / container_spec_memory_limit_bytes`	> 85%
Disk I/O utilization	`node_disk_io_time_seconds_total`	> 70%
Disk IOPS	`rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])`	Approaching provisioned limit

Row 5: What changed? (Correlation annotations)

Add Grafana annotations for:

Deployments (from CI/CD pipeline)
MongoDB configuration changes (from audit log)
Index builds (from currentOp monitoring)
Balancer migrations (from config.changelog)

// FAST: Push Grafana annotations for MongoDB events
public void annotateGrafana(String event, String description) {
    // POST to Grafana annotations API
    HttpClient httpClient = HttpClient.newHttpClient();
    String body = String.format(
        "{\"text\":\"%s: %s\",\"tags\":[\"mongodb\",\"%s\"]}",
        event, description, event.toLowerCase().replace(" ", "-"));

    HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create("http://grafana:3000/api/annotations"))
        .header("Authorization", "Bearer " + grafanaApiKey)
        .header("Content-Type", "application/json")
        .POST(HttpRequest.BodyPublishers.ofString(body))
        .build();

    httpClient.sendAsync(request, HttpResponse.BodyHandlers.discarding());
}

The Proof

Alert hierarchy for the telemetry platform:

Priority	Alert	Condition	Action
P1 (page)	MongoDB unavailable	Ping failures > 3 consecutive	Check pod status, check elections
P1 (page)	Error rate > 5%	5xx responses > 5% for 2 min	Check MongoDB connectivity
P2 (urgent)	HTTP p99 > 500ms	Sustained for 5 min	Check dashboard rows 2-4
P2 (urgent)	Connection pool exhausted	Available connections = 0 for 1 min	Increase pool size or reduce concurrency
P3 (warning)	Replication lag > 30s	Sustained for 5 min	Check secondary I/O, index builds
P3 (warning)	WiredTiger cache > 20% dirty	Sustained for 10 min	Check checkpoint, eviction threads
P4 (ticket)	Disk utilization > 70%	Sustained for 1 hour	Plan disk expansion or compaction
P4 (ticket)	Oplog window < 4 hours	Any occurrence	Resize oplog

After implementing the unified dashboard and alert hierarchy:

Metric	Before	After
Mean time to diagnose (MTTD)	25 min	5 min
Mean time to resolve (MTTR)	45 min	15 min
False positive alerts/week	12	2
Incidents escalated to MongoDB expert	80%	25%

The Trade-off

Building and maintaining the dashboard is an ongoing investment. Every new metric added to the application must be reflected in the dashboard. Dashboard rot (stale panels, broken queries, outdated thresholds) is common. Assign ownership: one engineer maintains the dashboard and updates it during each performance review.

The alert hierarchy requires tuning. Initial thresholds are educated guesses. After 2-4 weeks of production data, adjust thresholds to eliminate false positives without missing real incidents. Too many alerts cause alert fatigue (the team ignores all alerts). Too few alerts cause missed incidents.

Sampling rate for traces (10% in production) means that not every slow request has a trace. For investigating a specific user’s slow request, the trace may not exist. Implement on-demand trace capture: when a user reports a slow request, temporarily increase sampling for their session or enable debug-level tracing for their sensorId.

The observability stack itself consumes resources. Prometheus, Grafana, and the OTel collector need CPU, memory, and storage. Budget 5-10% of the cluster’s resources for observability. This is not overhead; it is the cost of operating a production system reliably.