Skip to main content
unbound mongodb at scale

Building the Observability Dashboard

6 min read Chapter 72 of 72

Building the Observability Dashboard

The Symptom

The team has Prometheus metrics for MongoDB (CH13, CH14, CH19), JVM metrics (CH3), connection pool metrics (CH4), and OpenTelemetry traces (CH24). But diagnosing an incident requires opening 6 dashboards and mentally correlating timestamps. The mean time to diagnose a performance issue is 25 minutes.

The Cause

Each monitoring tool answers a narrow question:

  • Prometheus: “What are the current values of these counters?”
  • Traces: “How long did this specific request take?”
  • Logs: “What error messages were emitted?”
  • MongoDB profiler: “Which queries were slow on the server?”

Without a unified dashboard, the engineer must manually correlate: “The HTTP p99 spiked at 14:32. Was there a checkpoint at 14:32? Was replication lagging? Was the connection pool exhausted?”

The Benchmark

Diagnosis approachMean time to diagnoseRequiresAccuracy
Manual (6 dashboards)25 minutesExpert knowledge of all toolsHigh (if patient)
Unified Grafana dashboard5 minutesDashboard exists and is maintainedHigh
Automated runbook (alert -> diagnosis)2 minutesRunbook is written and testedMedium (pre-defined patterns)

The Fix

The Telemetry Platform Observability Dashboard.

Organize the dashboard into 5 rows, each answering a specific question:

Row 1: Is the system healthy? (SLI/SLO)

// Key SLI metrics to export
@Component
public class SliMetricsExporter {

    private final MeterRegistry registry;

    // SLI 1: HTTP latency (from Spring Boot Actuator)
    // Already exported as http_server_requests_seconds

    // SLI 2: Error rate
    // Already exported as http_server_requests_seconds with status tag

    // SLI 3: MongoDB availability
    @Scheduled(fixedRate = 10000)
    public void checkMongoAvailability() {
        try {
            long start = System.nanoTime();
            client.getDatabase("admin").runCommand(new Document("ping", 1));
            long durationMs = (System.nanoTime() - start) / 1_000_000;
            registry.gauge("mongodb.ping.duration_ms", durationMs);
            registry.counter("mongodb.ping.success").increment();
        } catch (Exception e) {
            registry.counter("mongodb.ping.failure").increment();
        }
    }
}

Grafana panels for Row 1:

PanelQueryThreshold
HTTP p99 latencyhistogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))< 200ms green, < 500ms yellow, > 500ms red
Error raterate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m])< 0.1% green, < 1% yellow, > 1% red
MongoDB availabilityrate(mongodb_ping_success[5m]) / (rate(mongodb_ping_success[5m]) + rate(mongodb_ping_failure[5m]))> 99.9% green

Row 2: Where is latency coming from?

PanelMetricPurpose
Latency breakdown (stacked)Pool checkout + query + mappingShows which layer dominates
MongoDB op latencymongodb_latency_writes_micros / mongodb_latency_writes_opsServer-side latency
Connection pool queue timePool checkout duration histogramDriver-side wait time

Row 3: Is MongoDB under pressure?

PanelMetricAlert threshold
WiredTiger cache utilizationwt_cache_dirty_bytes / wt_cache_bytes_max> 20% dirty
Concurrency tickets availablewt_tickets_available_read, wt_tickets_available_write< 20 available
Replication lagmongodb_replication_lag_seconds> 30s
Oplog windowmongodb_oplog_window_seconds< 2 hours
Checkpoint durationwt_checkpoint_duration_ms> 30s

Row 4: Is the infrastructure healthy?

PanelMetricAlert threshold
CPU utilizationcontainer_cpu_usage_seconds_total> 80% sustained
CFS throttle ratecontainer_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total> 5%
Memory utilizationcontainer_memory_usage_bytes / container_spec_memory_limit_bytes> 85%
Disk I/O utilizationnode_disk_io_time_seconds_total> 70%
Disk IOPSrate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])Approaching provisioned limit

Row 5: What changed? (Correlation annotations)

Add Grafana annotations for:

  • Deployments (from CI/CD pipeline)
  • MongoDB configuration changes (from audit log)
  • Index builds (from currentOp monitoring)
  • Balancer migrations (from config.changelog)
// FAST: Push Grafana annotations for MongoDB events
public void annotateGrafana(String event, String description) {
    // POST to Grafana annotations API
    HttpClient httpClient = HttpClient.newHttpClient();
    String body = String.format(
        "{\"text\":\"%s: %s\",\"tags\":[\"mongodb\",\"%s\"]}",
        event, description, event.toLowerCase().replace(" ", "-"));

    HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create("http://grafana:3000/api/annotations"))
        .header("Authorization", "Bearer " + grafanaApiKey)
        .header("Content-Type", "application/json")
        .POST(HttpRequest.BodyPublishers.ofString(body))
        .build();

    httpClient.sendAsync(request, HttpResponse.BodyHandlers.discarding());
}

The Proof

Alert hierarchy for the telemetry platform:

PriorityAlertConditionAction
P1 (page)MongoDB unavailablePing failures > 3 consecutiveCheck pod status, check elections
P1 (page)Error rate > 5%5xx responses > 5% for 2 minCheck MongoDB connectivity
P2 (urgent)HTTP p99 > 500msSustained for 5 minCheck dashboard rows 2-4
P2 (urgent)Connection pool exhaustedAvailable connections = 0 for 1 minIncrease pool size or reduce concurrency
P3 (warning)Replication lag > 30sSustained for 5 minCheck secondary I/O, index builds
P3 (warning)WiredTiger cache > 20% dirtySustained for 10 minCheck checkpoint, eviction threads
P4 (ticket)Disk utilization > 70%Sustained for 1 hourPlan disk expansion or compaction
P4 (ticket)Oplog window < 4 hoursAny occurrenceResize oplog

After implementing the unified dashboard and alert hierarchy:

MetricBeforeAfter
Mean time to diagnose (MTTD)25 min5 min
Mean time to resolve (MTTR)45 min15 min
False positive alerts/week122
Incidents escalated to MongoDB expert80%25%

The Trade-off

Building and maintaining the dashboard is an ongoing investment. Every new metric added to the application must be reflected in the dashboard. Dashboard rot (stale panels, broken queries, outdated thresholds) is common. Assign ownership: one engineer maintains the dashboard and updates it during each performance review.

The alert hierarchy requires tuning. Initial thresholds are educated guesses. After 2-4 weeks of production data, adjust thresholds to eliminate false positives without missing real incidents. Too many alerts cause alert fatigue (the team ignores all alerts). Too few alerts cause missed incidents.

Sampling rate for traces (10% in production) means that not every slow request has a trace. For investigating a specific user’s slow request, the trace may not exist. Implement on-demand trace capture: when a user reports a slow request, temporarily increase sampling for their session or enable debug-level tracing for their sensorId.

The observability stack itself consumes resources. Prometheus, Grafana, and the OTel collector need CPU, memory, and storage. Budget 5-10% of the cluster’s resources for observability. This is not overhead; it is the cost of operating a production system reliably.