Skip to main content
unbound mongodb at scale

Tracing Connection Pool and Query Performance

5 min read Chapter 71 of 72

Tracing Connection Pool and Query Performance

The Symptom

The HTTP API’s p99 latency is 450ms. The MongoDB find command takes 12ms according to the slow query log. Where are the other 438ms?

The Cause

Without tracing, the gap between the HTTP latency (450ms) and the MongoDB command latency (12ms) is invisible. The trace reveals the breakdown:

HTTP GET /api/readings?sensorId=sensor-00042&hours=24    [450ms]
├── TelemetryService.getReadings()                        [440ms]
│   ├── ConnectionPool.checkout()                         [380ms]  <- BOTTLENECK
│   ├── MongoCollection.find()                            [12ms]
│   ├── Cursor.toList()                                   [45ms]
│   └── TelemetryMapper.toDtoList()                       [3ms]
└── ResponseSerialization                                 [10ms]

380ms is spent waiting for a connection from the pool. The pool is exhausted (CH4). The MongoDB command itself is fast (12ms), but the request is queued waiting for an available connection.

The Benchmark

Trace patternRoot causeFix reference
Pool checkout > 100msConnection pool exhaustionCH4 (pool sizing)
MongoDB command > 100msSlow query, missing indexCH2, CH10
Cursor iteration > 50msLarge result set, no projectionCH6
Service method > MongoDB totalApplication-layer overheadCH5 (mapping)
Multiple MongoDB spans > 2N+1 query patternCH9 (denormalization)

The Fix

Step 1: Add connection pool checkout span.

// FAST: Instrument connection pool with OpenTelemetry
public class MongoOtelPoolListener implements ConnectionPoolListener {

    private final Tracer tracer;
    private final ConcurrentHashMap<Integer, Span> checkoutSpans = new ConcurrentHashMap<>();
    private final ConcurrentHashMap<Integer, Long> checkoutStarts = new ConcurrentHashMap<>();

    public MongoOtelPoolListener(OpenTelemetry otel) {
        this.tracer = otel.getTracer("mongodb-pool");
    }

    @Override
    public void connectionCheckOutStarted(ConnectionCheckOutStartedEvent event) {
        Span span = tracer.spanBuilder("mongodb.pool.checkout")
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("db.system", "mongodb")
            .setAttribute("net.peer.name",
                event.getServerId().getAddress().getHost())
            .startSpan();

        checkoutSpans.put(event.getServerId().hashCode(), span);
        checkoutStarts.put(event.getServerId().hashCode(), System.nanoTime());
    }

    @Override
    public void connectionCheckedOut(ConnectionCheckedOutEvent event) {
        int key = event.getConnectionId().getServerId().hashCode();
        Span span = checkoutSpans.remove(key);
        Long startNanos = checkoutStarts.remove(key);

        if (span != null) {
            long durationMs = (System.nanoTime() - startNanos) / 1_000_000;
            span.setAttribute("pool.checkout_duration_ms", durationMs);

            if (durationMs > 100) {
                span.setAttribute("pool.slow_checkout", true);
                span.addEvent("Slow connection checkout",
                    Attributes.of(AttributeKey.longKey("duration_ms"), durationMs));
            }
            span.end();
        }
    }

    @Override
    public void connectionCheckOutFailed(ConnectionCheckOutFailedEvent event) {
        int key = event.getServerId().hashCode();
        Span span = checkoutSpans.remove(key);
        checkoutStarts.remove(key);

        if (span != null) {
            span.setStatus(StatusCode.ERROR, "Connection checkout failed: " +
                event.getReason().name());
            span.end();
        }
    }
}

Step 2: Add slow query detail to spans.

// FAST: Enrich MongoDB command spans with query details for slow operations
public class EnrichedMongoCommandListener implements CommandListener {

    private final Tracer tracer;
    private final ConcurrentHashMap<Integer, Span> activeSpans = new ConcurrentHashMap<>();

    @Override
    public void commandStarted(CommandStartedEvent event) {
        Span span = tracer.spanBuilder("mongodb." + event.getCommandName())
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("db.system", "mongodb")
            .setAttribute("db.name", event.getDatabaseName())
            .setAttribute("db.operation", event.getCommandName())
            .startSpan();

        activeSpans.put(event.getRequestId(), span);
    }

    @Override
    public void commandSucceeded(CommandSucceededEvent event) {
        Span span = activeSpans.remove(event.getRequestId());
        if (span != null) {
            long elapsedMs = event.getElapsedTime(TimeUnit.MILLISECONDS);
            span.setAttribute("db.response_time_ms", elapsedMs);

            // For slow commands, add query plan details
            if (elapsedMs > 50 && isQueryCommand(event.getCommandName())) {
                span.addEvent("slow_query",
                    Attributes.of(
                        AttributeKey.longKey("elapsed_ms"), elapsedMs,
                        AttributeKey.stringKey("command_name"), event.getCommandName()
                    ));
            }

            span.end();
        }
    }

    private boolean isQueryCommand(String name) {
        return "find".equals(name) || "aggregate".equals(name) ||
               "count".equals(name) || "distinct".equals(name);
    }
}

Step 3: Set up trace-based alerts.

// Alert criteria for MongoDB operations
// 1. Any trace where pool.checkout > 200ms -> connection pool alert
// 2. Any trace where mongodb.find > 100ms -> slow query alert
// 3. Any trace with > 5 MongoDB spans -> N+1 query alert
// 4. Any trace where total_mongodb_time / http_time < 0.5 -> application overhead alert

The Proof

After adding pool checkout spans and trace analysis:

DiscoveryTraces analyzedFindingFix
Pool checkout bottleneck10,00023% of requests wait > 100msIncreased pool size (CH4)
Slow find on anomalies10,0003% of find commands > 100msAdded index (CH10)
N+1 pattern in dashboard50015 MongoDB commands per requestDenormalized (CH9)
Mapping overhead10,0008% of request time in mappingSwitched to raw driver (CH5)

After fixing the pool checkout issue (increasing maxSize from 100 to 200):

MetricBeforeAfter
HTTP p99450ms65ms
Pool checkout p99380ms8ms
MongoDB find p9912ms12ms (unchanged)

The 450ms p99 dropped to 65ms. The MongoDB query was never the problem; the connection pool was.

The Trade-off

Tracing adds overhead to every request. Each span creation involves memory allocation, timestamp capture, and attribute storage. At 10% sampling rate, 1 in 10 requests is traced. The overhead for sampled requests is approximately 0.5ms per span. With 3-5 spans per request, the tracing overhead is 1.5-2.5ms per traced request. For non-sampled requests (90%), the overhead is near zero (just a sampling decision check).

The 10% sampling rate means that 90% of slow requests are not traced. For debugging intermittent issues, increase the sampling rate temporarily. For production monitoring, 10% provides sufficient statistical coverage for aggregate analysis (p99 calculations, bottleneck identification) without significant overhead.

Storing traces is expensive. At 50,000 requests/second with 10% sampling, the trace backend receives 5,000 traces/second. Each trace with 5 spans is approximately 2 KB. Total: 10 MB/second = 864 GB/day. Configure trace retention (typically 7-14 days) and use tail-based sampling to keep only interesting traces (slow, error, or specific conditions).