Tracing Connection Pool and Query Performance

The Symptom

The HTTP API’s p99 latency is 450ms. The MongoDB find command takes 12ms according to the slow query log. Where are the other 438ms?

The Cause

Without tracing, the gap between the HTTP latency (450ms) and the MongoDB command latency (12ms) is invisible. The trace reveals the breakdown:

HTTP GET /api/readings?sensorId=sensor-00042&hours=24    [450ms]
├── TelemetryService.getReadings()                        [440ms]
│   ├── ConnectionPool.checkout()                         [380ms]  <- BOTTLENECK
│   ├── MongoCollection.find()                            [12ms]
│   ├── Cursor.toList()                                   [45ms]
│   └── TelemetryMapper.toDtoList()                       [3ms]
└── ResponseSerialization                                 [10ms]

380ms is spent waiting for a connection from the pool. The pool is exhausted (CH4). The MongoDB command itself is fast (12ms), but the request is queued waiting for an available connection.

The Benchmark

Trace pattern	Root cause	Fix reference
Pool checkout > 100ms	Connection pool exhaustion	CH4 (pool sizing)
MongoDB command > 100ms	Slow query, missing index	CH2, CH10
Cursor iteration > 50ms	Large result set, no projection	CH6
Service method > MongoDB total	Application-layer overhead	CH5 (mapping)
Multiple MongoDB spans > 2	N+1 query pattern	CH9 (denormalization)

The Fix

Step 1: Add connection pool checkout span.

// FAST: Instrument connection pool with OpenTelemetry
public class MongoOtelPoolListener implements ConnectionPoolListener {

    private final Tracer tracer;
    private final ConcurrentHashMap<Integer, Span> checkoutSpans = new ConcurrentHashMap<>();
    private final ConcurrentHashMap<Integer, Long> checkoutStarts = new ConcurrentHashMap<>();

    public MongoOtelPoolListener(OpenTelemetry otel) {
        this.tracer = otel.getTracer("mongodb-pool");
    }

    @Override
    public void connectionCheckOutStarted(ConnectionCheckOutStartedEvent event) {
        Span span = tracer.spanBuilder("mongodb.pool.checkout")
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("db.system", "mongodb")
            .setAttribute("net.peer.name",
                event.getServerId().getAddress().getHost())
            .startSpan();

        checkoutSpans.put(event.getServerId().hashCode(), span);
        checkoutStarts.put(event.getServerId().hashCode(), System.nanoTime());
    }

    @Override
    public void connectionCheckedOut(ConnectionCheckedOutEvent event) {
        int key = event.getConnectionId().getServerId().hashCode();
        Span span = checkoutSpans.remove(key);
        Long startNanos = checkoutStarts.remove(key);

        if (span != null) {
            long durationMs = (System.nanoTime() - startNanos) / 1_000_000;
            span.setAttribute("pool.checkout_duration_ms", durationMs);

            if (durationMs > 100) {
                span.setAttribute("pool.slow_checkout", true);
                span.addEvent("Slow connection checkout",
                    Attributes.of(AttributeKey.longKey("duration_ms"), durationMs));
            }
            span.end();
        }
    }

    @Override
    public void connectionCheckOutFailed(ConnectionCheckOutFailedEvent event) {
        int key = event.getServerId().hashCode();
        Span span = checkoutSpans.remove(key);
        checkoutStarts.remove(key);

        if (span != null) {
            span.setStatus(StatusCode.ERROR, "Connection checkout failed: " +
                event.getReason().name());
            span.end();
        }
    }
}

Step 2: Add slow query detail to spans.

// FAST: Enrich MongoDB command spans with query details for slow operations
public class EnrichedMongoCommandListener implements CommandListener {

    private final Tracer tracer;
    private final ConcurrentHashMap<Integer, Span> activeSpans = new ConcurrentHashMap<>();

    @Override
    public void commandStarted(CommandStartedEvent event) {
        Span span = tracer.spanBuilder("mongodb." + event.getCommandName())
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("db.system", "mongodb")
            .setAttribute("db.name", event.getDatabaseName())
            .setAttribute("db.operation", event.getCommandName())
            .startSpan();

        activeSpans.put(event.getRequestId(), span);
    }

    @Override
    public void commandSucceeded(CommandSucceededEvent event) {
        Span span = activeSpans.remove(event.getRequestId());
        if (span != null) {
            long elapsedMs = event.getElapsedTime(TimeUnit.MILLISECONDS);
            span.setAttribute("db.response_time_ms", elapsedMs);

            // For slow commands, add query plan details
            if (elapsedMs > 50 && isQueryCommand(event.getCommandName())) {
                span.addEvent("slow_query",
                    Attributes.of(
                        AttributeKey.longKey("elapsed_ms"), elapsedMs,
                        AttributeKey.stringKey("command_name"), event.getCommandName()
                    ));
            }

            span.end();
        }
    }

    private boolean isQueryCommand(String name) {
        return "find".equals(name) || "aggregate".equals(name) ||
               "count".equals(name) || "distinct".equals(name);
    }
}

Step 3: Set up trace-based alerts.

// Alert criteria for MongoDB operations
// 1. Any trace where pool.checkout > 200ms -> connection pool alert
// 2. Any trace where mongodb.find > 100ms -> slow query alert
// 3. Any trace with > 5 MongoDB spans -> N+1 query alert
// 4. Any trace where total_mongodb_time / http_time < 0.5 -> application overhead alert

The Proof

After adding pool checkout spans and trace analysis:

Discovery	Traces analyzed	Finding	Fix
Pool checkout bottleneck	10,000	23% of requests wait > 100ms	Increased pool size (CH4)
Slow find on anomalies	10,000	3% of find commands > 100ms	Added index (CH10)
N+1 pattern in dashboard	500	15 MongoDB commands per request	Denormalized (CH9)
Mapping overhead	10,000	8% of request time in mapping	Switched to raw driver (CH5)

After fixing the pool checkout issue (increasing maxSize from 100 to 200):

Metric	Before	After
HTTP p99	450ms	65ms
Pool checkout p99	380ms	8ms
MongoDB find p99	12ms	12ms (unchanged)

The 450ms p99 dropped to 65ms. The MongoDB query was never the problem; the connection pool was.

The Trade-off

Tracing adds overhead to every request. Each span creation involves memory allocation, timestamp capture, and attribute storage. At 10% sampling rate, 1 in 10 requests is traced. The overhead for sampled requests is approximately 0.5ms per span. With 3-5 spans per request, the tracing overhead is 1.5-2.5ms per traced request. For non-sampled requests (90%), the overhead is near zero (just a sampling decision check).

The 10% sampling rate means that 90% of slow requests are not traced. For debugging intermittent issues, increase the sampling rate temporarily. For production monitoring, 10% provides sufficient statistical coverage for aggregate analysis (p99 calculations, bottleneck identification) without significant overhead.

Storing traces is expensive. At 50,000 requests/second with 10% sampling, the trace backend receives 5,000 traces/second. Each trace with 5 spans is approximately 2 KB. Total: 10 MB/second = 864 GB/day. Configure trace retention (typically 7-14 days) and use tail-based sampling to keep only interesting traces (slow, error, or specific conditions).