Tracing Connection Pool and Query Performance
Tracing Connection Pool and Query Performance
The Symptom
The HTTP API’s p99 latency is 450ms. The MongoDB find command takes 12ms according to the slow query log. Where are the other 438ms?
The Cause
Without tracing, the gap between the HTTP latency (450ms) and the MongoDB command latency (12ms) is invisible. The trace reveals the breakdown:
HTTP GET /api/readings?sensorId=sensor-00042&hours=24 [450ms]
├── TelemetryService.getReadings() [440ms]
│ ├── ConnectionPool.checkout() [380ms] <- BOTTLENECK
│ ├── MongoCollection.find() [12ms]
│ ├── Cursor.toList() [45ms]
│ └── TelemetryMapper.toDtoList() [3ms]
└── ResponseSerialization [10ms]
380ms is spent waiting for a connection from the pool. The pool is exhausted (CH4). The MongoDB command itself is fast (12ms), but the request is queued waiting for an available connection.
The Benchmark
| Trace pattern | Root cause | Fix reference |
|---|---|---|
| Pool checkout > 100ms | Connection pool exhaustion | CH4 (pool sizing) |
| MongoDB command > 100ms | Slow query, missing index | CH2, CH10 |
| Cursor iteration > 50ms | Large result set, no projection | CH6 |
| Service method > MongoDB total | Application-layer overhead | CH5 (mapping) |
| Multiple MongoDB spans > 2 | N+1 query pattern | CH9 (denormalization) |
The Fix
Step 1: Add connection pool checkout span.
// FAST: Instrument connection pool with OpenTelemetry
public class MongoOtelPoolListener implements ConnectionPoolListener {
private final Tracer tracer;
private final ConcurrentHashMap<Integer, Span> checkoutSpans = new ConcurrentHashMap<>();
private final ConcurrentHashMap<Integer, Long> checkoutStarts = new ConcurrentHashMap<>();
public MongoOtelPoolListener(OpenTelemetry otel) {
this.tracer = otel.getTracer("mongodb-pool");
}
@Override
public void connectionCheckOutStarted(ConnectionCheckOutStartedEvent event) {
Span span = tracer.spanBuilder("mongodb.pool.checkout")
.setSpanKind(SpanKind.CLIENT)
.setAttribute("db.system", "mongodb")
.setAttribute("net.peer.name",
event.getServerId().getAddress().getHost())
.startSpan();
checkoutSpans.put(event.getServerId().hashCode(), span);
checkoutStarts.put(event.getServerId().hashCode(), System.nanoTime());
}
@Override
public void connectionCheckedOut(ConnectionCheckedOutEvent event) {
int key = event.getConnectionId().getServerId().hashCode();
Span span = checkoutSpans.remove(key);
Long startNanos = checkoutStarts.remove(key);
if (span != null) {
long durationMs = (System.nanoTime() - startNanos) / 1_000_000;
span.setAttribute("pool.checkout_duration_ms", durationMs);
if (durationMs > 100) {
span.setAttribute("pool.slow_checkout", true);
span.addEvent("Slow connection checkout",
Attributes.of(AttributeKey.longKey("duration_ms"), durationMs));
}
span.end();
}
}
@Override
public void connectionCheckOutFailed(ConnectionCheckOutFailedEvent event) {
int key = event.getServerId().hashCode();
Span span = checkoutSpans.remove(key);
checkoutStarts.remove(key);
if (span != null) {
span.setStatus(StatusCode.ERROR, "Connection checkout failed: " +
event.getReason().name());
span.end();
}
}
}
Step 2: Add slow query detail to spans.
// FAST: Enrich MongoDB command spans with query details for slow operations
public class EnrichedMongoCommandListener implements CommandListener {
private final Tracer tracer;
private final ConcurrentHashMap<Integer, Span> activeSpans = new ConcurrentHashMap<>();
@Override
public void commandStarted(CommandStartedEvent event) {
Span span = tracer.spanBuilder("mongodb." + event.getCommandName())
.setSpanKind(SpanKind.CLIENT)
.setAttribute("db.system", "mongodb")
.setAttribute("db.name", event.getDatabaseName())
.setAttribute("db.operation", event.getCommandName())
.startSpan();
activeSpans.put(event.getRequestId(), span);
}
@Override
public void commandSucceeded(CommandSucceededEvent event) {
Span span = activeSpans.remove(event.getRequestId());
if (span != null) {
long elapsedMs = event.getElapsedTime(TimeUnit.MILLISECONDS);
span.setAttribute("db.response_time_ms", elapsedMs);
// For slow commands, add query plan details
if (elapsedMs > 50 && isQueryCommand(event.getCommandName())) {
span.addEvent("slow_query",
Attributes.of(
AttributeKey.longKey("elapsed_ms"), elapsedMs,
AttributeKey.stringKey("command_name"), event.getCommandName()
));
}
span.end();
}
}
private boolean isQueryCommand(String name) {
return "find".equals(name) || "aggregate".equals(name) ||
"count".equals(name) || "distinct".equals(name);
}
}
Step 3: Set up trace-based alerts.
// Alert criteria for MongoDB operations
// 1. Any trace where pool.checkout > 200ms -> connection pool alert
// 2. Any trace where mongodb.find > 100ms -> slow query alert
// 3. Any trace with > 5 MongoDB spans -> N+1 query alert
// 4. Any trace where total_mongodb_time / http_time < 0.5 -> application overhead alert
The Proof
After adding pool checkout spans and trace analysis:
| Discovery | Traces analyzed | Finding | Fix |
|---|---|---|---|
| Pool checkout bottleneck | 10,000 | 23% of requests wait > 100ms | Increased pool size (CH4) |
| Slow find on anomalies | 10,000 | 3% of find commands > 100ms | Added index (CH10) |
| N+1 pattern in dashboard | 500 | 15 MongoDB commands per request | Denormalized (CH9) |
| Mapping overhead | 10,000 | 8% of request time in mapping | Switched to raw driver (CH5) |
After fixing the pool checkout issue (increasing maxSize from 100 to 200):
| Metric | Before | After |
|---|---|---|
| HTTP p99 | 450ms | 65ms |
| Pool checkout p99 | 380ms | 8ms |
| MongoDB find p99 | 12ms | 12ms (unchanged) |
The 450ms p99 dropped to 65ms. The MongoDB query was never the problem; the connection pool was.
The Trade-off
Tracing adds overhead to every request. Each span creation involves memory allocation, timestamp capture, and attribute storage. At 10% sampling rate, 1 in 10 requests is traced. The overhead for sampled requests is approximately 0.5ms per span. With 3-5 spans per request, the tracing overhead is 1.5-2.5ms per traced request. For non-sampled requests (90%), the overhead is near zero (just a sampling decision check).
The 10% sampling rate means that 90% of slow requests are not traced. For debugging intermittent issues, increase the sampling rate temporarily. For production monitoring, 10% provides sufficient statistical coverage for aggregate analysis (p99 calculations, bottleneck identification) without significant overhead.
Storing traces is expensive. At 50,000 requests/second with 10% sampling, the trace backend receives 5,000 traces/second. Each trace with 5 spans is approximately 2 KB. Total: 10 MB/second = 864 GB/day. Configure trace retention (typically 7-14 days) and use tail-based sampling to keep only interesting traces (slow, error, or specific conditions).