$lookup Performance: Measuring the Join Tax

The Symptom

The dashboard endpoint that shows sensor readings with metadata has a p95 latency of 280ms. The MongoDB explain output shows the $lookup stage consuming 65% of the total execution time. The sensor metadata collection has 10,000 documents with an index on _id. The lookup should be fast. Why is it not?

The Cause

MongoDB $lookup behavior depends on the result set size. For each document in the pipeline, the $lookup stage performs a lookup against the foreign collection. With 500 documents in the pipeline, that is 500 lookups. MongoDB 5.0 introduced hash join optimization for equality-based lookups, which builds a hash table from the foreign collection in memory. But the hash table is rebuilt for each aggregation execution, not cached across queries.

// explain output for the $lookup aggregation
{
  "stages": [
    {
      "$cursor": {
        "executionStats": {
          "nReturned": 500,
          "executionTimeMillisEstimate": 15
        }
      }
    },
    {
      "$lookup": {
        "executionStats": {
          "nReturned": 500,
          "executionTimeMillisEstimate": 180
        }
      }
    },
    {
      "$unwind": {
        "executionStats": {
          "executionTimeMillisEstimate": 5
        }
      }
    }
  ]
}

The $cursor stage (finding the readings) takes 15ms. The $lookup stage takes 180ms. 92% of the query time is the join.

The Benchmark

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 2, time = 10)
@Measurement(iterations = 3, time = 15)
@Fork(1)
@State(Scope.Benchmark)
public class LookupVsDenormalizedBenchmark {

    private MongoCollection<Document> normalizedReadings;
    private MongoCollection<Document> denormalizedReadings;

    @Param({"10", "50", "100", "500", "1000"})
    private int resultSize;

    @Setup
    public void setup() {
        MongoClient client = MongoClients.create("mongodb://localhost:27017");
        var db = client.getDatabase("telemetry");
        normalizedReadings = db.getCollection("readings_normalized");
        denormalizedReadings = db.getCollection("readings_denormalized");
    }

    @Benchmark
    public List<Document> withLookup() {
        return normalizedReadings.aggregate(List.of(
            Aggregates.match(Filters.eq("sensorId", "sensor-00042")),
            Aggregates.sort(Sorts.descending("ts")),
            Aggregates.limit(resultSize),
            Aggregates.lookup("sensors", "sensorId", "_id", "sensor"),
            Aggregates.unwind("$sensor"),
            Aggregates.project(Projections.fields(
                Projections.include("sensorId", "ts", "temperature"),
                Projections.computed("building", "$sensor.building"),
                Projections.computed("floor", "$sensor.floor")
            ))
        )).into(new ArrayList<>());
    }

    @Benchmark
    public List<Document> denormalized() {
        return denormalizedReadings.find(
            Filters.eq("sensorId", "sensor-00042")
        ).sort(Sorts.descending("ts"))
         .limit(resultSize)
         .projection(Projections.include("sensorId", "ts", "temperature",
             "sensor.building", "sensor.floor"))
         .into(new ArrayList<>());
    }
}

Results:

Benchmark                                  (resultSize)  Mode  Cnt    Score    Error  Units
LookupVsDenormalizedBenchmark.withLookup           10  avgt    3   12.000 ±  2.000  ms/op
LookupVsDenormalizedBenchmark.withLookup           50  avgt    3   35.000 ±  4.000  ms/op
LookupVsDenormalizedBenchmark.withLookup          100  avgt    3   68.000 ±  8.000  ms/op
LookupVsDenormalizedBenchmark.withLookup          500  avgt    3  280.000 ± 20.000  ms/op
LookupVsDenormalizedBenchmark.withLookup         1000  avgt    3  520.000 ± 35.000  ms/op
LookupVsDenormalizedBenchmark.denormalized         10  avgt    3    2.000 ±  0.300  ms/op
LookupVsDenormalizedBenchmark.denormalized         50  avgt    3    4.000 ±  0.500  ms/op
LookupVsDenormalizedBenchmark.denormalized        100  avgt    3    6.000 ±  0.800  ms/op
LookupVsDenormalizedBenchmark.denormalized        500  avgt    3   15.000 ±  2.000  ms/op
LookupVsDenormalizedBenchmark.denormalized       1000  avgt    3   28.000 ±  3.000  ms/op

At 500 results, $lookup is 18.7x slower. The ratio increases with result size because each additional document adds a lookup operation. The denormalized approach scales linearly with I/O.

The Fix

Denormalize the sensor metadata into reading documents at write time:

// FAST: Embed sensor metadata at write time
public void ingestReading(SensorReading reading) {
    // Cache sensor metadata (refreshed every 5 minutes)
    Document sensorMeta = sensorMetadataCache.get(reading.getSensorId());

    Document doc = new Document()
        .append("sensorId", reading.getSensorId())
        .append("ts", Date.from(reading.getTimestamp()))
        .append("temperature", reading.getTemperature())
        .append("humidity", reading.getHumidity())
        .append("pressure", reading.getPressure())
        .append("sensor", new Document()
            .append("building", sensorMeta.getString("building"))
            .append("floor", sensorMeta.getInteger("floor"))
            .append("zone", sensorMeta.getString("zone"))
            .append("type", sensorMeta.getString("type"))
        );

    collection.insertOne(doc);
}

The Proof

After denormalization:

Metric	$lookup	Denormalized
Dashboard p50 (500 docs)	145ms	10ms
Dashboard p95 (500 docs)	280ms	18ms
Storage per reading	340 bytes	420 bytes
Storage overhead	Baseline	+23.5%
Sensor metadata update cost	Zero (single source)	Bulk update all readings

The Trade-off

The 23.5% storage increase from embedding 80 bytes of sensor metadata in every reading adds up. For 200 million readings per month, that is 16 GB of additional storage. For the telemetry platform with a 2 TB storage budget, 16 GB is negligible.

The real cost is consistency. If sensor-00042 moves from Floor 3 to Floor 5, historical readings still show Floor 3. This may be correct (the reading was taken when the sensor was on Floor 3) or incorrect (the dashboard should show current location). If historical accuracy matters, denormalization is the correct choice. If current-state accuracy matters, use $lookup or maintain a version field in the embedded metadata and update asynchronously.