Metric Aggregations and the HyperLogLog Cardinality Estimator
Metric Aggregations and the HyperLogLog Cardinality Estimator
The Symptom
The product manager asks “how many unique users searched this week?” The developer writes a terms aggregation on user_id with size: 1000000. The query takes 45 seconds and the coordinating node runs out of heap. The answer is 237,000 unique users, but the query nearly crashed the cluster to compute it.
The Internals
The cardinality aggregation uses HyperLogLog++ (HLL++), a probabilistic algorithm that estimates the number of distinct values in a field without storing all values in memory. HLL++ uses a fixed amount of memory (determined by the precision_threshold parameter) regardless of the actual cardinality.
The precision_threshold controls the trade-off between accuracy and memory:
| precision_threshold | Memory | Error Rate (approximate) |
|---|---|---|
| 100 | ~1.6KB | ~6% |
| 1,000 | ~16KB | ~2% |
| 10,000 | ~160KB | ~0.5% |
| 40,000 (max) | ~640KB | ~0.25% |
For cardinalities below the precision_threshold, the result is exact. Above it, the error is bounded by the rate shown.
The Implementation
// HARDENED: Search analytics aggregation for the documentation platform
// Computes unique users, unique queries, and search volume in one request
SearchRequest analyticsQuery = SearchRequest.of(s -> s
.index("search-logs")
.size(0) // No hits needed, only aggregations
.query(q -> q
.bool(b -> b
.filter(f -> f.range(r -> r
.field("timestamp")
.gte(JsonData.of("now-7d"))
))
.filter(f -> f.term(t -> t
.field("tenant_id").value(tenantId)))
)
)
.aggregations("unique_users", a -> a
.cardinality(c -> c
.field("user_id")
.precisionThreshold(10000)
)
)
.aggregations("unique_queries", a -> a
.cardinality(c -> c
.field("query_text.raw")
.precisionThreshold(10000)
)
)
.aggregations("total_searches", a -> a
.valueCount(vc -> vc.field("_id"))
)
.aggregations("result_count_stats", a -> a
.stats(st -> st.field("result_count"))
)
.aggregations("latency_percentiles", a -> a
.percentiles(p -> p
.field("latency_ms")
.percents(50.0, 90.0, 95.0, 99.0)
)
)
.aggregations("zero_result_rate", a -> a
.filter(f -> f.term(t -> t.field("result_count").value(0)))
)
);
Zero-Result Query Analysis
// Queries that return zero results indicate search quality gaps
SearchRequest zeroResultQueries = SearchRequest.of(s -> s
.index("search-logs")
.size(0)
.query(q -> q
.bool(b -> b
.filter(f -> f.term(t -> t.field("result_count").value(0)))
.filter(f -> f.range(r -> r.field("timestamp").gte(JsonData.of("now-7d"))))
.filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
)
)
.aggregations("top_zero_result_queries", a -> a
.terms(t -> t
.field("query_text.raw")
.size(20)
.minDocCount(3) // Only show queries that failed multiple times
)
)
);
Zero-result queries are the most actionable search analytics signal. Each represents a user need that the documentation does not satisfy (missing content), a vocabulary mismatch (the content exists but uses different terms), or a search configuration problem (the analyzer or query structure prevents matching).
The Measurement
Weekly search analytics for the documentation platform:
| Metric | Value | Health Indicator |
|---|---|---|
| Unique users | 23,400 | Baseline: growing |
| Unique queries | 8,900 | Query diversity |
| Total searches | 145,000 | Volume |
| Avg results per query | 12.3 | Recall |
| Zero-result rate | 8.2% | < 5% is good, > 15% is problematic |
| p50 latency | 18ms | Good |
| p99 latency | 85ms | Acceptable |
A zero-result rate above 15% indicates systemic search quality problems. Between 5% and 15% is normal for documentation search (some queries are genuinely not covered). Below 5% suggests the analyzer may be too aggressive with fuzzy matching, returning marginally relevant results instead of admitting no good match exists.
The Decision Rule
Use the cardinality aggregation with precision_threshold: 10000 for any “count unique” analytics query. Never use a terms aggregation with a large size to count unique values. The memory difference is five orders of magnitude.
Track zero-result rate as the primary search quality metric in production. It is cheaper to compute than NDCG (no relevance judgments required) and directly actionable (each zero-result query is a specific improvement opportunity).
Log every search query with its result count, latency, and user ID. This log becomes the source data for search analytics, zero-result analysis, and query test set expansion (add frequently-failing queries to the test set as new entries).