Building a Search Quality Dashboard
Building a Search Quality Dashboard
The Symptom
The team deploys a synonym list update on Tuesday. Search relevance for technical queries improves by 0.04 NDCG. On Thursday, a teammate deploys a mapping change that accidentally removes the code_snippets field from the multi_match query. Relevance for code-related queries drops by 0.12 NDCG. Nobody notices because the only relevance metric is a monthly manual evaluation.
The Internals
Search quality is a time-series metric, not a one-time evaluation. Every change to the mapping, analyzer, query template, or synonym list potentially affects relevance. Without continuous measurement, regressions hide behind feature launches.
The search quality pipeline:
- Query test set. A fixed set of queries with graded relevance judgments (from Chapter 8).
- Automated evaluation. Run the test set against the current index, compute NDCG@5 per category.
- Historical storage. Store each evaluation result with a timestamp and the deployment version.
- Regression detection. Compare the current NDCG@5 with the previous deployment. Alert on drops > 0.02.
The Implementation
Automated NDCG Tracker
public class NdcgTracker {
private final SearchService searchService;
private final RelevanceEvaluator evaluator;
private final OpenSearchClient client;
public NdcgTracker(SearchService searchService,
RelevanceEvaluator evaluator,
OpenSearchClient client) {
this.searchService = searchService;
this.evaluator = evaluator;
this.client = client;
}
public record NdcgSnapshot(
Instant timestamp,
String deploymentVersion,
double overallNdcg,
Map<String, Double> categoryNdcg,
int queryCount,
int failedQueries
) {}
public NdcgSnapshot evaluate(String deploymentVersion,
List<QueryTestCase> testSet) throws Exception {
Map<String, List<Double>> categoryScores = new LinkedHashMap<>();
int failedQueries = 0;
for (QueryTestCase testCase : testSet) {
try {
var results = searchService.search(
testCase.tenantId(), testCase.query(), 0);
List<String> returnedSlugs = results.hits().stream()
.map(Hit::id)
.toList();
double ndcg = evaluator.computeNdcg(
returnedSlugs, testCase.judgments(), 5);
categoryScores
.computeIfAbsent(testCase.category(), k -> new ArrayList<>())
.add(ndcg);
} catch (Exception e) {
failedQueries++;
}
}
Map<String, Double> categoryAverages = categoryScores.entrySet().stream()
.collect(Collectors.toMap(
Map.Entry::getKey,
e -> e.getValue().stream()
.mapToDouble(Double::doubleValue).average().orElse(0)
));
double overallNdcg = categoryAverages.values().stream()
.mapToDouble(Double::doubleValue).average().orElse(0);
NdcgSnapshot snapshot = new NdcgSnapshot(
Instant.now(),
deploymentVersion,
overallNdcg,
categoryAverages,
testSet.size(),
failedQueries
);
// Store in the search-quality index
storeSnapshot(snapshot);
return snapshot;
}
private void storeSnapshot(NdcgSnapshot snapshot) throws IOException {
client.index(i -> i
.index("search-quality-metrics")
.document(snapshot)
);
}
}
Regression Detector
public class RegressionDetector {
private final OpenSearchClient client;
private static final double REGRESSION_THRESHOLD = 0.02;
public RegressionDetector(OpenSearchClient client) {
this.client = client;
}
public record RegressionAlert(
String category,
double previousNdcg,
double currentNdcg,
double delta,
String previousVersion,
String currentVersion
) {}
public List<RegressionAlert> detectRegressions(
NdcgTracker.NdcgSnapshot current) throws IOException {
// Fetch the previous snapshot
var response = client.search(s -> s
.index("search-quality-metrics")
.query(q -> q.range(r -> r
.field("timestamp")
.lt(JsonData.of(current.timestamp().toString()))
))
.sort(so -> so.field(f -> f
.field("timestamp")
.order(SortOrder.Desc)
))
.size(1),
NdcgTracker.NdcgSnapshot.class
);
if (response.hits().hits().isEmpty()) {
return List.of(); // No previous snapshot to compare
}
var previous = response.hits().hits().get(0).source();
List<RegressionAlert> alerts = new ArrayList<>();
for (var entry : current.categoryNdcg().entrySet()) {
String category = entry.getKey();
double currentNdcg = entry.getValue();
double previousNdcg = previous.categoryNdcg()
.getOrDefault(category, 0.0);
double delta = currentNdcg - previousNdcg;
if (delta < -REGRESSION_THRESHOLD) {
alerts.add(new RegressionAlert(
category, previousNdcg, currentNdcg, delta,
previous.deploymentVersion(),
current.deploymentVersion()
));
}
}
return alerts;
}
}
The Measurement
Search quality tracking over 30 days:
| Week | Overall NDCG | Method Name | Concept | Error Message | Event |
|---|---|---|---|---|---|
| 1 | 0.77 | 0.89 | 0.71 | 0.72 | Baseline |
| 2 | 0.79 | 0.89 | 0.75 | 0.72 | Synonym update |
| 3 | 0.82 | 0.89 | 0.78 | 0.72 | Hybrid search launch |
| 4 | 0.78 | 0.76 | 0.78 | 0.72 | Mapping change (regression) |
The regression in week 4 affected only the “method name” category, dropping from 0.89 to 0.76. The overall NDCG dropped by 0.04. Without per-category tracking, this regression would be averaged away: the overall drop of 0.04 might not trigger an alert, but the category-specific drop of 0.13 is clearly a problem.
The Decision Rule
Run NDCG evaluation on every deployment that changes mappings, analyzers, query templates, or synonym lists. Store results in a time-series index for historical comparison.
Alert on per-category regression, not just overall NDCG. A mapping change that improves concept queries (+0.02) while destroying method name queries (-0.10) has a net negative impact on user experience despite a modest overall NDCG change.
Include NDCG evaluation in the CI pipeline as a deployment gate. A deployment that reduces any category’s NDCG by more than 0.02 should require explicit approval before proceeding.