Reading and Interpreting Explain Output

The Symptom

A product manager reports that the documentation page “Quick Start Guide for Authentication” ranks above “OAuth2 Token Refresh Policy” when a user searches for “token refresh.” The product manager expects the OAuth2 page to rank first. The developer adds "boost": 5 to the title field, which does fix this one query but breaks ten others. Understanding why the Quick Start Guide scored higher requires reading the explain output, not guessing at boost values.

The Internals

The explain API returns a tree structure. Each node in the tree describes one component of the score calculation. The root node is the final score. Its children are the per-field contributions. Their children are the per-term BM25 components.

For a bool query with should clauses across multiple fields, the tree looks like:

sum of:
  max of:
    weight(title:token) [BM25]
    weight(body:token) [BM25]
  max of:
    weight(title:refresh) [BM25]
    weight(body:refresh) [BM25]

The multi_match query with best_fields type takes the maximum score across fields for each term, then sums the per-term maxima. This means a document that mentions “token” heavily in the title and “refresh” heavily in the body can outscore a document that mentions both terms moderately in the body only.

The Implementation

A batch explain utility for comparing scores across candidate documents:

public class RelevanceInspector {

    private final OpenSearchClient client;

    public RelevanceInspector(OpenSearchClient client) {
        this.client = client;
    }

    public record ScoreBreakdown(
        String documentId,
        float totalScore,
        boolean matched,
        String explanationTree
    ) {}

    /**
     * Explain scores for a query against a list of document IDs.
     * Useful for understanding why document A ranks above document B.
     */
    public List<ScoreBreakdown> batchExplain(
            String index, Query query, List<String> documentIds)
            throws IOException {

        List<ScoreBreakdown> results = new ArrayList<>();

        for (String docId : documentIds) {
            ExplainResponse<JsonData> response = client.explain(e -> e
                    .index(index)
                    .id(docId)
                    .query(query),
                JsonData.class
            );

            String tree = formatExplanation(response.explanation(), 0);

            results.add(new ScoreBreakdown(
                docId,
                (float) response.explanation().value(),
                response.matched(),
                tree
            ));
        }

        results.sort(Comparator.comparingDouble(ScoreBreakdown::totalScore).reversed());
        return results;
    }

    private String formatExplanation(ExplanationDetail detail, int depth) {
        StringBuilder sb = new StringBuilder();
        sb.append("  ".repeat(depth))
          .append(String.format("%.4f", detail.value()))
          .append(" ")
          .append(detail.description())
          .append("\n");

        for (ExplanationDetail child : detail.details()) {
            sb.append(formatExplanation(child, depth + 1));
        }
        return sb.toString();
    }
}

Using it to diagnose the ranking problem:

Query query = Query.of(q -> q
    .multiMatch(mm -> mm
        .query("token refresh")
        .fields("title^2", "body")
        .type(TextQueryType.BestFields)
    )
);

List<ScoreBreakdown> breakdowns = inspector.batchExplain(
    "docs-v1",
    query,
    List.of(
        "tenant-acme:quick-start-auth",
        "tenant-acme:oauth2-token-refresh"
    )
);

for (ScoreBreakdown bd : breakdowns) {
    System.out.println("=== " + bd.documentId() + " (score: " + bd.totalScore() + ") ===");
    System.out.println(bd.explanationTree());
}

The output reveals the cause. The Quick Start Guide has “token” in its title (double-weighted) and “refresh” in its body. The OAuth2 page has both terms in its body but neither in its title. The title boost of 2x applied to even a single matching term outweighs two body matches.

=== tenant-acme:quick-start-auth (score: 5.2341) ===
5.2341 sum of:
  3.4102 max of:
    3.4102 weight(title:token) [BM25]  <-- title match, boosted 2x
    0.8921 weight(body:token) [BM25]
  1.8239 max of:
    0.0000 weight(title:refresh) [BM25]  <-- no title match
    1.8239 weight(body:refresh) [BM25]

=== tenant-acme:oauth2-token-refresh (score: 4.8712) ===
4.8712 sum of:
  2.9034 max of:
    0.0000 weight(title:token) [BM25]
    2.9034 weight(body:token) [BM25]
  1.9678 max of:
    0.0000 weight(title:refresh) [BM25]
    1.9678 weight(body:refresh) [BM25]

The fix is not to increase the boost. The fix is to recognize that the best_fields multi-match type rewards having any term in the title, even when the document is not fundamentally about that term. For this query pattern, the cross_fields type would treat the title and body as a single combined field, reducing the accidental title-match advantage.

The Measurement

Track explain output as part of the relevance evaluation pipeline. For the top 10 results of each test query, store the score breakdown. When a relevance change is deployed, compare the before and after breakdowns to understand not just whether NDCG changed, but why specific documents moved up or down.

The Decision Rule

Use the explain API for diagnostic purposes when a specific ranking is wrong. Do not use it as the primary tool for relevance tuning. The explain API answers “why did this document get this score?” The evaluation framework (chapter 9) answers “is the overall ranking getting better or worse?”

When explain output shows that a document ranks unexpectedly because of a single field match that outweighs more relevant content matches, the fix is in the query structure (multi-match type, field grouping), not in boost values. Boost values are a scalpel. Query structure is the operating table.