Score Normalization Across Shards and Indices

The Symptom

The documentation search platform runs a shared index with documents from multiple tenants. A search for “authentication” scoped to tenant A returns a top result with score 4.2. The same search scoped to tenant B returns a top result with score 7.8. The score difference is not because tenant B’s documentation is better. It is because tenant B has fewer total documents, making “authentication” rarer (higher IDF) in their slice of the data.

A different problem: the platform splits each tenant’s documentation by version into separate indices (docs-acme-v3, docs-acme-v4). A cross-version search for “connection pool” returns version 3 results above version 4 results because the version 3 index is smaller, inflating IDF for every term.

The Internals

BM25 scores are not absolute values. They are relative to the corpus statistics (document count, average document length, term frequency distribution) of the index and shard where they were computed. Two consequences:

Scores are not comparable across indices. An index with 1,000 documents produces different IDF values than an index with 100,000 documents for the same term. A score of 5.0 in a small index and a score of 5.0 in a large index represent completely different levels of relevance.

Scores can differ across shards within the same index. By default, OpenSearch computes BM25 using shard-local statistics. If documents are unevenly distributed across shards (which happens when routing is not used or when the hash function produces skewed buckets), the same query term has different IDF values on different shards.

The IDF component drives the variance. Consider “authentication” across two shards:

Metric	Shard 0	Shard 1
Total docs	45,000	55,000
Docs containing “authentication”	800	1,200
IDF	$\log(1 + \frac{45000 - 800 + 0.5}{800 + 0.5}) = 3.71$	$\log(1 + \frac{55000 - 1200 + 0.5}{1200 + 0.5}) = 3.61$

The same term, the same index, different scores. The 2.7% IDF difference seems small, but it compounds across multi-term queries and can swap the ordering of closely-scored documents.

The Implementation

DFS_QUERY_THEN_FETCH

The direct solution for shard-level scoring inconsistency is DFS_QUERY_THEN_FETCH. This search type adds a preliminary phase: before executing the query, OpenSearch collects term statistics from all shards, computes global IDF values, and distributes them back to each shard for scoring.

// HARDENED: Global scoring for consistent cross-shard results
// Use when per-shard document counts are below 50,000 or highly skewed

SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .searchType(SearchType.DfsQueryThenFetch)
    .query(q -> q
        .bool(b -> b
            .filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
            .must(mu -> mu.match(m -> m.field("body").query(userQuery)))
        )
    )
);

// FRAGILE: Default QUERY_THEN_FETCH on a 3-shard index with 10,000 docs
// Per-shard IDF is computed on ~3,300 docs per shard, producing
// noticeable scoring variance for medium-frequency terms.

SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .query(q -> q
        .bool(b -> b
            .filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
            .must(mu -> mu.match(m -> m.field("body").query(userQuery)))
        )
    )
);

Cross-Index Score Normalization

When searching across multiple indices (e.g., documentation versions), scores from different indices are fundamentally incomparable. Three strategies:

Strategy 1: Single index with version as a field. Store all versions in one index with a version keyword field. BM25 statistics are computed across all versions, producing consistent scores. Filter by version in a bool filter clause (which does not affect scoring).

// HARDENED: Single index, version as filter
SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .query(q -> q
        .bool(b -> b
            .filter(f -> f.term(t -> t.field("version").value("4.0")))
            .must(mu -> mu.match(m -> m.field("body").query(userQuery)))
        )
    )
);

Strategy 2: Rank-based merging. When separate indices are required (for operational reasons like independent lifecycle management), merge results by rank position rather than score. The first result from each index is rank 1, regardless of score. Interleave by rank.

Strategy 3: Score normalization by percentile. Compute the score distribution within each index’s results and normalize to a common scale. This is complex, brittle, and rarely worth the implementation cost.

The Measurement

Measure scoring variance with a diagnostic query:

// Execute the same query with and without DFS to measure score variance
SearchResponse<DocPage> defaultScoring = client.search(s -> s
        .index("docs-v1")
        .searchType(SearchType.QueryThenFetch)
        .explain(true)
        .query(q -> q.match(m -> m.field("body").query("authentication")))
        .size(10),
    DocPage.class
);

SearchResponse<DocPage> dfsScoring = client.search(s -> s
        .index("docs-v1")
        .searchType(SearchType.DfsQueryThenFetch)
        .explain(true)
        .query(q -> q.match(m -> m.field("body").query("authentication")))
        .size(10),
    DocPage.class
);

// Compare scores and rank order between the two approaches
for (int i = 0; i < defaultScoring.hits().hits().size(); i++) {
    var defaultHit = defaultScoring.hits().hits().get(i);
    var dfsHit = dfsScoring.hits().hits().get(i);
    System.out.printf("Rank %d: default=%s (%.4f), dfs=%s (%.4f)%n",
        i + 1,
        defaultHit.id(), defaultHit.score(),
        dfsHit.id(), dfsHit.score()
    );
}

If the rank order changes between the two modes, the per-shard statistics are causing scoring variance that affects user-visible results.

The Decision Rule

Use DFS_QUERY_THEN_FETCH when: the index has 3 or more shards and the total document count per shard is below 50,000; or when tenant-filtered queries reduce the effective corpus to fewer than 10,000 documents per shard. The extra round trip adds 2-5ms per query.

Use the default QUERY_THEN_FETCH when: each shard has more than 50,000 documents and the data distribution across shards is roughly uniform. At this scale, per-shard IDF converges to global IDF within a margin that does not affect visible ranking.

Prefer a single-index-with-filter approach over multi-index search when cross-collection scoring consistency matters. The operational overhead of managing one larger index is lower than the relevance engineering overhead of normalizing scores across multiple indices.