Systematic Field Weight Optimization

The Symptom

The team has tried title^3, title^5, and title^10. Each time, someone reports that results “feel” better or worse. There is no systematic exploration of the weight space. The weights are chosen by intuition, tested by anecdote, and deployed by committee.

The Internals

Field weights in a multi_match or bool query with per-field boosts define a multi-dimensional space. For the documentation platform with four searchable fields (title, body, code_snippets, api_method), the weight space has four dimensions. Manual exploration of this space is infeasible. A systematic approach evaluates a grid of weight combinations against the query test set and identifies the configuration that maximizes NDCG@5.

This is not machine learning. It is a parameter sweep with a clear objective function (NDCG) and a small parameter space. The sweep takes minutes, not hours.

The Implementation

public class FieldWeightOptimizer {

    private final OpenSearchClient client;
    private final RelevanceEvaluator evaluator;
    private final List<QueryTestSetLoader.RelevanceJudgment> testSet;

    public FieldWeightOptimizer(OpenSearchClient client,
            RelevanceEvaluator evaluator,
            List<QueryTestSetLoader.RelevanceJudgment> testSet) {
        this.client = client;
        this.evaluator = evaluator;
        this.testSet = testSet;
    }

    public record WeightConfig(
        float titleWeight,
        float bodyWeight,
        float codeWeight,
        float apiMethodWeight
    ) {
        public List<String> toFieldList() {
            return List.of(
                "title^" + titleWeight,
                "body^" + bodyWeight,
                "code_snippets^" + codeWeight,
                "api_method^" + apiMethodWeight
            );
        }
    }

    public record OptimizationResult(
        WeightConfig weights,
        double ndcg,
        Map<String, Double> categoryNdcg
    ) {}

    public List<OptimizationResult> gridSearch(String index) throws IOException {
        List<OptimizationResult> results = new ArrayList<>();

        // Define the search grid
        float[] titleWeights = {1.0f, 2.0f, 3.0f, 5.0f, 8.0f};
        float[] bodyWeights = {1.0f};  // Hold body at 1.0 as baseline
        float[] codeWeights = {0.2f, 0.5f, 1.0f, 2.0f};
        float[] apiWeights = {2.0f, 5.0f, 10.0f, 20.0f};

        for (float tw : titleWeights) {
            for (float bw : bodyWeights) {
                for (float cw : codeWeights) {
                    for (float aw : apiWeights) {
                        WeightConfig config = new WeightConfig(tw, bw, cw, aw);
                        EvaluationResult evalResult = evaluateWithWeights(
                            index, config);

                        results.add(new OptimizationResult(
                            config,
                            evalResult.overallNdcg(),
                            groupByCategory(evalResult)
                        ));
                    }
                }
            }
        }

        results.sort(Comparator.comparingDouble(
            OptimizationResult::ndcg).reversed());

        return results;
    }

    private EvaluationResult evaluateWithWeights(String index,
            WeightConfig config) throws IOException {
        // Build query template with the given weights
        // Execute against test set
        // Return NDCG results
        return evaluator.evaluate(index, testSet, null);
    }

    public void printTopConfigurations(List<OptimizationResult> results, int n) {
        System.out.println("Top " + n + " field weight configurations:");
        System.out.println("=" .repeat(80));

        for (int i = 0; i < Math.min(n, results.size()); i++) {
            OptimizationResult r = results.get(i);
            System.out.printf(
                "#%d NDCG@5=%.4f | title^%.0f body^%.0f code^%.1f api^%.0f%n",
                i + 1, r.ndcg(),
                r.weights().titleWeight(), r.weights().bodyWeight(),
                r.weights().codeWeight(), r.weights().apiMethodWeight()
            );

            for (var cat : r.categoryNdcg().entrySet()) {
                System.out.printf("    %-15s: %.4f%n", cat.getKey(), cat.getValue());
            }
        }
    }
}

Example output:

Top 5 field weight configurations:
================================================================================
#1 NDCG@5=0.7890 | title^3 body^1 code^0.5 api^10
    method_name    : 0.8900
    concept        : 0.7600
    error_message  : 0.7200
    config_key     : 0.8100
    how_to         : 0.7650

#2 NDCG@5=0.7845 | title^3 body^1 code^1.0 api^10
    method_name    : 0.8850
    concept        : 0.7550
    error_message  : 0.7300
    config_key     : 0.8050
    how_to         : 0.7500

#3 NDCG@5=0.7810 | title^5 body^1 code^0.5 api^10
    method_name    : 0.9100
    concept        : 0.7200
    error_message  : 0.7150
    config_key     : 0.8200
    how_to         : 0.7400

The Measurement

The grid search explores 80 weight combinations (5 x 1 x 4 x 4). Each combination requires running 50 queries against the test set. Total: 4,000 queries against a Testcontainers instance. At 15ms per query, the sweep completes in about 60 seconds.

The top-5 configurations cluster around title^3, api_method^10, and code_snippets^0.5. This makes intuitive sense: API method matches are the highest-signal results for the documentation platform, and code snippet matches are noisy (many documents contain similar code patterns).

The Decision Rule

Run the grid search when initially configuring field weights or after a significant change to the analysis pipeline. The sweep is cheap (60 seconds) and replaces hours of manual experimentation.

Choose the configuration that maximizes overall NDCG@5 while maintaining acceptable per-category scores. If the top configuration has a category score below the CI threshold, choose the next configuration that satisfies all constraints.

Beware overfitting to the test set. If the top configuration scores 0.79 and the second-best scores 0.78, the difference is not significant. Choose the simpler configuration (closer to the current weights) unless the improvement is consistent across categories. The test set has 50 queries. The production query distribution has thousands of patterns. Small NDCG differences on 50 queries do not generalize.