Skip to main content
search at depth

Building the Query Test Set for Relevance Evaluation

5 min read Chapter 23 of 60

Building the Query Test Set for Relevance Evaluation

The Symptom

A developer changes the title field boost from 3 to 5. The search results “look better” for the three queries they tested manually. The change is deployed. Support tickets arrive: method name searches that used to return the API reference now return a conceptual guide with the method name in the title. The three manually tested queries improved. Forty other query patterns regressed.

Manual spot-checking is not relevance evaluation. It is confirmation bias with a browser tab.

The Internals

A query test set (also called a judgment list or relevance assessment) is a dataset mapping queries to their expected results, graded by relevance. It serves the same purpose as a unit test suite: it codifies expected behavior and detects regressions when the system changes.

The test set must cover the distribution of actual user queries. For the documentation platform, analysis of search logs reveals five query categories:

  1. Exact method names (25% of queries): getConnection, HttpClient.Builder, setRetryPolicy
  2. Concept searches (30%): “connection pooling,” “retry policy,” “SSL configuration”
  3. Error messages (15%): “Connection refused,” “NullPointerException in UserService”
  4. Configuration keys (15%): “spring.datasource.url,” “server.port,” “logging.level”
  5. How-to questions (15%): “how to configure connection timeout,” “debug slow queries”

Each category exercises different parts of the analysis and scoring pipeline. Method name queries depend on the code analyzer. Concept searches depend on BM25 and field boosting. Error message queries depend on phrase matching. Configuration key queries depend on the whitespace analyzer. How-to queries depend on natural language analysis.

The Implementation

/**
 * Query test set fixture for the documentation platform.
 * Stored in src/test/resources/relevance/query-test-set.json
 * and loaded in integration tests.
 */
public class QueryTestSetLoader {

    public record RelevanceJudgment(
        String queryId,
        String query,
        String category,
        Map<String, String> filters,
        List<JudgedDocument> judgments
    ) {}

    public record JudgedDocument(
        String documentId,
        int relevanceGrade  // 0=irrelevant, 1=marginal, 2=relevant, 3=perfect
    ) {}

    public List<RelevanceJudgment> loadTestSet() throws IOException {
        try (var stream = getClass().getResourceAsStream(
                "/relevance/query-test-set.json")) {
            return objectMapper.readValue(stream,
                new TypeReference<List<RelevanceJudgment>>() {});
        }
    }
}

The test set file:

[
  {
    "queryId": "Q001",
    "query": "getConnection",
    "category": "method_name",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:api-ref-jdbc-connection", "relevanceGrade": 3 },
      { "documentId": "acme:guide-connection-pooling", "relevanceGrade": 2 },
      { "documentId": "acme:api-ref-datasource", "relevanceGrade": 1 },
      { "documentId": "acme:changelog-v3.2", "relevanceGrade": 0 }
    ]
  },
  {
    "queryId": "Q002",
    "query": "retry policy configuration",
    "category": "concept",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:guide-retry-policies", "relevanceGrade": 3 },
      { "documentId": "acme:api-ref-http-client", "relevanceGrade": 2 },
      { "documentId": "acme:guide-error-handling", "relevanceGrade": 1 },
      { "documentId": "acme:guide-authentication", "relevanceGrade": 0 }
    ]
  },
  {
    "queryId": "Q003",
    "query": "Connection refused port 5432",
    "category": "error_message",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      {
        "documentId": "acme:troubleshooting-db-connection",
        "relevanceGrade": 3
      },
      { "documentId": "acme:guide-postgres-setup", "relevanceGrade": 2 },
      { "documentId": "acme:api-ref-datasource", "relevanceGrade": 1 }
    ]
  },
  {
    "queryId": "Q004",
    "query": "spring.datasource.hikari.maximum-pool-size",
    "category": "config_key",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:ref-config-properties", "relevanceGrade": 3 },
      { "documentId": "acme:guide-connection-pooling", "relevanceGrade": 2 }
    ]
  },
  {
    "queryId": "Q005",
    "query": "how to configure connection timeout",
    "category": "how_to",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:guide-connection-timeout", "relevanceGrade": 3 },
      { "documentId": "acme:guide-connection-pooling", "relevanceGrade": 2 },
      { "documentId": "acme:api-ref-http-client", "relevanceGrade": 1 }
    ]
  }
]

Determining Ground Truth

Relevance grades are assigned by people who understand the documentation domain, not by the search system. For the documentation platform, this means:

  1. A developer familiar with the documentation corpus reviews each query
  2. For each query, the reviewer identifies the 3-10 most relevant documents and assigns grades
  3. Grade 3 (perfect): the document directly answers the query
  4. Grade 2 (relevant): the document contains useful information for the query
  5. Grade 1 (marginal): the document is tangentially related
  6. Grade 0 (irrelevant): the document should not appear in results

The initial test set requires a one-time investment of 2-4 hours for 50 queries. It is updated when new document types are added or when user search patterns shift.

The Measurement

Track test set coverage by query category:

CategoryQueries in Test Set% of User TrafficCoverage
Method name1225%Adequate
Concept1530%Adequate
Error message815%Adequate
Config key815%Adequate
How-to715%Adequate
Total50100%

A test set with fewer than 30 queries is too small to detect regressions in minority query categories. A test set with more than 100 queries becomes burdensome to maintain and grade. 50 queries, distributed across categories in proportion to user traffic, provides reliable regression detection.

The Decision Rule

Create the query test set before making any relevance changes. The test set establishes the baseline against which all changes are measured. Without it, relevance tuning is guessing.

Grade relevance with domain experts, not with the search system’s output. If the test set is built by running queries and accepting the current top results as “correct,” the test set codifies the current behavior rather than the desired behavior.

Update the test set when user search patterns change (e.g., a new document type is added to the platform) or when a specific relevance failure surfaces a query pattern not covered by the existing test set.