Skip to main content
search at depth

Debugging Analysis with the _analyze API

4 min read Chapter 5 of 60

Debugging Analysis with the _analyze API

The Symptom

A search for “Spring Boot configuration” returns results. A search for “SpringBoot configuration” returns nothing. Both queries are searching the same field. Both contain the same words. The user is confused. The developer is confused. The inverted index is doing exactly what it was told.

The Internals

The _analyze API is the single most useful debugging tool for search relevance problems. It shows exactly what tokens an analyzer produces from a given input. When a query returns zero results, the first question is always: what tokens does the analyzer produce for the query terms, and do those tokens exist in the index?

Analysis happens twice for every search operation:

  1. Index-time analysis: when a document is indexed, its text fields are analyzed and the resulting tokens are stored in the inverted index.
  2. Query-time analysis: when a search query is executed, the query text is analyzed using the same analyzer (by default) and the resulting tokens are used for lookup.

A match occurs when a query-time token equals an index-time token. If the analyzers produce different tokens, no match is possible.

The _analyze API can be called three ways:

# Using a built-in analyzer
POST _analyze
{
  "analyzer": "standard",
  "text": "SpringBoot configuration"
}

# Using a specific index's analyzer for a field
POST docs-v1/_analyze
{
  "field": "title",
  "text": "SpringBoot configuration"
}

# Building an analyzer inline for experimentation
POST _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "SpringBoot configuration"
}

The third form is the most powerful for debugging. It lets you isolate the effect of each component.

The Implementation

A Java utility class for analyzer debugging during development and testing:

public class AnalyzerDebugger {

    private final OpenSearchClient client;

    public AnalyzerDebugger(OpenSearchClient client) {
        this.client = client;
    }

    /**
     * Analyze text using a specific index and field analyzer,
     * returning the list of tokens for inspection.
     */
    public List<String> analyzeForField(String index, String field, String text)
            throws IOException {

        AnalyzeRequest request = AnalyzeRequest.of(a -> a
            .index(index)
            .field(field)
            .text(text)
        );

        AnalyzeResponse response = client.indices().analyze(request);

        return response.tokens().stream()
            .map(AnalyzeToken::token)
            .toList();
    }

    /**
     * Compare tokens produced by two different analyzers on the same text.
     */
    public record AnalyzerComparison(
        List<String> analyzerATokens,
        List<String> analyzerBTokens,
        List<String> onlyInA,
        List<String> onlyInB,
        List<String> common
    ) {}

    public AnalyzerComparison compareAnalyzers(
            String index, String fieldA, String fieldB, String text)
            throws IOException {

        List<String> tokensA = analyzeForField(index, fieldA, text);
        List<String> tokensB = analyzeForField(index, fieldB, text);

        Set<String> setA = new HashSet<>(tokensA);
        Set<String> setB = new HashSet<>(tokensB);

        List<String> common = tokensA.stream().filter(setB::contains).distinct().toList();
        List<String> onlyInA = tokensA.stream().filter(t -> !setB.contains(t)).distinct().toList();
        List<String> onlyInB = tokensB.stream().filter(t -> !setA.contains(t)).distinct().toList();

        return new AnalyzerComparison(tokensA, tokensB, onlyInA, onlyInB, common);
    }
}

Using this in a test to verify analyzer behavior:

@Test
void codeAnalyzerDecomposeCamelCase() throws Exception {
    // Create index with code_analyzer (as defined in CH2)
    createDocsIndex(client);

    var debugger = new AnalyzerDebugger(client);

    List<String> tokens = debugger.analyzeForField("docs-v1", "title", "HttpClientFactory");

    assertThat(tokens).contains("httpclientfactory");  // preserved original
    assertThat(tokens).contains("http");                // camelCase split
    assertThat(tokens).contains("client");
    assertThat(tokens).contains("factory");
}

@Test
void standardAnalyzerFailsOnCamelCase() throws Exception {
    var debugger = new AnalyzerDebugger(client);

    List<String> standard = debugger.analyzeForField("docs-v1", "title.standard",
        "HttpClientFactory");

    // Standard analyzer does NOT decompose camelCase
    assertThat(standard).contains("httpclientfactory");
    assertThat(standard).doesNotContain("http");
    assertThat(standard).doesNotContain("client");
}

The Measurement

Build a diagnostic report comparing analyzer behavior across representative queries from the documentation platform:

QueryStandard TokensCode Analyzer TokensMatch Difference
getConnectiongetconnectiongetconnection, get, connection+2 tokens, broader match
java.sql.Connectionjava.sql.connectionjava, sql, connectionDot-split enables component match
max_pool_sizemax_pool_sizemax, pool, size, max_pool_sizeUnderscore-split enables partial match
Spring Bootspring, bootspring, bootIdentical for natural language

The measurement reveals the trade-off: the code analyzer produces more tokens, which means broader recall (more documents match) but potentially lower precision (documents match that should not). This trade-off is managed through field boosting in the query, covered in chapter 9.

The Decision Rule

Use the _analyze API as the first debugging step when search returns unexpected results. Before modifying boost weights, before adding synonyms, before restructuring the query DSL, verify that the analyzer produces tokens that make the match possible. If the tokens do not match, no amount of query tuning will fix the problem.

Use the analyzer comparison utility during development to validate that changes to the analysis pipeline do not break existing search behavior. Run it against the query test set (built in chapter 8) as a regression check.