Skip to main content
search at depth

Synonyms, Stop Words, and Stemming for Technical Domains

5 min read Chapter 6 of 60

Synonyms, Stop Words, and Stemming for Technical Domains

The Symptom

A developer searches for “k8s deployment” and gets zero results. The documentation contains extensive Kubernetes deployment guides. The word “k8s” does not appear in any document because every technical writer spelled out “Kubernetes.” A search for “DB connection” misses every page that says “database connection.” These are not obscure edge cases. They are the first three searches a new user tries.

Separately, a search for “A record DNS” returns nothing because the standard stop word list removed “A” as a stop word, and the document’s title was “Configuring A Records in Route53.”

The Internals

Synonyms map terms at analysis time so that a search for one term matches documents containing another. OpenSearch supports two synonym approaches:

  1. Synonym token filter: applied during analysis, either at index time or query time. Index-time synonyms expand the stored tokens. Query-time synonyms expand the search tokens.
  2. Synonym graph token filter: handles multi-word synonyms correctly by producing a token graph that preserves phrase positions.

Stop words are high-frequency terms removed during analysis to reduce index size and avoid matching on non-meaningful words. The English default list includes “a”, “an”, “the”, “is”, “at”, “which,” and about 30 others. This list was designed for natural language prose, not for technical content where “A” in “A record” is a domain-specific term.

Stemming reduces words to their root form so that “configuring,” “configured,” and “configuration” all match. The common algorithms (Porter, Snowball, Hunspell) are designed for natural language and produce unexpected results on technical terms.

The Implementation

Synonym Configuration for Technical Documentation

// HARDENED: Query-time synonyms for technical terms
// Stored in a file for easy updates without reindexing

CreateIndexRequest request = CreateIndexRequest.of(idx -> idx
    .index("docs-v1")
    .settings(s -> s
        .analysis(a -> a
            .filter("tech_synonyms", f -> f
                .definition(d -> d
                    .synonymGraph(sg -> sg
                        .synonyms(List.of(
                            "k8s, kubernetes",
                            "db, database",
                            "js, javascript",
                            "ts, typescript",
                            "k/v, key-value, key value",
                            "api, application programming interface",
                            "ci/cd, continuous integration continuous delivery",
                            "oom, out of memory",
                            "gc, garbage collection"
                        ))
                    )
                )
            )
            .analyzer("doc_search_analyzer", an -> an
                .custom(c -> c
                    .tokenizer("standard")
                    .filter("lowercase")
                )
            )
            .analyzer("doc_query_analyzer", an -> an
                .custom(c -> c
                    .tokenizer("standard")
                    .filter("lowercase", "tech_synonyms")
                )
            )
        )
    )
    .mappings(m -> m
        .properties("body", p -> p.text(t -> t
            .analyzer("doc_search_analyzer")
            .searchAnalyzer("doc_query_analyzer")
        ))
    )
);
// FRAGILE: Index-time synonyms
// Every synonym change requires a full reindex of all documents.
// On a 10-million-document index, this means hours of downtime
// or a complex reindex-behind-alias operation.

.analyzer("doc_analyzer", an -> an
    .custom(c -> c
        .tokenizer("standard")
        .filter("lowercase", "tech_synonyms")  // synonyms at index time
    )
)

Query-time synonym expansion searches the index using both the original term and its synonyms. Adding a new synonym pair requires only an index close/open cycle to reload the analyzer settings, not a full reindex. The cost is a slightly more complex query (more terms to match at search time), which is negligible compared to the operational cost of reindexing millions of documents.

Stop Word Configuration

// HARDENED: No stop words for technical documentation
// The cost of indexing high-frequency words is lower than
// the cost of missing "A record", "IT department", "Go language"

.analyzer("doc_analyzer", an -> an
    .custom(c -> c
        .tokenizer("standard")
        .filter("lowercase")
        // No stop word filter. Intentional.
    )
)

Default stop word removal is wrong for technical documentation. The word “a” in “A record,” “Go” as a programming language name, “IT” as information technology, “C” as a programming language, “R” as a statistics language: all of these are stop words in the English list and domain-specific terms in technical content. The marginal index size savings from stop word removal does not justify the search quality degradation.

Stemming with Guardrails

// HARDENED: Light stemming with a protected words list
// Prevents the stemmer from destroying technical terms

.filter("protected_words", f -> f
    .definition(d -> d
        .keywordMarker(km -> km
            .keywords(List.of(
                "kubernetes", "docker", "kafka", "redis",
                "nginx", "grpc", "graphql", "oauth",
                "jdbc", "jndi", "jmx", "cors",
                "csrf", "xss", "sql", "nosql"
            ))
        )
    )
)
.filter("light_stemmer", f -> f
    .definition(d -> d
        .stemmer(st -> st.language("light_english"))
    )
)
.analyzer("doc_analyzer", an -> an
    .custom(c -> c
        .tokenizer("standard")
        .filter("lowercase", "protected_words", "light_stemmer")
    )
)

The keyword marker filter prevents the stemmer from modifying protected terms. Without it, the Porter stemmer reduces “kubernetes” to “kubernet,” “docker” to “docker” (coincidentally correct), and “redis” to “redi.” The light_english stemmer is less aggressive than the default english stemmer and produces fewer destructive reductions on technical vocabulary.

The Measurement

Verify synonym and stemming behavior with the _analyze API:

POST docs-v1/_analyze
{
  "field": "body",
  "text": "k8s deployment configuration"
}

Expected output with query-time synonyms at search time:

{
  "tokens": [
    { "token": "k8s", "position": 0 },
    { "token": "kubernetes", "position": 0 },
    { "token": "deployment", "position": 1 },
    { "token": "configuration", "position": 2 }
  ]
}

The synonym token kubernetes appears at the same position as k8s, which means a phrase query for “k8s deployment” will also match “kubernetes deployment.”

The Decision Rule

Use query-time synonyms when the synonym list changes frequently or the index is large enough that reindexing is operationally expensive. Use index-time synonyms only when query-time expansion measurably degrades query latency, which happens at very high query volumes with very large synonym lists.

Disable stop words for technical documentation search. The index size overhead is small. The relevance cost of removing domain-specific terms is large.

Use the light_english stemmer with a keyword marker list for technical content. Use the english stemmer only for natural-language prose fields where aggressive stemming improves recall without losing technical precision. Always protect product names, protocol names, and acronyms from stemming.