The Inverted Index: From Raw Text to Relevance in One Data Structure
The Inverted Index
Every search result your users see begins with a data structure built during indexing, not during query time. The inverted index maps terms to the documents that contain them. When a user types “retry policy” into the documentation search bar, OpenSearch does not scan every document looking for those words. It looks up “retry” in the index, finds the list of documents containing it, looks up “policy,” finds its list, and intersects them. The speed of search is determined by the speed of that lookup, and the quality of search is determined by what was stored in the index when the document was written.
This book uses one domain throughout: a multi-tenant technical documentation search engine. Multiple clients, each a separate software company, store their versioned documentation, API references, code snippets, and changelogs. Developers search across their tenant’s content expecting exact matches on method names like getConnection(), fuzzy matches on concepts like “connection pooling,” code-aware results that understand that HttpClient and http_client refer to the same thing, and filtering by documentation version. Every index design decision, query construction, relevance tuning, and failure scenario in this book happens inside this system.
Four opinions drive every chapter.
OpenSearch is the default. Both OpenSearch and Elasticsearch descend from the same codebase, and for the majority of search engineering work the concepts transfer directly. This book uses OpenSearch 2.x for every code example, API call, and cluster configuration. Where the two products diverge, on security plugins, on licensing, on ML features, the divergence is stated. The licensing split happened. OpenSearch is Apache 2.0 licensed. Elasticsearch is not. That is the last time this book discusses it.
Index design is the decision you cannot undo. A wrong field type, a missing multi-field, a shard count chosen based on a blog post rather than data volume: these cannot be fixed with a configuration change. They require a full reindex. Every chapter that touches mappings treats them as permanent architectural decisions.
Relevance is an engineering problem, not a configuration problem. Adding "boost": 3 to a field and hoping results improve is not relevance tuning. It is guessing. Real relevance engineering requires a test set of queries with expected results, a scoring metric, and a repeatable evaluation pipeline. This discipline is built in chapters 8 through 10.
Semantic search complements lexical search; it does not replace it. kNN vector search produces confidently wrong results when applied alone to technical documentation. A user searching for “how to configure SSL” does not benefit from a vector that is semantically close to “setting up TLS certificates” if the exact configuration parameter name is ssl.keystore.path and that string never appears in the vector-matched document. The correct architecture combines BM25 lexical scoring with dense vector scoring. The book proves this with NDCG numbers.
The Data Structure
An inverted index stores three things per term:
- The term itself, normalized by the analysis pipeline (lowercased, stemmed, decomposed)
- A postings list: the set of document IDs containing that term, with per-document term frequency
- Positional data (optional): the exact positions where the term appears, enabling phrase queries
Consider a documentation search engine indexing two documents:
Document 1 (ID: doc-1): “Configure the retry policy for HTTP connections”
Document 2 (ID: doc-2): “The default retry count is three for failed connections”
After standard analysis (lowercase, English stop words removed), the inverted index contains:
| Term | Postings List |
|---|---|
| configure | doc-1 (tf=1, pos=0) |
| retry | doc-1 (tf=1, pos=2), doc-2 (tf=1, pos=2) |
| policy | doc-1 (tf=1, pos=3) |
| http | doc-1 (tf=1, pos=5) |
| connections | doc-1 (tf=1, pos=6), doc-2 (tf=1, pos=8) |
| default | doc-2 (tf=1, pos=1) |
| count | doc-2 (tf=1, pos=4) |
| three | doc-2 (tf=1, pos=6) |
| failed | doc-2 (tf=1, pos=8) |
A query for “retry connections” resolves to: look up “retry” (doc-1, doc-2), look up “connections” (doc-1, doc-2), intersect. Both documents match. Ranking decides which appears first, and ranking is the subject of chapter 3.
The inverted index is not a database index. A B-tree index in PostgreSQL points from a key to a row. An inverted index points from a term to every document containing that term, with frequency and position metadata. This is what makes full-text search fundamentally different from database lookups: the index is the primary data structure, not an optimization on top of a table scan.
Three Storage Mechanisms
OpenSearch stores data in three distinct structures, and confusing them is one of the most common mapping mistakes.
The inverted index stores analyzed text for full-text search. It is built from text fields. You search it. You cannot sort by it. You cannot aggregate on it.
Doc values store column-oriented data for sorting, aggregations, and scripting. They are built from keyword, numeric, date, boolean, and geo_point fields. You sort by them. You aggregate on them. You do not full-text search them.
Stored fields hold the original field value for retrieval in _source. By default, OpenSearch stores the entire original JSON document in _source. Individual fields can be configured as store: true to be retrievable independently.
The mapping mistake that surfaces in every documentation search platform eventually:
// FRAGILE: text field used for both search AND aggregation
// Aggregation on a text field forces fielddata into heap memory.
// On a 10-million-document index, this will trigger OOM or circuit breaker.
CreateIndexRequest request = CreateIndexRequest.of(idx -> idx
.index("docs-v1")
.mappings(m -> m
.properties("category", p -> p
.text(t -> t
.analyzer("standard")
)
)
)
);
// HARDENED: text field with keyword sub-field
// Search against "category" (analyzed text), aggregate on "category.raw" (keyword).
CreateIndexRequest request = CreateIndexRequest.of(idx -> idx
.index("docs-v1")
.mappings(m -> m
.properties("category", p -> p
.text(t -> t
.analyzer("standard")
.fields("raw", f -> f
.keyword(k -> k
.ignoreAbove(256)
)
)
)
)
)
);
The text field feeds the inverted index. The keyword sub-field builds doc values. Searching uses the analyzed text. Aggregations use the keyword. No fielddata in heap. No circuit breaker trips.
Segments: Where the Index Lives on Disk
The inverted index does not exist as a single file. Lucene, the library underneath OpenSearch, writes data in immutable segments. Each segment is a self-contained inverted index with its own term dictionary, postings lists, doc values, and stored fields.
When you index a document, it goes into an in-memory buffer. When the buffer is flushed (either by the refresh interval or explicitly), a new segment is written to disk. Segments are immutable. Once written, they are never modified. Deleting a document does not remove it from the segment; it marks it in a bitset. The deleted document is physically removed only when segments merge.
This immutability has consequences:
- Updates are delete-plus-reindex. Updating a single field in a document marks the old version as deleted and writes the entire document into a new segment. There is no in-place update.
- Segment count affects query speed. A query must search every segment and merge the results. More segments means more work at query time.
- Merge policy determines write amplification. Lucene’s tiered merge policy periodically combines small segments into larger ones, reducing segment count but consuming CPU and I/O.
The relationship between refresh interval, segment count, and query latency is one of the most important tuning decisions in OpenSearch, and it is covered in depth in chapter 6.
The Documentation Search Platform
The system this book builds is a multi-tenant documentation search engine with the following index structure:
// HARDENED: Production mapping for the documentation search platform
CreateIndexRequest request = CreateIndexRequest.of(idx -> idx
.index("docs-v1")
.settings(s -> s
.numberOfShards("3")
.numberOfReplicas("1")
.refreshInterval(t -> t.time("5s"))
)
.mappings(m -> m
.properties("tenant_id", p -> p.keyword(k -> k))
.properties("title", p -> p.text(t -> t
.analyzer("standard")
.fields("exact", f -> f.keyword(k -> k.ignoreAbove(512)))
))
.properties("body", p -> p.text(t -> t
.analyzer("standard")
))
.properties("code_snippets", p -> p.text(t -> t
.analyzer("whitespace")
))
.properties("api_method", p -> p.keyword(k -> k))
.properties("version", p -> p.keyword(k -> k))
.properties("content_type", p -> p.keyword(k -> k))
.properties("created_at", p -> p.date(d -> d
.format("strict_date_optional_time||epoch_millis")
))
)
);
/*
Equivalent JSON mapping:
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.refresh_interval": "5s"
},
"mappings": {
"properties": {
"tenant_id": { "type": "keyword" },
"title": { "type": "text", "analyzer": "standard",
"fields": { "exact": { "type": "keyword", "ignore_above": 512 } } },
"body": { "type": "text", "analyzer": "standard" },
"code_snippets":{ "type": "text", "analyzer": "whitespace" },
"api_method": { "type": "keyword" },
"version": { "type": "keyword" },
"content_type": { "type": "keyword" },
"created_at": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }
}
}
}
*/
Every chapter refers to this mapping. Chapter 2 examines why code_snippets uses a whitespace analyzer instead of standard. Chapter 5 explains why tenant_id is a keyword, not a text field with a keyword sub-field. Chapter 8 shows why title needs different boost weights than body for the documentation search use case.
The diagram above shows the complete path from a raw documentation page to a searchable entry in the inverted index. The document enters the analysis pipeline, which tokenizes and normalizes the text. Each resulting term is added to the term dictionary, which maps to a postings list containing document IDs, term frequencies, and positions. This entire structure lives inside an immutable Lucene segment. When you understand this path, you understand why changing an analyzer requires a full reindex: the terms stored in existing segments were produced by the old analyzer and cannot be re-analyzed in place.