How Documents Become Searchable
How Documents Become Searchable
The Symptom
A documentation platform tenant reports that newly published API reference pages do not appear in search results for up to 30 seconds after publishing. The developer who built the indexing integration insists the documents are being indexed because the API returns 201 Created. The documents are indexed. They are not yet searchable. These are different things.
The Internals
When a document is submitted to OpenSearch, it does not immediately become part of the searchable inverted index. The document passes through a pipeline with distinct phases, each with different durability and visibility guarantees.
Phase 1: Coordinate and Route. The coordinating node receives the index request, determines which shard owns the document (using routing_value % number_of_shards), and forwards the request to the primary shard’s node.
Phase 2: Write to Translog. The primary shard appends the document to the transaction log (translog) on disk. This is a sequential write, append-only, and fast. The translog provides durability: if the node crashes before the next segment flush, the translog is replayed on recovery. The document is now durable but not searchable.
Phase 3: Write to In-Memory Buffer. The document is analyzed (tokenized, normalized, filtered) and the resulting terms are added to an in-memory indexing buffer. This buffer is a partial Lucene segment that exists only in heap memory. The document is now durable and in memory, but still not searchable.
Phase 4: Refresh. The refresh operation flushes the in-memory buffer to a new Lucene segment on the filesystem (using the OS page cache, not necessarily fsync’d to disk). Once the segment is written, the new searcher is opened and the document becomes visible to queries. By default, this happens every 1 second (index.refresh_interval).
Phase 5: Flush. The flush operation calls fsync on all unflushed segments, ensuring they are written to durable storage, and then truncates the translog. This is the point at which the translog is no longer needed for recovery.
The 30-second search delay the tenant reported was caused by a refresh_interval of 30s, set during a bulk import and never reverted.
The Implementation
The OpenSearch Java client provides two paths for indexing documents: single-document and bulk. For the documentation platform, single-document indexing is used for real-time page updates, and bulk indexing is used for initial tenant onboarding.
// HARDENED: Single document indexing with explicit refresh control
@Repository
public class DocumentSearchRepository {
private final OpenSearchClient client;
public DocumentSearchRepository(OpenSearchClient client) {
this.client = client;
}
public void indexDocument(DocPage page) throws IOException {
IndexRequest<DocPage> request = IndexRequest.of(r -> r
.index("docs-v1")
.id(page.tenantId() + ":" + page.slug())
.routing(page.tenantId())
.document(page)
.refresh(Refresh.False) // Do not force refresh on every write
);
IndexResponse response = client.index(request);
if (response.result() != Result.Created && response.result() != Result.Updated) {
throw new IndexingException(
"Unexpected index result: " + response.result() +
" for document " + page.slug()
);
}
}
public record DocPage(
String tenantId,
String title,
String body,
String slug,
String apiMethod,
String version,
String contentType,
List<String> codeSnippets
) {}
}
// FRAGILE: Forcing refresh on every single write
// This creates a new segment per document, destroying query performance
// under any meaningful write load.
IndexRequest<DocPage> request = IndexRequest.of(r -> r
.index("docs-v1")
.id(page.tenantId() + ":" + page.slug())
.routing(page.tenantId())
.document(page)
.refresh(Refresh.True) // New segment on every write
);
Using Refresh.True on every index operation forces OpenSearch to create a new Lucene segment after each document. On the documentation platform, a tenant publishing 500 API reference pages in a batch creates 500 segments. Each subsequent search query must search all 500 segments and merge results. The query latency degrades from 15ms to 400ms until the merge policy consolidates segments, consuming CPU and I/O in the process.
The correct approach: set refresh_interval to an appropriate value for your read latency requirements (1 second for near-real-time, 5-30 seconds for write-heavy workloads) and let OpenSearch batch refreshes.
The Measurement
The indexing path is observable through the _nodes/stats API:
GET _nodes/stats/indices/indexing,refresh,flush,translog
Key metrics to export to Prometheus:
| Metric | What it tells you |
|---|---|
indices.indexing.index_total | Total documents indexed |
indices.indexing.index_time_in_millis | Time spent in analysis and indexing |
indices.refresh.total | Number of refresh operations |
indices.refresh.total_time_in_millis | Time spent creating new segments |
indices.translog.operations | Documents in translog not yet flushed |
indices.translog.size_in_bytes | Translog size on disk |
A growing translog size with stable flush.total indicates flushes are not keeping up with writes. A high refresh.total_time_in_millis relative to refresh.total indicates segments are large and refresh is expensive.
The Decision Rule
Use Refresh.False (the default) when the application can tolerate the configured refresh_interval delay between indexing and search visibility. This covers the majority of documentation platform operations: page updates, new version publishes, bulk imports.
Use Refresh.WaitFor when the application must confirm that an indexed document is searchable before returning a response to the user, but you do not want to force a refresh that affects all other pending documents. This is appropriate for a documentation platform’s “publish and verify” workflow where the publisher needs to see their page in search results immediately after publishing.
Never use Refresh.True in a loop or in bulk operations. The segment-per-document cost makes it unsuitable for any indexing path that processes more than a handful of documents per minute.