Translog Durability and the Indexing Durability vs Speed Trade-off
Translog Durability and the Indexing Durability vs Speed Trade-off
The Symptom
The documentation platform’s bulk indexing throughput is 2,000 docs/second. The hardware should support 8,000 docs/second based on CPU, memory, and network capacity. Profiling shows that 60% of the indexing time is spent in fsync on the translog. Each bulk request waits for the translog to be durably written to disk before acknowledging the response.
The Internals
The translog (transaction log) is OpenSearch’s write-ahead log. Every document write is appended to the translog before being processed into the in-memory indexing buffer. The translog serves one purpose: recovery after an unclean shutdown.
If a node crashes after indexing a document but before the next flush (which fsyncs segments to disk), the document exists only in the in-memory buffer and the translog. On restart, OpenSearch replays the translog to recover documents that were indexed but not yet flushed to segments.
OpenSearch offers two translog durability modes:
request (default): every index, delete, update, or bulk operation waits for the translog to be fsync’d to disk before returning a response. The document is durable the moment the client receives the acknowledgment. The cost is one fsync per operation (or per bulk request).
async: the translog is fsync’d in the background based on sync_interval (default: 5 seconds). Index operations return immediately after appending to the in-memory translog buffer, without waiting for fsync. If the node crashes between fsync intervals, documents written since the last fsync are lost.
// HARDENED: Async translog for bulk import with acceptable data loss window
// Documents are replicated to replica shards before acknowledgment,
// so data loss requires simultaneous failure of primary and replica nodes
// within the sync_interval window.
client.indices().putSettings(ps -> ps
.index("docs-v1")
.settings(s -> s
.translog(t -> t
.durability("async")
.syncInterval(Time.of(ti -> ti.time("5s")))
.flushThresholdSize("1gb")
)
)
);
// After bulk import: restore default durability
client.indices().putSettings(ps -> ps
.index("docs-v1")
.settings(s -> s
.translog(t -> t
.durability("request")
)
)
);
The risk of async translog durability is bounded. OpenSearch replicates each indexing operation to replica shards before acknowledging the write (with wait_for_active_shards at its default). For data to be lost, both the primary and replica nodes must fail within the sync_interval window, and neither node’s translog must be recoverable from disk. This is a narrow failure mode.
The Implementation
Translog Size Management
The translog grows until a flush operation truncates it. Flush is triggered by:
- Translog size threshold (
index.translog.flush_threshold_size, default: 512MB). When the translog exceeds this size, a flush is triggered. - Flush interval (
index.translog.flush_threshold_period, removed in recent versions; flush is triggered by size threshold and the global flush interval).
A large translog affects recovery time. If a node crashes, the recovery process must replay every operation in the translog. A 2GB translog takes significantly longer to replay than a 256MB translog.
// Monitor translog size and uncommitted operations
public record TranslogHealth(
long operations,
long sizeBytes,
long uncommittedOperations,
long uncommittedSizeBytes
) {
public boolean needsAttention() {
return uncommittedSizeBytes > 1_000_000_000L; // > 1GB uncommitted
}
}
public TranslogHealth getTranslogHealth(String index) throws IOException {
IndicesStatsResponse stats = client.indices().stats(s -> s.index(index));
var translog = stats.indices().get(index).primaries().translog();
return new TranslogHealth(
translog.operations(),
translog.size() != null ? Long.parseLong(translog.size()) : 0,
translog.uncommittedOperations(),
translog.uncommittedSize() != null ?
Long.parseLong(translog.uncommittedSize()) : 0
);
}
The Measurement
Indexing throughput with different translog durability settings:
| Setting | Throughput (docs/s) | Bulk Latency (p50) | Bulk Latency (p99) | Data Loss Window |
|---|---|---|---|---|
request | 2,100 | 180ms | 450ms | 0 (fsync per request) |
async (5s interval) | 5,800 | 65ms | 180ms | Up to 5 seconds |
async (30s interval) | 6,200 | 55ms | 160ms | Up to 30 seconds |
The throughput improvement from request to async is 2.8x. The improvement from 5s to 30s async is marginal (7%), because the bottleneck shifts from fsync overhead to analysis and indexing buffer management.
Recovery time as a function of translog size:
| Translog Size | Uncommitted Ops | Recovery Time |
|---|---|---|
| 128MB | 50,000 | 8 seconds |
| 512MB | 200,000 | 35 seconds |
| 2GB | 800,000 | 2.5 minutes |
| 5GB | 2,000,000 | 7 minutes |
The Decision Rule
Use async translog durability during bulk imports when the data source is replayable (i.e., the import can be restarted from the source if a node failure causes data loss during the import). The 2-3x throughput improvement justifies the narrow data loss risk.
Use request translog durability (the default) for normal operation when individual document writes must be acknowledged as durable. The fsync overhead is acceptable at typical documentation platform write rates (tens to hundreds of writes per second, not thousands).
Set flush_threshold_size to 1GB for write-heavy workloads to reduce flush frequency. Set it to 256MB for latency-sensitive clusters where recovery time must stay under 30 seconds.