Merge Policy Tuning and Write Amplification

The Symptom

The documentation platform’s indexing throughput drops by 60% every 15 minutes for 2-3 minutes at a time. During these drops, I/O utilization on data nodes spikes to 100%. The write thread pool queue grows. Bulk requests that normally complete in 200ms take 1,500ms. The pattern repeats like clockwork.

The cause is segment merges. The tiered merge policy has accumulated enough small segments to trigger a large merge. The merge consumes disk I/O bandwidth, starving concurrent indexing and search operations.

The Internals

Write amplification is the ratio of total bytes written to disk versus the logical bytes of new data. In a Lucene index, every document is written at least twice: once to the translog, once to a segment. When that segment is later merged into a larger segment, the document’s data is rewritten. If the merged segment is itself merged later, the data is written again.

For the tiered merge policy with segments_per_tier=10 and max_merge_at_once=10, the theoretical write amplification is:

$$\text{write_amplification} = 1 + \log_{m}(N/s)$$

Where $m$ is max_merge_at_once, $N$ is total index size, and $s$ is the initial segment size. For a 50GB index with 50MB initial segments and merge factor 10:

$$1 + \log_{10}(50000/50) = 1 + 3 = 4\times$$

Every byte of source data is written to disk approximately 4 times over the index’s lifetime.

Merge Throttling

OpenSearch throttles merge I/O to prevent merges from consuming all disk bandwidth. The default throttle is 20MB/s per node. This throttle protects search latency but extends merge duration. During write-heavy phases, the throttle can become the bottleneck: segments accumulate faster than merges can consolidate them.

// HARDENED: Increase merge throttle during bulk import
// Reset after import to protect search latency during normal operation

client.cluster().putSettings(ps -> ps
    .transient_(t -> t
        .putAll(Map.of(
            "indices.store.throttle.max_bytes_per_sec", JsonData.of("100mb")
        ))
    )
);

// After bulk import: reset to default
client.cluster().putSettings(ps -> ps
    .transient_(t -> t
        .putAll(Map.of(
            "indices.store.throttle.max_bytes_per_sec", JsonData.of("20mb")
        ))
    )
);

Force Merge

_forcemerge triggers an immediate merge of all segments into a specified number of target segments. It is commonly used after a bulk import to consolidate the segment count and improve query performance.

// HARDENED: Force merge after bulk import to a read-only index
// Reduces segment count from hundreds to 1 per shard.
// Only safe on indices that will no longer receive writes.

client.indices().forcemerge(fm -> fm
    .index("docs-v1")
    .maxNumSegments(1)
);

// FRAGILE: Force merge on an actively-written index
// The merge completes. A new document is indexed. A new segment is created.
// Now there are 2 segments: the giant merged one and a tiny new one.
// The next merge must rewrite the entire giant segment to incorporate
// the tiny one. Write amplification becomes catastrophic.

// DON'T: force merge while writes are ongoing
client.indices().forcemerge(fm -> fm
    .index("docs-write-active")
    .maxNumSegments(1)
);

Force merge to 1 segment on a 50GB shard means rewriting 50GB of data in a single merge operation. During this merge, the shard’s disk usage temporarily doubles (old segments + new merged segment). If the disk does not have 50GB of free space, the merge fails.

The Implementation

A merge monitoring utility:

public class MergeMonitor {

    private final OpenSearchClient client;

    public MergeMonitor(OpenSearchClient client) {
        this.client = client;
    }

    public record MergeStats(
        long totalMerges,
        long totalMergeTimeMillis,
        long currentMerges,
        long totalMergeSizeBytes,
        long segmentCount
    ) {}

    public MergeStats getMergeStats(String index) throws IOException {
        IndicesStatsResponse stats = client.indices().stats(s -> s.index(index));

        var mergeStats = stats.indices().get(index).primaries().merges();
        var segStats = stats.indices().get(index).primaries().segments();

        return new MergeStats(
            mergeStats.total(),
            mergeStats.totalTimeInMillis(),
            mergeStats.current(),
            mergeStats.totalSizeInBytes(),
            segStats.count()
        );
    }
}

The Measurement

Monitor merge activity during a bulk import of 1 million documents:

Phase	Segment Count	Merge I/O (MB/s)	Index Throughput (docs/s)	Avg Bulk Latency
Start (refresh=1s)	0	0	3,200	180ms
Mid-import (refresh=1s)	340	18	1,800	450ms
Start (refresh=30s)	0	0	4,500	120ms
Mid-import (refresh=30s)	28	8	4,200	140ms
Post-forcemerge	1	0	N/A	N/A

With refresh_interval=1s, segment count reaches 340 before merges can consolidate them. With refresh_interval=30s, the peak segment count stays at 28. Indexing throughput is 2.3x higher with the longer refresh interval.

The Decision Rule

Increase merge throttle to 100MB/s during bulk imports and reset to 20MB/s for normal operation. The default throttle protects search latency at the cost of indexing throughput during write-heavy phases.

Force merge only indices that are read-only (no further writes). Force merging an actively-written index creates a worst-case merge pattern on the next write.

Monitor segment count per shard as a leading indicator of merge pressure. If segment count consistently exceeds 50 during normal operation (not bulk imports), the merge policy is not keeping up with the write rate. Increase the refresh interval or add data nodes to distribute the write load.