Shard Allocation and Rebalancing Mechanics
Shard Allocation and Rebalancing Mechanics
The Symptom
A three-node cluster in a two-availability-zone deployment loses one zone. The cluster goes red. Investigation reveals that both the primary and replica of shard 2 were on nodes in the same zone. The replica that was supposed to provide fault tolerance was co-located with the primary it was protecting.
The Internals
OpenSearch’s shard allocation is governed by a set of deciders that evaluate where each shard can be placed. The allocation process runs when:
- An index is created (initial allocation)
- A node joins or leaves the cluster (rebalancing)
- A shard fails (reallocation to a surviving node)
- Allocation settings are changed (forced reallocation)
The deciders evaluate in sequence. Each decider can allow, deny, or throttle a shard allocation. The key deciders:
SameShardAllocationDecider: prevents a primary and its replica from being placed on the same node. This is always enabled and cannot be disabled.
AwarenessAllocationDecider: distributes replicas across different awareness values (typically availability zones or racks). When configured with cluster.routing.allocation.awareness.attributes: zone, OpenSearch ensures that if a primary is in zone A, its replica is placed in zone B.
DiskThresholdDecider: prevents allocation to nodes with low disk space. The default thresholds are 85% (warning, no new shards allocated) and 90% (critical, shards relocated away).
FilterAllocationDecider: respects per-index allocation filtering rules (index.routing.allocation.include, exclude, require).
The Implementation
Zone-Aware Allocation
# opensearch.yml for nodes in zone-a
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: zone-a
# opensearch.yml for nodes in zone-b
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: zone-b
# HARDENED: Force awareness to prevent all replicas landing in one zone
# when zone-b nodes are temporarily unavailable
cluster.routing.allocation.awareness.force.zone.values: zone-a,zone-b
Without forced awareness, if all zone-b nodes go down, OpenSearch allocates the unassigned replicas to zone-a nodes. The cluster turns green. When zone-b returns, those replicas do not automatically move back. The cluster runs with all copies in zone-a, providing zero fault tolerance against a zone-a failure.
With forced awareness, OpenSearch refuses to allocate replicas that would violate the zone distribution requirement. The cluster stays yellow (unassigned replicas) but maintains the invariant that zone-a and zone-b failures are survivable once both zones are healthy.
Monitoring Shard Distribution
// Check shard allocation across nodes and zones
public Map<String, List<String>> getShardDistribution(String index)
throws IOException {
CatShardsResponse response = client.cat().shards(s -> s
.index(index)
.headers("index", "shard", "prirep", "state", "node")
);
Map<String, List<String>> nodeShards = new HashMap<>();
for (ShardsRecord record : response.valueBody()) {
String nodeId = record.node();
String shardInfo = String.format("%s-%s(%s)",
record.index(), record.shard(), record.prirep());
nodeShards.computeIfAbsent(nodeId, k -> new ArrayList<>()).add(shardInfo);
}
return nodeShards;
}
The Measurement
Key metrics to export to Prometheus:
| Metric | Source | Alert Threshold |
|---|---|---|
| Unassigned shards | _cluster/health | > 0 for more than 5 minutes |
| Relocating shards | _cluster/health | > 0 sustained (indicates rebalancing storm) |
| Disk usage per node | _cat/nodes?h=disk.used_percent | > 80% |
| Shards per node | _cat/nodes?h=node.name,shards | > 1000 per node |
An unassigned shard count that stays above zero indicates an allocation problem. The _cluster/allocation/explain API reveals exactly why a shard cannot be allocated:
GET _cluster/allocation/explain
{
"index": "docs-v1",
"shard": 2,
"primary": false
}
The Decision Rule
Configure zone awareness in any cluster spanning multiple availability zones. Use forced awareness when the cluster has exactly 2 zones, preventing a single-zone failure from creating a false-green cluster status.
Set disk watermarks lower than the defaults for clusters with large shards: 75% warning, 85% critical. The time to relocate a 50GB shard at 100MB/s network throughput is 8 minutes. If the disk fills during relocation, the node becomes unresponsive.
Monitor unassigned shard count as the primary cluster health indicator. Zero unassigned shards means every shard has both a primary and the configured number of replicas allocated to nodes. Any other number requires investigation.