Shard Allocation and Rebalancing Mechanics

The Symptom

A three-node cluster in a two-availability-zone deployment loses one zone. The cluster goes red. Investigation reveals that both the primary and replica of shard 2 were on nodes in the same zone. The replica that was supposed to provide fault tolerance was co-located with the primary it was protecting.

The Internals

OpenSearch’s shard allocation is governed by a set of deciders that evaluate where each shard can be placed. The allocation process runs when:

An index is created (initial allocation)
A node joins or leaves the cluster (rebalancing)
A shard fails (reallocation to a surviving node)
Allocation settings are changed (forced reallocation)

The deciders evaluate in sequence. Each decider can allow, deny, or throttle a shard allocation. The key deciders:

SameShardAllocationDecider: prevents a primary and its replica from being placed on the same node. This is always enabled and cannot be disabled.

AwarenessAllocationDecider: distributes replicas across different awareness values (typically availability zones or racks). When configured with cluster.routing.allocation.awareness.attributes: zone, OpenSearch ensures that if a primary is in zone A, its replica is placed in zone B.

DiskThresholdDecider: prevents allocation to nodes with low disk space. The default thresholds are 85% (warning, no new shards allocated) and 90% (critical, shards relocated away).

FilterAllocationDecider: respects per-index allocation filtering rules (index.routing.allocation.include, exclude, require).

The Implementation

Zone-Aware Allocation

# opensearch.yml for nodes in zone-a
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: zone-a

# opensearch.yml for nodes in zone-b
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: zone-b

# HARDENED: Force awareness to prevent all replicas landing in one zone
# when zone-b nodes are temporarily unavailable
cluster.routing.allocation.awareness.force.zone.values: zone-a,zone-b

Without forced awareness, if all zone-b nodes go down, OpenSearch allocates the unassigned replicas to zone-a nodes. The cluster turns green. When zone-b returns, those replicas do not automatically move back. The cluster runs with all copies in zone-a, providing zero fault tolerance against a zone-a failure.

With forced awareness, OpenSearch refuses to allocate replicas that would violate the zone distribution requirement. The cluster stays yellow (unassigned replicas) but maintains the invariant that zone-a and zone-b failures are survivable once both zones are healthy.

Monitoring Shard Distribution

// Check shard allocation across nodes and zones
public Map<String, List<String>> getShardDistribution(String index)
        throws IOException {

    CatShardsResponse response = client.cat().shards(s -> s
        .index(index)
        .headers("index", "shard", "prirep", "state", "node")
    );

    Map<String, List<String>> nodeShards = new HashMap<>();

    for (ShardsRecord record : response.valueBody()) {
        String nodeId = record.node();
        String shardInfo = String.format("%s-%s(%s)",
            record.index(), record.shard(), record.prirep());
        nodeShards.computeIfAbsent(nodeId, k -> new ArrayList<>()).add(shardInfo);
    }

    return nodeShards;
}

The Measurement

Key metrics to export to Prometheus:

Metric	Source	Alert Threshold
Unassigned shards	`_cluster/health`	> 0 for more than 5 minutes
Relocating shards	`_cluster/health`	> 0 sustained (indicates rebalancing storm)
Disk usage per node	`_cat/nodes?h=disk.used_percent`	> 80%
Shards per node	`_cat/nodes?h=node.name,shards`	> 1000 per node

An unassigned shard count that stays above zero indicates an allocation problem. The _cluster/allocation/explain API reveals exactly why a shard cannot be allocated:

GET _cluster/allocation/explain
{
  "index": "docs-v1",
  "shard": 2,
  "primary": false
}

The Decision Rule

Configure zone awareness in any cluster spanning multiple availability zones. Use forced awareness when the cluster has exactly 2 zones, preventing a single-zone failure from creating a false-green cluster status.

Set disk watermarks lower than the defaults for clusters with large shards: 75% warning, 85% critical. The time to relocate a 50GB shard at 100MB/s network throughput is 8 minutes. If the disk fills during relocation, the node becomes unresponsive.

Monitor unassigned shard count as the primary cluster health indicator. Zero unassigned shards means every shard has both a primary and the configured number of replicas allocated to nodes. Any other number requires investigation.