Cluster Architecture: Shards, Replicas, and the Routing Decision That Determines Your Latency
Cluster Architecture
A three-node OpenSearch cluster runs the documentation search platform. Response times are under 50ms. The team adds a fourth node to “improve performance.” Response times do not change. The team increases the shard count from 3 to 12 to “distribute load across more shards.” Response times increase to 120ms. Adding hardware made search slower.
The problem is that the team treated shard count as a performance dial. It is not. Shard count determines the minimum number of Lucene index structures that must be searched and merged for every query. More shards means more fan-out, more network round trips to the coordinating node, and more merge work. Adding shards without adding proportional data volume adds overhead without benefit.
Shards and Replicas
An OpenSearch index is a logical grouping. The data physically lives in shards. Each shard is a complete Lucene index with its own inverted index, doc values, and stored fields.
Primary shards own the original data. When a document is indexed, it is routed to one primary shard based on a hash of the routing value (by default, the document ID). The number of primary shards is set at index creation and cannot be changed without reindexing.
Replica shards are copies of primary shards. They serve two purposes: fault tolerance (if a node hosting a primary shard fails, a replica is promoted) and read throughput (search queries can execute on any replica).
The routing formula:
$$\text{shard} = \text{hash}(\text{routing_value}) \mod \text{number_of_primary_shards}$$
This formula is why the primary shard count is immutable. Changing it changes the modulus, which changes the shard assignment for every document. Existing documents would not be findable on their new calculated shards because they still physically reside on their old shards.
Custom Routing for Multi-Tenant Search
The documentation platform serves multiple tenants. Each tenant’s search queries should only touch shards containing that tenant’s data. Without custom routing, a tenant with 1,000 documents in a 5-shard index has documents distributed across all 5 shards. Every search query fans out to all 5 shards, even though only a fraction of each shard’s data is relevant.
Custom routing directs all documents for a tenant to the same shard:
// HARDENED: Custom routing ensures tenant data co-location
// All documents for tenant-acme route to the same shard.
// A filtered query for tenant-acme searches only one shard.
IndexRequest<DocPage> request = IndexRequest.of(r -> r
.index("docs-v1")
.id(page.tenantId() + ":" + page.slug())
.routing(page.tenantId())
.document(page)
);
// HARDENED: Search with routing to limit shard fan-out
// Only the shard containing tenant-acme's documents is searched.
SearchRequest request = SearchRequest.of(s -> s
.index("docs-v1")
.routing(tenantId)
.query(q -> q
.bool(b -> b
.filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
.must(mu -> mu.match(m -> m.field("body").query(userQuery)))
)
)
);
// FRAGILE: No routing, no filter
// Searches all shards for all tenants, returns results from other tenants.
// This is a data isolation failure, not just a performance problem.
SearchRequest request = SearchRequest.of(s -> s
.index("docs-v1")
.query(q -> q.match(m -> m.field("body").query(userQuery)))
);
The tenant_id filter in the bool query is necessary even with routing. Routing controls which shard is searched, not which documents within that shard are returned. Two tenants can hash to the same shard. Without the filter, tenant A sees tenant B’s documents.
Shard Sizing
The common guideline is 30-50GB per shard. This guideline exists because:
- Heap overhead per shard. Each shard consumes heap memory for metadata, field data caches, and segment file handles. At 1,000 shards per node, the per-shard overhead alone can consume multiple gigabytes of heap.
- Recovery time. A failed shard must be copied to another node. A 50GB shard takes minutes to recover. A 200GB shard takes an hour.
- Merge efficiency. Lucene’s tiered merge policy works well when segments within a shard are reasonably sized. Very large shards (hundreds of GB) produce very large segments that are expensive to merge.
For the documentation platform:
| Tenant Size | Documents | Estimated Index Size | Recommended Shards |
|---|---|---|---|
| Small (< 10K docs) | 10,000 | ~500MB | 1 primary |
| Medium (10K-100K) | 100,000 | ~5GB | 1 primary |
| Large (100K-1M) | 1,000,000 | ~50GB | 2-3 primaries |
| Very large (> 1M) | 5,000,000 | ~250GB | 5-8 primaries |
Over-sharding is the more common mistake. A 500MB index split across 5 shards means each shard is 100MB. The query fan-out overhead of searching 5 shards and merging results exceeds the benefit of parallelism at this data volume. One shard serves the same query faster.
Node Roles
OpenSearch nodes can serve multiple roles, and understanding the role separation is critical for cluster sizing:
Cluster manager nodes maintain cluster state: shard allocation, index metadata, node membership. They should not serve data or handle search queries. Three dedicated cluster manager nodes provide quorum-based leader election.
Data nodes store shard data and execute search and indexing operations. This is where CPU, memory, and disk I/O matter.
Coordinating nodes (also called client nodes) receive search requests, fan them out to data nodes, merge the results, and return the response. For high-query-volume clusters, dedicated coordinating nodes prevent query merge work from competing with indexing on data nodes.
Ingest nodes run ingest pipelines (document enrichment before indexing). For the documentation platform’s Kafka-based indexing pipeline, ingest is handled by the application layer, not by OpenSearch ingest nodes.
# Kubernetes StatefulSet for a production documentation search cluster
# 3 cluster manager nodes, 6 data nodes, 2 coordinating nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: opensearch-data
spec:
replicas: 6
template:
spec:
containers:
- name: opensearch
image: opensearchproject/opensearch:2.12.0
env:
- name: node.roles
value: "data"
- name: OPENSEARCH_JAVA_OPTS
value: "-Xms16g -Xmx16g"
resources:
requests:
memory: "32Gi"
cpu: "8"
limits:
memory: "32Gi"
The diagram shows a production cluster layout for the documentation search platform. The coordinating node receives the search request, determines which shards need to be queried (using routing if specified), fans the query out to the data nodes hosting those shards, collects partial results from each shard, merges them into a final ranked list, and returns the response. Each data node hosts multiple shards from different indices, and replicas are distributed so that no single node failure loses all copies of a shard.
The Decision Rule
Start with 1 primary shard and 1 replica for any index under 30GB. Add primary shards only when a single shard’s query latency degrades measurably (not speculatively) and the degradation is caused by data volume, not by an unoptimized query or a missing filter.
Use custom routing for multi-tenant indices to eliminate cross-tenant shard fan-out. Accept the trade-off: if one tenant has disproportionately more data, their shard becomes a hot spot. Monitor per-shard sizes and split tenants into their own index when a single shard exceeds 50GB.
Deploy dedicated cluster manager nodes in any cluster with more than 5 data nodes. The cost is three small VMs. The benefit is isolation of cluster state management from search and indexing load.