Resilience: Split Brain, Snapshots, and Recovery
Resilience: Split Brain, Snapshots, and Recovery
The documentation search platform serves 50 tenants. A network partition isolates two of seven nodes. Without proper quorum configuration, both sides of the partition elect a cluster manager and begin accepting writes. When the partition heals, two divergent copies of the cluster state exist. One must be discarded, and any writes to the discarded side are lost.
Split Brain Prevention
Split brain occurs when a cluster manager quorum cannot be maintained during a network partition. OpenSearch prevents split brain through the discovery.seed_hosts and cluster.initial_cluster_manager_nodes settings.
The quorum formula: minimum_master_nodes = (number_of_manager_eligible_nodes / 2) + 1
For a 7-node cluster with 3 dedicated cluster manager nodes, the quorum is 2. A partition that isolates 1 manager node from the other 2 leaves the majority side with quorum. The minority side cannot elect a new manager and goes read-only.
# opensearch.yml for cluster manager nodes
# Dedicated cluster manager (no data, no ingest)
node.roles:
- cluster_manager
# Initial bootstrap (only used on first cluster formation)
cluster.initial_cluster_manager_nodes:
- manager-1
- manager-2
- manager-3
# Discovery (all manager-eligible nodes)
discovery.seed_hosts:
- manager-1:9300
- manager-2:9300
- manager-3:9300
Snapshot Configuration
// HARDENED: S3 snapshot repository with server-side encryption
public class SnapshotManager {
private final OpenSearchClient client;
public SnapshotManager(OpenSearchClient client) {
this.client = client;
}
public void createS3Repository(String repoName, String bucket,
String region) throws IOException {
client.snapshot().createRepository(cr -> cr
.name(repoName)
.type("s3")
.settings(s -> s
.putAll(Map.of(
"bucket", JsonData.of(bucket),
"region", JsonData.of(region),
"server_side_encryption", JsonData.of(true),
"max_snapshot_bytes_per_sec", JsonData.of("200mb"),
"max_restore_bytes_per_sec", JsonData.of("200mb"),
"compress", JsonData.of(true)
))
)
);
}
public void takeSnapshot(String repoName, String snapshotName,
List<String> indices) throws IOException {
client.snapshot().create(cs -> cs
.repository(repoName)
.snapshot(snapshotName)
.indices(indices)
.includeGlobalState(false) // Don't snapshot cluster settings
.waitForCompletion(false) // Run as background task
);
}
public void scheduleSnapshots(String repoName) throws IOException {
// ISM policy for automated daily snapshots
Request request = new Request("PUT",
"/_plugins/_sm/policies/daily-snapshot");
request.setJsonEntity("""
{
"description": "Daily snapshot of all documentation indices",
"creation": {
"schedule": {
"cron": {
"expression": "0 2 * * *",
"timezone": "UTC"
}
}
},
"deletion": {
"schedule": {
"cron": {
"expression": "0 3 * * *",
"timezone": "UTC"
}
},
"condition": {
"max_age": "30d",
"max_count": 30
}
},
"snapshot_config": {
"repository": "%s",
"indices": "docs-*",
"include_global_state": false
}
}
""".formatted(repoName));
restClient.performRequest(request);
}
}
Recovery from Snapshot
// Full restore: recover all indices from a snapshot
public void fullRestore(String repoName, String snapshotName)
throws IOException {
client.snapshot().restore(rs -> rs
.repository(repoName)
.snapshot(snapshotName)
.includeGlobalState(false)
.waitForCompletion(false)
);
}
// Partial restore: recover a single tenant's index
public void restoreTenantIndex(String repoName, String snapshotName,
String tenantId) throws IOException {
String indexPattern = "docs-" + tenantId + "-*";
client.snapshot().restore(rs -> rs
.repository(repoName)
.snapshot(snapshotName)
.indices(indexPattern)
.includeGlobalState(false)
.renamePattern("(.+)")
.renameReplacement("restored_$1")
.waitForCompletion(false)
);
}
The diagram illustrates the three-layer resilience architecture: quorum-based split-brain prevention (layer 1), daily snapshots to S3 (layer 2), and recovery paths—shard recovery from peer nodes (fast, minutes) and snapshot restore (slower, proportional to data size).
RTO and RPO Design
| Recovery Scenario | RPO | RTO | Mechanism |
|---|---|---|---|
| Single node failure | 0 (replicas) | 2-10 min | Shard reallocation |
| Two node failure (with replicas) | 0 | 5-20 min | Shard reallocation |
| Availability zone failure | 0 (zone-aware replicas) | 10-30 min | Cross-zone reallocation |
| Full cluster loss | Up to 24h | 1-4 hrs | Snapshot restore from S3 |
| Accidental index deletion | Up to 24h | 20-60 min | Snapshot restore |
| Data corruption | Up to 24h | 1-4 hrs | Snapshot restore |
The RPO of “up to 24h” for snapshot-based recovery assumes daily snapshots. If the business requires a lower RPO, increase snapshot frequency to every 4 or 6 hours, reducing RPO to 4-6 hours at the cost of additional snapshot storage and cluster overhead.
The Decision Rule
Deploy 3 dedicated cluster manager nodes in production. Never run manager-eligible and data roles on the same node in a cluster with more than 5 nodes. A data node under memory or I/O pressure can delay cluster manager heartbeats, triggering unnecessary manager elections.
Take daily snapshots to an off-cluster repository (S3, GCS, or Azure Blob). Verify snapshot integrity monthly by restoring to a test cluster. A snapshot that has never been restored is an untested assumption.
Design for the most likely failure mode (single node failure, RTO: minutes) rather than the worst case (full cluster loss, RTO: hours). Zone-aware replica allocation handles the most likely failure with zero data loss and minimal recovery time.