Resilience: Split Brain, Snapshots, and Recovery

The documentation search platform serves 50 tenants. A network partition isolates two of seven nodes. Without proper quorum configuration, both sides of the partition elect a cluster manager and begin accepting writes. When the partition heals, two divergent copies of the cluster state exist. One must be discarded, and any writes to the discarded side are lost.

Split Brain Prevention

Split brain occurs when a cluster manager quorum cannot be maintained during a network partition. OpenSearch prevents split brain through the discovery.seed_hosts and cluster.initial_cluster_manager_nodes settings.

The quorum formula: minimum_master_nodes = (number_of_manager_eligible_nodes / 2) + 1

For a 7-node cluster with 3 dedicated cluster manager nodes, the quorum is 2. A partition that isolates 1 manager node from the other 2 leaves the majority side with quorum. The minority side cannot elect a new manager and goes read-only.

# opensearch.yml for cluster manager nodes

# Dedicated cluster manager (no data, no ingest)
node.roles:
  - cluster_manager

# Initial bootstrap (only used on first cluster formation)
cluster.initial_cluster_manager_nodes:
  - manager-1
  - manager-2
  - manager-3

# Discovery (all manager-eligible nodes)
discovery.seed_hosts:
  - manager-1:9300
  - manager-2:9300
  - manager-3:9300

Snapshot Configuration

// HARDENED: S3 snapshot repository with server-side encryption

public class SnapshotManager {

    private final OpenSearchClient client;

    public SnapshotManager(OpenSearchClient client) {
        this.client = client;
    }

    public void createS3Repository(String repoName, String bucket,
            String region) throws IOException {
        client.snapshot().createRepository(cr -> cr
            .name(repoName)
            .type("s3")
            .settings(s -> s
                .putAll(Map.of(
                    "bucket", JsonData.of(bucket),
                    "region", JsonData.of(region),
                    "server_side_encryption", JsonData.of(true),
                    "max_snapshot_bytes_per_sec", JsonData.of("200mb"),
                    "max_restore_bytes_per_sec", JsonData.of("200mb"),
                    "compress", JsonData.of(true)
                ))
            )
        );
    }

    public void takeSnapshot(String repoName, String snapshotName,
            List<String> indices) throws IOException {
        client.snapshot().create(cs -> cs
            .repository(repoName)
            .snapshot(snapshotName)
            .indices(indices)
            .includeGlobalState(false)  // Don't snapshot cluster settings
            .waitForCompletion(false)   // Run as background task
        );
    }

    public void scheduleSnapshots(String repoName) throws IOException {
        // ISM policy for automated daily snapshots
        Request request = new Request("PUT",
            "/_plugins/_sm/policies/daily-snapshot");
        request.setJsonEntity("""
            {
              "description": "Daily snapshot of all documentation indices",
              "creation": {
                "schedule": {
                  "cron": {
                    "expression": "0 2 * * *",
                    "timezone": "UTC"
                  }
                }
              },
              "deletion": {
                "schedule": {
                  "cron": {
                    "expression": "0 3 * * *",
                    "timezone": "UTC"
                  }
                },
                "condition": {
                  "max_age": "30d",
                  "max_count": 30
                }
              },
              "snapshot_config": {
                "repository": "%s",
                "indices": "docs-*",
                "include_global_state": false
              }
            }
            """.formatted(repoName));

        restClient.performRequest(request);
    }
}

Recovery from Snapshot

// Full restore: recover all indices from a snapshot

public void fullRestore(String repoName, String snapshotName)
        throws IOException {
    client.snapshot().restore(rs -> rs
        .repository(repoName)
        .snapshot(snapshotName)
        .includeGlobalState(false)
        .waitForCompletion(false)
    );
}

// Partial restore: recover a single tenant's index

public void restoreTenantIndex(String repoName, String snapshotName,
        String tenantId) throws IOException {

    String indexPattern = "docs-" + tenantId + "-*";

    client.snapshot().restore(rs -> rs
        .repository(repoName)
        .snapshot(snapshotName)
        .indices(indexPattern)
        .includeGlobalState(false)
        .renamePattern("(.+)")
        .renameReplacement("restored_$1")
        .waitForCompletion(false)
    );
}

Cluster resilience architecture showing quorum, snapshot lifecycle, and recovery paths

The diagram illustrates the three-layer resilience architecture: quorum-based split-brain prevention (layer 1), daily snapshots to S3 (layer 2), and recovery paths—shard recovery from peer nodes (fast, minutes) and snapshot restore (slower, proportional to data size).

RTO and RPO Design

Recovery Scenario	RPO	RTO	Mechanism
Single node failure	0 (replicas)	2-10 min	Shard reallocation
Two node failure (with replicas)	0	5-20 min	Shard reallocation
Availability zone failure	0 (zone-aware replicas)	10-30 min	Cross-zone reallocation
Full cluster loss	Up to 24h	1-4 hrs	Snapshot restore from S3
Accidental index deletion	Up to 24h	20-60 min	Snapshot restore
Data corruption	Up to 24h	1-4 hrs	Snapshot restore

The RPO of “up to 24h” for snapshot-based recovery assumes daily snapshots. If the business requires a lower RPO, increase snapshot frequency to every 4 or 6 hours, reducing RPO to 4-6 hours at the cost of additional snapshot storage and cluster overhead.

The Decision Rule

Deploy 3 dedicated cluster manager nodes in production. Never run manager-eligible and data roles on the same node in a cluster with more than 5 nodes. A data node under memory or I/O pressure can delay cluster manager heartbeats, triggering unnecessary manager elections.

Take daily snapshots to an off-cluster repository (S3, GCS, or Azure Blob). Verify snapshot integrity monthly by restoring to a test cluster. A snapshot that has never been restored is an untested assumption.

Design for the most likely failure mode (single node failure, RTO: minutes) rather than the worst case (full cluster loss, RTO: hours). Zone-aware replica allocation handles the most likely failure with zero data loss and minimal recovery time.