Disaster Recovery Testing and Runbook Design

The Symptom

The team has daily snapshots, zone-aware replicas, and 3 dedicated cluster managers. The SLA promises 99.9% availability. A developer accidentally deletes a production index with DELETE /docs-acme-v3. The team discovers the most recent snapshot is 22 hours old and corrupted because the S3 bucket lifecycle policy deleted segments older than 7 days while the snapshot metadata still referenced them.

The Internals

A disaster recovery plan that has never been tested is an untested hypothesis. The snapshot repository, the restore process, the alias reconfiguration, and the data verification must all be validated against actual cluster behavior, not documentation.

The most common failure modes for OpenSearch clusters:

Accidental deletion. A human or automated process deletes an index. No confirmation prompt in the API.
Mapping explosion. Dynamic mapping creates thousands of fields, consuming heap and destabilizing the cluster.
Disk exhaustion. A node fills its disk, triggering a flood-stage watermark that blocks all writes.
Snapshot corruption. S3 lifecycle policies, permission changes, or network errors during snapshot creation produce incomplete snapshots.

The Implementation

DR Test Suite

// HARDENED: Automated DR test that validates snapshot/restore end-to-end

@Testcontainers
public class DisasterRecoveryTest {

    @Container
    private static final GenericContainer<?> opensearch =
        new GenericContainer<>("opensearchproject/opensearch:2.12.0")
            .withExposedPorts(9200)
            .withEnv("discovery.type", "single-node")
            .withEnv("plugins.security.disabled", "true")
            .withEnv("path.repo", "/snapshots");

    @Test
    void snapshotRestorePreservesAllDocuments() throws Exception {
        OpenSearchClient client = createClient();

        // Create index and index test documents
        String indexName = "docs-test-v1";
        indexTestDocuments(client, indexName, 1000);

        // Create snapshot repository (filesystem for tests)
        client.snapshot().createRepository(cr -> cr
            .name("test-repo")
            .type("fs")
            .settings(s -> s
                .putAll(Map.of("location", JsonData.of("/snapshots/test")))
            )
        );

        // Take snapshot
        client.snapshot().create(cs -> cs
            .repository("test-repo")
            .snapshot("test-snap-1")
            .indices(indexName)
            .waitForCompletion(true)
        );

        // Delete the index (simulate disaster)
        client.indices().delete(d -> d.index(indexName));

        // Restore from snapshot
        client.snapshot().restore(rs -> rs
            .repository("test-repo")
            .snapshot("test-snap-1")
            .indices(indexName)
            .waitForCompletion(true)
        );

        // Verify all documents restored
        client.indices().refresh(r -> r.index(indexName));
        long restoredCount = client.count(c -> c.index(indexName)).count();

        assertEquals(1000, restoredCount,
            "Restored document count must match original");

        // Verify search quality on restored index
        var searchResult = client.search(s -> s
            .index(indexName)
            .query(q -> q.matchAll(m -> m))
            .size(1),
            DocPage.class
        );

        assertFalse(searchResult.hits().hits().isEmpty(),
            "Restored index must be searchable");
    }

    @Test
    void partialRestoreRecoversSingleTenant() throws Exception {
        OpenSearchClient client = createClient();

        // Create indices for multiple tenants
        indexTestDocuments(client, "docs-tenant-a-v1", 500);
        indexTestDocuments(client, "docs-tenant-b-v1", 500);

        // Snapshot both
        client.snapshot().createRepository(cr -> cr
            .name("test-repo")
            .type("fs")
            .settings(s -> s
                .putAll(Map.of("location", JsonData.of("/snapshots/test")))
            )
        );

        client.snapshot().create(cs -> cs
            .repository("test-repo")
            .snapshot("multi-tenant-snap")
            .indices("docs-*")
            .waitForCompletion(true)
        );

        // Delete only tenant A's index
        client.indices().delete(d -> d.index("docs-tenant-a-v1"));

        // Restore only tenant A
        client.snapshot().restore(rs -> rs
            .repository("test-repo")
            .snapshot("multi-tenant-snap")
            .indices("docs-tenant-a-v1")
            .waitForCompletion(true)
        );

        client.indices().refresh(r -> r.index("docs-tenant-a-v1"));

        // Verify tenant A restored, tenant B unchanged
        assertEquals(500,
            client.count(c -> c.index("docs-tenant-a-v1")).count());
        assertEquals(500,
            client.count(c -> c.index("docs-tenant-b-v1")).count());
    }
}

Failure Scenario Runbook

// Runbook for accidental index deletion

public class IndexDeletionRunbook {

    public void execute(String deletedIndex, String snapshotRepo)
            throws Exception {

        // Step 1: Identify the most recent snapshot containing the index
        var snapshots = client.snapshot().get(gs -> gs
            .repository(snapshotRepo)
            .snapshot("*")
        );

        String latestSnapshot = snapshots.snapshots().stream()
            .filter(snap -> snap.indices().contains(deletedIndex))
            .max(Comparator.comparing(snap -> snap.startTimeInMillis()))
            .map(snap -> snap.snapshot())
            .orElseThrow(() -> new RuntimeException(
                "No snapshot contains index " + deletedIndex));

        // Step 2: Restore the index from the snapshot
        client.snapshot().restore(rs -> rs
            .repository(snapshotRepo)
            .snapshot(latestSnapshot)
            .indices(deletedIndex)
            .waitForCompletion(true)
        );

        // Step 3: Verify document count
        client.indices().refresh(r -> r.index(deletedIndex));
        long count = client.count(c -> c.index(deletedIndex)).count();

        // Step 4: Re-attach aliases
        // (Aliases are not stored in snapshots)
        String tenantId = extractTenantId(deletedIndex);
        client.indices().updateAliases(ua -> ua
            .actions(a -> a.add(ad -> ad
                .index(deletedIndex)
                .alias("docs-" + tenantId + "-read")
            ))
            .actions(a -> a.add(ad -> ad
                .index(deletedIndex)
                .alias("docs-" + tenantId + "-write")
                .isWriteIndex(true)
            ))
        );
    }
}

The Measurement

DR test results for the documentation platform:

Scenario	Expected RTO	Measured RTO	Status
Single index restore (5GB)	10 min	8 min	Pass
Full cluster restore (500GB)	4 hrs	3.5 hrs	Pass
Node failure (shard reallocation)	10 min	7 min	Pass
Accidental deletion + alias restore	15 min	12 min	Pass
Corrupted snapshot fallback	30 min	45 min	Fail—needs secondary snapshot

The corrupted snapshot scenario failed because no secondary snapshot repository existed. After adding a cross-region S3 backup, the fallback path meets the 30-minute RTO.

The Decision Rule

Run the DR test suite monthly. Quarterly is too infrequent to catch regression from infrastructure changes (S3 policy updates, node role changes, security configuration changes).

Maintain two snapshot repositories: a primary in the same region for fast restore, and a secondary in a different region for disaster scenarios. S3 Cross-Region Replication can automate this.

Protect production indices from accidental deletion with the action.destructive_requires_name cluster setting set to true. This prevents DELETE /* or DELETE /docs-* patterns, requiring the exact index name for deletion.