Disaster Recovery Testing and Runbook Design
Disaster Recovery Testing and Runbook Design
The Symptom
The team has daily snapshots, zone-aware replicas, and 3 dedicated cluster managers. The SLA promises 99.9% availability. A developer accidentally deletes a production index with DELETE /docs-acme-v3. The team discovers the most recent snapshot is 22 hours old and corrupted because the S3 bucket lifecycle policy deleted segments older than 7 days while the snapshot metadata still referenced them.
The Internals
A disaster recovery plan that has never been tested is an untested hypothesis. The snapshot repository, the restore process, the alias reconfiguration, and the data verification must all be validated against actual cluster behavior, not documentation.
The most common failure modes for OpenSearch clusters:
- Accidental deletion. A human or automated process deletes an index. No confirmation prompt in the API.
- Mapping explosion. Dynamic mapping creates thousands of fields, consuming heap and destabilizing the cluster.
- Disk exhaustion. A node fills its disk, triggering a flood-stage watermark that blocks all writes.
- Snapshot corruption. S3 lifecycle policies, permission changes, or network errors during snapshot creation produce incomplete snapshots.
The Implementation
DR Test Suite
// HARDENED: Automated DR test that validates snapshot/restore end-to-end
@Testcontainers
public class DisasterRecoveryTest {
@Container
private static final GenericContainer<?> opensearch =
new GenericContainer<>("opensearchproject/opensearch:2.12.0")
.withExposedPorts(9200)
.withEnv("discovery.type", "single-node")
.withEnv("plugins.security.disabled", "true")
.withEnv("path.repo", "/snapshots");
@Test
void snapshotRestorePreservesAllDocuments() throws Exception {
OpenSearchClient client = createClient();
// Create index and index test documents
String indexName = "docs-test-v1";
indexTestDocuments(client, indexName, 1000);
// Create snapshot repository (filesystem for tests)
client.snapshot().createRepository(cr -> cr
.name("test-repo")
.type("fs")
.settings(s -> s
.putAll(Map.of("location", JsonData.of("/snapshots/test")))
)
);
// Take snapshot
client.snapshot().create(cs -> cs
.repository("test-repo")
.snapshot("test-snap-1")
.indices(indexName)
.waitForCompletion(true)
);
// Delete the index (simulate disaster)
client.indices().delete(d -> d.index(indexName));
// Restore from snapshot
client.snapshot().restore(rs -> rs
.repository("test-repo")
.snapshot("test-snap-1")
.indices(indexName)
.waitForCompletion(true)
);
// Verify all documents restored
client.indices().refresh(r -> r.index(indexName));
long restoredCount = client.count(c -> c.index(indexName)).count();
assertEquals(1000, restoredCount,
"Restored document count must match original");
// Verify search quality on restored index
var searchResult = client.search(s -> s
.index(indexName)
.query(q -> q.matchAll(m -> m))
.size(1),
DocPage.class
);
assertFalse(searchResult.hits().hits().isEmpty(),
"Restored index must be searchable");
}
@Test
void partialRestoreRecoversSingleTenant() throws Exception {
OpenSearchClient client = createClient();
// Create indices for multiple tenants
indexTestDocuments(client, "docs-tenant-a-v1", 500);
indexTestDocuments(client, "docs-tenant-b-v1", 500);
// Snapshot both
client.snapshot().createRepository(cr -> cr
.name("test-repo")
.type("fs")
.settings(s -> s
.putAll(Map.of("location", JsonData.of("/snapshots/test")))
)
);
client.snapshot().create(cs -> cs
.repository("test-repo")
.snapshot("multi-tenant-snap")
.indices("docs-*")
.waitForCompletion(true)
);
// Delete only tenant A's index
client.indices().delete(d -> d.index("docs-tenant-a-v1"));
// Restore only tenant A
client.snapshot().restore(rs -> rs
.repository("test-repo")
.snapshot("multi-tenant-snap")
.indices("docs-tenant-a-v1")
.waitForCompletion(true)
);
client.indices().refresh(r -> r.index("docs-tenant-a-v1"));
// Verify tenant A restored, tenant B unchanged
assertEquals(500,
client.count(c -> c.index("docs-tenant-a-v1")).count());
assertEquals(500,
client.count(c -> c.index("docs-tenant-b-v1")).count());
}
}
Failure Scenario Runbook
// Runbook for accidental index deletion
public class IndexDeletionRunbook {
public void execute(String deletedIndex, String snapshotRepo)
throws Exception {
// Step 1: Identify the most recent snapshot containing the index
var snapshots = client.snapshot().get(gs -> gs
.repository(snapshotRepo)
.snapshot("*")
);
String latestSnapshot = snapshots.snapshots().stream()
.filter(snap -> snap.indices().contains(deletedIndex))
.max(Comparator.comparing(snap -> snap.startTimeInMillis()))
.map(snap -> snap.snapshot())
.orElseThrow(() -> new RuntimeException(
"No snapshot contains index " + deletedIndex));
// Step 2: Restore the index from the snapshot
client.snapshot().restore(rs -> rs
.repository(snapshotRepo)
.snapshot(latestSnapshot)
.indices(deletedIndex)
.waitForCompletion(true)
);
// Step 3: Verify document count
client.indices().refresh(r -> r.index(deletedIndex));
long count = client.count(c -> c.index(deletedIndex)).count();
// Step 4: Re-attach aliases
// (Aliases are not stored in snapshots)
String tenantId = extractTenantId(deletedIndex);
client.indices().updateAliases(ua -> ua
.actions(a -> a.add(ad -> ad
.index(deletedIndex)
.alias("docs-" + tenantId + "-read")
))
.actions(a -> a.add(ad -> ad
.index(deletedIndex)
.alias("docs-" + tenantId + "-write")
.isWriteIndex(true)
))
);
}
}
The Measurement
DR test results for the documentation platform:
| Scenario | Expected RTO | Measured RTO | Status |
|---|---|---|---|
| Single index restore (5GB) | 10 min | 8 min | Pass |
| Full cluster restore (500GB) | 4 hrs | 3.5 hrs | Pass |
| Node failure (shard reallocation) | 10 min | 7 min | Pass |
| Accidental deletion + alias restore | 15 min | 12 min | Pass |
| Corrupted snapshot fallback | 30 min | 45 min | Fail—needs secondary snapshot |
The corrupted snapshot scenario failed because no secondary snapshot repository existed. After adding a cross-region S3 backup, the fallback path meets the 30-minute RTO.
The Decision Rule
Run the DR test suite monthly. Quarterly is too infrequent to catch regression from infrastructure changes (S3 policy updates, node role changes, security configuration changes).
Maintain two snapshot repositories: a primary in the same region for fast restore, and a secondary in a different region for disaster scenarios. S3 Cross-Region Replication can automate this.
Protect production indices from accidental deletion with the action.destructive_requires_name cluster setting set to true. This prevents DELETE /* or DELETE /docs-* patterns, requiring the exact index name for deletion.