Failure Modes for Data Systems

Every mechanism covered in this book, the WAL, the B-Tree, the replication protocol, the consumer group, has a failure mode. The WAL can fill the disk. The B-Tree can fragment. The replication can lag. The consumer group can rebalance indefinitely. This chapter catalogs the failure modes that recur across data systems and provides the observable symptoms, root causes, and recovery patterns for each.

Split Brains: Two Leaders, Two Truths

A split brain occurs when two nodes in a replicated system both believe they are the leader. Each accepts writes. The writes diverge. When the partition heals, the two datasets are irreconcilable without data loss or manual intervention.

What Causes Split Brains

Split brains happen when the failure detection mechanism incorrectly concludes that the leader has failed, promotes a replica, and the old leader is still alive.

The scenario:

The PostgreSQL primary is running normally.
A network partition isolates the primary from the Patroni/etcd cluster.
etcd’s Raft consensus (Chapter 4-S2) detects that the primary’s lease has expired.
Patroni promotes a replica to primary.
The old primary is still running, still accepting writes from clients on its side of the partition.
Two primaries. Two sets of writes. Split brain.

-- Concept: detecting split brain via timeline ID
-- PostgreSQL increments the timeline ID on every promotion.
-- If two nodes have different timeline IDs and both accept writes, split brain.

SELECT pg_is_in_recovery(), pg_control_system().timeline_id;
-- Node A: false (primary), timeline_id: 3
-- Node B: false (primary), timeline_id: 4
-- Both report as primary. Timeline IDs differ. Split brain confirmed.

Prevention

The standard prevention is fencing: when a replica is promoted, the old primary must be forcefully shut down (STONITH: Shoot The Other Node In The Head).

# Concept: Patroni fencing configuration
# The promoted replica executes a fencing command to kill the old primary

postgresql:
  callbacks:
    on_role_change: /usr/local/bin/fence-old-primary.sh

# fence-old-primary.sh:
# 1. Send SIGTERM to the old primary's PostgreSQL process
# 2. If that fails, revoke the old primary's network access via firewall rule
# 3. If that fails, power-cycle the old primary via IPMI/BMC
# The goal is to ensure the old primary stops accepting writes before the
# new primary starts. The fencing must succeed before promotion completes.

Network Partitions: The CAP Theorem in Practice

The CAP theorem states that during a network partition, a distributed system must choose between consistency (all nodes see the same data) and availability (all nodes respond to requests). You cannot have both during a partition.

In practice, this is not a global system-wide choice. It is a per-operation decision.

# Concept: observing a network partition in Kafka
# Broker 3 is partitioned from brokers 1 and 2.

kafka-topics.sh --describe --topic package-events

# Partition 0: Leader: 1, Replicas: 1,2,3, ISR: 1,2
# Partition 1: Leader: 2, Replicas: 2,3,1, ISR: 2,1
# Partition 2: Leader: 3, Replicas: 3,1,2, ISR: 3

# Partition 2's ISR is {3} only. Broker 3 is the leader but has no in-sync replicas.
# With min.insync.replicas=2 and acks=all:
# Producers writing to partition 2 receive NotEnoughReplicasException.
# Kafka chose consistency over availability for partition 2.

# With min.insync.replicas=1 and acks=all:
# Producers can write to partition 2 (only broker 3 acknowledges).
# If broker 3 fails during the partition, those writes are lost.
# Kafka chose availability over consistency.

The logistics platform’s configuration:

package-events topic: min.insync.replicas=2, acks=all. During a partition, writes to affected partitions fail. Package events are buffered by the producer and retried. Consistency is chosen over availability because lost package events cause incorrect dashboard state.
package-tracking-metrics topic: min.insync.replicas=1, acks=1. During a partition, writes continue with reduced durability. Metrics data is ephemeral and can tolerate loss. Availability is chosen over consistency.

Clock Skew: Why Timestamps Lie

The diagram shows three nodes with drifting wall clocks. Node A’s clock is 150ms ahead. Node B’s clock is accurate. Node C’s clock is 80ms behind. An event that occurs on Node A at physical time T is timestamped T+150ms. An event that occurs later on Node C is timestamped with a value earlier than Node A’s event. Sorting by timestamp produces a history that violates causal ordering. This affects any system that uses wall clock timestamps for ordering: Kafka message timestamps, database audit trails, and distributed tracing.

System clocks on different machines are not synchronized precisely. NTP keeps clocks within a few milliseconds on a good day, within hundreds of milliseconds on a bad day. During NTP corrections, the clock can jump backward.

-- Concept: clock skew affecting audit trail ordering
-- Two services write to the audit_history table in PostgreSQL.
-- Service A's clock is 200ms ahead of Service B's clock.

-- Service A writes at its local time 14:22:00.200:
INSERT INTO audit_history (event_id, timestamp, event)
VALUES ('evt-1', '2024-11-15T14:22:00.200Z', 'Package scanned at WH-042');

-- Service B writes at its local time 14:22:00.050 (200ms behind A):
INSERT INTO audit_history (event_id, timestamp, event)
VALUES ('evt-2', '2024-11-15T14:22:00.050Z', 'Package loaded on truck');

-- Query ordered by timestamp:
SELECT * FROM audit_history ORDER BY timestamp;
-- evt-2: 14:22:00.050 Package loaded on truck
-- evt-1: 14:22:00.200 Package scanned at WH-042

-- The loading happened AFTER the scan, but the timestamps say otherwise.
-- The audit trail is wrong.

The fix: use logical timestamps (Lamport clocks, vector clocks) or hybrid logical clocks (HLC) instead of wall clock time for ordering. PostgreSQL’s LSN (Log Sequence Number) is a logical timestamp: it reflects the order in which changes were applied to the WAL, regardless of wall clock time.

Cascading Failures

A cascading failure starts with one component failing slowly (not crashing, but responding slowly), which causes its callers to queue requests, exhaust threads or connections, and fail themselves, propagating the failure upstream.

The sequence in the logistics platform:

PostgreSQL’s checkpoint runs during peak traffic. Query latency spikes from 2ms to 200ms.
The package service’s connection pool fills because connections are held longer.
The route optimizer, waiting for the package service, times out after 5 seconds. But it retries immediately.
The retries double the load on the package service, which is already overloaded.
The route optimizer’s thread pool fills. It stops responding to the dispatch service.
The dispatch service times out and alerts on-call.

Three failures in 30 seconds, caused by a checkpoint.

Breaking the Cascade

// Concept: circuit breaker to prevent cascading failure
// When the package service is slow, stop calling it instead of retrying.

// Circuit states:
// CLOSED: normal operation, requests pass through
// OPEN: package service is failing, requests are rejected immediately
// HALF_OPEN: test one request to see if the service has recovered

// Simplified circuit breaker (production: use resilience4j)
class CircuitBreaker {
    private int failureCount = 0;
    private int failureThreshold = 5;
    private long openUntil = 0;

    boolean allowRequest() {
        if (System.currentTimeMillis() < openUntil) {
            return false;  // OPEN: reject immediately
        }
        return true;
    }

    void recordFailure() {
        failureCount++;
        if (failureCount >= failureThreshold) {
            openUntil = System.currentTimeMillis() + 30_000;  // Open for 30 seconds
            failureCount = 0;
        }
    }

    void recordSuccess() {
        failureCount = 0;
        openUntil = 0;
    }
}

The circuit breaker prevents the route optimizer from hammering a failing package service. When 5 consecutive requests fail, the circuit opens for 30 seconds. During that window, requests to the package service are rejected immediately without consuming threads, connections, or network bandwidth. After 30 seconds, one test request is sent. If it succeeds, the circuit closes and normal traffic resumes.

The Recovery Playbook

When a data system failure occurs, the recovery sequence matters. Wrong ordering causes secondary failures.

Restart Order

Storage layer first: PostgreSQL, RocksDB. Wait for crash recovery (WAL replay) to complete before accepting connections.
Message layer second: Kafka brokers. Wait for ISR to re-form and under-replicated partitions to sync.
Application layer third: Services that read from the recovered storage and message layers.
Derived stores last: Redis caches, ClickHouse analytics. Rebuild from the source of truth or replay from Kafka.

Post-Incident Verification

-- Concept: verifying data integrity after a failure
-- Check for gaps in the package event sequence

SELECT
    package_id,
    status,
    timestamp,
    LAG(timestamp) OVER (PARTITION BY package_id ORDER BY timestamp) as prev_timestamp,
    timestamp - LAG(timestamp) OVER (PARTITION BY package_id ORDER BY timestamp) as gap
FROM package_events
WHERE timestamp > now() - interval '24 hours'
ORDER BY gap DESC NULLS LAST
LIMIT 20;

-- Large gaps may indicate lost events during the failure.
-- Cross-reference with Kafka consumer lag metrics at the time of failure.

The Decision Rule

Implement fencing for every leader-election system. A promotion without fencing is a split brain waiting to happen.

Choose consistency or availability per data entity, not globally. Critical business data (inventory, assignments) gets acks=all and min.insync.replicas=2. Ephemeral data (metrics, logs) gets acks=1.

Use logical ordering (LSN, offsets, sequence numbers) instead of wall clock timestamps for any ordering that must be correct across machines. Wall clock timestamps are useful for display. They are not reliable for ordering.

Implement circuit breakers on every inter-service call. A service that retries a failing dependency without a circuit breaker converts a single-service failure into a system-wide outage.

Recovery is not “restart everything.” It is a sequenced process: storage first, messaging second, application third, derived stores last. Test the recovery sequence before you need it.