Consensus, Leader Election, and Why Your Database Goes Read-Only
Consensus, Leader Election, and Why Your Database Goes Read-Only
The Black Box
The primary database crashes at 02:14:07 AM. The application begins returning 500 errors. At 02:14:19 AM, twelve seconds later, writes resume. The postmortem says “automatic failover to the replica.” The twelve seconds are not explained. They are the leader election.
The Mechanism
Raft consensus guarantees that a group of nodes agrees on a single leader. The relevant mechanics for understanding failover duration:
Terms. Time is divided into terms, each identified by a monotonically increasing integer. Each term has at most one leader. When a leader fails, a new term begins with an election.
Heartbeats. The leader sends heartbeats to all followers at a regular interval (typically 100-500ms). Each heartbeat resets the follower’s election timer.
Election timeout. If a follower’s election timer expires (no heartbeat received within the timeout), the follower transitions to the candidate state, increments the term, votes for itself, and sends vote requests to all other nodes.
Vote rules. A node votes for a candidate only if (1) the candidate’s term is higher than the node’s current term, and (2) the candidate’s log is at least as up-to-date as the voter’s log. A candidate that receives votes from a majority of nodes becomes the leader.
The Failover Timeline
The twelve seconds from the example break down:
| Phase | Duration | Cause |
|---|---|---|
| Detection | 1-5s | Election timeout waiting for missed heartbeats |
| Election | 0.5-2s | Vote request round trips, possible split vote retry |
| Log recovery | 0-5s | New leader replaying uncommitted entries |
| Connection drain | 1-3s | Application connections to old primary timing out |
The election timeout is the dominant factor. A short timeout (500ms) detects failures fast but risks false elections during network blips. A long timeout (5s) avoids false elections but delays failover.
The Observable Consequence
PostgreSQL with Patroni
Patroni is a PostgreSQL high-availability controller that uses etcd (which uses Raft) for leader election.
# Concept: Patroni configuration controlling failover timing
# /etc/patroni/patroni.yml
bootstrap:
dcs:
ttl: 30 # Leader lease duration (seconds)
loop_wait: 10 # Health check interval (seconds)
retry_timeout: 10 # Timeout for DCS operations
maximum_lag_on_failover: 1048576 # Max replication lag (bytes) for promotion
# Failover timeline with these settings:
# 1. Primary crashes at T=0
# 2. Patroni on primary misses health check. etcd lease expires after TTL (30s worst case)
# 3. Patroni on replica detects leader key is gone (next loop_wait, up to 10s)
# 4. Patroni promotes replica to primary (1-3s for pg_ctl promote)
# 5. Total: 30 + 10 + 3 = 43 seconds worst case
# 6. With TTL=10, loop_wait=3: 10 + 3 + 3 = 16 seconds worst case
The maximum_lag_on_failover parameter prevents promoting a replica that is too far behind. If the only available replica has 100MB of replication lag, Patroni will not promote it, and the cluster remains read-only until the replica catches up or an operator intervenes. This is the scenario that produces a postmortem reading “failover took 8 minutes.”
Kafka KRaft
Kafka’s KRaft mode replaces ZooKeeper with a built-in Raft implementation for controller election:
# Concept: KRaft election timing
# server.properties
controller.quorum.election.timeout.ms=1000
controller.quorum.fetch.timeout.ms=2000
# A follower controller waits 2 seconds for a fetch response.
# If no response, it starts an election after 1 second (randomized).
# Total failover: 2 + 1 + election round trip = 3-5 seconds.
#
# During this window:
# - Existing partition leaders continue serving reads and writes
# - New partition leader elections are blocked
# - Topic creation/deletion is blocked
# - Consumer group rebalancing is blocked
The critical difference: Kafka’s data path (partition leaders) is independent of the controller. A controller election blocks metadata operations but not data read/write on existing partitions. PostgreSQL’s failover blocks all writes until the new primary is promoted.
The Code
Detecting failover from the application side:
// Concept: detecting and handling PostgreSQL failover in application code
// The connection to the old primary fails. The application must reconnect to the new primary.
// JDBC connection string with multiple hosts and target_session_attrs
String url = "jdbc:postgresql://primary:5432,replica1:5432,replica2:5432/logistics"
+ "?targetServerType=primary"
+ "&connectTimeout=5"
+ "&socketTimeout=10";
// targetServerType=primary: JDBC driver connects to the first host
// that reports itself as a primary (not in recovery mode).
// During failover:
// 1. Connection to old primary fails (socketTimeout = 10s)
// 2. Driver tries replica1. If Patroni promoted replica1, it is now primary. Connect.
// 3. If replica1 is still a replica, try replica2.
// 4. If no primary found, throw exception. Application retries.
The Decision Rule
Set the election timeout based on your availability budget. If 30 seconds of downtime per failover is acceptable, use Patroni’s defaults. If you need sub-10-second failover, reduce ttl to 10 and loop_wait to 3, and accept that network jitter may trigger unnecessary failovers.
Use synchronous replication (synchronous_commit = remote_apply) if losing committed transactions during failover is unacceptable. This guarantees the promoted replica has every committed transaction. Without synchronous replication, the promoted replica may be missing the most recent transactions that were committed on the old primary but not yet replicated.
The Raft election described in Chapter 4 is the reason the failover window exists at all. Understanding the election mechanics, the timeout, the vote round trip, the log recovery, allows you to estimate failover duration from configuration parameters rather than waiting for a production incident to measure it.