Replication Topologies: Leaders and Followers
SummaryThis section introduces replication topologies as fundamental mechanisms...
This section introduces replication topologies as fundamental mechanisms...
This section introduces replication topologies as fundamental mechanisms for achieving high availability and fault tolerance in distributed database systems. It contrasts single-leader and multi-leader replication strategies, highlighting their trade-offs between consistency, availability, and performance. Single-leader replication provides strong consistency by funneling all writes through a designated leader node, creating a potential write bottleneck but ensuring a single source of truth. Multi-leader replication improves write availability by allowing multiple nodes to accept writes independently, but introduces complexity through required conflict resolution mechanisms like last-write-wins (LWW) and application-defined merge logic (CRDTs). The section explores practical challenges including replication lag, stale reads, and failure handling through automatic failover. Key terminology introduced includes: single-leader replication (one designated node accepts all writes), multi-leader replication (multiple nodes accept writes requiring conflict resolution), synchronous replication (waits for follower acknowledgments), asynchronous replication (confirms writes before replication), and replication lag (delay between leader commit and follower apply). The included comparison table systematically evaluates consistency models, write availability, read scalability, and failure handling across different replication types. Code examples demonstrate synchronous vs. asynchronous write paths and conflict resolution strategies.
Replication Topologies: Enforcing Consistency Through Invariants
Replication is not a feature—it is a necessary consequence of distributed systems’ exposure to failure. The only question is how consistency is enforced when data exists in multiple places simultaneously. Every replication topology makes immutable tradeoffs between consistency, availability, and operational complexity. These tradeoffs are not negotiable; they are dictated by the laws of distributed computing. This section defines replication topologies by their consistency invariants, the mechanisms that enforce them, and the failure recovery protocols that dominate real-world operation.
Single-Leader Replication: Strong Consistency at the Cost of Write Bottleneck
Invariant: All clients observe a total order of writes. Reads reflect the latest committed write.
This guarantee is enforced by routing all writes through a single leader. The leader serializes writes using a durable log, ensuring global consistency. Followers apply the log in the same order, converging to the same state. This model provides strong consistency but introduces a write bottleneck: throughput is limited by the capacity of one node.
Failure is the default state. If the leader fails, the system cannot accept writes until a new leader is elected. Automatic failover is required, but it risks split-brain if not coordinated with quorum enforcement.
from typing import List, Dict, Optional
from dataclasses import dataclass
from enum import Enum
import time
class ReplicationMode(Enum):
SYNC = "sync"
ASYNC = "async"
@dataclass
class WALRecord:
key: str
value: str
term: int # Leader's epoch
index: int # Log position
timestamp: float = time.time()
class FollowerNode:
def __init__(self, node_id: str):
self.node_id = node_id
self.last_applied_index: int = -1
self.log: Dict[int, WALRecord] = {}
def replicate(self, record: WALRecord) -> bool:
"""Append to local log and return success."""
self.log[record.index] = record
return True
def apply_up_to(self, index: int) -> None:
"""Apply log entries in order up to index."""
while self.last_applied_index + 1 <= index:
next_index = self.last_applied_index + 1
if next_index in self.log:
record = self.log[next_index]
# Apply to state machine (e.g., key-value store)
self.apply_record(record)
self.last_applied_index = next_index
def apply_record(self, record: WALRecord) -> None:
# Simulate state update
pass
class LeaderNode:
def __init__(self):
self.current_term: int = 0
self.log: List[WALRecord] = []
self.commit_index: int = -1
self.wal_offset: int = 0
def appendToWAL(self, key: str, value: str) -> WALRecord:
record = WALRecord(
key=key,
value=value,
term=self.current_term,
index=len(self.log)
)
self.log.append(record)
return record
def apply(self, record: WALRecord) -> None:
# Apply to local state machine
pass
class ReplicatedDatabase:
def __init__(self, mode: ReplicationMode = ReplicationMode.SYNC):
self.leader = LeaderNode()
self.followers: List[FollowerNode] = [FollowerNode(f"f{i}") for i in range(3)]
self.mode = mode
def write(self, key: str, value: str) -> bool:
# 1. Write to leader's WAL
record = self.leader.appendToWAL(key, value)
# 2. Replicate based on mode
if self.mode == ReplicationMode.SYNC:
# Wait for all followers to acknowledge
for follower in self.followers:
if not follower.replicate(record):
return False # Fail fast on any failure
else: # ASYNC
for follower in self.followers:
# Fire and forget
follower.replicate(record)
# 3. Apply to leader's state
self.leader.apply(record)
self.leader.commit_index = record.index
# 4. Apply to followers asynchronously in background
if self.mode == ReplicationMode.ASYNC:
self._replicate_async(record)
return True
def _replicate_async(self, record: WALRecord) -> None:
"""Background task to push to followers."""
for follower in self.followers:
try:
follower.replicate(record)
# Background apply
follower.apply_up_to(record.index)
except Exception:
# Retry logic omitted
pass
def read(self, key: str, consistent: bool = False) -> Optional[str]:
"""If consistent, read from leader. Otherwise, read from any follower."""
if consistent:
# Simulate read from leader's state
return self._read_from_leader(key)
else:
# Read from any follower (eventual consistency)
return self.followers[0].log.get(self.followers[0].last_applied_index, None) # Simplified
def _read_from_leader(self, key: str) -> Optional[str]:
# Simulate key-value lookup
for record in reversed(self.leader.log):
if record.key == key:
return record.value
return None
Split-brain occurs when two nodes believe they are the leader. Recovery requires fencing: the old leader must be isolated before the new one takes over. This is enforced via distributed locks (e.g., using Raft or Paxos) or lease-based coordination. Without fencing, concurrent writes violate the total order invariant and cause irreversible divergence.
Multi-Leader Replication: Write Availability at the Cost of Conflict Complexity
Invariant: Writes succeed on any leader during network partition. Consistency is eventual and application-defined.
This model allows multiple leaders to accept writes independently. It maximizes write availability but sacrifices strong consistency. Concurrent writes to the same key on different leaders create conflicting versions. The system cannot resolve these automatically without policy.
Conflict resolution is not optional—it is the core design constraint. The choice of strategy determines correctness.
from typing import List, Dict, Tuple
from dataclasses import dataclass
import time
@dataclass
class VersionedValue:
value: str
timestamp: float = time.time()
leader_id: str = ""
version_vector: Dict[str, int] = None # Per-leader counter
def __post_init__(self):
if self.version_vector is None:
self.version_vector = {self.leader_id: 1}
def increment_version(self) -> None:
self.version_vector[self.leader_id] = self.version_vector.get(self.leader_id, 0) + 1
self.timestamp = time.time()
def resolve_conflict_lww(conflicting: List[VersionedValue]) -> VersionedValue:
"""Last-write-wins by timestamp. Loses data on clock skew."""
return max(conflicting, key=lambda v: v.timestamp)
def resolve_conflict_version_vector(conflicting: List[VersionedValue]) -> List[VersionedValue]:
"""Detect concurrent writes. Return all conflicting versions for application resolution."""
# Simplified: compare version vectors for causality
# If A > B: A overwrites B
# If A || B: concurrent, return both
survivors = []
for v in conflicting:
is_overwritten = False
for other in conflicting:
if other != v and _causally_dominates(other.version_vector, v.version_vector):
is_overwritten = True
break
if not is_overwritten:
survivors.append(v)
return survivors
def _causally_dominates(a: Dict[str, int], b: Dict[str, int]) -> bool:
"""True if a >= b for all keys and a > b for at least one."""
if set(a.keys()) < set(b.keys()):
return False
greater_or_equal = all(a.get(k, 0) >= b[k] for k in b)
strictly_greater = any(a.get(k, 0) > b[k] for k in b)
return greater_or_equal and strictly_greater
def merge_shopping_carts(cart_a: Dict[str, int], cart_b: Dict[str, int]) -> Dict[str, int]:
"""CRDT-style merge: sum quantities per item."""
result = cart_a.copy()
for item, qty in cart_b.items():
result[item] = result.get(item, 0) + qty
return result
Split-brain is not a failure mode—it is a normal operating condition in multi-leader systems. Leaders operate independently during partitions. Recovery requires merging divergent histories using version vectors or application-specific logic. Read repair detects inconsistencies during reads and triggers reconciliation.
Replication Strategy Tradeoffs
| Replication Type | Consistency Guarantee | Write Availability | Read Scalability | Failure Recovery | Conflict Resolution |
|---|---|---|---|---|---|
| Single-Leader (Sync) | Strong (total order) | Low (single writer) | High (read from followers) | Requires leader election + fencing | None (single source of truth) |
| Single-Leader (Async) | Eventual (stale reads possible) | Moderate | High | Risk of data loss; requires WAL replay | None |
| Multi-Leader | Eventual (with conflicts) | High | High | Automatic during partition; merge on heal | Required (LWW, version vectors, CRDTs) |
| Leaderless (Quorum) | Configurable (R + W > N) | High | High | Read repair, anti-entropy | Version vectors, reconciliation |
Design Implications
- Strong consistency requires a single serialization point. This is enforced by a leader, a lock service, or a consensus algorithm. There is no alternative.
- Write availability requires relinquishing immediate consistency. Multi-leader and leaderless systems achieve this by allowing concurrent writes, but they shift conflict resolution to the application.
- Replication lag is not a performance issue—it is a consistency boundary. Asynchronous replication decouples durability from latency but exposes stale reads. Read-after-write consistency requires routing recent reads to the leader or tracking replication watermarks.
- Split-brain recovery is a mandatory design requirement. Systems must detect and resolve divergent states. Fencing (via leases or consensus) prevents dual leadership in single-leader systems. Multi-leader systems must merge concurrent writes using causal metadata.
- The choice of replication topology is determined by the application’s tolerance for inconsistency. Financial ledgers require strong consistency. Collaborative editors tolerate eventual consistency with merge logic. Choose accordingly.