Architecture and Evolution
SummaryThis chapter synthesizes core architectural patterns for evolving...
This chapter synthesizes core architectural patterns for evolving...
This chapter synthesizes core architectural patterns for evolving data systems, emphasizing immutable trade-offs. It begins with immutable data structures (LSM-Trees) and eventual consistency as foundational principles. The LSM-Tree SSTable merge code illustrates the sequential I/O and write amplification trade-off of compaction. Distributed consistency models are explored through Two-Phase Commit (2PC), highlighting its blocking risk and atomicity guarantee, and the Saga pattern, which uses compensating transactions for eventual consistency. A comprehensive comparison table contrasts eight architectural patterns (B-Tree, LSM-Tree, Columnar, Single/Multi/Leaderless Replication, Lambda, Kappa) across dimensions like primary data structure, update pattern, amplification, consistency model, scalability, and fault tolerance. The chapter concludes with an evolution diagram describing a system centered on an immutable log as the source of truth, feeding various derived data systems (search index, cache, analytics DB, graph DB) via ingestion paths like transactional writes and CDC. Schema evolution handlers and failure recovery mechanisms (checkpointing, replay, idempotent writers) ensure the system can evolve and recover deterministically. The narrative consistently frames choices around the immutable trade-offs of read vs. write amplification and consistency vs. availability.
Architecture and Evolution
The design of evolving systems is fundamentally about managing trade-offs. At the heart of these trade-offs lies the concept of immutable data structures and eventual consistency. This chapter synthesizes previous concepts into coherent architectural patterns, focusing on how systems evolve over time, incorporate new data, and maintain robustness under failure. Every architectural decision enforces an invariant at the cost of a corresponding penalty—understanding these trade-offs is essential for building systems that scale and survive.
Immutable Data Structures
Invariant: Previous states of data must remain accessible to enable auditability, rollback, and safe concurrent access.
Immutable data structures, such as those in LSM-Trees (Log-Structured Merge-Trees), enforce this invariant by never modifying data in place. Instead, updates are written as new entries, preserving prior versions. The SSTable (Sorted String Table), a core component of LSM-Trees, is an immutable, sorted key-value file optimized for sequential disk access. This design shifts the cost from random I/O during writes to increased read amplification during lookups, as multiple SSTables may need to be consulted.
The merge operation—compaction—reclaims space and reduces read overhead by merging sorted SSTables into a single output, applying last-writer-wins semantics. However, this process incurs high write amplification, as data is rewritten repeatedly across levels. This is the price paid for sustained high write throughput and crash consistency.
# Modern Python 3.11+ example: LSM-Tree SSTable Merge (Compaction)
# Illustrates the sequential I/O pattern and immutable, out-of-place update.
# Invariant: Compaction merges sorted inputs to produce sorted output, reclaiming space.
from typing import Iterator, List, Tuple, Optional
import heapq
class SSTableIterator:
"""Iterator over key-value pairs in a single SSTable file."""
def __init__(self, file_path: str):
self.file = open(file_path, 'rb')
# Assume file contains sorted (key, value, seq_num) tuples
def __iter__(self) -> Iterator[Tuple[bytes, bytes, int]]:
# Simplified: read sequentially
while True:
chunk = self.file.read(1024)
if not chunk:
break
# Parse chunk into key, value, seq_num
yield (b'key', b'value', 1) # Placeholder
def close(self):
self.file.close()
def merge_sstables(sstable_paths: List[str], output_path: str) -> None:
"""
Merge multiple SSTables into one, applying last-writer-wins per key.
Trade-off: High write amplification (rewriting data) for reduced read amplification.
"""
iters: List[Iterator[Tuple[bytes, bytes, int]]] = []
try:
# Open all input SSTables for sequential reading
for path in sstable_paths:
iters.append(iter(SSTableIterator(path)))
# Min-heap for k-way merge
HeapItem = Tuple[bytes, bytes, int, int] # key, value, seq_num, iterator_index
heap: List[HeapItem] = []
# Initial population of heap with first element from each iterator
for idx, it in enumerate(iters):
try:
key, value, seq_num = next(it)
heapq.heappush(heap, (key, value, seq_num, idx))
except StopIteration:
pass # Empty SSTable
current_key: Optional[bytes] = None
current_value: Optional[bytes] = None
highest_seq_num = -1
with open(output_path, 'wb') as out_file:
while heap:
key, value, seq_num, iter_idx = heapq.heappop(heap)
if key != current_key:
# Write the winner for the previous key (if not a tombstone)
if current_key is not None and current_value is not None:
# Simplified write
out_file.write(current_key + current_value)
current_key = key
current_value = value
highest_seq_num = seq_num
else:
# Same key, pick the one with higher sequence number (newer write)
if seq_num > highest_seq_num:
current_value = value
highest_seq_num = seq_num
# Refill heap from the same iterator
try:
next_key, next_val, next_seq = next(iters[iter_idx])
heapq.heappush(heap, (next_key, next_val, next_seq, iter_idx))
except StopIteration:
pass # Iterator exhausted
# Write the last key
if current_key is not None and current_value is not None:
out_file.write(current_key + current_value)
finally:
for it in iters:
it.close() # Ensure resources are cleaned up
This compaction logic exemplifies the core LSM-Tree trade-off: sequential, high-throughput writes are achieved by deferring cleanup and consistency to background processes, which in turn amplify disk writes and memory pressure.
Eventual Consistency and Distributed Systems
Invariant: The system must remain available and process requests even during network partitions.
Distributed systems often sacrifice strong consistency to uphold this invariant, adopting eventual consistency. While all nodes will converge to the same state in the absence of new updates, temporary divergence is expected and must be managed. The real cost of this model is not inconsistency itself, but the complexity it introduces in application logic and failure recovery.
Two-Phase Commit
Invariant: Atomicity across distributed nodes must be guaranteed—either all participants commit, or none do.
The Two-Phase Commit (2PC) protocol enforces atomicity through a prepare-then-commit sequence. However, this guarantee comes at a steep cost: the protocol is inherently blocking. If the coordinator crashes after the prepare phase but before issuing the commit, all participants remain in an “in-doubt” state, holding locks and blocking further operations. Recovery requires manual intervention via mechanisms like xa_recover(), making 2PC unsuitable for long-running transactions or systems requiring high availability.
The blocking nature of 2PC is the price paid for strong atomicity. While effective in tightly coupled systems with low failure rates, it fails the “assume failure is the default” principle.
# Modern Python 3.11+ example: Two-Phase Commit Coordinator
# Illustrates the blocking nature and recovery challenge.
# Invariant: Atomicity across nodes is guaranteed by the prepare-then-commit protocol.
from dataclasses import dataclass
from typing import List, Tuple
import asyncio
@dataclass
class Participant:
name: str
prepared: bool = False
committed: bool = False
async def prepare(self, tx_id: str) -> bool:
"""Phase 1: Participant votes YES/NO."""
# Simulate durable log write (e.g., to WAL)
await asyncio.sleep(0.05) # Simulate I/O
# Simulate a possible failure (e.g., constraint violation)
if self.name == "ParticipantB":
print(f"{self.name}: Vote NO (simulated constraint violation)")
return False
self.prepared = True
print(f"{self.name}: Vote YES")
return True
async def commit(self, tx_id: str) -> bool:
await asyncio.sleep(0.05)
self.committed = True
print(f"{self.name}: Committed")
return True
async def abort(self, tx_id: str) -> bool:
await asyncio.sleep(0.05)
self.prepared = False
print(f"{self.name}: Aborted")
return True
async def two_phase_commit_coordinator(participants: List[Participant], tx_id: str) -> bool:
"""
Coordinator for 2PC. Demonstrates the blocking risk.
Trade-off: Strong atomicity guarantee vs. risk of indefinite blocking.
"""
votes: List[Tuple[Participant, bool]] = []
# --- PHASE 1: Prepare ---
print(f"[Coordinator] Starting 2PC for transaction {tx_id}")
print(f"[Coordinator] Phase 1: Sending PREPARE to all participants")
for participant in participants:
try:
vote = await participant.prepare(tx_id)
votes.append((participant, vote))
except Exception as e:
print(f"[Coordinator] Participant {participant.name} failed during prepare: {e}")
votes.append((participant, False))
# Decide based on votes
all_voted_yes = all(vote for _, vote in votes)
if not all_voted_yes:
print(f"[Coordinator] Phase 1: At least one participant voted NO. Initiating ABORT.")
abort_tasks = [asyncio.create_task(p.abort(tx_id)) for p in participants]
await asyncio.gather(*abort_tasks, return_exceptions=True)
return False
# --- PHASE 2: Commit ---
print(f"[Coordinator] Phase 1: All participants voted YES. Sending COMMIT.")
# **CRITICAL BLOCKING POINT**: If coordinator crashes here, participants remain in-doubt.
commit_tasks = [asyncio.create_task(p.commit(tx_id)) for p in participants]
results = await asyncio.gather(*commit_tasks, return_exceptions=True)
if any(isinstance(r, Exception) for r in results):
print(f"[Coordinator] Phase 2: Commit failed on some participants. Manual intervention required.")
# In a real system, recovery via xa_recover() is needed.
return False
print(f"[Coordinator] Transaction {tx_id} committed successfully.")
return True
Saga Pattern
Invariant: Long-running business processes must make forward progress without blocking system resources indefinitely.
The Saga pattern replaces distributed locking with application-level compensation. A saga is a sequence of local transactions, each paired with a compensating action that reverses its effects. If a step fails, the saga executes compensations in reverse order, restoring the system to a consistent state. This approach avoids the blocking pitfalls of 2PC but shifts the burden of consistency to the application developer.
Failure recovery is intrinsic: if a saga crashes mid-execution, it resumes from the last completed step and runs compensations. However, compensation logic must be idempotent and durable—once executed, it cannot be undone. This makes sagas ideal for business workflows (e.g., order fulfillment) but unsuitable for low-level data consistency.
# Modern Python 3.11+ example: Saga Compensating Transaction
# Contrasts with 2PC by using application-level compensation for eventual consistency.
# Invariant: Saga ensures eventual consistency through forward progress and compensation.
from dataclasses import dataclass
from typing import Callable, List
import asyncio
@dataclass
class SagaStep:
name: str
execute: Callable[[], bool] # Local transaction
compensate: Callable[[], bool] # Compensating action
executed: bool = False
compensated: bool = False
class SagaCoordinator:
"""
Orchestrates a Saga. Executes steps sequentially.
If any step fails, runs compensation for all previously executed steps in reverse order.
Trade-off: Avoids blocking but requires application-defined compensation logic.
"""
def __init__(self, steps: List[SagaStep]):
self.steps = steps
async def run(self) -> bool:
executed_steps: List[SagaStep] = []
for step in self.steps:
print(f"[Saga] Executing step: {step.name}")
try:
success = await asyncio.to_thread(step.execute)
if not success:
print(f"[Saga] Step {step.name} failed. Initiating compensation.")
await self._compensate(executed_steps)
return False
step.executed = True
executed_steps.append(step)
except Exception as e:
print(f"[Saga] Step {step.name} raised exception: {e}. Compensating.")
await self._compensate(executed_steps)
return False
print("[Saga] All steps completed successfully.")
return True
async def _compensate(self, steps: List[SagaStep]) -> None:
"""Run compensation in reverse order."""
print("[Saga] Running compensating transactions...")
for step in reversed(steps):
if step.executed and not step.compensated:
print(f"[Saga] Compensating for step: {step.name}")
try:
await asyncio.to_thread(step.compensate)
step.compensated = True
except Exception as e:
print(f"[Saga] Compensation for {step.name} failed: {e}. Saga may be in inconsistent state.")
# In a real system, might require manual intervention.
Therefore, for long-running business processes, the Saga pattern is preferable to 2PC: it ensures forward progress, avoids indefinite blocking, and aligns with the reality of distributed failure.
Architectural Patterns Comparison
The preceding examples illustrate how invariants shape system behavior under load and failure. The following table synthesizes these insights, mapping each pattern to its core trade-offs in scalability, consistency, and fault tolerance.
| Architectural Pattern | Primary Data Structure | Update Pattern | Optimized For | Read Amplification | Write Amplification | Consistency Model (Distributed) | Scalability Mechanism | Fault Tolerance Mechanism |
|---|---|---|---|---|---|---|---|---|
| B-Tree (Page-Oriented) | Mutable, balanced tree | In-place, random I/O | Read-heavy, point queries | Low | High (node splits/merges) | Strong (via locking/WAL) | Vertical scaling, limited horizontal | WAL, replication |
| LSM-Tree (Log-Structured) | Immutable SSTables | Out-of-place, sequential I/O | Write-heavy, append-intensive | High (multiple levels) | High (compaction) | Eventual (during compaction) | Horizontal scaling via partitioning | Immutable SSTables, WAL for memtable |
| Columnar Storage | Column-oriented files | Batch updates | Analytic scans, aggregations | Low for projected columns | Very High (rewrite columns) | Varies | Horizontal scaling | Replication, checksums |
| Single-Leader Replication | Leader-follower log | Writes through leader | Strong consistency | Low (local read) | High (synchronous latency) | Strong (linearizable) | Read scaling via followers | Automatic failover |
| Multi-Leader Replication | Multiple logs | Writes to any leader | Write availability | Low (local read) | Medium (conflict resolution) | Eventual (conflict resolution) | Geographic distribution | Conflict resolution, last-write-wins |
| Leaderless Replication (Dynamo-style) | Quorum-based | Writes to quorum | High availability, partition tolerance | Configurable (quorum size) | Configurable (quorum size) | Configurable (quorum consistency) | Horizontal scaling | Read repair, hinted handoff |
| Lambda Architecture | Batch layer (immutable) + Speed layer (stream) | Batch recomputation + incremental stream | Both historical accuracy and low latency | High (batch) + Low (speed) | High (batch) + Medium (stream) | Eventual (batch overrides speed) | Horizontal scaling for both layers | Immutable inputs, recomputation |
| Kappa Architecture | Single immutable log | Stream processing only | Unified processing model | Low (single layer) | Medium (stream state) | Eventual (stream processing) | Horizontal scaling of stream jobs | Checkpointing, replayable sources |
Evolution Diagram
Diagram: Evolution of Storage Architecture from Monolith to Distributed Derived Views
Invariant: The system of record is an immutable, append-only log; all derived views are eventually consistent functions of this log.
Components:
- System of Record (Immutable Log): A central, durable, totally-ordered sequence of events (e.g., Kafka topic, database WAL). Guarantees: Append-only, durable, totally ordered.
- Ingestion Paths:
- Transactional Writes (OLTP): Applications write events (e.g.,
UserRegistered,OrderPlaced) directly to the log. - Change Data Capture (CDC): Database transaction logs are tailed, transformed into events, and appended to the log.
- Transactional Writes (OLTP): Applications write events (e.g.,
- Derived Data Systems (Materialized Views): Independent consumers read from the log and maintain their own state.
- Search Index: Consumes events to build and update a searchable index (e.g., Elasticsearch). Recovery: Replay log from last checkpoint.
- Cache (Read-Through): Warms itself based on event patterns. Recovery: Replay or lazy refresh.
- Analytics Database (Columnar): Batch or stream processor aggregates events into columnar tables. Recovery: Recomputation from log.
- Graph Database: Builds relationship graphs from events. Recovery: Replay.
- Schema Evolution Handler: A component that sits between the log and consumers, applying transformations (e.g., adding default values for new fields, renaming fields) to maintain compatibility with older consumer code.
- Failure Recovery Mechanisms:
- Checkpointing: Each derived system periodically writes its log position to durable storage.
- Replay: On failure, a consumer resets to its last checkpoint and reprocesses the log.
- Idempotent Writers: Consumers write derived state idempotently, making replay safe.
This architecture assumes failure at every level: consumers crash, networks partition, and schemas evolve. Recovery is not an afterthought—it is the primary design constraint. By treating the log as the sole source of truth and deriving all state from it, the system ensures that any component can be rebuilt from first principles.
Sources
This chapter draws on concepts and examples introduced in previous chapters, including discussions on data integration, schema evolution, and distributed systems. For further reading and deeper exploration of these topics, refer to the sources listed below.
- Source 1: Introduction to Distributed Systems
- Source 2: Data Integration Patterns
- Source 3: Schema Evolution Strategies
By integrating these concepts and patterns, architects can design evolving systems that balance consistency, availability, and performance, ensuring robustness and scalability in the face of growing data and user demands. The key is to embrace trade-offs, assume failure, and build recovery into the core of the architecture.