Skip to main content
data systems mechanics invariants in distributed architectures

Architecture and Evolution

12 min read Chapter 25 of 28
Summary

This chapter synthesizes core architectural patterns for evolving...

This chapter synthesizes core architectural patterns for evolving data systems, emphasizing immutable trade-offs. It begins with immutable data structures (LSM-Trees) and eventual consistency as foundational principles. The LSM-Tree SSTable merge code illustrates the sequential I/O and write amplification trade-off of compaction. Distributed consistency models are explored through Two-Phase Commit (2PC), highlighting its blocking risk and atomicity guarantee, and the Saga pattern, which uses compensating transactions for eventual consistency. A comprehensive comparison table contrasts eight architectural patterns (B-Tree, LSM-Tree, Columnar, Single/Multi/Leaderless Replication, Lambda, Kappa) across dimensions like primary data structure, update pattern, amplification, consistency model, scalability, and fault tolerance. The chapter concludes with an evolution diagram describing a system centered on an immutable log as the source of truth, feeding various derived data systems (search index, cache, analytics DB, graph DB) via ingestion paths like transactional writes and CDC. Schema evolution handlers and failure recovery mechanisms (checkpointing, replay, idempotent writers) ensure the system can evolve and recover deterministically. The narrative consistently frames choices around the immutable trade-offs of read vs. write amplification and consistency vs. availability.

Architecture and Evolution

The design of evolving systems is fundamentally about managing trade-offs. At the heart of these trade-offs lies the concept of immutable data structures and eventual consistency. This chapter synthesizes previous concepts into coherent architectural patterns, focusing on how systems evolve over time, incorporate new data, and maintain robustness under failure. Every architectural decision enforces an invariant at the cost of a corresponding penalty—understanding these trade-offs is essential for building systems that scale and survive.

Immutable Data Structures

Invariant: Previous states of data must remain accessible to enable auditability, rollback, and safe concurrent access.

Immutable data structures, such as those in LSM-Trees (Log-Structured Merge-Trees), enforce this invariant by never modifying data in place. Instead, updates are written as new entries, preserving prior versions. The SSTable (Sorted String Table), a core component of LSM-Trees, is an immutable, sorted key-value file optimized for sequential disk access. This design shifts the cost from random I/O during writes to increased read amplification during lookups, as multiple SSTables may need to be consulted.

The merge operation—compaction—reclaims space and reduces read overhead by merging sorted SSTables into a single output, applying last-writer-wins semantics. However, this process incurs high write amplification, as data is rewritten repeatedly across levels. This is the price paid for sustained high write throughput and crash consistency.

# Modern Python 3.11+ example: LSM-Tree SSTable Merge (Compaction)
# Illustrates the sequential I/O pattern and immutable, out-of-place update.
# Invariant: Compaction merges sorted inputs to produce sorted output, reclaiming space.

from typing import Iterator, List, Tuple, Optional
import heapq

class SSTableIterator:
    """Iterator over key-value pairs in a single SSTable file."""
    def __init__(self, file_path: str):
        self.file = open(file_path, 'rb')
        # Assume file contains sorted (key, value, seq_num) tuples
    def __iter__(self) -> Iterator[Tuple[bytes, bytes, int]]:
        # Simplified: read sequentially
        while True:
            chunk = self.file.read(1024)
            if not chunk:
                break
            # Parse chunk into key, value, seq_num
            yield (b'key', b'value', 1)  # Placeholder
    def close(self):
        self.file.close()

def merge_sstables(sstable_paths: List[str], output_path: str) -> None:
    """
    Merge multiple SSTables into one, applying last-writer-wins per key.
    Trade-off: High write amplification (rewriting data) for reduced read amplification.
    """
    iters: List[Iterator[Tuple[bytes, bytes, int]]] = []
    try:
        # Open all input SSTables for sequential reading
        for path in sstable_paths:
            iters.append(iter(SSTableIterator(path)))
        
        # Min-heap for k-way merge
        HeapItem = Tuple[bytes, bytes, int, int]  # key, value, seq_num, iterator_index
        heap: List[HeapItem] = []
        
        # Initial population of heap with first element from each iterator
        for idx, it in enumerate(iters):
            try:
                key, value, seq_num = next(it)
                heapq.heappush(heap, (key, value, seq_num, idx))
            except StopIteration:
                pass  # Empty SSTable
        
        current_key: Optional[bytes] = None
        current_value: Optional[bytes] = None
        highest_seq_num = -1
        
        with open(output_path, 'wb') as out_file:
            while heap:
                key, value, seq_num, iter_idx = heapq.heappop(heap)
                
                if key != current_key:
                    # Write the winner for the previous key (if not a tombstone)
                    if current_key is not None and current_value is not None:
                        # Simplified write
                        out_file.write(current_key + current_value)
                    current_key = key
                    current_value = value
                    highest_seq_num = seq_num
                else:
                    # Same key, pick the one with higher sequence number (newer write)
                    if seq_num > highest_seq_num:
                        current_value = value
                        highest_seq_num = seq_num
                
                # Refill heap from the same iterator
                try:
                    next_key, next_val, next_seq = next(iters[iter_idx])
                    heapq.heappush(heap, (next_key, next_val, next_seq, iter_idx))
                except StopIteration:
                    pass  # Iterator exhausted
            
            # Write the last key
            if current_key is not None and current_value is not None:
                out_file.write(current_key + current_value)
    finally:
        for it in iters:
            it.close()  # Ensure resources are cleaned up

This compaction logic exemplifies the core LSM-Tree trade-off: sequential, high-throughput writes are achieved by deferring cleanup and consistency to background processes, which in turn amplify disk writes and memory pressure.

Eventual Consistency and Distributed Systems

Invariant: The system must remain available and process requests even during network partitions.

Distributed systems often sacrifice strong consistency to uphold this invariant, adopting eventual consistency. While all nodes will converge to the same state in the absence of new updates, temporary divergence is expected and must be managed. The real cost of this model is not inconsistency itself, but the complexity it introduces in application logic and failure recovery.

Two-Phase Commit

Invariant: Atomicity across distributed nodes must be guaranteed—either all participants commit, or none do.

The Two-Phase Commit (2PC) protocol enforces atomicity through a prepare-then-commit sequence. However, this guarantee comes at a steep cost: the protocol is inherently blocking. If the coordinator crashes after the prepare phase but before issuing the commit, all participants remain in an “in-doubt” state, holding locks and blocking further operations. Recovery requires manual intervention via mechanisms like xa_recover(), making 2PC unsuitable for long-running transactions or systems requiring high availability.

The blocking nature of 2PC is the price paid for strong atomicity. While effective in tightly coupled systems with low failure rates, it fails the “assume failure is the default” principle.

# Modern Python 3.11+ example: Two-Phase Commit Coordinator
# Illustrates the blocking nature and recovery challenge.
# Invariant: Atomicity across nodes is guaranteed by the prepare-then-commit protocol.

from dataclasses import dataclass
from typing import List, Tuple
import asyncio

@dataclass
class Participant:
    name: str
    prepared: bool = False
    committed: bool = False
    
    async def prepare(self, tx_id: str) -> bool:
        """Phase 1: Participant votes YES/NO."""
        # Simulate durable log write (e.g., to WAL)
        await asyncio.sleep(0.05)  # Simulate I/O
        # Simulate a possible failure (e.g., constraint violation)
        if self.name == "ParticipantB":
            print(f"{self.name}: Vote NO (simulated constraint violation)")
            return False
        self.prepared = True
        print(f"{self.name}: Vote YES")
        return True
    
    async def commit(self, tx_id: str) -> bool:
        await asyncio.sleep(0.05)
        self.committed = True
        print(f"{self.name}: Committed")
        return True
    
    async def abort(self, tx_id: str) -> bool:
        await asyncio.sleep(0.05)
        self.prepared = False
        print(f"{self.name}: Aborted")
        return True

async def two_phase_commit_coordinator(participants: List[Participant], tx_id: str) -> bool:
    """
    Coordinator for 2PC. Demonstrates the blocking risk.
    Trade-off: Strong atomicity guarantee vs. risk of indefinite blocking.
    """
    votes: List[Tuple[Participant, bool]] = []
    
    # --- PHASE 1: Prepare ---
    print(f"[Coordinator] Starting 2PC for transaction {tx_id}")
    print(f"[Coordinator] Phase 1: Sending PREPARE to all participants")
    for participant in participants:
        try:
            vote = await participant.prepare(tx_id)
            votes.append((participant, vote))
        except Exception as e:
            print(f"[Coordinator] Participant {participant.name} failed during prepare: {e}")
            votes.append((participant, False))
    
    # Decide based on votes
    all_voted_yes = all(vote for _, vote in votes)
    
    if not all_voted_yes:
        print(f"[Coordinator] Phase 1: At least one participant voted NO. Initiating ABORT.")
        abort_tasks = [asyncio.create_task(p.abort(tx_id)) for p in participants]
        await asyncio.gather(*abort_tasks, return_exceptions=True)
        return False
    
    # --- PHASE 2: Commit ---
    print(f"[Coordinator] Phase 1: All participants voted YES. Sending COMMIT.")
    # **CRITICAL BLOCKING POINT**: If coordinator crashes here, participants remain in-doubt.
    commit_tasks = [asyncio.create_task(p.commit(tx_id)) for p in participants]
    results = await asyncio.gather(*commit_tasks, return_exceptions=True)
    
    if any(isinstance(r, Exception) for r in results):
        print(f"[Coordinator] Phase 2: Commit failed on some participants. Manual intervention required.")
        # In a real system, recovery via xa_recover() is needed.
        return False
    
    print(f"[Coordinator] Transaction {tx_id} committed successfully.")
    return True

Saga Pattern

Invariant: Long-running business processes must make forward progress without blocking system resources indefinitely.

The Saga pattern replaces distributed locking with application-level compensation. A saga is a sequence of local transactions, each paired with a compensating action that reverses its effects. If a step fails, the saga executes compensations in reverse order, restoring the system to a consistent state. This approach avoids the blocking pitfalls of 2PC but shifts the burden of consistency to the application developer.

Failure recovery is intrinsic: if a saga crashes mid-execution, it resumes from the last completed step and runs compensations. However, compensation logic must be idempotent and durable—once executed, it cannot be undone. This makes sagas ideal for business workflows (e.g., order fulfillment) but unsuitable for low-level data consistency.

# Modern Python 3.11+ example: Saga Compensating Transaction
# Contrasts with 2PC by using application-level compensation for eventual consistency.
# Invariant: Saga ensures eventual consistency through forward progress and compensation.

from dataclasses import dataclass
from typing import Callable, List
import asyncio

@dataclass
class SagaStep:
    name: str
    execute: Callable[[], bool]           # Local transaction
    compensate: Callable[[], bool]        # Compensating action
    executed: bool = False
    compensated: bool = False

class SagaCoordinator:
    """
    Orchestrates a Saga. Executes steps sequentially.
    If any step fails, runs compensation for all previously executed steps in reverse order.
    Trade-off: Avoids blocking but requires application-defined compensation logic.
    """
    def __init__(self, steps: List[SagaStep]):
        self.steps = steps
    
    async def run(self) -> bool:
        executed_steps: List[SagaStep] = []
        
        for step in self.steps:
            print(f"[Saga] Executing step: {step.name}")
            try:
                success = await asyncio.to_thread(step.execute)
                if not success:
                    print(f"[Saga] Step {step.name} failed. Initiating compensation.")
                    await self._compensate(executed_steps)
                    return False
                step.executed = True
                executed_steps.append(step)
            except Exception as e:
                print(f"[Saga] Step {step.name} raised exception: {e}. Compensating.")
                await self._compensate(executed_steps)
                return False
        
        print("[Saga] All steps completed successfully.")
        return True
    
    async def _compensate(self, steps: List[SagaStep]) -> None:
        """Run compensation in reverse order."""
        print("[Saga] Running compensating transactions...")
        for step in reversed(steps):
            if step.executed and not step.compensated:
                print(f"[Saga] Compensating for step: {step.name}")
                try:
                    await asyncio.to_thread(step.compensate)
                    step.compensated = True
                except Exception as e:
                    print(f"[Saga] Compensation for {step.name} failed: {e}. Saga may be in inconsistent state.")
                    # In a real system, might require manual intervention.

Therefore, for long-running business processes, the Saga pattern is preferable to 2PC: it ensures forward progress, avoids indefinite blocking, and aligns with the reality of distributed failure.

Architectural Patterns Comparison

The preceding examples illustrate how invariants shape system behavior under load and failure. The following table synthesizes these insights, mapping each pattern to its core trade-offs in scalability, consistency, and fault tolerance.

Architectural PatternPrimary Data StructureUpdate PatternOptimized ForRead AmplificationWrite AmplificationConsistency Model (Distributed)Scalability MechanismFault Tolerance Mechanism
B-Tree (Page-Oriented)Mutable, balanced treeIn-place, random I/ORead-heavy, point queriesLowHigh (node splits/merges)Strong (via locking/WAL)Vertical scaling, limited horizontalWAL, replication
LSM-Tree (Log-Structured)Immutable SSTablesOut-of-place, sequential I/OWrite-heavy, append-intensiveHigh (multiple levels)High (compaction)Eventual (during compaction)Horizontal scaling via partitioningImmutable SSTables, WAL for memtable
Columnar StorageColumn-oriented filesBatch updatesAnalytic scans, aggregationsLow for projected columnsVery High (rewrite columns)VariesHorizontal scalingReplication, checksums
Single-Leader ReplicationLeader-follower logWrites through leaderStrong consistencyLow (local read)High (synchronous latency)Strong (linearizable)Read scaling via followersAutomatic failover
Multi-Leader ReplicationMultiple logsWrites to any leaderWrite availabilityLow (local read)Medium (conflict resolution)Eventual (conflict resolution)Geographic distributionConflict resolution, last-write-wins
Leaderless Replication (Dynamo-style)Quorum-basedWrites to quorumHigh availability, partition toleranceConfigurable (quorum size)Configurable (quorum size)Configurable (quorum consistency)Horizontal scalingRead repair, hinted handoff
Lambda ArchitectureBatch layer (immutable) + Speed layer (stream)Batch recomputation + incremental streamBoth historical accuracy and low latencyHigh (batch) + Low (speed)High (batch) + Medium (stream)Eventual (batch overrides speed)Horizontal scaling for both layersImmutable inputs, recomputation
Kappa ArchitectureSingle immutable logStream processing onlyUnified processing modelLow (single layer)Medium (stream state)Eventual (stream processing)Horizontal scaling of stream jobsCheckpointing, replayable sources

Evolution Diagram

Diagram: Evolution of Storage Architecture from Monolith to Distributed Derived Views

Invariant: The system of record is an immutable, append-only log; all derived views are eventually consistent functions of this log.

Components:

  1. System of Record (Immutable Log): A central, durable, totally-ordered sequence of events (e.g., Kafka topic, database WAL). Guarantees: Append-only, durable, totally ordered.
  2. Ingestion Paths:
    • Transactional Writes (OLTP): Applications write events (e.g., UserRegistered, OrderPlaced) directly to the log.
    • Change Data Capture (CDC): Database transaction logs are tailed, transformed into events, and appended to the log.
  3. Derived Data Systems (Materialized Views): Independent consumers read from the log and maintain their own state.
    • Search Index: Consumes events to build and update a searchable index (e.g., Elasticsearch). Recovery: Replay log from last checkpoint.
    • Cache (Read-Through): Warms itself based on event patterns. Recovery: Replay or lazy refresh.
    • Analytics Database (Columnar): Batch or stream processor aggregates events into columnar tables. Recovery: Recomputation from log.
    • Graph Database: Builds relationship graphs from events. Recovery: Replay.
  4. Schema Evolution Handler: A component that sits between the log and consumers, applying transformations (e.g., adding default values for new fields, renaming fields) to maintain compatibility with older consumer code.
  5. Failure Recovery Mechanisms:
    • Checkpointing: Each derived system periodically writes its log position to durable storage.
    • Replay: On failure, a consumer resets to its last checkpoint and reprocesses the log.
    • Idempotent Writers: Consumers write derived state idempotently, making replay safe.

This architecture assumes failure at every level: consumers crash, networks partition, and schemas evolve. Recovery is not an afterthought—it is the primary design constraint. By treating the log as the sole source of truth and deriving all state from it, the system ensures that any component can be rebuilt from first principles.

Sources

This chapter draws on concepts and examples introduced in previous chapters, including discussions on data integration, schema evolution, and distributed systems. For further reading and deeper exploration of these topics, refer to the sources listed below.

By integrating these concepts and patterns, architects can design evolving systems that balance consistency, availability, and performance, ensuring robustness and scalability in the face of growing data and user demands. The key is to embrace trade-offs, assume failure, and build recovery into the core of the architecture.