The Physics of Storage: Persisting State
SummaryThis section establishes the fundamental constraints of disk-based...
This section establishes the fundamental constraints of disk-based...
This section establishes the fundamental constraints of disk-based storage, focusing on the trade-offs between read and write amplification. It introduces two core data structures: mutable B-Trees, which perform in-place updates and offer low read amplification but can have high write amplification for random writes; and immutable LSM-Trees (Log-Structured Merge-Trees), which perform out-of-place updates, batching writes in memory and flushing them as sorted string tables (SSTables). LSM-Trees optimize for write throughput by converting random writes to sequential disk writes, but incur higher read amplification due to multi-level lookups and write amplification during compaction. The section defines three key amplification metrics: read amplification (I/O ops per logical read), write amplification (physical writes per logical write), and storage amplification (physical storage per logical data). It presents comparative tables and code examples illustrating a B-Tree node structure and LSM-Tree operations (flush and compaction). The core narrative is that the choice between B-Trees and LSM-Trees represents a fundamental trade-off: B-Trees favor read-heavy workloads, while LSM-Trees favor write-heavy workloads, with amplification factors as critical design considerations.
The Physics of Storage: Persisting State
Storage systems are governed by immutable physical and algorithmic constraints that dictate how data is persisted across reboots and failures. At the core of these systems lie two dominant data structures—B-Trees and Log-Structured Merge-Trees (LSM-Trees)—each enforcing distinct trade-offs between read, write, and storage amplification. These trade-offs are not artifacts of implementation but direct consequences of the underlying mechanics of disk I/O, memory hierarchy, and data mutability. This chapter establishes the invariants of disk-based storage and analyzes how B-Trees and LSM-Trees satisfy them under divergent design principles.
Invariants of Disk-Based Storage
The performance and efficiency of any storage system are bound by three amplification factors: read, write, and storage amplification. These are not tunable parameters but emergent properties of the data structure’s design.
- Read amplification is the number of I/O operations required to retrieve a single logical record. It is minimized by direct indexing and maximized by multi-level indirection.
- Write amplification is the ratio of physical bytes written to storage versus logical bytes written by the application. It increases with in-place updates, node splits, and compaction.
- Storage amplification is the ratio of physical storage consumed to logical data size. It is inflated by replication, file system overhead, and data redundancy during compaction.
These factors form a constrained optimization surface: improving one necessarily degrades at least one other.
B-Trees: In-Place Updates and Read Efficiency
B-Trees enforce low read amplification by maintaining a balanced, searchable tree where each key resides in exactly one node. The structure guarantees O(log N) search time with a single disk seek per level, typically resulting in 3–5 I/Os for large datasets.
Updates occur in place. When a node becomes full, it splits, and the parent is updated—potentially triggering a cascade of splits up to the root. This mechanism ensures structural balance but incurs high write amplification under random writes: a single 4KB page update may propagate through multiple levels, rewriting tens of kilobytes physically.
The B-Tree’s invariants:
- All leaves are at the same depth.
- Each node contains between t−1 and 2t−1 keys (for order t).
- Search, insertion, and deletion are O(log N).
These guarantees are maintained through split and merge operations that preserve balance, but at the cost of random write throughput.
LSM-Trees: Sequential Writes and Compaction Overhead
LSM-Trees eliminate random writes by batching mutations in a mutable in-memory structure (MemTable), then flushing them sequentially to disk as immutable Sorted String Tables (SSTables). This design converts random writes into sequential I/O, maximizing throughput on spinning disks and reducing flash wear.
However, reads must consult multiple SSTable levels—L0 (recent flushes), L1, L2, etc.—each potentially containing overlapping key ranges. Without global merging, a single read may require I/O from every level, increasing read amplification.
Compaction enforces structural integrity by merging SSTables across levels, discarding obsolete entries, and reducing read overhead. But this process rewrites data multiple times, leading to high write amplification—often 10x to 50x the logical write volume.
The LSM-Tree’s invariants:
- All on-disk data is immutable.
- SSTables within a level are sorted by key.
- Each level is a multiple in size of the previous (geometric growth).
- Compaction ensures eventual consistency and key unification.
These guarantees enable high write throughput but at the expense of background I/O and memory for merge operations.
Amplification Trade-Offs
| Amplification Type | Definition | Typical Range | Primary Influences |
|---|---|---|---|
| Read Amplification | I/O operations per logical read | 1–10+ | Data structure depth, Bloom filters, caching |
| Write Amplification | Physical writes per logical write | 1–50+ | Compaction strategy, update frequency, tree order |
| Storage Amplification | Physical storage per logical data | 1–5+ | Replication, compaction overhead, file alignment |
| Data Structure | Mutability | Write Pattern | Read Amplification | Write Amplification |
|---|---|---|---|---|
| B-Tree | Mutable | Random | Low (1–3 I/Os) | High (splits, cascading updates) |
| LSM-Tree | Immutable | Sequential | High (multi-level search) | High (compaction merges) |
Code Examples
// B-Tree node structure (in-memory)
typedef struct BTreeNode {
int *keys;
struct BTreeNode **children;
int num_keys;
int leaf;
} BTreeNode;
// Flush MemTable to immutable SSTable (LSM-Tree)
void flush_memtable_to_sstable(MemTable *mem, const char *filename) {
// Iterate sorted in-memory entries
// Write key-value pairs sequentially to disk
// File becomes immutable SSTable
}
// Compact two sorted SSTables into one
void compact_sstables(const char *in1, const char *in2, const char *out) {
FILE *f1 = fopen(in1, "r");
FILE *f2 = fopen(in2, "r");
FILE *outf = fopen(out, "w");
// Merge-sort key streams
// Omit duplicate or deleted keys
// Write merged output
fclose(f1); fclose(f2); fclose(outf);
}
These implementations reflect the core divergence: B-Trees modify existing structures, LSM-Trees generate new ones.
Diagram Descriptions
-
Diagram 1: B-Tree Structure
- A balanced tree with internal nodes containing keys and child pointers.
- Node capacity bounded by order t; splits occur at overflow.
- In-place updates modify leaf or internal nodes directly.
- All paths from root to leaf have equal length.
-
Diagram 2: LSM-Tree Levels
- MemTable: mutable, in-memory sorted structure (e.g., skip list).
- L0: multiple small SSTables from flushes; overlapping keys allowed.
- L1 and beyond: larger, non-overlapping SSTables after compaction.
- Arrows indicate compaction merging smaller files into larger, sorted ones.
-
Diagram 3: Write Amplification in Compaction
- Four SSTables (total 400 MB) merged into one 400 MB output.
- Despite no net data growth, 1.6 GB of I/O occurs (4x write amplification).
- Old files deleted post-merge; space reclaimed asynchronously.
-
Diagram 4: Read Path in LSM-Tree
- Read operation checks MemTable → L0 → L1 → L2 in sequence.
- Bloom filters attached to each SSTable allow skipping non-existent keys.
- Worst case: one I/O per level, leading to high read latency.
Decision Rule: Choose Based on Workload Dominance
The selection between B-Trees and LSM-Trees is dictated by workload characteristics:
- Use B-Trees when read amplification must be minimized. Applications with frequent point queries, low write concurrency, or strict latency requirements benefit from direct access and predictable I/O.
- Use LSM-Trees when write amplification can be tolerated for throughput. High-ingest systems (e.g., time-series databases, write-ahead logs) leverage sequential writes and batched compaction to sustain massive write loads.
There is no convergence between these designs. The trade-off is absolute: optimizing for sequential writes inherently increases read and compaction costs. Systems must be architected around this invariant.