The Physics of Storage: Persisting State

Storage systems are governed by immutable physical and algorithmic constraints that dictate how data is persisted across reboots and failures. At the core of these systems lie two dominant data structures—B-Trees and Log-Structured Merge-Trees (LSM-Trees)—each enforcing distinct trade-offs between read, write, and storage amplification. These trade-offs are not artifacts of implementation but direct consequences of the underlying mechanics of disk I/O, memory hierarchy, and data mutability. This chapter establishes the invariants of disk-based storage and analyzes how B-Trees and LSM-Trees satisfy them under divergent design principles.

Invariants of Disk-Based Storage

The performance and efficiency of any storage system are bound by three amplification factors: read, write, and storage amplification. These are not tunable parameters but emergent properties of the data structure’s design.

Read amplification is the number of I/O operations required to retrieve a single logical record. It is minimized by direct indexing and maximized by multi-level indirection.
Write amplification is the ratio of physical bytes written to storage versus logical bytes written by the application. It increases with in-place updates, node splits, and compaction.
Storage amplification is the ratio of physical storage consumed to logical data size. It is inflated by replication, file system overhead, and data redundancy during compaction.

These factors form a constrained optimization surface: improving one necessarily degrades at least one other.

B-Trees: In-Place Updates and Read Efficiency

B-Trees enforce low read amplification by maintaining a balanced, searchable tree where each key resides in exactly one node. The structure guarantees O(log N) search time with a single disk seek per level, typically resulting in 3–5 I/Os for large datasets.

Updates occur in place. When a node becomes full, it splits, and the parent is updated—potentially triggering a cascade of splits up to the root. This mechanism ensures structural balance but incurs high write amplification under random writes: a single 4KB page update may propagate through multiple levels, rewriting tens of kilobytes physically.

The B-Tree’s invariants:

All leaves are at the same depth.
Each node contains between t−1 and 2t−1 keys (for order t).
Search, insertion, and deletion are O(log N).

These guarantees are maintained through split and merge operations that preserve balance, but at the cost of random write throughput.

LSM-Trees: Sequential Writes and Compaction Overhead

LSM-Trees eliminate random writes by batching mutations in a mutable in-memory structure (MemTable), then flushing them sequentially to disk as immutable Sorted String Tables (SSTables). This design converts random writes into sequential I/O, maximizing throughput on spinning disks and reducing flash wear.

However, reads must consult multiple SSTable levels—L0 (recent flushes), L1, L2, etc.—each potentially containing overlapping key ranges. Without global merging, a single read may require I/O from every level, increasing read amplification.

Compaction enforces structural integrity by merging SSTables across levels, discarding obsolete entries, and reducing read overhead. But this process rewrites data multiple times, leading to high write amplification—often 10x to 50x the logical write volume.

The LSM-Tree’s invariants:

All on-disk data is immutable.
SSTables within a level are sorted by key.
Each level is a multiple in size of the previous (geometric growth).
Compaction ensures eventual consistency and key unification.

These guarantees enable high write throughput but at the expense of background I/O and memory for merge operations.

Amplification Trade-Offs

Amplification Type	Definition	Typical Range	Primary Influences
Read Amplification	I/O operations per logical read	1–10+	Data structure depth, Bloom filters, caching
Write Amplification	Physical writes per logical write	1–50+	Compaction strategy, update frequency, tree order
Storage Amplification	Physical storage per logical data	1–5+	Replication, compaction overhead, file alignment

Data Structure	Mutability	Write Pattern	Read Amplification	Write Amplification
B-Tree	Mutable	Random	Low (1–3 I/Os)	High (splits, cascading updates)
LSM-Tree	Immutable	Sequential	High (multi-level search)	High (compaction merges)

Code Examples

// B-Tree node structure (in-memory)
typedef struct BTreeNode {
    int *keys;
    struct BTreeNode **children;
    int num_keys;
    int leaf;
} BTreeNode;

// Flush MemTable to immutable SSTable (LSM-Tree)
void flush_memtable_to_sstable(MemTable *mem, const char *filename) {
    // Iterate sorted in-memory entries
    // Write key-value pairs sequentially to disk
    // File becomes immutable SSTable
}

// Compact two sorted SSTables into one
void compact_sstables(const char *in1, const char *in2, const char *out) {
    FILE *f1 = fopen(in1, "r");
    FILE *f2 = fopen(in2, "r");
    FILE *outf = fopen(out, "w");
    // Merge-sort key streams
    // Omit duplicate or deleted keys
    // Write merged output
    fclose(f1); fclose(f2); fclose(outf);
}

These implementations reflect the core divergence: B-Trees modify existing structures, LSM-Trees generate new ones.

Diagram Descriptions

Diagram 1: B-Tree Structure
- A balanced tree with internal nodes containing keys and child pointers.
- Node capacity bounded by order t; splits occur at overflow.
- In-place updates modify leaf or internal nodes directly.
- All paths from root to leaf have equal length.
Diagram 2: LSM-Tree Levels
- MemTable: mutable, in-memory sorted structure (e.g., skip list).
- L0: multiple small SSTables from flushes; overlapping keys allowed.
- L1 and beyond: larger, non-overlapping SSTables after compaction.
- Arrows indicate compaction merging smaller files into larger, sorted ones.
Diagram 3: Write Amplification in Compaction
- Four SSTables (total 400 MB) merged into one 400 MB output.
- Despite no net data growth, 1.6 GB of I/O occurs (4x write amplification).
- Old files deleted post-merge; space reclaimed asynchronously.
Diagram 4: Read Path in LSM-Tree
- Read operation checks MemTable → L0 → L1 → L2 in sequence.
- Bloom filters attached to each SSTable allow skipping non-existent keys.
- Worst case: one I/O per level, leading to high read latency.

Decision Rule: Choose Based on Workload Dominance

The selection between B-Trees and LSM-Trees is dictated by workload characteristics:

Use B-Trees when read amplification must be minimized. Applications with frequent point queries, low write concurrency, or strict latency requirements benefit from direct access and predictable I/O.
Use LSM-Trees when write amplification can be tolerated for throughput. High-ingest systems (e.g., time-series databases, write-ahead logs) leverage sequential writes and batched compaction to sustain massive write loads.

There is no convergence between these designs. The trade-off is absolute: optimizing for sequential writes inherently increases read and compaction costs. Systems must be architected around this invariant.