B-Trees: The Read-Optimized Mutable Standard
SummaryThis section establishes the fundamental dichotomy between mutable,...
This section establishes the fundamental dichotomy between mutable,...
This section establishes the fundamental dichotomy between mutable, page-oriented storage (B-Trees) and immutable, log-structured storage (LSM-Trees). It explains the core mechanics of update-in-place versus out-of-place appends, directly linking these to their resulting I/O patterns: random writes for B-Trees and sequential writes for LSM-Trees. The analysis details the fragmentation trade-off—internal and external for B-Trees versus storage amplification from compaction for LSM-Trees—and underscores the critical role of the Write-Ahead Log (WAL) in enabling durable in-place updates. A comparative table synthesizes these characteristics, and contrasting pseudocode illustrates the procedural differences. The narrative frames these as immutable trade-offs, optimized for read-heavy versus write-heavy workloads respectively.
B-Trees: The Read-Optimized Mutable Standard
B-Trees are the canonical example of a mutable, page-oriented storage structure. They maintain a balanced, searchable tree where each key resides in exactly one node, ensuring low read amplification. Updates occur in-place, with the database managing a buffer pool of fixed-size pages cached in RAM. A B-Tree update involves reading the target page into the buffer pool, modifying it in memory, and eventually writing the dirty page back to its exact same location on disk.
The update-in-place mechanism in B-Trees creates random write I/O patterns, as the modified page can be anywhere on disk. Random writes are significantly slower than sequential writes on magnetic hard disk drives (HDDs), making B-Trees less optimal for write-heavy workloads. However, for read-heavy workloads, B-Trees often have lower latency because the key is found in a single, predictable location.
Fragmentation in B-Trees
Fragmentation in B-Trees occurs internally (within partially filled pages) and externally (free space gaps from deleted pages). Internal fragmentation wastes space inside pages, while external fragmentation makes it difficult to allocate large contiguous regions, though this is less critical for SSDs. The use of a Write-Ahead Log (WAL) is critical for B-Trees to guarantee durability despite update-in-place mechanics.
Log-Structured Storage: A Contrasting Approach
Log-structured storage, as used in LSM-Trees, avoids update-in-place by appending all new data (including updates and deletes) to the end of a log file. This approach converts random application writes into sequential disk writes during memtable flushes and compaction processes. LSM-Trees do not suffer from update-in-place fragmentation but have storage amplification due to data duplication across SSTable levels during compaction.
Comparative Analysis
The following table contrasts the core mechanics, I/O patterns, and trade-offs between page-oriented (B-Tree) and log-structured (LSM-Tree) storage models.
| Characteristic | Page-Oriented Storage (B-Tree) | Log-Structured Storage (LSM-Tree) |
|---|---|---|
| Update Model | In-place update | Out-of-place append (immutable) |
| Primary Write I/O Pattern | Random (modify page at location X) | Sequential (append to log/SSTable) |
| Primary Read I/O Pattern | Random (seek to page location) | Random (seek within SSTable files) |
| Fragmentation | Internal (partially full pages) & External (free space gaps) | None from updates; space overhead from duplicates during compaction |
| Crash Recovery Mechanism | Write-Ahead Log (WAL) mandatory | Separate WAL for memtable only; SSTables are immutable once written |
| Amplification | Write Amplification: Moderate-High (page splits/merges). Read Amplification: Low. Storage Amplification: Low. | Write Amplification: High (compaction). Read Amplification: Moderate-High (multiple levels). Storage Amplification: Moderate-High (temporary duplicates). |
| Optimized Workload | Read-heavy, point reads, workloads needing low latency | Write-heavy, bulk inserts, append-intensive workloads |
| Example Systems | Traditional RDBMS (PostgreSQL, MySQL/InnoDB), B-Tree file systems | Key-Value Stores (RocksDB, LevelDB, Cassandra, ScyllaDB) |
Code Illustrations
The following pseudocode illustrates the B-Tree insert operation with Write-Ahead Logging, contrasting with the LSM-Tree write path.
function btree_insert_with_wal(BTree* tree, Key key, Value value, WAL* log) {
// Find the target leaf page
Page* leaf_page = find_leaf_page(tree, key);
// Prepare log record
WALRecord rec = create_wal_record(INSERT_OP, leaf_page->page_id, key, value);
// Write log record to stable storage
log_append_and_flush(log, rec);
// Apply change in memory
page_insert_key_value(leaf_page, key, value);
// Handle page overflow
if (page_is_overfull(leaf_page)) {
btree_split_leaf_with_wal(tree, leaf_page, log);
}
}
function lsm_write(LSMTree* tree, Key key, Value value) {
// Optional: Write to a separate WAL for memtable durability
// Update the in-memory memtable
memtable_insert(tree->active_memtable, key, value);
// Flush memtable to SSTable when full
if (memtable_size(tree->active_memtable) > threshold) {
flush_memtable_to_sstable(tree->active_memtable);
}
}
The provided code snippets demonstrate the fundamental difference in update mechanics between B-Trees and LSM-Trees, highlighting the trade-offs in I/O patterns, fragmentation, and amplification factors.
Sources
This section has referenced various concepts and technologies related to B-Trees and LSM-Trees. For further reading and implementation details, please refer to the cited sources and supplementary materials.