Log-Structured Engines: Optimizing for Write Throughput

Log-Structured Merge-Trees (LSM-Trees) are designed to optimize write throughput by leveraging sequential disk I/O. This approach fundamentally differs from B-Trees, which focus on minimizing read amplification. In LSM-Trees, writes are first buffered in an in-memory data structure called a memtable, and then periodically flushed to disk as immutable Sorted String Tables (SSTables). This process allows for high write throughput but introduces additional complexity in read operations and space management.

Memtable, WAL, and SSTable: The Core Components

Memtable: A mutable, in-memory data structure that holds recently written key-value pairs in sorted order. When the memtable reaches a size threshold, it is frozen and a new memtable is created for incoming writes. The frozen memtable is then flushed to disk as an SSTable.
WAL (Write-Ahead Log): A sequential, append-only log file on disk where all write operations are recorded immediately before being applied to the memtable. This ensures durability by allowing recovery of unflushed data after a crash.
SSTable (Sorted String Table): An immutable, on-disk file containing a sequence of key-value pairs sorted by key. SSTables are the building blocks of LSM-Trees, providing efficient storage and retrieval of data.

The Write Path: From Memtable to SSTable

The write path in an LSM-Tree involves the following steps:

Append to WAL: Each write operation is first recorded in the WAL to ensure durability.
Insert into Memtable: The write is then inserted into the memtable, which maintains the data in sorted order.
Memtable Flush: When the memtable reaches its size limit, it is frozen, and its contents are flushed to disk as a new SSTable. This process involves sorting the memtable contents and writing them sequentially to disk.

The Read Path: Finding Data Across Memtables and SSTables

Reading data from an LSM-Tree involves checking multiple components in sequence:

Active Memtable: The search starts in the active memtable, which contains the most recent writes.
Immutable Memtables: If the key is not found in the active memtable, the search proceeds to immutable memtables that are pending flush.
SSTables: Finally, the search checks the SSTables on disk, starting from the most recent (Level 0) to the oldest. To optimize this process, LSM-Trees employ Bloom filters associated with each SSTable to quickly determine if a key is present, thus avoiding unnecessary disk I/O.

Compaction: Managing SSTables and Reducing Amplification

As SSTables accumulate, LSM-Trees use a process called compaction to merge multiple SSTables into fewer, larger ones. Compaction serves several purposes:

Reduces Read Amplification: By merging SSTables, compaction decreases the number of files that need to be checked during a read operation, thus reducing read amplification.
Reclaims Space: Compaction removes deleted or overwritten keys, reclaiming space and reducing storage amplification.
Manages Write Amplification: Although compaction itself contributes to write amplification, it is essential for maintaining the overall health and efficiency of the LSM-Tree by preventing an unbounded growth of SSTables.

Leveling Compaction Strategy

One common compaction strategy is leveling, where SSTables are organized into levels, each with a maximum size limit. When a level reaches its limit, SSTables from that level are compacted with overlapping SSTables from the next level, producing new SSTables for the next level. This strategy ensures that each level has a bounded number of SSTables, preventing exponential growth and maintaining predictable performance.

Tiering Compaction Strategy

Another strategy is tiering, where SSTables within a level can have overlapping key ranges. Compaction occurs when a level reaches a capacity threshold, merging all SSTables in that level into new SSTables for the next level. Tiering allows for more flexibility in managing SSTable sizes and key ranges but requires careful tuning to avoid increased write amplification.

Bloom Filters: Optimizing Read Performance

To further optimize read performance, LSM-Trees utilize Bloom filters, probabilistic data structures that can quickly determine whether a key is present in an SSTable. Before performing disk I/O to read an SSTable, the Bloom filter is consulted. If the filter indicates the key is not present, the SSTable can be skipped, reducing unnecessary disk accesses and improving read efficiency.

Code Examples: Illustrating LSM-Tree Mechanics

Generic LSM-Tree Compaction Process

function compact(level_i_sstables, level_i_plus_1_sstables) {
    merged_kv_pairs = []
    iterators = create_sorted_iterators(level_i_sstables, level_i_plus_1_sstables)
    heap = MinHeap(iterators) // compares by current key
    current_key = null
    current_value = null
    highest_seq_num = -1
    while not heap.empty() {
        (key, value, seq_num, iterator) = heap.pop_min()
        if key != current_key {
            if current_key != null and not is_tombstone(current_value) {
                merged_kv_pairs.append((current_key, current_value))
            }
            current_key = key
            current_value = value
            highest_seq_num = seq_num
        } else {
            if seq_num > highest_seq_num {
                current_value = value
                highest_seq_num = seq_num
            }
        }
        if iterator.has_next() {
            heap.push(iterator.next())
        }
    }
    if current_key != null and not is_tombstone(current_value) {
        merged_kv_pairs.append((current_key, current_value))
    }
    new_sstables = split_and_write_sstables(merged_kv_pairs, level_i_plus_1_target_size)
    delete_files(level_i_sstables, level_i_plus_1_sstables)
    register_sstables(new_sstables, level=i+1)
    return new_sstables
}

Bloom Filter Usage for SSTable Lookup Optimization

class SSTable {
    BloomFilter filter;
    FileHandle data_file;
    IndexBlock index;
    function might_contain(key) {
        return filter.test(key)
    }
    function get(key) {
        if not might_contain(key) {
            return NOT_FOUND
        }
        block_offset = index.lookup(key)
        data_block = data_file.read_block(block_offset)
        return data_block.search(key)
    }
}

function lsm_read(key) {
    result = active_memtable.get(key)
    if result != NOT_FOUND: return result
    for level in sorted_levels_new_to_old {
        for sstable in level.sstables {
            if sstable.might_contain(key) {
                value = sstable.get(key)
                if value != NOT_FOUND {
                    return value
                }
            }
        }
    }
    return NOT_FOUND
}

Tables: Summarizing Key Components and Amplification Factors

Component	Location	Mutability	Purpose	I/O Pattern
WAL (Write-Ahead Log)	Disk	Append-only	Durability: Persist writes before memtable. Crash recovery.	Sequential Writes
Active Memtable	Memory (RAM)	Mutable	Buffer incoming writes in sorted order for fast in-memory ops.	In-memory (No disk I/O for inserts)
Immutable Memtable	Memory (RAM)	Immutable	Frozen memtable waiting to be flushed. Serves reads.	In-memory (No disk I/O)
SSTable (Flush from Memtable)	Disk	Immutable	Long-term storage of sorted key-value pairs.	Sequential Write (during flush)
SSTable (During Compaction)	Disk	Immutable (new files)	Merged, deduplicated data from older SSTables.	Sequential Read (old files), Sequential Write (new files)
Bloom Filter	Memory (RAM)	Immutable (per SSTable)	Probabilistic membership test to avoid disk I/O for non-existent keys.	In-memory (No disk I/O)
Block Cache	Memory (RAM)	Mutable (LRU)	Cache frequently read SSTable data blocks.	Reduces Random Read I/O

Diagram Descriptions: Visualizing LSM-Tree Mechanics

Diagram: LSM-Tree Write Path

Client Write Request: Key-Value pair arrives.
Step 1: Log to WAL: The pair is appended as a record to the end of the Write-Ahead Log file on disk (sequential write).
Step 2: Insert into Memtable: The pair is inserted into the in-memory sorted structure (e.g., skip list) of the active memtable.
Memtable Full?: When the active memtable size exceeds a threshold, it becomes immutable, and a new active memtable is created.
Flush to SSTable: A background thread takes the immutable memtable, sorts its entries (already sorted in most designs), and writes them sequentially to a new SSTable file on disk (L0).
Cycle Repeats: The process continues. The WAL for the old memtable can be archived or a new WAL is started.

Diagram: LSM-Tree Read Path

Client Read Request: Key lookup arrives.
Search Active Memtable: Check the in-memory, mutable active memtable.
Search Immutable Memtables: Check any frozen memtables pending flush.
Search SSTable Levels (L0 → Ln): For each level, from newest (L0) to oldest (Ln): a. For each SSTable in level: Consult its in-memory Bloom filter. b. Bloom Filter Says ‘No’: Skip this SSTable entirely (no I/O). c. Bloom Filter Says ‘Maybe’: Perform disk I/O to read the relevant index and data blocks from the SSTable.
Return Value or Not Found: The first valid value found (with highest sequence number) is returned. If not found in any component, return ‘key not found’.

Diagram: Compaction Process (Leveling)

[Level L] —> Contains multiple SSTables (e.g., 4 files). [Level L+1] —> Contains larger, non-overlapping SSTables. Trigger: Level L exceeds its size limit (e.g., 4 files). Process:

Select all SSTables in Level L.
Identify key range covered by selected SSTables.
Find all SSTables in Level L+1 whose key ranges overlap with the range from step 2.
Perform a multi-way merge-sort on the data from the selected SSTables (Level L + overlapping Level L+1).
During the merge, discard older versions of keys (lower sequence number) and tombstones whose keys have no older data.
Write the merged, sorted output as new, larger SSTables into Level L+1.
Delete the input SSTables from Levels L and L+1. Result: Reduced number of files in Level L, maintained sorted, non-overlapping structure in Level L+1, reclaimed space from deletions.

Comparative Analysis: B-Trees vs. LSM-Trees

Amplification Type	B-Tree (In-place)	LSM-Tree (Out-of-place)	Primary Cause
Write Amplification	Moderate to High (e.g., 4-50)	High (e.g., 2-30)	B-Tree: Node splits/merges, journaling. LSM: Compaction rewrites data multiple times.
Read Amplification	Low (Often 1)	Moderate to High	B-Tree: Tree traversal (log N). LSM: Checking memtable + multiple SSTable levels.
Storage Amplification	Low (∼1)	Moderate ( >1 during compaction)	B-Tree: In-place updates reuse space. LSM: Transient duplicates exist before compaction reclaims space.
I/O Pattern (Writes)	Random (in-place update of leaf/page)	Sequential (memtable flush, compaction)	Fundamental architectural difference.
I/O Pattern (Reads)	Random (index traversal)	Random (if data not in cache/Bloom filter negative)	B-Tree always random. LSM random if Bloom filter passes or cache miss.
Optimization For	Read Latency, Point Lookups	Write Throughput, Bulk Ingestion	Design trade-off.

In conclusion, LSM-Trees offer a compelling solution for write-heavy workloads by optimizing for sequential disk I/O and leveraging memtables, WALs, and SSTables. While they introduce complexity in read operations and space management, careful design and tuning can mitigate these challenges. The choice between B-Trees and LSM-Trees represents a fundamental trade-off between read and write amplification, with each suited to different workload profiles.

Sources

This chapter has drawn on internal knowledge and research to provide a comprehensive overview of LSM-Trees and their mechanics. For further reading and deeper exploration of storage systems, readers are encouraged to consult the references provided.