WAL, Compaction, and Durability Guarantees
WAL, Compaction, and Durability Guarantees
Chapter 1 showed that appending to a file is the only sane default for writes: sequential I/O, crash-safe, simple. Chapter 2 added indexes, B-Trees and LSM-trees, to make reads fast. But indexes are derived data structures. A B-Tree page can be corrupted by a crash mid-write. An LSM-tree memtable lives in volatile memory. If either is lost, the data is gone.
The Write-Ahead Log solves this. Before modifying any data structure, write a description of the change to an append-only log file. If the system crashes, replay the log to reconstruct the state that was lost. The log is the source of truth. Everything else is a materialized view of the log.
This is the same append-only file from Chapter 1. The pattern persists because the properties that make it a bad database (every read requires a full scan) make it a perfect recovery mechanism (every write is durable, and the log can be replayed from any point).
PostgreSQL’s WAL
PostgreSQL’s WAL is a sequence of 16MB segment files in pg_wal/. Every modification to a data page, whether a heap insert, an index page split, or a TOAST update, is first recorded as a WAL record before the data page is modified.
The Write Path
INSERT INTO package_events ...arrives.- PostgreSQL constructs a WAL record describing the change: which page, which offset, what data.
- The WAL record is written to the WAL buffer in shared memory.
- At commit time, PostgreSQL flushes the WAL buffer to disk (
fsyncon the WAL segment file). The commit does not return until the WAL is on persistent media. - The actual data page (heap tuple, index entry) is modified in the buffer pool. This page is “dirty” but has not been written to disk yet.
- The background writer or checkpointer will eventually write the dirty page to disk.
The critical property: the WAL is on disk before the commit returns to the client. If the server crashes after the commit, the WAL contains the record. During recovery, PostgreSQL replays the WAL from the last checkpoint, reapplying any changes to data pages that were not yet written to disk.
-- Concept: WAL generation rate and its relationship to write throughput
-- Monitor WAL during a bulk insert of 1 million package events
SELECT pg_current_wal_lsn() as before_lsn;
INSERT INTO package_events (package_id, status, timestamp, warehouse_id)
SELECT
'PKG-' || lpad(generate_series::text, 8, '0'),
'SCANNED',
now() - (random() * interval '7 days'),
'WH-' || (random() * 50)::int
FROM generate_series(1, 1000000);
SELECT pg_current_wal_lsn() as after_lsn;
-- before_lsn: 0/A4000000
-- after_lsn: 0/D8500000
-- Difference: ~848 MB of WAL for 1 million rows
-- Average WAL per row: ~848 bytes
-- The table row is approximately 80 bytes.
-- The WAL record is ~848 bytes because it includes:
-- heap insert record, index insert records (3 indexes),
-- page headers, checksums, and alignment padding.
-- Write amplification from WAL: ~10.6x
Checkpoints
WAL segments accumulate. Without cleanup, they fill the disk. Checkpoints solve this.
A checkpoint writes all dirty pages from the buffer pool to disk. After a checkpoint, the WAL records before that point are no longer needed for crash recovery, because the data pages on disk already reflect those changes. PostgreSQL can then recycle those WAL segments.
-- Concept: checkpoint behavior and its I/O cost
SHOW checkpoint_timeout; -- Default: 5min
SHOW max_wal_size; -- Default: 1GB
SHOW checkpoint_completion_target; -- Default: 0.9
-- PostgreSQL triggers a checkpoint when:
-- 1. checkpoint_timeout (5 minutes) has elapsed since the last checkpoint, OR
-- 2. WAL usage exceeds max_wal_size (1GB)
-- checkpoint_completion_target = 0.9 means PostgreSQL spreads
-- the dirty page writes over 90% of the checkpoint interval
-- to avoid a burst of I/O that would spike query latency.
The diagram shows the WAL flush cycle. Writes append to the WAL and modify pages in the buffer pool. The checkpoint process periodically flushes dirty pages to disk and advances the recovery point. WAL segments before the checkpoint are eligible for recycling. The buffer pool acts as a cache of data pages, and the WAL is the guarantee that uncommitted dirty pages can be recovered.
Kafka’s Partition Log
Kafka’s storage is the append-only log from Chapter 1, elevated to a distributed system. Each partition is a directory of segment files. Each segment is an append-only file of messages.
# Concept: Kafka partition log segments on disk
ls -la /var/kafka-logs/package-events-0/
# -rw-r--r-- 1 kafka kafka 1073741824 Nov 15 14:22 00000000000000000000.log
# -rw-r--r-- 1 kafka kafka 10485760 Nov 15 14:22 00000000000000000000.index
# -rw-r--r-- 1 kafka kafka 10485760 Nov 15 14:22 00000000000000000000.timeindex
# -rw-r--r-- 1 kafka kafka 536870912 Nov 15 14:23 00000000000001048576.log
# -rw-r--r-- 1 kafka kafka 10485760 Nov 15 14:23 00000000000001048576.index
# The .log file is the append-only data file (same pattern as CH1)
# The .index file maps message offsets to byte positions (same idea as CH2's hash index)
# The .timeindex maps timestamps to offsets (for time-based lookups)
# File names encode the starting offset of the first message in the segment
Kafka’s durability configuration controls when the OS flushes the page cache to disk:
log.flush.interval.messages: flush after this many messages (default: Long.MAX_VALUE, effectively never)log.flush.interval.ms: flush after this many milliseconds (default: null, rely on OS)
By default, Kafka does not call fsync after every write. It relies on the OS page cache and Linux’s dirty_writeback_centisecs (default 5 seconds) to flush data to disk. Durability comes from replication: if the leader crashes before flushing, the replicas have the data.
This is a fundamentally different durability model than PostgreSQL’s. PostgreSQL guarantees that a committed transaction is on disk before returning to the client. Kafka guarantees that a committed message is replicated to min.insync.replicas brokers, but those brokers may not have flushed to disk yet.
RocksDB’s WAL
RocksDB’s WAL serves the same purpose as PostgreSQL’s: protecting the memtable against crashes. If RocksDB crashes after writing to the memtable but before flushing it to an SSTable, the WAL is replayed to reconstruct the memtable.
The lifecycle:
- Write arrives. A WAL record is appended to the current WAL file.
- The write is inserted into the active memtable.
- When the memtable is full, it is flushed to an SSTable. The WAL records corresponding to that memtable are no longer needed.
- The WAL file is archived or deleted (configurable via
WAL_ttl_seconds).
// Concept: RocksDB WAL and sync configuration
// These options control durability vs write throughput
Options options = new Options();
options.setWalDir("/fast-nvme/rocksdb-wal"); // WAL on fast device
options.setDbLogDir("/sata-ssd/rocksdb-data"); // SSTables on cheaper device
// Sync WAL on every write (maximum durability, lower throughput)
WriteOptions syncWrite = new WriteOptions();
syncWrite.setSync(true);
// Throughput: ~50,000 writes/sec on NVMe
// Batch writes and sync periodically (higher throughput, window of data loss)
WriteOptions asyncWrite = new WriteOptions();
asyncWrite.setSync(false);
asyncWrite.setDisableWAL(false);
// Throughput: ~500,000 writes/sec on NVMe
// Risk: up to WAL_ttl_seconds of data loss on crash
The Durability Spectrum
All three systems use the same fundamental pattern: append to a log before modifying the primary data structure. They differ in when the log reaches persistent media.
| System | Default fsync behavior | Data loss window on crash |
|---|---|---|
| PostgreSQL | fsync at every commit | Zero (with synchronous_commit=on) |
| Kafka | No fsync, relies on replication | Up to OS flush interval (5s) |
| RocksDB | Configurable per write | Zero (sync=true) to seconds (sync=false) |
The decision is always the same tradeoff: latency vs durability. Fsyncing every write guarantees no data loss but caps throughput at the fsync rate of the device. Batching fsyncs or relying on replication increases throughput but opens a window during which a crash loses data.
The Decision Rule
Use synchronous WAL writes (PostgreSQL’s default, RocksDB with sync=true) when losing even one committed transaction is unacceptable. Financial transactions, audit logs, inventory decrements. The latency cost is real but bounded by NVMe fsync performance (50,000-80,000 operations per second).
Use asynchronous WAL writes with replication (Kafka’s default, RocksDB with sync=false plus replication) when throughput matters more than per-write durability. Event streams, metrics, clickstreams. The replication factor determines how many simultaneous hardware failures the system can tolerate.
For the logistics platform: package tracking events use Kafka with acks=all and min.insync.replicas=2. The WAL is not fsynced on every write, but the event is replicated to two brokers before the producer receives an acknowledgment. A single broker crash loses no data. A correlated double failure (both replicas crash before flushing) can lose events in the 5-second flush window. This risk is accepted because the source of truth for delivery status is the scanner device, which retries on failure.