Crash Recovery
Crash Recovery
When the database starts, it must determine whether the previous session ended cleanly. If a WAL file exists and contains records, the previous session may have crashed before completing a checkpoint. Recovery replays committed records from the WAL into the database file, bringing it to a consistent state.
Detecting a Dirty Shutdown
The detection is straightforward:
def needs_recovery(wal_path: str) -> bool:
"""Return True if a WAL file exists and is non-empty."""
if not os.path.exists(wal_path):
return False
return os.path.getsize(wal_path) > 0
An empty WAL (or no WAL at all) means the previous session checkpointed successfully. A non-empty WAL means we must inspect its contents.
Reading WAL Records
The recovery procedure reads the WAL sequentially, extracting records:
from dataclasses import dataclass
@dataclass
class WALRecord:
record_type: int
page_num: int
checksum: int
page_data: bytes | None # None for COMMIT records
def read_wal_records(wal_path: str) -> list[WALRecord]:
"""Read all valid records from the WAL file.
Stops at the first corrupted or incomplete record.
"""
records: list[WALRecord] = []
fd = os.open(wal_path, os.O_RDONLY)
try:
while True:
# Read header
header_bytes = os.read(fd, WAL_RECORD_HEADER_SIZE)
if len(header_bytes) < WAL_RECORD_HEADER_SIZE:
break # Incomplete header — end of valid records
record_type, page_num, checksum = struct.unpack(
WAL_RECORD_HEADER_FORMAT, header_bytes
)
if record_type == RECORD_COMMIT:
records.append(WALRecord(record_type, page_num, checksum, None))
continue
if record_type == RECORD_PAGE_WRITE:
page_data = os.read(fd, PAGE_SIZE)
if len(page_data) < PAGE_SIZE:
break # Incomplete page data — torn record
# Verify checksum
actual_checksum = zlib.crc32(page_data) & 0xFFFFFFFF
if actual_checksum != checksum:
break # Corrupted record — stop here
records.append(
WALRecord(record_type, page_num, checksum, page_data)
)
else:
break # Unknown record type — corruption
finally:
os.close(fd)
return records
Three stopping conditions:
- Incomplete header or page data — the crash happened while writing this record. Discard it.
- Checksum mismatch — the page data is corrupted (torn write). Discard this and all subsequent records.
- Unknown record type — corruption.
In all cases, we stop reading and only use records collected so far.
The Replay Algorithm
Not all records in the WAL should be replayed. Only records that belong to a committed transaction are valid. A committed transaction is a sequence of PAGE_WRITE records followed by a COMMIT record.
def recover(db_path: str) -> int:
"""Replay committed WAL records into the database file.
Returns the number of pages restored.
"""
wal_path = db_path + ".wal"
if not needs_recovery(wal_path):
return 0
records = read_wal_records(wal_path)
# Find the last COMMIT record
last_commit_index = -1
for i, record in enumerate(records):
if record.record_type == RECORD_COMMIT:
last_commit_index = i
if last_commit_index == -1:
# No COMMIT found — all records are from an uncommitted transaction
# Discard the WAL
_truncate_wal(wal_path)
return 0
# Collect PAGE_WRITE records up to and including the last COMMIT
# Only keep the LAST write for each page (later writes override earlier ones)
page_writes: dict[int, bytes] = {}
for i in range(last_commit_index + 1):
record = records[i]
if record.record_type == RECORD_PAGE_WRITE and record.page_data:
page_writes[record.page_num] = record.page_data
# Apply to the database file
fd = os.open(db_path, os.O_RDWR | os.O_CREAT, 0o644)
try:
for page_num, page_data in sorted(page_writes.items()):
os.lseek(fd, page_num * PAGE_SIZE, os.SEEK_SET)
os.write(fd, page_data)
os.fsync(fd) # Ensure all replayed pages are durable
finally:
os.close(fd)
# Truncate the WAL
_truncate_wal(wal_path)
return len(page_writes)
def _truncate_wal(wal_path: str) -> None:
"""Zero out the WAL file."""
fd = os.open(wal_path, os.O_WRONLY | os.O_TRUNC)
os.fsync(fd)
os.close(fd)
Key design choices:
- We keep only the last write for each page within the committed transaction. If page 2 was written three times, only the final version matters.
- We apply pages in sorted order by page number. This is not strictly necessary for correctness (each write is independent), but sequential writes are faster on spinning disks.
- We
fsyncthe database file before truncating the WAL. This ensures the database is consistent even if we crash during WAL truncation.
Integrating Recovery into Database Startup
Recovery runs automatically when the Pager is created:
class Pager:
def __init__(self, db_path: str) -> None:
# Run recovery BEFORE opening for normal operation
pages_recovered = recover(db_path)
if pages_recovered > 0:
print(f"Recovery: replayed {pages_recovered} pages from WAL.")
self.db_path = db_path
self.fd = os.open(db_path, os.O_RDWR | os.O_CREAT, 0o644)
file_length = os.lseek(self.fd, 0, os.SEEK_END)
self.num_pages = file_length // PAGE_SIZE
self.pages: dict[int, memoryview] = {}
self.wal = WAL(db_path + ".wal")
Simulating a Crash
To verify recovery, we simulate a crash by writing data and then not calling close() or checkpoint():
def test_crash_recovery():
"""Simulate a crash and verify recovery restores committed data."""
import os
db_path = "test_recovery.db"
for f in [db_path, db_path + ".wal"]:
if os.path.exists(f):
os.remove(f)
# --- Session 1: Insert 50 rows and commit ---
pager1 = Pager(db_path)
root = pager1.get_page(0)
initialize_leaf_node(root, is_root=True)
pager1.num_pages = 1
for k in range(50):
row = Row(id=k, username=f"user_{k:03d}")
tree_insert(pager1, 0, k, row.serialize())
pager1.commit() # WAL now has committed records
# --- Simulate crash: do NOT call checkpoint or close ---
# The database file may have the data, but the WAL is not truncated.
# Force-close file descriptors without cleanup.
os.close(pager1.fd)
pager1.wal.close()
# --- Session 2: Open the database — recovery should run ---
pager2 = Pager(db_path) # recovery happens in __init__
# Verify all 50 rows are present
for k in range(50):
cursor = Cursor.find(pager2, 0, k)
row = Row.deserialize(bytes(cursor.value()))
assert row.id == k, f"Missing row {k} after recovery"
pager2.close()
# --- Verify WAL is now empty (post-recovery + checkpoint) ---
assert os.path.getsize(db_path + ".wal") == 0
# Cleanup
os.remove(db_path)
os.remove(db_path + ".wal")
print("Crash recovery test passed.")
test_crash_recovery()
Testing Uncommitted Data Rejection
def test_uncommitted_data_discarded():
"""Verify that uncommitted writes are lost after crash."""
db_path = "test_uncommitted.db"
for f in [db_path, db_path + ".wal"]:
if os.path.exists(f):
os.remove(f)
# --- Session 1: Insert rows but never commit ---
pager1 = Pager(db_path)
root = pager1.get_page(0)
initialize_leaf_node(root, is_root=True)
pager1.num_pages = 1
for k in range(10):
row = Row(id=k, username=f"user_{k}")
tree_insert(pager1, 0, k, row.serialize())
# Flush pages to WAL but do NOT commit
for page_num in list(pager1.pages.keys()):
pager1.flush(page_num)
# No commit marker in WAL
os.close(pager1.fd)
pager1.wal.close()
# --- Session 2: Recovery should discard uncommitted WAL ---
pager2 = Pager(db_path)
# The root page should be empty (no committed data)
root = pager2.get_page(0)
# Depending on whether the database file was written,
# the page might have data but the WAL was discarded.
# The important thing is the WAL is now empty.
assert os.path.getsize(db_path + ".wal") == 0
pager2.close()
os.remove(db_path)
os.remove(db_path + ".wal")
print("Uncommitted data rejection test passed.")
test_uncommitted_data_discarded()
The Recovery Guarantee
| Crash scenario | WAL state | Recovery action | Data outcome |
|---|---|---|---|
| Before any flush | Empty | None | No data written |
| After flush, before commit | PAGE_WRITE records, no COMMIT | Discard WAL | Data lost (correct: never committed) |
| After commit, before checkpoint | PAGE_WRITE + COMMIT records | Replay to DB, truncate WAL | All committed data restored |
| During checkpoint | COMMIT present, DB may be partial | Replay committed records (idempotent) | Data restored |
| After checkpoint | Empty WAL | None | Data already in DB |
The WAL guarantees: if commit() returned successfully, the data will survive any subsequent crash. This is durability.
Performance Impact
The WAL adds overhead:
- Disk writes doubled. Every page is written twice: once to the WAL, once to the database file.
- fsync cost. Each commit requires at least one fsync (~2–10ms on SSD).
The trade-off is worthwhile. Without the WAL, a crash corrupts the B-Tree — potentially losing all data, not just the last transaction. With the WAL, the worst case is losing uncommitted changes from the last incomplete transaction.
| Metric | Without WAL | With WAL |
|---|---|---|
| Write throughput | Higher | ~50% lower |
| Crash safety | None | Full durability |
| Recovery time | N/A (data lost) | Proportional to WAL size |
With durability solved, the final chapter turns to the interface layer. Our engine accepts binary operations through Python function calls, but a real database speaks SQL. Chapter 7 builds a tokenizer and parser that translate text commands into the operations our engine already supports.