Skip to main content
python database internals building a persistent engine from scratch

Crash Recovery

7 min read Chapter 18 of 21

Crash Recovery

When the database starts, it must determine whether the previous session ended cleanly. If a WAL file exists and contains records, the previous session may have crashed before completing a checkpoint. Recovery replays committed records from the WAL into the database file, bringing it to a consistent state.

Detecting a Dirty Shutdown

The detection is straightforward:

def needs_recovery(wal_path: str) -> bool:
    """Return True if a WAL file exists and is non-empty."""
    if not os.path.exists(wal_path):
        return False
    return os.path.getsize(wal_path) > 0

An empty WAL (or no WAL at all) means the previous session checkpointed successfully. A non-empty WAL means we must inspect its contents.

Reading WAL Records

The recovery procedure reads the WAL sequentially, extracting records:

from dataclasses import dataclass


@dataclass
class WALRecord:
    record_type: int
    page_num: int
    checksum: int
    page_data: bytes | None  # None for COMMIT records


def read_wal_records(wal_path: str) -> list[WALRecord]:
    """Read all valid records from the WAL file.

    Stops at the first corrupted or incomplete record.
    """
    records: list[WALRecord] = []
    fd = os.open(wal_path, os.O_RDONLY)

    try:
        while True:
            # Read header
            header_bytes = os.read(fd, WAL_RECORD_HEADER_SIZE)
            if len(header_bytes) < WAL_RECORD_HEADER_SIZE:
                break  # Incomplete header — end of valid records

            record_type, page_num, checksum = struct.unpack(
                WAL_RECORD_HEADER_FORMAT, header_bytes
            )

            if record_type == RECORD_COMMIT:
                records.append(WALRecord(record_type, page_num, checksum, None))
                continue

            if record_type == RECORD_PAGE_WRITE:
                page_data = os.read(fd, PAGE_SIZE)
                if len(page_data) < PAGE_SIZE:
                    break  # Incomplete page data — torn record

                # Verify checksum
                actual_checksum = zlib.crc32(page_data) & 0xFFFFFFFF
                if actual_checksum != checksum:
                    break  # Corrupted record — stop here

                records.append(
                    WALRecord(record_type, page_num, checksum, page_data)
                )
            else:
                break  # Unknown record type — corruption
    finally:
        os.close(fd)

    return records

Three stopping conditions:

  1. Incomplete header or page data — the crash happened while writing this record. Discard it.
  2. Checksum mismatch — the page data is corrupted (torn write). Discard this and all subsequent records.
  3. Unknown record type — corruption.

In all cases, we stop reading and only use records collected so far.

The Replay Algorithm

Not all records in the WAL should be replayed. Only records that belong to a committed transaction are valid. A committed transaction is a sequence of PAGE_WRITE records followed by a COMMIT record.

def recover(db_path: str) -> int:
    """Replay committed WAL records into the database file.

    Returns the number of pages restored.
    """
    wal_path = db_path + ".wal"
    if not needs_recovery(wal_path):
        return 0

    records = read_wal_records(wal_path)

    # Find the last COMMIT record
    last_commit_index = -1
    for i, record in enumerate(records):
        if record.record_type == RECORD_COMMIT:
            last_commit_index = i

    if last_commit_index == -1:
        # No COMMIT found — all records are from an uncommitted transaction
        # Discard the WAL
        _truncate_wal(wal_path)
        return 0

    # Collect PAGE_WRITE records up to and including the last COMMIT
    # Only keep the LAST write for each page (later writes override earlier ones)
    page_writes: dict[int, bytes] = {}
    for i in range(last_commit_index + 1):
        record = records[i]
        if record.record_type == RECORD_PAGE_WRITE and record.page_data:
            page_writes[record.page_num] = record.page_data

    # Apply to the database file
    fd = os.open(db_path, os.O_RDWR | os.O_CREAT, 0o644)
    try:
        for page_num, page_data in sorted(page_writes.items()):
            os.lseek(fd, page_num * PAGE_SIZE, os.SEEK_SET)
            os.write(fd, page_data)
        os.fsync(fd)  # Ensure all replayed pages are durable
    finally:
        os.close(fd)

    # Truncate the WAL
    _truncate_wal(wal_path)

    return len(page_writes)


def _truncate_wal(wal_path: str) -> None:
    """Zero out the WAL file."""
    fd = os.open(wal_path, os.O_WRONLY | os.O_TRUNC)
    os.fsync(fd)
    os.close(fd)

Key design choices:

  • We keep only the last write for each page within the committed transaction. If page 2 was written three times, only the final version matters.
  • We apply pages in sorted order by page number. This is not strictly necessary for correctness (each write is independent), but sequential writes are faster on spinning disks.
  • We fsync the database file before truncating the WAL. This ensures the database is consistent even if we crash during WAL truncation.

Integrating Recovery into Database Startup

Recovery runs automatically when the Pager is created:

class Pager:
    def __init__(self, db_path: str) -> None:
        # Run recovery BEFORE opening for normal operation
        pages_recovered = recover(db_path)
        if pages_recovered > 0:
            print(f"Recovery: replayed {pages_recovered} pages from WAL.")

        self.db_path = db_path
        self.fd = os.open(db_path, os.O_RDWR | os.O_CREAT, 0o644)
        file_length = os.lseek(self.fd, 0, os.SEEK_END)
        self.num_pages = file_length // PAGE_SIZE
        self.pages: dict[int, memoryview] = {}
        self.wal = WAL(db_path + ".wal")

Simulating a Crash

To verify recovery, we simulate a crash by writing data and then not calling close() or checkpoint():

def test_crash_recovery():
    """Simulate a crash and verify recovery restores committed data."""
    import os

    db_path = "test_recovery.db"
    for f in [db_path, db_path + ".wal"]:
        if os.path.exists(f):
            os.remove(f)

    # --- Session 1: Insert 50 rows and commit ---
    pager1 = Pager(db_path)
    root = pager1.get_page(0)
    initialize_leaf_node(root, is_root=True)
    pager1.num_pages = 1

    for k in range(50):
        row = Row(id=k, username=f"user_{k:03d}")
        tree_insert(pager1, 0, k, row.serialize())

    pager1.commit()  # WAL now has committed records

    # --- Simulate crash: do NOT call checkpoint or close ---
    # The database file may have the data, but the WAL is not truncated.
    # Force-close file descriptors without cleanup.
    os.close(pager1.fd)
    pager1.wal.close()

    # --- Session 2: Open the database — recovery should run ---
    pager2 = Pager(db_path)  # recovery happens in __init__

    # Verify all 50 rows are present
    for k in range(50):
        cursor = Cursor.find(pager2, 0, k)
        row = Row.deserialize(bytes(cursor.value()))
        assert row.id == k, f"Missing row {k} after recovery"

    pager2.close()

    # --- Verify WAL is now empty (post-recovery + checkpoint) ---
    assert os.path.getsize(db_path + ".wal") == 0

    # Cleanup
    os.remove(db_path)
    os.remove(db_path + ".wal")
    print("Crash recovery test passed.")


test_crash_recovery()

Testing Uncommitted Data Rejection

def test_uncommitted_data_discarded():
    """Verify that uncommitted writes are lost after crash."""
    db_path = "test_uncommitted.db"
    for f in [db_path, db_path + ".wal"]:
        if os.path.exists(f):
            os.remove(f)

    # --- Session 1: Insert rows but never commit ---
    pager1 = Pager(db_path)
    root = pager1.get_page(0)
    initialize_leaf_node(root, is_root=True)
    pager1.num_pages = 1

    for k in range(10):
        row = Row(id=k, username=f"user_{k}")
        tree_insert(pager1, 0, k, row.serialize())

    # Flush pages to WAL but do NOT commit
    for page_num in list(pager1.pages.keys()):
        pager1.flush(page_num)
    # No commit marker in WAL

    os.close(pager1.fd)
    pager1.wal.close()

    # --- Session 2: Recovery should discard uncommitted WAL ---
    pager2 = Pager(db_path)

    # The root page should be empty (no committed data)
    root = pager2.get_page(0)
    # Depending on whether the database file was written,
    # the page might have data but the WAL was discarded.
    # The important thing is the WAL is now empty.
    assert os.path.getsize(db_path + ".wal") == 0

    pager2.close()
    os.remove(db_path)
    os.remove(db_path + ".wal")
    print("Uncommitted data rejection test passed.")


test_uncommitted_data_discarded()

The Recovery Guarantee

Crash scenarioWAL stateRecovery actionData outcome
Before any flushEmptyNoneNo data written
After flush, before commitPAGE_WRITE records, no COMMITDiscard WALData lost (correct: never committed)
After commit, before checkpointPAGE_WRITE + COMMIT recordsReplay to DB, truncate WALAll committed data restored
During checkpointCOMMIT present, DB may be partialReplay committed records (idempotent)Data restored
After checkpointEmpty WALNoneData already in DB

The WAL guarantees: if commit() returned successfully, the data will survive any subsequent crash. This is durability.

Performance Impact

The WAL adds overhead:

  • Disk writes doubled. Every page is written twice: once to the WAL, once to the database file.
  • fsync cost. Each commit requires at least one fsync (~2–10ms on SSD).

The trade-off is worthwhile. Without the WAL, a crash corrupts the B-Tree — potentially losing all data, not just the last transaction. With the WAL, the worst case is losing uncommitted changes from the last incomplete transaction.

MetricWithout WALWith WAL
Write throughputHigher~50% lower
Crash safetyNoneFull durability
Recovery timeN/A (data lost)Proportional to WAL size

With durability solved, the final chapter turns to the interface layer. Our engine accepts binary operations through Python function calls, but a real database speaks SQL. Chapter 7 builds a tokenizer and parser that translate text commands into the operations our engine already supports.