The Architecture of Persistence

Every database, from SQLite to PostgreSQL, solves the same fundamental problem: how do you take transient data living in RAM and make it survive a power failure? The answer is deceptively complex. You need to manage binary data formats, shuttle fixed-size pages between memory and disk, organize keys for fast lookup, and guarantee that a crash mid-write won’t corrupt everything.

We are going to build all of this from scratch in Python. No ORMs, no third-party storage libraries — just struct, memoryview, and raw file I/O. By the end, you will have a working database engine that parses SQL, searches a B-Tree, writes ahead of commits, and recovers from crashes.

This chapter establishes the architecture we will follow and sets up the project skeleton.

The Layered Architecture

Our database is organized as a stack of six layers. Each layer has a single responsibility and communicates only with its immediate neighbors:

┌─────────────────────────┐
│   REPL (User Interface)  │  Reads input, prints results
├─────────────────────────┤
│   Parser / Tokenizer     │  SQL text → parse tree
├─────────────────────────┤
│   Virtual Machine (VM)   │  Executes operations on the table
├─────────────────────────┤
│   B-Tree                 │  Sorted key-value index structure
├─────────────────────────┤
│   Pager                  │  Fixed-size page I/O with caching
├─────────────────────────┤
│   OS Interface           │  open(), read(), write(), fsync()
└─────────────────────────┘

Why layers? Isolation. When we replace linear scan with B-Tree lookup in Chapter 4, the VM does not change. When we add write-ahead logging in Chapter 6, the B-Tree does not change. Each layer can evolve independently because it exposes a narrow interface to the layer above.

The data flow works like this: the user types INSERT 1 alice into the REPL. The Parser tokenizes it and produces a statement object. The VM calls execute_insert, which positions a Cursor on the B-Tree. The B-Tree asks the Pager for the correct page, which reads it from disk through the OS Interface if it is not already cached.

Setting Up the Project Skeleton

Create the following directory structure:

pydb/
├── constants.py      # Page sizes, format strings, limits
├── row.py            # Binary row serialization
├── pager.py          # Disk I/O and page caching
├── table.py          # Row-to-page mapping
├── cursor.py         # Iteration abstraction
├── btree.py          # B-Tree node operations
├── vm.py             # Virtual Machine (execution engine)
├── parser.py         # SQL tokenizer and parser
├── wal.py            # Write-Ahead Log
├── repl.py           # Read-Eval-Print Loop
└── main.py           # Entry point

We use Python 3.12+ with type hints throughout. Every binary operation uses struct for packing and memoryview for zero-copy access — no pickling, no JSON, no text-mode file I/O.

Start with the constants module. Every other module imports from here:

# constants.py
import struct
from typing import Final

PAGE_SIZE: Final[int] = 4096             # 4 KB — matches common OS page size
MAX_PAGES: Final[int] = 4096            # Upper bound: 4096 pages × 4 KB = 16 MB
PAGE_HEADER_FORMAT: Final[str] = "<I I"  # Little-endian: page_number (4B), data_length (4B)
PAGE_HEADER_SIZE: Final[int] = struct.calcsize(PAGE_HEADER_FORMAT)  # 8 bytes

Two design choices deserve explanation:

PAGE_SIZE = 4096. This matches the virtual memory page size on most Linux and macOS systems. When our pages align with OS pages, read() and write() syscalls transfer data without extra copying in the kernel buffer cache. SQLite defaults to 4096 for the same reason.
Little-endian byte order (<). x86 and ARM (in its default mode) are little-endian. Using native byte order avoids conversion overhead. We make this explicit in every format string rather than relying on = (native) so the database file is portable across machines with the same endianness.

Page Size Trade-Offs

The 4 KB choice is not universal. Here is how different page sizes affect performance:

Page Size	I/O Efficiency	Memory Overhead	Best For
2 KB	More syscalls per scan	Lower per-page RAM	Embedded devices with < 1 MB RAM
4 KB	One syscall = one OS page	Moderate	General-purpose (our choice)
8 KB	Fewer syscalls for sequential reads	Higher per-page RAM	Analytics workloads with large rows
64 KB	Excellent throughput for bulk scans	Very high; wastes space for small rows	Column-store engines

We lock in 4 KB now. Every page we write — whether it holds raw rows, B-Tree nodes, or WAL records — will be exactly PAGE_SIZE bytes. This invariant simplifies the Pager enormously: it never needs to handle variable-length I/O.

With the project structure in place and constants defined, we are ready to confront the first hard question: why can’t we just let the operating system manage our data files?