The Architecture of Persistence
SummaryThis section introduces the high-level architecture of a...
This section introduces the high-level architecture of a...
This section introduces the high-level architecture of a database management system, organized as a layered structure: REPL for user interaction, Parser for SQL interpretation, Virtual Machine for bytecode execution, B-Tree for indexing, Pager for page-based storage management, and OS Interface for system abstraction. It emphasizes modularity and data flow between layers. The project skeleton is set up using Python 3.12+ with type hints, defining a Constants module with PAGE_SIZE=4096 bytes, HEADER_LIMIT=100, and PAGE_HEADER_FORMAT='<I I' for binary headers. A stub Pager class uses struct and memoryview for low-level binary manipulation. Key design trade-offs are discussed, comparing page sizes (4KB, 8KB, 2KB) in terms of I/O performance, memory usage, and overhead. This foundation enables further development with a focus on efficiency and scalability.
The Architecture of Persistence
Every database, from SQLite to PostgreSQL, solves the same fundamental problem: how do you take transient data living in RAM and make it survive a power failure? The answer is deceptively complex. You need to manage binary data formats, shuttle fixed-size pages between memory and disk, organize keys for fast lookup, and guarantee that a crash mid-write won’t corrupt everything.
We are going to build all of this from scratch in Python. No ORMs, no third-party storage libraries — just struct, memoryview, and raw file I/O. By the end, you will have a working database engine that parses SQL, searches a B-Tree, writes ahead of commits, and recovers from crashes.
This chapter establishes the architecture we will follow and sets up the project skeleton.
The Layered Architecture
Our database is organized as a stack of six layers. Each layer has a single responsibility and communicates only with its immediate neighbors:
┌─────────────────────────┐
│ REPL (User Interface) │ Reads input, prints results
├─────────────────────────┤
│ Parser / Tokenizer │ SQL text → parse tree
├─────────────────────────┤
│ Virtual Machine (VM) │ Executes operations on the table
├─────────────────────────┤
│ B-Tree │ Sorted key-value index structure
├─────────────────────────┤
│ Pager │ Fixed-size page I/O with caching
├─────────────────────────┤
│ OS Interface │ open(), read(), write(), fsync()
└─────────────────────────┘
Why layers? Isolation. When we replace linear scan with B-Tree lookup in Chapter 4, the VM does not change. When we add write-ahead logging in Chapter 6, the B-Tree does not change. Each layer can evolve independently because it exposes a narrow interface to the layer above.
The data flow works like this: the user types INSERT 1 alice into the REPL. The Parser tokenizes it and produces a statement object. The VM calls execute_insert, which positions a Cursor on the B-Tree. The B-Tree asks the Pager for the correct page, which reads it from disk through the OS Interface if it is not already cached.
Setting Up the Project Skeleton
Create the following directory structure:
pydb/
├── constants.py # Page sizes, format strings, limits
├── row.py # Binary row serialization
├── pager.py # Disk I/O and page caching
├── table.py # Row-to-page mapping
├── cursor.py # Iteration abstraction
├── btree.py # B-Tree node operations
├── vm.py # Virtual Machine (execution engine)
├── parser.py # SQL tokenizer and parser
├── wal.py # Write-Ahead Log
├── repl.py # Read-Eval-Print Loop
└── main.py # Entry point
We use Python 3.12+ with type hints throughout. Every binary operation uses struct for packing and memoryview for zero-copy access — no pickling, no JSON, no text-mode file I/O.
Start with the constants module. Every other module imports from here:
# constants.py
import struct
from typing import Final
PAGE_SIZE: Final[int] = 4096 # 4 KB — matches common OS page size
MAX_PAGES: Final[int] = 4096 # Upper bound: 4096 pages × 4 KB = 16 MB
PAGE_HEADER_FORMAT: Final[str] = "<I I" # Little-endian: page_number (4B), data_length (4B)
PAGE_HEADER_SIZE: Final[int] = struct.calcsize(PAGE_HEADER_FORMAT) # 8 bytes
Two design choices deserve explanation:
-
PAGE_SIZE = 4096. This matches the virtual memory page size on most Linux and macOS systems. When our pages align with OS pages,
read()andwrite()syscalls transfer data without extra copying in the kernel buffer cache. SQLite defaults to 4096 for the same reason. -
Little-endian byte order (
<). x86 and ARM (in its default mode) are little-endian. Using native byte order avoids conversion overhead. We make this explicit in every format string rather than relying on=(native) so the database file is portable across machines with the same endianness.
Page Size Trade-Offs
The 4 KB choice is not universal. Here is how different page sizes affect performance:
| Page Size | I/O Efficiency | Memory Overhead | Best For |
|---|---|---|---|
| 2 KB | More syscalls per scan | Lower per-page RAM | Embedded devices with < 1 MB RAM |
| 4 KB | One syscall = one OS page | Moderate | General-purpose (our choice) |
| 8 KB | Fewer syscalls for sequential reads | Higher per-page RAM | Analytics workloads with large rows |
| 64 KB | Excellent throughput for bulk scans | Very high; wastes space for small rows | Column-store engines |
We lock in 4 KB now. Every page we write — whether it holds raw rows, B-Tree nodes, or WAL records — will be exactly PAGE_SIZE bytes. This invariant simplifies the Pager enormously: it never needs to handle variable-length I/O.
With the project structure in place and constants defined, we are ready to confront the first hard question: why can’t we just let the operating system manage our data files?