Skip to main content
python database internals building a persistent engine from scratch

Defining the Row Format

4 min read Chapter 3 of 21
Summary

This section defines the binary serialization format for...

This section defines the binary serialization format for a database row, essential for high-performance database systems. Key concepts include binary serialization using Python's struct module for packing integers and fixed-length strings, and memoryview for efficient memory access. The Row class is implemented with serialize() and deserialize() methods, calculating offset constants such as ID_OFFSET=0 and USERNAME_OFFSET=4 for direct field access. Terminology introduced includes: Row (a database record), Offset (byte position within binary data), Fixed-length String (padded to a set size), and Little-endian (byte order for integer packing). The binary layout consists of a 4-byte ID field and a 32-byte username field, totaling 36 bytes per row. The design emphasizes low-level logic with struct and memoryview, avoiding high-level abstractions to ensure performance and clarity. No external sources are cited, focusing on internal implementation details.

Defining the Row Format

Before we can store anything on disk, we need to decide exactly how a single database row looks in binary. This is not an abstract exercise — the byte layout you choose here ripples through every layer above. The Pager moves pages, the Table maps row indices to byte offsets, and the B-Tree stores serialized rows as values. If the row format is sloppy, everything built on top inherits the sloppiness.

Our table has two columns: an integer id and a fixed-length string username. We serialize each row into a flat, fixed-size binary buffer with no delimiters and no length prefixes — just raw bytes at known offsets.

The Binary Layout

Every row occupies exactly 36 bytes:

Byte offset:  0         4                                    36
              ┌─────────┬─────────────────────────────────────┐
              │  id (4B) │        username (32B)                │
              │  <I      │  UTF-8, null-padded                 │
              └─────────┴─────────────────────────────────────┘
FieldFormatSize (bytes)OffsetDescription
id<I (little-endian unsigned 32-bit int)40Unique row identifier, 0 to 2³²−1
usernameUTF-8, right-padded with \x00324Fixed-width string field
Total36ROW_SIZE constant used everywhere

Why fixed-size? Because fixed offsets mean we can jump directly to any field using memoryview slicing. No scanning, no parsing, no allocations. Given a buffer buf that contains a serialized row, buf[0:4] is always the id and buf[4:36] is always the username. This is the same principle that C structs use, and it is why databases prefer fixed-width storage for performance-critical paths.

The Row Class

# row.py
import struct
from typing import ClassVar

class Row:
    """Binary-serializable database row with fixed-size fields."""

    ID_SIZE: ClassVar[int] = 4
    USERNAME_SIZE: ClassVar[int] = 32
    ID_OFFSET: ClassVar[int] = 0
    USERNAME_OFFSET: ClassVar[int] = ID_SIZE           # 4
    ROW_SIZE: ClassVar[int] = ID_SIZE + USERNAME_SIZE  # 36

    ID_FORMAT: ClassVar[str] = "<I"  # Little-endian unsigned 32-bit integer

    def __init__(self, id_val: int, username: str) -> None:
        self.id_val: int = id_val
        self.username: str = username

    def serialize(self) -> bytes:
        """Pack this row into a fixed-size bytes object."""
        encoded = self.username.encode("utf-8")
        if len(encoded) > self.USERNAME_SIZE:
            raise ValueError(
                f"Username exceeds {self.USERNAME_SIZE} bytes: "
                f"got {len(encoded)}"
            )
        packed_id = struct.pack(self.ID_FORMAT, self.id_val)
        padded_name = encoded.ljust(self.USERNAME_SIZE, b"\x00")
        return packed_id + padded_name

    @classmethod
    def deserialize(cls, buf: memoryview) -> "Row":
        """Unpack a row from a memoryview without copying the underlying buffer."""
        id_val: int = struct.unpack(
            cls.ID_FORMAT,
            buf[cls.ID_OFFSET : cls.ID_OFFSET + cls.ID_SIZE],
        )[0]
        raw_name: bytes = bytes(
            buf[cls.USERNAME_OFFSET : cls.USERNAME_OFFSET + cls.USERNAME_SIZE]
        )
        username: str = raw_name.rstrip(b"\x00").decode("utf-8")
        return cls(id_val, username)

    def __repr__(self) -> str:
        return f"Row(id={self.id_val}, username={self.username!r})"

Three design decisions worth noting:

  1. serialize returns bytes, deserialize accepts memoryview. Serialization allocates a new buffer because we are building a value to hand off. Deserialization takes a view into an existing page buffer — we read fields in place with zero-copy slicing, then decode only the username bytes we need.

  2. Validation lives in serialize. If someone passes a username longer than 32 bytes (after UTF-8 encoding), we raise immediately. A corrupt row that silently overflows into the next slot would cause cascading data corruption.

  3. ClassVar offsets are computed at class definition time. ROW_SIZE, ID_OFFSET, and USERNAME_OFFSET are constants — no per-instance overhead, and other modules (Table, B-Tree) can import them directly.

Verifying the Format

A quick round-trip test confirms the layout:

# Verify serialization round-trip
original = Row(42, "alice")
blob: bytes = original.serialize()
assert len(blob) == Row.ROW_SIZE  # 36

restored = Row.deserialize(memoryview(blob))
assert restored.id_val == 42
assert restored.username == "alice"

The serialized blob is just 36 bytes of raw data — no headers, no type tags, no alignment padding beyond what we explicitly defined. This is exactly the payload that will be stored inside pages by the Pager and inside B-Tree cells by the index layer.

With the row format locked down, we have the atomic unit of data for our database. The next step is building the Pager — the component that reads and writes pages full of these rows to and from disk.