Defining the Row Format

Before we can store anything on disk, we need to decide exactly how a single database row looks in binary. This is not an abstract exercise — the byte layout you choose here ripples through every layer above. The Pager moves pages, the Table maps row indices to byte offsets, and the B-Tree stores serialized rows as values. If the row format is sloppy, everything built on top inherits the sloppiness.

Our table has two columns: an integer id and a fixed-length string username. We serialize each row into a flat, fixed-size binary buffer with no delimiters and no length prefixes — just raw bytes at known offsets.

The Binary Layout

Every row occupies exactly 36 bytes:

Byte offset:  0         4                                    36
              ┌─────────┬─────────────────────────────────────┐
              │  id (4B) │        username (32B)                │
              │  <I      │  UTF-8, null-padded                 │
              └─────────┴─────────────────────────────────────┘

Field	Format	Size (bytes)	Offset	Description
`id`	`<I` (little-endian unsigned 32-bit int)	4	0	Unique row identifier, 0 to 2³²−1
`username`	UTF-8, right-padded with `\x00`	32	4	Fixed-width string field
Total	—	36	—	`ROW_SIZE` constant used everywhere

Why fixed-size? Because fixed offsets mean we can jump directly to any field using memoryview slicing. No scanning, no parsing, no allocations. Given a buffer buf that contains a serialized row, buf[0:4] is always the id and buf[4:36] is always the username. This is the same principle that C structs use, and it is why databases prefer fixed-width storage for performance-critical paths.

The Row Class

# row.py
import struct
from typing import ClassVar

class Row:
    """Binary-serializable database row with fixed-size fields."""

    ID_SIZE: ClassVar[int] = 4
    USERNAME_SIZE: ClassVar[int] = 32
    ID_OFFSET: ClassVar[int] = 0
    USERNAME_OFFSET: ClassVar[int] = ID_SIZE           # 4
    ROW_SIZE: ClassVar[int] = ID_SIZE + USERNAME_SIZE  # 36

    ID_FORMAT: ClassVar[str] = "<I"  # Little-endian unsigned 32-bit integer

    def __init__(self, id_val: int, username: str) -> None:
        self.id_val: int = id_val
        self.username: str = username

    def serialize(self) -> bytes:
        """Pack this row into a fixed-size bytes object."""
        encoded = self.username.encode("utf-8")
        if len(encoded) > self.USERNAME_SIZE:
            raise ValueError(
                f"Username exceeds {self.USERNAME_SIZE} bytes: "
                f"got {len(encoded)}"
            )
        packed_id = struct.pack(self.ID_FORMAT, self.id_val)
        padded_name = encoded.ljust(self.USERNAME_SIZE, b"\x00")
        return packed_id + padded_name

    @classmethod
    def deserialize(cls, buf: memoryview) -> "Row":
        """Unpack a row from a memoryview without copying the underlying buffer."""
        id_val: int = struct.unpack(
            cls.ID_FORMAT,
            buf[cls.ID_OFFSET : cls.ID_OFFSET + cls.ID_SIZE],
        )[0]
        raw_name: bytes = bytes(
            buf[cls.USERNAME_OFFSET : cls.USERNAME_OFFSET + cls.USERNAME_SIZE]
        )
        username: str = raw_name.rstrip(b"\x00").decode("utf-8")
        return cls(id_val, username)

    def __repr__(self) -> str:
        return f"Row(id={self.id_val}, username={self.username!r})"

Three design decisions worth noting:

serialize returns bytes, deserialize accepts memoryview. Serialization allocates a new buffer because we are building a value to hand off. Deserialization takes a view into an existing page buffer — we read fields in place with zero-copy slicing, then decode only the username bytes we need.
Validation lives in serialize. If someone passes a username longer than 32 bytes (after UTF-8 encoding), we raise immediately. A corrupt row that silently overflows into the next slot would cause cascading data corruption.
ClassVar offsets are computed at class definition time. ROW_SIZE, ID_OFFSET, and USERNAME_OFFSET are constants — no per-instance overhead, and other modules (Table, B-Tree) can import them directly.

Verifying the Format

A quick round-trip test confirms the layout:

# Verify serialization round-trip
original = Row(42, "alice")
blob: bytes = original.serialize()
assert len(blob) == Row.ROW_SIZE  # 36

restored = Row.deserialize(memoryview(blob))
assert restored.id_val == 42
assert restored.username == "alice"

The serialized blob is just 36 bytes of raw data — no headers, no type tags, no alignment padding beyond what we explicitly defined. This is exactly the payload that will be stored inside pages by the Pager and inside B-Tree cells by the index layer.

With the row format locked down, we have the atomic unit of data for our database. The next step is building the Pager — the component that reads and writes pages full of these rows to and from disk.