Skip to main content
python database internals building a persistent engine from scratch

Traversing the Tree

8 min read Chapter 15 of 21

Traversing the Tree

With splitting implemented, our B-Tree can grow beyond a single leaf. But the cursor from Chapter 4 only knows how to iterate within a single leaf node. To perform a full table scan — or to search for a key in a multi-level tree — the cursor must descend from the root through internal nodes to the correct leaf, and then advance across leaf boundaries.

This section updates the Cursor for multi-level traversal and benchmarks the result.

Descending to a Leaf

We already wrote tree_find in the previous section as a recursive function. Let’s refactor it into the Cursor class:

@dataclass
class Cursor:
    """Points to a specific cell within a leaf node of a B-Tree."""
    pager: Pager
    page_num: int       # always a leaf node page
    cell_num: int       # cell index within that leaf
    end_of_table: bool

    @classmethod
    def table_start(cls, pager: Pager, root_page_num: int) -> "Cursor":
        """Position the cursor at the first cell (smallest key) in the tree."""
        page_num = root_page_num
        page = pager.get_page(page_num)
        node_type = struct.unpack_from("B", page, 0)[0]

        # Descend leftmost path until we reach a leaf
        while node_type == INTERNAL_NODE_TYPE:
            # The leftmost child is cell 0's child pointer
            page_num = internal_node_child(page, 0)
            page = pager.get_page(page_num)
            node_type = struct.unpack_from("B", page, 0)[0]

        num_cells = leaf_node_num_cells(page)
        return cls(
            pager=pager,
            page_num=page_num,
            cell_num=0,
            end_of_table=(num_cells == 0),
        )

    @classmethod
    def find(cls, pager: Pager, root_page_num: int, key: int) -> "Cursor":
        """Position the cursor at the cell containing `key`, or where it would be."""
        page_num = root_page_num
        page = pager.get_page(page_num)
        node_type = struct.unpack_from("B", page, 0)[0]

        # Descend through internal nodes
        while node_type == INTERNAL_NODE_TYPE:
            page_num = internal_node_find_child(page, key)
            page = pager.get_page(page_num)
            node_type = struct.unpack_from("B", page, 0)[0]

        # Now at a leaf — binary search for the key
        num_cells = leaf_node_num_cells(page)
        index = leaf_node_find(page, key)
        return cls(
            pager=pager,
            page_num=page_num,
            cell_num=index,
            end_of_table=(index >= num_cells),
        )

Both table_start and find use iterative descent instead of recursion. This avoids stack depth issues for deep trees (though depth rarely exceeds 3-4 in practice).

Advancing Across Leaf Boundaries

The advance method from Chapter 4 stops at the end of a single leaf. For a multi-leaf tree, advancing past the last cell of one leaf should move to the first cell of the next leaf in key order.

This requires sibling pointers: each leaf node stores the page number of the next leaf. We add a 4-byte next_leaf field after the header:

Leaf Node Layout (updated):
┌─────────────────────────────────────────────────────────────┐
│ Node Type (1B) │ Is Root (1B) │ Num Cells (4B) │ Next Leaf (4B) │
├─────────────────────────────────────────────────────────────┤
│ Cell 0: [Key (4B)] [Value (36B)]                            │
│ ...                                                          │
└─────────────────────────────────────────────────────────────┘
# Updated constants
LEAF_NODE_NEXT_LEAF_OFFSET: Final[int] = LEAF_NODE_HEADER_SIZE  # 6
LEAF_NODE_NEXT_LEAF_SIZE: Final[int] = 4
LEAF_NODE_CELLS_OFFSET: Final[int] = (
    LEAF_NODE_HEADER_SIZE + LEAF_NODE_NEXT_LEAF_SIZE
)  # 10

# Recalculate max cells with the new offset
LEAF_NODE_MAX_CELLS: Final[int] = (
    (PAGE_SIZE - LEAF_NODE_CELLS_OFFSET) // LEAF_NODE_CELL_SIZE
)  # (4096 - 10) // 40 = 102 (still 102 — we lost 4 bytes to the pointer,
   # but 4086 // 40 = 102 with 6 bytes wasted instead of 10)

NO_SIBLING: Final[int] = 0  # Value meaning "no next leaf"


def leaf_node_next_leaf(page: memoryview) -> int:
    """Read the next-leaf page number. Returns 0 if there is no sibling."""
    return struct.unpack_from("<I", page, LEAF_NODE_NEXT_LEAF_OFFSET)[0]


def leaf_node_set_next_leaf(page: memoryview, next_page_num: int) -> None:
    """Write the next-leaf page number."""
    struct.pack_into("<I", page, LEAF_NODE_NEXT_LEAF_OFFSET, next_page_num)

Important: The cell offset functions must now use LEAF_NODE_CELLS_OFFSET (10) instead of the old LEAF_NODE_HEADER_SIZE (6). Update leaf_node_cell_offset:

def leaf_node_cell_offset(cell_num: int) -> int:
    return LEAF_NODE_CELLS_OFFSET + cell_num * LEAF_NODE_CELL_SIZE

And the initialize_leaf_node function sets next_leaf to 0:

def initialize_leaf_node(page: memoryview, is_root: bool = False) -> None:
    struct.pack_into(LEAF_NODE_HEADER_FORMAT, page, 0,
                     LEAF_NODE_TYPE, 1 if is_root else 0, 0)
    leaf_node_set_next_leaf(page, NO_SIBLING)
    page[LEAF_NODE_CELLS_OFFSET:] = b"\x00" * (PAGE_SIZE - LEAF_NODE_CELLS_OFFSET)

During leaf_node_split_and_insert, we must update the sibling chain:

# Inside leaf_node_split_and_insert, after creating the new right sibling:
leaf_node_set_next_leaf(new_page, leaf_node_next_leaf(old_page))
leaf_node_set_next_leaf(old_page, new_page_num)

This threads all leaves into a singly linked list in key order.

Updated Cursor.advance

def advance(self) -> None:
    """Move the cursor to the next cell, crossing leaf boundaries if needed."""
    self.cell_num += 1
    page = self.pager.get_page(self.page_num)

    if self.cell_num >= leaf_node_num_cells(page):
        # Past the end of this leaf — check for a sibling
        next_page_num = leaf_node_next_leaf(page)
        if next_page_num == NO_SIBLING:
            self.end_of_table = True
        else:
            self.page_num = next_page_num
            self.cell_num = 0
            # Check if the sibling is also empty (shouldn't be, but defensive)
            next_page = self.pager.get_page(next_page_num)
            if leaf_node_num_cells(next_page) == 0:
                self.end_of_table = True

With this change, a full table scan starting from Cursor.table_start will follow the leaf chain from the leftmost leaf to the rightmost, producing all rows in ascending key order.

Depth Tracking

For debugging and performance analysis, it helps to know the tree’s depth. A simple traversal from root to the leftmost leaf:

def tree_depth(pager: Pager, root_page_num: int) -> int:
    """Return the depth of the B-Tree (1 = single leaf, 2 = root + leaves, etc.)."""
    depth = 1
    page_num = root_page_num
    page = pager.get_page(page_num)
    node_type = struct.unpack_from("B", page, 0)[0]

    while node_type == INTERNAL_NODE_TYPE:
        page_num = internal_node_child(page, 0)  # follow leftmost child
        page = pager.get_page(page_num)
        node_type = struct.unpack_from("B", page, 0)[0]
        depth += 1

    return depth

Tree Visualization for Debugging

A print_tree function helps verify tree structure during development:

def print_tree(pager: Pager, page_num: int, indent: int = 0) -> None:
    """Recursively print the tree structure for debugging."""
    page = pager.get_page(page_num)
    node_type = struct.unpack_from("B", page, 0)[0]
    prefix = "  " * indent

    if node_type == LEAF_NODE_TYPE:
        num_cells = leaf_node_num_cells(page)
        print(f"{prefix}leaf (page {page_num}): {num_cells} cells", end="")
        if num_cells > 0:
            first_key = leaf_node_key(page, 0)
            last_key = leaf_node_key(page, num_cells - 1)
            print(f" keys [{first_key}..{last_key}]")
        else:
            print()
    else:
        num_keys = internal_node_num_keys(page)
        right_child = internal_node_right_child(page)
        print(f"{prefix}internal (page {page_num}): {num_keys} keys")
        for i in range(num_keys):
            child = internal_node_child(page, i)
            key = internal_node_key(page, i)
            print_tree(pager, child, indent + 1)
            print(f"{prefix}  key: {key}")
        print_tree(pager, right_child, indent + 1)

Example output for a tree with 300 rows:

internal (page 0): 2 keys
  leaf (page 1): 102 cells keys [0..101]
  key: 102
  leaf (page 2): 102 cells keys [102..203]
  key: 204
  leaf (page 3): 96 cells keys [204..299]

Benchmarking: B-Tree vs. Linear Scan

With the tree operational, we can compare search performance against the linear scan from Chapter 3:

import time

def benchmark_btree_search(num_rows: int) -> dict:
    """Benchmark B-Tree point lookups and full scan."""
    import os

    db_path = "bench_btree.db"
    if os.path.exists(db_path):
        os.remove(db_path)

    pager = Pager(db_path)
    root = pager.get_page(0)
    initialize_leaf_node(root, is_root=True)
    pager.num_pages = 1

    # Insert N rows
    for k in range(num_rows):
        row = Row(id=k, username=f"user_{k:06d}")
        tree_insert(pager, 0, k, row.serialize())

    depth = tree_depth(pager, 0)

    # Benchmark: single point lookup (worst case: last key)
    start = time.perf_counter_ns()
    for _ in range(1000):
        cursor = Cursor.find(pager, 0, num_rows - 1)
        _ = cursor.value()
    lookup_ns = (time.perf_counter_ns() - start) // 1000

    # Benchmark: full scan
    start = time.perf_counter_ns()
    cursor = Cursor.table_start(pager, 0)
    count = 0
    while not cursor.end_of_table:
        _ = cursor.value()
        cursor.advance()
        count += 1
    scan_ns = time.perf_counter_ns() - start

    pager.close()
    os.remove(db_path)

    return {
        "rows": num_rows,
        "depth": depth,
        "lookup_ns": lookup_ns,
        "scan_ns": scan_ns,
        "scan_count": count,
    }


def run_benchmarks():
    sizes = [100, 1_000, 10_000, 100_000]
    print(f"{'Rows':>10} {'Depth':>6} {'Lookup (ns)':>12} {'Scan (ns)':>12}")
    print("-" * 44)
    for n in sizes:
        r = benchmark_btree_search(n)
        print(f"{r['rows']:>10} {r['depth']:>6} {r['lookup_ns']:>12,} {r['scan_ns']:>12,}")


run_benchmarks()

Expected results:

RowsTree DepthLookup (ns)Full Scan (ns)
1001~500~10,000
1,0002~800~100,000
10,0002~900~1,000,000
100,0003~1,200~10,000,000

Key observation: Lookup time barely increases as rows grow from 100 to 100,000. That is $O(\log N)$ in action — specifically $O(d \cdot \log_2 b)$ where $d$ is the tree depth and $b$ is the branching factor. A depth-3 tree requires at most 3 page reads plus a binary search within the final leaf (7 comparisons). This is 3 pages touched, regardless of whether the table has 10,000 or 26 million rows.

Full scan time grows linearly because we visit every cell. There is no shortcut for “read everything.”

Complexity Summary

OperationLinear Scan (CH3)B-Tree (CH5)
Point lookup$O(N)$$O(\log N)$
Insert$O(1)$ amortized$O(\log N)$
Full scan$O(N)$$O(N)$
Space overheadNoneInternal nodes (~1% of data)

The insert cost increased from $O(1)$ to $O(\log N)$ because we must descend to the correct leaf, and occasionally split. This is an acceptable trade-off: the database spends most of its time reading, not writing.

What We Built

At the end of Chapter 5, we have a complete B-Tree implementation:

  • Leaf nodes store sorted key-value pairs (102 per node).
  • Internal nodes route searches via binary search over keys (510 keys, 511 children per node).
  • Splitting handles overflow by dividing nodes and promoting keys.
  • The cursor descends from root to leaf and follows sibling pointers for scans.
  • Tree depth grows logarithmically: depth 3 handles 26+ million rows.

What we have not addressed is durability. If the program crashes during a split — after writing the new sibling but before updating the parent — the tree is corrupted. Pages written to disk may be partial. The Pager calls os.write but never os.fsync, so the OS may reorder writes. Chapter 6 introduces the Write-Ahead Log (WAL), the mechanism that makes our storage engine crash-safe.