Query Parsing and Interface

Until now, every operation has been a direct Python function call: tree_insert(pager, 0, key, value) or Cursor.find(pager, 0, key). This works for testing but fails the fundamental contract of a database: accepting text-based queries and executing them against the storage engine.

This chapter builds a minimal SQL front-end. We implement three components:

Tokenizer (lexer) — breaks raw input strings into tokens.
Parser — organizes tokens into a structured representation (an AST).
Compiler — maps the AST to Virtual Machine operations.

Together, they form a pipeline:

"INSERT 1 alice"  →  [INSERT, 1, alice]  →  InsertStatement(1, "alice")  →  execute_insert
"SELECT"          →  [SELECT]            →  SelectStatement()            →  execute_select

The Language We Support

Our SQL subset supports exactly two statements:

INSERT <id> <username>
SELECT

No WHERE, no CREATE TABLE, no JOIN. The table schema is fixed (id integer, username text). This simplification lets us focus on the parsing machinery without drowning in SQL grammar complexity.

We also support meta-commands prefixed with a dot:

.exit       — close the database and exit
.btree      — print the B-Tree structure
.constants  — print internal constants

The Pipeline Architecture

User input (string)
       │
       ▼
┌─────────────┐
│  Tokenizer  │  Breaks text into tokens: keywords, literals, identifiers
└──────┬──────┘
       │  list[Token]
       ▼
┌─────────────┐
│   Parser    │  Validates grammar, produces Statement objects
└──────┬──────┘
       │  Statement (AST)
       ▼
┌─────────────┐
│  Compiler   │  Maps Statement → VirtualMachine method call
└──────┬──────┘
       │  ExecuteResult
       ▼
   Output to user

Each component has a single responsibility, making them independently testable. The Tokenizer knows nothing about SQL grammar. The Parser knows nothing about page layouts. The Compiler knows nothing about text processing.

Token Types

A token is a categorized piece of text. Our language needs these types:

from enum import Enum, auto


class TokenType(Enum):
    # Keywords
    KEYWORD_INSERT = auto()
    KEYWORD_SELECT = auto()

    # Literals
    INTEGER = auto()       # e.g., 42
    STRING = auto()        # e.g., alice (unquoted identifier treated as string)

    # Meta
    META_COMMAND = auto()  # e.g., .exit, .btree

    # Structural
    EOF = auto()           # end of input

Why no STRING_LITERAL with quotes? For simplicity. Our INSERT syntax is INSERT <id> <username>, where <username> is a bare word. Production SQL uses quoted strings ('alice'), but that requires escape handling. We take the pragmatic path.

Statement Types

The Parser produces one of these:

from dataclasses import dataclass


@dataclass
class InsertStatement:
    id: int
    username: str


@dataclass
class SelectStatement:
    pass  # No fields — selects all rows


@dataclass
class MetaCommand:
    command: str  # ".exit", ".btree", ".constants"

These are the Abstract Syntax Tree (AST) nodes. In a full SQL parser, the AST would be a deep tree with joins, subqueries, and expressions. Ours has three flat types.

Error Handling Strategy

Parsing can fail at two levels:

Tokenization errors — unrecognized characters.
Parse errors — valid tokens in an invalid sequence (e.g., INSERT alice with no id).

We use exceptions for both:

class TokenError(Exception):
    """Raised when the tokenizer encounters invalid input."""
    pass


class ParseError(Exception):
    """Raised when the parser encounters an invalid token sequence."""
    pass

The REPL catches these and prints error messages without crashing.

The next section implements the Tokenizer that breaks input strings into the token types defined above.