Query Parsing and Interface
Query Parsing and Interface
Until now, every operation has been a direct Python function call: tree_insert(pager, 0, key, value) or Cursor.find(pager, 0, key). This works for testing but fails the fundamental contract of a database: accepting text-based queries and executing them against the storage engine.
This chapter builds a minimal SQL front-end. We implement three components:
- Tokenizer (lexer) — breaks raw input strings into tokens.
- Parser — organizes tokens into a structured representation (an AST).
- Compiler — maps the AST to Virtual Machine operations.
Together, they form a pipeline:
"INSERT 1 alice" → [INSERT, 1, alice] → InsertStatement(1, "alice") → execute_insert
"SELECT" → [SELECT] → SelectStatement() → execute_select
The Language We Support
Our SQL subset supports exactly two statements:
INSERT <id> <username>
SELECT
No WHERE, no CREATE TABLE, no JOIN. The table schema is fixed (id integer, username text). This simplification lets us focus on the parsing machinery without drowning in SQL grammar complexity.
We also support meta-commands prefixed with a dot:
.exit — close the database and exit
.btree — print the B-Tree structure
.constants — print internal constants
The Pipeline Architecture
User input (string)
│
▼
┌─────────────┐
│ Tokenizer │ Breaks text into tokens: keywords, literals, identifiers
└──────┬──────┘
│ list[Token]
▼
┌─────────────┐
│ Parser │ Validates grammar, produces Statement objects
└──────┬──────┘
│ Statement (AST)
▼
┌─────────────┐
│ Compiler │ Maps Statement → VirtualMachine method call
└──────┬──────┘
│ ExecuteResult
▼
Output to user
Each component has a single responsibility, making them independently testable. The Tokenizer knows nothing about SQL grammar. The Parser knows nothing about page layouts. The Compiler knows nothing about text processing.
Token Types
A token is a categorized piece of text. Our language needs these types:
from enum import Enum, auto
class TokenType(Enum):
# Keywords
KEYWORD_INSERT = auto()
KEYWORD_SELECT = auto()
# Literals
INTEGER = auto() # e.g., 42
STRING = auto() # e.g., alice (unquoted identifier treated as string)
# Meta
META_COMMAND = auto() # e.g., .exit, .btree
# Structural
EOF = auto() # end of input
Why no STRING_LITERAL with quotes? For simplicity. Our INSERT syntax is INSERT <id> <username>, where <username> is a bare word. Production SQL uses quoted strings ('alice'), but that requires escape handling. We take the pragmatic path.
Statement Types
The Parser produces one of these:
from dataclasses import dataclass
@dataclass
class InsertStatement:
id: int
username: str
@dataclass
class SelectStatement:
pass # No fields — selects all rows
@dataclass
class MetaCommand:
command: str # ".exit", ".btree", ".constants"
These are the Abstract Syntax Tree (AST) nodes. In a full SQL parser, the AST would be a deep tree with joins, subqueries, and expressions. Ours has three flat types.
Error Handling Strategy
Parsing can fail at two levels:
- Tokenization errors — unrecognized characters.
- Parse errors — valid tokens in an invalid sequence (e.g.,
INSERT alicewith no id).
We use exceptions for both:
class TokenError(Exception):
"""Raised when the tokenizer encounters invalid input."""
pass
class ParseError(Exception):
"""Raised when the parser encounters an invalid token sequence."""
pass
The REPL catches these and prints error messages without crashing.
The next section implements the Tokenizer that breaks input strings into the token types defined above.