Building a Character-Level Tokenizer for MicroGPT

Chapter 3: The Tokenizer - Text to Numbers and Back

Author Gary Jackson details the implementation of a character-level tokenizer for the MicroGPT architecture. The system converts individual characters into integer IDs and introduces a synthetic Beginning of Sequence token to delimit training sequences.

Why This Matters

While production-grade Large Language Models utilize subword tokenizers like Byte Pair Encoding (BPE) to manage vocabularies of 100,000 tokens, character-level tokenization provides a high-signal entry point for understanding discrete symbol processing. This architecture reduces a name-based dataset to a vocabulary of 27 symbols, simplifying the computational overhead while demonstrating how models learn sequence boundaries.

Key Insights

Production LLMs use subword tokenizers like BPE to merge frequent sequences into single tokens, resulting in ~100K token vocabularies.
MicroGPT implements the simplest possible tokenizer by mapping one token per unique character, resulting in a vocabulary size of 27 for lowercase letters.
The Beginning of Sequence (BOS) token acts as a synthetic delimiter to signal the start and end of a document, such as [BOS, e, m, m, a, BOS].
Token IDs are arbitrary integer assignments where values like 0 or 25 are conventions rather than mathematical requirements.
The Decode function allows for visual representation of synthetic tokens, such as rendering the BOS token as a dot for human readability.

Working Examples

The core Tokenizer class implementation in C#.

namespace MicroGPT; public class Tokenizer { private readonly List<char> _allChars; public int Bos { get; } public int VocabSize { get; } public Tokenizer(List<string> docs) { _allChars = [.. string.Join("", docs).Distinct().OrderBy(c => c)]; Bos = _allChars.Count; VocabSize = _allChars.Count + 1; } public int Encode(char c) => _allChars.IndexOf(c); public char Decode(int i) => i == Bos ? '.' : _allChars[i]; public static List<string> LoadDocs(string path, Random random) { return [ .. File.ReadAllLines(path).Select(l => l.Trim()).Where(l => !string.IsNullOrEmpty(l)).OrderBy(_ => random.Next()), ]; } }

Example of wrapping a sequence with synthetic BOS tokens for training.

var tokens = new List<int> { tokenizer.Bos }; tokens.AddRange(name.Select(tokenizer.Encode)); tokens.Add(tokenizer.Bos); Console.WriteLine($"With BOS: [{string.Join(", ", tokens)}]");

Practical Applications

MicroGPT sequence generation: Using BOS tokens to signal document boundaries. Pitfall: Omitting delimiters prevents the model from learning when to terminate generation.
Dataset Preprocessing: Loading and shuffling text documents for training. Pitfall: Failing to trim or filter empty lines results in noisy vocabulary and inconsistent token mapping.

References:

https://dev.to/garyljackson/chapter-3-the-tokenizer-text-to-numbers-and-back-3b5

On This Page

Chapter 3: The Tokenizer - Text to Numbers and Back

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

AI-Driven Software Delivery: Leveraging Lean, ChOP & LLMs to Create Effective Learning Experiences

Introducing Daggr: Chain Apps Programmatically, Inspect Visually

Deep Dive into LSTM Input Gates: Mechanics of Memory Retention