Skip to main content

On This Page

Building a Character-Level Tokenizer for MicroGPT

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Chapter 3: The Tokenizer - Text to Numbers and Back

Author Gary Jackson details the implementation of a character-level tokenizer for the MicroGPT architecture. The system converts individual characters into integer IDs and introduces a synthetic Beginning of Sequence token to delimit training sequences.

Why This Matters

While production-grade Large Language Models utilize subword tokenizers like Byte Pair Encoding (BPE) to manage vocabularies of 100,000 tokens, character-level tokenization provides a high-signal entry point for understanding discrete symbol processing. This architecture reduces a name-based dataset to a vocabulary of 27 symbols, simplifying the computational overhead while demonstrating how models learn sequence boundaries.

Key Insights

  • Production LLMs use subword tokenizers like BPE to merge frequent sequences into single tokens, resulting in ~100K token vocabularies.
  • MicroGPT implements the simplest possible tokenizer by mapping one token per unique character, resulting in a vocabulary size of 27 for lowercase letters.
  • The Beginning of Sequence (BOS) token acts as a synthetic delimiter to signal the start and end of a document, such as [BOS, e, m, m, a, BOS].
  • Token IDs are arbitrary integer assignments where values like 0 or 25 are conventions rather than mathematical requirements.
  • The Decode function allows for visual representation of synthetic tokens, such as rendering the BOS token as a dot for human readability.

Working Examples

The core Tokenizer class implementation in C#.

namespace MicroGPT; public class Tokenizer { private readonly List<char> _allChars; public int Bos { get; } public int VocabSize { get; } public Tokenizer(List<string> docs) { _allChars = [.. string.Join("", docs).Distinct().OrderBy(c => c)]; Bos = _allChars.Count; VocabSize = _allChars.Count + 1; } public int Encode(char c) => _allChars.IndexOf(c); public char Decode(int i) => i == Bos ? '.' : _allChars[i]; public static List<string> LoadDocs(string path, Random random) { return [ .. File.ReadAllLines(path).Select(l => l.Trim()).Where(l => !string.IsNullOrEmpty(l)).OrderBy(_ => random.Next()), ]; } }

Example of wrapping a sequence with synthetic BOS tokens for training.

var tokens = new List<int> { tokenizer.Bos }; tokens.AddRange(name.Select(tokenizer.Encode)); tokens.Add(tokenizer.Bos); Console.WriteLine($"With BOS: [{string.Join(", ", tokens)}]");

Practical Applications

  • MicroGPT sequence generation: Using BOS tokens to signal document boundaries. Pitfall: Omitting delimiters prevents the model from learning when to terminate generation.
  • Dataset Preprocessing: Loading and shuffling text documents for training. Pitfall: Failing to trim or filter empty lines results in noisy vocabulary and inconsistent token mapping.

References:

Continue reading

Next article

Building Secure PHP and MySQL Authentication Systems: A Step-by-Step Engineering Guide

Related Content