Skip to main content
building large language models from scratch a beginners guide with python and pytorch

Text Processing — Turning Words into Numbers

30 min read Chapter 3 of 11
Summary

This chapter addresses the fundamental challenge of converting...

This chapter addresses the fundamental challenge of converting human language into numerical representations that neural networks can process. Starting with character-level tokenization, it progresses through word-level tokenization to Byte Pair Encoding (BPE) — the subword algorithm used by GPT models. A complete BPE tokenizer is built from scratch with encode/decode functionality. Special tokens are explained with their roles in sequence processing.

Text Processing — Turning Words into Numbers

In the previous chapter, we set up our Python environment and got comfortable with PyTorch tensors. Now we face the very first real challenge of building a language model: how do we feed text into a neural network?

Neural networks are mathematical machines. They multiply matrices, add vectors, and apply functions—all operations on numbers. They have no concept of the letter “A,” the word “cat,” or the sentence “The weather is nice today.” Before a single neuron fires, we need to convert every piece of text into a sequence of numbers.

This conversion process is called tokenization, and it is the foundation upon which every language model is built.

1. The Fundamental Problem

Imagine you receive a book written entirely in a language you don’t speak—say, Mandarin Chinese. You can’t read it, can’t understand it, and certainly can’t summarize it. But someone hands you a dictionary that maps every Chinese character to a unique number, and another dictionary that maps those numbers to English words. Suddenly, you have a system: Chinese → numbers → English. You can process the book.

A neural network faces the same problem. It receives text like:

"The cat sat on the mat."

And it needs to see something like:

[87, 14, 203, 55, 87, 412]

The dictionary that performs this translation is called a vocabulary, and the process of splitting text into pieces and assigning numbers is called tokenization. The individual pieces are called tokens.

There’s a critical design question here: what should a token be? Should each character be a token? Each word? Something in between? This choice has enormous consequences for how well the model learns, how much memory it uses, and how fast it runs.

Let’s explore the options, starting from the simplest.

2. Character-Level Tokenization

The simplest possible approach: treat every single character as a token. The letter “h” gets one number, “e” gets another, a space gets another, and so on.

Building a Character Tokenizer from Scratch

class CharTokenizer:
    """A tokenizer that treats each character as a separate token."""

    def __init__(self):
        self.char_to_id = {}   # Maps characters to integers
        self.id_to_char = {}   # Maps integers back to characters

    def build_vocab(self, text):
        """Scan the text and assign a unique number to each character."""
        # Get all unique characters, sorted for consistency
        unique_chars = sorted(set(text))

        # Assign an integer ID to each character
        for idx, char in enumerate(unique_chars):
            self.char_to_id[char] = idx
            self.id_to_char[idx] = char

        print(f"Vocabulary size: {len(unique_chars)} characters")
        print(f"Characters: {unique_chars}")

    def encode(self, text):
        """Convert a string into a list of integers."""
        return [self.char_to_id[ch] for ch in text]

    def decode(self, ids):
        """Convert a list of integers back into a string."""
        return "".join(self.id_to_char[i] for i in ids)

Let’s use it:

# Create the tokenizer and build a vocabulary from sample text
tokenizer = CharTokenizer()
tokenizer.build_vocab("hello world")

# Encode a string into numbers
encoded = tokenizer.encode("hello")
print(f"'hello' -> {encoded}")
# Output: 'hello' -> [2, 1, 4, 4, 5]

# Decode back to text
decoded = tokenizer.decode(encoded)
print(f"{encoded} -> '{decoded}'")
# Output: [2, 1, 4, 4, 5] -> 'hello'

# Verify round-trip works
assert decoded == "hello"
print("Round-trip successful!")

Let’s inspect the vocabulary:

# See the full mapping
print("\nCharacter -> ID mapping:")
for char, idx in tokenizer.char_to_id.items():
    # Show spaces visually
    display = "SPACE" if char == " " else char
    print(f"  '{display}' -> {idx}")

# Output:
# Character -> ID mapping:
#   'SPACE' -> 0
#   'd'     -> 1  (wait—this depends on sorted order)
#   'e'     -> 2
#   'h'     -> 3
#   'l'     -> 4
#   'o'     -> 5
#   'r'     -> 6
#   'w'     -> 7

Note: The exact numbers depend on the sorted order of unique characters. The important thing is that each character gets a unique number and we can convert back and forth losslessly.

Encoding a Longer String

sentence = "hello world"
encoded = tokenizer.encode(sentence)
print(f"Encoded: {encoded}")
# Output: Encoded: [3, 2, 4, 4, 5, 0, 7, 5, 6, 4, 1]

print(f"Length of original: {len(sentence)} characters")
print(f"Length of encoded:  {len(encoded)} tokens")
# Output:
# Length of original: 11 characters
# Length of encoded:  11 tokens

Notice something: the encoded sequence is exactly as long as the original text. Every character, including the space, becomes one token.

Pros and Cons

Pros of character-level tokenization:

  • Tiny vocabulary. English text uses roughly 100 distinct characters (letters, digits, punctuation, spaces). Even with Unicode, you can cover most text with a few hundred entries.
  • No unknown tokens. If a character exists in your vocabulary, you can tokenize any word that uses it—including words you’ve never seen before, typos, and made-up words.
  • Simple to implement. As we just saw, it’s about 20 lines of code.

Cons of character-level tokenization:

  • Loses word meaning. The model sees [3, 2, 4, 4, 5] for “hello” and [3, 2, 4, 4] for “hell.” These look almost identical numerically, but they mean very different things. The model has to learn which character sequences form meaningful words—an extra burden.
  • Very long sequences. A 500-word article might be 3,000 characters. The model must process all 3,000 tokens. This is slow and memory-intensive because (as we’ll see in Chapter 5) the computational cost of the attention mechanism grows quadratically with sequence length.

Because of these limitations, modern language models don’t use character-level tokenization. But understanding it is essential—it’s the starting point from which better methods evolved.

3. Word-Level Tokenization

The opposite extreme: treat each word as a token. “The” gets one number, “cat” gets another, “sat” gets another.

Building a Word Tokenizer

class WordTokenizer:
    """A tokenizer that treats each word as a separate token."""

    def __init__(self):
        self.word_to_id = {}
        self.id_to_word = {}

    def build_vocab(self, text):
        """Split text into words and assign IDs."""
        # Simple split on whitespace (a real tokenizer would handle
        # punctuation, contractions, etc.)
        words = text.lower().split()

        # Get unique words, sorted for consistency
        unique_words = sorted(set(words))

        for idx, word in enumerate(unique_words):
            self.word_to_id[word] = idx
            self.id_to_word[idx] = word

        print(f"Vocabulary size: {len(unique_words)} words")

    def encode(self, text):
        """Convert text to a list of word IDs."""
        words = text.lower().split()
        return [self.word_to_id[w] for w in words]

    def decode(self, ids):
        """Convert word IDs back to text."""
        return " ".join(self.id_to_word[i] for i in ids)

Let’s use it:

tokenizer = WordTokenizer()
corpus = "the cat sat on the mat the dog sat on the rug"
tokenizer.build_vocab(corpus)

# Output: Vocabulary size: 7 words

encoded = tokenizer.encode("the cat sat on the mat")
print(f"Encoded: {encoded}")
# Output: Encoded: [5, 0, 4, 3, 5, 2]

decoded = tokenizer.decode(encoded)
print(f"Decoded: '{decoded}'")
# Output: Decoded: 'the cat sat on the mat'

Notice how much shorter the encoded sequence is compared to character-level: 6 tokens instead of 22 characters. That’s a big advantage for the model’s processing speed.

The Vocabulary Explosion Problem

But there’s a serious issue. Let’s see what happens with richer text:

text = """
Machine learning is a subset of artificial intelligence.
Deep learning is a subset of machine learning.
Natural language processing uses deep learning techniques.
Large language models are trained on massive text corpora.
"""

tokenizer = WordTokenizer()
tokenizer.build_vocab(text)

# Already 22 unique words from just 4 sentences!
# Real-world text has hundreds of thousands of unique words.

The English language contains roughly 170,000 words in current use. Add technical terms, names, slang, multiple languages, and you easily reach millions. A vocabulary of millions means:

  • The model needs millions of parameters just for the first layer
  • Most words appear rarely, so the model can’t learn good representations for them
  • Memory usage explodes

The Out-of-Vocabulary (OOV) Problem

An even worse problem: what happens when the model encounters a word it has never seen?

tokenizer = WordTokenizer()
tokenizer.build_vocab("the cat sat on the mat")

try:
    encoded = tokenizer.encode("the dog sat on the mat")
except KeyError as e:
    print(f"Error! Unknown word: {e}")
    # Output: Error! Unknown word: 'dog'

The word “dog” wasn’t in our training text, so the tokenizer crashes. In practice, you’d replace unknown words with a special [UNK] token, but that throws away information. If the model sees “I adopted a [UNK] from the shelter,” it has no idea whether the unknown word was “dog,” “cat,” or “child.”

Word-level tokenization was used in early NLP systems, but these two problems—vocabulary explosion and OOV—make it impractical for modern language models.

We need something in between: smaller than words, more meaningful than characters.

4. Byte Pair Encoding (BPE)

Byte Pair Encoding is the tokenization algorithm used by GPT-2, GPT-3, GPT-4, and many other large language models. It’s a subword tokenization method—tokens can be whole words, parts of words, or even single characters.

The core idea is elegant: start with individual characters, then repeatedly merge the most frequent pair of adjacent tokens into a new token. Common words end up as single tokens; rare words get split into smaller, recognizable pieces.

Step-by-Step Example

Let’s walk through BPE on a tiny corpus. Suppose our training text contains these words with their frequencies:

"low"    — appears 5 times
"lower"  — appears 2 times
"lowest" — appears 1 time

Step 0: Start with characters. We split every word into individual characters and add a special end-of-word marker _:

l o w _       (frequency: 5)
l o w e r _   (frequency: 2)
l o w e s t _ (frequency: 1)

Our initial vocabulary is: {_, e, l, o, r, s, t, w}

Step 1: Count all adjacent pairs.

PairTotal Frequency
(l, o)5 + 2 + 1 = 8
(o, w)5 + 2 + 1 = 8
(w, _)5
(w, e)2 + 1 = 3
(e, r)2
(r, _)2
(e, s)1
(s, t)1
(t, _)1

Step 2: Merge the most frequent pair. Tie between (l, o) and (o, w), both with frequency 8. We pick (l, o) and merge them into a new token lo:

lo w _       (frequency: 5)
lo w e r _   (frequency: 2)
lo w e s t _ (frequency: 1)

Vocabulary: {_, e, l, lo, o, r, s, t, w}

Step 3: Count pairs again and merge.

PairTotal Frequency
(lo, w)5 + 2 + 1 = 8
(w, _)5
(w, e)2 + 1 = 3
(e, r)2
(r, _)2
(e, s)1
(s, t)1
(t, _)1

Most frequent: (lo, w) with 8. Merge into low:

low _       (frequency: 5)
low e r _   (frequency: 2)
low e s t _ (frequency: 1)

Vocabulary: {_, e, l, lo, low, o, r, s, t, w}

Step 4: Next merge. Most frequent pair is (low, _) with 5. Merge into low_:

low_        (frequency: 5)
low e r _   (frequency: 2)
low e s t _ (frequency: 1)

Step 5: Next most frequent is (low, e) with 3. Merge into lowe:

low_         (frequency: 5)
lowe r _     (frequency: 2)
lowe s t _   (frequency: 1)

We could keep going, but let’s stop here. After just a few merges, “low” is already a single token, and “lower” is just two tokens: lowe + r_, rather than six individual characters.

The key insight: common substrings become single tokens automatically. The algorithm discovers word structure without being taught any linguistics.

Implementing BPE from Scratch

from collections import Counter

class SimpleBPE:
    """A simplified Byte Pair Encoding tokenizer."""

    def __init__(self, num_merges=20):
        self.num_merges = num_merges  # How many merge operations to learn
        self.merges = []               # List of (pair, merged_token) in order
        self.vocab = set()             # Set of all tokens

    def _get_pairs(self, token_sequences):
        """Count all adjacent token pairs across all sequences."""
        pair_counts = Counter()
        for seq, freq in token_sequences.items():
            for i in range(len(seq) - 1):
                pair = (seq[i], seq[i + 1])
                pair_counts[pair] += freq
        return pair_counts

    def _merge_pair(self, token_sequences, pair):
        """Replace all occurrences of a pair with the merged token."""
        merged = pair[0] + pair[1]
        new_sequences = {}

        for seq, freq in token_sequences.items():
            new_seq = []
            i = 0
            while i < len(seq):
                # If we found the pair, merge it
                if i < len(seq) - 1 and seq[i] == pair[0] and seq[i + 1] == pair[1]:
                    new_seq.append(merged)
                    i += 2  # Skip both elements of the pair
                else:
                    new_seq.append(seq[i])
                    i += 1
            new_sequences[tuple(new_seq)] = freq

        return new_sequences

    def train(self, text):
        """Learn BPE merges from training text."""
        # Split text into words and count frequencies
        word_counts = Counter(text.split())

        # Convert each word into a tuple of characters + end marker
        token_sequences = {}
        for word, count in word_counts.items():
            chars = tuple(list(word) + ["_"])
            token_sequences[chars] = count

        # Initialize vocabulary with all individual characters
        self.vocab = set()
        for seq in token_sequences:
            self.vocab.update(seq)

        print(f"Initial vocabulary ({len(self.vocab)} tokens): {sorted(self.vocab)}")
        print()

        # Perform merges
        for i in range(self.num_merges):
            pairs = self._get_pairs(token_sequences)
            if not pairs:
                print(f"No more pairs to merge after {i} steps.")
                break

            # Find the most frequent pair
            best_pair = max(pairs, key=pairs.get)
            best_count = pairs[best_pair]

            # Stop if no pair appears more than once
            if best_count < 2:
                print(f"Stopping: no pair appears more than once.")
                break

            # Merge the pair
            merged_token = best_pair[0] + best_pair[1]
            self.merges.append(best_pair)
            self.vocab.add(merged_token)
            token_sequences = self._merge_pair(token_sequences, best_pair)

            print(f"Merge {i + 1}: '{best_pair[0]}' + '{best_pair[1]}' -> "
                  f"'{merged_token}' (appeared {best_count} times)")

        print(f"\nFinal vocabulary ({len(self.vocab)} tokens): {sorted(self.vocab)}")
        print(f"Learned {len(self.merges)} merge rules.")

    def encode(self, word):
        """Encode a single word using learned merge rules."""
        # Start with character-level tokens
        tokens = list(word) + ["_"]

        # Apply each merge rule in order
        for pair in self.merges:
            merged = pair[0] + pair[1]
            new_tokens = []
            i = 0
            while i < len(tokens):
                if (i < len(tokens) - 1 and
                    tokens[i] == pair[0] and
                    tokens[i + 1] == pair[1]):
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens

        return tokens

    def decode(self, tokens):
        """Decode a list of BPE tokens back to a string."""
        # Join tokens and remove end-of-word markers
        text = "".join(tokens)
        text = text.replace("_", "")
        return text

Let’s train and test it:

# Training corpus
corpus = ("low " * 5 + "lower " * 2 + "lowest " * 1 +
          "new " * 4 + "newer " * 3 + "newest " * 2).strip()

print(f"Training corpus: '{corpus}'")
print(f"{'=' * 60}\n")

bpe = SimpleBPE(num_merges=15)
bpe.train(corpus)

# Output:
# Initial vocabulary (12 tokens): ['_', 'e', 'l', 'n', 'o', 'r', 's', 't', 'w', ...]
#
# Merge 1: 'e' + 'w' -> 'ew' (appeared 12 times)
# Merge 2: 'n' + 'ew' -> 'new' (appeared 9 times)
# Merge 3: 'l' + 'o' -> 'lo' (appeared 8 times)
# Merge 4: 'lo' + 'w' -> 'low' (appeared 8 times)
# ... (more merges)

Now let’s encode some words:

print("\n--- Encoding examples ---\n")

test_words = ["low", "lower", "lowest", "new", "newer", "newest"]
for word in test_words:
    tokens = bpe.encode(word)
    print(f"  '{word}' -> {tokens}")

# Example output:
#   'low'    -> ['low', '_']
#   'lower'  -> ['low', 'er', '_']    (or similar, depending on merges)
#   'lowest' -> ['low', 'e', 's', 't', '_']
#   'new'    -> ['new', '_']
#   'newer'  -> ['new', 'er', '_']
#   'newest' -> ['new', 'e', 's', 't', '_']

Notice the beauty of BPE: “lower” and “newer” share the subword “er” token. The model can learn that “-er” means a comparative form, and apply that knowledge across both words—even if it has never seen “newer” during training, it knows the pieces “new” and “er.”

# Verify round-trip (encode then decode)
for word in test_words:
    tokens = bpe.encode(word)
    recovered = bpe.decode(tokens)
    status = "OK" if recovered == word else "FAIL"
    print(f"  [{status}] '{word}' -> {tokens} -> '{recovered}'")

Why BPE Works Well

BPE hits a sweet spot between character-level and word-level tokenization:

PropertyCharactersBPEWords
Vocabulary size~10030,000–50,000100,000+
Sequence lengthVery longModerateShort
Handles rare wordsYesYes (splits into subwords)No (OOV)
Captures meaningPoorlyWellBest
Used by GPT modelsNoYesNo

GPT-2 uses a BPE vocabulary of about 50,257 tokens. GPT-4 uses roughly 100,000 tokens. These vocabularies are trained on billions of words of text, so common words like “the,” “is,” and “and” become single tokens, while rare words like “tokenization” might split into “token” + “ization.”

5. Building a Complete Tokenizer Class

Now let’s build a more complete tokenizer that combines everything we’ve learned. It will:

  • Train a BPE vocabulary from a corpus
  • Assign integer IDs to every token
  • Support encode() (text → list of integers) and decode() (list of integers → text)
  • Handle special tokens
from collections import Counter

class BPETokenizer:
    """
    A complete BPE tokenizer with integer encoding/decoding.
    This is a simplified but functional version of what GPT models use.
    """

    def __init__(self, vocab_size=256):
        # Target vocabulary size (including special tokens)
        self.vocab_size = vocab_size

        # Merge rules learned during training
        self.merges = []

        # Token <-> ID mappings
        self.token_to_id = {}
        self.id_to_token = {}

        # Special tokens
        self.special_tokens = {
            "[PAD]": 0,   # Padding token
            "[UNK]": 1,   # Unknown token
            "[BOS]": 2,   # Beginning of sequence
            "[EOS]": 3,   # End of sequence
        }

    def _get_pairs(self, token_sequences):
        """Count adjacent token pairs across all sequences."""
        pair_counts = Counter()
        for seq, freq in token_sequences.items():
            for i in range(len(seq) - 1):
                pair_counts[(seq[i], seq[i + 1])] += freq
        return pair_counts

    def _merge_pair(self, token_sequences, pair):
        """Merge all occurrences of a pair in all sequences."""
        merged = pair[0] + pair[1]
        new_sequences = {}
        for seq, freq in token_sequences.items():
            new_seq = []
            i = 0
            while i < len(seq):
                if (i < len(seq) - 1 and
                    seq[i] == pair[0] and seq[i + 1] == pair[1]):
                    new_seq.append(merged)
                    i += 2
                else:
                    new_seq.append(seq[i])
                    i += 1
            new_sequences[tuple(new_seq)] = freq
        return new_sequences

    def train(self, text):
        """Train the tokenizer on a text corpus."""
        print("Training BPE tokenizer...")
        print(f"Corpus length: {len(text)} characters")

        # Step 1: Count word frequencies
        word_counts = Counter(text.split())
        print(f"Unique words: {len(word_counts)}")

        # Step 2: Split words into character sequences with end marker
        token_sequences = {}
        for word, count in word_counts.items():
            chars = tuple(list(word) + ["_"])
            token_sequences[chars] = count

        # Step 3: Collect initial character vocabulary
        all_tokens = set()
        for seq in token_sequences:
            all_tokens.update(seq)

        # Reserve space for special tokens
        num_special = len(self.special_tokens)
        max_merges = self.vocab_size - len(all_tokens) - num_special

        print(f"Initial character tokens: {len(all_tokens)}")
        print(f"Max merges to learn: {max_merges}")
        print()

        # Step 4: Learn merge rules
        for i in range(max_merges):
            pairs = self._get_pairs(token_sequences)
            if not pairs:
                break

            best_pair = max(pairs, key=pairs.get)
            if pairs[best_pair] < 2:
                break

            merged_token = best_pair[0] + best_pair[1]
            self.merges.append(best_pair)
            all_tokens.add(merged_token)
            token_sequences = self._merge_pair(token_sequences, best_pair)

            if (i + 1) % 10 == 0 or i < 5:
                print(f"  Merge {i + 1}: '{best_pair[0]}' + '{best_pair[1]}' "
                      f"-> '{merged_token}'")

        print(f"\nLearned {len(self.merges)} merge rules.")

        # Step 5: Build token-to-ID mapping
        # Special tokens get the first IDs
        self.token_to_id = dict(self.special_tokens)
        self.id_to_token = {v: k for k, v in self.special_tokens.items()}

        # Assign IDs to all learned tokens
        next_id = len(self.special_tokens)
        for token in sorted(all_tokens):
            if token not in self.token_to_id:
                self.token_to_id[token] = next_id
                self.id_to_token[next_id] = token
                next_id += 1

        print(f"Total vocabulary size: {len(self.token_to_id)} tokens "
              f"({len(self.special_tokens)} special + "
              f"{len(self.token_to_id) - len(self.special_tokens)} learned)")

    def _tokenize_word(self, word):
        """Apply learned merges to tokenize a single word."""
        tokens = list(word) + ["_"]

        for pair in self.merges:
            merged = pair[0] + pair[1]
            new_tokens = []
            i = 0
            while i < len(tokens):
                if (i < len(tokens) - 1 and
                    tokens[i] == pair[0] and tokens[i + 1] == pair[1]):
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens

        return tokens

    def encode(self, text, add_special_tokens=True):
        """
        Convert text to a list of integer token IDs.

        Args:
            text: The input string to encode.
            add_special_tokens: If True, add [BOS] at the start
                                and [EOS] at the end.

        Returns:
            A list of integer IDs.
        """
        ids = []

        # Optionally add beginning-of-sequence token
        if add_special_tokens:
            ids.append(self.special_tokens["[BOS]"])

        # Tokenize each word
        words = text.split()
        for word in words:
            tokens = self._tokenize_word(word)
            for token in tokens:
                # Look up the token ID, use [UNK] if not found
                token_id = self.token_to_id.get(
                    token, self.special_tokens["[UNK]"]
                )
                ids.append(token_id)

        # Optionally add end-of-sequence token
        if add_special_tokens:
            ids.append(self.special_tokens["[EOS]"])

        return ids

    def decode(self, ids, skip_special_tokens=True):
        """
        Convert a list of integer token IDs back to text.

        Args:
            ids: List of integer token IDs.
            skip_special_tokens: If True, remove special tokens
                                 from the output.

        Returns:
            The decoded string.
        """
        tokens = []
        special_ids = set(self.special_tokens.values())

        for token_id in ids:
            if skip_special_tokens and token_id in special_ids:
                continue
            token = self.id_to_token.get(token_id, "[UNK]")
            tokens.append(token)

        # Join tokens and clean up end-of-word markers
        text = "".join(tokens)
        # Replace end-of-word markers with spaces
        text = text.replace("_", " ")
        # Clean up extra spaces
        text = text.strip()

        return text

    def vocab_info(self):
        """Print summary information about the vocabulary."""
        print(f"Vocabulary size: {len(self.token_to_id)}")
        print(f"Merge rules: {len(self.merges)}")
        print(f"Special tokens: {list(self.special_tokens.keys())}")

Training and Using the Complete Tokenizer

# A small training corpus
corpus = """
the cat sat on the mat and the dog sat on the rug
the cat chased the dog around the garden
the dog barked at the cat loudly
learning about language models is fun
natural language processing is fascinating
deep learning models process text efficiently
"""

# Create and train the tokenizer
tokenizer = BPETokenizer(vocab_size=100)
tokenizer.train(corpus)

# Output:
# Training BPE tokenizer...
# Corpus length: 261 characters
# Unique words: 23
# Initial character tokens: 22
# Max merges to learn: 74
#   Merge 1: 'h' + 'e' -> 'he'
#   Merge 2: 't' + 'he' -> 'the'
#   Merge 3: 'the' + '_' -> 'the_'
#   ...
# Learned XX merge rules.
# Total vocabulary size: XX tokens (4 special + XX learned)

Now let’s encode and decode:

# Encode a sentence
text = "the cat sat on the mat"
encoded = tokenizer.encode(text)
print(f"\nOriginal:  '{text}'")
print(f"Encoded:   {encoded}")
print(f"Tokens:    {len(encoded)} (including special tokens)")

# Decode back
decoded = tokenizer.decode(encoded)
print(f"Decoded:   '{decoded}'")

# Show token-by-token breakdown
print(f"\nToken breakdown:")
for token_id in encoded:
    token = tokenizer.id_to_token[token_id]
    print(f"  ID {token_id:3d} -> '{token}'")

# Output (example):
# Original:  'the cat sat on the mat'
# Encoded:   [2, 15, 8, 12, 14, 15, 11, 3]
# Tokens:    8 (including special tokens)
# Decoded:   'the cat sat on the mat'
#
# Token breakdown:
#   ID   2 -> '[BOS]'
#   ID  15 -> 'the_'
#   ID   8 -> 'cat_'
#   ID  12 -> 'sat_'
#   ID  14 -> 'on_'
#   ID  15 -> 'the_'
#   ID  11 -> 'mat_'
#   ID   3 -> '[EOS]'

Handling Unknown Words

# Try encoding a word the tokenizer has never seen
text = "the bird flew over the garden"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

print(f"Original: '{text}'")
print(f"Decoded:  '{decoded}'")

# "bird" and "flew" and "over" might be split into subwords
# or use [UNK] for completely unknown characters
print(f"\nToken breakdown:")
for token_id in encoded:
    token = tokenizer.id_to_token.get(token_id, "[UNK]")
    print(f"  ID {token_id:3d} -> '{token}'")

Because BPE falls back to character-level tokenization for unknown subwords, it can handle words it has never seen—it just splits them into smaller pieces.

6. Special Tokens

You’ve already seen [BOS] and [EOS] in our tokenizer. Let’s explain all the common special tokens and why they matter.

[PAD] — Padding Token

Neural networks process data in batches (groups of sequences processed simultaneously for efficiency). But sequences in a batch may have different lengths:

Sequence 1: [12, 45, 78, 23, 56]       (5 tokens)
Sequence 2: [34, 67, 89]               (3 tokens)
Sequence 3: [11, 22, 33, 44, 55, 66]   (6 tokens)

To process them as a batch (a single matrix), they must all have the same length. We pad shorter sequences with the [PAD] token:

# Padding example
pad_id = 0  # [PAD] token ID

sequences = [
    [12, 45, 78, 23, 56],
    [34, 67, 89],
    [11, 22, 33, 44, 55, 66],
]

# Find the longest sequence
max_len = max(len(seq) for seq in sequences)
print(f"Max sequence length: {max_len}")

# Pad all sequences to the same length
padded = []
for seq in sequences:
    padding_needed = max_len - len(seq)
    padded_seq = seq + [pad_id] * padding_needed
    padded.append(padded_seq)
    print(f"  {seq} -> {padded_seq}")

# Output:
# Max sequence length: 6
#   [12, 45, 78, 23, 56]         -> [12, 45, 78, 23, 56, 0]
#   [34, 67, 89]                 -> [34, 67, 89, 0, 0, 0]
#   [11, 22, 33, 44, 55, 66]     -> [11, 22, 33, 44, 55, 66]

The model learns to ignore padding tokens. During training, we use an “attention mask” to tell the model which tokens are real and which are padding.

[UNK] — Unknown Token

Used when the tokenizer encounters a token that isn’t in its vocabulary. This is a last resort—BPE tokenizers rarely need it because they can always fall back to individual characters or bytes.

# With a word-level tokenizer, [UNK] is common:
# "I love quantum computing" -> ["I", "love", "[UNK]", "[UNK]"]
#
# With BPE, it's rare:
# "I love quantum computing" -> ["I", "love", "qu", "ant", "um", "comput", "ing"]

[BOS] — Beginning of Sequence

Placed at the very start of a sequence. It signals to the model: “A new piece of text starts here.” This is especially important during text generation—the model needs a starting token to generate the first real word.

# Without BOS: the model doesn't know "the" is the first word
# [45, 12, 78, 23]
#
# With BOS: the model knows this is the start of a sequence
# [2, 45, 12, 78, 23]
#  ^-- [BOS] tells the model: "start generating from here"

Some models (like GPT) use <|endoftext|> as their BOS token, doubling as both a start and end marker.

[EOS] — End of Sequence

Placed at the end of a sequence. It tells the model: “This text is complete.” During generation, when the model outputs [EOS], we know it has finished generating and we should stop.

# During text generation:
# Input:  [BOS] "Once upon a"
# Model generates: "time" "there" "was" "a" "dragon" [EOS]
#                                                      ^ Stop here!

Putting It All Together

# A fully tokenized sequence with special tokens:
#
# Original text: "The cat sat."
#
# Tokenized with all special tokens:
# [BOS] the_ cat_ sat_ . _ [EOS] [PAD] [PAD] [PAD]
#   2    15    8    12  9 3   3     0     0     0
#
# Attention mask (1 = real token, 0 = padding):
#   1     1    1    1   1 1   1     0     0     0

Special tokens are part of the vocabulary and get their own integer IDs, just like regular tokens. The model learns what they mean through training.

7. Exercises

Exercise 1: Extend the Character Tokenizer

Modify the CharTokenizer class to handle unknown characters gracefully. If encode() encounters a character not in the vocabulary, it should use a special [UNK] ID instead of crashing.

Hint: Add [UNK] to your vocabulary during build_vocab(), and use its ID as a fallback in encode().

Solution
class CharTokenizerWithUNK:
    """Character tokenizer that handles unknown characters."""

    def __init__(self):
        self.char_to_id = {}
        self.id_to_char = {}
        self.unk_id = 0  # Reserve ID 0 for unknown characters

    def build_vocab(self, text):
        """Build vocabulary with [UNK] token at position 0."""
        # Reserve position 0 for [UNK]
        self.char_to_id["[UNK]"] = 0
        self.id_to_char[0] = "[UNK]"

        # Assign IDs starting from 1
        unique_chars = sorted(set(text))
        for idx, char in enumerate(unique_chars, start=1):
            self.char_to_id[char] = idx
            self.id_to_char[idx] = char

        print(f"Vocabulary size: {len(self.char_to_id)} "
              f"(1 special + {len(unique_chars)} characters)")

    def encode(self, text):
        """Encode text, using [UNK] for unknown characters."""
        encoded = []
        for ch in text:
            if ch in self.char_to_id:
                encoded.append(self.char_to_id[ch])
            else:
                encoded.append(self.unk_id)  # Use [UNK] ID
                print(f"  Warning: unknown character '{ch}' "
                      f"replaced with [UNK]")
        return encoded

    def decode(self, ids):
        """Decode IDs back to text."""
        return "".join(self.id_to_char.get(i, "?") for i in ids)


# Test it
tokenizer = CharTokenizerWithUNK()
tokenizer.build_vocab("hello world")

# Encode text with unknown characters
encoded = tokenizer.encode("hello! world?")
print(f"Encoded: {encoded}")
# '!' and '?' are not in vocab, so they become [UNK] (ID 0)

decoded = tokenizer.decode(encoded)
print(f"Decoded: '{decoded}'")
# Output: 'hello[UNK] world[UNK]'

Exercise 2: BPE Merge Visualization

Write a function that takes a word and the list of BPE merge rules, and prints each step of how the word gets tokenized—showing which merge rule is applied at each step.

Hint: Start with the character-level representation and apply merges one at a time, printing the state after each merge that actually changes something.

Solution
def visualize_bpe_encoding(word, merges):
    """
    Show step-by-step how a word gets tokenized by BPE merges.

    Args:
        word: The word to tokenize (string).
        merges: List of (token1, token2) merge rules.
    """
    tokens = list(word) + ["_"]
    print(f"Word: '{word}'")
    print(f"Step 0 (characters): {tokens}")

    step = 0
    for pair in merges:
        merged = pair[0] + pair[1]
        new_tokens = []
        i = 0
        changed = False

        while i < len(tokens):
            if (i < len(tokens) - 1 and
                tokens[i] == pair[0] and tokens[i + 1] == pair[1]):
                new_tokens.append(merged)
                i += 2
                changed = True
            else:
                new_tokens.append(tokens[i])
                i += 1

        if changed:
            step += 1
            tokens = new_tokens
            print(f"Step {step} (merge '{pair[0]}'+'{pair[1]}' "
                  f"-> '{merged}'): {tokens}")

    print(f"\nFinal tokens: {tokens}")
    print(f"Number of tokens: {len(tokens)}")


# Example usage with some merge rules
merges = [
    ("l", "o"),      # l + o -> lo
    ("lo", "w"),     # lo + w -> low
    ("e", "r"),      # e + r -> er
    ("low", "er"),   # low + er -> lower
]

visualize_bpe_encoding("lower", merges)
# Output:
# Word: 'lower'
# Step 0 (characters): ['l', 'o', 'w', 'e', 'r', '_']
# Step 1 (merge 'l'+'o' -> 'lo'):    ['lo', 'w', 'e', 'r', '_']
# Step 2 (merge 'lo'+'w' -> 'low'):  ['low', 'e', 'r', '_']
# Step 3 (merge 'e'+'r' -> 'er'):    ['low', 'er', '_']
# Step 4 (merge 'low'+'er' -> 'lower'): ['lower', '_']
#
# Final tokens: ['lower', '_']
# Number of tokens: 2

print()
visualize_bpe_encoding("lowest", merges)
# "lowest" only benefits from the first two merges (l+o, lo+w)
# Output:
# Word: 'lowest'
# Step 0 (characters): ['l', 'o', 'w', 'e', 's', 't', '_']
# Step 1 (merge 'l'+'o' -> 'lo'):   ['lo', 'w', 'e', 's', 't', '_']
# Step 2 (merge 'lo'+'w' -> 'low'): ['low', 'e', 's', 't', '_']
#
# Final tokens: ['low', 'e', 's', 't', '_']
# Number of tokens: 5

Exercise 3: Vocabulary Size Experiment

Using the BPETokenizer class from Section 5, train tokenizers with different vocabulary sizes (50, 100, 200) on the same corpus. For each, encode the same test sentence and compare:

  • How many tokens the sentence gets split into
  • Whether common words are single tokens or multiple tokens

What trend do you observe?

Solution
corpus = """
the cat sat on the mat and the dog sat on the rug
the cat chased the dog around the garden
natural language processing is fascinating
deep learning models can understand text
building language models from scratch is rewarding
the model learns patterns from training data
tokens are the building blocks of language models
""" * 3  # Repeat to give more training data

test_sentence = "the cat sat on the mat"

print("=" * 60)
print(f"Test sentence: '{test_sentence}'")
print("=" * 60)

for vocab_size in [50, 100, 200]:
    print(f"\n--- Vocabulary size: {vocab_size} ---")
    tok = BPETokenizer(vocab_size=vocab_size)
    tok.train(corpus)

    encoded = tok.encode(test_sentence, add_special_tokens=False)
    print(f"Number of tokens: {len(encoded)}")
    print(f"Encoded: {encoded}")

    # Show the actual tokens
    token_strings = []
    for tid in encoded:
        token_strings.append(tok.id_to_token.get(tid, "[UNK]"))
    print(f"Tokens: {token_strings}")

# Expected observation:
# - Larger vocabulary -> fewer tokens per sentence
# - vocab_size=50:  "the" might still be 3+ tokens
# - vocab_size=100: "the" becomes a single token, "cat" might too
# - vocab_size=200: most common words are single tokens
#
# This is the fundamental tradeoff:
# Bigger vocabulary = shorter sequences = faster processing
# But also = more parameters = more memory

Exercise 4: Implement Batch Padding

Write a function pad_batch() that takes a list of encoded sequences (lists of integers) and a pad token ID, and returns a padded batch where all sequences have the same length. Also return an attention mask.

Solution
def pad_batch(sequences, pad_id=0):
    """
    Pad a batch of sequences to equal length.

    Args:
        sequences: List of lists of integer token IDs.
        pad_id: The ID of the padding token.

    Returns:
        padded: List of lists, all the same length.
        attention_mask: List of lists (1 for real tokens, 0 for padding).
    """
    max_len = max(len(seq) for seq in sequences)

    padded = []
    attention_mask = []

    for seq in sequences:
        # Calculate how much padding is needed
        pad_length = max_len - len(seq)

        # Pad the sequence
        padded_seq = seq + [pad_id] * pad_length
        padded.append(padded_seq)

        # Create attention mask (1 for real, 0 for padding)
        mask = [1] * len(seq) + [0] * pad_length
        attention_mask.append(mask)

    return padded, attention_mask


# Test it
sequences = [
    [2, 15, 8, 12, 3],      # "the cat sat" with BOS/EOS
    [2, 15, 8, 3],           # "the cat" with BOS/EOS
    [2, 15, 8, 12, 14, 3],  # "the cat sat on" with BOS/EOS
]

padded, masks = pad_batch(sequences, pad_id=0)

print("Padded sequences:")
for i, (seq, mask) in enumerate(zip(padded, masks)):
    print(f"  Seq {i}: {seq}")
    print(f"  Mask:  {mask}")
    print()

# Output:
# Padded sequences:
#   Seq 0: [2, 15, 8, 12, 3, 0]
#   Mask:  [1, 1, 1, 1, 1, 0]
#
#   Seq 1: [2, 15, 8, 3, 0, 0]
#   Mask:  [1, 1, 1, 1, 0, 0]
#
#   Seq 2: [2, 15, 8, 12, 14, 3]
#   Mask:  [1, 1, 1, 1, 1, 1]

# Convert to PyTorch tensors (preview of what we'll do in CH4)
import torch

padded_tensor = torch.tensor(padded)
mask_tensor = torch.tensor(masks)

print(f"Batch tensor shape: {padded_tensor.shape}")
# Output: Batch tensor shape: torch.Size([3, 6])
# 3 sequences, each 6 tokens long

print(f"Mask tensor shape:  {mask_tensor.shape}")
# Output: Mask tensor shape:  torch.Size([3, 6])

Summary

In this chapter, we solved the fundamental problem of getting text into a neural network:

  1. Character-level tokenization is the simplest approach—tiny vocabulary, handles any text—but produces very long sequences and loses word-level meaning.

  2. Word-level tokenization preserves meaning and creates short sequences, but suffers from vocabulary explosion and can’t handle unknown words.

  3. Byte Pair Encoding (BPE) finds the sweet spot: it starts with characters and iteratively merges the most frequent pairs. Common words become single tokens; rare words split into recognizable subwords. This is what GPT models use.

  4. Special tokens like [PAD], [UNK], [BOS], and [EOS] serve structural roles—marking sequence boundaries, handling unknowns, and enabling batched processing.

We built a complete tokenizer from scratch that can train on a corpus, encode text to integers, and decode integers back to text. This is the exact pipeline that feeds text into every language model.

In the next chapter, we’ll take these integer token IDs and convert them into dense vector representations called embeddings—the actual numerical form that neural networks work with internally.