Training a Tokenizer for BERT Models

Overview

BERT, an early transformer model, is small enough to train on a home computer but requires a tokenizer to convert text into integer tokens. This article demonstrates training a WordPiece tokenizer, following BERT’s original design, using the tokenizers library.

Training a custom tokenizer allows for optimization to a specific dataset, improving model performance beyond pre-trained options, but necessitates careful configuration of normalization and special tokens. A poorly trained tokenizer can significantly degrade model accuracy and increase training time.

Key Insights

WikiText dataset: A popular preprocessed dataset for NLP experiments available through Hugging Face.
WordPiece vs. BPE: BERT utilizes WordPiece tokenization, while modern LLMs often employ Byte-Pair Encoding (BPE).
tokenizers library: Simplifies tokenizer implementation, saving development effort and providing configuration options (Hugging Face).

Working Example

import tokenizers
from datasets import load_dataset

path, name = "wikitext", "wikitext-103-raw-v1"
vocab_size = 30522
dataset = load_dataset(path, name, split="train")

texts = []
for line in dataset["text"]:
    line = line.strip()
    if line and not line.startswith("="):
        texts.append(line)

tokenizer = tokenizers.Tokenizer(tokenizers.models.WordPiece())
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
tokenizer.decoder = tokenizers.decoders.WordPiece()
tokenizer.normalizer = tokenizers.normalizers.NFKC()
tokenizer.trainer = tokenizers.trainers.WordPieceTrainer(
    vocab_size=vocab_size,
    special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

tokenizer.train_from_iterator(texts, trainer=tokenizer.trainer)
tokenizer.enable_padding(pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]")
tokenizer_path = f"{dataset_name}_wordpiece.json"
tokenizer.save(tokenizer_path, pretty=True)

tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path)
print(tokenizer.encode("Hello, world!").tokens)
print(tokenizer.decode(tokenizer.encode("Hello, world!").ids))

Practical Applications

Fine-tuning BERT: Train a tokenizer on domain-specific text data to improve BERT’s performance on specialized tasks (e.g., medical text analysis).
Pitfall: Using a vocabulary size that is too small can lead to many unknown tokens, hindering model performance; conversely, a size that is too large can increase computational cost.

References:

On This Page

Overview

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

BERT Models and Variants: A Technical Overview

Fine-Tuning BERT for NLP Tasks: GLUE and SQuAD Code Examples

Tokenization in Transformers v5: Simpler, Clearer, and More Modular