Practical Considerations — Debugging, Ethics, and What's Next
SummaryThis final chapter addresses practical challenges in LLM...
This final chapter addresses practical challenges in LLM...
This final chapter addresses practical challenges in LLM development. Training debugging covers loss curve interpretation, gradient norm monitoring, and systematic diagnosis of common issues (NaN loss, plateaus, repetitive generation). Overfitting and underfitting are explained with mitigation strategies including dropout and weight decay. Data quality guidelines emphasize cleaning, deduplication, and filtering. Ethical considerations address bias in training data, responsible development practices, and safety measures. The chapter concludes with a roadmap for continued learning including Hugging Face, research papers, and open-source models, plus a final project challenge.
Practical Considerations — Debugging, Ethics, and What’s Next
Over the past ten chapters, you built a large language model from scratch. You tokenized text, constructed embeddings, implemented multi-head attention, assembled Transformer blocks, trained on real data, and fine-tuned the result into a model that follows instructions. That’s an extraordinary amount of ground to cover.
But if you’ve actually run the code—and I hope you have—you’ve probably encountered moments where things went wrong. The loss spiked to infinity. The model generated the same word over and over. Training seemed to stall for no reason. These problems aren’t signs of failure; they’re the normal experience of training neural networks. Every researcher and engineer who has ever trained a model has stared at a loss curve wondering what went wrong.
This final chapter is about the practical reality of working with LLMs. We’ll cover how to debug training, how to think about data quality, the ethical considerations that come with building systems that generate human-like text, and where to go from here. Think of it as the chapter that bridges the gap between “I understand how LLMs work” and “I can work with LLMs effectively.”
1. Debugging Training: Reading Loss Curves
The loss curve is your primary diagnostic tool during training. It tells you, at a glance, whether your model is learning, struggling, or broken. Let’s build a function to plot it, then learn to read the shapes.
Plotting Training and Validation Loss
import matplotlib.pyplot as plt
def plot_training_curves(train_losses, val_losses=None,
title="Training Progress",
save_path=None):
"""
Plot training and validation loss curves.
Args:
train_losses: List of training loss values per epoch/step.
val_losses: Optional list of validation loss values.
title: Plot title.
save_path: If provided, save the plot to this file path.
"""
plt.figure(figsize=(10, 6))
steps = range(1, len(train_losses) + 1)
plt.plot(steps, train_losses, label="Training Loss",
color="blue", linewidth=2)
if val_losses is not None:
val_steps = range(1, len(val_losses) + 1)
plt.plot(val_steps, val_losses, label="Validation Loss",
color="orange", linewidth=2, linestyle="--")
plt.xlabel("Step", fontsize=12)
plt.ylabel("Loss", fontsize=12)
plt.title(title, fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=150)
print(f"Plot saved to {save_path}")
plt.show()
# Example: a healthy training run
import math
healthy_train = [4.2, 3.8, 3.3, 2.9, 2.5, 2.2, 1.9, 1.7, 1.6, 1.5]
healthy_val = [4.3, 3.9, 3.5, 3.1, 2.7, 2.5, 2.3, 2.2, 2.1, 2.1]
plot_training_curves(healthy_train, healthy_val,
title="Healthy Training Run")
What Does a Good Training Curve Look Like?
A healthy training curve has three characteristics:
- Steady downward trend. The loss decreases consistently, especially in the early steps when the model is learning the most basic patterns.
- Training and validation loss move together. They don’t have to be identical—training loss is usually slightly lower—but they should follow the same general trajectory.
- Gradual flattening. As the model learns, the easy improvements are captured first. The curve flattens as the remaining gains become smaller. This is normal, not a problem.
What Does Overfitting Look Like?
# Overfitting example
overfit_train = [4.2, 3.5, 2.8, 2.1, 1.5, 1.0, 0.6, 0.3, 0.15, 0.05]
overfit_val = [4.3, 3.6, 3.0, 2.7, 2.6, 2.7, 2.9, 3.2, 3.5, 3.9]
plot_training_curves(overfit_train, overfit_val,
title="Overfitting — Training vs Validation Diverge")
The signature of overfitting is unmistakable: training loss keeps dropping while validation loss reverses direction and starts climbing. The model is memorizing the training data rather than learning general patterns. It’s getting better at predicting the exact sequences it has already seen, but worse at predicting anything new.
What Does a Learning Rate That’s Too High Look Like?
# Learning rate too high
import random
random.seed(42)
lr_high_train = [4.2, 5.1, 3.8, 6.2, 4.5, 7.8, 5.2, float('inf'),
float('nan'), float('nan')]
# For plotting, replace inf/nan with large values
lr_high_plot = [v if math.isfinite(v) else 10.0
for v in lr_high_train]
plot_training_curves(lr_high_plot, title="Learning Rate Too High — Loss Spikes")
When the learning rate is too high, the loss doesn’t decrease smoothly—it oscillates wildly, spiking up and down before often exploding to infinity or becoming NaN. The model’s weight updates are so large that they overshoot the optimal values, bouncing around the loss landscape like a ball thrown too hard.
Monitoring Gradient Norms
Loss curves tell you what happened. Gradient norms tell you why. By tracking the magnitude of gradients during training, you can detect two deadly problems before they crash your run:
- Exploding gradients: Gradient norms grow exponentially, leading to NaN loss.
- Vanishing gradients: Gradient norms shrink to near zero, causing the model to stop learning.
import torch
def compute_gradient_norm(model):
"""
Compute the total L2 norm of all gradients in the model.
Call this AFTER loss.backward() but BEFORE optimizer.step().
"""
total_norm = 0.0
for param in model.parameters():
if param.grad is not None:
total_norm += param.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
return total_norm
def training_step_with_monitoring(model, batch, optimizer,
criterion, grad_norms_log):
"""
A single training step that also monitors gradient norms.
"""
optimizer.zero_grad()
outputs = model(batch["input_ids"])
loss = criterion(
outputs.view(-1, outputs.size(-1)),
batch["labels"].view(-1)
)
loss.backward()
# Monitor gradient norm BEFORE clipping
grad_norm = compute_gradient_norm(model)
grad_norms_log.append(grad_norm)
# Clip gradients to prevent explosions
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
return loss.item()
# After training, you can plot gradient norms:
# plot_training_curves(grad_norms_log,
# title="Gradient Norms During Training")
#
# Healthy: norms stay in a consistent range (e.g., 0.5-5.0)
# Exploding: norms grow rapidly (10 → 100 → 1000 → ...)
# Vanishing: norms shrink to near zero (0.1 → 0.01 → 0.001)
A practical rule of thumb: if gradient norms are consistently above 10, consider lowering your learning rate. If they’re consistently below 0.01, your model is barely learning—check your architecture and learning rate.
2. Overfitting vs. Underfitting
These are the two fundamental failure modes of any machine learning model, and understanding them deeply is essential for training LLMs.
The Analogy
Imagine two students preparing for an exam.
The Overfitter memorizes every question and answer from last year’s practice exams, word for word. When the actual exam contains exactly those same questions, they score 100%. But when the exam has new questions—even on the same topics—they fail miserably. They learned the answers but not the subject.
The Underfitter glances at the textbook the night before and calls it done. They haven’t absorbed enough of the material to answer any questions well, whether they’re from the practice exams or new ones. They fail on everything.
The ideal student studies enough to understand the underlying patterns. They can answer both familiar and unfamiliar questions because they’ve learned the principles, not just the examples.
A language model behaves the same way. An overfit model produces perfect output for its training data but generates nonsense for anything else. An underfit model generates poor output for everything—it hasn’t trained long enough or doesn’t have enough capacity to capture the patterns in language.
Detecting Overfitting in Practice
You detect overfitting by comparing training loss to validation loss, as we saw in the plots above. But there’s also a behavioral test: generate text with your model and see if it’s reproducing chunks of training data verbatim. If your model completes “To be or not to be” with the exact next paragraph from your Shakespeare training file, it may be memorizing rather than generalizing.
Techniques to Combat Overfitting
Dropout — Randomly “turn off” neurons during training, forcing the model to not rely on any single pathway.
import torch
import torch.nn as nn
class TransformerBlockWithDropout(nn.Module):
"""
A simplified Transformer block showing where dropout is applied.
"""
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, n_heads,
dropout=dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout), # Dropout in feed-forward
nn.Linear(d_ff, d_model),
nn.Dropout(dropout), # Dropout after feed-forward
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout) # Dropout after attention
def forward(self, x, mask=None):
# Self-attention with dropout
attn_out, _ = self.attention(x, x, x, attn_mask=mask)
x = self.norm1(x + self.dropout(attn_out))
# Feed-forward with dropout (built into self.ff)
ff_out = self.ff(x)
x = self.norm2(x + ff_out)
return x
# During training, dropout is active:
block = TransformerBlockWithDropout(d_model=128, n_heads=4,
d_ff=512, dropout=0.1)
block.train() # Dropout is ON
# During evaluation, dropout is disabled:
block.eval() # Dropout is OFF — all neurons participate
Typical dropout rates for LLMs range from 0.0 to 0.2. GPT-2 used 0.1. Larger models often use lower dropout or none at all because they have so much data that overfitting is less of a concern.
Weight Decay — Adds a penalty to the loss function that discourages large weights, keeping the model simpler.
# Weight decay is applied through the optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
weight_decay=0.01 # L2 regularization penalty
)
# AdamW applies weight decay correctly (decoupled from gradient)
# Regular Adam with weight_decay actually does L2 regularization,
# which behaves slightly differently. Use AdamW for LLMs.
More Data — The most reliable cure for overfitting. If your model has memorized all the patterns in your dataset, give it more patterns to learn from. This is why LLM pretraining uses trillions of tokens—it’s very hard to overfit a 7-billion-parameter model when it sees each training example only once.
Detecting Underfitting
Underfitting is simpler to diagnose: both training and validation loss remain high. The model simply isn’t learning. Causes include:
- Model is too small (not enough parameters to capture the patterns).
- Learning rate is too low (updates are too small to make progress).
- Training time is too short (the model needs more epochs).
- Data is too noisy (there are no consistent patterns to learn).
The fix is usually the opposite of the overfitting fixes: bigger model, higher learning rate, longer training, or cleaner data.
3. Common Training Issues and Fixes
After training dozens of models, certain failure patterns become recognizable. Here’s a diagnostic table for the most common problems:
| Problem | Symptoms | Likely Cause | Fix |
|---|---|---|---|
| Loss not decreasing | Loss stays flat from the start | Learning rate too low, bug in data pipeline, or frozen weights | Increase learning rate by 10x, verify data loading, check that requires_grad=True |
| Loss is NaN | Loss becomes nan after a few steps | Learning rate too high, numerical overflow, or division by zero | Reduce learning rate, add gradient clipping, check for log(0) in loss |
| Loss spikes then recovers | Occasional sharp increases | Bad batches in data, learning rate too high | Add gradient clipping, inspect data for outliers, reduce learning rate |
| Loss plateaus | Loss decreases, then stops improving | Model capacity reached, learning rate too high for fine details | Try learning rate warmup + cosine decay, increase model size, check data diversity |
| Model outputs garbage | Generated text is random characters | Early training (normal), or broken tokenizer | Wait longer, verify tokenizer encode/decode roundtrip |
| Model is repetitive | Generates the same phrase over and over | Overfitting, or generation parameters too greedy | Lower temperature, add repetition penalty, check training data for duplicates |
| Model copies training data | Generated text matches training data verbatim | Severe overfitting, small dataset | Add dropout, increase dataset size, reduce training epochs |
| Training is very slow | Each step takes minutes | Batch size too large for GPU, model too big | Reduce batch size, use gradient accumulation, use mixed precision |
A Debugging Checklist
When something goes wrong, work through this list systematically:
def debug_checklist():
"""
A mental model for debugging training issues.
Print this and tape it to your monitor.
"""
checklist = """
TRAINING DEBUG CHECKLIST
========================
1. DATA
[ ] Can you load a single batch without errors?
[ ] Does tokenizer.decode(tokenizer.encode(text)) == text?
[ ] Are labels shifted correctly (predict next token)?
[ ] Is the data shuffled?
[ ] Are there any empty or corrupted examples?
2. MODEL
[ ] Does a forward pass produce output of the right shape?
[ ] Are all parameters on the same device (CPU/GPU)?
[ ] Is model.train() called before training?
[ ] Is model.eval() called before validation?
3. OPTIMIZATION
[ ] Is the learning rate reasonable? (1e-4 to 3e-4 for Adam)
[ ] Is gradient clipping enabled? (max_norm=1.0)
[ ] Is the loss function correct for the task?
[ ] Are you calling optimizer.zero_grad() each step?
[ ] Are you calling loss.backward() then optimizer.step()?
4. MONITORING
[ ] Are you logging training loss every N steps?
[ ] Are you computing validation loss periodically?
[ ] Are you saving checkpoints?
[ ] Are you generating sample text to inspect quality?
"""
print(checklist)
debug_checklist()
4. Data Quality Matters
There’s a saying in machine learning that’s so old it’s practically a proverb: garbage in, garbage out. For LLMs, this is more true than for any other kind of model. Your model will faithfully learn whatever patterns exist in your data—including the patterns you don’t want.
Why Data Quality is Critical
A language model doesn’t know what “good” text is. It simply learns statistical patterns. If your training data is 30% spam emails, your model will learn to generate spam. If your data contains factual errors, your model will confidently repeat those errors. If your data has a lot of duplicate content, your model will memorize those duplicates instead of learning diverse language patterns.
The difference between a good LLM and a mediocre one is often not the architecture—it’s the data.
Practical Data Cleaning Guidelines
Deduplication — Remove duplicate or near-duplicate documents. The internet is full of scraped content that appears on dozens of sites. If your model sees the same article 50 times, it will memorize it rather than learn from it.
import hashlib
def deduplicate_documents(documents):
"""
Remove exact duplicate documents using content hashing.
Args:
documents: List of document strings.
Returns:
List of unique documents.
"""
seen_hashes = set()
unique_docs = []
for doc in documents:
# Normalize whitespace before hashing
normalized = " ".join(doc.split())
doc_hash = hashlib.sha256(normalized.encode()).hexdigest()
if doc_hash not in seen_hashes:
seen_hashes.add(doc_hash)
unique_docs.append(doc)
removed = len(documents) - len(unique_docs)
print(f"Removed {removed} duplicates "
f"({removed/len(documents)*100:.1f}%)")
return unique_docs
Quality Filtering — Not all text is equally useful for training. Short fragments, boilerplate text, and machine-generated noise hurt model quality.
def quality_filter(documents, min_length=100,
max_length=100000,
min_unique_words=20):
"""
Filter documents by basic quality heuristics.
Args:
documents: List of document strings.
min_length: Minimum character count.
max_length: Maximum character count.
min_unique_words: Minimum number of distinct words.
Returns:
List of documents that pass all filters.
"""
filtered = []
reasons = {"too_short": 0, "too_long": 0,
"low_diversity": 0, "passed": 0}
for doc in documents:
if len(doc) < min_length:
reasons["too_short"] += 1
continue
if len(doc) > max_length:
reasons["too_long"] += 1
continue
words = doc.lower().split()
unique = set(words)
if len(unique) < min_unique_words:
reasons["low_diversity"] += 1
continue
reasons["passed"] += 1
filtered.append(doc)
print(f"Quality filter results:")
for reason, count in reasons.items():
print(f" {reason}: {count}")
return filtered
Language Filtering — If you want an English model, you need to filter out non-English content. Web scrapes often contain a mix of languages.
Sensitive Content Removal — Remove personal information (names, addresses, phone numbers), toxic content, and other material you don’t want your model to learn. This is both an ethical and a practical concern—a model trained on toxic content will generate toxic output.
The Scale of the Problem
Modern LLMs are trained on datasets measured in terabytes. At that scale, manual inspection is impossible. The teams behind models like LLaMA and GPT-4 spend enormous effort on automated data pipelines that clean, filter, deduplicate, and categorize data before a single training step occurs. For your own projects at smaller scale, even simple filtering like the functions above can make a meaningful difference.
5. Ethical Considerations and Biases
Building technology that generates human-like text comes with responsibilities. This section isn’t about telling you what to do—it’s about making sure you understand what can go wrong and how to think about it.
LLMs Learn Biases from Data
A language model learns the statistical patterns in its training data. If that data reflects societal biases—and it does, because all large text corpora contain biased content—the model will learn those biases. This isn’t a bug in the algorithm; it’s a direct consequence of how the model works.
For example, if a model trained on internet text is asked to complete the sentence “The doctor walked into the room and he…” it might assign “he” a higher probability than “she,” simply because historical text more often describes doctors as male. The model isn’t being intentionally sexist; it’s reflecting the statistical patterns it learned.
This becomes a real problem when LLMs are deployed in applications that affect people’s lives: hiring tools, content moderation, customer service, educational systems, and more.
Examples of How Bias Manifests
- Stereotyping: The model associates certain professions, traits, or behaviors with specific genders, races, or nationalities.
- Underrepresentation: The model generates less coherent or relevant text about topics, communities, or languages that were poorly represented in its training data.
- Toxicity amplification: If toxic content exists in the training data, the model can generate similar toxic content, sometimes in contexts where it’s particularly harmful.
- Cultural assumptions: The model may treat one culture’s norms as universal, ignoring the diversity of human experience.
What Responsible Development Looks Like
There is no silver bullet for bias in LLMs, but there are concrete steps you can take:
-
Audit your training data. Know what’s in it. If you’re scraping the web, understand that the web is not a neutral or representative sample of human knowledge.
-
Test your model with diverse inputs. Don’t just test whether it generates grammatically correct text. Test whether it generates fair, balanced text across different demographics and topics.
-
Be transparent about limitations. If you build an application using an LLM, communicate clearly that the outputs may contain errors or biases. Don’t present model-generated text as objective truth.
-
Include safety measures. Content filtering, moderation layers, and human review processes can catch harmful outputs before they reach users.
-
Consider who benefits and who is harmed. Every technology decision has tradeoffs. Before deploying an LLM application, think about whether it could be used in ways you didn’t intend and what the consequences might be.
A Note on Perspective
The ethical questions surrounding LLMs don’t have easy answers, and reasonable people disagree about the right approaches. What matters is that you think about these questions as part of your development process, not as an afterthought. The fact that you’ve built a model from scratch—that you understand how it works at every level—puts you in a better position than most to reason about what it can and cannot do, and where its outputs should and shouldn’t be trusted.
6. The Road Ahead
You’ve built a complete LLM from scratch. That’s genuinely impressive, but it’s also just the beginning. The field of language modeling is evolving at a pace unlike anything in the history of computer science. Here’s a map of what to explore next.
Hugging Face Transformers
The Hugging Face transformers library is the standard tool for working with pretrained models. It provides:
- Thousands of pretrained models you can download and use immediately.
- A consistent API for tokenization, inference, and fine-tuning.
- Integration with datasets, evaluation metrics, and training utilities.
# What working with Hugging Face looks like
# (requires: pip install transformers torch)
# from transformers import AutoTokenizer, AutoModelForCausalLM
#
# # Load a pretrained model in two lines
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
# model = AutoModelForCausalLM.from_pretrained("gpt2")
#
# # Generate text
# inputs = tokenizer("The future of AI is", return_tensors="pt")
# outputs = model.generate(**inputs, max_new_tokens=50)
# print(tokenizer.decode(outputs[0]))
Because you’ve built everything from scratch, you’ll understand what these library calls are doing under the hood. When the library’s MultiheadAttention computes attention, you know it’s doing the same QKV projections and softmax you implemented in Chapter 5. That understanding is invaluable for debugging and customization.
Reading Research Papers
Research papers are how new ideas enter the field. They can feel intimidating at first, but they become accessible with practice. A strategy for approaching them:
- Read the abstract and introduction. These tell you what problem the paper solves and why it matters. If it’s not relevant to you, move on.
- Look at the figures and tables. These often communicate the key results more clearly than the text.
- Read the method section. This is the technical core. With the knowledge from this book, you can follow the math and architecture descriptions.
- Skip the proofs (at first). Mathematical proofs are important but rarely needed for practical understanding.
- Compare to what you know. “This is like our attention mechanism, but they add X.” Anchoring new ideas to things you’ve already built makes them easier to absorb.
Key papers worth reading:
- “Attention Is All You Need” (Vaswani et al., 2017) — The original Transformer paper. You’ve already implemented everything in it.
- “Language Models are Few-Shot Learners” (Brown et al., 2020) — The GPT-3 paper. Describes scaling and in-context learning.
- “LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., 2021) — The LoRA paper from Chapter 10.
- “Training language models to follow instructions” (Ouyang et al., 2022) — The InstructGPT/RLHF paper.
Open-Source Models
The open-source LLM ecosystem has exploded. Models you can download and experiment with:
- LLaMA / Llama 2 / Llama 3 (Meta) — High-quality open-weight models in various sizes (7B to 70B+ parameters).
- Mistral / Mixtral (Mistral AI) — Efficient models with innovations like sliding window attention and mixture of experts.
- Phi (Microsoft) — Smaller models (1.3B-14B) that punch above their weight through careful data curation.
- Gemma (Google) — Open models built on Gemini research.
- Qwen (Alibaba) — Strong multilingual models.
All of these use the same fundamental architecture you built in this book. The differences are in scale, training data, and specific design choices—but the core Transformer remains.
Tools and Frameworks
- PyTorch — You already know it. Remains the most popular framework for research.
- Hugging Face ecosystem —
transformers,datasets,accelerate,peft(for LoRA and related techniques). - vLLM — High-performance inference serving for LLMs.
- Weights & Biases / MLflow — Experiment tracking and visualization.
- LangChain / LlamaIndex — Frameworks for building applications on top of LLMs (retrieval-augmented generation, agents, etc.).
The Evolving Landscape
The field moves fast. Techniques that are state-of-the-art today may be superseded in months. But the fundamentals you’ve learned—attention, tokenization, embeddings, training loops, fine-tuning—remain stable. New architectures are variations on these themes, not replacements for them. Your understanding of the Transformer from first principles is a foundation that won’t become obsolete.
7. Book Summary: The Journey from Zero to LLM
Let’s take a moment to appreciate the complete path we’ve traveled:
Chapter 1: Why Build from Scratch? — We explored what language models actually are, why understanding them matters, and set expectations for the journey ahead.
Chapter 2: Setting Up — Python and PyTorch Fundamentals — We established our development environment, learned tensor operations, and built the fluency with PyTorch needed for everything that followed.
Chapter 3: Text Processing — Turning Words into Numbers — We tackled the fundamental problem of converting text to numbers. We built character-level and word-level tokenizers, then implemented Byte Pair Encoding (BPE)—the same algorithm used by GPT.
Chapter 4: Word Embeddings — We gave meaning to those numbers by mapping tokens into a continuous vector space where similar words live near each other. We implemented learnable embedding layers and positional encodings.
Chapter 5: The Attention Mechanism — The heart of the Transformer. We built self-attention from scratch, understanding how models learn which words to “pay attention to” when processing a sequence. Multi-head attention allowed the model to attend to multiple patterns simultaneously.
Chapter 6: Building the Transformer — We assembled all the pieces—attention, feed-forward networks, layer normalization, residual connections—into a complete Transformer architecture.
Chapter 7: Training Your First Language Model — We wrote the training loop, handled batching and data loading, implemented learning rate scheduling, and watched the model learn to generate text.
Chapter 8: Generating Text — We explored how to actually produce text from a trained model, implementing greedy decoding, temperature sampling, top-k, and top-p (nucleus) sampling.
Chapter 9: Scaling Up — We addressed the practical challenges of building larger models: multi-GPU training, mixed precision, gradient accumulation, and efficient attention mechanisms.
Chapter 10: Fine-tuning and Applications — We transformed a base model into a useful one through fine-tuning, implemented LoRA for efficient adaptation, and explored instruction tuning and RLHF.
Chapter 11: Practical Considerations — This chapter, tying everything together with debugging skills, data quality practices, ethical awareness, and a roadmap for continued growth.
You started with print("Hello, world!") and ended with a system that can generate coherent paragraphs of text, follow instructions, and answer questions. Every layer of abstraction between “raw text” and “generated language” passed through your hands. Very few people in the world—including many who work with LLMs daily—have that depth of understanding.
8. Final Project: Train Your Own LLM
Here is your capstone challenge. Use everything you’ve learned to build a complete system from end to end.
The Challenge
Train a small language model on a dataset of your choosing, then fine-tune it for a specific task.
Guidelines
Dataset Selection — Choose a domain that interests you. Options include:
- A collection of recipes (model learns to generate cooking instructions).
- A corpus of poetry (model learns poetic structure and language).
- Technical documentation for a programming language.
- Dialogue from a TV show or movie script.
- Scientific paper abstracts from a specific field.
Aim for at least 1-5 MB of clean text. More is better, but quality matters more than quantity.
Model Size — Keep it small enough to train on your available hardware:
- Minimal: 4 layers, 4 heads, d_model=128, d_ff=512 (~2M parameters). Trains in minutes on a CPU.
- Small: 6 layers, 6 heads, d_model=256, d_ff=1024 (~15M parameters). Needs a GPU for reasonable training time.
- Medium: 8 layers, 8 heads, d_model=512, d_ff=2048 (~50M parameters). Several hours on a consumer GPU.
Training Plan:
- Clean and tokenize your dataset.
- Train a BPE tokenizer on your data (or reuse GPT-2’s tokenizer).
- Pretrain the model for at least 5-10 epochs.
- Monitor training and validation loss. Plot the curves.
- Generate sample text and inspect quality.
- Create a small fine-tuning dataset (50-200 examples in instruction-response format).
- Fine-tune the model, optionally using LoRA.
- Evaluate: does the fine-tuned model follow instructions better than the base model?
Evaluation Criteria — Ask yourself:
- Does the generated text look like it belongs to my chosen domain?
- Can the model complete domain-specific prompts coherently?
- Does the fine-tuned model follow the instruction format?
- What are the model’s most obvious failure modes?
There is no “correct” result. The goal is to experience the full pipeline—data preparation, pretraining, fine-tuning, evaluation—and to learn from whatever happens. The debugging skills from this chapter will almost certainly come in handy.
9. Exercises
Exercise 1: Diagnose the Training Run
You observe the following training log. Identify the problem and suggest a fix.
Step 1: loss = 8.52
Step 10: loss = 8.51
Step 100: loss = 8.49
Step 500: loss = 8.47
Step 1000: loss = 8.45
Step 5000: loss = 8.40
Solution
The loss is barely decreasing—it dropped only 0.12 over 5000 steps. This is underfitting, almost certainly caused by a learning rate that is far too low.
With a loss around 8.5, the model is performing close to random chance (for a vocabulary of ~5000 tokens, random guessing gives a loss of about $\ln(5000) \approx 8.52$). The weight updates are so tiny that the model is barely learning.
Fix: Increase the learning rate by a factor of 10-100. If you’re using 1e-6, try 1e-4 or 3e-4. Also verify that:
- Gradients are flowing (check gradient norms—they should be nonzero).
- All parameters have
requires_grad=True. - You’re calling
loss.backward()andoptimizer.step()in every iteration.
# Before (too low):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6)
# After (reasonable starting point):
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
Exercise 2: Implement Early Stopping
Write a class that monitors validation loss and stops training when the model starts overfitting (validation loss hasn’t improved for a specified number of evaluations).
Solution
class EarlyStopping:
"""
Stop training when validation loss stops improving.
Args:
patience: Number of evaluations to wait for improvement
before stopping.
min_delta: Minimum decrease in validation loss to count
as an improvement.
"""
def __init__(self, patience=5, min_delta=0.01):
self.patience = patience
self.min_delta = min_delta
self.best_loss = float('inf')
self.counter = 0
self.should_stop = False
def check(self, val_loss):
"""
Call after each validation evaluation.
Args:
val_loss: Current validation loss.
Returns:
True if training should stop, False otherwise.
"""
if val_loss < self.best_loss - self.min_delta:
# Improvement found — reset counter
self.best_loss = val_loss
self.counter = 0
print(f" ✓ New best validation loss: {val_loss:.4f}")
else:
# No improvement
self.counter += 1
print(f" ✗ No improvement for {self.counter}/{self.patience} "
f"evaluations (best: {self.best_loss:.4f})")
if self.counter >= self.patience:
self.should_stop = True
print(f" ⚠ Early stopping triggered!")
return self.should_stop
# Usage in a training loop:
early_stop = EarlyStopping(patience=5, min_delta=0.01)
# for epoch in range(num_epochs):
# train_loss = train_one_epoch(model, train_loader)
# val_loss = evaluate(model, val_loader)
#
# print(f"Epoch {epoch}: train={train_loss:.4f}, val={val_loss:.4f}")
#
# if early_stop.check(val_loss):
# print("Stopping early to prevent overfitting.")
# break
# Example run:
print("Simulating early stopping:\n")
fake_val_losses = [3.2, 2.8, 2.5, 2.4, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9]
stopper = EarlyStopping(patience=3, min_delta=0.05)
for epoch, val_loss in enumerate(fake_val_losses):
print(f"Epoch {epoch}: val_loss = {val_loss:.2f}")
if stopper.check(val_loss):
print(f"Stopped at epoch {epoch}.")
break
Output:
Simulating early stopping:
Epoch 0: val_loss = 3.20
✓ New best validation loss: 3.2000
Epoch 1: val_loss = 2.80
✓ New best validation loss: 2.8000
Epoch 2: val_loss = 2.50
✓ New best validation loss: 2.5000
Epoch 3: val_loss = 2.40
✓ New best validation loss: 2.4000
Epoch 4: val_loss = 2.40
✗ No improvement for 1/3 evaluations (best: 2.4000)
Epoch 5: val_loss = 2.50
✗ No improvement for 2/3 evaluations (best: 2.4000)
Epoch 6: val_loss = 2.60
✗ No improvement for 3/3 evaluations (best: 2.4000)
⚠ Early stopping triggered!
Stopped at epoch 6.
The early stopper correctly identified that validation loss stopped improving after epoch 3 and terminated training after 3 consecutive non-improvements, preventing the model from overfitting further.
Exercise 3: Data Quality Audit
Write a function that analyzes a text dataset and produces a quality report: number of documents, average length, duplicate count, language distribution estimate, and potential issues.
Solution
import hashlib
from collections import Counter
def audit_dataset(documents):
"""
Produce a quality report for a text dataset.
Args:
documents: List of document strings.
Returns:
Dictionary with quality metrics.
"""
if not documents:
print("Empty dataset!")
return {}
# Basic statistics
lengths = [len(doc) for doc in documents]
word_counts = [len(doc.split()) for doc in documents]
# Duplicate detection
hashes = []
for doc in documents:
normalized = " ".join(doc.split())
h = hashlib.sha256(normalized.encode()).hexdigest()
hashes.append(h)
hash_counts = Counter(hashes)
n_duplicates = sum(c - 1 for c in hash_counts.values() if c > 1)
# Simple language heuristic: check for common English words
english_words = {"the", "is", "and", "to", "of", "a", "in",
"that", "it", "was"}
likely_english = 0
for doc in documents:
words = set(doc.lower().split()[:100]) # Check first 100 words
overlap = words & english_words
if len(overlap) >= 3:
likely_english += 1
# Potential issues
issues = []
short_docs = sum(1 for l in lengths if l < 50)
if short_docs > len(documents) * 0.1:
issues.append(f"{short_docs} documents are very short (<50 chars)")
if n_duplicates > len(documents) * 0.05:
issues.append(f"{n_duplicates} duplicate documents detected")
empty_docs = sum(1 for doc in documents if len(doc.strip()) == 0)
if empty_docs > 0:
issues.append(f"{empty_docs} empty documents found")
# Report
report = {
"total_documents": len(documents),
"total_characters": sum(lengths),
"total_words": sum(word_counts),
"avg_length_chars": sum(lengths) / len(lengths),
"avg_length_words": sum(word_counts) / len(word_counts),
"min_length_chars": min(lengths),
"max_length_chars": max(lengths),
"duplicates": n_duplicates,
"likely_english_pct": likely_english / len(documents) * 100,
"issues": issues,
}
# Print report
print("=" * 50)
print("DATASET QUALITY REPORT")
print("=" * 50)
print(f"Documents: {report['total_documents']:,}")
print(f"Total words: {report['total_words']:,}")
print(f"Total characters: {report['total_characters']:,}")
print(f"Avg length: {report['avg_length_words']:.0f} words "
f"({report['avg_length_chars']:.0f} chars)")
print(f"Min length: {report['min_length_chars']:,} chars")
print(f"Max length: {report['max_length_chars']:,} chars")
print(f"Duplicates: {report['duplicates']}")
print(f"Likely English: {report['likely_english_pct']:.1f}%")
if issues:
print(f"\n⚠ ISSUES FOUND:")
for issue in issues:
print(f" - {issue}")
else:
print(f"\n✓ No major issues detected.")
print("=" * 50)
return report
# Example usage:
sample_docs = [
"The quick brown fox jumps over the lazy dog. " * 10,
"Machine learning is a subset of artificial intelligence "
"that enables systems to learn from data.",
"The quick brown fox jumps over the lazy dog. " * 10, # Duplicate!
"", # Empty!
"Hi", # Too short!
"Natural language processing deals with the interaction "
"between computers and human language. It is a fascinating "
"field that combines linguistics and computer science.",
]
report = audit_dataset(sample_docs)
Output:
==================================================
DATASET QUALITY REPORT
==================================================
Documents: 6
Total words: 153
Total characters: 875
Avg length: 25 words (145 chars)
Min length: 0 chars
Max length: 440 chars
Duplicates: 1
Likely English: 66.7%
⚠ ISSUES FOUND:
- 2 documents are very short (<50 chars)
- 1 empty documents found
==================================================
The audit correctly identified the duplicate document, the empty document, and the too-short documents. In a real project, you would run this on your full dataset before training and address the flagged issues.
Summary
This chapter covered the practical skills that turn theoretical knowledge into working systems:
-
Debugging training starts with reading loss curves. A healthy curve drops steadily; overfitting shows a growing gap between training and validation loss; an excessive learning rate creates spikes or NaN values. Gradient norm monitoring catches exploding and vanishing gradients before they crash your run.
-
Overfitting and underfitting are the two fundamental failure modes. Dropout, weight decay, and more data combat overfitting. Bigger models, higher learning rates, and longer training address underfitting. Early stopping provides an automated safety net.
-
Common training issues can be systematically diagnosed using a checklist approach. Most problems trace back to data pipeline bugs, learning rate misconfiguration, or incorrect gradient handling.
-
Data quality determines the ceiling of what your model can achieve. Deduplication, filtering, and cleaning are not optional—they’re essential. Garbage in, garbage out.
-
Ethical considerations are part of the engineering process, not separate from it. LLMs learn biases from data, and responsible development means auditing data, testing for fairness, being transparent about limitations, and thinking about consequences.
-
The road ahead is vast and exciting. Hugging Face, research papers, open-source models, and a rapidly growing ecosystem of tools await you. The foundations you’ve built in this book—understanding every layer from tokenization to generation—give you the knowledge to engage with all of it deeply.
You started this book wondering how language models work. You leave it having built one from scratch, trained it, fine-tuned it, and learned to debug it. That’s not a small thing. Welcome to the field.