Fine-tuning and Applications — Making Your Model Useful
SummaryThis chapter covers making a pretrained LLM useful...
This chapter covers making a pretrained LLM useful...
This chapter covers making a pretrained LLM useful through fine-tuning. Transfer learning leverages pretrained weights instead of training from scratch. Fine-tuning datasets are prepared in instruction-response format. A complete fine-tuning loop trains the model on new data with lower learning rates. LoRA (Low-Rank Adaptation) enables efficient fine-tuning by adding small trainable matrices to frozen weights, dramatically reducing trainable parameters. Instruction tuning teaches models to follow prompts. RLHF (Reinforcement Learning from Human Feedback) is explained conceptually as the process behind ChatGPT-like behavior. Model evaluation uses perplexity and human assessment.
Fine-tuning and Applications — Making Your Model Useful
In the previous chapters, we built a Transformer language model from scratch, trained it on raw text, and watched it learn to generate coherent English. That’s a remarkable achievement. But if you actually try to use your base model, you’ll notice something frustrating: it can finish sentences, but it can’t answer questions. It can generate paragraphs of text, but it won’t follow instructions. It speaks English, but it doesn’t speak helpful.
This is the gap between a base model and a useful model. Every LLM you’ve ever interacted with—ChatGPT, Claude, Llama—went through an additional training phase after pretraining. That phase is called fine-tuning, and it’s the subject of this chapter.
1. What is Fine-tuning?
The Analogy
Think of pretraining like raising a child. Over years, the child learns English—grammar, vocabulary, how sentences work, facts about the world. By age ten, the child can speak fluently. But can the child write a professional email? Diagnose a medical symptom? Explain quantum physics to a five-year-old?
No. The child knows English, but hasn’t been trained for any specific task.
Fine-tuning is what happens next. You send the child to school, where a teacher says: “When someone asks you a question, answer it clearly.” Or: “When given a topic, write a structured essay.” The child doesn’t need to relearn English—they already know it. They just need to learn how to apply their knowledge in a specific way.
That’s exactly what fine-tuning does to a language model. The base model already understands language—word meanings, grammar, facts. Fine-tuning teaches it behavior: how to respond to instructions, how to answer questions, how to be helpful rather than just predictive.
Transfer Learning
This principle is called transfer learning: take knowledge learned in one context (pretraining on general text) and transfer it to a new context (following instructions, answering questions, writing code).
Why not just train a model from scratch on instruction-response data? Because pretraining is enormously expensive. GPT-3 cost millions of dollars and months of compute time to pretrain. Fine-tuning, by contrast, can be done in hours on a single GPU with a small dataset. You’re leveraging all the knowledge already baked into the model’s weights.
Pretraining: Months + Millions of $ → Understands language
Fine-tuning: Hours + Hundreds of $ → Follows instructions
The ratio of effort is staggering. Pretraining builds the foundation; fine-tuning adds the finishing touches. But those finishing touches are what make the difference between a model that rambles aimlessly and one that says: “Here’s a clear answer to your question.”
What Changes During Fine-tuning?
Mechanically, fine-tuning is just more training. You take the pretrained model’s weights—the same attention matrices, feed-forward layers, and embeddings we built in earlier chapters—and continue training them on a new, curated dataset. The differences are:
- The dataset is different. Instead of raw web text, you use carefully formatted instruction-response pairs.
- The learning rate is lower. You don’t want to overwrite what the model already knows—you want to gently nudge it.
- The training is shorter. A few hundred to a few thousand steps, not millions.
- Sometimes you freeze some layers. You might only update the last few layers, keeping earlier layers (which capture general language patterns) unchanged.
2. Preparing a Fine-tuning Dataset
The quality of your fine-tuning dataset matters more than its size. A base model trained on billions of tokens can be meaningfully fine-tuned with just a few thousand high-quality examples.
Data Formats
Fine-tuning data comes in several formats, depending on what behavior you want to teach:
Instruction-Response Pairs — The most common format. Each example has an instruction and the desired response:
{"instruction": "Summarize the following text in one sentence.",
"input": "The Eiffel Tower was built in 1889 for the World's Fair...",
"response": "The Eiffel Tower, built in 1889 for the World's Fair, is a famous Parisian landmark."}
Question-Answer Pairs — A simpler variant:
{"question": "What is the capital of France?",
"answer": "The capital of France is Paris."}
Conversation Format — For chat models, data looks like a multi-turn conversation:
{"messages": [
{"role": "user", "content": "What is photosynthesis?"},
{"role": "assistant", "content": "Photosynthesis is the process by which plants convert sunlight into energy..."}
]}
For our implementation, we’ll use the instruction-response format since it’s the simplest and most widely used.
Creating a Simple Instruction Dataset
Let’s build a small instruction dataset in code. In practice, you’d load this from a file, but creating it explicitly helps us understand the format:
# A tiny instruction-tuning dataset
instruction_data = [
{
"instruction": "What is the capital of France?",
"response": "The capital of France is Paris."
},
{
"instruction": "Explain gravity in simple terms.",
"response": "Gravity is the force that pulls objects toward each other. The Earth's gravity keeps us on the ground and the Moon in orbit."
},
{
"instruction": "Write a haiku about programming.",
"response": "Code flows line by line\nBugs hide in the smallest place\nTests reveal the truth"
},
{
"instruction": "What is 15 multiplied by 7?",
"response": "15 multiplied by 7 is 105."
},
{
"instruction": "Translate 'hello' to Spanish.",
"response": "The Spanish translation of 'hello' is 'hola'."
},
{
"instruction": "List three prime numbers.",
"response": "Three prime numbers are 2, 3, and 5."
},
{
"instruction": "What does CPU stand for?",
"response": "CPU stands for Central Processing Unit."
},
{
"instruction": "Summarize what machine learning is in one sentence.",
"response": "Machine learning is a field of computer science where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed."
},
]
print(f"Dataset size: {len(instruction_data)} examples")
# Output: Dataset size: 8 examples
Formatting for the Model
The model needs to see the instruction and response as a single sequence of text, with clear markers separating the two parts. A common pattern uses special tokens or textual delimiters:
def format_example(example):
"""
Format an instruction-response pair into a single training string.
The model learns to generate everything after '### Response:',
given everything before it as context.
"""
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['response']}"
)
return text
# See what a formatted example looks like
formatted = format_example(instruction_data[1])
print(formatted)
# Output:
# ### Instruction:
# Explain gravity in simple terms.
#
# ### Response:
# Gravity is the force that pulls objects toward each other. The Earth's
# gravity keeps us on the ground and the Moon in orbit.
The model will be trained to predict the entire sequence, but during inference, we only provide the instruction part and let the model generate the response.
Implementing a Fine-tuning Dataset Class
Now let’s build a proper PyTorch Dataset that tokenizes and prepares these examples for training:
import torch
from torch.utils.data import Dataset
class InstructionDataset(Dataset):
"""
A PyTorch Dataset for instruction-tuning.
Each example is formatted as:
### Instruction:
<the instruction>
### Response:
<the response>
The entire sequence is tokenized and used as both input and target
(shifted by one position, as in standard language modeling).
"""
def __init__(self, data, tokenizer, max_length=128):
"""
Args:
data: List of dicts with 'instruction' and 'response' keys.
tokenizer: A tokenizer with encode() method.
max_length: Maximum sequence length (truncate or pad to this).
"""
self.examples = []
for item in data:
# Format the instruction-response pair
text = format_example(item)
# Tokenize
token_ids = tokenizer.encode(text)
# Truncate if too long
if len(token_ids) > max_length:
token_ids = token_ids[:max_length]
# Pad if too short
while len(token_ids) < max_length:
token_ids.append(0) # 0 = padding token
self.examples.append(token_ids)
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
token_ids = self.examples[idx]
# Input is everything except the last token
x = torch.tensor(token_ids[:-1], dtype=torch.long)
# Target is everything except the first token (shifted by 1)
y = torch.tensor(token_ids[1:], dtype=torch.long)
return x, y
This is the same autoregressive setup we used in pretraining: the input is the sequence shifted right, and the target is the sequence shifted left. The model learns to predict the next token at every position. The magic is that the data now contains instruction-response pairs, so the model learns the pattern: “after an instruction, produce a helpful response.”
3. Fine-tuning Implementation
Let’s implement the full fine-tuning pipeline. We’ll assume we have a pretrained GPT-like model from earlier chapters.
Setting Up the Pretrained Model
First, we load our pretrained model and prepare it for fine-tuning:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
def setup_fine_tuning(model, learning_rate=1e-5):
"""
Prepare a pretrained model for fine-tuning.
Key differences from pretraining:
- Much lower learning rate (1e-5 vs 1e-3)
- Optionally freeze early layers
- Shorter training schedule
Args:
model: A pretrained language model.
learning_rate: Learning rate for fine-tuning (should be small).
Returns:
optimizer: Configured optimizer.
"""
# Use a MUCH lower learning rate than pretraining
# Pretraining might use 1e-3; fine-tuning uses 1e-5 or smaller
optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=0.01
)
return optimizer
def count_parameters(model):
"""Count trainable parameters."""
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total:,}")
print(f"Trainable parameters: {trainable:,}")
return total, trainable
Optionally Freezing Layers
Sometimes you want to freeze the earlier layers of the model—the ones that capture general language patterns—and only fine-tune the later layers:
def freeze_early_layers(model, num_layers_to_freeze):
"""
Freeze the first N transformer layers.
Frozen layers won't be updated during fine-tuning.
This preserves general language knowledge while allowing
the later layers to adapt to the new task.
Args:
model: The transformer model.
num_layers_to_freeze: How many layers to freeze (from the bottom).
"""
# Freeze the embedding layer (it rarely needs updating)
for param in model.embedding.parameters():
param.requires_grad = False
# Freeze the first N transformer blocks
for i in range(num_layers_to_freeze):
for param in model.layers[i].parameters():
param.requires_grad = False
# Print what's frozen vs trainable
frozen = sum(p.numel() for p in model.parameters() if not p.requires_grad)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Frozen parameters: {frozen:,}")
print(f"Trainable parameters: {trainable:,}")
print(f"Percentage trainable: {100 * trainable / (frozen + trainable):.1f}%")
The Fine-tuning Loop
The training loop itself is almost identical to pretraining—the key difference is the data and the hyperparameters:
def fine_tune(model, dataset, num_epochs=3, batch_size=4,
learning_rate=1e-5, device='cpu'):
"""
Fine-tune a pretrained model on an instruction dataset.
Args:
model: Pretrained language model.
dataset: InstructionDataset instance.
num_epochs: Number of passes through the data.
batch_size: Examples per batch.
learning_rate: Small learning rate for fine-tuning.
device: 'cpu' or 'cuda'.
Returns:
losses: List of loss values for monitoring.
"""
model = model.to(device)
model.train()
optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=0.01
)
loss_fn = nn.CrossEntropyLoss(ignore_index=0) # Ignore padding tokens
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
losses = []
for epoch in range(num_epochs):
epoch_loss = 0.0
num_batches = 0
for batch_x, batch_y in dataloader:
batch_x = batch_x.to(device) # [batch_size, seq_len]
batch_y = batch_y.to(device) # [batch_size, seq_len]
# Forward pass
logits = model(batch_x) # [batch_size, seq_len, vocab_size]
# Reshape for loss computation
# CrossEntropyLoss expects [N, C] and [N]
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = batch_y.view(B * T)
# Compute loss
loss = loss_fn(logits, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping (important for stability during fine-tuning)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
epoch_loss += loss.item()
num_batches += 1
avg_loss = epoch_loss / num_batches
losses.append(avg_loss)
print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}")
return losses
Observing the Effect of Fine-tuning
Let’s write a function to see how the model’s behavior changes before and after fine-tuning:
def generate_response(model, tokenizer, instruction, max_tokens=100,
temperature=0.7, device='cpu'):
"""
Generate a response to an instruction using the fine-tuned model.
Args:
model: The language model.
tokenizer: Tokenizer with encode/decode methods.
instruction: The instruction text.
max_tokens: Maximum tokens to generate.
temperature: Sampling temperature (lower = more deterministic).
device: 'cpu' or 'cuda'.
Returns:
response: The generated response text.
"""
model.eval()
# Format the prompt the same way as training data
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
token_ids = tokenizer.encode(prompt)
input_ids = torch.tensor([token_ids], dtype=torch.long).to(device)
with torch.no_grad():
for _ in range(max_tokens):
# Get predictions
logits = model(input_ids)
# Take the last token's predictions
next_token_logits = logits[:, -1, :] / temperature
# Sample from the distribution
probs = torch.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
input_ids = torch.cat([input_ids, next_token], dim=1)
# Stop if we generate an end token
if next_token.item() == tokenizer.eos_token_id:
break
# Decode only the generated part (after the prompt)
generated_ids = input_ids[0, len(token_ids):].tolist()
response = tokenizer.decode(generated_ids)
return response
# Example usage (conceptual — requires a trained model):
#
# BEFORE fine-tuning:
# Instruction: "What is the capital of France?"
# Model output: "the capital of France the country in Europe which has many..."
# (The base model just continues the text aimlessly)
#
# AFTER fine-tuning:
# Instruction: "What is the capital of France?"
# Model output: "The capital of France is Paris."
# (The fine-tuned model gives a direct, helpful answer)
The transformation is dramatic. The base model treats the instruction as just more text to continue. The fine-tuned model recognizes the pattern: instruction comes in, helpful response goes out.
4. LoRA — Efficient Fine-tuning
The Problem with Full Fine-tuning
Full fine-tuning updates every parameter in the model. For a small model like ours, that’s fine. But for a model like LLaMA-70B with 70 billion parameters, updating all of them requires:
- Storing all 70 billion parameters in memory
- Storing gradients for all 70 billion parameters
- Storing optimizer states (AdamW keeps two extra copies per parameter)
That’s roughly 280 billion floating-point numbers in memory—over a terabyte of GPU RAM. Most people don’t have that.
The Key Insight
In 2021, researchers at Microsoft published a paper called “LoRA: Low-Rank Adaptation of Large Language Models.” Their key insight was elegant: when you fine-tune a model, the weight changes are low-rank.
What does that mean? Remember that our model’s weights are large matrices. During fine-tuning, these matrices change slightly—call the change ΔW. The researchers discovered that ΔW doesn’t need all its dimensions to be expressive. You can decompose it into two much smaller matrices:
$$\Delta W = A \times B$$
Where:
- $W$ is the original weight matrix of shape ($d \times d$)
- $A$ is a small matrix of shape ($d \times r$)
- $B$ is a small matrix of shape ($r \times d$)
- $r$ is the rank, typically 4, 8, or 16—much smaller than $d$
The Analogy
Imagine you have an enormous painting—a masterpiece that took years to create. You want to modify its mood slightly, making it warmer. Full fine-tuning would mean repainting the entire canvas. LoRA’s approach is different: you place a small, transparent overlay on top of the painting and paint only on the overlay. The original masterpiece stays untouched. The final result is the original plus the overlay.
The overlay (ΔW) is much smaller than the full painting (W), but it’s enough to change the overall effect.
Parameter Savings
Let’s do the math. Suppose a weight matrix W is 4096 × 4096:
- Full fine-tuning: 4096 × 4096 = 16,777,216 parameters
- LoRA with rank 8: (4096 × 8) + (8 × 4096) = 65,536 parameters
That’s a 256× reduction in trainable parameters. For a 70-billion-parameter model, LoRA might fine-tune only 10–50 million parameters—less than 0.1% of the total.
Implementing LoRA from Scratch
Let’s build a LoRA layer:
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
"""
A LoRA (Low-Rank Adaptation) layer.
Instead of updating the full weight matrix W during fine-tuning,
we freeze W and learn a low-rank update: delta_W = A @ B
The output becomes: y = x @ W + x @ A @ B
= x @ (W + A @ B)
Only A and B are trainable, dramatically reducing parameters.
"""
def __init__(self, original_layer, rank=8, alpha=16):
"""
Args:
original_layer: The nn.Linear layer to apply LoRA to.
rank: The rank of the low-rank matrices (smaller = fewer params).
alpha: Scaling factor for the LoRA update.
"""
super().__init__()
self.original_layer = original_layer
self.rank = rank
self.alpha = alpha
in_features = original_layer.in_features
out_features = original_layer.out_features
# Freeze the original weights — they don't change during fine-tuning
for param in self.original_layer.parameters():
param.requires_grad = False
# Create the low-rank matrices A and B
# A: projects input down to low rank
# B: projects back up to output dimension
self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
# Initialize A with small random values, B with zeros
# This means the LoRA update starts at zero (no change to the model)
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# B stays at zero — so initially delta_W = A @ B = 0
# Scaling factor
self.scaling = alpha / rank
def forward(self, x):
"""
Forward pass: original output + low-rank update.
Args:
x: Input tensor of shape [batch, seq_len, in_features]
Returns:
Output tensor of shape [batch, seq_len, out_features]
"""
# Original frozen computation
original_output = self.original_layer(x)
# Low-rank update: x @ A @ B, scaled
lora_update = x @ self.lora_A @ self.lora_B * self.scaling
return original_output + lora_update
Let’s verify the parameter savings:
# Demonstrate parameter savings
d_model = 512 # A modest model dimension
# Original linear layer
original = nn.Linear(d_model, d_model, bias=False)
original_params = sum(p.numel() for p in original.parameters())
# LoRA version with rank 8
lora = LoRALayer(original, rank=8, alpha=16)
lora_trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)
print(f"Original layer parameters: {original_params:,}")
print(f"LoRA trainable parameters: {lora_trainable:,}")
print(f"Reduction factor: {original_params / lora_trainable:.1f}x")
print(f"Percentage of original: {100 * lora_trainable / original_params:.2f}%")
# Output:
# Original layer parameters: 262,144
# LoRA trainable parameters: 8,192
# Reduction factor: 32.0x
# Percentage of original: 3.12%
Applying LoRA to a Model
In practice, you apply LoRA to specific layers in the model—typically the attention projection matrices (Q, K, V, and output):
def apply_lora_to_model(model, rank=8, alpha=16):
"""
Replace attention linear layers with LoRA-wrapped versions.
Only the attention Q, K, V, and output projection layers
are wrapped. Feed-forward layers and embeddings stay frozen.
Args:
model: The pretrained transformer model.
rank: LoRA rank.
alpha: LoRA scaling factor.
Returns:
model: The modified model (in-place).
"""
lora_layers_added = 0
for layer in model.layers:
# Wrap attention projections with LoRA
if hasattr(layer, 'attention'):
attn = layer.attention
if hasattr(attn, 'W_q'):
attn.W_q = LoRALayer(attn.W_q, rank=rank, alpha=alpha)
lora_layers_added += 1
if hasattr(attn, 'W_k'):
attn.W_k = LoRALayer(attn.W_k, rank=rank, alpha=alpha)
lora_layers_added += 1
if hasattr(attn, 'W_v'):
attn.W_v = LoRALayer(attn.W_v, rank=rank, alpha=alpha)
lora_layers_added += 1
if hasattr(attn, 'W_o'):
attn.W_o = LoRALayer(attn.W_o, rank=rank, alpha=alpha)
lora_layers_added += 1
print(f"Applied LoRA to {lora_layers_added} layers (rank={rank})")
# Show the savings
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters()
if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable (LoRA): {trainable_params:,}")
print(f"Frozen: {total_params - trainable_params:,}")
print(f"Trainable percentage: {100 * trainable_params / total_params:.2f}%")
return model
Why LoRA Works
It might seem surprising that such a tiny number of parameters can meaningfully alter a model’s behavior. The intuition is this: pretraining already puts the model’s weights in a very good neighborhood of parameter space. Fine-tuning only needs to make small adjustments. Those small adjustments live in a low-dimensional subspace—you don’t need to explore all possible directions, just a few important ones.
Think of it as the difference between building a house from scratch (high-dimensional, many choices) and rearranging the furniture (low-dimensional, few choices, but the effect on livability is huge).
5. Instruction Tuning
Teaching a Model to Follow Instructions
Instruction tuning is a specific type of fine-tuning where the goal is to teach the model to follow instructions. This is what transforms a base model (which just predicts the next word) into a helpful assistant (which does what you ask).
The key idea is simple: show the model thousands of examples where an instruction is followed by a good response. After enough examples, the model learns the meta-pattern: “when I see an instruction, I should produce a helpful, relevant response.”
The Prompt-Completion Format
During instruction tuning, training examples follow a consistent template:
<|start|>
### Instruction:
{the user's request}
### Response:
{the ideal response}
<|end|>
The exact template varies between models—Llama uses [INST]...[/INST], ChatML uses <|im_start|>user—but the principle is the same: a clear division between what the user says and what the model should say.
A Tiny Instruction-Tuned Model
Let’s trace through what instruction tuning looks like concretely. Imagine we have a tiny model that has been pretrained on general text. Before instruction tuning, if we feed it:
### Instruction:
List three colors.
### Response:
The base model might generate: “The response to this question depends on several factors including the cultural context in which colors are perceived…”
It’s continuing the text in a way that sounds like a Wikipedia article. It doesn’t understand that it should answer the question.
After instruction tuning on examples like:
Instruction: "Name two planets." → Response: "Mars and Jupiter."
Instruction: "What is 2 + 2?" → Response: "4."
Instruction: "Say hello in French." → Response: "Bonjour."
The model learns the pattern. Now when it sees the instruction about colors, it generates: “Red, blue, and green.”
The transformation seems magical, but it’s just pattern matching at scale. The model has seen enough instruction-response pairs that it has internalized: “after ### Response:\n, I should provide a direct, helpful answer to whatever was in ### Instruction:.”
Why ChatGPT-like Behavior Requires This Step
Base GPT models are remarkable—they know facts, can write prose, and understand grammar. But they have no concept of being helpful. They were trained to predict the next token in web pages, books, and articles. None of those texts are formatted as “instruction → response.”
Instruction tuning bridges this gap. Without it, you’d have a model that knows everything but can’t do anything useful with its knowledge. It’s the difference between a library (contains information, but you have to find it yourself) and a librarian (understands your question and gives you exactly what you need).
The Scale of Instruction Tuning
Real instruction-tuning datasets contain tens of thousands to millions of examples:
- Alpaca (Stanford): 52,000 instruction-response pairs generated by GPT-4
- Dolly (Databricks): 15,000 human-written instruction-response pairs
- FLAN (Google): Millions of examples across hundreds of task types
- OpenAssistant: 160,000+ human conversation trees
Even with our tiny 8-example dataset, the principle is the same: show the model the pattern, and it learns the behavior.
6. RLHF — Reinforcement Learning from Human Feedback
Instruction tuning gets the model to follow instructions, but the responses might be verbose, incorrect, or unhelpful in subtle ways. RLHF (Reinforcement Learning from Human Feedback) is the next step: teaching the model not just to respond, but to respond well.
The Problem
After instruction tuning, a model might generate two responses to “Explain gravity”:
Response A: “Gravity is a fundamental force of nature that attracts objects with mass toward each other. The more massive an object, the stronger its gravitational pull.”
Response B: “Gravity is described by Einstein’s general theory of relativity as a curvature in spacetime caused by mass and energy. The Einstein field equations, given by $G_{\mu\nu} + \Lambda g_{\mu\nu} = \frac{8\pi G}{c^4} T_{\mu\nu}$, describe this curvature…”
Both are technically correct. But for a general audience, Response A is clearly better—it’s clear, concise, and accessible. How do we teach the model to prefer Response A?
We can’t encode this preference as a simple rule. “Be clear” is subjective. “Be helpful” depends on context. These preferences are subtle and nuanced—exactly the kind of thing humans are good at judging but hard to write algorithms for.
The Three-Step Process
RLHF works in three stages:
Step 1: Supervised Fine-tuning (SFT)
This is the instruction tuning we just covered. Train the model on instruction-response pairs so it knows the basic format.
Input: "What is the capital of Japan?"
Output: "The capital of Japan is Tokyo."
Step 2: Train a Reward Model
Hire human annotators. Show them a prompt and two (or more) model-generated responses. Ask them: “Which response is better?”
Prompt: "Explain gravity simply."
Response A: "Gravity pulls things together. The bigger something is,
the stronger it pulls." → Human picks: ✓ Better
Response B: "Gravitational acceleration on Earth is approximately
9.81 m/s²..." → Human picks: ✗ Worse
Collect thousands of these human preference judgments. Then train a separate neural network—the reward model—to predict which response a human would prefer. The reward model takes a prompt and response as input and outputs a score:
reward_model("Explain gravity", response_A) → 0.85 (high = good)
reward_model("Explain gravity", response_B) → 0.32 (low = worse)
The reward model learns the patterns in human preferences: prefer clear explanations, avoid jargon, be concise, be accurate.
Step 3: Optimize with Reinforcement Learning
Now use the reward model to further train the language model. The process uses an algorithm called PPO (Proximal Policy Optimization):
- The model generates a response to a prompt
- The reward model scores the response
- If the score is high → update the model to produce more responses like this
- If the score is low → update the model to avoid responses like this
- Repeat thousands of times
Analogy: Teaching a dog new tricks.
- The dog (LLM) performs an action (generates a response)
- The trainer (reward model) gives a treat or witholds it (high/low score)
- Over time, the dog learns which behaviors earn treats
- The dog doesn't understand WHY treats come — it just learns the pattern
Why RLHF Matters
The jump from instruction tuning to RLHF is what made ChatGPT feel different from earlier language models. GPT-3 (base model) could generate text but was hard to use. InstructGPT (instruction-tuned) could follow instructions. ChatGPT (RLHF-trained) felt helpful, harmless, and honest.
RLHF aligns the model’s goals with human preferences. Without it, the model optimizes for “predict the next token accurately.” With it, the model optimizes for “produce responses that humans rate as helpful.”
Limitations of RLHF
RLHF isn’t perfect:
- Expensive: Collecting human preferences requires many paid annotators
- Reward hacking: The model may learn to game the reward model rather than be genuinely helpful (for example, producing responses that sound confident but are wrong)
- Annotation bias: Human preferences are subjective and vary between annotators
- Alignment tax: RLHF can reduce the model’s raw capability slightly while improving its helpfulness
Researchers are actively exploring alternatives like DPO (Direct Preference Optimization), which skips the separate reward model entirely and optimizes directly from human preference data. But RLHF remains the foundational technique that launched the era of aligned AI assistants.
7. Evaluation
You’ve fine-tuned your model. But how do you know it’s actually good? Evaluating language models is notoriously difficult because “good” is subjective—but there are some established approaches.
Perplexity
Perplexity is the most common automatic metric for language models. It measures how “surprised” the model is by the test data. Lower perplexity = better.
The math: perplexity is the exponentiation of the average cross-entropy loss:
$$\text{Perplexity} = e^{\text{loss}}$$
If your model has a loss of 3.0 on some test text, its perplexity is $e^{3.0} \approx 20.1$. Intuitively, this means the model is, on average, as uncertain as if it were choosing uniformly among ~20 options at each step. A perfect model that always predicts the right next word has a perplexity of 1.
import math
def compute_perplexity(model, dataset, device='cpu'):
"""
Compute perplexity of a model on a dataset.
Perplexity = exp(average cross-entropy loss)
Lower is better. A perplexity of 1 would mean the model
perfectly predicts every next token.
Args:
model: The language model.
dataset: A dataset yielding (input, target) pairs.
device: 'cpu' or 'cuda'.
Returns:
perplexity: The computed perplexity score.
"""
model.eval()
model = model.to(device)
loss_fn = nn.CrossEntropyLoss(ignore_index=0)
total_loss = 0.0
total_batches = 0
dataloader = DataLoader(dataset, batch_size=4, shuffle=False)
with torch.no_grad():
for batch_x, batch_y in dataloader:
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
logits = model(batch_x)
B, T, C = logits.shape
loss = loss_fn(logits.view(B * T, C), batch_y.view(B * T))
total_loss += loss.item()
total_batches += 1
avg_loss = total_loss / total_batches
perplexity = math.exp(avg_loss)
print(f"Average loss: {avg_loss:.4f}")
print(f"Perplexity: {perplexity:.2f}")
return perplexity
# Typical perplexity values:
# Random model (untrained): ~vocab_size (e.g., 50,000)
# After pretraining: 20-50 (for small models)
# After fine-tuning: 5-20 (on in-domain data)
# GPT-4 level: < 10 (estimated)
Human Evaluation
Perplexity tells you about prediction accuracy, but not about quality. A model that always responds “I don’t know” might have decent perplexity but is useless. Human evaluation remains the gold standard.
A simple human evaluation protocol:
def human_eval_template(model, tokenizer, test_prompts, device='cpu'):
"""
Generate responses for human evaluation.
Humans rate each response on:
- Helpfulness (1-5): Does it answer the question?
- Accuracy (1-5): Is the information correct?
- Clarity (1-5): Is the response well-written?
Args:
model: The fine-tuned model.
tokenizer: The tokenizer.
test_prompts: List of instruction strings to test.
device: 'cpu' or 'cuda'.
"""
print("=" * 60)
print("HUMAN EVALUATION")
print("Rate each response: Helpfulness / Accuracy / Clarity (1-5)")
print("=" * 60)
for i, prompt in enumerate(test_prompts):
response = generate_response(
model, tokenizer, prompt,
max_tokens=100, temperature=0.7, device=device
)
print(f"\n--- Example {i + 1} ---")
print(f"Instruction: {prompt}")
print(f"Response: {response}")
print(f"Helpfulness: __ / 5")
print(f"Accuracy: __ / 5")
print(f"Clarity: __ / 5")
# Example test prompts
test_prompts = [
"What is the boiling point of water?",
"Explain why the sky is blue in simple terms.",
"Write a short poem about rain.",
"What are three benefits of exercise?",
]
Simple Benchmark Setup
For reproducible evaluation, create a benchmark—a fixed set of test examples with known correct answers:
def run_benchmark(model, tokenizer, benchmark, device='cpu'):
"""
Run a simple benchmark to evaluate model quality.
Args:
model: The fine-tuned model.
tokenizer: The tokenizer.
benchmark: List of dicts with 'instruction' and 'expected' keys.
device: 'cpu' or 'cuda'.
Returns:
score: Percentage of responses that contain the expected answer.
"""
correct = 0
for item in benchmark:
response = generate_response(
model, tokenizer, item['instruction'],
max_tokens=50, temperature=0.1, # Low temp for consistency
device=device
)
# Simple check: does the response contain the expected answer?
if item['expected'].lower() in response.lower():
correct += 1
status = "✓"
else:
status = "✗"
print(f"{status} Instruction: {item['instruction']}")
print(f" Expected: {item['expected']}")
print(f" Got: {response[:80]}...")
print()
score = 100 * correct / len(benchmark)
print(f"Score: {correct}/{len(benchmark)} ({score:.1f}%)")
return score
# Example benchmark
simple_benchmark = [
{"instruction": "What is 2 + 2?", "expected": "4"},
{"instruction": "What is the capital of France?", "expected": "Paris"},
{"instruction": "What color is the sky?", "expected": "blue"},
{"instruction": "How many days are in a week?", "expected": "7"},
{"instruction": "What is H2O?", "expected": "water"},
]
This is a very simple benchmark—production models are evaluated on much more sophisticated ones like MMLU, HellaSwag, or HumanEval. But the principle is identical: fixed test set, clear success criteria, reproducible results.
8. Exercises
Exercise 1: Extend the Instruction Dataset
Create a fine-tuning dataset with at least 20 instruction-response pairs covering three categories: factual questions, creative writing, and math problems. Implement your InstructionDataset class and verify it produces correct input-target pairs.
Solution
import torch
from torch.utils.data import Dataset
# Extended instruction dataset with three categories
instruction_data_extended = [
# === Factual Questions ===
{"instruction": "What is the capital of Japan?",
"response": "The capital of Japan is Tokyo."},
{"instruction": "Who wrote Romeo and Juliet?",
"response": "Romeo and Juliet was written by William Shakespeare."},
{"instruction": "What is the largest planet in our solar system?",
"response": "The largest planet in our solar system is Jupiter."},
{"instruction": "What year did World War II end?",
"response": "World War II ended in 1945."},
{"instruction": "What is the chemical symbol for gold?",
"response": "The chemical symbol for gold is Au."},
{"instruction": "How many continents are there?",
"response": "There are seven continents on Earth."},
{"instruction": "What is the speed of light?",
"response": "The speed of light is approximately 299,792,458 meters per second."},
# === Creative Writing ===
{"instruction": "Write a one-sentence story about a robot.",
"response": "The last robot on Earth spent its days painting sunsets it had never been programmed to appreciate."},
{"instruction": "Write a haiku about the ocean.",
"response": "Waves crash on the shore\nSalt and foam dance in the wind\nThe tide pulls back home"},
{"instruction": "Describe a forest in two sentences.",
"response": "Towering oaks stretched toward the sky, their branches interlocking like fingers. Below, ferns carpeted the forest floor in a thousand shades of green."},
{"instruction": "Write a short limerick.",
"response": "A coder who worked through the night\nFound a bug that just wouldn't go right\nShe traced every line\nThe fix was just fine\nA semicolon missing from sight."},
{"instruction": "Write a metaphor about time.",
"response": "Time is a river that flows in only one direction, carrying us forward whether we paddle or not."},
{"instruction": "Describe rain using only sounds.",
"response": "Pitter-patter, drip-drop, whoosh, splash, drumming, tinkling, rushing, gurgling."},
# === Math Problems ===
{"instruction": "What is 25 times 4?",
"response": "25 times 4 is 100."},
{"instruction": "What is the square root of 144?",
"response": "The square root of 144 is 12."},
{"instruction": "If a train travels 60 mph for 3 hours, how far does it go?",
"response": "The train travels 180 miles (60 mph × 3 hours = 180 miles)."},
{"instruction": "What is 15% of 200?",
"response": "15% of 200 is 30."},
{"instruction": "What is 2 to the power of 10?",
"response": "2 to the power of 10 is 1024."},
{"instruction": "What is the area of a rectangle with width 5 and height 8?",
"response": "The area is 40 square units (5 × 8 = 40)."},
{"instruction": "Solve: 3x + 6 = 21. What is x?",
"response": "x = 5. Subtract 6 from both sides: 3x = 15. Divide by 3: x = 5."},
]
def format_example(example):
"""Format an instruction-response pair into a training string."""
return (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['response']}"
)
class InstructionDataset(Dataset):
def __init__(self, data, tokenizer, max_length=128):
self.examples = []
for item in data:
text = format_example(item)
token_ids = tokenizer.encode(text)
if len(token_ids) > max_length:
token_ids = token_ids[:max_length]
while len(token_ids) < max_length:
token_ids.append(0)
self.examples.append(token_ids)
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
token_ids = self.examples[idx]
x = torch.tensor(token_ids[:-1], dtype=torch.long)
y = torch.tensor(token_ids[1:], dtype=torch.long)
return x, y
# Verify the dataset
print(f"Total examples: {len(instruction_data_extended)}")
print(f"Categories: factual (7), creative (6), math (7)")
print()
# Show a formatted example from each category
for category_start in [0, 7, 13]:
example = instruction_data_extended[category_start]
print(f"--- Example ---")
print(format_example(example))
print()
# Output:
# Total examples: 20
# Categories: factual (7), creative (6), math (7)
#
# --- Example ---
# ### Instruction:
# What is the capital of Japan?
#
# ### Response:
# The capital of Japan is Tokyo.
#
# --- Example ---
# ### Instruction:
# Write a one-sentence story about a robot.
#
# ### Response:
# The last robot on Earth spent its days painting sunsets it had
# never been programmed to appreciate.
#
# --- Example ---
# ### Instruction:
# What is 25 times 4?
#
# ### Response:
# 25 times 4 is 100.
Exercise 2: Implement LoRA with Different Ranks
Implement LoRA layers with ranks 2, 8, and 32. Compare the number of trainable parameters for each rank, applied to a linear layer with dimensions 1024 × 1024. Discuss the trade-off between parameter count and expressiveness.
Solution
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
"""LoRA layer implementation (same as in the chapter)."""
def __init__(self, original_layer, rank=8, alpha=16):
super().__init__()
self.original_layer = original_layer
self.rank = rank
self.alpha = alpha
in_features = original_layer.in_features
out_features = original_layer.out_features
for param in self.original_layer.parameters():
param.requires_grad = False
self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
self.scaling = alpha / rank
def forward(self, x):
original_output = self.original_layer(x)
lora_update = x @ self.lora_A @ self.lora_B * self.scaling
return original_output + lora_update
# Compare different ranks
d = 1024
ranks = [2, 8, 32]
print(f"Original layer: {d} × {d} = {d * d:,} parameters\n")
print(f"{'Rank':<8} {'LoRA Params':<15} {'Reduction':<12} {'% of Original':<15}")
print("-" * 50)
for rank in ranks:
original = nn.Linear(d, d, bias=False)
lora = LoRALayer(original, rank=rank, alpha=rank * 2)
original_params = d * d
lora_params = sum(p.numel() for p in lora.parameters() if p.requires_grad)
reduction = original_params / lora_params
percentage = 100 * lora_params / original_params
print(f"{rank:<8} {lora_params:<15,} {reduction:<12.1f}x {percentage:<15.2f}%")
# Verify outputs match shapes
print("\nShape verification:")
x = torch.randn(2, 10, d) # batch=2, seq_len=10, dim=1024
for rank in ranks:
original = nn.Linear(d, d, bias=False)
lora = LoRALayer(original, rank=rank)
output = lora(x)
print(f" Rank {rank:>2}: input {tuple(x.shape)} → output {tuple(output.shape)}")
# Output:
# Original layer: 1024 × 1024 = 1,048,576 parameters
#
# Rank LoRA Params Reduction % of Original
# --------------------------------------------------
# 2 4,096 256.0x 0.39%
# 8 16,384 64.0x 1.56%
# 32 65,536 16.0x 6.25%
#
# Shape verification:
# Rank 2: input (2, 10, 1024) → output (2, 10, 1024)
# Rank 8: input (2, 10, 1024) → output (2, 10, 1024)
# Rank 32: input (2, 10, 1024) → output (2, 10, 1024)
# Discussion:
# - Rank 2: Extreme compression (0.39%). Very few trainable parameters.
# Good for simple adaptations (e.g., style transfer).
# May not capture complex behavioral changes.
#
# - Rank 8: The typical default. Good balance between parameter efficiency
# and expressiveness. Works well for most fine-tuning tasks.
#
# - Rank 32: More expressive but less efficient. Useful when fine-tuning
# requires significant behavioral changes. Still 16x fewer
# parameters than full fine-tuning.
#
# The trade-off: lower rank = fewer parameters = faster training = less
# memory, but potentially less capacity to learn complex new behaviors.
# In practice, rank 8-16 works well for most applications.
Exercise 3: Build a Perplexity Comparison
Write code that computes perplexity on a held-out test set before and after fine-tuning. Show that fine-tuning reduces perplexity on in-domain data. Explain what it means if perplexity increases on out-of-domain data.
Solution
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import math
import copy
def compute_perplexity(model, dataset, device='cpu'):
"""Compute perplexity of a model on a dataset."""
model.eval()
model = model.to(device)
loss_fn = nn.CrossEntropyLoss(ignore_index=0)
total_loss = 0.0
total_batches = 0
dataloader = DataLoader(dataset, batch_size=4, shuffle=False)
with torch.no_grad():
for batch_x, batch_y in dataloader:
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
logits = model(batch_x)
B, T, C = logits.shape
loss = loss_fn(logits.view(B * T, C), batch_y.view(B * T))
total_loss += loss.item()
total_batches += 1
if total_batches == 0:
return float('inf')
avg_loss = total_loss / total_batches
perplexity = math.exp(min(avg_loss, 100)) # Cap to avoid overflow
return perplexity
def compare_perplexity(model_before, model_after, in_domain_data,
out_domain_data, device='cpu'):
"""
Compare perplexity before and after fine-tuning on two datasets.
Args:
model_before: Model weights before fine-tuning.
model_after: Model weights after fine-tuning.
in_domain_data: Test data similar to fine-tuning data.
out_domain_data: Test data different from fine-tuning data.
device: 'cpu' or 'cuda'.
"""
print("=" * 55)
print("PERPLEXITY COMPARISON: Before vs After Fine-tuning")
print("=" * 55)
# In-domain evaluation
ppl_before_in = compute_perplexity(model_before, in_domain_data, device)
ppl_after_in = compute_perplexity(model_after, in_domain_data, device)
print(f"\nIn-domain data (similar to fine-tuning data):")
print(f" Before fine-tuning: {ppl_before_in:.2f}")
print(f" After fine-tuning: {ppl_after_in:.2f}")
print(f" Change: {ppl_after_in - ppl_before_in:+.2f} ", end="")
if ppl_after_in < ppl_before_in:
print("(improved ✓)")
else:
print("(worsened ✗)")
# Out-of-domain evaluation
ppl_before_out = compute_perplexity(model_before, out_domain_data, device)
ppl_after_out = compute_perplexity(model_after, out_domain_data, device)
print(f"\nOut-of-domain data (different from fine-tuning data):")
print(f" Before fine-tuning: {ppl_before_out:.2f}")
print(f" After fine-tuning: {ppl_after_out:.2f}")
print(f" Change: {ppl_after_out - ppl_before_out:+.2f} ", end="")
if ppl_after_out < ppl_before_out:
print("(improved ✓)")
else:
print("(worsened ✗)")
print(f"\n--- Interpretation ---")
print(f"In-domain perplexity should DECREASE after fine-tuning.")
print(f"The model becomes more confident on data similar to what")
print(f"it was fine-tuned on.\n")
print(f"Out-of-domain perplexity may INCREASE after fine-tuning.")
print(f"This is called 'catastrophic forgetting' — the model")
print(f"becomes so specialized on the fine-tuning data that it")
print(f"gets worse on other types of text. This is why:")
print(f" 1. We use low learning rates during fine-tuning")
print(f" 2. We train for few epochs (to avoid overfitting)")
print(f" 3. LoRA helps — by only changing a small number of")
print(f" parameters, it preserves most of the original")
print(f" model's general knowledge")
# Usage (conceptual — requires trained models):
#
# # Save model before fine-tuning
# model_before = copy.deepcopy(model)
#
# # Fine-tune the model
# fine_tune(model, train_dataset, num_epochs=3)
#
# # Compare perplexity
# compare_perplexity(
# model_before, model,
# in_domain_test, out_domain_test
# )
#
# Expected output:
# =====================================================
# PERPLEXITY COMPARISON: Before vs After Fine-tuning
# =====================================================
#
# In-domain data (similar to fine-tuning data):
# Before fine-tuning: 45.32
# After fine-tuning: 12.18
# Change: -33.14 (improved ✓)
#
# Out-of-domain data (different from fine-tuning data):
# Before fine-tuning: 38.76
# After fine-tuning: 42.91
# Change: +4.15 (worsened ✗)
Summary
In this chapter, we transformed a base language model from a text-completion engine into a useful tool:
-
Fine-tuning takes a pretrained model and continues training it on curated data. It’s dramatically cheaper than pretraining because you’re leveraging existing knowledge—the essence of transfer learning.
-
Instruction datasets teach the model new behavior through example. By training on thousands of instruction-response pairs, the model learns the meta-pattern: “when given an instruction, produce a helpful response.”
-
The fine-tuning loop is almost identical to pretraining, but with a lower learning rate, fewer epochs, and carefully formatted data. Gradient clipping and padding-aware loss prevent instabilities.
-
LoRA makes fine-tuning accessible by adding tiny trainable matrices alongside frozen pretrained weights. A rank-8 LoRA layer uses as little as 1-3% of the parameters of the original layer, yet captures the essential adaptations needed.
-
Instruction tuning is the specific application of fine-tuning that produces ChatGPT-like behavior. It teaches models to follow prompts rather than just complete text.
-
RLHF goes further, using human preferences to teach the model not just to respond, but to respond well. The three-step process—supervised fine-tuning, reward model training, and reinforcement learning—is what makes modern AI assistants feel helpful and aligned.
-
Evaluation combines automatic metrics (perplexity) with human judgment. No single metric captures everything, so practical evaluation uses both.
You’ve now seen the complete journey from raw text to a model that can follow instructions and answer questions. The remaining frontier—larger scales, better alignment techniques, and real-world deployment—builds on everything we’ve covered in this book.