Skip to main content
building large language models from scratch a beginners guide with python and pytorch

Fine-tuning and Applications — Making Your Model Useful

35 min read Chapter 10 of 11
Summary

This chapter covers making a pretrained LLM useful...

This chapter covers making a pretrained LLM useful through fine-tuning. Transfer learning leverages pretrained weights instead of training from scratch. Fine-tuning datasets are prepared in instruction-response format. A complete fine-tuning loop trains the model on new data with lower learning rates. LoRA (Low-Rank Adaptation) enables efficient fine-tuning by adding small trainable matrices to frozen weights, dramatically reducing trainable parameters. Instruction tuning teaches models to follow prompts. RLHF (Reinforcement Learning from Human Feedback) is explained conceptually as the process behind ChatGPT-like behavior. Model evaluation uses perplexity and human assessment.

Fine-tuning and Applications — Making Your Model Useful

In the previous chapters, we built a Transformer language model from scratch, trained it on raw text, and watched it learn to generate coherent English. That’s a remarkable achievement. But if you actually try to use your base model, you’ll notice something frustrating: it can finish sentences, but it can’t answer questions. It can generate paragraphs of text, but it won’t follow instructions. It speaks English, but it doesn’t speak helpful.

This is the gap between a base model and a useful model. Every LLM you’ve ever interacted with—ChatGPT, Claude, Llama—went through an additional training phase after pretraining. That phase is called fine-tuning, and it’s the subject of this chapter.

1. What is Fine-tuning?

The Analogy

Think of pretraining like raising a child. Over years, the child learns English—grammar, vocabulary, how sentences work, facts about the world. By age ten, the child can speak fluently. But can the child write a professional email? Diagnose a medical symptom? Explain quantum physics to a five-year-old?

No. The child knows English, but hasn’t been trained for any specific task.

Fine-tuning is what happens next. You send the child to school, where a teacher says: “When someone asks you a question, answer it clearly.” Or: “When given a topic, write a structured essay.” The child doesn’t need to relearn English—they already know it. They just need to learn how to apply their knowledge in a specific way.

That’s exactly what fine-tuning does to a language model. The base model already understands language—word meanings, grammar, facts. Fine-tuning teaches it behavior: how to respond to instructions, how to answer questions, how to be helpful rather than just predictive.

Transfer Learning

This principle is called transfer learning: take knowledge learned in one context (pretraining on general text) and transfer it to a new context (following instructions, answering questions, writing code).

Why not just train a model from scratch on instruction-response data? Because pretraining is enormously expensive. GPT-3 cost millions of dollars and months of compute time to pretrain. Fine-tuning, by contrast, can be done in hours on a single GPU with a small dataset. You’re leveraging all the knowledge already baked into the model’s weights.

Pretraining:  Months + Millions of $  →  Understands language
Fine-tuning:  Hours  + Hundreds of $  →  Follows instructions

The ratio of effort is staggering. Pretraining builds the foundation; fine-tuning adds the finishing touches. But those finishing touches are what make the difference between a model that rambles aimlessly and one that says: “Here’s a clear answer to your question.”

What Changes During Fine-tuning?

Mechanically, fine-tuning is just more training. You take the pretrained model’s weights—the same attention matrices, feed-forward layers, and embeddings we built in earlier chapters—and continue training them on a new, curated dataset. The differences are:

  1. The dataset is different. Instead of raw web text, you use carefully formatted instruction-response pairs.
  2. The learning rate is lower. You don’t want to overwrite what the model already knows—you want to gently nudge it.
  3. The training is shorter. A few hundred to a few thousand steps, not millions.
  4. Sometimes you freeze some layers. You might only update the last few layers, keeping earlier layers (which capture general language patterns) unchanged.

2. Preparing a Fine-tuning Dataset

The quality of your fine-tuning dataset matters more than its size. A base model trained on billions of tokens can be meaningfully fine-tuned with just a few thousand high-quality examples.

Data Formats

Fine-tuning data comes in several formats, depending on what behavior you want to teach:

Instruction-Response Pairs — The most common format. Each example has an instruction and the desired response:

{"instruction": "Summarize the following text in one sentence.",
 "input": "The Eiffel Tower was built in 1889 for the World's Fair...",
 "response": "The Eiffel Tower, built in 1889 for the World's Fair, is a famous Parisian landmark."}

Question-Answer Pairs — A simpler variant:

{"question": "What is the capital of France?",
 "answer": "The capital of France is Paris."}

Conversation Format — For chat models, data looks like a multi-turn conversation:

{"messages": [
    {"role": "user", "content": "What is photosynthesis?"},
    {"role": "assistant", "content": "Photosynthesis is the process by which plants convert sunlight into energy..."}
]}

For our implementation, we’ll use the instruction-response format since it’s the simplest and most widely used.

Creating a Simple Instruction Dataset

Let’s build a small instruction dataset in code. In practice, you’d load this from a file, but creating it explicitly helps us understand the format:

# A tiny instruction-tuning dataset
instruction_data = [
    {
        "instruction": "What is the capital of France?",
        "response": "The capital of France is Paris."
    },
    {
        "instruction": "Explain gravity in simple terms.",
        "response": "Gravity is the force that pulls objects toward each other. The Earth's gravity keeps us on the ground and the Moon in orbit."
    },
    {
        "instruction": "Write a haiku about programming.",
        "response": "Code flows line by line\nBugs hide in the smallest place\nTests reveal the truth"
    },
    {
        "instruction": "What is 15 multiplied by 7?",
        "response": "15 multiplied by 7 is 105."
    },
    {
        "instruction": "Translate 'hello' to Spanish.",
        "response": "The Spanish translation of 'hello' is 'hola'."
    },
    {
        "instruction": "List three prime numbers.",
        "response": "Three prime numbers are 2, 3, and 5."
    },
    {
        "instruction": "What does CPU stand for?",
        "response": "CPU stands for Central Processing Unit."
    },
    {
        "instruction": "Summarize what machine learning is in one sentence.",
        "response": "Machine learning is a field of computer science where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed."
    },
]

print(f"Dataset size: {len(instruction_data)} examples")
# Output: Dataset size: 8 examples

Formatting for the Model

The model needs to see the instruction and response as a single sequence of text, with clear markers separating the two parts. A common pattern uses special tokens or textual delimiters:

def format_example(example):
    """
    Format an instruction-response pair into a single training string.

    The model learns to generate everything after '### Response:',
    given everything before it as context.
    """
    text = (
        f"### Instruction:\n{example['instruction']}\n\n"
        f"### Response:\n{example['response']}"
    )
    return text


# See what a formatted example looks like
formatted = format_example(instruction_data[1])
print(formatted)
# Output:
# ### Instruction:
# Explain gravity in simple terms.
#
# ### Response:
# Gravity is the force that pulls objects toward each other. The Earth's
# gravity keeps us on the ground and the Moon in orbit.

The model will be trained to predict the entire sequence, but during inference, we only provide the instruction part and let the model generate the response.

Implementing a Fine-tuning Dataset Class

Now let’s build a proper PyTorch Dataset that tokenizes and prepares these examples for training:

import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    """
    A PyTorch Dataset for instruction-tuning.

    Each example is formatted as:
        ### Instruction:
        <the instruction>

        ### Response:
        <the response>

    The entire sequence is tokenized and used as both input and target
    (shifted by one position, as in standard language modeling).
    """

    def __init__(self, data, tokenizer, max_length=128):
        """
        Args:
            data: List of dicts with 'instruction' and 'response' keys.
            tokenizer: A tokenizer with encode() method.
            max_length: Maximum sequence length (truncate or pad to this).
        """
        self.examples = []

        for item in data:
            # Format the instruction-response pair
            text = format_example(item)

            # Tokenize
            token_ids = tokenizer.encode(text)

            # Truncate if too long
            if len(token_ids) > max_length:
                token_ids = token_ids[:max_length]

            # Pad if too short
            while len(token_ids) < max_length:
                token_ids.append(0)  # 0 = padding token

            self.examples.append(token_ids)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        token_ids = self.examples[idx]
        # Input is everything except the last token
        x = torch.tensor(token_ids[:-1], dtype=torch.long)
        # Target is everything except the first token (shifted by 1)
        y = torch.tensor(token_ids[1:], dtype=torch.long)
        return x, y

This is the same autoregressive setup we used in pretraining: the input is the sequence shifted right, and the target is the sequence shifted left. The model learns to predict the next token at every position. The magic is that the data now contains instruction-response pairs, so the model learns the pattern: “after an instruction, produce a helpful response.”

3. Fine-tuning Implementation

Let’s implement the full fine-tuning pipeline. We’ll assume we have a pretrained GPT-like model from earlier chapters.

Setting Up the Pretrained Model

First, we load our pretrained model and prepare it for fine-tuning:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader


def setup_fine_tuning(model, learning_rate=1e-5):
    """
    Prepare a pretrained model for fine-tuning.

    Key differences from pretraining:
    - Much lower learning rate (1e-5 vs 1e-3)
    - Optionally freeze early layers
    - Shorter training schedule

    Args:
        model: A pretrained language model.
        learning_rate: Learning rate for fine-tuning (should be small).

    Returns:
        optimizer: Configured optimizer.
    """
    # Use a MUCH lower learning rate than pretraining
    # Pretraining might use 1e-3; fine-tuning uses 1e-5 or smaller
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        weight_decay=0.01
    )

    return optimizer


def count_parameters(model):
    """Count trainable parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters:     {total:,}")
    print(f"Trainable parameters: {trainable:,}")
    return total, trainable

Optionally Freezing Layers

Sometimes you want to freeze the earlier layers of the model—the ones that capture general language patterns—and only fine-tune the later layers:

def freeze_early_layers(model, num_layers_to_freeze):
    """
    Freeze the first N transformer layers.

    Frozen layers won't be updated during fine-tuning.
    This preserves general language knowledge while allowing
    the later layers to adapt to the new task.

    Args:
        model: The transformer model.
        num_layers_to_freeze: How many layers to freeze (from the bottom).
    """
    # Freeze the embedding layer (it rarely needs updating)
    for param in model.embedding.parameters():
        param.requires_grad = False

    # Freeze the first N transformer blocks
    for i in range(num_layers_to_freeze):
        for param in model.layers[i].parameters():
            param.requires_grad = False

    # Print what's frozen vs trainable
    frozen = sum(p.numel() for p in model.parameters() if not p.requires_grad)
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Frozen parameters:    {frozen:,}")
    print(f"Trainable parameters: {trainable:,}")
    print(f"Percentage trainable: {100 * trainable / (frozen + trainable):.1f}%")

The Fine-tuning Loop

The training loop itself is almost identical to pretraining—the key difference is the data and the hyperparameters:

def fine_tune(model, dataset, num_epochs=3, batch_size=4,
              learning_rate=1e-5, device='cpu'):
    """
    Fine-tune a pretrained model on an instruction dataset.

    Args:
        model: Pretrained language model.
        dataset: InstructionDataset instance.
        num_epochs: Number of passes through the data.
        batch_size: Examples per batch.
        learning_rate: Small learning rate for fine-tuning.
        device: 'cpu' or 'cuda'.

    Returns:
        losses: List of loss values for monitoring.
    """
    model = model.to(device)
    model.train()

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        weight_decay=0.01
    )
    loss_fn = nn.CrossEntropyLoss(ignore_index=0)  # Ignore padding tokens

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    losses = []

    for epoch in range(num_epochs):
        epoch_loss = 0.0
        num_batches = 0

        for batch_x, batch_y in dataloader:
            batch_x = batch_x.to(device)  # [batch_size, seq_len]
            batch_y = batch_y.to(device)  # [batch_size, seq_len]

            # Forward pass
            logits = model(batch_x)       # [batch_size, seq_len, vocab_size]

            # Reshape for loss computation
            # CrossEntropyLoss expects [N, C] and [N]
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = batch_y.view(B * T)

            # Compute loss
            loss = loss_fn(logits, targets)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()

            # Gradient clipping (important for stability during fine-tuning)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()

            epoch_loss += loss.item()
            num_batches += 1

        avg_loss = epoch_loss / num_batches
        losses.append(avg_loss)
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}")

    return losses

Observing the Effect of Fine-tuning

Let’s write a function to see how the model’s behavior changes before and after fine-tuning:

def generate_response(model, tokenizer, instruction, max_tokens=100,
                      temperature=0.7, device='cpu'):
    """
    Generate a response to an instruction using the fine-tuned model.

    Args:
        model: The language model.
        tokenizer: Tokenizer with encode/decode methods.
        instruction: The instruction text.
        max_tokens: Maximum tokens to generate.
        temperature: Sampling temperature (lower = more deterministic).
        device: 'cpu' or 'cuda'.

    Returns:
        response: The generated response text.
    """
    model.eval()

    # Format the prompt the same way as training data
    prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    token_ids = tokenizer.encode(prompt)
    input_ids = torch.tensor([token_ids], dtype=torch.long).to(device)

    with torch.no_grad():
        for _ in range(max_tokens):
            # Get predictions
            logits = model(input_ids)
            # Take the last token's predictions
            next_token_logits = logits[:, -1, :] / temperature

            # Sample from the distribution
            probs = torch.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            input_ids = torch.cat([input_ids, next_token], dim=1)

            # Stop if we generate an end token
            if next_token.item() == tokenizer.eos_token_id:
                break

    # Decode only the generated part (after the prompt)
    generated_ids = input_ids[0, len(token_ids):].tolist()
    response = tokenizer.decode(generated_ids)

    return response


# Example usage (conceptual — requires a trained model):
#
# BEFORE fine-tuning:
#   Instruction: "What is the capital of France?"
#   Model output: "the capital of France the country in Europe which has many..."
#   (The base model just continues the text aimlessly)
#
# AFTER fine-tuning:
#   Instruction: "What is the capital of France?"
#   Model output: "The capital of France is Paris."
#   (The fine-tuned model gives a direct, helpful answer)

The transformation is dramatic. The base model treats the instruction as just more text to continue. The fine-tuned model recognizes the pattern: instruction comes in, helpful response goes out.

4. LoRA — Efficient Fine-tuning

The Problem with Full Fine-tuning

Full fine-tuning updates every parameter in the model. For a small model like ours, that’s fine. But for a model like LLaMA-70B with 70 billion parameters, updating all of them requires:

  • Storing all 70 billion parameters in memory
  • Storing gradients for all 70 billion parameters
  • Storing optimizer states (AdamW keeps two extra copies per parameter)

That’s roughly 280 billion floating-point numbers in memory—over a terabyte of GPU RAM. Most people don’t have that.

The Key Insight

In 2021, researchers at Microsoft published a paper called “LoRA: Low-Rank Adaptation of Large Language Models.” Their key insight was elegant: when you fine-tune a model, the weight changes are low-rank.

What does that mean? Remember that our model’s weights are large matrices. During fine-tuning, these matrices change slightly—call the change ΔW. The researchers discovered that ΔW doesn’t need all its dimensions to be expressive. You can decompose it into two much smaller matrices:

$$\Delta W = A \times B$$

Where:

  • $W$ is the original weight matrix of shape ($d \times d$)
  • $A$ is a small matrix of shape ($d \times r$)
  • $B$ is a small matrix of shape ($r \times d$)
  • $r$ is the rank, typically 4, 8, or 16—much smaller than $d$

The Analogy

Imagine you have an enormous painting—a masterpiece that took years to create. You want to modify its mood slightly, making it warmer. Full fine-tuning would mean repainting the entire canvas. LoRA’s approach is different: you place a small, transparent overlay on top of the painting and paint only on the overlay. The original masterpiece stays untouched. The final result is the original plus the overlay.

The overlay (ΔW) is much smaller than the full painting (W), but it’s enough to change the overall effect.

Parameter Savings

Let’s do the math. Suppose a weight matrix W is 4096 × 4096:

  • Full fine-tuning: 4096 × 4096 = 16,777,216 parameters
  • LoRA with rank 8: (4096 × 8) + (8 × 4096) = 65,536 parameters

That’s a 256× reduction in trainable parameters. For a 70-billion-parameter model, LoRA might fine-tune only 10–50 million parameters—less than 0.1% of the total.

Implementing LoRA from Scratch

Let’s build a LoRA layer:

import torch
import torch.nn as nn
import math


class LoRALayer(nn.Module):
    """
    A LoRA (Low-Rank Adaptation) layer.

    Instead of updating the full weight matrix W during fine-tuning,
    we freeze W and learn a low-rank update: delta_W = A @ B

    The output becomes: y = x @ W + x @ A @ B
                           = x @ (W + A @ B)

    Only A and B are trainable, dramatically reducing parameters.
    """

    def __init__(self, original_layer, rank=8, alpha=16):
        """
        Args:
            original_layer: The nn.Linear layer to apply LoRA to.
            rank: The rank of the low-rank matrices (smaller = fewer params).
            alpha: Scaling factor for the LoRA update.
        """
        super().__init__()

        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha

        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # Freeze the original weights — they don't change during fine-tuning
        for param in self.original_layer.parameters():
            param.requires_grad = False

        # Create the low-rank matrices A and B
        # A: projects input down to low rank
        # B: projects back up to output dimension
        self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))

        # Initialize A with small random values, B with zeros
        # This means the LoRA update starts at zero (no change to the model)
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        # B stays at zero — so initially delta_W = A @ B = 0

        # Scaling factor
        self.scaling = alpha / rank

    def forward(self, x):
        """
        Forward pass: original output + low-rank update.

        Args:
            x: Input tensor of shape [batch, seq_len, in_features]

        Returns:
            Output tensor of shape [batch, seq_len, out_features]
        """
        # Original frozen computation
        original_output = self.original_layer(x)

        # Low-rank update: x @ A @ B, scaled
        lora_update = x @ self.lora_A @ self.lora_B * self.scaling

        return original_output + lora_update

Let’s verify the parameter savings:

# Demonstrate parameter savings
d_model = 512  # A modest model dimension

# Original linear layer
original = nn.Linear(d_model, d_model, bias=False)
original_params = sum(p.numel() for p in original.parameters())

# LoRA version with rank 8
lora = LoRALayer(original, rank=8, alpha=16)
lora_trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)

print(f"Original layer parameters:  {original_params:,}")
print(f"LoRA trainable parameters:  {lora_trainable:,}")
print(f"Reduction factor:           {original_params / lora_trainable:.1f}x")
print(f"Percentage of original:     {100 * lora_trainable / original_params:.2f}%")

# Output:
# Original layer parameters:  262,144
# LoRA trainable parameters:  8,192
# Reduction factor:           32.0x
# Percentage of original:     3.12%

Applying LoRA to a Model

In practice, you apply LoRA to specific layers in the model—typically the attention projection matrices (Q, K, V, and output):

def apply_lora_to_model(model, rank=8, alpha=16):
    """
    Replace attention linear layers with LoRA-wrapped versions.

    Only the attention Q, K, V, and output projection layers
    are wrapped. Feed-forward layers and embeddings stay frozen.

    Args:
        model: The pretrained transformer model.
        rank: LoRA rank.
        alpha: LoRA scaling factor.

    Returns:
        model: The modified model (in-place).
    """
    lora_layers_added = 0

    for layer in model.layers:
        # Wrap attention projections with LoRA
        if hasattr(layer, 'attention'):
            attn = layer.attention

            if hasattr(attn, 'W_q'):
                attn.W_q = LoRALayer(attn.W_q, rank=rank, alpha=alpha)
                lora_layers_added += 1

            if hasattr(attn, 'W_k'):
                attn.W_k = LoRALayer(attn.W_k, rank=rank, alpha=alpha)
                lora_layers_added += 1

            if hasattr(attn, 'W_v'):
                attn.W_v = LoRALayer(attn.W_v, rank=rank, alpha=alpha)
                lora_layers_added += 1

            if hasattr(attn, 'W_o'):
                attn.W_o = LoRALayer(attn.W_o, rank=rank, alpha=alpha)
                lora_layers_added += 1

    print(f"Applied LoRA to {lora_layers_added} layers (rank={rank})")

    # Show the savings
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters()
                          if p.requires_grad)
    print(f"Total parameters:     {total_params:,}")
    print(f"Trainable (LoRA):     {trainable_params:,}")
    print(f"Frozen:               {total_params - trainable_params:,}")
    print(f"Trainable percentage: {100 * trainable_params / total_params:.2f}%")

    return model

Why LoRA Works

It might seem surprising that such a tiny number of parameters can meaningfully alter a model’s behavior. The intuition is this: pretraining already puts the model’s weights in a very good neighborhood of parameter space. Fine-tuning only needs to make small adjustments. Those small adjustments live in a low-dimensional subspace—you don’t need to explore all possible directions, just a few important ones.

Think of it as the difference between building a house from scratch (high-dimensional, many choices) and rearranging the furniture (low-dimensional, few choices, but the effect on livability is huge).

5. Instruction Tuning

Teaching a Model to Follow Instructions

Instruction tuning is a specific type of fine-tuning where the goal is to teach the model to follow instructions. This is what transforms a base model (which just predicts the next word) into a helpful assistant (which does what you ask).

The key idea is simple: show the model thousands of examples where an instruction is followed by a good response. After enough examples, the model learns the meta-pattern: “when I see an instruction, I should produce a helpful, relevant response.”

The Prompt-Completion Format

During instruction tuning, training examples follow a consistent template:

<|start|>
### Instruction:
{the user's request}

### Response:
{the ideal response}
<|end|>

The exact template varies between models—Llama uses [INST]...[/INST], ChatML uses <|im_start|>user—but the principle is the same: a clear division between what the user says and what the model should say.

A Tiny Instruction-Tuned Model

Let’s trace through what instruction tuning looks like concretely. Imagine we have a tiny model that has been pretrained on general text. Before instruction tuning, if we feed it:

### Instruction:
List three colors.

### Response:

The base model might generate: “The response to this question depends on several factors including the cultural context in which colors are perceived…”

It’s continuing the text in a way that sounds like a Wikipedia article. It doesn’t understand that it should answer the question.

After instruction tuning on examples like:

Instruction: "Name two planets."     → Response: "Mars and Jupiter."
Instruction: "What is 2 + 2?"       → Response: "4."
Instruction: "Say hello in French."  → Response: "Bonjour."

The model learns the pattern. Now when it sees the instruction about colors, it generates: “Red, blue, and green.”

The transformation seems magical, but it’s just pattern matching at scale. The model has seen enough instruction-response pairs that it has internalized: “after ### Response:\n, I should provide a direct, helpful answer to whatever was in ### Instruction:.”

Why ChatGPT-like Behavior Requires This Step

Base GPT models are remarkable—they know facts, can write prose, and understand grammar. But they have no concept of being helpful. They were trained to predict the next token in web pages, books, and articles. None of those texts are formatted as “instruction → response.”

Instruction tuning bridges this gap. Without it, you’d have a model that knows everything but can’t do anything useful with its knowledge. It’s the difference between a library (contains information, but you have to find it yourself) and a librarian (understands your question and gives you exactly what you need).

The Scale of Instruction Tuning

Real instruction-tuning datasets contain tens of thousands to millions of examples:

  • Alpaca (Stanford): 52,000 instruction-response pairs generated by GPT-4
  • Dolly (Databricks): 15,000 human-written instruction-response pairs
  • FLAN (Google): Millions of examples across hundreds of task types
  • OpenAssistant: 160,000+ human conversation trees

Even with our tiny 8-example dataset, the principle is the same: show the model the pattern, and it learns the behavior.

6. RLHF — Reinforcement Learning from Human Feedback

Instruction tuning gets the model to follow instructions, but the responses might be verbose, incorrect, or unhelpful in subtle ways. RLHF (Reinforcement Learning from Human Feedback) is the next step: teaching the model not just to respond, but to respond well.

The Problem

After instruction tuning, a model might generate two responses to “Explain gravity”:

Response A: “Gravity is a fundamental force of nature that attracts objects with mass toward each other. The more massive an object, the stronger its gravitational pull.”

Response B: “Gravity is described by Einstein’s general theory of relativity as a curvature in spacetime caused by mass and energy. The Einstein field equations, given by $G_{\mu\nu} + \Lambda g_{\mu\nu} = \frac{8\pi G}{c^4} T_{\mu\nu}$, describe this curvature…”

Both are technically correct. But for a general audience, Response A is clearly better—it’s clear, concise, and accessible. How do we teach the model to prefer Response A?

We can’t encode this preference as a simple rule. “Be clear” is subjective. “Be helpful” depends on context. These preferences are subtle and nuanced—exactly the kind of thing humans are good at judging but hard to write algorithms for.

The Three-Step Process

RLHF works in three stages:

Step 1: Supervised Fine-tuning (SFT)

This is the instruction tuning we just covered. Train the model on instruction-response pairs so it knows the basic format.

Input:  "What is the capital of Japan?"
Output: "The capital of Japan is Tokyo."

Step 2: Train a Reward Model

Hire human annotators. Show them a prompt and two (or more) model-generated responses. Ask them: “Which response is better?”

Prompt: "Explain gravity simply."

Response A: "Gravity pulls things together. The bigger something is,
             the stronger it pulls."           → Human picks: ✓ Better

Response B: "Gravitational acceleration on Earth is approximately
             9.81 m/s²..."                     → Human picks: ✗ Worse

Collect thousands of these human preference judgments. Then train a separate neural network—the reward model—to predict which response a human would prefer. The reward model takes a prompt and response as input and outputs a score:

reward_model("Explain gravity", response_A) → 0.85  (high = good)
reward_model("Explain gravity", response_B) → 0.32  (low = worse)

The reward model learns the patterns in human preferences: prefer clear explanations, avoid jargon, be concise, be accurate.

Step 3: Optimize with Reinforcement Learning

Now use the reward model to further train the language model. The process uses an algorithm called PPO (Proximal Policy Optimization):

  1. The model generates a response to a prompt
  2. The reward model scores the response
  3. If the score is high → update the model to produce more responses like this
  4. If the score is low → update the model to avoid responses like this
  5. Repeat thousands of times
Analogy: Teaching a dog new tricks.

- The dog (LLM) performs an action (generates a response)
- The trainer (reward model) gives a treat or witholds it (high/low score)
- Over time, the dog learns which behaviors earn treats
- The dog doesn't understand WHY treats come — it just learns the pattern

Why RLHF Matters

The jump from instruction tuning to RLHF is what made ChatGPT feel different from earlier language models. GPT-3 (base model) could generate text but was hard to use. InstructGPT (instruction-tuned) could follow instructions. ChatGPT (RLHF-trained) felt helpful, harmless, and honest.

RLHF aligns the model’s goals with human preferences. Without it, the model optimizes for “predict the next token accurately.” With it, the model optimizes for “produce responses that humans rate as helpful.”

Limitations of RLHF

RLHF isn’t perfect:

  • Expensive: Collecting human preferences requires many paid annotators
  • Reward hacking: The model may learn to game the reward model rather than be genuinely helpful (for example, producing responses that sound confident but are wrong)
  • Annotation bias: Human preferences are subjective and vary between annotators
  • Alignment tax: RLHF can reduce the model’s raw capability slightly while improving its helpfulness

Researchers are actively exploring alternatives like DPO (Direct Preference Optimization), which skips the separate reward model entirely and optimizes directly from human preference data. But RLHF remains the foundational technique that launched the era of aligned AI assistants.

7. Evaluation

You’ve fine-tuned your model. But how do you know it’s actually good? Evaluating language models is notoriously difficult because “good” is subjective—but there are some established approaches.

Perplexity

Perplexity is the most common automatic metric for language models. It measures how “surprised” the model is by the test data. Lower perplexity = better.

The math: perplexity is the exponentiation of the average cross-entropy loss:

$$\text{Perplexity} = e^{\text{loss}}$$

If your model has a loss of 3.0 on some test text, its perplexity is $e^{3.0} \approx 20.1$. Intuitively, this means the model is, on average, as uncertain as if it were choosing uniformly among ~20 options at each step. A perfect model that always predicts the right next word has a perplexity of 1.

import math


def compute_perplexity(model, dataset, device='cpu'):
    """
    Compute perplexity of a model on a dataset.

    Perplexity = exp(average cross-entropy loss)

    Lower is better. A perplexity of 1 would mean the model
    perfectly predicts every next token.

    Args:
        model: The language model.
        dataset: A dataset yielding (input, target) pairs.
        device: 'cpu' or 'cuda'.

    Returns:
        perplexity: The computed perplexity score.
    """
    model.eval()
    model = model.to(device)
    loss_fn = nn.CrossEntropyLoss(ignore_index=0)

    total_loss = 0.0
    total_batches = 0

    dataloader = DataLoader(dataset, batch_size=4, shuffle=False)

    with torch.no_grad():
        for batch_x, batch_y in dataloader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)

            logits = model(batch_x)

            B, T, C = logits.shape
            loss = loss_fn(logits.view(B * T, C), batch_y.view(B * T))

            total_loss += loss.item()
            total_batches += 1

    avg_loss = total_loss / total_batches
    perplexity = math.exp(avg_loss)

    print(f"Average loss: {avg_loss:.4f}")
    print(f"Perplexity:   {perplexity:.2f}")

    return perplexity


# Typical perplexity values:
# Random model (untrained):  ~vocab_size (e.g., 50,000)
# After pretraining:         20-50 (for small models)
# After fine-tuning:         5-20 (on in-domain data)
# GPT-4 level:               < 10 (estimated)

Human Evaluation

Perplexity tells you about prediction accuracy, but not about quality. A model that always responds “I don’t know” might have decent perplexity but is useless. Human evaluation remains the gold standard.

A simple human evaluation protocol:

def human_eval_template(model, tokenizer, test_prompts, device='cpu'):
    """
    Generate responses for human evaluation.

    Humans rate each response on:
    - Helpfulness (1-5): Does it answer the question?
    - Accuracy (1-5): Is the information correct?
    - Clarity (1-5): Is the response well-written?

    Args:
        model: The fine-tuned model.
        tokenizer: The tokenizer.
        test_prompts: List of instruction strings to test.
        device: 'cpu' or 'cuda'.
    """
    print("=" * 60)
    print("HUMAN EVALUATION")
    print("Rate each response: Helpfulness / Accuracy / Clarity (1-5)")
    print("=" * 60)

    for i, prompt in enumerate(test_prompts):
        response = generate_response(
            model, tokenizer, prompt,
            max_tokens=100, temperature=0.7, device=device
        )

        print(f"\n--- Example {i + 1} ---")
        print(f"Instruction: {prompt}")
        print(f"Response:    {response}")
        print(f"Helpfulness: __ / 5")
        print(f"Accuracy:    __ / 5")
        print(f"Clarity:     __ / 5")


# Example test prompts
test_prompts = [
    "What is the boiling point of water?",
    "Explain why the sky is blue in simple terms.",
    "Write a short poem about rain.",
    "What are three benefits of exercise?",
]

Simple Benchmark Setup

For reproducible evaluation, create a benchmark—a fixed set of test examples with known correct answers:

def run_benchmark(model, tokenizer, benchmark, device='cpu'):
    """
    Run a simple benchmark to evaluate model quality.

    Args:
        model: The fine-tuned model.
        tokenizer: The tokenizer.
        benchmark: List of dicts with 'instruction' and 'expected' keys.
        device: 'cpu' or 'cuda'.

    Returns:
        score: Percentage of responses that contain the expected answer.
    """
    correct = 0

    for item in benchmark:
        response = generate_response(
            model, tokenizer, item['instruction'],
            max_tokens=50, temperature=0.1,  # Low temp for consistency
            device=device
        )

        # Simple check: does the response contain the expected answer?
        if item['expected'].lower() in response.lower():
            correct += 1
            status = "✓"
        else:
            status = "✗"

        print(f"{status} Instruction: {item['instruction']}")
        print(f"  Expected: {item['expected']}")
        print(f"  Got:      {response[:80]}...")
        print()

    score = 100 * correct / len(benchmark)
    print(f"Score: {correct}/{len(benchmark)} ({score:.1f}%)")
    return score


# Example benchmark
simple_benchmark = [
    {"instruction": "What is 2 + 2?", "expected": "4"},
    {"instruction": "What is the capital of France?", "expected": "Paris"},
    {"instruction": "What color is the sky?", "expected": "blue"},
    {"instruction": "How many days are in a week?", "expected": "7"},
    {"instruction": "What is H2O?", "expected": "water"},
]

This is a very simple benchmark—production models are evaluated on much more sophisticated ones like MMLU, HellaSwag, or HumanEval. But the principle is identical: fixed test set, clear success criteria, reproducible results.

8. Exercises

Exercise 1: Extend the Instruction Dataset

Create a fine-tuning dataset with at least 20 instruction-response pairs covering three categories: factual questions, creative writing, and math problems. Implement your InstructionDataset class and verify it produces correct input-target pairs.

Solution
import torch
from torch.utils.data import Dataset


# Extended instruction dataset with three categories
instruction_data_extended = [
    # === Factual Questions ===
    {"instruction": "What is the capital of Japan?",
     "response": "The capital of Japan is Tokyo."},
    {"instruction": "Who wrote Romeo and Juliet?",
     "response": "Romeo and Juliet was written by William Shakespeare."},
    {"instruction": "What is the largest planet in our solar system?",
     "response": "The largest planet in our solar system is Jupiter."},
    {"instruction": "What year did World War II end?",
     "response": "World War II ended in 1945."},
    {"instruction": "What is the chemical symbol for gold?",
     "response": "The chemical symbol for gold is Au."},
    {"instruction": "How many continents are there?",
     "response": "There are seven continents on Earth."},
    {"instruction": "What is the speed of light?",
     "response": "The speed of light is approximately 299,792,458 meters per second."},

    # === Creative Writing ===
    {"instruction": "Write a one-sentence story about a robot.",
     "response": "The last robot on Earth spent its days painting sunsets it had never been programmed to appreciate."},
    {"instruction": "Write a haiku about the ocean.",
     "response": "Waves crash on the shore\nSalt and foam dance in the wind\nThe tide pulls back home"},
    {"instruction": "Describe a forest in two sentences.",
     "response": "Towering oaks stretched toward the sky, their branches interlocking like fingers. Below, ferns carpeted the forest floor in a thousand shades of green."},
    {"instruction": "Write a short limerick.",
     "response": "A coder who worked through the night\nFound a bug that just wouldn't go right\nShe traced every line\nThe fix was just fine\nA semicolon missing from sight."},
    {"instruction": "Write a metaphor about time.",
     "response": "Time is a river that flows in only one direction, carrying us forward whether we paddle or not."},
    {"instruction": "Describe rain using only sounds.",
     "response": "Pitter-patter, drip-drop, whoosh, splash, drumming, tinkling, rushing, gurgling."},

    # === Math Problems ===
    {"instruction": "What is 25 times 4?",
     "response": "25 times 4 is 100."},
    {"instruction": "What is the square root of 144?",
     "response": "The square root of 144 is 12."},
    {"instruction": "If a train travels 60 mph for 3 hours, how far does it go?",
     "response": "The train travels 180 miles (60 mph × 3 hours = 180 miles)."},
    {"instruction": "What is 15% of 200?",
     "response": "15% of 200 is 30."},
    {"instruction": "What is 2 to the power of 10?",
     "response": "2 to the power of 10 is 1024."},
    {"instruction": "What is the area of a rectangle with width 5 and height 8?",
     "response": "The area is 40 square units (5 × 8 = 40)."},
    {"instruction": "Solve: 3x + 6 = 21. What is x?",
     "response": "x = 5. Subtract 6 from both sides: 3x = 15. Divide by 3: x = 5."},
]


def format_example(example):
    """Format an instruction-response pair into a training string."""
    return (
        f"### Instruction:\n{example['instruction']}\n\n"
        f"### Response:\n{example['response']}"
    )


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.examples = []
        for item in data:
            text = format_example(item)
            token_ids = tokenizer.encode(text)
            if len(token_ids) > max_length:
                token_ids = token_ids[:max_length]
            while len(token_ids) < max_length:
                token_ids.append(0)
            self.examples.append(token_ids)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        token_ids = self.examples[idx]
        x = torch.tensor(token_ids[:-1], dtype=torch.long)
        y = torch.tensor(token_ids[1:], dtype=torch.long)
        return x, y


# Verify the dataset
print(f"Total examples: {len(instruction_data_extended)}")
print(f"Categories: factual (7), creative (6), math (7)")
print()

# Show a formatted example from each category
for category_start in [0, 7, 13]:
    example = instruction_data_extended[category_start]
    print(f"--- Example ---")
    print(format_example(example))
    print()

# Output:
# Total examples: 20
# Categories: factual (7), creative (6), math (7)
#
# --- Example ---
# ### Instruction:
# What is the capital of Japan?
#
# ### Response:
# The capital of Japan is Tokyo.
#
# --- Example ---
# ### Instruction:
# Write a one-sentence story about a robot.
#
# ### Response:
# The last robot on Earth spent its days painting sunsets it had
# never been programmed to appreciate.
#
# --- Example ---
# ### Instruction:
# What is 25 times 4?
#
# ### Response:
# 25 times 4 is 100.

Exercise 2: Implement LoRA with Different Ranks

Implement LoRA layers with ranks 2, 8, and 32. Compare the number of trainable parameters for each rank, applied to a linear layer with dimensions 1024 × 1024. Discuss the trade-off between parameter count and expressiveness.

Solution
import torch
import torch.nn as nn
import math


class LoRALayer(nn.Module):
    """LoRA layer implementation (same as in the chapter)."""

    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha

        in_features = original_layer.in_features
        out_features = original_layer.out_features

        for param in self.original_layer.parameters():
            param.requires_grad = False

        self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))

        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

        self.scaling = alpha / rank

    def forward(self, x):
        original_output = self.original_layer(x)
        lora_update = x @ self.lora_A @ self.lora_B * self.scaling
        return original_output + lora_update


# Compare different ranks
d = 1024
ranks = [2, 8, 32]

print(f"Original layer: {d} × {d} = {d * d:,} parameters\n")
print(f"{'Rank':<8} {'LoRA Params':<15} {'Reduction':<12} {'% of Original':<15}")
print("-" * 50)

for rank in ranks:
    original = nn.Linear(d, d, bias=False)
    lora = LoRALayer(original, rank=rank, alpha=rank * 2)

    original_params = d * d
    lora_params = sum(p.numel() for p in lora.parameters() if p.requires_grad)

    reduction = original_params / lora_params
    percentage = 100 * lora_params / original_params

    print(f"{rank:<8} {lora_params:<15,} {reduction:<12.1f}x {percentage:<15.2f}%")

# Verify outputs match shapes
print("\nShape verification:")
x = torch.randn(2, 10, d)  # batch=2, seq_len=10, dim=1024

for rank in ranks:
    original = nn.Linear(d, d, bias=False)
    lora = LoRALayer(original, rank=rank)
    output = lora(x)
    print(f"  Rank {rank:>2}: input {tuple(x.shape)} → output {tuple(output.shape)}")

# Output:
# Original layer: 1024 × 1024 = 1,048,576 parameters
#
# Rank     LoRA Params     Reduction    % of Original
# --------------------------------------------------
# 2        4,096           256.0x       0.39%
# 8        16,384          64.0x        1.56%
# 32       65,536          16.0x        6.25%
#
# Shape verification:
#   Rank  2: input (2, 10, 1024) → output (2, 10, 1024)
#   Rank  8: input (2, 10, 1024) → output (2, 10, 1024)
#   Rank 32: input (2, 10, 1024) → output (2, 10, 1024)

# Discussion:
# - Rank 2:  Extreme compression (0.39%). Very few trainable parameters.
#            Good for simple adaptations (e.g., style transfer).
#            May not capture complex behavioral changes.
#
# - Rank 8:  The typical default. Good balance between parameter efficiency
#            and expressiveness. Works well for most fine-tuning tasks.
#
# - Rank 32: More expressive but less efficient. Useful when fine-tuning
#            requires significant behavioral changes. Still 16x fewer
#            parameters than full fine-tuning.
#
# The trade-off: lower rank = fewer parameters = faster training = less
# memory, but potentially less capacity to learn complex new behaviors.
# In practice, rank 8-16 works well for most applications.

Exercise 3: Build a Perplexity Comparison

Write code that computes perplexity on a held-out test set before and after fine-tuning. Show that fine-tuning reduces perplexity on in-domain data. Explain what it means if perplexity increases on out-of-domain data.

Solution
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import math
import copy


def compute_perplexity(model, dataset, device='cpu'):
    """Compute perplexity of a model on a dataset."""
    model.eval()
    model = model.to(device)
    loss_fn = nn.CrossEntropyLoss(ignore_index=0)

    total_loss = 0.0
    total_batches = 0

    dataloader = DataLoader(dataset, batch_size=4, shuffle=False)

    with torch.no_grad():
        for batch_x, batch_y in dataloader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)

            logits = model(batch_x)

            B, T, C = logits.shape
            loss = loss_fn(logits.view(B * T, C), batch_y.view(B * T))

            total_loss += loss.item()
            total_batches += 1

    if total_batches == 0:
        return float('inf')

    avg_loss = total_loss / total_batches
    perplexity = math.exp(min(avg_loss, 100))  # Cap to avoid overflow
    return perplexity


def compare_perplexity(model_before, model_after, in_domain_data,
                       out_domain_data, device='cpu'):
    """
    Compare perplexity before and after fine-tuning on two datasets.

    Args:
        model_before: Model weights before fine-tuning.
        model_after: Model weights after fine-tuning.
        in_domain_data: Test data similar to fine-tuning data.
        out_domain_data: Test data different from fine-tuning data.
        device: 'cpu' or 'cuda'.
    """
    print("=" * 55)
    print("PERPLEXITY COMPARISON: Before vs After Fine-tuning")
    print("=" * 55)

    # In-domain evaluation
    ppl_before_in = compute_perplexity(model_before, in_domain_data, device)
    ppl_after_in = compute_perplexity(model_after, in_domain_data, device)

    print(f"\nIn-domain data (similar to fine-tuning data):")
    print(f"  Before fine-tuning: {ppl_before_in:.2f}")
    print(f"  After fine-tuning:  {ppl_after_in:.2f}")
    print(f"  Change:             {ppl_after_in - ppl_before_in:+.2f} ", end="")
    if ppl_after_in < ppl_before_in:
        print("(improved ✓)")
    else:
        print("(worsened ✗)")

    # Out-of-domain evaluation
    ppl_before_out = compute_perplexity(model_before, out_domain_data, device)
    ppl_after_out = compute_perplexity(model_after, out_domain_data, device)

    print(f"\nOut-of-domain data (different from fine-tuning data):")
    print(f"  Before fine-tuning: {ppl_before_out:.2f}")
    print(f"  After fine-tuning:  {ppl_after_out:.2f}")
    print(f"  Change:             {ppl_after_out - ppl_before_out:+.2f} ", end="")
    if ppl_after_out < ppl_before_out:
        print("(improved ✓)")
    else:
        print("(worsened ✗)")

    print(f"\n--- Interpretation ---")
    print(f"In-domain perplexity should DECREASE after fine-tuning.")
    print(f"The model becomes more confident on data similar to what")
    print(f"it was fine-tuned on.\n")
    print(f"Out-of-domain perplexity may INCREASE after fine-tuning.")
    print(f"This is called 'catastrophic forgetting' — the model")
    print(f"becomes so specialized on the fine-tuning data that it")
    print(f"gets worse on other types of text. This is why:")
    print(f"  1. We use low learning rates during fine-tuning")
    print(f"  2. We train for few epochs (to avoid overfitting)")
    print(f"  3. LoRA helps — by only changing a small number of")
    print(f"     parameters, it preserves most of the original")
    print(f"     model's general knowledge")


# Usage (conceptual — requires trained models):
#
# # Save model before fine-tuning
# model_before = copy.deepcopy(model)
#
# # Fine-tune the model
# fine_tune(model, train_dataset, num_epochs=3)
#
# # Compare perplexity
# compare_perplexity(
#     model_before, model,
#     in_domain_test, out_domain_test
# )
#
# Expected output:
# =====================================================
# PERPLEXITY COMPARISON: Before vs After Fine-tuning
# =====================================================
#
# In-domain data (similar to fine-tuning data):
#   Before fine-tuning: 45.32
#   After fine-tuning:  12.18
#   Change:             -33.14 (improved ✓)
#
# Out-of-domain data (different from fine-tuning data):
#   Before fine-tuning: 38.76
#   After fine-tuning:  42.91
#   Change:             +4.15 (worsened ✗)

Summary

In this chapter, we transformed a base language model from a text-completion engine into a useful tool:

  1. Fine-tuning takes a pretrained model and continues training it on curated data. It’s dramatically cheaper than pretraining because you’re leveraging existing knowledge—the essence of transfer learning.

  2. Instruction datasets teach the model new behavior through example. By training on thousands of instruction-response pairs, the model learns the meta-pattern: “when given an instruction, produce a helpful response.”

  3. The fine-tuning loop is almost identical to pretraining, but with a lower learning rate, fewer epochs, and carefully formatted data. Gradient clipping and padding-aware loss prevent instabilities.

  4. LoRA makes fine-tuning accessible by adding tiny trainable matrices alongside frozen pretrained weights. A rank-8 LoRA layer uses as little as 1-3% of the parameters of the original layer, yet captures the essential adaptations needed.

  5. Instruction tuning is the specific application of fine-tuning that produces ChatGPT-like behavior. It teaches models to follow prompts rather than just complete text.

  6. RLHF goes further, using human preferences to teach the model not just to respond, but to respond well. The three-step process—supervised fine-tuning, reward model training, and reinforcement learning—is what makes modern AI assistants feel helpful and aligned.

  7. Evaluation combines automatic metrics (perplexity) with human judgment. No single metric captures everything, so practical evaluation uses both.

You’ve now seen the complete journey from raw text to a model that can follow instructions and answer questions. The remaining frontier—larger scales, better alignment techniques, and real-world deployment—builds on everything we’ve covered in this book.