Skip to main content
building large language models from scratch a beginners guide with python and pytorch

Introduction and Setup — Why Build an LLM?

25 min read Chapter 1 of 11
Summary

This chapter introduces Large Language Models through accessible...

This chapter introduces Large Language Models through accessible analogies, outlines the complete book roadmap across 11 chapters, and walks the reader through setting up a Python virtual environment with PyTorch. By the end, readers have a verified development environment and understand the journey ahead — from tensors and tokenization to training a working language model.

Introduction and Setup — Why Build an LLM?

Welcome. You’re about to do something that most people assume requires a PhD, a cluster of GPUs, and a team of researchers: you’re going to build a Large Language Model from scratch.

Not by calling an API. Not by fine-tuning someone else’s model. You’re going to write the code yourself — line by line, layer by layer — starting from nothing more than Python and PyTorch.

Will your model rival GPT-4 or Claude? No. But by the end of this book, you’ll understand exactly how those models work under the hood, because you’ll have built a smaller version with your own hands. And that understanding is worth more than any tutorial or blog post can give you.

Let’s start at the beginning.


What Is a Large Language Model?

You’ve almost certainly used one. When you ask ChatGPT to write an email, when your phone predicts your next word, when a coding assistant completes your function — there’s an LLM behind the curtain.

But what is it, really?

The Autocomplete Analogy

Think about the autocomplete feature on your phone keyboard. You type “I’m running” and it suggests “late.” It learned this from seeing millions of text messages where those words appeared together.

An LLM is that same idea, scaled up to an almost incomprehensible degree:

  • Your phone’s autocomplete looks at the last few words.
  • An LLM looks at thousands of words of context at once.
  • Your phone learned from a modest dataset of common phrases.
  • An LLM learned from billions of pages of text — books, websites, code, conversations.
  • Your phone picks from a handful of likely next words.
  • An LLM can generate entire paragraphs, essays, or programs.

At its core, though, the principle is identical: given the words so far, predict what comes next.

That’s it. That’s the secret. Every impressive thing an LLM does — writing poetry, answering questions, translating languages, writing code — emerges from this single, deceptively simple objective: next-word prediction.

The Library Analogy

Here’s another way to think about it. Imagine a person who has read every book in a vast library. They haven’t memorized every word (that would just be a database), but they’ve absorbed the patterns — how sentences flow, how arguments are structured, how Python functions are formatted, how stories build tension.

If you give this person the beginning of a sentence, they can continue it in a way that sounds right, because they’ve internalized the statistical patterns of language. They know that “The capital of France is” is almost always followed by “Paris,” and that a Python function definition is usually followed by a docstring or a return statement.

An LLM is a mathematical approximation of that well-read person. It’s a giant mathematical function — with billions of numerical parameters — that takes text as input and produces a probability distribution over what word should come next.

Why “Large” Matters

You might wonder: we’ve had language models for decades. What changed?

Three things converged around 2017–2020:

  1. The Transformer architecture (2017) — A new way to process sequences that could look at all words simultaneously rather than one at a time. We’ll build one in Chapters 7–8.

  2. Scale — Researchers discovered that making models bigger (more parameters) and feeding them more data didn’t just make them slightly better. It made them qualitatively different. A model with 100 million parameters can complete sentences. A model with 100 billion parameters can reason, write code, and carry on conversations.

  3. Compute — GPUs became powerful enough (and cloud computing cheap enough) to actually train these massive models.

The “Large” in LLM isn’t just marketing. Size is genuinely a key ingredient. But here’s the encouraging news: you can understand all the principles by building a small model. The architecture is the same whether you have 1 million parameters or 175 billion. Only the scale differs.


Why Build One from Scratch?

Fair question. There are excellent pre-trained models available for free. Why go through the trouble of building one yourself?

Understanding vs. Using

There’s a difference between knowing how to drive a car and understanding how an engine works. Both are useful, but they serve different purposes.

If you only use LLMs through APIs, you’re a driver. You can get where you need to go, but when something breaks — when the model hallucinates, when it refuses a reasonable request, when it generates subtly wrong code — you’re stuck. You can’t diagnose the problem because the engine is a black box.

If you build an LLM, even a small one, you become a mechanic. You understand:

  • Why models hallucinate — because they’re pattern-matching, not reasoning, and sometimes the patterns lead somewhere plausible but wrong.
  • Why context length matters — because the attention mechanism has a fixed window, and you’ll implement that window yourself.
  • Why training data matters — because you’ll see firsthand how garbage in produces garbage out.
  • Why fine-tuning works — because you’ll understand the parameters that get adjusted and why.

Career Value

Whether you’re a software engineer, data scientist, researcher, or student, deep understanding of LLMs is becoming one of the most valuable skills in technology. Companies don’t just need people who can call openai.chat.completions.create(). They need people who can:

  • Debug model behavior
  • Choose the right architecture for a problem
  • Understand trade-offs between model size, speed, and quality
  • Build custom solutions when off-the-shelf models don’t fit

This book gives you that foundation.

It’s More Accessible Than You Think

You don’t need:

  • A PhD in machine learning
  • A $10,000 GPU
  • Years of experience with neural networks

You do need:

  • Basic Python skills (functions, loops, classes, lists)
  • Curiosity and patience
  • A computer with at least 8 GB of RAM (yes, really — we’ll train on CPU for most of the book)

If you can write a Python class and use a for loop, you have enough to start. We’ll build everything else together.


What You’ll Build by the End

Let’s make this concrete. By the final chapter of this book, you will have built:

  1. A complete tokenizer — the component that converts raw text into numbers that a neural network can process.

  2. A Transformer model — the same architecture (at a smaller scale) that powers GPT, Claude, LLaMA, and every other modern LLM.

  3. A training pipeline — the code that feeds data into your model and adjusts its parameters so it learns.

  4. A text generator — a system that takes a prompt and generates coherent continuations, word by word.

Your model will be small — perhaps 10–50 million parameters, trained on a few hundred megabytes of text. It won’t write award-winning essays, but it will generate grammatically correct, topically coherent text. More importantly, every single component will be code you wrote and understand.


Book Roadmap: The Journey Ahead

Here’s where we’re going across all 11 chapters. Think of this as a map — you don’t need to understand every destination yet, just get a feel for the terrain.

Part I: Foundations (Chapters 1–4)

Chapter 1: Introduction and Setup — Why Build an LLM? (You are here) Set up your development environment and understand what we’re building and why.

Chapter 2: Tensors and Neural Network Fundamentals Learn the building block of all deep learning: tensors (fancy multi-dimensional arrays). We’ll cover how neural networks learn through forward passes, loss functions, and backpropagation — all with code you can run.

Chapter 3: Text Tokenization — Teaching Machines to Read Computers don’t understand words. We’ll build a tokenizer that converts text into numbers, starting with simple character-level encoding and progressing to the Byte Pair Encoding (BPE) algorithm used by real LLMs.

Chapter 4: Word Embeddings — Giving Words Meaning A token ID like 4521 doesn’t carry meaning. We’ll build embedding layers that map each token to a rich vector of numbers where similar words end up near each other in vector space.

Part II: The Transformer (Chapters 5–8)

Chapter 5: Attention Mechanism — The Heart of the Transformer The single most important concept in modern NLP. We’ll build self-attention from scratch, understanding how a model learns which words in a sentence are relevant to each other.

Chapter 6: Positional Encoding — Teaching Order to a Parallel System Transformers process all words simultaneously, so they have no inherent sense of word order. We’ll implement positional encodings that tell the model where each word sits in the sequence.

Chapter 7: Building the Transformer Block We’ll assemble attention, embeddings, and feed-forward layers into a complete Transformer block — the repeating unit that makes up the entire model.

Chapter 8: Assembling the Full GPT-Style Model Stack multiple Transformer blocks, add input/output layers, and create the complete model architecture. By the end of this chapter, you’ll have a model that can accept text and produce predictions — it just won’t be trained yet.

Part III: Training and Generation (Chapters 9–11)

Chapter 9: Training Your Language Model The big one. We’ll prepare a dataset, write the training loop, implement the optimizer, and watch as our model goes from outputting random nonsense to generating coherent text. We’ll cover learning rates, batch sizes, loss curves, and overfitting.

Chapter 10: Text Generation and Sampling Strategies A trained model outputs probabilities. How do we turn those into actual text? We’ll implement greedy decoding, temperature scaling, top-k sampling, and nucleus (top-p) sampling — and see how each one affects output quality.

Chapter 11: Next Steps — Where to Go from Here Fine-tuning, RLHF, scaling laws, efficient inference, and the broader LLM ecosystem. This chapter connects what you’ve built to the cutting-edge research and production systems you’ll encounter in the real world.


Setting Up Your Development Environment

Enough motivation. Let’s get your hands dirty.

We need two things installed: Python and PyTorch. If you’re already comfortable with Python virtual environments, feel free to skim ahead. If not, follow every step — getting the environment right now saves hours of debugging later.

Step 1: Verify Your Python Installation

Open a terminal (Command Prompt on Windows, Terminal on macOS/Linux) and check your Python version:

python3 --version

You should see something like:

Python 3.10.12

You need Python 3.10 or newer. If you have an older version, download the latest from python.org.

Windows Users: Make sure you check “Add Python to PATH” during installation. This is the single most common setup issue for Windows users.

macOS Users: The python3 command is what you want. The plain python command on macOS often points to an old Python 2 installation that came with the system.

Step 2: Create a Virtual Environment

A virtual environment is an isolated Python installation just for this project. It prevents conflicts between packages needed by different projects.

Think of it like a clean desk: instead of piling every tool you own onto one surface, you set up a dedicated workspace with just the tools you need.

# Navigate to where you want your project to live
mkdir llm-from-scratch
cd llm-from-scratch

# Create the virtual environment
python3 -m venv venv

# Activate it
# On macOS/Linux:
source venv/bin/activate

# On Windows:
# venv\Scripts\activate

After activation, your terminal prompt should change to show (venv) at the beginning. This tells you the virtual environment is active and any packages you install will go into this isolated environment, not your system Python.

(venv) ~/llm-from-scratch $

Why does this matter? Without a virtual environment, installing packages can break other Python projects on your system. The virtual environment is your safety net. Always activate it before working on this project.

Step 3: Install PyTorch

PyTorch is the deep learning framework we’ll use throughout this book. It provides the building blocks — tensors, automatic differentiation, neural network layers — that we’ll combine to build our LLM.

For learning purposes, the CPU-only version is all you need. It’s smaller to download and simpler to install:

# Install PyTorch (CPU version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

This download is roughly 200–300 MB. Give it a minute.

Have an NVIDIA GPU? If you have a compatible NVIDIA GPU and want faster training later, you can install the CUDA-enabled version instead. Visit pytorch.org/get-started and select your CUDA version. But the CPU version works for everything in this book — GPU just makes training faster, not different.

Step 4: Install Additional Packages

We need a few more packages for data handling and visualization throughout the book:

pip install numpy matplotlib tqdm
  • numpy — Numerical computing (we’ll use PyTorch tensors more, but numpy shows up in data preprocessing)
  • matplotlib — Plotting training loss curves and visualizations
  • tqdm — Progress bars for training loops (because watching a blank terminal during training is agonizing)

Step 5: Verify Everything Works

Create a file called verify_setup.py in your project directory:

"""
Verification script for LLM-from-scratch development environment.
Run this to confirm everything is installed correctly.
"""

import sys

# ---- Check Python version ----
print("=" * 50)
print("ENVIRONMENT VERIFICATION")
print("=" * 50)

# Get the Python version as a tuple of integers, e.g., (3, 10, 12)
python_version = sys.version_info

print(f"\n1. Python version: {sys.version}")

# We need at least Python 3.10
if python_version >= (3, 10):
    print("   ✓ Python version is 3.10+ — OK")
else:
    print("   ✗ Python version is too old. Please install Python 3.10 or newer.")
    sys.exit(1)  # Stop here if Python is too old

# ---- Check PyTorch ----
try:
    import torch  # Import the PyTorch library

    print(f"\n2. PyTorch version: {torch.__version__}")
    print("   ✓ PyTorch is installed — OK")

    # Check if CUDA (GPU support) is available
    if torch.cuda.is_available():
        print(f"   GPU available: {torch.cuda.get_device_name(0)}")
    else:
        print("   Running on CPU (this is fine for learning!)")

except ImportError:
    print("\n2. ✗ PyTorch is NOT installed.")
    print("   Run: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu")
    sys.exit(1)

# ---- Check numpy ----
try:
    import numpy as np  # Import numpy for numerical operations

    print(f"\n3. NumPy version: {np.__version__}")
    print("   ✓ NumPy is installed — OK")
except ImportError:
    print("\n3. ✗ NumPy is NOT installed. Run: pip install numpy")
    sys.exit(1)

# ---- Check matplotlib ----
try:
    import matplotlib  # Import matplotlib for plotting

    print(f"\n4. Matplotlib version: {matplotlib.__version__}")
    print("   ✓ Matplotlib is installed — OK")
except ImportError:
    print("\n4. ✗ Matplotlib is NOT installed. Run: pip install matplotlib")
    sys.exit(1)

# ---- Check tqdm ----
try:
    import tqdm  # Import tqdm for progress bars

    print(f"\n5. tqdm version: {tqdm.__version__}")
    print("   ✓ tqdm is installed — OK")
except ImportError:
    print("\n5. ✗ tqdm is NOT installed. Run: pip install tqdm")
    sys.exit(1)

# ---- Quick PyTorch smoke test ----
print("\n" + "=" * 50)
print("PYTORCH SMOKE TEST")
print("=" * 50)

# Create a simple tensor (a multi-dimensional array of numbers)
x = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
print(f"\n6. Created tensor: {x}")
print(f"   Shape: {x.shape}")       # How many elements: torch.Size([5])
print(f"   Data type: {x.dtype}")   # The number type: torch.float32

# Do some basic math
y = x * 2 + 1  # Multiply each element by 2 and add 1
print(f"\n7. x * 2 + 1 = {y}")

# Matrix multiplication (the core operation in neural networks)
# Create a 2x3 matrix and a 3x2 matrix, then multiply them
a = torch.randn(2, 3)  # 2 rows, 3 columns, filled with random numbers
b = torch.randn(3, 2)  # 3 rows, 2 columns, filled with random numbers
c = torch.matmul(a, b)  # Matrix multiplication: (2x3) @ (3x2) = (2x2)

print(f"\n8. Matrix multiplication:")
print(f"   A shape: {a.shape}")  # torch.Size([2, 3])
print(f"   B shape: {b.shape}")  # torch.Size([3, 2])
print(f"   A @ B shape: {c.shape}")  # torch.Size([2, 2])

# Automatic differentiation (how neural networks learn)
# This is a preview — we'll cover this in detail in Chapter 2
w = torch.tensor([2.0], requires_grad=True)  # A "learnable" parameter
loss = (w * 3 - 6) ** 2  # A simple loss function: (2*3 - 6)^2 = 0
loss.backward()  # Compute the gradient (derivative) automatically

print(f"\n9. Autograd test:")
print(f"   w = {w.item()}")            # 2.0
print(f"   loss = (w*3 - 6)^2 = {loss.item()}")  # 0.0
print(f"   d(loss)/dw = {w.grad.item()}")  # 0.0 (because loss is already minimized)

print("\n" + "=" * 50)
print("ALL CHECKS PASSED — You're ready to build an LLM!")
print("=" * 50)

Run it:

python3 verify_setup.py

If everything is installed correctly, you’ll see:

==================================================
ENVIRONMENT VERIFICATION
==================================================

1. Python version: 3.10.12 (main, ...)
   ✓ Python version is 3.10+ — OK

2. PyTorch version: 2.1.0+cpu
   ✓ PyTorch is installed — OK
   Running on CPU (this is fine for learning!)

3. NumPy version: 1.24.0
   ✓ NumPy is installed — OK

4. Matplotlib version: 3.8.0
   ✓ Matplotlib is installed — OK

5. tqdm version: 4.66.0
   ✓ tqdm is installed — OK

==================================================
PYTORCH SMOKE TEST
==================================================

6. Created tensor: tensor([1., 2., 3., 4., 5.])
   Shape: torch.Size([5])
   Data type: torch.float32

7. x * 2 + 1 = tensor([ 3.,  5.,  7.,  9., 11.])

8. Matrix multiplication:
   A shape: torch.Size([2, 3])
   B shape: torch.Size([3, 2])
   A @ B shape: torch.Size([2, 2])

9. Autograd test:
   w = 2.0
   loss = (w*3 - 6)^2 = 0.0
   d(loss)/dw = 0.0

==================================================
ALL CHECKS PASSED — You're ready to build an LLM!
==================================================

If any check fails, the script tells you exactly what to install. Fix it before moving on.


Hardware: What Do You Actually Need?

One of the biggest misconceptions about deep learning is that you need expensive hardware. For learning, you don’t.

The Minimum (What This Book Assumes)

ComponentMinimumRecommended
CPUAny modern processor (Intel i5 / AMD Ryzen 5 or equivalent)Multi-core processor (i7 / Ryzen 7)
RAM8 GB16 GB
Disk5 GB free space20 GB free space
GPUNot requiredNVIDIA GPU with 4+ GB VRAM (optional)
OSWindows 10+, macOS 11+, or LinuxAny of these

Can You Really Train on CPU?

Yes — with caveats.

The models we build in this book are deliberately small. A model with 10 million parameters (which is enough to generate interesting text) trains in minutes to hours on a modern CPU. For comparison, GPT-3 has 175 billion parameters and required thousands of GPUs training for weeks.

We’re building a bicycle, not a jumbo jet. But the engineering principles are the same.

Here’s a rough guide to what you can expect:

Model SizeCPU Training TimeGPU Training Time
1M parametersMinutesSeconds
10M parameters1–3 hours10–30 minutes
50M parameters6–12 hours1–2 hours

If training feels slow, that’s actually a feature for learning. You’ll have time to think about what the model is doing while it trains, check the loss curve, and form hypotheses about what’s working and what isn’t.

Free GPU Options (If You Want Speed)

If you do want GPU acceleration later (especially for Chapter 9 onward), these free options exist:

  • Google Colab — Free tier gives you access to a T4 GPU. Upload your notebook and train in the cloud.
  • Kaggle Notebooks — Similar to Colab, with free GPU access for 30 hours per week.
  • Lightning AI — Free tier with GPU access for PyTorch projects.

But I encourage you to start on CPU. The slight slowness forces you to think more carefully about your code and experiments, which leads to deeper understanding.


Your First PyTorch Program: Hello, Tensors

Before we close this chapter, let’s write one more program. This isn’t just a verification — it’s your first taste of the building blocks we’ll use throughout the book.

Create a file called hello_pytorch.py:

"""
Hello PyTorch — your first step toward building an LLM.

This script demonstrates the absolute basics of tensors,
which are the fundamental data structure in deep learning.
Think of tensors as multi-dimensional arrays on steroids.
"""

import torch

# ---- What is a tensor? ----
# A tensor is just an array of numbers. The "dimension" or "rank"
# tells you how many axes it has:
#   - 0D tensor (scalar): a single number
#   - 1D tensor (vector): a list of numbers
#   - 2D tensor (matrix): a grid of numbers (rows and columns)
#   - 3D tensor: a cube of numbers
#   - nD tensor: you get the idea

# Scalar (0D) — just a number
scalar = torch.tensor(42.0)
print(f"Scalar: {scalar}")
print(f"  Shape: {scalar.shape}")  # torch.Size([]) — no dimensions
print(f"  Rank: {scalar.dim()}")   # 0
print()

# Vector (1D) — a list of numbers
vector = torch.tensor([1.0, 2.0, 3.0, 4.0])
print(f"Vector: {vector}")
print(f"  Shape: {vector.shape}")  # torch.Size([4]) — 4 elements
print(f"  Rank: {vector.dim()}")   # 1
print()

# Matrix (2D) — a grid of numbers
# Think of it as a spreadsheet: 2 rows, 3 columns
matrix = torch.tensor([
    [1.0, 2.0, 3.0],  # Row 0
    [4.0, 5.0, 6.0],  # Row 1
])
print(f"Matrix:\n{matrix}")
print(f"  Shape: {matrix.shape}")  # torch.Size([2, 3]) — 2 rows, 3 columns
print(f"  Rank: {matrix.dim()}")   # 2
print()

# 3D Tensor — a stack of matrices
# In NLP, this often represents: (batch_size, sequence_length, features)
# For example: 2 sentences, each 4 words long, each word represented by 3 numbers
tensor_3d = torch.randn(2, 4, 3)  # Random numbers, shape (2, 4, 3)
print(f"3D Tensor shape: {tensor_3d.shape}")  # torch.Size([2, 4, 3])
print(f"  Rank: {tensor_3d.dim()}")            # 3
print(f"  Total elements: {tensor_3d.numel()}")  # 2 * 4 * 3 = 24
print()

# ---- Why do shapes matter? ----
# In deep learning, most bugs are shape mismatches.
# If you try to multiply a (2, 3) matrix with a (5, 7) matrix, it won't work.
# The inner dimensions must match: (2, 3) @ (3, 4) -> (2, 4)

a = torch.randn(2, 3)  # 2 rows, 3 columns
b = torch.randn(3, 4)  # 3 rows, 4 columns
result = a @ b          # Matrix multiply: (2,3) @ (3,4) -> (2,4)

print(f"Matrix multiplication:")
print(f"  {a.shape} @ {b.shape} = {result.shape}")  # [2,3] @ [3,4] = [2,4]
print()

# ---- A preview of what's coming ----
# In an LLM, we'll work with tensors like this:
#   - Input token IDs:  shape (batch_size, sequence_length)
#   - Embeddings:       shape (batch_size, sequence_length, embedding_dim)
#   - Attention scores: shape (batch_size, num_heads, seq_len, seq_len)
#   - Output logits:    shape (batch_size, sequence_length, vocab_size)
#
# Don't worry if these don't mean anything yet. By Chapter 8,
# you'll be manipulating all of them with confidence.

# Simulate a tiny "vocabulary" of 10 words, with 4-dimensional embeddings
vocab_size = 10       # Our toy vocabulary has 10 words
embedding_dim = 4     # Each word is represented by 4 numbers
batch_size = 1        # Processing 1 sentence at a time
seq_length = 5        # Each sentence has 5 words

# Create a random embedding table (we'll learn real ones in Chapter 4)
embedding_table = torch.randn(vocab_size, embedding_dim)
print(f"Embedding table shape: {embedding_table.shape}")
# Each row is the vector for one word: torch.Size([10, 4])

# Simulate a sentence: [word3, word7, word1, word0, word5]
token_ids = torch.tensor([3, 7, 1, 0, 5])
print(f"Token IDs: {token_ids}")

# Look up the embedding for each token
# This is literally what an embedding layer does!
embeddings = embedding_table[token_ids]  # Index into the table
print(f"Embeddings shape: {embeddings.shape}")  # torch.Size([5, 4])
# Each of our 5 tokens now has a 4-dimensional representation

print(f"\nWord 0 (token 3) embedding: {embeddings[0]}")
print(f"Word 1 (token 7) embedding: {embeddings[1]}")

print("\n✓ If you can see this, you're ready for Chapter 2!")

Run it:

python3 hello_pytorch.py

Study the output. Pay attention to the shapes. Every shape printed here will come back in later chapters. You don’t need to memorize them — just notice the pattern: tensors have shapes, shapes tell you the structure of your data, and shapes must be compatible for operations like matrix multiplication.


Common Setup Issues and Fixes

Before we wrap up, here are the most common problems people run into, along with fixes:

“python3: command not found”

  • Windows: Use python instead of python3, or reinstall Python with “Add to PATH” checked.
  • macOS: Install Python 3 via python.org or Homebrew (brew install python3).
  • Linux: Install via your package manager (sudo apt install python3 on Ubuntu/Debian).

”No module named torch”

You probably forgot to activate the virtual environment. Run:

source venv/bin/activate   # macOS/Linux
# or
venv\Scripts\activate      # Windows

Then try the import again.

”pip install torch” downloads a huge file (1+ GB)

You’re probably downloading the CUDA version. Use the CPU-only URL:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

“Permission denied” errors

Don’t use sudo pip install. If you’re getting permission errors, make sure your virtual environment is activated. The (venv) prefix should be visible in your terminal prompt.

PyTorch version mismatch warnings

As long as import torch works and the version is 2.0 or newer, you’re fine. Minor version differences (2.0 vs 2.1 vs 2.2) won’t affect anything in this book.


Chapter Summary

Here’s what we covered:

  • LLMs are next-word predictors — they learn statistical patterns from massive amounts of text and use those patterns to generate new text.
  • Building from scratch teaches understanding — you’ll know not just how to use LLMs, but why they work (and why they sometimes don’t).
  • The book covers 11 chapters — from tensors and tokenization through the full Transformer architecture to training and text generation.
  • Your environment is set up — Python 3.10+, PyTorch, numpy, matplotlib, and tqdm are installed in a clean virtual environment.
  • CPU training is sufficient for learning — our models are deliberately small enough to train without expensive GPUs.
  • Tensors are the building block — multi-dimensional arrays of numbers, and their shapes are the most important thing to track.

Exercises

Exercise 1: Verify Your Installation

Task: Run the verify_setup.py script from this chapter and confirm all checks pass. If any check fails, fix the issue before moving on.

Solution: Follow the installation steps above. The most common issues are:

  • Python version too old → install Python 3.10+
  • Virtual environment not activated → run source venv/bin/activate
  • PyTorch not installed → run the pip install torch command with the CPU URL

Exercise 2: Tensor Exploration

Task: Create a Python script that does the following:

  1. Creates a tensor of shape (3, 4) filled with zeros
  2. Creates a tensor of shape (3, 4) filled with ones
  3. Creates a tensor of shape (3, 4) filled with random numbers between 0 and 1
  4. Adds the ones tensor and the random tensor together, and prints the result and its shape
  5. Computes the mean of the random tensor and prints it

Solution:

import torch

# 1. Zeros
zeros = torch.zeros(3, 4)
print(f"Zeros:\n{zeros}")
print(f"Shape: {zeros.shape}\n")  # torch.Size([3, 4])

# 2. Ones
ones = torch.ones(3, 4)
print(f"Ones:\n{ones}")
print(f"Shape: {ones.shape}\n")  # torch.Size([3, 4])

# 3. Random
random_t = torch.rand(3, 4)  # Uniform random between 0 and 1
print(f"Random:\n{random_t}")
print(f"Shape: {random_t.shape}\n")  # torch.Size([3, 4])

# 4. Addition
added = ones + random_t  # Element-wise addition
print(f"Ones + Random:\n{added}")
print(f"Shape: {added.shape}\n")  # torch.Size([3, 4])

# 5. Mean
mean_val = random_t.mean()  # Average of all 12 elements
print(f"Mean of random tensor: {mean_val.item():.4f}")

Exercise 3: Shape Detective

Task: Without running the code, predict the output shape of each operation. Then run the code to check your answers.

import torch

a = torch.randn(5, 3)
b = torch.randn(3, 2)
c = torch.randn(5, 1)

# What shape is each result?
r1 = a @ b        # ?
r2 = a + c        # ? (hint: broadcasting)
r3 = a.T          # ? (transpose)
r4 = a.reshape(15)  # ?
r5 = a.unsqueeze(0)  # ?

Solution:

import torch

a = torch.randn(5, 3)  # 5 rows, 3 columns
b = torch.randn(3, 2)  # 3 rows, 2 columns
c = torch.randn(5, 1)  # 5 rows, 1 column

r1 = a @ b           # (5,3) @ (3,2) -> (5, 2)
print(f"r1 shape: {r1.shape}")  # torch.Size([5, 2])

r2 = a + c           # (5,3) + (5,1) -> (5, 3) via broadcasting
print(f"r2 shape: {r2.shape}")  # torch.Size([5, 3])
# Broadcasting: c's single column is "broadcast" across all 3 columns of a

r3 = a.T             # Transpose: rows become columns
print(f"r3 shape: {r3.shape}")  # torch.Size([3, 5])

r4 = a.reshape(15)   # Flatten all 15 elements into a 1D tensor
print(f"r4 shape: {r4.shape}")  # torch.Size([15])

r5 = a.unsqueeze(0)  # Add a new dimension at position 0
print(f"r5 shape: {r5.shape}")  # torch.Size([1, 5, 3])
# This is commonly used to add a "batch" dimension

What’s Next

In Chapter 2, we’ll dive into how neural networks actually learn. We’ll cover:

  • What a neuron really is (spoiler: it’s just multiplication and addition)
  • Forward passes: how data flows through a network
  • Loss functions: how the network knows it’s wrong
  • Backpropagation: how the network adjusts itself to be less wrong
  • All implemented from scratch in PyTorch

You have your tools. You have your environment. Let’s build.