Foundations — Tensors, Gradients, and Neural Networks
SummaryThis chapter builds the mathematical and conceptual foundation...
This chapter builds the mathematical and conceptual foundation...
This chapter builds the mathematical and conceptual foundation for deep learning. Starting with tensors as multi-dimensional containers for numbers, it progresses through tensor operations (addition, multiplication, matrix multiplication with shape tracking), gradients as slopes that guide optimization, and backpropagation as the mechanism for learning. A complete neural network is built from scratch to predict house prices, demonstrating forward passes, loss computation, and weight updates. Activation functions (ReLU, Sigmoid) provide non-linearity, and loss functions (MSE, cross-entropy) measure prediction quality. Every concept uses real-world analogies and includes runnable PyTorch code with shape annotations.
Foundations — Tensors, Gradients, and Neural Networks
Before we can build a language model, we need to speak the language that neural networks speak: numbers arranged in structured containers, flowing through mathematical operations, adjusting themselves to get better at a task. That’s the essence of deep learning, and this chapter will give you a solid, hands-on understanding of every piece.
By the end of this chapter, you will:
- Understand what tensors are and how to manipulate them
- Know what gradients are and why they matter for learning
- Build a complete neural network from scratch in PyTorch
- Understand activation functions and loss functions intuitively
Let’s begin.
1. What Is a Tensor?
The Container Analogy
Imagine you have different kinds of containers for holding numbers:
- A single box that holds one number — say, the temperature today:
72. This is a scalar. - A row of boxes holding a list of numbers — say, temperatures for the week:
[72, 68, 75, 71, 69, 74, 73]. This is a vector. - A grid of boxes (rows and columns) — say, temperatures for 4 weeks across 7 days. That’s a table with 4 rows and 7 columns. This is a matrix.
- A cube of boxes — say, temperatures for 4 weeks, 7 days, measured at 3 different times each day. Now you have a 3D block. This is a 3D tensor.
A tensor is just a generalization of all of these. It’s a container that can hold numbers in any number of dimensions. The number of dimensions is called the tensor’s rank (or ndim).
| Name | Rank | Shape Example | Real-World Analogy |
|---|---|---|---|
| Scalar | 0 | () | A single temperature reading |
| Vector | 1 | (7,) | Temperatures for a week |
| Matrix | 2 | (4, 7) | Temperatures for 4 weeks × 7 days |
| 3D | 3 | (4, 7, 3) | Weeks × Days × Times of day |
| 4D | 4 | (12, 4, 7, 3) | Months × Weeks × Days × Times |
In deep learning, tensors are everywhere. Your data is a tensor. Your model’s weights are tensors. The output is a tensor. Understanding them is non-negotiable.
Creating Tensors in PyTorch
Let’s make this concrete with code. If you haven’t installed PyTorch yet, run pip install torch in your terminal.
import torch
# Scalar (rank 0) — a single number
scalar = torch.tensor(42.0)
print(f"Scalar: {scalar}")
print(f" Shape: {scalar.shape}") # torch.Size([])
print(f" Dimensions: {scalar.ndim}") # 0
print()
# Vector (rank 1) — a list of numbers
vector = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
print(f"Vector: {vector}")
print(f" Shape: {vector.shape}") # torch.Size([5])
print(f" Dimensions: {vector.ndim}") # 1
print()
# Matrix (rank 2) — a grid of numbers
matrix = torch.tensor([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]
])
print(f"Matrix:\n{matrix}")
print(f" Shape: {matrix.shape}") # torch.Size([2, 3])
print(f" Dimensions: {matrix.ndim}") # 2
print()
# 3D Tensor (rank 3) — a cube of numbers
tensor_3d = torch.tensor([
[[1, 2], [3, 4], [5, 6]],
[[7, 8], [9, 10], [11, 12]]
])
print(f"3D Tensor:\n{tensor_3d}")
print(f" Shape: {tensor_3d.shape}") # torch.Size([2, 3, 2])
print(f" Dimensions: {tensor_3d.ndim}") # 3
Output:
Scalar: tensor(42.)
Shape: torch.Size([])
Dimensions: 0
Vector: tensor([1., 2., 3., 4., 5.])
Shape: torch.Size([5])
Dimensions: 1
Matrix:
tensor([[1., 2., 3.],
[4., 5., 6.]])
Shape: torch.Size([2, 3])
Dimensions: 2
3D Tensor:
tensor([[[ 1, 2],
[ 3, 4],
[ 5, 6]],
[[ 7, 8],
[ 9, 10],
[11, 12]]])
Shape: torch.Size([2, 3, 2])
Dimensions: 3
Reading Tensor Shapes
The shape of a tensor tells you how many elements exist along each dimension. Learning to read shapes is one of the most important skills in deep learning.
For a tensor with shape (2, 3, 2):
- The first dimension has size 2 (think: 2 “slabs”)
- The second dimension has size 3 (think: 3 rows in each slab)
- The third dimension has size 2 (think: 2 columns in each row)
Total number of elements: 2 × 3 × 2 = 12.
# Useful ways to create tensors
zeros = torch.zeros(3, 4) # 3×4 matrix of zeros
ones = torch.ones(2, 5) # 2×5 matrix of ones
random = torch.randn(3, 3) # 3×3 matrix of random numbers (normal distribution)
sequence = torch.arange(0, 10) # [0, 1, 2, ..., 9]
print(f"Zeros shape: {zeros.shape}") # torch.Size([3, 4])
print(f"Ones shape: {ones.shape}") # torch.Size([2, 5])
print(f"Random shape: {random.shape}") # torch.Size([3, 3])
print(f"Sequence shape: {sequence.shape}") # torch.Size([10])
Reshaping Tensors
You’ll frequently need to change the shape of a tensor without changing its data. Think of it like rearranging the same 12 eggs from a single row into a 3×4 carton.
# Start with a flat vector of 12 numbers
flat = torch.arange(1, 13, dtype=torch.float32)
print(f"Flat: {flat}")
print(f" Shape: {flat.shape}") # torch.Size([12])
# Reshape to 3 rows × 4 columns
grid = flat.reshape(3, 4)
print(f"\nReshaped to (3, 4):\n{grid}")
print(f" Shape: {grid.shape}") # torch.Size([3, 4])
# Reshape to 2 × 2 × 3
cube = flat.reshape(2, 2, 3)
print(f"\nReshaped to (2, 2, 3):\n{cube}")
print(f" Shape: {cube.shape}") # torch.Size([2, 2, 3])
# Using -1 to let PyTorch figure out one dimension
auto = flat.reshape(4, -1) # "Make it 4 rows, figure out the columns"
print(f"\nReshaped to (4, -1):\n{auto}")
print(f" Shape: {auto.shape}") # torch.Size([4, 3])
Key rule: The total number of elements must stay the same. You can reshape a (12,) tensor into (3, 4), (4, 3), (2, 6), (2, 2, 3), etc. — but not into (3, 5) because 3 × 5 = 15 ≠ 12.
2. Tensor Operations
Now that we know what tensors are, let’s learn how to do math with them.
Element-wise Operations
The simplest operations work element by element — each number in one tensor pairs with the corresponding number in the other tensor.
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
# Addition: each element adds to its partner
c = a + b
print(f"a + b = {c}") # tensor([5., 7., 9.])
# Subtraction
print(f"a - b = {a - b}") # tensor([-3., -3., -3.])
# Element-wise multiplication (NOT matrix multiplication!)
print(f"a * b = {a * b}") # tensor([4., 10., 18.])
# Element-wise division
print(f"a / b = {a / b}") # tensor([0.2500, 0.4000, 0.5000])
# Squaring each element
print(f"a ** 2 = {a ** 2}") # tensor([1., 4., 9.])
Notice that for element-wise operations, both tensors must have the same shape (with some exceptions — see Broadcasting below).
# Works with matrices too
A = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B = torch.tensor([[5.0, 6.0], [7.0, 8.0]])
print(f"A:\n{A}")
print(f"B:\n{B}")
print(f"A + B:\n{A + B}")
# tensor([[ 6., 8.],
# [10., 12.]])
Matrix Multiplication — The Most Important Operation
Matrix multiplication is the core operation in neural networks. Let’s build intuition step by step.
The rule: To multiply matrix A (shape m × n) by matrix B (shape n × p), the number of columns in A must equal the number of rows in B. The result has shape m × p.
A (m × n) @ B (n × p) = C (m × p)
↑
These must match!
Visual walkthrough with concrete numbers:
# A is 2×3 (2 rows, 3 columns)
A = torch.tensor([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]
])
# B is 3×2 (3 rows, 2 columns)
B = torch.tensor([
[7.0, 8.0],
[9.0, 10.0],
[11.0, 12.0]
])
print(f"A shape: {A.shape}") # torch.Size([2, 3])
print(f"B shape: {B.shape}") # torch.Size([3, 2])
# Matrix multiplication: A (2×3) @ B (3×2) = C (2×2)
C = A @ B # The @ operator does matrix multiplication
print(f"\nA @ B:\n{C}")
print(f"Result shape: {C.shape}") # torch.Size([2, 2])
How is each element computed? Each element C[i][j] is the dot product of row i of A with column j of B:
C[0][0] = (1×7) + (2×9) + (3×11) = 7 + 18 + 33 = 58
C[0][1] = (1×8) + (2×10) + (3×12) = 8 + 20 + 36 = 64
C[1][0] = (4×7) + (5×9) + (6×11) = 28 + 45 + 66 = 139
C[1][1] = (4×8) + (5×10) + (6×12) = 32 + 50 + 72 = 154
Let’s verify:
# Manual verification
print(f"C[0,0] = 1*7 + 2*9 + 3*11 = {1*7 + 2*9 + 3*11}") # 58
print(f"C[0,1] = 1*8 + 2*10 + 3*12 = {1*8 + 2*10 + 3*12}") # 64
print(f"C[1,0] = 4*7 + 5*9 + 6*11 = {4*7 + 5*9 + 6*11}") # 139
print(f"C[1,1] = 4*8 + 5*10 + 6*12 = {4*8 + 5*10 + 6*12}") # 154
Why does this matter for neural networks? In a neural network, the input data is a matrix and the weights are another matrix. The forward pass is essentially a series of matrix multiplications. When we say “a layer with 128 neurons takes a 64-dimensional input,” we mean a (batch_size × 64) tensor is multiplied by a (64 × 128) weight matrix to produce a (batch_size × 128) output.
# Simulating a neural network layer:
batch_size = 4
input_features = 3
output_features = 5
# Random input data: 4 samples, each with 3 features
inputs = torch.randn(batch_size, input_features)
print(f"Input shape: {inputs.shape}") # torch.Size([4, 3])
# Weight matrix: transforms 3 features → 5 features
weights = torch.randn(input_features, output_features)
print(f"Weights shape: {weights.shape}") # torch.Size([3, 5])
# Forward pass = matrix multiplication
output = inputs @ weights
print(f"Output shape: {output.shape}") # torch.Size([4, 5])
# 4 samples in, 4 samples out. Each now has 5 features instead of 3.
Broadcasting — When Shapes Don’t Quite Match
Sometimes you want to add a single number to every element of a tensor, or add a vector to every row of a matrix. PyTorch handles this automatically through broadcasting.
The idea: when two tensors have different shapes, PyTorch “stretches” the smaller tensor to match the larger one, if certain rules are met.
# Scalar + Vector: the scalar is "broadcast" across all elements
a = torch.tensor([1.0, 2.0, 3.0])
result = a + 10
print(f"[1, 2, 3] + 10 = {result}") # tensor([11., 12., 13.])
# Vector + Matrix: the vector is broadcast across all rows
matrix = torch.tensor([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]
])
bias = torch.tensor([10.0, 20.0, 30.0])
result = matrix + bias
print(f"\nMatrix + bias vector:")
print(result)
# tensor([[11., 22., 33.],
# [14., 25., 36.]])
# The bias [10, 20, 30] was added to EACH row
Broadcasting rules (simplified):
- Compare shapes from right to left
- Dimensions either match, or one of them is 1 (which gets stretched)
- If one tensor has fewer dimensions, it’s padded with 1s on the left
# Shape tracking examples
A = torch.randn(4, 3) # Shape: (4, 3)
b = torch.randn(3) # Shape: (3) → broadcast to (4, 3)
C = A + b # Result: (4, 3) ✓
print(f"A {A.shape} + b {b.shape} = C {C.shape}")
# Another example
X = torch.randn(2, 3, 4) # Shape: (2, 3, 4)
Y = torch.randn(1, 1, 4) # Shape: (1, 1, 4) → broadcast to (2, 3, 4)
Z = X + Y # Result: (2, 3, 4) ✓
print(f"X {X.shape} + Y {Y.shape} = Z {Z.shape}")
# This would FAIL:
# A = torch.randn(4, 3)
# b = torch.randn(4)
# C = A + b # ERROR! 3 ≠ 4 when comparing from the right
Broadcasting is crucial because neural networks constantly add bias vectors to the output of matrix multiplications. The bias has shape (output_features,) and the matrix multiplication output has shape (batch_size, output_features). Broadcasting makes this work seamlessly.
3. What Is a Gradient?
The Hill-Rolling Analogy
Imagine you’re blindfolded, standing on a hilly landscape, and your goal is to reach the lowest valley. You can’t see, but you can feel the slope under your feet. If the ground tilts to the left, you step left. If it tilts forward, you step forward. You always move in the direction of the steepest descent.
This is gradient descent — the core algorithm behind training neural networks.
- The landscape is the loss function (how wrong your predictions are)
- Your position is the current values of the weights
- The slope under your feet is the gradient
- Taking a step downhill is updating the weights to reduce the loss
The gradient tells you two things:
- Which direction to move each weight (increase it or decrease it?)
- How much to move (steep slope = big step, gentle slope = small step)
Derivatives — The Slope of a Curve
If you remember one thing from calculus, let it be this: a derivative is the slope of a curve at a specific point.
Consider the function $f(x) = x^2$:
- At $x = 3$: the slope (derivative) is $2 \times 3 = 6$. The curve is climbing steeply upward.
- At $x = 1$: the slope is $2 \times 1 = 2$. Still climbing, but more gently.
- At $x = 0$: the slope is $2 \times 0 = 0$. Flat! This is the bottom of the valley — the minimum.
- At $x = -2$: the slope is $2 \times (-2) = -4$. Negative slope means the curve is going downward (to the left).
If we want to minimize $f(x) = x^2$, we move in the opposite direction of the gradient:
- At $x = 3$, gradient is 6 (positive), so we decrease $x$.
- At $x = -2$, gradient is -4 (negative), so we increase $x$.
- Either way, we move toward $x = 0$, the minimum.
PyTorch Autograd — Automatic Gradient Computation
PyTorch can compute gradients automatically. You just need to tell it which tensors to track.
import torch
# Create a tensor and tell PyTorch to track gradients
x = torch.tensor(3.0, requires_grad=True)
print(f"x = {x}")
# Define a function: f(x) = x²
f = x ** 2
print(f"f(x) = x² = {f}")
# Compute the gradient (derivative) of f with respect to x
f.backward()
# The gradient is stored in x.grad
print(f"df/dx at x=3: {x.grad}") # Should be 2*3 = 6.0
Output:
x = tensor(3., requires_grad=True)
f(x) = x² = tensor(9., grad_fn=<PowBackward0>)
df/dx at x=3: tensor(6.)
Let’s try a more complex function: $f(x) = 3x^3 + 2x^2 - 5x + 7$
The derivative is: $f’(x) = 9x^2 + 4x - 5$
x = torch.tensor(2.0, requires_grad=True)
f = 3 * x**3 + 2 * x**2 - 5 * x + 7
print(f"f(2) = {f.item()}") # 3*8 + 2*4 - 10 + 7 = 24 + 8 - 10 + 7 = 29
f.backward()
print(f"f'(2) = {x.grad.item()}") # 9*4 + 4*2 - 5 = 36 + 8 - 5 = 39
Now with multiple variables — this is what happens in real neural networks (many weights, one loss):
# Two parameters
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
# A simple function of both
# f(w, b) = 3w² + 2wb + b²
f = 3 * w**2 + 2 * w * b + b**2
f.backward()
# Partial derivatives:
# ∂f/∂w = 6w + 2b = 6*2 + 2*1 = 14
# ∂f/∂b = 2w + 2b = 2*2 + 2*1 = 6
print(f"∂f/∂w = {w.grad.item()}") # 14.0
print(f"∂f/∂b = {b.grad.item()}") # 6.0
The gradient with respect to w tells us: “If you increase w slightly, f will increase by about 14 times that amount.” Since we want to decrease f (minimize loss), we should decrease w.
Gradient Descent in Action
Let’s find the minimum of $f(x) = (x - 5)^2$. We know the answer is $x = 5$, but let’s watch gradient descent discover it:
# Start at a random position
x = torch.tensor(0.0, requires_grad=True)
learning_rate = 0.1
print("Step | x | f(x) | gradient")
print("-----|---------|---------|--------")
for step in range(20):
# Forward: compute f(x)
f = (x - 5) ** 2
# Backward: compute gradient
f.backward()
# Print current state
if step < 10 or step % 5 == 0:
print(f" {step:2d} | {x.item():7.4f} | {f.item():7.4f} | {x.grad.item():7.4f}")
# Update x: move in the opposite direction of the gradient
with torch.no_grad(): # Don't track this operation
x -= learning_rate * x.grad
# Reset gradient for next iteration
x.grad.zero_()
print(f"\nFinal x = {x.item():.6f} (target: 5.0)")
Output:
Step | x | f(x) | gradient
-----|---------|---------|--------
0 | 0.0000 | 25.0000 | -10.0000
1 | 1.0000 | 16.0000 | -8.0000
2 | 1.8000 | 10.2400 | -6.4000
3 | 2.4400 | 6.5536 | -5.1200
4 | 2.9520 | 4.1943 | -4.0960
5 | 3.3616 | 2.6844 | -3.2768
6 | 3.6893 | 1.7180 | -2.6214
7 | 3.9514 | 1.0995 | -2.0972
8 | 4.1612 | 0.7037 | -1.6777
9 | 4.3289 | 0.4504 | -1.3422
Final x = 4.985981 (target: 5.0)
Watch how:
- The gradient starts large (-10) when we’re far from the minimum
- It shrinks as we approach the target
xconverges toward 5.0 — the minimum of $(x-5)^2$
This is exactly how neural networks learn! Replace x with millions of weights, and $(x-5)^2$ with a complex loss function, and you have the same process.
4. Backpropagation
The Chain of Dominoes
Imagine a chain of dominoes. You push the first one, it hits the second, which hits the third, and so on. Each domino’s fall is caused by the one before it.
Backpropagation works the same way, but in reverse. We start at the end (the loss), and trace backward through every computation to figure out how each weight contributed to the error.
Mathematically, this is the chain rule from calculus. If $z$ depends on $y$, and $y$ depends on $x$, then:
$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$
In words: “How much does $z$ change when $x$ changes?” equals “How much does $z$ change when $y$ changes?” times “How much does $y$ change when $x$ changes?”
A Computation Graph Example
Let’s trace through a concrete example. Consider this computation:
x = 2
w = 3
b = 1
y = w * x (y = 6)
z = y + b (z = 7)
L = z² (L = 49) ← This is our "loss"
We want to find how L changes with respect to w (so we can update w to reduce L).
Forward pass (compute left to right):
x=2, w=3, b=1 → y = w*x = 6 → z = y+b = 7 → L = z² = 49
Backward pass (compute right to left using chain rule):
dL/dz = 2z = 2*7 = 14 (How does L change with z?)
dz/dy = 1 (How does z change with y?)
dy/dw = x = 2 (How does y change with w?)
dL/dw = dL/dz × dz/dy × dy/dw (Chain rule!)
= 14 × 1 × 2
= 28
Let’s verify with PyTorch:
x = torch.tensor(2.0)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
# Forward pass
y = w * x # y = 6
z = y + b # z = 7
L = z ** 2 # L = 49
print(f"Forward pass: y={y.item()}, z={z.item()}, L={L.item()}")
# Backward pass
L.backward()
print(f"dL/dw = {w.grad.item()}") # 28.0
print(f"dL/db = {b.grad.item()}") # 14.0
# Manual verification:
# dL/db = dL/dz × dz/db = 14 × 1 = 14 ✓
# dL/dw = dL/dz × dz/dy × dy/dw = 14 × 1 × 2 = 28 ✓
Output:
Forward pass: y=6.0, z=7.0, L=49.0
dL/dw = 28.0
dL/db = 14.0
Why This Matters
In a real neural network with millions of parameters, we can’t compute gradients by hand. PyTorch builds a computation graph during the forward pass, recording every operation. Then .backward() traverses this graph in reverse, applying the chain rule at every step to compute every gradient automatically.
This is what makes deep learning practical — you define the forward computation, and PyTorch handles the backward computation for free.
# PyTorch tracks the computation graph
x = torch.tensor(3.0, requires_grad=True)
# Each operation creates a node in the graph
a = x * 2 # MulBackward
b = a + 5 # AddBackward
c = b ** 3 # PowBackward
d = torch.sin(c) # SinBackward
print(f"d.grad_fn = {d.grad_fn}") # Shows the last operation
# You can walk back through the graph:
print(f" → {d.grad_fn.next_functions[0][0]}") # PowBackward
print(f" → {d.grad_fn.next_functions[0][0].next_functions[0][0]}") # AddBackward
d.backward()
print(f"\nGradient of d with respect to x: {x.grad.item():.4f}")
The graph is built dynamically — every time you run your code, a fresh graph is created. This means you can use if statements, loops, and any Python logic in your forward pass, and PyTorch will still compute correct gradients. This is called dynamic computation graphs and is one of PyTorch’s strengths.
5. Your First Neural Network
Now we’ll combine everything to build a real (tiny) neural network. We’ll predict house prices based on two features: size (in hundreds of square feet) and number of bedrooms.
What Is a Neuron?
A single neuron does three things:
- Multiply each input by a weight (importance factor)
- Sum all the weighted inputs plus a bias
- Apply an activation function (introduces non-linearity)
inputs: [x₁, x₂]
weights: [w₁, w₂]
bias: b
output = activation(w₁·x₁ + w₂·x₂ + b)
Think of it like a tiny decision-maker. The weights control how much it cares about each input. The bias shifts its baseline. The activation function adds flexibility.
Building It From Scratch (No PyTorch nn)
Let’s build a network to predict house prices. Two inputs (size, bedrooms) → one output (price in thousands of dollars).
import torch
# ─── Training Data ───
# [size (hundreds of sq ft), bedrooms] → price ($1000s)
X = torch.tensor([
[10.0, 3.0], # 1000 sqft, 3 bedrooms → $200k
[15.0, 4.0], # 1500 sqft, 4 bedrooms → $300k
[20.0, 5.0], # 2000 sqft, 5 bedrooms → $400k
[8.0, 2.0], # 800 sqft, 2 bedrooms → $150k
[12.0, 3.0], # 1200 sqft, 3 bedrooms → $250k
], dtype=torch.float32)
y_true = torch.tensor([
[200.0],
[300.0],
[400.0],
[150.0],
[250.0],
], dtype=torch.float32)
print(f"Input shape: {X.shape}") # (5, 2) — 5 houses, 2 features each
print(f"Target shape: {y_true.shape}") # (5, 1) — 5 prices
Now let’s define our weights and perform a forward pass manually:
# Initialize weights randomly
torch.manual_seed(42) # For reproducibility
# Layer: 2 inputs → 1 output
# Weight matrix: (2, 1) — one weight per input feature
# Bias: (1,) — one bias for the output
w = torch.randn(2, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
print(f"Initial weights:\n{w}")
print(f"Initial bias: {b}")
print(f"Weight shape: {w.shape}") # (2, 1)
print(f"Bias shape: {b.shape}") # (1,)
The Forward Pass
# Forward pass: predictions = X @ w + b
# X shape: (5, 2)
# w shape: (2, 1)
# X @ w shape: (5, 1) — matrix multiplication
# b shape: (1,) — broadcast to (5, 1)
# predictions shape: (5, 1)
predictions = X @ w + b
print(f"\nPredictions shape: {predictions.shape}")
print(f"Predictions:\n{predictions.data}")
print(f"True values:\n{y_true}")
The predictions are random garbage right now — that’s expected! The weights are random.
Loss Calculation
We need a number that tells us how wrong our predictions are. We’ll use Mean Squared Error (MSE): the average of the squared differences between predictions and true values.
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$
# Calculate loss (Mean Squared Error)
errors = predictions - y_true # Shape: (5, 1)
squared_errors = errors ** 2 # Shape: (5, 1)
loss = squared_errors.mean() # Shape: scalar
print(f"\nErrors:\n{errors.data}")
print(f"Squared errors:\n{squared_errors.data}")
print(f"Loss (MSE): {loss.item():.2f}")
Backward Pass and Weight Update
# Backward pass: compute gradients
loss.backward()
print(f"\nGradients:")
print(f" dL/dw = {w.grad.data}")
print(f" dL/db = {b.grad.data}")
# Update weights: move in the opposite direction of the gradient
learning_rate = 0.0001 # Small steps — house prices are large numbers
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
# Clear gradients for next iteration
w.grad.zero_()
b.grad.zero_()
# Check new predictions
new_predictions = X @ w + b
new_loss = ((new_predictions - y_true) ** 2).mean()
print(f"\nLoss before: {loss.item():.2f}")
print(f"Loss after: {new_loss.item():.2f}")
print(f"Improvement: {loss.item() - new_loss.item():.2f}")
The loss should decrease! Now let’s run this for many iterations:
Training Loop
# Full training loop
torch.manual_seed(42)
w = torch.randn(2, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
learning_rate = 0.0001
num_epochs = 1000
print("Epoch | Loss")
print("------|--------")
for epoch in range(num_epochs):
# Forward pass
predictions = X @ w + b
loss = ((predictions - y_true) ** 2).mean()
# Backward pass
loss.backward()
# Update weights
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
# Clear gradients
w.grad.zero_()
b.grad.zero_()
# Print every 100 epochs
if epoch % 100 == 0 or epoch == num_epochs - 1:
print(f" {epoch:4d} | {loss.item():.4f}")
print(f"\nFinal weights: {w.data.flatten().tolist()}")
print(f"Final bias: {b.item():.4f}")
# Test predictions
final_predictions = X @ w + b
print(f"\nPredictions vs True values:")
for i in range(len(y_true)):
print(f" House {i+1}: predicted ${final_predictions[i].item():.1f}k, "
f"actual ${y_true[i].item():.1f}k")
Using PyTorch nn.Module (The Real Way)
In practice, nobody writes weight updates manually. PyTorch’s nn.Module handles the bookkeeping. Here’s the same network, the proper way:
import torch
import torch.nn as nn
import torch.optim as optim
# Define the model as a class
class HousePriceModel(nn.Module):
def __init__(self):
super().__init__()
# nn.Linear handles weights + bias automatically
# Input: 2 features, Output: 1 prediction
self.linear = nn.Linear(2, 1)
def forward(self, x):
return self.linear(x)
# Create model, loss function, and optimizer
model = HousePriceModel()
criterion = nn.MSELoss() # Mean Squared Error
optimizer = optim.SGD(model.parameters(), lr=0.0001) # Stochastic Gradient Descent
# Print model architecture
print(model)
print(f"\nModel parameters:")
for name, param in model.named_parameters():
print(f" {name}: shape={param.shape}, values={param.data}")
Output:
HousePriceModel(
(linear): Linear(in_features=2, out_features=1, bias=True)
)
Model parameters:
linear.weight: shape=torch.Size([1, 2]), values=tensor([...])
linear.bias: shape=torch.Size([1]), values=tensor([...])
# Training loop with nn.Module
print("Epoch | Loss")
print("------|--------")
for epoch in range(1000):
# Forward pass
predictions = model(X) # Calls model.forward(X)
loss = criterion(predictions, y_true)
# Backward pass
optimizer.zero_grad() # Clear old gradients
loss.backward() # Compute new gradients
optimizer.step() # Update weights
if epoch % 100 == 0 or epoch == 999:
print(f" {epoch:4d} | {loss.item():.4f}")
# Final predictions
model.eval() # Switch to evaluation mode
with torch.no_grad():
final_preds = model(X)
print(f"\nFinal predictions:")
for i in range(len(y_true)):
print(f" House {i+1}: ${final_preds[i].item():.1f}k "
f"(actual: ${y_true[i].item():.1f}k)")
Notice how much cleaner this is:
nn.Linear(2, 1)creates the weight matrix and bias automaticallyoptimizer.zero_grad()clears gradientsloss.backward()computes all gradientsoptimizer.step()updates all weights
This pattern — forward → loss → backward → step — is the heartbeat of all neural network training.
Adding a Hidden Layer
A single linear layer can only learn linear relationships (straight lines). To learn complex patterns, we stack multiple layers with activation functions in between:
class HousePriceModelV2(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(2, 8) # 2 inputs → 8 hidden neurons
self.relu = nn.ReLU() # Activation function
self.layer2 = nn.Linear(8, 1) # 8 hidden → 1 output
def forward(self, x):
x = self.layer1(x) # Shape: (batch, 2) → (batch, 8)
x = self.relu(x) # Shape: (batch, 8) — same shape, values changed
x = self.layer2(x) # Shape: (batch, 8) → (batch, 1)
return x
model_v2 = HousePriceModelV2()
print(model_v2)
# Count parameters
total_params = sum(p.numel() for p in model_v2.parameters())
print(f"\nTotal parameters: {total_params}")
# layer1: 2*8 weights + 8 biases = 24
# layer2: 8*1 weights + 1 bias = 9
# Total: 33 parameters
# Train the deeper model
optimizer_v2 = optim.SGD(model_v2.parameters(), lr=0.0001)
criterion = nn.MSELoss()
for epoch in range(2000):
predictions = model_v2(X)
loss = criterion(predictions, y_true)
optimizer_v2.zero_grad()
loss.backward()
optimizer_v2.step()
if epoch % 500 == 0 or epoch == 1999:
print(f"Epoch {epoch:4d} | Loss: {loss.item():.4f}")
6. Activation Functions
Why Do We Need Them?
Without activation functions, stacking layers is pointless. Here’s why:
A linear layer computes $y = Wx + b$. If you stack two linear layers:
$$\text{Layer 1: } h = W_1 x + b_1$$ $$\text{Layer 2: } y = W_2 h + b_2 = W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)$$
This is still just $y = W’x + b’$ — a single linear transformation! No matter how many linear layers you stack, the result is always linear. You might as well have one layer.
Activation functions break this linearity. They introduce curves, bends, and non-linear patterns that let neural networks learn complex relationships.
ReLU (Rectified Linear Unit)
What it does: If the input is positive, keep it. If negative, set it to zero.
$$\text{ReLU}(x) = \max(0, x)$$
Visual description:
Input: -3 -1 0 1 3 5
Output: 0 0 0 1 3 5
Think of it as a flood gate:
─────╱ Positive values pass through unchanged
╱
───── Negative values are blocked (set to 0)
import torch
import torch.nn.functional as F
x = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0, 5.0])
relu_output = F.relu(x)
print(f"Input: {x.tolist()}")
print(f"ReLU: {relu_output.tolist()}")
# Input: [-3.0, -1.0, 0.0, 1.0, 3.0, 5.0]
# ReLU: [0.0, 0.0, 0.0, 1.0, 3.0, 5.0]
Why is ReLU popular?
- Dead simple to compute (just a comparison)
- Doesn’t saturate for positive values (no vanishing gradient)
- Empirically works very well in practice
- Most neural networks use ReLU or its variants (GELU, which we’ll see in transformers)
Sigmoid
What it does: Squashes any input into the range (0, 1).
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
Visual description:
_______________ 1.0
╱
╱
╱ 0.5
╱
____╱ 0.0
Large negative → near 0
Zero → exactly 0.5
Large positive → near 1
x = torch.tensor([-5.0, -2.0, 0.0, 2.0, 5.0])
sigmoid_output = torch.sigmoid(x)
print(f"Input: {x.tolist()}")
print(f"Sigmoid: {[f'{v:.4f}' for v in sigmoid_output.tolist()]}")
# Input: [-5.0, -2.0, 0.0, 2.0, 5.0]
# Sigmoid: ['0.0067', '0.1192', '0.5000', '0.8808', '0.9933']
When to use sigmoid:
- Output gates (when you need a value between 0 and 1)
- Binary classification (is this spam or not spam?)
- Probability outputs
Why not everywhere? Sigmoid has the “vanishing gradient” problem — for very large or very small inputs, the gradient becomes near-zero, so learning stalls. That’s why ReLU is preferred for hidden layers.
Comparison in a Network
# Same network, different activation functions
x = torch.randn(5, 3) # 5 samples, 3 features
linear = nn.Linear(3, 4)
output_raw = linear(x)
print(f"Raw output (no activation):\n{output_raw.data}\n")
print(f"After ReLU:\n{F.relu(output_raw).data}\n")
print(f"After Sigmoid:\n{torch.sigmoid(output_raw).data}")
Note how:
- Raw output can be any real number (positive or negative)
- After ReLU: all negatives become 0, positives unchanged
- After Sigmoid: everything squeezed between 0 and 1
GELU — A Preview
When we build our LLM later, we’ll use GELU (Gaussian Error Linear Unit) instead of ReLU. GELU is smoother — instead of a hard cutoff at zero, it has a soft curve. Think of it as a “gentler” ReLU. For now, just know it exists.
x = torch.tensor([-3.0, -1.0, -0.5, 0.0, 0.5, 1.0, 3.0])
print(f"Input: {x.tolist()}")
print(f"ReLU: {F.relu(x).tolist()}")
print(f"GELU: {[f'{v:.3f}' for v in F.gelu(x).tolist()]}")
# Notice GELU allows small negative values through, unlike ReLU
7. Loss Functions
A loss function measures how wrong your model’s predictions are. It produces a single number: lower = better. Training is all about minimizing this number.
MSE (Mean Squared Error) — For Regression
When you’re predicting a continuous number (price, temperature, age), MSE is the go-to loss:
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$
Where $\hat{y}_i$ is the prediction and $y_i$ is the true value.
Why squared? Two reasons:
- It makes all errors positive (a prediction that’s too high by 5 and one that’s too low by 5 both contribute 25)
- It penalizes large errors much more than small ones ($10^2 = 100$, but $2^2 = 4$)
import torch
import torch.nn as nn
# Predictions vs true values
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 8.0])
# Manual MSE calculation
errors = predictions - targets
print(f"Errors: {errors.tolist()}") # [-0.5, 0.5, 0.1, -0.2]
print(f"Squared: {(errors**2).tolist()}") # [0.25, 0.25, 0.01, 0.04]
print(f"Mean: {(errors**2).mean().item():.4f}") # 0.1375
# PyTorch's built-in MSE
criterion = nn.MSELoss()
loss = criterion(predictions, targets)
print(f"nn.MSELoss: {loss.item():.4f}") # 0.1375 — same!
Cross-Entropy — For Classification
When you’re predicting a category (which word comes next, is this a cat or dog, what digit is this), cross-entropy is the standard loss.
The idea: your model outputs a probability distribution over possible classes (e.g., “60% cat, 30% dog, 10% bird”). Cross-entropy measures how different this distribution is from reality (which is “100% cat, 0% everything else”).
$$\text{Cross-Entropy} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$
Where $C$ is the number of classes, $y_c$ is 1 for the correct class and 0 for others, and $\hat{y}_c$ is the predicted probability for class $c$.
# Classification example: 3 classes (cat, dog, bird)
# Model outputs raw scores (logits), not probabilities
logits = torch.tensor([[2.0, 1.0, 0.1]]) # Raw model output
target = torch.tensor([0]) # True class: cat (index 0)
# Cross-entropy loss (handles softmax internally)
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, target)
print(f"Cross-entropy loss: {loss.item():.4f}")
# Let's see what happens under the hood:
probabilities = torch.softmax(logits, dim=1)
print(f"Probabilities: {probabilities.data}")
# The model gives ~66% to cat, ~24% to dog, ~10% to bird
# Since the true label is cat, this isn't bad! Loss should be low.
# Now with a wrong prediction:
logits_wrong = torch.tensor([[0.1, 2.0, 1.0]]) # Model thinks it's a dog
loss_wrong = criterion(logits_wrong, target)
print(f"\nWrong prediction loss: {loss_wrong.item():.4f}")
probs_wrong = torch.softmax(logits_wrong, dim=1)
print(f"Wrong probabilities: {probs_wrong.data}")
# Higher loss when the model is confident about the wrong class
Why cross-entropy for LLMs? When a language model predicts the next word, it’s choosing from a vocabulary of thousands of words — that’s a classification problem with thousands of classes. Cross-entropy loss penalizes the model when it assigns low probability to the correct next word.
# Simulating LLM next-word prediction
# Vocabulary: ["the", "cat", "sat", "on", "mat"]
vocab_size = 5
# Model's raw predictions for the next word
logits = torch.tensor([[1.5, 3.2, 0.5, 0.1, 0.8]]) # thinks "cat" is most likely
true_next_word = torch.tensor([1]) # correct answer is "cat" (index 1)
loss = nn.CrossEntropyLoss()(logits, true_next_word)
probs = torch.softmax(logits, dim=1)
print("Word probabilities:")
vocab = ["the", "cat", "sat", "on", "mat"]
for i, word in enumerate(vocab):
marker = " ← correct" if i == 1 else ""
print(f" {word}: {probs[0][i].item():.4f}{marker}")
print(f"\nLoss: {loss.item():.4f}")
Choosing the Right Loss Function
| Task | Loss Function | Output Activation |
|---|---|---|
| Predict a number (regression) | MSE | None (linear) |
| Binary yes/no | Binary Cross-Entropy | Sigmoid |
| Pick one of N classes | Cross-Entropy | Softmax (built-in) |
| Next word prediction (LLM) | Cross-Entropy | Softmax (built-in) |
8. Putting It All Together — A Complete Example
Let’s combine everything into one clean, well-commented example. We’ll build a two-layer neural network for house price prediction.
import torch
import torch.nn as nn
import torch.optim as optim
# ─── Step 1: Prepare Data ───
# Features: [size (100s sqft), bedrooms, bathrooms, age (decades)]
X_train = torch.tensor([
[10.0, 3.0, 2.0, 2.0], # 1000sqft, 3bed, 2bath, 20yr old
[15.0, 4.0, 2.5, 1.0], # 1500sqft, 4bed, 2.5bath, 10yr old
[20.0, 5.0, 3.0, 0.5], # 2000sqft, 5bed, 3bath, 5yr old
[8.0, 2.0, 1.0, 5.0], # 800sqft, 2bed, 1bath, 50yr old
[12.0, 3.0, 2.0, 3.0], # 1200sqft, 3bed, 2bath, 30yr old
[25.0, 6.0, 4.0, 0.0], # 2500sqft, 6bed, 4bath, new
[9.0, 2.0, 1.0, 4.0], # 900sqft, 2bed, 1bath, 40yr old
[18.0, 4.0, 3.0, 1.5], # 1800sqft, 4bed, 3bath, 15yr old
], dtype=torch.float32)
y_train = torch.tensor([
[200.0], [320.0], [450.0], [120.0],
[240.0], [550.0], [140.0], [380.0]
], dtype=torch.float32)
print(f"Training data: {X_train.shape[0]} houses, {X_train.shape[1]} features")
print(f"Target shape: {y_train.shape}")
# ─── Step 2: Define Model ───
class HousePriceNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.layer1 = nn.Linear(input_size, hidden_size)
self.activation = nn.ReLU()
self.layer2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
# x shape: (batch_size, 4)
x = self.layer1(x) # → (batch_size, hidden_size)
x = self.activation(x) # → (batch_size, hidden_size)
x = self.layer2(x) # → (batch_size, 1)
return x
model = HousePriceNet(input_size=4, hidden_size=16, output_size=1)
# Count parameters
num_params = sum(p.numel() for p in model.parameters())
print(f"\nModel architecture:\n{model}")
print(f"Total parameters: {num_params}")
# layer1: 4*16 weights + 16 biases = 80
# layer2: 16*1 weights + 1 bias = 17
# Total: 97
# ─── Step 3: Setup Training ───
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01) # Adam is usually better than SGD
# ─── Step 4: Training Loop ───
num_epochs = 2000
losses = []
print("\nTraining...")
print(f"{'Epoch':>6} | {'Loss':>10} | {'Avg Error ($k)':>14}")
print("-" * 36)
for epoch in range(num_epochs):
# Forward pass
predictions = model(X_train)
loss = criterion(predictions, y_train)
losses.append(loss.item())
# Backward pass + update
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Logging
if epoch % 400 == 0 or epoch == num_epochs - 1:
avg_error = torch.abs(predictions - y_train).mean().item()
print(f"{epoch:6d} | {loss.item():10.2f} | ${avg_error:12.2f}k")
# ─── Step 5: Evaluate ───
model.eval()
with torch.no_grad():
final_preds = model(X_train)
print(f"\n{'House':>5} | {'Predicted':>10} | {'Actual':>8} | {'Error':>8}")
print("-" * 42)
for i in range(len(y_train)):
pred = final_preds[i].item()
actual = y_train[i].item()
error = abs(pred - actual)
print(f"{i+1:5d} | ${pred:8.1f}k | ${actual:6.1f}k | ${error:6.1f}k")
This complete example demonstrates every concept from this chapter:
- Tensors hold our data and model parameters
- Matrix multiplication (inside
nn.Linear) transforms inputs - ReLU activation adds non-linearity between layers
- MSE loss measures prediction quality
- Backpropagation (
loss.backward()) computes all gradients - Gradient descent (
optimizer.step()) updates weights to improve predictions
9. Exercises
Test your understanding with these exercises. Try them on your own before looking at the solutions.
Exercise 1: Tensor Shapes
Predict the output shape for each operation, then verify with PyTorch:
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.randn(3, 4)
v = torch.randn(4)
# What are the shapes of:
# a) A @ B
# b) A + C
# c) A * C
# d) A + v (broadcasting)
# e) A.reshape(6, 2)
# f) A.reshape(-1)
Exercise 2: Gradient Computation
Use PyTorch autograd to find the gradient of $f(x, y) = x^2 y + 3xy^2 - 2x + 5$ at the point $(x=2, y=3)$.
Compute it by hand first:
- $\frac{\partial f}{\partial x} = 2xy + 3y^2 - 2$
- $\frac{\partial f}{\partial y} = x^2 + 6xy$
Exercise 3: Manual Forward Pass
Given this network:
- Input:
[2.0, 3.0] - Layer 1 weights:
[[0.5, -0.3], [0.2, 0.8]](shape 2×2) - Layer 1 bias:
[0.1, -0.1] - Activation: ReLU
- Layer 2 weights:
[[0.4], [0.6]](shape 2×1) - Layer 2 bias:
[0.0]
Compute the output step by step on paper, then verify with PyTorch.
Exercise 4: Build a Temperature Converter
Build a neural network that learns to convert Celsius to Fahrenheit ($F = 1.8C + 32$). Create training data, train the model, and show it can convert temperatures it hasn’t seen.
Exercise 5: Loss Function Comparison
Create a scenario where the model makes the same predictions, but compare MSE loss vs Mean Absolute Error (MAE = nn.L1Loss()). Which one penalizes outliers more?
Solutions
Solution 1: Tensor Shapes
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.randn(3, 4)
v = torch.randn(4)
# a) A @ B → (3, 4) @ (4, 5) = (3, 5)
print(f"a) A @ B: {(A @ B).shape}") # torch.Size([3, 5])
# b) A + C → (3, 4) + (3, 4) = (3, 4) — same shapes, element-wise
print(f"b) A + C: {(A + C).shape}") # torch.Size([3, 4])
# c) A * C → (3, 4) * (3, 4) = (3, 4) — element-wise multiplication
print(f"c) A * C: {(A * C).shape}") # torch.Size([3, 4])
# d) A + v → (3, 4) + (4,) = (3, 4) — v broadcast across rows
print(f"d) A + v: {(A + v).shape}") # torch.Size([3, 4])
# e) A.reshape(6, 2) → 3*4 = 12 = 6*2 ✓
print(f"e) reshape(6, 2): {A.reshape(6, 2).shape}") # torch.Size([6, 2])
# f) A.reshape(-1) → flattened to (12,)
print(f"f) reshape(-1): {A.reshape(-1).shape}") # torch.Size([12])
Solution 2: Gradient Computation
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
f = x**2 * y + 3 * x * y**2 - 2 * x + 5
f.backward()
print(f"f(2, 3) = {f.item()}")
print(f"∂f/∂x = {x.grad.item()}") # 2*2*3 + 3*9 - 2 = 12 + 27 - 2 = 37
print(f"∂f/∂y = {y.grad.item()}") # 4 + 6*2*3 = 4 + 36 = 40
# Manual check:
# ∂f/∂x = 2xy + 3y² - 2 = 2(2)(3) + 3(9) - 2 = 12 + 27 - 2 = 37 ✓
# ∂f/∂y = x² + 6xy = 4 + 6(2)(3) = 4 + 36 = 40 ✓
Solution 3: Manual Forward Pass
import torch
import torch.nn.functional as F
# Input
x = torch.tensor([[2.0, 3.0]])
# Layer 1
W1 = torch.tensor([[0.5, -0.3], [0.2, 0.8]])
b1 = torch.tensor([0.1, -0.1])
# Step 1: Linear transform
z1 = x @ W1.T + b1
# [2.0, 3.0] @ [[0.5, 0.2], [-0.3, 0.8]] + [0.1, -0.1]
# = [2*0.5 + 3*(-0.3), 2*0.2 + 3*0.8] + [0.1, -0.1]
# = [1.0 - 0.9, 0.4 + 2.4] + [0.1, -0.1]
# = [0.1, 2.8] + [0.1, -0.1]
# = [0.2, 2.7]
print(f"After Layer 1 (linear): {z1.data}")
# Step 2: ReLU
a1 = F.relu(z1)
# [max(0, 0.2), max(0, 2.7)] = [0.2, 2.7]
print(f"After ReLU: {a1.data}")
# Layer 2
W2 = torch.tensor([[0.4], [0.6]])
b2 = torch.tensor([0.0])
# Step 3: Linear transform
z2 = a1 @ W2 + b2
# = [0.2*0.4 + 2.7*0.6] + 0.0
# = [0.08 + 1.62]
# = [1.70]
print(f"Output: {z2.data}")
# Verify with nn.Module
model = torch.nn.Sequential(
torch.nn.Linear(2, 2),
torch.nn.ReLU(),
torch.nn.Linear(2, 1)
)
# Set weights manually
with torch.no_grad():
model[0].weight.copy_(W1)
model[0].bias.copy_(b1)
model[2].weight.copy_(W2.T)
model[2].bias.copy_(b2)
output = model(x)
print(f"nn.Module output: {output.data}") # Should match: 1.70
Solution 4: Temperature Converter
import torch
import torch.nn as nn
import torch.optim as optim
# F = 1.8 * C + 32
# Generate training data
torch.manual_seed(42)
C_train = torch.linspace(-40, 100, 50).reshape(-1, 1)
F_train = 1.8 * C_train + 32
# Simple linear model (this IS a linear relationship)
model = nn.Linear(1, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.0001)
# Train
for epoch in range(5000):
pred = model(C_train)
loss = criterion(pred, F_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 1000 == 0:
print(f"Epoch {epoch}: Loss = {loss.item():.6f}")
# Check learned parameters
w = model.weight.item()
b = model.bias.item()
print(f"\nLearned: F = {w:.4f} * C + {b:.4f}")
print(f"Actual: F = 1.8000 * C + 32.0000")
# Test on unseen temperatures
test_temps = torch.tensor([[0.0], [100.0], [37.0], [-10.0]])
with torch.no_grad():
predictions = model(test_temps)
print(f"\nTest predictions:")
for i, c in enumerate(test_temps):
actual = 1.8 * c.item() + 32
print(f" {c.item():.0f}°C → predicted {predictions[i].item():.2f}°F "
f"(actual: {actual:.1f}°F)")
Solution 5: Loss Function Comparison
import torch
import torch.nn as nn
predictions = torch.tensor([2.0, 8.0, 5.0, 10.0])
targets = torch.tensor([3.0, 7.0, 5.0, 3.0]) # Last one is an outlier!
errors = predictions - targets
print(f"Errors: {errors.tolist()}") # [-1.0, 1.0, 0.0, 7.0]
# MSE: squares the errors → outlier dominates
mse = nn.MSELoss()(predictions, targets)
print(f"MSE Loss: {mse.item():.2f}")
# = mean([1, 1, 0, 49]) = 51/4 = 12.75
# MAE: absolute values → outlier has proportional impact
mae = nn.L1Loss()(predictions, targets)
print(f"MAE Loss: {mae.item():.2f}")
# = mean([1, 1, 0, 7]) = 9/4 = 2.25
print(f"\nMSE/MAE ratio: {mse.item()/mae.item():.2f}")
print("MSE penalizes the outlier (error=7) MUCH more because 7²=49")
print("MAE treats it proportionally: error=7 contributes 7, not 49")
Summary
In this chapter, you learned the building blocks of deep learning:
| Concept | What it is | Why it matters |
|---|---|---|
| Tensor | Multi-dimensional number container | All data and parameters in neural networks are tensors |
| Tensor Operations | Math on tensors (add, multiply, matmul) | Matrix multiplication is the core computation in neural networks |
| Gradient | Slope/direction of steepest change | Tells us how to adjust weights to reduce error |
| Backpropagation | Chain rule applied through the computation graph | Computes gradients for all parameters automatically |
| Neural Network | Layers of weighted sums + activations | Learns complex patterns from data |
| Activation Function | Non-linear function between layers | Enables learning non-linear relationships |
| Loss Function | Measures prediction quality | Provides the signal that drives learning |
The training loop you learned — forward → loss → backward → step — is the same loop used to train GPT, BERT, and every other large language model. The networks are bigger and the data is different, but the core process is identical.
In the next chapter, we’ll apply these foundations to text — learning how to convert words into numbers that neural networks can process.