Foundations — Tensors, Gradients, and Neural Networks

Before we can build a language model, we need to speak the language that neural networks speak: numbers arranged in structured containers, flowing through mathematical operations, adjusting themselves to get better at a task. That’s the essence of deep learning, and this chapter will give you a solid, hands-on understanding of every piece.

By the end of this chapter, you will:

Understand what tensors are and how to manipulate them
Know what gradients are and why they matter for learning
Build a complete neural network from scratch in PyTorch
Understand activation functions and loss functions intuitively

Let’s begin.

1. What Is a Tensor?

The Container Analogy

Imagine you have different kinds of containers for holding numbers:

A single box that holds one number — say, the temperature today: 72. This is a scalar.
A row of boxes holding a list of numbers — say, temperatures for the week: [72, 68, 75, 71, 69, 74, 73]. This is a vector.
A grid of boxes (rows and columns) — say, temperatures for 4 weeks across 7 days. That’s a table with 4 rows and 7 columns. This is a matrix.
A cube of boxes — say, temperatures for 4 weeks, 7 days, measured at 3 different times each day. Now you have a 3D block. This is a 3D tensor.

A tensor is just a generalization of all of these. It’s a container that can hold numbers in any number of dimensions. The number of dimensions is called the tensor’s rank (or ndim).

Name	Rank	Shape Example	Real-World Analogy
Scalar	0	`()`	A single temperature reading
Vector	1	`(7,)`	Temperatures for a week
Matrix	2	`(4, 7)`	Temperatures for 4 weeks × 7 days
3D	3	`(4, 7, 3)`	Weeks × Days × Times of day
4D	4	`(12, 4, 7, 3)`	Months × Weeks × Days × Times

In deep learning, tensors are everywhere. Your data is a tensor. Your model’s weights are tensors. The output is a tensor. Understanding them is non-negotiable.

Creating Tensors in PyTorch

Let’s make this concrete with code. If you haven’t installed PyTorch yet, run pip install torch in your terminal.

import torch

# Scalar (rank 0) — a single number
scalar = torch.tensor(42.0)
print(f"Scalar: {scalar}")
print(f"  Shape: {scalar.shape}")      # torch.Size([])
print(f"  Dimensions: {scalar.ndim}")  # 0
print()

# Vector (rank 1) — a list of numbers
vector = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
print(f"Vector: {vector}")
print(f"  Shape: {vector.shape}")      # torch.Size([5])
print(f"  Dimensions: {vector.ndim}")  # 1
print()

# Matrix (rank 2) — a grid of numbers
matrix = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
])
print(f"Matrix:\n{matrix}")
print(f"  Shape: {matrix.shape}")      # torch.Size([2, 3])
print(f"  Dimensions: {matrix.ndim}")  # 2
print()

# 3D Tensor (rank 3) — a cube of numbers
tensor_3d = torch.tensor([
    [[1, 2], [3, 4], [5, 6]],
    [[7, 8], [9, 10], [11, 12]]
])
print(f"3D Tensor:\n{tensor_3d}")
print(f"  Shape: {tensor_3d.shape}")      # torch.Size([2, 3, 2])
print(f"  Dimensions: {tensor_3d.ndim}")  # 3

Output:

Scalar: tensor(42.)
  Shape: torch.Size([])
  Dimensions: 0

Vector: tensor([1., 2., 3., 4., 5.])
  Shape: torch.Size([5])
  Dimensions: 1

Matrix:
tensor([[1., 2., 3.],
        [4., 5., 6.]])
  Shape: torch.Size([2, 3])
  Dimensions: 2

3D Tensor:
tensor([[[ 1,  2],
         [ 3,  4],
         [ 5,  6]],

        [[ 7,  8],
         [ 9, 10],
         [11, 12]]])
  Shape: torch.Size([2, 3, 2])
  Dimensions: 3

Reading Tensor Shapes

The shape of a tensor tells you how many elements exist along each dimension. Learning to read shapes is one of the most important skills in deep learning.

For a tensor with shape (2, 3, 2):

The first dimension has size 2 (think: 2 “slabs”)
The second dimension has size 3 (think: 3 rows in each slab)
The third dimension has size 2 (think: 2 columns in each row)

Total number of elements: 2 × 3 × 2 = 12.

# Useful ways to create tensors
zeros = torch.zeros(3, 4)       # 3×4 matrix of zeros
ones = torch.ones(2, 5)         # 2×5 matrix of ones
random = torch.randn(3, 3)     # 3×3 matrix of random numbers (normal distribution)
sequence = torch.arange(0, 10)  # [0, 1, 2, ..., 9]

print(f"Zeros shape: {zeros.shape}")       # torch.Size([3, 4])
print(f"Ones shape: {ones.shape}")         # torch.Size([2, 5])
print(f"Random shape: {random.shape}")     # torch.Size([3, 3])
print(f"Sequence shape: {sequence.shape}") # torch.Size([10])

Reshaping Tensors

You’ll frequently need to change the shape of a tensor without changing its data. Think of it like rearranging the same 12 eggs from a single row into a 3×4 carton.

# Start with a flat vector of 12 numbers
flat = torch.arange(1, 13, dtype=torch.float32)
print(f"Flat: {flat}")
print(f"  Shape: {flat.shape}")  # torch.Size([12])

# Reshape to 3 rows × 4 columns
grid = flat.reshape(3, 4)
print(f"\nReshaped to (3, 4):\n{grid}")
print(f"  Shape: {grid.shape}")  # torch.Size([3, 4])

# Reshape to 2 × 2 × 3
cube = flat.reshape(2, 2, 3)
print(f"\nReshaped to (2, 2, 3):\n{cube}")
print(f"  Shape: {cube.shape}")  # torch.Size([2, 2, 3])

# Using -1 to let PyTorch figure out one dimension
auto = flat.reshape(4, -1)  # "Make it 4 rows, figure out the columns"
print(f"\nReshaped to (4, -1):\n{auto}")
print(f"  Shape: {auto.shape}")  # torch.Size([4, 3])

Key rule: The total number of elements must stay the same. You can reshape a (12,) tensor into (3, 4), (4, 3), (2, 6), (2, 2, 3), etc. — but not into (3, 5) because 3 × 5 = 15 ≠ 12.

2. Tensor Operations

Now that we know what tensors are, let’s learn how to do math with them.

Element-wise Operations

The simplest operations work element by element — each number in one tensor pairs with the corresponding number in the other tensor.

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Addition: each element adds to its partner
c = a + b
print(f"a + b = {c}")  # tensor([5., 7., 9.])

# Subtraction
print(f"a - b = {a - b}")  # tensor([-3., -3., -3.])

# Element-wise multiplication (NOT matrix multiplication!)
print(f"a * b = {a * b}")  # tensor([4., 10., 18.])

# Element-wise division
print(f"a / b = {a / b}")  # tensor([0.2500, 0.4000, 0.5000])

# Squaring each element
print(f"a ** 2 = {a ** 2}")  # tensor([1., 4., 9.])

Notice that for element-wise operations, both tensors must have the same shape (with some exceptions — see Broadcasting below).

# Works with matrices too
A = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

print(f"A:\n{A}")
print(f"B:\n{B}")
print(f"A + B:\n{A + B}")
# tensor([[ 6.,  8.],
#         [10., 12.]])

Matrix Multiplication — The Most Important Operation

Matrix multiplication is the core operation in neural networks. Let’s build intuition step by step.

The rule: To multiply matrix A (shape m × n) by matrix B (shape n × p), the number of columns in A must equal the number of rows in B. The result has shape m × p.

A (m × n)  @  B (n × p)  =  C (m × p)
              ↑
    These must match!

Visual walkthrough with concrete numbers:

# A is 2×3 (2 rows, 3 columns)
A = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
])

# B is 3×2 (3 rows, 2 columns)
B = torch.tensor([
    [7.0,  8.0],
    [9.0,  10.0],
    [11.0, 12.0]
])

print(f"A shape: {A.shape}")  # torch.Size([2, 3])
print(f"B shape: {B.shape}")  # torch.Size([3, 2])

# Matrix multiplication: A (2×3) @ B (3×2) = C (2×2)
C = A @ B  # The @ operator does matrix multiplication
print(f"\nA @ B:\n{C}")
print(f"Result shape: {C.shape}")  # torch.Size([2, 2])

How is each element computed? Each element C[i][j] is the dot product of row i of A with column j of B:

C[0][0] = (1×7) + (2×9) + (3×11) = 7 + 18 + 33 = 58
C[0][1] = (1×8) + (2×10) + (3×12) = 8 + 20 + 36 = 64
C[1][0] = (4×7) + (5×9) + (6×11) = 28 + 45 + 66 = 139
C[1][1] = (4×8) + (5×10) + (6×12) = 32 + 50 + 72 = 154

Let’s verify:

# Manual verification
print(f"C[0,0] = 1*7 + 2*9 + 3*11 = {1*7 + 2*9 + 3*11}")   # 58
print(f"C[0,1] = 1*8 + 2*10 + 3*12 = {1*8 + 2*10 + 3*12}")  # 64
print(f"C[1,0] = 4*7 + 5*9 + 6*11 = {4*7 + 5*9 + 6*11}")    # 139
print(f"C[1,1] = 4*8 + 5*10 + 6*12 = {4*8 + 5*10 + 6*12}")  # 154

Why does this matter for neural networks? In a neural network, the input data is a matrix and the weights are another matrix. The forward pass is essentially a series of matrix multiplications. When we say “a layer with 128 neurons takes a 64-dimensional input,” we mean a (batch_size × 64) tensor is multiplied by a (64 × 128) weight matrix to produce a (batch_size × 128) output.

# Simulating a neural network layer:
batch_size = 4
input_features = 3
output_features = 5

# Random input data: 4 samples, each with 3 features
inputs = torch.randn(batch_size, input_features)
print(f"Input shape: {inputs.shape}")  # torch.Size([4, 3])

# Weight matrix: transforms 3 features → 5 features
weights = torch.randn(input_features, output_features)
print(f"Weights shape: {weights.shape}")  # torch.Size([3, 5])

# Forward pass = matrix multiplication
output = inputs @ weights
print(f"Output shape: {output.shape}")  # torch.Size([4, 5])
# 4 samples in, 4 samples out. Each now has 5 features instead of 3.

Broadcasting — When Shapes Don’t Quite Match

Sometimes you want to add a single number to every element of a tensor, or add a vector to every row of a matrix. PyTorch handles this automatically through broadcasting.

The idea: when two tensors have different shapes, PyTorch “stretches” the smaller tensor to match the larger one, if certain rules are met.

# Scalar + Vector: the scalar is "broadcast" across all elements
a = torch.tensor([1.0, 2.0, 3.0])
result = a + 10
print(f"[1, 2, 3] + 10 = {result}")  # tensor([11., 12., 13.])

# Vector + Matrix: the vector is broadcast across all rows
matrix = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
])
bias = torch.tensor([10.0, 20.0, 30.0])

result = matrix + bias
print(f"\nMatrix + bias vector:")
print(result)
# tensor([[11., 22., 33.],
#         [14., 25., 36.]])
# The bias [10, 20, 30] was added to EACH row

Broadcasting rules (simplified):

Compare shapes from right to left
Dimensions either match, or one of them is 1 (which gets stretched)
If one tensor has fewer dimensions, it’s padded with 1s on the left

# Shape tracking examples
A = torch.randn(4, 3)    # Shape: (4, 3)
b = torch.randn(3)       # Shape:    (3)  →  broadcast to (4, 3)
C = A + b                 # Result: (4, 3) ✓

print(f"A {A.shape} + b {b.shape} = C {C.shape}")

# Another example
X = torch.randn(2, 3, 4)  # Shape: (2, 3, 4)
Y = torch.randn(1, 1, 4)  # Shape: (1, 1, 4) → broadcast to (2, 3, 4)
Z = X + Y                  # Result: (2, 3, 4) ✓

print(f"X {X.shape} + Y {Y.shape} = Z {Z.shape}")

# This would FAIL:
# A = torch.randn(4, 3)
# b = torch.randn(4)
# C = A + b  # ERROR! 3 ≠ 4 when comparing from the right

Broadcasting is crucial because neural networks constantly add bias vectors to the output of matrix multiplications. The bias has shape (output_features,) and the matrix multiplication output has shape (batch_size, output_features). Broadcasting makes this work seamlessly.

3. What Is a Gradient?

The Hill-Rolling Analogy

Imagine you’re blindfolded, standing on a hilly landscape, and your goal is to reach the lowest valley. You can’t see, but you can feel the slope under your feet. If the ground tilts to the left, you step left. If it tilts forward, you step forward. You always move in the direction of the steepest descent.

This is gradient descent — the core algorithm behind training neural networks.

The landscape is the loss function (how wrong your predictions are)
Your position is the current values of the weights
The slope under your feet is the gradient
Taking a step downhill is updating the weights to reduce the loss

The gradient tells you two things:

Which direction to move each weight (increase it or decrease it?)
How much to move (steep slope = big step, gentle slope = small step)

Derivatives — The Slope of a Curve

If you remember one thing from calculus, let it be this: a derivative is the slope of a curve at a specific point.

Consider the function $f(x) = x^2$:

At $x = 3$: the slope (derivative) is $2 \times 3 = 6$. The curve is climbing steeply upward.
At $x = 1$: the slope is $2 \times 1 = 2$. Still climbing, but more gently.
At $x = 0$: the slope is $2 \times 0 = 0$. Flat! This is the bottom of the valley — the minimum.
At $x = -2$: the slope is $2 \times (-2) = -4$. Negative slope means the curve is going downward (to the left).

If we want to minimize $f(x) = x^2$, we move in the opposite direction of the gradient:

At $x = 3$, gradient is 6 (positive), so we decrease $x$.
At $x = -2$, gradient is -4 (negative), so we increase $x$.
Either way, we move toward $x = 0$, the minimum.

PyTorch Autograd — Automatic Gradient Computation

PyTorch can compute gradients automatically. You just need to tell it which tensors to track.

import torch

# Create a tensor and tell PyTorch to track gradients
x = torch.tensor(3.0, requires_grad=True)
print(f"x = {x}")

# Define a function: f(x) = x²
f = x ** 2
print(f"f(x) = x² = {f}")

# Compute the gradient (derivative) of f with respect to x
f.backward()

# The gradient is stored in x.grad
print(f"df/dx at x=3: {x.grad}")  # Should be 2*3 = 6.0

Output:

x = tensor(3., requires_grad=True)
f(x) = x² = tensor(9., grad_fn=<PowBackward0>)
df/dx at x=3: tensor(6.)

Let’s try a more complex function: $f(x) = 3x^3 + 2x^2 - 5x + 7$

The derivative is: $f’(x) = 9x^2 + 4x - 5$

x = torch.tensor(2.0, requires_grad=True)

f = 3 * x**3 + 2 * x**2 - 5 * x + 7
print(f"f(2) = {f.item()}")  # 3*8 + 2*4 - 10 + 7 = 24 + 8 - 10 + 7 = 29

f.backward()
print(f"f'(2) = {x.grad.item()}")  # 9*4 + 4*2 - 5 = 36 + 8 - 5 = 39

Now with multiple variables — this is what happens in real neural networks (many weights, one loss):

# Two parameters
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# A simple function of both
# f(w, b) = 3w² + 2wb + b²
f = 3 * w**2 + 2 * w * b + b**2

f.backward()

# Partial derivatives:
# ∂f/∂w = 6w + 2b = 6*2 + 2*1 = 14
# ∂f/∂b = 2w + 2b = 2*2 + 2*1 = 6
print(f"∂f/∂w = {w.grad.item()}")  # 14.0
print(f"∂f/∂b = {b.grad.item()}")  # 6.0

The gradient with respect to w tells us: “If you increase w slightly, f will increase by about 14 times that amount.” Since we want to decrease f (minimize loss), we should decrease w.

Gradient Descent in Action

Let’s find the minimum of $f(x) = (x - 5)^2$. We know the answer is $x = 5$, but let’s watch gradient descent discover it:

# Start at a random position
x = torch.tensor(0.0, requires_grad=True)
learning_rate = 0.1

print("Step | x       | f(x)    | gradient")
print("-----|---------|---------|--------")

for step in range(20):
    # Forward: compute f(x)
    f = (x - 5) ** 2

    # Backward: compute gradient
    f.backward()

    # Print current state
    if step < 10 or step % 5 == 0:
        print(f"  {step:2d} | {x.item():7.4f} | {f.item():7.4f} | {x.grad.item():7.4f}")

    # Update x: move in the opposite direction of the gradient
    with torch.no_grad():  # Don't track this operation
        x -= learning_rate * x.grad

    # Reset gradient for next iteration
    x.grad.zero_()

print(f"\nFinal x = {x.item():.6f} (target: 5.0)")

Output:

Step | x       | f(x)    | gradient
-----|---------|---------|--------
   0 |  0.0000 | 25.0000 | -10.0000
   1 |  1.0000 | 16.0000 |  -8.0000
   2 |  1.8000 | 10.2400 |  -6.4000
   3 |  2.4400 |  6.5536 |  -5.1200
   4 |  2.9520 |  4.1943 |  -4.0960
   5 |  3.3616 |  2.6844 |  -3.2768
   6 |  3.6893 |  1.7180 |  -2.6214
   7 |  3.9514 |  1.0995 |  -2.0972
   8 |  4.1612 |  0.7037 |  -1.6777
   9 |  4.3289 |  0.4504 |  -1.3422

Final x = 4.985981 (target: 5.0)

Watch how:

The gradient starts large (-10) when we’re far from the minimum
It shrinks as we approach the target
x converges toward 5.0 — the minimum of $(x-5)^2$

This is exactly how neural networks learn! Replace x with millions of weights, and $(x-5)^2$ with a complex loss function, and you have the same process.

4. Backpropagation

The Chain of Dominoes

Imagine a chain of dominoes. You push the first one, it hits the second, which hits the third, and so on. Each domino’s fall is caused by the one before it.

Backpropagation works the same way, but in reverse. We start at the end (the loss), and trace backward through every computation to figure out how each weight contributed to the error.

Mathematically, this is the chain rule from calculus. If $z$ depends on $y$, and $y$ depends on $x$, then:

$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$

In words: “How much does $z$ change when $x$ changes?” equals “How much does $z$ change when $y$ changes?” times “How much does $y$ change when $x$ changes?”

A Computation Graph Example

Let’s trace through a concrete example. Consider this computation:

x = 2
w = 3
b = 1

y = w * x      (y = 6)
z = y + b      (z = 7)
L = z²         (L = 49)     ← This is our "loss"

We want to find how L changes with respect to w (so we can update w to reduce L).

Forward pass (compute left to right):

x=2, w=3, b=1  →  y = w*x = 6  →  z = y+b = 7  →  L = z² = 49

Backward pass (compute right to left using chain rule):

dL/dz = 2z = 2*7 = 14           (How does L change with z?)
dz/dy = 1                        (How does z change with y?)
dy/dw = x = 2                    (How does y change with w?)

dL/dw = dL/dz × dz/dy × dy/dw   (Chain rule!)
      = 14 × 1 × 2
      = 28

Let’s verify with PyTorch:

x = torch.tensor(2.0)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# Forward pass
y = w * x       # y = 6
z = y + b       # z = 7
L = z ** 2      # L = 49

print(f"Forward pass: y={y.item()}, z={z.item()}, L={L.item()}")

# Backward pass
L.backward()

print(f"dL/dw = {w.grad.item()}")  # 28.0
print(f"dL/db = {b.grad.item()}")  # 14.0

# Manual verification:
# dL/db = dL/dz × dz/db = 14 × 1 = 14 ✓
# dL/dw = dL/dz × dz/dy × dy/dw = 14 × 1 × 2 = 28 ✓

Output:

Forward pass: y=6.0, z=7.0, L=49.0
dL/dw = 28.0
dL/db = 14.0

Why This Matters

In a real neural network with millions of parameters, we can’t compute gradients by hand. PyTorch builds a computation graph during the forward pass, recording every operation. Then .backward() traverses this graph in reverse, applying the chain rule at every step to compute every gradient automatically.

This is what makes deep learning practical — you define the forward computation, and PyTorch handles the backward computation for free.

# PyTorch tracks the computation graph
x = torch.tensor(3.0, requires_grad=True)

# Each operation creates a node in the graph
a = x * 2         # MulBackward
b = a + 5          # AddBackward
c = b ** 3         # PowBackward
d = torch.sin(c)   # SinBackward

print(f"d.grad_fn = {d.grad_fn}")  # Shows the last operation
# You can walk back through the graph:
print(f"  → {d.grad_fn.next_functions[0][0]}")  # PowBackward
print(f"    → {d.grad_fn.next_functions[0][0].next_functions[0][0]}")  # AddBackward

d.backward()
print(f"\nGradient of d with respect to x: {x.grad.item():.4f}")

The graph is built dynamically — every time you run your code, a fresh graph is created. This means you can use if statements, loops, and any Python logic in your forward pass, and PyTorch will still compute correct gradients. This is called dynamic computation graphs and is one of PyTorch’s strengths.

5. Your First Neural Network

Now we’ll combine everything to build a real (tiny) neural network. We’ll predict house prices based on two features: size (in hundreds of square feet) and number of bedrooms.

What Is a Neuron?

A single neuron does three things:

Multiply each input by a weight (importance factor)
Sum all the weighted inputs plus a bias
Apply an activation function (introduces non-linearity)

inputs: [x₁, x₂]
weights: [w₁, w₂]
bias: b

output = activation(w₁·x₁ + w₂·x₂ + b)

Think of it like a tiny decision-maker. The weights control how much it cares about each input. The bias shifts its baseline. The activation function adds flexibility.

Building It From Scratch (No PyTorch nn)

Let’s build a network to predict house prices. Two inputs (size, bedrooms) → one output (price in thousands of dollars).

import torch

# ─── Training Data ───
# [size (hundreds of sq ft), bedrooms] → price ($1000s)
X = torch.tensor([
    [10.0, 3.0],   # 1000 sqft, 3 bedrooms → $200k
    [15.0, 4.0],   # 1500 sqft, 4 bedrooms → $300k
    [20.0, 5.0],   # 2000 sqft, 5 bedrooms → $400k
    [8.0,  2.0],   # 800 sqft, 2 bedrooms  → $150k
    [12.0, 3.0],   # 1200 sqft, 3 bedrooms → $250k
], dtype=torch.float32)

y_true = torch.tensor([
    [200.0],
    [300.0],
    [400.0],
    [150.0],
    [250.0],
], dtype=torch.float32)

print(f"Input shape: {X.shape}")    # (5, 2) — 5 houses, 2 features each
print(f"Target shape: {y_true.shape}")  # (5, 1) — 5 prices

Now let’s define our weights and perform a forward pass manually:

# Initialize weights randomly
torch.manual_seed(42)  # For reproducibility

# Layer: 2 inputs → 1 output
# Weight matrix: (2, 1) — one weight per input feature
# Bias: (1,) — one bias for the output
w = torch.randn(2, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)

print(f"Initial weights:\n{w}")
print(f"Initial bias: {b}")
print(f"Weight shape: {w.shape}")  # (2, 1)
print(f"Bias shape: {b.shape}")    # (1,)

The Forward Pass

# Forward pass: predictions = X @ w + b
# X shape: (5, 2)
# w shape: (2, 1)
# X @ w shape: (5, 1) — matrix multiplication
# b shape: (1,) — broadcast to (5, 1)
# predictions shape: (5, 1)

predictions = X @ w + b
print(f"\nPredictions shape: {predictions.shape}")
print(f"Predictions:\n{predictions.data}")
print(f"True values:\n{y_true}")

The predictions are random garbage right now — that’s expected! The weights are random.

Loss Calculation

We need a number that tells us how wrong our predictions are. We’ll use Mean Squared Error (MSE): the average of the squared differences between predictions and true values.

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$

# Calculate loss (Mean Squared Error)
errors = predictions - y_true       # Shape: (5, 1)
squared_errors = errors ** 2        # Shape: (5, 1)
loss = squared_errors.mean()        # Shape: scalar

print(f"\nErrors:\n{errors.data}")
print(f"Squared errors:\n{squared_errors.data}")
print(f"Loss (MSE): {loss.item():.2f}")

Backward Pass and Weight Update

# Backward pass: compute gradients
loss.backward()

print(f"\nGradients:")
print(f"  dL/dw = {w.grad.data}")
print(f"  dL/db = {b.grad.data}")

# Update weights: move in the opposite direction of the gradient
learning_rate = 0.0001  # Small steps — house prices are large numbers

with torch.no_grad():
    w -= learning_rate * w.grad
    b -= learning_rate * b.grad

# Clear gradients for next iteration
w.grad.zero_()
b.grad.zero_()

# Check new predictions
new_predictions = X @ w + b
new_loss = ((new_predictions - y_true) ** 2).mean()
print(f"\nLoss before: {loss.item():.2f}")
print(f"Loss after:  {new_loss.item():.2f}")
print(f"Improvement: {loss.item() - new_loss.item():.2f}")

The loss should decrease! Now let’s run this for many iterations:

Training Loop

# Full training loop
torch.manual_seed(42)
w = torch.randn(2, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
learning_rate = 0.0001
num_epochs = 1000

print("Epoch | Loss")
print("------|--------")

for epoch in range(num_epochs):
    # Forward pass
    predictions = X @ w + b
    loss = ((predictions - y_true) ** 2).mean()

    # Backward pass
    loss.backward()

    # Update weights
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad

    # Clear gradients
    w.grad.zero_()
    b.grad.zero_()

    # Print every 100 epochs
    if epoch % 100 == 0 or epoch == num_epochs - 1:
        print(f"  {epoch:4d} | {loss.item():.4f}")

print(f"\nFinal weights: {w.data.flatten().tolist()}")
print(f"Final bias: {b.item():.4f}")

# Test predictions
final_predictions = X @ w + b
print(f"\nPredictions vs True values:")
for i in range(len(y_true)):
    print(f"  House {i+1}: predicted ${final_predictions[i].item():.1f}k, "
          f"actual ${y_true[i].item():.1f}k")

Using PyTorch nn.Module (The Real Way)

In practice, nobody writes weight updates manually. PyTorch’s nn.Module handles the bookkeeping. Here’s the same network, the proper way:

import torch
import torch.nn as nn
import torch.optim as optim

# Define the model as a class
class HousePriceModel(nn.Module):
    def __init__(self):
        super().__init__()
        # nn.Linear handles weights + bias automatically
        # Input: 2 features, Output: 1 prediction
        self.linear = nn.Linear(2, 1)

    def forward(self, x):
        return self.linear(x)

# Create model, loss function, and optimizer
model = HousePriceModel()
criterion = nn.MSELoss()                 # Mean Squared Error
optimizer = optim.SGD(model.parameters(), lr=0.0001)  # Stochastic Gradient Descent

# Print model architecture
print(model)
print(f"\nModel parameters:")
for name, param in model.named_parameters():
    print(f"  {name}: shape={param.shape}, values={param.data}")

Output:

HousePriceModel(
  (linear): Linear(in_features=2, out_features=1, bias=True)
)

Model parameters:
  linear.weight: shape=torch.Size([1, 2]), values=tensor([...])
  linear.bias: shape=torch.Size([1]), values=tensor([...])

# Training loop with nn.Module
print("Epoch | Loss")
print("------|--------")

for epoch in range(1000):
    # Forward pass
    predictions = model(X)            # Calls model.forward(X)
    loss = criterion(predictions, y_true)

    # Backward pass
    optimizer.zero_grad()              # Clear old gradients
    loss.backward()                    # Compute new gradients
    optimizer.step()                   # Update weights

    if epoch % 100 == 0 or epoch == 999:
        print(f"  {epoch:4d} | {loss.item():.4f}")

# Final predictions
model.eval()  # Switch to evaluation mode
with torch.no_grad():
    final_preds = model(X)
    print(f"\nFinal predictions:")
    for i in range(len(y_true)):
        print(f"  House {i+1}: ${final_preds[i].item():.1f}k "
              f"(actual: ${y_true[i].item():.1f}k)")

Notice how much cleaner this is:

nn.Linear(2, 1) creates the weight matrix and bias automatically
optimizer.zero_grad() clears gradients
loss.backward() computes all gradients
optimizer.step() updates all weights

This pattern — forward → loss → backward → step — is the heartbeat of all neural network training.

Adding a Hidden Layer

A single linear layer can only learn linear relationships (straight lines). To learn complex patterns, we stack multiple layers with activation functions in between:

class HousePriceModelV2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 8)    # 2 inputs → 8 hidden neurons
        self.relu = nn.ReLU()             # Activation function
        self.layer2 = nn.Linear(8, 1)    # 8 hidden → 1 output

    def forward(self, x):
        x = self.layer1(x)   # Shape: (batch, 2) → (batch, 8)
        x = self.relu(x)     # Shape: (batch, 8) — same shape, values changed
        x = self.layer2(x)   # Shape: (batch, 8) → (batch, 1)
        return x

model_v2 = HousePriceModelV2()
print(model_v2)

# Count parameters
total_params = sum(p.numel() for p in model_v2.parameters())
print(f"\nTotal parameters: {total_params}")
# layer1: 2*8 weights + 8 biases = 24
# layer2: 8*1 weights + 1 bias = 9
# Total: 33 parameters

# Train the deeper model
optimizer_v2 = optim.SGD(model_v2.parameters(), lr=0.0001)
criterion = nn.MSELoss()

for epoch in range(2000):
    predictions = model_v2(X)
    loss = criterion(predictions, y_true)

    optimizer_v2.zero_grad()
    loss.backward()
    optimizer_v2.step()

    if epoch % 500 == 0 or epoch == 1999:
        print(f"Epoch {epoch:4d} | Loss: {loss.item():.4f}")

6. Activation Functions

Why Do We Need Them?

Without activation functions, stacking layers is pointless. Here’s why:

A linear layer computes $y = Wx + b$. If you stack two linear layers:

$$\text{Layer 1: } h = W_1 x + b_1$$ $$\text{Layer 2: } y = W_2 h + b_2 = W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)$$

This is still just $y = W’x + b’$ — a single linear transformation! No matter how many linear layers you stack, the result is always linear. You might as well have one layer.

Activation functions break this linearity. They introduce curves, bends, and non-linear patterns that let neural networks learn complex relationships.

ReLU (Rectified Linear Unit)

What it does: If the input is positive, keep it. If negative, set it to zero.

$$\text{ReLU}(x) = \max(0, x)$$

Visual description:

Input:  -3  -1   0   1   3   5
Output:  0   0   0   1   3   5

Think of it as a flood gate:
  ─────╱         Positive values pass through unchanged
      ╱
─────            Negative values are blocked (set to 0)

import torch
import torch.nn.functional as F

x = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0, 5.0])
relu_output = F.relu(x)

print(f"Input:  {x.tolist()}")
print(f"ReLU:   {relu_output.tolist()}")
# Input:  [-3.0, -1.0, 0.0, 1.0, 3.0, 5.0]
# ReLU:   [0.0, 0.0, 0.0, 1.0, 3.0, 5.0]

Why is ReLU popular?

Dead simple to compute (just a comparison)
Doesn’t saturate for positive values (no vanishing gradient)
Empirically works very well in practice
Most neural networks use ReLU or its variants (GELU, which we’ll see in transformers)

Sigmoid

What it does: Squashes any input into the range (0, 1).

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Visual description:

         _______________  1.0
        ╱
       ╱
      ╱                   0.5
     ╱
____╱                     0.0

Large negative → near 0
Zero → exactly 0.5
Large positive → near 1

x = torch.tensor([-5.0, -2.0, 0.0, 2.0, 5.0])
sigmoid_output = torch.sigmoid(x)

print(f"Input:   {x.tolist()}")
print(f"Sigmoid: {[f'{v:.4f}' for v in sigmoid_output.tolist()]}")
# Input:   [-5.0, -2.0, 0.0, 2.0, 5.0]
# Sigmoid: ['0.0067', '0.1192', '0.5000', '0.8808', '0.9933']

When to use sigmoid:

Output gates (when you need a value between 0 and 1)
Binary classification (is this spam or not spam?)
Probability outputs

Why not everywhere? Sigmoid has the “vanishing gradient” problem — for very large or very small inputs, the gradient becomes near-zero, so learning stalls. That’s why ReLU is preferred for hidden layers.

Comparison in a Network

# Same network, different activation functions
x = torch.randn(5, 3)  # 5 samples, 3 features

linear = nn.Linear(3, 4)
output_raw = linear(x)

print(f"Raw output (no activation):\n{output_raw.data}\n")
print(f"After ReLU:\n{F.relu(output_raw).data}\n")
print(f"After Sigmoid:\n{torch.sigmoid(output_raw).data}")

Note how:

Raw output can be any real number (positive or negative)
After ReLU: all negatives become 0, positives unchanged
After Sigmoid: everything squeezed between 0 and 1

GELU — A Preview

When we build our LLM later, we’ll use GELU (Gaussian Error Linear Unit) instead of ReLU. GELU is smoother — instead of a hard cutoff at zero, it has a soft curve. Think of it as a “gentler” ReLU. For now, just know it exists.

x = torch.tensor([-3.0, -1.0, -0.5, 0.0, 0.5, 1.0, 3.0])
print(f"Input: {x.tolist()}")
print(f"ReLU:  {F.relu(x).tolist()}")
print(f"GELU:  {[f'{v:.3f}' for v in F.gelu(x).tolist()]}")
# Notice GELU allows small negative values through, unlike ReLU

7. Loss Functions

A loss function measures how wrong your model’s predictions are. It produces a single number: lower = better. Training is all about minimizing this number.

MSE (Mean Squared Error) — For Regression

When you’re predicting a continuous number (price, temperature, age), MSE is the go-to loss:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$

Where $\hat{y}_i$ is the prediction and $y_i$ is the true value.

Why squared? Two reasons:

It makes all errors positive (a prediction that’s too high by 5 and one that’s too low by 5 both contribute 25)
It penalizes large errors much more than small ones ($10^2 = 100$, but $2^2 = 4$)

import torch
import torch.nn as nn

# Predictions vs true values
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets     = torch.tensor([3.0, -0.5, 2.0, 8.0])

# Manual MSE calculation
errors = predictions - targets
print(f"Errors: {errors.tolist()}")               # [-0.5, 0.5, 0.1, -0.2]
print(f"Squared: {(errors**2).tolist()}")          # [0.25, 0.25, 0.01, 0.04]
print(f"Mean: {(errors**2).mean().item():.4f}")    # 0.1375

# PyTorch's built-in MSE
criterion = nn.MSELoss()
loss = criterion(predictions, targets)
print(f"nn.MSELoss: {loss.item():.4f}")            # 0.1375 — same!

Cross-Entropy — For Classification

When you’re predicting a category (which word comes next, is this a cat or dog, what digit is this), cross-entropy is the standard loss.

The idea: your model outputs a probability distribution over possible classes (e.g., “60% cat, 30% dog, 10% bird”). Cross-entropy measures how different this distribution is from reality (which is “100% cat, 0% everything else”).

$$\text{Cross-Entropy} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

Where $C$ is the number of classes, $y_c$ is 1 for the correct class and 0 for others, and $\hat{y}_c$ is the predicted probability for class $c$.

# Classification example: 3 classes (cat, dog, bird)
# Model outputs raw scores (logits), not probabilities
logits = torch.tensor([[2.0, 1.0, 0.1]])   # Raw model output
target = torch.tensor([0])                    # True class: cat (index 0)

# Cross-entropy loss (handles softmax internally)
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, target)
print(f"Cross-entropy loss: {loss.item():.4f}")

# Let's see what happens under the hood:
probabilities = torch.softmax(logits, dim=1)
print(f"Probabilities: {probabilities.data}")
# The model gives ~66% to cat, ~24% to dog, ~10% to bird
# Since the true label is cat, this isn't bad! Loss should be low.

# Now with a wrong prediction:
logits_wrong = torch.tensor([[0.1, 2.0, 1.0]])  # Model thinks it's a dog
loss_wrong = criterion(logits_wrong, target)
print(f"\nWrong prediction loss: {loss_wrong.item():.4f}")
probs_wrong = torch.softmax(logits_wrong, dim=1)
print(f"Wrong probabilities: {probs_wrong.data}")
# Higher loss when the model is confident about the wrong class

Why cross-entropy for LLMs? When a language model predicts the next word, it’s choosing from a vocabulary of thousands of words — that’s a classification problem with thousands of classes. Cross-entropy loss penalizes the model when it assigns low probability to the correct next word.

# Simulating LLM next-word prediction
# Vocabulary: ["the", "cat", "sat", "on", "mat"]
vocab_size = 5

# Model's raw predictions for the next word
logits = torch.tensor([[1.5, 3.2, 0.5, 0.1, 0.8]])  # thinks "cat" is most likely
true_next_word = torch.tensor([1])  # correct answer is "cat" (index 1)

loss = nn.CrossEntropyLoss()(logits, true_next_word)
probs = torch.softmax(logits, dim=1)

print("Word probabilities:")
vocab = ["the", "cat", "sat", "on", "mat"]
for i, word in enumerate(vocab):
    marker = " ← correct" if i == 1 else ""
    print(f"  {word}: {probs[0][i].item():.4f}{marker}")
print(f"\nLoss: {loss.item():.4f}")

Choosing the Right Loss Function

Task	Loss Function	Output Activation
Predict a number (regression)	MSE	None (linear)
Binary yes/no	Binary Cross-Entropy	Sigmoid
Pick one of N classes	Cross-Entropy	Softmax (built-in)
Next word prediction (LLM)	Cross-Entropy	Softmax (built-in)

8. Putting It All Together — A Complete Example

Let’s combine everything into one clean, well-commented example. We’ll build a two-layer neural network for house price prediction.

import torch
import torch.nn as nn
import torch.optim as optim

# ─── Step 1: Prepare Data ───
# Features: [size (100s sqft), bedrooms, bathrooms, age (decades)]
X_train = torch.tensor([
    [10.0, 3.0, 2.0, 2.0],   # 1000sqft, 3bed, 2bath, 20yr old
    [15.0, 4.0, 2.5, 1.0],   # 1500sqft, 4bed, 2.5bath, 10yr old
    [20.0, 5.0, 3.0, 0.5],   # 2000sqft, 5bed, 3bath, 5yr old
    [8.0,  2.0, 1.0, 5.0],   # 800sqft, 2bed, 1bath, 50yr old
    [12.0, 3.0, 2.0, 3.0],   # 1200sqft, 3bed, 2bath, 30yr old
    [25.0, 6.0, 4.0, 0.0],   # 2500sqft, 6bed, 4bath, new
    [9.0,  2.0, 1.0, 4.0],   # 900sqft, 2bed, 1bath, 40yr old
    [18.0, 4.0, 3.0, 1.5],   # 1800sqft, 4bed, 3bath, 15yr old
], dtype=torch.float32)

y_train = torch.tensor([
    [200.0], [320.0], [450.0], [120.0],
    [240.0], [550.0], [140.0], [380.0]
], dtype=torch.float32)

print(f"Training data: {X_train.shape[0]} houses, {X_train.shape[1]} features")
print(f"Target shape: {y_train.shape}")

# ─── Step 2: Define Model ───
class HousePriceNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x shape: (batch_size, 4)
        x = self.layer1(x)       # → (batch_size, hidden_size)
        x = self.activation(x)   # → (batch_size, hidden_size)
        x = self.layer2(x)       # → (batch_size, 1)
        return x

model = HousePriceNet(input_size=4, hidden_size=16, output_size=1)

# Count parameters
num_params = sum(p.numel() for p in model.parameters())
print(f"\nModel architecture:\n{model}")
print(f"Total parameters: {num_params}")
# layer1: 4*16 weights + 16 biases = 80
# layer2: 16*1 weights + 1 bias = 17
# Total: 97

# ─── Step 3: Setup Training ───
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Adam is usually better than SGD

# ─── Step 4: Training Loop ───
num_epochs = 2000
losses = []

print("\nTraining...")
print(f"{'Epoch':>6} | {'Loss':>10} | {'Avg Error ($k)':>14}")
print("-" * 36)

for epoch in range(num_epochs):
    # Forward pass
    predictions = model(X_train)
    loss = criterion(predictions, y_train)
    losses.append(loss.item())

    # Backward pass + update
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Logging
    if epoch % 400 == 0 or epoch == num_epochs - 1:
        avg_error = torch.abs(predictions - y_train).mean().item()
        print(f"{epoch:6d} | {loss.item():10.2f} | ${avg_error:12.2f}k")

# ─── Step 5: Evaluate ───
model.eval()
with torch.no_grad():
    final_preds = model(X_train)

print(f"\n{'House':>5} | {'Predicted':>10} | {'Actual':>8} | {'Error':>8}")
print("-" * 42)
for i in range(len(y_train)):
    pred = final_preds[i].item()
    actual = y_train[i].item()
    error = abs(pred - actual)
    print(f"{i+1:5d} | ${pred:8.1f}k | ${actual:6.1f}k | ${error:6.1f}k")

This complete example demonstrates every concept from this chapter:

Tensors hold our data and model parameters
Matrix multiplication (inside nn.Linear) transforms inputs
ReLU activation adds non-linearity between layers
MSE loss measures prediction quality
Backpropagation (loss.backward()) computes all gradients
Gradient descent (optimizer.step()) updates weights to improve predictions

9. Exercises

Test your understanding with these exercises. Try them on your own before looking at the solutions.

Exercise 1: Tensor Shapes

Predict the output shape for each operation, then verify with PyTorch:

A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.randn(3, 4)
v = torch.randn(4)

# What are the shapes of:
# a) A @ B
# b) A + C
# c) A * C
# d) A + v  (broadcasting)
# e) A.reshape(6, 2)
# f) A.reshape(-1)

Exercise 2: Gradient Computation

Use PyTorch autograd to find the gradient of $f(x, y) = x^2 y + 3xy^2 - 2x + 5$ at the point $(x=2, y=3)$.

Compute it by hand first:

$\frac{\partial f}{\partial x} = 2xy + 3y^2 - 2$
$\frac{\partial f}{\partial y} = x^2 + 6xy$

Exercise 3: Manual Forward Pass

Given this network:

Input: [2.0, 3.0]
Layer 1 weights: [[0.5, -0.3], [0.2, 0.8]] (shape 2×2)
Layer 1 bias: [0.1, -0.1]
Activation: ReLU
Layer 2 weights: [[0.4], [0.6]] (shape 2×1)
Layer 2 bias: [0.0]

Compute the output step by step on paper, then verify with PyTorch.

Exercise 4: Build a Temperature Converter

Build a neural network that learns to convert Celsius to Fahrenheit ($F = 1.8C + 32$). Create training data, train the model, and show it can convert temperatures it hasn’t seen.

Exercise 5: Loss Function Comparison

Create a scenario where the model makes the same predictions, but compare MSE loss vs Mean Absolute Error (MAE = nn.L1Loss()). Which one penalizes outliers more?

Solutions

Solution 1: Tensor Shapes

A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.randn(3, 4)
v = torch.randn(4)

# a) A @ B → (3, 4) @ (4, 5) = (3, 5)
print(f"a) A @ B: {(A @ B).shape}")  # torch.Size([3, 5])

# b) A + C → (3, 4) + (3, 4) = (3, 4) — same shapes, element-wise
print(f"b) A + C: {(A + C).shape}")  # torch.Size([3, 4])

# c) A * C → (3, 4) * (3, 4) = (3, 4) — element-wise multiplication
print(f"c) A * C: {(A * C).shape}")  # torch.Size([3, 4])

# d) A + v → (3, 4) + (4,) = (3, 4) — v broadcast across rows
print(f"d) A + v: {(A + v).shape}")  # torch.Size([3, 4])

# e) A.reshape(6, 2) → 3*4 = 12 = 6*2 ✓
print(f"e) reshape(6, 2): {A.reshape(6, 2).shape}")  # torch.Size([6, 2])

# f) A.reshape(-1) → flattened to (12,)
print(f"f) reshape(-1): {A.reshape(-1).shape}")  # torch.Size([12])

Solution 2: Gradient Computation

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

f = x**2 * y + 3 * x * y**2 - 2 * x + 5
f.backward()

print(f"f(2, 3) = {f.item()}")
print(f"∂f/∂x = {x.grad.item()}")  # 2*2*3 + 3*9 - 2 = 12 + 27 - 2 = 37
print(f"∂f/∂y = {y.grad.item()}")  # 4 + 6*2*3 = 4 + 36 = 40

# Manual check:
# ∂f/∂x = 2xy + 3y² - 2 = 2(2)(3) + 3(9) - 2 = 12 + 27 - 2 = 37 ✓
# ∂f/∂y = x² + 6xy = 4 + 6(2)(3) = 4 + 36 = 40 ✓

Solution 3: Manual Forward Pass

import torch
import torch.nn.functional as F

# Input
x = torch.tensor([[2.0, 3.0]])

# Layer 1
W1 = torch.tensor([[0.5, -0.3], [0.2, 0.8]])
b1 = torch.tensor([0.1, -0.1])

# Step 1: Linear transform
z1 = x @ W1.T + b1
# [2.0, 3.0] @ [[0.5, 0.2], [-0.3, 0.8]] + [0.1, -0.1]
# = [2*0.5 + 3*(-0.3), 2*0.2 + 3*0.8] + [0.1, -0.1]
# = [1.0 - 0.9, 0.4 + 2.4] + [0.1, -0.1]
# = [0.1, 2.8] + [0.1, -0.1]
# = [0.2, 2.7]
print(f"After Layer 1 (linear): {z1.data}")

# Step 2: ReLU
a1 = F.relu(z1)
# [max(0, 0.2), max(0, 2.7)] = [0.2, 2.7]
print(f"After ReLU: {a1.data}")

# Layer 2
W2 = torch.tensor([[0.4], [0.6]])
b2 = torch.tensor([0.0])

# Step 3: Linear transform
z2 = a1 @ W2 + b2
# = [0.2*0.4 + 2.7*0.6] + 0.0
# = [0.08 + 1.62]
# = [1.70]
print(f"Output: {z2.data}")

# Verify with nn.Module
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2),
    torch.nn.ReLU(),
    torch.nn.Linear(2, 1)
)
# Set weights manually
with torch.no_grad():
    model[0].weight.copy_(W1)
    model[0].bias.copy_(b1)
    model[2].weight.copy_(W2.T)
    model[2].bias.copy_(b2)

output = model(x)
print(f"nn.Module output: {output.data}")  # Should match: 1.70

Solution 4: Temperature Converter

import torch
import torch.nn as nn
import torch.optim as optim

# F = 1.8 * C + 32
# Generate training data
torch.manual_seed(42)
C_train = torch.linspace(-40, 100, 50).reshape(-1, 1)
F_train = 1.8 * C_train + 32

# Simple linear model (this IS a linear relationship)
model = nn.Linear(1, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.0001)

# Train
for epoch in range(5000):
    pred = model(C_train)
    loss = criterion(pred, F_train)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 1000 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.6f}")

# Check learned parameters
w = model.weight.item()
b = model.bias.item()
print(f"\nLearned: F = {w:.4f} * C + {b:.4f}")
print(f"Actual:  F = 1.8000 * C + 32.0000")

# Test on unseen temperatures
test_temps = torch.tensor([[0.0], [100.0], [37.0], [-10.0]])
with torch.no_grad():
    predictions = model(test_temps)

print(f"\nTest predictions:")
for i, c in enumerate(test_temps):
    actual = 1.8 * c.item() + 32
    print(f"  {c.item():.0f}°C → predicted {predictions[i].item():.2f}°F "
          f"(actual: {actual:.1f}°F)")

Solution 5: Loss Function Comparison

import torch
import torch.nn as nn

predictions = torch.tensor([2.0, 8.0, 5.0, 10.0])
targets     = torch.tensor([3.0, 7.0, 5.0, 3.0])  # Last one is an outlier!

errors = predictions - targets
print(f"Errors: {errors.tolist()}")  # [-1.0, 1.0, 0.0, 7.0]

# MSE: squares the errors → outlier dominates
mse = nn.MSELoss()(predictions, targets)
print(f"MSE Loss: {mse.item():.2f}")
# = mean([1, 1, 0, 49]) = 51/4 = 12.75

# MAE: absolute values → outlier has proportional impact
mae = nn.L1Loss()(predictions, targets)
print(f"MAE Loss: {mae.item():.2f}")
# = mean([1, 1, 0, 7]) = 9/4 = 2.25

print(f"\nMSE/MAE ratio: {mse.item()/mae.item():.2f}")
print("MSE penalizes the outlier (error=7) MUCH more because 7²=49")
print("MAE treats it proportionally: error=7 contributes 7, not 49")

Summary

In this chapter, you learned the building blocks of deep learning:

Concept	What it is	Why it matters
Tensor	Multi-dimensional number container	All data and parameters in neural networks are tensors
Tensor Operations	Math on tensors (add, multiply, matmul)	Matrix multiplication is the core computation in neural networks
Gradient	Slope/direction of steepest change	Tells us how to adjust weights to reduce error
Backpropagation	Chain rule applied through the computation graph	Computes gradients for all parameters automatically
Neural Network	Layers of weighted sums + activations	Learns complex patterns from data
Activation Function	Non-linear function between layers	Enables learning non-linear relationships
Loss Function	Measures prediction quality	Provides the signal that drives learning

The training loop you learned — forward → loss → backward → step — is the same loop used to train GPT, BERT, and every other large language model. The networks are bigger and the data is different, but the core process is identical.

In the next chapter, we’ll apply these foundations to text — learning how to convert words into numbers that neural networks can process.