Backpropagation
How Neural Networks Learn — Step-by-Step Derivation
Last Updated: March 2026
📌 Key Takeaways
- Definition: Backpropagation computes the gradient of the loss function with respect to every weight in the network using the chain rule.
- Foundation: The chain rule of calculus — derivatives of composite functions.
- Two passes per training step: Forward pass (compute output & loss) → Backward pass (compute gradients) → Weight update.
- Gradient flow: Gradients flow backward from output layer to input layer.
- Vanishing gradients: Gradients become extremely small in deep networks with sigmoid/tanh — use ReLU to mitigate.
1. Intuition — Why Do We Need Backpropagation?
A neural network has thousands — sometimes billions — of weights. To train it, we need to know: for each weight, how much should we increase or decrease it to reduce the loss?
This requires computing ∂L/∂w for each weight. For the final layer, this is straightforward. But for earlier layers, the weight connects to the loss through many intermediate computations.
Backpropagation solves this by using the chain rule to efficiently compute all gradients in a single backward pass. It starts at the output layer and propagates gradients backward — each layer’s gradient is computed using the gradient already computed for the next layer.
2. The Chain Rule of Calculus
If y = f(g(x)), then: dy/dx = (dy/dg) × (dg/dx)
In neural networks, to find ∂L/∂w:
∂L/∂w = (∂L/∂a) × (∂a/∂z) × (∂z/∂w)
Where z = w·x + b (pre-activation), a = activation(z), L = loss(a). Backpropagation chains these together for every weight in every layer.
3. The Backpropagation Algorithm
Step 1 — Forward Pass: Compute and store Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾ and A⁽ˡ⁾ = activation(Z⁽ˡ⁾) for each layer l. Store all values — they are needed in the backward pass.
Step 2 — Compute Loss: L = loss(A⁽ᴸ⁾, y)
Step 3 — Backward Pass (output layer):
δ⁽ᴸ⁾ = ∂L/∂A⁽ᴸ⁾ × activation'(Z⁽ᴸ⁾)
∂L/∂W⁽ᴸ⁾ = δ⁽ᴸ⁾ × (A⁽ᴸ⁻¹⁾)ᵀ
Step 4 — Backward Pass (hidden layers):
δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾)ᵀ × δ⁽ˡ⁺¹⁾ × activation'(Z⁽ˡ⁾)
∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ × (A⁽ˡ⁻¹⁾)ᵀ
Step 5 — Weight Update:
W⁽ˡ⁾ := W⁽ˡ⁾ − α × ∂L/∂W⁽ˡ⁾
4. Complete Worked Example
Network: 1 input (x=2), 1 hidden neuron with sigmoid, 1 output with sigmoid. Weights: w₁=0.5 (input→hidden), w₂=0.8 (hidden→output). Biases = 0. True label y=1. Learning rate α=0.1.
Forward Pass:
z₁ = 0.5 × 2 = 1.0 → a₁ = sigmoid(1.0) ≈ 0.731
z₂ = 0.8 × 0.731 = 0.585 → ŷ = sigmoid(0.585) ≈ 0.642
Loss = −log(0.642) ≈ 0.443
Backward Pass:
δ₂ = ∂L/∂z₂ ≈ −0.358 → ∂L/∂w₂ = δ₂ × a₁ ≈ −0.262
δ₁ = δ₂ × w₂ × sigmoid'(z₁) ≈ −0.056 → ∂L/∂w₁ = δ₁ × x ≈ −0.112
Weight Updates:
w₂ := 0.8 + 0.026 = 0.826 | w₁ := 0.5 + 0.011 = 0.511
Both weights increased — pushing prediction closer to 1 (the true label), reducing the loss.
5. Vanishing & Exploding Gradients
Vanishing Gradients
With sigmoid/tanh, derivatives are always < 0.25. After 10 layers: 0.25¹⁰ ≈ 0.000001 — essentially zero. Early layers learn nothing.
Solutions: Use ReLU. Use Batch Normalisation. Use residual connections (ResNets). Use Xavier/He initialisation.
Exploding Gradients
With large weights, gradients multiply to very large values, causing weight updates to overshoot. Common in RNNs.
Solutions: Gradient clipping. Weight regularisation. Better initialisation.
6. Optimisers — How Gradients Are Used
| Optimiser | How It Works | When to Use |
|---|---|---|
| SGD | Simple gradient descent with fixed learning rate | Simple problems, full control |
| SGD + Momentum | Accelerates in consistent gradient directions | Large-scale training, computer vision |
| RMSprop | Adapts learning rate using recent gradient magnitudes | RNNs, non-stationary problems |
| Adam | Combines momentum and RMSprop | Default choice — works well for most problems |
| AdamW | Adam with decoupled weight decay | Transformers, large language models |
7. Common Mistakes Students Make
- Forgetting to zero gradients between batches (PyTorch): Gradients accumulate by default. Always call optimizer.zero_grad() before each forward pass.
- Confusing backpropagation with gradient descent: Backpropagation computes gradients. Gradient descent uses them to update weights. Two separate steps.
- Using sigmoid in deep hidden layers: Causes vanishing gradients. Use ReLU in hidden layers, sigmoid only at the binary output.
- Not monitoring training and validation loss: Always plot both loss curves. They tell you whether the model is converging, diverging, or overfitting.
8. Frequently Asked Questions
Does backpropagation always find the global minimum?
No — it finds a local minimum, not necessarily the global minimum. For deep neural networks, the loss surface is non-convex with many local minima. In practice, most local minima in deep networks have similar loss values to the global minimum, so this is less of a problem than it theoretically seems.
What is the difference between batch, mini-batch, and stochastic gradient descent?
Batch GD uses the entire training set per update — stable but slow. SGD uses one example — fast but very noisy. Mini-batch GD uses a small batch (32–256 examples) — the standard. Most “SGD” implementations in frameworks actually use mini-batch GD.