Backpropagation

Backpropagation

How Neural Networks Learn — Step-by-Step Derivation

Last Updated: March 2026

📌 Key Takeaways

  • Definition: Backpropagation computes the gradient of the loss function with respect to every weight in the network using the chain rule.
  • Foundation: The chain rule of calculus — derivatives of composite functions.
  • Two passes per training step: Forward pass (compute output & loss) → Backward pass (compute gradients) → Weight update.
  • Gradient flow: Gradients flow backward from output layer to input layer.
  • Vanishing gradients: Gradients become extremely small in deep networks with sigmoid/tanh — use ReLU to mitigate.

1. Intuition — Why Do We Need Backpropagation?

A neural network has thousands — sometimes billions — of weights. To train it, we need to know: for each weight, how much should we increase or decrease it to reduce the loss?

This requires computing ∂L/∂w for each weight. For the final layer, this is straightforward. But for earlier layers, the weight connects to the loss through many intermediate computations.

Backpropagation solves this by using the chain rule to efficiently compute all gradients in a single backward pass. It starts at the output layer and propagates gradients backward — each layer’s gradient is computed using the gradient already computed for the next layer.

2. The Chain Rule of Calculus

If y = f(g(x)), then: dy/dx = (dy/dg) × (dg/dx)

In neural networks, to find ∂L/∂w:

∂L/∂w = (∂L/∂a) × (∂a/∂z) × (∂z/∂w)

Where z = w·x + b (pre-activation), a = activation(z), L = loss(a). Backpropagation chains these together for every weight in every layer.

3. The Backpropagation Algorithm

Step 1 — Forward Pass: Compute and store Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾ and A⁽ˡ⁾ = activation(Z⁽ˡ⁾) for each layer l. Store all values — they are needed in the backward pass.

Step 2 — Compute Loss: L = loss(A⁽ᴸ⁾, y)

Step 3 — Backward Pass (output layer):

δ⁽ᴸ⁾ = ∂L/∂A⁽ᴸ⁾ × activation'(Z⁽ᴸ⁾)

∂L/∂W⁽ᴸ⁾ = δ⁽ᴸ⁾ × (A⁽ᴸ⁻¹⁾)ᵀ

Step 4 — Backward Pass (hidden layers):

δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾)ᵀ × δ⁽ˡ⁺¹⁾ × activation'(Z⁽ˡ⁾)

∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ × (A⁽ˡ⁻¹⁾)ᵀ

Step 5 — Weight Update:

W⁽ˡ⁾ := W⁽ˡ⁾ − α × ∂L/∂W⁽ˡ⁾

4. Complete Worked Example

Network: 1 input (x=2), 1 hidden neuron with sigmoid, 1 output with sigmoid. Weights: w₁=0.5 (input→hidden), w₂=0.8 (hidden→output). Biases = 0. True label y=1. Learning rate α=0.1.

Forward Pass:

z₁ = 0.5 × 2 = 1.0 → a₁ = sigmoid(1.0) ≈ 0.731

z₂ = 0.8 × 0.731 = 0.585 → ŷ = sigmoid(0.585) ≈ 0.642

Loss = −log(0.642) ≈ 0.443

Backward Pass:

δ₂ = ∂L/∂z₂ ≈ −0.358 → ∂L/∂w₂ = δ₂ × a₁ ≈ −0.262

δ₁ = δ₂ × w₂ × sigmoid'(z₁) ≈ −0.056 → ∂L/∂w₁ = δ₁ × x ≈ −0.112

Weight Updates:

w₂ := 0.8 + 0.026 = 0.826  |  w₁ := 0.5 + 0.011 = 0.511

Both weights increased — pushing prediction closer to 1 (the true label), reducing the loss.

5. Vanishing & Exploding Gradients

Vanishing Gradients

With sigmoid/tanh, derivatives are always < 0.25. After 10 layers: 0.25¹⁰ ≈ 0.000001 — essentially zero. Early layers learn nothing.

Solutions: Use ReLU. Use Batch Normalisation. Use residual connections (ResNets). Use Xavier/He initialisation.

Exploding Gradients

With large weights, gradients multiply to very large values, causing weight updates to overshoot. Common in RNNs.

Solutions: Gradient clipping. Weight regularisation. Better initialisation.

6. Optimisers — How Gradients Are Used

Optimiser How It Works When to Use
SGD Simple gradient descent with fixed learning rate Simple problems, full control
SGD + Momentum Accelerates in consistent gradient directions Large-scale training, computer vision
RMSprop Adapts learning rate using recent gradient magnitudes RNNs, non-stationary problems
Adam Combines momentum and RMSprop Default choice — works well for most problems
AdamW Adam with decoupled weight decay Transformers, large language models

7. Common Mistakes Students Make

  • Forgetting to zero gradients between batches (PyTorch): Gradients accumulate by default. Always call optimizer.zero_grad() before each forward pass.
  • Confusing backpropagation with gradient descent: Backpropagation computes gradients. Gradient descent uses them to update weights. Two separate steps.
  • Using sigmoid in deep hidden layers: Causes vanishing gradients. Use ReLU in hidden layers, sigmoid only at the binary output.
  • Not monitoring training and validation loss: Always plot both loss curves. They tell you whether the model is converging, diverging, or overfitting.

8. Frequently Asked Questions

Does backpropagation always find the global minimum?

No — it finds a local minimum, not necessarily the global minimum. For deep neural networks, the loss surface is non-convex with many local minima. In practice, most local minima in deep networks have similar loss values to the global minimum, so this is less of a problem than it theoretically seems.

What is the difference between batch, mini-batch, and stochastic gradient descent?

Batch GD uses the entire training set per update — stable but slow. SGD uses one example — fast but very noisy. Mini-batch GD uses a small batch (32–256 examples) — the standard. Most “SGD” implementations in frameworks actually use mini-batch GD.

Next Steps