What is backpropagation in neural networks?

Backpropagation is the algorithm used to train neural networks. It computes the gradient of the loss function with respect to every weight in the network by applying the chain rule of calculus, propagating the error signal backwards from the output layer to the input layer. These gradients are then used by gradient descent to update the weights and reduce the loss.

What is the chain rule in backpropagation?

The chain rule is a calculus rule for computing the derivative of a composite function. In backpropagation, the chain rule allows us to compute the gradient of the loss with respect to early-layer weights by multiplying gradients layer by layer: dL/dw = dL/da × da/dz × dz/dw.

Backpropagation

How Neural Networks Learn — Step-by-Step Derivation

Last Updated: March 2026

📌 Key Takeaways

Definition: Backpropagation computes the gradient of the loss function with respect to every weight in the network using the chain rule.
Foundation: The chain rule of calculus — derivatives of composite functions.
Two passes per training step: Forward pass (compute output & loss) → Backward pass (compute gradients) → Weight update.
Gradient flow: Gradients flow backward from output layer to input layer.
Vanishing gradients: Gradients become extremely small in deep networks with sigmoid/tanh — use ReLU to mitigate.

1. Intuition — Why Do We Need Backpropagation?

A neural network has thousands — sometimes billions — of weights. To train it, we need to know: for each weight, how much should we increase or decrease it to reduce the loss?

This requires computing ∂L/∂w for each weight. For the final layer, this is straightforward. But for earlier layers, the weight connects to the loss through many intermediate computations.

Backpropagation solves this by using the chain rule to efficiently compute all gradients in a single backward pass. It starts at the output layer and propagates gradients backward — each layer’s gradient is computed using the gradient already computed for the next layer.

2. The Chain Rule of Calculus

If y = f(g(x)), then: dy/dx = (dy/dg) × (dg/dx)

In neural networks, to find ∂L/∂w:

∂L/∂w = (∂L/∂a) × (∂a/∂z) × (∂z/∂w)

Where z = w·x + b (pre-activation), a = activation(z), L = loss(a). Backpropagation chains these together for every weight in every layer.

3. The Backpropagation Algorithm

Step 1 — Forward Pass: Compute and store Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾ and A⁽ˡ⁾ = activation(Z⁽ˡ⁾) for each layer l. Store all values — they are needed in the backward pass.

Step 2 — Compute Loss: L = loss(A⁽ᴸ⁾, y)

Step 3 — Backward Pass (output layer):

δ⁽ᴸ⁾ = ∂L/∂A⁽ᴸ⁾ × activation'(Z⁽ᴸ⁾)

∂L/∂W⁽ᴸ⁾ = δ⁽ᴸ⁾ × (A⁽ᴸ⁻¹⁾)ᵀ

Step 4 — Backward Pass (hidden layers):

δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾)ᵀ × δ⁽ˡ⁺¹⁾ × activation'(Z⁽ˡ⁾)

∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ × (A⁽ˡ⁻¹⁾)ᵀ

Step 5 — Weight Update:

W⁽ˡ⁾ := W⁽ˡ⁾ − α × ∂L/∂W⁽ˡ⁾

4. Complete Worked Example

Network: 1 input (x=2), 1 hidden neuron with sigmoid, 1 output with sigmoid. Weights: w₁=0.5 (input→hidden), w₂=0.8 (hidden→output). Biases = 0. True label y=1. Learning rate α=0.1.

Forward Pass:

z₁ = 0.5 × 2 = 1.0 → a₁ = sigmoid(1.0) ≈ 0.731

z₂ = 0.8 × 0.731 = 0.585 → ŷ = sigmoid(0.585) ≈ 0.642

Loss = −log(0.642) ≈ 0.443

Backward Pass:

δ₂ = ∂L/∂z₂ ≈ −0.358 → ∂L/∂w₂ = δ₂ × a₁ ≈ −0.262

δ₁ = δ₂ × w₂ × sigmoid'(z₁) ≈ −0.056 → ∂L/∂w₁ = δ₁ × x ≈ −0.112

Weight Updates:

w₂ := 0.8 + 0.026 = 0.826 | w₁ := 0.5 + 0.011 = 0.511

Both weights increased — pushing prediction closer to 1 (the true label), reducing the loss.

5. Vanishing & Exploding Gradients

Vanishing Gradients

With sigmoid/tanh, derivatives are always < 0.25. After 10 layers: 0.25¹⁰ ≈ 0.000001 — essentially zero. Early layers learn nothing.

Solutions: Use ReLU. Use Batch Normalisation. Use residual connections (ResNets). Use Xavier/He initialisation.

Exploding Gradients

With large weights, gradients multiply to very large values, causing weight updates to overshoot. Common in RNNs.

Solutions: Gradient clipping. Weight regularisation. Better initialisation.

6. Optimisers — How Gradients Are Used

Optimiser	How It Works	When to Use
SGD	Simple gradient descent with fixed learning rate	Simple problems, full control
SGD + Momentum	Accelerates in consistent gradient directions	Large-scale training, computer vision
RMSprop	Adapts learning rate using recent gradient magnitudes	RNNs, non-stationary problems
Adam	Combines momentum and RMSprop	Default choice — works well for most problems
AdamW	Adam with decoupled weight decay	Transformers, large language models

7. Common Mistakes Students Make

Forgetting to zero gradients between batches (PyTorch): Gradients accumulate by default. Always call optimizer.zero_grad() before each forward pass.
Confusing backpropagation with gradient descent: Backpropagation computes gradients. Gradient descent uses them to update weights. Two separate steps.
Using sigmoid in deep hidden layers: Causes vanishing gradients. Use ReLU in hidden layers, sigmoid only at the binary output.
Not monitoring training and validation loss: Always plot both loss curves. They tell you whether the model is converging, diverging, or overfitting.

8. Frequently Asked Questions

Does backpropagation always find the global minimum?

No — it finds a local minimum, not necessarily the global minimum. For deep neural networks, the loss surface is non-convex with many local minima. In practice, most local minima in deep networks have similar loss values to the global minimum, so this is less of a problem than it theoretically seems.

What is the difference between batch, mini-batch, and stochastic gradient descent?

Batch GD uses the entire training set per update — stable but slow. SGD uses one example — fast but very noisy. Mini-batch GD uses a small batch (32–256 examples) — the standard. Most “SGD” implementations in frameworks actually use mini-batch GD.