Batch Normalisation & Dropout
Regularisation Techniques in Deep Learning — For Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Batch Normalisation (BN): Normalises layer inputs within each mini-batch. Solves internal covariate shift, speeds up training, allows higher learning rates.
- Dropout: Randomly deactivates neurons during training. Prevents co-adaptation, reduces overfitting — acts like training an ensemble of networks.
- Placement: BN goes between Linear/Conv layer and activation. Dropout goes after activation (usually in fully connected layers).
- Dropout rate: 0.2–0.5 for dense layers; 0.1–0.2 for convolutional layers.
- At inference time: Dropout is turned OFF. BN uses running statistics from training.
- Use both together for most deep networks — they solve different problems.
1. The Problem — Training Deep Networks is Hard
Training deep neural networks (10+ layers) faces two major obstacles:
- Internal Covariate Shift: As training progresses, the distribution of each layer’s inputs changes because the weights of the previous layers change. This makes each layer constantly readjust to a new input distribution — slowing learning. Batch Normalisation directly addresses this.
- Overfitting: Deep networks have millions of parameters and can memorise training data. The model learns spurious correlations and performs poorly on new data. Dropout addresses this by making the network more robust.
2. Batch Normalisation — What & Why
Batch Normalisation (Ioffe & Szegedy, 2015) normalises the activations of each layer within a mini-batch so they have approximately zero mean and unit variance, then applies learnable scale (γ) and shift (β) parameters.
Benefits:
- Faster training: Higher learning rates can be used safely — gradients are more stable.
- Reduces sensitivity to initialisation: Poor weight initialisation is less catastrophic with BN.
- Mild regularisation: The mini-batch statistics introduce noise that acts like regularisation — often reduces the need for dropout.
- Enables deeper networks: BN made training very deep networks (ResNet with 152 layers) practical.
3. Batch Norm Formula
For a mini-batch B = {x₁, x₂, …, xₘ} of activations:
Step 1 — Mini-batch mean: μ_B = (1/m) Σxᵢ
Step 2 — Mini-batch variance: σ²_B = (1/m) Σ(xᵢ − μ_B)²
Step 3 — Normalise: x̂ᵢ = (xᵢ − μ_B) / √(σ²_B + ε)
Step 4 — Scale and shift: yᵢ = γ × x̂ᵢ + β
| Symbol | Meaning |
|---|---|
| μ_B | Mean of the mini-batch |
| σ²_B | Variance of the mini-batch |
| ε | Small constant (~1e-5) to prevent division by zero |
| x̂ᵢ | Normalised activation (zero mean, unit variance) |
| γ (gamma) | Learnable scale parameter — allows network to undo normalisation |
| β (beta) | Learnable shift parameter — allows network to shift output |
The γ and β parameters are learned during backpropagation. This means the network can learn to undo batch normalisation if that is beneficial — giving it the flexibility to represent the identity function if needed.
At inference time: Instead of mini-batch statistics, BN uses running averages of μ and σ² accumulated during training. This makes inference deterministic regardless of batch size.
4. Where to Place Batch Norm
The original paper recommends placing BN before the activation function:
Linear/Conv → Batch Norm → Activation (ReLU) → Next Layer
However, many practitioners place it after activation and find similar or better results. The after-activation placement is more common in modern architectures. Either placement works — be consistent within your architecture.
In CNNs: BN is applied per channel — for a feature map of size (H × W × C), statistics are computed over H × W × batch_size for each of C channels.
5. Dropout — What & Why
Dropout (Srivastava et al., 2014) is a regularisation technique where, during each training step, each neuron is independently set to zero with probability p (the dropout rate).
The intuition behind dropout is the ensemble interpretation: by randomly dropping neurons, each training step trains a different sub-network. At inference, you effectively average predictions over an exponential number of different sub-networks — this ensemble averaging reduces variance and overfitting, just like Random Forest averages many trees.
6. How Dropout Works
During training: For each neuron in a dropout layer, sample a Bernoulli random variable with probability (1−p) of being 1. Multiply the neuron’s output by this value — neurons that get 0 are “dropped” for this step.
r_j ~ Bernoulli(1−p)
ã_j = r_j × a_j
During inference: All neurons are active. To compensate for the fact that more neurons are active than during training, outputs are scaled: a_j_inference = (1−p) × a_j. This is called inverted dropout and is the standard implementation in PyTorch and TensorFlow.
Choosing Dropout Rate p
| Layer Type | Typical Dropout Rate | Notes |
|---|---|---|
| Fully connected (dense) | 0.3 – 0.5 | Higher rates for large layers |
| Convolutional | 0.1 – 0.25 | Lower rates — conv layers share weights and are less prone to overfitting |
| Recurrent (LSTM/GRU) | 0.2 – 0.3 | Apply to inputs and outputs, not hidden states (use variational dropout) |
| Transformers | 0.1 | Applied to attention weights and feed-forward layers |
7. Batch Norm vs Dropout — Comparison
| Feature | Batch Normalisation | Dropout |
|---|---|---|
| Primary purpose | Training stability, speed | Regularisation, prevent overfitting |
| How it works | Normalises activations to zero mean/unit variance | Randomly zeros out neurons |
| Learnable params | Yes — γ and β per feature | No |
| At inference | Uses running statistics (deterministic) | Disabled (all neurons active) |
| Placement | Between linear layer and activation | After activation, in dense layers |
| Works with small batches? | No — needs reasonable batch size (≥16) | Yes |
| Use together? | Yes — they solve different problems; use both in most deep networks | |
8. Python Code
import torch
import torch.nn as nn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# ============================================================
# PyTorch -- Batch Norm + Dropout in a CNN
# ============================================================
class ConvNetWithBNDropout(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Conv Block 1
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32), # BN after conv, before activation
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Conv Block 2
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Dropout(p=0.5), # Dropout after activation in dense layers
nn.Linear(512, 128),
nn.ReLU(),
nn.Dropout(p=0.3),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
return self.classifier(x)
model_pt = ConvNetWithBNDropout()
print(model_pt)
# KEY: Set train/eval mode for dropout and BN to work correctly
model_pt.train() # Enables dropout + uses mini-batch stats for BN
model_pt.eval() # Disables dropout + uses running stats for BN
# ============================================================
# TensorFlow/Keras -- Batch Norm + Dropout
# ============================================================
def build_model_keras():
return keras.Sequential([
# Conv Block
layers.Conv2D(32, (3,3), padding='same', input_shape=(28,28,1)),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D((2,2)),
layers.Conv2D(64, (3,3), padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D((2,2)),
# Dense Block
layers.Flatten(),
layers.Dense(256),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Dropout(0.5), # training=True only during training
layers.Dense(10, activation='softmax')
])
keras_model = build_model_keras()
keras_model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
keras_model.summary()
# Note: In Keras, dropout and BN are automatically controlled
# by the 'training' argument -- set correctly during fit() and evaluate()
9. Common Mistakes Students Make
- Forgetting to set model.eval() during inference in PyTorch: In PyTorch, dropout and batch norm behave differently in train vs eval mode. Always call model.eval() before inference and model.train() before training. Forgetting this causes inconsistent predictions.
- Using dropout with batch normalisation carelessly: Dropout after BN can interact poorly — dropout changes the variance of activations, which conflicts with BN’s normalisation. A common practice: use BN instead of dropout in convolutional layers; use dropout only in the final fully connected layers.
- Using batch norm with very small batch sizes: BN computes statistics over the batch. With fewer than 8–16 samples per batch, the batch statistics are too noisy to be useful. Use Layer Normalisation or Group Normalisation instead for small batches.
- Applying dropout during inference: Dropout must be disabled at inference time. In Keras this is handled automatically. In PyTorch, you must explicitly call model.eval().
10. Frequently Asked Questions
Should I use batch norm before or after the activation function?
The original paper places BN before activation (Conv → BN → ReLU). Many modern implementations place it after activation (Conv → ReLU → BN) and find similar or better results. There is no definitive consensus. For most practical purposes, either placement works — just be consistent throughout your architecture. For residual networks (ResNet), BN before activation is more standard.
What is Layer Normalisation and how is it different from Batch Normalisation?
Batch Normalisation normalises across the batch dimension — statistics are computed over all examples in the mini-batch for each feature. Layer Normalisation normalises across the feature dimension — statistics are computed over all features for each individual example. Layer Norm does not depend on batch size and works well for small batches and recurrent networks. It is the standard normalisation technique in Transformers (BERT, GPT use Layer Norm, not Batch Norm).