What is batch normalisation?

Batch normalisation is a technique that normalises the inputs to each layer within a neural network by computing the mean and variance of the current mini-batch and transforming the activations to have approximately zero mean and unit variance. This reduces internal covariate shift, enables higher learning rates, reduces sensitivity to initialisation, and acts as a mild regulariser. It is one of the most important innovations in deep learning training stability.

What is dropout in neural networks?

Dropout is a regularisation technique where, during each training step, each neuron is randomly deactivated (set to zero) with probability p (the dropout rate). This prevents neurons from co-adapting — the network cannot rely on any single neuron always being present, so it learns more redundant, robust representations. At inference time, dropout is disabled and all neurons are active, with outputs scaled by (1-p).

Batch Normalisation & Dropout

Regularisation Techniques in Deep Learning — For Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Batch Normalisation (BN): Normalises layer inputs within each mini-batch. Solves internal covariate shift, speeds up training, allows higher learning rates.
Dropout: Randomly deactivates neurons during training. Prevents co-adaptation, reduces overfitting — acts like training an ensemble of networks.
Placement: BN goes between Linear/Conv layer and activation. Dropout goes after activation (usually in fully connected layers).
Dropout rate: 0.2–0.5 for dense layers; 0.1–0.2 for convolutional layers.
At inference time: Dropout is turned OFF. BN uses running statistics from training.
Use both together for most deep networks — they solve different problems.

1. The Problem — Training Deep Networks is Hard

Training deep neural networks (10+ layers) faces two major obstacles:

Internal Covariate Shift: As training progresses, the distribution of each layer’s inputs changes because the weights of the previous layers change. This makes each layer constantly readjust to a new input distribution — slowing learning. Batch Normalisation directly addresses this.
Overfitting: Deep networks have millions of parameters and can memorise training data. The model learns spurious correlations and performs poorly on new data. Dropout addresses this by making the network more robust.

2. Batch Normalisation — What & Why

Batch Normalisation (Ioffe & Szegedy, 2015) normalises the activations of each layer within a mini-batch so they have approximately zero mean and unit variance, then applies learnable scale (γ) and shift (β) parameters.

Benefits:

Faster training: Higher learning rates can be used safely — gradients are more stable.
Reduces sensitivity to initialisation: Poor weight initialisation is less catastrophic with BN.
Mild regularisation: The mini-batch statistics introduce noise that acts like regularisation — often reduces the need for dropout.
Enables deeper networks: BN made training very deep networks (ResNet with 152 layers) practical.

3. Batch Norm Formula

For a mini-batch B = {x₁, x₂, …, xₘ} of activations:

Step 1 — Mini-batch mean: μ_B = (1/m) Σxᵢ

Step 2 — Mini-batch variance: σ²_B = (1/m) Σ(xᵢ − μ_B)²

Step 3 — Normalise: x̂ᵢ = (xᵢ − μ_B) / √(σ²_B + ε)

Step 4 — Scale and shift: yᵢ = γ × x̂ᵢ + β

Symbol	Meaning
μ_B	Mean of the mini-batch
σ²_B	Variance of the mini-batch
ε	Small constant (~1e-5) to prevent division by zero
x̂ᵢ	Normalised activation (zero mean, unit variance)
γ (gamma)	Learnable scale parameter — allows network to undo normalisation
β (beta)	Learnable shift parameter — allows network to shift output

The γ and β parameters are learned during backpropagation. This means the network can learn to undo batch normalisation if that is beneficial — giving it the flexibility to represent the identity function if needed.

At inference time: Instead of mini-batch statistics, BN uses running averages of μ and σ² accumulated during training. This makes inference deterministic regardless of batch size.

4. Where to Place Batch Norm

The original paper recommends placing BN before the activation function:

Linear/Conv → Batch Norm → Activation (ReLU) → Next Layer

However, many practitioners place it after activation and find similar or better results. The after-activation placement is more common in modern architectures. Either placement works — be consistent within your architecture.

In CNNs: BN is applied per channel — for a feature map of size (H × W × C), statistics are computed over H × W × batch_size for each of C channels.

5. Dropout — What & Why

Dropout (Srivastava et al., 2014) is a regularisation technique where, during each training step, each neuron is independently set to zero with probability p (the dropout rate).

The intuition behind dropout is the ensemble interpretation: by randomly dropping neurons, each training step trains a different sub-network. At inference, you effectively average predictions over an exponential number of different sub-networks — this ensemble averaging reduces variance and overfitting, just like Random Forest averages many trees.

6. How Dropout Works

During training: For each neuron in a dropout layer, sample a Bernoulli random variable with probability (1−p) of being 1. Multiply the neuron’s output by this value — neurons that get 0 are “dropped” for this step.

r_j ~ Bernoulli(1−p)

ã_j = r_j × a_j

During inference: All neurons are active. To compensate for the fact that more neurons are active than during training, outputs are scaled: a_j_inference = (1−p) × a_j. This is called inverted dropout and is the standard implementation in PyTorch and TensorFlow.

Choosing Dropout Rate p

Layer Type	Typical Dropout Rate	Notes
Fully connected (dense)	0.3 – 0.5	Higher rates for large layers
Convolutional	0.1 – 0.25	Lower rates — conv layers share weights and are less prone to overfitting
Recurrent (LSTM/GRU)	0.2 – 0.3	Apply to inputs and outputs, not hidden states (use variational dropout)
Transformers	0.1	Applied to attention weights and feed-forward layers

7. Batch Norm vs Dropout — Comparison

Feature	Batch Normalisation	Dropout
Primary purpose	Training stability, speed	Regularisation, prevent overfitting
How it works	Normalises activations to zero mean/unit variance	Randomly zeros out neurons
Learnable params	Yes — γ and β per feature	No
At inference	Uses running statistics (deterministic)	Disabled (all neurons active)
Placement	Between linear layer and activation	After activation, in dense layers
Works with small batches?	No — needs reasonable batch size (≥16)	Yes
Use together?	Yes — they solve different problems; use both in most deep networks

8. Python Code


import torch
import torch.nn as nn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ============================================================
# PyTorch -- Batch Norm + Dropout in a CNN
# ============================================================
class ConvNetWithBNDropout(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),      # BN after conv, before activation
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            # Conv Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(p=0.5),       # Dropout after activation in dense layers
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Dropout(p=0.3),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

model_pt = ConvNetWithBNDropout()
print(model_pt)

# KEY: Set train/eval mode for dropout and BN to work correctly
model_pt.train()   # Enables dropout + uses mini-batch stats for BN
model_pt.eval()    # Disables dropout + uses running stats for BN

# ============================================================
# TensorFlow/Keras -- Batch Norm + Dropout
# ============================================================
def build_model_keras():
    return keras.Sequential([
        # Conv Block
        layers.Conv2D(32, (3,3), padding='same', input_shape=(28,28,1)),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.MaxPooling2D((2,2)),

        layers.Conv2D(64, (3,3), padding='same'),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.MaxPooling2D((2,2)),

        # Dense Block
        layers.Flatten(),
        layers.Dense(256),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(0.5),    # training=True only during training

        layers.Dense(10, activation='softmax')
    ])

keras_model = build_model_keras()
keras_model.compile(optimizer='adam',
                    loss='sparse_categorical_crossentropy',
                    metrics=['accuracy'])
keras_model.summary()

# Note: In Keras, dropout and BN are automatically controlled
# by the 'training' argument -- set correctly during fit() and evaluate()

9. Common Mistakes Students Make

Forgetting to set model.eval() during inference in PyTorch: In PyTorch, dropout and batch norm behave differently in train vs eval mode. Always call model.eval() before inference and model.train() before training. Forgetting this causes inconsistent predictions.
Using dropout with batch normalisation carelessly: Dropout after BN can interact poorly — dropout changes the variance of activations, which conflicts with BN’s normalisation. A common practice: use BN instead of dropout in convolutional layers; use dropout only in the final fully connected layers.
Using batch norm with very small batch sizes: BN computes statistics over the batch. With fewer than 8–16 samples per batch, the batch statistics are too noisy to be useful. Use Layer Normalisation or Group Normalisation instead for small batches.
Applying dropout during inference: Dropout must be disabled at inference time. In Keras this is handled automatically. In PyTorch, you must explicitly call model.eval().

10. Frequently Asked Questions

Should I use batch norm before or after the activation function?

The original paper places BN before activation (Conv → BN → ReLU). Many modern implementations place it after activation (Conv → ReLU → BN) and find similar or better results. There is no definitive consensus. For most practical purposes, either placement works — just be consistent throughout your architecture. For residual networks (ResNet), BN before activation is more standard.

What is Layer Normalisation and how is it different from Batch Normalisation?

Batch Normalisation normalises across the batch dimension — statistics are computed over all examples in the mini-batch for each feature. Layer Normalisation normalises across the feature dimension — statistics are computed over all features for each individual example. Layer Norm does not depend on batch size and works well for small batches and recurrent networks. It is the standard normalisation technique in Transformers (BERT, GPT use Layer Norm, not Batch Norm).

Batch Normalisation & Dropout

Batch Normalisation & Dropout

📌 Key Takeaways

1. The Problem — Training Deep Networks is Hard

2. Batch Normalisation — What & Why

3. Batch Norm Formula

4. Where to Place Batch Norm

5. Dropout — What & Why

6. How Dropout Works

Choosing Dropout Rate p

7. Batch Norm vs Dropout — Comparison

8. Python Code

9. Common Mistakes Students Make

10. Frequently Asked Questions

Should I use batch norm before or after the activation function?

What is Layer Normalisation and how is it different from Batch Normalisation?

Next Steps

Next Steps

Leave a Comment