Convolutional Neural Networks (CNNs)
Image Recognition & Computer Vision — Explained for Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Definition: CNNs are neural networks designed for spatial data (images, video) using convolutional layers that detect local patterns with learnable filters.
- Key layers: Conv2D (detect features) → Activation (ReLU) → Pooling (reduce size) → Flatten → Dense (classify).
- Filters (kernels): Small matrices (e.g., 3×3) that slide over the input, detecting edges, textures, and shapes.
- Feature maps: Output of applying one filter to the input — one per filter per layer.
- Why CNNs for images: Parameter sharing (same filter reused across all positions) and local connectivity make CNNs far more efficient than fully connected networks for images.
- Famous architectures: LeNet, AlexNet, VGG, ResNet, EfficientNet.
1. Why Not Use a Standard Neural Network for Images?
A standard fully connected neural network treats every pixel as a separate input with no spatial awareness. For a 224×224×3 colour image, this means 150,528 input neurons. Even a single hidden layer with 1,000 neurons requires 150 million parameters — computationally infeasible and extremely prone to overfitting.
More fundamentally, a fully connected layer has no awareness that pixels near each other are more related than distant ones. It treats the pixel at position (10, 10) and the pixel at position (200, 200) as equally related — clearly wrong for images.
CNNs solve both problems: they use local connectivity (each neuron connects only to a small spatial region) and parameter sharing (the same filter is used across the entire image), dramatically reducing parameters while encoding spatial structure.
2. The Convolutional Layer
A convolutional layer applies a set of learnable filters to the input by sliding each filter across the height and width of the input, computing a dot product at each position. This operation is called convolution (technically cross-correlation in most implementations).
For an input image of size H × W × C (height × width × channels) and a filter of size F × F × C:
Output size = ((H − F + 2P) / S + 1) × ((W − F + 2P) / S + 1) × K
Where P = padding, S = stride, K = number of filters. Each filter produces one feature map; K filters produce K feature maps.
3. Filters, Kernels & Feature Maps
A filter (or kernel) is a small matrix of learnable weights — typically 3×3 or 5×5. When a filter slides over an image and computes dot products, it detects the pattern encoded in its weights wherever that pattern appears in the image.
- Early layers learn simple features: horizontal edges, vertical edges, colour gradients.
- Middle layers combine simple features into more complex patterns: curves, corners, textures.
- Deep layers detect high-level features: wheels, eyes, faces, entire objects.
The weights in these filters are learned automatically during backpropagation — you do not hand-design them. This is the power of CNNs: they learn which features to detect.
Feature map: The result of applying one filter to the full input. If you have 32 filters in a layer, the layer outputs 32 feature maps — 32 different aspects of the input at that level of abstraction.
Parameter Sharing
The same filter weights are used at every spatial position. A 3×3 filter for a layer has only 3×3×C parameters regardless of the input image size. This is in stark contrast to fully connected layers where each connection has a separate weight. Parameter sharing embodies the assumption that if a pattern (like a horizontal edge) is useful in one part of the image, it is useful everywhere.
4. Padding & Stride
| Concept | Definition | Effect | When to Use |
|---|---|---|---|
| Valid Padding (no padding) | No padding — filter stays within input boundaries | Output smaller than input: (H−F+1) × (W−F+1) | When you want to reduce spatial size |
| Same Padding | Zero-pad input so output has same H×W as input | Output same size as input | When you want to preserve spatial dimensions |
| Stride = 1 | Filter moves one pixel at a time | Dense, overlapping detections — larger output | Default for most conv layers |
| Stride = 2 | Filter moves two pixels at a time | Output halved — can replace pooling | Downsampling alternative to pooling |
5. Pooling Layers
Pooling layers reduce the spatial dimensions of feature maps, reducing computation and providing some spatial invariance — making the network less sensitive to the exact position of features.
| Type | Operation | Use |
|---|---|---|
| Max Pooling | Take the maximum value in each window | Most common — preserves the strongest activation |
| Average Pooling | Take the average value in each window | Used in some architectures (GoogLeNet) |
| Global Average Pooling | Average each feature map to a single value | Replaces flatten + dense — reduces overfitting (used in ResNet, MobileNet) |
A 2×2 max pool with stride 2 reduces a 28×28 feature map to 14×14, cutting the number of values by 75% while retaining the most prominent features.
6. Complete CNN Architecture
A typical CNN for image classification follows this pattern:
- Input: Image tensor — e.g., 32×32×3 (32×32 pixels, RGB)
- Conv Block 1: Conv2D(32 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 16×16×32
- Conv Block 2: Conv2D(64 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 8×8×64
- Conv Block 3: Conv2D(128 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 4×4×128
- Flatten: Reshape 4×4×128 = 2,048 values into a 1D vector
- Dense(256) → ReLU → Dropout(0.5)
- Dense(num_classes) → Softmax — final class probabilities
Early conv blocks detect simple features. Deeper blocks detect complex features. Pooling progressively reduces spatial size. Flattening connects to fully connected layers for the final classification decision.
7. Famous CNN Architectures
| Architecture | Year | Key Innovation | Parameters |
|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN — digit recognition | ~60K |
| AlexNet | 2012 | Deep CNN, ReLU, Dropout, GPU training — ImageNet breakthrough | ~60M |
| VGG-16 | 2014 | Very deep network with simple uniform 3×3 filters | ~138M |
| ResNet-50 | 2015 | Residual connections — solved vanishing gradients in very deep networks | ~25M |
| EfficientNet | 2019 | Compound scaling — state-of-the-art accuracy per parameter | 5M–66M |
| Vision Transformer (ViT) | 2020 | Applies transformer attention to image patches — no convolutions | 86M+ |
8. Transfer Learning
Training a large CNN from scratch requires millions of labelled images and significant compute. Transfer learning reuses a CNN pre-trained on a large dataset (e.g., ImageNet with 1.2M images) for a new task:
- Take a pre-trained model (e.g., ResNet-50 trained on ImageNet)
- Remove the final classification layer
- Add new classification layers for your task
- Fine-tune: either train only new layers (feature extraction) or train the whole network with a low learning rate
Transfer learning is standard practice — even with only a few hundred images per class, fine-tuning a pre-trained CNN typically outperforms training from scratch with thousands of images. It is the recommended approach for most real-world image classification tasks.
9. Python Code
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# --- Build a CNN for image classification ---
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
model = keras.Sequential([
# Block 1
layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=input_shape),
layers.MaxPooling2D((2, 2)),
# Block 2
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
# Block 3
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
# Classifier
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes, activation='softmax')
])
return model
model = build_cnn()
model.summary() # Shows layer shapes and parameter counts
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# --- Train on CIFAR-10 dataset ---
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train = X_train / 255.0 # Normalise pixel values to [0, 1]
X_test = X_test / 255.0
early_stop = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
history = model.fit(X_train, y_train,
epochs=30,
batch_size=64,
validation_split=0.1,
callbacks=[early_stop])
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.3f}")
# --- Transfer Learning with ResNet50 ---
base_model = keras.applications.ResNet50(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
base_model.trainable = False # Freeze base model weights
transfer_model = keras.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.3),
layers.Dense(10, activation='softmax') # 10 classes
])
transfer_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
10. Frequently Asked Questions
Can CNNs be used for data other than images?
Yes. CNNs work well on any data with local spatial or temporal structure. 1D CNNs are used for time series, audio, and text. 3D CNNs process video or volumetric medical scans (CT/MRI). The key requirement is that nearby values in the data are more related than distant ones — the assumption of local correlation that makes convolutional filters effective.
How many filters should each convolutional layer have?
A common pattern is to start with 32 filters in the first layer and double the number with each subsequent layer (32, 64, 128, 256…). This mirrors the increasing complexity of learned features — early layers detect few simple features, later layers detect many complex ones. Adjust based on your dataset size and computational budget.
What is the difference between CNN and Vision Transformer (ViT)?
CNNs use convolutional filters with local receptive fields — they build up global understanding from local patterns. Vision Transformers divide the image into patches and apply self-attention globally — each patch attends to every other patch from the start. ViTs require much more training data to outperform CNNs, but with sufficient data and scale, they match or exceed CNN performance. Hybrid architectures combining both are now common.