What is a Convolutional Neural Network (CNN)?

A Convolutional Neural Network (CNN) is a type of deep learning model specifically designed for processing grid-structured data like images. It uses convolutional layers that apply learnable filters to detect local patterns (edges, textures, shapes) in the input, followed by pooling layers for dimensionality reduction, and finally fully connected layers for classification.

What is a convolutional layer?

A convolutional layer applies a set of learnable filters (kernels) to the input by sliding each filter across the spatial dimensions and computing dot products. Each filter detects a specific feature (e.g., a horizontal edge). The output of applying one filter to the input is called a feature map. Multiple filters produce multiple feature maps, each representing a different learned feature.

What is max pooling in CNNs?

Max pooling is a downsampling operation that reduces the spatial dimensions of a feature map by taking the maximum value within each pooling window. A 2x2 max pool with stride 2 reduces a 28x28 feature map to 14x14. Pooling provides spatial invariance (the network becomes less sensitive to exact position of features), reduces computation, and helps control overfitting.

Convolutional Neural Networks (CNNs)

Image Recognition & Computer Vision — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: CNNs are neural networks designed for spatial data (images, video) using convolutional layers that detect local patterns with learnable filters.
Key layers: Conv2D (detect features) → Activation (ReLU) → Pooling (reduce size) → Flatten → Dense (classify).
Filters (kernels): Small matrices (e.g., 3×3) that slide over the input, detecting edges, textures, and shapes.
Feature maps: Output of applying one filter to the input — one per filter per layer.
Why CNNs for images: Parameter sharing (same filter reused across all positions) and local connectivity make CNNs far more efficient than fully connected networks for images.
Famous architectures: LeNet, AlexNet, VGG, ResNet, EfficientNet.

1. Why Not Use a Standard Neural Network for Images?

A standard fully connected neural network treats every pixel as a separate input with no spatial awareness. For a 224×224×3 colour image, this means 150,528 input neurons. Even a single hidden layer with 1,000 neurons requires 150 million parameters — computationally infeasible and extremely prone to overfitting.

More fundamentally, a fully connected layer has no awareness that pixels near each other are more related than distant ones. It treats the pixel at position (10, 10) and the pixel at position (200, 200) as equally related — clearly wrong for images.

CNNs solve both problems: they use local connectivity (each neuron connects only to a small spatial region) and parameter sharing (the same filter is used across the entire image), dramatically reducing parameters while encoding spatial structure.

2. The Convolutional Layer

A convolutional layer applies a set of learnable filters to the input by sliding each filter across the height and width of the input, computing a dot product at each position. This operation is called convolution (technically cross-correlation in most implementations).

For an input image of size H × W × C (height × width × channels) and a filter of size F × F × C:

Output size = ((H − F + 2P) / S + 1) × ((W − F + 2P) / S + 1) × K

Where P = padding, S = stride, K = number of filters. Each filter produces one feature map; K filters produce K feature maps.

3. Filters, Kernels & Feature Maps

A filter (or kernel) is a small matrix of learnable weights — typically 3×3 or 5×5. When a filter slides over an image and computes dot products, it detects the pattern encoded in its weights wherever that pattern appears in the image.

Early layers learn simple features: horizontal edges, vertical edges, colour gradients.
Middle layers combine simple features into more complex patterns: curves, corners, textures.
Deep layers detect high-level features: wheels, eyes, faces, entire objects.

The weights in these filters are learned automatically during backpropagation — you do not hand-design them. This is the power of CNNs: they learn which features to detect.

Feature map: The result of applying one filter to the full input. If you have 32 filters in a layer, the layer outputs 32 feature maps — 32 different aspects of the input at that level of abstraction.

Parameter Sharing

The same filter weights are used at every spatial position. A 3×3 filter for a layer has only 3×3×C parameters regardless of the input image size. This is in stark contrast to fully connected layers where each connection has a separate weight. Parameter sharing embodies the assumption that if a pattern (like a horizontal edge) is useful in one part of the image, it is useful everywhere.

4. Padding & Stride

Concept	Definition	Effect	When to Use
Valid Padding (no padding)	No padding — filter stays within input boundaries	Output smaller than input: (H−F+1) × (W−F+1)	When you want to reduce spatial size
Same Padding	Zero-pad input so output has same H×W as input	Output same size as input	When you want to preserve spatial dimensions
Stride = 1	Filter moves one pixel at a time	Dense, overlapping detections — larger output	Default for most conv layers
Stride = 2	Filter moves two pixels at a time	Output halved — can replace pooling	Downsampling alternative to pooling

5. Pooling Layers

Pooling layers reduce the spatial dimensions of feature maps, reducing computation and providing some spatial invariance — making the network less sensitive to the exact position of features.

Type	Operation	Use
Max Pooling	Take the maximum value in each window	Most common — preserves the strongest activation
Average Pooling	Take the average value in each window	Used in some architectures (GoogLeNet)
Global Average Pooling	Average each feature map to a single value	Replaces flatten + dense — reduces overfitting (used in ResNet, MobileNet)

A 2×2 max pool with stride 2 reduces a 28×28 feature map to 14×14, cutting the number of values by 75% while retaining the most prominent features.

6. Complete CNN Architecture

A typical CNN for image classification follows this pattern:

Input: Image tensor — e.g., 32×32×3 (32×32 pixels, RGB)
Conv Block 1: Conv2D(32 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 16×16×32
Conv Block 2: Conv2D(64 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 8×8×64
Conv Block 3: Conv2D(128 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 4×4×128
Flatten: Reshape 4×4×128 = 2,048 values into a 1D vector
Dense(256) → ReLU → Dropout(0.5)
Dense(num_classes) → Softmax — final class probabilities

Early conv blocks detect simple features. Deeper blocks detect complex features. Pooling progressively reduces spatial size. Flattening connects to fully connected layers for the final classification decision.

7. Famous CNN Architectures

Architecture	Year	Key Innovation	Parameters
LeNet-5	1998	First practical CNN — digit recognition	~60K
AlexNet	2012	Deep CNN, ReLU, Dropout, GPU training — ImageNet breakthrough	~60M
VGG-16	2014	Very deep network with simple uniform 3×3 filters	~138M
ResNet-50	2015	Residual connections — solved vanishing gradients in very deep networks	~25M
EfficientNet	2019	Compound scaling — state-of-the-art accuracy per parameter	5M–66M
Vision Transformer (ViT)	2020	Applies transformer attention to image patches — no convolutions	86M+

8. Transfer Learning

Training a large CNN from scratch requires millions of labelled images and significant compute. Transfer learning reuses a CNN pre-trained on a large dataset (e.g., ImageNet with 1.2M images) for a new task:

Take a pre-trained model (e.g., ResNet-50 trained on ImageNet)
Remove the final classification layer
Add new classification layers for your task
Fine-tune: either train only new layers (feature extraction) or train the whole network with a low learning rate

Transfer learning is standard practice — even with only a few hundred images per class, fine-tuning a pre-trained CNN typically outperforms training from scratch with thousands of images. It is the recommended approach for most real-world image classification tasks.

9. Python Code


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# --- Build a CNN for image classification ---
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
    model = keras.Sequential([
        # Block 1
        layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=input_shape),
        layers.MaxPooling2D((2, 2)),

        # Block 2
        layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),

        # Block 3
        layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),

        # Classifier
        layers.Flatten(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model

model = build_cnn()
model.summary()  # Shows layer shapes and parameter counts

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# --- Train on CIFAR-10 dataset ---
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train = X_train / 255.0   # Normalise pixel values to [0, 1]
X_test  = X_test  / 255.0

early_stop = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
history = model.fit(X_train, y_train,
                    epochs=30,
                    batch_size=64,
                    validation_split=0.1,
                    callbacks=[early_stop])

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.3f}")

# --- Transfer Learning with ResNet50 ---
base_model = keras.applications.ResNet50(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)
base_model.trainable = False  # Freeze base model weights

transfer_model = keras.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')  # 10 classes
])
transfer_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

10. Frequently Asked Questions

Can CNNs be used for data other than images?

Yes. CNNs work well on any data with local spatial or temporal structure. 1D CNNs are used for time series, audio, and text. 3D CNNs process video or volumetric medical scans (CT/MRI). The key requirement is that nearby values in the data are more related than distant ones — the assumption of local correlation that makes convolutional filters effective.

How many filters should each convolutional layer have?

A common pattern is to start with 32 filters in the first layer and double the number with each subsequent layer (32, 64, 128, 256…). This mirrors the increasing complexity of learned features — early layers detect few simple features, later layers detect many complex ones. Adjust based on your dataset size and computational budget.

What is the difference between CNN and Vision Transformer (ViT)?

CNNs use convolutional filters with local receptive fields — they build up global understanding from local patterns. Vision Transformers divide the image into patches and apply self-attention globally — each patch attends to every other patch from the start. ViTs require much more training data to outperform CNNs, but with sufficient data and scale, they match or exceed CNN performance. Hybrid architectures combining both are now common.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs)

📌 Key Takeaways

1. Why Not Use a Standard Neural Network for Images?

2. The Convolutional Layer

3. Filters, Kernels & Feature Maps

Parameter Sharing

4. Padding & Stride

5. Pooling Layers

6. Complete CNN Architecture

7. Famous CNN Architectures

8. Transfer Learning

9. Python Code

10. Frequently Asked Questions

Can CNNs be used for data other than images?

How many filters should each convolutional layer have?

What is the difference between CNN and Vision Transformer (ViT)?

Next Steps

Next Steps

Leave a Comment