Convolutional Neural Networks (CNNs)



Convolutional Neural Networks (CNNs)

Image Recognition & Computer Vision — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Definition: CNNs are neural networks designed for spatial data (images, video) using convolutional layers that detect local patterns with learnable filters.
  • Key layers: Conv2D (detect features) → Activation (ReLU) → Pooling (reduce size) → Flatten → Dense (classify).
  • Filters (kernels): Small matrices (e.g., 3×3) that slide over the input, detecting edges, textures, and shapes.
  • Feature maps: Output of applying one filter to the input — one per filter per layer.
  • Why CNNs for images: Parameter sharing (same filter reused across all positions) and local connectivity make CNNs far more efficient than fully connected networks for images.
  • Famous architectures: LeNet, AlexNet, VGG, ResNet, EfficientNet.

1. Why Not Use a Standard Neural Network for Images?

A standard fully connected neural network treats every pixel as a separate input with no spatial awareness. For a 224×224×3 colour image, this means 150,528 input neurons. Even a single hidden layer with 1,000 neurons requires 150 million parameters — computationally infeasible and extremely prone to overfitting.

More fundamentally, a fully connected layer has no awareness that pixels near each other are more related than distant ones. It treats the pixel at position (10, 10) and the pixel at position (200, 200) as equally related — clearly wrong for images.

CNNs solve both problems: they use local connectivity (each neuron connects only to a small spatial region) and parameter sharing (the same filter is used across the entire image), dramatically reducing parameters while encoding spatial structure.

2. The Convolutional Layer

A convolutional layer applies a set of learnable filters to the input by sliding each filter across the height and width of the input, computing a dot product at each position. This operation is called convolution (technically cross-correlation in most implementations).

For an input image of size H × W × C (height × width × channels) and a filter of size F × F × C:

Output size = ((H − F + 2P) / S + 1) × ((W − F + 2P) / S + 1) × K

Where P = padding, S = stride, K = number of filters. Each filter produces one feature map; K filters produce K feature maps.

3. Filters, Kernels & Feature Maps

A filter (or kernel) is a small matrix of learnable weights — typically 3×3 or 5×5. When a filter slides over an image and computes dot products, it detects the pattern encoded in its weights wherever that pattern appears in the image.

  • Early layers learn simple features: horizontal edges, vertical edges, colour gradients.
  • Middle layers combine simple features into more complex patterns: curves, corners, textures.
  • Deep layers detect high-level features: wheels, eyes, faces, entire objects.

The weights in these filters are learned automatically during backpropagation — you do not hand-design them. This is the power of CNNs: they learn which features to detect.

Feature map: The result of applying one filter to the full input. If you have 32 filters in a layer, the layer outputs 32 feature maps — 32 different aspects of the input at that level of abstraction.

Parameter Sharing

The same filter weights are used at every spatial position. A 3×3 filter for a layer has only 3×3×C parameters regardless of the input image size. This is in stark contrast to fully connected layers where each connection has a separate weight. Parameter sharing embodies the assumption that if a pattern (like a horizontal edge) is useful in one part of the image, it is useful everywhere.

4. Padding & Stride

ConceptDefinitionEffectWhen to Use
Valid Padding (no padding)No padding — filter stays within input boundariesOutput smaller than input: (H−F+1) × (W−F+1)When you want to reduce spatial size
Same PaddingZero-pad input so output has same H×W as inputOutput same size as inputWhen you want to preserve spatial dimensions
Stride = 1Filter moves one pixel at a timeDense, overlapping detections — larger outputDefault for most conv layers
Stride = 2Filter moves two pixels at a timeOutput halved — can replace poolingDownsampling alternative to pooling

5. Pooling Layers

Pooling layers reduce the spatial dimensions of feature maps, reducing computation and providing some spatial invariance — making the network less sensitive to the exact position of features.

TypeOperationUse
Max PoolingTake the maximum value in each windowMost common — preserves the strongest activation
Average PoolingTake the average value in each windowUsed in some architectures (GoogLeNet)
Global Average PoolingAverage each feature map to a single valueReplaces flatten + dense — reduces overfitting (used in ResNet, MobileNet)

A 2×2 max pool with stride 2 reduces a 28×28 feature map to 14×14, cutting the number of values by 75% while retaining the most prominent features.

6. Complete CNN Architecture

A typical CNN for image classification follows this pattern:

  1. Input: Image tensor — e.g., 32×32×3 (32×32 pixels, RGB)
  2. Conv Block 1: Conv2D(32 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 16×16×32
  3. Conv Block 2: Conv2D(64 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 8×8×64
  4. Conv Block 3: Conv2D(128 filters, 3×3, same) → ReLU → MaxPool(2×2) → Output: 4×4×128
  5. Flatten: Reshape 4×4×128 = 2,048 values into a 1D vector
  6. Dense(256) → ReLU → Dropout(0.5)
  7. Dense(num_classes) → Softmax — final class probabilities

Early conv blocks detect simple features. Deeper blocks detect complex features. Pooling progressively reduces spatial size. Flattening connects to fully connected layers for the final classification decision.

7. Famous CNN Architectures

ArchitectureYearKey InnovationParameters
LeNet-51998First practical CNN — digit recognition~60K
AlexNet2012Deep CNN, ReLU, Dropout, GPU training — ImageNet breakthrough~60M
VGG-162014Very deep network with simple uniform 3×3 filters~138M
ResNet-502015Residual connections — solved vanishing gradients in very deep networks~25M
EfficientNet2019Compound scaling — state-of-the-art accuracy per parameter5M–66M
Vision Transformer (ViT)2020Applies transformer attention to image patches — no convolutions86M+

8. Transfer Learning

Training a large CNN from scratch requires millions of labelled images and significant compute. Transfer learning reuses a CNN pre-trained on a large dataset (e.g., ImageNet with 1.2M images) for a new task:

  1. Take a pre-trained model (e.g., ResNet-50 trained on ImageNet)
  2. Remove the final classification layer
  3. Add new classification layers for your task
  4. Fine-tune: either train only new layers (feature extraction) or train the whole network with a low learning rate

Transfer learning is standard practice — even with only a few hundred images per class, fine-tuning a pre-trained CNN typically outperforms training from scratch with thousands of images. It is the recommended approach for most real-world image classification tasks.

9. Python Code


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# --- Build a CNN for image classification ---
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
    model = keras.Sequential([
        # Block 1
        layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=input_shape),
        layers.MaxPooling2D((2, 2)),

        # Block 2
        layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),

        # Block 3
        layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),

        # Classifier
        layers.Flatten(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model

model = build_cnn()
model.summary()  # Shows layer shapes and parameter counts

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# --- Train on CIFAR-10 dataset ---
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train = X_train / 255.0   # Normalise pixel values to [0, 1]
X_test  = X_test  / 255.0

early_stop = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
history = model.fit(X_train, y_train,
                    epochs=30,
                    batch_size=64,
                    validation_split=0.1,
                    callbacks=[early_stop])

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.3f}")

# --- Transfer Learning with ResNet50 ---
base_model = keras.applications.ResNet50(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)
base_model.trainable = False  # Freeze base model weights

transfer_model = keras.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')  # 10 classes
])
transfer_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    

10. Frequently Asked Questions

Can CNNs be used for data other than images?

Yes. CNNs work well on any data with local spatial or temporal structure. 1D CNNs are used for time series, audio, and text. 3D CNNs process video or volumetric medical scans (CT/MRI). The key requirement is that nearby values in the data are more related than distant ones — the assumption of local correlation that makes convolutional filters effective.

How many filters should each convolutional layer have?

A common pattern is to start with 32 filters in the first layer and double the number with each subsequent layer (32, 64, 128, 256…). This mirrors the increasing complexity of learned features — early layers detect few simple features, later layers detect many complex ones. Adjust based on your dataset size and computational budget.

What is the difference between CNN and Vision Transformer (ViT)?

CNNs use convolutional filters with local receptive fields — they build up global understanding from local patterns. Vision Transformers divide the image into patches and apply self-attention globally — each patch attends to every other patch from the start. ViTs require much more training data to outperform CNNs, but with sufficient data and scale, they match or exceed CNN performance. Hybrid architectures combining both are now common.

Next Steps