Transfer Learning



Transfer Learning

Fine-Tuning Pre-Trained Models — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Definition: Transfer learning reuses a model trained on one task as the starting point for a different but related task.
  • Why it works: Early layers of deep networks learn general features (edges, textures, grammar) that are useful across many tasks.
  • Feature extraction: Freeze pre-trained weights, train only new top layers. Fast, needs little data.
  • Fine-tuning: Unfreeze some layers, continue training with low learning rate. More accurate, needs more data.
  • Key benefit: Achieve state-of-the-art results with hundreds of images instead of millions.
  • Popular models: ResNet, VGG, EfficientNet (images) | BERT, RoBERTa (text) | Whisper (audio).

1. What is Transfer Learning?

Transfer learning is a technique where knowledge gained from training a model on one task is transferred and reused to improve learning on a different but related task.

Instead of training a deep neural network from scratch — which requires massive datasets (ImageNet has 1.2 million images) and weeks of GPU training — you start with a model that has already learned useful representations and adapt it to your specific problem.

Analogy — Learning a Second Language

If you already speak Hindi fluently and want to learn Marathi, you do not start from scratch learning what a noun or verb is. Your knowledge of Hindi grammar, sentence structure, and many shared vocabulary roots transfers directly. You only need to learn what is new and different. Transfer learning works the same way — a model trained on millions of images already knows what edges, textures, and shapes look like. You only need to teach it the difference between your specific categories.

2. Why Transfer Learning Works

Deep neural networks learn hierarchical representations:

  • Early layers learn simple, general features — edges, corners, colour gradients, simple textures. These are useful for virtually any image task.
  • Middle layers learn more complex patterns — textures, object parts, common shapes. Still fairly general.
  • Late layers learn highly task-specific features — “this looks like a cat” or “this is a car wheel”. These are specific to the original training task.

When you transfer a model, you keep the general early/middle layer knowledge and replace or retrain the task-specific late layers. The general features learned from millions of examples give your model a massive head start — even on tasks the original model never saw.

3. Two Approaches — Feature Extraction vs Fine-Tuning

Approach 1 — Feature Extraction

Freeze all pre-trained model weights (make them non-trainable). Remove the original output layer. Add your own new classification layers on top. Train only the new layers on your dataset.

The pre-trained model acts as a fixed feature extractor — it converts raw images into rich feature vectors, and your small classifier learns to map those features to your classes.

Best when: You have very little data (under 1,000 images per class); your dataset is similar to the original training data; you need fast training and quick results.

Approach 2 — Fine-Tuning

Start with feature extraction. Then unfreeze some (or all) of the pre-trained layers. Continue training with a very small learning rate (10–100x smaller than normal). The pre-trained weights are gently adjusted to better fit your specific data.

Key rule: Always use a much lower learning rate for fine-tuning (e.g., 1e-5 instead of 1e-3). Large learning rates will destroy the carefully learned pre-trained representations.

Best when: You have moderate data (1,000–10,000 images); your dataset differs somewhat from the original (different colours, styles, or domain); you need maximum accuracy.

FactorFeature ExtractionFine-Tuning
Data neededVery little (100–1,000/class)Moderate (1,000–10,000/class)
Training speedVery fastSlower
AccuracyGoodBetter
Risk of overfittingLowHigher — use regularisation
Learning rateNormal (1e-3)Very low (1e-5 to 1e-4)

4. When to Use Which Approach — Decision Guide

Your Dataset SizeSimilarity to SourceRecommended Approach
Small (<1,000/class)SimilarFeature extraction only
Small (<1,000/class)DifferentFeature extraction + tune top layers
Medium (1,000–10,000/class)SimilarFine-tune last few layers
Medium (1,000–10,000/class)DifferentFine-tune more layers
Large (>10,000/class)AnyFull fine-tuning or train from scratch

5. Popular Pre-Trained Models

Computer Vision

ModelParametersAccuracy (ImageNet)Best For
VGG-16138M92.7%Simple baseline, easy to understand
ResNet-5025M93.9%General purpose — excellent balance
EfficientNet-B05.3M93.3%Mobile/edge devices — small but accurate
EfficientNet-B766M97.1%Maximum accuracy when compute allows
MobileNetV35.4M92.5%Real-time on mobile devices

Natural Language Processing

ModelParametersBest For
BERT-base110MClassification, NER, Q&A
RoBERTa125MBetter BERT — improved pre-training
DistilBERT66MFaster BERT — 40% smaller, 97% performance
GPT-2117M–1.5BText generation, open-source
mBERT179MMultilingual — 104 languages including Hindi

6. Python Code


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.applications import ResNet50, EfficientNetB0

# ============================================================
# APPROACH 1: FEATURE EXTRACTION
# ============================================================
def build_feature_extraction_model(num_classes, input_shape=(224, 224, 3)):
    # Load pre-trained ResNet50 WITHOUT the top classification layer
    base_model = ResNet50(
        weights='imagenet',
        include_top=False,
        input_shape=input_shape
    )
    base_model.trainable = False  # Freeze ALL pre-trained weights

    # Build new model on top
    model = keras.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.BatchNormalization(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# ============================================================
# APPROACH 2: FINE-TUNING
# ============================================================
def build_finetune_model(num_classes, input_shape=(224, 224, 3)):
    base_model = EfficientNetB0(
        weights='imagenet',
        include_top=False,
        input_shape=input_shape
    )

    # Step 1: Feature extraction phase
    base_model.trainable = False
    inputs = keras.Input(shape=input_shape)
    x = base_model(inputs, training=False)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dense(256, activation='relu')(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    model = keras.Model(inputs, outputs)

    model.compile(optimizer=keras.optimizers.Adam(1e-3),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    return model, base_model

# Training workflow
model, base_model = build_finetune_model(num_classes=5)

# Phase 1: Train top layers only (5-10 epochs)
# model.fit(train_data, epochs=10, validation_data=val_data)

# Phase 2: Unfreeze and fine-tune with VERY low learning rate
base_model.trainable = True
# Optionally freeze early layers -- keep first 100 layers frozen
for layer in base_model.layers[:100]:
    layer.trainable = False

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # 100x smaller!
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
# model.fit(train_data, epochs=20, validation_data=val_data)

print(f"Total layers: {len(base_model.layers)}")
print(f"Trainable layers: {sum(1 for l in base_model.layers if l.trainable)}")
print(f"Frozen layers: {sum(1 for l in base_model.layers if not l.trainable)}")

# ============================================================
# NLP TRANSFER LEARNING WITH HUGGINGFACE
# ============================================================
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)
import torch

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2  # Binary classification
)

# Tokenise text
texts = ["This product is great!", "Terrible quality."]
inputs = tokenizer(texts, padding=True, truncation=True,
                   max_length=128, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)
    print("Predictions:", predictions.numpy())
    

7. Common Mistakes Students Make

  • Using too high a learning rate for fine-tuning: A normal learning rate (1e-3) will destroy pre-trained weights in the first few steps. Always use 1e-5 to 1e-4 when fine-tuning.
  • Not preprocessing images to match the pre-trained model’s format: ResNet expects images normalised to ImageNet mean/std. EfficientNet expects pixel values in [0, 255]. Always use the model’s built-in preprocessing function.
  • Fine-tuning when you have very little data: With fewer than 500 images per class, fine-tuning will overfit badly. Use feature extraction only and add strong data augmentation.
  • Forgetting to set base_model.trainable = False before the first training phase: Without freezing, all layers train simultaneously from the start, leading to poor results.

8. Frequently Asked Questions

Do I need a GPU for transfer learning?

For feature extraction, a CPU is often sufficient — you are only training a small classifier on top. For fine-tuning large models, a GPU is strongly recommended. Google Colab provides free GPU access — use it for transfer learning experiments. Even a single T4 GPU makes fine-tuning 10–50x faster than CPU.

Can transfer learning work across very different domains?

Yes, but effectiveness decreases as domains diverge. A model trained on natural photos transfers well to medical X-rays (both are images with spatial structure). A model trained on English text transfers reasonably to other languages. A model trained on images cannot be directly transferred to text — these require separate pre-trained models. The more similar the source and target domains, the more effective the transfer.

Next Steps

Leave a Comment