What is the difference between fine-tuning and feature extraction in transfer learning?

Feature extraction freezes the pre-trained model's weights and uses it only to extract features — you train only the new classification layers you add on top. Fine-tuning unfreezes some or all of the pre-trained model's layers and continues training them on your new data with a very low learning rate. Feature extraction is faster and needs less data; fine-tuning achieves higher accuracy when you have enough data.

Transfer Learning

Q: What is transfer learning?

Transfer learning is a machine learning technique where a model trained on one task is reused as the starting point for a model on a different but related task. Instead of training from scratch, you take a pre-trained model (like ResNet trained on ImageNet or BERT trained on Wikipedia) and adapt it to your specific problem — saving enormous amounts of time, data, and compute.

Fine-Tuning Pre-Trained Models — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: Transfer learning reuses a model trained on one task as the starting point for a different but related task.
Why it works: Early layers of deep networks learn general features (edges, textures, grammar) that are useful across many tasks.
Feature extraction: Freeze pre-trained weights, train only new top layers. Fast, needs little data.
Fine-tuning: Unfreeze some layers, continue training with low learning rate. More accurate, needs more data.
Key benefit: Achieve state-of-the-art results with hundreds of images instead of millions.
Popular models: ResNet, VGG, EfficientNet (images) | BERT, RoBERTa (text) | Whisper (audio).

1. What is Transfer Learning?

Transfer learning is a technique where knowledge gained from training a model on one task is transferred and reused to improve learning on a different but related task.

Instead of training a deep neural network from scratch — which requires massive datasets (ImageNet has 1.2 million images) and weeks of GPU training — you start with a model that has already learned useful representations and adapt it to your specific problem.

Analogy — Learning a Second Language

If you already speak Hindi fluently and want to learn Marathi, you do not start from scratch learning what a noun or verb is. Your knowledge of Hindi grammar, sentence structure, and many shared vocabulary roots transfers directly. You only need to learn what is new and different. Transfer learning works the same way — a model trained on millions of images already knows what edges, textures, and shapes look like. You only need to teach it the difference between your specific categories.

2. Why Transfer Learning Works

Deep neural networks learn hierarchical representations:

Early layers learn simple, general features — edges, corners, colour gradients, simple textures. These are useful for virtually any image task.
Middle layers learn more complex patterns — textures, object parts, common shapes. Still fairly general.
Late layers learn highly task-specific features — “this looks like a cat” or “this is a car wheel”. These are specific to the original training task.

When you transfer a model, you keep the general early/middle layer knowledge and replace or retrain the task-specific late layers. The general features learned from millions of examples give your model a massive head start — even on tasks the original model never saw.

3. Two Approaches — Feature Extraction vs Fine-Tuning

Approach 1 — Feature Extraction

Freeze all pre-trained model weights (make them non-trainable). Remove the original output layer. Add your own new classification layers on top. Train only the new layers on your dataset.

The pre-trained model acts as a fixed feature extractor — it converts raw images into rich feature vectors, and your small classifier learns to map those features to your classes.

Best when: You have very little data (under 1,000 images per class); your dataset is similar to the original training data; you need fast training and quick results.

Approach 2 — Fine-Tuning

Start with feature extraction. Then unfreeze some (or all) of the pre-trained layers. Continue training with a very small learning rate (10–100x smaller than normal). The pre-trained weights are gently adjusted to better fit your specific data.

Key rule: Always use a much lower learning rate for fine-tuning (e.g., 1e-5 instead of 1e-3). Large learning rates will destroy the carefully learned pre-trained representations.

Best when: You have moderate data (1,000–10,000 images); your dataset differs somewhat from the original (different colours, styles, or domain); you need maximum accuracy.

Factor	Feature Extraction	Fine-Tuning
Data needed	Very little (100–1,000/class)	Moderate (1,000–10,000/class)
Training speed	Very fast	Slower
Accuracy	Good	Better
Risk of overfitting	Low	Higher — use regularisation
Learning rate	Normal (1e-3)	Very low (1e-5 to 1e-4)

4. When to Use Which Approach — Decision Guide

Your Dataset Size	Similarity to Source	Recommended Approach
Small (<1,000/class)	Similar	Feature extraction only
Small (<1,000/class)	Different	Feature extraction + tune top layers
Medium (1,000–10,000/class)	Similar	Fine-tune last few layers
Medium (1,000–10,000/class)	Different	Fine-tune more layers
Large (>10,000/class)	Any	Full fine-tuning or train from scratch

5. Popular Pre-Trained Models

Computer Vision

Model	Parameters	Accuracy (ImageNet)	Best For
VGG-16	138M	92.7%	Simple baseline, easy to understand
ResNet-50	25M	93.9%	General purpose — excellent balance
EfficientNet-B0	5.3M	93.3%	Mobile/edge devices — small but accurate
EfficientNet-B7	66M	97.1%	Maximum accuracy when compute allows
MobileNetV3	5.4M	92.5%	Real-time on mobile devices

Natural Language Processing

Model	Parameters	Best For
BERT-base	110M	Classification, NER, Q&A
RoBERTa	125M	Better BERT — improved pre-training
DistilBERT	66M	Faster BERT — 40% smaller, 97% performance
GPT-2	117M–1.5B	Text generation, open-source
mBERT	179M	Multilingual — 104 languages including Hindi

6. Python Code


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.applications import ResNet50, EfficientNetB0

# ============================================================
# APPROACH 1: FEATURE EXTRACTION
# ============================================================
def build_feature_extraction_model(num_classes, input_shape=(224, 224, 3)):
    # Load pre-trained ResNet50 WITHOUT the top classification layer
    base_model = ResNet50(
        weights='imagenet',
        include_top=False,
        input_shape=input_shape
    )
    base_model.trainable = False  # Freeze ALL pre-trained weights

    # Build new model on top
    model = keras.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.BatchNormalization(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# ============================================================
# APPROACH 2: FINE-TUNING
# ============================================================
def build_finetune_model(num_classes, input_shape=(224, 224, 3)):
    base_model = EfficientNetB0(
        weights='imagenet',
        include_top=False,
        input_shape=input_shape
    )

    # Step 1: Feature extraction phase
    base_model.trainable = False
    inputs = keras.Input(shape=input_shape)
    x = base_model(inputs, training=False)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dense(256, activation='relu')(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    model = keras.Model(inputs, outputs)

    model.compile(optimizer=keras.optimizers.Adam(1e-3),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    return model, base_model

# Training workflow
model, base_model = build_finetune_model(num_classes=5)

# Phase 1: Train top layers only (5-10 epochs)
# model.fit(train_data, epochs=10, validation_data=val_data)

# Phase 2: Unfreeze and fine-tune with VERY low learning rate
base_model.trainable = True
# Optionally freeze early layers -- keep first 100 layers frozen
for layer in base_model.layers[:100]:
    layer.trainable = False

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # 100x smaller!
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
# model.fit(train_data, epochs=20, validation_data=val_data)

print(f"Total layers: {len(base_model.layers)}")
print(f"Trainable layers: {sum(1 for l in base_model.layers if l.trainable)}")
print(f"Frozen layers: {sum(1 for l in base_model.layers if not l.trainable)}")

# ============================================================
# NLP TRANSFER LEARNING WITH HUGGINGFACE
# ============================================================
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)
import torch

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2  # Binary classification
)

# Tokenise text
texts = ["This product is great!", "Terrible quality."]
inputs = tokenizer(texts, padding=True, truncation=True,
                   max_length=128, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)
    print("Predictions:", predictions.numpy())

7. Common Mistakes Students Make

Using too high a learning rate for fine-tuning: A normal learning rate (1e-3) will destroy pre-trained weights in the first few steps. Always use 1e-5 to 1e-4 when fine-tuning.
Not preprocessing images to match the pre-trained model’s format: ResNet expects images normalised to ImageNet mean/std. EfficientNet expects pixel values in [0, 255]. Always use the model’s built-in preprocessing function.
Fine-tuning when you have very little data: With fewer than 500 images per class, fine-tuning will overfit badly. Use feature extraction only and add strong data augmentation.
Forgetting to set base_model.trainable = False before the first training phase: Without freezing, all layers train simultaneously from the start, leading to poor results.

8. Frequently Asked Questions

Do I need a GPU for transfer learning?

For feature extraction, a CPU is often sufficient — you are only training a small classifier on top. For fine-tuning large models, a GPU is strongly recommended. Google Colab provides free GPU access — use it for transfer learning experiments. Even a single T4 GPU makes fine-tuning 10–50x faster than CPU.

Can transfer learning work across very different domains?

Yes, but effectiveness decreases as domains diverge. A model trained on natural photos transfers well to medical X-rays (both are images with spatial structure). A model trained on English text transfers reasonably to other languages. A model trained on images cannot be directly transferred to text — these require separate pre-trained models. The more similar the source and target domains, the more effective the transfer.

Transfer Learning

Transfer Learning

📌 Key Takeaways

1. What is Transfer Learning?

Analogy — Learning a Second Language

2. Why Transfer Learning Works

3. Two Approaches — Feature Extraction vs Fine-Tuning

Approach 1 — Feature Extraction

Approach 2 — Fine-Tuning

4. When to Use Which Approach — Decision Guide

5. Popular Pre-Trained Models

Computer Vision

Natural Language Processing

6. Python Code

7. Common Mistakes Students Make

8. Frequently Asked Questions

Do I need a GPU for transfer learning?

Can transfer learning work across very different domains?

Next Steps

Next Steps

Leave a Comment