What is a Transformer in machine learning?

A Transformer is a deep learning architecture introduced in the 2017 paper 'Attention Is All You Need'. Unlike RNNs, it processes all sequence elements in parallel using a self-attention mechanism that allows each element to directly attend to every other element. Transformers are the foundation of modern large language models like BERT, GPT, and T5.

What is self-attention in Transformers?

Self-attention allows each element in a sequence to attend to every other element in the same sequence. For each position, it computes a weighted sum of all other positions' values, where the weights reflect the relevance of each position to the current one. This captures relationships between any two positions regardless of their distance.

What is the difference between BERT and GPT?

BERT uses only the encoder and is trained to understand text by predicting masked words — it sees context from both left and right. GPT uses only the decoder and is trained to predict the next word — it only sees left context. BERT excels at understanding tasks (classification, Q&A). GPT excels at generation tasks (text completion, dialogue).

Transformers

The Attention Mechanism Simplified for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Origin: Introduced in “Attention Is All You Need” (Vaswani et al., 2017).
Core mechanism: Self-attention — every token attends to every other token in parallel, capturing long-range dependencies.
Key advantage over RNNs: Fully parallelisable — much faster to train on GPUs.
Components: Multi-head self-attention + Feed-forward network + Layer normalisation + Positional encoding.
Famous models: BERT (encoder), GPT series (decoder), T5 (encoder-decoder), ViT (vision).
Foundation of LLMs: ChatGPT, Claude, Gemini, Llama are all transformer-based.

1. Why Transformers? — Limitations of RNNs

RNNs had two critical limitations:

Sequential processing: Step 5 cannot start until step 4 finishes — prevents parallelisation, making training slow.
Long-range dependencies: Information from early tokens degrades over long sequences even with LSTMs.

Transformers solve both by replacing recurrence with self-attention — processing all tokens simultaneously, any token directly attending to any other regardless of distance.

2. The Attention Mechanism — Intuition

Consider: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal, not the street. Attention allows the model to assign high weight to “animal” when processing “it”, regardless of their distance in the sentence.

Analogy — The Research Assistant

Imagine writing a report and looking through a stack of reference cards. For each sentence, you decide which cards are most relevant and weight them accordingly. Attention does the same — for each position, it decides which other positions are most relevant and combines their information.

3. Self-Attention — Query, Key, Value

Attention(Q, K, V) = softmax(Q × Kᵀ / √dₖ) × V

Component	Analogy	Role
Query (Q)	A search query	“What information am I looking for?”
Key (K)	An index card label	“What information do I contain?”
Value (V)	The actual content	“Here is my information”
Q × Kᵀ	Relevance score	How well does query match each key?
/ √dₖ	Scaling	Prevents softmax saturation for large dimensions
softmax	Normalised weights	Converts scores to weights summing to 1
× V	Weighted combination	Weighted sum of values — the output

4. Multi-Head Attention

MultiHead(Q, K, V) = Concat(head₁, …, headₕ) × Wᴼ

headᵢ = Attention(Q × Wᵢᵠ, K × Wᵢᴷ, V × Wᵢᵛ)

Different heads learn to attend to different types of relationships simultaneously — syntactic, semantic, positional. The outputs are concatenated and linearly projected for a richer representation.

5. Positional Encoding

Self-attention is permutation-invariant — it has no inherent sense of order. Positional encodings are added to token embeddings before the first layer to inject position information. The original paper used sinusoidal encodings; modern models (BERT, GPT) use learned positional embeddings.

6. Full Transformer Architecture

Encoder (N identical layers, typically N=6 or 12):

Multi-Head Self-Attention → Add & Norm → Feed-Forward Network → Add & Norm

Self-attention: each token gathers context from all other tokens
Feed-forward: two linear layers with ReLU — processes each position independently
Add & Norm: residual connection + layer normalisation — stabilises training

Decoder (N identical layers):

Masked Self-Attention → Add & Norm → Cross-Attention (attends to encoder output) → Add & Norm → Feed-Forward → Add & Norm

7. BERT vs GPT — Encoder vs Decoder

Feature	BERT	GPT Series
Architecture	Encoder only	Decoder only
Attention direction	Bidirectional	Unidirectional (left only)
Pre-training task	Masked Language Modelling	Causal Language Modelling
Best for	Understanding: classification, NER, Q&A	Generation: text completion, dialogue, code
Examples	BERT, RoBERTa, DistilBERT	GPT-4, Claude, Gemini, Llama

8. Applications

Application	Model Family
Text classification, sentiment analysis	BERT, RoBERTa
Machine translation	T5, BART, mBART
Text generation, chatbots	GPT-4, Claude, Gemini, Llama
Code generation	GitHub Copilot, CodeLlama
Image recognition	Vision Transformer (ViT)
Protein structure prediction	AlphaFold 2

9. Common Mistakes Students Make

Thinking attention replaces all architectures: CNNs are still widely used for images; RNNs/LSTMs for streaming data and edge devices.
Confusing self-attention with cross-attention: Self-attention attends within the same sequence. Cross-attention (decoder) attends to a different sequence (encoder output).
Underestimating computational cost: Self-attention has O(n²) complexity with sequence length n. Doubling the sequence length quadruples computation.

10. Frequently Asked Questions

Do I need to understand transformers to use ChatGPT or Claude?

No — you can use LLMs effectively with just prompt engineering knowledge. But understanding transformers helps you understand why these models have certain strengths and limitations: context window limits, why they can hallucinate, and why in-context learning works.

What is a Large Language Model (LLM)?

An LLM is a transformer-based model trained on massive text datasets to predict the next token. Scale gives them emergent abilities: reasoning, code generation, instruction following. GPT-4, Claude, Gemini, and Llama are all LLMs.

What is prompt engineering?

Prompt engineering is designing input text to elicit specific, high-quality outputs from LLMs. Techniques include: chain-of-thought prompting, few-shot prompting, role assignment, and structured output requests.

Next Steps

👉 Back — RNNs & LSTMs
👉 Back — Neural Networks Foundations
👉 GATE CS Preparation Guide — AI/ML Topics
👉 Back to AI & ML Hub