Transformers
The Attention Mechanism Simplified for Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Origin: Introduced in “Attention Is All You Need” (Vaswani et al., 2017).
- Core mechanism: Self-attention — every token attends to every other token in parallel, capturing long-range dependencies.
- Key advantage over RNNs: Fully parallelisable — much faster to train on GPUs.
- Components: Multi-head self-attention + Feed-forward network + Layer normalisation + Positional encoding.
- Famous models: BERT (encoder), GPT series (decoder), T5 (encoder-decoder), ViT (vision).
- Foundation of LLMs: ChatGPT, Claude, Gemini, Llama are all transformer-based.
1. Why Transformers? — Limitations of RNNs
RNNs had two critical limitations:
- Sequential processing: Step 5 cannot start until step 4 finishes — prevents parallelisation, making training slow.
- Long-range dependencies: Information from early tokens degrades over long sequences even with LSTMs.
Transformers solve both by replacing recurrence with self-attention — processing all tokens simultaneously, any token directly attending to any other regardless of distance.
2. The Attention Mechanism — Intuition
Consider: “The animal didn’t cross the street because it was too tired.”
What does “it” refer to? The animal, not the street. Attention allows the model to assign high weight to “animal” when processing “it”, regardless of their distance in the sentence.
Analogy — The Research Assistant
Imagine writing a report and looking through a stack of reference cards. For each sentence, you decide which cards are most relevant and weight them accordingly. Attention does the same — for each position, it decides which other positions are most relevant and combines their information.
3. Self-Attention — Query, Key, Value
Attention(Q, K, V) = softmax(Q × Kᵀ / √dₖ) × V
| Component | Analogy | Role |
|---|---|---|
| Query (Q) | A search query | “What information am I looking for?” |
| Key (K) | An index card label | “What information do I contain?” |
| Value (V) | The actual content | “Here is my information” |
| Q × Kᵀ | Relevance score | How well does query match each key? |
| / √dₖ | Scaling | Prevents softmax saturation for large dimensions |
| softmax | Normalised weights | Converts scores to weights summing to 1 |
| × V | Weighted combination | Weighted sum of values — the output |
4. Multi-Head Attention
MultiHead(Q, K, V) = Concat(head₁, …, headₕ) × Wᴼ
headᵢ = Attention(Q × Wᵢᵠ, K × Wᵢᴷ, V × Wᵢᵛ)
Different heads learn to attend to different types of relationships simultaneously — syntactic, semantic, positional. The outputs are concatenated and linearly projected for a richer representation.
5. Positional Encoding
Self-attention is permutation-invariant — it has no inherent sense of order. Positional encodings are added to token embeddings before the first layer to inject position information. The original paper used sinusoidal encodings; modern models (BERT, GPT) use learned positional embeddings.
6. Full Transformer Architecture
Encoder (N identical layers, typically N=6 or 12):
Multi-Head Self-Attention → Add & Norm → Feed-Forward Network → Add & Norm
- Self-attention: each token gathers context from all other tokens
- Feed-forward: two linear layers with ReLU — processes each position independently
- Add & Norm: residual connection + layer normalisation — stabilises training
Decoder (N identical layers):
Masked Self-Attention → Add & Norm → Cross-Attention (attends to encoder output) → Add & Norm → Feed-Forward → Add & Norm
7. BERT vs GPT — Encoder vs Decoder
| Feature | BERT | GPT Series |
|---|---|---|
| Architecture | Encoder only | Decoder only |
| Attention direction | Bidirectional | Unidirectional (left only) |
| Pre-training task | Masked Language Modelling | Causal Language Modelling |
| Best for | Understanding: classification, NER, Q&A | Generation: text completion, dialogue, code |
| Examples | BERT, RoBERTa, DistilBERT | GPT-4, Claude, Gemini, Llama |
8. Applications
| Application | Model Family |
|---|---|
| Text classification, sentiment analysis | BERT, RoBERTa |
| Machine translation | T5, BART, mBART |
| Text generation, chatbots | GPT-4, Claude, Gemini, Llama |
| Code generation | GitHub Copilot, CodeLlama |
| Image recognition | Vision Transformer (ViT) |
| Protein structure prediction | AlphaFold 2 |
9. Common Mistakes Students Make
- Thinking attention replaces all architectures: CNNs are still widely used for images; RNNs/LSTMs for streaming data and edge devices.
- Confusing self-attention with cross-attention: Self-attention attends within the same sequence. Cross-attention (decoder) attends to a different sequence (encoder output).
- Underestimating computational cost: Self-attention has O(n²) complexity with sequence length n. Doubling the sequence length quadruples computation.
10. Frequently Asked Questions
Do I need to understand transformers to use ChatGPT or Claude?
No — you can use LLMs effectively with just prompt engineering knowledge. But understanding transformers helps you understand why these models have certain strengths and limitations: context window limits, why they can hallucinate, and why in-context learning works.
What is a Large Language Model (LLM)?
An LLM is a transformer-based model trained on massive text datasets to predict the next token. Scale gives them emergent abilities: reasoning, code generation, instruction following. GPT-4, Claude, Gemini, and Llama are all LLMs.
What is prompt engineering?
Prompt engineering is designing input text to elicit specific, high-quality outputs from LLMs. Techniques include: chain-of-thought prompting, few-shot prompting, role assignment, and structured output requests.