Transformers

Transformers

The Attention Mechanism Simplified for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Origin: Introduced in “Attention Is All You Need” (Vaswani et al., 2017).
  • Core mechanism: Self-attention — every token attends to every other token in parallel, capturing long-range dependencies.
  • Key advantage over RNNs: Fully parallelisable — much faster to train on GPUs.
  • Components: Multi-head self-attention + Feed-forward network + Layer normalisation + Positional encoding.
  • Famous models: BERT (encoder), GPT series (decoder), T5 (encoder-decoder), ViT (vision).
  • Foundation of LLMs: ChatGPT, Claude, Gemini, Llama are all transformer-based.

1. Why Transformers? — Limitations of RNNs

RNNs had two critical limitations:

  1. Sequential processing: Step 5 cannot start until step 4 finishes — prevents parallelisation, making training slow.
  2. Long-range dependencies: Information from early tokens degrades over long sequences even with LSTMs.

Transformers solve both by replacing recurrence with self-attention — processing all tokens simultaneously, any token directly attending to any other regardless of distance.

2. The Attention Mechanism — Intuition

Consider: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal, not the street. Attention allows the model to assign high weight to “animal” when processing “it”, regardless of their distance in the sentence.

Analogy — The Research Assistant

Imagine writing a report and looking through a stack of reference cards. For each sentence, you decide which cards are most relevant and weight them accordingly. Attention does the same — for each position, it decides which other positions are most relevant and combines their information.

3. Self-Attention — Query, Key, Value

Attention(Q, K, V) = softmax(Q × Kᵀ / √dₖ) × V

Component Analogy Role
Query (Q) A search query “What information am I looking for?”
Key (K) An index card label “What information do I contain?”
Value (V) The actual content “Here is my information”
Q × Kᵀ Relevance score How well does query match each key?
/ √dₖ Scaling Prevents softmax saturation for large dimensions
softmax Normalised weights Converts scores to weights summing to 1
× V Weighted combination Weighted sum of values — the output

4. Multi-Head Attention

MultiHead(Q, K, V) = Concat(head₁, …, headₕ) × Wᴼ

headᵢ = Attention(Q × Wᵢᵠ, K × Wᵢᴷ, V × Wᵢᵛ)

Different heads learn to attend to different types of relationships simultaneously — syntactic, semantic, positional. The outputs are concatenated and linearly projected for a richer representation.

5. Positional Encoding

Self-attention is permutation-invariant — it has no inherent sense of order. Positional encodings are added to token embeddings before the first layer to inject position information. The original paper used sinusoidal encodings; modern models (BERT, GPT) use learned positional embeddings.

6. Full Transformer Architecture

Encoder (N identical layers, typically N=6 or 12):

Multi-Head Self-Attention → Add & Norm → Feed-Forward Network → Add & Norm

  • Self-attention: each token gathers context from all other tokens
  • Feed-forward: two linear layers with ReLU — processes each position independently
  • Add & Norm: residual connection + layer normalisation — stabilises training

Decoder (N identical layers):

Masked Self-Attention → Add & Norm → Cross-Attention (attends to encoder output) → Add & Norm → Feed-Forward → Add & Norm

7. BERT vs GPT — Encoder vs Decoder

Feature BERT GPT Series
Architecture Encoder only Decoder only
Attention direction Bidirectional Unidirectional (left only)
Pre-training task Masked Language Modelling Causal Language Modelling
Best for Understanding: classification, NER, Q&A Generation: text completion, dialogue, code
Examples BERT, RoBERTa, DistilBERT GPT-4, Claude, Gemini, Llama

8. Applications

Application Model Family
Text classification, sentiment analysis BERT, RoBERTa
Machine translation T5, BART, mBART
Text generation, chatbots GPT-4, Claude, Gemini, Llama
Code generation GitHub Copilot, CodeLlama
Image recognition Vision Transformer (ViT)
Protein structure prediction AlphaFold 2

9. Common Mistakes Students Make

  • Thinking attention replaces all architectures: CNNs are still widely used for images; RNNs/LSTMs for streaming data and edge devices.
  • Confusing self-attention with cross-attention: Self-attention attends within the same sequence. Cross-attention (decoder) attends to a different sequence (encoder output).
  • Underestimating computational cost: Self-attention has O(n²) complexity with sequence length n. Doubling the sequence length quadruples computation.

10. Frequently Asked Questions

Do I need to understand transformers to use ChatGPT or Claude?

No — you can use LLMs effectively with just prompt engineering knowledge. But understanding transformers helps you understand why these models have certain strengths and limitations: context window limits, why they can hallucinate, and why in-context learning works.

What is a Large Language Model (LLM)?

An LLM is a transformer-based model trained on massive text datasets to predict the next token. Scale gives them emergent abilities: reasoning, code generation, instruction following. GPT-4, Claude, Gemini, and Llama are all LLMs.

What is prompt engineering?

Prompt engineering is designing input text to elicit specific, high-quality outputs from LLMs. Techniques include: chain-of-thought prompting, few-shot prompting, role assignment, and structured output requests.

Next Steps