Large Language Models (LLMs)
GPT, BERT & Claude Simplified for Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Definition: LLMs are transformer-based models trained on massive text corpora — billions of parameters, trillions of tokens — to predict and generate text.
- Training: Pre-training on next-token prediction → fine-tuning with human feedback (RLHF) for helpful, harmless, honest behaviour.
- Emergent abilities: At sufficient scale, LLMs develop capabilities not explicitly trained — reasoning, code generation, translation, summarisation.
- GPT (decoder): Best for generation — chatbots, code, creative writing.
- BERT (encoder): Best for understanding — classification, NER, Q&A.
- Prompt engineering: How you phrase your input dramatically affects output quality.
1. What are Large Language Models?
A Large Language Model is a transformer-based neural network trained on massive amounts of text data to learn the statistical patterns of language — enabling it to understand, generate, and reason about text at a human-competitive level.
The word “large” refers to two dimensions of scale: model size (billions to hundreds of billions of parameters) and training data (hundreds of billions to trillions of tokens — tokens are roughly words or word pieces).
The core task during pre-training is simple: given a sequence of text, predict the next token. But at sufficient scale, this simple objective leads to a model that develops a surprisingly rich internal representation of language, facts, reasoning patterns, and even code — all learned implicitly from the statistics of human-written text.
2. How LLMs are Trained
Modern LLMs like GPT-4 and Claude are trained in three stages:
Stage 1 — Pre-training
The model trains on a massive corpus of text (web pages, books, code, scientific papers) by predicting the next token. This takes weeks to months on thousands of GPUs and requires enormous compute (GPT-4 cost estimates: $50–100 million in compute). The model develops broad knowledge of language and the world.
Stage 2 — Supervised Fine-tuning (SFT)
Human annotators create high-quality examples of desired behaviour — helpful answers to questions, step-by-step explanations, polite refusals to harmful requests. The model is fine-tuned on these examples to follow instructions.
Stage 3 — Reinforcement Learning from Human Feedback (RLHF)
Human raters compare pairs of model outputs and choose the better one. A reward model is trained on these preferences. The LLM is then fine-tuned using reinforcement learning to maximise the reward model’s score — producing outputs that humans prefer: more helpful, accurate, and less harmful.
RLHF is the key step that transforms a raw language model into a useful assistant. Without it, pre-trained models generate plausible text but are not necessarily helpful, harmless, or honest.
3. Emergent Abilities
One of the most surprising findings in LLM research is emergence — capabilities that appear suddenly at scale and are not present in smaller models. These abilities were not explicitly trained; they arise from the model learning increasingly complex patterns in the training data.
- Multi-step reasoning: Solving complex maths or logic problems by working through intermediate steps.
- In-context learning: Learning a new task from a few examples provided in the prompt — without gradient updates.
- Chain-of-thought reasoning: When prompted to “think step by step”, large models produce dramatically better answers to reasoning questions.
- Code generation: Writing syntactically correct and functionally sound code in multiple programming languages.
- Multilingual translation: Even without explicit translation training, models trained on multilingual text can translate between languages.
4. Major LLM Families
| Model | Organisation | Architecture | Notable For |
|---|---|---|---|
| GPT-4 / GPT-4o | OpenAI | Decoder | ChatGPT, code generation, multimodal |
| Claude 3.5/4 | Anthropic | Decoder | Long context, safety, reasoning |
| Gemini Ultra/Pro | Google DeepMind | Decoder | Multimodal, integration with Google |
| Llama 3 | Meta | Decoder | Open weights, fine-tunable, on-device |
| BERT / RoBERTa | Google / Meta | Encoder | Classification, NER, embeddings |
| T5 / FLAN-T5 | Encoder-Decoder | Summarisation, translation, Q&A | |
| Mistral / Mixtral | Mistral AI | Decoder (MoE) | Efficient, strong open-source |
5. GPT vs BERT vs T5 vs Claude
| Feature | GPT (Decoder) | BERT (Encoder) | T5 (Enc-Dec) | Claude (Decoder) |
|---|---|---|---|---|
| Best for | Generation, chat, code | Classification, NER, Q&A | Translation, summarisation | Reasoning, long documents, safety |
| Attention direction | Left-to-right only | Bidirectional | Both | Left-to-right |
| Output type | Open-ended text | Embeddings/labels | Structured text | Open-ended text |
| Open source? | No (API only) | Yes | Yes | No (API only) |
| Context window | 128K tokens (GPT-4) | 512 tokens | ~512 tokens | 200K tokens |
6. Prompt Engineering Basics
The quality of an LLM’s output depends heavily on how the input (prompt) is written. Prompt engineering is the practice of designing inputs to maximise output quality.
Key Techniques:
| Technique | Description | Example |
|---|---|---|
| Zero-shot | Direct instruction, no examples | “Summarise this text in 3 bullet points: [text]” |
| Few-shot | Provide 2–5 examples of desired format | “Input: ‘great!’ → Positive. Input: ‘awful’ → Negative. Input: ‘okay’ → ?” |
| Chain-of-thought | Ask model to reason step by step | “Solve this problem step by step: …” |
| Role prompting | Assign a persona | “You are an expert mechanical engineer. Explain…” |
| Output format | Specify the desired structure | “Respond in JSON format with keys ‘answer’ and ‘confidence’.” |
| Self-consistency | Generate multiple outputs, take majority | Generate 5 solutions, pick the most common answer |
7. Context Window & Limitations
The context window is the maximum amount of text (in tokens) an LLM can process at once. Everything outside the context window is invisible to the model — it has no memory of it.
- GPT-4: up to 128,000 tokens (~100,000 words)
- Claude 3.5: up to 200,000 tokens (~150,000 words)
- Gemini 1.5 Pro: up to 1,000,000 tokens
Key limitations of LLMs to understand:
- Hallucination: LLMs generate plausible-sounding but factually incorrect information. They are pattern matchers, not databases — always verify important facts.
- Knowledge cutoff: Pre-training ends at a specific date — LLMs do not know about events after their cutoff without retrieval augmentation (RAG).
- No persistent memory: Each conversation starts fresh — the model does not remember previous conversations unless explicitly provided.
- Reasoning limits: Despite impressive performance, LLMs still make systematic errors on certain types of logical reasoning and mathematical computation.
8. Applications for Engineers
| Domain | Application | Model Type |
|---|---|---|
| Software Engineering | Code generation, debugging, documentation (GitHub Copilot) | GPT-4, Claude |
| Data Analysis | Natural language queries on databases, automated report generation | GPT-4, Claude |
| Technical Writing | Drafting specifications, summarising research papers, translating documents | Claude, GPT-4 |
| Customer Support | Intelligent chatbots that understand context and domain-specific language | Fine-tuned Llama, GPT |
| Education | Personalised tutoring, concept explanation, problem generation | GPT-4, Claude |
| Mechanical/Civil | Interpreting standards (IS codes, ASTM), design checklist generation | Fine-tuned domain LLMs |
9. Common Misconceptions
- “LLMs understand language like humans do”: LLMs are statistical pattern matchers trained on human text. They produce outputs that look like understanding because they are trained on the outputs of human understanding. Whether they truly “understand” is a deep philosophical question — what is clear is that they produce outputs that are often indistinguishable from human-level performance on many tasks.
- “LLMs always tell the truth”: LLMs generate the most statistically likely next token — not necessarily the most accurate. They hallucinate confidently. Always verify important claims, especially specific numbers, dates, and citations.
- “Bigger is always better”: Smaller, well-trained models (Llama 3 8B, Mistral 7B) often outperform larger, older models on specific tasks. The right model depends on your task, latency requirements, and budget.
- “Prompt engineering is just writing instructions”: Effective prompt engineering is a skill that combines understanding of how LLMs work, the specific model’s training, and the target task. Small changes in wording can cause large changes in output quality.
10. Frequently Asked Questions
How many parameters does a large language model have?
Model sizes vary enormously. BERT-base has 110 million parameters. GPT-3 had 175 billion. GPT-4’s exact size is not published but is estimated at over 1 trillion parameters (possibly a mixture of experts). Llama 3 comes in 8B, 70B, and 405B parameter versions. The number of parameters is not the only factor in performance — training data quality, architecture, and RLHF training matter too.
Can I run an LLM on my laptop?
Yes — small models like Llama 3 8B, Phi-3, and Mistral 7B can run on a laptop with 8–16GB RAM using tools like Ollama or LM Studio. They are not as capable as GPT-4 or Claude but are impressive for their size and run completely offline. For most production applications, API access to cloud-hosted models is more practical.
What is RAG (Retrieval-Augmented Generation)?
RAG combines LLMs with a retrieval system — a database of documents. When a question is asked, the system first retrieves relevant documents, then provides them to the LLM as context for generating the answer. This gives LLMs access to up-to-date, domain-specific information without retraining. RAG is the standard approach for building LLM-powered applications over private or frequently changing data.