Large Language Models (LLMs)



Large Language Models (LLMs)

GPT, BERT & Claude Simplified for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Definition: LLMs are transformer-based models trained on massive text corpora — billions of parameters, trillions of tokens — to predict and generate text.
  • Training: Pre-training on next-token prediction → fine-tuning with human feedback (RLHF) for helpful, harmless, honest behaviour.
  • Emergent abilities: At sufficient scale, LLMs develop capabilities not explicitly trained — reasoning, code generation, translation, summarisation.
  • GPT (decoder): Best for generation — chatbots, code, creative writing.
  • BERT (encoder): Best for understanding — classification, NER, Q&A.
  • Prompt engineering: How you phrase your input dramatically affects output quality.

1. What are Large Language Models?

A Large Language Model is a transformer-based neural network trained on massive amounts of text data to learn the statistical patterns of language — enabling it to understand, generate, and reason about text at a human-competitive level.

The word “large” refers to two dimensions of scale: model size (billions to hundreds of billions of parameters) and training data (hundreds of billions to trillions of tokens — tokens are roughly words or word pieces).

The core task during pre-training is simple: given a sequence of text, predict the next token. But at sufficient scale, this simple objective leads to a model that develops a surprisingly rich internal representation of language, facts, reasoning patterns, and even code — all learned implicitly from the statistics of human-written text.

2. How LLMs are Trained

Modern LLMs like GPT-4 and Claude are trained in three stages:

Stage 1 — Pre-training

The model trains on a massive corpus of text (web pages, books, code, scientific papers) by predicting the next token. This takes weeks to months on thousands of GPUs and requires enormous compute (GPT-4 cost estimates: $50–100 million in compute). The model develops broad knowledge of language and the world.

Stage 2 — Supervised Fine-tuning (SFT)

Human annotators create high-quality examples of desired behaviour — helpful answers to questions, step-by-step explanations, polite refusals to harmful requests. The model is fine-tuned on these examples to follow instructions.

Stage 3 — Reinforcement Learning from Human Feedback (RLHF)

Human raters compare pairs of model outputs and choose the better one. A reward model is trained on these preferences. The LLM is then fine-tuned using reinforcement learning to maximise the reward model’s score — producing outputs that humans prefer: more helpful, accurate, and less harmful.

RLHF is the key step that transforms a raw language model into a useful assistant. Without it, pre-trained models generate plausible text but are not necessarily helpful, harmless, or honest.

3. Emergent Abilities

One of the most surprising findings in LLM research is emergence — capabilities that appear suddenly at scale and are not present in smaller models. These abilities were not explicitly trained; they arise from the model learning increasingly complex patterns in the training data.

  • Multi-step reasoning: Solving complex maths or logic problems by working through intermediate steps.
  • In-context learning: Learning a new task from a few examples provided in the prompt — without gradient updates.
  • Chain-of-thought reasoning: When prompted to “think step by step”, large models produce dramatically better answers to reasoning questions.
  • Code generation: Writing syntactically correct and functionally sound code in multiple programming languages.
  • Multilingual translation: Even without explicit translation training, models trained on multilingual text can translate between languages.

4. Major LLM Families

ModelOrganisationArchitectureNotable For
GPT-4 / GPT-4oOpenAIDecoderChatGPT, code generation, multimodal
Claude 3.5/4AnthropicDecoderLong context, safety, reasoning
Gemini Ultra/ProGoogle DeepMindDecoderMultimodal, integration with Google
Llama 3MetaDecoderOpen weights, fine-tunable, on-device
BERT / RoBERTaGoogle / MetaEncoderClassification, NER, embeddings
T5 / FLAN-T5GoogleEncoder-DecoderSummarisation, translation, Q&A
Mistral / MixtralMistral AIDecoder (MoE)Efficient, strong open-source

5. GPT vs BERT vs T5 vs Claude

FeatureGPT (Decoder)BERT (Encoder)T5 (Enc-Dec)Claude (Decoder)
Best forGeneration, chat, codeClassification, NER, Q&ATranslation, summarisationReasoning, long documents, safety
Attention directionLeft-to-right onlyBidirectionalBothLeft-to-right
Output typeOpen-ended textEmbeddings/labelsStructured textOpen-ended text
Open source?No (API only)YesYesNo (API only)
Context window128K tokens (GPT-4)512 tokens~512 tokens200K tokens

6. Prompt Engineering Basics

The quality of an LLM’s output depends heavily on how the input (prompt) is written. Prompt engineering is the practice of designing inputs to maximise output quality.

Key Techniques:

TechniqueDescriptionExample
Zero-shotDirect instruction, no examples“Summarise this text in 3 bullet points: [text]”
Few-shotProvide 2–5 examples of desired format“Input: ‘great!’ → Positive. Input: ‘awful’ → Negative. Input: ‘okay’ → ?”
Chain-of-thoughtAsk model to reason step by step“Solve this problem step by step: …”
Role promptingAssign a persona“You are an expert mechanical engineer. Explain…”
Output formatSpecify the desired structure“Respond in JSON format with keys ‘answer’ and ‘confidence’.”
Self-consistencyGenerate multiple outputs, take majorityGenerate 5 solutions, pick the most common answer

7. Context Window & Limitations

The context window is the maximum amount of text (in tokens) an LLM can process at once. Everything outside the context window is invisible to the model — it has no memory of it.

  • GPT-4: up to 128,000 tokens (~100,000 words)
  • Claude 3.5: up to 200,000 tokens (~150,000 words)
  • Gemini 1.5 Pro: up to 1,000,000 tokens

Key limitations of LLMs to understand:

  • Hallucination: LLMs generate plausible-sounding but factually incorrect information. They are pattern matchers, not databases — always verify important facts.
  • Knowledge cutoff: Pre-training ends at a specific date — LLMs do not know about events after their cutoff without retrieval augmentation (RAG).
  • No persistent memory: Each conversation starts fresh — the model does not remember previous conversations unless explicitly provided.
  • Reasoning limits: Despite impressive performance, LLMs still make systematic errors on certain types of logical reasoning and mathematical computation.

8. Applications for Engineers

DomainApplicationModel Type
Software EngineeringCode generation, debugging, documentation (GitHub Copilot)GPT-4, Claude
Data AnalysisNatural language queries on databases, automated report generationGPT-4, Claude
Technical WritingDrafting specifications, summarising research papers, translating documentsClaude, GPT-4
Customer SupportIntelligent chatbots that understand context and domain-specific languageFine-tuned Llama, GPT
EducationPersonalised tutoring, concept explanation, problem generationGPT-4, Claude
Mechanical/CivilInterpreting standards (IS codes, ASTM), design checklist generationFine-tuned domain LLMs

9. Common Misconceptions

  • “LLMs understand language like humans do”: LLMs are statistical pattern matchers trained on human text. They produce outputs that look like understanding because they are trained on the outputs of human understanding. Whether they truly “understand” is a deep philosophical question — what is clear is that they produce outputs that are often indistinguishable from human-level performance on many tasks.
  • “LLMs always tell the truth”: LLMs generate the most statistically likely next token — not necessarily the most accurate. They hallucinate confidently. Always verify important claims, especially specific numbers, dates, and citations.
  • “Bigger is always better”: Smaller, well-trained models (Llama 3 8B, Mistral 7B) often outperform larger, older models on specific tasks. The right model depends on your task, latency requirements, and budget.
  • “Prompt engineering is just writing instructions”: Effective prompt engineering is a skill that combines understanding of how LLMs work, the specific model’s training, and the target task. Small changes in wording can cause large changes in output quality.

10. Frequently Asked Questions

How many parameters does a large language model have?

Model sizes vary enormously. BERT-base has 110 million parameters. GPT-3 had 175 billion. GPT-4’s exact size is not published but is estimated at over 1 trillion parameters (possibly a mixture of experts). Llama 3 comes in 8B, 70B, and 405B parameter versions. The number of parameters is not the only factor in performance — training data quality, architecture, and RLHF training matter too.

Can I run an LLM on my laptop?

Yes — small models like Llama 3 8B, Phi-3, and Mistral 7B can run on a laptop with 8–16GB RAM using tools like Ollama or LM Studio. They are not as capable as GPT-4 or Claude but are impressive for their size and run completely offline. For most production applications, API access to cloud-hosted models is more practical.

What is RAG (Retrieval-Augmented Generation)?

RAG combines LLMs with a retrieval system — a database of documents. When a question is asked, the system first retrieves relevant documents, then provides them to the LLM as context for generating the answer. This gives LLMs access to up-to-date, domain-specific information without retraining. RAG is the standard approach for building LLM-powered applications over private or frequently changing data.

Next Steps