What is a Large Language Model (LLM)?

A Large Language Model (LLM) is a transformer-based neural network trained on massive text datasets to predict and generate text. The 'large' refers to the scale — billions of parameters and training data measured in trillions of tokens. At sufficient scale, LLMs develop emergent abilities like reasoning, code generation, and instruction following that smaller models cannot perform.

What is the difference between GPT and BERT?

BERT uses only the transformer encoder and is trained bidirectionally — it sees full context from both left and right. It excels at understanding tasks (classification, question answering). GPT uses only the transformer decoder and is trained to predict the next token from left context only. It excels at generation tasks (text completion, dialogue, code). They are complementary — different tools for different problems.

What is prompt engineering?

Prompt engineering is the practice of designing text inputs (prompts) to elicit specific, high-quality outputs from large language models. Techniques include: zero-shot prompting (direct instructions), few-shot prompting (providing examples), chain-of-thought prompting (asking the model to reason step by step), and role prompting (assigning a persona). Effective prompting can dramatically improve LLM output quality without any model training.

Large Language Models (LLMs)

GPT, BERT & Claude Simplified for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: LLMs are transformer-based models trained on massive text corpora — billions of parameters, trillions of tokens — to predict and generate text.
Training: Pre-training on next-token prediction → fine-tuning with human feedback (RLHF) for helpful, harmless, honest behaviour.
Emergent abilities: At sufficient scale, LLMs develop capabilities not explicitly trained — reasoning, code generation, translation, summarisation.
GPT (decoder): Best for generation — chatbots, code, creative writing.
BERT (encoder): Best for understanding — classification, NER, Q&A.
Prompt engineering: How you phrase your input dramatically affects output quality.

1. What are Large Language Models?

A Large Language Model is a transformer-based neural network trained on massive amounts of text data to learn the statistical patterns of language — enabling it to understand, generate, and reason about text at a human-competitive level.

The word “large” refers to two dimensions of scale: model size (billions to hundreds of billions of parameters) and training data (hundreds of billions to trillions of tokens — tokens are roughly words or word pieces).

The core task during pre-training is simple: given a sequence of text, predict the next token. But at sufficient scale, this simple objective leads to a model that develops a surprisingly rich internal representation of language, facts, reasoning patterns, and even code — all learned implicitly from the statistics of human-written text.

2. How LLMs are Trained

Modern LLMs like GPT-4 and Claude are trained in three stages:

Stage 1 — Pre-training

The model trains on a massive corpus of text (web pages, books, code, scientific papers) by predicting the next token. This takes weeks to months on thousands of GPUs and requires enormous compute (GPT-4 cost estimates: $50–100 million in compute). The model develops broad knowledge of language and the world.

Stage 2 — Supervised Fine-tuning (SFT)

Human annotators create high-quality examples of desired behaviour — helpful answers to questions, step-by-step explanations, polite refusals to harmful requests. The model is fine-tuned on these examples to follow instructions.

Stage 3 — Reinforcement Learning from Human Feedback (RLHF)

Human raters compare pairs of model outputs and choose the better one. A reward model is trained on these preferences. The LLM is then fine-tuned using reinforcement learning to maximise the reward model’s score — producing outputs that humans prefer: more helpful, accurate, and less harmful.

RLHF is the key step that transforms a raw language model into a useful assistant. Without it, pre-trained models generate plausible text but are not necessarily helpful, harmless, or honest.

3. Emergent Abilities

One of the most surprising findings in LLM research is emergence — capabilities that appear suddenly at scale and are not present in smaller models. These abilities were not explicitly trained; they arise from the model learning increasingly complex patterns in the training data.

Multi-step reasoning: Solving complex maths or logic problems by working through intermediate steps.
In-context learning: Learning a new task from a few examples provided in the prompt — without gradient updates.
Chain-of-thought reasoning: When prompted to “think step by step”, large models produce dramatically better answers to reasoning questions.
Code generation: Writing syntactically correct and functionally sound code in multiple programming languages.
Multilingual translation: Even without explicit translation training, models trained on multilingual text can translate between languages.

4. Major LLM Families

Model	Organisation	Architecture	Notable For
GPT-4 / GPT-4o	OpenAI	Decoder	ChatGPT, code generation, multimodal
Claude 3.5/4	Anthropic	Decoder	Long context, safety, reasoning
Gemini Ultra/Pro	Google DeepMind	Decoder	Multimodal, integration with Google
Llama 3	Meta	Decoder	Open weights, fine-tunable, on-device
BERT / RoBERTa	Google / Meta	Encoder	Classification, NER, embeddings
T5 / FLAN-T5	Google	Encoder-Decoder	Summarisation, translation, Q&A
Mistral / Mixtral	Mistral AI	Decoder (MoE)	Efficient, strong open-source

5. GPT vs BERT vs T5 vs Claude

Feature	GPT (Decoder)	BERT (Encoder)	T5 (Enc-Dec)	Claude (Decoder)
Best for	Generation, chat, code	Classification, NER, Q&A	Translation, summarisation	Reasoning, long documents, safety
Attention direction	Left-to-right only	Bidirectional	Both	Left-to-right
Output type	Open-ended text	Embeddings/labels	Structured text	Open-ended text
Open source?	No (API only)	Yes	Yes	No (API only)
Context window	128K tokens (GPT-4)	512 tokens	~512 tokens	200K tokens

6. Prompt Engineering Basics

The quality of an LLM’s output depends heavily on how the input (prompt) is written. Prompt engineering is the practice of designing inputs to maximise output quality.

Key Techniques:

Technique	Description	Example
Zero-shot	Direct instruction, no examples	“Summarise this text in 3 bullet points: [text]”
Few-shot	Provide 2–5 examples of desired format	“Input: ‘great!’ → Positive. Input: ‘awful’ → Negative. Input: ‘okay’ → ?”
Chain-of-thought	Ask model to reason step by step	“Solve this problem step by step: …”
Role prompting	Assign a persona	“You are an expert mechanical engineer. Explain…”
Output format	Specify the desired structure	“Respond in JSON format with keys ‘answer’ and ‘confidence’.”
Self-consistency	Generate multiple outputs, take majority	Generate 5 solutions, pick the most common answer

7. Context Window & Limitations

The context window is the maximum amount of text (in tokens) an LLM can process at once. Everything outside the context window is invisible to the model — it has no memory of it.

GPT-4: up to 128,000 tokens (~100,000 words)
Claude 3.5: up to 200,000 tokens (~150,000 words)
Gemini 1.5 Pro: up to 1,000,000 tokens

Key limitations of LLMs to understand:

Hallucination: LLMs generate plausible-sounding but factually incorrect information. They are pattern matchers, not databases — always verify important facts.
Knowledge cutoff: Pre-training ends at a specific date — LLMs do not know about events after their cutoff without retrieval augmentation (RAG).
No persistent memory: Each conversation starts fresh — the model does not remember previous conversations unless explicitly provided.
Reasoning limits: Despite impressive performance, LLMs still make systematic errors on certain types of logical reasoning and mathematical computation.

8. Applications for Engineers

Domain	Application	Model Type
Software Engineering	Code generation, debugging, documentation (GitHub Copilot)	GPT-4, Claude
Data Analysis	Natural language queries on databases, automated report generation	GPT-4, Claude
Technical Writing	Drafting specifications, summarising research papers, translating documents	Claude, GPT-4
Customer Support	Intelligent chatbots that understand context and domain-specific language	Fine-tuned Llama, GPT
Education	Personalised tutoring, concept explanation, problem generation	GPT-4, Claude
Mechanical/Civil	Interpreting standards (IS codes, ASTM), design checklist generation	Fine-tuned domain LLMs

9. Common Misconceptions

“LLMs understand language like humans do”: LLMs are statistical pattern matchers trained on human text. They produce outputs that look like understanding because they are trained on the outputs of human understanding. Whether they truly “understand” is a deep philosophical question — what is clear is that they produce outputs that are often indistinguishable from human-level performance on many tasks.
“LLMs always tell the truth”: LLMs generate the most statistically likely next token — not necessarily the most accurate. They hallucinate confidently. Always verify important claims, especially specific numbers, dates, and citations.
“Bigger is always better”: Smaller, well-trained models (Llama 3 8B, Mistral 7B) often outperform larger, older models on specific tasks. The right model depends on your task, latency requirements, and budget.
“Prompt engineering is just writing instructions”: Effective prompt engineering is a skill that combines understanding of how LLMs work, the specific model’s training, and the target task. Small changes in wording can cause large changes in output quality.

10. Frequently Asked Questions

How many parameters does a large language model have?

Model sizes vary enormously. BERT-base has 110 million parameters. GPT-3 had 175 billion. GPT-4’s exact size is not published but is estimated at over 1 trillion parameters (possibly a mixture of experts). Llama 3 comes in 8B, 70B, and 405B parameter versions. The number of parameters is not the only factor in performance — training data quality, architecture, and RLHF training matter too.

Can I run an LLM on my laptop?

Yes — small models like Llama 3 8B, Phi-3, and Mistral 7B can run on a laptop with 8–16GB RAM using tools like Ollama or LM Studio. They are not as capable as GPT-4 or Claude but are impressive for their size and run completely offline. For most production applications, API access to cloud-hosted models is more practical.

What is RAG (Retrieval-Augmented Generation)?

RAG combines LLMs with a retrieval system — a database of documents. When a question is asked, the system first retrieves relevant documents, then provides them to the LLM as context for generating the answer. This gives LLMs access to up-to-date, domain-specific information without retraining. RAG is the standard approach for building LLM-powered applications over private or frequently changing data.