Text Preprocessing in NLP
Tokenisation, Stemming, Lemmatisation & TF-IDF — For Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Definition: Text preprocessing converts raw text into clean numerical features that ML models can process.
- Pipeline: Raw text → Lowercase → Remove punctuation/numbers → Tokenise → Remove stop words → Stem/Lemmatise → Vectorise
- Stemming: Rule-based word truncation. Fast but may produce non-words. (“studies” → “studi”)
- Lemmatisation: Dictionary-based reduction to base form. Slower but accurate. (“studies” → “study”)
- Bag of Words: Simple word count matrix — loses word order but works well for many tasks.
- TF-IDF: Weighs words by importance to a document, penalising common words across all documents.
1. Why Text Preprocessing?
Machine learning algorithms operate on numbers — they cannot directly process raw text like “The quick brown fox jumped over the lazy dog.” Text preprocessing converts this unstructured text into structured numerical representations that algorithms can learn from.
Without preprocessing, a model would treat “Running”, “running”, and “RUNNING” as three completely different words. It would give equal weight to the word “the” (which appears in almost every document and carries no meaning) and the word “neural” (which is highly specific and informative). Preprocessing fixes these issues before training even begins.
The quality of text preprocessing directly impacts model performance — poor preprocessing leads to high-dimensional, noisy feature spaces that make learning harder.
2. The Standard Preprocessing Pipeline
A typical NLP preprocessing pipeline follows these steps in order:
- Raw text input — e.g., “The Students are Running to their Classes!!!”
- Lowercase → “the students are running to their classes!!!”
- Remove punctuation/special characters → “the students are running to their classes”
- Tokenise → [“the”, “students”, “are”, “running”, “to”, “their”, “classes”]
- Remove stop words → [“students”, “running”, “classes”]
- Stem or lemmatise → [“student”, “run”, “class”]
- Vectorise → numerical representation (Bag of Words or TF-IDF)
Not every step is needed for every task — the pipeline should be tailored to the problem. For sentiment analysis, stop word removal may hurt performance (negations like “not” are stop words but critical for sentiment). For topic modelling, stop words should definitely be removed.
3. Tokenisation
Tokenisation is the process of splitting text into individual units called tokens. Tokens are usually words, but can also be characters, subwords, or sentences depending on the approach.
| Type | Example Input | Tokens | Use Case |
|---|---|---|---|
| Word tokenisation | “I love NLP” | [“I”, “love”, “NLP”] | Most NLP tasks |
| Sentence tokenisation | “I love NLP. It is fun.” | [“I love NLP.”, “It is fun.”] | Summarisation, translation |
| Character tokenisation | “NLP” | [“N”, “L”, “P”] | Spelling correction, rare words |
| Subword tokenisation | “unhappy” | [“un”, “happy”] | BERT, GPT — handles unknown words |
Modern large language models (BERT, GPT) use subword tokenisation algorithms like BPE (Byte-Pair Encoding) or WordPiece. These split rare words into known subword units — “unhappiness” → [“un”, “happy”, “ness”] — allowing the model to handle unseen words gracefully.
4. Normalisation
Normalisation reduces variability in text that is not semantically meaningful:
- Lowercasing: “Apple” and “apple” refer to the same word. Convert all text to lowercase unless capitalisation is meaningful (e.g., named entity recognition).
- Remove punctuation: Commas, periods, exclamation marks rarely add meaning for bag-of-words models. Remove them. Exception: sentiment analysis (exclamation marks can signal emotion).
- Remove numbers: For topic classification, standalone numbers often add noise. For financial text analysis, numbers may be critical.
- Remove extra whitespace: Normalise multiple spaces, tabs, and newlines to single spaces.
- Expand contractions: “don’t” → “do not”, “I’m” → “I am”. Important for consistent tokenisation.
- Remove HTML tags: When processing scraped web content — strip <p>, <div>, etc.
5. Stop Word Removal
Stop words are extremely common words that carry little semantic meaning and appear in almost every document — words like “the”, “a”, “is”, “in”, “to”, “and”. Including them inflates the feature space without adding discriminating information.
Standard English stop word lists (NLTK, spaCy) contain 100–400 words. Removing them reduces vocabulary size significantly and speeds up training.
When NOT to remove stop words:
- Sentiment analysis: “not good” → removing “not” gives “good” — the opposite meaning.
- Question answering: “what”, “who”, “where” are stop words but are the actual questions.
- Text generation and language modelling: all words are needed for coherent text.
- Subword/transformer models: these handle common words internally — stop word removal is unnecessary.
6. Stemming
Stemming reduces words to their base or root form by removing suffixes and prefixes using rule-based algorithms. The result (the stem) may not be a real word.
| Original Word | Porter Stem | Is it a real word? |
|---|---|---|
| running | run | Yes |
| studies | studi | No |
| happiness | happi | No |
| beautiful | beauti | No |
| engineering | engin | No |
Common stemming algorithms: Porter Stemmer (most common for English), Lancaster Stemmer (more aggressive), Snowball Stemmer (multilingual). The Porter Stemmer is the standard choice for most English NLP tasks.
Pros: Very fast; simple to implement; reduces vocabulary size. Cons: May produce non-words; over-stemming (different words get same stem) and under-stemming (same word gets different stems) are common issues.
7. Lemmatisation
Lemmatisation reduces words to their canonical dictionary form (lemma) using vocabulary and morphological analysis. Unlike stemming, the output is always a real, meaningful word.
| Original Word | Lemma | Part of Speech |
|---|---|---|
| running | run | Verb |
| studies | study | Verb/Noun |
| better | good | Adjective |
| was | be | Verb |
| mice | mouse | Noun |
Lemmatisation requires knowing the part of speech (POS) of each word — “studies” as a noun → “study”; “studies” as a verb → “study” (same here, but “better” as adjective → “good” vs “better” as verb → “better”).
When to use lemmatisation over stemming: When text quality and word meaning matter — information retrieval, sentiment analysis, question answering. For high-volume pipelines where speed is critical and accuracy is secondary, stemming is acceptable.
8. Bag of Words (BoW)
Bag of Words represents text as a vector of word counts, ignoring grammar and word order. The “bag” metaphor reflects that all words are thrown in together with no structure.
Example: Vocabulary = [“cat”, “dog”, “sat”, “mat”]
- “The cat sat on the mat” → [1, 0, 1, 1] (cat=1, dog=0, sat=1, mat=1)
- “The dog sat on the mat” → [0, 1, 1, 1]
Pros: Simple, fast, works well for document classification and spam detection. Cons: Ignores word order (semantic meaning lost); creates high-dimensional sparse vectors for large vocabularies; common words dominate counts.
9. TF-IDF — Term Frequency-Inverse Document Frequency
TF-IDF improves on Bag of Words by down-weighting words that are common across all documents (and therefore less discriminating) and up-weighting words that are specific to individual documents.
TF(t, d) = count of term t in document d / total terms in d
IDF(t) = log(N / df(t))
TF-IDF(t, d) = TF(t, d) × IDF(t)
| Symbol | Meaning |
|---|---|
| TF(t, d) | How often term t appears in document d (normalised) |
| IDF(t) | How rare term t is across all N documents |
| N | Total number of documents in the corpus |
| df(t) | Number of documents containing term t |
Intuition: “the” appears in every document → IDF ≈ 0 → TF-IDF ≈ 0 (filtered out). “neural” appears in 5 out of 1000 documents → high IDF → high TF-IDF in documents that contain it.
TF-IDF is the standard text representation for classical ML models (Naive Bayes, Logistic Regression, SVM) applied to text. It consistently outperforms raw Bag of Words for most classification tasks.
10. Python Code
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
text = "The engineering students are running experiments and studying machine learning algorithms."
# --- 1. Tokenisation ---
tokens = word_tokenize(text)
print("Tokens:", tokens)
# --- 2. Lowercase ---
tokens_lower = [t.lower() for t in tokens]
# --- 3. Remove punctuation ---
tokens_clean = [t for t in tokens_lower if t.isalpha()]
# --- 4. Remove stop words ---
stop_words = set(stopwords.words('english'))
tokens_no_stop = [t for t in tokens_clean if t not in stop_words]
print("After stop word removal:", tokens_no_stop)
# --- 5. Stemming ---
stemmer = PorterStemmer()
stems = [stemmer.stem(t) for t in tokens_no_stop]
print("Stems:", stems)
# --- 6. Lemmatisation ---
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t, pos='v') for t in tokens_no_stop]
print("Lemmas:", lemmas)
# --- 7. Bag of Words ---
corpus = [
"machine learning algorithms are powerful",
"deep learning uses neural networks",
"machine learning and deep learning are related"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
print("\nBag of Words vocabulary:", vectorizer.get_feature_names_out())
print("BoW matrix:\n", bow_matrix.toarray())
# --- 8. TF-IDF ---
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("\nTF-IDF matrix:\n", tfidf_matrix.toarray().round(3))
11. Frequently Asked Questions
Do I need to preprocess text for transformer models like BERT?
Not in the traditional sense. Transformers use their own built-in tokenisers (BPE, WordPiece) that handle lowercasing and subword splitting internally. You do not need to manually stem, lemmatise, or remove stop words before feeding text to BERT or GPT. However, basic cleaning (removing HTML, fixing encoding issues, removing irrelevant boilerplate) is still beneficial.
What is the difference between Bag of Words and TF-IDF?
Both represent documents as word vectors, but BoW uses raw counts while TF-IDF weighs each count by how rare the word is across all documents. TF-IDF gives higher scores to words that are specific to a document and lower scores to words that appear everywhere. For most classification tasks, TF-IDF outperforms raw BoW.
Which Python library is best for NLP preprocessing?
NLTK is the classic choice — comprehensive, well-documented, and great for learning. spaCy is better for production pipelines — faster, more accurate lemmatisation, and includes POS tagging and named entity recognition out of the box. For transformer-based models, use HuggingFace Tokenizers. Start with NLTK for learning; switch to spaCy for projects.