What is text preprocessing in NLP?

Text preprocessing is the process of cleaning and transforming raw text into a structured format that machine learning algorithms can work with. It includes steps like tokenisation (splitting text into words), lowercasing, removing punctuation and stop words, stemming or lemmatisation, and converting text to numerical representations like Bag of Words or TF-IDF vectors.

What is the difference between stemming and lemmatisation?

Stemming chops off word endings using rules to reduce words to a base form (stem), which may not be a real word — e.g., 'running' → 'run', 'studies' → 'studi'. Lemmatisation uses vocabulary and grammatical analysis to reduce words to their dictionary base form (lemma) — e.g., 'running' → 'run', 'studies' → 'study'. Lemmatisation is more accurate but slower; stemming is faster but cruder.

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that measures how important a word is to a document in a corpus. TF measures how often a word appears in a document. IDF penalises words that appear in many documents (common words like 'the' get low IDF). TF-IDF = TF × IDF. Words with high TF-IDF are frequent in a specific document but rare across the corpus — these are the most discriminating words for that document.

Text Preprocessing in NLP

Tokenisation, Stemming, Lemmatisation & TF-IDF — For Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: Text preprocessing converts raw text into clean numerical features that ML models can process.
Pipeline: Raw text → Lowercase → Remove punctuation/numbers → Tokenise → Remove stop words → Stem/Lemmatise → Vectorise
Stemming: Rule-based word truncation. Fast but may produce non-words. (“studies” → “studi”)
Lemmatisation: Dictionary-based reduction to base form. Slower but accurate. (“studies” → “study”)
Bag of Words: Simple word count matrix — loses word order but works well for many tasks.
TF-IDF: Weighs words by importance to a document, penalising common words across all documents.

1. Why Text Preprocessing?

Machine learning algorithms operate on numbers — they cannot directly process raw text like “The quick brown fox jumped over the lazy dog.” Text preprocessing converts this unstructured text into structured numerical representations that algorithms can learn from.

Without preprocessing, a model would treat “Running”, “running”, and “RUNNING” as three completely different words. It would give equal weight to the word “the” (which appears in almost every document and carries no meaning) and the word “neural” (which is highly specific and informative). Preprocessing fixes these issues before training even begins.

The quality of text preprocessing directly impacts model performance — poor preprocessing leads to high-dimensional, noisy feature spaces that make learning harder.

2. The Standard Preprocessing Pipeline

A typical NLP preprocessing pipeline follows these steps in order:

Raw text input — e.g., “The Students are Running to their Classes!!!”
Lowercase → “the students are running to their classes!!!”
Remove punctuation/special characters → “the students are running to their classes”
Tokenise → [“the”, “students”, “are”, “running”, “to”, “their”, “classes”]
Remove stop words → [“students”, “running”, “classes”]
Stem or lemmatise → [“student”, “run”, “class”]
Vectorise → numerical representation (Bag of Words or TF-IDF)

Not every step is needed for every task — the pipeline should be tailored to the problem. For sentiment analysis, stop word removal may hurt performance (negations like “not” are stop words but critical for sentiment). For topic modelling, stop words should definitely be removed.

3. Tokenisation

Tokenisation is the process of splitting text into individual units called tokens. Tokens are usually words, but can also be characters, subwords, or sentences depending on the approach.

Type	Example Input	Tokens	Use Case
Word tokenisation	“I love NLP”	[“I”, “love”, “NLP”]	Most NLP tasks
Sentence tokenisation	“I love NLP. It is fun.”	[“I love NLP.”, “It is fun.”]	Summarisation, translation
Character tokenisation	“NLP”	[“N”, “L”, “P”]	Spelling correction, rare words
Subword tokenisation	“unhappy”	[“un”, “happy”]	BERT, GPT — handles unknown words

Modern large language models (BERT, GPT) use subword tokenisation algorithms like BPE (Byte-Pair Encoding) or WordPiece. These split rare words into known subword units — “unhappiness” → [“un”, “happy”, “ness”] — allowing the model to handle unseen words gracefully.

4. Normalisation

Normalisation reduces variability in text that is not semantically meaningful:

Lowercasing: “Apple” and “apple” refer to the same word. Convert all text to lowercase unless capitalisation is meaningful (e.g., named entity recognition).
Remove punctuation: Commas, periods, exclamation marks rarely add meaning for bag-of-words models. Remove them. Exception: sentiment analysis (exclamation marks can signal emotion).
Remove numbers: For topic classification, standalone numbers often add noise. For financial text analysis, numbers may be critical.
Remove extra whitespace: Normalise multiple spaces, tabs, and newlines to single spaces.
Expand contractions: “don’t” → “do not”, “I’m” → “I am”. Important for consistent tokenisation.
Remove HTML tags: When processing scraped web content — strip <p>, <div>, etc.

5. Stop Word Removal

Stop words are extremely common words that carry little semantic meaning and appear in almost every document — words like “the”, “a”, “is”, “in”, “to”, “and”. Including them inflates the feature space without adding discriminating information.

Standard English stop word lists (NLTK, spaCy) contain 100–400 words. Removing them reduces vocabulary size significantly and speeds up training.

When NOT to remove stop words:

Sentiment analysis: “not good” → removing “not” gives “good” — the opposite meaning.
Question answering: “what”, “who”, “where” are stop words but are the actual questions.
Text generation and language modelling: all words are needed for coherent text.
Subword/transformer models: these handle common words internally — stop word removal is unnecessary.

6. Stemming

Stemming reduces words to their base or root form by removing suffixes and prefixes using rule-based algorithms. The result (the stem) may not be a real word.

Original Word	Porter Stem	Is it a real word?
running	run	Yes
studies	studi	No
happiness	happi	No
beautiful	beauti	No
engineering	engin	No

Common stemming algorithms: Porter Stemmer (most common for English), Lancaster Stemmer (more aggressive), Snowball Stemmer (multilingual). The Porter Stemmer is the standard choice for most English NLP tasks.

Pros: Very fast; simple to implement; reduces vocabulary size. Cons: May produce non-words; over-stemming (different words get same stem) and under-stemming (same word gets different stems) are common issues.

7. Lemmatisation

Lemmatisation reduces words to their canonical dictionary form (lemma) using vocabulary and morphological analysis. Unlike stemming, the output is always a real, meaningful word.

Original Word	Lemma	Part of Speech
running	run	Verb
studies	study	Verb/Noun
better	good	Adjective
was	be	Verb
mice	mouse	Noun

Lemmatisation requires knowing the part of speech (POS) of each word — “studies” as a noun → “study”; “studies” as a verb → “study” (same here, but “better” as adjective → “good” vs “better” as verb → “better”).

When to use lemmatisation over stemming: When text quality and word meaning matter — information retrieval, sentiment analysis, question answering. For high-volume pipelines where speed is critical and accuracy is secondary, stemming is acceptable.

8. Bag of Words (BoW)

Bag of Words represents text as a vector of word counts, ignoring grammar and word order. The “bag” metaphor reflects that all words are thrown in together with no structure.

Example: Vocabulary = [“cat”, “dog”, “sat”, “mat”]

“The cat sat on the mat” → [1, 0, 1, 1] (cat=1, dog=0, sat=1, mat=1)
“The dog sat on the mat” → [0, 1, 1, 1]

Pros: Simple, fast, works well for document classification and spam detection. Cons: Ignores word order (semantic meaning lost); creates high-dimensional sparse vectors for large vocabularies; common words dominate counts.

9. TF-IDF — Term Frequency-Inverse Document Frequency

TF-IDF improves on Bag of Words by down-weighting words that are common across all documents (and therefore less discriminating) and up-weighting words that are specific to individual documents.

TF(t, d) = count of term t in document d / total terms in d

IDF(t) = log(N / df(t))

TF-IDF(t, d) = TF(t, d) × IDF(t)

Symbol	Meaning
TF(t, d)	How often term t appears in document d (normalised)
IDF(t)	How rare term t is across all N documents
N	Total number of documents in the corpus
df(t)	Number of documents containing term t

Intuition: “the” appears in every document → IDF ≈ 0 → TF-IDF ≈ 0 (filtered out). “neural” appears in 5 out of 1000 documents → high IDF → high TF-IDF in documents that contain it.

TF-IDF is the standard text representation for classical ML models (Naive Bayes, Logistic Regression, SVM) applied to text. It consistently outperforms raw Bag of Words for most classification tasks.

10. Python Code


import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

text = "The engineering students are running experiments and studying machine learning algorithms."

# --- 1. Tokenisation ---
tokens = word_tokenize(text)
print("Tokens:", tokens)

# --- 2. Lowercase ---
tokens_lower = [t.lower() for t in tokens]

# --- 3. Remove punctuation ---
tokens_clean = [t for t in tokens_lower if t.isalpha()]

# --- 4. Remove stop words ---
stop_words = set(stopwords.words('english'))
tokens_no_stop = [t for t in tokens_clean if t not in stop_words]
print("After stop word removal:", tokens_no_stop)

# --- 5. Stemming ---
stemmer = PorterStemmer()
stems = [stemmer.stem(t) for t in tokens_no_stop]
print("Stems:", stems)

# --- 6. Lemmatisation ---
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t, pos='v') for t in tokens_no_stop]
print("Lemmas:", lemmas)

# --- 7. Bag of Words ---
corpus = [
    "machine learning algorithms are powerful",
    "deep learning uses neural networks",
    "machine learning and deep learning are related"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
print("\nBag of Words vocabulary:", vectorizer.get_feature_names_out())
print("BoW matrix:\n", bow_matrix.toarray())

# --- 8. TF-IDF ---
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("\nTF-IDF matrix:\n", tfidf_matrix.toarray().round(3))

11. Frequently Asked Questions

Do I need to preprocess text for transformer models like BERT?

Not in the traditional sense. Transformers use their own built-in tokenisers (BPE, WordPiece) that handle lowercasing and subword splitting internally. You do not need to manually stem, lemmatise, or remove stop words before feeding text to BERT or GPT. However, basic cleaning (removing HTML, fixing encoding issues, removing irrelevant boilerplate) is still beneficial.

What is the difference between Bag of Words and TF-IDF?

Both represent documents as word vectors, but BoW uses raw counts while TF-IDF weighs each count by how rare the word is across all documents. TF-IDF gives higher scores to words that are specific to a document and lower scores to words that appear everywhere. For most classification tasks, TF-IDF outperforms raw BoW.

Which Python library is best for NLP preprocessing?

NLTK is the classic choice — comprehensive, well-documented, and great for learning. spaCy is better for production pipelines — faster, more accurate lemmatisation, and includes POS tagging and named entity recognition out of the box. For transformer-based models, use HuggingFace Tokenizers. Start with NLTK for learning; switch to spaCy for projects.

Text Preprocessing in NLP

Text Preprocessing in NLP

📌 Key Takeaways

1. Why Text Preprocessing?

2. The Standard Preprocessing Pipeline

3. Tokenisation

4. Normalisation

5. Stop Word Removal

6. Stemming

7. Lemmatisation

8. Bag of Words (BoW)

9. TF-IDF — Term Frequency-Inverse Document Frequency

10. Python Code

11. Frequently Asked Questions

Do I need to preprocess text for transformer models like BERT?

What is the difference between Bag of Words and TF-IDF?

Which Python library is best for NLP preprocessing?

Next Steps

Next Steps

Leave a Comment