What is Naive Bayes classifier?

Naive Bayes is a probabilistic supervised ML classifier based on Bayes' theorem. It is called 'naive' because it assumes all features are conditionally independent given the class label — an assumption that is rarely true in practice but works surprisingly well. It is fast, requires little data, and excels at text classification tasks like spam detection.

What is Bayes theorem?

Bayes theorem describes the probability of an event based on prior knowledge. The formula is: P(A|B) = P(B|A) × P(A) / P(B). In ML: P(class|features) = P(features|class) × P(class) / P(features). P(class) is the prior probability, P(features|class) is the likelihood, and P(class|features) is the posterior probability we want to compute.

Why is Naive Bayes called naive?

Naive Bayes is called 'naive' because it makes the naive assumption that all features are conditionally independent given the class label. This assumption is almost never true in real data, but the algorithm works well in practice anyway — especially for text classification.

Naive Bayes Classifier

Probabilistic Classification Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: A probabilistic classifier based on Bayes’ theorem with a conditional independence assumption between features.
Core formula: P(class | features) ∝ P(features | class) × P(class)
“Naive” assumption: All features are treated as independent given the class — rarely true but surprisingly effective.
Three variants: Gaussian NB (continuous features), Multinomial NB (word counts/text), Bernoulli NB (binary features).
Strengths: Extremely fast, works well with small datasets and high-dimensional text data.
Best for: Spam detection, sentiment analysis, document classification, real-time classification.

1. Bayes’ Theorem — The Foundation

Naive Bayes is built on Bayes’ theorem:

P(A | B) = P(B | A) × P(A) / P(B)

Term	Name	Meaning in ML
P(A\|B)	Posterior	Probability of class A given observed features B — what we want
P(B\|A)	Likelihood	Probability of observing features B given class A
P(A)	Prior	Probability of class A before seeing any features
P(B)	Evidence	Probability of features B (same for all classes — acts as normaliser)

In classification, P(B) is the same for all classes, so: P(class | features) ∝ P(features | class) × P(class)

2. What is Naive Bayes?

Naive Bayes is a supervised probabilistic classifier that uses Bayes’ theorem with the “naive” assumption that all input features are conditionally independent given the class label.

It is called naive because real features are almost never truly independent — in a spam email, the words “free” and “money” are correlated, not independent. Yet despite this violated assumption, Naive Bayes performs remarkably well in practice, especially for text classification.

3. The Naive Bayes Formula

Given features x = (x₁, x₂, …, xₙ), the predicted class is:

ŷ = argmax_c [ P(c) × Π P(xᵢ | c) ]

In practice, use log probabilities to avoid numerical underflow:

ŷ = argmax_c [ log P(c) + Σ log P(xᵢ | c) ]

4. Three Variants of Naive Bayes

Variant	Feature Type	Likelihood Model	Best For
Gaussian NB	Continuous numerical	Gaussian distribution with mean and std estimated from training data	Sensor data, medical measurements, continuous features
Multinomial NB	Discrete counts	Proportional to frequency of feature in class	Text classification with word counts
Bernoulli NB	Binary (0 or 1)	Bernoulli distribution — probability of feature being present/absent	Short texts, binary word presence

5. Worked Example — Spam Detection

Email	Contains “free”	Contains “meeting”	Class
1	Yes	No	Spam
2	Yes	No	Spam
3	No	Yes	Not Spam
4	No	Yes	Not Spam
5	Yes	Yes	Not Spam

Priors: P(Spam) = 2/5 = 0.4, P(Not Spam) = 3/5 = 0.6

Classify: “free”=Yes, “meeting”=No

Score(Spam) = 0.4 × 1.0 × 0.5 = 0.20 (with Laplace smoothing for meeting=No)
Score(Not Spam) ≈ 0.065

Prediction: Spam

6. Laplace Smoothing

If a feature value never appears with a class in training, its likelihood is 0 — zeroing out the entire product. Laplace Smoothing fixes this:

P(xᵢ = v | c) = (count(xᵢ=v, c) + α) / (count(c) + α × |V|)

Where α = 1 (Laplace). Scikit-learn applies it automatically via the var_smoothing parameter. Always use Laplace smoothing.

7. Advantages & Limitations

Advantages	Limitations
Extremely fast to train and predict	Naive independence assumption rarely holds
Works well with small training datasets	Poor probability estimates (often over-confident)
Excellent for high-dimensional text data	Cannot learn feature interactions
Handles multi-class naturally	Continuous features require distributional assumption
Robust to irrelevant features	Sensitive to correlated features

8. Python Code


from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# --- Gaussian NB for continuous features ---
X, y = load_iris(return_X_y=True)
gnb = GaussianNB()
scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
print(f"Gaussian NB Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")

# --- Multinomial NB for text classification ---
emails = [
    "free money win prize", "free offer click here",
    "meeting tomorrow office", "project deadline schedule",
    "free click win money", "team meeting agenda"
]
labels = [1, 1, 0, 0, 1, 0]  # 1=Spam, 0=Not Spam

text_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB(alpha=1.0))  # alpha=Laplace smoothing
])
text_pipeline.fit(emails, labels)
new_email = ["free money click"]
prediction = text_pipeline.predict(new_email)
probability = text_pipeline.predict_proba(new_email)
print(f"Prediction: {'Spam' if prediction[0]==1 else 'Not Spam'}")
print(f"Spam probability: {probability[0][1]:.3f}")

9. Common Mistakes Students Make

Not applying Laplace smoothing: Without smoothing, a single unseen feature value zeros out the entire prediction. Always use smoothing (alpha > 0).
Using Multinomial NB with negative values: Multinomial NB requires non-negative feature values. Use Gaussian NB or Complement NB for negative features.
Expecting calibrated probabilities: Naive Bayes probability estimates are often extreme. Good for ranking and classification, not for calibrated confidence.
Using NB when features are highly correlated: Correlated features are counted multiple times, distorting the posterior.

10. Frequently Asked Questions

Why does Naive Bayes work despite the naive assumption?

Even though the independence assumption is violated, Naive Bayes only needs to identify the correct class — not estimate accurate probabilities. The ranking of classes by posterior probability is often correct even when individual probabilities are wrong.

Which Naive Bayes variant should I use for text classification?

For long documents with word frequency counts, use Multinomial NB. For short texts or binary word presence/absence, use Bernoulli NB. Complement NB often outperforms both on imbalanced datasets — available in Scikit-learn as ComplementNB.

Naive Bayes Classifier

Naive Bayes Classifier

📌 Key Takeaways

1. Bayes’ Theorem — The Foundation

2. What is Naive Bayes?

3. The Naive Bayes Formula

4. Three Variants of Naive Bayes

5. Worked Example — Spam Detection

6. Laplace Smoothing

7. Advantages & Limitations

8. Python Code

9. Common Mistakes Students Make

10. Frequently Asked Questions

Why does Naive Bayes work despite the naive assumption?

Which Naive Bayes variant should I use for text classification?

Next Steps

Next Steps

Leave a Comment