Naive Bayes Classifier

Naive Bayes Classifier

Probabilistic Classification Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Definition: A probabilistic classifier based on Bayes’ theorem with a conditional independence assumption between features.
  • Core formula: P(class | features) ∝ P(features | class) × P(class)
  • “Naive” assumption: All features are treated as independent given the class — rarely true but surprisingly effective.
  • Three variants: Gaussian NB (continuous features), Multinomial NB (word counts/text), Bernoulli NB (binary features).
  • Strengths: Extremely fast, works well with small datasets and high-dimensional text data.
  • Best for: Spam detection, sentiment analysis, document classification, real-time classification.

1. Bayes’ Theorem — The Foundation

Naive Bayes is built on Bayes’ theorem:

P(A | B) = P(B | A) × P(A) / P(B)

TermNameMeaning in ML
P(A|B)PosteriorProbability of class A given observed features B — what we want
P(B|A)LikelihoodProbability of observing features B given class A
P(A)PriorProbability of class A before seeing any features
P(B)EvidenceProbability of features B (same for all classes — acts as normaliser)

In classification, P(B) is the same for all classes, so: P(class | features) ∝ P(features | class) × P(class)

2. What is Naive Bayes?

Naive Bayes is a supervised probabilistic classifier that uses Bayes’ theorem with the “naive” assumption that all input features are conditionally independent given the class label.

It is called naive because real features are almost never truly independent — in a spam email, the words “free” and “money” are correlated, not independent. Yet despite this violated assumption, Naive Bayes performs remarkably well in practice, especially for text classification.

3. The Naive Bayes Formula

Given features x = (x₁, x₂, …, xₙ), the predicted class is:

ŷ = argmax_c [ P(c) × Π P(xᵢ | c) ]

In practice, use log probabilities to avoid numerical underflow:

ŷ = argmax_c [ log P(c) + Σ log P(xᵢ | c) ]

4. Three Variants of Naive Bayes

VariantFeature TypeLikelihood ModelBest For
Gaussian NBContinuous numericalGaussian distribution with mean and std estimated from training dataSensor data, medical measurements, continuous features
Multinomial NBDiscrete countsProportional to frequency of feature in classText classification with word counts
Bernoulli NBBinary (0 or 1)Bernoulli distribution — probability of feature being present/absentShort texts, binary word presence

5. Worked Example — Spam Detection

EmailContains “free”Contains “meeting”Class
1YesNoSpam
2YesNoSpam
3NoYesNot Spam
4NoYesNot Spam
5YesYesNot Spam

Priors: P(Spam) = 2/5 = 0.4, P(Not Spam) = 3/5 = 0.6

Classify: “free”=Yes, “meeting”=No

  • Score(Spam) = 0.4 × 1.0 × 0.5 = 0.20 (with Laplace smoothing for meeting=No)
  • Score(Not Spam) ≈ 0.065

Prediction: Spam

6. Laplace Smoothing

If a feature value never appears with a class in training, its likelihood is 0 — zeroing out the entire product. Laplace Smoothing fixes this:

P(xᵢ = v | c) = (count(xᵢ=v, c) + α) / (count(c) + α × |V|)

Where α = 1 (Laplace). Scikit-learn applies it automatically via the var_smoothing parameter. Always use Laplace smoothing.

7. Advantages & Limitations

AdvantagesLimitations
Extremely fast to train and predictNaive independence assumption rarely holds
Works well with small training datasetsPoor probability estimates (often over-confident)
Excellent for high-dimensional text dataCannot learn feature interactions
Handles multi-class naturallyContinuous features require distributional assumption
Robust to irrelevant featuresSensitive to correlated features

8. Python Code


from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# --- Gaussian NB for continuous features ---
X, y = load_iris(return_X_y=True)
gnb = GaussianNB()
scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
print(f"Gaussian NB Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")

# --- Multinomial NB for text classification ---
emails = [
    "free money win prize", "free offer click here",
    "meeting tomorrow office", "project deadline schedule",
    "free click win money", "team meeting agenda"
]
labels = [1, 1, 0, 0, 1, 0]  # 1=Spam, 0=Not Spam

text_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB(alpha=1.0))  # alpha=Laplace smoothing
])
text_pipeline.fit(emails, labels)
new_email = ["free money click"]
prediction = text_pipeline.predict(new_email)
probability = text_pipeline.predict_proba(new_email)
print(f"Prediction: {'Spam' if prediction[0]==1 else 'Not Spam'}")
print(f"Spam probability: {probability[0][1]:.3f}")
    

9. Common Mistakes Students Make

  • Not applying Laplace smoothing: Without smoothing, a single unseen feature value zeros out the entire prediction. Always use smoothing (alpha > 0).
  • Using Multinomial NB with negative values: Multinomial NB requires non-negative feature values. Use Gaussian NB or Complement NB for negative features.
  • Expecting calibrated probabilities: Naive Bayes probability estimates are often extreme. Good for ranking and classification, not for calibrated confidence.
  • Using NB when features are highly correlated: Correlated features are counted multiple times, distorting the posterior.

10. Frequently Asked Questions

Why does Naive Bayes work despite the naive assumption?

Even though the independence assumption is violated, Naive Bayes only needs to identify the correct class — not estimate accurate probabilities. The ranking of classes by posterior probability is often correct even when individual probabilities are wrong.

Which Naive Bayes variant should I use for text classification?

For long documents with word frequency counts, use Multinomial NB. For short texts or binary word presence/absence, use Bernoulli NB. Complement NB often outperforms both on imbalanced datasets — available in Scikit-learn as ComplementNB.

Next Steps