Naive Bayes Classifier
Probabilistic Classification Explained for Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Definition: A probabilistic classifier based on Bayes’ theorem with a conditional independence assumption between features.
- Core formula: P(class | features) ∝ P(features | class) × P(class)
- “Naive” assumption: All features are treated as independent given the class — rarely true but surprisingly effective.
- Three variants: Gaussian NB (continuous features), Multinomial NB (word counts/text), Bernoulli NB (binary features).
- Strengths: Extremely fast, works well with small datasets and high-dimensional text data.
- Best for: Spam detection, sentiment analysis, document classification, real-time classification.
1. Bayes’ Theorem — The Foundation
Naive Bayes is built on Bayes’ theorem:
P(A | B) = P(B | A) × P(A) / P(B)
| Term | Name | Meaning in ML |
|---|---|---|
| P(A|B) | Posterior | Probability of class A given observed features B — what we want |
| P(B|A) | Likelihood | Probability of observing features B given class A |
| P(A) | Prior | Probability of class A before seeing any features |
| P(B) | Evidence | Probability of features B (same for all classes — acts as normaliser) |
In classification, P(B) is the same for all classes, so: P(class | features) ∝ P(features | class) × P(class)
2. What is Naive Bayes?
Naive Bayes is a supervised probabilistic classifier that uses Bayes’ theorem with the “naive” assumption that all input features are conditionally independent given the class label.
It is called naive because real features are almost never truly independent — in a spam email, the words “free” and “money” are correlated, not independent. Yet despite this violated assumption, Naive Bayes performs remarkably well in practice, especially for text classification.
3. The Naive Bayes Formula
Given features x = (x₁, x₂, …, xₙ), the predicted class is:
ŷ = argmax_c [ P(c) × Π P(xᵢ | c) ]
In practice, use log probabilities to avoid numerical underflow:
ŷ = argmax_c [ log P(c) + Σ log P(xᵢ | c) ]
4. Three Variants of Naive Bayes
| Variant | Feature Type | Likelihood Model | Best For |
|---|---|---|---|
| Gaussian NB | Continuous numerical | Gaussian distribution with mean and std estimated from training data | Sensor data, medical measurements, continuous features |
| Multinomial NB | Discrete counts | Proportional to frequency of feature in class | Text classification with word counts |
| Bernoulli NB | Binary (0 or 1) | Bernoulli distribution — probability of feature being present/absent | Short texts, binary word presence |
5. Worked Example — Spam Detection
| Contains “free” | Contains “meeting” | Class | |
|---|---|---|---|
| 1 | Yes | No | Spam |
| 2 | Yes | No | Spam |
| 3 | No | Yes | Not Spam |
| 4 | No | Yes | Not Spam |
| 5 | Yes | Yes | Not Spam |
Priors: P(Spam) = 2/5 = 0.4, P(Not Spam) = 3/5 = 0.6
Classify: “free”=Yes, “meeting”=No
- Score(Spam) = 0.4 × 1.0 × 0.5 = 0.20 (with Laplace smoothing for meeting=No)
- Score(Not Spam) ≈ 0.065
Prediction: Spam
6. Laplace Smoothing
If a feature value never appears with a class in training, its likelihood is 0 — zeroing out the entire product. Laplace Smoothing fixes this:
P(xᵢ = v | c) = (count(xᵢ=v, c) + α) / (count(c) + α × |V|)
Where α = 1 (Laplace). Scikit-learn applies it automatically via the var_smoothing parameter. Always use Laplace smoothing.
7. Advantages & Limitations
| Advantages | Limitations |
|---|---|
| Extremely fast to train and predict | Naive independence assumption rarely holds |
| Works well with small training datasets | Poor probability estimates (often over-confident) |
| Excellent for high-dimensional text data | Cannot learn feature interactions |
| Handles multi-class naturally | Continuous features require distributional assumption |
| Robust to irrelevant features | Sensitive to correlated features |
8. Python Code
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
# --- Gaussian NB for continuous features ---
X, y = load_iris(return_X_y=True)
gnb = GaussianNB()
scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
print(f"Gaussian NB Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
# --- Multinomial NB for text classification ---
emails = [
"free money win prize", "free offer click here",
"meeting tomorrow office", "project deadline schedule",
"free click win money", "team meeting agenda"
]
labels = [1, 1, 0, 0, 1, 0] # 1=Spam, 0=Not Spam
text_pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB(alpha=1.0)) # alpha=Laplace smoothing
])
text_pipeline.fit(emails, labels)
new_email = ["free money click"]
prediction = text_pipeline.predict(new_email)
probability = text_pipeline.predict_proba(new_email)
print(f"Prediction: {'Spam' if prediction[0]==1 else 'Not Spam'}")
print(f"Spam probability: {probability[0][1]:.3f}")
9. Common Mistakes Students Make
- Not applying Laplace smoothing: Without smoothing, a single unseen feature value zeros out the entire prediction. Always use smoothing (alpha > 0).
- Using Multinomial NB with negative values: Multinomial NB requires non-negative feature values. Use Gaussian NB or Complement NB for negative features.
- Expecting calibrated probabilities: Naive Bayes probability estimates are often extreme. Good for ranking and classification, not for calibrated confidence.
- Using NB when features are highly correlated: Correlated features are counted multiple times, distorting the posterior.
10. Frequently Asked Questions
Why does Naive Bayes work despite the naive assumption?
Even though the independence assumption is violated, Naive Bayes only needs to identify the correct class — not estimate accurate probabilities. The ranking of classes by posterior probability is often correct even when individual probabilities are wrong.
Which Naive Bayes variant should I use for text classification?
For long documents with word frequency counts, use Multinomial NB. For short texts or binary word presence/absence, use Bernoulli NB. Complement NB often outperforms both on imbalanced datasets — available in Scikit-learn as ComplementNB.