What is a decision tree in machine learning?

A decision tree is a supervised machine learning algorithm that makes predictions by learning a series of if-then-else decision rules from the training data. It splits the data at each node based on the feature that best separates the classes, building a tree structure that can be used to classify new examples or predict numerical values.

What is Gini impurity in decision trees?

Gini impurity measures how often a randomly chosen element from a set would be incorrectly classified if it were labelled randomly according to the distribution of labels in the subset. It ranges from 0 (perfectly pure — all one class) to 0.5 (maximum impurity for binary classification). The decision tree selects the split that minimises the weighted Gini impurity of the resulting child nodes.

Why do decision trees overfit?

Decision trees overfit because they can grow arbitrarily deep, eventually creating one leaf node per training example. This means the tree memorises the training data perfectly but fails to generalise to new data. Solutions include setting a maximum depth, minimum samples per leaf, or pruning the tree after training.

Decision Trees

How They Work & When to Use Them — For Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: A decision tree makes predictions through a sequence of if-then-else questions learned from data.
Structure: Root node → Internal nodes (decision points) → Leaf nodes (final predictions).
Splitting criteria: Gini Impurity (CART algorithm) or Information Gain / Entropy (ID3, C4.5).
Strength: Highly interpretable — you can read the decision rules. Handles both numerical and categorical features.
Weakness: Prone to overfitting. Small changes in data can drastically change the tree structure.
Fix for overfitting: Limit max depth, set minimum samples per leaf, or use Random Forest (ensemble of trees).

1. What is a Decision Tree?

A decision tree is a supervised ML algorithm that makes predictions by learning a hierarchy of if-then-else decision rules from training data. At each step, the algorithm asks a question about a feature (e.g., “Is the patient’s age > 50?”), splits the data based on the answer, and repeats until it reaches a prediction.

Decision trees can be used for both classification (predicting a category) and regression (predicting a continuous value). When used for regression, they are called regression trees.

Analogy — The Medical Triage System

Imagine a doctor triaging patients in an emergency room. They ask: “Is the patient conscious?” If no → immediate care. If yes → “Is blood pressure above 180?” If yes → urgent. If no → “Is there chest pain?” and so on. This cascade of questions based on observable features is exactly how a decision tree works — learned automatically from historical patient data.

2. Tree Structure — Nodes, Branches, Leaves

Component	Description	Example
Root Node	The topmost node — the first and most important split. Splits the entire dataset.	“Age > 30?”
Internal Node	A decision point within the tree. Each internal node splits data based on one feature.	“Income > 50,000?”
Branch	The outcome of a decision — Yes or No, True or False, or a value range.	Yes / No
Leaf Node	Terminal node — contains the final prediction (class label or value). No further splitting.	“Approve Loan” / “Reject Loan”

The depth of a tree is the number of levels from root to the deepest leaf. A tree of depth 1 (just a root with two leaves) is called a decision stump. Deeper trees can model more complex patterns but are more prone to overfitting.

3. How Splitting Works — Gini Impurity & Information Gain

At each node, the algorithm tries every possible feature and every possible split value, and selects the one that creates the most “pure” child nodes — where one class dominates as much as possible. Two main criteria are used:

3.1 Gini Impurity (used in CART — the most common algorithm)

Gini = 1 − Σ pᵢ²

Where pᵢ is the proportion of class i in the node. Gini = 0 means perfectly pure (all one class). Gini = 0.5 is maximum impurity for binary classification (50/50 split).

The algorithm selects the split that minimises the weighted Gini impurity of the resulting child nodes:

Weighted Gini = (n_left/n) × Gini_left + (n_right/n) × Gini_right

3.2 Information Gain / Entropy (used in ID3, C4.5)

Entropy = −Σ pᵢ log₂(pᵢ)

Information Gain = Entropy(parent) − Weighted Entropy(children)

The algorithm selects the split that maximises information gain — the reduction in uncertainty after the split. Entropy = 0 means pure; Entropy = 1 means maximum disorder (binary case).

Gini vs Entropy — Which to Use?

In practice, both give very similar results. Gini is faster to compute (no logarithm) and is the default in Scikit-learn’s DecisionTreeClassifier. Information Gain is more common in academic literature and interpretable explanations.

4. Worked Example — Predicting Loan Approval

Dataset: 6 loan applicants with two features: Credit Score (Good/Bad) and Employment (Stable/Unstable), and a label: Approved (Yes/No).

Applicant	Credit Score	Employment	Approved?
1	Good	Stable	Yes
2	Good	Unstable	Yes
3	Bad	Stable	No
4	Bad	Unstable	No
5	Good	Stable	Yes
6	Bad	Stable	No

Root split — try “Credit Score”:

Credit = Good: 3 Yes, 0 No → Gini = 1 − (1² + 0²) = 0.0 (perfectly pure!)
Credit = Bad: 0 Yes, 3 No → Gini = 1 − (0² + 1²) = 0.0 (perfectly pure!)
Weighted Gini = (3/6)(0) + (3/6)(0) = 0.0

Credit Score creates a perfect split — Gini = 0. The resulting tree has depth 1:

Credit Score = Good → Approve
Credit Score = Bad → Reject

In this simple example, one feature is sufficient. Real datasets require deeper trees with multiple splits.

5. Overfitting & Pruning

Decision trees are highly prone to overfitting. Given enough depth, a tree will create a separate leaf for every training example — achieving 100% training accuracy but failing completely on new data.

Prevention Techniques (Pre-pruning):

max_depth: Limit the maximum depth of the tree (e.g., max_depth=5).
min_samples_split: Minimum number of samples required to split a node.
min_samples_leaf: Minimum samples required at a leaf node.
max_features: Limit the number of features considered at each split.

Post-Pruning:

Grow the full tree, then remove branches that provide little predictive power. Cost Complexity Pruning (also called Weakest Link Pruning) is implemented in Scikit-learn as the ccp_alpha parameter.

Best Solution — Use Random Forest:

Instead of relying on one tree, Random Forest trains hundreds of trees on random subsets of the data and features, then averages their predictions. This dramatically reduces variance while keeping low bias.

6. Advantages & Limitations

Advantages	Limitations
Highly interpretable — decision rules can be visualised and explained	Prone to overfitting without pruning or depth limits
Handles both numerical and categorical features	Unstable — small changes in data can produce very different trees
No feature scaling required	Biased towards features with more levels in categorical data
Can model non-linear relationships	Cannot extrapolate beyond the range of training data (for regression)
Fast to train and predict	Single trees rarely achieve top accuracy vs ensemble methods
Handles missing values well (with extensions)	Greedy algorithm — does not guarantee globally optimal tree

7. Python Code


from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train decision tree (limit depth to prevent overfitting)
model = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

# Print the tree rules in text format
tree_rules = export_text(model, feature_names=load_iris().feature_names)
print(tree_rules)

8. Common Mistakes Students Make

Not setting max_depth: Always set a maximum depth when training decision trees. Without it, the tree will overfit the training data completely.
Using decision trees for high-dimensional data: With many features, decision trees become very complex and hard to interpret. Use Random Forest or Gradient Boosting instead.
Expecting stability: Decision trees are highly sensitive to training data. Adding or removing a few examples can produce a completely different tree. This is why ensembles (Random Forest) are preferred in practice.
Forgetting that CART is greedy: The algorithm selects the best split at each node locally, without considering the globally optimal tree. The resulting tree is good but not necessarily the best possible tree.

9. Frequently Asked Questions

What is the difference between CART, ID3, and C4.5?

These are three algorithms for building decision trees. CART (Classification and Regression Trees) uses Gini impurity and can handle regression; it is the algorithm used by Scikit-learn. ID3 uses Information Gain but only handles categorical features. C4.5 is an improved version of ID3 that handles continuous features and missing values. CART is the most widely used today.

Can decision trees handle regression problems?

Yes. When used for regression, the leaf nodes output the mean value of training examples in that leaf rather than a class label. The splitting criterion changes from Gini/Entropy to Mean Squared Error (MSE) — the split that minimises the weighted MSE of child nodes is selected.

How do I choose the right max_depth for a decision tree?

Use cross-validation. Train trees with different max_depth values (e.g., 2 to 20) and evaluate each on a validation set. Plot training vs validation accuracy against depth — the optimal depth is just before the validation accuracy starts decreasing while training accuracy continues to increase (the overfitting point).

Decision Trees — How They Work and When to Use Them