Decision Trees
How They Work & When to Use Them — For Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- Definition: A decision tree makes predictions through a sequence of if-then-else questions learned from data.
- Structure: Root node → Internal nodes (decision points) → Leaf nodes (final predictions).
- Splitting criteria: Gini Impurity (CART algorithm) or Information Gain / Entropy (ID3, C4.5).
- Strength: Highly interpretable — you can read the decision rules. Handles both numerical and categorical features.
- Weakness: Prone to overfitting. Small changes in data can drastically change the tree structure.
- Fix for overfitting: Limit max depth, set minimum samples per leaf, or use Random Forest (ensemble of trees).
1. What is a Decision Tree?
A decision tree is a supervised ML algorithm that makes predictions by learning a hierarchy of if-then-else decision rules from training data. At each step, the algorithm asks a question about a feature (e.g., “Is the patient’s age > 50?”), splits the data based on the answer, and repeats until it reaches a prediction.
Decision trees can be used for both classification (predicting a category) and regression (predicting a continuous value). When used for regression, they are called regression trees.
Analogy — The Medical Triage System
Imagine a doctor triaging patients in an emergency room. They ask: “Is the patient conscious?” If no → immediate care. If yes → “Is blood pressure above 180?” If yes → urgent. If no → “Is there chest pain?” and so on. This cascade of questions based on observable features is exactly how a decision tree works — learned automatically from historical patient data.
2. Tree Structure — Nodes, Branches, Leaves
| Component | Description | Example |
|---|---|---|
| Root Node | The topmost node — the first and most important split. Splits the entire dataset. | “Age > 30?” |
| Internal Node | A decision point within the tree. Each internal node splits data based on one feature. | “Income > 50,000?” |
| Branch | The outcome of a decision — Yes or No, True or False, or a value range. | Yes / No |
| Leaf Node | Terminal node — contains the final prediction (class label or value). No further splitting. | “Approve Loan” / “Reject Loan” |
The depth of a tree is the number of levels from root to the deepest leaf. A tree of depth 1 (just a root with two leaves) is called a decision stump. Deeper trees can model more complex patterns but are more prone to overfitting.
3. How Splitting Works — Gini Impurity & Information Gain
At each node, the algorithm tries every possible feature and every possible split value, and selects the one that creates the most “pure” child nodes — where one class dominates as much as possible. Two main criteria are used:
3.1 Gini Impurity (used in CART — the most common algorithm)
Gini = 1 − Σ pᵢ²
Where pᵢ is the proportion of class i in the node. Gini = 0 means perfectly pure (all one class). Gini = 0.5 is maximum impurity for binary classification (50/50 split).
The algorithm selects the split that minimises the weighted Gini impurity of the resulting child nodes:
Weighted Gini = (n_left/n) × Gini_left + (n_right/n) × Gini_right
3.2 Information Gain / Entropy (used in ID3, C4.5)
Entropy = −Σ pᵢ log₂(pᵢ)
Information Gain = Entropy(parent) − Weighted Entropy(children)
The algorithm selects the split that maximises information gain — the reduction in uncertainty after the split. Entropy = 0 means pure; Entropy = 1 means maximum disorder (binary case).
Gini vs Entropy — Which to Use?
In practice, both give very similar results. Gini is faster to compute (no logarithm) and is the default in Scikit-learn’s DecisionTreeClassifier. Information Gain is more common in academic literature and interpretable explanations.
4. Worked Example — Predicting Loan Approval
Dataset: 6 loan applicants with two features: Credit Score (Good/Bad) and Employment (Stable/Unstable), and a label: Approved (Yes/No).
| Applicant | Credit Score | Employment | Approved? |
|---|---|---|---|
| 1 | Good | Stable | Yes |
| 2 | Good | Unstable | Yes |
| 3 | Bad | Stable | No |
| 4 | Bad | Unstable | No |
| 5 | Good | Stable | Yes |
| 6 | Bad | Stable | No |
Root split — try “Credit Score”:
- Credit = Good: 3 Yes, 0 No → Gini = 1 − (1² + 0²) = 0.0 (perfectly pure!)
- Credit = Bad: 0 Yes, 3 No → Gini = 1 − (0² + 1²) = 0.0 (perfectly pure!)
- Weighted Gini = (3/6)(0) + (3/6)(0) = 0.0
Credit Score creates a perfect split — Gini = 0. The resulting tree has depth 1:
- Credit Score = Good → Approve
- Credit Score = Bad → Reject
In this simple example, one feature is sufficient. Real datasets require deeper trees with multiple splits.
5. Overfitting & Pruning
Decision trees are highly prone to overfitting. Given enough depth, a tree will create a separate leaf for every training example — achieving 100% training accuracy but failing completely on new data.
Prevention Techniques (Pre-pruning):
- max_depth: Limit the maximum depth of the tree (e.g., max_depth=5).
- min_samples_split: Minimum number of samples required to split a node.
- min_samples_leaf: Minimum samples required at a leaf node.
- max_features: Limit the number of features considered at each split.
Post-Pruning:
Grow the full tree, then remove branches that provide little predictive power. Cost Complexity Pruning (also called Weakest Link Pruning) is implemented in Scikit-learn as the ccp_alpha parameter.
Best Solution — Use Random Forest:
Instead of relying on one tree, Random Forest trains hundreds of trees on random subsets of the data and features, then averages their predictions. This dramatically reduces variance while keeping low bias.
6. Advantages & Limitations
| Advantages | Limitations |
|---|---|
| Highly interpretable — decision rules can be visualised and explained | Prone to overfitting without pruning or depth limits |
| Handles both numerical and categorical features | Unstable — small changes in data can produce very different trees |
| No feature scaling required | Biased towards features with more levels in categorical data |
| Can model non-linear relationships | Cannot extrapolate beyond the range of training data (for regression) |
| Fast to train and predict | Single trees rarely achieve top accuracy vs ensemble methods |
| Handles missing values well (with extensions) | Greedy algorithm — does not guarantee globally optimal tree |
7. Python Code
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train decision tree (limit depth to prevent overfitting)
model = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
# Print the tree rules in text format
tree_rules = export_text(model, feature_names=load_iris().feature_names)
print(tree_rules)
8. Common Mistakes Students Make
- Not setting max_depth: Always set a maximum depth when training decision trees. Without it, the tree will overfit the training data completely.
- Using decision trees for high-dimensional data: With many features, decision trees become very complex and hard to interpret. Use Random Forest or Gradient Boosting instead.
- Expecting stability: Decision trees are highly sensitive to training data. Adding or removing a few examples can produce a completely different tree. This is why ensembles (Random Forest) are preferred in practice.
- Forgetting that CART is greedy: The algorithm selects the best split at each node locally, without considering the globally optimal tree. The resulting tree is good but not necessarily the best possible tree.
9. Frequently Asked Questions
What is the difference between CART, ID3, and C4.5?
These are three algorithms for building decision trees. CART (Classification and Regression Trees) uses Gini impurity and can handle regression; it is the algorithm used by Scikit-learn. ID3 uses Information Gain but only handles categorical features. C4.5 is an improved version of ID3 that handles continuous features and missing values. CART is the most widely used today.
Can decision trees handle regression problems?
Yes. When used for regression, the leaf nodes output the mean value of training examples in that leaf rather than a class label. The splitting criterion changes from Gini/Entropy to Mean Squared Error (MSE) — the split that minimises the weighted MSE of child nodes is selected.
How do I choose the right max_depth for a decision tree?
Use cross-validation. Train trees with different max_depth values (e.g., 2 to 20) and evaluate each on a validation set. Plot training vs validation accuracy against depth — the optimal depth is just before the validation accuracy starts decreasing while training accuracy continues to increase (the overfitting point).