ML Evaluation Metrics — Accuracy, Precision, Recall, F1, AUC Explained



ML Evaluation Metrics

Accuracy, Precision, Recall, F1, AUC — Complete Reference for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Classification metrics: Accuracy, Precision, Recall, F1 Score, AUC-ROC — each measures a different aspect of performance.
  • Regression metrics: MAE, MSE, RMSE, R² — measure how far predictions are from true values.
  • Accuracy is misleading on imbalanced datasets. Use Precision, Recall, and F1 instead.
  • Confusion matrix is the foundation — all classification metrics are derived from it.
  • AUC-ROC is the go-to metric for comparing classifiers regardless of threshold.
  • Rule of thumb: Use F1 for imbalanced data, AUC-ROC for ranking/probability models, R² for regression explanation.

1. The Confusion Matrix

The confusion matrix is the foundation of all classification evaluation. It shows the count of correct and incorrect predictions broken down by class. For a binary classifier:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
Term Meaning Also Called
True Positive (TP) Model predicted positive AND it actually is positive Hit
True Negative (TN) Model predicted negative AND it actually is negative Correct rejection
False Positive (FP) Model predicted positive BUT it is actually negative Type I Error, False Alarm
False Negative (FN) Model predicted negative BUT it is actually positive Type II Error, Miss

Every classification metric is derived from these four values. Understanding the confusion matrix is essential before computing any other metric.

2. Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy measures the overall fraction of correct predictions. It is the most intuitive metric but the most easily misused.

When accuracy is misleading: On an imbalanced dataset where 99% of examples are class 0, a model that always predicts class 0 achieves 99% accuracy — but it has zero ability to detect class 1. Always check the class distribution before trusting accuracy.

When accuracy is appropriate: When classes are roughly balanced and the cost of false positives and false negatives is similar.

3. Precision

Precision = TP / (TP + FP)

Precision answers: “Of all the examples the model predicted as positive, what fraction were actually positive?”

High precision means few false positives — the model is careful about flagging positives. Low precision means many false alarms.

When precision matters most: When the cost of a False Positive is high. Example: Spam detection — you do not want to mark legitimate emails as spam. Or search engines — you want every result shown to be relevant.

4. Recall (Sensitivity / True Positive Rate)

Recall = TP / (TP + FN)

Recall answers: “Of all the examples that were actually positive, what fraction did the model correctly identify?”

High recall means few false negatives — the model catches most of the actual positives. Low recall means many real positives are missed.

When recall matters most: When the cost of a False Negative is high. Example: Cancer screening — missing a positive diagnosis is far more dangerous than a false alarm. Or fraud detection — missing a fraudulent transaction is worse than investigating a legitimate one.

The Precision-Recall Tradeoff

Increasing the decision threshold raises precision but lowers recall. Decreasing the threshold raises recall but lowers precision. The right balance depends on the application’s cost structure. Plotting the Precision-Recall curve across all thresholds helps choose the optimal point.

5. F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 is the harmonic mean of precision and recall. It gives a single number that balances both metrics. The harmonic mean penalises extreme values — a model with precision = 1.0 and recall = 0.0 gets F1 = 0, not 0.5.

F1 ranges from 0 (worst) to 1 (perfect). It is the standard metric for imbalanced classification problems.

F-beta Score

When recall is more important than precision (e.g., medical diagnosis), use the F-beta score:

Fβ = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β = 1 → F1 (equal weight). β = 2 → F2 (recall twice as important). β = 0.5 → F0.5 (precision twice as important).

6. AUC-ROC

The ROC curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) on the y-axis against False Positive Rate on the x-axis, at every possible decision threshold from 0 to 1.

True Positive Rate (TPR) = TP / (TP + FN) [same as Recall]

False Positive Rate (FPR) = FP / (FP + TN)

The AUC (Area Under the ROC Curve) summarises the entire ROC curve as a single number:

AUC Value Interpretation
1.0 Perfect classifier — separates all positives and negatives perfectly
0.9 – 0.99 Excellent
0.8 – 0.89 Good
0.7 – 0.79 Fair
0.6 – 0.69 Poor
0.5 No discriminating power — equivalent to random guessing
< 0.5 Worse than random — predictions are systematically inverted

AUC is threshold-independent — it evaluates the model’s ranking ability, not just performance at a specific threshold. It works well for imbalanced datasets and is the standard metric for ranking and probability estimation tasks.

7. Regression Metrics

For regression problems (predicting continuous values), different metrics are used:

Metric Formula Units Best value
MAE (Mean Absolute Error) (1/m) × Σ|yᵢ − ŷᵢ| Same as target 0
MSE (Mean Squared Error) (1/m) × Σ(yᵢ − ŷᵢ)² Target squared 0
RMSE (Root MSE) √MSE Same as target 0
(R-Squared) 1 − SS_res/SS_tot Dimensionless 1

MAE vs MSE vs RMSE

  • MAE: Robust to outliers — all errors contribute equally. Easy to interpret (average error in original units).
  • MSE: Penalises large errors more heavily (squaring). Sensitive to outliers. Differentiable — preferred for optimisation.
  • RMSE: Same scale as the target variable. Most commonly reported regression metric. More sensitive to large errors than MAE.

R² (Coefficient of Determination)

R² = 1 − (SS_residual / SS_total)

SS_residual = Σ(yᵢ − ŷᵢ)²

SS_total = Σ(yᵢ − ȳ)²

R² measures what proportion of variance in y is explained by the model. R² = 1 → perfect fit. R² = 0 → model explains nothing (equivalent to predicting the mean). R² can be negative for very poor models. R² = 0.85 means the model explains 85% of the variance in the target variable.

8. Which Metric to Use — Decision Guide

Situation Recommended Metric
Balanced binary classification Accuracy, F1
Imbalanced binary classification F1, AUC-ROC, Precision-Recall AUC
False positives are costly (spam, fraud alerts) Precision
False negatives are costly (disease detection) Recall, F2 Score
Comparing classifiers regardless of threshold AUC-ROC
Multi-class classification Macro/Weighted F1, Confusion Matrix
Regression — interpretable error MAE, RMSE
Regression — explained variance
Regression with outliers MAE (more robust than RMSE)

9. Worked Example

Scenario: A cancer screening model tested on 100 patients. Confusion matrix results:

Predicted Cancer Predicted No Cancer
Actual Cancer (40 patients) TP = 35 FN = 5
Actual No Cancer (60 patients) FP = 10 TN = 50

Calculations:

  • Accuracy = (35 + 50) / 100 = 85%
  • Precision = 35 / (35 + 10) = 35/45 = 77.8%
  • Recall = 35 / (35 + 5) = 35/40 = 87.5%
  • F1 = 2 × (0.778 × 0.875) / (0.778 + 0.875) = 82.4%

Analysis: The model has 85% accuracy. But the recall of 87.5% tells us the more important story — it correctly identifies 87.5% of actual cancer cases. The 5 false negatives (missed cancers) are the critical concern in this domain. Depending on acceptable risk, the decision threshold should be lowered to increase recall, even at the cost of more false positives (lower precision).

10. Python Code


from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    mean_absolute_error, mean_squared_error, r2_score
)
import numpy as np

# --- Classification Metrics ---
y_true_cls = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred_cls = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
y_prob     = [0.9, 0.2, 0.8, 0.4, 0.1, 0.7, 0.6, 0.3, 0.85, 0.15]

print("=== Classification Metrics ===")
print(f"Accuracy:  {accuracy_score(y_true_cls, y_pred_cls):.3f}")
print(f"Precision: {precision_score(y_true_cls, y_pred_cls):.3f}")
print(f"Recall:    {recall_score(y_true_cls, y_pred_cls):.3f}")
print(f"F1 Score:  {f1_score(y_true_cls, y_pred_cls):.3f}")
print(f"AUC-ROC:   {roc_auc_score(y_true_cls, y_prob):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_true_cls, y_pred_cls))
print("\nClassification Report:")
print(classification_report(y_true_cls, y_pred_cls))

# --- Regression Metrics ---
y_true_reg = [3.0, 5.0, 2.5, 7.0, 4.0]
y_pred_reg = [2.8, 5.2, 2.0, 6.5, 4.3]

print("\n=== Regression Metrics ===")
mae  = mean_absolute_error(y_true_reg, y_pred_reg)
mse  = mean_squared_error(y_true_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2   = r2_score(y_true_reg, y_pred_reg)

print(f"MAE:  {mae:.4f}")
print(f"MSE:  {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²:   {r2:.4f}")
    

11. Frequently Asked Questions

What is the difference between macro and weighted F1?

Macro F1 calculates F1 for each class independently and takes the unweighted average — treating all classes equally regardless of size. Weighted F1 takes the average weighted by the number of examples in each class — giving more weight to larger classes. For imbalanced datasets, use weighted F1 to reflect real-world performance. Use macro F1 when all classes are equally important regardless of frequency.

What does an AUC of 0.5 mean?

An AUC of 0.5 means the model has no discriminating power — it performs no better than random guessing. A random classifier, which assigns class labels randomly, produces an ROC curve along the diagonal (from (0,0) to (1,1)), giving AUC = 0.5. Your model should always significantly exceed 0.5 to be useful.

Can R² be negative?

Yes. A negative R² means the model performs worse than simply predicting the mean of y for every input. This happens when the model’s predictions are so far off that the residual sum of squares exceeds the total sum of squares. It indicates a fundamentally wrong model or serious issues in training.

Next Steps