Cross Validation — K-Fold and Leave-One-Out Explained



Cross Validation

K-Fold & Leave-One-Out — The Right Way to Evaluate ML Models

Last Updated: March 2026

📌 Key Takeaways

  • Definition: Cross validation evaluates model performance by training and testing on multiple different splits of the data — giving a more reliable generalisation estimate than a single split.
  • Standard method: K-Fold CV (K=5 or K=10). Split data into K folds; train on K-1, validate on 1; repeat K times; average results.
  • Stratified K-Fold: Preserves class proportions in each fold — use for classification, especially on imbalanced datasets.
  • Leave-One-Out (LOOCV): K = n. Nearly unbiased but very slow — only for tiny datasets.
  • Critical rule: Never use the test set during cross-validation. CV is for model selection; test set is for final evaluation only.

1. Why Cross Validation?

The simplest way to evaluate a model is a single train/test split — train on 80% of data, test on 20%. But this has a serious problem: the result depends heavily on which 20% happened to end up in the test set. A lucky or unlucky split can give a misleading picture of how the model will actually perform on new data.

Cross validation solves this by averaging performance across multiple train/test splits, giving a much more reliable estimate of generalisation error with a measure of variability (standard deviation across folds).

Cross validation is essential for: model selection (choosing between algorithms), hyperparameter tuning (choosing the best settings), and assessing whether a model is overfitting or underfitting. The test set should be completely untouched during these processes — it is used only once, for the final performance report.

2. K-Fold Cross Validation

K-Fold CV is the standard cross-validation method. The dataset is randomly divided into K equal-sized subsets (folds). The procedure runs K times:

  1. In iteration i, fold i is the validation set; the remaining K-1 folds form the training set.
  2. Train the model on the training set.
  3. Evaluate on the validation set. Record the performance score.
  4. After K iterations, average the K scores → final cross-validation score.

Each example appears in the validation set exactly once and in the training set K-1 times. This makes efficient use of all available data.

Choosing K

K value Bias Variance Computation Recommendation
K = 5 Slightly higher Lower Fast Good default for large datasets
K = 10 Low Moderate Moderate Standard recommendation
K = n (LOOCV) Very low High Very slow Only for tiny datasets (<50 examples)

K=10 is the most widely used value — it provides a good balance between bias (the more folds, the less bias) and computational cost.

3. Stratified K-Fold

Stratified K-Fold ensures that each fold contains approximately the same proportion of each class as the full dataset. This is critical for classification problems, especially on imbalanced datasets.

Without stratification, random chance might create a fold where the minority class is entirely absent — making the validation score for that fold meaningless. Stratified K-Fold prevents this by guaranteeing class balance in every fold.

Always use Stratified K-Fold for classification tasks. Scikit-learn’s cross_val_score uses it by default when the estimator is a classifier.

4. Leave-One-Out Cross Validation (LOOCV)

LOOCV is K-Fold where K = n (the number of training examples). In each iteration:

  • The model trains on n-1 examples.
  • It is validated on the single remaining example.
  • This repeats n times (once per example).

Advantages: Near-zero bias — the training set in each fold is almost as large as the full dataset. Deterministic — no randomness, same result every run.

Disadvantages: Computationally expensive — requires n model training runs (can be thousands). High variance in the estimate — the n validation scores are highly correlated (each training set differs by only one example). Not recommended for datasets with more than a few hundred examples unless the model trains very quickly.

5. Comparison Table

Method Splits Training Size Bias Variance Speed Best For
Hold-out (single split) 1 ~80% High High Very fast Very large datasets only
5-Fold CV 5 80% Moderate Low Fast Large datasets, quick iteration
10-Fold CV 10 90% Low Low Moderate Standard — most datasets
Stratified K-Fold K (K-1)/K Low Low Moderate Classification, imbalanced data
LOOCV n n-1 Very Low High Very slow Tiny datasets (<50 examples)

6. Nested Cross Validation — For Hyperparameter Tuning

When you use cross-validation to both tune hyperparameters AND evaluate the model, you introduce optimism bias — you will overestimate performance because the validation set influenced model selection. The solution is nested cross-validation:

  • Outer loop: K-Fold CV for performance estimation.
  • Inner loop: K-Fold CV for hyperparameter selection (within each outer training fold).

This gives an unbiased estimate of the generalisation error of the model selection procedure — not just one specific model. It is more computationally expensive but is the statistically correct approach when both tuning and evaluating on the same dataset.

7. Common Pitfalls

  • Data leakage from preprocessing: If you scale features or compute statistics using the full dataset (including validation folds) before cross-validation, the validation scores will be overly optimistic. Always fit preprocessing inside the cross-validation loop — use Scikit-learn Pipelines to handle this automatically.
  • Using the test set during CV: The test set must never be seen until final model evaluation. Cross-validation is done on training data only. Looking at test set performance to guide model selection is data leakage.
  • Not using stratification for classification: Always use StratifiedKFold for classification tasks. Standard KFold can create unbalanced folds, especially on imbalanced datasets.
  • Choosing the model with the best CV mean without considering variance: A model with mean CV score 0.85 ± 0.01 is much more reliable than one with 0.87 ± 0.12. Always report both mean and standard deviation of cross-validation scores.

8. Python Code


import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, LeaveOneOut,
    cross_validate, GridSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load dataset
X, y = load_breast_cancer(return_X_y=True)

# --- Standard K-Fold (K=10) ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='f1')
print(f"10-Fold CV F1:  {scores.mean():.3f} ± {scores.std():.3f}")

# --- Multiple metrics at once ---
results = cross_validate(model, X, y, cv=kf,
                         scoring=['accuracy', 'f1', 'roc_auc'])
print(f"Accuracy: {results['test_accuracy'].mean():.3f}")
print(f"F1 Score: {results['test_f1'].mean():.3f}")
print(f"AUC-ROC:  {results['test_roc_auc'].mean():.3f}")

# --- Pipeline to prevent data leakage ---
pipeline = Pipeline([
    ('scaler', StandardScaler()),      # Fitted inside CV loop — no leakage
    ('clf', LogisticRegression(max_iter=1000))
])
pipeline_scores = cross_val_score(pipeline, X, y, cv=kf, scoring='f1')
print(f"\nPipeline CV F1: {pipeline_scores.mean():.3f} ± {pipeline_scores.std():.3f}")

# --- Nested CV for hyperparameter tuning ---
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
                           param_grid, cv=inner_cv, scoring='f1')

nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='f1')
print(f"\nNested CV F1: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")
    

9. Frequently Asked Questions

Should I use cross-validation or a single train/test split?

Use cross-validation for model selection and hyperparameter tuning — it gives a more reliable estimate of generalisation performance. Use a single held-out test set for the final performance report. For very large datasets (millions of examples) where training is expensive, a single split may be more practical — the larger dataset makes a single split more reliable anyway.

Can cross-validation prevent overfitting?

Cross-validation detects overfitting but does not prevent it. If you observe a large gap between training scores and cross-validation scores, your model is overfitting. Cross-validation gives you an honest estimate of how the model will perform on new data. To actually prevent overfitting, you need techniques like regularisation, simpler models, or more data.

Why do I need a separate test set if I’m using cross-validation?

Cross-validation scores are used to guide model selection and tuning decisions. Every time you look at a CV score and adjust your model based on it, the CV set (in aggregate) influences your model. A completely separate test set — never seen during development — gives you an unbiased final estimate of real-world performance. Without it, you cannot know if your reported performance is genuine or the result of subtle overfitting to the CV process.

Next Steps