What is cross validation in machine learning?

Cross validation is a technique for evaluating how well a machine learning model generalises to an independent dataset. Instead of a single train/test split, the data is split multiple times, and the model is trained and evaluated on each split. The results are averaged to give a more reliable estimate of model performance.

What is K-Fold cross validation?

In K-Fold cross validation, the dataset is divided into K equal-sized folds. The model is trained K times — each time using K-1 folds for training and the remaining fold for validation. The final performance estimate is the average across all K validation scores. Common values of K are 5 and 10.

What is the difference between K-Fold and Leave-One-Out cross validation?

K-Fold uses K folds where each fold contains n/K examples. Leave-One-Out (LOOCV) is a special case where K = n (one fold per example). LOOCV uses n-1 examples for training each time and tests on the single remaining example. LOOCV gives an almost unbiased estimate but is computationally expensive for large datasets. K-Fold (K=5 or 10) is the standard choice.

Cross Validation

K-Fold & Leave-One-Out — The Right Way to Evaluate ML Models

Last Updated: March 2026

Key Takeaways 📌

Definition: Cross validation evaluates model performance by training and testing on multiple different splits of the data — giving a more reliable generalisation estimate than a single split.
Standard method: K-Fold CV (K=5 or K=10). Split data into K folds; train on K-1, validate on 1; repeat K times; average results.
Stratified K-Fold: Preserves class proportions in each fold — use for classification, especially on imbalanced datasets.
Leave-One-Out (LOOCV): K = n. Nearly unbiased but very slow — only for tiny datasets.
Critical rule: Never use the test set during cross-validation. CV is for model selection; test set is for final evaluation only.

1. Why Cross Validation?

The simplest way to evaluate a model is a single train/test split — train on 80% of data, test on 20%. But this has a serious problem: the result depends heavily on which 20% happened to end up in the test set. A lucky or unlucky split can give a misleading picture of how the model will actually perform on new data.

Cross validation solves this by averaging performance across multiple train/test splits, giving a much more reliable estimate of generalisation error with a measure of variability (standard deviation across folds).

Cross validation is essential for: model selection (choosing between algorithms), hyperparameter tuning (choosing the best settings), and assessing whether a model is overfitting or underfitting. The test set should be completely untouched during these processes — it is used only once, for the final performance report.

2. K-Fold Cross Validation

K-Fold CV is the standard cross-validation method. The dataset is randomly divided into K equal-sized subsets (folds). The procedure runs K times:

In iteration i, fold i is the validation set; the remaining K-1 folds form the training set.
Train the model on the training set.
Evaluate on the validation set. Record the performance score.
After K iterations, average the K scores → final cross-validation score.

Each example appears in the validation set exactly once and in the training set K-1 times. This makes efficient use of all available data.

Choosing K

K value	Bias	Variance	Computation	Recommendation
K = 5	Slightly higher	Lower	Fast	Good default for large datasets
K = 10	Low	Moderate	Moderate	Standard recommendation
K = n (LOOCV)	Very low	High	Very slow	Only for tiny datasets (<50 examples)

K=10 is the most widely used value — it provides a good balance between bias (the more folds, the less bias) and computational cost.

3. Stratified K-Fold

Stratified K-Fold ensures that each fold contains approximately the same proportion of each class as the full dataset. This is critical for classification problems, especially on imbalanced datasets.

Without stratification, random chance might create a fold where the minority class is entirely absent — making the validation score for that fold meaningless. Stratified K-Fold prevents this by guaranteeing class balance in every fold.

Always use Stratified K-Fold for classification tasks. Scikit-learn’s cross_val_score uses it by default when the estimator is a classifier.

4. Leave-One-Out Cross Validation (LOOCV)

LOOCV is K-Fold where K = n (the number of training examples). In each iteration:

The model trains on n-1 examples.
It is validated on the single remaining example.
This repeats n times (once per example).

Advantages: Near-zero bias — the training set in each fold is almost as large as the full dataset. Deterministic — no randomness, same result every run.

Disadvantages: Computationally expensive — requires n model training runs (can be thousands). High variance in the estimate — the n validation scores are highly correlated (each training set differs by only one example). Not recommended for datasets with more than a few hundred examples unless the model trains very quickly.

5. Comparison Table

Method	Splits	Training Size	Bias	Variance	Speed	Best For
Hold-out (single split)	1	~80%	High	High	Very fast	Very large datasets only
5-Fold CV	5	80%	Moderate	Low	Fast	Large datasets, quick iteration
10-Fold CV	10	90%	Low	Low	Moderate	Standard — most datasets
Stratified K-Fold	K	(K-1)/K	Low	Low	Moderate	Classification, imbalanced data
LOOCV	n	n-1	Very Low	High	Very slow	Tiny datasets (<50 examples)

6. Nested Cross Validation — For Hyperparameter Tuning

When you use cross-validation to both tune hyperparameters AND evaluate the model, you introduce optimism bias — you will overestimate performance because the validation set influenced model selection. The solution is nested cross-validation:

Outer loop: K-Fold CV for performance estimation.
Inner loop: K-Fold CV for hyperparameter selection (within each outer training fold).

This gives an unbiased estimate of the generalisation error of the model selection procedure — not just one specific model. It is more computationally expensive but is the statistically correct approach when both tuning and evaluating on the same dataset.

7. Common Pitfalls

Data leakage from preprocessing: If you scale features or compute statistics using the full dataset (including validation folds) before cross-validation, the validation scores will be overly optimistic. Always fit preprocessing inside the cross-validation loop — use Scikit-learn Pipelines to handle this automatically.
Using the test set during CV: The test set must never be seen until final model evaluation. Cross-validation is done on training data only. Looking at test set performance to guide model selection is data leakage.
Not using stratification for classification: Always use StratifiedKFold for classification tasks. Standard KFold can create unbalanced folds, especially on imbalanced datasets.
Choosing the model with the best CV mean without considering variance: A model with mean CV score 0.85 ± 0.01 is much more reliable than one with 0.87 ± 0.12. Always report both mean and standard deviation of cross-validation scores.

8. Python Code


import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, LeaveOneOut,
    cross_validate, GridSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load dataset
X, y = load_breast_cancer(return_X_y=True)

# --- Standard K-Fold (K=10) ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='f1')
print(f"10-Fold CV F1:  {scores.mean():.3f} ± {scores.std():.3f}")

# --- Multiple metrics at once ---
results = cross_validate(model, X, y, cv=kf,
                         scoring=['accuracy', 'f1', 'roc_auc'])
print(f"Accuracy: {results['test_accuracy'].mean():.3f}")
print(f"F1 Score: {results['test_f1'].mean():.3f}")
print(f"AUC-ROC:  {results['test_roc_auc'].mean():.3f}")

# --- Pipeline to prevent data leakage ---
pipeline = Pipeline([
    ('scaler', StandardScaler()),      # Fitted inside CV loop — no leakage
    ('clf', LogisticRegression(max_iter=1000))
])
pipeline_scores = cross_val_score(pipeline, X, y, cv=kf, scoring='f1')
print(f"\nPipeline CV F1: {pipeline_scores.mean():.3f} ± {pipeline_scores.std():.3f}")

# --- Nested CV for hyperparameter tuning ---
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
                           param_grid, cv=inner_cv, scoring='f1')

nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='f1')
print(f"\nNested CV F1: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")

9. Frequently Asked Questions

Should I use cross-validation or a single train/test split?

Use cross-validation for model selection and hyperparameter tuning — it gives a more reliable estimate of generalisation performance. Use a single held-out test set for the final performance report. For very large datasets (millions of examples) where training is expensive, a single split may be more practical — the larger dataset makes a single split more reliable anyway.

Can cross-validation prevent overfitting?

Cross-validation detects overfitting but does not prevent it. If you observe a large gap between training scores and cross-validation scores, your model is overfitting. Cross-validation gives you an honest estimate of how the model will perform on new data. To actually prevent overfitting, you need techniques like regularisation, simpler models, or more data.

Why do I need a separate test set if I’m using cross-validation?

Cross-validation scores are used to guide model selection and tuning decisions. Every time you look at a CV score and adjust your model based on it, the CV set (in aggregate) influences your model. A completely separate test set — never seen during development — gives you an unbiased final estimate of real-world performance. Without it, you cannot know if your reported performance is genuine or the result of subtle overfitting to the CV process.

Cross Validation — K-Fold and Leave-One-Out Explained