K-Means Clustering — Steps, Elbow Method and Examples



K-Means Clustering

Steps, Elbow Method & Examples — For Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Type: Unsupervised learning — no labels required.
  • Goal: Partition data into K clusters such that points in the same cluster are as similar as possible.
  • Algorithm: Iteratively assigns points to nearest centroid, then recalculates centroids. Repeats until convergence.
  • Choosing K: Use the Elbow Method — plot WCSS vs K and find the “elbow” point.
  • Distance metric: Euclidean distance (default). Always scale features before applying K-Means.
  • Limitation: Assumes spherical clusters; sensitive to outliers; result depends on initial centroid placement.

1. What is K-Means Clustering?

K-Means is an unsupervised machine learning algorithm that partitions a dataset into K distinct, non-overlapping clusters based on feature similarity. Each data point belongs to exactly one cluster — the one whose centroid (centre point) is closest to it in feature space.

It is called K-Means because it represents each cluster by the mean (average) of all points assigned to it — the centroid — and K is the number of clusters you specify.

Analogy — Sorting Exam Papers by Topic

Imagine you have 500 engineering exam papers and want to sort them into 4 topic groups — without reading each one. You pick 4 random papers as “seeds” and assign each remaining paper to the seed it most resembles. Then you recalculate the “average paper” for each group and repeat. After several rounds, the groups stabilise into 4 coherent topic clusters. K-Means works exactly this way — with data points replacing papers and centroids replacing average papers.

2. The K-Means Algorithm — Step by Step

  1. Choose K — decide how many clusters you want. K is a hyperparameter that must be set before training. (See Elbow Method below for how to choose K.)
  2. Initialise K centroids — randomly place K centroids in the feature space. Better: use K-Means++ initialisation for smarter placement (reduces bad convergence).
  3. Assignment Step — assign each data point to the nearest centroid using Euclidean distance:

    distance(x, μₖ) = √Σ(xⱼ − μₖⱼ)²

    Each point goes to the cluster whose centroid minimises this distance.

  4. Update Step — recalculate each centroid as the mean of all points assigned to that cluster:

    μₖ = (1/|Cₖ|) × Σ xᵢ (for all xᵢ in cluster Cₖ)

  5. Repeat — repeat steps 3 and 4 until centroids no longer move significantly (convergence) or a maximum number of iterations is reached.

Objective Function — WCSS

K-Means minimises the Within-Cluster Sum of Squares (WCSS) — also called inertia:

WCSS = Σₖ Σ(xᵢ ∈ Cₖ) ||xᵢ − μₖ||²

This is the sum of squared distances from each point to its cluster centroid, summed over all clusters. Lower WCSS means tighter, better-defined clusters. K-Means is guaranteed to converge (WCSS never increases), but may converge to a local minimum — not the global minimum.

3. Worked Numerical Example

Dataset: 6 data points in 1D: {2, 3, 10, 11, 20, 22}. Find K=2 clusters.

Iteration 1:

Initialise centroids randomly: μ₁ = 2, μ₂ = 10

Assignment:

  • Point 2: distance to μ₁=|2−2|=0, to μ₂=|2−10|=8 → Cluster 1
  • Point 3: distance to μ₁=1, to μ₂=7 → Cluster 1
  • Point 10: distance to μ₁=8, to μ₂=0 → Cluster 2
  • Point 11: distance to μ₁=9, to μ₂=1 → Cluster 2
  • Point 20: distance to μ₁=18, to μ₂=10 → Cluster 2
  • Point 22: distance to μ₁=20, to μ₂=12 → Cluster 2

Update centroids:

  • μ₁ = mean(2, 3) = 2.5
  • μ₂ = mean(10, 11, 20, 22) = 63/4 = 15.75

Iteration 2:

Assignment with new centroids μ₁=2.5, μ₂=15.75:

  • Point 2 → Cluster 1 (|2−2.5|=0.5 < |2−15.75|=13.75)
  • Point 3 → Cluster 1
  • Point 10 → Cluster 1 (|10−2.5|=7.5 < |10−15.75|=5.75) → Wait, 5.75 < 7.5 → Cluster 2
  • Point 11 → Cluster 2 (|11−15.75|=4.75 < |11−2.5|=8.5)
  • Point 20 → Cluster 2
  • Point 22 → Cluster 2

Update centroids:

  • μ₁ = mean(2, 3) = 2.5
  • μ₂ = mean(10, 11, 20, 22) = 15.75

Centroids unchanged → Convergence!

Final clusters: Cluster 1 = {2, 3} | Cluster 2 = {10, 11, 20, 22}

4. Choosing K — The Elbow Method

K is the most important hyperparameter in K-Means. Choosing it incorrectly leads to meaningless clusters. The Elbow Method is the standard approach:

  1. Run K-Means for K = 1, 2, 3, 4, …, 10 (or more)
  2. For each K, compute the WCSS (inertia)
  3. Plot K on the x-axis and WCSS on the y-axis
  4. Look for the “elbow” — the point where WCSS starts decreasing more slowly (diminishing returns)
  5. The elbow point is the optimal K

Intuition: Adding more clusters always reduces WCSS — with K=n (one cluster per point), WCSS = 0. The elbow identifies the point after which additional clusters add complexity without meaningful improvement in cluster quality.

Silhouette Score is a complementary method — it measures how similar a point is to its own cluster vs other clusters. Scores range from −1 to +1; higher is better. The K with the highest silhouette score is the optimal choice.

5. Centroid Initialisation — K-Means++

Standard K-Means initialises centroids randomly, which can lead to poor convergence (getting stuck in local minima). K-Means++ (the default in Scikit-learn) uses a smarter initialisation:

  1. Choose the first centroid uniformly at random from the data points.
  2. For each subsequent centroid, choose a data point with probability proportional to its squared distance from the nearest already-chosen centroid.
  3. Repeat until K centroids are chosen.

This ensures initial centroids are spread out, dramatically reducing the chance of poor convergence and typically leading to better final clusters with fewer iterations.

6. Limitations of K-Means

  • Must specify K in advance: K-Means cannot discover the number of clusters automatically. Use the Elbow Method or Silhouette Score to guide the choice, but domain knowledge is always valuable.
  • Assumes spherical, equal-sized clusters: K-Means uses Euclidean distance, which assumes clusters are roughly spherical and similarly sized. It fails on elongated, irregularly shaped, or very differently sized clusters. Use DBSCAN or Gaussian Mixture Models for such cases.
  • Sensitive to outliers: Outliers pull centroids towards them, distorting cluster shapes. Use K-Medoids (which uses actual data points as cluster centres) for robustness to outliers.
  • Local minimum problem: K-Means converges to a local minimum — the result depends on initialisation. Run multiple times with different seeds (n_init parameter) and keep the best result. Scikit-learn does this by default (n_init=10).
  • Requires feature scaling: Features with larger ranges dominate the distance calculation. Always normalise or standardise features before K-Means.

7. Real-World Applications

DomainApplication
MarketingCustomer segmentation — group customers by purchasing behaviour for targeted campaigns
Image ProcessingImage compression — reduce colours by clustering pixels into K colour groups
Document AnalysisTopic clustering — group news articles or research papers by subject
MedicinePatient grouping — cluster patients by symptoms for personalised treatment
EngineeringAnomaly detection — identify unusual sensor readings in manufacturing or infrastructure
Recommendation SystemsUser clustering — group users with similar preferences for collaborative filtering

8. Python Code


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Generate sample data (3 natural clusters)
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.8, random_state=42)

# Scale features (essential for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Elbow Method to find optimal K ---
wcss = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    wcss.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, wcss, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS (Inertia)')
plt.title('Elbow Method for Optimal K')
plt.xticks(K_range)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# --- Train Final Model with K=3 ---
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Evaluate
sil_score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score (K=3): {sil_score:.3f}")
print(f"WCSS (Inertia):         {kmeans.inertia_:.2f}")
print(f"Cluster sizes: {np.bincount(labels)}")
print(f"Centroids (scaled):\n{kmeans.cluster_centers_}")
    

9. Common Mistakes Students Make

  • Not scaling features: This is the most common mistake. K-Means uses Euclidean distance, so a feature in thousands will dominate one in single digits. Always use StandardScaler or MinMaxScaler first.
  • Running K-Means only once: Different random initialisations can give different results. Always use n_init≥10 to run multiple times and keep the best result. Scikit-learn handles this by default.
  • Choosing K based on WCSS alone: WCSS always decreases as K increases — so K=n (one point per cluster) always gives WCSS=0. Use the elbow method AND the silhouette score together to make a more reliable choice.
  • Applying K-Means to non-spherical clusters: K-Means fails badly on elongated, ring-shaped, or irregularly distributed clusters. Always visualise your data first (use PCA for dimensionality reduction if needed). If clusters are non-spherical, use DBSCAN instead.

10. Frequently Asked Questions

What is the difference between K-Means and K-Medoids?

K-Means uses the mean of assigned points as the cluster centre (centroid), which may not be an actual data point. K-Medoids uses an actual data point as the cluster centre (medoid). K-Medoids is more robust to outliers because outliers cannot drag the medoid to an extreme position. However, K-Medoids is computationally more expensive.

Does K-Means always converge?

Yes — K-Means always converges because each iteration either reduces WCSS or leaves it unchanged. However, it may converge to a local minimum rather than the global minimum. This is why running multiple initialisations (n_init) is important — to find the best of several local minima.

When should I use DBSCAN instead of K-Means?

Use DBSCAN when: (1) you do not know the number of clusters in advance, (2) clusters have irregular or non-spherical shapes, (3) there are significant outliers in the data (DBSCAN identifies them as noise), or (4) cluster density varies significantly. K-Means is better for well-separated, roughly spherical clusters of similar size.

Next Steps