k-Means Clustering

Classical ML

An unsupervised algorithm that partitions n points into k clusters by iteratively assigning points to the nearest centroid and moving centroids to the mean of their cluster.


In one line

Pick k starting centroids, assign each point to the nearest, move each centroid to the mean of its points, repeat until nothing moves.

What it actually means

k-Means minimizes the sum of squared distances from points to their assigned centroid. It’s fast, it scales, and it’s the first clustering algorithm everyone learns. The catch: you have to pick k ahead of time, it assumes roughly spherical clusters of similar size, it’s sensitive to initialization (use k-means++), and Euclidean distance has to actually mean something for your features — so scale them first. For non-spherical or density-based structure, DBSCAN or HDBSCAN work better.

Why it matters

k-Means is the default for quick-and-dirty segmentation, coarse vector quantization, cohort discovery, and as a preprocessing step before something fancier. It’s also one of the few algorithms you can reason about on a whiteboard, which is why it keeps showing up in interviews.

Example

from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init="k-means++", n_init=10, random_state=0)
labels = km.fit_predict(X_scaled)

You’ll hear it when

  • Segmenting users, products, or documents.
  • Quantizing embeddings for approximate nearest neighbor search.
  • Building a color palette from an image.
  • Explaining the elbow method or silhouette score.

Related terms

See also