k-Means Clustering
An unsupervised algorithm that partitions n points into k clusters by iteratively assigning points to the nearest centroid and moving centroids to the mean of their cluster.
In one line
Pick k starting centroids, assign each point to the nearest, move each centroid to the mean of its points, repeat until nothing moves.
What it actually means
k-Means minimizes the sum of squared distances from points to their assigned centroid. It’s fast, it scales, and it’s the first clustering algorithm everyone learns. The catch: you have to pick k ahead of time, it assumes roughly spherical clusters of similar size, it’s sensitive to initialization (use k-means++), and Euclidean distance has to actually mean something for your features — so scale them first. For non-spherical or density-based structure, DBSCAN or HDBSCAN work better.
Why it matters
k-Means is the default for quick-and-dirty segmentation, coarse vector quantization, cohort discovery, and as a preprocessing step before something fancier. It’s also one of the few algorithms you can reason about on a whiteboard, which is why it keeps showing up in interviews.
Example
from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init="k-means++", n_init=10, random_state=0)
labels = km.fit_predict(X_scaled)
You’ll hear it when
- Segmenting users, products, or documents.
- Quantizing embeddings for approximate nearest neighbor search.
- Building a color palette from an image.
- Explaining the elbow method or silhouette score.