Introduction to Clustering

22. Introduction to Clustering#

Clustering is an unsupervised learning task: we group observations so that items in the same group are more similar to each other than to items in other groups. Unlike classification, we are not given labels. Instead, we discover structure in the data.

22.1. Why cluster?#

Exploration: find hidden patterns or natural groupings.
Compression: summarize a large dataset by representative clusters.
Feature engineering: use cluster membership as a new feature for downstream models.

22.2. Core ideas#

Similarity is the key ingredient (often a distance metric like Euclidean or cosine).
Scale matters: features should usually be standardized before clustering.
No single best answer: different algorithms and metrics can yield different groupings.

22.3. Common clustering methods#

K-means: partitions data into “k” spherical clusters by minimizing within-cluster variance.
Hierarchical: builds a tree of clusters (agglomerative or divisive).
DBSCAN: density-based clusters that can find irregular shapes and noise.

22.4. Evaluating clusters#

Internal metrics (e.g., silhouette score) measure compactness and separation.
Domain checks matter most: do the clusters make sense for the problem?