K Means Clustering with Python

22.5. K Means Clustering with Python#

This notebook is just a code reference for the video lecture and reading.

22.5.1. Method Used#

K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps: Reassign data points to the cluster whose centroid is closest. Calculate new centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

22.5.2. Learning goals#

Explain the intuition behind k-means.
Run k-means on toy data and interpret the output.
Recognize common pitfalls (scaling, choice of k, randomness).

22.5.3. K-means in a nutshell#

K-means partitions the data into “k” groups by minimizing the within-cluster sum of squares. It alternates between two steps until convergence:

Assign each point to the nearest cluster center.
Update each center to be the mean of its assigned points.

This process is fast and simple, but it assumes clusters are roughly spherical and similar in size.

22.5.4. Import Libraries#

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import seaborn as sns
      2 import matplotlib.pyplot as plt
      3 get_ipython().run_line_magic('matplotlib', 'inline')

ModuleNotFoundError: No module named 'seaborn'

22.5.5. Create some Data#

from sklearn.datasets import make_blobs

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from sklearn.datasets import make_blobs

ModuleNotFoundError: No module named 'sklearn'

make_blobs generates synthetic Gaussian clusters for quick experiments. It returns the feature matrix and the cluster labels used to create the points.

# Create Data
data = make_blobs(n_samples=200, n_features=2, 
                           centers=4, cluster_std=1.8,random_state=101)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 # Create Data
----> 2 data = make_blobs(n_samples=200, n_features=2, 
      3                            centers=4, cluster_std=1.8,random_state=101)

NameError: name 'make_blobs' is not defined

22.5.6. Visualize Data#

plt.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 plt.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

NameError: name 'plt' is not defined

22.5.7. Choosing k#

K-means needs the number of clusters up front. Two common heuristics:

Elbow method: plot within-cluster sum of squares vs. k and look for the bend.
Silhouette score: higher values mean tighter, better separated clusters.

Always sanity-check with domain knowledge; k is not just a technical choice.

22.5.8. Creating the Clusters#

from sklearn.cluster import KMeans

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 from sklearn.cluster import KMeans

ModuleNotFoundError: No module named 'sklearn'

kmeans = KMeans(n_clusters=4)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 kmeans = KMeans(n_clusters=4)

NameError: name 'KMeans' is not defined

kmeans.fit(data[0])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 kmeans.fit(data[0])

NameError: name 'kmeans' is not defined

kmeans.cluster_centers_

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 kmeans.cluster_centers_

NameError: name 'kmeans' is not defined

kmeans.labels_

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 kmeans.labels_

NameError: name 'kmeans' is not defined

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='rainbow')
ax2.set_title("Original")
ax2.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
      2 ax1.set_title('K Means')
      3 ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='rainbow')

NameError: name 'plt' is not defined

22.5.9. Practical tips#

Scale features before clustering when units differ.
Run multiple initializations (n_init) to avoid poor local minima.
Use random_state for reproducible results in demos.
Interpret cluster labels carefully: the numbers are arbitrary.

You should note, the colors are meaningless in reference between the two plots.