Skip to article frontmatterSkip to article content

3.5 Clustering and the K-Means Algorithm

An example of unsupervised machine learning

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

0.1Reading

Material related to this page, as well as additional exercises, can be found in VMLS Chapter 4.

0.2Learning Objectives

By the end of this page, you should know:

  • the clustering problem
  • the centroid, or mean, of a group of vectors
  • the K-means algorithm

1The Clustering Problem

1.1An Informal Example of Clustering (VMLS 4.1)

Suppose we have NN vectors x1,...,xNV\vv{x_1}, ..., \vv{x_N} \in V. The goal of clustering is to group the vectors into kk groups, kk groups or clusters of vectors that are close to each other, as measured by the distance between pairs of them.

Normally, the number of groups kk is much smaller than the total number of vectors NN. Typical values in practice for kk range from a handful (2-3) to hundreds, and NN ranges from hundreds to billions.

The figure below shows a simple example with N=300N = 300 vectors in R2\mathbb{R}^2, shown as small circles. The right picture shows that is easily seen: these vectors can be clustererd into k=3k = 3 groups in a way that “looks right” (we will quantify this idea soon).

A collection of $300$ vectors in $\mathbb{R}^n$ grouped into $3$ clusters by eye

This is example is a bit silly; for vectors in R2\mathbb{R}^2, clustering is easy; just make a scatter plot and use your eyes. In almost all applications, vectors live in Rn\mathbb{R}^n with nn much bigger than 2. Another silly aspect is how cleanly points split into clusters; real data is messy, and often many points lie between clusters. Finally, in real examples, it is not always obvious how many clusters kk there are.

1.2Applications of Clustering

Despite all of this, we’ll see clustering can still be incredibly useful in practice. Before we dive into more details, let’s highlight a few common applications where clustering is useful:

  • Topic discovery. Suppose xi\vv{x_i} are word histograms associated with NN documents (a word histogram x\vv x has entries xix_i which count the number of times word ii appears in a document). Clustering will partition the NN documents into kk groups, which can be interpreted as groups of documents with the same or similar topics, genre, or author. This is sometimes called automatic topic discovery.

  • Customer market segmentation. Suppose the vector xiRn\vv{x_i} \in \mathbb{R}^n gives the dollar values of nn items purchased by customer ii in the past year. A clustering algorithm groups the customers into kk market segments, which are groups of customers with similar purchasing patterns.

Other examples include patient, zip code, student, and survey response clustering, as well as identifying weatehr zones, daily energy use patterns, and financial sectors. See pp. 70-71 of VLMS for more details.

1.3A Clustering Objective (VLMS 4.2)

Our goal now is to formalize the ideas described above, and introduce a quantitative measure of “how good a clustering” is.

Specifying Cluster Assignments

We specify a clutering of vectors by assigning each vector to a group. We label the groups 1,...,k1, ..., k and assign each of the NN vectors x1,..,xn\vv{x_1}, .., \vv{x_n} to a group via the vector cRN\vv c \in \mathbb{R}^N, with cic_i being the group number that xi\vv{x_i} has been assigned to.

For example, if N=5N = 5 and k=3k = 3, then

c=[31112]\begin{align*} c = \bm 3\\1\\1\\1\\2 \em \end{align*}

assigns x1\vv{x_1} to group 3; x2,x3,x4\vv{x_2}, \vv{x_3}, \vv{x_4} to group 1; x5\vv{x_5} to group 2.

We will also describe clusters by the sets of indices for each group, with GjG_j being the set of indices associated with group jj. For our simple example, we have

G1={2,3,4},G2={5},G3={1}\begin{align*} G_1 = \{ 2, 3, 4 \}, \quad G_2 = \{5\}, \quad G_3 = \{1\} \end{align*}

In general, we have that Gj={ici=j}G_j = \{ i \mid c_i = j\}.

Group Representatives

Each group is assigned a group representative z1,...,zkV\vv{z_1}, ..., \vv{z_k} \in V. Note that these representatives can be any vector, and need not be one of the given vectors x1,...,xN\vv{x_1}, ..., \vv{x_N}. A good clustering will have each representative close to vectors in its associated group, i.e.,

dist(xi,zci)=xizci\begin{align*} \text{dist}(\vv{x_i}, \vv{z_{c_i}}) = \| \vv{x_i} - \vv{z_{c_i}} \| \end{align*}

is small for all i=1,...,Ni = 1,..., N. Note that according to our notation, xi\vv{x_i} is in group cic_i, so zci\vv{z_{c_i}} is the group representative against which xi\vv{x_i} should be measured.

A Clustering Objective

We now define a clustering objective that assigns a score, or cost, to a choice of clustering and representatives:

A clustering is said to be optimal if the choice of group assignments c1,...,cNc_1, ..., c_N and group representatives z1,...,zN\vv{z_1}, ..., \vv{z_N} lead to the smallest achievable clustering objective JclustJ_{\text{clust}}; in this case, these choices are said to minimize the objective JclustJ_{\text{clust}}. Unfortunately, except for very small problems, it is computationally prohibitive to find an optimal clustering.

Forutunately, the k-means algorithm we will introduce next can be run efficiently on very large problems, and often finds very good clusterings that achieve objective values JclustJ_{\text{clust}} near the smallest possible value. Because k-means finds suboptimal solutions, we call it a heuristic. Heuristics are often looked down on in more theory oriented circles because they cannot guarantee the quality of their solutions, but as we will see, they often work incredibly well in practice!

2The K-Means Algorithm

The idea behind k-means is to break down the overall hard problem of choosing the best representatives and clusterings at the same time into two subproblems we can easily solve effectively. While we can’t yet solve the problem of jointly choosing group assignments of group representatives to minimize JclustJ_{\text{clust}}, we know how to solve for one component when the other is fixed.

2.1Partitioning Vectors with Representatives Fixed

Pretend for a moment that we have already found group representatives z1,...,zk\vv{z_1}, ..., \vv{z_k}, and our task is to pick the group assignments c1,...,cNc_1, ..., c_N which lead to the smallest possible JclustJ_{\text{clust}}. This problem is actually very easy! We will use the idea of nearest neighbors we saw in the previous section.

Notice that the objective JclustJ_{\text{clust}} is a sum of NN terms, with one term for each vector xi\vv{x_i}. Further, the choice of cic_i (i.e., the group to which we assign xi\vv{x_i}) only affects the term

1Nxizci2\begin{align*} \frac 1 N \| \vv{x_i} - \vv{z_{c_i}} \|^2 \end{align*}

in JclustJ_{\text{clust}}. This means we should choose cic_i to make this term smallest since it doesn’t affect any other terms in our clustering objective. To minimize xizci\| \vv{x_i} - \vv{z_{c_i}} \|, we pick the cic_i so that

xizcixizcjfor j=1,...,k.\begin{align*} \| \vv{x_i} - \vv{z_{c_i}} \| \leq \| \vv{x_i} - \vv{z_{c_j}} \| \quad \text{for $j = 1, ..., k$.} \end{align*}

This should look familiar! Modulo our new notation, we should asign xi\vv{x_i} to its nearest neighbor among the representatives.

2.2Optimizing the Group Representatives with Assignments Fixed

Now we flip things around, and assume each vector xi\vv{x_i} has been assigned to a group cic_i. How should we pick group representatives z1,...,zk\vv{z_1}, ..., \vv{z_k} to minimize JclustJ_{\text{clust}}? We start be rearranging our objectives into kk sums, one for each group:

Jclust=J1+J2+...+Jk\begin{align*} J_{\text{clust}} = J_1 + J_2 + ... + J_k \end{align*}

where Jj:=1NiGjxizj2J_j := \frac 1 N \sum_{i \in G_j}{\| \vv{x_i} - \vv{z_j} \|^2} is the contribution to JclustJ_{\text{clust}} from the vectors in group jj. The sum notation here means we should include terms xizj2\| \vv{x_i} - \vv{z_j} \|^2 in our sum if iGji \in G_j (i.e., if xi\vv{x_i} has been assigned to group jj).

The choice of representative zj\vv{z_j} only affects the term JjJ_j, so we can choose zj\vv{z_j} to minimize JjJ_j. You can check, e.g., using vector calculus, that the best choice is to pick zj\vv{z_j} to be the average (or centroid) of the vectors in group jj:

zj=1GjiGjxi\begin{align*} \vv{z_j} = \frac{1}{|G_j|} \sum_{i \in G_j}{\vv{x_i}} \end{align*}

Here Gj|G_j| is the cardinality of the set GjG_j, and denotes the number of elements in the set GjG_j, i.e., the size of group jj.

2.3Pseudocode for the k-means Algorithm

While we can’t yet solve the problem of jointly choosing group assignments & group representatives JclustJ_{\text{clust}}, we know how to solve for one component when the other is fixed.

The k-means algorithm produces an approximate solution to the clustering problem by iterating between the two subroutines. A key feature of this approach is that JclustJ_{\text{clust}} gets better or stays the same with each iteration, meaning it is guaranteed to converge, to a (likely suboptimal) solution.

We next state the pseudocode for the k-means algorithm.

One iteration of the k-means algorithm is illustrated below:

A collection of $300$ vectors in $\mathbb{R}^n$ grouped into $3$ clusters by eye

Left: vectors x1,...,xN\vv{x_1}, ..., \vv{x_N} are assigned to the nearest representative zj\vv{z_j}.

Right: the representatives are updated to the centroids of the new groups.

3Python break: K-Means in scikit-learn!

The scikit-learn library in Python gives an implementation of the k-means algorithm. Documentation is available here.

In the following code, we generate a sample clustering problem with N=600N = 600 and k=3k = 3. We then run k-means, plot some intermediate clusterings, and plot the final clustering. We also plot the rate of convergence (k-means loss vs. number of iterations).

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

np.random.seed(2030)

# Sample data
sample1 = np.random.multivariate_normal(
    mean=np.array([2, 0]),
    cov=np.array([[2, 0], [0, 2]]),
    size=300
)

sample2 = np.random.multivariate_normal(
    mean=np.array([-1, 4]),
    cov=np.array([[2, 0.5], [0.5, 1]]),
    size=180
)

sample3 = np.random.multivariate_normal(
    mean=np.array([-3, -2]),
    cov=np.array([[1, -0.2], [-0.2, 2]]),
    size=120
)

# Generates a dataset with data sampled from 3 normal distributions.
data = np.concatenate((sample1, sample2, sample3), axis=0)
initial_representatives = [[5, 0], [4.5, 0], [4, 0]]

# Define RGB colors for the clusters.
colors = np.array([
    [1, 0, 0], 
    [0, 1, 0], 
    [0, 0, 1]  
])

fig, axes = plt.subplots(5, 1, figsize=(20, 20))

# For demonstration purposes, we will purposely limit the number of k-means iterations
# and show you the outputs along the way!
num_iterations = [1, 2, 6]
for i in range(len(num_iterations)):
    kmeans = KMeans(n_clusters=3, random_state=0, init=initial_representatives, max_iter=num_iterations[i]).fit(data)

    ax = axes[i]
    ax.scatter(data[:, 0], data[:, 1], c=colors[kmeans.labels_], s=5)
    ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='black', s=100)
    ax.set_title(f'K-Means with {num_iterations[i]} Iterations')
    ax.set_aspect('equal')
    ax.set_xticks([])
    ax.set_yticks([])

# Now, we allow the algorithm to run until convergence and plot the clusters.
kmeans = KMeans(n_clusters=3, random_state=0, init=initial_representatives).fit(data)

ax = axes[3]
ax.scatter(data[:, 0], data[:, 1], c=colors[kmeans.labels_], s=5)
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='black', s=100)
ax.set_title(f'Final Clustering')
ax.set_aspect('equal')
ax.set_xticks([])
ax.set_yticks([])

# We also run the code again for a bunch of different iteration numbers, and plot the 
# rate of convergence to the solution.
costs = []
for i in range(kmeans.n_iter_):
    kmeans = KMeans(n_clusters=3, random_state=0, init=initial_representatives, max_iter=i+1).fit(data)
    costs.append(kmeans.inertia_ / data.shape[0])

ax = axes[4]
ax.plot(range(1, kmeans.n_iter_ + 1), costs, marker='o', linestyle='-', color='b')
ax.set_title('Rate of Convergence of K-means')
ax.set_xlabel('Iteration')
ax.set_ylabel('Cost (Mean Squared Distance)')
ax.set_aspect('equal')
ax.set_xticks(range(1, kmeans.n_iter_ + 1))
ax.set_yticks([])

plt.tight_layout()
plt.show()
<Figure size 2000x2000 with 5 Axes>

4Python Break: Automated Topic Discovery!

TODO

Binder