12. Clustering and Unsupervised Learning#
In general terms cluster analysis, or clustering, is the task of grouping a data-set into different distinct categories based on some measure of equality of the data. This measure is often referred to as a metric or similarity measure in the literature (note: sometimes we deal with a dissimilarity measure instead). Usually, these metrics are formulated as some kind of distance function between points in a high-dimensional space.
The simplest, and also the most common is the Euclidean distance.
The simplest of all clustering algorithms is the k-means algorithm
, sometimes also referred to as Lloyds algorithm. It is the simplest and also
the most common. From its simplicity it obtains both strengths and weaknesses.
These will be discussed in more detail later. The
Assume, we are given
In the basic k-means algorithm each point is assigned to only
one cluster
We start with guesses / random initializations of our
cluster centers/centroidsFor each centroid the points that are most similar are identified
Then we move / replace each centroid with a coordinate average of all the points that were assigned to that centroid.
Iterate 2-3 until the centroids no longer move (to some tolerance)
We assume we have
which we wish to group into
We define the so called within-cluster point scatter which gives us a measure of how close each data point assigned to the same cluster tends to be to the all the others.
where
We have
This is a quantity that is conserved throughout the
Given a cluster mean
Now we have all the pieces necessary to formally revisit the
The
For a given cluster assignment
, and cluster means . We minimize the total cluster variance with respect to the cluster means yielding the means of the currently assigned clusters.Given a current set of
means the total cluster variance is minimized by assigning each observation to the closest (current) cluster mean. That is $ $Steps 1 and 2 are repeated until the assignments do not change.
12.1. Codes and Approaches#
Before we start we specify a number
which is the number of clusters we want to try to separate our data into.We initially choose
random data points in our data as our initial centroids, or means (this is where the name comes from).Assign each data point to their closest centroid, based on the squared Euclidean distance.
For each of the
cluster we update the centroid by calculating new mean values for all the data points in the cluster.Iteratively minimize the within cluster scatter by performing steps (3, 4) until the new assignments stop changing (can be to some tolerance) or until a maximum number of iterations have passed.
Let us now program the most basic version of the algorithm using nothing but Python with numpy arrays. This code is kept intentionally simple to gradually progress our understanding. There is no vectorization of any kind, and even most helper functions are not utilized.
We need first a dataset to do our cluster analysis on. In our case this is a plain vanilla data set using random numbers using a Gaussian distribution.
%matplotlib inline
import time
import numpy as np
import tensorflow as tf
from matplotlib import image
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from IPython.display import display
np.random.seed(2021)
Next we define functions, for ease of use later, to generate Gaussians and to set up our toy data set.
def gaussian_points(dim=2, n_points=1000, mean_vector=np.array([0, 0]),
sample_variance=1):
"""
Very simple custom function to generate gaussian distributed point clusters
with variable dimension, number of points, means in each direction
(must match dim) and sample variance.
Inputs:
dim (int)
n_points (int)
mean_vector (np.array) (where index 0 is x, index 1 is y etc.)
sample_variance (float)
Returns:
data (np.array): with dimensions (dim x n_points)
"""
mean_matrix = np.zeros(dim) + mean_vector
covariance_matrix = np.eye(dim) * sample_variance
data = np.random.multivariate_normal(mean_matrix, covariance_matrix,
n_points)
return data
def generate_simple_clustering_dataset(dim=2, n_points=1000, plotting=True,
return_data=True):
"""
Toy model to illustrate k-means clustering
"""
data1 = gaussian_points(mean_vector=np.array([5, 5]))
data2 = gaussian_points()
data3 = gaussian_points(mean_vector=np.array([1, 4.5]))
data4 = gaussian_points(mean_vector=np.array([5, 1]))
data = np.concatenate((data1, data2, data3, data4), axis=0)
if plotting:
fig, ax = plt.subplots()
ax.scatter(data[:, 0], data[:, 1], alpha=0.2)
ax.set_title('Toy Model Dataset')
plt.show()
if return_data:
return data
data = generate_simple_clustering_dataset()

With the above dataset we start
implementing the
n_samples, dimensions = data.shape
n_clusters = 4
# we randomly initialize our centroids
np.random.seed(2021)
centroids = data[np.random.choice(n_samples, n_clusters, replace=False), :]
distances = np.zeros((n_samples, n_clusters))
# first we need to calculate the distance to each centroid from our data
for k in range(n_clusters):
for n in range(n_samples):
dist = 0
for d in range(dimensions):
dist += np.abs(data[n, d] - centroids[k, d])**2
distances[n, k] = dist
# we initialize an array to keep track of to which cluster each point belongs
# the way we set it up here the index tracks which point and the value which
# cluster the point belongs to
cluster_labels = np.zeros(n_samples, dtype='int')
# next we loop through our samples and for every point assign it to the cluster
# to which it has the smallest distance to
for n in range(n_samples):
# tracking variables (all of this is basically just an argmin)
smallest = 1e10
smallest_row_index = 1e10
for k in range(n_clusters):
if distances[n, k] < smallest:
smallest = distances[n, k]
smallest_row_index = k
cluster_labels[n] = smallest_row_index
fig = plt.figure()
ax = fig.add_subplot()
unique_cluster_labels = np.unique(cluster_labels)
for i in unique_cluster_labels:
ax.scatter(data[cluster_labels == i, 0],
data[cluster_labels == i, 1],
label = i,
alpha = 0.2)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black')
ax.set_title("First Grouping of Points to Centroids")
plt.show()

So what do we have so far? We have ‘picked’
max_iterations = 100
tolerance = 1e-8
for iteration in range(max_iterations):
prev_centroids = centroids.copy()
for k in range(n_clusters):
# this array will be used to update our centroid positions
vector_mean = np.zeros(dimensions)
mean_divisor = 0
for n in range(n_samples):
if cluster_labels[n] == k:
vector_mean += data[n, :]
mean_divisor += 1
# update according to the k means
centroids[k, :] = vector_mean / mean_divisor
# we find the dissimilarity
for k in range(n_clusters):
for n in range(n_samples):
dist = 0
for d in range(dimensions):
dist += np.abs(data[n, d] - centroids[k, d])**2
distances[n, k] = dist
# assign each point
for n in range(n_samples):
smallest = 1e10
smallest_row_index = 1e10
for k in range(n_clusters):
if distances[n, k] < smallest:
smallest = distances[n, k]
smallest_row_index = k
cluster_labels[n] = smallest_row_index
# convergence criteria
centroid_difference = np.sum(np.abs(centroids - prev_centroids))
if centroid_difference < tolerance:
print(f'Converged at iteration {iteration}')
break
elif iteration == max_iterations:
print(f'Did not converge in {max_iterations} iterations')
Converged at iteration 5
We now have a simple , un-optimized
fig = plt.figure()
ax = fig.add_subplot()
unique_cluster_labels = np.unique(cluster_labels)
for i in unique_cluster_labels:
ax.scatter(data[cluster_labels == i, 0],
data[cluster_labels == i, 1],
label = i,
alpha = 0.2)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black')
ax.set_title("Final Result of K-means Clustering")
plt.show()

def naive_kmeans(data, n_clusters=4, max_iterations=100, tolerance=1e-8):
start_time = time.time()
n_samples, dimensions = data.shape
n_clusters = 4
#np.random.seed(2021)
centroids = data[np.random.choice(n_samples, n_clusters, replace=False), :]
distances = np.zeros((n_samples, n_clusters))
for k in range(n_clusters):
for n in range(n_samples):
dist = 0
for d in range(dimensions):
dist += np.abs(data[n, d] - centroids[k, d])**2
distances[n, k] = dist
cluster_labels = np.zeros(n_samples, dtype='int')
for n in range(n_samples):
smallest = 1e10
smallest_row_index = 1e10
for k in range(n_clusters):
if distances[n, k] < smallest:
smallest = distances[n, k]
smallest_row_index = k
cluster_labels[n] = smallest_row_index
for iteration in range(max_iterations):
prev_centroids = centroids.copy()
for k in range(n_clusters):
vector_mean = np.zeros(dimensions)
mean_divisor = 0
for n in range(n_samples):
if cluster_labels[n] == k:
vector_mean += data[n, :]
mean_divisor += 1
centroids[k, :] = vector_mean / mean_divisor
for k in range(n_clusters):
for n in range(n_samples):
dist = 0
for d in range(dimensions):
dist += np.abs(data[n, d] - centroids[k, d])**2
distances[n, k] = dist
for n in range(n_samples):
smallest = 1e10
smallest_row_index = 1e10
for k in range(n_clusters):
if distances[n, k] < smallest:
smallest = distances[n, k]
smallest_row_index = k
cluster_labels[n] = smallest_row_index
centroid_difference = np.sum(np.abs(centroids - prev_centroids))
if centroid_difference < tolerance:
print(f'Converged at iteration {iteration}')
print(f'Runtime: {time.time() - start_time} seconds')
return cluster_labels, centroids
print(f'Did not converge in {max_iterations} iterations')
print(f'Runtime: {time.time() - start_time} seconds')
return cluster_labels, centroids