What is cluster analysis?

Cluster analysis is a family of unsupervised learning methods used to group observations into clusters.

The goal is to obtain groups such that:

  • observations in the same cluster are as similar as possible;
  • observations in different clusters are as different as possible.

Note

In clustering, the groups are not known in advance.
They are constructed from the observed data.

The key question

Before clustering, we need to decide:

Important

What does it mean for two observations to be similar or different?

Most clustering methods depend, explicitly or implicitly, on a notion of:

  • distance;
  • dissimilarity;
  • similarity;
  • affinity.

From data to pairwise dissimilarities

Suppose we observe three units described by four variables.

Code
toy_data <- tibble(
  X1 = c(5, 3, 7),     
  X2 = c(1, 4, 2),     
  X3 = factor(c("purple", "blue", "gold")), 
  X4 = factor(c("A", "B", "C")) 
)

The clustering algorithm does not only look at the rows separately.
It compares the observations pairwise:

\[ d(A,B), \quad d(A,C), \quad d(B,C). \]

These pairwise dissimilarities are the basic input for many clustering methods.

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4)) +
  geom_point(size=8) + geom_text(color="white") +
  geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4,alpha=.001)+
  geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4,alpha=.001)+theme_void()

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4)) +
  geom_point(size=8) + geom_text(color="white") +
  geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
  geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+theme_void()

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4)) +
  geom_point(size=8) + geom_text(color="white") +
  # geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
  # geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+
  geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","B")), 
               inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=2)+
  geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","B")), 
               inherit.aes = FALSE,aes(y=X2,x=0),color="forestgreen",size=2)+
  annotate("label",x=2, y=2,label="d(A,B)",size=10)+theme_void()

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4)) +
  geom_point(size=8) + geom_text(color="white") +
  # geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
  # geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+
  geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","C")), 
               inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=2)+
  geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","C")), 
               inherit.aes = FALSE,aes(y=X2,x=0),color="forestgreen",size=2)+
  annotate("label",x=2, y=2,label="d(A,C)",size=10)+theme_void()

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4)) +
  geom_point(size=8) + geom_text(color="white") +
  # geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
  # geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+
  geom_line(data=toy_data |> dplyr::filter(X4 %in%c("B","C")), 
               inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=2)+
  geom_line(data=toy_data |> dplyr::filter(X4 %in%c("B","C")), 
               inherit.aes = FALSE,aes(y=X2,x=0),color="forestgreen",size=2)+
  annotate("label",x=2, y=2,label="d(B,C)",size=10)+theme_void()

building pairwise dissimilarities: intuition

2 continuous and 1 categorical variables

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4,color=as.character(X3))) +
  geom_point(size=8) + geom_text(color="white") +
  geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4,alpha=.001)+
  geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4,alpha=.001)+
  scale_color_identity()+theme_void()+
  theme(legend.position="none")

building pairwise dissimilarities: intuition

one might consider purple and blue closer than e.g. purple and yellow

Code
toy_data |> 
  ggplot(aes(x=X1,y=X2,label=X4,color=as.character(X3))) +
  geom_point(size=8) + geom_text(color="white") +
  geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4,alpha=.001)+
  geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4,alpha=.001)+
  scale_color_identity()+theme_void()+
  theme(legend.position="none")

Dissimilarity or distance?

A dissimilarity is a numerical index measuring how different two observations are.

A distance is a dissimilarity that satisfies additional mathematical properties.

Let \(d\(x\,y\)\) be the dissimilarity between two observations \(x\) and \(y\).

A distance usually satisfies:

\[ d(x,y) \geq 0 \]

\[ d(x,y) = 0 \iff x = y \]

\[ d(x,y) = d(y,x) \]

\[ d(x,z) \leq d(x,y) + d(y,z) \]

Properties of a distance

The main properties are:

Property Meaning
Non-negativity distances cannot be negative
Identity of indiscernibles distance is zero only for identical objects
Symmetry the distance from \(x\) to \(y\) equals the distance from \(y\) to \(x\)
Triangle inequality going through a third point cannot make the path shorter

Note

Every distance is a dissimilarity, but not every dissimilarity is a distance.

Dissimilarities depend on the type of data

The way we compute dissimilarities depends on the nature of the variables.

Type of data Common dissimilarity
Continuous variables Euclidean or Manhattan distance
Categorical variables Matching dissimilarity
Mixed variables Gower dissimilarity

Continuous variables: Euclidean and Manhattan distances

For continuous variables, dissimilarities are based on numerical differences.

Given two observations \(A\) and \(B\), described by two continuous variables:

\[ A = (x_{A1}, x_{A2}), \qquad B = (x_{B1}, x_{B2}). \]

Two common choices are:

\[ d_{\text{Manhattan}}(A,B) = |x_{A1}-x_{B1}| + |x_{A2}-x_{B2}| \]

and

\[ d_{\text{Euclidean}}(A,B) = \sqrt{(x_{A1}-x_{B1})^2 + (x_{A2}-x_{B2})^2}. \]

Continuous variables: Euclidean and Manhattan distances

For continuous variables, dissimilarities are based on numerical differences.

Given two observations \(A\) and \(B\), described by two continuous variables:

\[ A = (x_{A1}, x_{A2}), \qquad B = (x_{B1}, x_{B2}). \]

Two common choices are:

\[ d_{\text{Manhattan}}(A,B) = |x_{A1}-x_{B1}| + |x_{A2}-x_{B2}| \]

and

\[ d_{\text{Euclidean}}(A,B) = \sqrt{(x_{A1}-x_{B1})^2 + (x_{A2}-x_{B2})^2}. \]

Categorical variables: matching dissimilarity

For categorical variables, a simple choice is the matching dissimilarity.

For one categorical variable:

\[ d(x_i,x_j) = \begin{cases} 0 & \text{if } x_i = x_j \\ 1 & \text{if } x_i \neq x_j \end{cases} \]

For several categorical variables, we count the proportion of variables on which two observations differ.

Mixed variables: Gower dissimilarity

When variables have different types, Gower dissimilarity compares observations variable by variable and then averages the results.

For two observations \(i\) and \(j\):

\[ d_G(i,j) = \frac{ \sum_{\ell=1}^{p} w_{ij\ell} \, d_{ij\ell} }{ \sum_{\ell=1}^{p} w_{ij\ell} }. \]

Here:

  • \(d\_\{ij\\ell\}\) is the dissimilarity between observations \(i\) and \(j\) on variable \(\\ell\);
  • \(w\_\{ij\\ell\}\) is a weight, often equal to 1 if the comparison is available and 0 if it is missing;
  • each variable-specific dissimilarity is scaled between 0 and 1.

Note

Gower dissimilarity combines different types of variables by putting their contributions on a common scale.

Gower dissimilarity: variable-specific components

For a continuous variable \(X\_\\ell\), Gower usually uses a range-scaled absolute difference:

\[ d_{ij\ell} = \frac{|x_{i\ell} - x_{j\ell}|}{\max(X_\ell) - \min(X_\ell)}. \]

For a categorical variable \(X\_\ell\), Gower uses matching dissimilarity:

\[ d_{ij\ell} = \begin{cases} 0 & \text{if } x_{i\ell} = x_{j\ell}, \\ 1 & \text{if } x_{i\ell} \neq x_{j\ell}. \end{cases} \]

The final dissimilarity is the average of these variable-wise contributions.

From pairwise dissimilarities to a distance matrix

Once we have defined how to compare two observations, we can compute all pairwise dissimilarities.

Code
toy_data |> 
  select(X1, X2) |> 
  dist(method = "euclidean") |> 
  as.matrix() |> 
  round(2)
     1    2    3
1 0.00 3.61 2.24
2 3.61 0.00 4.47
3 2.24 4.47 0.00

The result is a distance matrix:

\[ D = \begin{pmatrix} 0 & d(A,B) & d(A,C) \\ d(B,A) & 0 & d(B,C) \\ d(C,A) & d(C,B) & 0 \end{pmatrix}. \]

Many clustering algorithms start from this matrix, directly or indirectly.

Different clustering methods use distances differently

The same data can be clustered in different ways.

Some methods use the distance matrix directly:

  • hierarchical clustering;
  • k-medoids;
  • DBSCAN.

Other methods use distances implicitly:

  • k-means;
  • spectral clustering.

Note

Changing the distance, scaling, or representation can change the clustering solution.

Hierarchical clustering

Hierarchical clustering builds a nested sequence of clusters.

It starts from a dissimilarity matrix and produces a hierarchy that can be represented as a dendrogram.

Note

The dendrogram shows how observations are progressively grouped together as the allowed dissimilarity increases.

Agglomerative hierarchical clustering

The most common approach is agglomerative clustering.

The algorithm works bottom-up:

  1. start with each observation as a separate cluster;
  2. find the two closest clusters;
  3. merge them;
  4. update the distances between clusters;
  5. repeat until all observations belong to a single cluster.

The result is a hierarchy of increasingly larger groups.

The key choice: linkage

At the beginning, distances are computed between individual observations.

After the first merge, we need to compute distances between:

  • one observation and one cluster;
  • two clusters.

This is done through a linkage criterion.

Important

Different linkage criteria can produce different dendrograms, even with the same dissimilarity matrix.

Linkage criteria: formulas

Let \(d\(i\,j\)\) be the dissimilarity between observations \(i\) and \(j\).

For two clusters \(A\) and \(B\):

\[ d_{\text{single}}(A,B) = \min_{i \in A, j \in B} d(i,j) \]

\[ d_{\text{complete}}(A,B) = \max_{i \in A, j \in B} d(i,j) \]

\[ d_{\text{average}}(A,B) = \frac{1}{|A||B|} \sum_{i \in A} \sum_{j \in B} d(i,j) \]

Toy data for hierarchical clustering

We create a small dataset with 10 observations arranged around 4 visible groups.

Code
toy_hc <- tibble(
  id = LETTERS[1:16],
  x = c(
    0.3, 2.2, 1.1, 2.8,      # A-D: lower-left group
    6.2, 8.4, 7.1, 9.0,      # E-H: central group
    12.2, 14.8, 13.1, 15.2,  # I-L: lower-right group
    3.2, 5.7, 4.4, 6.3       # M-P: upper group
  ),
  y = c(
    0.4, 1.9, 2.8, 0.8,      # A-D
    5.8, 7.9, 8.8, 6.4,      # E-H
    0.6, 1.4, 3.0, 2.5,      # I-L
    11.4, 13.2, 14.1, 11.8   # M-P
  )
)

toy_mat <- toy_hc |> 
  select(x, y) |> 
  as.matrix()

rownames(toy_mat) <- toy_hc$id

hc_toy <- hclust(dist(toy_mat), method = "single")
hc_toy_cmp <- hclust(dist(toy_mat), method = "complete")
hc_toy_avg <- hclust(dist(toy_mat), method = "average")

Agglomerative clustering with single linkage

Single linkage defines the distance between two clusters as the distance between their closest pair of points.

For two clusters \(A\) and \(B\):

\[ d_{\text{single}}(A,B) = \min_{i \in A, j \in B} d(i,j). \]

At each step, the algorithm merges the two clusters with the smallest single-linkage distance.

Single-linkage clustering: step 1

Code
plot_hc_step(1,hc = hc_toy_cmp, data = toy_hc)

Single-linkage clustering: step 2

Code
plot_hc_step(2)

Single-linkage clustering: step 3

Code
plot_hc_step(3)

Single-linkage clustering: step 4

Code
plot_hc_step(4)

Single-linkage clustering: step 6

Code
plot_hc_step(6)

Single-linkage clustering: step 8

Code
plot_hc_step(8)

Single-linkage clustering: step 10

Code
plot_hc_step(10)

Single-linkage clustering: step 12

Code
plot_hc_step(12)

Single-linkage clustering: step 15

Code
plot_hc_step(15)

Single-linkage clustering: step 1

Code
plot_hc_step(1)

Single-linkage clustering: step 2

Code
plot_hc_step(2)

Single-linkage clustering: step 3

Code
plot_hc_step(3)

Single-linkage clustering: step 4

Code
plot_hc_step(4)

Single-linkage clustering: step 6

Code
plot_hc_step(6)

Single-linkage clustering: step 8

Code
plot_hc_step(8)

Single-linkage clustering: step 10

Code
plot_hc_step(10)

Single-linkage clustering: step 12

Code
plot_hc_step(12)

Single-linkage clustering: step 15

Code
plot_hc_step(15)

Complete-linkage clustering: step 1

Code
plot_hc_step(1, hc = hc_toy_cmp)

Complete-linkage clustering: step 2

Code
plot_hc_step(2, hc = hc_toy_cmp)

Complete-linkage clustering: step 3

Code
plot_hc_step(3, hc = hc_toy_cmp)

Complete-linkage clustering: step 4

Code
plot_hc_step(4, hc = hc_toy_cmp)

Complete-linkage clustering: step 6

Code
plot_hc_step(6, hc = hc_toy_cmp)

Complete-linkage clustering: step 8

Code
plot_hc_step(8, hc = hc_toy_cmp)

Complete-linkage clustering: step 10

Code
plot_hc_step(10, hc = hc_toy_cmp)

Complete-linkage clustering: step 12

Code
plot_hc_step(12, hc = hc_toy_cmp)

Complete-linkage clustering: step 15

Code
plot_hc_step(15, hc = hc_toy_cmp)

Average-linkage clustering: step 1

Code
plot_hc_step(1, hc = hc_toy_avg)

Average-linkage clustering: step 2

Code
plot_hc_step(2, hc = hc_toy_avg)

Average-linkage clustering: step 3

Code
plot_hc_step(3, hc = hc_toy_avg)

Average-linkage clustering: step 4

Code
plot_hc_step(4, hc = hc_toy_avg)

Average-linkage clustering: step 6

Code
plot_hc_step(6, hc = hc_toy_avg)

Average-linkage clustering: step 8

Code
plot_hc_step(8, hc = hc_toy_avg)

Average-linkage clustering: step 10

Code
plot_hc_step(10, hc = hc_toy_avg)

Average-linkage clustering: step 12

Code
plot_hc_step(12, hc = hc_toy_avg)

Average-linkage clustering: step 15

Code
plot_hc_step(15, hc = hc_toy_avg)

Same data, different linkage criteria

Code
plot_full_dendro(hc_toy, "Single linkage") +
  plot_full_dendro(hc_toy_cmp, "Complete linkage") +
  plot_full_dendro(hc_toy_avg, "Average linkage") +
  patchwork::plot_layout(ncol = 3)

Important

The distance matrix is the same, but the dendrogram changes because the linkage criterion changes.

Distances, dendrograms, and ultrametrics

Hierarchical clustering starts from a matrix of pairwise dissimilarities.

It then produces a dendrogram, which represents the sequence of aggregations.

Important

The dendrogram does not simply reproduce the original distances.
It defines a new set of pairwise values: the aggregation distances.

From original distances to aggregation distances

Hierarchical clustering starts from an original dissimilarity matrix:

\[ D = \{d(i,j)\}. \]

The dendrogram induces a second matrix:

\[ D_U = \{d_U(i,j)\}. \]

For two observations \(i\) and \(j\), \(d_U(i,j)\) is the height at which they first become part of the same cluster.

Note

We call \(d_U(i,j)\) the aggregation distance between \(i\) and \(j\).

Original distance vs aggregation distance

The original distance \(d(i,j)\) measures how far two observations are in the original data space.

The aggregation distance \(d_U(i,j)\) measures how far apart they are in the hierarchy.

Quantity Meaning
\(d(i,j)\) How different are \(i\) and \(j\) before clustering?
\(d_U(i,j)\) At what dendrogram height do \(i\) and \(j\) first merge?

Important

The dendrogram transforms the original dissimilarities into a hierarchical representation.

Aggregation distances form an ultrametric

The matrix \(D_U\) induced by a dendrogram is an ultrametric.

An ultrametric satisfies a stronger version of the triangle inequality.

For any three observations \(i\), \(j\), and \(k\):

\[ d_U(i,j) \leq \max \{ d_U(i,k), d_U(j,k) \}. \]

This is called the ultrametric inequality.

A small example

Suppose the dendrogram says:

  • \(A\) and \(B\) merge at height 2;
  • \(C\) joins the cluster \(\{A,B\}\) at height 5.

Then:

\[ d_U(A,B) = 2, \]

but

\[ d_U(A,C) = 5, \qquad d_U(B,C) = 5. \]

So the three aggregation distances are:

\[ 2, 5, 5. \]

The two largest distances are equal, as required by an ultrametric.

Original distances vs aggregation distances

To keep the table readable, we show a subset of observations.

Original distances vs aggregation distances

To keep the table readable, we show a subset of observations.

Original Euclidean distances
A B C D E F G H
A 0.00 2.42 2.53 2.53 8.00 11.04 10.81 10.57
B 2.42 0.00 1.42 1.25 5.59 8.63 8.46 8.15
C 2.53 1.42 0.00 2.62 5.92 8.91 8.49 8.68
D 2.53 1.25 2.62 0.00 6.05 9.04 9.08 8.35
E 8.00 5.59 5.92 6.05 0.00 3.04 3.13 2.86
F 11.04 8.63 8.91 9.04 3.04 0.00 1.58 1.62
G 10.81 8.46 8.49 9.08 3.13 1.58 0.00 3.06
H 10.57 8.15 8.68 8.35 2.86 1.62 3.06 0.00
Aggregation distances: single linkage
A B C D E F G H
A 0.00 2.42 2.42 2.42 5.59 5.59 5.59 5.59
B 2.42 0.00 1.42 1.25 5.59 5.59 5.59 5.59
C 2.42 1.42 0.00 1.42 5.59 5.59 5.59 5.59
D 2.42 1.25 1.42 0.00 5.59 5.59 5.59 5.59
E 5.59 5.59 5.59 5.59 0.00 2.86 2.86 2.86
F 5.59 5.59 5.59 5.59 2.86 0.00 1.58 1.62
G 5.59 5.59 5.59 5.59 2.86 1.58 0.00 1.62
H 5.59 5.59 5.59 5.59 2.86 1.62 1.62 0.00

Three dendrograms, three ultrametrics

Single linkage
A B C D E F
A 0.00 2.42 2.42 2.42 5.59 5.59
B 2.42 0.00 1.42 1.25 5.59 5.59
C 2.42 1.42 0.00 1.42 5.59 5.59
D 2.42 1.25 1.42 0.00 5.59 5.59
E 5.59 5.59 5.59 5.59 0.00 2.86
F 5.59 5.59 5.59 5.59 2.86 0.00
Complete linkage
A B C D E F
A 0.00 2.62 2.53 2.62 14.30 14.30
B 2.62 0.00 2.62 1.25 14.30 14.30
C 2.53 2.62 0.00 2.62 14.30 14.30
D 2.62 1.25 2.62 0.00 14.30 14.30
E 14.30 14.30 14.30 14.30 0.00 3.13
F 14.30 14.30 14.30 14.30 3.13 0.00
Average linkage
A B C D E F
A 0.00 2.49 2.49 2.49 10.11 10.11
B 2.49 0.00 2.02 1.25 10.11 10.11
C 2.49 2.02 0.00 2.02 10.11 10.11
D 2.49 1.25 2.02 0.00 10.11 10.11
E 10.11 10.11 10.11 10.11 0.00 3.01
F 10.11 10.11 10.11 10.11 3.01 0.00

The matrices may differ because the dendrograms differ.

Quality of the hierarchical representation

A dendrogram transforms the original pairwise distances into aggregation distances.

Quantity Meaning
\(d(i,j)\) original distance between observations \(i\) and \(j\)
\(d_U(i,j)\) aggregation / cophenetic distance: dendrogram height at which \(i\) and \(j\) first belong to the same cluster

To evaluate the quality of the hierarchy, we compare the two sets of pairwise values:

\[ \{d(i,j): i<j\} \qquad \text{and} \qquad \{d_U(i,j): i<j\}. \]

Note

A good dendrogram should preserve the main structure of the original dissimilarities.

Comparing original and aggregation distances

The most common index is the cophenetic correlation coefficient:

\[ r_c = \mathrm{cor} \left( \{d(i,j): i<j\}, \{d_U(i,j): i<j\} \right). \]

Other deformation indexes measure the discrepancy directly:

\[ \Delta_1 = \frac{2}{n(n-1)} \sum_{i<j} |d(i,j)-d_U(i,j)| \]

\[ RMSE_U = \sqrt{ \frac{2}{n(n-1)} \sum_{i<j} (d(i,j)-d_U(i,j))^2 }. \]

Index Better value
Cophenetic correlation high
Mean absolute deformation low
RMSE deformation low

Computing them in R

Code
D_original <- dist(toy_mat)

D_coph_single <- cophenetic(hc_toy)

# Cophenetic correlation
cor(D_original, D_coph_single)
[1] 0.8208033
Code
D_mat <- as.matrix(D_original)
U_mat <- as.matrix(D_coph_single)

idx <- lower.tri(D_mat)

tibble(
  cophenetic_correlation = cor(D_original, D_coph_single),
  mean_absolute_deformation = mean(abs(D_mat[idx] - U_mat[idx])),
  rmse_deformation = sqrt(mean((D_mat[idx] - U_mat[idx])^2))
)
# A tibble: 1 × 3
  cophenetic_correlation mean_absolute_deformation rmse_deformation
                   <dbl>                     <dbl>            <dbl>
1                  0.821                      4.15             5.17

From hierarchical clustering to k-means

Hierarchical clustering builds a full dendrogram.

K-means takes a different approach:

Important

It directly searches for a partition of the data into a pre-specified number of clusters, \(k\).

So, before running k-means, we must choose \(k\).

K-means: step 0

K-means: step 1

K-means: step 2

K-means: step 3

K-means: step 4

K-means: step 5

K-means: algorithm

Given a chosen number of clusters \(k\):

  1. initialize \(k\) centroids;
  2. assign each point to the nearest centroid;
  3. recompute the centroids from the current clusters;
  4. repeat assignment and update until convergence.

Note

Unlike hierarchical clustering, k-means does not produce a dendrogram. It produces one partition.

K-means on the toy data

Code
plot_kmeans_solution(k = 4, seed = 123)

With \(k\=4\), k-means directly returns a partition into four clusters.

K-means requires choosing \(k\)

Code
plot_kmeans_solution(k = 3, seed = 123) +
  plot_kmeans_solution(k = 4, seed = 123) +
  plot_kmeans_solution(k = 5, seed = 123) +
  patchwork::plot_layout(ncol = 3)

The same data can produce different partitions depending on the chosen value of \(k\).

K-means: strengths and limits

Strengths

  • Simple and fast.
  • Works well for compact, roughly spherical clusters.
  • Scales better than hierarchical clustering.

Limits

  • Requires choosing \(k\) in advance.
  • Sensitive to scaling.
  • Sensitive to initialization.
  • Uses centroids, so it is mainly suited to continuous variables.
  • Tends to favor convex, similarly sized clusters.

Beyond standard k-means

K-means is based on two key ideas:

  • each cluster is represented by a centroid;
  • each observation belongs to exactly one cluster.

Two useful variants modify these assumptions:

Method Main idea
k-medoids represent each cluster by an observed unit
fuzzy c-means allow partial membership in several clusters

Note

Both methods still require choosing the number of clusters in advance.

k-medoids

K-medoids is similar to k-means, but each cluster is represented by a medoid.

A medoid is an actual observation in the dataset.

Important

A centroid is an average point.
A medoid is a real observed point.

This makes k-medoids useful when:

  • averages are not meaningful;
  • outliers are present;
  • we want to work directly from a dissimilarity matrix.

k-means vs k-medoids

Feature k-means k-medoids
Cluster representative centroid medoid
Representative is observed? no yes
Input usually continuous variables data or dissimilarity matrix
Robustness to outliers lower higher
Typical objective minimize squared distances to centroids minimize dissimilarities to medoids

Note

k-medoids is a natural choice when clustering is based on a custom dissimilarity matrix.

k-medoids on the toy data

k-medoids can use a dissimilarity matrix

Unlike k-means, k-medoids can be applied directly to a dissimilarity matrix.

[1]  2  6 10 14

This is useful when distances are not Euclidean, for example with:

  • Gower dissimilarity for mixed data;
  • matching dissimilarity for categorical data;
  • custom domain-specific dissimilarities.

Fuzzy c-means

Standard k-means gives a hard partition:

\[ u_{ig} \in \{0,1\}. \]

Each observation belongs to one cluster only.

Fuzzy c-means gives a soft partition:

\[ 0 \leq u_{ig} \leq 1, \qquad \sum_{g=1}^{k} u_{ig} = 1. \]

Here, \(u_{ig}\) is the degree of membership of observation \(i\) in cluster \(g\).

Fuzzy c-means: objective

Fuzzy c-means minimizes:

\[ \sum_{i=1}^{n} \sum_{g=1}^{k} u_{ig}^m \|x_i - c_g\|^2. \]

where:

  • \(u_{ig}\) is the membership degree of observation \(i\) in cluster \(g\);
  • \(c_g\) is the centroid of cluster \(g\);
  • \(m > 1\) controls the degree of fuzziness.

Note

When memberships are close to 0 or 1, the solution is nearly hard.
When memberships are more balanced, cluster boundaries are more uncertain.

Fuzzy c-means on the toy data

Interpreting fuzzy memberships

Fuzzy c-means gives more information than a hard partition.

For each observation, we can inspect:

  • the cluster with the largest membership;
  • the maximum membership degree;
  • whether the observation lies near a cluster boundary.
Situation Interpretation
one membership close to 1 clear cluster assignment
two or more similar memberships uncertain / boundary observation
all memberships similar weak cluster structure

Note

Fuzzy clustering is useful when cluster boundaries are gradual rather than sharp.

k-means, k-medoids, and fuzzy c-means

Method Representative Assignment Best suited for
k-means centroid hard continuous data, compact clusters
k-medoids observed medoid hard arbitrary dissimilarities, robustness
fuzzy c-means centroid soft gradual or uncertain cluster boundaries

Important

The method should match the structure of the data and the interpretation you need.

Spectral clustering

Spectral clustering starts from a different idea.

Instead of looking for clusters directly in the original data space, we build a graph:

  • observations are nodes;
  • edges connect similar observations;
  • edge weights measure how strongly two observations are connected.

Note

The clustering problem becomes a graph partitioning problem.

Spectral clustering: graph representation

A graph representation of the data:

  • each node is an observation;
  • each edge has a weight \(w_{ij}\);
  • high weights connect observations that are very similar.
Code
library(network)
library(sna)
library(GGally)

The affinity matrix

The graph can be represented by an affinity matrix \(A\).

The element \(w_{ij}\) is:

  • high when observations \(i\) and \(j\) are similar;
  • low when observations \(i\) and \(j\) are dissimilar;
  • zero if no edge is present.
. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

From distances to affinities

Spectral clustering usually starts from pairwise distances.

The distance matrix \(D\) is transformed into an affinity matrix \(A\).

A common choice is the Gaussian affinity:

\[ a_{ij} = \exp\left( -\frac{d(i,j)^2}{2\sigma^2} \right), \qquad a_{ii}=0. \]

The parameter \(\sigma\) controls the neighbourhood scale.

Note

Small distances become large affinities.
Large distances become small affinities.

Spectral clustering: the main idea

Once the affinity matrix has been built, spectral clustering works in three steps.

  1. Build a graph from the affinity matrix \(A\).
  2. Compute a spectral embedding from the graph.
  3. Apply k-means to the embedded observations.

Important

The final k-means step is not applied to the original variables, but to the spectral coordinates.

Normalized graph Laplacian

Let \(r_i\) be the total affinity of observation \(i\):

\[ r_i = \sum_{j=1}^{n} a_{ij}. \]

Let \(D_r\) be the diagonal matrix containing these row sums.

The normalized affinity matrix is:

\[ L = D_r^{-1/2} A D_r^{-1/2}. \]

We then compute its eigen-decomposition:

\[ L = Q \Lambda Q^\top. \]

Spectral embedding

The first \(K\) eigenvectors of \(L\) define a new representation of the observations.

Let \(\tilde Q\) be the matrix containing the selected eigenvectors.

Each row of \(\tilde Q\) is the spectral embedding of one observation.

The final clustering is obtained by applying k-means to the rows of \(\tilde Q\).

Note

Spectral clustering turns a graph partitioning problem into a clustering problem in a new coordinate system.

Why spectral clustering?

Spectral clustering is useful when clusters are not well represented by centroids in the original space.

It can work well with:

  • non-convex clusters;
  • graph-like structure;
  • local connectivity;
  • custom affinities built from domain-specific distances.

Tip

This is why spectral clustering is often useful when distances encode more than simple Euclidean proximity.

Spectral clustering and non-convex structure

Spectral clustering can separate groups that are connected locally, even when they are not convex in the original space.

DBSCAN: density-based clustering

DBSCAN is a density-based clustering method.

Instead of searching for spherical groups or a hierarchy, it looks for regions where observations are densely concentrated.

Note

Clusters are defined as dense regions separated by sparse regions.

DBSCAN can also identify observations that do not clearly belong to any cluster: noise points.

DBSCAN: core, border, and noise points

DBSCAN depends on two parameters:

  • \(\varepsilon\): neighbourhood radius;
  • minPts: minimum number of neighbours required to define a dense region.

Each observation is classified as:

Type of point Meaning
Core point has at least minPts observations within distance \(\varepsilon\)
Border point is close to a core point, but is not itself core
Noise point is neither core nor border

Important

DBSCAN does not require choosing \(k\), but it does require choosing \(\varepsilon\) and minPts.

DBSCAN: strengths and limits

Strengths

  • Can detect non-convex clusters.
  • Can identify noise or outlying observations.
  • Does not require choosing the number of clusters in advance.

Limits

  • Sensitive to the choice of \(\varepsilon\) and minPts.
  • Struggles when clusters have very different densities.
  • Still depends on the chosen distance and scaling.

Note

DBSCAN is useful when clusters are better described as dense regions than as groups around centroids.

DBSCAN on a non-convex example

Cluster evaluation

After obtaining a clustering solution, we need to ask:

Important

Is this clustering useful, stable, and interpretable?

Cluster evaluation can have different goals:

Goal Question
Assess compactness Are observations close to others in the same cluster?
Assess separation Are clusters well separated from each other?
Compare solutions Which method or number of clusters works better?
Compare with labels Does the clustering agree with known groups?
Check robustness Does the solution persist under small changes?

Internal vs external evaluation

Cluster evaluation measures can be divided into two main families.

Type Uses true labels? Main idea
Internal evaluation No evaluate the structure of the clustering using the data only
External evaluation Yes compare the clustering with known reference labels

Note

Most real clustering problems rely mainly on internal evaluation and substantive interpretation.

Internal vs external evaluation

Cluster evaluation measures can be divided into two main families.

Type Uses true labels? Main idea
Internal evaluation No evaluate the structure of the clustering using the data only
External evaluation Yes compare the clustering with known reference labels

Note

Most real clustering problems rely mainly on internal evaluation and substantive interpretation.

Internal evaluation measures

Internal measures assess cluster quality using only:

  • the data;
  • the dissimilarity matrix;
  • the cluster assignments.

They usually combine two ideas:

  1. Within-cluster compactness
    Observations in the same cluster should be close.

  2. Between-cluster separation
    Observations in different clusters should be far apart.

Within-cluster sum of squares

For k-means, a common criterion is the within-cluster sum of squares:

\[ W_k = \sum_{g=1}^{k} \sum_{i \in C_g} \|x_i - \bar{x}_g\|^2. \]

As \(k\) increases, \(W_k\) always decreases.

Warning

The goal is not simply to minimize \(W_k\), because the minimum is obtained when each observation is its own cluster.

Elbow method

The elbow method evaluates \(W_k\) for different values of \(k\).

We look for a value of \(k\) after which the improvement becomes much smaller.

Note

The elbow is a heuristic, not a formal rule.

Silhouette coefficient

For each observation \(i\), define:

  • \(a(i)\): average dissimilarity between \(i\) and observations in its own cluster;
  • \(b(i)\): smallest average dissimilarity between \(i\) and observations in another cluster.

The silhouette value is:

\[ s(i) = \frac{b(i)-a(i)} {\max\{a(i), b(i)\}}. \]

Values close to 1 indicate well-clustered observations.
Values close to 0 indicate boundary observations.
Negative values suggest possible misclassification.

Average silhouette width

The average silhouette width summarizes the silhouette values over all observations.

Note

A larger average silhouette suggests a clearer clustering structure.

Calinski-Harabasz index

The Calinski-Harabasz index compares between-cluster variability with within-cluster variability:

\[ CH(k) = \frac{B_k/(k-1)} {W_k/(n-k)}. \]

where:

  • \(B_k\) measures between-cluster variation;
  • \(W_k\) measures within-cluster variation;
  • \(n\) is the number of observations.

Note

Larger values suggest more compact and better separated clusters.

Davies-Bouldin index

The Davies-Bouldin index compares within-cluster scatter with between-cluster separation.

It is based on quantities of the form:

\[ R_{gh} = \frac{S_g + S_h} {M_{gh}}, \]

where:

  • \(S_g\) and \(S_h\) measure the scatter of clusters \(g\) and \(h\);
  • \(M_{gh}\) is the distance between their centroids.

Note

Smaller values indicate better clustering.

Internal evaluation: summary

Measure Main idea Better value
Within-cluster sum of squares compactness around centroids low
Elbow method diminishing returns in WSS visible elbow
Average silhouette compactness and separation high
Calinski-Harabasz between / within variation high
Davies-Bouldin scatter / separation low

Warning

Internal measures are helpful, but they do not replace interpretation.

External evaluation

Sometimes we have reference labels.

For example:

  • known species;
  • diagnostic groups;
  • manually coded categories;
  • benchmark classes.

Then we can compare the clustering solution with the known labels.

Important

External labels are not used to build the clustering.
They are used only to evaluate it.

External evaluation: contingency table

The first step is often a contingency table.

Rows are clusters.
Columns are reference classes.

cluster Group 1 Group 2 Group 3 Group 4
1 0 4 0 0
2 0 0 0 4
3 4 0 0 0
4 0 0 4 0

Note

Cluster labels are arbitrary: cluster 1 is not necessarily reference group 1.

Rand index and Adjusted Rand Index

The Rand index compares pairs of observations.

For each pair, it checks whether the two partitions agree:

  • same cluster in both partitions;
  • different clusters in both partitions.

The Adjusted Rand Index corrects the Rand index for chance agreement.

ARI value Interpretation
1 perfect agreement
around 0 agreement close to random
below 0 worse than random agreement

Note

ARI is one of the most common external validation measures for clustering.

Mutual-information based measures

Another family of measures is based on the information shared by:

  • the clustering solution;
  • the reference labels.

Common indexes include:

  • Mutual Information;
  • Normalized Mutual Information;
  • Adjusted Mutual Information.

Note

These measures evaluate how much knowing the cluster assignment reduces uncertainty about the reference label.

External evaluation: summary

Measure Main idea Better value
Contingency table direct comparison of clusters and labels diagonal pattern, after label matching
Rand index pairwise agreement high
Adjusted Rand Index pairwise agreement corrected for chance high
Normalized Mutual Information shared information between partitions high
Purity dominant reference class inside clusters high, but can be misleading

Warning

External measures are useful when labels exist, but clustering is often exploratory: labels may not represent the target structure.

Choosing the number of clusters

There is rarely a single correct number of clusters.

Useful criteria include:

  • internal evaluation measures;
  • stability under resampling;
  • interpretability of the solution;
  • usefulness for the scientific or practical question.

Important

Choosing \(k\) is both a statistical and substantive decision.

Practical workflow for choosing \(k\)

A practical workflow is:

  1. try a reasonable range of values for \(k\);
  2. compute several internal evaluation measures;
  3. inspect visualizations and cluster profiles;
  4. check whether the solution is stable;
  5. choose a solution that is interpretable and useful.

Note

Do not rely on a single index mechanically. Different indexes may prefer different values of \(k\).

Cluster evaluation: final take-away

Cluster evaluation should combine three perspectives.

Perspective Question
Geometry Are clusters compact and separated?
Stability Does the solution persist under small changes?
Interpretation Do the clusters make substantive sense?

Important

A good clustering solution is not only one with a good index.
It should also be stable, interpretable, and relevant to the problem.