Class 3 — clustering workflows
2026-05-25
Cluster analysis is a family of unsupervised learning methods used to group observations into clusters.
The goal is to obtain groups such that:
Note
In clustering, the groups are not known in advance.
They are constructed from the observed data.
Before clustering, we need to decide:
Important
What does it mean for two observations to be similar or different?
Most clustering methods depend, explicitly or implicitly, on a notion of:
Suppose we observe three units described by four variables.
The clustering algorithm does not only look at the rows separately.
It compares the observations pairwise:
\[ d(A,B), \quad d(A,C), \quad d(B,C). \]
These pairwise dissimilarities are the basic input for many clustering methods.
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous variables: add up by-variable (absolute value or squared) differences
toy_data |>
ggplot(aes(x=X1,y=X2,label=X4)) +
geom_point(size=8) + geom_text(color="white") +
# geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
# geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+
geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","B")),
inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=2)+
geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","B")),
inherit.aes = FALSE,aes(y=X2,x=0),color="forestgreen",size=2)+
annotate("label",x=2, y=2,label="d(A,B)",size=10)+theme_void()2 continuous variables: add up by-variable (absolute value or squared) differences
toy_data |>
ggplot(aes(x=X1,y=X2,label=X4)) +
geom_point(size=8) + geom_text(color="white") +
# geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
# geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+
geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","C")),
inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=2)+
geom_line(data=toy_data |> dplyr::filter(X4 %in%c("A","C")),
inherit.aes = FALSE,aes(y=X2,x=0),color="forestgreen",size=2)+
annotate("label",x=2, y=2,label="d(A,C)",size=10)+theme_void()2 continuous variables: add up by-variable (absolute value or squared) differences
toy_data |>
ggplot(aes(x=X1,y=X2,label=X4)) +
geom_point(size=8) + geom_text(color="white") +
# geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4)+
# geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4)+
geom_line(data=toy_data |> dplyr::filter(X4 %in%c("B","C")),
inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=2)+
geom_line(data=toy_data |> dplyr::filter(X4 %in%c("B","C")),
inherit.aes = FALSE,aes(y=X2,x=0),color="forestgreen",size=2)+
annotate("label",x=2, y=2,label="d(B,C)",size=10)+theme_void()2 continuous and 1 categorical variables
toy_data |>
ggplot(aes(x=X1,y=X2,label=X4,color=as.character(X3))) +
geom_point(size=8) + geom_text(color="white") +
geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4,alpha=.001)+
geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4,alpha=.001)+
scale_color_identity()+theme_void()+
theme(legend.position="none")one might consider purple and blue closer than e.g. purple and yellow
toy_data |>
ggplot(aes(x=X1,y=X2,label=X4,color=as.character(X3))) +
geom_point(size=8) + geom_text(color="white") +
geom_point(inherit.aes = FALSE,aes(x=X1,y=0),color="#2A9D8F",size=4,alpha=.001)+
geom_point(inherit.aes = FALSE,aes(x=0,y=X2),color="forestgreen",size=4,alpha=.001)+
scale_color_identity()+theme_void()+
theme(legend.position="none")A dissimilarity is a numerical index measuring how different two observations are.
A distance is a dissimilarity that satisfies additional mathematical properties.
Let \(d\(x\,y\)\) be the dissimilarity between two observations \(x\) and \(y\).
A distance usually satisfies:
\[ d(x,y) \geq 0 \]
\[ d(x,y) = 0 \iff x = y \]
\[ d(x,y) = d(y,x) \]
\[ d(x,z) \leq d(x,y) + d(y,z) \]
The main properties are:
| Property | Meaning |
|---|---|
| Non-negativity | distances cannot be negative |
| Identity of indiscernibles | distance is zero only for identical objects |
| Symmetry | the distance from \(x\) to \(y\) equals the distance from \(y\) to \(x\) |
| Triangle inequality | going through a third point cannot make the path shorter |
Note
Every distance is a dissimilarity, but not every dissimilarity is a distance.
The way we compute dissimilarities depends on the nature of the variables.
| Type of data | Common dissimilarity |
|---|---|
| Continuous variables | Euclidean or Manhattan distance |
| Categorical variables | Matching dissimilarity |
| Mixed variables | Gower dissimilarity |
For continuous variables, dissimilarities are based on numerical differences.
Given two observations \(A\) and \(B\), described by two continuous variables:
\[ A = (x_{A1}, x_{A2}), \qquad B = (x_{B1}, x_{B2}). \]
Two common choices are:
\[ d_{\text{Manhattan}}(A,B) = |x_{A1}-x_{B1}| + |x_{A2}-x_{B2}| \]
and
\[ d_{\text{Euclidean}}(A,B) = \sqrt{(x_{A1}-x_{B1})^2 + (x_{A2}-x_{B2})^2}. \]
For continuous variables, dissimilarities are based on numerical differences.
Given two observations \(A\) and \(B\), described by two continuous variables:
\[ A = (x_{A1}, x_{A2}), \qquad B = (x_{B1}, x_{B2}). \]
Two common choices are:
\[ d_{\text{Manhattan}}(A,B) = |x_{A1}-x_{B1}| + |x_{A2}-x_{B2}| \]
and
\[ d_{\text{Euclidean}}(A,B) = \sqrt{(x_{A1}-x_{B1})^2 + (x_{A2}-x_{B2})^2}. \]
For categorical variables, a simple choice is the matching dissimilarity.
For one categorical variable:
\[ d(x_i,x_j) = \begin{cases} 0 & \text{if } x_i = x_j \\ 1 & \text{if } x_i \neq x_j \end{cases} \]
For several categorical variables, we count the proportion of variables on which two observations differ.
When variables have different types, Gower dissimilarity compares observations variable by variable and then averages the results.
For two observations \(i\) and \(j\):
\[ d_G(i,j) = \frac{ \sum_{\ell=1}^{p} w_{ij\ell} \, d_{ij\ell} }{ \sum_{\ell=1}^{p} w_{ij\ell} }. \]
Here:
Note
Gower dissimilarity combines different types of variables by putting their contributions on a common scale.
For a continuous variable \(X\_\\ell\), Gower usually uses a range-scaled absolute difference:
\[ d_{ij\ell} = \frac{|x_{i\ell} - x_{j\ell}|}{\max(X_\ell) - \min(X_\ell)}. \]
For a categorical variable \(X\_\ell\), Gower uses matching dissimilarity:
\[ d_{ij\ell} = \begin{cases} 0 & \text{if } x_{i\ell} = x_{j\ell}, \\ 1 & \text{if } x_{i\ell} \neq x_{j\ell}. \end{cases} \]
The final dissimilarity is the average of these variable-wise contributions.
Once we have defined how to compare two observations, we can compute all pairwise dissimilarities.
1 2 3
1 0.00 3.61 2.24
2 3.61 0.00 4.47
3 2.24 4.47 0.00
The result is a distance matrix:
\[ D = \begin{pmatrix} 0 & d(A,B) & d(A,C) \\ d(B,A) & 0 & d(B,C) \\ d(C,A) & d(C,B) & 0 \end{pmatrix}. \]
Many clustering algorithms start from this matrix, directly or indirectly.
The same data can be clustered in different ways.
Some methods use the distance matrix directly:
Other methods use distances implicitly:
Note
Changing the distance, scaling, or representation can change the clustering solution.
Hierarchical clustering builds a nested sequence of clusters.
It starts from a dissimilarity matrix and produces a hierarchy that can be represented as a dendrogram.
Note
The dendrogram shows how observations are progressively grouped together as the allowed dissimilarity increases.
The most common approach is agglomerative clustering.
The algorithm works bottom-up:
The result is a hierarchy of increasingly larger groups.
At the beginning, distances are computed between individual observations.
After the first merge, we need to compute distances between:
This is done through a linkage criterion.
Important
Different linkage criteria can produce different dendrograms, even with the same dissimilarity matrix.
Let \(d\(i\,j\)\) be the dissimilarity between observations \(i\) and \(j\).
For two clusters \(A\) and \(B\):
\[ d_{\text{single}}(A,B) = \min_{i \in A, j \in B} d(i,j) \]
\[ d_{\text{complete}}(A,B) = \max_{i \in A, j \in B} d(i,j) \]
\[ d_{\text{average}}(A,B) = \frac{1}{|A||B|} \sum_{i \in A} \sum_{j \in B} d(i,j) \]
We create a small dataset with 10 observations arranged around 4 visible groups.
toy_hc <- tibble(
id = LETTERS[1:16],
x = c(
0.3, 2.2, 1.1, 2.8, # A-D: lower-left group
6.2, 8.4, 7.1, 9.0, # E-H: central group
12.2, 14.8, 13.1, 15.2, # I-L: lower-right group
3.2, 5.7, 4.4, 6.3 # M-P: upper group
),
y = c(
0.4, 1.9, 2.8, 0.8, # A-D
5.8, 7.9, 8.8, 6.4, # E-H
0.6, 1.4, 3.0, 2.5, # I-L
11.4, 13.2, 14.1, 11.8 # M-P
)
)
toy_mat <- toy_hc |>
select(x, y) |>
as.matrix()
rownames(toy_mat) <- toy_hc$id
hc_toy <- hclust(dist(toy_mat), method = "single")
hc_toy_cmp <- hclust(dist(toy_mat), method = "complete")
hc_toy_avg <- hclust(dist(toy_mat), method = "average")Single linkage defines the distance between two clusters as the distance between their closest pair of points.
For two clusters \(A\) and \(B\):
\[ d_{\text{single}}(A,B) = \min_{i \in A, j \in B} d(i,j). \]
At each step, the algorithm merges the two clusters with the smallest single-linkage distance.
Important
The distance matrix is the same, but the dendrogram changes because the linkage criterion changes.
Hierarchical clustering starts from a matrix of pairwise dissimilarities.
It then produces a dendrogram, which represents the sequence of aggregations.
Important
The dendrogram does not simply reproduce the original distances.
It defines a new set of pairwise values: the aggregation distances.
Hierarchical clustering starts from an original dissimilarity matrix:
\[ D = \{d(i,j)\}. \]
The dendrogram induces a second matrix:
\[ D_U = \{d_U(i,j)\}. \]
For two observations \(i\) and \(j\), \(d_U(i,j)\) is the height at which they first become part of the same cluster.
Note
We call \(d_U(i,j)\) the aggregation distance between \(i\) and \(j\).
The original distance \(d(i,j)\) measures how far two observations are in the original data space.
The aggregation distance \(d_U(i,j)\) measures how far apart they are in the hierarchy.
| Quantity | Meaning |
|---|---|
| \(d(i,j)\) | How different are \(i\) and \(j\) before clustering? |
| \(d_U(i,j)\) | At what dendrogram height do \(i\) and \(j\) first merge? |
Important
The dendrogram transforms the original dissimilarities into a hierarchical representation.
The matrix \(D_U\) induced by a dendrogram is an ultrametric.
An ultrametric satisfies a stronger version of the triangle inequality.
For any three observations \(i\), \(j\), and \(k\):
\[ d_U(i,j) \leq \max \{ d_U(i,k), d_U(j,k) \}. \]
This is called the ultrametric inequality.
Suppose the dendrogram says:
Then:
\[ d_U(A,B) = 2, \]
but
\[ d_U(A,C) = 5, \qquad d_U(B,C) = 5. \]
So the three aggregation distances are:
\[ 2, 5, 5. \]
The two largest distances are equal, as required by an ultrametric.
To keep the table readable, we show a subset of observations.
To keep the table readable, we show a subset of observations.
| A | B | C | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|---|
| A | 0.00 | 2.42 | 2.53 | 2.53 | 8.00 | 11.04 | 10.81 | 10.57 |
| B | 2.42 | 0.00 | 1.42 | 1.25 | 5.59 | 8.63 | 8.46 | 8.15 |
| C | 2.53 | 1.42 | 0.00 | 2.62 | 5.92 | 8.91 | 8.49 | 8.68 |
| D | 2.53 | 1.25 | 2.62 | 0.00 | 6.05 | 9.04 | 9.08 | 8.35 |
| E | 8.00 | 5.59 | 5.92 | 6.05 | 0.00 | 3.04 | 3.13 | 2.86 |
| F | 11.04 | 8.63 | 8.91 | 9.04 | 3.04 | 0.00 | 1.58 | 1.62 |
| G | 10.81 | 8.46 | 8.49 | 9.08 | 3.13 | 1.58 | 0.00 | 3.06 |
| H | 10.57 | 8.15 | 8.68 | 8.35 | 2.86 | 1.62 | 3.06 | 0.00 |
| A | B | C | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|---|
| A | 0.00 | 2.42 | 2.42 | 2.42 | 5.59 | 5.59 | 5.59 | 5.59 |
| B | 2.42 | 0.00 | 1.42 | 1.25 | 5.59 | 5.59 | 5.59 | 5.59 |
| C | 2.42 | 1.42 | 0.00 | 1.42 | 5.59 | 5.59 | 5.59 | 5.59 |
| D | 2.42 | 1.25 | 1.42 | 0.00 | 5.59 | 5.59 | 5.59 | 5.59 |
| E | 5.59 | 5.59 | 5.59 | 5.59 | 0.00 | 2.86 | 2.86 | 2.86 |
| F | 5.59 | 5.59 | 5.59 | 5.59 | 2.86 | 0.00 | 1.58 | 1.62 |
| G | 5.59 | 5.59 | 5.59 | 5.59 | 2.86 | 1.58 | 0.00 | 1.62 |
| H | 5.59 | 5.59 | 5.59 | 5.59 | 2.86 | 1.62 | 1.62 | 0.00 |
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| A | 0.00 | 2.42 | 2.42 | 2.42 | 5.59 | 5.59 |
| B | 2.42 | 0.00 | 1.42 | 1.25 | 5.59 | 5.59 |
| C | 2.42 | 1.42 | 0.00 | 1.42 | 5.59 | 5.59 |
| D | 2.42 | 1.25 | 1.42 | 0.00 | 5.59 | 5.59 |
| E | 5.59 | 5.59 | 5.59 | 5.59 | 0.00 | 2.86 |
| F | 5.59 | 5.59 | 5.59 | 5.59 | 2.86 | 0.00 |
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| A | 0.00 | 2.62 | 2.53 | 2.62 | 14.30 | 14.30 |
| B | 2.62 | 0.00 | 2.62 | 1.25 | 14.30 | 14.30 |
| C | 2.53 | 2.62 | 0.00 | 2.62 | 14.30 | 14.30 |
| D | 2.62 | 1.25 | 2.62 | 0.00 | 14.30 | 14.30 |
| E | 14.30 | 14.30 | 14.30 | 14.30 | 0.00 | 3.13 |
| F | 14.30 | 14.30 | 14.30 | 14.30 | 3.13 | 0.00 |
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| A | 0.00 | 2.49 | 2.49 | 2.49 | 10.11 | 10.11 |
| B | 2.49 | 0.00 | 2.02 | 1.25 | 10.11 | 10.11 |
| C | 2.49 | 2.02 | 0.00 | 2.02 | 10.11 | 10.11 |
| D | 2.49 | 1.25 | 2.02 | 0.00 | 10.11 | 10.11 |
| E | 10.11 | 10.11 | 10.11 | 10.11 | 0.00 | 3.01 |
| F | 10.11 | 10.11 | 10.11 | 10.11 | 3.01 | 0.00 |
The matrices may differ because the dendrograms differ.
A dendrogram transforms the original pairwise distances into aggregation distances.
| Quantity | Meaning |
|---|---|
| \(d(i,j)\) | original distance between observations \(i\) and \(j\) |
| \(d_U(i,j)\) | aggregation / cophenetic distance: dendrogram height at which \(i\) and \(j\) first belong to the same cluster |
To evaluate the quality of the hierarchy, we compare the two sets of pairwise values:
\[ \{d(i,j): i<j\} \qquad \text{and} \qquad \{d_U(i,j): i<j\}. \]
Note
A good dendrogram should preserve the main structure of the original dissimilarities.
The most common index is the cophenetic correlation coefficient:
\[ r_c = \mathrm{cor} \left( \{d(i,j): i<j\}, \{d_U(i,j): i<j\} \right). \]
Other deformation indexes measure the discrepancy directly:
\[ \Delta_1 = \frac{2}{n(n-1)} \sum_{i<j} |d(i,j)-d_U(i,j)| \]
\[ RMSE_U = \sqrt{ \frac{2}{n(n-1)} \sum_{i<j} (d(i,j)-d_U(i,j))^2 }. \]
| Index | Better value |
|---|---|
| Cophenetic correlation | high |
| Mean absolute deformation | low |
| RMSE deformation | low |
[1] 0.8208033
# A tibble: 1 × 3
cophenetic_correlation mean_absolute_deformation rmse_deformation
<dbl> <dbl> <dbl>
1 0.821 4.15 5.17
Hierarchical clustering builds a full dendrogram.
K-means takes a different approach:
Important
It directly searches for a partition of the data into a pre-specified number of clusters, \(k\).
So, before running k-means, we must choose \(k\).
Given a chosen number of clusters \(k\):
Note
Unlike hierarchical clustering, k-means does not produce a dendrogram. It produces one partition.
With \(k\=4\), k-means directly returns a partition into four clusters.
The same data can produce different partitions depending on the chosen value of \(k\).
Strengths
Limits
K-means is based on two key ideas:
Two useful variants modify these assumptions:
| Method | Main idea |
|---|---|
| k-medoids | represent each cluster by an observed unit |
| fuzzy c-means | allow partial membership in several clusters |
Note
Both methods still require choosing the number of clusters in advance.
K-medoids is similar to k-means, but each cluster is represented by a medoid.
A medoid is an actual observation in the dataset.
Important
A centroid is an average point.
A medoid is a real observed point.
This makes k-medoids useful when:
| Feature | k-means | k-medoids |
|---|---|---|
| Cluster representative | centroid | medoid |
| Representative is observed? | no | yes |
| Input | usually continuous variables | data or dissimilarity matrix |
| Robustness to outliers | lower | higher |
| Typical objective | minimize squared distances to centroids | minimize dissimilarities to medoids |
Note
k-medoids is a natural choice when clustering is based on a custom dissimilarity matrix.
Unlike k-means, k-medoids can be applied directly to a dissimilarity matrix.
[1] 2 6 10 14
This is useful when distances are not Euclidean, for example with:
Standard k-means gives a hard partition:
\[ u_{ig} \in \{0,1\}. \]
Each observation belongs to one cluster only.
Fuzzy c-means gives a soft partition:
\[ 0 \leq u_{ig} \leq 1, \qquad \sum_{g=1}^{k} u_{ig} = 1. \]
Here, \(u_{ig}\) is the degree of membership of observation \(i\) in cluster \(g\).
Fuzzy c-means minimizes:
\[ \sum_{i=1}^{n} \sum_{g=1}^{k} u_{ig}^m \|x_i - c_g\|^2. \]
where:
Note
When memberships are close to 0 or 1, the solution is nearly hard.
When memberships are more balanced, cluster boundaries are more uncertain.
Fuzzy c-means gives more information than a hard partition.
For each observation, we can inspect:
| Situation | Interpretation |
|---|---|
| one membership close to 1 | clear cluster assignment |
| two or more similar memberships | uncertain / boundary observation |
| all memberships similar | weak cluster structure |
Note
Fuzzy clustering is useful when cluster boundaries are gradual rather than sharp.
| Method | Representative | Assignment | Best suited for |
|---|---|---|---|
| k-means | centroid | hard | continuous data, compact clusters |
| k-medoids | observed medoid | hard | arbitrary dissimilarities, robustness |
| fuzzy c-means | centroid | soft | gradual or uncertain cluster boundaries |
Important
The method should match the structure of the data and the interpretation you need.
Spectral clustering starts from a different idea.
Instead of looking for clusters directly in the original data space, we build a graph:
Note
The clustering problem becomes a graph partitioning problem.
A graph representation of the data:
The graph can be represented by an affinity matrix \(A\).
The element \(w_{ij}\) is:
| . | a | b | c | d |
|---|---|---|---|---|
| a | 0 | 0 | w_ac | 0 |
| b | 0 | 0 | w_cb | w_bd |
| c | w_ca | w_cb | 0 | w_cd |
| d | 0 | w_db | w_dc | 0 |
Spectral clustering usually starts from pairwise distances.
The distance matrix \(D\) is transformed into an affinity matrix \(A\).
A common choice is the Gaussian affinity:
\[ a_{ij} = \exp\left( -\frac{d(i,j)^2}{2\sigma^2} \right), \qquad a_{ii}=0. \]
The parameter \(\sigma\) controls the neighbourhood scale.
Note
Small distances become large affinities.
Large distances become small affinities.
Once the affinity matrix has been built, spectral clustering works in three steps.
Important
The final k-means step is not applied to the original variables, but to the spectral coordinates.
Let \(r_i\) be the total affinity of observation \(i\):
\[ r_i = \sum_{j=1}^{n} a_{ij}. \]
Let \(D_r\) be the diagonal matrix containing these row sums.
The normalized affinity matrix is:
\[ L = D_r^{-1/2} A D_r^{-1/2}. \]
We then compute its eigen-decomposition:
\[ L = Q \Lambda Q^\top. \]
The first \(K\) eigenvectors of \(L\) define a new representation of the observations.
Let \(\tilde Q\) be the matrix containing the selected eigenvectors.
Each row of \(\tilde Q\) is the spectral embedding of one observation.
The final clustering is obtained by applying k-means to the rows of \(\tilde Q\).
Note
Spectral clustering turns a graph partitioning problem into a clustering problem in a new coordinate system.
Spectral clustering is useful when clusters are not well represented by centroids in the original space.
It can work well with:
Tip
This is why spectral clustering is often useful when distances encode more than simple Euclidean proximity.
Spectral clustering can separate groups that are connected locally, even when they are not convex in the original space.
DBSCAN is a density-based clustering method.
Instead of searching for spherical groups or a hierarchy, it looks for regions where observations are densely concentrated.
Note
Clusters are defined as dense regions separated by sparse regions.
DBSCAN can also identify observations that do not clearly belong to any cluster: noise points.
DBSCAN depends on two parameters:
minPts: minimum number of neighbours required to define a dense region.Each observation is classified as:
| Type of point | Meaning |
|---|---|
| Core point | has at least minPts observations within distance \(\varepsilon\) |
| Border point | is close to a core point, but is not itself core |
| Noise point | is neither core nor border |
Important
DBSCAN does not require choosing \(k\), but it does require choosing \(\varepsilon\) and minPts.
Strengths
Limits
minPts.Note
DBSCAN is useful when clusters are better described as dense regions than as groups around centroids.
After obtaining a clustering solution, we need to ask:
Important
Is this clustering useful, stable, and interpretable?
Cluster evaluation can have different goals:
| Goal | Question |
|---|---|
| Assess compactness | Are observations close to others in the same cluster? |
| Assess separation | Are clusters well separated from each other? |
| Compare solutions | Which method or number of clusters works better? |
| Compare with labels | Does the clustering agree with known groups? |
| Check robustness | Does the solution persist under small changes? |
Cluster evaluation measures can be divided into two main families.
| Type | Uses true labels? | Main idea |
|---|---|---|
| Internal evaluation | No | evaluate the structure of the clustering using the data only |
| External evaluation | Yes | compare the clustering with known reference labels |
Note
Most real clustering problems rely mainly on internal evaluation and substantive interpretation.
Cluster evaluation measures can be divided into two main families.
| Type | Uses true labels? | Main idea |
|---|---|---|
| Internal evaluation | No | evaluate the structure of the clustering using the data only |
| External evaluation | Yes | compare the clustering with known reference labels |
Note
Most real clustering problems rely mainly on internal evaluation and substantive interpretation.
Internal measures assess cluster quality using only:
They usually combine two ideas:
Within-cluster compactness
Observations in the same cluster should be close.
Between-cluster separation
Observations in different clusters should be far apart.
For k-means, a common criterion is the within-cluster sum of squares:
\[ W_k = \sum_{g=1}^{k} \sum_{i \in C_g} \|x_i - \bar{x}_g\|^2. \]
As \(k\) increases, \(W_k\) always decreases.
Warning
The goal is not simply to minimize \(W_k\), because the minimum is obtained when each observation is its own cluster.
The elbow method evaluates \(W_k\) for different values of \(k\).
We look for a value of \(k\) after which the improvement becomes much smaller.
Note
The elbow is a heuristic, not a formal rule.
For each observation \(i\), define:
The silhouette value is:
\[ s(i) = \frac{b(i)-a(i)} {\max\{a(i), b(i)\}}. \]
Values close to 1 indicate well-clustered observations.
Values close to 0 indicate boundary observations.
Negative values suggest possible misclassification.
The average silhouette width summarizes the silhouette values over all observations.
Note
A larger average silhouette suggests a clearer clustering structure.
The Calinski-Harabasz index compares between-cluster variability with within-cluster variability:
\[ CH(k) = \frac{B_k/(k-1)} {W_k/(n-k)}. \]
where:
Note
Larger values suggest more compact and better separated clusters.
The Davies-Bouldin index compares within-cluster scatter with between-cluster separation.
It is based on quantities of the form:
\[ R_{gh} = \frac{S_g + S_h} {M_{gh}}, \]
where:
Note
Smaller values indicate better clustering.
| Measure | Main idea | Better value |
|---|---|---|
| Within-cluster sum of squares | compactness around centroids | low |
| Elbow method | diminishing returns in WSS | visible elbow |
| Average silhouette | compactness and separation | high |
| Calinski-Harabasz | between / within variation | high |
| Davies-Bouldin | scatter / separation | low |
Warning
Internal measures are helpful, but they do not replace interpretation.
Sometimes we have reference labels.
For example:
Then we can compare the clustering solution with the known labels.
Important
External labels are not used to build the clustering.
They are used only to evaluate it.
The first step is often a contingency table.
Rows are clusters.
Columns are reference classes.
| cluster | Group 1 | Group 2 | Group 3 | Group 4 |
|---|---|---|---|---|
| 1 | 0 | 4 | 0 | 0 |
| 2 | 0 | 0 | 0 | 4 |
| 3 | 4 | 0 | 0 | 0 |
| 4 | 0 | 0 | 4 | 0 |
Note
Cluster labels are arbitrary: cluster 1 is not necessarily reference group 1.
The Rand index compares pairs of observations.
For each pair, it checks whether the two partitions agree:
The Adjusted Rand Index corrects the Rand index for chance agreement.
| ARI value | Interpretation |
|---|---|
| 1 | perfect agreement |
| around 0 | agreement close to random |
| below 0 | worse than random agreement |
Note
ARI is one of the most common external validation measures for clustering.
Another family of measures is based on the information shared by:
Common indexes include:
Note
These measures evaluate how much knowing the cluster assignment reduces uncertainty about the reference label.
| Measure | Main idea | Better value |
|---|---|---|
| Contingency table | direct comparison of clusters and labels | diagonal pattern, after label matching |
| Rand index | pairwise agreement | high |
| Adjusted Rand Index | pairwise agreement corrected for chance | high |
| Normalized Mutual Information | shared information between partitions | high |
| Purity | dominant reference class inside clusters | high, but can be misleading |
Warning
External measures are useful when labels exist, but clustering is often exploratory: labels may not represent the target structure.
There is rarely a single correct number of clusters.
Useful criteria include:
Important
Choosing \(k\) is both a statistical and substantive decision.
A practical workflow is:
Note
Do not rely on a single index mechanically. Different indexes may prefer different values of \(k\).
Cluster evaluation should combine three perspectives.
| Perspective | Question |
|---|---|
| Geometry | Are clusters compact and separated? |
| Stability | Does the solution persist under small changes? |
| Interpretation | Do the clusters make substantive sense? |
Important
A good clustering solution is not only one with a good index.
It should also be stable, interpretable, and relevant to the problem.