outline
cluster analysis
the general aim is to find homogeneous groups of observations
observations from the same group are similar to each other
observations from different groups are not similar to each other
multiple clustering approaches exist
clustering approaches
clustering approaches
partitioning1
clustering approaches
agglomerative1
clustering approaches
model-based1
clustering approaches
density-based1
clustering approaches
spectral1
clustering approaches
If a cluster structure does underlie the observations at hand, any approach will work
OR NOT?
clustering approaches
partitioning (Kmeans)
clustering approaches
agglomerative
clustering approaches
model-based (GMM)
clustering approaches
density-based (dbscan)
clustering approaches
spectral
Spectral clustering: a graph partitioning problem
Graph representation
a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)
the affinity matrix \({\bf A}\)
the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups
| . | a | b | c | d |
|---|---|---|---|---|
| a | 0 | 0 | w_ac | 0 |
| b | 0 | 0 | w_cb | w_bd |
| c | w_ca | w_cb | 0 | w_cd |
| d | 0 | w_db | w_dc | 0 |
Spectral clustering: making the graph easy to cut
An approximated solution to the graph partitioning problem:
\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]
\(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s
Spectral clustering: solution and performance
Recent benchmarking studies1: SC works well, particularly in case of non-convex and overlapping clusters
SC for mixed data?
the distance measure definition for mixed data is key
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous and 1 categorical variables
intuition
one might consider purple and blue closer than e.g. purple and yellow
desirable properties 1
Multivariate Additivity
Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if
\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]
where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.
If additivity holds, by-variable distances are added together: they should be on equivalent scales
Commensurability
Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable. We have commensurability if, for all \(j\), and \(i \neq \ell\),
\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]
where \(c\) is some constant.
If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability,
ad hoc distance functions can be used for each variable and then aggregated:
then
one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)
- well suited in the mixed data case
mixed-data setup
a mixed data set
\(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical
the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned
A formulation for mixed distance between observations \(i\) and \(\ell\):
\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]
numeric case
\(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable
\(w_{j_n}\) is a weight for the \(j_n-\)th variable.
categorical case
dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)
association-based (AB) distances?
not all differences weighs the same
distances for categorical data
the general (delta) framework1
Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{c}\)
The pair-wise distances between categorical observations are given by
\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]
the definition of \({\bf \Delta}\) determines the distance in use
if the off-diagonal terms of the \(\Delta_{j}\)’s only depend on the \(j^{th}\) variable, then \({\bf D}_{c}\) is independence-based
if the off-diagonal terms of the \(\Delta_{j}\)’s depend on the \(j^{th}\) variable AND on the relation of variable \(j\) with the other \(Q_{c}-1\) variables, then \({\bf D}_{c}\) is association-based
\(\Delta_{j}\)’s for association-based distances
the matrix co-occurrence proportions is
\[ {\bf P} =\frac{1}{I} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]
\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that
the general off-diagonal block is \({\bf R}_{ij}\) ( \(q_{i}\times q_{j}\) )
the \(a^{th}\) row of \({\bf R}_{ij}\), \({\bf r}^{ij}_{a}\), is the conditional distribution of the \(j^{th}\) variable, given the \(a^{th}\) category of the \(i^{th}\) variable
Total variation distance (TVD): pair-wise dissimilarity between categories1
consider any pair of categories \(a\) and \(b\) from the \(i^{th}\) categorical variable, their overall dissimilarity is \(\delta^{i}(a,b)\), that is \[\delta^{i}(a,b)=\sum_{i\neq j}^{q}w_{ij}\Phi^{ij}({\bf r}^{ij}_{a},{\bf r}^{ij}_{b})\] upon defining \[\Phi^{ij}({\bf r}^{ij}_{a},{\bf r}^{ij}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{j}}|{\bf r}^{ij}_{\ell a}-{\bf r}^{ij}_{\ell b}|\] then \(\Phi^{ij}()\) corresponds the total variation distance between two discrete probability distributions
independence- vs association-based
independence-based pairwise distance
No inter-variable relations are considered
in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming (matching) distance (among MANY others)
in the mixed data case: Gower dissimilarity
association-based pairwise distance
AB spectral clustering for mixed data
naive spectral clustering for mixed data
the SC for mixed data proposed by Mbuga and Tortora (2021) is the NSW procedure for SC applied on the convex combination of the numerical and categorical distances:
\[{\bf D}^{*} =\alpha{\bf {D}}_{n}^{*}+(1-\alpha){\bf {D}}^{*}_{c}\] where
for continuous variables: \({\bf D}^{*}_{n}\) is the Euclidean distance
for categorical variables: \({\bf D}^{*}_{c}\) is the Hamming distance
\(\alpha\) can be tuned, its default is \(\alpha=.5\)
from SC for mixed data to AB SC for mixed data
building upon the proposal by Mbuga and Tortora (2021), we replace the independence-based \({\bf D}^{*}_{n}\) and \({\bf D}^{*}_{c}\) with an association-based version. In particular:
for continuous variables: \({\bf D}_{n}\) is the commensurable whithened Manhattan distance (L1 equivalent of Mahalanobis distance)
for categorical variables: \({\bf D}_{c}\) is commensurable total variation distance.
Since additivity and commensurability hold, the considered general distance becomes:
\[{\bf D} ={\bf {D}}_{n}+{\bf {D}}_{c}\]
interactions?
both \({\bf D}_{n}\) and \({\bf D}_{c}\) take into account the inter-variable relations within the two blocks of variables
the interaction between the two blocks of variables is not taken into account
Aim
define \(\Delta_{int}\) so that it accounts for the interaction
a straightforward option is to compare the continuous variables means, conditional to \(a\) and \(b\)
NN based computation of \(\delta_{int}^{j}(a,b)\)
\(\Delta_{int}^{j}\) computation
consider \(D_{n}\) and sort it row-wise (or, column-wise) to identify the neighbors for each observation.
set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)
for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:
classify the observations using the prior corrected1 decision rule
\[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]
compute \(\delta_{int}^{j}(a,b)\) as the balanced accuracy2 (average of class-wise sensitivities) \[ \Phi_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]
finally the pairwise dissimilarity matrix is \({\bf {D}}_{int}={\bf {Z}}\Delta_{int}{\bf {Z}}^{\sf T}\) and is added to \({\bf {D}}_{n}\) and \({\bf {D}}_{c}\)
computing \(\Delta_{int}\)
computing \(\Delta_{int}\)
computing \(\Delta_{int}\)
computing \(\Delta_{int}\)
computing \(\Delta_{int}\)
computing \(\Delta_{int}\)
toy experiment
does \(\delta_{int}^{j}(a,b)\) of help?
a toy scenario
a latent cluster membership
1 signal categorical variable partially associated to the latent clusters
4 categorical variables: 1 signal, 3 noise, mutually independent
2 continuous variables, unrelated to the latent clusters, but with discriminant power for some categories of the signal categorical variable
compared methods
gower dissimilarity: a straightforward option
naive SC for mixed-type 1
GUDMM2: an entropy based approach that takes into account inter- and intra-variable relation
unbiased association-based approach3: association-based, but no interactions considered
toy experiment
Final considerations and future work
association-based measures aim to go beyond match/mismatch of categories
when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful
measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters
commensurability makes the by-variable contributions to distance even
computationally demanding (but it can be made bearable)
\(\pi_{nn}\) tuning and regularization of \(\delta_{int}\)’s to reduce variability
main references
Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.
Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.
Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.
Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.
CLADAG-VOS joint meeting, September 8 - 10, 2025