Intermediate Workshop of the PRIN 2022 project, 5-6 February 2026, Bologna
outline
some unsupervised learning methods take as input a dissimilarity matrix
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous and 1 categorical variables
intuition
one might consider purple and blue closer than e.g. purple and yellow
Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
The Euclidean distance \(\longrightarrow\) shared information is over-counted
The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted
association-based pairwise distance
Association-based for continuous: Mahalanobis distance
Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is
\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where
\([\cdot]^{\odot 1/2}\) denotes the element-wise square root
\({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix
\({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1
To distance matrix \({\bf D}_{tvd}\) is defined using the so-called delta framework2 a general way to define categorical data distances.
Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.
\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) is the super-indicator matrix, with \(Q^{*}=\sum_{j=1}^{Q_c} q_j\)
\({\Delta}_j\) is the category dissimilarity matrix for variable \(j\), i.e., the \(j\)th diagonal block of the block-diagonal matrix \({\Delta}\).
setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1 (2)
Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):
\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]
We refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix
\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]
where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1 (3)
Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\) The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as
\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]
where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).
TVD-based dissimilarity matrix is, therefore,
\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]
A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:
\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]
this distance only accounts for correlations or associations among variables of the same type
no continuous–categorical interactions are considered.
define \(\Delta_{int}\), that accounts for the interactions and augment \(\Delta_{(tvd)}\)
\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]
where
\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and
\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where \(\alpha=\frac{1}{Q_{c}}\).
What is \(\Delta^{int}\)?
consider \({\bf D}_{mah}\) and sort it to identify the neighbors for each observation.
set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)
for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:
classify the observations using the prior corrected1 decision rule
\[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]
compute \(\delta_{int}^{j}(a,b)\) as the balanced accuracy2 (average of class-wise sensitivities) \[ \Phi_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]
well separated or not
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & \color{indianred}{0.94} & \cdot & \cdot \\ \color{indianred}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{indianred}{0.4} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{indianred}{0.4} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & \color{indianred}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.4 & \cdot & 0 & \cdot\\ \color{indianred}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & \color{indianred}{0.54} & \cdot \\ 0.4 & \color{indianred}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & \color{indianred}{0.55} \\ 0.4 & 0.54 & 0 & \cdot \\ 0.39 & \color{indianred}{0.55} & \cdot & 0 \end{pmatrix} \]
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.4 & 0.54 & 0 & \color{indianred}{0} \\ 0.39 & 0.55 & \color{indianred}{0} & 0 \end{pmatrix} \]
Let \({\bf X}=\left[{\bf X}_{con}{\bf X}_{cat} \right]\) be a mixed data matrix with \(n\) observations described by \(Q_{d}\) continuous and \(Q_{c}\) categorical variables, respectively, and let \({\bf x}_{i}=\left[{\bf x}_{i_{con}}{\bf x}_{i_{cat}}\right]\) the \(i^{th}\) observation
We build upon the following results
first result
The distribution of \(x_{i}\) can be written as
\[ f({\bf x}_{i_{con}},{\bf x}_{i_{cat}})=f({\bf x}_{i_{con}})f({\bf x}_{i_{con}}\mid{\bf x}_{i_{cat}}) \]
second result
starting from any distance from a cluster center \(d({\bf x}_{i},{\bf c}_k)\), it is possible to construct a probabilistic clustering model1 2
\[ f({\bf x}_{i};{\bf c}_k, {\bf S}_k)=g({\bf c}_k, {\bf S}_k)\exp\left(-d({\bf x}_{i},{\bf c}_k,)\right) \] where
third result
the dissimilarity between \({\bf x}_{i_{con}}\) and a generic cluster \(k\) with center \({\bf c}_k\) and covariance matrix \({\bf S}_k\) is1
\[ d({\bf x}_{i_{con}},{\bf c}_k)=\log(M_{k} f({\bf x}_{i_{con}};{\bf c}_k, {\bf S}_k)^{-1}) \]
\(f(\cdot)\) a symmetric probability density function.
\(M_k\) is the maximum of the density function
that is, if \({\bf x}_{i_{con}}={\bf c}_k\) then \(d({\bf x}_{i_{con}},{\bf c}_k)\).
replace \({\bf c}_{k}\) with a generic observation \(i'\)
\[ d({\bf x}_{i_{con}},{\bf x}_{i'})=\log(M f({\bf x}_{i_{con}};{\bf x}_{i'_{con}}, {\bf S})^{-1}) \]
\(f(\cdot)\) a symmetric probability density function.
\(M\) is the maximum of the density function
proposition 1
If \(f({\bf x}_{i_{con}}|{\bf x}_{i'_{con}}, {\bf S})\) is a the multivariate normal distribution, it results that
\[ d({\bf x}_{i_{con}},{\bf x}_{i'_{con}})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T} \]
categorical analogue
consider the \(Q_{cat}\)-dimensional categorical vector \({\bf x}_{i_{cat}}\), it results, for its \(j^{th}\) element, that
\[ p(x_{ij_{cat}}=a| x_{i'j_{cat}}=b) = \left[\delta^{j}(a,b)\right]^{-1} \]
where \(a\) and \(b\) are two general categories of the \(jth\) variable
For the whole vector it results \[ p({\bf x}_{ij_{cat}}| {\bf x}_{i'j_{cat}}) = \prod_{j=1}^{Q_c} p(x_{ij_{cat}}=a_{j}| x_{i'j_{cat}}=b_{j})=\prod_{j=1}^{Q_c}\left[\delta^{j}(a_{j},b_{j})\right]^{-1} \]
proposition 2
using the previous results, the dissimilarity between two mixed-data observations \({\bf x}_i\) and \({\bf x}_{i'}\) can be measured as
\[ d({\bf x}_{i}, {\bf x}_{i'})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T}-\log(p({\bf x}_{i_{cat}}|{\bf x}_{i_{con}}, {\bf x}_{i'_{con}}, {\bf x}_{i'_{cat}} )). \]
Spectral clustering: a graph partitioning problem
Graph representation
a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)
the affinity matrix \({\bf A}\)
the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups
| . | a | b | c | d |
|---|---|---|---|---|
| a | 0 | 0 | w_ac | 0 |
| b | 0 | 0 | w_cb | w_bd |
| c | w_ca | w_cb | 0 | w_cd |
| d | 0 | w_db | w_dc | 0 |
Spectral clustering: making the graph easy to cut
An approximated solution to the graph partitioning problem:
\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]
\(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s
Spectral clustering: solution and performance
Recent benchmarking studies1: SC works well, particularly in case of non-convex and overlapping clusters
toy experiment: data generation
is \(\delta_{int}^{j}(a,b)\) of help?
three continuous signal, three continuous of Gaussian noise, three categorical variables (one noise)
different dependence structures and location shifts in the signal cont inuous across the groups defined by the signal categorical variables.
the continuous signal variables are generated conditionally on the signal categorical variables
\[ X_j = \beta_{0,nm} + \sum_{k>j}^{6} \beta_{1,nm} X_k, \qquad j = 1,2,3 \] where \(\beta_{1,nm}\) take different values depending on the \(m\) and \(n\), the observed categories of the two signal categorical variables, respectively.
toy experiment
toy experiment: compared methods
gower dissimilarity: a straightforward option
naive SC for mixed-type 1
modified gower2: an entropy based approach that takes into account inter- and intra-variable relation
association-based approach3: with and without interaction
toy experiment
Final considerations and future work
association-based measures aim to go beyond match/mismatch of categories
when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful
measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters
computationally demanding (but it can be made bearable)
\(\pi_{nn}\) tuning and regularization of \(\delta_{int}\)’s to reduce variability
main references
Le, S. Q. and T. B. Ho (2005a). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mbuga, F. and C. Tortora (2021a). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.
Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025a). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.