outline

 

dissimilarity- and distance-based unsupervised learning

 

independence-based vs association-based

 

taking into account continuous/categorical interactions

 

example and future work

 

dissimilarity- and distance-based unsupervised learning

learning from dissimilarities

some unsupervised learning methods take as input a dissimilarity matrix

 

dimension reduction: multidimensional scaling (MDS)1

 

clustering methods: hierarchical (HC) and partitioning around medoids (PAM)2

 

the dissimilarity measure of choice is key, obviously

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

one might consider purple and blue closer than e.g. purple and yellow

independence-based

Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together

 

  • in the continuous case: Euclidean or Manhattan distances

 

  • in the categorical case: Hamming (matching) distance (among MANY others)

 

  • in the mixed data case: Gower dissimilarity index

 

no inter-variable relations are considered \(\rightarrow\) independence-based

independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

The Euclidean distance \(\longrightarrow\) shared information is over-counted

association-based

The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted

this is an association-based distance for continuous data

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for continuous: Mahalanobis distance

Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is

\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where

  • \([\cdot]^{\odot 1/2}\) denotes the element-wise square root

  • \({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix

  • \({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1

To distance matrix \({\bf D}_{tvd}\) is defined using the so-called delta framework2 a general way to define categorical data distances.

Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.

\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) is the super-indicator matrix, with \(Q^{*}=\sum_{j=1}^{Q_c} q_j\)

  • \({\Delta}_j\) is the category dissimilarity matrix for variable \(j\), i.e., the \(j\)th diagonal block of the block-diagonal matrix \({\Delta}\).

  • setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1 (2)

Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):

\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]

We refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix

\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]

where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1 (3)

Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\) The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as

\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]

where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).

TVD-based dissimilarity matrix is, therefore,

\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]

AB for mixed?

association-based for mixed

A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:

\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]

  • this distance only accounts for correlations or associations among variables of the same type

  • no continuous–categorical interactions are considered.

 

how to measure interactions?

how to measure interactions

define \(\Delta_{int}\), that accounts for the interactions and augment \(\Delta_{(tvd)}\)

  • the dissimilarity measure becomes

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

where

\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and

\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where \(\alpha=\frac{1}{Q_{c}}\).

how to measure interactions

What is \(\Delta^{int}\)?

  • the general entry for the \(j^{th}\) diagonal block is \(\delta_{int}^{j}(a,b)\) accounts for the interaction by measuring how the continuous variables help in discriminating between the observations choosing category \(a\) and those choosing category \(b\) for the \(j^{th}\) categorical variable
  • consider the computation of \(\delta_{int}^{ij}\left({ab}\right)\) as a two-class (\(a/b\)) classification problem, with the continuous variables as predictors
    • use a distance-based classifier: nearest-neighbors

\(\Delta^{int}_{j}\) computation

  • consider \({\bf D}_{mah}\) and sort it to identify the neighbors for each observation.

  • set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)

  • for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:

  • classify the observations using the prior corrected1 decision rule

    \[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]

  • compute \(\delta_{int}^{j}(a,b)\) as the balanced accuracy2 (average of class-wise sensitivities) \[ \Phi_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]

well separated or not

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & \color{indianred}{0.94} & \cdot & \cdot \\ \color{indianred}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{indianred}{0.4} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{indianred}{0.4} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & \color{indianred}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.4 & \cdot & 0 & \cdot\\ \color{indianred}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & \color{indianred}{0.54} & \cdot \\ 0.4 & \color{indianred}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & \color{indianred}{0.55} \\ 0.4 & 0.54 & 0 & \cdot \\ 0.39 & \color{indianred}{0.55} & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.4 & 0.54 & 0 & \color{indianred}{0} \\ 0.39 & 0.55 & \color{indianred}{0} & 0 \end{pmatrix} \]

Just one-way interaction?

Let \({\bf X}=\left[{\bf X}_{con}{\bf X}_{cat} \right]\) be a mixed data matrix with \(n\) observations described by \(Q_{d}\) continuous and \(Q_{c}\) categorical variables, respectively, and let \({\bf x}_{i}=\left[{\bf x}_{i_{con}}{\bf x}_{i_{cat}}\right]\) the \(i^{th}\) observation

We build upon the following results

first result

The distribution of \(x_{i}\) can be written as

\[ f({\bf x}_{i_{con}},{\bf x}_{i_{cat}})=f({\bf x}_{i_{con}})f({\bf x}_{i_{con}}\mid{\bf x}_{i_{cat}}) \]

second result

starting from any distance from a cluster center \(d({\bf x}_{i},{\bf c}_k)\), it is possible to construct a probabilistic clustering model1 2

\[ f({\bf x}_{i};{\bf c}_k, {\bf S}_k)=g({\bf c}_k, {\bf S}_k)\exp\left(-d({\bf x}_{i},{\bf c}_k,)\right) \] where

Just one-way interaction?

third result

the dissimilarity between \({\bf x}_{i_{con}}\) and a generic cluster \(k\) with center \({\bf c}_k\) and covariance matrix \({\bf S}_k\) is1

\[ d({\bf x}_{i_{con}},{\bf c}_k)=\log(M_{k} f({\bf x}_{i_{con}};{\bf c}_k, {\bf S}_k)^{-1}) \]

  • \(f(\cdot)\) a symmetric probability density function.

  • \(M_k\) is the maximum of the density function

  • that is, if \({\bf x}_{i_{con}}={\bf c}_k\) then \(d({\bf x}_{i_{con}},{\bf c}_k)\).

replace \({\bf c}_{k}\) with a generic observation \(i'\)

\[ d({\bf x}_{i_{con}},{\bf x}_{i'})=\log(M f({\bf x}_{i_{con}};{\bf x}_{i'_{con}}, {\bf S})^{-1}) \]

  • \(f(\cdot)\) a symmetric probability density function.

  • \(M\) is the maximum of the density function

Just one-way interaction!

proposition 1

If \(f({\bf x}_{i_{con}}|{\bf x}_{i'_{con}}, {\bf S})\) is a the multivariate normal distribution, it results that

\[ d({\bf x}_{i_{con}},{\bf x}_{i'_{con}})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T} \]

categorical analogue

consider the \(Q_{cat}\)-dimensional categorical vector \({\bf x}_{i_{cat}}\), it results, for its \(j^{th}\) element, that

\[ p(x_{ij_{cat}}=a| x_{i'j_{cat}}=b) = \left[\delta^{j}(a,b)\right]^{-1} \]

where \(a\) and \(b\) are two general categories of the \(jth\) variable

For the whole vector it results \[ p({\bf x}_{ij_{cat}}| {\bf x}_{i'j_{cat}}) = \prod_{j=1}^{Q_c} p(x_{ij_{cat}}=a_{j}| x_{i'j_{cat}}=b_{j})=\prod_{j=1}^{Q_c}\left[\delta^{j}(a_{j},b_{j})\right]^{-1} \]

Just one-way interaction! (wrap-up)

proposition 2

using the previous results, the dissimilarity between two mixed-data observations \({\bf x}_i\) and \({\bf x}_{i'}\) can be measured as

\[ d({\bf x}_{i}, {\bf x}_{i'})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T}-\log(p({\bf x}_{i_{cat}}|{\bf x}_{i_{con}}, {\bf x}_{i'_{con}}, {\bf x}_{i'_{cat}} )). \]

spectral clustering in a nutshell

Spectral clustering: a graph partitioning problem

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \({\bf A}\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

An approximated solution to the graph partitioning problem:

  • spectral decomposition of the graph Laplacian matrix, that is a normalized version of the affinity matrix \({\bf A}\):

\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

  • \(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances

  • the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)

  • diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)

  • \({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

  • the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: solution and performance

Recent benchmarking studies1: SC works well, particularly in case of non-convex and overlapping clusters

experiments

toy experiment: data generation

is \(\delta_{int}^{j}(a,b)\) of help?

  • three continuous signal, three continuous of Gaussian noise, three categorical variables (one noise)

  • different dependence structures and location shifts in the signal cont inuous across the groups defined by the signal categorical variables.

  • the continuous signal variables are generated conditionally on the signal categorical variables

\[ X_j = \beta_{0,nm} + \sum_{k>j}^{6} \beta_{1,nm} X_k, \qquad j = 1,2,3 \] where \(\beta_{1,nm}\) take different values depending on the \(m\) and \(n\), the observed categories of the two signal categorical variables, respectively.

toy experiment

toy experiment: compared methods

  • gower dissimilarity: a straightforward option

  • naive SC for mixed-type 1

  • modified gower2: an entropy based approach that takes into account inter- and intra-variable relation

  • association-based approach3: with and without interaction

toy experiment

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories

  • when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful

  • measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters

  • computationally demanding (but it can be made bearable)

  • \(\pi_{nn}\) tuning and regularization of \(\delta_{int}\)’s to reduce variability

an R package to compute distances: anydist?

an R package to compute distances: manydist! (it’s on CRAN!)

main references

Le, S. Q. and T. B. Ho (2005a). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021a). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025a). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.