outline

 

spectral clustering

 

association-based distances for mixed data

 

taking into account continuous/categorical interactions

 

example and future work

 

cluster analysis

the general aim is to find homogeneous groups of observations

  • observations from the same group are similar to each other

  • observations from different groups are not similar to each other

  • multiple clustering approaches exist

clustering approaches

  • If a cluster structure does underlie the observations at hand, any approach will work

clustering approaches

partitioning1

clustering approaches

agglomerative1

clustering approaches

model-based1

clustering approaches

density-based1

clustering approaches

spectral1

clustering approaches

If a cluster structure does underlie the observations at hand, any approach will work

 

 

 

 

OR NOT?

clustering approaches

partitioning (Kmeans)

clustering approaches

agglomerative

clustering approaches

model-based (GMM)

clustering approaches

density-based (dbscan)

clustering approaches

spectral

spectral clustering in a nutshell

Spectral clustering: a graph partitioning problem

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \({\bf A}\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

An approximated solution to the graph partitioning problem:

  • spectral decomposition of the graph Laplacian matrix, that is a normalized version of the affinity matrix \({\bf A}\):

\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

  • \(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances

  • the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)

  • diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)

  • \({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

  • the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: solution and performance

Recent benchmarking studies1: SC works well, particularly in case of non-convex and overlapping clusters

SC for mixed data?

the distance measure definition for mixed data is key

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

one might consider purple and blue closer than e.g. purple and yellow

desirable properties 1

Multivariate Additivity

Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if

\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]

where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.

  • Manhattan distance satisfies the additivity property; the Euclidean distance does not

If additivity holds, by-variable distances are added together: they should be on equivalent scales

Commensurability

Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable. We have commensurability if, for all \(j\), and \(i \neq \ell\),

\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]

where \(c\) is some constant.

If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability,
ad hoc distance functions can be used for each variable and then aggregated:

 

then

one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)

  • well suited in the mixed data case

mixed-data setup

a mixed data set

  • \(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical

  • the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned

A formulation for mixed distance between observations \(i\) and \(\ell\):

\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]

numeric case

  • \(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable

  • \(w_{j_n}\) is a weight for the \(j_n-\)th variable.

categorical case

dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)

  • \(w_{j_c}\) is a weight for the \(j_c-\)th variable

association-based (AB) distances?

not all differences weighs the same

distances for categorical data

the general (delta) framework1

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{c}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

  • the definition of \({\bf \Delta}\) determines the distance in use

  • if the off-diagonal terms of the \(\Delta_{j}\)’s only depend on the \(j^{th}\) variable, then \({\bf D}_{c}\) is independence-based

  • if the off-diagonal terms of the \(\Delta_{j}\)’s depend on the \(j^{th}\) variable AND on the relation of variable \(j\) with the other \(Q_{c}-1\) variables, then \({\bf D}_{c}\) is association-based

\(\Delta_{j}\)’s for association-based distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{I} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

  • \({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

  • the general off-diagonal block is \({\bf R}_{ij}\) ( \(q_{i}\times q_{j}\) )

  • the \(a^{th}\) row of \({\bf R}_{ij}\), \({\bf r}^{ij}_{a}\), is the conditional distribution of the \(j^{th}\) variable, given the \(a^{th}\) category of the \(i^{th}\) variable

Total variation distance (TVD): pair-wise dissimilarity between categories1

consider any pair of categories \(a\) and \(b\) from the \(i^{th}\) categorical variable, their overall dissimilarity is \(\delta^{i}(a,b)\), that is \[\delta^{i}(a,b)=\sum_{i\neq j}^{q}w_{ij}\Phi^{ij}({\bf r}^{ij}_{a},{\bf r}^{ij}_{b})\] upon defining \[\Phi^{ij}({\bf r}^{ij}_{a},{\bf r}^{ij}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{j}}|{\bf r}^{ij}_{\ell a}-{\bf r}^{ij}_{\ell b}|\] then \(\Phi^{ij}()\) corresponds the total variation distance between two discrete probability distributions

independence- vs association-based

independence-based pairwise distance

No inter-variable relations are considered

  • in the continuous case: Euclidean or Manhattan distances

  • in the categorical case: Hamming (matching) distance (among MANY others)

  • in the mixed data case: Gower dissimilarity

 

 

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are downweighted

AB spectral clustering for mixed data

naive spectral clustering for mixed data

the SC for mixed data proposed by Mbuga and Tortora (2021) is the NSW procedure for SC applied on the convex combination of the numerical and categorical distances:

\[{\bf D}^{*} =\alpha{\bf {D}}_{n}^{*}+(1-\alpha){\bf {D}}^{*}_{c}\] where

  • for continuous variables: \({\bf D}^{*}_{n}\) is the Euclidean distance

  • for categorical variables: \({\bf D}^{*}_{c}\) is the Hamming distance

  • \(\alpha\) can be tuned, its default is \(\alpha=.5\)

from SC for mixed data to AB SC for mixed data

building upon the proposal by Mbuga and Tortora (2021), we replace the independence-based \({\bf D}^{*}_{n}\) and \({\bf D}^{*}_{c}\) with an association-based version. In particular:

  • for continuous variables: \({\bf D}_{n}\) is the commensurable whithened Manhattan distance (L1 equivalent of Mahalanobis distance)

  • for categorical variables: \({\bf D}_{c}\) is commensurable total variation distance.

Since additivity and commensurability hold, the considered general distance becomes:

\[{\bf D} ={\bf {D}}_{n}+{\bf {D}}_{c}\]

interactions?

  • both \({\bf D}_{n}\) and \({\bf D}_{c}\) take into account the inter-variable relations within the two blocks of variables

  • the interaction between the two blocks of variables is not taken into account

Aim

define \(\Delta_{int}\) so that it accounts for the interaction

  • the general entry for the \(j^{th}\) diagonal block is \(\delta_{int}^{j}(a,b)\) accounts for the interaction by measuring how the continuous variables help in discriminating between the observations choosing category \(a\) and those choosing category \(b\) for the \(j^{th}\) categorical variable
  • a straightforward option is to compare the continuous variables means, conditional to \(a\) and \(b\)

    • not suited for non-spherical clusters
  • consider the computation of \(\delta_{int}^{ij}\left({ab}\right)\) as a two-class (\(a/b\)) classification problem, with the continuous variables as predictors
    • use a distance-based classifier: nearest-neighbors

NN based computation of \(\delta_{int}^{j}(a,b)\)

\(\Delta_{int}^{j}\) computation

  • consider \(D_{n}\) and sort it row-wise (or, column-wise) to identify the neighbors for each observation.

  • set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)

  • for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:

  • classify the observations using the prior corrected1 decision rule

    \[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]

  • compute \(\delta_{int}^{j}(a,b)\) as the balanced accuracy2 (average of class-wise sensitivities) \[ \Phi_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]

finally the pairwise dissimilarity matrix is \({\bf {D}}_{int}={\bf {Z}}\Delta_{int}{\bf {Z}}^{\sf T}\) and is added to \({\bf {D}}_{n}\) and \({\bf {D}}_{c}\)

computing \(\Delta_{int}\)

computing \(\Delta_{int}\)

computing \(\Delta_{int}\)

computing \(\Delta_{int}\)

computing \(\Delta_{int}\)

computing \(\Delta_{int}\)

toy experiment

does \(\delta_{int}^{j}(a,b)\) of help?

a toy scenario

  • a latent cluster membership

  • 1 signal categorical variable partially associated to the latent clusters

  • 4 categorical variables: 1 signal, 3 noise, mutually independent

  • 2 continuous variables, unrelated to the latent clusters, but with discriminant power for some categories of the signal categorical variable

compared methods

  • gower dissimilarity: a straightforward option

  • naive SC for mixed-type 1

  • GUDMM2: an entropy based approach that takes into account inter- and intra-variable relation

  • unbiased association-based approach3: association-based, but no interactions considered

toy experiment

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories

  • when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful

  • measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters

  • commensurability makes the by-variable contributions to distance even

  • computationally demanding (but it can be made bearable)

  • \(\pi_{nn}\) tuning and regularization of \(\delta_{int}\)’s to reduce variability

an R package to compute distances: anydist?

an R package to compute distances: manydist! (it’s on CRAN!)

main references

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.