Spectral clustering of mixed data via association-based distance

Alfonso Iodice D'Enza, Cristina Tortora and Francesco Palumbo

The framework

Spectral clustering (Ng, Jordan, and Weiss, 2001): cutting the graph

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \(\bf A\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

Normalized affinity matrix

An approximated solution to the graph partitioning problem boils down to the spectral decomposition of the graph Laplacian matrix

\[\color{indianred}{\bf{L}} = {\bf I}-{\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{indianred}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{indianred}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

  • \(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances

  • the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)

  • diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)

  • \({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

  • the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: for non-convex clusters, too

(see e.g. Murugesan, Cho, and Tortora, 2021 for a comparative review)

Description of the research problem

The aim: extend SC to mixed data via association-based distances

  • Mbuga and Tortora (2021) proposed SC for mixed data by defining the distance matrix \(\bf D\) as a convex combination of Hamming and squared Euclidean distances
    • the present proposal is to define \(\bf D\) as a convex combination of association-based and Mahalanobis distances

distances for categorical data: a framework (van de Velden, Iodice D’Enza, Markos, and Cavicchia, 2023)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_Q \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_Q \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q}^{\sf T} \end{array} \right]\]

  • \({\bf \Delta}\) is a block-wise diagonal matrix and \(\bf Z\) is the one-hot-encoding of the Q categorical attributes in \(\bf X\)

  • different \({\bf \Delta}\)’s correspond to different distances

The aim: extend SC to mixed data via association-based distances

association-based for categorical data: total variation distance (TVD) (see, e.g. , Le and Ho, 2005)

the weight assigned to the difference between a pair of categories \(a\) and \(b\) depends on the distributions of the other attributes conditional to \(a\) and \(b\): the higher the similarity, the lower the weight

\[\Phi^{ij}({\bf r}^{ij}_{a},{\bf r}^{ij}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{j}}|{\bf r}^{ij}_{\ell a}-{\bf r}^{ij}_{\ell b}|\]

  • \({\bf r}^{ij}_{a}\) is the \(a^{th}\) row of \({\bf R}_{ij}\), \(q_{i}\times q_{j}\) general off-diagonal block of \({\bf R}\)

  • \({\bf r}^{ij}_{a}\) is the conditional distribution of the \(j^{th}\) variable, given the \(a^{th}\) category of the \(i^{th}\) variable

  • \({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\)

    • \({\bf P}_{d}=diag({\bf P})\)

Methodology and main results

Association-based for mixed?

association-based for mixed data

  • TVD on categorical + Mahalanobis on continuous

  • Consider the continuous attributes in the TVD computation (that is, within the \(\bf \Delta\))

  • \(\Phi^{ij}\left({ab}\right)\) is the impact of continuous attribute(s) on the general pair of categories \(a\) and \(b\)
    • consider a two-class classification problem, with the continuous variables as predictors
    • use a distance-based classifier: nearest-neighbors

\(\Phi^{ij}\left({ab}\right)\): improvement of the NN classifier over pure chance \((.5)\)

\[ \Phi^{ij}_{ab} = \frac{1}{k{\pi}_{a}}\sum_{s\in\aleph_{a}}I({\hat \pi}_{a(s)}>.5)-.5\]

\[ \Phi^{ij}_{ba} = \frac{1}{k{\pi}_{b}}\sum_{s\in \aleph_{b}}I({\hat \pi}_{b(s)}>.5)-.5\] \[\Phi^{ij}\left({ab}\right)=\Phi^{ij}_{ab}+\Phi^{ij}_{ba}\]

continuous attributes in the TVD computation: NN-based

SC on synthetic mixed data

  • structured categorical data generated as in van de Velden, Iodice D’Enza, and Palumbo (2017);
  • continuous variables genrated accordingly

Conclusions

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories

  • a non-parametric approach is suitable for non-convex/oddly shaped clusters

  • computationally demanding (but it can be made bearable)

  • convex combination of the cont/cat distances

    • weight as an hyperparameter
    • feature engineering

Main references

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.

van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.

slides available at https://alfonsoiodicede.github.io/