Graph representation
a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)
the affinity matrix \(\bf A\)
the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups
. | a | b | c | d |
---|---|---|---|---|
a | 0 | 0 | w_ac | 0 |
b | 0 | 0 | w_cb | w_bd |
c | w_ca | w_cb | 0 | w_cd |
d | 0 | w_db | w_dc | 0 |
Normalized affinity matrix
An approximated solution to the graph partitioning problem boils down to the spectral decomposition of the graph Laplacian matrix
\[\color{indianred}{\bf{L}} = {\bf I}-{\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{indianred}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{indianred}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]
\(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s
(see e.g. Murugesan, Cho, and Tortora, 2021 for a comparative review)
distances for categorical data: a framework (van de Velden, Iodice D’Enza, Markos, and Cavicchia, 2023)
The pair-wise distances between categorical observations are given by
\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_Q \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_Q \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q}^{\sf T} \end{array} \right]\]
\({\bf \Delta}\) is a block-wise diagonal matrix and \(\bf Z\) is the one-hot-encoding of the Q categorical attributes in \(\bf X\)
different \({\bf \Delta}\)’s correspond to different distances
association-based for categorical data: total variation distance (TVD) (see, e.g. , Le and Ho, 2005)
the weight assigned to the difference between a pair of categories \(a\) and \(b\) depends on the distributions of the other attributes conditional to \(a\) and \(b\): the higher the similarity, the lower the weight
\[\Phi^{ij}({\bf r}^{ij}_{a},{\bf r}^{ij}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{j}}|{\bf r}^{ij}_{\ell a}-{\bf r}^{ij}_{\ell b}|\]
\({\bf r}^{ij}_{a}\) is the \(a^{th}\) row of \({\bf R}_{ij}\), \(q_{i}\times q_{j}\) general off-diagonal block of \({\bf R}\)
\({\bf r}^{ij}_{a}\) is the conditional distribution of the \(j^{th}\) variable, given the \(a^{th}\) category of the \(i^{th}\) variable
\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\)
association-based for mixed data
TVD on categorical + Mahalanobis on continuous
Consider the continuous attributes in the TVD computation (that is, within the \(\bf \Delta\))
\(\Phi^{ij}\left({ab}\right)\): improvement of the NN classifier over pure chance \((.5)\)
\[ \Phi^{ij}_{ab} = \frac{1}{k{\pi}_{a}}\sum_{s\in\aleph_{a}}I({\hat \pi}_{a(s)}>.5)-.5\]
\[ \Phi^{ij}_{ba} = \frac{1}{k{\pi}_{b}}\sum_{s\in \aleph_{b}}I({\hat \pi}_{b(s)}>.5)-.5\] \[\Phi^{ij}\left({ab}\right)=\Phi^{ij}_{ab}+\Phi^{ij}_{ba}\]
association-based measures aim to go beyond match/mismatch of categories
a non-parametric approach is suitable for non-convex/oddly shaped clusters
computationally demanding (but it can be made bearable)
convex combination of the cont/cat distances
Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.
Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.
Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.
van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.
Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.
slides available at https://alfonsoiodicede.github.io/
SIS 2023 - Statistical Learning, Sustainability and Impact Evaluation