outline
independent vs association based distances1
a mixed data set
\(n\) observations described by \(Q_{c}\) categorical and \(Q_{d}\) continuous variable
the \(n\times Q\) data matrix \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\) is column-wise partitioned
independence-based pairwise distance
by-variable distances are computed, and then added together:
e.g. Euclidean or Manhattan distances for the continuous case
e.g. Hamming (matching) distance for the categorical case
in the mixed data case, combine dissimilarity measures for each type of variable, as in the Gower index
\[\begin{equation} d(x_i,x_{i'}) =\alpha \sum_{j=1}^{Q_{c}}I(x_{ij}\neq x_{i'j})+ (1-\alpha)\sum_{j=Q_{c}+1}^{Q}\frac{|x_{ij} - x_{{i'j}}|}{range(X_{j})} \label{GowerDissim} \end{equation}\]where \(I(\cdot)=1\) if the expression in input is true, \(\alpha=\frac{Q_{c}}{Q}\)
independent vs association based distances1
an alternative approach: association-based
The rationale of association-based distances is that not all the observed differences weigh the same:
differences in line with the association/correlation between variables are less informative and should be downweighted
where \(\mathbf{S}\) is the sample covariance matrix
independent vs association based distances1
distances for categorical data: a framework
Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{cat}\)
The pair-wise distances between categorical observations are given by
\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]
the definition of \({\bf \Delta}\) determines the distance in use
if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{cat}\) is independence-based
if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{cat}\) is association-based
\(\Delta_{j}\)’s for AB distances
the matrix co-occurrence proportions is
\[ {\bf P} =\frac{1}{n} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]
\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that
the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )
the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable
\(\Delta_{j}\)’s for AB distances
the \((a,b)^{th}\) entry of \(\Delta_{j}\) is
\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]
defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\)\(=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\) leads to total variation distance (TVD) (Le and Ho, 2005)
\(w_{ji}\) is the contribution of each of the \(Q_{c}-1\) variables, other than the \(j^{th}\);
\(w_{ji}=1/(Q_{c}-1)\) for equal contribution
alternative weights are defined, e.g., in (Mousavi and Sehhati, 2023)
association-based for mixed?
assume \({\bf D}_{cat}\) to be TVD and \({\bf D}_{con}\) be Mahalanobis distance, both association-based
their convex combination \({\bf D}\), however, does not take into account the interaction between the two blocks of variables
Aim: define \(\delta_{con}(a,b)\) so that it accounts for the categorical/continuous interactions
How to define \(\delta_{con}(a,b)\)?
the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)
it is assessed via nearest neighbors (NN) averaging
if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is
\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]
if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is
\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]
\(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)
\(\pi_{nn}\) is the user-defined proportion of nearest neighbors
\(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))
How to define \(\delta_{con}(a,b)\)?
We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,
Note
\[ \delta^{j}_{con}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]
Tip
\[ \delta^{j}_{con}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]
finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{con}}\) is given by
\[ \delta^{j}_{con}(a,b) = \delta^{j}_{con}(a) + \delta^{j}_{con}(b). \]
continuous attributes in the TVD computation: NN-based
Spectral clustering (Ng, Jordan, and Weiss, 2001): a graph partitioning problem
Graph representation
a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)
the affinity matrix \({\bf A}\)
the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups
. | a | b | c | d |
---|---|---|---|---|
a | 0 | 0 | w_ac | 0 |
b | 0 | 0 | w_cb | w_bd |
c | w_ca | w_cb | 0 | w_cd |
d | 0 | w_db | w_dc | 0 |
Spectral clustering: making the graph easy to cut
An approximated solution to the graph partitioning problem:
\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]
\(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\), \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\), \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s
Spectral clustering: the NJW procedure
step 1: compute the pairwise distances matrix \(\bf D\)
step 2: switch to the affinity matrix \(\bf A\)
step 3: normalize the affinity matrix to obtain the Laplacian matrix \(\bf L\)
step 4: decompose \(\bf L\) and obtain \({\bf {\tilde Q}}\), matrix of the \(K\)-dimensional spectral scores
step 5: apply the \(K\)-means on \({\bf {\tilde Q}}\) to obtain the cluster allocation vector
scatterplot matrix of the spectral scores
Spectral clustering: solution and performance
Recent benchmarking studies (e.g., Murugesan, Cho, and Tortora, 2021): SC works well, particularly in case of non-convex and overlapping clusters
Spectral clustering extensions: non-continuos data
the definition of pairwise distances matrix \(\bf D\) is key to extend the SC to mixed data
Ad hoc SC procedures for mixed have been proposed in the literature
SpectralCAT (David and Averbuch, 2012): in case of mixed variables, continuous are suitably categorized
SC for mixed-type (Mbuga and Tortora, 2021): a convex combination of Hamming and Euclidean distances to deal with categorical and continuous variables, respectively
SC on synthetic mixed data
supervised AB-distance
the class labels are categories of a further variable \(y\) (the response)
a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.
\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes
\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]
the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by
\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]
non-lazy KNN for categorical/mixed data
the KNN classifier is referred to as lazy (no fit on the training set + assessment on the test set)
using the supervised AB-distance, \(\Delta_{s}\) is fitted on the training set
the distance-to-neighbors is computed as
\[\begin{equation} {\bf D}_{s} = {\bf Z}_{test}{\Delta}_{s}{\bf Z}_{train}^{\sf T} \end{equation}\]
KNN and supevised AB-distance
Final considerations and future work
association-based measures aim to go beyond match/mismatch of categories
a non-parametric approach to measure interactions is suitable for non-convex/oddly shaped clusters/classes
in supervised settings, AB distances allow to take into account the response in the pair-wise computations
\(\delta^{j}_{con}(a,b)\) computationally demanding (but it can be made bearable)
convex combination of the cont/cat distances
main references
David, G. and A. Averbuch (2012). “SpectralCAT: categorical spectral clustering of numerical and nominal data”. In: Pattern Recognition 45.1, pp. 416-433.
Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.
Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.
Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.
Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.
van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.
Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.