Association-based distances for categorical and mixed-type data

Alfonso Iodice D’Enza, Michel van de Velden, Angelos Markos and Carlo Cavicchia

 

 

 

 

outline

 

indipendent vs association-based (AB) distances

 

AB distances for non-continuous data

  • categorical/continuous interaction?

 

AB learning

  • unsupervised: spectral clustering

  • supervised: KNN

 

perspectives

 

AB distances: an intuition

independent vs association based distances1

a mixed data set

  • \(n\) observations described by \(Q_{c}\) categorical and \(Q_{d}\) continuous variable

  • the \(n\times Q\) data matrix \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\) is column-wise partitioned

independence-based pairwise distance

by-variable distances are computed, and then added together:

  • e.g. Euclidean or Manhattan distances for the continuous case

  • e.g. Hamming (matching) distance for the categorical case

in the mixed data case, combine dissimilarity measures for each type of variable, as in the Gower index

\[\begin{equation} d(x_i,x_{i'}) =\alpha \sum_{j=1}^{Q_{c}}I(x_{ij}\neq x_{i'j})+ (1-\alpha)\sum_{j=Q_{c}+1}^{Q}\frac{|x_{ij} - x_{{i'j}}|}{range(X_{j})} \label{GowerDissim} \end{equation}\]

where \(I(\cdot)=1\) if the expression in input is true, \(\alpha=\frac{Q_{c}}{Q}\)

independent vs association based distances1

an alternative approach: association-based

The rationale of association-based distances is that not all the observed differences weigh the same:

  • differences in line with the association/correlation between variables are less informative and should be downweighted

    • for continuous data, an example is the Mahalanobis distance
    \[\begin{equation} d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \label{MahalanobisDist} \end{equation}\]

    where \(\mathbf{S}\) is the sample covariance matrix

    • for categorical data, the choice is not obvious nor trivial

independent vs association based distances1

distances for categorical data: a framework

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{cat}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

  • the definition of \({\bf \Delta}\) determines the distance in use

  • if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{cat}\) is independence-based

  • if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{cat}\) is association-based

\(\Delta_{j}\)’s for AB distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{n} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

 

\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

  • the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )

  • the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable

\(\Delta_{j}\)’s for AB distances

the \((a,b)^{th}\) entry of \(\Delta_{j}\) is

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]

defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\)\(=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\) leads to total variation distance (TVD) (Le and Ho, 2005)

 

\(w_{ji}\) is the contribution of each of the \(Q_{c}-1\) variables, other than the \(j^{th}\);

  • \(w_{ji}=1/(Q_{c}-1)\) for equal contribution

  • alternative weights are defined, e.g., in (Mousavi and Sehhati, 2023)

association-based for mixed?

  • assume \({\bf D}_{cat}\) to be TVD and \({\bf D}_{con}\) be Mahalanobis distance, both association-based

  • their convex combination \({\bf D}\), however, does not take into account the interaction between the two blocks of variables

Aim: define \(\delta_{con}(a,b)\) so that it accounts for the categorical/continuous interactions

How to define \(\delta_{con}(a,b)\)?

the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)

it is assessed via nearest neighbors (NN) averaging

if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is

\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]

if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is

\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]

  • \(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)

  • \(\pi_{nn}\) is the user-defined proportion of nearest neighbors

  • \(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))

How to define \(\delta_{con}(a,b)\)?

We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,

Note

\[ \delta^{j}_{con}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]

Tip

\[ \delta^{j}_{con}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]

finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{con}}\) is given by

\[ \delta^{j}_{con}(a,b) = \delta^{j}_{con}(a) + \delta^{j}_{con}(b). \]

continuous attributes in the TVD computation: NN-based

unsupervised distance learning: spectral clustering

Spectral clustering (Ng, Jordan, and Weiss, 2001): a graph partitioning problem

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \({\bf A}\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

An approximated solution to the graph partitioning problem:

  • spectral decomposition of the graph Laplacian matrix, that is a normalized version of the affinity matrix \({\bf A}\):

\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

  • \(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances

  • the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)

  • diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\), \(i=1,\ldots,n\)

  • \({\bf D}_{r}=diag({\bf r})\), \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

  • the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: the NJW procedure

  • step 1: compute the pairwise distances matrix \(\bf D\)

  • step 2: switch to the affinity matrix \(\bf A\)

  • step 3: normalize the affinity matrix to obtain the Laplacian matrix \(\bf L\)

  • step 4: decompose \(\bf L\) and obtain \({\bf {\tilde Q}}\), matrix of the \(K\)-dimensional spectral scores

  • step 5: apply the \(K\)-means on \({\bf {\tilde Q}}\) to obtain the cluster allocation vector

scatterplot matrix of the spectral scores

Spectral clustering: solution and performance

Recent benchmarking studies (e.g., Murugesan, Cho, and Tortora, 2021): SC works well, particularly in case of non-convex and overlapping clusters

Spectral clustering extensions: non-continuos data

the definition of pairwise distances matrix \(\bf D\) is key to extend the SC to mixed data

  • Gower dissimilarity is a straightforward option

Ad hoc SC procedures for mixed have been proposed in the literature

  • SpectralCAT (David and Averbuch, 2012): in case of mixed variables, continuous are suitably categorized

  • SC for mixed-type (Mbuga and Tortora, 2021): a convex combination of Hamming and Euclidean distances to deal with categorical and continuous variables, respectively

SC on synthetic mixed data

  • structured categorical data generated as in van de Velden, Iodice D’Enza, and Palumbo (2017);
  • continuous variables generated accordingly

supervised distance learning: KNN

supervised AB-distance

  • the class labels are categories of a further variable \(y\) (the response)

  • a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.

\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes

\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]

the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by

\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]

non-lazy KNN for categorical/mixed data

  • the KNN classifier is referred to as lazy (no fit on the training set + assessment on the test set)

  • using the supervised AB-distance, \(\Delta_{s}\) is fitted on the training set

  • the distance-to-neighbors is computed as

\[\begin{equation} {\bf D}_{s} = {\bf Z}_{test}{\Delta}_{s}{\bf Z}_{train}^{\sf T} \end{equation}\]

  • the test set accuracy is used to tune the hyperparameter \(K\)

KNN and supevised AB-distance

  • structured categorical data generated as in van de Velden, Iodice D’Enza, and Palumbo (2017);

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories

  • a non-parametric approach to measure interactions is suitable for non-convex/oddly shaped clusters/classes

  • in supervised settings, AB distances allow to take into account the response in the pair-wise computations

  • \(\delta^{j}_{con}(a,b)\) computationally demanding (but it can be made bearable)

  • convex combination of the cont/cat distances

    • \(\alpha\) as an hyperparameter
    • \(\alpha=0.5\) and feature engineering (to achieve commensurability)

main references

David, G. and A. Averbuch (2012). “SpectralCAT: categorical spectral clustering of numerical and nominal data”. In: Pattern Recognition 45.1, pp. 416-433.

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.

van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.