Association-based distances for categorical and mixed-type data

Alfonso Iodice D’Enza, Michel van de Velden, Angelos Markos and Carlo Cavicchia

outline

indipendent vs association-based (AB) distances

AB distances for non-continuous data

categorical/continuous interaction?

AB learning

unsupervised: spectral clustering
supervised: KNN

perspectives

AB distances: an intuition

independent vs association based distances¹

a mixed data set

\(n\) observations described by \(Q_{c}\) categorical and \(Q_{d}\) continuous variable
the \(n\times Q\) data matrix \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\) is column-wise partitioned

independence-based pairwise distance

by-variable distances are computed, and then added together:

e.g. Euclidean or Manhattan distances for the continuous case
e.g. Hamming (matching) distance for the categorical case

in the mixed data case, combine dissimilarity measures for each type of variable, as in the Gower index

\[\begin{equation} d(x_i,x_{i'}) =\alpha \sum_{j=1}^{Q_{c}}I(x_{ij}\neq x_{i'j})+ (1-\alpha)\sum_{j=Q_{c}+1}^{Q}\frac{|x_{ij} - x_{{i'j}}|}{range(X_{j})} \label{GowerDissim} \end{equation}\]

where \(I(\cdot)=1\) if the expression in input is true, \(\alpha=\frac{Q_{c}}{Q}\)

independent vs association based distances¹

an alternative approach: association-based

The rationale of association-based distances is that not all the observed differences weigh the same:

differences in line with the association/correlation between variables are less informative and should be downweighted
- for continuous data, an example is the Mahalanobis distance
\[\begin{equation} d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \label{MahalanobisDist} \end{equation}\]
where \(\mathbf{S}\) is the sample covariance matrix
- for categorical data, the choice is not obvious nor trivial

independent vs association based distances¹

distances for categorical data: a framework

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{cat}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

the definition of \({\bf \Delta}\) determines the distance in use
if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{cat}\) is independence-based
if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{cat}\) is association-based

\(\Delta_{j}\)’s for AB distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{n} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )
the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable

\(\Delta_{j}\)’s for AB distances

the \((a,b)^{th}\) entry of \(\Delta_{j}\) is

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]

defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\)\(=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\) leads to total variation distance (TVD) (Le and Ho, 2005)

\(w_{ji}\) is the contribution of each of the \(Q_{c}-1\) variables, other than the \(j^{th}\);

\(w_{ji}=1/(Q_{c}-1)\) for equal contribution
alternative weights are defined, e.g., in (Mousavi and Sehhati, 2023)

association-based for mixed?

assume \({\bf D}_{cat}\) to be TVD and \({\bf D}_{con}\) be Mahalanobis distance, both association-based
their convex combination \({\bf D}\), however, does not take into account the interaction between the two blocks of variables

Aim: define \(\delta_{con}(a,b)\) so that it accounts for the categorical/continuous interactions

How to define \(\delta_{con}(a,b)\)?

the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)

it is assessed via nearest neighbors (NN) averaging

if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is

\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]

if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is

\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]

\(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)
\(\pi_{nn}\) is the user-defined proportion of nearest neighbors
\(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))

How to define \(\delta_{con}(a,b)\)?

We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,

Note

\[ \delta^{j}_{con}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]

Tip

\[ \delta^{j}_{con}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]

finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{con}}\) is given by

\[ \delta^{j}_{con}(a,b) = \delta^{j}_{con}(a) + \delta^{j}_{con}(b). \]

continuous attributes in the TVD computation: NN-based

unsupervised distance learning: spectral clustering

Spectral clustering (Ng, Jordan, and Weiss, 2001): a graph partitioning problem

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \({\bf A}\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

.	a	b	c	d
a	0	0	w_ac	0
b	0	0	w_cb	w_bd
c	w_ca	w_cb	0	w_cd
d	0	w_db	w_dc	0

Spectral clustering: making the graph easy to cut

An approximated solution to the graph partitioning problem:

spectral decomposition of the graph Laplacian matrix, that is a normalized version of the affinity matrix \({\bf A}\):

\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

\(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\), \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\), \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: the NJW procedure

step 1: compute the pairwise distances matrix \(\bf D\)
step 2: switch to the affinity matrix \(\bf A\)
step 3: normalize the affinity matrix to obtain the Laplacian matrix \(\bf L\)
step 4: decompose \(\bf L\) and obtain \({\bf {\tilde Q}}\), matrix of the \(K\)-dimensional spectral scores
step 5: apply the \(K\)-means on \({\bf {\tilde Q}}\) to obtain the cluster allocation vector

scatterplot matrix of the spectral scores

Spectral clustering: solution and performance

Recent benchmarking studies (e.g., Murugesan, Cho, and Tortora, 2021): SC works well, particularly in case of non-convex and overlapping clusters

Spectral clustering extensions: non-continuos data

the definition of pairwise distances matrix \(\bf D\) is key to extend the SC to mixed data

Gower dissimilarity is a straightforward option

Ad hoc SC procedures for mixed have been proposed in the literature

SpectralCAT (David and Averbuch, 2012): in case of mixed variables, continuous are suitably categorized
SC for mixed-type (Mbuga and Tortora, 2021): a convex combination of Hamming and Euclidean distances to deal with categorical and continuous variables, respectively

SC on synthetic mixed data

structured categorical data generated as in van de Velden, Iodice D’Enza, and Palumbo (2017);
continuous variables generated accordingly

supervised distance learning: KNN

supervised AB-distance

the class labels are categories of a further variable \(y\) (the response)
a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.

\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes

\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]

the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by

\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]

non-lazy KNN for categorical/mixed data

the KNN classifier is referred to as lazy (no fit on the training set + assessment on the test set)
using the supervised AB-distance, \(\Delta_{s}\) is fitted on the training set
the distance-to-neighbors is computed as

\[\begin{equation} {\bf D}_{s} = {\bf Z}_{test}{\Delta}_{s}{\bf Z}_{train}^{\sf T} \end{equation}\]

the test set accuracy is used to tune the hyperparameter \(K\)

KNN and supevised AB-distance

structured categorical data generated as in van de Velden, Iodice D’Enza, and Palumbo (2017);

Final considerations and future work

association-based measures aim to go beyond match/mismatch of categories
a non-parametric approach to measure interactions is suitable for non-convex/oddly shaped clusters/classes
in supervised settings, AB distances allow to take into account the response in the pair-wise computations

\(\delta^{j}_{con}(a,b)\) computationally demanding (but it can be made bearable)
convex combination of the cont/cat distances
- \(\alpha\) as an hyperparameter
- \(\alpha=0.5\) and feature engineering (to achieve commensurability)

main references

David, G. and A. Averbuch (2012). “SpectralCAT: categorical spectral clustering of numerical and nominal data”. In: Pattern Recognition 45.1, pp. 416-433.

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Ng, A., M. Jordan, and Y. Weiss (2001). “On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems”. In: volume 14, 849.

van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.

slides available at https://alfonsoiodicede.github.io/talks.html