Association-based learning for mixed data

Alfonso Iodice D’Enza, Angelos Markos, Michel van de Velden and Carlo Cavicchia

 

 

 

 

outline

indipendence- vs association-based (AB) distances

  • AB distances for continuous data

  • KNN learning from continuous data

AB distances for categorical data

  • the \(\Delta\) framework

  • entropy-based

  • supervised (should you have labeled data)

  • KNN learning from categorical data

AB distances for mixed data

  • categorical/continuous interaction?

  • KNN learning from mixed data

  • a touch of unsurvised learning: spectral clustering

perspectives

independence- vs association-based distances1

a mixed data set

  • \(n\) observations described by \(Q_{c}\) categorical and \(Q_{d}\) continuous variable

  • the \(n\times Q\) data matrix \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\) is column-wise partitioned

independence-based pairwise distance

by-variable distances are computed, and then added together:

  • in the continuous case: Euclidean or Manhattan distances

  • in the categorical case: Hamming (matching) distance (among MANY others)

  • in the mixed data case, combine dissimilarity measures for each type of variable, as in the Gower index

association-based distances: continuous data

The rationale of association-based distances is that not all the observed differences weigh the same:

  • differences in line with the association/correlation between variables are less informative and should be downweighted

the continuous data case

  • Mahalanobis distance

\[ d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \label{MahalanobisDist} \]

where \(\mathbf{S}\) is the sample covariance matrix

association-based distances: continuous data

modified Mahalanobis distance1

\[ d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}_{w}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \]

where

  • \(\mathbf{S}^{-1}_{w}=\mathbf{S}^{-1}\odot \mathbf{W}\) (Hadamard product)

  • \(\mathbf{W}\) is the matrix variables pairwise Mutual Information with general element

\[ w_{jj'} = \frac{1}{n \cdot \ln 2} \left( \psi(n) + \psi(k) - \frac{1}{n} \sum_{\ell=1}^{n} \left[ \psi(n_{\ell}^j + 1) + \psi(n_{\ell}^{j'} + 1) \right] \right) \] where

  • \(\psi\) denotes the digamma function 2

  • \(k\) is the user-defined number of neighbors

  • consider, as a threshold, the max distance of the \(\ell^{th}\) observation to the \(k^{th}\) neighbor, computed wrt to the \(j\) and \(j'\) variables

  • \(n_{\ell}^{j}\) and \(n_{\ell}^{j'}\) are the number of observations within the threshold

KNN learning: synthetic continuous data

setup

  • generated \({\bf X}_{con}\) \((1000\times16)\) as a sample finite mixture model with Gaussian components via the MixSim package 1

  • 4 classes, same size

  • low/high level of overlap

  • 25 replicates

  • distance methods: Mahalanobis, modified Mahalanobis, Gower (range normalized Manhattan)

  • evaluation: accuracy

KNN learning: synthetic continuous data

the categorical data case: the general (delta) framework1

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{cat}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

  • the definition of \({\bf \Delta}\) determines the distance in use

  • if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{cat}\) is independence-based

  • if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{cat}\) is association-based

the categorical data case: the general (delta) framework1

non-diagonal \(\Delta_{j}\)

Let \(a\) and \(b\) be two categories of the categorical variable \(j\), the corresponding \((a,b)^{th}\) entry of \(\Delta_{j}\) is

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}(\xi^{ji}_{a},\xi^{ji}_{b}) \]

where \(\xi^{ji}_{a}\) and \(\xi^{ji}_{b}\) are be defined from

  • the joint (empirical) distributions of the categories of the variable \(i\) with \(a\) and \(b\), respectively

  • the conditional (empirical) distributions of the categories of the variable \(i\) given \(a\) and \(b\), respectively

joint distribution-based \(\Delta_{j}\)’s for association-based distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{n} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

  • let \({\bf p}^{ji}_{a}\) and \({\bf p}^{ji}_{b}\) be rows of \({\bf P}_{ji}\), off-diagonal block of \(\bf P\)

joint distribution-based \(\Delta_{j}\)’s for association-based distances1

setting \({\xi}^{ji}_{a}={\bf p}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf p}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b}) \]

  • by defining \(\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b})\) in terms of normalized entropy the above becomes

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\left[\frac{\sum_{\ell=1}^{q_{i}}{({\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})log_{2}(\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})}{log_{2}(q_{i})} \right] \]

  • the weights \(w_{ji}\) are based on the mutual information between the variables \(j\) and \(i\)

\[ w_{ji}= \sum_{\upsilon=1}^{q_{j}}\sum_{\ell=1}^{q_{i}} {\bf p}^{ji}_{\upsilon \ell}\log_{2}\left(\frac{{\bf p}^{ji}_{\upsilon \ell}}{{\bf p}^{ji}_{\upsilon.}{\bf p}^{ji}_{.\ell}}\right) \]

where \({\bf p}^{ji}_{\upsilon.}\) and \({\bf p}^{ji}_{.\ell}\) indicate the \(\upsilon^{th}\) row margin and the \(\ell^{th}\) column margin of \({\bf P}^{ji}\), respectively

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

 

\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

  • the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )

  • the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

setting \({\xi}^{ji}_{a}={\bf r}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf r}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]

  • by defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\) in terms of L1 distance the above becomes

\[\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\] that corresponds to the total variation distance (TVD) 1

  • \(w_{ji}=1/(Q_{c}-1)\) for equal contribution

supervised AB-distance

  • the class labels are categories of a further variable \(y\) (the response)

  • a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.

\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes

\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]

the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by

\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]

KNN learning: synthetic categorical data

setup

  • generated \({\bf X}_{cat}\) \((1000\times16)\), 8 of witch are associated to the response 1

  • 4 classes, same size

  • low/high level of overlap (association to the response)

  • 25 replicates

  • distance methods: supervised TVD, Entropy-based, Gower (matching-based)

  • evaluation: accuracy

KNN learning: synthetic categorical data

association-based for mixed?

a straightforward way to generalise association-based to mixed data is to combine them

\[{\bf D}_{mix}=\alpha {\bf D}_{cat}+(1-\alpha){\bf D}_{con}\]

  • \({\bf D}_{cat}\) is one of the previously defined (TVD/entropy-based)

  • \({\bf D}_{con}\) be Mahalanobis (or, modified Mahalanobis) distance

  • However, no categorical/continuous interaction is taken into account

Aim: define \(\delta_{int}(a,b)\) so that it accounts for the categorical/continuous interactions

  • two alternative approaches are evaluated

How to define \(\delta_{int}(a,b)\): JS-based

Let \(a\) and \(b\) be two categories of the variable \(j\) and let \(X_{i}\) be continuous

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \] where \(f_{a}(X_{i})\) and \(f_{b}(X_{i})\) are the distributions of \(X_{i}\) conditional to \(a\) and \(b\), respectively

The two distributions are compared via the Kullback-Leibler divergence

\[ \Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))=\int f_{a}(x)log_{2} \frac{f_{a}(x)}{f_{b}(x)}dx \]

How to define \(\delta_{int}(a,b)\): JS-based

Since is \(\Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))\neq\Phi^{ji}_{KL}(f_{b}(X_{i}),f_{a}(X_{i}))\), it is rendered symmetric using the Jensen Shannon distance

\[ \Phi^{ji}_{JS}(f_{a}(X_{i}),f_{b}(X_{i}))=\frac{1}{4}\sqrt{ \Phi^{ji}_{KL}\left(f_{a}(X_{i}), f_{ab}(X_{i})\right)+ \Phi^{ji}_{KL}\left( f_{ab}(X_{i}),f_{b}(X_{i})\right)} \] where \(f_{ab}(X_{i})=\left(f_{a}(X_{i})+f_{b}(X_{i})\right)/2\)

How to define \(\delta_{int}(a,b)\): JS-based

The \((a,b)^{th}\) entry of the \(\Delta^{int}_{j}\) is, therefore,

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \]

the weights \(w_{ji}\) are once again based on the mutual information between \(X_{i}\) (continuous) and \(X_{j}\) (categorical) variable 1

\[w_{ji} = \frac{1}{n} \sum_{\ell=1}^{n} \left(\psi(n) -\psi(n_{\ell}^j) + \psi(k) - \psi(m_{\ell}) \right)\] - \(x_{j\ell}\) is the category of \(X_j\) of the \(\ell^{th}\) observation, and \(n^{j}_{\ell}\) its frequency.

  • \(d_\ell^i\) is the distance of observation \(\ell\) to the \(k^{th}\) neighbor

  • \(m_\ell\) the number of observations within a \(d_\ell^i\) distance from the \(\ell^{th}\) observation

How to define \(\delta_{int}(a,b)\): NN-based

the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)

it is assessed via nearest neighbors (NN) averaging

if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is

\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]

if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is

\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]

  • \(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)

  • \(\pi_{nn}\) is the user-defined proportion of nearest neighbors

  • \(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))

How to define \(\delta_{int}(a,b)\)?

We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,

Note

\[ \delta^{j}_{int}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]

Tip

\[ \delta^{j}_{int}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]

finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{int}}\) is given by

\[ \delta^{j}_{int}(a,b) = \delta^{j}_{int}(a) + \delta^{j}_{int}(b). \]

continuous attributes in the TVD computation: NN-based

KNN learning: synthetic mixed data

setup

  • \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\)

  • 4 classes, same size

  • low/high level of overlap (association to the response)

  • 25 replicates

  • distance methods:

    • association_based: Mahalanobis, supervised TVD, NN-based interaction

    • gudmm: modified Mahalanobis, entropy-based, JS-based

    • gower

  • evaluation: accuracy

KNN learning: synthetic mixed data

A touch of unsupervised learning: spectral clustering

click here if you have time

SC on synthetic mixed data

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories
  • in supervised settings, AB distances allow to take into account the response in the pair-wise computations

    • non lazy KNN
  • \(\delta^{j}_{int}(a,b)\) computationally demanding (but it can be made bearable)
  • measuring cont/cat interactions via NN is suitable for non-convex/oddly shaped classes
  • convex combination of the cont/cat distances

    • \(\alpha\) as an hyperparameter
    • \(\alpha=0.5\) and feature engineering (to achieve commensurability)

main references

Kraskov, A., H. Stögbauer, and P. Grassberger (2004). “Estimating mutual information”. In: Physical review E 69.6, p. 066138.

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Melnykov, V., W. Chen, and R. Maitra (2012). “MixSim: An R package for simulating data to study performance of clustering algorithms”. In: Journal of Statistical Software 51, pp. 1-25.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Ross, B. C. (2014). “Mutual information between discrete and continuous data sets”. In: PloS one 9.2, p. e87357.

van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.