Association-based learning for mixed data

Alfonso Iodice D’Enza, Angelos Markos, Michel van de Velden and Carlo Cavicchia

outline

indipendence- vs association-based (AB) distances

AB distances for continuous data
KNN learning from continuous data

AB distances for categorical data

the \(\Delta\) framework
entropy-based
supervised (should you have labeled data)
KNN learning from categorical data

AB distances for mixed data

categorical/continuous interaction?
KNN learning from mixed data
a touch of unsurvised learning: spectral clustering

perspectives

independence- vs association-based distances¹

a mixed data set

\(n\) observations described by \(Q_{c}\) categorical and \(Q_{d}\) continuous variable
the \(n\times Q\) data matrix \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\) is column-wise partitioned

independence-based pairwise distance

by-variable distances are computed, and then added together:

in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming (matching) distance (among MANY others)
in the mixed data case, combine dissimilarity measures for each type of variable, as in the Gower index

association-based distances: continuous data

The rationale of association-based distances is that not all the observed differences weigh the same:

differences in line with the association/correlation between variables are less informative and should be downweighted

the continuous data case

Mahalanobis distance

\[ d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \label{MahalanobisDist} \]

where \(\mathbf{S}\) is the sample covariance matrix

association-based distances: continuous data

modified Mahalanobis distance¹

\[ d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}_{w}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \]

where

\(\mathbf{S}^{-1}_{w}=\mathbf{S}^{-1}\odot \mathbf{W}\) (Hadamard product)
\(\mathbf{W}\) is the matrix variables pairwise Mutual Information with general element

\[ w_{jj'} = \frac{1}{n \cdot \ln 2} \left( \psi(n) + \psi(k) - \frac{1}{n} \sum_{\ell=1}^{n} \left[ \psi(n_{\ell}^j + 1) + \psi(n_{\ell}^{j'} + 1) \right] \right) \] where

\(\psi\) denotes the digamma function ²
\(k\) is the user-defined number of neighbors
consider, as a threshold, the max distance of the \(\ell^{th}\) observation to the \(k^{th}\) neighbor, computed wrt to the \(j\) and \(j'\) variables
\(n_{\ell}^{j}\) and \(n_{\ell}^{j'}\) are the number of observations within the threshold

KNN learning: synthetic continuous data

setup

generated \({\bf X}_{con}\) \((1000\times16)\) as a sample finite mixture model with Gaussian components via the MixSim package ¹
4 classes, same size
low/high level of overlap
25 replicates
distance methods: Mahalanobis, modified Mahalanobis, Gower (range normalized Manhattan)
evaluation: accuracy

KNN learning: synthetic continuous data

the categorical data case: the general (delta) framework¹

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{cat}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

the definition of \({\bf \Delta}\) determines the distance in use
if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{cat}\) is independence-based
if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{cat}\) is association-based

the categorical data case: the general (delta) framework¹

non-diagonal \(\Delta_{j}\)

Let \(a\) and \(b\) be two categories of the categorical variable \(j\), the corresponding \((a,b)^{th}\) entry of \(\Delta_{j}\) is

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}(\xi^{ji}_{a},\xi^{ji}_{b}) \]

where \(\xi^{ji}_{a}\) and \(\xi^{ji}_{b}\) are be defined from

the joint (empirical) distributions of the categories of the variable \(i\) with \(a\) and \(b\), respectively
the conditional (empirical) distributions of the categories of the variable \(i\) given \(a\) and \(b\), respectively

joint distribution-based \(\Delta_{j}\)’s for association-based distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{n} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

let \({\bf p}^{ji}_{a}\) and \({\bf p}^{ji}_{b}\) be rows of \({\bf P}_{ji}\), off-diagonal block of \(\bf P\)

joint distribution-based \(\Delta_{j}\)’s for association-based distances¹

setting \({\xi}^{ji}_{a}={\bf p}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf p}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b}) \]

by defining \(\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b})\) in terms of normalized entropy the above becomes

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\left[\frac{\sum_{\ell=1}^{q_{i}}{({\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})log_{2}(\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})}{log_{2}(q_{i})} \right] \]

the weights \(w_{ji}\) are based on the mutual information between the variables \(j\) and \(i\)

\[ w_{ji}= \sum_{\upsilon=1}^{q_{j}}\sum_{\ell=1}^{q_{i}} {\bf p}^{ji}_{\upsilon \ell}\log_{2}\left(\frac{{\bf p}^{ji}_{\upsilon \ell}}{{\bf p}^{ji}_{\upsilon.}{\bf p}^{ji}_{.\ell}}\right) \]

where \({\bf p}^{ji}_{\upsilon.}\) and \({\bf p}^{ji}_{.\ell}\) indicate the \(\upsilon^{th}\) row margin and the \(\ell^{th}\) column margin of \({\bf P}^{ji}\), respectively

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )
the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

setting \({\xi}^{ji}_{a}={\bf r}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf r}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]

by defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\) in terms of L1 distance the above becomes

\[\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\] that corresponds to the total variation distance (TVD) ¹

\(w_{ji}=1/(Q_{c}-1)\) for equal contribution

supervised AB-distance

the class labels are categories of a further variable \(y\) (the response)
a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.

\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes

\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]

the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by

\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]

KNN learning: synthetic categorical data

setup

generated \({\bf X}_{cat}\) \((1000\times16)\), 8 of witch are associated to the response ¹
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods: supervised TVD, Entropy-based, Gower (matching-based)
evaluation: accuracy

KNN learning: synthetic categorical data

association-based for mixed?

a straightforward way to generalise association-based to mixed data is to combine them

\[{\bf D}_{mix}=\alpha {\bf D}_{cat}+(1-\alpha){\bf D}_{con}\]

\({\bf D}_{cat}\) is one of the previously defined (TVD/entropy-based)
\({\bf D}_{con}\) be Mahalanobis (or, modified Mahalanobis) distance
However, no categorical/continuous interaction is taken into account

Aim: define \(\delta_{int}(a,b)\) so that it accounts for the categorical/continuous interactions

two alternative approaches are evaluated

How to define \(\delta_{int}(a,b)\): JS-based

Let \(a\) and \(b\) be two categories of the variable \(j\) and let \(X_{i}\) be continuous

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \] where \(f_{a}(X_{i})\) and \(f_{b}(X_{i})\) are the distributions of \(X_{i}\) conditional to \(a\) and \(b\), respectively

The two distributions are compared via the Kullback-Leibler divergence

\[ \Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))=\int f_{a}(x)log_{2} \frac{f_{a}(x)}{f_{b}(x)}dx \]

How to define \(\delta_{int}(a,b)\): JS-based

Since is \(\Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))\neq\Phi^{ji}_{KL}(f_{b}(X_{i}),f_{a}(X_{i}))\), it is rendered symmetric using the Jensen Shannon distance

\[ \Phi^{ji}_{JS}(f_{a}(X_{i}),f_{b}(X_{i}))=\frac{1}{4}\sqrt{ \Phi^{ji}_{KL}\left(f_{a}(X_{i}), f_{ab}(X_{i})\right)+ \Phi^{ji}_{KL}\left( f_{ab}(X_{i}),f_{b}(X_{i})\right)} \] where \(f_{ab}(X_{i})=\left(f_{a}(X_{i})+f_{b}(X_{i})\right)/2\)

How to define \(\delta_{int}(a,b)\): JS-based

The \((a,b)^{th}\) entry of the \(\Delta^{int}_{j}\) is, therefore,

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \]

the weights \(w_{ji}\) are once again based on the mutual information between \(X_{i}\) (continuous) and \(X_{j}\) (categorical) variable ¹

\[w_{ji} = \frac{1}{n} \sum_{\ell=1}^{n} \left(\psi(n) -\psi(n_{\ell}^j) + \psi(k) - \psi(m_{\ell}) \right)\] - \(x_{j\ell}\) is the category of \(X_j\) of the \(\ell^{th}\) observation, and \(n^{j}_{\ell}\) its frequency.

\(d_\ell^i\) is the distance of observation \(\ell\) to the \(k^{th}\) neighbor
\(m_\ell\) the number of observations within a \(d_\ell^i\) distance from the \(\ell^{th}\) observation

How to define \(\delta_{int}(a,b)\): NN-based

the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)

it is assessed via nearest neighbors (NN) averaging

if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is

\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]

if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is

\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]

\(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)
\(\pi_{nn}\) is the user-defined proportion of nearest neighbors
\(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))

How to define \(\delta_{int}(a,b)\)?

We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,

Note

\[ \delta^{j}_{int}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]

Tip

\[ \delta^{j}_{int}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]

finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{int}}\) is given by

\[ \delta^{j}_{int}(a,b) = \delta^{j}_{int}(a) + \delta^{j}_{int}(b). \]

continuous attributes in the TVD computation: NN-based

KNN learning: synthetic mixed data

setup

\({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\)
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods:
- association_based: Mahalanobis, supervised TVD, NN-based interaction
- gudmm: modified Mahalanobis, entropy-based, JS-based
- gower
evaluation: accuracy

KNN learning: synthetic mixed data

A touch of unsupervised learning: spectral clustering

click here if you have time

SC on synthetic mixed data

Final considerations and future work

association-based measures aim to go beyond match/mismatch of categories

in supervised settings, AB distances allow to take into account the response in the pair-wise computations
- non lazy KNN

\(\delta^{j}_{int}(a,b)\) computationally demanding (but it can be made bearable)

measuring cont/cat interactions via NN is suitable for non-convex/oddly shaped classes

convex combination of the cont/cat distances
- \(\alpha\) as an hyperparameter
- \(\alpha=0.5\) and feature engineering (to achieve commensurability)

main references

Kraskov, A., H. Stögbauer, and P. Grassberger (2004). “Estimating mutual information”. In: Physical review E 69.6, p. 066138.

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Melnykov, V., W. Chen, and R. Maitra (2012). “MixSim: An R package for simulating data to study performance of clustering algorithms”. In: Journal of Statistical Software 51, pp. 1-25.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Ross, B. C. (2014). “Mutual information between discrete and continuous data sets”. In: PloS one 9.2, p. e87357.

van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.

slides available at https://alfonsoiodicede.github.io/talks.html

Association-based learning for mixed data

AB distances for continuous data

KNN learning from continuous data

the \(\Delta\) framework

entropy-based

supervised (should you have labeled data)

KNN learning from categorical data

categorical/continuous interaction?

KNN learning from mixed data

A touch of unsupervised learning: spectral clustering