outline
indipendence- vs association-based (AB) distances
AB distances for categorical data
AB distances for mixed data
perspectives
independence- vs association-based distances1
a mixed data set
\(n\) observations described by \(Q_{c}\) categorical and \(Q_{d}\) continuous variable
the \(n\times Q\) data matrix \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\) is column-wise partitioned
independence-based pairwise distance
by-variable distances are computed, and then added together:
in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming (matching) distance (among MANY others)
in the mixed data case, combine dissimilarity measures for each type of variable, as in the Gower index
association-based distances: continuous data
The rationale of association-based distances is that not all the observed differences weigh the same:
the continuous data case
\[ d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \label{MahalanobisDist} \]
where \(\mathbf{S}\) is the sample covariance matrix
association-based distances: continuous data
modified Mahalanobis distance1
\[ d(\mathbf{x}_i, \mathbf{x}_{i'}) = \sqrt{(\mathbf{x}_i - \mathbf{x}_{i'})^{\sf T} \mathbf{S}_{w}^{-1} (\mathbf{x}_i - \mathbf{x}_{i'})} \]
where
\(\mathbf{S}^{-1}_{w}=\mathbf{S}^{-1}\odot \mathbf{W}\) (Hadamard product)
\(\mathbf{W}\) is the matrix variables pairwise Mutual Information with general element
\[ w_{jj'} = \frac{1}{n \cdot \ln 2} \left( \psi(n) + \psi(k) - \frac{1}{n} \sum_{\ell=1}^{n} \left[ \psi(n_{\ell}^j + 1) + \psi(n_{\ell}^{j'} + 1) \right] \right) \] where
\(\psi\) denotes the digamma function 2
\(k\) is the user-defined number of neighbors
consider, as a threshold, the max distance of the \(\ell^{th}\) observation to the \(k^{th}\) neighbor, computed wrt to the \(j\) and \(j'\) variables
\(n_{\ell}^{j}\) and \(n_{\ell}^{j'}\) are the number of observations within the threshold
KNN learning: synthetic continuous data
setup
generated \({\bf X}_{con}\) \((1000\times16)\) as a sample finite mixture model with Gaussian components via the MixSim
package 1
4 classes, same size
low/high level of overlap
25 replicates
distance methods: Mahalanobis, modified Mahalanobis, Gower (range normalized Manhattan)
evaluation: accuracy
KNN learning: synthetic continuous data
the categorical data case: the general (delta) framework1
Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{cat}\)
The pair-wise distances between categorical observations are given by
\[{\bf D}_{cat}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]
the definition of \({\bf \Delta}\) determines the distance in use
if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{cat}\) is independence-based
if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{cat}\) is association-based
the categorical data case: the general (delta) framework1
non-diagonal \(\Delta_{j}\)
Let \(a\) and \(b\) be two categories of the categorical variable \(j\), the corresponding \((a,b)^{th}\) entry of \(\Delta_{j}\) is
\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}(\xi^{ji}_{a},\xi^{ji}_{b}) \]
where \(\xi^{ji}_{a}\) and \(\xi^{ji}_{b}\) are be defined from
the joint (empirical) distributions of the categories of the variable \(i\) with \(a\) and \(b\), respectively
the conditional (empirical) distributions of the categories of the variable \(i\) given \(a\) and \(b\), respectively
joint distribution-based \(\Delta_{j}\)’s for association-based distances
the matrix co-occurrence proportions is
\[ {\bf P} =\frac{1}{n} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]
joint distribution-based \(\Delta_{j}\)’s for association-based distances1
setting \({\xi}^{ji}_{a}={\bf p}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf p}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)
\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b}) \]
\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\left[\frac{\sum_{\ell=1}^{q_{i}}{({\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})log_{2}(\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})}{log_{2}(q_{i})} \right] \]
\[ w_{ji}= \sum_{\upsilon=1}^{q_{j}}\sum_{\ell=1}^{q_{i}} {\bf p}^{ji}_{\upsilon \ell}\log_{2}\left(\frac{{\bf p}^{ji}_{\upsilon \ell}}{{\bf p}^{ji}_{\upsilon.}{\bf p}^{ji}_{.\ell}}\right) \]
where \({\bf p}^{ji}_{\upsilon.}\) and \({\bf p}^{ji}_{.\ell}\) indicate the \(\upsilon^{th}\) row margin and the \(\ell^{th}\) column margin of \({\bf P}^{ji}\), respectively
conditional distribution-based \(\Delta_{j}\)’s for association-based distances
\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that
the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )
the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable
conditional distribution-based \(\Delta_{j}\)’s for association-based distances
setting \({\xi}^{ji}_{a}={\bf r}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf r}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)
\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]
\[\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\] that corresponds to the total variation distance (TVD) 1
supervised AB-distance
the class labels are categories of a further variable \(y\) (the response)
a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.
\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes
\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]
the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by
\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]
KNN learning: synthetic categorical data
setup
generated \({\bf X}_{cat}\) \((1000\times16)\), 8 of witch are associated to the response 1
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods: supervised TVD, Entropy-based, Gower (matching-based)
evaluation: accuracy
KNN learning: synthetic categorical data
association-based for mixed?
a straightforward way to generalise association-based to mixed data is to combine them
\[{\bf D}_{mix}=\alpha {\bf D}_{cat}+(1-\alpha){\bf D}_{con}\]
\({\bf D}_{cat}\) is one of the previously defined (TVD/entropy-based)
\({\bf D}_{con}\) be Mahalanobis (or, modified Mahalanobis) distance
However, no categorical/continuous interaction is taken into account
Aim: define \(\delta_{int}(a,b)\) so that it accounts for the categorical/continuous interactions
How to define \(\delta_{int}(a,b)\): JS-based
Let \(a\) and \(b\) be two categories of the variable \(j\) and let \(X_{i}\) be continuous
\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \] where \(f_{a}(X_{i})\) and \(f_{b}(X_{i})\) are the distributions of \(X_{i}\) conditional to \(a\) and \(b\), respectively
The two distributions are compared via the Kullback-Leibler divergence
\[ \Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))=\int f_{a}(x)log_{2} \frac{f_{a}(x)}{f_{b}(x)}dx \]
How to define \(\delta_{int}(a,b)\): JS-based
Since is \(\Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))\neq\Phi^{ji}_{KL}(f_{b}(X_{i}),f_{a}(X_{i}))\), it is rendered symmetric using the Jensen Shannon distance
\[ \Phi^{ji}_{JS}(f_{a}(X_{i}),f_{b}(X_{i}))=\frac{1}{4}\sqrt{ \Phi^{ji}_{KL}\left(f_{a}(X_{i}), f_{ab}(X_{i})\right)+ \Phi^{ji}_{KL}\left( f_{ab}(X_{i}),f_{b}(X_{i})\right)} \] where \(f_{ab}(X_{i})=\left(f_{a}(X_{i})+f_{b}(X_{i})\right)/2\)
How to define \(\delta_{int}(a,b)\): JS-based
The \((a,b)^{th}\) entry of the \(\Delta^{int}_{j}\) is, therefore,
\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \]
the weights \(w_{ji}\) are once again based on the mutual information between \(X_{i}\) (continuous) and \(X_{j}\) (categorical) variable 1
\[w_{ji} = \frac{1}{n} \sum_{\ell=1}^{n} \left(\psi(n) -\psi(n_{\ell}^j) + \psi(k) - \psi(m_{\ell}) \right)\] - \(x_{j\ell}\) is the category of \(X_j\) of the \(\ell^{th}\) observation, and \(n^{j}_{\ell}\) its frequency.
\(d_\ell^i\) is the distance of observation \(\ell\) to the \(k^{th}\) neighbor
\(m_\ell\) the number of observations within a \(d_\ell^i\) distance from the \(\ell^{th}\) observation
How to define \(\delta_{int}(a,b)\): NN-based
the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)
it is assessed via nearest neighbors (NN) averaging
if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is
\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]
if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is
\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]
\(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)
\(\pi_{nn}\) is the user-defined proportion of nearest neighbors
\(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))
How to define \(\delta_{int}(a,b)\)?
We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,
Note
\[ \delta^{j}_{int}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]
Tip
\[ \delta^{j}_{int}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]
finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{int}}\) is given by
\[ \delta^{j}_{int}(a,b) = \delta^{j}_{int}(a) + \delta^{j}_{int}(b). \]
continuous attributes in the TVD computation: NN-based
KNN learning: synthetic mixed data
setup
\({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\)
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods:
association_based: Mahalanobis, supervised TVD, NN-based interaction
gudmm: modified Mahalanobis, entropy-based, JS-based
gower
evaluation: accuracy
KNN learning: synthetic mixed data
SC on synthetic mixed data
Final considerations and future work
in supervised settings, AB distances allow to take into account the response in the pair-wise computations
convex combination of the cont/cat distances
main references
Kraskov, A., H. Stögbauer, and P. Grassberger (2004). “Estimating mutual information”. In: Physical review E 69.6, p. 066138.
Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Melnykov, V., W. Chen, and R. Maitra (2012). “MixSim: An R package for simulating data to study performance of clustering algorithms”. In: Journal of Statistical Software 51, pp. 1-25.
Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.
Ross, B. C. (2014). “Mutual information between discrete and continuous data sets”. In: PloS one 9.2, p. e87357.
van de Velden, M., A. Iodice D’Enza, A. Markos, et al. (2023). “A general framework for implementing distances for categorical variables”. In: submitted to Pattern Recognition, pp. 1-21.
Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.