Distance-based learning for mixed-type data

Alfonso Iodice D’Enza, Michel van de Velden, Angelos Markos and Carlo Cavicchia

2025-01-17

outline

distance-based learning in the mixed-data case

variable-specific biases

the \(\Delta\) framework

(supervised) association-based (AB) distances

non-lazy KNN for mixed data

categorical/numerical interaction?

what’s next

advertising R package (almost there!)

(un)bias distances for mixed-data

distance-based learning

unsupervised

clustering
- K-means, partitioning around medoids
- spectral clustering, DB-scan
dimensionality reduction
- multidimensional scaling
- t-SNE

supervised

nearest neighbors averaging
support vector machines with radial basis functions

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

2 continuous and 1 categorical variables

one might consider purple and blue closer than e.g. purple and yellow

desirable properties¹

Multivariate Additivity

Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if

\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]

where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.

Manhattan distance satisfies the additivity property; the Euclidean distance does not

desirable properties

If additivity holds, by-variable distances are added together: they should be on equivalent scales

Commensurability

Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable. We have commensurability if, for all \(j\), and \(i \neq \ell\),

\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]

where \(c\) is some constant.

desirable properties

If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability, then ad hoc distance functions can be used for each variable and then aggregated.

then

one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)

well suited in the mixed data case

mixed-data setup

a mixed data set

\(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical
the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned

A formulation for mixed distance between observations \(i\) and \(\ell\):

\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]

numeric case

\(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable
\(w_{j_n}\) is a weight for the \(j_n-\)th variable.

categorical case

dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)

\(w_{j_c}\) is a weight for the \(j_c-\)th variable

distributions, scaling and bias: the numeric case

synthetic data

\(I=500\) observations from normal, uniform, skewed and bimodal distributions
skewed refers to a \(\chi^2_{1/2}\) distribution
bimodal we considered \(n/2\) draws from \(\chi^2_{1/2}\) (censored at \(10\)), and \(n/2\) draws from \(10-\chi^2_{1/2}\) (censored at \(0\))

as long as the variables have the same underlying distribution and scaling, commensurability holds
standard deviation scaling is the least affected by the variables distributions
contribution of variable to the overall distance may be biased

distributions, scaling and bias: the categorical case ¹

the general (delta) framework

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{c}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

the definition of \({\bf \Delta}\) determines the distance in use
if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{c}\) is independence-based
if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{c}\) is association-based

distributions, scaling and bias: the categorical case

independence-based pairwise distance

No inter-variable relations are considered

in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming (matching) distance (among MANY others)
in the mixed data case: Gower index

association-based pairwise distance

The rationale is that not all the observed differences weigh the same:

differences in line with the inter-variables association/correlation are downweighted

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance	Cat. dissimilarity	\(E[d(X_i, X_{\ell})]\)	\(q=2\)	\(q=5\)
Matching	\(\boldsymbol{\Delta}_m = \mathbf{1} \mathbf{1}^{\top} - \mathbf{I}\)	\(\frac{q-1}{q}\)	0.5	0.8
Eskin	\(\boldsymbol{\Delta}_e = \frac{2}{q^2}\boldsymbol{\Delta}_m\)	\(\frac{2(q-1)}{q^3}\)	0.250	0.064
Occurrence frequency (OF)	\(\boldsymbol{\Delta}_{OF} = \log^2(q)\boldsymbol{\Delta}_m\)	\(\log^2(q)\frac{q-1}{q}\)	0.240	2.072
Inverse OF	\(\boldsymbol{\Delta}_{IOF} = \log^2\left(\frac{I}{q}\right) \boldsymbol{\Delta}_m\)	\(\log^2\left(\frac{I}{q}\right)\frac{q-1}{q}\)	9.601	9.610

skewed frequency distribution

\(q\in \{2,3,5,10\}\)
\(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
\(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

The expected distance increases with the heterogeneity of the distribution and with the number of categories

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance	Cat. dissimilarity	\(E[d(X_i, X_{\ell})]\)	\(q=2\)	\(q=5\)
no scaling	\(\boldsymbol{\Delta}_{d}=2\boldsymbol{\Delta}_m\)	\(\frac{2(q-1)}{q}\)	1	1.6
Hennig-Liao scaling	\(\boldsymbol{\Delta}_{HL} = \sqrt{\frac{2q}{q-1}}\boldsymbol{\Delta}_m\)	\(\sqrt{\frac{2\left(q-1\right)}{q}}\)	1	1.265
St. dev. scaling	\(\boldsymbol{\Delta}_{s}=2\sqrt{\frac{q}{q-1}}\boldsymbol{\Delta}_m\)	\(2\sqrt{\frac{(q-1)}{q}}\)	1.414	1.789
Cat. dissim. scaling	\(\boldsymbol{\Delta}_{cds}=\frac{q}{q-1}\boldsymbol{\Delta}_m\)	\(1\)	1	1

skewed frequency distribution

\(q\in \{2,3,5,10\}\)
\(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
\(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

while the pattern is similar, different scalings can smooth out the effects of heterogeneity and number of categories

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance	Cat. dissimilarity	\(E[d(X_i, X_{\ell})]\)	\(q=2\)	\(q=5\)
Total Variance	\(\boldsymbol{\Delta}_{tvd} = \boldsymbol{\Delta}_m\)	\(\frac{q-1}{q}\)	0.5	0.8
Kullback-Leibler (Le & Ho)	\(\boldsymbol{\Delta}_{KL} = \kappa\boldsymbol{\Delta}_m\)	\(\kappa\frac{q-1}{q}\)	8.305	13.288

where \(\kappa=5\log_{2}(10)\)

skewed frequency distribution

\(q\in \{2,3,5,10\}\)
\(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
\(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

distances are computed based on the association of a target with the same marginal distribution of the considered variable
the magnitude of the distances differs with the method of choice; yet the patterns are the same

variable importance

data generating process

\(I=100\) observations and \(Q=6\) variables
\(\bf Y\) is a \(I\times 2\) orthogonal basis constructed off of \(2I\) values drawn from \(U(-2,2)\)
\(\bf N\) is a \(2\times Q\) random matrix with \(2Q\) values drawn from \(U(-2,2)\)
\({\bf X}_{o}={\bf Y}{\bf N}\) is the \(p-\)dimensional observed matrix with low-dimensional configuration \(Y\)
gaussian noise added (\(\sigma=0.03\), half the standard deviation of the generated data)
\(Q_{1}\) and \(Q_2\) are numerical, for \(j\in [3,6]\), \(Q_{j}\)’s are rendered categorical, with \(\{2,3,5,9\}\) categories, respectively.

variants

Num: all numeric data (Manhattan) distance
Naive: Euclidean distance on numeric and one-hot encoded categorical variables
HL: Euclidean distance on (standardized) numeric and one-hot encoded with the scaling factor proposed by Hennig-Liao
HLa: same as HL, but Manhattan distance is used instead (additive)
G: Gower, range-normalized numerical and simple matching
Uind: commensurable distance using simple matching for the categorical variables
Ustd: commensurable distance using category dissimilarity scaling for the categorical variables
Udep: commensurable association-based mixed distance using PCA scaling of the numerical, and total variation distance for the categorical

variable importance

Leave-one-variable-out: contribution to distance

Naive (and, to some extent, Gower) emphasize the categorical variables
the Hennig-Liao scaling lead to the opposite effect (emphasis on numeric)
unbiased distances lead to a relative contribution close to 1/6 (0.16)

variable importance

full vs LOO multidimensional scaling (MDS) configuration

the effect of the numeber of categories is reversed here: variables with fewer categories impact more the MDS configuration
- few categories \(\longrightarrow\) less room for differentiation among observations

variable importance

Retrieve \(\bf Y\) via MDS: alienation coef distribution over 100 instances

all categorical variables have the same number of categories (2, 3, 5, 9)

variable importance

FIFA data: Dutch league

variable importance

FIFA data: Dutch league

just like before Naive and Gower variants favour the categorical variables.
in contrast, Hennig-Liao scaling with Euclidean distance over-corrects, making numerical variables overly dominant
drop in mean values for the last two numerical values is due to the skewness of the corresponding distributions

variable importance

FIFA data: Dutch league

on the unbiased distances

the mean distances per variable are equivalent
there is variability in the impact of the variables on MDS
commensurability does not mean that the variables play the same role in determining a subsequent solution

more on association-based (AB) distances

the delta matrix: categories dissimilarities

recall \(\bf \Delta\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]=\sum_{j=1}^{Q_{c}}{{\bf Z}_{j}{\bf \Delta}_{j}}{\bf Z}^{\sf T}_{j}\]

in association-based distances, \(\Delta_{j}\) is non-diagonal and its elements depend on the other variables, too

non-diagonal \(\Delta_{j}\)

Let \(a\) and \(b\) be two categories of the categorical variable \(j\), the corresponding \((a,b)^{th}\) entry of \(\Delta_{j}\) is

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}(\xi^{ji}_{a},\xi^{ji}_{b}) \]

where \(\xi^{ji}_{a}\) and \(\xi^{ji}_{b}\) are be defined from

the joint (empirical) distributions of the categories of the variable \(i\) with \(a\) and \(b\), respectively
the conditional (empirical) distributions of the categories of the variable \(i\) given \(a\) and \(b\), respectively

joint distribution-based \(\Delta_{j}\)’s for association-based distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{I} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

let \({\bf p}^{ji}_{a}\) and \({\bf p}^{ji}_{b}\) be rows of \({\bf P}_{ji}\), off-diagonal block of \(\bf P\)

joint distribution-based \(\Delta_{j}\)’s for association-based distances¹

entropy-based

setting \({\xi}^{ji}_{a}={\bf p}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf p}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b}) \]

by defining \(\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b})\) in terms of normalized entropy the above becomes

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\left[\frac{\sum_{\ell=1}^{q_{i}}{({\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})log_{2}(\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})}{log_{2}(q_{i})} \right] \]

the weights \(w_{ji}\) are based on the mutual information between the variables \(j\) and \(i\)

\[ w_{ji}= \sum_{\upsilon=1}^{q_{j}}\sum_{\ell=1}^{q_{i}} {\bf p}^{ji}_{\upsilon \ell}\log_{2}\left(\frac{{\bf p}^{ji}_{\upsilon \ell}}{{\bf p}^{ji}_{\upsilon.}{\bf p}^{ji}_{.\ell}}\right) \]

where \({\bf p}^{ji}_{\upsilon.}\) and \({\bf p}^{ji}_{.\ell}\) indicate the \(\upsilon^{th}\) row margin and the \(\ell^{th}\) column margin of \({\bf P}^{ji}\), respectively

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )
the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

total variation distance (TVD)

setting \({\xi}^{ji}_{a}={\bf r}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf r}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]

by defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\) in terms of L1 distance the above becomes

\[\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\] that corresponds to the total variation distance (TVD) ¹

the weights can be \(w_{ji}\)\(=1/(Q_{c}-1)\)], or suitably defined to achieve commensurability

supervised AB-distance

supervised TVD

the class labels are categories of a further variable \(y\) (the response)
a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.

\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes

\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]

the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by

\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]

synthetic categorical data

setup

generated \({\bf X}_{c}\) \((1000\times16)\), 8 of witch are associated to the response ¹
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods: supervised TVD, Entropy-based, Gower (matching-based)
evaluation: accuracy

non-lazy KNN for categorical data

non-lazy KNN for mixed?

association-based for mixed

a straightforward way to generalise association-based to mixed data is to combine them

\[{\bf D}=\alpha {\bf D}_{c}+(1-\alpha){\bf D}_{n}\]

\({\bf D}_{c}\) is one of the previously defined ab-distance (TVD/entropy-based)
\({\bf D}_{n}\) the numeric counterpart is Mahalanobis (or, modified Mahalanobis) distance
However, no categorical/continuous interaction is taken into account

Aim: define \(\Delta^{int}_{j}\) so that it accounts for the categorical/continuous interactions

two alternative approaches are evaluated

How to define \(\delta_{int}(a,b)\), general element of \(\Delta^{int}_{j}\): JS-based

Let \(a\) and \(b\) be two categories of the variable \(j\) and let \(X_{i}\) be continuous

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \] where \(f_{a}(X_{i})\) and \(f_{b}(X_{i})\) are the distributions of \(X_{i}\) conditional to \(a\) and \(b\), respectively

The two distributions are compared via the Kullback-Leibler divergence

\[ \Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))=\int f_{a}(x)log_{2} \frac{f_{a}(x)}{f_{b}(x)}dx \]

How to define \(\delta_{int}(a,b)\): JS-based

Since is \(\Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))\neq\Phi^{ji}_{KL}(f_{b}(X_{i}),f_{a}(X_{i}))\), it is rendered symmetric using the Jensen Shannon distance

\[ \Phi^{ji}_{JS}(f_{a}(X_{i}),f_{b}(X_{i}))=\frac{1}{4}\sqrt{ \Phi^{ji}_{KL}\left(f_{a}(X_{i}), f_{ab}(X_{i})\right)+ \Phi^{ji}_{KL}\left( f_{ab}(X_{i}),f_{b}(X_{i})\right)} \] where \(f_{ab}(X_{i})=\left(f_{a}(X_{i})+f_{b}(X_{i})\right)/2\)

How to define \(\delta_{int}(a,b)\): JS-based

The \((a,b)^{th}\) entry of the \(\Delta^{int}_{j}\) is, therefore,

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \]

the weights \(w_{ji}\) are once again based on the mutual information between \(X_{i}\) (continuous) and \(X_{j}\) (categorical) variable ¹

How to define \(\delta_{int}(a,b)\): NN-based

the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)

it is assessed via nearest neighbors (NN) averaging

if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is

\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]

if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is

\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]

\(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)
\(\pi_{nn}\) is the user-defined proportion of nearest neighbors
\(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))

How to define \(\delta_{int}(a,b)\): NN-based

We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,

cateogry a

\[ \delta^{j}_{int}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]

cateogry b

\[ \delta^{j}_{int}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]

finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{int}}\) is given by

\[ \delta^{j}_{int}(a,b) = \delta^{j}_{int}(a) + \delta^{j}_{int}(b). \]

continuous variables in the TVD computation: NN-based

KNN learning: synthetic mixed data

setup

\({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\)
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods:
- association_based: Mahalanobis, supervised TVD, NN-based interaction
- gudmm: modified Mahalanobis, entropy-based, JS-based
- gower
evaluation: accuracy

KNN learning: synthetic mixed data

no gains from interaction, but this is expected: the two blocks of variables were generated independently

an R package to compute distances: anydist?

an R package to compute distances: manydist!

the manydist package: main functions

ndist: computing distances for numerical variables

Arguments

x: tibble/df with numeric training observations
validate_x: (optional) tibble/df with numeric test observations
commensurable : T/F argument
method : c("manhattan","euclidean")
scaling : c("none","std","range","robust","pc_scores")
sig : (optional) specify a middle matrix for association-based (e.g. if sig=cov(x) and method=euclidean, then you get Mahalanobis)

Value

a nrow(x) by nrow(x) distance matrix if validate_x=NULL; a nrow(validate_x) by nrow(x) distance matrix otherwise

cdist: computing distances for categorical variables

mdist: computing distances for mixed variables

the manydist package: main functions

ndist: computing distances for numerical variables

cdist: computing distances for categorical variables

Arguments

x: tibble/df with categorical training observations
validate_x: (optional) tibble/df with categorical test observations
commensurable : T/F argument
method : several independence- and association-based methods implemented. A string vector of methods name can be supplied, for by-variable specification

Value

a nrow(x) by nrow(x) distance matrix if validate_x=NULL; a nrow(validate_x) by nrow(x) distance matrix otherwise

mdist: computing distances for mixed variables

the manydist package: main functions

ndist: computing distances for numerical variables

cdist: computing distances for categorical variables

mdist: computing distances for mixed variables

Arguments

wrapper function combining ndist and cdist

x: tibble/df with mixed training observations
validate_x: (optional) tibble/df with mixed test observations
commensurable : T/F argument
distance_cont and distance_cat : equivalent of argument method in ndist and cdist
interaction : T/F argument; if TRUE, NN-based implemented

Value

a nrow(x) by nrow(x) distance matrix if validate_x=NULL; a nrow(validate_x) by nrow(x) distance matrix otherwise

Final considerations and future work

the idea of an unbiased distance is that variable types, scales or measurement levels should not trivially impact the distance
- not mandatory, but expecially in unsupervised settings, desireable

association-based measures aim to go beyond match/mismatch of categories
- in supervised settings, AB distances allow to take into account the response in the pair-wise computations
- non lazy KNN

NN-based interactions are computationally demanding (it can be made bearable)
- measuring cont/cat interactions via NN is suitable for non-convex/oddly shaped classes

extend the Discriminant adaptive nearest neighbor classification¹ to categorical and mixed data

finalize and render available the manydist package (GitHub first, then CRAN)

main references

Hastie, T. and R. Tibshirani (1995). “Discriminant adaptive nearest neighbor classification and regression”. In: Advances in neural information processing systems 8.

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Ross, B. C. (2014). “Mutual information between discrete and continuous data sets”. In: PloS one 9.2, p. e87357.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025). “Unbiased mixed variables distance”. In: arXiv preprint arXiv:2411.00429, under review at JCGS.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.

slides available at https://alfonsoiodicede.github.io/talks.html