outline

distance-based learning in the mixed-data case

  • variable-specific biases

the \(\Delta\) framework

  • (supervised) association-based (AB) distances

non-lazy KNN for mixed data

  • categorical/numerical interaction?

what’s next

  • advertising R package (almost there!)

(un)bias distances for mixed-data

distance-based learning

unsupervised

  • clustering

    • K-means, partitioning around medoids

    • spectral clustering, DB-scan

  • dimensionality reduction

    • multidimensional scaling
    • t-SNE

supervised

  • nearest neighbors averaging

  • support vector machines with radial basis functions

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

2 continuous and 1 categorical variables

one might consider purple and blue closer than e.g. purple and yellow

desirable properties1

Multivariate Additivity

Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if

\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]

where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.

  • Manhattan distance satisfies the additivity property; the Euclidean distance does not

desirable properties

If additivity holds, by-variable distances are added together: they should be on equivalent scales

Commensurability

Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable. We have commensurability if, for all \(j\), and \(i \neq \ell\),

\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]

where \(c\) is some constant.

desirable properties

If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability, then ad hoc distance functions can be used for each variable and then aggregated.

 

then

one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)

well suited in the mixed data case

mixed-data setup

a mixed data set

  • \(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical

  • the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned

A formulation for mixed distance between observations \(i\) and \(\ell\):

\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]

numeric case

  • \(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable

  • \(w_{j_n}\) is a weight for the \(j_n-\)th variable.

categorical case

dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)

  • \(w_{j_c}\) is a weight for the \(j_c-\)th variable

distributions, scaling and bias: the numeric case

synthetic data

  • \(I=500\) observations from normal, uniform, skewed and bimodal distributions

  • skewed refers to a \(\chi^2_{1/2}\) distribution

  • bimodal we considered \(n/2\) draws from \(\chi^2_{1/2}\) (censored at \(10\)), and \(n/2\) draws from \(10-\chi^2_{1/2}\) (censored at \(0\))

  • as long as the variables have the same underlying distribution and scaling, commensurability holds

  • standard deviation scaling is the least affected by the variables distributions

  • contribution of variable to the overall distance may be biased

distributions, scaling and bias: the categorical case 1

the general (delta) framework

Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{c}\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

  • the definition of \({\bf \Delta}\) determines the distance in use

  • if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{c}\) is independence-based

  • if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{c}\) is association-based

distributions, scaling and bias: the categorical case

independence-based pairwise distance

No inter-variable relations are considered

  • in the continuous case: Euclidean or Manhattan distances

  • in the categorical case: Hamming (matching) distance (among MANY others)

  • in the mixed data case: Gower index

association-based pairwise distance

The rationale is that not all the observed differences weigh the same:

  • differences in line with the inter-variables association/correlation are downweighted

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance Cat. dissimilarity \(E[d(X_i, X_{\ell})]\) \(q=2\) \(q=5\)
Matching \(\boldsymbol{\Delta}_m = \mathbf{1} \mathbf{1}^{\top} - \mathbf{I}\) \(\frac{q-1}{q}\) 0.5 0.8
Eskin \(\boldsymbol{\Delta}_e = \frac{2}{q^2}\boldsymbol{\Delta}_m\) \(\frac{2(q-1)}{q^3}\) 0.250 0.064
Occurrence frequency (OF) \(\boldsymbol{\Delta}_{OF} = \log^2(q)\boldsymbol{\Delta}_m\) \(\log^2(q)\frac{q-1}{q}\) 0.240 2.072
Inverse OF \(\boldsymbol{\Delta}_{IOF} = \log^2\left(\frac{I}{q}\right) \boldsymbol{\Delta}_m\) \(\log^2\left(\frac{I}{q}\right)\frac{q-1}{q}\) 9.601 9.610

skewed frequency distribution

  • \(q\in \{2,3,5,10\}\)
  • \(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
  • \(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

The expected distance increases with the heterogeneity of the distribution and with the number of categories

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance Cat. dissimilarity \(E[d(X_i, X_{\ell})]\) \(q=2\) \(q=5\)
no scaling \(\boldsymbol{\Delta}_{d}=2\boldsymbol{\Delta}_m\) \(\frac{2(q-1)}{q}\) 1 1.6
Hennig-Liao scaling \(\boldsymbol{\Delta}_{HL} = \sqrt{\frac{2q}{q-1}}\boldsymbol{\Delta}_m\) \(\sqrt{\frac{2\left(q-1\right)}{q}}\) 1 1.265
St. dev. scaling \(\boldsymbol{\Delta}_{s}=2\sqrt{\frac{q}{q-1}}\boldsymbol{\Delta}_m\) \(2\sqrt{\frac{(q-1)}{q}}\) 1.414 1.789
Cat. dissim. scaling \(\boldsymbol{\Delta}_{cds}=\frac{q}{q-1}\boldsymbol{\Delta}_m\) \(1\) 1 1

skewed frequency distribution

  • \(q\in \{2,3,5,10\}\)
  • \(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
  • \(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

while the pattern is similar, different scalings can smooth out the effects of heterogeneity and number of categories

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance Cat. dissimilarity \(E[d(X_i, X_{\ell})]\) \(q=2\) \(q=5\)
Total Variance \(\boldsymbol{\Delta}_{tvd} = \boldsymbol{\Delta}_m\) \(\frac{q-1}{q}\) 0.5 0.8
Kullback-Leibler (Le & Ho) \(\boldsymbol{\Delta}_{KL} = \kappa\boldsymbol{\Delta}_m\) \(\kappa\frac{q-1}{q}\) 8.305 13.288

where \(\kappa=5\log_{2}(10)\)

skewed frequency distribution

  • \(q\in \{2,3,5,10\}\)
  • \(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
  • \(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),
  • distances are computed based on the association of a target with the same marginal distribution of the considered variable

  • the magnitude of the distances differs with the method of choice; yet the patterns are the same

variable importance

data generating process

  • \(I=100\) observations and \(Q=6\) variables

  • \(\bf Y\) is a \(I\times 2\) orthogonal basis constructed off of \(2I\) values drawn from \(U(-2,2)\)

  • \(\bf N\) is a \(2\times Q\) random matrix with \(2Q\) values drawn from \(U(-2,2)\)

  • \({\bf X}_{o}={\bf Y}{\bf N}\) is the \(p-\)dimensional observed matrix with low-dimensional configuration \(Y\)

  • gaussian noise added (\(\sigma=0.03\), half the standard deviation of the generated data)

  • \(Q_{1}\) and \(Q_2\) are numerical, for \(j\in [3,6]\), \(Q_{j}\)’s are rendered categorical, with \(\{2,3,5,9\}\) categories, respectively.

variants

  • Num: all numeric data (Manhattan) distance

  • Naive: Euclidean distance on numeric and one-hot encoded categorical variables

  • HL: Euclidean distance on (standardized) numeric and one-hot encoded with the scaling factor proposed by Hennig-Liao

  • HLa: same as HL, but Manhattan distance is used instead (additive)

  • G: Gower, range-normalized numerical and simple matching

  • Uind: commensurable distance using simple matching for the categorical variables

  • Ustd: commensurable distance using category dissimilarity scaling for the categorical variables

  • Udep: commensurable association-based mixed distance using PCA scaling of the numerical, and total variation distance for the categorical

variable importance

Leave-one-variable-out: contribution to distance

  • Naive (and, to some extent, Gower) emphasize the categorical variables

  • the Hennig-Liao scaling lead to the opposite effect (emphasis on numeric)

  • unbiased distances lead to a relative contribution close to 1/6 (0.16)

variable importance

full vs LOO multidimensional scaling (MDS) configuration

  • the effect of the numeber of categories is reversed here: variables with fewer categories impact more the MDS configuration

    • few categories \(\longrightarrow\) less room for differentiation among observations

variable importance

Retrieve \(\bf Y\) via MDS: alienation coef distribution over 100 instances

  • all categorical variables have the same number of categories (2, 3, 5, 9)

variable importance

FIFA data: Dutch league

variable importance

FIFA data: Dutch league

  • just like before Naive and Gower variants favour the categorical variables.

  • in contrast, Hennig-Liao scaling with Euclidean distance over-corrects, making numerical variables overly dominant

  • drop in mean values for the last two numerical values is due to the skewness of the corresponding distributions

variable importance

FIFA data: Dutch league

on the unbiased distances

  • the mean distances per variable are equivalent

  • there is variability in the impact of the variables on MDS

  • commensurability does not mean that the variables play the same role in determining a subsequent solution

more on association-based (AB) distances

the delta matrix: categories dissimilarities

recall \(\bf \Delta\)

The pair-wise distances between categorical observations are given by

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]=\sum_{j=1}^{Q_{c}}{{\bf Z}_{j}{\bf \Delta}_{j}}{\bf Z}^{\sf T}_{j}\]

in association-based distances, \(\Delta_{j}\) is non-diagonal and its elements depend on the other variables, too

non-diagonal \(\Delta_{j}\)

Let \(a\) and \(b\) be two categories of the categorical variable \(j\), the corresponding \((a,b)^{th}\) entry of \(\Delta_{j}\) is

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}(\xi^{ji}_{a},\xi^{ji}_{b}) \]

where \(\xi^{ji}_{a}\) and \(\xi^{ji}_{b}\) are be defined from

  • the joint (empirical) distributions of the categories of the variable \(i\) with \(a\) and \(b\), respectively

  • the conditional (empirical) distributions of the categories of the variable \(i\) given \(a\) and \(b\), respectively

joint distribution-based \(\Delta_{j}\)’s for association-based distances

the matrix co-occurrence proportions is

\[ {\bf P} =\frac{1}{I} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]

  • let \({\bf p}^{ji}_{a}\) and \({\bf p}^{ji}_{b}\) be rows of \({\bf P}_{ji}\), off-diagonal block of \(\bf P\)

joint distribution-based \(\Delta_{j}\)’s for association-based distances1

entropy-based

setting \({\xi}^{ji}_{a}={\bf p}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf p}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b}) \]

by defining \(\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b})\) in terms of normalized entropy the above becomes

\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\left[\frac{\sum_{\ell=1}^{q_{i}}{({\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})log_{2}(\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})}{log_{2}(q_{i})} \right] \]

the weights \(w_{ji}\) are based on the mutual information between the variables \(j\) and \(i\)

\[ w_{ji}= \sum_{\upsilon=1}^{q_{j}}\sum_{\ell=1}^{q_{i}} {\bf p}^{ji}_{\upsilon \ell}\log_{2}\left(\frac{{\bf p}^{ji}_{\upsilon \ell}}{{\bf p}^{ji}_{\upsilon.}{\bf p}^{ji}_{.\ell}}\right) \]

where \({\bf p}^{ji}_{\upsilon.}\) and \({\bf p}^{ji}_{.\ell}\) indicate the \(\upsilon^{th}\) row margin and the \(\ell^{th}\) column margin of \({\bf P}^{ji}\), respectively

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that

  • the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )

  • the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable

conditional distribution-based \(\Delta_{j}\)’s for association-based distances

total variation distance (TVD)

setting \({\xi}^{ji}_{a}={\bf r}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf r}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)

\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]

by defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\) in terms of L1 distance the above becomes

\[\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\] that corresponds to the total variation distance (TVD) 1

the weights can be \(w_{ji}\)\(=1/(Q_{c}-1)\)], or suitably defined to achieve commensurability

supervised AB-distance

supervised TVD

  • the class labels are categories of a further variable \(y\) (the response)

  • a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.

\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes

\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]

the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by

\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]

synthetic categorical data

setup

  • generated \({\bf X}_{c}\) \((1000\times16)\), 8 of witch are associated to the response 1

  • 4 classes, same size

  • low/high level of overlap (association to the response)

  • 25 replicates

  • distance methods: supervised TVD, Entropy-based, Gower (matching-based)

  • evaluation: accuracy

non-lazy KNN for categorical data

non-lazy KNN for mixed?

association-based for mixed

a straightforward way to generalise association-based to mixed data is to combine them

\[{\bf D}=\alpha {\bf D}_{c}+(1-\alpha){\bf D}_{n}\]

  • \({\bf D}_{c}\) is one of the previously defined ab-distance (TVD/entropy-based)

  • \({\bf D}_{n}\) the numeric counterpart is Mahalanobis (or, modified Mahalanobis) distance

  • However, no categorical/continuous interaction is taken into account

Aim: define \(\Delta^{int}_{j}\) so that it accounts for the categorical/continuous interactions

  • two alternative approaches are evaluated

How to define \(\delta_{int}(a,b)\), general element of \(\Delta^{int}_{j}\): JS-based

Let \(a\) and \(b\) be two categories of the variable \(j\) and let \(X_{i}\) be continuous

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \] where \(f_{a}(X_{i})\) and \(f_{b}(X_{i})\) are the distributions of \(X_{i}\) conditional to \(a\) and \(b\), respectively

The two distributions are compared via the Kullback-Leibler divergence

\[ \Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))=\int f_{a}(x)log_{2} \frac{f_{a}(x)}{f_{b}(x)}dx \]

How to define \(\delta_{int}(a,b)\): JS-based

Since is \(\Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))\neq\Phi^{ji}_{KL}(f_{b}(X_{i}),f_{a}(X_{i}))\), it is rendered symmetric using the Jensen Shannon distance

\[ \Phi^{ji}_{JS}(f_{a}(X_{i}),f_{b}(X_{i}))=\frac{1}{4}\sqrt{ \Phi^{ji}_{KL}\left(f_{a}(X_{i}), f_{ab}(X_{i})\right)+ \Phi^{ji}_{KL}\left( f_{ab}(X_{i}),f_{b}(X_{i})\right)} \] where \(f_{ab}(X_{i})=\left(f_{a}(X_{i})+f_{b}(X_{i})\right)/2\)

How to define \(\delta_{int}(a,b)\): JS-based

The \((a,b)^{th}\) entry of the \(\Delta^{int}_{j}\) is, therefore,

\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \]

the weights \(w_{ji}\) are once again based on the mutual information between \(X_{i}\) (continuous) and \(X_{j}\) (categorical) variable 1

How to define \(\delta_{int}(a,b)\): NN-based

the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)

it is assessed via nearest neighbors (NN) averaging

if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is

\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]

if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is

\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]

  • \(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)

  • \(\pi_{nn}\) is the user-defined proportion of nearest neighbors

  • \(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))

How to define \(\delta_{int}(a,b)\): NN-based

We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,

cateogry a

\[ \delta^{j}_{int}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]

cateogry b

\[ \delta^{j}_{int}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]

finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{int}}\) is given by

\[ \delta^{j}_{int}(a,b) = \delta^{j}_{int}(a) + \delta^{j}_{int}(b). \]

continuous variables in the TVD computation: NN-based

KNN learning: synthetic mixed data

setup

  • \({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\)

  • 4 classes, same size

  • low/high level of overlap (association to the response)

  • 25 replicates

  • distance methods:

    • association_based: Mahalanobis, supervised TVD, NN-based interaction

    • gudmm: modified Mahalanobis, entropy-based, JS-based

    • gower

  • evaluation: accuracy

KNN learning: synthetic mixed data

no gains from interaction, but this is expected: the two blocks of variables were generated independently

an R package to compute distances: anydist?

an R package to compute distances: manydist!

the manydist package: main functions

ndist: computing distances for numerical variables

Arguments

  • x: tibble/df with numeric training observations

  • validate_x: (optional) tibble/df with numeric test observations

  • commensurable : T/F argument

  • method : c("manhattan","euclidean")

  • scaling : c("none","std","range","robust","pc_scores")

  • sig : (optional) specify a middle matrix for association-based (e.g. if sig=cov(x) and method=euclidean, then you get Mahalanobis)

Value

a nrow(x) by nrow(x) distance matrix if validate_x=NULL; a nrow(validate_x) by nrow(x) distance matrix otherwise

cdist: computing distances for categorical variables

mdist: computing distances for mixed variables

the manydist package: main functions

ndist: computing distances for numerical variables

cdist: computing distances for categorical variables

Arguments

  • x: tibble/df with categorical training observations

  • validate_x: (optional) tibble/df with categorical test observations

  • commensurable : T/F argument

  • method : several independence- and association-based methods implemented. A string vector of methods name can be supplied, for by-variable specification

Value

a nrow(x) by nrow(x) distance matrix if validate_x=NULL; a nrow(validate_x) by nrow(x) distance matrix otherwise

mdist: computing distances for mixed variables

the manydist package: main functions

ndist: computing distances for numerical variables

cdist: computing distances for categorical variables

mdist: computing distances for mixed variables

Arguments

wrapper function combining ndist and cdist

  • x: tibble/df with mixed training observations

  • validate_x: (optional) tibble/df with mixed test observations

  • commensurable : T/F argument

  • distance_cont and distance_cat : equivalent of argument method in ndist and cdist

  • interaction : T/F argument; if TRUE, NN-based implemented

Value

a nrow(x) by nrow(x) distance matrix if validate_x=NULL; a nrow(validate_x) by nrow(x) distance matrix otherwise

Final considerations and future work

  • the idea of an unbiased distance is that variable types, scales or measurement levels should not trivially impact the distance

    • not mandatory, but expecially in unsupervised settings, desireable
  • association-based measures aim to go beyond match/mismatch of categories

    • in supervised settings, AB distances allow to take into account the response in the pair-wise computations

    • non lazy KNN

  • NN-based interactions are computationally demanding (it can be made bearable)

    • measuring cont/cat interactions via NN is suitable for non-convex/oddly shaped classes
  • extend the Discriminant adaptive nearest neighbor classification1 to categorical and mixed data
  • finalize and render available the manydist package (GitHub first, then CRAN)

main references

Hastie, T. and R. Tibshirani (1995). “Discriminant adaptive nearest neighbor classification and regression”. In: Advances in neural information processing systems 8.

Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Ross, B. C. (2014). “Mutual information between discrete and continuous data sets”. In: PloS one 9.2, p. e87357.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025). “Unbiased mixed variables distance”. In: arXiv preprint arXiv:2411.00429, under review at JCGS.

Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.