2025-01-17
outline
distance-based learning in the mixed-data case
the \(\Delta\) framework
non-lazy KNN for mixed data
what’s next
distance-based learning
unsupervised
clustering
K-means, partitioning around medoids
spectral clustering, DB-scan
dimensionality reduction
supervised
nearest neighbors averaging
support vector machines with radial basis functions
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous and 1 categorical variables
intuition
2 continuous and 1 categorical variables
one might consider purple and blue closer than e.g. purple and yellow
desirable properties1
Multivariate Additivity
Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if
\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]
where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.
desirable properties
If additivity holds, by-variable distances are added together: they should be on equivalent scales
Commensurability
Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable. We have commensurability if, for all \(j\), and \(i \neq \ell\),
\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]
where \(c\) is some constant.
desirable properties
If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability, then ad hoc distance functions can be used for each variable and then aggregated.
then
one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)
well suited in the mixed data case
mixed-data setup
a mixed data set
\(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical
the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned
A formulation for mixed distance between observations \(i\) and \(\ell\):
\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]
numeric case
\(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable
\(w_{j_n}\) is a weight for the \(j_n-\)th variable.
categorical case
dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)
distributions, scaling and bias: the numeric case
synthetic data
\(I=500\) observations from normal
, uniform
, skewed
and bimodal
distributions
skewed
refers to a \(\chi^2_{1/2}\) distribution
bimodal
we considered \(n/2\) draws from \(\chi^2_{1/2}\) (censored at \(10\)), and \(n/2\) draws from \(10-\chi^2_{1/2}\) (censored at \(0\))
as long as the variables have the same underlying distribution and scaling, commensurability holds
standard deviation scaling is the least affected by the variables distributions
contribution of variable to the overall distance may be biased
distributions, scaling and bias: the categorical case 1
the general (delta) framework
Let \({\bf Z}=\left[{\bf Z}_{1},{\bf Z}_{2},\ldots,{\bf Z}_{Q_c}\right]\) be the one-hot encoding \({\bf X}_{c}\)
The pair-wise distances between categorical observations are given by
\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]
the definition of \({\bf \Delta}\) determines the distance in use
if \(\Delta_{j}\)’s are diagonal, then \({\bf D}_{c}\) is independence-based
if \(\Delta_{j}\)’s have non-zero off-diagonal terms, then \({\bf D}_{c}\) is association-based
distributions, scaling and bias: the categorical case
independence-based pairwise distance
No inter-variable relations are considered
in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming (matching) distance (among MANY others)
in the mixed data case: Gower index
association-based pairwise distance
The rationale is that not all the observed differences weigh the same:
distributions, scaling and bias: the categorical case
flat frequency distribution
Distance | Cat. dissimilarity | \(E[d(X_i, X_{\ell})]\) | \(q=2\) | \(q=5\) |
---|---|---|---|---|
Matching | \(\boldsymbol{\Delta}_m = \mathbf{1} \mathbf{1}^{\top} - \mathbf{I}\) | \(\frac{q-1}{q}\) | 0.5 | 0.8 |
Eskin | \(\boldsymbol{\Delta}_e = \frac{2}{q^2}\boldsymbol{\Delta}_m\) | \(\frac{2(q-1)}{q^3}\) | 0.250 | 0.064 |
Occurrence frequency (OF) | \(\boldsymbol{\Delta}_{OF} = \log^2(q)\boldsymbol{\Delta}_m\) | \(\log^2(q)\frac{q-1}{q}\) | 0.240 | 2.072 |
Inverse OF | \(\boldsymbol{\Delta}_{IOF} = \log^2\left(\frac{I}{q}\right) \boldsymbol{\Delta}_m\) | \(\log^2\left(\frac{I}{q}\right)\frac{q-1}{q}\) | 9.601 | 9.610 |
skewed frequency distribution
The expected distance increases with the heterogeneity of the distribution and with the number of categories
distributions, scaling and bias: the categorical case
flat frequency distribution
Distance | Cat. dissimilarity | \(E[d(X_i, X_{\ell})]\) | \(q=2\) | \(q=5\) |
---|---|---|---|---|
no scaling | \(\boldsymbol{\Delta}_{d}=2\boldsymbol{\Delta}_m\) | \(\frac{2(q-1)}{q}\) | 1 | 1.6 |
Hennig-Liao scaling | \(\boldsymbol{\Delta}_{HL} = \sqrt{\frac{2q}{q-1}}\boldsymbol{\Delta}_m\) | \(\sqrt{\frac{2\left(q-1\right)}{q}}\) | 1 | 1.265 |
St. dev. scaling | \(\boldsymbol{\Delta}_{s}=2\sqrt{\frac{q}{q-1}}\boldsymbol{\Delta}_m\) | \(2\sqrt{\frac{(q-1)}{q}}\) | 1.414 | 1.789 |
Cat. dissim. scaling | \(\boldsymbol{\Delta}_{cds}=\frac{q}{q-1}\boldsymbol{\Delta}_m\) | \(1\) | 1 | 1 |
skewed frequency distribution
while the pattern is similar, different scalings can smooth out the effects of heterogeneity and number of categories
distributions, scaling and bias: the categorical case
flat frequency distribution
Distance | Cat. dissimilarity | \(E[d(X_i, X_{\ell})]\) | \(q=2\) | \(q=5\) |
---|---|---|---|---|
Total Variance | \(\boldsymbol{\Delta}_{tvd} = \boldsymbol{\Delta}_m\) | \(\frac{q-1}{q}\) | 0.5 | 0.8 |
Kullback-Leibler (Le & Ho) | \(\boldsymbol{\Delta}_{KL} = \kappa\boldsymbol{\Delta}_m\) | \(\kappa\frac{q-1}{q}\) | 8.305 | 13.288 |
where \(\kappa=5\log_{2}(10)\)
skewed frequency distribution
distances are computed based on the association of a target
with the same marginal distribution of the considered variable
the magnitude of the distances differs with the method of choice; yet the patterns are the same
data generating process
\(I=100\) observations and \(Q=6\) variables
\(\bf Y\) is a \(I\times 2\) orthogonal basis constructed off of \(2I\) values drawn from \(U(-2,2)\)
\(\bf N\) is a \(2\times Q\) random matrix with \(2Q\) values drawn from \(U(-2,2)\)
\({\bf X}_{o}={\bf Y}{\bf N}\) is the \(p-\)dimensional observed matrix with low-dimensional configuration \(Y\)
gaussian noise added (\(\sigma=0.03\), half the standard deviation of the generated data)
\(Q_{1}\) and \(Q_2\) are numerical, for \(j\in [3,6]\), \(Q_{j}\)’s are rendered categorical, with \(\{2,3,5,9\}\) categories, respectively.
variants
Num
: all numeric data (Manhattan) distance
Naive
: Euclidean distance on numeric and one-hot encoded categorical variables
HL
: Euclidean distance on (standardized) numeric and one-hot encoded with the scaling factor proposed by Hennig-Liao
HLa
: same as HL
, but Manhattan distance is used instead (additive)
G
: Gower, range-normalized numerical and simple matching
Uind
: commensurable distance using simple matching for the categorical variables
Ustd
: commensurable distance using category dissimilarity scaling for the categorical variables
Udep
: commensurable association-based mixed distance using PCA scaling of the numerical, and total variation distance for the categorical
variable importance
Leave-one-variable-out: contribution to distance
Naive (and, to some extent, Gower) emphasize the categorical variables
the Hennig-Liao scaling lead to the opposite effect (emphasis on numeric)
unbiased distances lead to a relative contribution close to 1/6 (0.16)
variable importance
full vs LOO multidimensional scaling (MDS) configuration
the effect of the numeber of categories is reversed here: variables with fewer categories impact more the MDS configuration
variable importance
Retrieve \(\bf Y\) via MDS: alienation coef distribution over 100 instances
variable importance
FIFA data: Dutch league
variable importance
FIFA data: Dutch league
just like before Naive
and Gower
variants favour the categorical variables.
in contrast, Hennig-Liao
scaling with Euclidean distance over-corrects, making numerical variables overly dominant
drop in mean values for the last two numerical values is due to the skewness of the corresponding distributions
variable importance
FIFA data: Dutch league
on the unbiased distances
the mean distances per variable are equivalent
there is variability in the impact of the variables on MDS
commensurability does not mean that the variables play the same role in determining a subsequent solution
the delta matrix: categories dissimilarities
recall \(\bf \Delta\)
The pair-wise distances between categorical observations are given by
\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]=\sum_{j=1}^{Q_{c}}{{\bf Z}_{j}{\bf \Delta}_{j}}{\bf Z}^{\sf T}_{j}\]
in association-based distances, \(\Delta_{j}\) is non-diagonal and its elements depend on the other variables, too
non-diagonal \(\Delta_{j}\)
Let \(a\) and \(b\) be two categories of the categorical variable \(j\), the corresponding \((a,b)^{th}\) entry of \(\Delta_{j}\) is
\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\Phi^{ji}(\xi^{ji}_{a},\xi^{ji}_{b}) \]
where \(\xi^{ji}_{a}\) and \(\xi^{ji}_{b}\) are be defined from
the joint (empirical) distributions of the categories of the variable \(i\) with \(a\) and \(b\), respectively
the conditional (empirical) distributions of the categories of the variable \(i\) given \(a\) and \(b\), respectively
joint distribution-based \(\Delta_{j}\)’s for association-based distances
the matrix co-occurrence proportions is
\[ {\bf P} =\frac{1}{I} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{1} & {\bf Z}_{1}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{1}^{\sf T}{\bf Z}_{Q_{c}}\\ \vdots & \ddots &\vdots & \vdots \\ % {\bf Z}_{2}^{\sf T}{\bf Z}_{1} & {\bf Z}_{2}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{2}^{\sf T}{\bf Z}_{Q}\\ \vdots & \vdots &\ddots & \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{1} & {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{2}&\ldots &{\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{Q_{c}}\\ \end{bmatrix} \]
joint distribution-based \(\Delta_{j}\)’s for association-based distances1
entropy-based
setting \({\xi}^{ji}_{a}={\bf p}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf p}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)
\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b}) \]
by defining \(\Phi^{ji}({\bf p}^{ji}_{a},{\bf p}^{ji}_{b})\) in terms of normalized entropy the above becomes
\[ \delta^{j}(a,b)=\sum_{j\neq i}^{Q_{c}}w_{ji}\left[\frac{\sum_{\ell=1}^{q_{i}}{({\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})log_{2}(\bf p}^{ji}_{a\ell}+{\bf p}^{ji}_{b\ell})}{log_{2}(q_{i})} \right] \]
the weights \(w_{ji}\) are based on the mutual information between the variables \(j\) and \(i\)
\[ w_{ji}= \sum_{\upsilon=1}^{q_{j}}\sum_{\ell=1}^{q_{i}} {\bf p}^{ji}_{\upsilon \ell}\log_{2}\left(\frac{{\bf p}^{ji}_{\upsilon \ell}}{{\bf p}^{ji}_{\upsilon.}{\bf p}^{ji}_{.\ell}}\right) \]
where \({\bf p}^{ji}_{\upsilon.}\) and \({\bf p}^{ji}_{.\ell}\) indicate the \(\upsilon^{th}\) row margin and the \(\ell^{th}\) column margin of \({\bf P}^{ji}\), respectively
conditional distribution-based \(\Delta_{j}\)’s for association-based distances
\({\bf R} = {\bf P}_{d}^{-1}\left({\bf P}-{\bf P}_{d}\right)\), with \({\bf P}_{d}=diag({\bf P})\), is a block matrix such that
the general off-diagonal block is \({\bf R}_{ji}\) ( \(q_{j}\times q_{i}\) )
the \(a^{th}\) row of \({\bf R}_{ji}\), \({\bf r}^{ji}_{a}\), is the conditional distribution of the \(i^{th}\) variable, given the \(a^{th}\) category of the \(j^{th}\) variable
conditional distribution-based \(\Delta_{j}\)’s for association-based distances
total variation distance (TVD)
setting \({\xi}^{ji}_{a}={\bf r}^{ji}_{a}\) and \({\xi}^{ji}_{b}={\bf r}^{ji}_{b}\), the general formula for the \(ab^{th}\) entry of \(\Delta_{j}\)
\[ \delta^{j}(a,b)=\sum_{i\neq j}^{Q_{c}}w_{ji}\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) \]
by defining \(\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})\) in terms of L1 distance the above becomes
\[\Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b})=\frac{1}{2}\sum_{\ell=1}^{q_{i}}|{\bf r}^{ji}_{a \ell}-{\bf r}^{ji}_{b \ell}|\] that corresponds to the total variation distance (TVD) 1
the weights can be \(w_{ji}\)\(=1/(Q_{c}-1)\)], or suitably defined to achieve commensurability
supervised AB-distance
supervised TVD
the class labels are categories of a further variable \(y\) (the response)
a supervised variant of AB-distance can defined that takes into account the association between \(y\) and each of the other variables.
\({\bf Z}_{y}\) be the one-hot encoding of the response, then the matrix \({\bf R}\) becomes
\[ {\bf R}_{s} = {\bf P}_{z}^{-1}\left( {\bf Z}^{\sf T}{\bf Z}_{y} \right)= {\bf P}_{z}^{-1} \begin{bmatrix} {\bf Z}_{1}^{\sf T}{\bf Z}_{y}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T}{\bf Z}_{y} \end{bmatrix} \]
the \((a,b)^{th}\) general entry of \(\Delta^{j}_{s}\) is given by
\[ \delta_{s}^{j}(a,b)= \sum_{j=1}^{Q_{c}}w_{j}\left[\frac{1}{2}\sum_{\ell=1}^{q_{y}}|{\bf r}^{j}_{a \ell}-{\bf r}^{j}_{b \ell}|\right] \]
synthetic categorical data
setup
generated \({\bf X}_{c}\) \((1000\times16)\), 8 of witch are associated to the response 1
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods: supervised TVD, Entropy-based, Gower (matching-based)
evaluation: accuracy
non-lazy KNN for categorical data
association-based for mixed
a straightforward way to generalise association-based to mixed data is to combine them
\[{\bf D}=\alpha {\bf D}_{c}+(1-\alpha){\bf D}_{n}\]
\({\bf D}_{c}\) is one of the previously defined ab-distance (TVD/entropy-based)
\({\bf D}_{n}\) the numeric counterpart is Mahalanobis (or, modified Mahalanobis) distance
However, no categorical/continuous interaction is taken into account
Aim: define \(\Delta^{int}_{j}\) so that it accounts for the categorical/continuous interactions
How to define \(\delta_{int}(a,b)\), general element of \(\Delta^{int}_{j}\): JS-based
Let \(a\) and \(b\) be two categories of the variable \(j\) and let \(X_{i}\) be continuous
\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \] where \(f_{a}(X_{i})\) and \(f_{b}(X_{i})\) are the distributions of \(X_{i}\) conditional to \(a\) and \(b\), respectively
The two distributions are compared via the Kullback-Leibler divergence
\[ \Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))=\int f_{a}(x)log_{2} \frac{f_{a}(x)}{f_{b}(x)}dx \]
How to define \(\delta_{int}(a,b)\): JS-based
Since is \(\Phi^{ji}_{KL}(f_{a}(X_{i}),f_{b}(X_{i}))\neq\Phi^{ji}_{KL}(f_{b}(X_{i}),f_{a}(X_{i}))\), it is rendered symmetric using the Jensen Shannon distance
\[ \Phi^{ji}_{JS}(f_{a}(X_{i}),f_{b}(X_{i}))=\frac{1}{4}\sqrt{ \Phi^{ji}_{KL}\left(f_{a}(X_{i}), f_{ab}(X_{i})\right)+ \Phi^{ji}_{KL}\left( f_{ab}(X_{i}),f_{b}(X_{i})\right)} \] where \(f_{ab}(X_{i})=\left(f_{a}(X_{i})+f_{b}(X_{i})\right)/2\)
How to define \(\delta_{int}(a,b)\): JS-based
The \((a,b)^{th}\) entry of the \(\Delta^{int}_{j}\) is, therefore,
\[ \delta_{int}^{j}(a,b)=\sum_{i=Q_{c}+1}^{Q}w_{ji}\Phi_{JS}^{ji}\left({f}_{a}(X_{i}),{f}_{b}(X_{i})\right) \]
the weights \(w_{ji}\) are once again based on the mutual information between \(X_{i}\) (continuous) and \(X_{j}\) (categorical) variable 1
How to define \(\delta_{int}(a,b)\): NN-based
the categorical/continuous interaction is proportional to the discriminant power of the continuous variables for each category pair \((a,b)\) of the \(j^{th}\), \(j=1,\ldots,Q_{c}\)
it is assessed via nearest neighbors (NN) averaging
if \(x_{\ell j}=a\), the prop of NN of \(x_{\ell j}\) labeled \(a\) is
\[ {\hat\pi}_{a\ell}=\frac{1}{n^{j}_{a}\pi_{nn}} \sum_{m\in \cal{N}^{a}_{\ell}}I(x_{jm}=a) \]
if \(x_{\ell j}=b\), the prop of NN of \(x_{\ell j}\) labeled \(b\) is
\[ {\hat\pi}_{b\ell}=\frac{1}{n^{j}_{b}\pi_{nn}} \sum_{m\in \cal{N}^{b}_{\ell}}I(x_{jm}=b) \]
\(n^{j}_{a}\) and \(n^{j}_{b}\) are absolute frequencies of categories \(a\) and \(b\)
\(\pi_{nn}\) is the user-defined proportion of nearest neighbors
\(\mathcal{N}^{a}_{l}\) (\(\mathcal{N}^{b}_{l}\)) is the set of nearest neighbors if the \(\ell^{th}\) observation \(x_{\ell j}=a\) (\(x_{\ell j}=b\))
How to define \(\delta_{int}(a,b)\): NN-based
We consider the improvement over chance that is obtained using the continuous variables to correctly classify the observations,
cateogry a
\[ \delta^{j}_{int}(a)=\left[\frac{1}{n_{a}^{j}}\sum_{\ell=1}^{n_{a}^{j}} I(\hat{\pi}_{a\ell}\geq .5)\right]-.5 \]
cateogry b
\[ \delta^{j}_{int}(b)=\left[\frac{1}{n_{b}^{j}}\sum_{\ell=1}^{n_{b}^{j}} I(\hat{\pi}_{b\ell}\geq .5)\right]-.5 \]
finally, the \((a,b)^{th}\) entry of the \(\Delta_{j_{int}}\) is given by
\[ \delta^{j}_{int}(a,b) = \delta^{j}_{int}(a) + \delta^{j}_{int}(b). \]
continuous variables in the TVD computation: NN-based
KNN learning: synthetic mixed data
setup
\({\bf X}=\left[{\bf X}_{cat},{\bf X}_{con}\right]\)
4 classes, same size
low/high level of overlap (association to the response)
25 replicates
distance methods:
association_based: Mahalanobis, supervised TVD, NN-based interaction
gudmm: modified Mahalanobis, entropy-based, JS-based
gower
evaluation: accuracy
KNN learning: synthetic mixed data
no gains from interaction, but this is expected: the two blocks of variables were generated independently
the manydist package: main functions
ndist
: computing distances for numerical variables
Arguments
x
: tibble/df with numeric training observations
validate_x
: (optional) tibble/df with numeric test observations
commensurable
: T/F argument
method
: c("manhattan","euclidean")
scaling
: c("none","std","range","robust","pc_scores")
sig
: (optional) specify a middle matrix for association-based (e.g. if sig=cov(x)
and method=euclidean
, then you get Mahalanobis)
Value
a nrow(x)
by nrow(x)
distance matrix if validate_x=NULL
; a nrow(validate_x)
by nrow(x)
distance matrix otherwise
cdist
: computing distances for categorical variables
mdist
: computing distances for mixed variables
the manydist package: main functions
ndist
: computing distances for numerical variables
cdist
: computing distances for categorical variables
Arguments
x
: tibble/df with categorical training observations
validate_x
: (optional) tibble/df with categorical test observations
commensurable
: T/F argument
method
: several independence- and association-based methods implemented. A string vector of methods name can be supplied, for by-variable specification
Value
a nrow(x)
by nrow(x)
distance matrix if validate_x=NULL
; a nrow(validate_x)
by nrow(x)
distance matrix otherwise
mdist
: computing distances for mixed variables
the manydist package: main functions
ndist
: computing distances for numerical variables
cdist
: computing distances for categorical variables
mdist
: computing distances for mixed variables
Arguments
wrapper function combining ndist
and cdist
x
: tibble/df with mixed training observations
validate_x
: (optional) tibble/df with mixed test observations
commensurable
: T/F argument
distance_cont
and distance_cat
: equivalent of argument method
in ndist
and cdist
interaction
: T/F argument; if TRUE
, NN-based implemented
Value
a nrow(x)
by nrow(x)
distance matrix if validate_x=NULL
; a nrow(validate_x)
by nrow(x)
distance matrix otherwise
Final considerations and future work
the idea of an unbiased distance is that variable types, scales or measurement levels should not trivially impact the distance
association-based measures aim to go beyond match/mismatch of categories
in supervised settings, AB distances allow to take into account the response in the pair-wise computations
non lazy KNN
NN-based interactions are computationally demanding (it can be made bearable)
manydist
package (GitHub first, then CRAN)main references
Hastie, T. and R. Tibshirani (1995). “Discriminant adaptive nearest neighbor classification and regression”. In: Advances in neural information processing systems 8.
Le, S. Q. and T. B. Ho (2005). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mousavi, E. and M. Sehhati (2023). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.
Ross, B. C. (2014). “Mutual information between discrete and continuous data sets”. In: PloS one 9.2, p. e87357.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025). “Unbiased mixed variables distance”. In: arXiv preprint arXiv:2411.00429, under review at JCGS.
Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). “Cluster correspondence analysis”. In: Psychometrika 82.1, pp. 158-185.