From unbiased dissimilarities to response- and interaction-aware learning workflows
Alfonso Iodice D’Enza
Modelling and Forecasting of Socio-Economic Phenomena
May 12, 2026, Zakopane, Poland
1. Distance-based learning with mixed data
Increasing awareness in distance construction
2. Scale/type-aware distances
Additivity, commensurability, and bias in mixed data
3. Association-aware distances
Redundancy, correlations, and categorical associations
4. Interaction- and response-aware extensions
Continuous–categorical relationships and supervised neighbourhoods
5. Building distance-based pipelines: manydist
Distance construction, diagnostics, and learning workflows
Most real data applications do not contain only numerical variables.
Tabular data
In socio-economic applications, observations are often described by a mixture of
measurements, counts, categories, and ordered scales.
Numerical variables
Categorical variables
Many statistical learning methods rely, explicitly or implicitly, on comparing observations.
Scale
How do we compare variables measured in different units?
Type
How do we combine numerical differences with category mismatches?
Structure
How do we avoid counting associated information multiple times?
Terminology alert!
I will often use distance as shorthand for pairwise dissimilarity.
Some measures discussed here are not metrics in the strict mathematical sense,
but they all quantify how different two observations are.
Many learning methods can operate directly on a dissimilarity matrix.
Dimension reduction
Multidimensional scaling (MDS)1
Clustering methods
Hierarchical clustering (HC), PAM2, and spectral clustering3
Nearest-neighbour prediction
Nearest-neighbour classification and nearest-neighbour averaging for regression
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous variables: add up by-variable (absolute value or squared) differences
2 continuous and 1 categorical variables
one might consider purple and blue closer than e.g. purple and yellow
Multivariate Additivity
Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if
\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]
where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.
If additivity holds, by-variable distances are added together: they should be on equivalent scales
Commensurability
Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable.
We have commensurability if, for all \(j\), and \(i \neq \ell\),
\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]
where \(c\) is some constant.
If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability, then ad hoc distance functions can be used for each variable and then aggregated.
then
one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)
- well suited in the mixed data case
a mixed data set
\(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical
the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned
A formulation for mixed distance between observations \(i\) and \(\ell\):
\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]
numeric case
\(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable
\(w_{j_n}\) is a weight for the \(j_n-\)th variable.
categorical case
dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)
Synthetic data
\(I=500\) observations from normal, uniform, skewed, and bimodal distributions
skewed refers to a \(\chi^2_{1/2}\) distribution
bimodal: \(n/2\) draws from \(\chi^2_{1/2}\), censored at \(10\), and \(n/2\) draws from \(10-\chi^2_{1/2}\), censored at \(0\)
as long as variables have the same underlying distribution and scaling, commensurability holds
skewed variables may be under- or over-contributing to the distance, depending on the scaling (range and robust, respectively)
From categories to distances
Let \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) be the one-hot encoding of the categorical variables.
The pairwise categorical distance matrix can be written as
\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]
each \({\bf \Delta}_j\) defines the dissimilarity between the categories of variable (j)
different choices of \({\bf \Delta}_j\) imply different categorical distance measures
therefore, categorical distances can also suffer from scale and frequency-driven bias
flat frequency distribution
| Distance | Cat. dissimilarity | \(E[d(X_i, X_{\ell})]\) | \(q=2\) | \(q=5\) |
|---|---|---|---|---|
| Matching | \(\boldsymbol{\Delta}_m = \mathbf{1} \mathbf{1}^{\top} - \mathbf{I}\) | \(\frac{q-1}{q}\) | 0.5 | 0.8 |
| Eskin | \(\boldsymbol{\Delta}_e = \frac{2}{q^2}\boldsymbol{\Delta}_m\) | \(\frac{2(q-1)}{q^3}\) | 0.250 | 0.064 |
| Occurrence frequency (OF) | \(\boldsymbol{\Delta}_{OF} = \log^2(q)\boldsymbol{\Delta}_m\) | \(\log^2(q)\frac{q-1}{q}\) | 0.240 | 2.072 |
| Inverse OF | \(\boldsymbol{\Delta}_{IOF} = \log^2\left(\frac{I}{q}\right) \boldsymbol{\Delta}_m\) | \(\log^2\left(\frac{I}{q}\right)\frac{q-1}{q}\) | 9.601 | 9.610 |
skewed frequency distribution
The expected distance increases with the heterogeneity of the distribution and with the number of categories
Independence-based pairwise distance
No inter-variable relations are considered.
in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming / matching distance, among many others
in the mixed-data case: Gower dissimilarity index
variable contributions may be balanced, but still treated as separate sources of information
Beyond commensurability
commensurability makes variable contributions comparable across scales and data types.
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
The Euclidean distance \(\longrightarrow\) shared information is over-counted
The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted
this is an association-based distance for continuous data
Association-based for continuous: Mahalanobis distance
Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is
\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where
\([\cdot]^{\odot 1/2}\) denotes the element-wise square root
\({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix
\({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator
Association-based for categorical: total variation distance (TVD)(Le & Ho, 2005)
The distance matrix \({\bf D}_{tvd}\) can be defined via the delta framework upon properly defining the block-diagonal matrix \({\bf \Delta}\)
Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.
\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \]
Association-based for categorical: total variation distance (TVD) (Le & Ho, 2005) (2)
Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):
\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]
The block matrix \(\bf R\) refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix
\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]
where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.
Association-based for categorical: total variation distance (TVD)(Le & Ho, 2005) (3)
Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\).
The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as
\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]
where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).
TVD-based dissimilarity matrix is, therefore,
\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]
Data
Consider two categorical variables:
# A tibble: 10 × 3
id X1 X2
<int> <fct> <fct>
1 1 A u
2 2 A u
3 3 A v
4 4 B u
5 5 B u
6 6 B v
7 7 C u
8 8 C v
9 9 C v
10 10 C v
Indicator matrices
\[ {\bf Z}_1 = \begin{pmatrix} 1&0&0\\ 1&0&0\\ 1&0&0\\ 0&1&0\\ 0&1&0\\ 0&1&0\\ 0&0&1\\ 0&0&1\\ 0&0&1\\ 0&0&1 \end{pmatrix}, \qquad {\bf Z}_2 = \begin{pmatrix} 1&0\\ 1&0\\ 0&1\\ 1&0\\ 1&0\\ 0&1\\ 1&0\\ 0&1\\ 0&1\\ 0&1 \end{pmatrix}. \]
Let
\[ {\bf Z} = [{\bf Z}_1,{\bf Z}_2]. \]
The empirical co-occurrence matrix is
\[ {\bf P} = \frac{1}{10}{\bf Z}^{\sf T}{\bf Z}. \]
For this example,
\[ {\bf P} = \begin{pmatrix} \color{#2A9D8F}{0.30} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0.30} & \color{#2A9D8F}{0} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0.40} & \color{#E76F51}{0.10} & \color{#E76F51}{0.30}\\ \color{#E76F51}{0.20} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10} & \color{#2A9D8F}{0.50} & \color{#2A9D8F}{0}\\ \color{#E76F51}{0.10} & \color{#E76F51}{0.10} & \color{#E76F51}{0.30} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0.50} \end{pmatrix}. \]
The diagonal part of \({\bf P}\) is
\[ {\bf P}_z = {\bf P} \odot {\bf I}_{Q^*} = \operatorname{diag}(0.30,0.30,0.40,0.50,0.50). \]
The block matrix of conditional profiles is
\[ {\bf R} = {\bf P}_z^{-1}({\bf P}-{\bf P}_z). \]
For this example,
\[ {\bf R} = \begin{pmatrix} \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.67} & \color{#E76F51}{0.33}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.67} & \color{#E76F51}{0.33}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.25} & \color{#E76F51}{0.75}\\ \color{#E76F51}{0.40} & \color{#E76F51}{0.40} & \color{#E76F51}{0.20} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0}\\ \color{#E76F51}{0.20} & \color{#E76F51}{0.20} & \color{#E76F51}{0.60} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} \end{pmatrix}. \]
For the categories of \(X_1\), the relevant block is
\[ {\bf R}_{12} = \begin{pmatrix} 0.67 & 0.33\\ 0.67 & 0.33\\ 0.25 & 0.75 \end{pmatrix}. \]
Interpretation
Rows of \({\bf R}_{12}\) describe the distribution of \(X_2\) within each category of \(X_1\):
Compare the rows of \({\bf R}_{12}\) using TVD.
\[ \delta^{1}_{tvd}(A,B) = \frac{1}{2} \left( |0.67-0.67| + |0.33-0.33| \right) = 0. \]
\[ \delta^{1}_{tvd}(A,C) = \frac{1}{2} \left( |0.67-0.25| + |0.33-0.75| \right) = 0.42. \]
\[ \delta^{tvd}_{1}(B,C) = 0.42. \]
Therefore,
\[ \Delta^{(tvd)}_1 = \begin{pmatrix} 0 & 0 & 0.42\\ 0 & 0 & 0.42\\ 0.42 & 0.42 & 0 \end{pmatrix}. \]
For the categories of \(X_2\), the relevant block is
\[ {\bf R}_{21} = \begin{pmatrix} 0.40 & 0.40 & 0.20\\ 0.20 & 0.20 & 0.60 \end{pmatrix}. \]
Interpretation
Rows of \({\bf R}_{21}\) describe the distribution of \(X_1\) within each category of \(X_2\):
Compare the rows of \({\bf R}_{21}\) using TVD.
\[ \delta^{tvd}_{1}(u,v) = \frac{1}{2} \left( |0.40-0.20| + |0.40-0.20| + |0.20-0.60| \right) = 0.40. \]
Therefore,
\[ \Delta^{(tvd)}_2 = \begin{pmatrix} 0 & 0.40\\ 0.40 & 0 \end{pmatrix}. \]
We collect the category dissimilarity matrices in a block-diagonal matrix:
\[ \Delta^{(tvd)} = \begin{pmatrix} \color{#2A9D8F}{\Delta^{(tvd)}_1} & \color{#E76F51}{0}\\ \color{#E76F51}{0} & \color{#2A9D8F}{\Delta^{(tvd)}_2} \end{pmatrix}. \]
The observation-level categorical distance matrix is then
\[ {\bf D}_{tvd} = {\bf Z}\Delta^{(tvd)}{\bf Z}^{\sf T} = \begin{bmatrix} {\bf Z}_1 & {\bf Z}_2 \end{bmatrix} \begin{pmatrix} \Delta^{(tvd)}_1 & 0\\ 0 & \Delta^{(tvd)}_2 \end{pmatrix} \begin{bmatrix} {\bf Z}_1^{\sf T}\\ {\bf Z}_2^{\sf T} \end{bmatrix}. \]
Equivalently,
\[ {\bf D}_{tvd} = {\bf Z}_1\Delta^{(tvd)}_1{\bf Z}_1^{\sf T} + {\bf Z}_2\Delta^{(tvd)}_2{\bf Z}_2^{\sf T}. \]
Different distance definitions induce different distance-based representations of the same data.
Same data, different representation
Changing the distance changes the global dissimilarity structure on which downstream learning methods rely.
Leave-one-variable-out diagnostics
How can we measure the contribution of each variable to this structure?
The benchmark compares distance definitions that differ in how they treat scale, type, additivity, and association.
Additive distances
gower: classical Gower dissimilarity
mod_gower: modified Gower coefficients (Liu et al., 2024)
hl_add: additive version of Hennig–Liao scaling (Hennig & Liao, 2013)
u_ind: unbiased independence-based distance
u_dep: unbiased association-based distance
u_mix: unbiased Manhattan and TVD
Non-additive distances
naive: Euclidean distance on scaled numerical variables and one-hot-encoded
hl: Hennig–Liao scaling with Euclidean distance
gudmm: generalized multi-aspect distance metric for mixed-type data (Mousavi & Sehhati, 2023)
dkps: distance using kernel product similarity (Ghashti & Thompson, 2025)
For each distance and each variable \(X_j\), we compare the full-data representation \({\bf D}\) with the representation obtained after removing \(X_j\), that is \({\bf D}_{-j}\).
1. Distance level
Numeric comparision between \({\bf D}\) and \({\bf D}_{-j}\).
2. MDS level
Compute MDS from \({\bf D}\) and from \({\bf D}_{-j}\), then compare the resulting configurations.
LOVO diagnostics show how variables affect the distance matrix and the MDS representation.
But we also want to know whether distance biases affect a downstream learning task.
Unsupervised classification experiment
Use each distance matrix as input to PAM and evaluate how well the resulting partition recovers the known cluster structure.
Data generation
genRandomClustEvaluation
For each mixed-data distance, PAM is applied to the dissimilarity matrix with (K = 4).
Recovery of the true cluster labels is measured using the adjusted Rand index.
hl performs well when categorical variables are noise, but poorly when numerical variables are noisegower tends to show the opposite patternu_mix and u_dep are comparatively stable in the mixed signal/noise scenariosAssociation-aware distances account for relations within variable blocks:
Cross-type structure
In mixed data, categorical differences may be meaningful because they are reflected in the continuous variables.
Define \(\Delta^{int}\) to account for continuous–categorical interactions and use it to augment \(\Delta^{tvd}\).
The mixed dissimilarity becomes
\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]
where
\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \]
and
\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int}, \qquad \alpha=\frac{1}{Q_c}. \]
The entry \(\delta_{int}^{j}(a,b)\) measures how much the continuous variables help discriminate between observations choosing category \(a\) and those choosing category \(b\) for categorical variable \(j\).
Category-pair classification problem
For each pair \((a,b)\):
For each categorical variable \(j\) and each pair of categories \((a,b)\):
\[ \delta_{int}^{j}(a,b) = \frac{1}{2} \left( \frac{\texttt{true } a}{\texttt{true } a + \texttt{false } a} + \frac{\texttt{true } b}{\texttt{true } b + \texttt{false } b} \right). \]
For categorical variable \(j\) with \(q_j\) categories, compute
\(\frac{q_j(q_j -1)}{2}\) category-pair quantities.
\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
\[ \Delta_{int} = \begin{pmatrix} 0 & \color{#E76F51}{0.94} & \cdot & \cdot \\ \color{#E76F51}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{#E76F51}{0.40} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{#E76F51}{0.40} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & \color{#E76F51}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.40 & \cdot & 0 & \cdot\\ \color{#E76F51}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & \color{#E76F51}{0.54} & \cdot \\ 0.40 & \color{#E76F51}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & 0.54 & \color{#E76F51}{0.55} \\ 0.40 & 0.54 & 0 & \cdot \\ 0.39 & \color{#E76F51}{0.55} & \cdot & 0 \end{pmatrix} \]
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.40 & 0.54 & 0 & \color{#E76F51}{0} \\ 0.39 & 0.55 & \color{#E76F51}{0} & 0 \end{pmatrix} \]
We model the joint structure as
\[ f({\bf x}_{con},{\bf x}_{cat}) = f({\bf x}_{con}) f({\bf x}_{cat}\mid {\bf x}_{con}). \]
The interaction term asks how categorical distinctions are reflected in the continuous geometry.
Graph representation
A graph representation of the data matrix \({\bf X}\): the aim is to cut it into \(K\) groups, or clusters.
The affinity matrix \({\bf A}\)
The elements \({\bf w}_{ij}\) of \({\bf A}\) are high when observations \(i\) and \(j\) are likely to belong to the same group, and low otherwise.
| . | a | b | c | d |
|---|---|---|---|---|
| a | 0 | 0 | w_ac | 0 |
| b | 0 | 0 | w_cb | w_bd |
| c | w_ca | w_cb | 0 | w_cd |
| d | 0 | w_db | w_dc | 0 |
An approximate solution to the graph partitioning problem:
From distances to affinities
Start from the pairwise distance matrix \({\bf D}\) and build the affinity matrix
\[ {\bf A} = \exp\left(-\frac{{\bf D}^{2}}{2\sigma^{2}}\right), \qquad a_{ii}=0. \]
The parameter \(\sigma\) controls the neighbourhood scale.
Normalized graph Laplacian
The normalized affinity matrix is
\[ {\bf L} = {\bf D}_{r}^{-1/2} {\bf A} {\bf D}_{r}^{-1/2} = {\bf Q}{\Lambda}{\bf Q}^{\sf T}, \]
where \({\bf D}_{r}=\operatorname{diag}({\bf r})\), \({\bf r}={\bf A}{\bf 1}\), \({\bf 1}\) is an \(n\)-dimensional vector of ones.
Spectral embedding
The spectral clustering solution is obtained by applying \(K\)-means to the rows of
\({\bf \tilde Q}\), the matrix containing the first \(K\) eigenvectors of \({\bf L}\).
Interaction-aware distances can encode local connectivity and non-convex structure.
Design
Main feature
The clusters are not defined by continuous variables alone or categorical variables alone,
but by their cross-type interaction.
ab_dis_int clearly outperforms all competitorsab_dis without interactions, Gower, modified Gower, and the naive distance remain close to chance-level separationKNN is usually described as a lazy learner:
Reframing KNN
The distance is not just a preprocessing choice.
It determines the neighbourhoods used for classification or regression.
For mixed-type predictors, use a supervised distance with two components:
\[ D_{il} = D^n({\bf x}^n_i,{\bf x}^n_l) + D^c({\bf x}^c_i,{\bf x}^c_l). \]
Numerical part
Use discriminant information from \(y\)
to weight numerical differences.
Categorical part
Use the association between categories and \(y\)
to define category dissimilarities.
For continuous predictors, use the response to weight directions or variables.
Single-variable discriminant weighting
For numerical variable \(j\), define the Fisher score
\[ \sigma_j = \frac{B_j}{W_j}, \]
where \(B_j\) and \(W_j\) are the between- and within-group variances.
Then a supervised Manhattan-type distance is
\[ D^n({\bf x}^n_i,{\bf x}^n_l) = \sum_{j=1}^{Q_n} \sqrt{\sigma_j} \left|x^n_{ij}-x^n_{lj}\right|. \]
For categorical predictors, compare categories through their response profiles.
Let \({\bf Z}_y\) be the indicator matrix of the response.
The supervised profile matrix is
\[ {\bf R}_s = {\bf P}_d^{-1} {\bf Z}^{\sf T}{\bf Z}_y. \]
The supervised category dissimilarity is
\[ \delta_s^j(a,b) = \frac{1}{2} \sum_{\ell=1}^{q_y} \left| {\bf r}_{a\ell}^{j y} - {\bf r}_{b\ell}^{j y} \right|. \]
Data
The Carseats data are used to predict whether sales are high.
Compared distances
gower: robust Manhattan + matchingnaive: Euclidean on scaled numerical variables and dummiessup: supervised numerical weighting + supervised TVDsup_add: additive supervised versionsupf: full supervised versionsup, sup_add, and supf are clearly above gower and naivemanydist
A package to construct, diagnose, and use distances for continuous, categorical, and mixed-type data.
Distance construction
mdist()
step_mdist()
tidymodels workflowsDiagnostics
lovo_mdist()compare_lovo_mdist()benchmark_mdist()Learning: model specs
Unsupervised learning
pam_dist()spectral_dist()Supervised learning
nearest_neighbor_dist()manydist: socio-economic country profilesData
A 2022 World Bank / WDI snapshot of country-level socio-economic indicators.
Use manydist to build a mixed-type distance between countries and diagnose which variables shape the resulting dissimilarity structure.
| Country | Region | Income group | World Bank lending category | GDP per capita (k USD) | Life expectancy (years) | Unemployment (%) | Urban population (% total) | Population growth (%) |
|---|---|---|---|---|---|---|---|---|
| Tajikistan | Europe & Central Asia | Lower middle income | IDA | 1.1 | 71.6 | 7.1 | 26.2 | 2.14 |
| West Bank and Gaza | Middle East, North Africa, Afghanistan & Pakistan | Lower middle income | Not classified | 3.8 | 76.7 | 24.4 | 86.6 | 2.43 |
| Belarus | Europe & Central Asia | Upper middle income | IBRD | 8.0 | 74.1 | 3.6 | 78.5 | -0.80 |
| United Arab Emirates | Middle East, North Africa, Afghanistan & Pakistan | High income | Not classified | 50.8 | 80.5 | 2.9 | 85.5 | 5.09 |
| El Salvador | Latin America & Caribbean | Upper middle income | IBRD | 5.1 | 72.0 | 3.0 | 74.1 | 0.39 |
| New Zealand | East Asia & Pacific | High income | Not classified | 49.1 | 82.0 | 3.3 | 83.9 | -0.06 |
| Cyprus | Europe & Central Asia | High income | Not classified | 33.2 | 80.4 | 6.8 | 66.7 | 1.06 |
| Zambia | Sub-Saharan Africa | Lower middle income | IDA | 1.4 | 65.3 | 6.0 | 44.6 | 2.76 |
step_mdist() embeds the same distance specification into a modelling workflowlovo_mdist_compare()
The same leave-one-variable-out diagnostic can be computed for several distance definitions and compared in one display.
set.seed(123)
wdi_region <- wdi_data |>
dplyr::filter(region != "North America") |>
dplyr::mutate(
region = droplevels(region)
)
wdi_split <- initial_split(
wdi_region,
strata = region
)
wdi_train <- training(wdi_split)
wdi_test <- testing(wdi_split)
wdi_rec <- recipe(region ~ ., data = wdi_train) |>
update_role(country, new_role = "id") |>
step_mdist(
all_predictors(),
preset = "u_dep"
)
knn_spec <- nearest_neighbor_dist(
mode = "classification",
neighbors = tune()
)
wdi_wf <- workflow() |>
add_recipe(wdi_rec) |>
add_model(knn_spec)Data
Prepare the classification task.
set.seed(123)
wdi_region <- wdi_data |>
dplyr::filter(region != "North America") |>
dplyr::mutate(
region = droplevels(region)
)
wdi_split <- initial_split(
wdi_region,
strata = region
)
wdi_train <- training(wdi_split)
wdi_test <- testing(wdi_split)
wdi_rec <- recipe(region ~ ., data = wdi_train) |>
update_role(country, new_role = "id") |>
step_mdist(
all_predictors(),
preset = "u_dep"
)
knn_spec <- nearest_neighbor_dist(
mode = "classification",
neighbors = tune()
)
wdi_wf <- workflow() |>
add_recipe(wdi_rec) |>
add_model(knn_spec)Data
Prepare the classification task.
Split
Create training and test sets.
set.seed(123)
wdi_region <- wdi_data |>
dplyr::filter(region != "North America") |>
dplyr::mutate(
region = droplevels(region)
)
wdi_split <- initial_split(
wdi_region,
strata = region
)
wdi_train <- training(wdi_split)
wdi_test <- testing(wdi_split)
wdi_rec <- recipe(region ~ ., data = wdi_train) |>
update_role(country, new_role = "id") |>
step_mdist(
all_predictors(),
preset = "u_dep"
)
knn_spec <- nearest_neighbor_dist(
mode = "classification",
neighbors = tune()
)
wdi_wf <- workflow() |>
add_recipe(wdi_rec) |>
add_model(knn_spec)Data
Prepare the classification task.
Split
Create training and test sets.
Recipe
Use step_mdist() to construct the distance representation.
set.seed(123)
wdi_region <- wdi_data |>
dplyr::filter(region != "North America") |>
dplyr::mutate(
region = droplevels(region)
)
wdi_split <- initial_split(
wdi_region,
strata = region
)
wdi_train <- training(wdi_split)
wdi_test <- testing(wdi_split)
wdi_rec <- recipe(region ~ ., data = wdi_train) |>
update_role(country, new_role = "id") |>
step_mdist(
all_predictors(),
preset = "u_dep"
)
knn_spec <- nearest_neighbor_dist(
mode = "classification",
neighbors = tune()
)
wdi_wf <- workflow() |>
add_recipe(wdi_rec) |>
add_model(knn_spec)Data
Prepare the classification task.
Split
Create training and test sets.
Recipe
Use step_mdist() to construct the distance representation.
Model
Specify a distance-based KNN classifier.
set.seed(123)
wdi_region <- wdi_data |>
dplyr::filter(region != "North America") |>
dplyr::mutate(
region = droplevels(region)
)
wdi_split <- initial_split(
wdi_region,
strata = region
)
wdi_train <- training(wdi_split)
wdi_test <- testing(wdi_split)
wdi_rec <- recipe(region ~ ., data = wdi_train) |>
update_role(country, new_role = "id") |>
step_mdist(
all_predictors(),
preset = "u_dep"
)
knn_spec <- nearest_neighbor_dist(
mode = "classification",
neighbors = tune()
)
wdi_wf <- workflow() |>
add_recipe(wdi_rec) |>
add_model(knn_spec)Data
Prepare the classification task.
Split
Create training and test sets.
Recipe
Use step_mdist() to construct the distance representation.
Model
Specify a distance-based KNN classifier.
Workflow
Combine preprocessing and model specification.
set.seed(123)
wdi_folds <- vfold_cv(
wdi_train,
v = 5,
strata = region
)
knn_grid <- tibble(
neighbors = c(1, 3, 5, 7, 9, 11, 15)
)
knn_tuned <- tune_grid(
wdi_wf,
resamples = wdi_folds,
grid = knn_grid,
metrics = metric_set(accuracy)
)
best_k <- select_best(
knn_tuned,
metric = "accuracy"
)
final_wf <- finalize_workflow(
wdi_wf,
best_k
)
final_res <- last_fit(
final_wf,
split = wdi_split,
metrics = metric_set(accuracy)
)Resample
Create cross-validation folds on the training set.
set.seed(123)
wdi_folds <- vfold_cv(
wdi_train,
v = 5,
strata = region
)
knn_grid <- tibble(
neighbors = c(1, 3, 5, 7, 9, 11, 15)
)
knn_tuned <- tune_grid(
wdi_wf,
resamples = wdi_folds,
grid = knn_grid,
metrics = metric_set(accuracy)
)
best_k <- select_best(
knn_tuned,
metric = "accuracy"
)
final_wf <- finalize_workflow(
wdi_wf,
best_k
)
final_res <- last_fit(
final_wf,
split = wdi_split,
metrics = metric_set(accuracy)
)Resample
Create cross-validation folds on the training set.
Grid
Define candidate values for the number of neighbours.
set.seed(123)
wdi_folds <- vfold_cv(
wdi_train,
v = 5,
strata = region
)
knn_grid <- tibble(
neighbors = c(1, 3, 5, 7, 9, 11, 15)
)
knn_tuned <- tune_grid(
wdi_wf,
resamples = wdi_folds,
grid = knn_grid,
metrics = metric_set(accuracy)
)
best_k <- select_best(
knn_tuned,
metric = "accuracy"
)
final_wf <- finalize_workflow(
wdi_wf,
best_k
)
final_res <- last_fit(
final_wf,
split = wdi_split,
metrics = metric_set(accuracy)
)Resample
Create cross-validation folds on the training set.
Grid
Define candidate values for the number of neighbours.
Tune
Evaluate each candidate value by cross-validation.
set.seed(123)
wdi_folds <- vfold_cv(
wdi_train,
v = 5,
strata = region
)
knn_grid <- tibble(
neighbors = c(1, 3, 5, 7, 9, 11, 15)
)
knn_tuned <- tune_grid(
wdi_wf,
resamples = wdi_folds,
grid = knn_grid,
metrics = metric_set(accuracy)
)
best_k <- select_best(
knn_tuned,
metric = "accuracy"
)
final_wf <- finalize_workflow(
wdi_wf,
best_k
)
final_res <- last_fit(
final_wf,
split = wdi_split,
metrics = metric_set(accuracy)
)Resample
Create cross-validation folds on the training set.
Grid
Define candidate values for the number of neighbours.
Tune
Evaluate each candidate value by cross-validation.
Select
Choose the best-performing number of neighbours.
set.seed(123)
wdi_folds <- vfold_cv(
wdi_train,
v = 5,
strata = region
)
knn_grid <- tibble(
neighbors = c(1, 3, 5, 7, 9, 11, 15)
)
knn_tuned <- tune_grid(
wdi_wf,
resamples = wdi_folds,
grid = knn_grid,
metrics = metric_set(accuracy)
)
best_k <- select_best(
knn_tuned,
metric = "accuracy"
)
final_wf <- finalize_workflow(
wdi_wf,
best_k
)
final_res <- last_fit(
final_wf,
split = wdi_split,
metrics = metric_set(accuracy)
)Resample
Create cross-validation folds on the training set.
Grid
Define candidate values for the number of neighbours.
Tune
Evaluate each candidate value by cross-validation.
Select
Choose the best-performing number of neighbours.
Finalize
Insert the selected value into the workflow.
set.seed(123)
wdi_folds <- vfold_cv(
wdi_train,
v = 5,
strata = region
)
knn_grid <- tibble(
neighbors = c(1, 3, 5, 7, 9, 11, 15)
)
knn_tuned <- tune_grid(
wdi_wf,
resamples = wdi_folds,
grid = knn_grid,
metrics = metric_set(accuracy)
)
best_k <- select_best(
knn_tuned,
metric = "accuracy"
)
final_wf <- finalize_workflow(
wdi_wf,
best_k
)
final_res <- last_fit(
final_wf,
split = wdi_split,
metrics = metric_set(accuracy)
)Resample
Create cross-validation folds on the training set.
Grid
Define candidate values for the number of neighbours.
Tune
Evaluate each candidate value by cross-validation.
Select
Choose the best-performing number of neighbours.
Finalize
Insert the selected value into the workflow.
Test
Fit the finalized workflow on the training set and evaluate it on the test set.
Test-set performance
| Metric | Accuracy |
|---|---|
| accuracy | 0.681 |
Workflow
last_fit() fits the finalized workflow on the full training set and evaluates it once on the held-out test set.
Preprocessing is modelling
Distance-based learning makes no exception.
Distance choices are contextual
Some choices are domain-driven; others depend on the data structure and the downstream task.
Towards a common ground
Similar distance-based ideas often appear under different names across statistics, machine learning, econometrics, psychometrics, and operational research.
A package ecosystem such as manydist can make these choices easier to compare and reuse.