Learning from Mixed-Type Data with Association-Aware Distances

From unbiased dissimilarities to response- and interaction-aware learning workflows

Alfonso Iodice D’Enza
Modelling and Forecasting of Socio-Economic Phenomena
May 12, 2026, Zakopane, Poland

A collaborative project

Outline

1. Distance-based learning with mixed data
Increasing awareness in distance construction

2. Scale/type-aware distances
Additivity, commensurability, and bias in mixed data

3. Association-aware distances
Redundancy, correlations, and categorical associations

4. Interaction- and response-aware extensions
Continuous–categorical relationships and supervised neighbourhoods

5. Building distance-based pipelines: manydist
Distance construction, diagnostics, and learning workflows

1. Distance-based learning with mixed data

Mixed-type data

Most real data applications do not contain only numerical variables.

Tabular data

In socio-economic applications, observations are often described by a mixture of
measurements, counts, categories, and ordered scales.

Numerical variables

income
age
prices
growth rates
emissions
employment rates

Categorical variables

region
education level
occupation
sector
household type
policy regime

Learning from mixed-type data

Many statistical learning methods rely, explicitly or implicitly, on comparing observations.

If the data are mixed-type, then comparison is not straightforward:

Scale

How do we compare variables measured in different units?

Type

How do we combine numerical differences with category mismatches?

Structure

How do we avoid counting associated information multiple times?

the aim is to compare observations while accounting for scale, type, and structure

Learning from distances

Terminology alert!

I will often use distance as shorthand for pairwise dissimilarity.

Some measures discussed here are not metrics in the strict mathematical sense,
but they all quantify how different two observations are.

Many learning methods can operate directly on a dissimilarity matrix.

Dimension reduction

Multidimensional scaling (MDS)¹

Clustering methods

Hierarchical clustering (HC), PAM², and spectral clustering³

Nearest-neighbour prediction

Nearest-neighbour classification and nearest-neighbour averaging for regression

choosing a different distance can lead to a different downstream analysis result

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous and 1 categorical variables

building pairwise dissimilarities: intuition

one might consider purple and blue closer than e.g. purple and yellow

2. Scale/type-aware distances

Desirable properties¹

Multivariate Additivity

Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if

\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]

where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.

Manhattan distance satisfies the additivity property; the Euclidean distance does not

Desirable properties¹

If additivity holds, by-variable distances are added together: they should be on equivalent scales

Commensurability

Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable.

We have commensurability if, for all \(j\), and \(i \neq \ell\),

\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]

where \(c\) is some constant.

Desirable properties¹

If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability, then ad hoc distance functions can be used for each variable and then aggregated.

then

one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)

well suited in the mixed data case

Mixed-data setup

a mixed data set

\(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical
the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned

A formulation for mixed distance between observations \(i\) and \(\ell\):

\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]

numeric case

\(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable
\(w_{j_n}\) is a weight for the \(j_n-\)th variable.

categorical case

dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)

\(w_{j_c}\) is a weight for the \(j_c-\)th variable

Distributions, scaling and bias: the numeric case

Synthetic data

\(I=500\) observations from normal, uniform, skewed, and bimodal distributions
skewed refers to a \(\chi^2_{1/2}\) distribution
bimodal: \(n/2\) draws from \(\chi^2_{1/2}\), censored at \(10\), and \(n/2\) draws from \(10-\chi^2_{1/2}\), censored at \(0\)

as long as variables have the same underlying distribution and scaling, commensurability holds

skewed variables may be under- or over-contributing to the distance, depending on the scaling (range and robust, respectively)
- the contribution of a variable to the overall distance may be biased

Categorical distances: the delta framework¹

From categories to distances

Let \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) be the one-hot encoding of the categorical variables.

The pairwise categorical distance matrix can be written as

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

each \({\bf \Delta}_j\) defines the dissimilarity between the categories of variable (j)
different choices of \({\bf \Delta}_j\) imply different categorical distance measures
therefore, categorical distances can also suffer from scale and frequency-driven bias

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance	Cat. dissimilarity	\(E[d(X_i, X_{\ell})]\)	\(q=2\)	\(q=5\)
Matching	\(\boldsymbol{\Delta}_m = \mathbf{1} \mathbf{1}^{\top} - \mathbf{I}\)	\(\frac{q-1}{q}\)	0.5	0.8
Eskin	\(\boldsymbol{\Delta}_e = \frac{2}{q^2}\boldsymbol{\Delta}_m\)	\(\frac{2(q-1)}{q^3}\)	0.250	0.064
Occurrence frequency (OF)	\(\boldsymbol{\Delta}_{OF} = \log^2(q)\boldsymbol{\Delta}_m\)	\(\log^2(q)\frac{q-1}{q}\)	0.240	2.072
Inverse OF	\(\boldsymbol{\Delta}_{IOF} = \log^2\left(\frac{I}{q}\right) \boldsymbol{\Delta}_m\)	\(\log^2\left(\frac{I}{q}\right)\frac{q-1}{q}\)	9.601	9.610

skewed frequency distribution

\(q\in \{2,3,5,10\}\)
\(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
\(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

The expected distance increases with the heterogeneity of the distribution and with the number of categories

Independence-based distances

Independence-based pairwise distance

No inter-variable relations are considered.

in the continuous case: Euclidean or Manhattan distances
in the categorical case: Hamming / matching distance, among many others
in the mixed-data case: Gower dissimilarity index

variable contributions may be balanced, but still treated as separate sources of information

Beyond commensurability

commensurability makes variable contributions comparable across scales and data types.

If variables are correlated or associated, the same information may contribute repeatedly to the distance: redundancy

the next step is to account for the structure among variables

3. Association-aware distances

by variable differences: independence-based

When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

by variable differences: independence-based

When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

by variable differences: independence-based

The Euclidean distance \(\longrightarrow\) shared information is over-counted

accounting for inter-variable relations: association-based

The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted

this is an association-based distance for continuous data

association-based distance

Association-based for continuous: Mahalanobis distance

Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is

\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where

\([\cdot]^{\odot 1/2}\) denotes the element-wise square root
\({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix
\({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator

association-based distance

Association-based for categorical: total variation distance (TVD)(Le & Ho, 2005)

The distance matrix \({\bf D}_{tvd}\) can be defined via the delta framework upon properly defining the block-diagonal matrix \({\bf \Delta}\)

Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.

\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \]

in the framework, setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)

association-based distance

Association-based for categorical: total variation distance (TVD) (Le & Ho, 2005) (2)

Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):

\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]

The block matrix \(\bf R\) refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix

\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]

where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.

association-based distance

Association-based for categorical: total variation distance (TVD)(Le & Ho, 2005) (3)

Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\).

The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as

\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]

where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).

TVD-based dissimilarity matrix is, therefore,

\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]

association-based distance: a small example

Data

Consider two categorical variables:

\(X_1\) with categories \(A,B,C\)
\(X_2\) with categories \(u,v\)

# A tibble: 10 × 3
      id X1    X2   
   <int> <fct> <fct>
 1     1 A     u    
 2     2 A     u    
 3     3 A     v    
 4     4 B     u    
 5     5 B     u    
 6     6 B     v    
 7     7 C     u    
 8     8 C     v    
 9     9 C     v    
10    10 C     v

Indicator matrices

\[ {\bf Z}_1 = \begin{pmatrix} 1&0&0\\ 1&0&0\\ 1&0&0\\ 0&1&0\\ 0&1&0\\ 0&1&0\\ 0&0&1\\ 0&0&1\\ 0&0&1\\ 0&0&1 \end{pmatrix}, \qquad {\bf Z}_2 = \begin{pmatrix} 1&0\\ 1&0\\ 0&1\\ 1&0\\ 1&0\\ 0&1\\ 1&0\\ 0&1\\ 0&1\\ 0&1 \end{pmatrix}. \]

association-based distance: from \({\bf Z}\) to \({\bf P}\)

Let

\[ {\bf Z} = [{\bf Z}_1,{\bf Z}_2]. \]

The empirical co-occurrence matrix is

\[ {\bf P} = \frac{1}{10}{\bf Z}^{\sf T}{\bf Z}. \]

For this example,

\[ {\bf P} = \begin{pmatrix} \color{#2A9D8F}{0.30} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0.30} & \color{#2A9D8F}{0} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0.40} & \color{#E76F51}{0.10} & \color{#E76F51}{0.30}\\ \color{#E76F51}{0.20} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10} & \color{#2A9D8F}{0.50} & \color{#2A9D8F}{0}\\ \color{#E76F51}{0.10} & \color{#E76F51}{0.10} & \color{#E76F51}{0.30} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0.50} \end{pmatrix}. \]

diagonal blocks contain marginal information; off-diagonal blocks contain joint proportions

association-based distance: from \({\bf P}\) to \({\bf R}\)

The diagonal part of \({\bf P}\) is

\[ {\bf P}_z = {\bf P} \odot {\bf I}_{Q^*} = \operatorname{diag}(0.30,0.30,0.40,0.50,0.50). \]

The block matrix of conditional profiles is

\[ {\bf R} = {\bf P}_z^{-1}({\bf P}-{\bf P}_z). \]

For this example,

\[ {\bf R} = \begin{pmatrix} \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.67} & \color{#E76F51}{0.33}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.67} & \color{#E76F51}{0.33}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.25} & \color{#E76F51}{0.75}\\ \color{#E76F51}{0.40} & \color{#E76F51}{0.40} & \color{#E76F51}{0.20} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0}\\ \color{#E76F51}{0.20} & \color{#E76F51}{0.20} & \color{#E76F51}{0.60} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} \end{pmatrix}. \]

each off-diagonal block contains conditional profiles across variables

association-based distance: reading \({\bf R}_{12}\)

For the categories of \(X_1\), the relevant block is

\[ {\bf R}_{12} = \begin{pmatrix} 0.67 & 0.33\\ 0.67 & 0.33\\ 0.25 & 0.75 \end{pmatrix}. \]

Interpretation

Rows of \({\bf R}_{12}\) describe the distribution of \(X_2\) within each category of \(X_1\):

category \(A\): \(P(X_2=u \mid X_1=A)=0.67\), \(P(X_2=v \mid X_1=A)=0.33\)
category \(B\): \(P(X_2=u \mid X_1=B)=0.67\), \(P(X_2=v \mid X_1=B)=0.33\)
category \(C\): \(P(X_2=u \mid X_1=C)=0.25\), \(P(X_2=v \mid X_1=C)=0.75\)

categories are compared through their association profiles

association-based distance: from \({\bf R}\) to \(\Delta_1^{(tvd)}\)

Compare the rows of \({\bf R}_{12}\) using TVD.

\[ \delta^{1}_{tvd}(A,B) = \frac{1}{2} \left( |0.67-0.67| + |0.33-0.33| \right) = 0. \]

\[ \delta^{1}_{tvd}(A,C) = \frac{1}{2} \left( |0.67-0.25| + |0.33-0.75| \right) = 0.42. \]

\[ \delta^{tvd}_{1}(B,C) = 0.42. \]

Therefore,

\[ \Delta^{(tvd)}_1 = \begin{pmatrix} 0 & 0 & 0.42\\ 0 & 0 & 0.42\\ 0.42 & 0.42 & 0 \end{pmatrix}. \]

\(A\) and \(B\) are close because they have the same profile with respect to \(X_2\)

association-based distance: reading \({\bf R}_{21}\)

For the categories of \(X_2\), the relevant block is

\[ {\bf R}_{21} = \begin{pmatrix} 0.40 & 0.40 & 0.20\\ 0.20 & 0.20 & 0.60 \end{pmatrix}. \]

Interpretation

Rows of \({\bf R}_{21}\) describe the distribution of \(X_1\) within each category of \(X_2\):

category \(u\): \(P(X_1=A \mid X_2=u)=0.40\), \(P(X_1=B \mid X_2=u)=0.40\), \(P(X_1=C \mid X_2=u)=0.20\)
category \(v\): \(P(X_1=A \mid X_2=v)=0.20\), \(P(X_1=B \mid X_2=v)=0.20\), \(P(X_1=C \mid X_2=v)=0.60\)

categories of \(X_2\) are compared through their association profiles with respect to \(X_1\)

association-based distance: from \({\bf R}_{21}\) to \(\Delta_2^{(tvd)}\)

Compare the rows of \({\bf R}_{21}\) using TVD.

\[ \delta^{tvd}_{1}(u,v) = \frac{1}{2} \left( |0.40-0.20| + |0.40-0.20| + |0.20-0.60| \right) = 0.40. \]

Therefore,

\[ \Delta^{(tvd)}_2 = \begin{pmatrix} 0 & 0.40\\ 0.40 & 0 \end{pmatrix}. \]

\(u\) and \(v\) are different because they imply different profiles over \(X_1\)

association-based distance: from \(\Delta\) to \({\bf D}\)

We collect the category dissimilarity matrices in a block-diagonal matrix:

\[ \Delta^{(tvd)} = \begin{pmatrix} \color{#2A9D8F}{\Delta^{(tvd)}_1} & \color{#E76F51}{0}\\ \color{#E76F51}{0} & \color{#2A9D8F}{\Delta^{(tvd)}_2} \end{pmatrix}. \]

The observation-level categorical distance matrix is then

\[ {\bf D}_{tvd} = {\bf Z}\Delta^{(tvd)}{\bf Z}^{\sf T} = \begin{bmatrix} {\bf Z}_1 & {\bf Z}_2 \end{bmatrix} \begin{pmatrix} \Delta^{(tvd)}_1 & 0\\ 0 & \Delta^{(tvd)}_2 \end{pmatrix} \begin{bmatrix} {\bf Z}_1^{\sf T}\\ {\bf Z}_2^{\sf T} \end{bmatrix}. \]

Equivalently,

\[ {\bf D}_{tvd} = {\bf Z}_1\Delta^{(tvd)}_1{\bf Z}_1^{\sf T} + {\bf Z}_2\Delta^{(tvd)}_2{\bf Z}_2^{\sf T}. \]

category-level dissimilarities are translated into observation-level distances

From distances to data representation

Different distance definitions induce different distance-based representations of the same data.

Same data, different representation

Changing the distance changes the global dissimilarity structure on which downstream learning methods rely.

Leave-one-variable-out diagnostics

How can we measure the contribution of each variable to this structure?

compare the dissimilarity matrix computed with and without the variable in question

LOVO-based benchmark: evaluated distances

The benchmark compares distance definitions that differ in how they treat scale, type, additivity, and association.

Additive distances

gower: classical Gower dissimilarity
mod_gower: modified Gower coefficients (Liu et al., 2024)
hl_add: additive version of Hennig–Liao scaling (Hennig & Liao, 2013)
u_ind: unbiased independence-based distance
u_dep: unbiased association-based distance
u_mix: unbiased Manhattan and TVD

Non-additive distances

naive: Euclidean distance on scaled numerical variables and one-hot-encoded
hl: Hennig–Liao scaling with Euclidean distance
gudmm: generalized multi-aspect distance metric for mixed-type data (Mousavi & Sehhati, 2023)
dkps: distance using kernel product similarity (Ghashti & Thompson, 2025)

LOVO-based diagnostics: what is evaluated?

For each distance and each variable \(X_j\), we compare the full-data representation \({\bf D}\) with the representation obtained after removing \(X_j\), that is \({\bf D}_{-j}\).

1. Distance level

Numeric comparision between \({\bf D}\) and \({\bf D}_{-j}\).

mean absolute difference between distance matrices.

2. MDS level

Compute MDS from \({\bf D}\) and from \({\bf D}_{-j}\), then compare the resulting configurations.

alienation coefficient between MDS representations.

LOVO diagnostics assess how each variable contributes to the dissimilarity structure

LOVO diagnostics: distance-level effect

LOVO diagnostics: MDS-level effect

commensurability balances expected distance contributions, not necessarily the role of variables in every downstream representation

From diagnostics to downstream learning

LOVO diagnostics show how variables affect the distance matrix and the MDS representation.

But we also want to know whether distance biases affect a downstream learning task.

Unsupervised classification experiment

Use each distance matrix as input to PAM and evaluate how well the resulting partition recovers the known cluster structure.

Unsupervised classification experiment

Data generation

(n = 200) observations from (4) equal-sized clusters
data generated with genRandomClust
each dataset contains (8) numerical and (8) categorical variables
categorical variables are obtained by discretizing numerical variables into (9) categories
scenarios vary the number of signal and noise variables within each type
(100) datasets are generated for each scenario

Evaluation

For each mixed-data distance, PAM is applied to the dissimilarity matrix with (K = 4).
Recovery of the true cluster labels is measured using the adjusted Rand index.

PAM-based clustering results

hl performs well when categorical variables are noise, but poorly when numerical variables are noise
gower tends to show the opposite pattern
u_mix and u_dep are comparatively stable in the mixed signal/noise scenarios

4. Interaction- and response-aware extensions

Interaction-aware distances

Association-aware distances account for relations within variable blocks:

continuous–continuous relations;
categorical–categorical relations.

Cross-type structure

In mixed data, categorical differences may be meaningful because they are reflected in the continuous variables.

the next step is to make distances interaction-aware

How to measure interactions¹

Define \(\Delta^{int}\) to account for continuous–categorical interactions and use it to augment \(\Delta^{tvd}\).

The mixed dissimilarity becomes

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

where

\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \]

and

\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int}, \qquad \alpha=\frac{1}{Q_c}. \]

What is \(\Delta^{int}\)?

The entry \(\delta_{int}^{j}(a,b)\) measures how much the continuous variables help discriminate between observations choosing category \(a\) and those choosing category \(b\) for categorical variable \(j\).

Category-pair classification problem

For each pair \((a,b)\):

use the continuous variables as predictors;
classify observations belonging to categories \(a\) and \(b\);
use a nearest-neighbour rule in the continuous space.

Computing \(\Delta^{int}_{j}\)

For each categorical variable \(j\) and each pair of categories \((a,b)\):

use \({\bf D}_{mah}\) to identify neighbours in the continuous space;
consider a proportion of neighbours, say \(\hat{\pi}_{nn}=0.1\);
classify observations using a prior-corrected decision rule;
compute balanced accuracy.

\[ \delta_{int}^{j}(a,b) = \frac{1}{2} \left( \frac{\texttt{true } a}{\texttt{true } a + \texttt{false } a} + \frac{\texttt{true } b}{\texttt{true } b + \texttt{false } b} \right). \]

Well separated or not?

high separability \(\Rightarrow\) high interaction dissimilarity

Building \(\Delta^{int}_{j}\)

For categorical variable \(j\) with \(q_j\) categories, compute
\(\frac{q_j(q_j -1)}{2}\) category-pair quantities.

\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & \color{#E76F51}{0.94} & \cdot & \cdot \\ \color{#E76F51}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{#E76F51}{0.40} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{#E76F51}{0.40} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & \color{#E76F51}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.40 & \cdot & 0 & \cdot\\ \color{#E76F51}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & \color{#E76F51}{0.54} & \cdot \\ 0.40 & \color{#E76F51}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & 0.54 & \color{#E76F51}{0.55} \\ 0.40 & 0.54 & 0 & \cdot \\ 0.39 & \color{#E76F51}{0.55} & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.40 & 0.54 & 0 & \color{#E76F51}{0} \\ 0.39 & 0.55 & \color{#E76F51}{0} & 0 \end{pmatrix} \]

the interaction matrix summarizes category-pair separability in the continuous space

Why this interaction is asymmetric

We model the joint structure as

\[ f({\bf x}_{con},{\bf x}_{cat}) = f({\bf x}_{con}) f({\bf x}_{cat}\mid {\bf x}_{con}). \]

The interaction term asks how categorical distinctions are reflected in the continuous geometry.

Spectral clustering: a graph partitioning problem

Graph representation

A graph representation of the data matrix \({\bf X}\): the aim is to cut it into \(K\) groups, or clusters.

The affinity matrix \({\bf A}\)

The elements \({\bf w}_{ij}\) of \({\bf A}\) are high when observations \(i\) and \(j\) are likely to belong to the same group, and low otherwise.

.	a	b	c	d
a	0	0	w_ac	0
b	0	0	w_cb	w_bd
c	w_ca	w_cb	0	w_cd
d	0	w_db	w_dc	0

Spectral clustering: making the graph easy to cut

An approximate solution to the graph partitioning problem:

From distances to affinities

Start from the pairwise distance matrix \({\bf D}\) and build the affinity matrix

\[ {\bf A} = \exp\left(-\frac{{\bf D}^{2}}{2\sigma^{2}}\right), \qquad a_{ii}=0. \]

The parameter \(\sigma\) controls the neighbourhood scale.

Normalized graph Laplacian

The normalized affinity matrix is

\[ {\bf L} = {\bf D}_{r}^{-1/2} {\bf A} {\bf D}_{r}^{-1/2} = {\bf Q}{\Lambda}{\bf Q}^{\sf T}, \]

where \({\bf D}_{r}=\operatorname{diag}({\bf r})\), \({\bf r}={\bf A}{\bf 1}\), \({\bf 1}\) is an \(n\)-dimensional vector of ones.

Spectral embedding

The spectral clustering solution is obtained by applying \(K\)-means to the rows of
\({\bf \tilde Q}\), the matrix containing the first \(K\) eigenvectors of \({\bf L}\).

Why spectral clustering here?

Interaction-aware distances can encode local connectivity and non-convex structure.

Experiment: interaction-driven clusters

Design

\(n=500\) and \(n=1000\)
six continuous variables: \(V_1,\ldots,V_6\)
three categorical variables: \(C_1,C_2,C_3\)
\(V_4,V_5,V_6\) generated independently from \(N(0,1)\)
\(V_1,V_2,V_3\) generated conditionally on \(C_1\) and \(C_2\)

Main feature

The clusters are not defined by continuous variables alone or categorical variables alone,
but by their cross-type interaction.

Interaction-aware spectral clustering results

when the cluster structure is interaction-driven, ab_dis_int clearly outperforms all competitors
ab_dis without interactions, Gower, modified Gower, and the naive distance remain close to chance-level separation
the result supports the need to explicitly encode continuous–categorical interactions

Response-aware distances for KNN

KNN is usually described as a lazy learner:

store the training data;
compute distances from a new observation to the training observations;
predict from the responses of the nearest neighbours.

Reframing KNN

The distance is not just a preprocessing choice.
It determines the neighbourhoods used for classification or regression.

in supervised learning, the response can help define these neighbourhoods

Response-aware mixed distance

For mixed-type predictors, use a supervised distance with two components:

\[ D_{il} = D^n({\bf x}^n_i,{\bf x}^n_l) + D^c({\bf x}^c_i,{\bf x}^c_l). \]

Numerical part

Use discriminant information from \(y\)
to weight numerical differences.

Categorical part

Use the association between categories and \(y\)
to define category dissimilarities.

Numerical part: discriminant weighting¹

For continuous predictors, use the response to weight directions or variables.

Single-variable discriminant weighting

For numerical variable \(j\), define the Fisher score

\[ \sigma_j = \frac{B_j}{W_j}, \]

where \(B_j\) and \(W_j\) are the between- and within-group variances.

Then a supervised Manhattan-type distance is

\[ D^n({\bf x}^n_i,{\bf x}^n_l) = \sum_{j=1}^{Q_n} \sqrt{\sigma_j} \left|x^n_{ij}-x^n_{lj}\right|. \]

Categorical part: supervised TVD

For categorical predictors, compare categories through their response profiles.

Let \({\bf Z}_y\) be the indicator matrix of the response.
The supervised profile matrix is

\[ {\bf R}_s = {\bf P}_d^{-1} {\bf Z}^{\sf T}{\bf Z}_y. \]

The supervised category dissimilarity is

\[ \delta_s^j(a,b) = \frac{1}{2} \sum_{\ell=1}^{q_y} \left| {\bf r}_{a\ell}^{j y} - {\bf r}_{b\ell}^{j y} \right|. \]

categories are close if they show similar response distributions

Response-aware KNN: Carseats example

Data

The Carseats data are used to predict whether sales are high.

7 numerical predictors
3 categorical predictors
categorical predictors with 2 or 3 categories
binary response: high vs. low sales

Compared distances

gower: robust Manhattan + matching
naive: Euclidean on scaled numerical variables and dummies
sup: supervised numerical weighting + supervised TVD
sup_add: additive supervised version
supf: full supervised version

Response-aware KNN: Carseats results

response-aware distances improve nearest-neighbour classification accuracy
sup, sup_add, and supf are clearly above gower and naive
the response helps define neighbourhoods that better reflect the class structure

Building distance-based pipelines

from distance construction to reproducible learning workflows

Building distance-based pipelines

manydist

A package to construct, diagnose, and use distances for continuous, categorical, and mixed-type data.

Distance construction

mdist()
- presets and custom specifications
- response-aware distances
step_mdist()
- integrates distance construction into tidymodels workflows

Diagnostics

lovo_mdist()
compare_lovo_mdist()
benchmark_mdist()

Learning: model specs

Unsupervised learning

pam_dist()
spectral_dist()

Supervised learning

nearest_neighbor_dist()

`manydist`: socio-economic country profiles

Data

A 2022 World Bank / WDI snapshot of country-level socio-economic indicators.

observations: countries
numerical variables: GDP per capita, life expectancy, unemployment, urban population, population growth
categorical variables: world region and income group

Use manydist to build a mixed-type distance between countries and diagnose which variables shape the resulting dissimilarity structure.

Preparing the WDI data

Country	Region	Income group	World Bank lending category	GDP per capita (k USD)	Life expectancy (years)	Unemployment (%)	Urban population (% total)	Population growth (%)
Tajikistan	Europe & Central Asia	Lower middle income	IDA	1.1	71.6	7.1	26.2	2.14
West Bank and Gaza	Middle East, North Africa, Afghanistan & Pakistan	Lower middle income	Not classified	3.8	76.7	24.4	86.6	2.43
Belarus	Europe & Central Asia	Upper middle income	IBRD	8.0	74.1	3.6	78.5	-0.80
United Arab Emirates	Middle East, North Africa, Afghanistan & Pakistan	High income	Not classified	50.8	80.5	2.9	85.5	5.09
El Salvador	Latin America & Caribbean	Upper middle income	IBRD	5.1	72.0	3.0	74.1	0.39
New Zealand	East Asia & Pacific	High income	Not classified	49.1	82.0	3.3	83.9	-0.06
Cyprus	Europe & Central Asia	High income	Not classified	33.2	80.4	6.8	66.7	1.06
Zambia	Sub-Saharan Africa	Lower middle income	IDA	1.4	65.3	6.0	44.6	2.76

Constructing a mixed-type distance using a preset

Directly with mdist()

wdi_x <- wdi_data |> 
  dplyr::select(-country)

d_preset <- mdist(
  wdi_x,
  preset = "u_dep"
)

Inside a recipe with step_mdist()

rec_preset <- recipe(~ ., data = wdi_x) |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

d_preset_step <- rec_preset |>
  prep(training = wdi_x) |>
  bake(new_data = NULL)

all.equal(
  unname(as.matrix(d_preset$distance)),
  unname(as.matrix(d_preset_step))
)

[1] TRUE

`step_mdist()` embeds the same distance specification into a modelling workflow

Constructing a custom mixed-type distance

Directly with mdist()

wdi_x <- wdi_data |> 
  dplyr::select(-country)

d_custom <- mdist(
  wdi_x,
  distance_cont = "euclidean",
  distance_cat  = "eskin",
  scaling_cont  = "std",
  commensurable = FALSE
)

Inside a recipe with step_mdist()

rec_custom <- recipe(~ ., data = wdi_x) |>
  step_mdist(
    all_predictors(),
    distance_cont = "euclidean",
    distance_cat  = "eskin",
    scaling_cont  = "std",
    commensurable = FALSE
  )

d_custom_step <- rec_custom |>
  prep(training = wdi_x) |>
  bake(new_data = NULL)

all.equal(
  unname(as.matrix(d_custom$distance)),
  unname(as.matrix(d_custom_step))
)

[1] TRUE

Comparing distance constructions

lovo_mdist_compare()

The same leave-one-variable-out diagnostic can be computed for several distance definitions and compared in one display.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Recipe

Use step_mdist() to construct the distance representation.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Recipe

Use step_mdist() to construct the distance representation.

Model

Specify a distance-based KNN classifier.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Recipe

Use step_mdist() to construct the distance representation.

Model

Specify a distance-based KNN classifier.

Workflow

Combine preprocessing and model specification.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Select

Choose the best-performing number of neighbours.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Select

Choose the best-performing number of neighbours.

Finalize

Insert the selected value into the workflow.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Select

Choose the best-performing number of neighbours.

Finalize

Insert the selected value into the workflow.

Test

Fit the finalized workflow on the training set and evaluate it on the test set.

Tuning results

Test-set performance

Metric	Accuracy
accuracy	0.681

Workflow

last_fit() fits the finalized workflow on the full training set and evaluates it once on the held-out test set.

Wrap-up

Preprocessing is modelling

Distance-based learning makes no exception.

Distance choices are contextual

Some choices are domain-driven; others depend on the data structure and the downstream task.

Towards a common ground

Similar distance-based ideas often appear under different names across statistics, machine learning, econometrics, psychometrics, and operational research.

A package ecosystem such as manydist can make these choices easier to compare and reuse.

Main references

Iodice D’Enza, A., Tortora, C., & Palumbo, F. (2026). Association-based spectral clustering for mixed data with cross-type interactions. Manuscript Submitted to Statistics and Computing.

van de Velden, M., Iodice D’Enza, A., Markos, A., & Cavicchia, C. (2026). A general framework for unbiased mixed-variables distances. Under Review: Journal of Computational and Graphical Statistics.

Ghashti, J. S., & Thompson, J. R. J. (2025). Mixed-type distance shrinkage and selection for clustering via kernel metric learning. Journal of Classification, 42(2), 311–334.

Liu, P., Yuan, H., Ning, Y., Chakraborty, B., Liu, N., & Peres, M. A. (2024). A modified and weighted gower distance-based clustering analysis for mixed type data: A simulation and empirical analyses. BMC Medical Research Methodology, 24(305).

van de Velden, M., Iodice D’Enza, A., Markos, A., & Cavicchia, C. (2024). A general framework for implementing distances for categorical variables. Pattern Recognition, 153, 110547.

Mousavi, E., & Sehhati, M. (2023). A generalized multi-aspect distance metric for mixed-type data clustering. Pattern Recognition, 109353.

Hennig, C., & Liao, T. F. (2013). How to Find an Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society Series C: Applied Statistics, 62(3), 309–369.

Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.

Borg, I., & Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer.

Le, S. Q., & Ho, T. B. (2005). An association-based dissimilarity measure for categorical data. Pattern Recognition Letters, 26(16), 2549–2557.

Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14, 849–856. MIT Press.

Hastie, T., & Tibshirani, R. (1995). Discriminant adaptive nearest neighbor classification and regression. Advances in Neural Information Processing Systems, 8. MIT Press.

Contact and slides

Contact

Alfonso Iodice D’Enza
iodicede@unina.it

GitHub Pages

https://alfonsoiodicede.github.io

scan to open the slides

Learning from Mixed-Type Data with Association-Aware Distances

A collaborative project

Outline

1. Distance-based learning with mixed data

Mixed-type data

Learning from mixed-type data

the aim is to compare observations while accounting for scale, type, and structure

Learning from distances

choosing a different distance can lead to a different downstream analysis result

building pairwise dissimilarities: intuition

building pairwise dissimilarities: intuition

building pairwise dissimilarities: intuition

building pairwise dissimilarities: intuition

building pairwise dissimilarities: intuition

building pairwise dissimilarities: intuition

building pairwise dissimilarities: intuition

2. Scale/type-aware distances

Desirable properties1

Desirable properties1

Desirable properties1

Mixed-data setup

Distributions, scaling and bias: the numeric case

Categorical distances: the delta framework1

distributions, scaling and bias: the categorical case

Independence-based distances

the next step is to account for the structure among variables

3. Association-aware distances

by variable differences: independence-based

by variable differences: independence-based

by variable differences: independence-based

accounting for inter-variable relations: association-based

association-based distance

association-based distance

association-based distance

association-based distance

association-based distance: a small example

association-based distance: from \({\bf Z}\) to \({\bf P}\)

diagonal blocks contain marginal information; off-diagonal blocks contain joint proportions

association-based distance: from \({\bf P}\) to \({\bf R}\)

each off-diagonal block contains conditional profiles across variables

association-based distance: reading \({\bf R}_{12}\)

categories are compared through their association profiles

association-based distance: from \({\bf R}\) to \(\Delta_1^{(tvd)}\)

\(A\) and \(B\) are close because they have the same profile with respect to \(X_2\)

association-based distance: reading \({\bf R}_{21}\)

categories of \(X_2\) are compared through their association profiles with respect to \(X_1\)

association-based distance: from \({\bf R}_{21}\) to \(\Delta_2^{(tvd)}\)

\(u\) and \(v\) are different because they imply different profiles over \(X_1\)

association-based distance: from \(\Delta\) to \({\bf D}\)

category-level dissimilarities are translated into observation-level distances

From distances to data representation

LOVO-based benchmark: evaluated distances

LOVO-based diagnostics: what is evaluated?

LOVO diagnostics assess how each variable contributes to the dissimilarity structure

LOVO diagnostics: distance-level effect

LOVO diagnostics: MDS-level effect

From diagnostics to downstream learning

Unsupervised classification experiment

PAM-based clustering results

4. Interaction- and response-aware extensions

Interaction-aware distances

the next step is to make distances interaction-aware

How to measure interactions1

What is \(\Delta^{int}\)?

Computing \(\Delta^{int}_{j}\)

Well separated or not?

high separability \(\Rightarrow\) high interaction dissimilarity

Building \(\Delta^{int}_{j}\)

Building \(\Delta^{int}_{j}\)

Building \(\Delta^{int}_{j}\)

Building \(\Delta^{int}_{j}\)

Building \(\Delta^{int}_{j}\)

Building \(\Delta^{int}_{j}\)

Building \(\Delta^{int}_{j}\)

the interaction matrix summarizes category-pair separability in the continuous space

Why this interaction is asymmetric

Spectral clustering: a graph partitioning problem

Spectral clustering: making the graph easy to cut

Why spectral clustering here?

Experiment: interaction-driven clusters

Desirable properties¹

Desirable properties¹

Desirable properties¹

Categorical distances: the delta framework¹

How to measure interactions¹

Numerical part: discriminant weighting¹

`manydist`: socio-economic country profiles

`step_mdist()` embeds the same distance specification into a modelling workflow