A collaborative project

Outline

1. Distance-based learning with mixed data
Increasing awareness in distance construction

2. Scale/type-aware distances
Additivity, commensurability, and bias in mixed data

3. Association-aware distances
Redundancy, correlations, and categorical associations

4. Interaction- and response-aware extensions
Continuous–categorical relationships and supervised neighbourhoods

5. Building distance-based pipelines: manydist
Distance construction, diagnostics, and learning workflows

1. Distance-based learning with mixed data

Mixed-type data

Most real data applications do not contain only numerical variables.

 

Tabular data

In socio-economic applications, observations are often described by a mixture of
measurements, counts, categories, and ordered scales.

 

Numerical variables

  • income
  • age
  • prices
  • growth rates
  • emissions
  • employment rates

Categorical variables

  • region
  • education level
  • occupation
  • sector
  • household type
  • policy regime

Learning from mixed-type data

Many statistical learning methods rely, explicitly or implicitly, on comparing observations.

  • If the data are mixed-type, then comparison is not straightforward:

 

Scale

How do we compare variables measured in different units?

Type

How do we combine numerical differences with category mismatches?

Structure

How do we avoid counting associated information multiple times?

 

the aim is to compare observations while accounting for scale, type, and structure

Learning from distances

Terminology alert!

I will often use distance as shorthand for pairwise dissimilarity.

Some measures discussed here are not metrics in the strict mathematical sense,
but they all quantify how different two observations are.

Many learning methods can operate directly on a dissimilarity matrix.

Dimension reduction

Multidimensional scaling (MDS)1

Clustering methods

Hierarchical clustering (HC), PAM2, and spectral clustering3

Nearest-neighbour prediction

Nearest-neighbour classification and nearest-neighbour averaging for regression

choosing a different distance can lead to a different downstream analysis result

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

building pairwise dissimilarities: intuition

2 continuous and 1 categorical variables

building pairwise dissimilarities: intuition

one might consider purple and blue closer than e.g. purple and yellow

2. Scale/type-aware distances

Desirable properties1

Multivariate Additivity

Let \(\mathbf{x}_i=\left(x_{i1}, \dots, x_{iQ}\right)\) denote a \(Q-\)dimensional vector. A distance function \(d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) between observations \(i\) and \(\ell\) is multivariate additive if

\[ d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)=\sum_{j=1}^{Q} d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right), \]

where \(d_j\left(\mathbf{x}_i,\mathbf{x}_\ell\right)\) denotes the \(j-\)th variable specific distance.

  • Manhattan distance satisfies the additivity property; the Euclidean distance does not

Desirable properties1

If additivity holds, by-variable distances are added together: they should be on equivalent scales

Commensurability

Let \({\boldsymbol X}_i =\left(X_{i1}, \dots, X_{iQ}\right)\) denote a \(Q-\)dimensional random variable corresponding to an observation \(i\). Furthermore, let \(d_{j}\) denote the distance function corresponding to the \(j-\)th variable.

We have commensurability if, for all \(j\), and \(i \neq \ell\),

\[ E[d_{j}({ X}_{ij}, {X}_{\ell j})] = c, \]

where \(c\) is some constant.

Desirable properties1

If the multivariate distance function \(d(\cdot,\cdot)\) satisfies additivity and commensurability, then ad hoc distance functions can be used for each variable and then aggregated.

 

then

one can pick the appropriate \(d_{j}(\cdot,\cdot)\), given the nature of \(X_{j}\)

  • well suited in the mixed data case

Mixed-data setup

a mixed data set

  • \(I\) observations described by \(Q\) variables, \(Q_{n}\) numerical and \(Q_{c}\) categorical

  • the \(I\times Q\) data matrix \({\bf X}=\left[{\bf X}_{n},{\bf X}_{c}\right]\) is column-wise partitioned

A formulation for mixed distance between observations \(i\) and \(\ell\):

\[\begin{eqnarray}\label{genmixeddist_formula} d\left(\mathbf{x}_i,\mathbf{x}_\ell\right)&=& \sum_{j_n=1}^{Q_n} d_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} d_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right)=\\ &=& \sum_{j_n=1}^{Q_n} w_{j_n} \delta^n_{j_n}\left(\mathbf{x}^n_i,\mathbf{x}^n_\ell\right)+ \sum_{j_c=1}^{Q_c} w_{j_c}\delta^c_{j_c}\left(\mathbf{x}^c_i,\mathbf{x}^c_\ell\right) \end{eqnarray}\]

numeric case

  • \(\delta^n_{j_n}\) is a function quantifying the dissimilarity between observations on the \(j_n-\)th numerical variable

  • \(w_{j_n}\) is a weight for the \(j_n-\)th variable.

categorical case

dissimilarity between the categories chosen by subjects \(i\) and \(\ell\) for categorical variable \(j_c\)

  • \(w_{j_c}\) is a weight for the \(j_c-\)th variable

Distributions, scaling and bias: the numeric case

Synthetic data

  • \(I=500\) observations from normal, uniform, skewed, and bimodal distributions

  • skewed refers to a \(\chi^2_{1/2}\) distribution

  • bimodal: \(n/2\) draws from \(\chi^2_{1/2}\), censored at \(10\), and \(n/2\) draws from \(10-\chi^2_{1/2}\), censored at \(0\)

as long as variables have the same underlying distribution and scaling, commensurability holds

  • skewed variables may be under- or over-contributing to the distance, depending on the scaling (range and robust, respectively)

    • the contribution of a variable to the overall distance may be biased

Categorical distances: the delta framework1

From categories to distances

Let \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) be the one-hot encoding of the categorical variables.

The pairwise categorical distance matrix can be written as

\[{\bf D}_{c}={\bf Z}{\bf \Delta}{\bf Z}^{\sf T}= \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right]\]

  • each \({\bf \Delta}_j\) defines the dissimilarity between the categories of variable (j)

  • different choices of \({\bf \Delta}_j\) imply different categorical distance measures

  • therefore, categorical distances can also suffer from scale and frequency-driven bias

distributions, scaling and bias: the categorical case

flat frequency distribution

Distance Cat. dissimilarity \(E[d(X_i, X_{\ell})]\) \(q=2\) \(q=5\)
Matching \(\boldsymbol{\Delta}_m = \mathbf{1} \mathbf{1}^{\top} - \mathbf{I}\) \(\frac{q-1}{q}\) 0.5 0.8
Eskin \(\boldsymbol{\Delta}_e = \frac{2}{q^2}\boldsymbol{\Delta}_m\) \(\frac{2(q-1)}{q^3}\) 0.250 0.064
Occurrence frequency (OF) \(\boldsymbol{\Delta}_{OF} = \log^2(q)\boldsymbol{\Delta}_m\) \(\log^2(q)\frac{q-1}{q}\) 0.240 2.072
Inverse OF \(\boldsymbol{\Delta}_{IOF} = \log^2\left(\frac{I}{q}\right) \boldsymbol{\Delta}_m\) \(\log^2\left(\frac{I}{q}\right)\frac{q-1}{q}\) 9.601 9.610

skewed frequency distribution

  • \(q\in \{2,3,5,10\}\)
  • \(p_1 \in \{0.05,0.1,0.2,0.33, 0.5,0.66, 0.8,0.9,0.95\}\)
  • \(p_j = (1-p_1)/(q-1)\), with \(j=2,\dots,q\),

The expected distance increases with the heterogeneity of the distribution and with the number of categories

Independence-based distances

Independence-based pairwise distance

No inter-variable relations are considered.

  • in the continuous case: Euclidean or Manhattan distances

  • in the categorical case: Hamming / matching distance, among many others

  • in the mixed-data case: Gower dissimilarity index

variable contributions may be balanced, but still treated as separate sources of information

Beyond commensurability

commensurability makes variable contributions comparable across scales and data types.

  • If variables are correlated or associated, the same information may contribute repeatedly to the distance: redundancy

the next step is to account for the structure among variables

3. Association-aware distances

by variable differences: independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

by variable differences: independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

by variable differences: independence-based

The Euclidean distance \(\longrightarrow\) shared information is over-counted

accounting for inter-variable relations: association-based

The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted

this is an association-based distance for continuous data

association-based distance

Association-based for continuous: Mahalanobis distance

Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is

\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where

  • \([\cdot]^{\odot 1/2}\) denotes the element-wise square root

  • \({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix

  • \({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator

association-based distance

Association-based for categorical: total variation distance (TVD)(Le & Ho, 2005)

The distance matrix \({\bf D}_{tvd}\) can be defined via the delta framework upon properly defining the block-diagonal matrix \({\bf \Delta}\)

Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.

\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \]

  • in the framework, setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)

association-based distance

Association-based for categorical: total variation distance (TVD) (Le & Ho, 2005) (2)

Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):

\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]

The block matrix \(\bf R\) refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix

\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]

where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.

association-based distance

Association-based for categorical: total variation distance (TVD)(Le & Ho, 2005) (3)

Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\).

The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as

\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]

where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).

TVD-based dissimilarity matrix is, therefore,

\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]

association-based distance: a small example

Data

Consider two categorical variables:

  • \(X_1\) with categories \(A,B,C\)
  • \(X_2\) with categories \(u,v\)
# A tibble: 10 × 3
      id X1    X2   
   <int> <fct> <fct>
 1     1 A     u    
 2     2 A     u    
 3     3 A     v    
 4     4 B     u    
 5     5 B     u    
 6     6 B     v    
 7     7 C     u    
 8     8 C     v    
 9     9 C     v    
10    10 C     v    

Indicator matrices

\[ {\bf Z}_1 = \begin{pmatrix} 1&0&0\\ 1&0&0\\ 1&0&0\\ 0&1&0\\ 0&1&0\\ 0&1&0\\ 0&0&1\\ 0&0&1\\ 0&0&1\\ 0&0&1 \end{pmatrix}, \qquad {\bf Z}_2 = \begin{pmatrix} 1&0\\ 1&0\\ 0&1\\ 1&0\\ 1&0\\ 0&1\\ 1&0\\ 0&1\\ 0&1\\ 0&1 \end{pmatrix}. \]

association-based distance: from \({\bf Z}\) to \({\bf P}\)

Let

\[ {\bf Z} = [{\bf Z}_1,{\bf Z}_2]. \]

The empirical co-occurrence matrix is

\[ {\bf P} = \frac{1}{10}{\bf Z}^{\sf T}{\bf Z}. \]

For this example,

\[ {\bf P} = \begin{pmatrix} \color{#2A9D8F}{0.30} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0.30} & \color{#2A9D8F}{0} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0.40} & \color{#E76F51}{0.10} & \color{#E76F51}{0.30}\\ \color{#E76F51}{0.20} & \color{#E76F51}{0.20} & \color{#E76F51}{0.10} & \color{#2A9D8F}{0.50} & \color{#2A9D8F}{0}\\ \color{#E76F51}{0.10} & \color{#E76F51}{0.10} & \color{#E76F51}{0.30} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0.50} \end{pmatrix}. \]

diagonal blocks contain marginal information; off-diagonal blocks contain joint proportions

association-based distance: from \({\bf P}\) to \({\bf R}\)

The diagonal part of \({\bf P}\) is

\[ {\bf P}_z = {\bf P} \odot {\bf I}_{Q^*} = \operatorname{diag}(0.30,0.30,0.40,0.50,0.50). \]

The block matrix of conditional profiles is

\[ {\bf R} = {\bf P}_z^{-1}({\bf P}-{\bf P}_z). \]

For this example,

\[ {\bf R} = \begin{pmatrix} \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.67} & \color{#E76F51}{0.33}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.67} & \color{#E76F51}{0.33}\\ \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} & \color{#E76F51}{0.25} & \color{#E76F51}{0.75}\\ \color{#E76F51}{0.40} & \color{#E76F51}{0.40} & \color{#E76F51}{0.20} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0}\\ \color{#E76F51}{0.20} & \color{#E76F51}{0.20} & \color{#E76F51}{0.60} & \color{#2A9D8F}{0} & \color{#2A9D8F}{0} \end{pmatrix}. \]

each off-diagonal block contains conditional profiles across variables

association-based distance: reading \({\bf R}_{12}\)

For the categories of \(X_1\), the relevant block is

\[ {\bf R}_{12} = \begin{pmatrix} 0.67 & 0.33\\ 0.67 & 0.33\\ 0.25 & 0.75 \end{pmatrix}. \]

Interpretation

Rows of \({\bf R}_{12}\) describe the distribution of \(X_2\) within each category of \(X_1\):

  • category \(A\): \(P(X_2=u \mid X_1=A)=0.67\), \(P(X_2=v \mid X_1=A)=0.33\)
  • category \(B\): \(P(X_2=u \mid X_1=B)=0.67\), \(P(X_2=v \mid X_1=B)=0.33\)
  • category \(C\): \(P(X_2=u \mid X_1=C)=0.25\), \(P(X_2=v \mid X_1=C)=0.75\)

categories are compared through their association profiles

association-based distance: from \({\bf R}\) to \(\Delta_1^{(tvd)}\)

Compare the rows of \({\bf R}_{12}\) using TVD.

\[ \delta^{1}_{tvd}(A,B) = \frac{1}{2} \left( |0.67-0.67| + |0.33-0.33| \right) = 0. \]

\[ \delta^{1}_{tvd}(A,C) = \frac{1}{2} \left( |0.67-0.25| + |0.33-0.75| \right) = 0.42. \]

\[ \delta^{tvd}_{1}(B,C) = 0.42. \]

Therefore,

\[ \Delta^{(tvd)}_1 = \begin{pmatrix} 0 & 0 & 0.42\\ 0 & 0 & 0.42\\ 0.42 & 0.42 & 0 \end{pmatrix}. \]

\(A\) and \(B\) are close because they have the same profile with respect to \(X_2\)

association-based distance: reading \({\bf R}_{21}\)

For the categories of \(X_2\), the relevant block is

\[ {\bf R}_{21} = \begin{pmatrix} 0.40 & 0.40 & 0.20\\ 0.20 & 0.20 & 0.60 \end{pmatrix}. \]

Interpretation

Rows of \({\bf R}_{21}\) describe the distribution of \(X_1\) within each category of \(X_2\):

  • category \(u\): \(P(X_1=A \mid X_2=u)=0.40\), \(P(X_1=B \mid X_2=u)=0.40\), \(P(X_1=C \mid X_2=u)=0.20\)
  • category \(v\): \(P(X_1=A \mid X_2=v)=0.20\), \(P(X_1=B \mid X_2=v)=0.20\), \(P(X_1=C \mid X_2=v)=0.60\)

categories of \(X_2\) are compared through their association profiles with respect to \(X_1\)

association-based distance: from \({\bf R}_{21}\) to \(\Delta_2^{(tvd)}\)

Compare the rows of \({\bf R}_{21}\) using TVD.

\[ \delta^{tvd}_{1}(u,v) = \frac{1}{2} \left( |0.40-0.20| + |0.40-0.20| + |0.20-0.60| \right) = 0.40. \]

Therefore,

\[ \Delta^{(tvd)}_2 = \begin{pmatrix} 0 & 0.40\\ 0.40 & 0 \end{pmatrix}. \]

\(u\) and \(v\) are different because they imply different profiles over \(X_1\)

association-based distance: from \(\Delta\) to \({\bf D}\)

We collect the category dissimilarity matrices in a block-diagonal matrix:

\[ \Delta^{(tvd)} = \begin{pmatrix} \color{#2A9D8F}{\Delta^{(tvd)}_1} & \color{#E76F51}{0}\\ \color{#E76F51}{0} & \color{#2A9D8F}{\Delta^{(tvd)}_2} \end{pmatrix}. \]

The observation-level categorical distance matrix is then

\[ {\bf D}_{tvd} = {\bf Z}\Delta^{(tvd)}{\bf Z}^{\sf T} = \begin{bmatrix} {\bf Z}_1 & {\bf Z}_2 \end{bmatrix} \begin{pmatrix} \Delta^{(tvd)}_1 & 0\\ 0 & \Delta^{(tvd)}_2 \end{pmatrix} \begin{bmatrix} {\bf Z}_1^{\sf T}\\ {\bf Z}_2^{\sf T} \end{bmatrix}. \]

Equivalently,

\[ {\bf D}_{tvd} = {\bf Z}_1\Delta^{(tvd)}_1{\bf Z}_1^{\sf T} + {\bf Z}_2\Delta^{(tvd)}_2{\bf Z}_2^{\sf T}. \]

category-level dissimilarities are translated into observation-level distances

From distances to data representation

Different distance definitions induce different distance-based representations of the same data.

Same data, different representation

Changing the distance changes the global dissimilarity structure on which downstream learning methods rely.

Leave-one-variable-out diagnostics

How can we measure the contribution of each variable to this structure?

  • compare the dissimilarity matrix computed with and without the variable in question

LOVO-based benchmark: evaluated distances

The benchmark compares distance definitions that differ in how they treat scale, type, additivity, and association.

Additive distances

  • gower: classical Gower dissimilarity

  • mod_gower: modified Gower coefficients (Liu et al., 2024)

  • hl_add: additive version of Hennig–Liao scaling (Hennig & Liao, 2013)

  • u_ind: unbiased independence-based distance

  • u_dep: unbiased association-based distance

  • u_mix: unbiased Manhattan and TVD

Non-additive distances

  • naive: Euclidean distance on scaled numerical variables and one-hot-encoded

  • hl: Hennig–Liao scaling with Euclidean distance

  • gudmm: generalized multi-aspect distance metric for mixed-type data (Mousavi & Sehhati, 2023)

  • dkps: distance using kernel product similarity (Ghashti & Thompson, 2025)

LOVO-based diagnostics: what is evaluated?

For each distance and each variable \(X_j\), we compare the full-data representation \({\bf D}\) with the representation obtained after removing \(X_j\), that is \({\bf D}_{-j}\).

1. Distance level

Numeric comparision between \({\bf D}\) and \({\bf D}_{-j}\).

 

  • mean absolute difference between distance matrices.

2. MDS level

Compute MDS from \({\bf D}\) and from \({\bf D}_{-j}\), then compare the resulting configurations.

 

  • alienation coefficient between MDS representations.

LOVO diagnostics assess how each variable contributes to the dissimilarity structure

LOVO diagnostics: distance-level effect

LOVO diagnostics: MDS-level effect

  • commensurability balances expected distance contributions, not necessarily the role of variables in every downstream representation

From diagnostics to downstream learning

LOVO diagnostics show how variables affect the distance matrix and the MDS representation.

 

But we also want to know whether distance biases affect a downstream learning task.

Unsupervised classification experiment

Use each distance matrix as input to PAM and evaluate how well the resulting partition recovers the known cluster structure.

Unsupervised classification experiment

Data generation

  • (n = 200) observations from (4) equal-sized clusters
  • data generated with genRandomClust
  • each dataset contains (8) numerical and (8) categorical variables
  • categorical variables are obtained by discretizing numerical variables into (9) categories
  • scenarios vary the number of signal and noise variables within each type
  • (100) datasets are generated for each scenario

Evaluation

For each mixed-data distance, PAM is applied to the dissimilarity matrix with (K = 4).
Recovery of the true cluster labels is measured using the adjusted Rand index.

PAM-based clustering results

  • hl performs well when categorical variables are noise, but poorly when numerical variables are noise
  • gower tends to show the opposite pattern
  • u_mix and u_dep are comparatively stable in the mixed signal/noise scenarios

4. Interaction- and response-aware extensions

Interaction-aware distances

Association-aware distances account for relations within variable blocks:

  • continuous–continuous relations;
  • categorical–categorical relations.

Cross-type structure

In mixed data, categorical differences may be meaningful because they are reflected in the continuous variables.

the next step is to make distances interaction-aware

How to measure interactions1

Define \(\Delta^{int}\) to account for continuous–categorical interactions and use it to augment \(\Delta^{tvd}\).

The mixed dissimilarity becomes

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

where

\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \]

and

\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int}, \qquad \alpha=\frac{1}{Q_c}. \]

What is \(\Delta^{int}\)?

The entry \(\delta_{int}^{j}(a,b)\) measures how much the continuous variables help discriminate between observations choosing category \(a\) and those choosing category \(b\) for categorical variable \(j\).

Category-pair classification problem

For each pair \((a,b)\):

  • use the continuous variables as predictors;
  • classify observations belonging to categories \(a\) and \(b\);
  • use a nearest-neighbour rule in the continuous space.

Computing \(\Delta^{int}_{j}\)

For each categorical variable \(j\) and each pair of categories \((a,b)\):

  1. use \({\bf D}_{mah}\) to identify neighbours in the continuous space;
  2. consider a proportion of neighbours, say \(\hat{\pi}_{nn}=0.1\);
  3. classify observations using a prior-corrected decision rule;
  4. compute balanced accuracy.

\[ \delta_{int}^{j}(a,b) = \frac{1}{2} \left( \frac{\texttt{true } a}{\texttt{true } a + \texttt{false } a} + \frac{\texttt{true } b}{\texttt{true } b + \texttt{false } b} \right). \]

Well separated or not?

high separability \(\Rightarrow\) high interaction dissimilarity

Building \(\Delta^{int}_{j}\)

For categorical variable \(j\) with \(q_j\) categories, compute
\(\frac{q_j(q_j -1)}{2}\) category-pair quantities.

\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & \color{#E76F51}{0.94} & \cdot & \cdot \\ \color{#E76F51}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{#E76F51}{0.40} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{#E76F51}{0.40} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & \color{#E76F51}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.40 & \cdot & 0 & \cdot\\ \color{#E76F51}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & \color{#E76F51}{0.54} & \cdot \\ 0.40 & \color{#E76F51}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & 0.54 & \color{#E76F51}{0.55} \\ 0.40 & 0.54 & 0 & \cdot \\ 0.39 & \color{#E76F51}{0.55} & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.40 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.40 & 0.54 & 0 & \color{#E76F51}{0} \\ 0.39 & 0.55 & \color{#E76F51}{0} & 0 \end{pmatrix} \]

the interaction matrix summarizes category-pair separability in the continuous space

Why this interaction is asymmetric

We model the joint structure as

\[ f({\bf x}_{con},{\bf x}_{cat}) = f({\bf x}_{con}) f({\bf x}_{cat}\mid {\bf x}_{con}). \]

The interaction term asks how categorical distinctions are reflected in the continuous geometry.

Spectral clustering: a graph partitioning problem

Graph representation

A graph representation of the data matrix \({\bf X}\): the aim is to cut it into \(K\) groups, or clusters.

The affinity matrix \({\bf A}\)

The elements \({\bf w}_{ij}\) of \({\bf A}\) are high when observations \(i\) and \(j\) are likely to belong to the same group, and low otherwise.

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

An approximate solution to the graph partitioning problem:

From distances to affinities

Start from the pairwise distance matrix \({\bf D}\) and build the affinity matrix

\[ {\bf A} = \exp\left(-\frac{{\bf D}^{2}}{2\sigma^{2}}\right), \qquad a_{ii}=0. \]

The parameter \(\sigma\) controls the neighbourhood scale.

Normalized graph Laplacian

The normalized affinity matrix is

\[ {\bf L} = {\bf D}_{r}^{-1/2} {\bf A} {\bf D}_{r}^{-1/2} = {\bf Q}{\Lambda}{\bf Q}^{\sf T}, \]

where \({\bf D}_{r}=\operatorname{diag}({\bf r})\), \({\bf r}={\bf A}{\bf 1}\), \({\bf 1}\) is an \(n\)-dimensional vector of ones.

Spectral embedding

The spectral clustering solution is obtained by applying \(K\)-means to the rows of
\({\bf \tilde Q}\), the matrix containing the first \(K\) eigenvectors of \({\bf L}\).

Why spectral clustering here?

Interaction-aware distances can encode local connectivity and non-convex structure.

Experiment: interaction-driven clusters

Design

  • \(n=500\) and \(n=1000\)
  • six continuous variables: \(V_1,\ldots,V_6\)
  • three categorical variables: \(C_1,C_2,C_3\)
  • \(V_4,V_5,V_6\) generated independently from \(N(0,1)\)
  • \(V_1,V_2,V_3\) generated conditionally on \(C_1\) and \(C_2\)

Main feature

The clusters are not defined by continuous variables alone or categorical variables alone,
but by their cross-type interaction.

Interaction-aware spectral clustering results

  • when the cluster structure is interaction-driven, ab_dis_int clearly outperforms all competitors
  • ab_dis without interactions, Gower, modified Gower, and the naive distance remain close to chance-level separation
  • the result supports the need to explicitly encode continuous–categorical interactions

Response-aware distances for KNN

KNN is usually described as a lazy learner:

  • store the training data;
  • compute distances from a new observation to the training observations;
  • predict from the responses of the nearest neighbours.

Reframing KNN

The distance is not just a preprocessing choice.
It determines the neighbourhoods used for classification or regression.

in supervised learning, the response can help define these neighbourhoods

Response-aware mixed distance

For mixed-type predictors, use a supervised distance with two components:

\[ D_{il} = D^n({\bf x}^n_i,{\bf x}^n_l) + D^c({\bf x}^c_i,{\bf x}^c_l). \]

Numerical part

Use discriminant information from \(y\)
to weight numerical differences.

Categorical part

Use the association between categories and \(y\)
to define category dissimilarities.

Numerical part: discriminant weighting1

For continuous predictors, use the response to weight directions or variables.

Single-variable discriminant weighting

For numerical variable \(j\), define the Fisher score

\[ \sigma_j = \frac{B_j}{W_j}, \]

where \(B_j\) and \(W_j\) are the between- and within-group variances.

Then a supervised Manhattan-type distance is

\[ D^n({\bf x}^n_i,{\bf x}^n_l) = \sum_{j=1}^{Q_n} \sqrt{\sigma_j} \left|x^n_{ij}-x^n_{lj}\right|. \]

Categorical part: supervised TVD

For categorical predictors, compare categories through their response profiles.

Let \({\bf Z}_y\) be the indicator matrix of the response.
The supervised profile matrix is

\[ {\bf R}_s = {\bf P}_d^{-1} {\bf Z}^{\sf T}{\bf Z}_y. \]

The supervised category dissimilarity is

\[ \delta_s^j(a,b) = \frac{1}{2} \sum_{\ell=1}^{q_y} \left| {\bf r}_{a\ell}^{j y} - {\bf r}_{b\ell}^{j y} \right|. \]

categories are close if they show similar response distributions

Response-aware KNN: Carseats example

Data

The Carseats data are used to predict whether sales are high.

  • 7 numerical predictors
  • 3 categorical predictors
  • categorical predictors with 2 or 3 categories
  • binary response: high vs. low sales

Compared distances

  • gower: robust Manhattan + matching
  • naive: Euclidean on scaled numerical variables and dummies
  • sup: supervised numerical weighting + supervised TVD
  • sup_add: additive supervised version
  • supf: full supervised version

Response-aware KNN: Carseats results

  • response-aware distances improve nearest-neighbour classification accuracy
  • sup, sup_add, and supf are clearly above gower and naive
  • the response helps define neighbourhoods that better reflect the class structure

Building distance-based pipelines

from distance construction to reproducible learning workflows

Building distance-based pipelines

manydist

A package to construct, diagnose, and use distances for continuous, categorical, and mixed-type data.

Distance construction

  • mdist()
    • presets and custom specifications
    • response-aware distances
  • step_mdist()
    • integrates distance construction into tidymodels workflows

Diagnostics

  • lovo_mdist()
  • compare_lovo_mdist()
  • benchmark_mdist()

Learning: model specs

Unsupervised learning

  • pam_dist()
  • spectral_dist()

Supervised learning

  • nearest_neighbor_dist()

manydist: socio-economic country profiles

Data

A 2022 World Bank / WDI snapshot of country-level socio-economic indicators.

  • observations: countries
  • numerical variables: GDP per capita, life expectancy, unemployment, urban population, population growth
  • categorical variables: world region and income group

Use manydist to build a mixed-type distance between countries and diagnose which variables shape the resulting dissimilarity structure.

Preparing the WDI data

Country Region Income group World Bank lending category GDP per capita (k USD) Life expectancy (years) Unemployment (%) Urban population (% total) Population growth (%)
Tajikistan Europe & Central Asia Lower middle income IDA 1.1 71.6 7.1 26.2 2.14
West Bank and Gaza Middle East, North Africa, Afghanistan & Pakistan Lower middle income Not classified 3.8 76.7 24.4 86.6 2.43
Belarus Europe & Central Asia Upper middle income IBRD 8.0 74.1 3.6 78.5 -0.80
United Arab Emirates Middle East, North Africa, Afghanistan & Pakistan High income Not classified 50.8 80.5 2.9 85.5 5.09
El Salvador Latin America & Caribbean Upper middle income IBRD 5.1 72.0 3.0 74.1 0.39
New Zealand East Asia & Pacific High income Not classified 49.1 82.0 3.3 83.9 -0.06
Cyprus Europe & Central Asia High income Not classified 33.2 80.4 6.8 66.7 1.06
Zambia Sub-Saharan Africa Lower middle income IDA 1.4 65.3 6.0 44.6 2.76

Constructing a mixed-type distance using a preset

Directly with mdist()

wdi_x <- wdi_data |> 
  dplyr::select(-country)

d_preset <- mdist(
  wdi_x,
  preset = "u_dep"
)

Inside a recipe with step_mdist()

rec_preset <- recipe(~ ., data = wdi_x) |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

d_preset_step <- rec_preset |>
  prep(training = wdi_x) |>
  bake(new_data = NULL)
all.equal(
  unname(as.matrix(d_preset$distance)),
  unname(as.matrix(d_preset_step))
)
[1] TRUE

step_mdist() embeds the same distance specification into a modelling workflow

Constructing a custom mixed-type distance

Directly with mdist()

wdi_x <- wdi_data |> 
  dplyr::select(-country)

d_custom <- mdist(
  wdi_x,
  distance_cont = "euclidean",
  distance_cat  = "eskin",
  scaling_cont  = "std",
  commensurable = FALSE
)

Inside a recipe with step_mdist()

rec_custom <- recipe(~ ., data = wdi_x) |>
  step_mdist(
    all_predictors(),
    distance_cont = "euclidean",
    distance_cat  = "eskin",
    scaling_cont  = "std",
    commensurable = FALSE
  )

d_custom_step <- rec_custom |>
  prep(training = wdi_x) |>
  bake(new_data = NULL)
all.equal(
  unname(as.matrix(d_custom$distance)),
  unname(as.matrix(d_custom_step))
)
[1] TRUE

Comparing distance constructions

lovo_mdist_compare()

The same leave-one-variable-out diagnostic can be computed for several distance definitions and compared in one display.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Recipe

Use step_mdist() to construct the distance representation.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Recipe

Use step_mdist() to construct the distance representation.

Model

Specify a distance-based KNN classifier.

Distance-based classification pipeline

set.seed(123)

wdi_region <- wdi_data |>
  dplyr::filter(region != "North America") |>
  dplyr::mutate(
    region = droplevels(region)
  )

wdi_split <- initial_split(
  wdi_region,
  strata = region
)

wdi_train <- training(wdi_split)
wdi_test  <- testing(wdi_split)

wdi_rec <- recipe(region ~ ., data = wdi_train) |>
  update_role(country, new_role = "id") |>
  step_mdist(
    all_predictors(),
    preset = "u_dep"
  )

knn_spec <- nearest_neighbor_dist(
  mode = "classification",
  neighbors = tune()
)

wdi_wf <- workflow() |>
  add_recipe(wdi_rec) |>
  add_model(knn_spec)

Data

Prepare the classification task.

Split

Create training and test sets.

Recipe

Use step_mdist() to construct the distance representation.

Model

Specify a distance-based KNN classifier.

Workflow

Combine preprocessing and model specification.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Select

Choose the best-performing number of neighbours.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Select

Choose the best-performing number of neighbours.

Finalize

Insert the selected value into the workflow.

Tuning a distance-based KNN pipeline

set.seed(123)

wdi_folds <- vfold_cv(
  wdi_train,
  v = 5,
  strata = region
)

knn_grid <- tibble(
  neighbors = c(1, 3, 5, 7, 9, 11, 15)
)

knn_tuned <- tune_grid(
  wdi_wf,
  resamples = wdi_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy)
)

best_k <- select_best(
  knn_tuned,
  metric = "accuracy"
)

final_wf <- finalize_workflow(
  wdi_wf,
  best_k
)

final_res <- last_fit(
  final_wf,
  split = wdi_split,
  metrics = metric_set(accuracy)
)

Resample

Create cross-validation folds on the training set.

Grid

Define candidate values for the number of neighbours.

Tune

Evaluate each candidate value by cross-validation.

Select

Choose the best-performing number of neighbours.

Finalize

Insert the selected value into the workflow.

Test

Fit the finalized workflow on the training set and evaluate it on the test set.

Tuning results

Test-set performance

Metric Accuracy
accuracy 0.681

Workflow

last_fit() fits the finalized workflow on the full training set and evaluates it once on the held-out test set.

Wrap-up

Preprocessing is modelling

Distance-based learning makes no exception.

 

Distance choices are contextual

Some choices are domain-driven; others depend on the data structure and the downstream task.

 

Towards a common ground

Similar distance-based ideas often appear under different names across statistics, machine learning, econometrics, psychometrics, and operational research.

A package ecosystem such as manydist can make these choices easier to compare and reuse.

Main references

Iodice D’Enza, A., Tortora, C., & Palumbo, F. (2026). Association-based spectral clustering for mixed data with cross-type interactions. Manuscript Submitted to Statistics and Computing.
van de Velden, M., Iodice D’Enza, A., Markos, A., & Cavicchia, C. (2026). A general framework for unbiased mixed-variables distances. Under Review: Journal of Computational and Graphical Statistics.
Ghashti, J. S., & Thompson, J. R. J. (2025). Mixed-type distance shrinkage and selection for clustering via kernel metric learning. Journal of Classification, 42(2), 311–334.
Liu, P., Yuan, H., Ning, Y., Chakraborty, B., Liu, N., & Peres, M. A. (2024). A modified and weighted gower distance-based clustering analysis for mixed type data: A simulation and empirical analyses. BMC Medical Research Methodology, 24(305).
van de Velden, M., Iodice D’Enza, A., Markos, A., & Cavicchia, C. (2024). A general framework for implementing distances for categorical variables. Pattern Recognition, 153, 110547.
Mousavi, E., & Sehhati, M. (2023). A generalized multi-aspect distance metric for mixed-type data clustering. Pattern Recognition, 109353.
Hennig, C., & Liao, T. F. (2013). How to Find an Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society Series C: Applied Statistics, 62(3), 309–369.
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.
Borg, I., & Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer.
Le, S. Q., & Ho, T. B. (2005). An association-based dissimilarity measure for categorical data. Pattern Recognition Letters, 26(16), 2549–2557.
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14, 849–856. MIT Press.
Hastie, T., & Tibshirani, R. (1995). Discriminant adaptive nearest neighbor classification and regression. Advances in Neural Information Processing Systems, 8. MIT Press.

Contact and slides

Contact

Alfonso Iodice D’Enza
iodicede@unina.it

GitHub Pages

https://alfonsoiodicede.github.io

scan to open the slides