Association-Based Spectral Clustering for Mixed Data

Intermediate Workshop of the PRIN 2022 project, 5-6 February 2026, Bologna

Alfonso Iodice D’Enza, Cristina Tortora and Francesco Palumbo

outline

dissimilarity- and distance-based unsupervised learning

independence-based vs association-based

taking into account continuous/categorical interactions

example and future work

dissimilarity- and distance-based unsupervised learning

learning from dissimilarities

some unsupervised learning methods take as input a dissimilarity matrix

dimension reduction: multidimensional scaling (MDS)¹

clustering methods: hierarchical (HC) and partitioning around medoids (PAM)²

the dissimilarity measure of choice is key, obviously

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

one might consider purple and blue closer than e.g. purple and yellow

independence-based

Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together

in the continuous case: Euclidean or Manhattan distances

in the categorical case: Hamming (matching) distance (among MANY others)

in the mixed data case: Gower dissimilarity index

no inter-variable relations are considered $\rightarrow$ independence-based

independence-based

When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

The Euclidean distance $\longrightarrow$ shared information is over-counted

association-based

The Mahalanobis distance $\longrightarrow$ shared information is not over-counted

this is an association-based distance for continuous data

association-based pairwise distance

differences in line with the inter-variables association/correlation are down-weighted

Association-based for continuous: Mahalanobis distance

Let ${\bf X}_{con}$ be $n\times Q_{d}$ a data matrix of $n$ observations described by $Q_{d}$ continuous variables, and let $\bf S$ the sample covariance matrix, the Mahalanobis distance matrix is

\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where

$[\cdot]^{\odot 1/2}$ denotes the element-wise square root
${\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}$ is the Mahalanobis Gram matrix
${\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}$ is the centering operator

association-based pairwise distance

differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)¹

To distance matrix ${\bf D}_{tvd}$ is defined using the so-called delta framework² a general way to define categorical data distances.

Let ${\bf X}_{cat}$ be $n\times Q_{c}$ a data matrix of $n$ observations described by $Q_{c}$ categorical variables.

\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where ${\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]$ is the super-indicator matrix, with $Q^{*}=\sum_{j=1}^{Q_c} q_j$

${\Delta}_j$ is the category dissimilarity matrix for variable $j$, i.e., the $j$th diagonal block of the block-diagonal matrix ${\Delta}$.
setting ${\Delta}_j$ determines the categorical distance measure of choice (independent- or association-based)

association-based pairwise distance

differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)¹ (2)

Consider the empirical joint probability distributions stored in the off-diagonal blocks of ${\bf P}$:

\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]

We refer to the conditional probability distributions for each variable $j$ given each variable $i$ ($i,j=1,\ldots,Q_c$, $i\neq j$), stored in the block matrix

\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]

where ${\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}$, and ${\bf I}_{Q^*}$ is the $Q^*\times Q^*$ identity matrix.

association-based pairwise distance

differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)¹ (3)

Let ${\bf r}^{ji}_a$ and ${\bf r}^{ji}_b$ be the rows of ${\bf R}_{ji}$, the $(j,i)$th off-diagonal block of ${\bf R}$ The category dissimilarity between $a$ and $b$ for variable $j$ based on the total variation distance (TVD) is defined as

\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]

where $w_{ji}=1/(Q_c-1)$ for equal weighting (can be user-defined).

TVD-based dissimilarity matrix is, therefore,

\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]

AB for mixed?

association-based for mixed

A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:

\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]

this distance only accounts for correlations or associations among variables of the same type
no continuous–categorical interactions are considered.

how to measure interactions?

how to measure interactions

define $\Delta_{int}$, that accounts for the interactions and augment $\Delta_{(tvd)}$

the dissimilarity measure becomes

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

where

\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and

\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where $\alpha=\frac{1}{Q_{c}}$.

how to measure interactions

What is $\Delta^{int}$?

the general entry for the $j^{th}$ diagonal block is $\delta_{int}^{j}(a,b)$ accounts for the interaction by measuring how the continuous variables help in discriminating between the observations choosing category $a$ and those choosing category $b$ for the $j^{th}$ categorical variable

consider the computation of $\delta_{int}^{ij}\left({ab}\right)$ as a two-class ($a/b$) classification problem, with the continuous variables as predictors
- use a distance-based classifier: nearest-neighbors