Biplots in dimension reduction and clustering

class: center, middle, inverse, title-slide

# Biplots in dimension reduction and clustering
### Michel van de Velden, Alfonso Iodice D’Enza and Angelos Markos
### CSDA & EcoSta Workshop on Statistical Data Science (SDS 2022)
<br> 26 August 2022

---

class: animated fadeIn

### dimension reduction 
- <h4 style = "color:#37499c">  variables and samples reduction</h4>

- <h4 style = "color:#37499c"> the tandem approach the cluster masking problem </h4>

### joint DR
- <h4 style = "color:#37499c"> a unified framework for continuous, categorical and mixed data </h4>

### biplots in joint DR for cluster characterization

- <h4 style = "color:#37499c"> discriminant analysis biplot, contribution biplot </h4>

---
class: animated fadeIn middle

### dimension reduction

**summarising** a two-way data matrix by **aggregating measurements** 
(Farcomeni and Greco, 2016)

- column-wise reduction: define a limited number of linear combinations (**dimension reduction**)

- row-wise reduction: define a limited number of observations, each representative of an homogeneous group (**partitioning clustering**)

---
class: animated fadeIn
### continuous data case: principal component analysis (PCA)

`${\bf X}$` is the `$n \times p$` standardized data matrix

The PCA loss function can be defined as:

**
`$$\min_{\mathbf{A,B}}\left\Vert \mathbf{Y}-n^{1/2}\mathbf{AB}^{\sf T}p^{1/2}\right\Vert ^{2} \ \ s.t. \ \
p\mathbf{B}^{\sf T}\mathbf{B}=\mathbf{I}_{d}$$`
**

where `$\mathbf{Y}=n^{-1/2}\mathbf{X}p^{-1/2}$`.

- the columns of `$\mathbf{A}=n^{1/2}\mathbf{\tilde{U}{\bf \tilde{D}}_{\alpha}}$` are the row (principal) coordinates, and they are such that
`$\frac{1}{n}\mathbf{A}^{\sf T}\mathbf{A}={\bf \tilde{D}}_{\alpha}^{2}$`

- the columns of `$\mathbf{B}=p^{1/2}{\bf \tilde{V}}$` are the column (standard) coordinates

Since `${\bf \tilde{U}}{\bf \tilde{D}_{\alpha}}{\bf \tilde{V}^{{\sf T}}}$` is the `$d$`-truncated SVD of `$\bf{Y}$`, then
`$n^{1/2}\mathbf{AB}^{\sf T}p^{1/2}$` is the best rank- `$d$` approximation of `$\bf{Y}$`, in the least square sense.

---
class: animated fadeIn
### categorical data case: correspondence analysis (CA)

`${\bf P}$` is a two-way table with relative frequencies, crossing two categorical variables, with  `$q_{r}$` and `$q_{c}$` categories

The CA loss function is:

**
`$$\min_{\mathbf{A,B}}\left\Vert \mathbf{\tilde{P}-D}_{r}^{1/2}\mathbf{AB}^{\sf T}\mathbf{D}_{c}^{1/2}\right\Vert ^{2} \ \ s.t. \ \
\mathbf{B}^{\sf T}\mathbf{D}_{c}\mathbf{B}=\mathbf{I}_{d}$$`
**

where `$\mathbf{\tilde{P}=D}_{r}^{-1/2}\left(\mathbf{P}-\mathbf{rc}^{\sf T}\right)\mathbf{D}_{c}^{-1/2}$` , `$\mathbf{r=P1}_{q_c}$` , `$\mathbf{c} =\mathbf{P}^{\sf T}\mathbf{1}_{q_r}$`,
`$\mathbf{D}_{r}=diag(\mathbf{r})$`, `$\mathbf{D}_{c} = diag(\mathbf{c})$`

- the columns of `$\mathbf{A=D}_{r}^{-1/2}\mathbf{\tilde{U}{\bf \tilde{D}}_{\alpha}}$` are the row (principal) coordinates, and they are such that
`$\mathbf{A}^{\sf T}\mathbf{D}_{r}\mathbf{A={\bf D}_{\alpha}}^{2}$`

- the columns of `$\mathbf{B=D}_{c}^{-1/2}\mathbf{\tilde{V}}$` are the column (standard) coordinates

Since `${\bf \tilde{U}}{\bf \tilde{D}_{\alpha}}{\bf \tilde{V}^{{\sf T}}}$` is the `$d$`-truncated SVD of `$\bf{\tilde{P} }$`, then
`$\mathbf{D}_{r}^{1/2}\mathbf{AB}^{\sf T}\mathbf{D}_{c}^{1/2}$` is the best rank- `$d$` approximation of `$\bf{\tilde{P} }$`, in the least square sense.

---
class: animated fadeIn
### categorical data case: (multiple) correspondence analysis

MCA generalizes the application of CA to `$p$` categorical variables, each with `$q_{j}$` categories, `$j=1,\ldots,p$`.

- `${\bf Z}^{\star}_{j}$` is the one-hot-encoding of the `$j^{th}$` categorical variable.

- `${\bf Z}^{\star}=[{\bf Z}^{\star}_{1},\ldots,{\bf Z}^{\star}_{p}]$` and `${\bf Z}=\frac{ {\bf Z}^\star}{n\times Q}$`, with `$Q=\sum_{j=1}^{p}{q_{j}}$`;

- the margins are `${\bf r}=\frac{1}{n}{\bf 1}_{n}$` and `${\bf s}={\bf Z}^{{\sf T}}{\bf 1}_{n}$`

The (M)CA loss function is:

**
`$$\min_{\mathbf{A,B}}\left\Vert \mathbf{\tilde{Z}} - \frac{1}{\sqrt{n}}\mathbf{AB}^{\sf T}\mathbf{D}_{s}^{1/2}\right\Vert ^{2} \ \ s.t. \ \
\mathbf{B}^{\sf T}\mathbf{D}_{s}\mathbf{B}=\mathbf{I}_{d}$$`
**

where `$\mathbf{\tilde{Z}}=\sqrt{n}\left(\mathbf{Z}-\frac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{{\sf T}}{\bf Z}\right)\mathbf{D}_{s}^{-1/2}$`

---
class: animated fadeIn
### mixed data case: factor analysis of mixed data (FAMD, Escofier, 1979)

Real datasets have often both continuous and categorical variables.

Upon an appropriate data pre-processing, dimension reduction is done via PCA.

.pull-left[
Let `$\bf X$` contain the continuous variables (centered and standardised);
]

.pull-right[
Let the `$\bf Z$` also be centered and standardized:

- the centering operator is `${\bf M} = {\bf I}_{n} - n^{-1}{\bf 1}_{n}{\bf 1}^{{\sf T}}_{n}$` 
 - the scaling weights are in `${\bf D}_{z}=diag({\bf Z}^{\sf T}{\bf Z})$` 
 ]
 
--
 
The PCA of ** `${\bf X^{\star}} = \left[{\bf X} \ {\bf D}_{z}^{-1/2}{\bf Z}{\bf M}\right]$` ** is the FAMD

---
class: animated fadeIn
### partitioning cluster analysis (K-means, MacQueen, 1967)

Assuming to deal with continuous data, and Euclidean distances, the K-means loss function can be defined as:

**
`$$\min_{ {\bf Z}_{K} } \left \Vert \mathbf{X}-{\bf Z}_{K} {\bf G} \right\Vert ^{2}$$`
**

- `${\bf Z}_{K}$` is a `$n\times K$` binary matrix, the dummy coding of the cluster allocation vector

- `${\bf G} = \left({\bf Z}_{K}^{\sf T}{\bf Z}_{K}\right)^{-1}{\bf Z}^{\sf T}_{K}{\bf X}$` is the  `$K\times p$`  matrix of cluster means

- **non-continuous data**

- ad-hoc dissimilarity/distances
  - quantification

- **high dimensions**

- distance between any two points tend to converge to a same quantity: curse of dimensionality

---
class: animated fadeIn
### Column and row-wise dimension reduction

Practitioners often apply dimension reduction and then cluster the low-dimensional scores: this approach is referred to as ** tandem analysis ** (Arabie and Hubert, 1996)

- mitigates the effects of the curse of dimensionality

- may improve clustering

- the graphical display of the dataset at hand may be of help for cluster characterization

**but...**

- the dimension reduction step is independent from the clustering step