class: center, middle, inverse, title-slide # Biplots in dimension reduction and clustering ### Michel van de Velden, Alfonso Iodice D’Enza and Angelos Markos ### CSDA & EcoSta Workshop on Statistical Data Science (SDS 2022)
26 August 2022 --- class: animated fadeIn ### dimension reduction - <h4 style = "color:#37499c"> variables and samples reduction</h4> - <h4 style = "color:#37499c"> the tandem approach the cluster masking problem </h4> -- ### joint DR - <h4 style = "color:#37499c"> a unified framework for continuous, categorical and mixed data </h4> -- ### biplots in joint DR for cluster characterization - <h4 style = "color:#37499c"> discriminant analysis biplot, contribution biplot </h4> --- class: animated fadeIn middle ### dimension reduction **summarising** a two-way data matrix by **aggregating measurements** (Farcomeni and Greco, 2016) - column-wise reduction: define a limited number of linear combinations (**dimension reduction**) - row-wise reduction: define a limited number of observations, each representative of an homogeneous group (**partitioning clustering**) --- class: animated fadeIn ### continuous data case: principal component analysis (PCA) `\({\bf X}\)` is the `\(n \times p\)` standardized data matrix The PCA loss function can be defined as: ** `$$\min_{\mathbf{A,B}}\left\Vert \mathbf{Y}-n^{1/2}\mathbf{AB}^{\sf T}p^{1/2}\right\Vert ^{2} \ \ s.t. \ \ p\mathbf{B}^{\sf T}\mathbf{B}=\mathbf{I}_{d}$$` ** where `\(\mathbf{Y}=n^{-1/2}\mathbf{X}p^{-1/2}\)`. -- - the columns of `\(\mathbf{A}=n^{1/2}\mathbf{\tilde{U}{\bf \tilde{D}}_{\alpha}}\)` are the row (principal) coordinates, and they are such that `\(\frac{1}{n}\mathbf{A}^{\sf T}\mathbf{A}={\bf \tilde{D}}_{\alpha}^{2}\)` - the columns of `\(\mathbf{B}=p^{1/2}{\bf \tilde{V}}\)` are the column (standard) coordinates -- Since `\({\bf \tilde{U}}{\bf \tilde{D}_{\alpha}}{\bf \tilde{V}^{{\sf T}}}\)` is the `\(d\)`-truncated SVD of `\(\bf{Y}\)`, then `\(n^{1/2}\mathbf{AB}^{\sf T}p^{1/2}\)` is the best rank- `\(d\)` approximation of `\(\bf{Y}\)`, in the least square sense. --- class: animated fadeIn ### categorical data case: correspondence analysis (CA) `\({\bf P}\)` is a two-way table with relative frequencies, crossing two categorical variables, with `\(q_{r}\)` and `\(q_{c}\)` categories The CA loss function is: ** `$$\min_{\mathbf{A,B}}\left\Vert \mathbf{\tilde{P}-D}_{r}^{1/2}\mathbf{AB}^{\sf T}\mathbf{D}_{c}^{1/2}\right\Vert ^{2} \ \ s.t. \ \ \mathbf{B}^{\sf T}\mathbf{D}_{c}\mathbf{B}=\mathbf{I}_{d}$$` ** where `\(\mathbf{\tilde{P}=D}_{r}^{-1/2}\left(\mathbf{P}-\mathbf{rc}^{\sf T}\right)\mathbf{D}_{c}^{-1/2}\)` , `\(\mathbf{r=P1}_{q_c}\)` , `\(\mathbf{c} =\mathbf{P}^{\sf T}\mathbf{1}_{q_r}\)`, `\(\mathbf{D}_{r}=diag(\mathbf{r})\)`, `\(\mathbf{D}_{c} = diag(\mathbf{c})\)` -- - the columns of `\(\mathbf{A=D}_{r}^{-1/2}\mathbf{\tilde{U}{\bf \tilde{D}}_{\alpha}}\)` are the row (principal) coordinates, and they are such that `\(\mathbf{A}^{\sf T}\mathbf{D}_{r}\mathbf{A={\bf D}_{\alpha}}^{2}\)` - the columns of `\(\mathbf{B=D}_{c}^{-1/2}\mathbf{\tilde{V}}\)` are the column (standard) coordinates -- Since `\({\bf \tilde{U}}{\bf \tilde{D}_{\alpha}}{\bf \tilde{V}^{{\sf T}}}\)` is the `\(d\)`-truncated SVD of `\(\bf{\tilde{P} }\)`, then `\(\mathbf{D}_{r}^{1/2}\mathbf{AB}^{\sf T}\mathbf{D}_{c}^{1/2}\)` is the best rank- `\(d\)` approximation of `\(\bf{\tilde{P} }\)`, in the least square sense. --- class: animated fadeIn ### categorical data case: (multiple) correspondence analysis MCA generalizes the application of CA to `\(p\)` categorical variables, each with `\(q_{j}\)` categories, `\(j=1,\ldots,p\)`. - `\({\bf Z}^{\star}_{j}\)` is the one-hot-encoding of the `\(j^{th}\)` categorical variable. - `\({\bf Z}^{\star}=[{\bf Z}^{\star}_{1},\ldots,{\bf Z}^{\star}_{p}]\)` and `\({\bf Z}=\frac{ {\bf Z}^\star}{n\times Q}\)`, with `\(Q=\sum_{j=1}^{p}{q_{j}}\)`; - the margins are `\({\bf r}=\frac{1}{n}{\bf 1}_{n}\)` and `\({\bf s}={\bf Z}^{{\sf T}}{\bf 1}_{n}\)` -- The (M)CA loss function is: ** `$$\min_{\mathbf{A,B}}\left\Vert \mathbf{\tilde{Z}} - \frac{1}{\sqrt{n}}\mathbf{AB}^{\sf T}\mathbf{D}_{s}^{1/2}\right\Vert ^{2} \ \ s.t. \ \ \mathbf{B}^{\sf T}\mathbf{D}_{s}\mathbf{B}=\mathbf{I}_{d}$$` ** where `\(\mathbf{\tilde{Z}}=\sqrt{n}\left(\mathbf{Z}-\frac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{{\sf T}}{\bf Z}\right)\mathbf{D}_{s}^{-1/2}\)` --- class: animated fadeIn ### mixed data case: factor analysis of mixed data (FAMD, Escofier, 1979) Real datasets have often both continuous and categorical variables. Upon an appropriate data pre-processing, dimension reduction is done via PCA. -- .pull-left[ Let `\(\bf X\)` contain the continuous variables (centered and standardised); ] .pull-right[ Let the `\(\bf Z\)` also be centered and standardized: - the centering operator is `\({\bf M} = {\bf I}_{n} - n^{-1}{\bf 1}_{n}{\bf 1}^{{\sf T}}_{n}\)` - the scaling weights are in `\({\bf D}_{z}=diag({\bf Z}^{\sf T}{\bf Z})\)` ] -- The PCA of ** `\({\bf X^{\star}} = \left[{\bf X} \ {\bf D}_{z}^{-1/2}{\bf Z}{\bf M}\right]\)` ** is the FAMD --- class: animated fadeIn ### partitioning cluster analysis (K-means, MacQueen, 1967) Assuming to deal with continuous data, and Euclidean distances, the K-means loss function can be defined as: ** `$$\min_{ {\bf Z}_{K} } \left \Vert \mathbf{X}-{\bf Z}_{K} {\bf G} \right\Vert ^{2}$$` ** - `\({\bf Z}_{K}\)` is a `\(n\times K\)` binary matrix, the dummy coding of the cluster allocation vector - `\({\bf G} = \left({\bf Z}_{K}^{\sf T}{\bf Z}_{K}\right)^{-1}{\bf Z}^{\sf T}_{K}{\bf X}\)` is the `\(K\times p\)` matrix of cluster means -- - **non-continuous data** - ad-hoc dissimilarity/distances - quantification -- - **high dimensions** - distance between any two points tend to converge to a same quantity: curse of dimensionality --- class: animated fadeIn ### Column and row-wise dimension reduction Practitioners often apply dimension reduction and then cluster the low-dimensional scores: this approach is referred to as ** tandem analysis ** (Arabie and Hubert, 1996) - mitigates the effects of the curse of dimensionality - may improve clustering - the graphical display of the dataset at hand may be of help for cluster characterization -- **but...** - the dimension reduction step is independent from the clustering step --- class: animated fadeIn ### cluster masking Consider a toy example: three very well separated clusters in two dimensions <img src="./figures/pca_toy_anim.gif" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking the PCA scores are easily clustered via, e.g., Kmeans <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-3-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking the toy example: adding 4 noise variables `\(X_{j}\sim N(0,\sigma^{2}_{j})\)`, `\(\sigma_{j}\in \left\{ 6, 9\right\}\)`, `\(j=3,\ldots,6\)` --- class: animated fadeIn ### cluster masking the toy example: adding 4 noise variables `\(X_{j}\sim N(0,\sigma^{2}_{j})\)`, `\(\sigma_{j}\in \left\{ 6, 9\right\}\)`, `\(j=3,\ldots,6\)` <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-5-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking The tandem approach: the 2-d PC map <img src="./figures/tandem_toy_anim.gif" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking The tandem approach will fail: the 2-d PC map does not display clustered observations <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-7-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking The tandem approach will fail: K-means considering an increasing number of PCs <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-8-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking The tandem approach will fail: K-means considering an increasing number of PCs <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-9-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking The tandem approach will fail: K-means considering an increasing number of PCs <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-10-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### cluster masking The tandem approach will fail: K-means considering an increasing number of PCs <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-11-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### Beyond tandem analysis: joint methods **Methods for continuous data** - **Reduced K-means** (De Soete and Carroll, 1994): joint PCA + Kmeans - **Factorial K-means** (Vichi and Kiers, 2001): joint PCA + Kmeans **Methods for categorical data** - **MCA-Kmeans** (Hwang, Dillon, and Takane, 2006): joint MCA + Kmeans - **iFCB** (Iodice D’Enza and Palumbo, 2013): iterative sequential NSCA + Kmeans - **Cluster Correspondence Analysis** (van de Velden, Iodice D’Enza, and Palumbo, 2017): joint CA + Kmeans **Methods for mixed data** - **CDR** (cluster and dimension reduciton for mixed data, Vichi, Vicari, and Kiers, 2019) - **Groupals** (Van Buuren and Heiser, 1989) -- All methods are implemented in `\(\texttt{CRAN}\)` package `\(\texttt{clustrd}\)` (Markos, Iodice D'Enza, and van de Velden, 2019) --- class: animated fadeIn ### JDR: a unified framework The general objective can be formulated as follows (Yamamoto and Hwang, 2014; Vichi, Vicari, and Kiers, 2019): ** $$ \min\phi\left(\mathbf{B},\mathbf{Z}_{K}\right)= \alpha\left\Vert\mathbf{X} - \mathbf{X}\mathbf{B}\mathbf{B}^{\sf T}\right\Vert ^{2} + (1-\alpha)\left\Vert\mathbf{XB} - \mathbf{P}\mathbf{X}\mathbf{B}\right\Vert ^{2} $$ ** `\(\mathbf{B}\)` is the loadings matrix `\(\mathbf{P} = \mathbf{Z}_{K}\left(\mathbf{Z}_{K}^{\sf T}\mathbf{Z}_{K}\right)^{-1}\mathbf{Z}_{K}^{\sf T}\)` - `\(\alpha = 1/2\)`, **Reduced K-means** - `\(\alpha = 0\)`, **Factorial K-means** Note that `\(\alpha = 1\)` gives the PCA solution. --- class: animated fadeIn ### JDR: a unified framework ** $$ \min\phi\left(\mathbf{B},\mathbf{Z}_{K}\right)= \alpha\left\Vert\mathbf{X} - \mathbf{X}\mathbf{B}\mathbf{B}^{\sf T}\right\Vert ^{2} + (1-\alpha)\left\Vert\mathbf{XB} - \mathbf{P}\mathbf{X}\mathbf{B}\right\Vert ^{2} $$ ** **Extensions to categorical data** `$$\mathbf{X}^{cat} = \mathbf{D}_{z}^{-1/2}\mathbf{MZ}$$` - `\(\alpha = 1/2\)`, Cluster Correspondence Analysis **Extensions to mixed-type data** `$$\mathbf{X}^{mix} = \left[\mathbf{X}^{cnt} \hspace{.35cm} \mathbf{X}^{cat}\right]$$` - `\(\alpha = 1/2\)`, Mixed Reduced K-means - `\(\alpha = 0\)`, Mixed Factorial K-means Note that `\(\alpha = 1\)` gives the PCAMIX/FAMD solution. --- class: animated fadeIn ### JDR: a unified procedure For given `\(\alpha\)`, the following ALS algorithm is used to minimize the loss function: - Generate an initial cluster allocation `\(\mathbf{Z}_{K}\)` (e.g., by randomly assigning subjects to clusters). - Find loadings `\(\mathbf{B}\)` by taking the eigendecomposition of `\(\mathbf{X}^{\sf T}\left((1-\alpha)\mathbf{P} - (1-2\alpha)\mathbf{I}\right)\mathbf{X}\)` - Update the cluster allocation `\(\mathbf{Z}_{K}\)` by applying K-means to the reduced space subject coordinates `\(\mathbf{X}\mathbf{B}\)` - Repeat the procedure (i.e. go back to step 2) using `\(\mathbf{Z}_{K}\)` for the cluster allocation matrix, until convergence (clusters stay the same). --- class: animated fadeIn ### JDR on the toy data set: no more cluster masking <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-12-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn inverse center middle ## biplots in JDR ### for ### cluster characterization --- class: animated fadeIn ### The Palmer Penguins data set (not iris) (Horst, Hill, and Gorman, 2020) .pull-left[ ** 342 ** penguins from three species ** 4 ** variables - bill length and depth - body mass - flipper length ] .pull-right[ <img src="./figures/penguin_bill.png" width="65%" style="display: block; margin: auto;" /> ] .center[ <table> <thead> <tr> <th style="text-align:left;"> species </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:right;"> 151 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:right;"> 123 </td> </tr> </tbody> </table> ] --- class: animated fadeIn ### The Palmer Penguins: tandem <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-15-1.png" width="55%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The Palmer Penguins: tandem to RKM <img src="./figures/tandem_to_rkm_anim.gif" width="55%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The Palmer Penguins: RKM <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-17-1.png" width="55%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The Palmer Penguins: cluster characterization <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-18-1.png" width="55%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The Palmer Penguins: results .pull-left[ ** tandem ** <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-19-1.png" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ ** RKM ** <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-20-1.png" width="75%" style="display: block; margin: auto;" /> ] --- class: animated fadeIn middle ### The zoo dataset This is a dataset from the UCI repo (Newman, Hettich, Blake, and Merz, 1998; Leisch and Dimitriadou, 2021) - **82** animals of four types: **mammal**, **bird**, **fish** and **insect** - **16** categorical variables describing the animals characteristics --- class: animated fadeIn ### The zoo dataset <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-23-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The zoo: asymmetric map to contribution biplot (Greenacre, 2013) <img src="./figures/zoo_contr_anim.gif" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The zoo: contribution biplot for cluster characterization <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-25-1.png" width="45%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The zoo: cluster characterization barplot <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-26-1.png" width="65%" style="display: block; margin: auto;" /> --- class: animated fadeIn ### The zoo: species identification **Zoo species** <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-27-1.png" width="45%" style="display: block; margin: auto;" /> -- .center[ the two errors are **dolphin** and **porpoise** ] --- class: animated fadeIn middle ## Conclusion - several clustering evaluation measures exist (internal and external) - in unsupervised learning, to find the *best* method, or the best setting for a method, is no easy task - interpretability of a cluster solution is important - biplots can be a useful tool to interpret clusters in a joint dimension reduction and cluster analysis setting --- class: animated fadeIn ### References Arabie, P. and L. Hubert (1996). "Advances in cluster analysis relevant to marketing research". In: _From Data to Knowledge_. Springer, pp. 3-19. De Soete, G. D. and J. D. Carroll (1994). "K-means clustering in a low-dimensional Euclidean space". In: _New approaches in classification and data analysis_. Springer, Berlin, Heidelberg, pp. 212-219. Escofier, B. (1979). "Traitement simultané de variables qualitatives et quantitatives en analyse factorielle". In: _Cahiers de l'Analyse des Données_ 4.2, pp. 137-146. Farcomeni, A. and L. Greco (2016). _Robust methods for data reduction_. CRC press. Greenacre, M. (2013). "Contribution biplots". In: _Journal of Computational and Graphical Statistics_ 22.1, pp. 107-122. Horst, A. M., A. P. Hill, and K. B. Gorman (2020). _palmerpenguins: Palmer Archipelago (Antarctica) penguin data_. R package version 0.1.0. URL: [https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/). Hwang, H., W. R. Dillon, and Y. Takane (2006). "An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents". In: _Psychometrika_ 71.1, pp. 161-171. --- class: animated fadeIn ### References (2) Iodice D’Enza, A. and F. Palumbo (2013). "Iterative factor clustering of binary data". In: _Computational Statistics_ 28.2, pp. 789-807. Leisch, F. and E. Dimitriadou (2021). _mlbench: Machine Learning Benchmark Problems_. R package version 2.1-3. MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations". In: _Proceedings of the fifth Berkeley symposium on mathematical statistics and probability_. Vol. 1. 14. Oakland, CA, USA. , pp. 281-297. Markos, A., A. Iodice D'Enza, and M. van de Velden (2019). "Beyond Tandem Analysis: Joint Dimension Reduction and Clustering in R". In: _Journal of Statistical Software_ 91.10, pp. 1-24. DOI: [10.18637/jss.v091.i10](https://doi.org/10.18637%2Fjss.v091.i10). Newman, D., S. Hettich, C. Blake, et al. (1998). _UCI Repository of machine learning databases_. URL: [http://www.ics.uci.edu/~mlearn/MLRepository.html](http://www.ics.uci.edu/~mlearn/MLRepository.html). Van Buuren, S. and W. J. Heiser (1989). "Clusteringn objects intok groups under optimal scaling of variables". In: _Psychometrika_ 54.4, pp. 699-706. Velden, M. van de, A. Iodice D’Enza, and F. Palumbo (2017). "Cluster correspondence analysis". In: _Psychometrika_ 82.1, pp. 158-185. --- class: animated fadeIn ### References (3) Vichi, M. and H. A. Kiers (2001). "Factorial k-means analysis for two-way data". In: _Computational Statistics & Data Analysis_ 37.1, pp. 49-64. Vichi, M., D. Vicari, and H. A. Kiers (2019). "Clustering and dimension reduction for mixed variables". In: _Behaviormetrika_ 46.2, pp. 243-269. Yamamoto, M. and H. Hwang (2014). "A general formulation of cluster analysis with dimension reduction and subspace separation". In: _Behaviormetrika_ 41.1, pp. 115-129. --- class: animated fadeIn center middle inverse # bonus example: the mixed data case --- class: animated fadeIn middle ### The Diamond Stone Pricing dataset This is a dataset from . - Data on **308** diamond stones sold in Singapore. - **4** continuous variables: diamond size (**3** weighted binary variables) and price in $ - **3** categorical variables (diamond colour, clarity, certification body) --- class: animated fadeIn ### Diamond Stone Pricing: Mixed Reduced K-means .pull-left[ <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-32-1.png" width="85%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="biplots_in_dm_clust_COMPSTAT22_files/figure-html/unnamed-chunk-33-1.png" width="85%" style="display: block; margin: auto;" /> ]