1. On the equivalence of bottom-up/top-down the hierarchy levels
The Ward.D2 (and, Ward.D) aggregation methods that can be selected in the standard hclust function are based on the Lance-Williams formula (that generalizes at any linkage function). To keep it simple, and to avoid computing the full hierarchy for each permutation, we use the Minimum Increase of Sum of Squares (MISSQ) computed at each split
\[
MISSQ:\frac{n_{L} \cdot n_{R}}{n_{L}+n_{R}}\|\mu_{L}-\mu_{R}\|^{2} = \sum_{{\bf x}_{i}\in (L\cup R)}\|{\bf x}_{i}-\mu_{(L\cup R)} \|^{2} - \sum_{{\bf x}_{i}\in (L)}\|{\bf x}_{i}-\mu_{(L)} \|^{2} - \sum_{{\bf x}_{i}\in (R)}\|{\bf x}_{i}-\mu_{(R)} \|^{2}
\] that is, the increase in the \(SS\) due to the merge.
The quantities in Equation (1) can also be computed via pairwise distances, since, for the general group A, the following equation holds
\[
\sum_{x_{i}\in A}\|x_{i}-\mu_{A} \|^{2} = \frac{1}{n_{A}}\sum_{i=1}^{n_{A}}\sum_{j=1}^{n_{A}}\|x_{i}-x_{j} \|^{2} \ \ \ \ (2)
\] where \(n_{A}\) indicates the group size.
To build the hierarchy, the right-hand side of Equation (2) is used.
When it comes to re-compute the increase of SS of the split for the permutation test, the left-hand side of Equation (1) is useful.
Toy experiment: top-down computations of node heights
It’s easy to compute \(h(L,R)\), \(h(L)\) and \(h(R)\) (the quantity one needs in DESPOTA) in a top down approach. All it takes is to compute the max distance overall, and the max distances in the left and right-hand side groups (\(L\) and \(R\)).
The ward-linkage case is less obvious, but it can be done, we show how in the next toy example.
Generate data and compute the pairwise Euclidean distances
For the Ward linkage, the trick is to compute the quantity in the left-hand side of the MISSQ formula for the top node, \(MISSQ_{L\cup R}\), using the distances as in formula 2 and do the same going down one level to obtain \(MISSQ_{L}\) and \(MISSQ_{R}\)
Here follow the computations by hand, just as an illustrative example
The function node_heights_compute takes as input the distance matrix, the labels for the L/R split, and for the lower level split (l/r for the R set and for the L set, respectively).
We consider the Palmer Penguins dataset, we compute the dendrogram for the ward.D2 linkage (the real Ward… ref murtagh and legendre).
For the first split, we compute the \(rc_k\) test statistic: the \(F\) node is the root node of the dendrogram, the left and right nodes are the third and second nodes of the tree, respectively.
The heights for the top node (father), and for its children nodes can also be otained from the original pairwise distances, provided that the sets of children nodes originating from the L and R split, are respectively identified.
The \(L\) and \(R\) set are obtained via a 2-clusters cut of the dendrogram.
Similarly the sets for \(L_l\), \(L_r\), \(R_l\) and \(R_r\) are obtained via a 4-clusters cut of the dendrogram. We manually re-alligned the clusters so that each subsets is correctly assocciated to its label.
Given the sets, and the original distance matrix, the function node_heights_compute.r to compute the \(F\), \(L\) and \(R\) node heights (the values of the linkage functions computed on the corresponding sets).
The continuous features are linearly combined via principal component Analysis. The first 20 components are considered that replace the starting continuous variables
As an alternative to the continuous-only dimension reduction step, a factor analysis for mixed data is applied on the full set of data. We use the PCAmixdata package, and keep the first 25 components, 20 for the continuous features (as in the PCA-based case), the extra 5 components are for the 5 categorical variables.
# NOTE: before running this chunk be sure to have properly installed reticulate and python in your PC. # It is also important to point to the right active python root via the following command:# Sys.setenv(RETICULATE_PYTHON = "/usr/local/bin/python3")GUDMM <-GUDMM_Run(genom_red_nofct, no_f_cont =20, no_f_ord =5)dist_Gud <-as.dist(GUDMM$dist_mat)names(dist_Gud) <-paste0("id_",1:nrow(dist_Gud))dist_Gow <-daisy(genom_red, metric ="gower")names(dist_Gow) <-paste0("id_",1:nrow(dist_Gow))genom_continuos_reco <-recipe(data=genom_red,formula =~.) |>step_dummy(all_nominal(),one_hot = T) |>prep() |>bake(new_data=NULL)dist_Euc <-daisy(genom_continuos_reco,metric="euclidean")names(dist_Euc) <-paste0("id_",1:nrow(dist_Euc))genom_continuos_pcamix <- genom_mix |> dplyr::select(1:25) dist_EucMix <-daisy(genom_continuos_pcamix,metric="euclidean")names(dist_EucMix) <-paste0("id_",1:nrow(dist_EucMix))
Run the simulations for analysis
# tidymodels_prefer()despota_str_k1 <-expand_grid(distances =list(list(dist_Euc,"euclidean"),list(dist_EucMix,"PCAmix"),list(dist_Gow,"gower"),list(dist_Gud,"gudmm")),aggMethod =c("ward.D2"),M =499, alpha =c(seq(0.01,0.20,by =0.005),seq(0.3,0.5,by =0.1)) ) |>mutate(seed =sample(1000:9999, n(), replace =FALSE),id =1:n(),hclust =pmap(list(a = distances, b = aggMethod, e = id), .f =function(a,b,e){# cat("HCLUST loop",e,"of",n(),"\n") OUT <-hclust(d = a[[1]], method = b) OUT }),despota =pmap(list(a = distances, b = aggMethod, c = hclust, mm = M,d = alpha, e = id), .f =function(a,b,c,mm,d,e){cat("DESPOTA EVO loop ",e," of ",n()," (",a[[2]],", alpha = ",d,") \n", sep ="") start <-Sys.time() mod <-despota_evo(distance_matrix=a[[1]],hclust_solution=c,linkage_method=b,top_down =TRUE, M=mm, alpha=d) OUT =list("model"= mod,"labels"= mod |>unnest(cl_members) |>select(cluster) )cat(" ",format(Sys.time() - start)," --------------------- \n") OUT }),dist =map_chr(distances, ~ .x[[2]]) )despota_str_k1_sel <- despota_str_k1 |>mutate(desp_nclus =map_dbl(despota, ~length(table(.x$labels))) ) |>group_by(dist,aggMethod,M,desp_nclus) |>slice(which.min(alpha)) |>ungroup() |>filter(desp_nclus >1) |>select(-desp_nclus)dim(despota_str_k1_sel);dim(despota_str_k1)
---title: "Automatic dendrogram slicing for mixed-type data clustering"subtitle: "supplementary material"format: htmlcode-fold: truecode-tools: trueembed-resources: truepage-layout: fulltoc: truetoc-title: contentstoc-location: lefttoc-expand: 2theme: mintyknitr: opts_chunk: collapse: true R.options: message: false warning: false code-fold: false---## Initializationlibraries and functions needed```{r}#| code-fold: truelibrary(cluster)library(data.table)library(distances)library(dynamicTreeCut)library(factoextra)library(FactoMineR)library(fastcluster)library(fastDummies)library(fpc)library(ggdendro)library(ggrepel)library(ggsurvfit)library(gower)library(graphics)library(kableExtra)library(janitor)library(kableExtra)library(Matrix)library(mclust)library(NbClust)library(palmerpenguins)library(PCAmixdata)library(rcompanion)library(reticulate)library(survival)library(tidymodels)library(tidyverse)source("R/linkage_maker.R")source("R/my_hclust.R")source("R/single_div_step.R")source("R/node_heights_compute.R")source("R/is_terminal_node.R")source("R/despota_evo.R")source("R/mdist.R")source("despota_GUDMM/GUDMM_Run.R")source("despota_GUDMM/public_data.R")tidymodels_prefer()genom <-readRDS(file ="main_results/OutputCV1_PCAmerged.RDS")genom_mix <-readRDS(file ="main_results/OutputCV1_PCAmixmerged.RDS")```## 1. On the equivalence of bottom-up/top-down the hierarchy levelsThe Ward.D2 (and, Ward.D) aggregation methods that can be selected in the standard **hclust** function are based on the Lance-Williams formula (that generalizes at any linkage function).To keep it simple, and to avoid computing the full hierarchy for each permutation, we use the **Minimum Increase of Sum of Squares (MISSQ)** computed at each split$$MISSQ:\frac{n_{L} \cdot n_{R}}{n_{L}+n_{R}}\|\mu_{L}-\mu_{R}\|^{2} = \sum_{{\bf x}_{i}\in (L\cup R)}\|{\bf x}_{i}-\mu_{(L\cup R)} \|^{2} - \sum_{{\bf x}_{i}\in (L)}\|{\bf x}_{i}-\mu_{(L)} \|^{2} - \sum_{{\bf x}_{i}\in (R)}\|{\bf x}_{i}-\mu_{(R)} \|^{2}$$that is, the increase in the $SS$ due to the merge. The quantities in Equation (1) can also be computed via pairwise distances, since, for the general group A, the following equation holds$$\sum_{x_{i}\in A}\|x_{i}-\mu_{A} \|^{2} = \frac{1}{n_{A}}\sum_{i=1}^{n_{A}}\sum_{j=1}^{n_{A}}\|x_{i}-x_{j} \|^{2} \ \ \ \ (2)$$where $n_{A}$ indicates the group size.- To build the hierarchy, the right-hand side of Equation (2) is used.- When it comes to re-compute the increase of SS of the split for the permutation test, the left-hand side of Equation (1) is useful.### Toy experiment: top-down computations of node heights It's easy to compute $h(L,R)$, $h(L)$ and $h(R)$ (the quantity one needs in DESPOTA) in a top down approach. All it takes is to compute the max distance overall, and the max distances in the left and right-hand side groups ($L$ and $R$).The ward-linkage case is less obvious, but it can be done, we show how in the next toy example.- Generate data and compute the pairwise Euclidean distances```{r}set.seed(123)toy_data=tibble(x1=round(runif(1:10,min=2,max=15)),x2=round(runif(1:10,min=1,max=4)) )toy_dist=dist(toy_data)```- perform standard **hclust** on the toy data with: - ward linkage - complete linkage - single linkage```{r}toy_hc_w =hclust(toy_dist,method="ward.D")toy_hc_sing =hclust(toy_dist,method ="single")toy_hc_comp =hclust(toy_dist,method ="complete")```- take the data off of the hclust object that are needed for the dendrogram ```{r}# wardo_toy_hc_w = toy_hc_wtoy_hc_w =as.dendrogram(toy_hc_w)toy_hc_w_ddata <-dendro_data(toy_hc_w, type ="rectangle")# singleo_toy_hc_sing = toy_hc_singtoy_hc_sing =as.dendrogram(toy_hc_sing)toy_hc_sing_ddata <-dendro_data(toy_hc_sing, type ="rectangle")# completeo_toy_hc_comp = toy_hc_comptoy_hc_comp =as.dendrogram(toy_hc_comp)toy_hc_comp_ddata <-dendro_data(toy_hc_comp, type ="rectangle")```- select the nodes of interest and highlight them in the dendrogram```{r}hc_point_data_w = toy_hc_w_ddata$segments |>filter(y==max(y))custom_hc_plot_w = toy_hc_w_ddata$segments |>ggplot() +geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +geom_segment(data=hc_point_data_w, aes(x = x, y = y, xend = xend, yend = yend),alpha=.5,color="indianred",linewidth=4,inherit.aes =FALSE) +geom_point(data=hc_point_data_w, aes(x = x, y = yend),color="dodgerblue",size=4,inherit.aes =FALSE)+theme_bw()+ggtitle("ward")ggsave(file="MISSQ_comp.pdf",custom_hc_plot_w,height=7, width =7)hc_point_data_sing = toy_hc_sing_ddata$segments |>filter(y==max(y))custom_hc_plot_sing = toy_hc_sing_ddata$segments |>ggplot() +geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +geom_segment(data=hc_point_data_sing, aes(x = x, y = y, xend = xend, yend = yend),alpha=.5,color="indianred",linewidth=4,inherit.aes =FALSE) +geom_point(data=hc_point_data_sing, aes(x = x, y = yend),color="dodgerblue",size=4,inherit.aes =FALSE)+theme_bw()+ggtitle("single")hc_point_data_comp = toy_hc_comp_ddata$segments |>filter(y==max(y))custom_hc_plot_comp = toy_hc_comp_ddata$segments |>ggplot() +geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +geom_segment(data=hc_point_data_comp, aes(x = x, y = y, xend = xend, yend = yend),alpha=.5,color="indianred",linewidth=4,inherit.aes =FALSE) +geom_point(data=hc_point_data_comp, aes(x = x, y = yend),color="dodgerblue",size=4,inherit.aes =FALSE)+theme_bw()+ggtitle("complete")```::::{.columns}:::{.column width="45%"}```{r}library(patchwork)custom_hc_plot_w |custom_hc_plot_sing |custom_hc_plot_comp```::::::{.column width="45%"}```{r}par(mfrow=c(1,3))plot(toy_hc_w);title("ward linkage")plot(toy_hc_sing);title("single linkage")plot(toy_hc_comp);title("complete linkage")```:::::::For the Ward linkage, the trick is to compute the quantity in the left-hand side of the MISSQ formula for the top node,$MISSQ_{L\cup R}$, using the distances as in formula 2 and do the same going down one level to obtain $MISSQ_{L}$ and $MISSQ_{R}$Here follow the computations by hand, just as an illustrative example```{r}L_set =c(5,8,2,4)L_l_set =c(5,8)L_r_set =c(2,4)R_set =c(6,7,9,10,1,3)R_l_set =c(6)R_r_set =c(7,9,10,1,3)within_L_R = ((1/length(L_set))*sum(as.matrix(toy_dist)[L_set, L_set])+(1/length(R_set))*sum(as.matrix(toy_dist)[R_set,R_set]))top_node = (1/nrow(toy_data))*sum(as.matrix(toy_dist))-within_L_Rwithin_L_lr = ((1/length(L_l_set))*sum(as.matrix(toy_dist)[L_l_set,L_l_set])+(1/length(L_r_set))*sum(as.matrix(toy_dist)[L_r_set,L_r_set]))L_node = (1/length(L_set))*sum(as.matrix(toy_dist)[L_set,L_set]) - within_L_lrwithin_R_lr = (2*sum(as.matrix(toy_dist)[R_l_set,R_l_set])+(1/length(R_r_set))*sum(as.matrix(toy_dist)[R_r_set,R_r_set]))R_node = (1/length(R_set))*sum(as.matrix(toy_dist)[R_set,R_set]) - within_R_lrtibble(node_name =c("top_node", "left_node","right_node"), node_height=c(top_node, L_node, R_node), hclust_nodes=o_toy_hc_w$height[c(9,6,8)]) |>kbl()```### Top-down computations: penguins datasetThe function `node_heights_compute` takes as input the distance matrix, the labels for the L/R split, and for the lower level split (l/r for the R set and for the L set, respectively).We consider the [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/) dataset, we compute the dendrogram for the `ward.D2` linkage (the real Ward... ref murtagh and legendre).```{r}pengs = palmerpenguins::penguins |>na.omit()|> dplyr::select(where(is.numeric),-year) |>mutate(across(.cols=everything(),~scale(.x)))id_pengs=paste0("id_",1:nrow(pengs))pengs_dist =dist(pengs) attr(pengs_dist,"Labels")=id_pengshclust_solution =hclust(pengs_dist,method ="ward.D2")plot(hclust_solution,cex=0.5)rect.hclust(hclust_solution,k=4)```- For the first split, we compute the $rc_k$ test statistic: the $F$ node is the root node of the dendrogram, the left and right nodes are the third and second nodes of the tree, respectively.```{r}top_nd=sort(hclust_solution$height,decreasing=T)[c(1)]L_nd=sort(hclust_solution$height,decreasing=T)[c(3)]R_nd=sort(hclust_solution$height,decreasing=T)[c(2)]observed_rck =abs(L_nd-R_nd)/(top_nd-min(L_nd,R_nd))observed_rck```The heights for the top node (father), and for its children nodes can also be otained from the original pairwise distances, provided that the sets of children nodes originating from the L and R split, are respectively identified.- The $L$ and $R$ set are obtained via a 2-clusters cut of the dendrogram. ```{r}hclust_clusters =cutree(hclust_solution,k=2)LR_set = hclust_clustersL_set=names(LR_set[which(LR_set==2)])R_set=names(LR_set[which(LR_set==1)])```- Similarly the sets for $L_l$, $L_r$, $R_l$ and $R_r$ are obtained via a 4-clusters cut of the dendrogram. We manually re-alligned the clusters so that each subsets is correctly assocciated to its label.```{r}hclust_clusters_sub =cutree(hclust_solution,k=4)lr_set = hclust_clusters_subL_l_set =names(lr_set[which(lr_set==2)])L_r_set =names(lr_set[which(lr_set==3)])R_l_set =names(lr_set[which(lr_set==4)])R_r_set =names(lr_set[which(lr_set==1)])```- Given the sets, and the original distance matrix, the function `node_heights_compute.r` to compute the $F$, $L$ and $R$ node heights (the values of the linkage functions computed on the corresponding sets).```{r}out=node_heights_compute(distance_matrix = pengs_dist, method="ward.D2",L_set = L_set,R_set = R_set,L_l_set = L_l_set,L_r_set = L_r_set,R_l_set = R_l_set,R_r_set = R_r_set)```In fact, the node heights computed above correspond to those returned from `hclust`.```{r}out |>mutate(hclust_height =sort(hclust_solution$height,decreasing=T)[c(1,3,2)] ) ```<!-- ### Permutation-test: distribution under the null hypothesis --><!-- Let $F_{k}$ be the full set of observations, and let $L_{k}$ and $R_{k}$ result from the first split of the dendrogram. --><!-- Create the set of permutations to get $L_{k}^{(m)}$ and $R_{k}^{(m)}$ -->```{r, eval =FALSE, echo=FALSE}set.seed(1234)M=5000source("R/single_step_iPDDP.R")hclust_clusters = LR_set M=10perm_id_list =tibble(permutations_id =1:M |>map(\(i) sample(id_pengs,replace =FALSE,size=length(id_pengs))))perm_structure = perm_id_list |>mutate(permutations =map(.x=permutations_id, ~set_names(hclust_clusters,.x)),distance_matrix =map(.x=permutations_id,.f =~as.matrix(pengs_dist)),Lk_s =map(.x = permutations,~which(.x==1) |>names()),Rk_s =map(.x = permutations,~which(.x==2) |>names()),Lk_distance_matrix=map2(.x=Lk_s, .y = distance_matrix, ~.y[.x,.x]),Rk_distance_matrix=map2(.x=Rk_s, .y = distance_matrix, ~.y[.x,.x]),split_hclu_Lk =map(.x=Lk_distance_matrix, ~.x |>as.dist() |>hclust(method="ward.D2") |>cutree(k=2)),split_hclu_Rk =map(.x=Rk_distance_matrix, ~.x |>as.dist() |>hclust(method="ward.D2") |>cutree(k=2)),split_td_Lk =map(.x=Lk_distance_matrix,~.x |>single_step_iPDDP()),split_td_Rk =map(.x=Rk_distance_matrix,~.x |>single_step_iPDDP()),Lk_l_s =map(.x = split_hclu_Lk, ~which(.x==1) |>names()),Lk_r_s =map(.x = split_hclu_Lk, ~which(.x==2) |>names()),Rk_l_s =map(.x = split_hclu_Rk, ~which(.x==1) |>names()),Rk_r_s =map(.x = split_hclu_Rk, ~which(.x==2) |>names()),Lk_l_s_td =map(.x = split_td_Lk, ~which(.x==1) |>names()),Lk_r_s_td =map(.x = split_td_Lk, ~which(.x==2) |>names()),Rk_l_s_td =map(.x = split_td_Rk, ~which(.x==1) |>names()),Rk_r_s_td =map(.x = split_td_Rk, ~which(.x==2) |>names()),linkage_fun="ward.D2")``````{r, eval =FALSE, echo=FALSE}# # perm_structure=perm_structure |> # mutate(split_pam_Lk = map(.x=Lk_distance_matrix,~.x |> as.dist()|> pam(k = 2,nstart = 20)),# split_pam_Lk = map(.x=split_pam_Lk,~.x$clustering),# split_pam_Rk = map(.x=Rk_distance_matrix,~.x |> as.dist() |> pam(k = 2,nstart = 20)),# split_pam_Rk = map(.x=split_pam_Rk,~.x$clustering))# # perm_structure=perm_structure |> mutate(# Lk_l_s_pam = map(.x = split_pam_Lk, ~which(.x==1) |> names()),# Lk_r_s_pam = map(.x = split_pam_Lk, ~which(.x==2) |> names()),# Rk_l_s_pam = map(.x = split_pam_Rk, ~which(.x==1) |> names()),# Rk_r_s_pam = map(.x = split_pam_Rk, ~which(.x==2) |> names())# )# # # rc_ks = perm_structure |> # mutate(# node_heights = pmap(list(distance_matrix, Lk_s, Rk_s,Lk_l_s, Lk_r_s,Rk_l_s, Rk_r_s,linkage_fun,permutations),# node_heights_compute),# node_heights = map(.x=node_heights, ~.x |> pivot_wider(names_from=node,values_from = height)),# rc_k = map_dbl(.x=node_heights,~abs(.x$L_node-.x$R_node)/(.x$top_node-min(.x$L_node,.x$R_node))),# ## node_heights_td = pmap(list(distance_matrix, Lk_s, Rk_s, Lk_l_s_td, Lk_r_s_td,Rk_l_s_td, Rk_r_s_td,linkage_fun,permutations),# node_heights_compute),# node_heights_td = map(.x=node_heights_td, ~.x |> pivot_wider(names_from=node,values_from = height)),# rc_k_td = map_dbl(.x=node_heights_td,~abs(.x$L_node-.x$R_node)/(.x$top_node-min(.x$L_node,.x$R_node))),# node_heights_pam = pmap(list(distance_matrix, Lk_s, Rk_s,Lk_l_s_pam, Lk_r_s_pam,Rk_l_s_pam, Rk_r_s_pam,linkage_fun,permutations),# node_heights_compute),# node_heights_pam = map(.x=node_heights_pam, ~.x |> pivot_wider(names_from=node,values_from = height)),# rc_k_pam = map_dbl(.x=node_heights_pam,~abs(.x$L_node-.x$R_node)/(.x$top_node-min(.x$L_node,.x$R_node)))# ) |> dplyr::select(starts_with("rc_k"))``````{r eval=FALSE,echo=FALSE}# saveRDS(file="null_distributions_data.rds",rc_ks)``````{r, echo=FALSE}# rc_ks=readRDS(file = "null_distributions_data.rds")``````{r, eval =FALSE, echo=FALSE}# rc_ks=readRDS(file = "null_distributions_data.rds")# null_distrib = rc_ks |> # pivot_longer(cols=1:3,names_to="clustering", values_to = "rc_stat" ) |># filter(clustering %in% c("rc_k","rc_k_td")) |># mutate(clustering=ifelse(clustering=="rc_k","bottom_up",# ifelse(clustering=="rc_k_td","top_down","pam"))) |> # ggplot(aes(x=rc_stat,y=..density..,fill=clustering))+# geom_density(aes(linetype=clustering),alpha=.1)+# xlab(expression(paste(rc[k]," under ", H[0])))+theme_minimal()# # null_distrib``````{r, echo=FALSE}# ggsave(file="rck_null_distr.pdf",null_distrib,height = 4, width = 6)```## 2. ApplicationThe same data preparation and preprocessing pipeline as in Ellen et al. is implemented at [this link](https://www.github.com/AstraZeneca/Multimodal_NSCLC).The resulting dataset has been saved in `.RDS` format. Here follows the dimension reduction step. ### Data re-coding and taming```{r eval=FALSE} genom <-readRDS("data/OutputCV1_train_FULL.RDS")genom0 <- genom |>mutate(clin_gender = gender,clin_pack_years_smoked =ifelse(is.na(pack_years_smoked), median(pack_years_smoked,na.rm =TRUE),pack_years_smoked),clin_prior_malignancy = prior_malignancy,clin_volume =ifelse(is.na(volume), median(volume,na.rm =TRUE),volume),clin_censor_time1 = censor_time1,clin_ajcc_pathologic_stage =ifelse(is.na(ajcc_pathologic_stage), 4, ajcc_pathologic_stage),clin_vital_status = vital_status,clin_volume_low =ifelse(clin_volume <=0.3200, 1, 2), # if volume <= medianclin_pack_years_smoked_low =ifelse(clin_pack_years_smoked <=30, 1, 2) # if packs smoked <= Q1 clin_disease =as.factor(disease),clin_gender =as.factor(clin_gender),clin_prior_malignancy =as.factor(clin_prior_malignancy),clin_volume_low =as.factor(clin_volume_low),clin_pack_years_smoked_low =as.factor(clin_pack_years_smoked_low),clin_ajcc_pathologic_stage =as.factor(clin_ajcc_pathologic_stage),clin_vital_status =as.factor(clin_vital_status) ) |> dplyr::select(-gender,-pack_years_smoked,-prior_malignancy,-volume,-disease,-clin_pack_years_smoked,-clin_volume,-ajcc_pathologic_stage,-censor_time1,-vital_status )clin0 <- genom0 |> dplyr::select(starts_with("clin_"))```### Dimension reduction: principal component analysisThe continuous features are linearly combined via principal component Analysis. The first 20 components are considered that replace the starting continuous variables ```{r eval=FALSE} only_genom <- genom0 |> dplyr::select(-barcode,-starts_with("clin_"))genomPCA <-PCA(only_genom,ncp =20)genomPCA_full <-cbind(genomPCA$ind$coord,clin0)genom_full <- genom0saveRDS(genom_full,file ="/OutputCV1_merged.RDS")saveRDS(genomPCA_full,file ="/OutputCV1_PCAmerged.RDS")```As an alternative to the continuous-only dimension reduction step, a factor analysis for mixed data is applied on the full set of data. We use the `PCAmixdata` package, and keep the first 25 components, 20 for the continuous features (as in the PCA-based case), the extra 5 components are for the 5 categorical variables.```{r eval=FALSE} X <- genom0 |> dplyr::select(-barcode,-clin_vital_status,-clin_censor_time1,-clin_disease) |>splitmix()X1 <- X$X.quantiX2 <- X$X.qualipcamix_mod <-PCAmix(X.quanti=X1, X.quali=X2,rename.level=TRUE,ndim =40,graph=FALSE)genomPCAmix_full <-cbind(pcamix_mod$ind$coord[,1:30],clin0) ``````{r eval=FALSE,echo=FALSE} saveRDS(genomPCAmix_full,file ="data/OutputCV1_PCAmixmerged.RDS")``````{r eval=FALSE,echo=FALSE} readRDS(genomPCAmix_full,file ="data/OutputCV1_PCAmixmerged.RDS")```### Select only the continuous variables (PCs)```{r genom_reduced, eval=TRUE, echo=TRUE}genom_red <- genom |> dplyr::select(Dim.1:Dim.20, clin_gender,clin_prior_malignancy,clin_volume_low,clin_pack_years_smoked_low,clin_ajcc_pathologic_stage) |>as_tibble()```### Full dataset with no factors, to be used to compute the GUDMM distance```{r genom_reduced_nofct, eval=TRUE, echo=TRUE}genom_red_nofct <- genom |>mutate(across(starts_with("clin_"), ~as.numeric(.x) ) ) |> dplyr::select(Dim.1:Dim.20, clin_gender,clin_prior_malignancy,clin_volume_low,clin_pack_years_smoked_low,clin_ajcc_pathologic_stage) |>as_tibble()```## Distance computations```{r dist_comp, warning=FALSE, eval=FALSE, echo=TRUE}# NOTE: before running this chunk be sure to have properly installed reticulate and python in your PC. # It is also important to point to the right active python root via the following command:# Sys.setenv(RETICULATE_PYTHON = "/usr/local/bin/python3")GUDMM <-GUDMM_Run(genom_red_nofct, no_f_cont =20, no_f_ord =5)dist_Gud <-as.dist(GUDMM$dist_mat)names(dist_Gud) <-paste0("id_",1:nrow(dist_Gud))dist_Gow <-daisy(genom_red, metric ="gower")names(dist_Gow) <-paste0("id_",1:nrow(dist_Gow))genom_continuos_reco <-recipe(data=genom_red,formula =~.) |>step_dummy(all_nominal(),one_hot = T) |>prep() |>bake(new_data=NULL)dist_Euc <-daisy(genom_continuos_reco,metric="euclidean")names(dist_Euc) <-paste0("id_",1:nrow(dist_Euc))genom_continuos_pcamix <- genom_mix |> dplyr::select(1:25) dist_EucMix <-daisy(genom_continuos_pcamix,metric="euclidean")names(dist_EucMix) <-paste0("id_",1:nrow(dist_EucMix))``````{r save1, eval=FALSE, echo=FALSE}saveRDS(dist_Gud,"main_results/dist_Gud.RDS")saveRDS(dist_Gow,"main_results/dist_Gow.RDS")saveRDS(dist_Euc,"main_results/dist_Euc.RDS")saveRDS(dist_EucMix,"main_results/dist_EucMix.RDS")```## Run the simulations for analysis```{r read1, eval=TRUE, echo=FALSE}dist_Gud <-readRDS("main_results/dist_Gud.RDS")dist_Gow <-readRDS("main_results/dist_Gow.RDS")dist_Euc <-readRDS("main_results/dist_Euc.RDS")dist_EucMix <-readRDS("main_results/dist_EucMix.RDS")``````{r sim2, eval=FALSE}# tidymodels_prefer()despota_str_k1 <-expand_grid(distances =list(list(dist_Euc,"euclidean"),list(dist_EucMix,"PCAmix"),list(dist_Gow,"gower"),list(dist_Gud,"gudmm")),aggMethod =c("ward.D2"),M =499, alpha =c(seq(0.01,0.20,by =0.005),seq(0.3,0.5,by =0.1)) ) |>mutate(seed =sample(1000:9999, n(), replace =FALSE),id =1:n(),hclust =pmap(list(a = distances, b = aggMethod, e = id), .f =function(a,b,e){# cat("HCLUST loop",e,"of",n(),"\n") OUT <-hclust(d = a[[1]], method = b) OUT }),despota =pmap(list(a = distances, b = aggMethod, c = hclust, mm = M,d = alpha, e = id), .f =function(a,b,c,mm,d,e){cat("DESPOTA EVO loop ",e," of ",n()," (",a[[2]],", alpha = ",d,") \n", sep ="") start <-Sys.time() mod <-despota_evo(distance_matrix=a[[1]],hclust_solution=c,linkage_method=b,top_down =TRUE, M=mm, alpha=d) OUT =list("model"= mod,"labels"= mod |>unnest(cl_members) |>select(cluster) )cat(" ",format(Sys.time() - start)," --------------------- \n") OUT }),dist =map_chr(distances, ~ .x[[2]]) )despota_str_k1_sel <- despota_str_k1 |>mutate(desp_nclus =map_dbl(despota, ~length(table(.x$labels))) ) |>group_by(dist,aggMethod,M,desp_nclus) |>slice(which.min(alpha)) |>ungroup() |>filter(desp_nclus >1) |>select(-desp_nclus)dim(despota_str_k1_sel);dim(despota_str_k1)``````{r save2, eval=FALSE, echo=FALSE}saveRDS(despota_str_k1,"main_results/OUT_all_Sim_paper.Rdata")saveRDS(despota_str_k1_sel,"main_results/OUT_k_Sim_paper.Rdata")```## Comparisons### Simulation scheme for the paper```{r read2, eval=TRUE, echo=FALSE}despota_str_kappa1 <-readRDS("main_results/OUT_k_Sim_paper.Rdata")``````{r comparisons2_a, eval=FALSE}despota_eval1 <- despota_str_kappa1 |>mutate(despota_lab =map(despota, ~ .x$labels |>pull(cluster)),despota_k =map_dbl(despota_lab,~length(table(.x))),hier_lab2 =map2(hclust, despota_k, ~cutree(.x, k = .y)),# ---------------------------------------------CST_d =map2(distances,despota_lab, ~cluster.stats(.x[[1]],.y,)),CST_h2 =map2(distances,hier_lab2, ~cluster.stats(.x[[1]],.y,)),ASW_d =map_dbl(CST_d, ~ .x$avg.silwidth),ASW_h2 =map_dbl(CST_h2, ~ .x$avg.silwidth)) |>select(-distances, -despota, -hclust, -CST_d, -CST_h2)``````{r save3, eval=FALSE, echo=FALSE}saveRDS(despota_eval1,"main_results/OUT_k_FitMeaures_paper.Rdata")``````{r comparisons2_b, eval=TRUE, echo=FALSE}despota_eval_all1 <- despota_str_kappa1 |>mutate(despota_lab =map(despota, ~ .x$labels |>pull(cluster)),despota_k =map_dbl(despota_lab,~length(table(.x))) ) |>filter(despota_k <=30) |>mutate(hier_lab2 =map2(hclust, despota_k, ~cutree(.x, k = .y)),# ---------------------------------------------CST_d =map2(distances,despota_lab, ~cluster.stats(.x[[1]],.y,)),CST_h2 =map2(distances,hier_lab2, ~cluster.stats(.x[[1]],.y,)),ASW_d =map_dbl(CST_d, ~ .x$avg.silwidth),ASW_h2 =map_dbl(CST_h2, ~ .x$avg.silwidth)) |>select(-distances, -despota, -hclust, -CST_d, -CST_h2) |>group_by(dist,aggMethod,M,despota_k,ASW_d,ASW_h2) |>slice(which.min(alpha)) |>ungroup()``````{r save4, eval=FALSE, echo=FALSE}saveRDS(despota_eval_all1,"main_results/OUT_gof_Sim_paper.Rdata")``````{r read4, eval=TRUE, echo=FALSE}despota_eval_all1 <-readRDS("main_results/OUT_gof_Sim_paper.Rdata")``````{r plotgof3}mat_sil <- despota_eval_all1 %>%select(ASW_d,ASW_h2,dist,aggMethod,M,alpha,despota_k) %>%pivot_longer(cols =c(ASW_d,ASW_h2),names_to ="fit_type",values_to ="fit_vals") %>%mutate(method ="Avg. Silhouette",fit_type =recode(fit_type,"ASW_d"="DESPOTA","ASW_h2"="HClust") )plotFIN2 <- mat_sil |>mutate(method =recode_factor(method,"Avg. Silhouette"="Avg. Silhouette","Calinski-Harabasz"="Calinski-Harabasz","DUNN2"="Dunn","DUNN"="DUNN"),dist =recode_factor(dist,"euclidean"="Euclidean","gower"="Gower-based","PCAmix"="FAMD-based","gudmm"="Entropy-based",.ordered =TRUE),fit_type =recode_factor(fit_type,"DESPOTA"="DESPOTA","HClust"="HCl") ) |>filter(despota_k >=2& method %in%c("Avg. Silhouette")) |>#aggMethod == "ward.D") |> # & ggplot(aes(x = despota_k, y = fit_vals, colour = fit_type, group = fit_type, linetype = fit_type)) +geom_point() +geom_line() +facet_grid(. ~ dist) +xlab("k") +ylab("Avg. Silhouette") +theme_bw() +scale_x_continuous(breaks =seq(0, 30, by =5)) +scale_colour_manual(values =c("#F75836","#3639F7")) +theme(legend.position ="bottom", legend.title =element_blank())plotFIN2# ggsave("gof_sil.pdf",# width = 6, height = 4.5)``````{r read5, eval=TRUE, echo=FALSE}surv1 <-readRDS("R/data/OUT_k_FitMeaures_paper.Rdata")``````{r prendisurv}DFsurv1 <- surv1 |>select(id,dist,aggMethod,despota_k,despota_lab,hier_lab2) |>mutate(survival =rep(list(genom$clin_censor_time1),nrow(surv1)),deaths =rep(list(genom$clin_vital_status),nrow(surv1)),deaths =map(deaths, ~recode(.x,"Dead"=1, "Alive"=0 )),disease =rep(list(genom$clin_disease),nrow(surv1)),memb_d = despota_lab,memb_h = surv1$hier_lab2,surv_d =pmap(list(sur = survival, dea = deaths, mem = memb_d), function(sur, dea, mem){ Surv(sur,dea) ~ mem}),surv_h =pmap(list(sur = survival, dea = deaths, mem = memb_h), function(sur, dea, mem){ Surv(sur,dea) ~ mem}),sdiff_d =map(surv_d, ~survdiff(as.vector(.x))),sdiff_h =map(surv_h, ~survdiff(as.vector(.x))),pval_d =map_dbl(sdiff_d, ~ .x$pvalue),pval_h =map_dbl(sdiff_h, ~ .x$pvalue))# Entropy-based with 4 clustersplotsur1 <- DFsurv1 |>filter(dist =="gudmm"& aggMethod =="ward.D2"& despota_k ==4) |>mutate(surv_d2 =pmap(list(sur = survival, dea = deaths, mem = memb_d, dis = disease), function(sur, dea, mem, dis){ Surv(sur,dea) ~ mem + dis}),surv_h2 =pmap(list(sur = survival, dea = deaths, mem = memb_h, dis = disease), function(sur, dea, mem, dis){ Surv(sur,dea) ~ mem + dis}),fit_d =map(surv_d2, ~survfit(.x)),fit_h =map(surv_h2, ~survfit(.x)),gpd =map(surv_d, ~survfit(.x) |>ggsurvfit() +theme(legend.position ="none") ),gph =map(surv_h, ~survfit(.x) |>ggsurvfit() +theme(legend.position ="none") ) ) |>select(gpd,gph,despota_lab,hier_lab2,surv_d,surv_h,sdiff_d,sdiff_h,fit_d,fit_h) table(plotsur1$despota_lab)table(plotsur1$hier_lab2)print(survfit(plotsur1$surv_d[[1]]), rmean=730, scale=365)print(survfit(plotsur1$surv_h[[1]]), rmean=730, scale=365)print(plotsur1$fit_d[[1]], rmean=730)print(plotsur1$fit_h[[1]], rmean=730)summary(plotsur1$fit_d[[1]],time=(0:4)*182.5, scale=365)summary(plotsur1$fit_h[[1]],time=(0:4)*182.5, scale=365)# figure Entropy-basedggpubr::ggarrange( plotsur1$gpd[[1]] +ggtitle("DESPOTA"), plotsur1$gph[[1]] +ggtitle("HCl") +ylab(NULL),nrow =1)# ggsave("survival_gudmm.pdf",# width = 8, height = 4)# EUCLIDEAN and 4 clustersplotsur2 <- DFsurv1 |>filter(dist =="euclidean"& aggMethod =="ward.D2"& despota_k ==4) |>mutate(surv_d2 =pmap(list(sur = survival, dea = deaths, mem = memb_d, dis = disease), function(sur, dea, mem, dis){ Surv(sur,dea) ~ mem + dis}),surv_h2 =pmap(list(sur = survival, dea = deaths, mem = memb_h, dis = disease), function(sur, dea, mem, dis){ Surv(sur,dea) ~ mem + dis}),fit_d =map(surv_d2, ~survfit(.x)),fit_h =map(surv_h2, ~survfit(.x)),gpd =map(surv_d, ~survfit(.x) |>ggsurvfit() +theme(legend.position ="none") ),gph =map(surv_h, ~survfit(.x) |>ggsurvfit() +theme(legend.position ="none") ) ) |>select(gpd,gph,despota_lab,hier_lab2,surv_d,surv_h,sdiff_d,sdiff_h,fit_d,fit_h) table(plotsur2$despota_lab[[1]])print(survfit(plotsur2$surv_d[[1]]), rmean=730, scale=365)print(plotsur2$fit_d[[1]], rmean=730)summary(plotsur2$fit_d[[1]],time=(0:4)*182.5, scale=365)# figure Euclideanggpubr::ggarrange( plotsur2$gpd[[1]] +ggtitle("DESPOTA"), plotsur2$gph[[1]] +ggtitle("HCl") +ylab(NULL),nrow =1)# ggsave("survival_euclidean.pdf",# width = 8, height = 4)```## Comparison between Euclidean and Entropy-based```{r finalcomparisons}fin_clust1d <- genom_red |>select(-starts_with("Dim")) |>mutate(d_lab =as_factor(plotsur1$despota_lab[[1]]) )fin_clust1h <- genom_red |>select(-starts_with("Dim")) |>mutate(d_lab =as_factor(plotsur1$hier_lab2[[1]]) )fin_clust2 <- genom_red |>select(-starts_with("Dim")) |>mutate(d_lab =as_factor(plotsur2$despota_lab[[1]]) )adjustedRandIndex(unlist(plotsur1$despota_lab),unlist(plotsur1$hier_lab2))adjustedRandIndex(unlist(plotsur2$despota_lab),unlist(plotsur2$hier_lab2))adjustedRandIndex(unlist(plotsur1$despota_lab),unlist(plotsur2$despota_lab))``````{r v_tests}# V-teststable(fin_clust1d$d_lab)catdes1d <- fin_clust1d |>mutate(gender =recode_factor(clin_gender,"1"="Female","2"="Male"),"light smoker"=recode_factor(clin_pack_years_smoked_low,"1"="True","2"="False"),"prior malignancy"=recode_factor(clin_prior_malignancy,"1"="No","2"="Yes"),"low cancer volume"=recode_factor(clin_volume_low,"1"="True","2"="False"),disease = genom$clin_disease,stage =recode_factor(clin_ajcc_pathologic_stage,"1"="I","2"="I","3"="I","4"="II","5"="II","6"="II","7"="III","8"="III","9"="III","10"="IV" ) ) |>select(d_lab,gender,"light smoker", "prior malignancy" ,"low cancer volume","disease", "stage") |>FactoMineR::catdes(num.var =1, proba=1)table(fin_clust1h$d_lab)catdes1h <- fin_clust1h |>mutate(gender =recode_factor(clin_gender,"1"="Female","2"="Male"),"light smoker"=recode_factor(clin_pack_years_smoked_low,"1"="True","2"="False"),"prior malignancy"=recode_factor(clin_prior_malignancy,"1"="No","2"="Yes"),"low cancer volume"=recode_factor(clin_volume_low,"1"="True","2"="False"),disease = genom$clin_disease,stage =recode_factor(clin_ajcc_pathologic_stage,"1"="I","2"="I","3"="I","4"="II","5"="II","6"="II","7"="III","8"="III","9"="III","10"="IV" ) ) |>select(d_lab,gender,"light smoker", "prior malignancy" ,"low cancer volume","disease", "stage") |>FactoMineR::catdes(num.var =1, proba=1)table(fin_clust2$d_lab)catdes2 <- fin_clust2 |>mutate(gender =recode_factor(clin_gender,"1"="Female","2"="Male"),"light smoker"=recode_factor(clin_pack_years_smoked_low,"1"="True","2"="False"),"prior malignancy"=recode_factor(clin_prior_malignancy,"1"="No","2"="Yes"),"low cancer volume"=recode_factor(clin_volume_low,"1"="True","2"="False"),disease = genom$clin_disease,stage =recode_factor(clin_ajcc_pathologic_stage,"1"="I","2"="I","3"="I","4"="II","5"="II","6"="II","7"="III","8"="III","9"="III","10"="IV" ) ) |>select(d_lab,gender,"light smoker", "prior malignancy" ,"low cancer volume","disease", "stage") |>FactoMineR::catdes(num.var =1, proba=1)# data in grid versiontable(fin_clust1d$d_lab) # new order: 1-3-2-4table(fin_clust1h$d_lab) # new order: 1-2-4-3table(fin_clust2$d_lab) # new order: 2-3-4-1# final heatmaptibble(rbind( catdes1d$category[[1]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =1, method ="Entropy-based\n (DESPOTA)") , catdes1d$category[[2]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =3, method ="Entropy-based\n (DESPOTA)") , catdes1d$category[[3]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =2, method ="Entropy-based\n (DESPOTA)") , catdes1d$category[[4]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =4, method ="Entropy-based\n (DESPOTA)") ,# catdes1h$category[[1]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =1, method ="Entropy-based\n (Horizontal)") , catdes1h$category[[2]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =2, method ="Entropy-based\n (Horizontal)") , catdes1h$category[[3]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =4, method ="Entropy-based\n (Horizontal)") , catdes1h$category[[4]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =3, method ="Entropy-based\n (Horizontal)") ,# catdes2$category[[1]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =2, method ="Euclidean") , catdes2$category[[2]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =3, method ="Euclidean"), catdes2$category[[3]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =4, method ="Euclidean"), catdes2$category[[4]] |>as.data.frame() |>rownames_to_column() |>arrange(rowname) |>mutate(cluster =1, method ="Euclidean"))) |>select(-"Cla/Mod",-"Mod/Cla",-"Global") |>filter(rowname %in%c("stage=IV","stage=III","stage=II","stage=I","low.cancer.volume=True","prior.malignancy=Yes","light.smoker=True","gender=Male","gender=Female","disease=LUSC","disease=LUAD")) |>mutate(p.value2 =cut(p.value, breaks=c(0, 0.001, 0.01, 0.05, 0.1, 1),labels=c('***', '**', '*', '.', " ")) ) |>ggplot(aes(x = cluster, y = rowname, label = p.value2, fill = v.test)) +geom_tile() +geom_text() +facet_grid(. ~ method) +scale_fill_gradientn(colours =c("#F62D2D","white","#1034A6")) +ylab(NULL) +theme_minimal() +theme(panel.grid.major =element_blank(), panel.grid.minor =element_blank(), strip.text =element_text(face ="bold"))# ggsave("vtest_fig.pdf",# width = 7, height = 4)```