Unsupervised Learning

  • Unsupervised learning involves analyzing data without labeled outcomes to discover hidden patterns or structures.

  • Unlike supervised learning, there are no pre-defined correct answers (no target labels). The algorithm tries to organize or summarize data based on inherent characteristics.

Common unsupervised learning tasks

  • Clustering: Grouping similar observations into clusters (e.g. segmenting customers by behavior).

  • Dimensionality Reduction: Simplifying data by reducing feature count while retaining most information (e.g. PCA for visualization).

  • (Also association rule mining, anomaly detection, etc., though we focus on clustering and PCA here.)

What is unsupervised learning for?

  • Exploratory Analysis: Helps in making sense of unlabeled data – find natural groupings or important features (e.g., grouping species by measurements).

  • Preprocessing: Techniques like PCA can compress data, remove noise, or alleviate the “curse of dimensionality” for other learning tasks.

  • No Ground Truth: Since we don’t have a target variable, success is measured by meaningful insights or compact representations rather than accuracy to a label. This requires careful validation (e.g. using domain knowledge or internal metrics like cluster cohesion).

Principal Component Analysis (PCA)

Reducing dimensionality with PCA

PCA: intuition

  • High-dimensional data (many features) can be hard to visualize or model. Redundant features and noise may obscure important patterns (the curse of dimensionality).

  • Goal of PCA: Reduce a dataset with many correlated features to a smaller set of new, orthogonal features (principal components) that capture most of the variance.

  • Core Idea: Find the directions in feature space along which the data varies the most (greatest variance). These directions are the principal components (PCs).

Code
library(palmerpenguins)
library(tidyverse)

# 1. Prepare the data
penguins <- drop_na(palmerpenguins::penguins) |>
  select(bill_length_mm, body_mass_g)

penguins_scaled <- scale(penguins)
penguins_scaled_df <- as_tibble(penguins_scaled, .name_repair = "minimal") |>
  rename(bill_length = 1, body_mass = 2)

# 2. PCA on scaled data
pca <- prcomp(penguins_scaled)

# 3. PC1 unit vector (direction)
pc1_vec <- pca$rotation[, 1]

# 4. Project points onto PC1
proj_lengths <- as.matrix(penguins_scaled) %*% pc1_vec
projections <- proj_lengths %*% t(pc1_vec)

# 5. Add projection coordinates to data frame
penguins_scaled_df <- penguins_scaled_df |>
  mutate(x_proj = projections[,1],
         y_proj = projections[,2])

# 6. Plot in standardized space with PC1 and projections
PC1_pl = ggplot(penguins_scaled_df, aes(x = bill_length, y = body_mass)) +
  geom_point(color = "indianred")+
  labs(
       x = "Standardized Bill Length",
       y = "Standardized Body Mass") +
  coord_fixed() +  # Ensures angles are not distorted
  theme_minimal()

Reducing dimensionality with PCA

PCA: intuition

  • High-dimensional data (many features) can be hard to visualize or model. Redundant features and noise may obscure important patterns (the curse of dimensionality).

  • Goal of PCA: Reduce a dataset with many correlated features to a smaller set of new, orthogonal features (principal components) that capture most of the variance.

  • Core Idea: Find the directions in feature space along which the data varies the most (greatest variance). These directions are the principal components (PCs).

Code
PC1_pl +
  geom_segment(aes(xend = x_proj, yend = y_proj), alpha = 0.5) +
  geom_abline(
    intercept = 0,
    slope = pc1_vec[2] / pc1_vec[1],
    color = "dodgerblue", linetype = "dashed"
  ) 

Reducing dimensionality with PCA

principal components

  • PC1 is the line (through the data mean) that maximizes variance of projected data points.

  • PC2 is the next orthogonal direction maximizing remaining variance, and so on.

  • Each PC is a linear combination of original features (a weighted sum). The weights (loadings) tell how much each original feature contributes to that component.

  • Variance Explained: PCs are ordered by the amount of variance they capture. Typically, a few PCs explain most variance, enabling dimensionality reduction with minimal information loss.

Penguins PCA

Code
penguins_pca <- drop_na(palmerpenguins::penguins) |>
  select(where(is.numeric),-year,species)
library(FactoMineR)

pca_model <- PCA(penguins_pca, scale.unit = TRUE, graph = FALSE,quali.sup = "species")
summary(pca_model)

Call:
PCA(X = penguins_pca, scale.unit = TRUE, quali.sup = "species",  
     graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4
Variance               2.745   0.778   0.369   0.108
% of var.             68.634  19.453   9.216   2.697
Cumulative % of var.  68.634  88.087  97.303 100.000

Individuals (the 10 first)
                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
1                 |  1.942 | -1.854  0.376  0.911 |  0.032  0.000  0.000 |
2                 |  1.446 | -1.316  0.190  0.828 | -0.444  0.076  0.094 |
3                 |  1.495 | -1.377  0.207  0.847 | -0.161  0.010  0.012 |
4                 |  2.043 | -1.885  0.389  0.852 | -0.012  0.000  0.000 |
5                 |  2.210 | -1.920  0.403  0.755 |  0.818  0.258  0.137 |
6                 |  1.880 | -1.773  0.344  0.890 | -0.366  0.052  0.038 |
7                 |  1.681 | -0.818  0.073  0.237 |  0.501  0.097  0.089 |
8                 |  1.933 | -1.799  0.354  0.866 | -0.245  0.023  0.016 |
9                 |  2.439 | -1.956  0.419  0.643 |  0.998  0.385  0.167 |
10                |  2.658 | -1.570  0.269  0.349 |  0.578  0.129  0.047 |
                   Dim.3    ctr   cos2  
1                  0.235  0.045  0.015 |
2                  0.027  0.001  0.000 |
3                 -0.190  0.029  0.016 |
4                  0.629  0.322  0.095 |
5                  0.701  0.400  0.101 |
6                 -0.028  0.001  0.000 |
7                  1.335  1.452  0.631 |
8                 -0.627  0.320  0.105 |
9                  1.041  0.882  0.182 |
10                 2.049  3.421  0.594 |

Variables
                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
bill_length_mm    |  0.752 20.589  0.565 |  0.529 36.023  0.280 | -0.390 41.280
bill_depth_mm     | -0.661 15.924  0.437 |  0.702 63.389  0.493 |  0.259 18.131
flipper_length_mm |  0.956 33.273  0.913 |  0.005  0.003  0.000 |  0.143  5.574
body_mass_g       |  0.911 30.214  0.829 |  0.067  0.585  0.005 |  0.359 35.015
                    cos2  
bill_length_mm     0.152 |
bill_depth_mm      0.067 |
flipper_length_mm  0.021 |
body_mass_g        0.129 |

Supplementary categories
                       Dist     Dim.1    cos2  v.test     Dim.2    cos2  v.test
Adelie            |   1.499 |  -1.460   0.948 -14.184 |  -0.142   0.009  -2.583
Chinstrap         |   1.295 |  -0.389   0.090  -2.165 |   0.993   0.589  10.394
Gentoo            |   2.052 |   2.013   0.963  16.507 |  -0.394   0.037  -6.069
                      Dim.3    cos2  v.test  
Adelie            |   0.312   0.043   8.281 |
Chinstrap         |  -0.733   0.321 -11.147 |
Gentoo            |   0.036   0.000   0.803 |

Penguins PCA

Code
pca_model$ind$coord |>
  as_tibble() |>
  mutate(species = penguins_pca$species) |>
  ggplot(aes(x = Dim.1, y = Dim.2, color = species)) +
  geom_point() +
  labs(
    x = "PC1 (73% of variance)",
    y = "PC2 (23% of variance)",
    title = "PCA of Palmer Penguins Data"
  ) +
  theme_minimal()

Penguins PCA

Code
fviz_pca_var(pca_model, 
             col.var = "contrib",   # color by contribution to PCs
             gradient.cols = c("blue", "orange", "red"),
             repel = TRUE)

Hierarchical Clustering

What is Hierarchical Clustering?

  • A clustering method that builds a hierarchy (tree) of clusters. The result is typically visualized as a dendrogram, a tree-like diagram illustrating how clusters merge or split.

  • Agglomerative (bottom-up) approach (most common): Start with each data point as its own cluster, then iteratively merge the closest clusters until one overall cluster remains .

  • Divisive (top-down) approach: Start with one cluster containing all points and recursively split clusters. (Less common in practice .)

  • The user doesn’t pre-specify a number of clusters k; you can cut the dendrogram at a chosen level to get a desired number of clusters or interpret the tree structure to decide on clusters.

How to Merge Clusters

  • The key choice in agglomerative clustering is the linkage criterion, which defines the distance between two clusters based on pairwise distances of points:

Linkage Criteria

  • Single linkage: Distance between two clusters = minimum distance between any single point in one cluster and any point in the other. Tends to form “chains” (elongated clusters) since one close pair can link two clusters.

  • Complete linkage: Distance = maximum distance between any point in one cluster to any in the other. Tends to produce more compact, spherical clusters , as clusters only merge when all points are relatively close.

  • Average linkage: Distance = average pairwise distance between points across clusters (UPGMA). A compromise between single and complete – merges clusters when the average distance is small.

  • Ward’s method: Merges clusters that result in the smallest increase in total within-cluster variance (i.e., it tries to minimize the sum of squared distances within clusters). Often yields compact, evenly-sized clusters and is a popular default for hierarchical clustering.

  • Distance metric: Often Euclidean distance is used, but hierarchical clustering can work with any distance (e.g. correlation distance, Manhattan distance). Choice of distance and linkage both affect the clustering result .

Dendrogram: interpretation and cut

  • A dendrogram shows the merging process.

  • To decide on clusters, one can “cut” the dendrogram at a chosen height.

  • No single “best” cut: Often you look for a level where clusters are meaningful and not too numerous.

Code
d <- dist(penguins, method = "euclidean")
hc_complete <- hclust(d, method = "complete")
plot(hc_complete)

K-means Clustering

K-means: Algorithm Overview

  • Goal: Partition the data into K clusters such that points within each cluster are as similar as possible (minimizing within-cluster variance).
  • Centroid: Each cluster is defined by its centroid (the mean of points in the cluster). K-means alternates between assigning points to clusters and updating centroids

K-means: Algorithm Steps

  1. Initialize with K initial centroids (chosen randomly or by some heuristic).

  2. Assignment step: Assign each data point to the nearest centroid (usually via Euclidean distance)

  3. Update step: Recompute each centroid as the mean of all points assigned to that cluster .

  4. Repeat steps 2-3 until assignments no longer change (convergence) . This usually means. centroids have stabilized (or a maximum number of iterations reached).

K-means: Algorithm Overview

K-means tries to minimize the within-cluster sum of squares (WCSS), i.e. the sum of squared distances from each point to its cluster centroid. It’s a form of variance minimization within clusters.

  • This procedure will always converge, but it can get stuck in a local minimum (use random initialization).

  • Must choose K.

  • Fast and scalable.

  • Cluster assignment provides a hard partition: each point belongs to exactly one of the K clusters (no overlaps, no probabilities).

Choosing K: The Elbow Method

How to decide the number of clusters K? One popular heuristic is the Elbow Method :

  • run different values of K (e.g. 1 through 10) and compute WCSS.

  • As K increases, WCSS will always decrease: look for an “elbow”, a point after which the rate of decrease sharply slows.

many evalutation metrics exist to assess the quality of the cluster solution:

  • Silhouette score: measures how well-separated clusters are.

  • Gap statistic: compares WCSS to random data to assess cluster structure.

More penguins

trying K = 2,3,4… might show that K=3 yields a significant drop in WCSS (likely corresponding to the three species), after which improvements taper off.

Code
set.seed(42)
peng_num <- penguins %>% scale()    # use scaled numeric features
km3 <- kmeans(peng_num, centers = 3, nstart = 10)
# km3$centers   # centroid coordinates for the 3 clusters
# km3$tot.withinss  # total within-cluster sum of squares
table(km3$cluster, palmerpenguins::penguins |> na.omit() |> pull(species)) |> kbl()
Adelie Chinstrap Gentoo
139 6 3
4 61 14
3 1 102

Best Practices in unsupervised learning

  • Feature Scaling: Always scale/standardize features before PCA and distance-based clustering (K-means, K-medoids) . If not, variables on larger scales will dominate PCA’s variance and distance calculations in clustering, skewing results.

  • Preprocessing: Handle missing values and outliers prior to analysis. PCA can be thrown off by missing data (consider imputation) and outliers (consider removing or using robust PCA). K-means is very sensitive to outliers; consider removing them or using K-medoids or other robust clustering.

  • Validation: In unsupervised learning, validating results is tricky. When possible, validate clusters against any known information (e.g., cluster assignments vs known categories, even if those weren’t used in training). Or use internal validation (cohesion and separation metrics) and stability checks (does clustering change drastically with slight data perturbations?).