Lda on MASS

Lda implementation on package MASS

In ISLR, the linear discriminant analysis is implemented in the Rlab via the lda function from the package MASS. The function in question, is based on the Fisher’s discriminant analysis, as described in Section 12.1 of Modern Applied Statistics with S, by Venables and Ripley. In particular, let the within and the between covariance matrices, $\bf W$ and $\bf B$, be defined as follows

\[ {\bf W}=\frac{\left({\bf X}-{\bf GM}\right)^{\sf T}\left({\bf X}-{\bf GM}\right)}{n-g} \ \ \ \text{ and } \ \ \ {\bf B}=\frac{\left({\bf GM}-{\bf 1\bar{x}}^{\sf T}\right)^{\sf T}\left({\bf GM}-{\bf 1\bar{x} }^{\sf T}\right)}{g-1} \]

where

${\bf X}$ is the predictors $n\times p$ dimensional matrix;
${\bf G}$ is the $n\times g$ indicator matrix (one-hot-encoding) of the categorical response $Y$: the general term $g_{ij}=1$ if the $i^{th}$ observation has the $j$ category of $Y$, $g_{ij}=0$;
${\bf M}$ is the $g\times p$ matrix of group means
${\bf \bar{x}}$ is the data mean ($p$-dimensional)
${\bf 1}$ is an $n$-dimensional vector of ones.

The goal of Fisher’s discriminant analysis is one or more sets of weights ${\bf a}$ that provide linear combinations of the starting variables that maximize the ratio between/within covariance. Formally

\[\begin{equation} \max_{{\bf a}} {\bf a}^{\sf T}{\bf B}{\bf a}/{\bf a}^{\sf T}{\bf W}{\bf a}. \end{equation}\]

Venables and Ripley suggest that, in order to obtain the solution, one can simply do a PCA on $\bf B$, provided that the variables in $\bf X$ have been transformed so that the within group correlation matrix is the identity (that is, within each class, the variables are orthogonal). This is a PCA-based pre-processing often referred to as sphering or whitening. Finally, once $\bf a$ is obtained, then ${\bf X}{\bf a}$ is the linear discriminant. As in PCA, the first few linear discriminants exlain most of the between/within variability ratio.

In ISLR, LDA is not presented in terms of Fisher’s DA. Upon assuming the variables in $X$ to follow a multivariate Gaussian distribution within each class, with class specific means and equal covariance matrix $\bf \Sigma$, the LDA boils down to estimate, for each class $k$, the priors $\pi_{k}$, and the parameters of the multivariate Gaussian. Then the Bayes rule is used to get an estimate of the posterior probability

\[\begin{equation} P(Y=k|X)=\frac{\pi_{k}f_{k}(X)}{\sum_{l=1}^{g}{\pi_{l}f_{l}(X)}} \end{equation}\]

Venable and Ripley refer to this approach as discrimination via probability models, and briefly describe it in the subsection of 12.1 titled Discrimination for normal populations.

Reproducing the lda() output

We apply, as classic example, the lda function on iris.

data(iris)
lda_fit=lda(formula=Species~.,data=iris)

The $\pi_{k}'s$ are returned (that is, the proportion of traing observations in each class).

lda_fit$prior

##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333

The group means

lda_fit$means

##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa            5.006       3.428        1.462       0.246
## versicolor        5.936       2.770        4.260       1.326
## virginica         6.588       2.974        5.552       2.026

The linear discriminant weights

lda_fit$scaling

##                     LD1         LD2
## Sepal.Length  0.8293776  0.02410215
## Sepal.Width   1.5344731  2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width  -2.8104603  2.83918785

Finally, the singular values of the W/B are

lda_fit$svd

## [1] 48.642644  4.579983

To compute the quantities above from scratch

X=as_tibble(iris) %>% dplyr::select(where(is.numeric)) %>%
  data.matrix
n=X %>% nrow

Compute the general mean $\bf{\bar{x}}$

xbar = as_tibble(X) %>% summarise(across(where(is.numeric), mean)) %>% 
       data.matrix
xbar

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]     5.843333    3.057333        3.758    1.199333

and the group means $\bf{M}$

M = as_tibble(X) %>% mutate(Species=iris$Species) %>% group_by(Species) %>% 
  summarise(across(where(is.numeric), mean)) %>% 
  dplyr::select(where(is.numeric)) %>% data.matrix
M

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]        5.006       3.428        1.462       0.246
## [2,]        5.936       2.770        4.260       1.326
## [3,]        6.588       2.974        5.552       2.026

And the indicator matrix $\bf G$, one-hot encoding of $Y$

G = tibble(Species=iris$Species) %>% recipe(~Species) %>%
  step_dummy(Species,one_hot = T) %>% 
  prep() %>% juice() %>% data.matrix

as_tibble(G) %>% slice(c(1,51,101))

## # A tibble: 3 × 3
##   Species_setosa Species_versicolor Species_virginica
##            <dbl>              <dbl>             <dbl>
## 1              1                  0                 0
## 2              0                  1                 0
## 3              0                  0                 1

It is now possible to compute the covariance matrices as defined in the book

GM = G%*%M
g = M %>% nrow
one_x = matrix(1,n,1) %*% xbar

W_book = (t(X-GM)%*%(X-GM))/(n-g)
B_book = (t(GM - one_x)%*%(GM - one_x))/(g-1)

All the computations for the lda on the iris data set are found as the most voted answer to this stackexchange post. The author of the post refers to the scatter matrices for W and B (that is, just the sum of squares)

W_scat = (t(X-GM)%*%(X-GM))
B_scat = (t(GM - one_x)%*%(GM - one_x))
W_scat

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length      38.9562     13.6300      24.6246      5.6450
## Sepal.Width       13.6300     16.9620       8.1208      4.8084
## Petal.Length      24.6246      8.1208      27.2226      6.2718
## Petal.Width        5.6450      4.8084       6.2718      6.1566

B_scat

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length     63.21213   -19.95267     165.2484    71.27933
## Sepal.Width     -19.95267    11.34493     -57.2396   -22.93267
## Petal.Length    165.24840   -57.23960     437.1028   186.77400
## Petal.Width      71.27933   -22.93267     186.7740    80.41333

In principle, one should be able to get to the solution via the EVD of ${\bf W}^{-1}{\bf B}$

evd_book=eigen(solve(W_book)%*%B_book)
evd_scat=eigen(solve(W_scat)%*%B_scat)

The eigenvectors from the two EVD’s will be the same, and of course the non-zero eigenvalues will differ by a constant factor $(g-1)/(n-g)=$ 0.0136054.

evd_scat$values[1:2]/evd_book$values[1:2]

## [1] 0.01360544 0.01360544

evd_scat$vectors[,1:2]/evd_book$vectors[,1:2]

##      [,1] [,2]
## [1,]   -1    1
## [2,]   -1    1
## [3,]   -1    1
## [4,]   -1    1

In fact, for the square root of the eigenvalues to match the lda results, one has to use the book-based or correct the scatter matrix results by $(g-1)/(n-g)$. The author of the post did realise he was not getting the same results as lda in terms of singular values, he just did not figure out why (and in fact, that did not play a role in computing the discriminants).

((evd_scat$values[1:2])/((g-1)/(n-g)))^.5

## [1] 48.642644  4.579983

(evd_book$values[1:2])^.5

## [1] 48.642644  4.579983

lda_fit$svd

## [1] 48.642644  4.579983

Back to the eigenvectors, the EVD of ${\bf W}^{-1}{\bf B}$ produces

evd_book$vectors[,1:2]

##            [,1]         [,2]
## [1,]  0.2087418 -0.006531964
## [2,]  0.3862037 -0.586610553
## [3,] -0.5540117  0.252561540
## [4,] -0.7073504 -0.769453092

That are the eigenvectors column normalised to one (SS=1). In fact,

colSums(evd_book$vectors[,1:2]^2)

## [1] 1 1

The obtained eigenvectors do not correspond to the linear discriminants, albeit they are proportional.

lda_fit$scaling/evd_book$vectors[,1:2]

##                   LD1       LD2
## Sepal.Length 3.973222 -3.689878
## Sepal.Width  3.973222 -3.689878
## Petal.Length 3.973222 -3.689878
## Petal.Width  3.973222 -3.689878

Alternatives to the EVD$({\bf W}^{-1}{\bf B})$: Cholesky-based

Some alternatives, that provide equivalent results are described here

The Cholesky decomposition of the within covariance matrix, ${\bf U}=chol({\bf W})$, leads to the matrix ${\bf U}^{-1\sf{T}}{\bf B}{\bf U}^{-1}$ that has the same eigenvalues as ${\bf W}^{-1}{\bf B}$

U_book=chol(W_book)
U_scat=chol(W_scat)

UBU_book = t(solve(U_book)) %*% B_book %*% solve(U_book)
UBU_scat = t(solve(U_scat)) %*% B_scat %*% solve(U_scat)
evd_UBU_b = eigen(UBU_book) 
evd_UBU_s = eigen(UBU_scat) 
evd_UBU_b$values[1:2]

## [1] 2366.10680   20.97624

evd_book$values[1:2]

## [1] 2366.10680   20.97624

evd_UBU_s$values[1:2]

## [1] 32.191929  0.285391

evd_scat$values[1:2]

## [1] 32.191929  0.285391

in particular, ${\bf U}^{-1}{\bf E}$ where ${\bf E}$ are the eigenvectors of ${\bf U}^{-1\sf{T}}{\bf B}{\bf U}^{-1}$, corresponds to the lda_fit$scaling.

E=evd_UBU_b$vectors[,1:2]
solve(U_book) %*% E

##                    [,1]        [,2]
## Sepal.Length -0.8293776 -0.02410215
## Sepal.Width  -1.5344731 -2.16452123
## Petal.Length  2.2012117  0.93192121
## Petal.Width   2.8104603 -2.83918785

lda_fit$scaling

##                     LD1         LD2
## Sepal.Length  0.8293776  0.02410215
## Sepal.Width   1.5344731  2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width  -2.8104603  2.83918785

The computation for the scatter matrix version is similar, but it takes a further factor $\sqrt{n-g}$

E_s=evd_UBU_s$vectors[,1:2]
solve(U_scat) %*% E_s * sqrt(n-g)

##                    [,1]        [,2]
## Sepal.Length -0.8293776 -0.02410215
## Sepal.Width  -1.5344731 -2.16452123
## Petal.Length  2.2012117  0.93192121
## Petal.Width   2.8104603 -2.83918785

lda_fit$scaling

##                     LD1         LD2
## Sepal.Length  0.8293776  0.02410215
## Sepal.Width   1.5344731  2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width  -2.8104603  2.83918785

Alternatives to the EVD$({\bf W}^{-1}{\bf B})$: sphering/whitening

This is how it seems it was suggested by Venables and Ripley, that is sphering/whitening.

Let the EVD(${\bf W}$) = ${\bf V}{\bf S}{\bf V}^{\sf T}$, it results that \[\begin{equation} {\bf W}^{-1/2} = {\bf V}{\bf S}^{-1/2}{\bf V}^{\sf T} \end{equation}\]
Then ${\bf W}^{-1/2}{\bf B}{\bf W}^{-1/2}$ is the quadratic form of the between class data whitened by the within-class covariance.
EVD(${\bf W}^{-1/2}{\bf B}{\bf W}^{-1/2}$)$= {\bf \hat{V}}{\bf \hat{S}}{\bf\hat{V}}$;
finally, the linear discriminants are given by ${\bf W}^{-1/2}{\bf \hat{V}}$

evdW=eigen(W_book)
inv_Wsq = evdW$vectors %*% diag(1/sqrt(evdW$values)) %*% t(evdW$vectors)
inv_Wsq%*%eigen(inv_Wsq%*%B_book%*%inv_Wsq)$vectors[,1:2]

##            [,1]        [,2]
## [1,] -0.8293776 -0.02410215
## [2,] -1.5344731 -2.16452123
## [3,]  2.2012117  0.93192121
## [4,]  2.8104603 -2.83918785

lda_fit$scaling

##                     LD1         LD2
## Sepal.Length  0.8293776  0.02410215
## Sepal.Width   1.5344731  2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width  -2.8104603  2.83918785

For the scatter matrices case, the same result applies, but for the $\sqrt{n-g}$ factor.

evdW=eigen(W_scat)
inv_Wsq = evdW$vectors %*% diag(1/sqrt(evdW$values)) %*% t(evdW$vectors)
inv_Wsq%*%eigen(inv_Wsq%*%B_scat%*%inv_Wsq)$vectors[,1:2]* sqrt(n-g)

##            [,1]        [,2]
## [1,] -0.8293776 -0.02410215
## [2,] -1.5344731 -2.16452123
## [3,]  2.2012117  0.93192121
## [4,]  2.8104603 -2.83918785

lda_fit$scaling

##                     LD1         LD2
## Sepal.Length  0.8293776  0.02410215
## Sepal.Width   1.5344731  2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width  -2.8104603  2.83918785

Lda on MASS

Lda implementation on package MASS

Reproducing the lda() output

Alternatives to the EVD\(({\bf W}^{-1}{\bf B})\): Cholesky-based

Alternatives to the EVD\(({\bf W}^{-1}{\bf B})\): sphering/whitening