In ISLR, the linear discriminant analysis is implemented in the Rlab via the lda function from the package MASS. The function in question, is based on the Fisher’s discriminant analysis, as described in Section 12.1 of Modern Applied Statistics with S, by Venables and Ripley. In particular, let the within and the between covariance matrices, \(\bf W\) and \(\bf B\), be defined as follows
\[ {\bf W}=\frac{\left({\bf X}-{\bf GM}\right)^{\sf T}\left({\bf X}-{\bf GM}\right)}{n-g} \ \ \ \text{ and } \ \ \ {\bf B}=\frac{\left({\bf GM}-{\bf 1\bar{x}}^{\sf T}\right)^{\sf T}\left({\bf GM}-{\bf 1\bar{x} }^{\sf T}\right)}{g-1} \]
where
\({\bf X}\) is the predictors \(n\times p\) dimensional matrix;
\({\bf G}\) is the \(n\times g\) indicator matrix (one-hot-encoding) of the categorical response \(Y\): the general term \(g_{ij}=1\) if the \(i^{th}\) observation has the \(j\) category of \(Y\), \(g_{ij}=0\);
\({\bf M}\) is the \(g\times p\) matrix of group means
\({\bf \bar{x}}\) is the data mean (\(p\)-dimensional)
\({\bf 1}\) is an \(n\)-dimensional vector of ones.
The goal of Fisher’s discriminant analysis is one or more sets of weights \({\bf a}\) that provide linear combinations of the starting variables that maximize the ratio between/within covariance. Formally
\[\begin{equation} \max_{{\bf a}} {\bf a}^{\sf T}{\bf B}{\bf a}/{\bf a}^{\sf T}{\bf W}{\bf a}. \end{equation}\]
Venables and Ripley suggest that, in order to obtain the solution, one can simply do a PCA on \(\bf B\), provided that the variables in \(\bf X\) have been transformed so that the within group correlation matrix is the identity (that is, within each class, the variables are orthogonal). This is a PCA-based pre-processing often referred to as sphering or whitening. Finally, once \(\bf a\) is obtained, then \({\bf X}{\bf a}\) is the linear discriminant. As in PCA, the first few linear discriminants exlain most of the between/within variability ratio.
In ISLR, LDA is not presented in terms of Fisher’s DA. Upon assuming the variables in \(X\) to follow a multivariate Gaussian distribution within each class, with class specific means and equal covariance matrix \(\bf \Sigma\), the LDA boils down to estimate, for each class \(k\), the priors \(\pi_{k}\), and the parameters of the multivariate Gaussian. Then the Bayes rule is used to get an estimate of the posterior probability
\[\begin{equation} P(Y=k|X)=\frac{\pi_{k}f_{k}(X)}{\sum_{l=1}^{g}{\pi_{l}f_{l}(X)}} \end{equation}\]
Venable and Ripley refer to this approach as discrimination via probability models, and briefly describe it in the subsection of 12.1 titled Discrimination for normal populations.
We apply, as classic example, the lda function on iris.
data(iris)
lda_fit=lda(formula=Species~.,data=iris)
The \(\pi_{k}'s\) are returned (that is, the proportion of traing observations in each class).
lda_fit$prior
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
The group means
lda_fit$means
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.006 3.428 1.462 0.246
## versicolor 5.936 2.770 4.260 1.326
## virginica 6.588 2.974 5.552 2.026
The linear discriminant weights
lda_fit$scaling
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785
Finally, the singular values of the W/B are
lda_fit$svd
## [1] 48.642644 4.579983
To compute the quantities above from scratch
X=as_tibble(iris) %>% dplyr::select(where(is.numeric)) %>%
data.matrix
n=X %>% nrow
Compute the general mean \(\bf{\bar{x}}\)
xbar = as_tibble(X) %>% summarise(across(where(is.numeric), mean)) %>%
data.matrix
xbar
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 5.843333 3.057333 3.758 1.199333
and the group means \(\bf{M}\)
M = as_tibble(X) %>% mutate(Species=iris$Species) %>% group_by(Species) %>%
summarise(across(where(is.numeric), mean)) %>%
dplyr::select(where(is.numeric)) %>% data.matrix
M
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 5.006 3.428 1.462 0.246
## [2,] 5.936 2.770 4.260 1.326
## [3,] 6.588 2.974 5.552 2.026
And the indicator matrix \(\bf G\), one-hot encoding of \(Y\)
G = tibble(Species=iris$Species) %>% recipe(~Species) %>%
step_dummy(Species,one_hot = T) %>%
prep() %>% juice() %>% data.matrix
as_tibble(G) %>% slice(c(1,51,101))
## # A tibble: 3 × 3
## Species_setosa Species_versicolor Species_virginica
## <dbl> <dbl> <dbl>
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
It is now possible to compute the covariance matrices as defined in the book
GM = G%*%M
g = M %>% nrow
one_x = matrix(1,n,1) %*% xbar
W_book = (t(X-GM)%*%(X-GM))/(n-g)
B_book = (t(GM - one_x)%*%(GM - one_x))/(g-1)
All the computations for the lda on the iris data set are found as the most voted answer to this stackexchange post. The author of the post refers to the scatter matrices for W and B (that is, just the sum of squares)
W_scat = (t(X-GM)%*%(X-GM))
B_scat = (t(GM - one_x)%*%(GM - one_x))
W_scat
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 38.9562 13.6300 24.6246 5.6450
## Sepal.Width 13.6300 16.9620 8.1208 4.8084
## Petal.Length 24.6246 8.1208 27.2226 6.2718
## Petal.Width 5.6450 4.8084 6.2718 6.1566
B_scat
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 63.21213 -19.95267 165.2484 71.27933
## Sepal.Width -19.95267 11.34493 -57.2396 -22.93267
## Petal.Length 165.24840 -57.23960 437.1028 186.77400
## Petal.Width 71.27933 -22.93267 186.7740 80.41333
In principle, one should be able to get to the solution via the EVD of \({\bf W}^{-1}{\bf B}\)
evd_book=eigen(solve(W_book)%*%B_book)
evd_scat=eigen(solve(W_scat)%*%B_scat)
The eigenvectors from the two EVD’s will be the same, and of course the non-zero eigenvalues will differ by a constant factor \((g-1)/(n-g)=\) 0.0136054.
evd_scat$values[1:2]/evd_book$values[1:2]
## [1] 0.01360544 0.01360544
evd_scat$vectors[,1:2]/evd_book$vectors[,1:2]
## [,1] [,2]
## [1,] -1 1
## [2,] -1 1
## [3,] -1 1
## [4,] -1 1
In fact, for the square root of the eigenvalues to match the lda results, one has to use the book-based or correct the scatter matrix results by \((g-1)/(n-g)\). The author of the post did realise he was not getting the same results as lda in terms of singular values, he just did not figure out why (and in fact, that did not play a role in computing the discriminants).
((evd_scat$values[1:2])/((g-1)/(n-g)))^.5
## [1] 48.642644 4.579983
(evd_book$values[1:2])^.5
## [1] 48.642644 4.579983
lda_fit$svd
## [1] 48.642644 4.579983
Back to the eigenvectors, the EVD of \({\bf W}^{-1}{\bf B}\) produces
evd_book$vectors[,1:2]
## [,1] [,2]
## [1,] 0.2087418 -0.006531964
## [2,] 0.3862037 -0.586610553
## [3,] -0.5540117 0.252561540
## [4,] -0.7073504 -0.769453092
That are the eigenvectors column normalised to one (SS=1). In fact,
colSums(evd_book$vectors[,1:2]^2)
## [1] 1 1
The obtained eigenvectors do not correspond to the linear discriminants, albeit they are proportional.
lda_fit$scaling/evd_book$vectors[,1:2]
## LD1 LD2
## Sepal.Length 3.973222 -3.689878
## Sepal.Width 3.973222 -3.689878
## Petal.Length 3.973222 -3.689878
## Petal.Width 3.973222 -3.689878
Some alternatives, that provide equivalent results are described here
The Cholesky decomposition of the within covariance matrix, \({\bf U}=chol({\bf W})\), leads to the matrix \({\bf U}^{-1\sf{T}}{\bf B}{\bf U}^{-1}\) that has the same eigenvalues as \({\bf W}^{-1}{\bf B}\)
U_book=chol(W_book)
U_scat=chol(W_scat)
UBU_book = t(solve(U_book)) %*% B_book %*% solve(U_book)
UBU_scat = t(solve(U_scat)) %*% B_scat %*% solve(U_scat)
evd_UBU_b = eigen(UBU_book)
evd_UBU_s = eigen(UBU_scat)
evd_UBU_b$values[1:2]
## [1] 2366.10680 20.97624
evd_book$values[1:2]
## [1] 2366.10680 20.97624
evd_UBU_s$values[1:2]
## [1] 32.191929 0.285391
evd_scat$values[1:2]
## [1] 32.191929 0.285391
in particular, \({\bf U}^{-1}{\bf E}\) where \({\bf E}\) are the eigenvectors of \({\bf U}^{-1\sf{T}}{\bf B}{\bf U}^{-1}\), corresponds to the lda_fit$scaling.
E=evd_UBU_b$vectors[,1:2]
solve(U_book) %*% E
## [,1] [,2]
## Sepal.Length -0.8293776 -0.02410215
## Sepal.Width -1.5344731 -2.16452123
## Petal.Length 2.2012117 0.93192121
## Petal.Width 2.8104603 -2.83918785
lda_fit$scaling
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785
The computation for the scatter matrix version is similar, but it takes a further factor \(\sqrt{n-g}\)
E_s=evd_UBU_s$vectors[,1:2]
solve(U_scat) %*% E_s * sqrt(n-g)
## [,1] [,2]
## Sepal.Length -0.8293776 -0.02410215
## Sepal.Width -1.5344731 -2.16452123
## Petal.Length 2.2012117 0.93192121
## Petal.Width 2.8104603 -2.83918785
lda_fit$scaling
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785
This is how it seems it was suggested by Venables and Ripley, that is sphering/whitening.
Let the EVD(\({\bf W}\)) = \({\bf V}{\bf S}{\bf V}^{\sf T}\), it results that \[\begin{equation} {\bf W}^{-1/2} = {\bf V}{\bf S}^{-1/2}{\bf V}^{\sf T} \end{equation}\]
Then \({\bf W}^{-1/2}{\bf B}{\bf W}^{-1/2}\) is the quadratic form of the between class data whitened by the within-class covariance.
EVD(\({\bf W}^{-1/2}{\bf B}{\bf W}^{-1/2}\))\(= {\bf \hat{V}}{\bf \hat{S}}{\bf\hat{V}}\);
finally, the linear discriminants are given by \({\bf W}^{-1/2}{\bf \hat{V}}\)
evdW=eigen(W_book)
inv_Wsq = evdW$vectors %*% diag(1/sqrt(evdW$values)) %*% t(evdW$vectors)
inv_Wsq%*%eigen(inv_Wsq%*%B_book%*%inv_Wsq)$vectors[,1:2]
## [,1] [,2]
## [1,] -0.8293776 -0.02410215
## [2,] -1.5344731 -2.16452123
## [3,] 2.2012117 0.93192121
## [4,] 2.8104603 -2.83918785
lda_fit$scaling
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785
For the scatter matrices case, the same result applies, but for the \(\sqrt{n-g}\) factor.
evdW=eigen(W_scat)
inv_Wsq = evdW$vectors %*% diag(1/sqrt(evdW$values)) %*% t(evdW$vectors)
inv_Wsq%*%eigen(inv_Wsq%*%B_scat%*%inv_Wsq)$vectors[,1:2]* sqrt(n-g)
## [,1] [,2]
## [1,] -0.8293776 -0.02410215
## [2,] -1.5344731 -2.16452123
## [3,] 2.2012117 0.93192121
## [4,] 2.8104603 -2.83918785
lda_fit$scaling
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785