Classification

Statistical Learning

Alfonso Iodice D’Enza

Code

library(magick)
library(tidyverse)
library(purrr)
library(tidymodels)
library(discrim)
library(flipbookr)
library(kableExtra)
library(janitor)
library(patchwork)

classification

In a classification problem the response is a categorical variable

rather than predicting the value of \(Y\), one wants to estimate the posterior probability

\[P(Y=k\mid X=x_{i})\]

that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)

will a credit card owner default?

Code

set.seed(1234)
default=read_csv("./data/Default.csv")
default |> slice_sample(n=6) |> kbl() |> kable_styling(font_size=12)

default	student	balance	income
No	Yes	311.32186	22648.76
No	Yes	697.13558	18377.15
No	Yes	470.10718	16014.11
No	No	1200.04162	56081.08
No	No	553.64902	47021.49
No	No	10.23149	27237.38

Do balance and income bring useful info to identify deafulters?

Code

p1 = default |> ggplot(aes(x = balance, y=income,color=default))+
  geom_point(alpha=.5,size=3)+theme_minimal() 
p1

Do balance and income bring useful info to identify deafulters?

Code

p2 = default |> pivot_longer(names_to = "predictor",values_to="value",cols=c(balance,income)) |> 
  ggplot(aes(x = default, y = value))+geom_boxplot(aes(fill=default))+
  facet_wrap(~predictor,scales="free")+theme_minimal()
p2

balance is useful, income is not

Code

p1 | p2

Note: to arrange multiple plots together, give a look at the patchwork package

linear regression for non-numeric response?

if \(Y\) is categorical

With \(K\) categories, one could code \(Y\) as an integer vector

for non ordinal factors it would not make sense

even for ordinal factors, one would implicitly assume constant differences between categories

linear regression for non-numeric response?

if \(Y\) is binary

The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .

So one can model \(P(Y=1|X)=\beta_{0}+\beta_{1}X\) . . .

What can possibly be wrong with this approach?

linear regression for non-numeric response?

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)

Code

default |> mutate(def_01 = ifelse(default== "Yes",1,0)) |>  
  ggplot(aes(x=balance, y=def_01)) + geom_point(aes(color = default),size=4, alpha=.5) + 
  geom_smooth(method="lm") + theme_minimal()

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})\) is bounded in \([0, 1]\), therefore the linear fit cannot be used

use the logistic function instead

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)

modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression

conditional expectation

just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]

The odds and the logit

\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]

\(\frac{p(X)}{\left(1-p(X)\right)}\) are the odds, that take value in \([0,\infty]\)

\(log\left(\frac{p(X)}{\left(1-p(X)\right)}\right)=\beta_{0}+\beta_{1}X\) is the logit: there is a linear relation between the logit and the predictor

Estimating the posterior probability via the logistic function

a toy sample

Estimating the posterior probability via the logistic function

a toy sample : fit the logistic function

Estimating the posterior probability via the logistic function

a toy sample : for a new point \(\texttt{balance}=1400\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)

Estimating the posterior probability via the logistic function

How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)

Estimating the posterior probability via the logistic function

Least squares? One could switch to the logit, which is a linear function \(logit(p(Y=1|X))=\beta_{0} + \beta_{1} X\)

Estimating the posterior probability via the logistic function

Least squares? but the logit mapping, for the blue points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{1}{1-1}\right) = log(1) - log(0) = 0 -Inf= +Inf\]

Estimating the logit via the linear function

Least squares? but the logit mapping, for the red points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{0}{1-0}\right) = log(0) = -Inf\]

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Maximization of the likelihood function

The estimates for \(\beta_{0}\) and \(\beta_{1}\) are such that the following likelihood function is maximised:

\[ \begin{split} \ell\left(\hat{\beta}_{0},\hat{\beta}_{1}\right)=& \color{blue}{\prod_{\forall i}{p(x_{i})}}\times \color{red}{\prod_{\forall i'}{\left(1-p(x_{i})\right)}}=\\ =&\color{blue}{ \prod_{\forall i} \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}} \times \color{red}{ \prod_{\forall i^{\prime}}{\left(1- \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}} \right)} } \end{split} \]

Note: the \(i\) index is for blue points, \(i'\) is for red points

Fitting the logistic regression

pre-process: specify the recipe

Code

def_rec = recipe(default~balance, data=default)

specify the model

Code

def_model_spec = logistic_reg(mode="classification", engine="glm")

put them together in the workflow

Code

def_wflow = workflow() |> add_recipe(def_rec) |> add_model(def_model_spec)

fit the model

Code

def_fit = def_wflow |> fit(data=default)

Fitting the logistic regression

Look at the results

Code

def_fit |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

Code

def_fit |>  glance() |> kbl() |> kable_styling(font_size=12)

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual	nobs
2920.65	9999	-798.2258	1600.452	1614.872	1596.452	9998	10000

Fitting the logistic regression: a qualitative predictor

Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.

update the recipe in the workflow and re-fit

Code

def_wflow2 = def_wflow |> update_recipe(recipe(default~student,data=default))
def_fit2 = def_wflow2 |> fit(data=default)

look at the results

Code

def_fit2 |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

Fitting the logistic regression: a qualitative predictor

It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).

For non students \[ \begin{split} log\left(\frac{p(X)}{1-p(X)}\right) = -3.504 &\rightarrow& \ \frac{p(X)}{1-p(X)} = e^{-3.504}\rightarrow\\ p(X) = e^{-3.504}-e^{-3.504}p(X)&\rightarrow& \ p(X) +e^{-3.504}p(X) = e^{-3.504}\rightarrow\\ p(X)(1 +e^{-3.504}) = e^{-3.504} &\rightarrow & \ p(X) = \frac{e^{-3.504}}{1 +e^{-3.504}}=\color{blue}{0.029} \end{split} \]

For students \[ p(X) = \frac{e^{-3.504+0.4}}{1 +e^{-3.504+0.4}}=\color{red}{0.042} \]

Multiple logistic regression

In case of multiple predictors

\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]

and following relation holds

\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]

Multiple logistic regression

Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow

update the recipe in the workflow and re-fit

Code

def_wflow3 = def_wflow |> update_recipe(recipe(default~balance+student,data=default))
def_fit3 = def_wflow3 |> fit(data=default)

look at the results

Code

def_fit3 |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-10.7494959	0.3691914	-29.116326	0.0e+00
balance	0.0057381	0.0002318	24.749526	0.0e+00
studentYes	-0.7148776	0.1475190	-4.846003	1.3e-06

anything weird?

Multiple logistic regression

Code

p_stu = def_fit2 |> augment(new_data=default) |> 
  ggplot(aes(x=balance, y= .pred_Yes,color=student))+geom_line(size=3)+
  theme_minimal() + ylim(c(0,.1))+xlab("balance")+ylab("P(default=yes|student)")

p_stu

Multiple logistic regression

Code

p_bal_stu = def_fit3 |> augment(new_data=default) |> ggplot(aes(x=balance, y= .pred_Yes,color=student)) +
  geom_line(size=3) + theme_minimal() + xlab("balance") + ylab("P(default=yes|balance,student)")

p_bal_stu

Multiple logistic regression

more than two classes?

multinomial logistic regression

Suppose to have \(K\) classes and let be the .

For the other \(K-1\) classes, the logistic model is

\[P(Y=k|X=x)=\frac{e^{{\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}}}{1+\sum_{l=1}^{K-1}{e^{{\beta}_{l0}+{\beta}_{l1}X_{1}+{\beta}_{l2}X_{2}+\ldots+{\beta}_{p}X_{lp}}}}\]

For the baseline, \(k=K\), the previous becomes

\[P(Y=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}{e^{{\beta}_{l0}+{\beta}_{l1}X_{1}+{\beta}_{l2}X_{2}+\ldots+{\beta}_{p}X_{lp}}}}\]

Note: the models for the general class \(k\) and for the baseline \(K\) have the same denominator.

multinomial logistic regression

Il ratio between the posterior of the general class \(k\) and the baseline \(K\) risulta

\[\frac{P(Y=k|X=x)}{P(Y=K|X=x)} = e^{{\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}}\]

that can be re-written as

\[log\left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)}\right) ={\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}\]

multinomial logistic regression: softmax coding

When no baseline is considered, the posterior for the general class \(k\) is

\[P(Y=k|X=x)=\frac{e^{{\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}}}{1+\sum_{l=1}^{K-1}{e^{{\beta}_{l0}+{\beta}_{l1}X_{1}+{\beta}_{l2}X_{2}+\ldots+{\beta}_{p}X_{lp}}}}\]

and the ratio between the posterior of any two classes \(k\) and \(k'\) given by

\[log\left(\frac{P(Y=k|X=x)}{P(Y=k'|X=x)}\right) =({\beta}_{k0}-{\beta}_{k'0})+({\beta}_{k1}-{\beta}_{k'1})X_{1}+ ({\beta}_{k2}-{\beta}_{k'2})X_{2}+\ldots+({\beta}_{kp}-{\beta}_{k'p})X_{p}\]

generative models for classification

estimating the posterior: an indirect approach

in a classification problem the goal is to estimate \(\color{blue}{P(Y=k|X)}\), that is, the posterior probability.

the logistic regression seeks to estimate the posterior probability \(\color{red}{\text{directly}}\)

another approach is to model the distributions of the predictors within each class, and then use the Bayes therorem to obtain the posterior: this is what the generative models for classification do.

assuming the predictor \(X\) to be continuous, then its probability density function within the class \(k\) is \(f_{k}(X)\) .

also, let \(P(Y=k)=\pi_{k}\) (or, the prior) be the proportion of observations that belong to the \(k^{th}\) class.

the posterior probability is linked to the above

\[P(Y=k|X)=\frac{\pi_{k}f_{k}(X)}{\sum_{l=1}^{K}{\pi_{l}f_{l}(X)}}\]

estimating the posterior: an indirect approach

To obtain \(\hat{P}(Y=k|X)\) , one needs to estimate
\(\hat{\pi}_{k}\) : this is easily obtained by computing the proportion of training observations whithin the class \(k\).
\(\hat{f}_{k}(X)\) : the probability density is not easily obtained and some assumptions are needed.

Depending on the assumptions made on \(f_{k}(X)\) we get
- linear discriminant analysis (LDA)
- quadratic discriminant analysis (QDA)
- naive Bayes

linear discriminant analysis (LDA)

linear discriminant analysis LDA: one predictor

In the linear discriminant analysis, the assumption on \(f_{k}(X)\) is that

\[f_{k}(X)\sim N(\mu_{k},\sigma^{2})\]

therefore

\[f_{k}(X)=\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\]

in each class, the predictor is normally distributed
the scale parameter \(\sigma^{2}\) is the same for each class

linear discriminant analysis LDA: one predictor

Plugging in \(f_{k}(X)\) in the Bayes formula

\[p_{k}(x)=\frac{\pi_{k}\times f_{k}(x)}{\sum_{l=1}^{K}{\pi_{l}\times f_{l}(x)}} = \frac{\pi_{k}\times \overbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)}^{\color{red}{{f_{k}(x)}}}} {\sum_{l=1}^{K}{\pi_{l}\times \underbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}_{\color{red}{{f_{l}(x)}}}}}\]

it takes to estimate the following parameters

\(\mu_{1},\mu_{2},\ldots, \mu_{K}\)
\(\pi_{1},\pi_{2},\ldots, \pi_{K}\)
\(\sigma\)

to get, for each observation \(x\), \(\hat{p}_{1}(x)\) , \(\hat{p}_{2}(x)\) , \(\ldots\) , \(\hat{p}_{K}(x)\) : then the observation is assigned to the class for which \(\hat{p}_{k}(x)\) is max.

\(\color{red}{\text{Note}}:\) not all the quantities involved in the Bayes formula play a role in the classification of an object: in fact, some of them are constant across the classes.

linear discriminant function \(\delta\)

To get rid of the across-classes constant quantities

\[\log\left[p_{k}(x)\right] = \log{\left[ \frac{\pi_{k}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)} {\sum_{l=1}^{K}{\pi_{l}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}}\right]}\]

since \(\color{red}{\log(a/b)=\log(a)-\log(b)}\) it follows that

\[\log\left[p_{k}(x)\right]=\log{\left[ \pi_{k}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\right]}- \underbrace{\log{\left[\sum_{l=1}^{K}{\pi_{l}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}\right]}}_{\color{red}{\text{constant}}}\]

linear discriminant function \(\delta\)

\[\begin{split} &\underbrace{\log{\left[ \pi_{k}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\right]}}_{\color{red}{\log(a\times b)=\log(a)+\log(b)}}=\log(\pi_{k})+\underbrace{\log\left( \frac{1}{\sqrt{2\pi}\sigma}\right)}_{\color{red} {\text{constant}}}+\underbrace{\log\left[exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\right]}_{\color{red}{\log(\exp(a))=a}} =\\ &= \log(\pi_{k}) -\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}=\log(\pi_{k}) - \frac{1}{2\sigma^{2}}\left(x^{2}+\mu_{k}^{2}-2x\mu_{k} \right)=\\ &= \log(\pi_{k}) - \underbrace{\frac{x^{2}}{2\sigma^{2}}}_{\color{red}{\text{const}}}- \frac{\mu_{k}^{2}}{2\sigma^{2}}+ \frac{2x\mu_{k}}{2\sigma^{2}}= \log(\pi_{k}) - \frac{\mu_{k}^{2}}{2\sigma^{2}}+ x\frac{\mu_{k}}{\sigma^{2}}=\color{red}{\delta_{k}(x)}\\ \end{split}\]

two classes, same priors.

Consider a single predictor \(X\), normally distributed within the two classes, with parameters \(\mu_{1}\) , \(\mu_{2}\) and \(\sigma^{2}\) .

Also \(\pi_{1}=\pi_{2}\) . Now, the observation \(x\) is assigned to class 1 if

\[\begin{split} \delta_{1}(X) &>&\delta_{2}(X)\\ \color{blue}{\text{that is}} \\ log({\pi_{1}})-\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x &>& log({\pi_{2}})-\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x \\ \color{blue}{\text{ since }\pi_{1}=\pi_{2}}\\ -\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x > -\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x & \ \rightarrow \ & -\frac{\mu_{1}^{2}}{2} + \mu_{1}x > -\frac{\mu_{2}^{2}}{2} + \mu_{2}x\\ (\mu_{1} - \mu_{2})x > \frac{\mu_{1}^{2}-\mu_{2}^{2}}{2} & \ \rightarrow \ & x > \frac{(\mu_{1}+\mu_{2})(\mu_{1}-\mu_{2})}{2(\mu_{1} - \mu_{2})}\\ x &>& \frac{(\mu_{1}+\mu_{2})}{2}\\ \end{split}\]

the Bayes decision boundary, in which \(\delta_{1}=\delta_{2}\), is at \(\color{red}{x=\frac{(\mu_{1}+\mu_{2})}{2}}\)

two classes, same priors.

Code

set.seed(1234)
p_1 = ggplot()+xlim(-12,12)+theme_minimal() + xlim(-10,10) +
  stat_function(fun=dnorm,args=list(mean=4,sd=2),geom="area",fill="dodgerblue",alpha=.25)+
  stat_function(fun=dnorm,args=list(mean=-4,sd=2),geom="area",fill="indianred",alpha=.25)+
  geom_vline(xintercept=0,size=2,alpha=.5)+geom_vline(xintercept=-4,color="grey",size=3,alpha=.5)+
  geom_vline(xintercept=4,color="grey",size=3,alpha=.5) + geom_point(aes(x=-2,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="darkgreen")+
  geom_point(aes(x=1,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="magenta") + xlab(0)
p_1

.center[ The \(\color{darkgreen}{\text{green point}}\) goes to class 1, the \(\color{magenta}{\text{pink point}}\) goes to class 2]

two classes, same priors.

Consider a training set with 100 observations from the two classes (50 each): one needs to estimate \(\mu_1\) and \(\mu_2\) to have the estimated boundary at \(\frac{\hat{\mu}_{1}+\hat{\mu}_{2}}{2}\)

Code

class_12=tibble(class_1=rnorm(50,mean = -4,sd=2),class_2=rnorm(50,mean = 4,sd=2)) |> 
  pivot_longer(names_to="classes",values_to="values",cols = 1:2)
mu_12 = class_12 |> group_by(classes) |> summarise(means=mean(values))
mu_12_mean =  mean(mu_12$means)
p_2=class_12 |> ggplot(aes(x=values,fill=classes)) + theme_minimal() + 
  geom_histogram(aes(y=after_stat(density)),alpha=.5,color="grey") + xlim(-10,10) +
  geom_vline(xintercept=mu_12 |> pull(means),color="grey",size=3,alpha=.75)+
  geom_vline(xintercept = mu_12_mean, size=2, alpha=.75) + theme(legend.position = "none")
p_2

two classes, same priors.

The Bayes boundary is at \(\color{red}{0}\); the estimated boundary is sligthly off at -0.31

Code

p_1 / p_2

Fitting the LDA

pre-process: specify the recipe

Code

default=read_csv("./data/Default.csv") |> mutate(default=as.factor(default))

set.seed(1234)
def_split = initial_split(default,prop=3/4,strata=default)
default_test = testing(def_split)
defautl = training(def_split)
def_rec = recipe(default~balance, data=default)

specify the model

Code

def_lda_spec = discrim_linear(mode="classification", engine="MASS")

put them together in the workflow

Code

def_wflow_lda = workflow() |> add_recipe(def_rec) |> add_model(def_lda_spec)

fit the model

Code

def_fit_lda = def_wflow_lda |> fit(data=default)

Fitting the LDA

Look at the results (note: no \(\texttt{tidy}\) nor \(\texttt{glance}\) functions available for this model specification)

Code

def_fit_lda |>  extract_fit_engine()

Call:
lda(..y ~ ., data = data)

Prior probabilities of groups:
    No    Yes 
0.9667 0.0333 

Group means:
      balance
No   803.9438
Yes 1747.8217

Coefficients of linear discriminants:
                LD1
balance 0.002206916

The function \(\texttt{lda}\) from the \(\texttt{MASS}\) is used. It implements the Fisher’s discriminant analysis as described in Section 12.1 of Modern Applied Statistics with S, by Venables and Ripley.

Here LDA is presented as in ISLR; Venables and Ripley refer to the ISLR approach as discrimination via probability models, and briefly describe it in the subsection of 12.1 titled Discrimination for normal populations.

LDA with more than one predictor

Two examples of bivariate normal : - two independent components \(X_{1}\) and \(X_{2}\), that is \(cor(X_{1},X_{2})=0\) , and with same variance \(var(X_{1})=var(X_{2})\) .

two non independent components \(X_{1}\) and \(X_{2}\), that is \(cor(X_{1},X_{2})\neq 0\), and with \(var(X_{1})\neq var(X_{2})\) .

LDA with more than one predictor

Let \(X\) be a \(p\)-variate normal distribution, that is \(X\sim N(\mu,\Sigma)\) .

\(\mu\) is the p-dimensional vector of means \(\mu=\left[ \mu_{1},\mu_{2},\ldots,\mu_{p}\right]\).
\(\bf \Sigma\) is the covariance matrix

\[\bf{\Sigma}=\begin{bmatrix} \color{blue}{\sigma^{2}_{1}}&\color{darkgreen}{\sigma_{12}}&\ldots&\color{darkgreen}{\sigma_{1p}}\\ \color{darkgreen}{\sigma_{21}}&\color{blue}{\sigma^{2}_{2}}&\ldots&\color{darkgreen}{\sigma_{2p}} \\ \ldots&\ldots&\ldots&\ldots \\ \color{darkgreen}{\sigma_{p1}}&\color{darkgreen}{\sigma_{p2}}&\ldots& \color{blue}{\sigma^{2}_{p}} \\ \end{bmatrix}\]

diagonal terms are variances of the \(p\) components
off-diagonal terms are pairwise covariances between the \(p\) components

\[f(x)= \frac{1}{(2\pi)^{p/2}|\Sigma |^{1/2}}\exp{\left(-\frac{1}{2} \left( x-\mu\right)^{\sf T} \Sigma^{-1}\left( x-\mu\right)\right)}\]

where \(|\Sigma|\) indicates the determinant of \(\Sigma\).

LDA with more than one predictor

The assumption is that, within each class, \(X\sim N({\bf \mu}_{k}, {\bf \Sigma})\), just like in the univariate case

And the linear discriminant function is

\[\delta_{k}(X)=\log{\pi}_{k} - \frac{1}{2}{\mu}_{k}^{\sf T}{\Sigma}^{-1}\mu_{k} + x^{\sf T}\Sigma^{-1}\mu_{k}\]

Class-specific error

the Bayes classifier -like classification rule ensures the maximum accuracy

in several classification problems, not all classification errors are alike
- e.g. in the default problem one wants to avoid considering reliable customers that will default

to control for the type I error, it is possible to shift the threshold away from 0.5 (that is, the one used in Bayes classifier)

Code

lda_pred = def_fit_lda |> augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |> 
  mutate(.pred_class_0_05 = as.factor(ifelse(.pred_Yes>.05,"Yes","No")),
         .pred_class_0_1 = as.factor(ifelse(.pred_Yes>.1,"Yes","No")),
         .pred_class_0_2 = as.factor(ifelse(.pred_Yes>.2,"Yes","No")),
         .pred_class_0_3 = as.factor(ifelse(.pred_Yes>.3,"Yes","No")),
         .pred_class_0_4 = as.factor(ifelse(.pred_Yes>.4,"Yes","No")),
         .pred_class_0_5 = as.factor(ifelse(.pred_Yes>.5,"Yes","No"))
         )

Class-specific error

Code

lda_pred |> 
  pivot_longer(names_to = "threshold",values_to="prediction",cols = 4:9) |> 
  dplyr::select(-.pred_class,-.pred_Yes) |> group_by(threshold) |> 
  summarise(
    accuracy=round(mean(default==prediction),2),
    false_positive_rate = sum((default=="Yes")&(default!=prediction))/sum((default=="Yes"))
    ) |> arrange(desc(accuracy),desc(false_positive_rate)) |> kbl() |> kable_styling(font_size=10)

threshold	accuracy	false_positive_rate
.pred_class_0_5	0.97	0.8160920
.pred_class_0_4	0.97	0.7126437
.pred_class_0_3	0.97	0.6206897
.pred_class_0_2	0.96	0.4942529
.pred_class_0_1	0.94	0.2873563
.pred_class_0_05	0.89	0.1609195

Since the false positive rate increases along with the overall accuracy , reducing it will cause the overall performance of the classifier to drop

Roc curve

Code

  def_fit_lda |> augment(new_data=default_test)  |> 
  mutate(default=factor(default,levels=c("Yes","No"))) |>
  roc_curve(truth = default, .pred_Yes)|>
  ggplot(aes(x=1-specificity,y= sensitivity))+ggtitle("lda roc curve") + 
  geom_path(color="indianred")+geom_abline(lty=3)+coord_equal()+theme_minimal()

quadratic discriminant analysis (QDA)

In QDA the assumption on the covariance matrix being constant across classes is removed \({\bf \Sigma}_{k}\)

\[\begin{split} \color{red}{\delta_{k}(X)}&= log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right) -\frac{1}{2} \left( x -\mu_{k}\right)^{\sf T}{\bf \Sigma}_{k}^{-1}\left( x -\mu_{k}\right)+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)-\frac{1}{2} \left[\left(x^{\sf T}{\bf \Sigma}_{k}^{-1}-\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\right)\left( x -\mu_{k}\right)\right]+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)-\frac{1}{2} \left[ x^{\sf T}{\bf \Sigma}_{k}^{-1}x-\underbrace{\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}x}_{\text{scalar}}-\underbrace{x^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}}_{\text{scalar}}+\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}\right]+\\ &+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)-\frac{1}{2} x^{\sf T}{\bf \Sigma}_{k}^{-1}x +\frac{1}{{2}}{2}x^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)\color{red}{-\frac{1}{2} x^{\sf T}{\bf \Sigma}_{k}^{-1}x} +x^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}+\log(\pi_{k})\\ \end{split}\]

Fitting the QDA

pre-process: specify the recipe

Code

def_rec = recipe(default~balance, data=default)

specify the model

Code

def_qda_spec = discrim_quad(mode="classification", engine="MASS")

put them together in the workflow

Code

def_wflow_qda = workflow() |> add_recipe(def_rec) |> add_model(def_qda_spec)

fit the model

Code

def_fit_qda = def_wflow_qda |> fit(data=default)

Fitting the QDA

Look at the results (note: no \(\texttt{tidy}\) nor \(\texttt{glance}\) functions available for this model specification)

Code

def_fit_qda |> augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |> 
  mutate(default=factor(default,levels=c("Yes","No"))) |>
  roc_curve(truth = default, .pred_Yes)|>
  ggplot(aes(x=1-specificity,y=sensitivity))+ggtitle("qda roc curve") + 
  geom_path(color="dodgerblue")+geom_abline(lty=3)+coord_equal()+theme_minimal()

naive Bayes classifier

like LDA and QDA, the goal is to estimate \(f_{k}(X)\)
unlike LDA and QDA, in the naive Bayes classifier case, \(f_{k}(X)\) is not assumed to be multivariate normal demsity
the assumpion is that the predictors are independent within each class \(k\)
the joint distribution of the \(p\) predictors in class \(k\)

\[f_{k}(X)=f_{k1}(X_{1})\times f_{k2}(X_{2})\times \ldots \times f_{kp}(X_{p})\]
while the assumption is naive, it simplifies the fit, since the focus is on the marginal distribution of each predictor
this is of help in case of few training observations

naive Bayes classifier

Since the goal is to fit \(f_{kj}(X_{j})\), \(j=1,\ldots,p\), one can fit ad hoc function for each predictor
for continuos predictors, one can choose \(f_{kj}(X_{j})\sim N(\mu_{k},\sigma_{k})\) or a kernel density estimator
for categorical predictors, the relative frequencies distribution can be used instead.

naive Bayes classifier

suppose to have two classes, and three predictors \(X_{1}\), \(X_{2}\) (quantitative) and \(X_{3}\) (qualitative)
assume that \(\hat{\pi}_{1}=\hat{\pi}_{2}\), and that \(\hat{f}_{kj}(X_{j})\) with \(k=1,2\) and \(j=1,2,3\) are:

consider \({\bf x}^{\star {\sf T}}=\left[.4,1.5,1\right]\), then
\(\color{red}{\hat{f}_{11}(.4)=.368,\hat{f}_{12}(1.5)=.484,\hat{f}_{13}(1)=.226}\) for class 1
\(\color{blue}{\hat{f}_{21}(.4)=.030,\hat{f}_{22}(1.5)=.130,\hat{f}_{23}(1)=.616}\) for class 2

naive Bayes classifier

The posterior for each class \(P(Y=1|X_{1}=.4,X_{2}=1.5,X_{3}=1)\) and \(P(Y=2|X_{1}=.4,X_{2}=1.5,X_{3}=1)\), knowing that \(\hat{\pi}_{1}=\hat{\pi}_{2}=.5\), is given by \[\begin{split} P(Y=1|X_{1}=.4,X_{2}=1.5,X_{3}=1) &= \frac{\color{red}{\hat{\pi}_{1}\times\hat{f}_{11}(.4)\times \hat{f}_{12}(1.5)\times\hat{f}_{13}(1)}}{ \color{red}{\hat{\pi}_{1}\times\hat{f}_{11}(.4)\times \hat{f}_{12}(1.5)\times\hat{f}_{13}(1)}+ \color{blue}{\hat{\pi}_{2}\times\hat{f}_{21}(.4)\times\hat{f}_{22}(1.5)\times\hat{f}_{23}(1)} }=\\ &=\frac{\color{red}{.5 \times .368 \times .484 \times .226}}{ \color{red}{.5 \times .368 \times .484 \times .226}+ \color{blue}{ .5 \times .03 \times .130 \times .616} }=\\ &= \frac{\color{red}{0.0201}}{\color{red}{0.0201}+\color{blue}{0.0012}}=0.944 \end{split}\]

Similarly, and ignoring that \[P(Y=2|X_{1}=.4,X_{2}=1.5,X_{3}=1)=1-P(Y=1|X_{1}=.4,X_{2}=1.5,X_{3}=1)\] the class 2 posterior is

\[P(Y=2|X_{1}=.4,X_{2}=1.5,X_{3}=1)=\frac{\color{blue}{0.0012}}{\color{red}{0.0201}+\color{blue} {0.0012}}=0.06\]

Fitting naive Bayes

pre-process: specify the recipe

Code

def_rec = recipe(default~balance, data=default)

specify the model

Code

def_nb_spec = naive_Bayes(mode="classification", engine="naivebayes")

put them together in the workflow

Code

def_wflow_nb = workflow() |> add_recipe(def_rec) |> add_model(def_nb_spec)

fit the model

Code

def_fit_nb = def_wflow_nb |> fit(data=default)

Fitting naive Bayes

Code

def_fit_nb |> augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |> 
  mutate(default=factor(default,levels=c("Yes","No"))) |>
  roc_curve(truth = default, .pred_Yes)|>
  ggplot(aes(x=specificity,y=1-sensitivity))+ggtitle("naive Bayes roc curve") + 
  geom_path(color="darkgreen")+geom_abline(lty=3)+coord_equal()+theme_minimal()

Fitting naive Bayes

roc curves

auc

method	.estimate
logistic regression	0.0517932
LDA	0.0517932
QDA	0.0517932
naive Bayes	0.0517789

logistic regression vs LDA vs QDA vs naive Bayes

an analytic comparison

logistic regression vs LDA vs QDA vs naive Bayes

Consider the \(K\) classes case, and let \(K\) baseline, the predicted class \(k\) will be the one maximizing

for any of the considered classifiers.

the LDA and logistic regression

the assumption is that whithin the \(k^{th}\) class, \(X\sim N({\bf \mu}_{k},{\bf \Sigma})\)

\[\begin{split} \color{red}{log \left(f_{k}(x)\right)} &= log\left[\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2} } exp\left(-\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})\right)\right]=\\ &=\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})} \end{split}\]

And

\[\begin{split} \color{blue}{{log \left(f_{K}(x)\right)}} = \color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{K})}} \end{split}\]

the LDA and logistic regression

plugging \(\color{red}{log \left(f_{k}(x)\right)}\) and \(\color{blue}{{log \left(f_{K}(x)\right)}}\) in

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ \color{red}{log \left(f_{k}(x)\right)}-\color{blue}{log \left(f_{K}(x)\right)}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2}}\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})}+\\ &-\color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2}}\right) +\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{K})}} =\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)-\color{red}{\frac{1}{2}({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})} +\color{blue}{{\frac{1}{2}({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{K})}}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)- \color{red}{ \frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}-{\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}_{-2{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}}+{\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}\right] }+\\ &+\color{blue} {\frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}-{\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}_{-2{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}}+{\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}\right] } \end{split}\]

the LDA and logistic regression

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) & = log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ \color{red}{{\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}-\color{blue}{{{\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}}- \frac{1}{2}\underbrace{\color{red}{ {\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}}}_{a^2}+ \frac{1}{2}\underbrace{\color{blue}{{ {\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}}}}_{b^2} \end{split}\]

Now, since \(-a^{2}+b^{2}=(a+b)(a-b)\), the previous becomes

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) = log \left(\frac{\pi_{k}}{\pi_{K}}\right)+({\bf \mu}_{k}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}{\bf x}- \frac{1}{2}({\bf \mu}_{k}+{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf \mu}_{k}-{\bf \mu}_{K}) \end{split}\]

and, setting

\(log \left(\frac{\pi_{k}}{\pi_{K}}\right)-\frac{1}{2}({\bf \mu}_{k}+{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf \mu}_{k}-{\bf \mu}_{K})=\color{forestgreen}{a_{k}}\)
\(({\bf \mu}_{k}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}{\bf x}=\color{forestgreen}{\sum_{j=1}^{p}{b_{kj}x_{j}}}\)

it is clear that in LDA, just like in logistic regression, the log of the odds is a linear function of the predictors

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) = a_{k}+\sum_{j=1}^{p}{b_{kj}x_{j}} \end{split}\]

the QDA case

the assumption is that whithin the \(k^{th}\) class, \(X\sim N({\bf \mu}_{k},{\bf \Sigma}_{k})\)

\[\begin{split} \color{red}{log \left(f_{k}(x)\right)} &= log\left[\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2} } exp\left(-\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})\right)\right]=\\ &=\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})} \end{split}\]

And

\[\begin{split} \color{blue}{{log \left(f_{K}(x)\right)}} = \color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}_{K}^{-1}({\bf x}-{\bf \mu}_{K})}} \end{split}\]

the QDA case

again, plugging in \(\color{red}{log \left(f_{k}(x)\right)}\) and \(\color{blue}{{log \left(f_{K}(x)\right)}}\)

re-writing the following quantities

\[\begin{split} \color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2}}\right)} &= \color{red}{log(1)-log\left((2\pi)^{p/2}\right)-log\left(|{\bf\Sigma}_{k}|^{1/2}\right)}\\ \color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2}}\right)}} &= \color{blue}{{log(1)-log\left((2\pi)^{p/2}\right)-log\left(|{\bf\Sigma}_{K}|^{1/2}\right)}} \end{split}\]

it results that \[\begin{split} \color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2}}\right)}-\color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2}}\right)}} &= \color{red}{-log\left((2\pi)^{p/2}\right)-log\left(|{\bf\Sigma}_{k}|^{1/2}\right)}\color{blue}{{+log\left((2\pi)^{p/2}\right)+log\left(|{\bf\Sigma}_{K}|^{1/2}\right)}}= \color{blue}{{log\left(|{\bf\Sigma}_{K}|^{1/2}\right)}}\color{red}{-log\left(|{\bf\Sigma}_{k}|^{1/2}\right)}=log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right) \end{split}\]

the QDA case

the logit can be re-written accordingly \[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)\color{red}{ -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})} \color{blue}{{+\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}_{K}^{-1}({\bf x}-{\bf \mu}_{K})}}\\ &= log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)-\color{red}{ \frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}-{\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x}}_{-2 {\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x}}+{\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}\right] }+ \color{blue} {\frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K}-{\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x}}_{-2 {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x}}+{\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K}\right]}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)- \color{red}{ \frac{1}{2}{\bf x}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x} + {\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x} -\frac{1}{2}{\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k} }+ \color{blue}{ \frac{1}{2}{\bf x}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x} - {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x} -\frac{1}{2}{\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} }=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)- \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x} + \left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)^{\sf T}{\bf x}+\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right) \end{split}\]

finally \[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= \overbrace{\frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x}}^{\text{second degree term}} + \overbrace{\left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)^{\sf T}{\bf x}}^{\text{first degree term}}+\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)+\\ & + log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right) \end{split}\]

the QDA case

By defining the coefficients

\[\begin{split} a_{k} &= \frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)\\ b_{kj} &=\left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}-{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)^{\sf T}{\bf x} \\ c_{kjl} &= \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x} \end{split}\]

the QDA can be written as a second degree function of the \(X\):

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= a_{k} + \sum_{j=1}^{p}b_{j}x_{j}+\sum_{j=1}^{p}\sum_{l=1}^{p}{c_{kjl}x_{j}x_{l}} \end{split}\]

the naive Bayes case

The logit, in this case, is

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= log\left(\frac{\pi_{k}f_{k}(x)}{\pi_{K}f_{K}(x)} \right)= log\left(\frac{\pi_{k}}{\pi_{K}} \right)+log\left(\frac{f_{k}(x)}{f_{K}(x)} \right)=log\left(\frac{\pi_{k}}{\pi_{K}} \right)+ log\left(\frac{\prod_{j=1}^{p}f_{kj}(x_{j})}{\prod_{j=1}^{p}f_{Kj}(x_{j})} \right)=\\ &=log\left(\frac{\pi_{k}}{\pi_{K}} \right)+\sum_{j=1}^{p}log\left(\frac{f_{kj}(x_{j})}{f_{Kj}(x_{j})} \right) \end{split}\]

Setting

\[\begin{split} a_{k}&=log\left(\frac{\pi_{k}}{\pi_{K}} \right) \quad \text{ and } \quad g_{kj}&=log\left(\frac{f_{kj}(x_{j})}{f_{Kj}(x_{j})} \right) \end{split}\]

the logit can be re-written as a function of the predictors

\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= a_{k}+\sum_{j=1}^{p}g_{kj}(x_{j}) \end{split}\]

finding 1: the LDA is a special case of QDA

the LDA is a special case of QDA

Looking at the coefficients of the QDA, it is clear that, when \({\bf\Sigma}_{k}={\bf\Sigma}_{K}={\bf\Sigma}\) , then the QDA is just LDA

\[\begin{split} \color{red}{a_{k}}&=\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)=\\ &=\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K} \right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}|^{1/2}}{|{\bf\Sigma}|^{1/2}}\right)=\\ &=\frac{1}{2}\left({\bf \mu}_{k}+{\bf \mu}_{K}\right)^{\sf T}{\bf \Sigma}^{-1}\left({\bf \mu}_{k}-{\bf \mu}_{K}\right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+0 \longrightarrow \color{red}{\text{as for LDA}} \end{split}\]

\[\begin{split} \color{red}{b_{kj}} &={\bf x}^{\sf T}\left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}-{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)= \\ &={\bf x}^{\sf T}\left({\bf \Sigma}^{-1}{\bf \mu}_{k}-{\bf \Sigma}^{-1}{\bf \mu}_{K} \right)= \\ &={\bf x}^{\sf T}{\bf \Sigma}^{-1}\left({\bf \mu}_{k}-{\bf \mu}_{K} \right) \longrightarrow \color{red}{\text{as for LDA}} \end{split}\]

\[\begin{split} \color{red}{c_{kjl}} &= \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x}= \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}^{-1} - {\bf \Sigma}^{-1} \right){\bf x}=\color{red}{0} \end{split}\]

finding 2: the naive Bayes as a special case of LDA

the naive Bayes as a special case of LDA

Any classifier with a linear decision boundary can be defined as a naive Bayes such that

\[\begin{split} g_{kj}(x_{j})=b_{kj}x_{j} \end{split}\]

If for naive Bayes, one assumes that \(\color{red}{f_{kj}(x_{j}) \sim N(\mu_{kj},\sigma^{2}_{j})}\) then \(g_{kj}(x_{j})=log\left(\frac{f_{k}(x_{j})}{f_{K}(x_{j})}\right)=b_{kj}x_{j}\) , with \(\color{red}{b_{kj}=(\mu_{kj}-\mu_{Kj})/\sigma^{2}_{j}}\) .

The naive Bayes, in this case, boils down to an LDA with diagonal covariance matrix \({\bf \Sigma}\)

finding 3: the naive Bayes is NOT a special case of QDA

the naive Bayes is NOT a special case of QDA

the naive Bayes is additive , for each \(j\) and \(l\) the cooresponding \(g_{kj}(x_{j})\) and \(g_{kl}(x_{l})\) are added up.

in QDA, multiplicative effects are considered, see the coefficient \(c_{kjl}x_{j}x_{l}\) when \(j \neq l\)

when interactions are of help in discriminating among classes, then QDA always outperforms naive Bayes.

finding 4: multinomial logistic regression and LDA are the same linear combination of the predictors

coefficients are estimated in a different way

multinomial logistic regression and LDA

\[log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right)=\color{red}{\beta_{k0}+\sum_{j=1}^{p}\beta_{kj}x_{j}}\]

LDA \[log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right)=\color{blue}{a_{k}+\sum_{j=1}^{p}b_{kj}x_{j}}\]

LDA the coefficients are estimated assuming a multivariate normal distribution of the predictors within each class
whether the LDA outperforms the multinomial logistic depends on how the assumtion is supported by the data at hand

logistic regression vs LDA vs QDA vs naive Bayes

an empirical comparison

Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\). Training set size: \(n=20\)

Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\); \(cor(X_{1},X_{2})=-0.5\). Training set size: \(n=20\)

Two predictors: \(X_{1}\sim t_{n-1 gdl}\) and \(X_{2}\sim t_{n-1 gdl}\). Training set size: \(n=50\)

Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\); \(cor(X_{1},X_{2})_{class1}=0.5\), \(cor(X_{1},X_{2})_{class2}=-0.5\). Training set size: \(n=50\)

Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\); \(y\) generated via logistic function with predictors \(X_{1}X_{2}\), \(X^{2}_{1}\) and \(X^{2}_{2}\)

\(X\sim N({\bf\mu},{\bf \Sigma}_{k})\), a bivariate normal distribution \({\bf \Sigma}_{k}\) is diagonal, and it changes in each class. \(n=6\)