Statistical Learning
In a classification problem the response
is a categorical
variable
rather than predicting the value of \(Y\), one wants to estimate the posterior probability
\[P(Y=k\mid X=x_{i})\]
that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)
default | student | balance | income |
---|---|---|---|
No | Yes | 311.32186 | 22648.76 |
No | Yes | 697.13558 | 18377.15 |
No | Yes | 470.10718 | 16014.11 |
No | No | 1200.04162 | 56081.08 |
No | No | 553.64902 | 47021.49 |
No | No | 10.23149 | 27237.38 |
Note: to arrange multiple plots together, give a look at the patchwork
package
if \(Y\) is categorical
With \(K\) categories, one could code \(Y\) as an integer vector
if \(Y\) is binary
The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)
modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression
conditional expectation
just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]
\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]
a toy sample
a toy sample : fit the logistic function
a toy sample : for a new point \(\texttt{balance}=1400\)
a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)
a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)
How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)
Least squares? One could switch to the logit, which is a linear function \(logit(p(Y=1|X))=\beta_{0} + \beta_{1} X\)
Least squares? but the logit mapping, for the blue points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{1}{1-1}\right) = log(1) - log(0) = 0 -Inf= +Inf\]
Least squares? but the logit mapping, for the red points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{0}{1-0}\right) = log(0) = -Inf\]
logit: mapping
logit: mapping
logit: mapping
logit: mapping
The estimates for \(\beta_{0}\) and \(\beta_{1}\) are such that the following likelihood function is maximised:
\[ \begin{split} \ell\left(\hat{\beta}_{0},\hat{\beta}_{1}\right)=& \color{blue}{\prod_{\forall i}{p(x_{i})}}\times \color{red}{\prod_{\forall i'}{\left(1-p(x_{i})\right)}}=\\ =&\color{blue}{ \prod_{\forall i} \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}} \times \color{red}{ \prod_{\forall i^{\prime}}{\left(1- \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}} \right)} } \end{split} \]
Note: the \(i\) index is for blue points, \(i'\) is for red points
pre-process: specify the recipe
put them together in the workflow
Look at the results
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -10.6513306 | 0.3611574 | -29.49221 | 0 |
balance | 0.0054989 | 0.0002204 | 24.95309 | 0 |
Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.
update the recipe in the workflow and re-fit
It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).
In case of multiple predictors
\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]
and following relation holds
\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]
Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow
update the recipe in the workflow and re-fit
look at the results
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -10.7494959 | 0.3691914 | -29.116326 | 0.0e+00 |
balance | 0.0057381 | 0.0002318 | 24.749526 | 0.0e+00 |
studentYes | -0.7148776 | 0.1475190 | -4.846003 | 1.3e-06 |
Suppose to have \(K\) classes and let be the .
For the other \(K-1\) classes, the logistic model is
\[P(Y=k|X=x)=\frac{e^{{\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}}}{1+\sum_{l=1}^{K-1}{e^{{\beta}_{l0}+{\beta}_{l1}X_{1}+{\beta}_{l2}X_{2}+\ldots+{\beta}_{p}X_{lp}}}}\]
For the baseline, \(k=K\), the previous becomes
\[P(Y=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}{e^{{\beta}_{l0}+{\beta}_{l1}X_{1}+{\beta}_{l2}X_{2}+\ldots+{\beta}_{p}X_{lp}}}}\]
Note: the models for the general class \(k\) and for the baseline \(K\) have the same denominator.
Il ratio between the posterior of the general class \(k\) and the baseline \(K\) risulta
\[\frac{P(Y=k|X=x)}{P(Y=K|X=x)} = e^{{\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}}\]
that can be re-written as
\[log\left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)}\right) ={\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}\]
When no baseline is considered, the posterior for the general class \(k\) is
\[P(Y=k|X=x)=\frac{e^{{\beta}_{k0}+{\beta}_{k1}X_{1}+{\beta}_{k2}X_{2}+\ldots+{\beta}_{kp}X_{p}}}{1+\sum_{l=1}^{K-1}{e^{{\beta}_{l0}+{\beta}_{l1}X_{1}+{\beta}_{l2}X_{2}+\ldots+{\beta}_{p}X_{lp}}}}\]
and the ratio between the posterior of any two classes \(k\) and \(k'\) given by
\[log\left(\frac{P(Y=k|X=x)}{P(Y=k'|X=x)}\right) =({\beta}_{k0}-{\beta}_{k'0})+({\beta}_{k1}-{\beta}_{k'1})X_{1}+ ({\beta}_{k2}-{\beta}_{k'2})X_{2}+\ldots+({\beta}_{kp}-{\beta}_{k'p})X_{p}\]
in a classification problem the goal is to estimate \(\color{blue}{P(Y=k|X)}\), that is, the posterior probability.
the logistic regression seeks to estimate the posterior probability \(\color{red}{\text{directly}}\)
another approach is to model the distributions of the predictors within each class, and then use the Bayes therorem to obtain the posterior: this is what the generative models for classification do.
To obtain \(\hat{P}(Y=k|X)\) , one needs to estimate
\(\hat{\pi}_{k}\) : this is easily obtained by computing the proportion of training observations whithin the class \(k\).
\(\hat{f}_{k}(X)\) : the probability density is not easily obtained and some assumptions are needed.
In the linear discriminant analysis, the assumption on \(f_{k}(X)\) is that
\[f_{k}(X)\sim N(\mu_{k},\sigma^{2})\]
therefore
\[f_{k}(X)=\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\]
in each class, the predictor is normally distributed
the scale parameter \(\sigma^{2}\) is the same for each class
Plugging in \(f_{k}(X)\) in the Bayes formula
\[p_{k}(x)=\frac{\pi_{k}\times f_{k}(x)}{\sum_{l=1}^{K}{\pi_{l}\times f_{l}(x)}} = \frac{\pi_{k}\times \overbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)}^{\color{red}{{f_{k}(x)}}}} {\sum_{l=1}^{K}{\pi_{l}\times \underbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}_{\color{red}{{f_{l}(x)}}}}}\]
it takes to estimate the following parameters
to get, for each observation \(x\), \(\hat{p}_{1}(x)\) , \(\hat{p}_{2}(x)\) , \(\ldots\) , \(\hat{p}_{K}(x)\) : then the observation is assigned to the class for which \(\hat{p}_{k}(x)\) is max.
\(\color{red}{\text{Note}}:\) not all the quantities involved in the Bayes formula play a role in the classification of an object: in fact, some of them are constant across the classes.
To get rid of the across-classes constant quantities
\[\log\left[p_{k}(x)\right] = \log{\left[ \frac{\pi_{k}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)} {\sum_{l=1}^{K}{\pi_{l}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}}\right]}\]
since \(\color{red}{\log(a/b)=\log(a)-\log(b)}\) it follows that
\[\log\left[p_{k}(x)\right]=\log{\left[ \pi_{k}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\right]}- \underbrace{\log{\left[\sum_{l=1}^{K}{\pi_{l}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}\right]}}_{\color{red}{\text{constant}}}\]
\[\begin{split} &\underbrace{\log{\left[ \pi_{k}\times \frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\right]}}_{\color{red}{\log(a\times b)=\log(a)+\log(b)}}=\log(\pi_{k})+\underbrace{\log\left( \frac{1}{\sqrt{2\pi}\sigma}\right)}_{\color{red} {\text{constant}}}+\underbrace{\log\left[exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\right]}_{\color{red}{\log(\exp(a))=a}} =\\ &= \log(\pi_{k}) -\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}=\log(\pi_{k}) - \frac{1}{2\sigma^{2}}\left(x^{2}+\mu_{k}^{2}-2x\mu_{k} \right)=\\ &= \log(\pi_{k}) - \underbrace{\frac{x^{2}}{2\sigma^{2}}}_{\color{red}{\text{const}}}- \frac{\mu_{k}^{2}}{2\sigma^{2}}+ \frac{2x\mu_{k}}{2\sigma^{2}}= \log(\pi_{k}) - \frac{\mu_{k}^{2}}{2\sigma^{2}}+ x\frac{\mu_{k}}{\sigma^{2}}=\color{red}{\delta_{k}(x)}\\ \end{split}\]
Consider a single predictor \(X\), normally distributed within the two classes, with parameters \(\mu_{1}\) , \(\mu_{2}\) and \(\sigma^{2}\) .
Also \(\pi_{1}=\pi_{2}\) . Now, the observation \(x\) is assigned to class 1 if
\[\begin{split} \delta_{1}(X) &>&\delta_{2}(X)\\ \color{blue}{\text{that is}} \\ log({\pi_{1}})-\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x &>& log({\pi_{2}})-\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x \\ \color{blue}{\text{ since }\pi_{1}=\pi_{2}}\\ -\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x > -\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x & \ \rightarrow \ & -\frac{\mu_{1}^{2}}{2} + \mu_{1}x > -\frac{\mu_{2}^{2}}{2} + \mu_{2}x\\ (\mu_{1} - \mu_{2})x > \frac{\mu_{1}^{2}-\mu_{2}^{2}}{2} & \ \rightarrow \ & x > \frac{(\mu_{1}+\mu_{2})(\mu_{1}-\mu_{2})}{2(\mu_{1} - \mu_{2})}\\ x &>& \frac{(\mu_{1}+\mu_{2})}{2}\\ \end{split}\]
the Bayes decision boundary, in which \(\delta_{1}=\delta_{2}\), is at \(\color{red}{x=\frac{(\mu_{1}+\mu_{2})}{2}}\)
set.seed(1234)
p_1 = ggplot()+xlim(-12,12)+theme_minimal() + xlim(-10,10) +
stat_function(fun=dnorm,args=list(mean=4,sd=2),geom="area",fill="dodgerblue",alpha=.25)+
stat_function(fun=dnorm,args=list(mean=-4,sd=2),geom="area",fill="indianred",alpha=.25)+
geom_vline(xintercept=0,size=2,alpha=.5)+geom_vline(xintercept=-4,color="grey",size=3,alpha=.5)+
geom_vline(xintercept=4,color="grey",size=3,alpha=.5) + geom_point(aes(x=-2,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="darkgreen")+
geom_point(aes(x=1,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="magenta") + xlab(0)
p_1
.center[ The \(\color{darkgreen}{\text{green point}}\) goes to class 1, the \(\color{magenta}{\text{pink point}}\) goes to class 2]
Consider a training set with 100 observations from the two classes (50 each): one needs to estimate \(\mu_1\) and \(\mu_2\) to have the estimated boundary at \(\frac{\hat{\mu}_{1}+\hat{\mu}_{2}}{2}\)
class_12=tibble(class_1=rnorm(50,mean = -4,sd=2),class_2=rnorm(50,mean = 4,sd=2)) |>
pivot_longer(names_to="classes",values_to="values",cols = 1:2)
mu_12 = class_12 |> group_by(classes) |> summarise(means=mean(values))
mu_12_mean = mean(mu_12$means)
p_2=class_12 |> ggplot(aes(x=values,fill=classes)) + theme_minimal() +
geom_histogram(aes(y=after_stat(density)),alpha=.5,color="grey") + xlim(-10,10) +
geom_vline(xintercept=mu_12 |> pull(means),color="grey",size=3,alpha=.75)+
geom_vline(xintercept = mu_12_mean, size=2, alpha=.75) + theme(legend.position = "none")
p_2
The Bayes boundary is at \(\color{red}{0}\); the estimated boundary is sligthly off at -0.31
pre-process: specify the recipe
put them together in the workflow
Look at the results (note: no \(\texttt{tidy}\) nor \(\texttt{glance}\) functions available for this model specification)
Call:
lda(..y ~ ., data = data)
Prior probabilities of groups:
No Yes
0.9667 0.0333
Group means:
balance
No 803.9438
Yes 1747.8217
Coefficients of linear discriminants:
LD1
balance 0.002206916
The function \(\texttt{lda}\) from the \(\texttt{MASS}\) is used. It implements the Fisher’s discriminant analysis as described in Section 12.1 of Modern Applied Statistics with S, by Venables and Ripley.
Here LDA is presented as in ISLR; Venables and Ripley refer to the ISLR approach as discrimination via probability models, and briefly describe it in the subsection of 12.1 titled Discrimination for normal populations.
Two examples of bivariate normal : - two independent components \(X_{1}\) and \(X_{2}\), that is \(cor(X_{1},X_{2})=0\) , and with same variance \(var(X_{1})=var(X_{2})\) .
Let \(X\) be a \(p\)-variate normal distribution, that is \(X\sim N(\mu,\Sigma)\) .
\[\bf{\Sigma}=\begin{bmatrix} \color{blue}{\sigma^{2}_{1}}&\color{darkgreen}{\sigma_{12}}&\ldots&\color{darkgreen}{\sigma_{1p}}\\ \color{darkgreen}{\sigma_{21}}&\color{blue}{\sigma^{2}_{2}}&\ldots&\color{darkgreen}{\sigma_{2p}} \\ \ldots&\ldots&\ldots&\ldots \\ \color{darkgreen}{\sigma_{p1}}&\color{darkgreen}{\sigma_{p2}}&\ldots& \color{blue}{\sigma^{2}_{p}} \\ \end{bmatrix}\]
diagonal terms are variances of the \(p\) components
off-diagonal terms are pairwise covariances between the \(p\) components
\[f(x)= \frac{1}{(2\pi)^{p/2}|\Sigma |^{1/2}}\exp{\left(-\frac{1}{2} \left( x-\mu\right)^{\sf T} \Sigma^{-1}\left( x-\mu\right)\right)}\]
where \(|\Sigma|\) indicates the determinant of \(\Sigma\).
The assumption is that, within each class, \(X\sim N({\bf \mu}_{k}, {\bf \Sigma})\), just like in the univariate case
And the linear discriminant function is
\[\delta_{k}(X)=\log{\pi}_{k} - \frac{1}{2}{\mu}_{k}^{\sf T}{\Sigma}^{-1}\mu_{k} + x^{\sf T}\Sigma^{-1}\mu_{k}\]
in several classification problems, not all classification errors are alike
lda_pred = def_fit_lda |> augment(new_data = default_test) |>
dplyr::select(default, .pred_class, .pred_Yes) |>
mutate(.pred_class_0_05 = as.factor(ifelse(.pred_Yes>.05,"Yes","No")),
.pred_class_0_1 = as.factor(ifelse(.pred_Yes>.1,"Yes","No")),
.pred_class_0_2 = as.factor(ifelse(.pred_Yes>.2,"Yes","No")),
.pred_class_0_3 = as.factor(ifelse(.pred_Yes>.3,"Yes","No")),
.pred_class_0_4 = as.factor(ifelse(.pred_Yes>.4,"Yes","No")),
.pred_class_0_5 = as.factor(ifelse(.pred_Yes>.5,"Yes","No"))
)
lda_pred |>
pivot_longer(names_to = "threshold",values_to="prediction",cols = 4:9) |>
dplyr::select(-.pred_class,-.pred_Yes) |> group_by(threshold) |>
summarise(
accuracy=round(mean(default==prediction),2),
false_positive_rate = sum((default=="Yes")&(default!=prediction))/sum((default=="Yes"))
) |> arrange(desc(accuracy),desc(false_positive_rate)) |> kbl() |> kable_styling(font_size=10)
threshold | accuracy | false_positive_rate |
---|---|---|
.pred_class_0_5 | 0.97 | 0.8160920 |
.pred_class_0_4 | 0.97 | 0.7126437 |
.pred_class_0_3 | 0.97 | 0.6206897 |
.pred_class_0_2 | 0.96 | 0.4942529 |
.pred_class_0_1 | 0.94 | 0.2873563 |
.pred_class_0_05 | 0.89 | 0.1609195 |
Since the false positive rate increases along with the overall accuracy , reducing it will cause the overall performance of the classifier to drop
In QDA the assumption on the covariance matrix being constant across classes is removed \({\bf \Sigma}_{k}\)
\[\begin{split} \color{red}{\delta_{k}(X)}&= log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right) -\frac{1}{2} \left( x -\mu_{k}\right)^{\sf T}{\bf \Sigma}_{k}^{-1}\left( x -\mu_{k}\right)+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)-\frac{1}{2} \left[\left(x^{\sf T}{\bf \Sigma}_{k}^{-1}-\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\right)\left( x -\mu_{k}\right)\right]+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)-\frac{1}{2} \left[ x^{\sf T}{\bf \Sigma}_{k}^{-1}x-\underbrace{\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}x}_{\text{scalar}}-\underbrace{x^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}}_{\text{scalar}}+\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}\right]+\\ &+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)-\frac{1}{2} x^{\sf T}{\bf \Sigma}_{k}^{-1}x +\frac{1}{{2}}{2}x^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}+\log(\pi_{k})=\\ &=\log\left(\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\right)\color{red}{-\frac{1}{2} x^{\sf T}{\bf \Sigma}_{k}^{-1}x} +x^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}\mu_{k}+\log(\pi_{k})\\ \end{split}\]
pre-process: specify the recipe
put them together in the workflow
Look at the results (note: no \(\texttt{tidy}\) nor \(\texttt{glance}\) functions available for this model specification)
def_fit_qda |> augment(new_data = default_test) |>
dplyr::select(default, .pred_class, .pred_Yes) |>
mutate(default=factor(default,levels=c("Yes","No"))) |>
roc_curve(truth = default, .pred_Yes)|>
ggplot(aes(x=1-specificity,y=sensitivity))+ggtitle("qda roc curve") +
geom_path(color="dodgerblue")+geom_abline(lty=3)+coord_equal()+theme_minimal()
like LDA and QDA, the goal is to estimate \(f_{k}(X)\)
unlike LDA and QDA, in the naive Bayes classifier case, \(f_{k}(X)\) is not assumed to be multivariate normal demsity
the assumpion is that the predictors are independent within each class \(k\)
the joint distribution of the \(p\) predictors in class \(k\)
\[f_{k}(X)=f_{k1}(X_{1})\times f_{k2}(X_{2})\times \ldots \times f_{kp}(X_{p})\]
while the assumption is naive, it simplifies the fit, since the focus is on the marginal distribution of each predictor
this is of help in case of few training observations
Since the goal is to fit \(f_{kj}(X_{j})\), \(j=1,\ldots,p\), one can fit ad hoc function for each predictor
for continuos predictors, one can choose \(f_{kj}(X_{j})\sim N(\mu_{k},\sigma_{k})\) or a kernel density estimator
for categorical predictors, the relative frequencies distribution can be used instead.
suppose to have two classes, and three predictors \(X_{1}\), \(X_{2}\) (quantitative) and \(X_{3}\) (qualitative)
assume that \(\hat{\pi}_{1}=\hat{\pi}_{2}\), and that \(\hat{f}_{kj}(X_{j})\) with \(k=1,2\) and \(j=1,2,3\) are:
consider \({\bf x}^{\star {\sf T}}=\left[.4,1.5,1\right]\), then
\(\color{red}{\hat{f}_{11}(.4)=.368,\hat{f}_{12}(1.5)=.484,\hat{f}_{13}(1)=.226}\) for class 1
\(\color{blue}{\hat{f}_{21}(.4)=.030,\hat{f}_{22}(1.5)=.130,\hat{f}_{23}(1)=.616}\) for class 2
The posterior for each class \(P(Y=1|X_{1}=.4,X_{2}=1.5,X_{3}=1)\) and \(P(Y=2|X_{1}=.4,X_{2}=1.5,X_{3}=1)\), knowing that \(\hat{\pi}_{1}=\hat{\pi}_{2}=.5\), is given by \[\begin{split} P(Y=1|X_{1}=.4,X_{2}=1.5,X_{3}=1) &= \frac{\color{red}{\hat{\pi}_{1}\times\hat{f}_{11}(.4)\times \hat{f}_{12}(1.5)\times\hat{f}_{13}(1)}}{ \color{red}{\hat{\pi}_{1}\times\hat{f}_{11}(.4)\times \hat{f}_{12}(1.5)\times\hat{f}_{13}(1)}+ \color{blue}{\hat{\pi}_{2}\times\hat{f}_{21}(.4)\times\hat{f}_{22}(1.5)\times\hat{f}_{23}(1)} }=\\ &=\frac{\color{red}{.5 \times .368 \times .484 \times .226}}{ \color{red}{.5 \times .368 \times .484 \times .226}+ \color{blue}{ .5 \times .03 \times .130 \times .616} }=\\ &= \frac{\color{red}{0.0201}}{\color{red}{0.0201}+\color{blue}{0.0012}}=0.944 \end{split}\]
Similarly, and ignoring that \[P(Y=2|X_{1}=.4,X_{2}=1.5,X_{3}=1)=1-P(Y=1|X_{1}=.4,X_{2}=1.5,X_{3}=1)\] the class 2 posterior is
\[P(Y=2|X_{1}=.4,X_{2}=1.5,X_{3}=1)=\frac{\color{blue}{0.0012}}{\color{red}{0.0201}+\color{blue} {0.0012}}=0.06\]
pre-process: specify the recipe
put them together in the workflow
def_fit_nb |> augment(new_data = default_test) |>
dplyr::select(default, .pred_class, .pred_Yes) |>
mutate(default=factor(default,levels=c("Yes","No"))) |>
roc_curve(truth = default, .pred_Yes)|>
ggplot(aes(x=specificity,y=1-sensitivity))+ggtitle("naive Bayes roc curve") +
geom_path(color="darkgreen")+geom_abline(lty=3)+coord_equal()+theme_minimal()
roc curves
auc
method | .estimate |
---|---|
logistic regression | 0.0517932 |
LDA | 0.0517932 |
QDA | 0.0517932 |
naive Bayes | 0.0517789 |
Consider the \(K\) classes case, and let \(K\) baseline, the predicted class \(k\) will be the one maximizing
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= log \left(\frac{\pi_{k}f_{k}(x)}{\pi_{K}f_{K}(x)}\right)=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ log \left(\frac{f_{k}(x)}{f_{K}(x)}\right)=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ log \left(f_{k}(x)\right)-log \left(f_{K}(x)\right) \end{split}\]
for any of the considered classifiers.
the assumption is that whithin the \(k^{th}\) class, \(X\sim N({\bf \mu}_{k},{\bf \Sigma})\)
\[\begin{split} \color{red}{log \left(f_{k}(x)\right)} &= log\left[\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2} } exp\left(-\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})\right)\right]=\\ &=\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})} \end{split}\]
And
\[\begin{split} \color{blue}{{log \left(f_{K}(x)\right)}} = \color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{K})}} \end{split}\]
plugging \(\color{red}{log \left(f_{k}(x)\right)}\) and \(\color{blue}{{log \left(f_{K}(x)\right)}}\) in
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ \color{red}{log \left(f_{k}(x)\right)}-\color{blue}{log \left(f_{K}(x)\right)}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2}}\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})}+\\ &-\color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}|^{1/2}}\right) +\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{K})}} =\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)-\color{red}{\frac{1}{2}({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{k})} +\color{blue}{{\frac{1}{2}({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu}_{K})}}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)- \color{red}{ \frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}-{\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}_{-2{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}}+{\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}\right] }+\\ &+\color{blue} {\frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}-{\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}_{-2{\bf x}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}}+{\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}\right] } \end{split}\]
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) & = log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ \color{red}{{\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}-\color{blue}{{{\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf x}}}- \frac{1}{2}\underbrace{\color{red}{ {\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}}}_{a^2}+ \frac{1}{2}\underbrace{\color{blue}{{ {\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K}}}}_{b^2} \end{split}\]
Now, since \(-a^{2}+b^{2}=(a+b)(a-b)\), the previous becomes
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) = log \left(\frac{\pi_{k}}{\pi_{K}}\right)+({\bf \mu}_{k}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}{\bf x}- \frac{1}{2}({\bf \mu}_{k}+{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf \mu}_{k}-{\bf \mu}_{K}) \end{split}\]
and, setting
\(log \left(\frac{\pi_{k}}{\pi_{K}}\right)-\frac{1}{2}({\bf \mu}_{k}+{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}({\bf \mu}_{k}-{\bf \mu}_{K})=\color{forestgreen}{a_{k}}\)
\(({\bf \mu}_{k}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}^{-1}{\bf x}=\color{forestgreen}{\sum_{j=1}^{p}{b_{kj}x_{j}}}\)
it is clear that in LDA, just like in logistic regression, the log of the odds is a linear function of the predictors
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) = a_{k}+\sum_{j=1}^{p}{b_{kj}x_{j}} \end{split}\]
the assumption is that whithin the \(k^{th}\) class, \(X\sim N({\bf \mu}_{k},{\bf \Sigma}_{k})\)
\[\begin{split} \color{red}{log \left(f_{k}(x)\right)} &= log\left[\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2} } exp\left(-\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})\right)\right]=\\ &=\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})} \end{split}\]
And
\[\begin{split} \color{blue}{{log \left(f_{K}(x)\right)}} = \color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2} }\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}_{K}^{-1}({\bf x}-{\bf \mu}_{K})}} \end{split}\]
again, plugging in \(\color{red}{log \left(f_{k}(x)\right)}\) and \(\color{blue}{{log \left(f_{K}(x)\right)}}\)
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+ \color{red}{log \left(f_{k}(x)\right)}-\color{blue}{log \left(f_{K}(x)\right)}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+\color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2}}\right) -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})}+\\ &-\color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2}}\right) +\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}_{K}^{-1}({\bf x}-{\bf \mu}_{K})}} \end{split}\]
re-writing the following quantities
\[\begin{split} \color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2}}\right)} &= \color{red}{log(1)-log\left((2\pi)^{p/2}\right)-log\left(|{\bf\Sigma}_{k}|^{1/2}\right)}\\ \color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2}}\right)}} &= \color{blue}{{log(1)-log\left((2\pi)^{p/2}\right)-log\left(|{\bf\Sigma}_{K}|^{1/2}\right)}} \end{split}\]it results that \[\begin{split} \color{red}{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{k}|^{1/2}}\right)}-\color{blue}{{log\left(\frac{1}{(2\pi)^{p/2}|{\bf \Sigma}_{K}|^{1/2}}\right)}} &= \color{red}{-log\left((2\pi)^{p/2}\right)-log\left(|{\bf\Sigma}_{k}|^{1/2}\right)}\color{blue}{{+log\left((2\pi)^{p/2}\right)+log\left(|{\bf\Sigma}_{K}|^{1/2}\right)}}= \color{blue}{{log\left(|{\bf\Sigma}_{K}|^{1/2}\right)}}\color{red}{-log\left(|{\bf\Sigma}_{k}|^{1/2}\right)}=log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right) \end{split}\]
the logit can be re-written accordingly \[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)\color{red}{ -\frac{1}{2} ({\bf x}-{\bf \mu}_{k})^{\sf T}{\bf \Sigma}_{k}^{-1}({\bf x}-{\bf \mu}_{k})} \color{blue}{{+\frac{1}{2} ({\bf x}-{\bf \mu}_{K})^{\sf T}{\bf \Sigma}_{K}^{-1}({\bf x}-{\bf \mu}_{K})}}\\ &= log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)-\color{red}{ \frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}-{\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x}}_{-2 {\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x}}+{\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}\right] }+ \color{blue} {\frac{1}{2}\left[{\bf x}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x}- \underbrace{{\bf x}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K}-{\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x}}_{-2 {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x}}+{\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K}\right]}=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)- \color{red}{ \frac{1}{2}{\bf x}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x} + {\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf x} -\frac{1}{2}{\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k} }+ \color{blue}{ \frac{1}{2}{\bf x}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x} - {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf x} -\frac{1}{2}{\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} }=\\ &=log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)- \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x} + \left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)^{\sf T}{\bf x}+\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right) \end{split}\]
finally \[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= \overbrace{\frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x}}^{\text{second degree term}} + \overbrace{\left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)^{\sf T}{\bf x}}^{\text{first degree term}}+\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)+\\ & + log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right) \end{split}\]
By defining the coefficients
\[\begin{split} a_{k} &= \frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)\\ b_{kj} &=\left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}-{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)^{\sf T}{\bf x} \\ c_{kjl} &= \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x} \end{split}\]
the QDA can be written as a second degree function of the \(X\):
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= a_{k} + \sum_{j=1}^{p}b_{j}x_{j}+\sum_{j=1}^{p}\sum_{l=1}^{p}{c_{kjl}x_{j}x_{l}} \end{split}\]
The logit, in this case, is
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= log\left(\frac{\pi_{k}f_{k}(x)}{\pi_{K}f_{K}(x)} \right)= log\left(\frac{\pi_{k}}{\pi_{K}} \right)+log\left(\frac{f_{k}(x)}{f_{K}(x)} \right)=log\left(\frac{\pi_{k}}{\pi_{K}} \right)+ log\left(\frac{\prod_{j=1}^{p}f_{kj}(x_{j})}{\prod_{j=1}^{p}f_{Kj}(x_{j})} \right)=\\ &=log\left(\frac{\pi_{k}}{\pi_{K}} \right)+\sum_{j=1}^{p}log\left(\frac{f_{kj}(x_{j})}{f_{Kj}(x_{j})} \right) \end{split}\]
Setting
\[\begin{split} a_{k}&=log\left(\frac{\pi_{k}}{\pi_{K}} \right) \quad \text{ and } \quad g_{kj}&=log\left(\frac{f_{kj}(x_{j})}{f_{Kj}(x_{j})} \right) \end{split}\]
the logit can be re-written as a function of the predictors
\[\begin{split} log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right) &= a_{k}+\sum_{j=1}^{p}g_{kj}(x_{j}) \end{split}\]
Looking at the coefficients of the QDA, it is clear that, when \({\bf\Sigma}_{k}={\bf\Sigma}_{K}={\bf\Sigma}\) , then the QDA is just LDA
\[\begin{split} \color{red}{a_{k}}&=\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}_{K}|^{1/2}}{|{\bf\Sigma}_{k}|^{1/2}}\right)=\\ &=\frac{1}{2}\left({\bf \mu}_{k}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{k}- {\bf \mu}_{K}^{\sf T}{\bf \Sigma}^{-1}{\bf \mu}_{K} \right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+log\left(\frac{|{\bf\Sigma}|^{1/2}}{|{\bf\Sigma}|^{1/2}}\right)=\\ &=\frac{1}{2}\left({\bf \mu}_{k}+{\bf \mu}_{K}\right)^{\sf T}{\bf \Sigma}^{-1}\left({\bf \mu}_{k}-{\bf \mu}_{K}\right)+ log \left(\frac{\pi_{k}}{\pi_{K}}\right)+0 \longrightarrow \color{red}{\text{as for LDA}} \end{split}\]
\[\begin{split} \color{red}{b_{kj}} &={\bf x}^{\sf T}\left({\bf \Sigma}_{k}^{-1}{\bf \mu}_{k}-{\bf \Sigma}_{K}^{-1}{\bf \mu}_{K} \right)= \\ &={\bf x}^{\sf T}\left({\bf \Sigma}^{-1}{\bf \mu}_{k}-{\bf \Sigma}^{-1}{\bf \mu}_{K} \right)= \\ &={\bf x}^{\sf T}{\bf \Sigma}^{-1}\left({\bf \mu}_{k}-{\bf \mu}_{K} \right) \longrightarrow \color{red}{\text{as for LDA}} \end{split}\]
\[\begin{split} \color{red}{c_{kjl}} &= \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}_{K}^{-1} - {\bf \Sigma}_{k}^{-1} \right){\bf x}= \frac{1}{2} {\bf x}^{\sf T}\left({\bf \Sigma}^{-1} - {\bf \Sigma}^{-1} \right){\bf x}=\color{red}{0} \end{split}\]
Any classifier with a linear decision boundary can be defined as a naive Bayes such that
\[\begin{split} g_{kj}(x_{j})=b_{kj}x_{j} \end{split}\]
If for naive Bayes, one assumes that \(\color{red}{f_{kj}(x_{j}) \sim N(\mu_{kj},\sigma^{2}_{j})}\) then \(g_{kj}(x_{j})=log\left(\frac{f_{k}(x_{j})}{f_{K}(x_{j})}\right)=b_{kj}x_{j}\) , with \(\color{red}{b_{kj}=(\mu_{kj}-\mu_{Kj})/\sigma^{2}_{j}}\) .
The naive Bayes, in this case, boils down to an LDA with diagonal covariance matrix \({\bf \Sigma}\)
\[log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right)=\color{red}{\beta_{k0}+\sum_{j=1}^{p}\beta_{kj}x_{j}}\]
LDA \[log \left(\frac{P(Y=k|X=x)}{P(Y=K|X=x)} \right)=\color{blue}{a_{k}+\sum_{j=1}^{p}b_{kj}x_{j}}\]
LDA the coefficients are estimated assuming a multivariate normal distribution of the predictors within each class
whether the LDA outperforms the multinomial logistic depends on how the assumtion is supported by the data at hand
Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\). Training set size: \(n=20\)
Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\); \(cor(X_{1},X_{2})=-0.5\). Training set size: \(n=20\)
Two predictors: \(X_{1}\sim t_{n-1 gdl}\) and \(X_{2}\sim t_{n-1 gdl}\). Training set size: \(n=50\)
Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\); \(cor(X_{1},X_{2})_{class1}=0.5\), \(cor(X_{1},X_{2})_{class2}=-0.5\). Training set size: \(n=50\)
Two predictors: \(X_{1}\sim N(\mu_{1},\sigma)\) and \(X_{2}\sim N(\mu_{2},\sigma)\); \(y\) generated via logistic function with predictors \(X_{1}X_{2}\), \(X^{2}_{1}\) and \(X^{2}_{2}\)
\(X\sim N({\bf\mu},{\bf \Sigma}_{k})\), a bivariate normal distribution \({\bf \Sigma}_{k}\) is diagonal, and it changes in each class. \(n=6\)