\(y=f(X)+\epsilon\)
the goal : find \(\hat{f}(\cdot)\) boils down to find \(\hat{\beta}_{0},\hat{\beta}_{j}'s\)
Advertising data
Code
adv_data = read_csv (file= "./data/Advertising.csv" ) |> select (- 1 )
adv_data |> slice_sample (n = 8 ) |>
kbl () |> kable_styling (font_size= 10 )
76.4
0.8
14.8
9.4
66.9
11.7
36.8
9.7
265.6
20.0
0.3
17.4
151.5
41.3
58.5
18.5
23.8
35.1
65.9
9.2
237.4
27.5
11.0
18.9
7.8
38.9
50.6
6.6
197.6
23.3
14.2
16.6
sales is the response, indicating the level of sales in a specific market
TV , Radio and Newspaper are the predictors, indicating the advertising budget spent on the corresponding media
Advertising data
Code
adv_data_tidy = adv_data |> pivot_longer (names_to= "medium" ,values_to= "budget" ,cols = 1 : 3 )
adv_data_tidy |> slice_sample (n = 8 ) |>
kbl () |> kable_styling (font_size= 10 )
14.8
radio
10.1
19.4
radio
33.5
10.5
radio
16.0
11.8
newspaper
23.7
11.7
newspaper
5.9
19.2
TV
193.7
16.9
TV
163.3
16.6
TV
202.5
Advertising data
Code
adv_data_tidy |>
ggplot (aes (x = budget, y = sales)) + theme_minimal () + facet_wrap (~ medium,scales = "free" ) +
geom_point (color= "indianred" ,alpha= .5 ,size= 3 )
Note
: the budget spent on TV is up to 300, less so for Radio and Newspaper
Advertising data: single regressions
We can be naive
and regress sales
on tv
, newspaper
and radio
, separately.
Advertising data: single regressions
Code
avd_lm_models_nest = adv_data_tidy |>
group_by (medium) |>
group_nest (.key = "datasets" ) |>
mutate (
model_output = map (.x= datasets,~ lm (sales~ budget,data= .x)),
model_params = map (.x= model_output, ~ tidy (.x)),
model_metrics = map (.x= model_output, ~ glance (.x)),
model_fitted = map (.x= model_output, ~ augment (.x))
)
# A tibble: 3 × 2
medium datasets
<chr> <list<tibble[,2]>>
1 TV [200 × 2]
2 newspaper [200 × 2]
3 radio [200 × 2]
Advertising data: single regressions
Code
avd_lm_models_nest = adv_data_tidy |>
group_by (medium) |>
group_nest (.key = "datasets" ) |>
mutate (
model_output = map (.x= datasets,~ lm (sales~ budget,data= .x)),
model_params = map (.x= model_output, ~ tidy (.x)),
model_metrics = map (.x= model_output, ~ glance (.x)),
model_fitted = map (.x= model_output, ~ augment (.x))
)
# A tibble: 3 × 3
medium datasets model_output
<chr> <list<tibble[,2]>> <list>
1 TV [200 × 2] <lm>
2 newspaper [200 × 2] <lm>
3 radio [200 × 2] <lm>
Advertising data: single regressions
Code
avd_lm_models_nest = adv_data_tidy |>
group_by (medium) |>
group_nest (.key = "datasets" ) |>
mutate (
model_output = map (.x= datasets,~ lm (sales~ budget,data= .x)),
model_params = map (.x= model_output, ~ tidy (.x)),
model_metrics = map (.x= model_output, ~ glance (.x)),
model_fitted = map (.x= model_output, ~ augment (.x))
)
# A tibble: 3 × 4
medium datasets model_output model_params
<chr> <list<tibble[,2]>> <list> <list>
1 TV [200 × 2] <lm> <tibble [2 × 5]>
2 newspaper [200 × 2] <lm> <tibble [2 × 5]>
3 radio [200 × 2] <lm> <tibble [2 × 5]>
Advertising data: single regressions
Code
avd_lm_models_nest = adv_data_tidy |>
group_by (medium) |>
group_nest (.key = "datasets" ) |>
mutate (
model_output = map (.x= datasets,~ lm (sales~ budget,data= .x)),
model_params = map (.x= model_output, ~ tidy (.x)),
model_metrics = map (.x= model_output, ~ glance (.x)),
model_fitted = map (.x= model_output, ~ augment (.x))
)
# A tibble: 3 × 5
medium datasets model_output model_params model_metrics
<chr> <list<tibble[,2]>> <list> <list> <list>
1 TV [200 × 2] <lm> <tibble [2 × 5]> <tibble [1 × 12]>
2 newspaper [200 × 2] <lm> <tibble [2 × 5]> <tibble [1 × 12]>
3 radio [200 × 2] <lm> <tibble [2 × 5]> <tibble [1 × 12]>
Advertising data: single regressions
Code
avd_lm_models_nest = adv_data_tidy |>
group_by (medium) |>
group_nest (.key = "datasets" ) |>
mutate (
model_output = map (.x= datasets,~ lm (sales~ budget,data= .x)),
model_params = map (.x= model_output, ~ tidy (.x)),
model_metrics = map (.x= model_output, ~ glance (.x)),
model_fitted = map (.x= model_output, ~ augment (.x))
)
# A tibble: 3 × 6
medium datasets model_output model_params model_metrics model_fitted
<chr> <list<tibble[,> <list> <list> <list> <list>
1 TV [200 × 2] <lm> <tibble> <tibble> <tibble>
2 newspaper [200 × 2] <lm> <tibble> <tibble> <tibble>
3 radio [200 × 2] <lm> <tibble> <tibble> <tibble>
This is a nested data structure
with everything stored. The functions tidy
, glance
and augment
pull all the information off of the model output, and arrange it in a tidy way.
Advertising data: nested structure
The quantities nested in the tibble can be pulled out, or they can be expanded
within the tibble itself, using unnest
Code
avd_lm_models_nest |> unnest (model_params)
# A tibble: 6 × 10
medium datasets model_output term estimate std.error statistic p.value
<chr> <list<tibb> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 TV [200 × 2] <lm> (Int… 7.03 0.458 15.4 1.41e-35
2 TV [200 × 2] <lm> budg… 0.0475 0.00269 17.7 1.47e-42
3 newspaper [200 × 2] <lm> (Int… 12.4 0.621 19.9 4.71e-49
4 newspaper [200 × 2] <lm> budg… 0.0547 0.0166 3.30 1.15e- 3
5 radio [200 × 2] <lm> (Int… 9.31 0.563 16.5 3.56e-39
6 radio [200 × 2] <lm> budg… 0.202 0.0204 9.92 4.35e-19
# ℹ 2 more variables: model_metrics <list>, model_fitted <list>
Advertising data: nested structure
The quantities nested in the tibble can be pulled out, or they can be expanded
within the tibble itself, using unnest
Code
avd_lm_models_nest |> unnest (model_params) |>
select (medium,term: p.value) |>
kbl (digits = 4 ) |> kable_styling (font_size = 10 ) |>
row_spec (1 : 2 ,background = "#D9FDEC" ) |>
row_spec (3 : 4 ,background = "#CAF6FC" ) |>
row_spec (5 : 6 ,background = "#FDB5BA" )
TV
(Intercept)
7.0326
0.4578
15.3603
0.0000
TV
budget
0.0475
0.0027
17.6676
0.0000
newspaper
(Intercept)
12.3514
0.6214
19.8761
0.0000
newspaper
budget
0.0547
0.0166
3.2996
0.0011
radio
(Intercept)
9.3116
0.5629
16.5422
0.0000
radio
budget
0.2025
0.0204
9.9208
0.0000
The effect of budget on sales differs significantly from 0, for all the considered media
Ordinary least squares
\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}: RSS=\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]
OLS estimator of \(\beta_{0}\)
\[\begin{split}&\partial_{\hat{\beta}_{0}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}=
-2\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}=\\
&\sum_{i=1}^{n}{y_{i}}-n\hat{\beta}_{0}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}}=0\rightarrow
\hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x}\end{split}\]
Ordinary least squares
\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]
OLS estimator of \(\beta_{1}\)
\[\begin{split}&\partial_{\hat{\beta}_{1}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}=
-2\sum_{i=1}^{n}{x_{i}(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}=
\sum_{i=1}^{n}{x_{i}y_{i}}-\hat{\beta}_{0}\sum_{i=1}^{n}{x_{i}}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=0\\
&\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\left(\frac{\sum_{i=1}^{n}{y_{i}}}{n}-\hat{\beta}_{1}\frac{\sum_{i=1}^{n}{x_{i}}}{n}\right)\\
&\hat{\beta}_{1}\left(n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} \right)=
n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}\\
&\hat{\beta}_{1}=\frac{n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}}
{n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} }=\frac{\sigma_{xy}}{\sigma^{2}_{x}}
\end{split}\]
Three regression lines
Code
three_regs_plot= avd_lm_models_nest |> unnest (model_fitted) |>
select (medium,sales: .fitted) |>
ggplot (aes (x= budget,y= sales))+ theme_minimal ()+ facet_grid (~ medium,scales= "free" )+
geom_point (alpha= .5 ,color = "indianred" )+
geom_smooth (method= "lm" )+
geom_segment (aes (x= budget,xend= budget,y= .fitted,yend= sales),color= "forestgreen" ,alpha= .25 )
Three regression lines
Code
three_regs_plot = avd_lm_models_nest |> unnest (model_fitted) |>
select (medium,sales: .fitted) |>
ggplot (aes (x= budget,y= sales))+ theme_minimal ()+ facet_grid (~ medium,scales= "free" )+
geom_point (alpha= .5 ,color = "indianred" )+
geom_smooth (method= "lm" )+
geom_segment (aes (x= budget,xend= budget,y= .fitted,yend= sales),color= "forestgreen" ,alpha= .25 )
Three regression lines
Code
three_regs_plot= avd_lm_models_nest |> unnest (model_fitted) |>
select (medium,sales: .fitted) |>
ggplot (aes (x= budget,y= sales))+ theme_minimal ()+ facet_grid (~ medium,scales= "free" )+
geom_point (alpha= .5 ,color = "indianred" )+
geom_smooth (method= "lm" )+
geom_segment (aes (x= budget,xend= budget,y= .fitted,yend= sales),color= "forestgreen" ,alpha= .25 )
Three regression lines
Code
three_regs_plot= avd_lm_models_nest |> unnest (model_fitted) |>
select (medium,sales: .fitted) |>
ggplot (aes (x= budget,y= sales))+ theme_minimal ()+ facet_grid (~ medium,scales= "free" )+
geom_point (alpha= .5 ,color = "indianred" )+
geom_smooth (method= "lm" )+
geom_segment (aes (x= budget,xend= budget,y= .fitted,yend= sales),color= "forestgreen" ,alpha= .25 )
Advertising data: nested structure
The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using \(\texttt{unnest}\)
Code
avd_lm_models_nest |> unnest (model_metrics) |>
select (medium,r.squared: df.residual) |>
kbl (digits = 4 ) |> kable_styling (font_size = 10 ) |>
row_spec (1 ,background = "#D9FDEC" ) |>
row_spec (2 ,background = "#CAF6FC" ) |>
row_spec (3 ,background = "#FDB5BA" )
TV
0.6119
0.6099
3.2587
312.1450
0.0000
1
-519.0457
1044.091
1053.986
2102.531
198
newspaper
0.0521
0.0473
5.0925
10.8873
0.0011
1
-608.3357
1222.671
1232.566
5134.805
198
radio
0.3320
0.3287
4.2749
98.4216
0.0000
1
-573.3369
1152.674
1162.569
3618.479
198
Linear regression: model assumptions
The linear regression model is
\[y_{i}=\beta_{0}+\beta_{1}x_{1}+\epsilon_{i}\]
where \(\epsilon_{i}\) is a random variable with expected value of 0. For inference, more assumptions must me made:
\(\epsilon_{i}\sim N(0,\sigma^{2})\) , then the variance of the errors \(\sigma^{2}\) does not depend on \(x_{i}\) .
\(Cov(\epsilon_{i},\epsilon_{i'})=0\) , for all \(i\neq i'\) and \(i,i'=1,\ldots,n\) .
\(x_{i}\) non stochastic
It follows that \(y_{i}\) is a random variable such that
Linear regression: model assumptions
Statistics for Business and Economics (Anderson, Sweeney and Williams, (2011))
Inference on \(\hat{\beta}_{1}\) : hypothesis testing
The null hypothesis is \(\beta_{1}=0\) and the test statistic is
\[\frac{\hat{\beta}_{1}-\beta_{1}}{SE(\hat{\beta}_{1})}\]
that, under the null hypothesis becomes just \(\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\) distributed as a Student t with n-2 d.f.
\[\text{Reject } H_{0} \text{ if } \left|\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\right|>t_{1-\frac{\alpha}{2},n-2}\]
Linear regression: multiple predictors
\(y\) is the numeric response; \(X_{1},\ldots,X_{p}\) are the predictors, the general formula for the linear model is \[y = f(X)+\epsilon=\beta_{0}+\sum_{j=1}^{p}X_{j}\beta_{j}+ \epsilon\]
In algebraic form
\[{\bf y}={\bf X}{\bf \beta}+{\bf \epsilon}\]
\[\begin{bmatrix}
y_{1}\\
y_{2}\\
y_{3}\\
\vdots\\
y_{n}
\end{bmatrix}=\begin{bmatrix}
1& x_{1,1}&\ldots&x_{1,p}\\
1& x_{2,1}&\ldots&x_{2,p}\\
1& x_{3,1}&\ldots&x_{3,p}\\
\vdots&\vdots&\vdots&\\
1& x_{n,1}&\ldots&x_{n,p}\\
\end{bmatrix}\begin{bmatrix}
\beta_{0}\\
\beta_{1}\\
\vdots\\
\beta_{p}\\
\end{bmatrix}+\begin{bmatrix}
\epsilon_{1}\\
\epsilon_{2}\\
\epsilon_{3}\\
\vdots\\
\epsilon_{n}
\end{bmatrix}\]
Linear regression: ordinary least square (OLS)
OLS target
\[\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}\left(\sum_{i=1}^{n}y_{i}-\hat{\beta}_{0}-\sum_{j=1}^{p}X_{j}\beta_{j}\right)^{2}=\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}RSS\]
OLS target (algebraic) \[ \min_{\hat{\beta}} ({\bf y}-{\bf X}{\bf \hat{\beta}})^{\sf T}({\bf y}-{\bf X}{\bf \hat{\beta}})\]
Advertising data: multiple linear regression
At this stage, no pre-processing, no test set evaluation, just fit the regression model on the whole data set
pre-processing: define a \(\texttt{recipe}\)
Code
adv_recipe = recipe (sales~ ., data= adv_data)
model specification define a \(\texttt{parsnip}\) model
Code
adv_model = linear_reg (mode= "regression" , engine= "lm" )
define the \(\texttt{workflow}\)
Code
adv_wflow = workflow () |>
add_recipe (recipe= adv_recipe) |>
add_model (adv_model)
fit the model
Code
adv_fit = adv_wflow |>
fit (data= adv_data)
Advertising data: back to single models
TV
(Intercept)
7.0326
0.4578
15.3603
0.0000
TV
budget
0.0475
0.0027
17.6676
0.0000
newspaper
(Intercept)
12.3514
0.6214
19.8761
0.0000
newspaper
budget
0.0547
0.0166
3.2996
0.0011
radio
(Intercept)
9.3116
0.5629
16.5422
0.0000
radio
budget
0.2025
0.0204
9.9208
0.0000
Advertising data: single vs multiple regression
Code
avd_lm_models_nest |> unnest (model_params) |>
select (medium,term: p.value) |>
filter (term!= "(Intercept)" ) |> arrange (medium) |>
kbl (digits = 4 ) |> kable_styling (font_size = 12 ) |>
column_spec (c (3 ,6 ),bold= TRUE ) |>
row_spec (1 ,background = "#D9FDEC" ) |>
row_spec (2 ,background = "#CAF6FC" ) |>
row_spec (3 ,background = "#FDB5BA" )
TV
budget
0.0475
0.0027
17.6676
0.0000
newspaper
budget
0.0547
0.0166
3.2996
0.0011
radio
budget
0.2025
0.0204
9.9208
0.0000
Code
adv_fit |> tidy () |>
filter (term!= "(Intercept)" ) |> arrange (term) |>
kbl (digits = 4 ) |> kable_styling (font_size = 12 ) |>
column_spec (c (2 ,5 ),bold= TRUE ) |>
row_spec (1 ,background = "#D9FDEC" ) |>
row_spec (2 ,background = "#CAF6FC" ) |>
row_spec (3 ,background = "#FDB5BA" )
TV
0.0458
0.0014
32.8086
0.0000
newspaper
-0.0010
0.0059
-0.1767
0.8599
radio
0.1885
0.0086
21.8935
0.0000
why this contradictory results
Advertising data: single vs multiple regression
Code
library ("corrr" )
adv_data |> correlate () |> network_plot ()
\(\texttt{newspaper}\) is correlated with \(\texttt{radio}\) , the latter having a significative effect on \(\texttt{radio}\) in the multiple regression model
that is why, in the single model, \(\texttt{newspaper}\) has a significant effect \(\texttt{sales}\) ( \(\texttt{radio}\) is ignored)
supervised learning flow
split your observations in training , validation (or, cross-validate) and test
transform the predictors properly (features engineering )
select a reasonable grid of model hyperparameter(s) values to choose from
for each combination of hyperparameters
fit the model on training observations
compute appropriate metrics on evaluation observations
pick-up the best hyperparamters combination
compute the metric for the tuned model on the test set (observations never used yet)
obtain the final fit for the model on all the available observations
All the core packages in the tidymodels
refer to one step of a supervised learning flow
For all things tidymodels check tidymodels.org !
the tidymodels core
All the core packages in the tidymodels refer to one step of a supervised learning flow
the tidymodels core
the rsample
package provides tools for data splitting and resampling
the tidymodels core
the recipe
package provides tools for data pre-processing and feature engineering
the tidymodels core
the parsnip
package provides a unified interface to recall several models available in R
the tidymodels core
the workflows
package combines together pre-processing, modeling and post-processing steps
the tidymodels core
the yardstick
package provides several performance metrics
the tidymodels core
the dials
package provides tools to define hyperparameters value grids
the tidymodels core
the tune
package remarkably simplifies the hyperparameters optimization implementation
the tidymodels core
the broom
package provides utility functions to tidify model output
classification model
In a classification problem the response
is a categorical
variable
rather than predicting the value of \(Y\) , one wants to estimate the posterior probability
\[P(Y=k\mid X=x_{i})\]
that is, the probability that the observation \(i\) belongs the the class \(k\) , given that the predictor value for \(i\) is \(x_{i}\)
will a credit card owner default?
Code
set.seed (1234 )
default= read_csv ("./data/Default.csv" )
default |> slice_sample (n= 6 ) |> kbl () |> kable_styling (font_size= 12 )
No
Yes
311.32186
22648.76
No
Yes
697.13558
18377.15
No
Yes
470.10718
16014.11
No
No
1200.04162
56081.08
No
No
553.64902
47021.49
No
No
10.23149
27237.38
Do balance and income bring useful info to identify deafulters?
Code
p1 = default |> ggplot (aes (x = balance, y= income,color= default))+
geom_point (alpha= .5 ,size= 3 )+ theme_minimal ()
p1
Do balance and income bring useful info to identify deafulters?
Code
p2 = default |> pivot_longer (names_to = "predictor" ,values_to= "value" ,cols= c (balance,income)) |>
ggplot (aes (x = default, y = value))+ geom_boxplot (aes (fill= default))+
facet_wrap (~ predictor,scales= "free" )+ theme_minimal ()
p2
balance is useful, income is not
Note: to arrange multiple plots together, give a look at the patchwork
package
linear regression for non-numeric response?
if \(Y\) is categorical
With \(K\) categories, one could code \(Y\) as an integer vector
for non ordinal factors it would not make sense
even for ordinal factors, one would implicitly assume constant differences between categories
linear regression for non-numeric response?
if \(Y\) is binary
The goal is estimate \(P(Y=1|X)\) , which is, in fact, numeric . . .
So one can model \(P(Y=1|X)=\beta_{0}+\beta_{1}X\) . . .
What can possibly be wrong with this approach?
linear regression for non-numeric response?
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)
Code
default |> mutate (def_01 = ifelse (default== "Yes" ,1 ,0 )) |>
ggplot (aes (x= balance, y= def_01)) + geom_point (aes (color = default),size= 4 , alpha= .5 ) +
geom_smooth (method= "lm" ) + theme_minimal ()
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})\) is bounded in \([0, 1]\) , therefore the linear fit cannot be used
use the logistic function instead
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)
modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression
just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\) ; since \(Y\in\{0,1\}\) , it results that \[E[Y|X] \equiv P(Y=1|X)\]
The odds and the logit
\[
\begin{split}
p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\
\left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\
p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\
p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\
p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\
\frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X}
\end{split}
\]
\(\frac{p(X)}{\left(1-p(X)\right)}\) are the odds, that take value in \([0,\infty]\)
\(log\left(\frac{p(X)}{\left(1-p(X)\right)}\right)=\beta_{0}+\beta_{1}X\) is the logit: there is a linear relation between the logit and the predictor
Estimating the posterior probability via the logistic function
a toy sample
Estimating the posterior probability via the logistic function
a toy sample : fit the logistic function
Estimating the posterior probability via the logistic function
a toy sample : for a new point \(\texttt{balance}=1400\)
Estimating the posterior probability via the logistic function
a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)
Estimating the posterior probability via the logistic function
a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)
Estimating the posterior probability via the logistic function
How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)
Estimating the posterior probability via the logistic function
Least squares? One could switch to the logit, which is a linear function \(logit(p(Y=1|X))=\beta_{0} + \beta_{1} X\)
Estimating the posterior probability via the logistic function
Least squares? but the logit mapping, for the blue points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) =
log\left(\frac{1}{1-1}\right) = log(1) - log(0) = 0 -Inf= +Inf\]
Estimating the logit via the linear function
Least squares? but the logit mapping, for the red points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) =
log\left(\frac{0}{1-0}\right) = log(0) = -Inf\]
Estimating the logit via the linear function
logit: mapping
Estimating the logit via the linear function
logit: mapping
Estimating the logit via the linear function
logit: mapping
Estimating the logit via the linear function
logit: mapping
Maximization of the likelihood function
The estimates for \(\beta_{0}\) and \(\beta_{1}\) are such that the following likelihood function is maximised:
\[
\begin{split}
\ell\left(\hat{\beta}_{0},\hat{\beta}_{1}\right)=&
\color{blue}{\prod_{\forall i}{p(x_{i})}}\times \color{red}{\prod_{\forall i'}{\left(1-p(x_{i})\right)}}=\\
=&\color{blue}{
\prod_{\forall i}
\frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}}
\times
\color{red}{ \prod_{\forall i^{\prime}}{\left(1-
\frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}}
\right)}
}
\end{split}
\]
Note: the \(i\) index is for blue points, \(i'\) is for red points
Fitting the logistic regression
pre-process: specify the recipe
Code
def_rec = recipe (default~ balance, data= default)
specify the model
Code
def_model_spec = logistic_reg (mode= "classification" , engine= "glm" )
put them together in the workflow
Code
def_wflow = workflow () |> add_recipe (def_rec) |> add_model (def_model_spec)
fit the model
Code
def_fit = def_wflow |> fit (data= default)
Fitting the logistic regression
Look at the results
Code
def_fit |> tidy () |> kbl () |> kable_styling (font_size= 12 )
(Intercept)
-10.6513306
0.3611574
-29.49221
0
balance
0.0054989
0.0002204
24.95309
0
Code
def_fit |> glance () |> kbl () |> kable_styling (font_size= 12 )
2920.65
9999
-798.2258
1600.452
1614.872
1596.452
9998
10000
Fitting the logistic regression: a qualitative predictor
Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.
update the recipe in the workflow and re-fit
Code
def_wflow2 = def_wflow |> update_recipe (recipe (default~ student,data= default))
def_fit2 = def_wflow2 |> fit (data= default)
look at the results
Code
def_fit2 |> tidy () |> kbl () |> kable_styling (font_size= 12 )
(Intercept)
-3.5041278
0.0707130
-49.554219
0.0000000
studentYes
0.4048871
0.1150188
3.520181
0.0004313
Fitting the logistic regression: a qualitative predictor
It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).
For non students \[
\begin{split}
log\left(\frac{p(X)}{1-p(X)}\right) = -3.504 &\rightarrow& \ \frac{p(X)}{1-p(X)} = e^{-3.504}\rightarrow\\
p(X) = e^{-3.504}-e^{-3.504}p(X)&\rightarrow& \ p(X) +e^{-3.504}p(X) = e^{-3.504}\rightarrow\\
p(X)(1 +e^{-3.504}) = e^{-3.504} &\rightarrow & \ p(X) = \frac{e^{-3.504}}{1 +e^{-3.504}}=\color{blue}{0.029}
\end{split}
\]
For students \[
p(X) = \frac{e^{-3.504+0.4}}{1 +e^{-3.504+0.4}}=\color{red}{0.042}
\]
Multiple logistic regression
In case of multiple predictors
\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]
and following relation holds
\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]
Multiple logistic regression
Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\) , again we just update the recipe within the workflow
update the recipe in the workflow and re-fit
Code
def_wflow3 = def_wflow |> update_recipe (recipe (default~ balance+ student,data= default))
def_fit3 = def_wflow3 |> fit (data= default)
look at the results
Code
def_fit3 |> tidy () |> kbl () |> kable_styling (font_size= 12 )
(Intercept)
-10.7494959
0.3691914
-29.116326
0.0e+00
balance
0.0057381
0.0002318
24.749526
0.0e+00
studentYes
-0.7148776
0.1475190
-4.846003
1.3e-06
anything weird?
Multiple logistic regression
Code
p_stu = def_fit2 |> augment (new_data= default) |>
ggplot (aes (x= balance, y= .pred_Yes,color= student))+ geom_line (size= 3 )+
theme_minimal () + ylim (c (0 ,.1 ))+ xlab ("balance" )+ ylab ("P(default=yes|student)" )
p_stu
Multiple logistic regression
Code
p_bal_stu = def_fit3 |> augment (new_data= default) |> ggplot (aes (x= balance, y= .pred_Yes,color= student)) +
geom_line (size= 3 ) + theme_minimal () + xlab ("balance" ) + ylab ("P(default=yes|balance,student)" )
p_bal_stu
Multiple logistic regression
Multiple logistic regression
linear discriminant analysis (LDA)
linear discriminant analysis LDA: one predictor
In the linear discriminant analysis, the assumption on \(f_{k}(X)\) is that
\[f_{k}(X)\sim N(\mu_{k},\sigma^{2})\]
therefore
\[f_{k}(X)=\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\]
in each class, the predictor is normally distributed
the scale parameter \(\sigma^{2}\) is the same for each class
linear discriminant analysis LDA: one predictor
Plugging in \(f_{k}(X)\) in the Bayes formula
\[p_{k}(x)=\frac{\pi_{k}\times f_{k}(x)}{\sum_{l=1}^{K}{\pi_{l}\times f_{l}(x)}} = \frac{\pi_{k}\times \overbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)}^{\color{red}{{f_{k}(x)}}}}
{\sum_{l=1}^{K}{\pi_{l}\times \underbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}_{\color{red}{{f_{l}(x)}}}}}\]
it takes to estimate the following parameters
\(\mu_{1},\mu_{2},\ldots, \mu_{K}\)
\(\pi_{1},\pi_{2},\ldots, \pi_{K}\)
\(\sigma\)
to get, for each observation \(x\) , \(\hat{p}_{1}(x)\) , \(\hat{p}_{2}(x)\) , \(\ldots\) , \(\hat{p}_{K}(x)\) : then the observation is assigned to the class for which \(\hat{p}_{k}(x)\) is max.
\(\color{red}{\text{Note}}:\) not all the quantities involved in the Bayes formula play a role in the classification of an object: in fact, some of them are constant across the classes.
two classes, same priors.
Consider a single predictor \(X\) , normally distributed within the two classes, with parameters \(\mu_{1}\) , \(\mu_{2}\) and \(\sigma^{2}\) .
Also \(\pi_{1}=\pi_{2}\) . Now, the observation \(x\) is assigned to class 1 if
\[\begin{split}
\delta_{1}(X) &>&\delta_{2}(X)\\
\color{blue}{\text{that is}} \\
log({\pi_{1}})-\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x &>&
log({\pi_{2}})-\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x \\
\color{blue}{\text{ since }\pi_{1}=\pi_{2}}\\
-\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x >
-\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x & \ \rightarrow \ &
-\frac{\mu_{1}^{2}}{2} + \mu_{1}x >
-\frac{\mu_{2}^{2}}{2} + \mu_{2}x\\
(\mu_{1} - \mu_{2})x >
\frac{\mu_{1}^{2}-\mu_{2}^{2}}{2} & \ \rightarrow \ &
x >
\frac{(\mu_{1}+\mu_{2})(\mu_{1}-\mu_{2})}{2(\mu_{1} - \mu_{2})}\\
x &>&
\frac{(\mu_{1}+\mu_{2})}{2}\\
\end{split}\]
the Bayes decision boundary, in which \(\delta_{1}=\delta_{2}\) , is at \(\color{red}{x=\frac{(\mu_{1}+\mu_{2})}{2}}\)
two classes, same priors.
Code
set.seed (1234 )
p_1 = ggplot ()+ xlim (- 12 ,12 )+ theme_minimal () + xlim (- 10 ,10 ) +
stat_function (fun= dnorm,args= list (mean= 4 ,sd= 2 ),geom= "area" ,fill= "dodgerblue" ,alpha= .25 )+
stat_function (fun= dnorm,args= list (mean= - 4 ,sd= 2 ),geom= "area" ,fill= "indianred" ,alpha= .25 )+
geom_vline (xintercept= 0 ,size= 2 ,alpha= .5 )+ geom_vline (xintercept= - 4 ,color= "grey" ,size= 3 ,alpha= .5 )+
geom_vline (xintercept= 4 ,color= "grey" ,size= 3 ,alpha= .5 ) + geom_point (aes (x= - 2 ,y= 0 ),inherit.aes = FALSE ,size= 10 ,alpha= .5 ,color= "darkgreen" )+
geom_point (aes (x= 1 ,y= 0 ),inherit.aes = FALSE ,size= 10 ,alpha= .5 ,color= "magenta" ) + xlab (0 )
p_1
.center[ The \(\color{darkgreen}{\text{green point}}\) goes to class 1, the \(\color{magenta}{\text{pink point}}\) goes to class 2]
two classes, same priors.
Consider a training set with 100 observations from the two classes (50 each): one needs to estimate \(\mu_1\) and \(\mu_2\) to have the estimated boundary at \(\frac{\hat{\mu}_{1}+\hat{\mu}_{2}}{2}\)
Code
class_12= tibble (class_1= rnorm (50 ,mean = - 4 ,sd= 2 ),class_2= rnorm (50 ,mean = 4 ,sd= 2 )) |>
pivot_longer (names_to= "classes" ,values_to= "values" ,cols = 1 : 2 )
mu_12 = class_12 |> group_by (classes) |> summarise (means= mean (values))
mu_12_mean = mean (mu_12$ means)
p_2= class_12 |> ggplot (aes (x= values,fill= classes)) + theme_minimal () +
geom_histogram (aes (y= after_stat (density)),alpha= .5 ,color= "grey" ) + xlim (- 10 ,10 ) +
geom_vline (xintercept= mu_12 |> pull (means),color= "grey" ,size= 3 ,alpha= .75 )+
geom_vline (xintercept = mu_12_mean, size= 2 , alpha= .75 ) + theme (legend.position = "none" )
p_2
two classes, same priors.
The Bayes boundary is at \(\color{red}{0}\) ; the estimated boundary is sligthly off at -0.31
Fitting the LDA
Code
default= read_csv ("./data/Default.csv" ) |> mutate (default= as.factor (default))
set.seed (1234 )
def_split = initial_split (default,prop= 3 / 4 ,strata= default)
default_test = testing (def_split)
defautl = training (def_split)
def_rec = recipe (default~ balance, data= default)
def_lda_spec = discrim_linear (mode= "classification" , engine= "MASS" )
def_wflow_lda = workflow () |> add_recipe (def_rec) |> add_model (def_lda_spec)
def_fit_lda = def_wflow_lda |> fit (data= default)
def_fit_lda |> extract_fit_engine ()
Call:
lda(..y ~ ., data = data)
Prior probabilities of groups:
No Yes
0.9667 0.0333
Group means:
balance
No 803.9438
Yes 1747.8217
Coefficients of linear discriminants:
LD1
balance 0.002206916
Class-specific error
the Bayes classifier -like classification rule ensures the maximum accuracy
to control for the type I error, it is possible to shift the threshold away from 0.5 (that is, the one used in Bayes classifier)
Code
lda_pred = def_fit_lda |> augment (new_data = default_test) |>
dplyr:: select (default, .pred_class, .pred_Yes) |>
mutate (.pred_class_0_05 = as.factor (ifelse (.pred_Yes> .05 ,"Yes" ,"No" )),
.pred_class_0_1 = as.factor (ifelse (.pred_Yes> .1 ,"Yes" ,"No" )),
.pred_class_0_2 = as.factor (ifelse (.pred_Yes> .2 ,"Yes" ,"No" )),
.pred_class_0_3 = as.factor (ifelse (.pred_Yes> .3 ,"Yes" ,"No" )),
.pred_class_0_4 = as.factor (ifelse (.pred_Yes> .4 ,"Yes" ,"No" )),
.pred_class_0_5 = as.factor (ifelse (.pred_Yes> .5 ,"Yes" ,"No" ))
)
Class-specific error
Class-specific error
Code
lda_pred |>
pivot_longer (names_to = "threshold" ,values_to= "prediction" ,cols = 4 : 9 ) |>
dplyr:: select (- .pred_class,- .pred_Yes) |> group_by (threshold) |>
summarise (
accuracy= round (mean (default== prediction),2 ),
false_positive_rate = sum ((default== "Yes" )& (default!= prediction))/ sum ((default== "Yes" ))
) |> arrange (desc (accuracy),desc (false_positive_rate)) |> kbl () |> kable_styling (font_size= 10 )
.pred_class_0_5
0.97
0.8160920
.pred_class_0_4
0.97
0.7126437
.pred_class_0_3
0.97
0.6206897
.pred_class_0_2
0.96
0.4942529
.pred_class_0_1
0.94
0.2873563
.pred_class_0_05
0.89
0.1609195
Since the false positive rate increases along with the overall accuracy , reducing it will cause the overall performance of the classifier to drop
Roc curve
Code
def_fit_lda |> augment (new_data= default_test) |>
mutate (default= factor (default,levels= c ("Yes" ,"No" ))) |>
roc_curve (truth = default, .pred_Yes)|>
ggplot (aes (x= 1 - specificity,y= sensitivity))+ ggtitle ("lda roc curve" ) +
geom_path (color= "indianred" )+ geom_abline (lty= 3 )+ coord_equal ()+ theme_minimal ()