\(y=f(X)+\epsilon\)

 

this is what is all about…

\(y\)\(\ =f(X)+\epsilon\)

 

the response one is interested in

\(y=f(\)\(X\)\()+\epsilon\)

 

the features that are used to estimate \(y\)

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

 

the function that links the \(X\)’s to \(y\)

\(y=f(X)\)+\(\epsilon\)

 

the error: the \(X\)’s cannot explain \(y\) exactly

\(y=f(X)+\epsilon\)

 

the goal: find \(\hat{f}(\cdot)\) to estimate \(f(\cdot)\)

What to do, and how

 

\(y\)

regression: if the response is numeric

classification: if the response is non numeric (categorical)

 

 

goal

inference: study the effects of \(X\)’s on \(y\)

prediction: estimate \(y\), given new values of the \(X\)’s

 

 

fit strategy

parametric: assume a functional form for \(f(\cdot)\) and fit its parameters

non-parametric: fit \(f(\cdot)\) (almost) point-by-point.

basics: linear regression

guess \(y\)? (no \(X\) available)

Code
set.seed(123)
toy_data = tibble(y = runif(30,min=0,max = 10),x_no=0,x_yes = ((y-3)/3)-rnorm(30,sd=.5)) |> 
  pivot_longer(cols = x_no : x_yes, values_to = "x",names_to="states")

toy_data |> filter(states == "x_no") |> ggplot(aes(y=y,x=x)) + 
  geom_point(color="indianred",size=3,alpha=.75) + 
  expand_limits(y=c(0,10),x=c(-2,3)) + geom_hline(yintercept = mean(toy_data$y),color="forestgreen")

guess \(y\)? (\(X\) available)

Code
toy_data |> ggplot(aes(y=y,x=x))+
  geom_point(color="indianred",size=3,alpha=.75)+
  expand_limits(y=c(0,10),x=c(-2,3)) + 
  geom_hline(yintercept = mean(toy_data$y),color="forestgreen") +
  transition_states(states)

guess \(y\)? (\(X\) available)

Code
toy_data |> filter(states == "x_yes") |> ggplot(aes(y=y,x=x)) + 
  geom_point(color="indianred",size=3,alpha=.75)  + geom_smooth(method = "lm") + 
  geom_hline(yintercept = mean(toy_data$y),color="forestgreen")+
  expand_limits(y=c(0,10),x=c(-2,3)) 

\(y\)\(\ =f(X)+\epsilon\)

 

the response is a numeric (continuous) variable

\(y=f(\)\(X\)\()+\epsilon\)

 

the features can be numeric and categorical

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

 

the function is linear: \(y=\beta_{0}+\sum_{j=1}^{p}\beta_{j}X_{j}+\epsilon\)

\(y=f(X)\)+\(\epsilon\)

 

the error is assumed to be \(\epsilon_{i}\sim N(0,\sigma^{2})\) and \(cov(\epsilon_{i},\epsilon_{i'})\)

\(y=f(X)+\epsilon\)

 

the goal: find \(\hat{f}(\cdot)\) boils down to find \(\hat{\beta}_{0},\hat{\beta}_{j}'s\)

Advertising data

Code
adv_data = read_csv(file="./data/Advertising.csv") |> select(-1)
adv_data |> slice_sample(n = 8) |> 
  kbl() |> kable_styling(font_size=10)
TV radio newspaper sales
76.4 0.8 14.8 9.4
66.9 11.7 36.8 9.7
265.6 20.0 0.3 17.4
151.5 41.3 58.5 18.5
23.8 35.1 65.9 9.2
237.4 27.5 11.0 18.9
7.8 38.9 50.6 6.6
197.6 23.3 14.2 16.6
  • sales is the response, indicating the level of sales in a specific market
  • TV , Radio and Newspaper are the predictors, indicating the advertising budget spent on the corresponding media

Advertising data

Code
adv_data_tidy = adv_data |> pivot_longer(names_to="medium",values_to="budget",cols = 1:3) 
adv_data_tidy |> slice_sample(n = 8) |> 
  kbl() |> kable_styling(font_size=10)
sales medium budget
14.8 radio 10.1
19.4 radio 33.5
10.5 radio 16.0
11.8 newspaper 23.7
11.7 newspaper 5.9
19.2 TV 193.7
16.9 TV 163.3
16.6 TV 202.5

Advertising data

Code
adv_data_tidy |> 
  ggplot(aes(x = budget, y = sales)) + theme_minimal() + facet_wrap(~medium,scales = "free") +
  geom_point(color="indianred",alpha=.5,size=3)

Note: the budget spent on TV is up to 300, less so for Radio and Newspaper

Advertising data: single regressions

We can be naive and regress sales on tv, newspaper and radio, separately.

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 2
  medium              datasets
  <chr>     <list<tibble[,2]>>
1 TV                 [200 × 2]
2 newspaper          [200 × 2]
3 radio              [200 × 2]

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 3
  medium              datasets model_output
  <chr>     <list<tibble[,2]>> <list>      
1 TV                 [200 × 2] <lm>        
2 newspaper          [200 × 2] <lm>        
3 radio              [200 × 2] <lm>        

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 4
  medium              datasets model_output model_params    
  <chr>     <list<tibble[,2]>> <list>       <list>          
1 TV                 [200 × 2] <lm>         <tibble [2 × 5]>
2 newspaper          [200 × 2] <lm>         <tibble [2 × 5]>
3 radio              [200 × 2] <lm>         <tibble [2 × 5]>

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 5
  medium              datasets model_output model_params     model_metrics    
  <chr>     <list<tibble[,2]>> <list>       <list>           <list>           
1 TV                 [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>
2 newspaper          [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>
3 radio              [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 6
  medium           datasets model_output model_params model_metrics model_fitted
  <chr>     <list<tibble[,> <list>       <list>       <list>        <list>      
1 TV              [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    
2 newspaper       [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    
3 radio           [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    

This is a nested data structure with everything stored. The functions tidy, glance and augment pull all the information off of the model output, and arrange it in a tidy way.

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using unnest

Code
avd_lm_models_nest  |> unnest(model_params) 
# A tibble: 6 × 10
  medium       datasets model_output term  estimate std.error statistic  p.value
  <chr>     <list<tibb> <list>       <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 TV          [200 × 2] <lm>         (Int…   7.03     0.458       15.4  1.41e-35
2 TV          [200 × 2] <lm>         budg…   0.0475   0.00269     17.7  1.47e-42
3 newspaper   [200 × 2] <lm>         (Int…  12.4      0.621       19.9  4.71e-49
4 newspaper   [200 × 2] <lm>         budg…   0.0547   0.0166       3.30 1.15e- 3
5 radio       [200 × 2] <lm>         (Int…   9.31     0.563       16.5  3.56e-39
6 radio       [200 × 2] <lm>         budg…   0.202    0.0204       9.92 4.35e-19
# ℹ 2 more variables: model_metrics <list>, model_fitted <list>

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using unnest

Code
avd_lm_models_nest  |> unnest(model_params) |> 
  select(medium,term:p.value) |>
  kbl(digits = 4) |> kable_styling(font_size = 10) |> 
  row_spec(1:2,background = "#D9FDEC") |> 
  row_spec(3:4,background = "#CAF6FC") |> 
  row_spec(5:6,background = "#FDB5BA")
medium term estimate std.error statistic p.value
TV (Intercept) 7.0326 0.4578 15.3603 0.0000
TV budget 0.0475 0.0027 17.6676 0.0000
newspaper (Intercept) 12.3514 0.6214 19.8761 0.0000
newspaper budget 0.0547 0.0166 3.2996 0.0011
radio (Intercept) 9.3116 0.5629 16.5422 0.0000
radio budget 0.2025 0.0204 9.9208 0.0000

   

The effect of budget on sales differs significantly from 0, for all the considered media

Ordinary least squares

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}: RSS=\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]

OLS estimator of \(\beta_{0}\)

\[\begin{split}&\partial_{\hat{\beta}_{0}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}= -2\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}=\\ &\sum_{i=1}^{n}{y_{i}}-n\hat{\beta}_{0}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}}=0\rightarrow \hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x}\end{split}\]

Ordinary least squares

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]

OLS estimator of \(\beta_{1}\)

\[\begin{split}&\partial_{\hat{\beta}_{1}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}= -2\sum_{i=1}^{n}{x_{i}(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}= \sum_{i=1}^{n}{x_{i}y_{i}}-\hat{\beta}_{0}\sum_{i=1}^{n}{x_{i}}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=0\\ &\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\left(\frac{\sum_{i=1}^{n}{y_{i}}}{n}-\hat{\beta}_{1}\frac{\sum_{i=1}^{n}{x_{i}}}{n}\right)\\ &\hat{\beta}_{1}\left(n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} \right)= n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}\\ &\hat{\beta}_{1}=\frac{n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}} {n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} }=\frac{\sigma_{xy}}{\sigma^{2}_{x}} \end{split}\]

Three regression lines

Code
three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code
three_regs_plot = avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code
three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+ 
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code
three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+ 
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using \(\texttt{unnest}\)

Code
avd_lm_models_nest  |> unnest(model_metrics) |> 
  select(medium,r.squared:df.residual) |>
  kbl(digits = 4) |> kable_styling(font_size = 10) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA")
medium r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
TV 0.6119 0.6099 3.2587 312.1450 0.0000 1 -519.0457 1044.091 1053.986 2102.531 198
newspaper 0.0521 0.0473 5.0925 10.8873 0.0011 1 -608.3357 1222.671 1232.566 5134.805 198
radio 0.3320 0.3287 4.2749 98.4216 0.0000 1 -573.3369 1152.674 1162.569 3618.479 198

Linear regression: model assumptions

The linear regression model is

\[y_{i}=\beta_{0}+\beta_{1}x_{1}+\epsilon_{i}\]

where \(\epsilon_{i}\) is a random variable with expected value of 0. For inference, more assumptions must me made:

  • \(\epsilon_{i}\sim N(0,\sigma^{2})\) , then the variance of the errors \(\sigma^{2}\) does not depend on \(x_{i}\) .

  • \(Cov(\epsilon_{i},\epsilon_{i'})=0\) , for all \(i\neq i'\) and \(i,i'=1,\ldots,n\).

  • \(x_{i}\) non stochastic

It follows that \(y_{i}\) is a random variable such that

  • \(E[y_{i}]=\beta_{0}+\beta_{1}x_{1}\)

  • \(Var[y_{i}]=\sigma^{2}\)

Linear regression: model assumptions

RegAssum
Statistics for Business and Economics (Anderson, Sweeney and Williams, (2011))

Inference on \(\hat{\beta}_{1}\): hypothesis testing

The null hypothesis is \(\beta_{1}=0\) and the test statistic is

\[\frac{\hat{\beta}_{1}-\beta_{1}}{SE(\hat{\beta}_{1})}\]

that, under the null hypothesis becomes just \(\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\) distributed as a Student t with n-2 d.f.

\[\text{Reject } H_{0} \text{ if } \left|\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\right|>t_{1-\frac{\alpha}{2},n-2}\]

Linear regression: multiple predictors

  • \(y\) is the numeric response; \(X_{1},\ldots,X_{p}\) are the predictors, the general formula for the linear model is \[y = f(X)+\epsilon=\beta_{0}+\sum_{j=1}^{p}X_{j}\beta_{j}+ \epsilon\]

In algebraic form

\[{\bf y}={\bf X}{\bf \beta}+{\bf \epsilon}\]

\[\begin{bmatrix} y_{1}\\ y_{2}\\ y_{3}\\ \vdots\\ y_{n} \end{bmatrix}=\begin{bmatrix} 1& x_{1,1}&\ldots&x_{1,p}\\ 1& x_{2,1}&\ldots&x_{2,p}\\ 1& x_{3,1}&\ldots&x_{3,p}\\ \vdots&\vdots&\vdots&\\ 1& x_{n,1}&\ldots&x_{n,p}\\ \end{bmatrix}\begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{p}\\ \end{bmatrix}+\begin{bmatrix} \epsilon_{1}\\ \epsilon_{2}\\ \epsilon_{3}\\ \vdots\\ \epsilon_{n} \end{bmatrix}\]

Linear regression: ordinary least square (OLS)

OLS target

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}\left(\sum_{i=1}^{n}y_{i}-\hat{\beta}_{0}-\sum_{j=1}^{p}X_{j}\beta_{j}\right)^{2}=\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}RSS\]

OLS target (algebraic) \[ \min_{\hat{\beta}} ({\bf y}-{\bf X}{\bf \hat{\beta}})^{\sf T}({\bf y}-{\bf X}{\bf \hat{\beta}})\]

Advertising data: multiple linear regression

At this stage, no pre-processing, no test set evaluation, just fit the regression model on the whole data set

pre-processing: define a \(\texttt{recipe}\)

Code
adv_recipe = recipe(sales~., data=adv_data)

model specification define a \(\texttt{parsnip}\) model

Code
adv_model = linear_reg(mode="regression", engine="lm")

define the \(\texttt{workflow}\)

Code
adv_wflow = workflow() |> 
  add_recipe(recipe=adv_recipe) |> 
  add_model(adv_model)

fit the model

Code
adv_fit = adv_wflow |> 
  fit(data=adv_data)

Advertising data: back to single models

medium term estimate std.error statistic p.value
TV (Intercept) 7.0326 0.4578 15.3603 0.0000
TV budget 0.0475 0.0027 17.6676 0.0000
newspaper (Intercept) 12.3514 0.6214 19.8761 0.0000
newspaper budget 0.0547 0.0166 3.2996 0.0011
radio (Intercept) 9.3116 0.5629 16.5422 0.0000
radio budget 0.2025 0.0204 9.9208 0.0000

Advertising data: single vs multiple regression

Code
avd_lm_models_nest  |> unnest(model_params) |> 
  select(medium,term:p.value) |>
  filter(term!="(Intercept)") |> arrange(medium) |> 
  kbl(digits = 4) |> kable_styling(font_size = 12) |> 
  column_spec(c(3,6),bold=TRUE) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA") 
medium term estimate std.error statistic p.value
TV budget 0.0475 0.0027 17.6676 0.0000
newspaper budget 0.0547 0.0166 3.2996 0.0011
radio budget 0.2025 0.0204 9.9208 0.0000
Code
adv_fit |> tidy() |> 
  filter(term!="(Intercept)") |> arrange(term) |> 
  kbl(digits = 4) |> kable_styling(font_size = 12) |> 
  column_spec(c(2,5),bold=TRUE) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA") 
term estimate std.error statistic p.value
TV 0.0458 0.0014 32.8086 0.0000
newspaper -0.0010 0.0059 -0.1767 0.8599
radio 0.1885 0.0086 21.8935 0.0000

 

 

why this contradictory results

Advertising data: single vs multiple regression

Code
library("corrr")
adv_data |> correlate()  |>  network_plot()
  • \(\texttt{newspaper}\) is correlated with \(\texttt{radio}\) , the latter having a significative effect on \(\texttt{radio}\) in the multiple regression model

  • that is why, in the single model, \(\texttt{newspaper}\) has a significant effect \(\texttt{sales}\) ( \(\texttt{radio}\) is ignored)

supervised learning flow

pre-processing

split your observations in training, validation (or, cross-validate) and test

transform the predictors properly (features engineering)

model-spec

specify the model to fit

tuning

select a reasonable grid of model hyperparameter(s) values to choose from

for each combination of hyperparameters

  1. fit the model on training observations
  2. compute appropriate metrics on evaluation observations

pick-up the best hyperparamters combination

final evaluation and fit

compute the metric for the tuned model on the test set (observations never used yet)

obtain the final fit for the model on all the available observations

the tidymodels metapackage

All the core packages in the tidymodels refer to one step of a supervised learning flow

tidymodels logo

For all things tidymodels check tidymodels.org!

the tidymodels core

All the core packages in the tidymodels refer to one step of a supervised learning flow

all

the tidymodels core

the rsample package provides tools for data splitting and resampling

rsample

the tidymodels core

the recipe package provides tools for data pre-processing and feature engineering

recipes

the tidymodels core

the parsnip package provides a unified interface to recall several models available in R

parsnip

the tidymodels core

the workflows package combines together pre-processing, modeling and post-processing steps

workflows

the tidymodels core

the yardstick package provides several performance metrics

yardstick

the tidymodels core

the dials package provides tools to define hyperparameters value grids

dials

the tidymodels core

the tune package remarkably simplifies the hyperparameters optimization implementation

tune

the tidymodels core

the broom package provides utility functions to tidify model output

tune

a simple flow

pre-processing: training and test

The idea is to:

  • fit the model on a set of observations (training)

  • assess the model performance on a different set of observations (testing)

Refer to the Advertising data, split the observations in training/testing

adv_data = read_csv(file="./data/Advertising.csv") %>% select(-1)
adv_split = initial_split(adv_data,prop=3/4,strata=sales)
adv_train=training(adv_split)
adv_test=testing(adv_split)

training set

Code
adv_train |> slice_sample(n=5) |> kbl()
TV radio newspaper sales
281.4 39.6 55.8 24.4
0.7 39.6 8.7 1.6
220.3 49.0 3.2 24.7
76.3 27.5 16.0 12.0
139.3 14.5 10.2 13.4

test set

Code
adv_test |> slice_sample(n=5) |> kbl()
TV radio newspaper sales
19.6 20.1 17.0 7.6
198.9 49.4 60.0 23.7
17.9 37.6 21.6 8.0
76.4 0.8 14.8 9.4
228.3 16.9 26.2 15.5

pre-processing: specify the recipe (just the formula in this case)

adv_rec = recipe(formula=sales~.,data = adv_train)

 

model specification

adv_model = linear_reg(mode="regression",engine="lm")

 

pairing recipe and specification in a worklow and fit the model on the training set

adv_wf = workflow() |> add_recipe(adv_rec) |> add_model(adv_model)
adv_fit = adv_wf |> fit(data = adv_train)

 

use the fitted model to predict the test observations

adv_pred = adv_fit |> augment(new_data = adv_test)

 

compute the performance metric rmse (root mean squared error)

adv_pred |> rmse(truth = sales, estimate=.pred) |> kbl()
.metric .estimator .estimate
rmse standard 1.753936
Code
library(magick)
library(tidyverse)
library(purrr)
library(tidymodels)
library(discrim)
library(flipbookr)
library(kableExtra)
library(janitor)
library(patchwork)

classification

classification model

In a classification problem the response is a categorical variable

rather than predicting the value of \(Y\), one wants to estimate the posterior probability

\[P(Y=k\mid X=x_{i})\]

that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)

will a credit card owner default?

Code
set.seed(1234)
default=read_csv("./data/Default.csv")
default |> slice_sample(n=6) |> kbl() |> kable_styling(font_size=12)
default student balance income
No Yes 311.32186 22648.76
No Yes 697.13558 18377.15
No Yes 470.10718 16014.11
No No 1200.04162 56081.08
No No 553.64902 47021.49
No No 10.23149 27237.38

Do balance and income bring useful info to identify deafulters?

Code
p1 = default |> ggplot(aes(x = balance, y=income,color=default))+
  geom_point(alpha=.5,size=3)+theme_minimal() 
p1

Do balance and income bring useful info to identify deafulters?

Code
p2 = default |> pivot_longer(names_to = "predictor",values_to="value",cols=c(balance,income)) |> 
  ggplot(aes(x = default, y = value))+geom_boxplot(aes(fill=default))+
  facet_wrap(~predictor,scales="free")+theme_minimal()
p2

balance is useful, income is not

Code
p1 | p2

Note: to arrange multiple plots together, give a look at the patchwork package

linear regression for non-numeric response?

if \(Y\) is categorical

With \(K\) categories, one could code \(Y\) as an integer vector

  • for non ordinal factors it would not make sense

 

  • even for ordinal factors, one would implicitly assume constant differences between categories

linear regression for non-numeric response?

if \(Y\) is binary

The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .

  • So one can model \(P(Y=1|X)=\beta_{0}+\beta_{1}X\) . . .

 

  • What can possibly be wrong with this approach?

linear regression for non-numeric response?

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)

Code
default |> mutate(def_01 = ifelse(default== "Yes",1,0)) |>  
  ggplot(aes(x=balance, y=def_01)) + geom_point(aes(color = default),size=4, alpha=.5) + 
  geom_smooth(method="lm") + theme_minimal()
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})\) is bounded in \([0, 1]\), therefore the linear fit cannot be used

use the logistic function instead

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)

modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression

conditional expectation

just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]

The odds and the logit

\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]

  • \(\frac{p(X)}{\left(1-p(X)\right)}\) are the odds, that take value in \([0,\infty]\)
  • \(log\left(\frac{p(X)}{\left(1-p(X)\right)}\right)=\beta_{0}+\beta_{1}X\) is the logit: there is a linear relation between the logit and the predictor

Estimating the posterior probability via the logistic function

a toy sample

Estimating the posterior probability via the logistic function

a toy sample : fit the logistic function

Estimating the posterior probability via the logistic function

a toy sample : for a new point \(\texttt{balance}=1400\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)

Estimating the posterior probability via the logistic function

How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)

Estimating the posterior probability via the logistic function

Least squares? One could switch to the logit, which is a linear function \(logit(p(Y=1|X))=\beta_{0} + \beta_{1} X\)

Estimating the posterior probability via the logistic function

Least squares? but the logit mapping, for the blue points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{1}{1-1}\right) = log(1) - log(0) = 0 -Inf= +Inf\]

Estimating the logit via the linear function

Least squares? but the logit mapping, for the red points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{0}{1-0}\right) = log(0) = -Inf\]

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Maximization of the likelihood function

The estimates for \(\beta_{0}\) and \(\beta_{1}\) are such that the following likelihood function is maximised:

\[ \begin{split} \ell\left(\hat{\beta}_{0},\hat{\beta}_{1}\right)=& \color{blue}{\prod_{\forall i}{p(x_{i})}}\times \color{red}{\prod_{\forall i'}{\left(1-p(x_{i})\right)}}=\\ =&\color{blue}{ \prod_{\forall i} \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}} \times \color{red}{ \prod_{\forall i^{\prime}}{\left(1- \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}} \right)} } \end{split} \]

Note: the \(i\) index is for blue points, \(i'\) is for red points

Fitting the logistic regression

pre-process: specify the recipe

Code
def_rec = recipe(default~balance, data=default)

specify the model

Code
def_model_spec = logistic_reg(mode="classification", engine="glm")

put them together in the workflow

Code
def_wflow = workflow() |> add_recipe(def_rec) |> add_model(def_model_spec)

fit the model

Code
def_fit = def_wflow |> fit(data=default) 

Fitting the logistic regression

Look at the results

Code
def_fit |>  tidy() |> kbl() |> kable_styling(font_size=12)
term estimate std.error statistic p.value
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0
Code
def_fit |>  glance() |> kbl() |> kable_styling(font_size=12)
null.deviance df.null logLik AIC BIC deviance df.residual nobs
2920.65 9999 -798.2258 1600.452 1614.872 1596.452 9998 10000

Fitting the logistic regression: a qualitative predictor

Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.

update the recipe in the workflow and re-fit

Code
def_wflow2 = def_wflow |> update_recipe(recipe(default~student,data=default))
def_fit2 = def_wflow2 |> fit(data=default) 

look at the results

Code
def_fit2 |>  tidy() |> kbl() |> kable_styling(font_size=12)
term estimate std.error statistic p.value
(Intercept) -3.5041278 0.0707130 -49.554219 0.0000000
studentYes 0.4048871 0.1150188 3.520181 0.0004313

Fitting the logistic regression: a qualitative predictor

It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).

  • For non students \[ \begin{split} log\left(\frac{p(X)}{1-p(X)}\right) = -3.504 &\rightarrow& \ \frac{p(X)}{1-p(X)} = e^{-3.504}\rightarrow\\ p(X) = e^{-3.504}-e^{-3.504}p(X)&\rightarrow& \ p(X) +e^{-3.504}p(X) = e^{-3.504}\rightarrow\\ p(X)(1 +e^{-3.504}) = e^{-3.504} &\rightarrow & \ p(X) = \frac{e^{-3.504}}{1 +e^{-3.504}}=\color{blue}{0.029} \end{split} \]
  • For students \[ p(X) = \frac{e^{-3.504+0.4}}{1 +e^{-3.504+0.4}}=\color{red}{0.042} \]

Multiple logistic regression

In case of multiple predictors

\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]

and following relation holds

\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]

Multiple logistic regression

Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow

update the recipe in the workflow and re-fit

Code
def_wflow3 = def_wflow |> update_recipe(recipe(default~balance+student,data=default))
def_fit3 = def_wflow3 |> fit(data=default) 

look at the results

Code
def_fit3 |>  tidy() |> kbl() |> kable_styling(font_size=12)
term estimate std.error statistic p.value
(Intercept) -10.7494959 0.3691914 -29.116326 0.0e+00
balance 0.0057381 0.0002318 24.749526 0.0e+00
studentYes -0.7148776 0.1475190 -4.846003 1.3e-06

 

anything weird?

Multiple logistic regression

Code
p_stu = def_fit2 |> augment(new_data=default) |> 
  ggplot(aes(x=balance, y= .pred_Yes,color=student))+geom_line(size=3)+
  theme_minimal() + ylim(c(0,.1))+xlab("balance")+ylab("P(default=yes|student)")

p_stu 

Multiple logistic regression

Code
p_bal_stu = def_fit3 |> augment(new_data=default) |> ggplot(aes(x=balance, y= .pred_Yes,color=student)) +
  geom_line(size=3) + theme_minimal() + xlab("balance") + ylab("P(default=yes|balance,student)")

p_bal_stu

Multiple logistic regression

Multiple logistic regression

more than two classes?

 

multinomial logistic regression

linear discriminant analysis (LDA)

linear discriminant analysis LDA: one predictor

In the linear discriminant analysis, the assumption on \(f_{k}(X)\) is that

\[f_{k}(X)\sim N(\mu_{k},\sigma^{2})\]

therefore

\[f_{k}(X)=\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\]

 

  • in each class, the predictor is normally distributed

  • the scale parameter \(\sigma^{2}\) is the same for each class

linear discriminant analysis LDA: one predictor

Plugging in \(f_{k}(X)\) in the Bayes formula

\[p_{k}(x)=\frac{\pi_{k}\times f_{k}(x)}{\sum_{l=1}^{K}{\pi_{l}\times f_{l}(x)}} = \frac{\pi_{k}\times \overbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)}^{\color{red}{{f_{k}(x)}}}} {\sum_{l=1}^{K}{\pi_{l}\times \underbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}_{\color{red}{{f_{l}(x)}}}}}\]

it takes to estimate the following parameters

  • \(\mu_{1},\mu_{2},\ldots, \mu_{K}\)
  • \(\pi_{1},\pi_{2},\ldots, \pi_{K}\)
  • \(\sigma\)

to get, for each observation \(x\), \(\hat{p}_{1}(x)\) , \(\hat{p}_{2}(x)\) , \(\ldots\) , \(\hat{p}_{K}(x)\) : then the observation is assigned to the class for which \(\hat{p}_{k}(x)\) is max.

\(\color{red}{\text{Note}}:\) not all the quantities involved in the Bayes formula play a role in the classification of an object: in fact, some of them are constant across the classes.

two classes, same priors.

Consider a single predictor \(X\), normally distributed within the two classes, with parameters \(\mu_{1}\) , \(\mu_{2}\) and \(\sigma^{2}\) .

Also \(\pi_{1}=\pi_{2}\) . Now, the observation \(x\) is assigned to class 1 if

\[\begin{split} \delta_{1}(X) &>&\delta_{2}(X)\\ \color{blue}{\text{that is}} \\ log({\pi_{1}})-\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x &>& log({\pi_{2}})-\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x \\ \color{blue}{\text{ since }\pi_{1}=\pi_{2}}\\ -\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x > -\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x & \ \rightarrow \ & -\frac{\mu_{1}^{2}}{2} + \mu_{1}x > -\frac{\mu_{2}^{2}}{2} + \mu_{2}x\\ (\mu_{1} - \mu_{2})x > \frac{\mu_{1}^{2}-\mu_{2}^{2}}{2} & \ \rightarrow \ & x > \frac{(\mu_{1}+\mu_{2})(\mu_{1}-\mu_{2})}{2(\mu_{1} - \mu_{2})}\\ x &>& \frac{(\mu_{1}+\mu_{2})}{2}\\ \end{split}\]

the Bayes decision boundary, in which \(\delta_{1}=\delta_{2}\), is at \(\color{red}{x=\frac{(\mu_{1}+\mu_{2})}{2}}\)

two classes, same priors.

Code
set.seed(1234)
p_1 = ggplot()+xlim(-12,12)+theme_minimal() + xlim(-10,10) +
  stat_function(fun=dnorm,args=list(mean=4,sd=2),geom="area",fill="dodgerblue",alpha=.25)+
  stat_function(fun=dnorm,args=list(mean=-4,sd=2),geom="area",fill="indianred",alpha=.25)+
  geom_vline(xintercept=0,size=2,alpha=.5)+geom_vline(xintercept=-4,color="grey",size=3,alpha=.5)+
  geom_vline(xintercept=4,color="grey",size=3,alpha=.5) + geom_point(aes(x=-2,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="darkgreen")+
  geom_point(aes(x=1,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="magenta") + xlab(0)
p_1

.center[ The \(\color{darkgreen}{\text{green point}}\) goes to class 1, the \(\color{magenta}{\text{pink point}}\) goes to class 2]

two classes, same priors.

Consider a training set with 100 observations from the two classes (50 each): one needs to estimate \(\mu_1\) and \(\mu_2\) to have the estimated boundary at \(\frac{\hat{\mu}_{1}+\hat{\mu}_{2}}{2}\)

Code
class_12=tibble(class_1=rnorm(50,mean = -4,sd=2),class_2=rnorm(50,mean = 4,sd=2)) |> 
  pivot_longer(names_to="classes",values_to="values",cols = 1:2)
mu_12 = class_12 |> group_by(classes) |> summarise(means=mean(values))
mu_12_mean =  mean(mu_12$means)
p_2=class_12 |> ggplot(aes(x=values,fill=classes)) + theme_minimal() + 
  geom_histogram(aes(y=after_stat(density)),alpha=.5,color="grey") + xlim(-10,10) +
  geom_vline(xintercept=mu_12 |> pull(means),color="grey",size=3,alpha=.75)+
  geom_vline(xintercept = mu_12_mean, size=2, alpha=.75) + theme(legend.position = "none")
p_2

two classes, same priors.

The Bayes boundary is at \(\color{red}{0}\); the estimated boundary is sligthly off at -0.31

Code
p_1 / p_2

Fitting the LDA

Code
default=read_csv("./data/Default.csv") |> mutate(default=as.factor(default))

set.seed(1234)
def_split = initial_split(default,prop=3/4,strata=default)
default_test = testing(def_split)
defautl = training(def_split)
def_rec = recipe(default~balance, data=default)
def_lda_spec = discrim_linear(mode="classification", engine="MASS")
def_wflow_lda = workflow() |> add_recipe(def_rec) |> add_model(def_lda_spec)
def_fit_lda = def_wflow_lda |> fit(data=default) 
def_fit_lda |>  extract_fit_engine()
Call:
lda(..y ~ ., data = data)

Prior probabilities of groups:
    No    Yes 
0.9667 0.0333 

Group means:
      balance
No   803.9438
Yes 1747.8217

Coefficients of linear discriminants:
                LD1
balance 0.002206916

Class-specific error

  • the Bayes classifier -like classification rule ensures the maximum accuracy
  • in several classification problems, not all classification errors are alike

    • e.g. in the default problem one wants to avoid considering reliable customers that will default
  • to control for the type I error, it is possible to shift the threshold away from 0.5 (that is, the one used in Bayes classifier)
Code
lda_pred = def_fit_lda |> augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |> 
  mutate(.pred_class_0_05 = as.factor(ifelse(.pred_Yes>.05,"Yes","No")),
         .pred_class_0_1 = as.factor(ifelse(.pred_Yes>.1,"Yes","No")),
         .pred_class_0_2 = as.factor(ifelse(.pred_Yes>.2,"Yes","No")),
         .pred_class_0_3 = as.factor(ifelse(.pred_Yes>.3,"Yes","No")),
         .pred_class_0_4 = as.factor(ifelse(.pred_Yes>.4,"Yes","No")),
         .pred_class_0_5 = as.factor(ifelse(.pred_Yes>.5,"Yes","No"))
         )

Class-specific error

Class-specific error

Code
lda_pred |> 
  pivot_longer(names_to = "threshold",values_to="prediction",cols = 4:9) |> 
  dplyr::select(-.pred_class,-.pred_Yes) |> group_by(threshold) |> 
  summarise(
    accuracy=round(mean(default==prediction),2),
    false_positive_rate = sum((default=="Yes")&(default!=prediction))/sum((default=="Yes"))
    ) |> arrange(desc(accuracy),desc(false_positive_rate)) |> kbl() |> kable_styling(font_size=10) 
threshold accuracy false_positive_rate
.pred_class_0_5 0.97 0.8160920
.pred_class_0_4 0.97 0.7126437
.pred_class_0_3 0.97 0.6206897
.pred_class_0_2 0.96 0.4942529
.pred_class_0_1 0.94 0.2873563
.pred_class_0_05 0.89 0.1609195

 

Since the false positive rate increases along with the overall accuracy , reducing it will cause the overall performance of the classifier to drop

Roc curve

Code
  def_fit_lda |> augment(new_data=default_test)  |> 
  mutate(default=factor(default,levels=c("Yes","No"))) |>
  roc_curve(truth = default, .pred_Yes)|>
  ggplot(aes(x=1-specificity,y= sensitivity))+ggtitle("lda roc curve") + 
  geom_path(color="indianred")+geom_abline(lty=3)+coord_equal()+theme_minimal()