Essential statistics across disciplines

supervised learning

Alfonso Iodice D’Enza

\(y=f(X)+\epsilon\)

this is what is all about…

\(y\)\(\ =f(X)+\epsilon\)

the response one is interested in

\(y=f(\)\(X\)\()+\epsilon\)

the features that are used to estimate \(y\)

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

the function that links the \(X\)’s to \(y\)

\(y=f(X)\)+\(\epsilon\)

the error: the \(X\)’s cannot explain \(y\) exactly

\(y=f(X)+\epsilon\)

the goal: find \(\hat{f}(\cdot)\) to estimate \(f(\cdot)\)

What to do, and how

\(y\)

regression: if the response is numeric

classification: if the response is non numeric (categorical)

goal

inference: study the effects of \(X\)’s on \(y\)

prediction: estimate \(y\), given new values of the \(X\)’s

fit strategy

parametric: assume a functional form for \(f(\cdot)\) and fit its parameters

non-parametric: fit \(f(\cdot)\) (almost) point-by-point.

basics: linear regression

guess \(y\)? (no \(X\) available)

Code

set.seed(123)
toy_data = tibble(y = runif(30,min=0,max = 10),x_no=0,x_yes = ((y-3)/3)-rnorm(30,sd=.5)) |> 
  pivot_longer(cols = x_no : x_yes, values_to = "x",names_to="states")

toy_data |> filter(states == "x_no") |> ggplot(aes(y=y,x=x)) + 
  geom_point(color="indianred",size=3,alpha=.75) + 
  expand_limits(y=c(0,10),x=c(-2,3)) + geom_hline(yintercept = mean(toy_data$y),color="forestgreen")

guess \(y\)? (\(X\) available)

Code

toy_data |> ggplot(aes(y=y,x=x))+
  geom_point(color="indianred",size=3,alpha=.75)+
  expand_limits(y=c(0,10),x=c(-2,3)) + 
  geom_hline(yintercept = mean(toy_data$y),color="forestgreen") +
  transition_states(states)

guess \(y\)? (\(X\) available)

Code

toy_data |> filter(states == "x_yes") |> ggplot(aes(y=y,x=x)) + 
  geom_point(color="indianred",size=3,alpha=.75)  + geom_smooth(method = "lm") + 
  geom_hline(yintercept = mean(toy_data$y),color="forestgreen")+
  expand_limits(y=c(0,10),x=c(-2,3))

\(y\)\(\ =f(X)+\epsilon\)

the response is a numeric (continuous) variable

\(y=f(\)\(X\)\()+\epsilon\)

the features can be numeric and categorical

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

the function is linear: \(y=\beta_{0}+\sum_{j=1}^{p}\beta_{j}X_{j}+\epsilon\)

\(y=f(X)\)+\(\epsilon\)

the error is assumed to be \(\epsilon_{i}\sim N(0,\sigma^{2})\) and \(cov(\epsilon_{i},\epsilon_{i'})\)

\(y=f(X)+\epsilon\)

the goal: find \(\hat{f}(\cdot)\) boils down to find \(\hat{\beta}_{0},\hat{\beta}_{j}'s\)

Advertising data

Code

adv_data = read_csv(file="./data/Advertising.csv") |> select(-1)
adv_data |> slice_sample(n = 8) |> 
  kbl() |> kable_styling(font_size=10)

TV	radio	newspaper	sales
76.4	0.8	14.8	9.4
66.9	11.7	36.8	9.7
265.6	20.0	0.3	17.4
151.5	41.3	58.5	18.5
23.8	35.1	65.9	9.2
237.4	27.5	11.0	18.9
7.8	38.9	50.6	6.6
197.6	23.3	14.2	16.6

sales is the response, indicating the level of sales in a specific market
TV , Radio and Newspaper are the predictors, indicating the advertising budget spent on the corresponding media

Advertising data

Code

adv_data_tidy = adv_data |> pivot_longer(names_to="medium",values_to="budget",cols = 1:3) 
adv_data_tidy |> slice_sample(n = 8) |> 
  kbl() |> kable_styling(font_size=10)

sales	medium	budget
14.8	radio	10.1
19.4	radio	33.5
10.5	radio	16.0
11.8	newspaper	23.7
11.7	newspaper	5.9
19.2	TV	193.7
16.9	TV	163.3
16.6	TV	202.5

Advertising data

Code

adv_data_tidy |> 
  ggplot(aes(x = budget, y = sales)) + theme_minimal() + facet_wrap(~medium,scales = "free") +
  geom_point(color="indianred",alpha=.5,size=3)

Note: the budget spent on TV is up to 300, less so for Radio and Newspaper

Advertising data: single regressions

We can be naive and regress sales on tv, newspaper and radio, separately.

Advertising data: single regressions

Code

avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )

# A tibble: 3 × 2
  medium              datasets
  <chr>     <list<tibble[,2]>>
1 TV                 [200 × 2]
2 newspaper          [200 × 2]
3 radio              [200 × 2]

Advertising data: single regressions

Code

avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )

# A tibble: 3 × 3
  medium              datasets model_output
  <chr>     <list<tibble[,2]>> <list>      
1 TV                 [200 × 2] <lm>        
2 newspaper          [200 × 2] <lm>        
3 radio              [200 × 2] <lm>

Advertising data: single regressions

Code

avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )

# A tibble: 3 × 4
  medium              datasets model_output model_params    
  <chr>     <list<tibble[,2]>> <list>       <list>          
1 TV                 [200 × 2] <lm>         <tibble [2 × 5]>
2 newspaper          [200 × 2] <lm>         <tibble [2 × 5]>
3 radio              [200 × 2] <lm>         <tibble [2 × 5]>

Advertising data: single regressions

Code

avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )

# A tibble: 3 × 5
  medium              datasets model_output model_params     model_metrics    
  <chr>     <list<tibble[,2]>> <list>       <list>           <list>           
1 TV                 [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>
2 newspaper          [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>
3 radio              [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>

Advertising data: single regressions

Code

avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )

# A tibble: 3 × 6
  medium           datasets model_output model_params model_metrics model_fitted
  <chr>     <list<tibble[,> <list>       <list>       <list>        <list>      
1 TV              [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    
2 newspaper       [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    
3 radio           [200 × 2] <lm>         <tibble>     <tibble>      <tibble>

This is a nested data structure with everything stored. The functions tidy, glance and augment pull all the information off of the model output, and arrange it in a tidy way.

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using unnest

Code

avd_lm_models_nest  |> unnest(model_params)

# A tibble: 6 × 10
  medium       datasets model_output term  estimate std.error statistic  p.value
  <chr>     <list<tibb> <list>       <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 TV          [200 × 2] <lm>         (Int…   7.03     0.458       15.4  1.41e-35
2 TV          [200 × 2] <lm>         budg…   0.0475   0.00269     17.7  1.47e-42
3 newspaper   [200 × 2] <lm>         (Int…  12.4      0.621       19.9  4.71e-49
4 newspaper   [200 × 2] <lm>         budg…   0.0547   0.0166       3.30 1.15e- 3
5 radio       [200 × 2] <lm>         (Int…   9.31     0.563       16.5  3.56e-39
6 radio       [200 × 2] <lm>         budg…   0.202    0.0204       9.92 4.35e-19
# ℹ 2 more variables: model_metrics <list>, model_fitted <list>

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using unnest

Code

avd_lm_models_nest  |> unnest(model_params) |> 
  select(medium,term:p.value) |>
  kbl(digits = 4) |> kable_styling(font_size = 10) |> 
  row_spec(1:2,background = "#D9FDEC") |> 
  row_spec(3:4,background = "#CAF6FC") |> 
  row_spec(5:6,background = "#FDB5BA")

medium	term	estimate	std.error	statistic	p.value
TV	(Intercept)	7.0326	0.4578	15.3603	0.0000
TV	budget	0.0475	0.0027	17.6676	0.0000
newspaper	(Intercept)	12.3514	0.6214	19.8761	0.0000
newspaper	budget	0.0547	0.0166	3.2996	0.0011
radio	(Intercept)	9.3116	0.5629	16.5422	0.0000
radio	budget	0.2025	0.0204	9.9208	0.0000

The effect of budget on sales differs significantly from 0, for all the considered media

Ordinary least squares

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}: RSS=\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]

OLS estimator of \(\beta_{0}\)

\[\begin{split}&\partial_{\hat{\beta}_{0}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}= -2\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}=\\ &\sum_{i=1}^{n}{y_{i}}-n\hat{\beta}_{0}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}}=0\rightarrow \hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x}\end{split}\]

Ordinary least squares

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]

OLS estimator of \(\beta_{1}\)

\[\begin{split}&\partial_{\hat{\beta}_{1}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}= -2\sum_{i=1}^{n}{x_{i}(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}= \sum_{i=1}^{n}{x_{i}y_{i}}-\hat{\beta}_{0}\sum_{i=1}^{n}{x_{i}}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=0\\ &\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\left(\frac{\sum_{i=1}^{n}{y_{i}}}{n}-\hat{\beta}_{1}\frac{\sum_{i=1}^{n}{x_{i}}}{n}\right)\\ &\hat{\beta}_{1}\left(n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} \right)= n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}\\ &\hat{\beta}_{1}=\frac{n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}} {n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} }=\frac{\sigma_{xy}}{\sigma^{2}_{x}} \end{split}\]

Three regression lines

Code

three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code

three_regs_plot = avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code

three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+ 
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code

three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+ 
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using \(\texttt{unnest}\)

Code

avd_lm_models_nest  |> unnest(model_metrics) |> 
  select(medium,r.squared:df.residual) |>
  kbl(digits = 4) |> kable_styling(font_size = 10) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA")

medium	r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
TV	0.6119	0.6099	3.2587	312.1450	0.0000	1	-519.0457	1044.091	1053.986	2102.531	198
newspaper	0.0521	0.0473	5.0925	10.8873	0.0011	1	-608.3357	1222.671	1232.566	5134.805	198
radio	0.3320	0.3287	4.2749	98.4216	0.0000	1	-573.3369	1152.674	1162.569	3618.479	198

Linear regression: model assumptions

The linear regression model is

\[y_{i}=\beta_{0}+\beta_{1}x_{1}+\epsilon_{i}\]

where \(\epsilon_{i}\) is a random variable with expected value of 0. For inference, more assumptions must me made:

\(\epsilon_{i}\sim N(0,\sigma^{2})\) , then the variance of the errors \(\sigma^{2}\) does not depend on \(x_{i}\) .
\(Cov(\epsilon_{i},\epsilon_{i'})=0\) , for all \(i\neq i'\) and \(i,i'=1,\ldots,n\).
\(x_{i}\) non stochastic

It follows that \(y_{i}\) is a random variable such that

\(E[y_{i}]=\beta_{0}+\beta_{1}x_{1}\)
\(Var[y_{i}]=\sigma^{2}\)

Linear regression: model assumptions

Statistics for Business and Economics (Anderson, Sweeney and Williams, (2011))

Inference on \(\hat{\beta}_{1}\): hypothesis testing

The null hypothesis is \(\beta_{1}=0\) and the test statistic is

\[\frac{\hat{\beta}_{1}-\beta_{1}}{SE(\hat{\beta}_{1})}\]

that, under the null hypothesis becomes just \(\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\) distributed as a Student t with n-2 d.f.

\[\text{Reject } H_{0} \text{ if } \left|\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\right|>t_{1-\frac{\alpha}{2},n-2}\]

Linear regression: multiple predictors

\(y\) is the numeric response; \(X_{1},\ldots,X_{p}\) are the predictors, the general formula for the linear model is \[y = f(X)+\epsilon=\beta_{0}+\sum_{j=1}^{p}X_{j}\beta_{j}+ \epsilon\]

In algebraic form

\[{\bf y}={\bf X}{\bf \beta}+{\bf \epsilon}\]

\[\begin{bmatrix} y_{1}\\ y_{2}\\ y_{3}\\ \vdots\\ y_{n} \end{bmatrix}=\begin{bmatrix} 1& x_{1,1}&\ldots&x_{1,p}\\ 1& x_{2,1}&\ldots&x_{2,p}\\ 1& x_{3,1}&\ldots&x_{3,p}\\ \vdots&\vdots&\vdots&\\ 1& x_{n,1}&\ldots&x_{n,p}\\ \end{bmatrix}\begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{p}\\ \end{bmatrix}+\begin{bmatrix} \epsilon_{1}\\ \epsilon_{2}\\ \epsilon_{3}\\ \vdots\\ \epsilon_{n} \end{bmatrix}\]

Linear regression: ordinary least square (OLS)

OLS target

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}\left(\sum_{i=1}^{n}y_{i}-\hat{\beta}_{0}-\sum_{j=1}^{p}X_{j}\beta_{j}\right)^{2}=\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}RSS\]

OLS target (algebraic) \[ \min_{\hat{\beta}} ({\bf y}-{\bf X}{\bf \hat{\beta}})^{\sf T}({\bf y}-{\bf X}{\bf \hat{\beta}})\]

Advertising data: multiple linear regression

At this stage, no pre-processing, no test set evaluation, just fit the regression model on the whole data set

pre-processing: define a \(\texttt{recipe}\)

Code

adv_recipe = recipe(sales~., data=adv_data)

model specification define a \(\texttt{parsnip}\) model

Code

adv_model = linear_reg(mode="regression", engine="lm")

define the \(\texttt{workflow}\)

Code

adv_wflow = workflow() |> 
  add_recipe(recipe=adv_recipe) |> 
  add_model(adv_model)

fit the model

Code

adv_fit = adv_wflow |> 
  fit(data=adv_data)

Advertising data: back to single models

medium	term	estimate	std.error	statistic	p.value
TV	(Intercept)	7.0326	0.4578	15.3603	0.0000
TV	budget	0.0475	0.0027	17.6676	0.0000
newspaper	(Intercept)	12.3514	0.6214	19.8761	0.0000
newspaper	budget	0.0547	0.0166	3.2996	0.0011
radio	(Intercept)	9.3116	0.5629	16.5422	0.0000
radio	budget	0.2025	0.0204	9.9208	0.0000

Advertising data: single vs multiple regression

Code

avd_lm_models_nest  |> unnest(model_params) |> 
  select(medium,term:p.value) |>
  filter(term!="(Intercept)") |> arrange(medium) |> 
  kbl(digits = 4) |> kable_styling(font_size = 12) |> 
  column_spec(c(3,6),bold=TRUE) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA")

medium	term	estimate	std.error	statistic	p.value
TV	budget	0.0475	0.0027	17.6676	0.0000
newspaper	budget	0.0547	0.0166	3.2996	0.0011
radio	budget	0.2025	0.0204	9.9208	0.0000

Code

adv_fit |> tidy() |> 
  filter(term!="(Intercept)") |> arrange(term) |> 
  kbl(digits = 4) |> kable_styling(font_size = 12) |> 
  column_spec(c(2,5),bold=TRUE) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA")

term	estimate	std.error	statistic	p.value
TV	0.0458	0.0014	32.8086	0.0000
newspaper	-0.0010	0.0059	-0.1767	0.8599
radio	0.1885	0.0086	21.8935	0.0000

why this contradictory results

Advertising data: single vs multiple regression

Code

library("corrr")
adv_data |> correlate()  |>  network_plot()

\(\texttt{newspaper}\) is correlated with \(\texttt{radio}\) , the latter having a significative effect on \(\texttt{radio}\) in the multiple regression model
that is why, in the single model, \(\texttt{newspaper}\) has a significant effect \(\texttt{sales}\) ( \(\texttt{radio}\) is ignored)

supervised learning flow

pre-processing

split your observations in training, validation (or, cross-validate) and test

transform the predictors properly (features engineering)

model-spec

specify the model to fit

tuning

select a reasonable grid of model hyperparameter(s) values to choose from

for each combination of hyperparameters

fit the model on training observations

compute appropriate metrics on evaluation observations

pick-up the best hyperparamters combination

final evaluation and fit

compute the metric for the tuned model on the test set (observations never used yet)

obtain the final fit for the model on all the available observations

the tidymodels metapackage

All the core packages in the tidymodels refer to one step of a supervised learning flow

For all things tidymodels check tidymodels.org!

the tidymodels core

All the core packages in the tidymodels refer to one step of a supervised learning flow

the tidymodels core

the rsample package provides tools for data splitting and resampling

the tidymodels core

the recipe package provides tools for data pre-processing and feature engineering

the tidymodels core

the parsnip package provides a unified interface to recall several models available in R

the tidymodels core

the workflows package combines together pre-processing, modeling and post-processing steps

the tidymodels core

the yardstick package provides several performance metrics

the tidymodels core

the dials package provides tools to define hyperparameters value grids

the tidymodels core

the tune package remarkably simplifies the hyperparameters optimization implementation

the tidymodels core

the broom package provides utility functions to tidify model output

a simple flow

pre-processing: training and test

The idea is to:

fit the model on a set of observations (training)
assess the model performance on a different set of observations (testing)

Refer to the Advertising data, split the observations in training/testing

adv_data = read_csv(file="./data/Advertising.csv") %>% select(-1)
adv_split = initial_split(adv_data,prop=3/4,strata=sales)
adv_train=training(adv_split)
adv_test=testing(adv_split)

training set

Code

adv_train |> slice_sample(n=5) |> kbl()

TV	radio	newspaper	sales
281.4	39.6	55.8	24.4
0.7	39.6	8.7	1.6
220.3	49.0	3.2	24.7
76.3	27.5	16.0	12.0
139.3	14.5	10.2	13.4

test set

Code

adv_test |> slice_sample(n=5) |> kbl()

TV	radio	newspaper	sales
19.6	20.1	17.0	7.6
198.9	49.4	60.0	23.7
17.9	37.6	21.6	8.0
76.4	0.8	14.8	9.4
228.3	16.9	26.2	15.5

pre-processing: specify the recipe (just the formula in this case)

adv_rec = recipe(formula=sales~.,data = adv_train)

model specification

adv_model = linear_reg(mode="regression",engine="lm")

pairing recipe and specification in a `worklow` and fit the model on the training set

adv_wf = workflow() |> add_recipe(adv_rec) |> add_model(adv_model)
adv_fit = adv_wf |> fit(data = adv_train)

use the fitted model to predict the test observations

adv_pred = adv_fit |> augment(new_data = adv_test)

compute the performance metric `rmse` (root mean squared error)

adv_pred |> rmse(truth = sales, estimate=.pred) |> kbl()

.metric	.estimator	.estimate
rmse	standard	1.753936

Code

library(magick)
library(tidyverse)
library(purrr)
library(tidymodels)
library(discrim)
library(flipbookr)
library(kableExtra)
library(janitor)
library(patchwork)

classification

classification model

In a classification problem the response is a categorical variable

rather than predicting the value of \(Y\), one wants to estimate the posterior probability

\[P(Y=k\mid X=x_{i})\]

that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)

will a credit card owner default?

Code

set.seed(1234)
default=read_csv("./data/Default.csv")
default |> slice_sample(n=6) |> kbl() |> kable_styling(font_size=12)

default	student	balance	income
No	Yes	311.32186	22648.76
No	Yes	697.13558	18377.15
No	Yes	470.10718	16014.11
No	No	1200.04162	56081.08
No	No	553.64902	47021.49
No	No	10.23149	27237.38

Do balance and income bring useful info to identify deafulters?

Code

p1 = default |> ggplot(aes(x = balance, y=income,color=default))+
  geom_point(alpha=.5,size=3)+theme_minimal() 
p1

Do balance and income bring useful info to identify deafulters?

Code

p2 = default |> pivot_longer(names_to = "predictor",values_to="value",cols=c(balance,income)) |> 
  ggplot(aes(x = default, y = value))+geom_boxplot(aes(fill=default))+
  facet_wrap(~predictor,scales="free")+theme_minimal()
p2

balance is useful, income is not

Code

p1 | p2

Note: to arrange multiple plots together, give a look at the patchwork package

linear regression for non-numeric response?

if \(Y\) is categorical

With \(K\) categories, one could code \(Y\) as an integer vector

for non ordinal factors it would not make sense

even for ordinal factors, one would implicitly assume constant differences between categories

linear regression for non-numeric response?

if \(Y\) is binary

The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .

So one can model \(P(Y=1|X)=\beta_{0}+\beta_{1}X\) . . .

What can possibly be wrong with this approach?

linear regression for non-numeric response?

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)

Code

default |> mutate(def_01 = ifelse(default== "Yes",1,0)) |>  
  ggplot(aes(x=balance, y=def_01)) + geom_point(aes(color = default),size=4, alpha=.5) + 
  geom_smooth(method="lm") + theme_minimal()

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})\) is bounded in \([0, 1]\), therefore the linear fit cannot be used

use the logistic function instead

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)

modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression

conditional expectation

just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]

The odds and the logit

\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]

\(\frac{p(X)}{\left(1-p(X)\right)}\) are the odds, that take value in \([0,\infty]\)

\(log\left(\frac{p(X)}{\left(1-p(X)\right)}\right)=\beta_{0}+\beta_{1}X\) is the logit: there is a linear relation between the logit and the predictor

Estimating the posterior probability via the logistic function

a toy sample

Estimating the posterior probability via the logistic function

a toy sample : fit the logistic function

Estimating the posterior probability via the logistic function

a toy sample : for a new point \(\texttt{balance}=1400\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)

Estimating the posterior probability via the logistic function

How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)

Estimating the posterior probability via the logistic function

Least squares? One could switch to the logit, which is a linear function \(logit(p(Y=1|X))=\beta_{0} + \beta_{1} X\)

Estimating the posterior probability via the logistic function

Least squares? but the logit mapping, for the blue points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{1}{1-1}\right) = log(1) - log(0) = 0 -Inf= +Inf\]

Estimating the logit via the linear function

Least squares? but the logit mapping, for the red points, is \[log\left(\frac{p(Y=1|X)}{1-p(Y=1|X)}\right) = log\left(\frac{0}{1-0}\right) = log(0) = -Inf\]

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Estimating the logit via the linear function

logit: mapping

Maximization of the likelihood function

The estimates for \(\beta_{0}\) and \(\beta_{1}\) are such that the following likelihood function is maximised:

\[ \begin{split} \ell\left(\hat{\beta}_{0},\hat{\beta}_{1}\right)=& \color{blue}{\prod_{\forall i}{p(x_{i})}}\times \color{red}{\prod_{\forall i'}{\left(1-p(x_{i})\right)}}=\\ =&\color{blue}{ \prod_{\forall i} \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}}}} \times \color{red}{ \prod_{\forall i^{\prime}}{\left(1- \frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{i'}}} \right)} } \end{split} \]

Note: the \(i\) index is for blue points, \(i'\) is for red points

Fitting the logistic regression

pre-process: specify the recipe

Code

def_rec = recipe(default~balance, data=default)

specify the model

Code

def_model_spec = logistic_reg(mode="classification", engine="glm")

put them together in the workflow

Code

def_wflow = workflow() |> add_recipe(def_rec) |> add_model(def_model_spec)

fit the model

Code

def_fit = def_wflow |> fit(data=default)

Fitting the logistic regression

Look at the results

Code

def_fit |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

Code

def_fit |>  glance() |> kbl() |> kable_styling(font_size=12)

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual	nobs
2920.65	9999	-798.2258	1600.452	1614.872	1596.452	9998	10000

Fitting the logistic regression: a qualitative predictor

Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.

update the recipe in the workflow and re-fit

Code

def_wflow2 = def_wflow |> update_recipe(recipe(default~student,data=default))
def_fit2 = def_wflow2 |> fit(data=default)

look at the results

Code

def_fit2 |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

Fitting the logistic regression: a qualitative predictor

It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).

For non students \[ \begin{split} log\left(\frac{p(X)}{1-p(X)}\right) = -3.504 &\rightarrow& \ \frac{p(X)}{1-p(X)} = e^{-3.504}\rightarrow\\ p(X) = e^{-3.504}-e^{-3.504}p(X)&\rightarrow& \ p(X) +e^{-3.504}p(X) = e^{-3.504}\rightarrow\\ p(X)(1 +e^{-3.504}) = e^{-3.504} &\rightarrow & \ p(X) = \frac{e^{-3.504}}{1 +e^{-3.504}}=\color{blue}{0.029} \end{split} \]

For students \[ p(X) = \frac{e^{-3.504+0.4}}{1 +e^{-3.504+0.4}}=\color{red}{0.042} \]

Multiple logistic regression

In case of multiple predictors

\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]

and following relation holds

\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]

Multiple logistic regression

Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow

update the recipe in the workflow and re-fit

Code

def_wflow3 = def_wflow |> update_recipe(recipe(default~balance+student,data=default))
def_fit3 = def_wflow3 |> fit(data=default)

look at the results

Code

def_fit3 |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-10.7494959	0.3691914	-29.116326	0.0e+00
balance	0.0057381	0.0002318	24.749526	0.0e+00
studentYes	-0.7148776	0.1475190	-4.846003	1.3e-06

anything weird?

Multiple logistic regression

Code

p_stu = def_fit2 |> augment(new_data=default) |> 
  ggplot(aes(x=balance, y= .pred_Yes,color=student))+geom_line(size=3)+
  theme_minimal() + ylim(c(0,.1))+xlab("balance")+ylab("P(default=yes|student)")

p_stu

Multiple logistic regression

Code

p_bal_stu = def_fit3 |> augment(new_data=default) |> ggplot(aes(x=balance, y= .pred_Yes,color=student)) +
  geom_line(size=3) + theme_minimal() + xlab("balance") + ylab("P(default=yes|balance,student)")

p_bal_stu

Multiple logistic regression

more than two classes?

multinomial logistic regression

linear discriminant analysis (LDA)

linear discriminant analysis LDA: one predictor

In the linear discriminant analysis, the assumption on \(f_{k}(X)\) is that

\[f_{k}(X)\sim N(\mu_{k},\sigma^{2})\]

therefore

\[f_{k}(X)=\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)\]

in each class, the predictor is normally distributed
the scale parameter \(\sigma^{2}\) is the same for each class

linear discriminant analysis LDA: one predictor

Plugging in \(f_{k}(X)\) in the Bayes formula

\[p_{k}(x)=\frac{\pi_{k}\times f_{k}(x)}{\sum_{l=1}^{K}{\pi_{l}\times f_{l}(x)}} = \frac{\pi_{k}\times \overbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{k} \right)^{2}\right)}^{\color{red}{{f_{k}(x)}}}} {\sum_{l=1}^{K}{\pi_{l}\times \underbrace{\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}\left( x-\mu_{l} \right)^{2}\right)}_{\color{red}{{f_{l}(x)}}}}}\]

it takes to estimate the following parameters

\(\mu_{1},\mu_{2},\ldots, \mu_{K}\)
\(\pi_{1},\pi_{2},\ldots, \pi_{K}\)
\(\sigma\)

to get, for each observation \(x\), \(\hat{p}_{1}(x)\) , \(\hat{p}_{2}(x)\) , \(\ldots\) , \(\hat{p}_{K}(x)\) : then the observation is assigned to the class for which \(\hat{p}_{k}(x)\) is max.

\(\color{red}{\text{Note}}:\) not all the quantities involved in the Bayes formula play a role in the classification of an object: in fact, some of them are constant across the classes.

two classes, same priors.

Consider a single predictor \(X\), normally distributed within the two classes, with parameters \(\mu_{1}\) , \(\mu_{2}\) and \(\sigma^{2}\) .

Also \(\pi_{1}=\pi_{2}\) . Now, the observation \(x\) is assigned to class 1 if

\[\begin{split} \delta_{1}(X) &>&\delta_{2}(X)\\ \color{blue}{\text{that is}} \\ log({\pi_{1}})-\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x &>& log({\pi_{2}})-\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x \\ \color{blue}{\text{ since }\pi_{1}=\pi_{2}}\\ -\frac{\mu_{1}^{2}}{2\sigma^{2}} + \frac{\mu_{1}}{\sigma^{2}}x > -\frac{\mu_{2}^{2}}{2\sigma^{2}} + \frac{\mu_{2}}{\sigma^{2}}x & \ \rightarrow \ & -\frac{\mu_{1}^{2}}{2} + \mu_{1}x > -\frac{\mu_{2}^{2}}{2} + \mu_{2}x\\ (\mu_{1} - \mu_{2})x > \frac{\mu_{1}^{2}-\mu_{2}^{2}}{2} & \ \rightarrow \ & x > \frac{(\mu_{1}+\mu_{2})(\mu_{1}-\mu_{2})}{2(\mu_{1} - \mu_{2})}\\ x &>& \frac{(\mu_{1}+\mu_{2})}{2}\\ \end{split}\]

the Bayes decision boundary, in which \(\delta_{1}=\delta_{2}\), is at \(\color{red}{x=\frac{(\mu_{1}+\mu_{2})}{2}}\)

two classes, same priors.

Code

set.seed(1234)
p_1 = ggplot()+xlim(-12,12)+theme_minimal() + xlim(-10,10) +
  stat_function(fun=dnorm,args=list(mean=4,sd=2),geom="area",fill="dodgerblue",alpha=.25)+
  stat_function(fun=dnorm,args=list(mean=-4,sd=2),geom="area",fill="indianred",alpha=.25)+
  geom_vline(xintercept=0,size=2,alpha=.5)+geom_vline(xintercept=-4,color="grey",size=3,alpha=.5)+
  geom_vline(xintercept=4,color="grey",size=3,alpha=.5) + geom_point(aes(x=-2,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="darkgreen")+
  geom_point(aes(x=1,y=0),inherit.aes = FALSE,size=10,alpha=.5,color="magenta") + xlab(0)
p_1

.center[ The \(\color{darkgreen}{\text{green point}}\) goes to class 1, the \(\color{magenta}{\text{pink point}}\) goes to class 2]

two classes, same priors.

Consider a training set with 100 observations from the two classes (50 each): one needs to estimate \(\mu_1\) and \(\mu_2\) to have the estimated boundary at \(\frac{\hat{\mu}_{1}+\hat{\mu}_{2}}{2}\)

Code

class_12=tibble(class_1=rnorm(50,mean = -4,sd=2),class_2=rnorm(50,mean = 4,sd=2)) |> 
  pivot_longer(names_to="classes",values_to="values",cols = 1:2)
mu_12 = class_12 |> group_by(classes) |> summarise(means=mean(values))
mu_12_mean =  mean(mu_12$means)
p_2=class_12 |> ggplot(aes(x=values,fill=classes)) + theme_minimal() + 
  geom_histogram(aes(y=after_stat(density)),alpha=.5,color="grey") + xlim(-10,10) +
  geom_vline(xintercept=mu_12 |> pull(means),color="grey",size=3,alpha=.75)+
  geom_vline(xintercept = mu_12_mean, size=2, alpha=.75) + theme(legend.position = "none")
p_2

two classes, same priors.

The Bayes boundary is at \(\color{red}{0}\); the estimated boundary is sligthly off at -0.31

Code

p_1 / p_2

Fitting the LDA

Code

default=read_csv("./data/Default.csv") |> mutate(default=as.factor(default))

set.seed(1234)
def_split = initial_split(default,prop=3/4,strata=default)
default_test = testing(def_split)
defautl = training(def_split)
def_rec = recipe(default~balance, data=default)
def_lda_spec = discrim_linear(mode="classification", engine="MASS")
def_wflow_lda = workflow() |> add_recipe(def_rec) |> add_model(def_lda_spec)
def_fit_lda = def_wflow_lda |> fit(data=default) 
def_fit_lda |>  extract_fit_engine()

Call:
lda(..y ~ ., data = data)

Prior probabilities of groups:
    No    Yes 
0.9667 0.0333 

Group means:
      balance
No   803.9438
Yes 1747.8217

Coefficients of linear discriminants:
                LD1
balance 0.002206916

Class-specific error

the Bayes classifier -like classification rule ensures the maximum accuracy

in several classification problems, not all classification errors are alike
- e.g. in the default problem one wants to avoid considering reliable customers that will default

to control for the type I error, it is possible to shift the threshold away from 0.5 (that is, the one used in Bayes classifier)

Code

lda_pred = def_fit_lda |> augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |> 
  mutate(.pred_class_0_05 = as.factor(ifelse(.pred_Yes>.05,"Yes","No")),
         .pred_class_0_1 = as.factor(ifelse(.pred_Yes>.1,"Yes","No")),
         .pred_class_0_2 = as.factor(ifelse(.pred_Yes>.2,"Yes","No")),
         .pred_class_0_3 = as.factor(ifelse(.pred_Yes>.3,"Yes","No")),
         .pred_class_0_4 = as.factor(ifelse(.pred_Yes>.4,"Yes","No")),
         .pred_class_0_5 = as.factor(ifelse(.pred_Yes>.5,"Yes","No"))
         )

Class-specific error

Code

lda_pred |> 
  pivot_longer(names_to = "threshold",values_to="prediction",cols = 4:9) |> 
  dplyr::select(-.pred_class,-.pred_Yes) |> group_by(threshold) |> 
  summarise(
    accuracy=round(mean(default==prediction),2),
    false_positive_rate = sum((default=="Yes")&(default!=prediction))/sum((default=="Yes"))
    ) |> arrange(desc(accuracy),desc(false_positive_rate)) |> kbl() |> kable_styling(font_size=10)

threshold	accuracy	false_positive_rate
.pred_class_0_5	0.97	0.8160920
.pred_class_0_4	0.97	0.7126437
.pred_class_0_3	0.97	0.6206897
.pred_class_0_2	0.96	0.4942529
.pred_class_0_1	0.94	0.2873563
.pred_class_0_05	0.89	0.1609195

Since the false positive rate increases along with the overall accuracy , reducing it will cause the overall performance of the classifier to drop

Roc curve

Code

  def_fit_lda |> augment(new_data=default_test)  |> 
  mutate(default=factor(default,levels=c("Yes","No"))) |>
  roc_curve(truth = default, .pred_Yes)|>
  ggplot(aes(x=1-specificity,y= sensitivity))+ggtitle("lda roc curve") + 
  geom_path(color="indianred")+geom_abline(lty=3)+coord_equal()+theme_minimal()

Essential statistics across disciplines

\(y=f(X)+\epsilon\)

\(y\)\(\ =f(X)+\epsilon\)

\(y=f(\)\(X\)\()+\epsilon\)

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

\(y=f(X)\)+\(\epsilon\)

\(y=f(X)+\epsilon\)

What to do, and how

basics: linear regression

guess \(y\)? (no \(X\) available)

guess \(y\)? (\(X\) available)

guess \(y\)? (\(X\) available)

\(y\)\(\ =f(X)+\epsilon\)

\(y=f(\)\(X\)\()+\epsilon\)

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

\(y=f(X)\)+\(\epsilon\)

\(y=f(X)+\epsilon\)

Advertising data

Advertising data

Advertising data

Advertising data: single regressions

Advertising data: single regressions

Advertising data: single regressions

Advertising data: single regressions

Advertising data: single regressions

Advertising data: single regressions

Advertising data: nested structure

Advertising data: nested structure

Ordinary least squares

OLS estimator of \(\beta_{0}\)

Ordinary least squares

OLS estimator of \(\beta_{1}\)

Three regression lines

Three regression lines

Three regression lines

Three regression lines

Advertising data: nested structure

Linear regression: model assumptions

Linear regression: model assumptions

Inference on \(\hat{\beta}_{1}\): hypothesis testing

Linear regression: multiple predictors

Linear regression: ordinary least square (OLS)

Advertising data: multiple linear regression

Advertising data: back to single models

Advertising data: single vs multiple regression

Advertising data: single vs multiple regression

supervised learning flow

the tidymodels metapackage

For all things tidymodels check tidymodels.org!

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

a simple flow

pre-processing: training and test

pre-processing: specify the recipe (just the formula in this case)

model specification

pairing recipe and specification in a worklow and fit the model on the training set

use the fitted model to predict the test observations

compute the performance metric rmse (root mean squared error)

classification

classification model

will a credit card owner default?

Do balance and income bring useful info to identify deafulters?

Do balance and income bring useful info to identify deafulters?

balance is useful, income is not

linear regression for non-numeric response?

linear regression for non-numeric response?

linear regression for non-numeric response?

use the logistic function instead

The odds and the logit

Estimating the posterior probability via the logistic function

Estimating the posterior probability via the logistic function

Estimating the posterior probability via the logistic function

Estimating the posterior probability via the logistic function

pairing recipe and specification in a `worklow` and fit the model on the training set

compute the performance metric `rmse` (root mean squared error)