\(y=f(X)+\epsilon\)

 

this is what is all about…

\(y\)\(\ =f(X)+\epsilon\)

 

the response one is interested in

\(y=f(\)\(X\)\()+\epsilon\)

 

the features that are used to estimate \(y\)

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

 

the function that links the \(X\)’s to \(y\)

\(y=f(X)\)+\(\epsilon\)

 

the error: the \(X\)’s cannot explain \(y\) exactly

\(y=f(X)+\epsilon\)

 

the goal: find \(\hat{f}(\cdot)\) to estimate \(f(\cdot)\)

What to do, and how

 

\(y\)

regression: if the response is numeric

classification: if the response is non numeric (categorical)

 

 

goal

inference: study the effects of \(X\)’s on \(y\)

prediction: estimate \(y\), given new values of the \(X\)’s

 

 

fit strategy

parametric: assume a functional form for \(f(\cdot)\) and fit its parameters

non-parametric: fit \(f(\cdot)\) (almost) point-by-point.

basics: linear regression

guess \(y\)? (no \(X\) available)

Code
set.seed(123)
toy_data = tibble(y = runif(30,min=0,max = 10),x_no=0,x_yes = ((y-3)/3)-rnorm(30,sd=.5)) |> 
  pivot_longer(cols = x_no : x_yes, values_to = "x",names_to="states")

toy_data |> filter(states == "x_no") |> ggplot(aes(y=y,x=x)) + 
  geom_point(color="indianred",size=3,alpha=.75) + 
  expand_limits(y=c(0,10),x=c(-2,3)) + geom_hline(yintercept = mean(toy_data$y),color="forestgreen")

guess \(y\)? (\(X\) available)

Code
toy_data |> ggplot(aes(y=y,x=x))+
  geom_point(color="indianred",size=3,alpha=.75)+
  expand_limits(y=c(0,10),x=c(-2,3)) + 
  geom_hline(yintercept = mean(toy_data$y),color="forestgreen") +
  transition_states(states)

guess \(y\)? (\(X\) available)

Code
toy_data |> filter(states == "x_yes") |> ggplot(aes(y=y,x=x)) + 
  geom_point(color="indianred",size=3,alpha=.75)  + geom_smooth(method = "lm") + 
  geom_hline(yintercept = mean(toy_data$y),color="forestgreen")+
  expand_limits(y=c(0,10),x=c(-2,3)) 

\(y\)\(\ =f(X)+\epsilon\)

 

the response is a numeric (continuous) variable

\(y=f(\)\(X\)\()+\epsilon\)

 

the features can be numeric and categorical

\(y=\ \) \(f(\)\(X\)\()\)+\(\epsilon\)

 

the function is linear: \(y=\beta_{0}+\sum_{j=1}^{p}\beta_{j}X_{j}+\epsilon\)

\(y=f(X)\)+\(\epsilon\)

 

the error is assumed to be \(\epsilon_{i}\sim N(0,\sigma^{2})\) and \(cov(\epsilon_{i},\epsilon_{i'})\)

\(y=f(X)+\epsilon\)

 

the goal: find \(\hat{f}(\cdot)\) boils down to find \(\hat{\beta}_{0},\hat{\beta}_{j}'s\)

prologue

stuff you might already know

supervised learning flow

supervised learning flow

the bias variance trade-off

the bias variance trade-off

assess model performance

assess model performance

end of prologue

Advertising data

Code
adv_data = read_csv(file="./data/Advertising.csv") |> select(-1)
adv_data |> slice_sample(n = 8) |> 
  kbl() |> kable_styling(font_size=10)
TV radio newspaper sales
76.4 0.8 14.8 9.4
66.9 11.7 36.8 9.7
265.6 20.0 0.3 17.4
151.5 41.3 58.5 18.5
23.8 35.1 65.9 9.2
237.4 27.5 11.0 18.9
7.8 38.9 50.6 6.6
197.6 23.3 14.2 16.6
  • sales is the response, indicating the level of sales in a specific market
  • TV , Radio and Newspaper are the predictors, indicating the advertising budget spent on the corresponding media

Advertising data

Code
adv_data_tidy = adv_data |> pivot_longer(names_to="medium",values_to="budget",cols = 1:3) 
adv_data_tidy |> slice_sample(n = 8) |> 
  kbl() |> kable_styling(font_size=10)
sales medium budget
14.8 radio 10.1
19.4 radio 33.5
10.5 radio 16.0
11.8 newspaper 23.7
11.7 newspaper 5.9
19.2 TV 193.7
16.9 TV 163.3
16.6 TV 202.5

Advertising data

Code
adv_data_tidy |> 
  ggplot(aes(x = budget, y = sales)) + theme_minimal() + facet_wrap(~medium,scales = "free") +
  geom_point(color="indianred",alpha=.5,size=3)

Note: the budget spent on TV is up to 300, less so for Radio and Newspaper

Advertising data: single regressions

We can be naive and regress sales on tv, newspaper and radio, separately.

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 2
  medium              datasets
  <chr>     <list<tibble[,2]>>
1 TV                 [200 × 2]
2 newspaper          [200 × 2]
3 radio              [200 × 2]

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 3
  medium              datasets model_output
  <chr>     <list<tibble[,2]>> <list>      
1 TV                 [200 × 2] <lm>        
2 newspaper          [200 × 2] <lm>        
3 radio              [200 × 2] <lm>        

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 4
  medium              datasets model_output model_params    
  <chr>     <list<tibble[,2]>> <list>       <list>          
1 TV                 [200 × 2] <lm>         <tibble [2 × 5]>
2 newspaper          [200 × 2] <lm>         <tibble [2 × 5]>
3 radio              [200 × 2] <lm>         <tibble [2 × 5]>

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 5
  medium              datasets model_output model_params     model_metrics    
  <chr>     <list<tibble[,2]>> <list>       <list>           <list>           
1 TV                 [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>
2 newspaper          [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>
3 radio              [200 × 2] <lm>         <tibble [2 × 5]> <tibble [1 × 12]>

Advertising data: single regressions

Code
avd_lm_models_nest = adv_data_tidy |> 
                  group_by(medium) |> 
                  group_nest(.key = "datasets") |> 
                  mutate(
                    model_output = map(.x=datasets,~lm(sales~budget,data=.x)),
                         model_params = map(.x=model_output, ~tidy(.x)),
                         model_metrics = map(.x=model_output, ~glance(.x)),
                         model_fitted = map(.x=model_output, ~augment(.x))
                  )
# A tibble: 3 × 6
  medium           datasets model_output model_params model_metrics model_fitted
  <chr>     <list<tibble[,> <list>       <list>       <list>        <list>      
1 TV              [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    
2 newspaper       [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    
3 radio           [200 × 2] <lm>         <tibble>     <tibble>      <tibble>    

This is a nested data structure with everything stored. The functions tidy, glance and augment pull all the information off of the model output, and arrange it in a tidy way.

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using unnest

Code
avd_lm_models_nest  |> unnest(model_params) 
# A tibble: 6 × 10
  medium       datasets model_output term  estimate std.error statistic  p.value
  <chr>     <list<tibb> <list>       <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 TV          [200 × 2] <lm>         (Int…   7.03     0.458       15.4  1.41e-35
2 TV          [200 × 2] <lm>         budg…   0.0475   0.00269     17.7  1.47e-42
3 newspaper   [200 × 2] <lm>         (Int…  12.4      0.621       19.9  4.71e-49
4 newspaper   [200 × 2] <lm>         budg…   0.0547   0.0166       3.30 1.15e- 3
5 radio       [200 × 2] <lm>         (Int…   9.31     0.563       16.5  3.56e-39
6 radio       [200 × 2] <lm>         budg…   0.202    0.0204       9.92 4.35e-19
# ℹ 2 more variables: model_metrics <list>, model_fitted <list>

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using unnest

Code
avd_lm_models_nest  |> unnest(model_params) |> 
  select(medium,term:p.value) |>
  kbl(digits = 4) |> kable_styling(font_size = 10) |> 
  row_spec(1:2,background = "#D9FDEC") |> 
  row_spec(3:4,background = "#CAF6FC") |> 
  row_spec(5:6,background = "#FDB5BA")
medium term estimate std.error statistic p.value
TV (Intercept) 7.0326 0.4578 15.3603 0.0000
TV budget 0.0475 0.0027 17.6676 0.0000
newspaper (Intercept) 12.3514 0.6214 19.8761 0.0000
newspaper budget 0.0547 0.0166 3.2996 0.0011
radio (Intercept) 9.3116 0.5629 16.5422 0.0000
radio budget 0.2025 0.0204 9.9208 0.0000

   

The effect of budget on sales differs significantly from 0, for all the considered media

Ordinary least squares

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}: RSS=\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]

OLS estimator of \(\beta_{0}\)

\[\begin{split}&\partial_{\hat{\beta}_{0}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}= -2\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}=\\ &\sum_{i=1}^{n}{y_{i}}-n\hat{\beta}_{0}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}}=0\rightarrow \hat{\beta}_{0}=\bar{y}-\hat{\beta}_{1}\bar{x}\end{split}\]

Ordinary least squares

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1}}\sum_{i=1}^{n}{e_{i}^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{y}_{i})^{2}}=\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}\]

OLS estimator of \(\beta_{1}\)

\[\begin{split}&\partial_{\hat{\beta}_{1}}\sum_{i=1}^{n}{(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})^{2}}= -2\sum_{i=1}^{n}{x_{i}(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i})}= \sum_{i=1}^{n}{x_{i}y_{i}}-\hat{\beta}_{0}\sum_{i=1}^{n}{x_{i}}-\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=0\\ &\hat{\beta}_{1}\sum_{i=1}^{n}{x_{i}^{2}}=\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\left(\frac{\sum_{i=1}^{n}{y_{i}}}{n}-\hat{\beta}_{1}\frac{\sum_{i=1}^{n}{x_{i}}}{n}\right)\\ &\hat{\beta}_{1}\left(n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} \right)= n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}\\ &\hat{\beta}_{1}=\frac{n\sum_{i=1}^{n}{x_{i}y_{i}}-\sum_{i=1}^{n}{x_{i}}\sum_{i=1}^{n}{y_{i}}} {n\sum_{i=1}^{n}{x_{i}^{2}}-(\sum_{i=1}^{n}{x_{i}})^{2} }=\frac{\sigma_{xy}}{\sigma^{2}_{x}} \end{split}\]

Three regression lines

Code
three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code
three_regs_plot = avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code
three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+ 
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Three regression lines

Code
three_regs_plot=avd_lm_models_nest  |> unnest(model_fitted) |> 
  select(medium,sales:.fitted) |> 
  ggplot(aes(x=budget,y=sales))+theme_minimal()+facet_grid(~medium,scales="free")+
  geom_point(alpha=.5,color = "indianred")+
  geom_smooth(method="lm")+ 
  geom_segment(aes(x=budget,xend=budget,y=.fitted,yend=sales),color="forestgreen",alpha=.25)

Advertising data: nested structure

The quantities nested in the tibble can be pulled out, or they can be expanded within the tibble itself, using \(\texttt{unnest}\)

Code
avd_lm_models_nest  |> unnest(model_metrics) |> 
  select(medium,r.squared:df.residual) |>
  kbl(digits = 4) |> kable_styling(font_size = 10) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA")
medium r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
TV 0.6119 0.6099 3.2587 312.1450 0.0000 1 -519.0457 1044.091 1053.986 2102.531 198
newspaper 0.0521 0.0473 5.0925 10.8873 0.0011 1 -608.3357 1222.671 1232.566 5134.805 198
radio 0.3320 0.3287 4.2749 98.4216 0.0000 1 -573.3369 1152.674 1162.569 3618.479 198

Linear regression: model assumptions

The linear regression model is

\[y_{i}=\beta_{0}+\beta_{1}x_{1}+\epsilon_{i}\]

where \(\epsilon_{i}\) is a random variable with expected value of 0. For inference, more assumptions must me made:

  • \(\epsilon_{i}\sim N(0,\sigma^{2})\) , then the variance of the errors \(\sigma^{2}\) does not depend on \(x_{i}\) .

  • \(Cov(\epsilon_{i},\epsilon_{i'})=0\) , for all \(i\neq i'\) and \(i,i'=1,\ldots,n\).

  • \(x_{i}\) non stochastic

It follows that \(y_{i}\) is a random variable such that

  • \(E[y_{i}]=\beta_{0}+\beta_{1}x_{1}\)

  • \(Var[y_{i}]=\sigma^{2}\)

Linear regression: model assumptions

RegAssum
Statistics for Business and Economics (Anderson, Sweeney and Williams, (2011))

\(\sigma^{2}\) estimator

The variance \(\sigma^{2}\) is assumed to be constant, but it is unknown, and it has to be estimated:

  • since \(y_{i}\sim N(\beta_{0}+\beta_{1}x_{i},\sigma^{2})\) , it follows that

\[\frac{y_{i}-\beta_{0}-\beta_{1}x_{i}}{\sigma}\sim N(0,1)\]

  • furthermore, recall that \(\sum_{i=1}^{n}\left[N(0,1)\right]^{2} = \chi^{2}_{n}\) , then

\[\sum_{i=1}^{n}\left[N(0,1)\right]^{2}=\sum_{i=1}^{n}\left[\frac{y_{i}-\beta_{0}-\beta_{1}x_{i}}{\sigma}\right]^{2}=\chi^{2}_{n}\]

  • replacing \(\beta_{0}\) and \(\beta_{1}\) with their estimators \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) , the previous becomes

\[\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i}\right)^{2}}{\sigma^{2}}=\frac{RSS}{\sigma^{2}}=\chi^{2}_{n-2}\] two degree of freedom are lost as the two parameters are replaced by their estimators.

\(\sigma^{2}\) estimator

  • Finally, since \(E\left[\chi^{2}_{n-2}\right]=n-2\) then

\[E\left[\chi^{2}_{n-2}\right]=E\left[\frac{RSS}{\sigma^{2}}\right]=n-2\]

  • and because \(\sigma^{2}\) is constant, it can be pulled out from the expectation, and re-write

\[\frac{E\left[RSS\right]}{\sigma^{2}}=n-2\]

  • it follows that \[\sigma^{2}=E\left[\frac{RSS}{n-2}\right]\] and \(\frac{RSS}{n-2}\) is an unbiased estimator of \(\sigma^{2}\).

  • \(\sqrt{\frac{RSS}{n-2}}=RSE\) , the so-called residual standard error

\(\hat{\beta}_{1}\) as a linear combination of \(y_{i}\)

\[\begin{split} \hat{\beta}_{1}&= \frac{\sum_{i=1}^{n}{\left(y_{i}-\bar{y}\right)\left(x_{i}-\bar{x}\right)} }{\sum_{i=1}^{n}{\left(x_{i}-\bar{x}\right)^{2}}} =\frac{\sum_{i=1}^{n}{\left[y_{i} \left(x_{i}-\bar{x}\right) -\bar{y}\left(x_{i}-\bar{x}\right)\right]} }{\sum_{i=1}^{n}{\left(x_{i}-\bar{x}\right)^{2}}}=\\ &=\frac{\sum_{i=1}^{n}{y_{i} \left(x_{i}-\bar{x}\right)} -\bar{y}\sum_{i=1}^{n}{\left(x_{i}-\bar{x}\right)}}{\sum_{i=1}^{n}{\left(x_{i}-\bar{x}\right)^{2}}} = \sum_{i=1}^{n}{y_{i} \frac{\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{n}{\left(x_{i}-\bar{x}\right)^{2}}}}= \sum_{i=1}^{n}{w_{i}y_{i}}\end{split}\]

Given a linear combination \(U_{i}=c+d V_{i}\), if \(V_{i}\sim N(\mu_{v},\sigma^{2}_{v})\), then \(U_{i}\sim N(c+d\mu_{v},d^{2}\sigma^{2}_{v})\).

Since \(Y_{i}\sim N(\beta_{0}+\beta_{1}x_{i},\sigma^{2})\), and \(\hat{\beta}_{1}=\sum_{i=1}^{n}{w_{i}y_{i}}\) then \(c=0\) and \(d=w_{i}\), then

\[\hat{\beta}_{1}\sim N(\sum_{i=1}^{n}{w_{i}(\beta_{0}+\beta_{1}x_{i})},\sum_{i=1}^{n}{w_{i}^{2}\sigma^{2}})\]

Parameters of the \(\hat{\beta}_{1}\) distribution: \(E\left[\hat{\beta}_{1}\right]=\beta_{1}\)

The expectation of \(\hat{\beta}_{1}\) is

  • \(E\left[\hat{\beta}_{1}\right] = \sum_{i=1}^{n}{w_{i}(\beta_{0}+\beta_{1}x_{i})} = \beta_{0}\underbrace{\sum_{i=1}^{n}{w_{i}}}_{0}+\beta_{1}\sum_{i=1}^{n}{w_{i}x_{i}}=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\)

    \(=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(E\left[\hat{\beta}_{1}\right]=\beta_{1}\)

The expectation of \(\hat{\beta}_{1}\) is

  • \(E\left[\hat{\beta}_{1}\right] = \sum_{i=1}^{n}{w_{i}(\beta_{0}+\beta_{1}x_{i})} = \beta_{0}\underbrace{\sum_{i=1}^{n}{w_{i}}}_{0}+\beta_{1}\sum_{i=1}^{n}{w_{i}x_{i}}=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\)

    \(=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\beta_{1}\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x})x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}=\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(E\left[\hat{\beta}_{1}\right]=\beta_{1}\)

The expectation of \(\hat{\beta}_{1}\) is

  • \(E\left[\hat{\beta}_{1}\right] = \sum_{i=1}^{n}{w_{i}(\beta_{0}+\beta_{1}x_{i})} = \beta_{0}\underbrace{\sum_{i=1}^{n}{w_{i}}}_{0}+\beta_{1}\sum_{i=1}^{n}{w_{i}x_{i}}=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\)

    \(=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\beta_{1}\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x})x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}=\beta_{1}\frac{\sum_{i=1}^{n}{x_{i}^2}-\bar{x}\sum_{i=1}^{n}{x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(E\left[\hat{\beta}_{1}\right]=\beta_{1}\)

The expectation of \(\hat{\beta}_{1}\) is

  • \(E\left[\hat{\beta}_{1}\right] = \sum_{i=1}^{n}{w_{i}(\beta_{0}+\beta_{1}x_{i})} = \beta_{0}\underbrace{\sum_{i=1}^{n}{w_{i}}}_{0}+\beta_{1}\sum_{i=1}^{n}{w_{i}x_{i}}=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\)

    \(=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\beta_{1}\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x})x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}=\beta_{1}\frac{\sum_{i=1}^{n}{x_{i}^2}-\bar{x}\sum_{i=1}^{n}{x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}=\beta_{1}\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\sum_{i=1}^{n}{x_{i}^2}-\frac{\sum_{i=1}^{n}{x_{i}}}{n}\sum_{i=1}^{n}{x_{i}}=\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(E\left[\hat{\beta}_{1}\right]=\beta_{1}\)

The expectation of \(\hat{\beta}_{1}\) is

  • \(E\left[\hat{\beta}_{1}\right] = \sum_{i=1}^{n}{w_{i}(\beta_{0}+\beta_{1}x_{i})} = \beta_{0}\underbrace{\sum_{i=1}^{n}{w_{i}}}_{0}+\beta_{1}\sum_{i=1}^{n}{w_{i}x_{i}}=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\)

    \(=\beta_{1}\sum_{i=1}^{n}{\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}x_{i}}=\beta_{1}\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x})x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}=\beta_{1}\frac{\sum_{i=1}^{n}{x_{i}^2}-\bar{x}\sum_{i=1}^{n}{x_{i}}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}=\beta_{1}\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\sum_{i=1}^{n}{x_{i}^2}-\frac{\sum_{i=1}^{n}{x_{i}}}{n}\sum_{i=1}^{n}{x_{i}}=\)

    \(=\beta_{1}\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\frac{n\sum_{i=1}^{n}{x_{i}^2}-\left[\sum_{i=1}^{n}{x_{i}}\right]^2}{n}=\)

Note: \(\frac{n\sum_{i=1}^{n}{x_{i}^2}-\left[\sum_{i=1}^{n}{x_{i}}\right]^2}{n} \frac{1}{n} =\frac{n\sum_{i=1}^{n}{x_{i}^2}}{n^{2}}-\frac{\left[\sum_{i=1}^{n}{x_{i}}\right]^2}{n^{2}}=\frac{\sum_{i=1}^{n}{x_{i}^2}}{n}-\bar{x}^{2}=var(x)\)

therefore \(\frac{n\sum_{i=1}^{n}{x_{i}^2}-\left[\sum_{i=1}^{n}{x_{i}}\right]^2}{n} = var(x)n=\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\)

\(=\beta_{1}\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\frac{n\sum_{i=1}^{n}{x_{i}^2}-\left[\sum_{i=1}^{n}{x_{i}}\right]^2}{n}=\beta_{1}\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}} \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}=\beta_{1}\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(var\left(\hat{\beta}_{1}\right) = \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

The variance of \(\hat{\beta}_{1}\) is

\(var\left(\hat{\beta}_{1}\right) = \sum_{i=1}^{n}{w^{2}_{i}\sigma^{2}}=\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(var\left(\hat{\beta}_{1}\right) = \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

The variance of \(\hat{\beta}_{1}\) is

\(var\left(\hat{\beta}_{1}\right) = \sum_{i=1}^{n}{w^{2}_{i}\sigma^{2}} =\sum_{i=1}^{n}{\left[\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\right]^{2}}\sigma^{2}\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(var\left(\hat{\beta}_{1}\right) = \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

The variance of \(\hat{\beta}_{1}\) is

\(var\left(\hat{\beta}_{1}\right) = \sum_{i=1}^{n}{w^{2}_{i}\sigma^{2}} =\sum_{i=1}^{n}{\left[\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\right]^{2}}\sigma^{2}=\sum_{i=1}^{n}{\frac{(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(var\left(\hat{\beta}_{1}\right) = \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

The variance of \(\hat{\beta}_{1}\) is

\(var\left(\hat{\beta}_{1}\right) = \sum_{i=1}^{n}{w^{2}_{i}\sigma^{2}} =\sum_{i=1}^{n}{\left[\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\right]^{2}}\sigma^{2}=\sum_{i=1}^{n}{\frac{(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}={\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(var\left(\hat{\beta}_{1}\right) = \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

The variance of \(\hat{\beta}_{1}\) is

\(var\left(\hat{\beta}_{1}\right) = \sum_{i=1}^{n}{w^{2}_{i}\sigma^{2}} =\sum_{i=1}^{n}{\left[\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\right]^{2}}\sigma^{2}=\sum_{i=1}^{n}{\frac{(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}={\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}={\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}}\sigma^{2}\)

Parameters of the \(\hat{\beta}_{1}\) distribution: \(var\left(\hat{\beta}_{1}\right) = \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\)

The variance of \(\hat{\beta}_{1}\) is

\(var\left(\hat{\beta}_{1}\right) = \sum_{i=1}^{n}{w^{2}_{i}\sigma^{2}} =\sum_{i=1}^{n}{\left[\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\right]^{2}}\sigma^{2}=\sum_{i=1}^{n}{\frac{(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}={\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{\left[\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right]^{2}}}\sigma^{2}={\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}}\sigma^{2}\)

And the standard error of \(\hat{\beta}_{1}\) is

\(SE\left(\hat{\beta}_{1}\right) = \sqrt{\frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}}\)

Inference on \(\hat{\beta}_{1}\): hypothesis testing

The null hypothesis is \(\beta_{1}=0\) and the test statistic is

\[\frac{\hat{\beta}_{1}-\beta_{1}}{SE(\hat{\beta}_{1})}\]

that, under the null hypothesis becomes just \(\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\) distributed as a Student t with n-2 d.f.

\[\text{Reject } H_{0} \text{ if } \left|\frac{\hat{\beta}_{1}}{SE(\hat{\beta}_{1})}\right|>t_{1-\frac{\alpha}{2},n-2}\]

Linear regression: multiple predictors

  • \(y\) is the numeric response; \(X_{1},\ldots,X_{p}\) are the predictors, the general formula for the linear model is \[y = f(X)+\epsilon=\beta_{0}+\sum_{j=1}^{p}X_{j}\beta_{j}+ \epsilon\]

In algebraic form

\[{\bf y}={\bf X}{\bf \beta}+{\bf \epsilon}\]

\[\begin{bmatrix} y_{1}\\ y_{2}\\ y_{3}\\ \vdots\\ y_{n} \end{bmatrix}=\begin{bmatrix} 1& x_{1,1}&\ldots&x_{1,p}\\ 1& x_{2,1}&\ldots&x_{2,p}\\ 1& x_{3,1}&\ldots&x_{3,p}\\ \vdots&\vdots&\vdots&\\ 1& x_{n,1}&\ldots&x_{n,p}\\ \end{bmatrix}\begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{p}\\ \end{bmatrix}+\begin{bmatrix} \epsilon_{1}\\ \epsilon_{2}\\ \epsilon_{3}\\ \vdots\\ \epsilon_{n} \end{bmatrix}\]

Linear regression: ordinary least square (OLS)

OLS target

\[\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}\left(\sum_{i=1}^{n}y_{i}-\hat{\beta}_{0}-\sum_{j=1}^{p}X_{j}\beta_{j}\right)^{2}=\min_{\hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}}RSS\]

OLS target (algebraic) \[ \min_{\hat{\beta}} ({\bf y}-{\bf X}{\bf \hat{\beta}})^{\sf T}({\bf y}-{\bf X}{\bf \hat{\beta}})\]

Linear regression: ordinary least square (OLS) solution

OLS target
\[\min_{\hat{\beta}}({\bf y}-{\bf X}{\bf \hat{\beta}})^{\sf T}({\bf y}-{\bf X}{\bf \hat{\beta}})=\min_{\hat{\beta}}\left({\bf y}^{\sf T}{\bf y}-{\bf y}^{\sf T}{\bf X} {\bf \hat{\beta}}-{\bf{\hat{\beta}}}^{\sf T}{\bf X}^{\sf T}{\bf y}+{\bf {\hat{\beta}}}^{\sf T}{\bf X}^{\sf T}{\bf X} {\bf \hat{\beta}}\right)\]

First order conditions

\[\partial_{\hat{\beta}}\left({\bf y}^{\sf T}{\bf y}-{\bf y}^{\sf T}{\bf X} {\bf \hat{\beta}}-{\bf{\hat{\beta}}}^{\sf T}{\bf X}^{\sf T}{\bf y}+{\bf {\hat{\beta}}}^{\sf T}{\bf X}^{\sf T}{\bf X} {\bf \hat{\beta}}\right)=0\]

Since \(\partial_{\bf\hat{\beta}}\left({\bf y}^{\sf T}{\bf X} {\bf \hat{\beta}}\right)=\partial_{\bf\hat{\beta}}\left({ \bf {\hat{\beta}}}^{\sf T}{\bf X}^{\sf T}{\bf y}\right)={\bf X}^{\sf T}{\bf y}\) it results that

\[\partial_{\hat{\beta}}RSS=-2{\bf X}^{\sf T}{\bf y}+2{\bf X}^{\sf T}{\bf X} {\bf \hat{\beta}}=0\]

And the OLS solution is

\[ \hat{\beta} = ({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}{\bf y} \]

Distributional assumptions

\(\epsilon \sim N(0,\sigma^{2}{\bf I})\) , assuming non-stochastic predictors, it follows that

\[E[{\bf y}]= E[{\bf X}\beta+\epsilon]={\bf X}\beta \ \ \text{ and } \ \ var[{\bf y}]= \sigma^{2}{\bf I}\]

Distribution of linear combinations

\[ \text{if } \quad {\bf U}\sim N({\bf \mu},{\bf\Sigma}) \quad \text{ and } \quad {\bf V} = {\bf c}+{\bf D}{\bf U} \quad \text{then} \quad {\bf V}\sim N({\bf c}+{\bf D}\mu,{\bf D}{\bf \Sigma}{\bf D}^{\sf T})\]

Distribution of \(\hat{\beta}\)

\(\hat{\beta} = ({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}{\bf y}\) is a linear combination of \(\bf y\) , with \({\bf c} = 0\) and \({\bf D} = ({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}\) , then

  • \(E[\hat{\beta}]={\bf D} E[y] = {\bf D}{\bf X}\beta= ({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}{\bf X}\beta = \beta\)

  • \(var(\hat{\beta})= {\bf D} \sigma^{2}{\bf D}^{\sf T}=({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}\sigma^{2}[({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}]^{\sf T} = \sigma^{2}({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}{\bf X}({\bf X}^{\sf T}{\bf X})^{-1}=\sigma^{2}({\bf X}^{\sf T}{\bf X})^{-1}\)

Advertising data: multiple linear regression

At this stage, no pre-processing, no test set evaluation, just fit the regression model on the whole data set

pre-processing: define a \(\texttt{recipe}\)

Code
adv_recipe = recipe(sales~., data=adv_data)

model specification define a \(\texttt{parsnip}\) model

Code
adv_model = linear_reg(mode="regression", engine="lm")

define the \(\texttt{workflow}\)

Code
adv_wflow = workflow() |> 
  add_recipe(recipe=adv_recipe) |> 
  add_model(adv_model)

fit the model

Code
adv_fit = adv_wflow |> 
  fit(data=adv_data)

Advertising data: back to single models

medium term estimate std.error statistic p.value
TV (Intercept) 7.0326 0.4578 15.3603 0.0000
TV budget 0.0475 0.0027 17.6676 0.0000
newspaper (Intercept) 12.3514 0.6214 19.8761 0.0000
newspaper budget 0.0547 0.0166 3.2996 0.0011
radio (Intercept) 9.3116 0.5629 16.5422 0.0000
radio budget 0.2025 0.0204 9.9208 0.0000

Advertising data: single vs multiple regression

Code
avd_lm_models_nest  |> unnest(model_params) |> 
  select(medium,term:p.value) |>
  filter(term!="(Intercept)") |> arrange(medium) |> 
  kbl(digits = 4) |> kable_styling(font_size = 12) |> 
  column_spec(c(3,6),bold=TRUE) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA") 
medium term estimate std.error statistic p.value
TV budget 0.0475 0.0027 17.6676 0.0000
newspaper budget 0.0547 0.0166 3.2996 0.0011
radio budget 0.2025 0.0204 9.9208 0.0000
Code
adv_fit |> tidy() |> 
  filter(term!="(Intercept)") |> arrange(term) |> 
  kbl(digits = 4) |> kable_styling(font_size = 12) |> 
  column_spec(c(2,5),bold=TRUE) |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(2,background = "#CAF6FC") |> 
  row_spec(3,background = "#FDB5BA") 
term estimate std.error statistic p.value
TV 0.0458 0.0014 32.8086 0.0000
newspaper -0.0010 0.0059 -0.1767 0.8599
radio 0.1885 0.0086 21.8935 0.0000

 

 

why this contradictory results

Advertising data: single vs multiple regression

Code
library("corrr")
adv_data |> correlate()  |>  network_plot()
  • \(\texttt{newspaper}\) is correlated with \(\texttt{radio}\) , the latter having a significative effect on \(\texttt{radio}\) in the multiple regression model

  • that is why, in the single model, \(\texttt{newspaper}\) has a significant effect \(\texttt{sales}\) ( \(\texttt{radio}\) is ignored)

regression by successive orthogonalisations

computing multiple regression coefficients by means of single regressions

Algebraic formalization of single regression

Consider the data to be centered \(\bar{y}=\bar{x}=0\) : this means that \(\beta_{0}=0\) (no intercept model), \({\bf y}=\beta_{1}{\bf x}+\epsilon\)

The \(\hat{\beta}_{1}\) estimator

\[\hat{\beta}_{1}=\frac{\sum_{i}^{n}{x_{i}y_{i}}}{\sum_{i}^{n}{x_{i}^{2}}}=\frac{{\bf x}^{\sf T}{\bf y}}{{\bf x}^{\sf T}{\bf x}}=\frac{\langle {\bf x},{\bf y}\rangle}{\langle {\bf x},{\bf x}\rangle}\]

\({\langle {\bf a},{\bf b}\rangle}\) is the inner product \({\bf a}^{\sf T}{\bf b}\).

Consider the two predictors model \(y = \beta_{1}X_{1}+\beta_{2}X_{2}+\epsilon\)

Note if the two predictors are such that \({\bf x}_{1}^{\sf T}{\bf x}_{2} = 0\) then \(\hat{\beta}_{1}\) and \(\hat{\beta}_{2}\) can be computed by

  • fitting the multiple regression \(y = \beta_{1}X_{1}+\beta_{2}X_{2}+\epsilon\)

  • fitting the single regressions \(y = \beta_{1}X_{1}+\epsilon\) and \(y = \beta_{2}X_{2}+\epsilon\)

the orthogonal predictors case

Check the previous claim, given the predictors matrix

\[{\bf X}=\left[{ \begin{bmatrix} \\ \\ {\bf x}_{1} \\ \\ \\ \end{bmatrix} \begin{bmatrix} \\ \\ {\bf x}_{2} \\ \\ \\ \end{bmatrix}} \right]\]

and that

\[{\bf \hat{\beta}}=\left({\bf X}^{\sf T}{\bf X}\right)^{-1}{\bf X}^{\sf T}{\bf y} \rightarrow \left({\bf X}^{\sf T}{\bf X}\right)\hat{\beta} - {\bf X}^{\sf T}{\bf y}={\bf X}^{\sf T}\left({\bf X}\hat{\beta} - {\bf y}\right)=0\]

the non-orthogonal predictors case

In real data, it is impossible to have all the predictors pair-wise orthogonal.

  • single and multiple regression coefficients estimates will differ

  • multiple regression coefficients estimates can still be obtained via single regressions via…

 

 

successive orthogonalizations

Regression via successive orthogonalizations

Consider the four predictors model (with no intercept)

\[y=\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{3}+\beta_{4}X_{4}+\epsilon\]

step 1

  • set \({\bf z}_{1}={\bf x}_{1}\) and fit \({\bf x}_{2}=\beta_{2|1}{\bf z}_{1}+\epsilon\)

  • compute \(\hat{\beta}_{2|1}=\frac{\langle {\bf z}_{1},{\bf x}_{2}\rangle}{\langle {\bf z}_{1},{\bf z}_{1}\rangle}\)

  • compute \({\bf z}_{2}=x_{2}-\hat{\beta}_{2|1}{\bf z}_{1}\)

Regression via successive orthogonalizations

Consider the four predictors model (with no intercept)

\[y=\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{3}+\beta_{4}X_{4}+\epsilon\]

Compute the value for \(\hat{\beta}_{4}\) using successive orthogonalizations

step 2

  • fit \({\bf x}_{3}=\beta_{3|1}{\bf z}_{1}+\epsilon\) and \({\bf x}_{3}=\beta_{3|2}{\bf z}_{2}+\epsilon\)

  • compute \(\hat{\beta}_{3|1}=\frac{\langle {\bf z}_{1},{\bf x}_{3}\rangle}{\langle {\bf z}_{1},{\bf z}_{1}\rangle}\) and \(\hat{\beta}_{3|2}=\frac{\langle {\bf z}_{2},{\bf x}_{3}\rangle}{\langle {\bf z}_{2},{\bf z}_{2}\rangle}\)

  • compute \({\bf z}_{3}=x_{3}-\hat{\beta}_{3|1}{\bf z}_{1}-\hat{\beta}_{3|2}{\bf z}_{2}\)

Regression via successive orthogonalizations

Consider the four predictors model (with no intercept)

\[y=\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{3}+\beta_{4}X_{4}+\epsilon\]

Compute the value for \(\hat{\beta}_{4}\) using successive orthogonalizations

step 3

  • fit \({\bf x}_{4}=\beta_{4|1}{\bf z}_{1}+\epsilon\) , \({\bf x}_{4}=\beta_{4|2}{\bf z}_{2}+\epsilon\) and \({\bf x}_{4}=\beta_{4|3}{\bf z}_{3}+\epsilon\)

  • compute \(\hat{\beta}_{4|1}=\frac{\langle {\bf z}_{1},{\bf x}_{4}\rangle}{\langle {\bf z}_{1},{\bf z}_{1}\rangle}\) , \(\hat{\beta}_{4|2}=\frac{\langle {\bf z}_{2},{\bf x}_{4}\rangle}{\langle {\bf z}_{2},{\bf z}_{2}\rangle}\) and \(\hat{\beta}_{4|3}=\frac{\langle {\bf z}_{3},{\bf x}_{4}\rangle}{\langle {\bf z}_{3},{\bf z}_{3}\rangle}\)

  • compute \({\bf z}_{4}=x_{4}-\hat{\beta}_{4|1}{\bf z}_{1}-\hat{\beta}_{4|2}{\bf z}_{2}-\hat{\beta}_{4|3}{\bf z}_{3}\)

Regression via successive orthogonalizations

Consider the four predictors model (with no intercept)

\[y=\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{3}+\beta_{4}X_{4}+\epsilon\]

Compute the value for \(\hat{\beta}_{4}\) using successive orthogonalizations

step 4

  • fit \({\bf y}=\beta_{4}{\bf z}_{4}+\epsilon\)

  • compute \(\hat{\beta}_{4}=\frac{\langle {\bf z}_{4},{\bf y}\rangle}{\langle {\bf z}_{4},{\bf z}_{4}\rangle}\)

Regression via successive orthogonalizations: a further look

  • The residuals vector \({\bf e}=\left({\bf y}-{\bf \hat{y}}\right)\) is orthogonal to the predictor \({\bf x}\) , that is \(\left({\bf y}-{\bf \hat{y}}\right)^{\sf T}{\bf x}=0\)

\[\begin{split} ({\bf y}-{\bf \hat{y}})^{\sf T}{\bf x}&={\bf y}^{\sf T}{\bf x}-{\bf \hat{y}}^{\sf T}{\bf x} \\ &={\bf y}^{\sf T}{\bf x} - \underbrace{({\bf x} ({\bf x}^{\sf T}{\bf x})^{-1}{\bf x}^{\sf T}{\bf y})^{\sf T}}_{\bf \hat{y}^{\sf T}}{\bf x}= \\ &={\bf y}^{\sf T}{\bf x}- {\bf y}^{\sf T}{\bf x} ({\bf x}^{\sf T}{\bf x})^{-1}{\bf x}^{\sf T}{\bf x}= {\bf y}^{\sf T}{\bf x}- {\bf y}^{\sf T}{\bf x}=0 \end{split}\]

  • recall that \({\bf z}_{2}={\bf x}_{2}-\beta_{2|1}{\bf z}_{1}\) , the \({\bf z}_{2}\) is orthogonal to \({\bf z}_{1}\) , and since \({\bf z}_{1}={\bf x}_{1}\) , \({\bf z}_{2}\) is orthogonal to \({\bf x}_{1}\) , too.

  • \({\bf z}_{2}\) is \({\bf x}_{2}\) adjusted to be orthogonal to \({\bf z}_{1}\) ( \({\bf x}_{1}\) ).

  • \({\bf z}_{3}\) is \({\bf x}_{3}\) adjusted to be orthogonal to \({\bf z}_{2}\) and to \({\bf z}_{1}\).

  • \({\bf z}_{4}\) is \({\bf x}_{4}\) adjusted to be orthogonal to \({\bf z}_{3}\), \({\bf z}_{2}\) and to \({\bf z}_{1}\).

the multiple regression-based \(\hat{\beta}_{j}\) can be obtained via the single regression \({\bf y}=\hat{\beta}_{j}{\bf z}_{j}+\epsilon\) - \({\bf z}_{j}\) is \({\bf x}_{j}\) adjusted wrt to the other predictors.

Regression via successive orthogonalizations: some considerations

the multiple regression-based \(\hat{\beta}_{j}\) can be obtained via the single regression \({\bf y}=\hat{\beta}_{j}{\bf z}_{j}+\epsilon\)

  • If the predictors are pair-wise orthogonal, the single coefficients of the predictor \(j\) on the residual \(i\), \(\hat{\beta}_{j|i}=0\) then \({\bf z}_{j}={\bf x}_{j}, j= 1,\ldots,p\).

  • So \(\hat{\beta}_{j}=\frac{\langle {\bf z}_{j}, {\bf y}\rangle}{\langle {\bf z}_{j}, {\bf z}_{j}\rangle}=\frac{\langle {\bf x}_{j}, {\bf y}\rangle}{\langle {\bf x}_{j}, {\bf x}_{j}\rangle}\) , just a single regression of \({\bf y}\) on \({\bf x}_{j}\)

  • if \({\bf x}_{j}\) is correlated with one or more of the other predictors, the squared norm \(\|{\bf z}_{j}\|^{2}\approx 0\)

    • this makes sense, since the predictor in question won’t explain any new information on \({\bf y}\)
  • as \(\|{\bf z}_{j}\|^{2} = \langle {\bf z}_{j}, {\bf z}_{j}\rangle\approx 0\) , the variability of the OLS estimator explodes, since \(SE(\hat{\beta}_{j})=\frac{\sigma}{\sqrt{\|{\bf z}_{j}\|^{2}}}\)

Interactions

Code
library(scatterplot3d)

adv_fit_update = adv_wflow |> update_recipe(adv_recipe|> step_rm("newspaper"))|> 
  fit(data=adv_data)
 
data3d=adv_fit_update |> extract_fit_engine() |> augment(data=adv_data) |> 
  mutate(und_ov_est = ifelse(.std.resid<0,"indianred","blue")) |> select(TV, radio, sales, und_ov_est)

plot_3d = scatterplot3d(x=data3d$TV,y=data3d$radio,z=data3d$sales,xlab="TV",ylab="radio",zlab="sales",color=data3d$und_ov_est,pch=16,box=FALSE,grid=TRUE)
plot_3d$plane3d(adv_fit_update |> extract_fit_engine(),draw_polygon=TRUE,polygon_args = list(col = rgb(.1,.6,.4,.25)))

intermission: more stuff you know of

Goodness-of-fit

Global test statistic F

testing blocks of predictors

autocorrelation

heteroschedasticity

outliers and high leverage

multicollinearity

end of intermission

estimate the model performance (on the test set)

resampling methods

the credit data set

want: predict balance as a response

Code
library(janitor)
data(Credit,package = "ISLR2")
credit=Credit |> clean_names() 
credit |> slice_sample(n=12) |> kbl() |> kable_styling(font_size = 10)
income limit rating cards age education own student married region balance
63.931 5728 435 3 28 14 Yes No Yes East 581
28.144 1567 142 3 51 10 No No Yes South 0
24.230 4756 351 2 64 15 Yes No Yes South 594
43.540 2906 232 4 69 11 No No No South 0
76.782 5977 429 4 44 12 No No Yes West 548
44.473 3500 257 3 81 16 Yes No No East 8
36.362 5183 376 3 49 15 No No Yes East 654
23.989 4523 338 4 31 15 No No No South 601
22.379 3965 292 2 34 14 Yes No Yes West 384
53.217 4943 362 2 46 16 Yes No Yes West 382
10.503 2923 232 3 25 18 Yes No Yes East 191
26.067 3388 266 4 74 17 Yes No Yes East 155

a (random) smaller version of the credit data set

Code
set.seed(1234)
credit_small = credit |> select(balance,sample(names(credit)[-11],3)) 
credit_small |> slice_sample(n=12) |> kbl() |> kable_styling(font_size = 10)
balance region education age
47 West 17 57
0 West 10 24
912 East 13 49
0 West 9 62
155 East 17 74
1587 South 16 56
637 East 19 44
1176 East 9 52
732 South 12 66
391 South 11 45
0 South 15 75
772 South 18 48

validation approaches

train/test

train/test

Randomly select a proportion (e.g. .75) of observations for training, the rest is for testing

validation set

Select a proportion (e.g. .8) of observations to train the model, the rest is for validation

cross-validation

  • Leave-one-out cross-validation: all observations but one are for training, the one left is for validation. The procedure is iterated on the observations, until each observation is used for validation.

  • K-fold cross-validation: K-1 folds are for training, the K\(^{th}\) fold is for validation. The procedure is iterated untill each fold is used for validation.

  • the CV estimate of the evaluation metric is the average over the \(n\) (or, K ) iterations

Validation-set to assess model performance

train/test split

Code
first_split = initial_split(data = credit_small,prop=.75,strata = balance)
cred_tr = training(first_split)
cred_test = testing(first_split)

analysis/validate split

Code
val_split = validation_split(data = cred_tr, prop=.8,strata = balance)

split and resample object: print just minimal info

Code
first_split
<Training/Testing/Total>
<299/101/400>
Code
val_split
# Validation Set Split (0.8/0.2)  using stratification 
# A tibble: 1 × 2
  splits           id        
  <list>           <chr>     
1 <split [239/60]> validation
Code
val_split$splits[[1]]
<Training/Validation/Total>
<239/60/299>

Validation-set to assess model performance

split objects : they are lists

Code
glimpse(first_split)
List of 4
 $ data  :'data.frame': 400 obs. of  4 variables:
  ..$ balance  : num [1:400] 333 903 580 964 331 ...
  ..$ region   : Factor w/ 3 levels "East","South",..: 2 3 3 3 2 2 1 3 2 1 ...
  ..$ education: num [1:400] 11 15 11 11 16 10 12 9 13 19 ...
  ..$ age      : num [1:400] 34 82 71 36 68 77 37 87 66 41 ...
 $ in_id : int [1:299] 12 16 17 23 25 32 34 35 41 49 ...
 $ out_id: logi NA
 $ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ id: chr "Resample1"
 - attr(*, "class")= chr [1:3] "initial_split" "mc_split" "rsplit"
Code
str(val_split)
v_splt [1 × 2] (S3: validation_split/rset/tbl_df/tbl/data.frame)
 $ splits:List of 1
  ..$ :List of 4
  .. ..$ data  :'data.frame':   299 obs. of  4 variables:
  .. .. ..$ balance  : num [1:299] 0 0 0 0 0 0 0 0 50 0 ...
  .. .. ..$ region   : Factor w/ 3 levels "East","South",..: 2 1 1 1 2 3 2 3 1 3 ...
  .. .. ..$ education: num [1:299] 16 15 17 10 15 16 10 14 14 15 ...
  .. .. ..$ age      : num [1:299] 64 57 73 61 57 43 30 25 54 72 ...
  .. ..$ in_id : int [1:239] 3 4 5 8 10 11 12 13 14 15 ...
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "validation"
  .. ..- attr(*, "class")= chr [1:2] "val_split" "rsplit"
 $ id    : chr "validation"
 - attr(*, "prop")= num 0.8
 - attr(*, "strata")= chr "balance"
 - attr(*, "breaks")= num 4
 - attr(*, "pool")= num 0.1
 - attr(*, "fingerprint")= chr "95463bd733510d7274291db8f6792543"

Validation-set to assess model performance

Note: training()/testing() and analysis()/assessment() are equivalent helper functions to extract the data from the split object…

Code
training(first_split) |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
0 South 16 64
0 East 15 57
0 East 17 73
0 East 10 61
0 South 15 57
0 West 16 43
Code
first_split$data[first_split$in_id,] |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
0 South 16 64
0 East 15 57
0 East 17 73
0 East 10 61
0 South 15 57
0 West 16 43
Code
testing(first_split) |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
333 South 11 34
903 West 15 82
1151 South 10 77
204 West 7 57
1081 South 9 49
891 West 9 28
Code
first_split$data[-first_split$in_id,] |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
333 South 11 34
903 West 15 82
1151 South 10 77
204 West 7 57
1081 South 9 49
891 West 9 28

Validation-set to assess model performance

Note: training()/testing() and analysis()/assessment()$ are equivalent helper functions to extract the data from the split object…

Code
analysis(first_split) |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
0 South 16 64
0 East 15 57
0 East 17 73
0 East 10 61
0 South 15 57
0 West 16 43
Code
first_split$data[first_split$in_id,] |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
0 South 16 64
0 East 15 57
0 East 17 73
0 East 10 61
0 South 15 57
0 West 16 43
Code
assessment(first_split) |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
333 South 11 34
903 West 15 82
1151 South 10 77
204 West 7 57
1081 South 9 49
891 West 9 28
Code
first_split$data[-first_split$in_id,] |> slice(1:6) |> kbl() |> kable_styling(font_size = 8)
balance region education age
333 South 11 34
903 West 15 82
1151 South 10 77
204 West 7 57
1081 South 9 49
891 West 9 28

Validation-set to assess model performance

pre-processing

Code
cred_an = analysis(val_split$splits[[1]]) 
cred_prep = recipe(balance~., data = cred_an)

model-spec

Code
cred_mod = linear_reg(mode="regression", engine="lm")

workflow setup

Code
cred_wflow = workflow() |> 
  add_recipe(cred_prep) |> 
  add_model(cred_mod)

Validation-set to assess model performance

model fit

Code
cred_fit = cred_wflow |> 
  fit(cred_an)

assessment set predictions

Code
cred_as = assessment(val_split$splits[[1]])
cred_preds = cred_fit |> augment(cred_as) |> select(balance,.pred)
cred_preds |> slice_sample(n=3) |> kbl() |> kable_styling(font_size=9)
balance .pred
429 547.6457
843 512.9616
1120 501.6879

assessment metric: RMSE

Code
cred_preds |> rmse(truth=balance, estimate=.pred) |> 
  kbl() |> kable_styling(font_size=9)
.metric .estimator .estimate
rmse standard 455.6267

Validation-set to assess model performance: a shortcut

RMSE computation

Code
cred_wflow |> 
  fit_resamples(val_split) |> 
  collect_metrics()
# Resampling results
# Validation Set Split (0.8/0.2)  using stratification 
# A tibble: 1 × 4
  splits           id         .metrics         .notes          
  <list>           <chr>      <list>           <list>          
1 <split [239/60]> validation <tibble [2 × 4]> <tibble [0 × 3]>

Validation-set to assess model performance: shortcut

RMSE computation

Code
cred_wflow |> 
  fit_resamples(val_split) |> 
  collect_metrics()
.metric .estimator mean n std_err .config
rmse standard 455.6267 1 NA Preprocessor1_Model1

v-fold cross validation to assess model performance

data split: folds

Code
cred_folds = vfold_cv(cred_tr,v=5,strata=balance)

pre-processing

Code
cred_prep = recipe(balance~., data = cred_tr)

model-spec

Code
cred_mod = linear_reg(mode="regression", engine="lm")

workflow setup

Code
cred_wflow = workflow() |> 
  add_recipe(cred_prep) |> 
  add_model(cred_mod)

v-fold cross validation to assess model performance

model fit

Code
cred_fit_vfolds = cred_wflow |> 
  fit_resamples(cred_folds)
# Resampling results
# 5-fold cross-validation using stratification 
# A tibble: 5 × 4
  splits           id    .metrics         .notes          
  <list>           <chr> <list>           <list>          
1 <split [239/60]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [239/60]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [239/60]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
4 <split [239/60]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
5 <split [240/59]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

cross-validated RMSE

Code
cred_fit_vfolds |> 
  collect_metrics() |> filter(.metric == "rmse") |> 
  kbl() |> kable_styling(font_size = 9)
.metric .estimator mean n std_err .config
rmse standard 456.6134 5 11.20367 Preprocessor1_Model1

Leave One Out cross validation to assess model performance

Note: Leave one out cross-validation is somewhat deprecated, and while there is a specific function \(\texttt{loo_cv}\) in \(\texttt{rsample}\) .

LOO-CV splits

Code
cred_loo = loo_cv(cred_tr)

The function is not supported by \(\texttt{fit_resamples}\) : a workaround is to use still \(\texttt{v_fold_cv}\) and set as many folds as the as the rows in the training set

Code
cred_loo = vfold_cv(cred_tr,v=nrow(cred_tr))

Leave One Out cross validation to assess model performance

model fit

Code
cred_fit_loo = cred_wflow |>
  fit_resamples(cred_loo)

cross-validated RMSE

Code
cred_fit_loo |>
  collect_metrics() |> filter(.metric == "rmse") |>
  kbl() |> kable_styling(font_size = 9)
.metric .estimator mean n std_err .config
rmse standard 389.9147 299 14.094 Preprocessor1_Model1

Hyperparameter tuning

tiny example on polynomial regression: balance vs rating

Consider a tiny example of a tuning process:

Code
credit_tiny = credit |> select(rating, balance)
credit_split = initial_split(credit_tiny,prop=3/4)
credit_tiny |> mutate(train_test = replace(rep("test", n()),credit_split$in_id,"train")
                       ) |> 
  ggplot(aes(x = rating,y = balance,color=train_test)) + geom_point() + theme_minimal()

tiny example on polynomial regression: balance vs rating

  • define the CV-folds on the training set
Code
tiny_cred_folds = vfold_cv(training(credit_split),5)
  • the degree of the polynomial is set at a recipe level. Since it is an hyperparameter, we set it to \(\texttt{tune()}\) , a place holder
Code
tiny_rec = recipe(formula = balance~rating, data = training(credit_split)) |> 
  step_poly(rating, degree=tune())

. . . The model specification does not change, so \(\texttt{cred_mod}\) can still be used. The workflow is then

Code
tiny_wflow = workflow() |> add_recipe(tiny_rec) |> add_model(cred_mod)

. . . It now takes to specify the hyperparameter values grid, and use the function \(\texttt{tune_grid()}\) that does all the job

Code
hparm_grid = tibble(degree=1:10)
tiny_cred_tuning = tiny_wflow |> 
  tune_grid(resamples = tiny_cred_folds,
            grid = hparm_grid,
            control = control_grid(save_pred = TRUE)
  )

tiny example on polynomial regression: balance vs rating

Check the results

Code
tiny_cred_tuning |>  collect_metrics() |> filter(.metric=="rmse") |> slice_min(mean) |> 
  kbl() |> kable_styling(font_size = 10)
degree .metric .estimator mean n std_err .config
7 rmse standard 215.0154 5 8.148388 Preprocessor07_Model1
Code
autoplot(tiny_cred_tuning,metric = "rmse")+theme_minimal()

tiny example on polynomial regression: balance vs rating

Select the best performing model

Code
tiny_cred_final_mod = tiny_cred_tuning |> select_best(metric = "rmse")

Finalize the workflow (meaning: tell the workflow to pick the best model)

Code
final_wflow=tiny_wflow |> finalize_workflow(tiny_cred_final_mod)

Final fit and evaluation of the model: fit on the training, evaluate on the test. Pick \(\texttt{credit_split}\) which is the result of the first split ( \(\texttt{initial_split()}\) )

Code
tiny_final_fit = final_wflow |> 
  last_fit(credit_split)

tiny_final_fit |> extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
  (Intercept)  rating_poly_1  rating_poly_2  rating_poly_3  rating_poly_4  
        537.0         6923.0         -794.7         -115.0         1181.9  
rating_poly_5  rating_poly_6  rating_poly_7  
       -724.6         -415.9          401.8  
Code
tiny_final_fit |> collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard     229.    Preprocessor1_Model1
2 rsq     standard       0.732 Preprocessor1_Model1

tiny example on polynomial regression: balance vs rating

Code
tiny_final_fit |> extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
  (Intercept)  rating_poly_1  rating_poly_2  rating_poly_3  rating_poly_4  
        537.0         6923.0         -794.7         -115.0         1181.9  
rating_poly_5  rating_poly_6  rating_poly_7  
       -724.6         -415.9          401.8  
Code
tiny_final_fit |> collect_metrics() |> kbl() |> kable_styling(font_size = 8)
.metric .estimator .estimate .config
rmse standard 229.2952944 Preprocessor1_Model1
rsq standard 0.7324706 Preprocessor1_Model1

tiny example on polynomial regression: balance vs rating

Code
tiny_final_fit |> extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
  (Intercept)  rating_poly_1  rating_poly_2  rating_poly_3  rating_poly_4  
        537.0         6923.0         -794.7         -115.0         1181.9  
rating_poly_5  rating_poly_6  rating_poly_7  
       -724.6         -415.9          401.8  
Code
tiny_final_fit |> collect_metrics() |> kbl() |> kable_styling(font_size = 8)
.metric .estimator .estimate .config
rmse standard 229.2952944 Preprocessor1_Model1
rsq standard 0.7324706 Preprocessor1_Model1