Non Linear Regression

Statistical Learning

Alfonso Iodice D’Enza

Linear models

Code

library(magick)
library(tidyverse)
library(purrr)
library(tidymodels)
library(discrim)
library(kableExtra)
library(janitor)
library(patchwork)

The general formula is $Y=f(X)+\epsilon$

Estimate $f(.)$ via $\hat{f}(.)$
If $Y$ is numeric, and $X$ is a single predictor, assuming $f(.)$ is linear, the model is

\[y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i}\]

pros

it only takes to estimate the parameters $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ to obtain $\hat{f}(.)$
straightforward interpretation, also for the multiple predictors case.

cons

$\hat{f}(.)$ might have high bias and lead to poor predictions

The aim is to remove the linearity assumption and improve the predictive power, while still being able to interpret the results

Toy example : linear case $(f(x_{i}) = 1 +.75 x_{i})$

Code

set.seed(1234)
x = runif(1000,min=0,max = 10) 
eps = rnorm(1000,0,1.5) 
b_0 = 1
b_1 = .75 
y = b_0 + b_1 * x + eps 
pop_xy = tibble(y=y, x=x)
sam_xy = pop_xy |> 
  slice(sample(1:1000,30))

sam_xy |> 
  ggplot(aes(x = x,y = y)) + 
  geom_point(size=3,alpha=.5,
             color="darkblue")+
  xlim(c(0,11))+ylim(c(0,11))+
  theme_minimal()+
  geom_smooth(method="lm",
              color="darkgreen")

Toy example: non linear case $(f(x_{i}) = 1 +3 x_{i}-.5 x^{2}_{i})$

Code

set.seed(1234)
x = runif(1000,min=0,max = 10) 
eps = rnorm(1000,0,1.5) 
b_0 = 1
b_1 = 3
b_2 = -.5
y = b_0 + b_1 * x + b_2 * (x^2) + eps 
pop_xy = tibble(y=y, x=x)
sam_xy = pop_xy |> 
  slice(sample(1:1000,30))

sam_xy |> 
ggplot(aes(x = x,y = y)) + 
  geom_point(size=3,alpha=.5,
             color="darkblue")+
  theme_minimal() +
  geom_smooth(method="lm",
              color="red")+ 
  geom_smooth(method = "loess",
              color="darkgreen")

Non-linear approaches

Non-linear approaches are often extensions of the linear ones; they consider a single predictor $X$ .

Polynomial regression

it is the simples way to estimate $f()$ via a non-linear $\hat{f}()$. It just takes to replace the predictor $X$ with its power $d$ polynomial $y_{i}=\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\ldots+\beta_{d}X^{d}+\epsilon_{i}$ .

Step functions

the $X$ range is split in $K$ regions, $R_{j}$, $j=1,\ldots, K$; within $R_{j}$, $y_{i}$ is averaged for each $i$ such that $x_{i}\in R_{j}$. The resulting $\hat{f}()$ is piece-wise constant.

Regression splines

a combination of polynomial regression and step functions: in each of the $K$ regions $R_{j}$, a polynomial regression is fit: not just a piece-wise polynomial $\hat{f}()$, continuity at the knots must be assured.

Smoothing splines

Similar to regression splines, the model flexibility is specified according to a smoothness penalty .

Local Regression

In this case $\hat{f}()$ is fitted on non-disjoint regions of $X$.

Polynomial regression

Study the dependence of wage from age, from the Wage dataset in the package ISLR2

Code

library(ISLR2)
data(Wage)

w_vs_a = Wage |> 
  tibble(.rows=1) |>  
  dplyr::select(wage,age) |> 
  mutate(age = as.double(age))

w_vs_a |> 
  ggplot(aes(x=age,y=wage))+
  geom_point(alpha=.5,color="orange")+
  theme_minimal()+
  geom_smooth(method="lm")

Degree 4 polynomial regression

Pre-processing (recipe) specification

Code

pre_proc_w_vs_a = w_vs_a |> 
  recipe(wage~age) |> 
  step_poly(age,degree=4,
            options=list(raw=T))

Pre-processed data

Code

pre_proc_w_vs_a |>
  prep() |> bake(new_data=NULL) |>
  slice(1:6) |> 
  knitr::kable(format="html",digits=2)

wage	age_poly_1	age_poly_2	age_poly_3	age_poly_4
75.04	18	324	5832	104976
70.48	24	576	13824	331776
130.98	45	2025	91125	4100625
154.69	43	1849	79507	3418801
75.04	50	2500	125000	6250000
127.12	54	2916	157464	8503056

Degree 4 polynomial regression

model specification

Code

linear_model_spec = linear_reg(
  mode="regression")|>
  set_engine("lm")

workflow specification: combine recipe and model specifications

Code

poly_wflow = workflow() |> 
  add_recipe(pre_proc_w_vs_a) |> 
  add_model(linear_model_spec)

Degree 4 polynomial regression

fit the model: indicate the data to apply the workflow to

Code

poly_fit = poly_wflow |> 
  fit(data=w_vs_a)

model estimates

Code

poly_fit |> tidy() |>
  knitr::kable(digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	-184.15	60.04	-3.07	0.00
age_poly_1	21.25	5.89	3.61	0.00
age_poly_2	-0.56	0.21	-2.74	0.01
age_poly_3	0.01	0.00	2.22	0.03
age_poly_4	0.00	0.00	-1.95	0.05

Polynomial fit

To obtain point-wise confidence intervals, it takes $SE\left[\hat{f}(x_{0})\right]$

Each observation $x_{0}$ is associated to the vector ${\bf \ell}^{\sf T}=[1,x_{0},x_{0}^{2},x_{0}^{3},x_{0}^{4}]$ , then

\[SE\left[\hat{f}(x_{0})\right] = \left[{\bf \ell}^{\sf T}_{0} {\bf C}{\bf \ell}_{0} \right]^{1/2}\]

where ${\bf C}$ is the covariance matrix of the coefficient estimators

\[ {\bf C}= \begin{bmatrix} SE({\hat{\beta}_{0}})^{2} & cov(\hat{\beta}_{0},\hat{\beta}_{1}) & \ldots & \ldots & cov(\hat{\beta}_{0},\hat{\beta}_{4}) \\ cov(\hat{\beta}_{1},\hat{\beta}_{0}) &SE({\hat{\beta}_{1}})^{2}& \ldots& \ldots & cov(\hat{\beta}_{1},\hat{\beta}_{4}) \\ \vdots &\vdots & SE({\hat{\beta}_{2}})^{2} & \vdots & \vdots \\ \vdots &\vdots & \vdots & \ldots & \vdots \\ cov(\hat{\beta}_{4},\hat{\beta}_{0}) &\ldots & \ldots & \ldots &SE({\hat{\beta}_{4}})^{2}\\ \end{bmatrix} \]

Why?

$SE(\hat{f}(x_{0}))= \ell_{0}^{\sf T}\hat{\beta}\ell_{0}$

$\hat{f}(x_{0})$ is a linear combination of $\hat{\beta}$ with ${\bf D} = \ell_{0}^{\sf T}$ , so

\[SE(\hat{f}(x_{0}))=[var(\hat{f}(x_{0}))]^{1/2}=[{\bf D}\hat{\sigma}^{2}({\bf X}^{\sf T}{\bf X})^{-1}{\bf D}^{\sf T}]^{1/2} = [\ell_{0}^{\sf T}{\bf C}\ell_{0}]^{1/2}\]

Therefore the 95% confidence interval is

\[\hat{f}(x_{0})\pm z_{\alpha/2}SE(\hat{f}(x_{0}))=\hat{f}(x_{0})\pm 1.96SE(\hat{f}(x_{0}))\]

Polynomal regression: covariance matrix of $\hat{\beta}$

Code

C_mat = poly_fit |>
  extract_fit_engine() |>
  vcov()

C_mat |>
knitr::kable(format="html",
             digits=4)

	(Intercept)	age_poly_1	age_poly_2	age_poly_3	age_poly_4
(Intercept)	3604.8469	-351.2230	12.1027	-0.1760	9e-04
age_poly_1	-351.2230	34.6538	-1.2072	0.0177	-1e-04
age_poly_2	12.1027	-1.2072	0.0425	-0.0006	0e+00
age_poly_3	-0.1760	0.0177	-0.0006	0.0000	0e+00
age_poly_4	0.0009	-0.0001	0.0000	0.0000	0e+00

New values

Create a grid of values for age

Code

grid_age = tibble(age=
            seq(from=min(w_vs_a$age),
            to=max(w_vs_a$age),
            by=0.01)
            )

Create a matrix with $\ell_{0}$’s on rows

Code

l_0s = poly_fit |> 
  extract_recipe() |> 
  bake(new_data=grid_age) |>
  mutate(int=1) |> 
  select(int,
         starts_with("age"))

int	age_poly_1	age_poly_2	age_poly_3	age_poly_4
1	18.00	324.00	5832.00	104976.0
1	18.01	324.36	5841.73	105209.5
1	18.02	324.72	5851.46	105443.3
1	18.03	325.08	5861.21	105677.6
1	18.04	325.44	5870.97	105912.2

Two ways to compute $SE(\hat{f}(x_{0}))$

Computation using ${\bf \ell}_{0}^{\sf T}{\bf C}{\bf \ell}_{0}$

Code

SE_fx_0s = l_0s |> pmap(~c(...)) |>
  map(.f=function(x)
    sqrt(t(x) %*% C_mat %*% t(t(x)))
    ) |>
  unlist()

Computation using predict()

Code

x_0s_preds = poly_fit |> extract_fit_engine() |>
  predict(newdata = 
          poly_fit |> 
          extract_recipe() |>
          bake(new_data=grid_age),
          se=T)

lt_C_l	pred_se
5.30	5.30
5.29	5.29
5.28	5.28
5.27	5.27
5.26	5.26
5.25	5.25

Confidence band computation and plot

Code

poly_fun = grid_age |> 
  mutate(preds=x_0s_preds$fit,
  low_band =x_0s_preds$fit - 1.96*x_0s_preds$se.fit,
  up_band =x_0s_preds$fit + 1.96*x_0s_preds$se.fit
         )

w_vs_a |> 
  ggplot(aes(x=age,y=wage))+
  geom_point(alpha=.5,color="orange")+
  geom_point(data=poly_fun,aes(x=age,y=preds),size=.05,color="blue")+
  geom_ribbon(data=poly_fun,
              aes(x=age,ymin=low_band,ymax=up_band)
              ,fill="green",color="green",alpha=.25,inherit.aes = F)+
  theme_minimal()

Confidence band (tidy) computation and plot

Code

  poly_fun = poly_fit |> 
    augment(new_data = grid_age) |> 
    mutate(poly_fit |> 
             predict(new_data=grid_age,
                     type="conf_int"))

w_vs_a |>
  ggplot(aes(x=age,y=wage))+
  geom_point(alpha=.5,color="orange")+
  geom_point(data=poly_fun,aes(x=age,y=.pred),size=.05,color="blue")+
  geom_ribbon(data=poly_fun,
              aes(x=age,ymin=.pred_lower,ymax=.pred_upper)
              ,fill="blue",color="blue",alpha=.25,inherit.aes = F)+
  theme_minimal()

Polynomial (logistic) regression

$P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{4}X^{4}+\beta_{3}X^{3}}}{1+e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{3}X^{3}+\beta_{4}X^{4}}}$

Same dataset, turn wage into a binary variable: if wage$_{i}>250$ then high_wage$_{i}=TRUE$

Code

w_vs_a |> 
  mutate(high_wage=wage>250) |> 
  ggplot(aes(x=age,y=wage, color=high_wage)) + 
  geom_point() + theme_minimal()

Polynomial (logistic) regression

$P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{4}X^{4}+\beta_{3}X^{3}}}{1+e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{3}X^{3}+\beta_{4}X^{4}}}$

Same dataset, turn wage into a binary variable: if wage$_{i}>250$ then high_wage$_{i}=TRUE$

Code

w_vs_a |> 
  mutate(high_wage=wage>250) |> 
  ggplot(aes(x=age,y=as.numeric(high_wage), color=high_wage)) + 
  geom_point() + theme_minimal()

Pre-processing:

Insert the wage transformation into high_wage in the recipe specification.

Note: specify skip=TRUE to ignore this step for new data (that contains age only)

Code

prep_poly_log = w_vs_a |> recipe(wage~age) |>
  step_mutate(high_wage = as.factor(wage>250),
              role="outcome",skip=TRUE) |> 
  step_rm(wage) |>
  step_poly(age,degree=4,
            options=list(raw=T))

Model specification

Code

logistic_model_spec = logistic_reg(
  mode="classification")|>
  set_engine("glm")

Create the workflow and fit the model

Code

poly_log_wflow = workflow() |> 
  add_recipe(prep_poly_log) |> 
  add_model(logistic_model_spec)

poly_log_fit = poly_log_wflow |> 
  fit(data=w_vs_a)

Code

poly_log_fit |> tidy() |> 
  knitr::kable(format="html",digits=3) |> 
  kable_styling(font_size=10)

term	estimate	std.error	statistic	p.value
(Intercept)	-109.553	47.627	-2.300	0.021
age_poly_1	8.995	4.184	2.150	0.032
age_poly_2	-0.282	0.135	-2.082	0.037
age_poly_3	0.004	0.002	2.023	0.043
age_poly_4	0.000	0.000	-1.968	0.049

Compute the fit with via augment(), and the confidence bands via predict()

Code

fit_data = poly_log_fit |> 
  augment(new_data = grid_age) |> 
  mutate(poly_log_fit |> predict(new_data=grid_age,type="conf_int")) 

fit_data |> 
  select(age,
         ends_with("TRUE")) |> 
  slice(1:10) |> 
  knitr::kable(format="html",digits=5) |> 
  kable_styling(font_size=10)

age	.pred_TRUE	.pred_lower_TRUE	.pred_upper_TRUE
18.00	0	0	0.00166
18.01	0	0	0.00167
18.02	0	0	0.00167
18.03	0	0	0.00168
18.04	0	0	0.00168
18.05	0	0	0.00169
18.06	0	0	0.00169
18.07	0	0	0.00170
18.08	0	0	0.00170
18.09	0	0	0.00171

Polynomial logistic fit

Code

w_vs_a |> 
  mutate(high_wage=wage>250,
         high_wage=as.numeric(high_wage)/4) |> 
  ggplot(aes(x=age,y=high_wage,
             color=wage>250))+
  geom_point()+
  theme_minimal() +  
  scale_y_continuous(limits = c(0,.3))+
  geom_point(data=fit_data,
             aes(x=age,y=.pred_TRUE),
             size=.05,color="blue",
              inherit.aes = F)+
  geom_ribbon(data=fit_data,
              aes(x=age,
                  ymin=.pred_lower_TRUE,
                  ymax=.pred_upper_TRUE),
              fill="green",
              color="green",
              alpha=.25,
              inherit.aes = F)

Step functions: regression

Step functions

Global fit

Polynomial regression, and even regression, fit $f(.)$ with $\hat{f}(.)$ using the observed data all at once.

Local fit

Split the $X$ range into $K$ intervals, and fit the function within each interval

For $c_{1},c_{2},\ldots,c_{K}$ breaks, one defines the following $C_{j}$ , $j = 0,K$ , indicator (step) functions. For each observation $i$ , one of the step functions will be $1$ , and all the others will be $0$ , depending on the interval $x_{i}$ belongs to.

\[\begin{eqnarray} C_{0}(X) &=& I( X<c_{1})\\ C_{1}(X) &=& I( c_{1}\leq X<c_{2})\\ C_{2}(X) &=& I( c_{2}\leq X<c_{3})\\ \ldots \ldots \\ C_{K-1}(X) &=& I( c_{K-1}\leq X<c_{K})\\ C_{K}(X) &=& I( c_{K}\leq X)\\ \end{eqnarray}\]

Pre-processing: split the predictor age into three intervals

Code

step_regions_plot = w_vs_a |> 
  ggplot(aes(x=age, y=wage))+
  theme_minimal()+
  annotate(geom="rect",xmin=min(w_vs_a$age),xmax = 35,ymin=0,ymax=max(w_vs_a$wage),fill="indianred",alpha=.25)+
  annotate(geom="rect",xmin=35,xmax=50,ymin=0,ymax=max(w_vs_a$wage),fill="lightgreen",alpha=.25)+
  annotate(geom="rect",xmin=50,xmax=65,ymin=0,ymax=max(w_vs_a$wage),fill="lightblue",alpha=.25)+
  annotate(geom="rect",xmin=65,xmax=80,ymin=0,ymax=max(w_vs_a$wage),fill="indianred",alpha=.25)+
  geom_point(alpha=.5)

Pre-processing: split the predictor age into three intervals

Code

step_pre_proc = w_vs_a |> recipe(wage~age) |> 
  step_mutate(age=cut(age,
        breaks = c(0,35,50,65,80)))

age	wage
(0,35]	75.04315
(0,35]	70.47602
(35,50]	130.98218
(35,50]	154.68529
(35,50]	75.04315
(50,65]	127.11574
(35,50]	169.52854
(0,35]	111.72085
(35,50]	118.88436
(50,65]	128.68049

Define the workflow and fit the model

Code

lm_wflow=workflow() |> 
  add_recipe(step_pre_proc) |> 
  add_model(linear_model_spec)

step_lm_fit = lm_wflow |> fit(data=w_vs_a)

Compute preds and confidence band

Code

step_fit_data = step_lm_fit |> 
  augment(new_data= grid_age) |> 
  mutate(step_lm_fit |> predict(new_data=grid_age,type="conf_int"))

Update the plot with the step function

Code

step_regions_plot + 
  geom_point(data = step_fit_data, aes(x=age, y=.pred), size=.1,color="magenta",inherit.aes = FALSE)+
  geom_ribbon(data = step_fit_data, aes(x=age, ymin=.pred_lower,ymax=.pred_upper),
              fill="magenta",inherit.aes = FALSE,alpha=.25)

Step functions: classification

Split the predictor $\texttt{age}$ into three intervals and use it to predict whether the wage is higher than 250

Code

prep_step_log = w_vs_a |> recipe(wage~age) |>
  step_mutate(high_wage = as.factor(wage>250),
              role="outcome",skip=TRUE) |> 
  step_mutate(age=cut(age,
        breaks = c(0,35,50,65,80))) |> 
  step_rm(wage)

Code

step_regions_plot + geom_point(data=w_vs_a |> mutate(high_wage=(wage>250)),
                                aes(x=age,y=wage,color=high_wage),inherit.aes = FALSE)

Step functions: classification

Workflow specification and fit

Code

class_step_wf = workflow() |> 
  add_recipe(prep_step_log) |> 
  add_model(logistic_model_spec)

class_step_fit = class_step_wf |>
  fit(data = w_vs_a)

Code

class_fit_data = class_step_fit  |> 
  augment(new_data=grid_age) |> 
  mutate(class_step_fit |>
           predict(new_data=grid_age,
                   type = "conf_int")) |> 
  select(age,ends_with("TRUE"))

Step functions: classification

final plot

Code

w_vs_a |> 
  mutate(high_wage=as.factor(wage>250)) |> 
  ggplot(aes(x=age, y=as.double(wage>250)/4,color=high_wage))+ylab("wage > 250")+
  geom_point(alpha=.25,size=5)+theme_minimal()+
  geom_point(data = class_fit_data, aes(x=age, y=.pred_TRUE), size=.1,color="blue",inherit.aes = FALSE)+
  geom_ribbon(data = class_fit_data, aes(x=age, ymin=.pred_lower_TRUE,ymax=.pred_upper_TRUE),
              fill="lightblue",inherit.aes = FALSE,alpha=.25)+
  annotate(geom="rect",xmin=min(w_vs_a$age),xmax = 35,ymin=0,ymax=.25,fill="indianred",alpha=.25)+
  annotate(geom="rect",xmin=35,xmax=50,ymin=0,ymax=.25,fill="lightgreen",alpha=.25)+
  annotate(geom="rect",xmin=50,xmax=65,ymin=0,ymax=.25,fill="lightblue",alpha=.25)+
  annotate(geom="rect",xmin=65,xmax=80,ymin=0,ymax=.25,fill="indianred",alpha=.25)

Basis functions

Basis functions: a general setup

A basis function $b_{j}(.)$ , $j=1,\ldots,K$ takes as input the predictor $X$.

\[y_{i} = \beta_{0}+ \beta_{1}b_{1}(x_{i}) + \beta_{2}b_{2}(x_{i}) + \ldots + \beta_{K}b_{K}(x_{i})+\epsilon_{i}\]

If $b_{j}(x_{i})=x_{i}^{j}$ then the above is a polynomial regression

If $b_{j}(x_{i})=I(c_{j} \leq x_{i} < c_{j+1})$ then the model is piecewise constant (that is $b_{j}$’s are step functions )

Once the $b_{j}$’s are defined, then one uses OLS to estimate the coefficients

One of the most common definitions for $b_{j}$’s are the regression splines

Regression splines: piecewise linear

Just like for step functions, the $X$ is split into regions, and $\hat{f}(.)$ is obtained for each region

A linear regression is fitted, not just the mean, as in the step function case

Suppose again you are regressing $\texttt{wage}$ on $\texttt{age}$ . And that the range of $\texttt{age}$ is split once, at 50: $R_{1}:\texttt{age}<50$ and $R_{2}:\texttt{age}\geq 50$ .

Note that the split value $\texttt{age}=50$ is referred to as knot.

Regression splines: piecewise linear

Code

w_vs_a |>
  ggplot(aes(x=age,y=wage)) + theme_minimal()+
  geom_point(color="darkgreen")+
  geom_vline(aes(xintercept=50),color="orange")+
  annotate(geom="rect",xmin=min(w_vs_a$age),xmax=50,ymin=0,ymax=max(w_vs_a$wage),fill="indianred",alpha=.25)+
  annotate(geom="rect",xmin=50,xmax=max(w_vs_a$age),ymin=0,ymax=max(w_vs_a$wage),fill="lightblue",alpha=.25)+
  annotate(geom="text",x=22,y=220,label="paste(italic(R),1)",parse=TRUE,size=15,alpha=.3)+
  annotate(geom="text",x=75,y=220,label="paste(italic(R),2)",parse=TRUE,size=15,alpha=.3)->two_regions_plot
two_regions_plot

Two linear regressions, one per region

Turn the data into a list, according to the knot: then fit a linear model on each element of the list

Code

simple_reg_recipe = w_vs_a |> recipe(wage~age)

w_vs_a_list = w_vs_a |> 
  mutate(region=age>50,
         age=as.double(age)) |> 
  group_by(region) |> 
  group_map(~workflow() |>
              add_recipe(simple_reg_recipe) |>
              add_model(linear_model_spec) |>
              fit(data = .x) 
            )

Compute the predictions on a grid of age values, to draw the regression lines and bands

Code

fitted_curves_and_band = w_vs_a_list |> map2(.y=grid_age,.f=function(x,.y){
  line_pred=augment(x,new_data=grid_age);
  band_pred=predict(x,new_data=grid_age,type="conf_int");
  return(bind_cols(line_pred,band_pred))
  })

Two linear regressions per region

Code

two_regions_plot+
  geom_line(data=fitted_curves_and_band[[1]] |>
              filter(age<50),
             aes(x=age,y=.pred),
             color="blue", size=1.5,
             inherit.aes = FALSE)+
  geom_line(data=fitted_curves_and_band[[2]] |>
              filter(age>=50),
             aes(x=age,y=.pred),
             color="red", size=1.5,
             inherit.aes = FALSE)

Regression splines: piecewise linear

pre-processing

Code

pre_proc_w_vs_a = w_vs_a |> 
  recipe(wage~age) |>
  step_mutate(age_to_knot = 
                ifelse(age>50,age-50,0))

workflow and fit

Code

pw_linear = workflow() |>
  add_recipe(pre_proc_w_vs_a) |>
  add_model(linear_model_spec)

pw_linear_fit = pw_linear |>
  fit(data=w_vs_a)

Regression splines: piecewise linear

preds and confidence bands

Code

two_reg_fit_data=pw_linear_fit |> 
  augment(new_data=grid_age) |> 
  mutate(pw_linear_fit |> predict(new_data=grid_age,
                                   type="conf_int"))

Code

two_regions_plot+
  geom_line(data=two_reg_fit_data,
             aes(x=age,y=.pred),
             color="blue",
             inherit.aes = FALSE)+
  geom_ribbon(data=two_reg_fit_data,
             aes(x=age,ymin=.pred_lower,
                 ymax=.pred_upper),
             fill="blue",alpha=.25,
             inherit.aes = FALSE)->two_regions_and_fit

Regression splines: piecewise polynomial

A polynomial regression is fitted within each region

cubic splines

Within each of the regions defined by the split, fit a degree 3 polynomial regression

\[\texttt{wage}_{i,x_{i}\in R_{1}}= \beta_{0,R_{1}} + \beta_{1,R_{1}} \texttt{age}_{i}+\beta_{2,R_{1}} \texttt{age}^{2}_{i} +\beta_{3,R_{1}} \texttt{age}^{3}_{i}+\epsilon_{i}\]

\[\texttt{wage}_{i,x_{i}\in R_{2}}= \beta_{0,R_{2}} + \beta_{1,R_{2}} \texttt{age}_{i}+\beta_{2,R_{2}} \texttt{age}^{2}_{i} +\beta_{3,R_{2}} \texttt{age}^{3}_{i}+\epsilon_{i}\]

Regression splines: piecewise polynomial

Fit different polynomial (degree 3) regressions within each region, independent from one another

Code

two_poly_recipe = w_vs_a |> 
  mutate(age=as.double(age),
         region=age>50) |> 
  recipe(wage~age) |> 
  step_poly(age,3)

w_vs_a_list_2 = w_vs_a |> 
  mutate(region=age>50,
         age=as.double(age)) |> 
  group_by(region) |> 
  group_map(~workflow() |>
              # add_recipe(two_poly_recipe) |> 
              add_formula(wage ~ poly(age,3)) |>
              add_model(linear_model_spec) |>
              fit(data = .x) 
            )

Combine predictions and confidence bands for each region

Code

fitted_curves_and_band_2 = w_vs_a_list_2 |> map2(.y=grid_age,.f=function(x,.y){
  line_pred=augment(x,new_data=grid_age);
  band_pred=predict(x,new_data=grid_age,type="conf_int");
  return(bind_cols(line_pred,band_pred))
  })

Two polynomial regressions per region

Code

two_regions_plot+
  geom_point(data=fitted_curves_and_band_2[[1]] |> filter(age<50),
             aes(x=age,y=.pred),
             color="blue",
             inherit.aes = FALSE)+
  geom_point(data=fitted_curves_and_band_2[[2]]|> filter(age>=50),
             aes(x=age,y=.pred),
             color="red",
             inherit.aes = FALSE)

Not ideal!

Cubic splines

Let $\hat{f}_{1}(.)$ and $\hat{f}_{2}(.)$ be the two degree 3 polynomials fitted in regions $R_{1}$ and $R_{2}$ , and let $\xi$ the only considered knot

Single fitting curve $\hat{f}(.)$: conditions

continuity at the knot: $\hat{f}_{1}(\xi) = \hat{f}_{2}(\xi)$
$\hat{f}(.)$ is supposed to be smooth at the knot, too:
- $\hat{f}^{\prime}_{1}(\xi) = \hat{f}^{\prime}_{2}(\xi)$
- $\hat{f}^{\prime\prime}_{1}(\xi) = \hat{f}^{\prime\prime}_{2}(\xi)$

In general a degree $d$ spline is a piecewise degree $d$ polynomial

derivatives up to degree $d-1$ are required to be continuous at each knot

Spline basis representation

Basis functions are useful to represent regression splines, including piecewise polynomials.

Suppose to have $K$ knots

cubic spline via basis functions

\[y_{i}= \beta_{0} +\beta_{1} b_{1}(x_{i}) +\beta_{2} b_{2}(x_{i}) +\ldots +\beta_{K} b_{K}(x_{i})+\ldots+\beta_{K+3} b_{K+3}(x_{i}) +\epsilon_{i}\] for a degree 3 (cubic) polynomial and $K$ knots, $K+3$ basis functions $b_{j}(.)$, $j=1,\ldots,K+3$, are needed.

choice of $b_{j}(.)$’s

for each term of the polynomial, the choice is obvious: $b_{j}(x_{i})=x_{i}^{j}$ , $j=1,\ldots,d$; for cubic splines $d=3$
at each knot $\xi$ , $b_{j}(x_{i})= h(x_{i},\xi)$ a truncated power basis function is used
$h(x_{i},\xi) = (x_{i}-\xi)_{+}^{3} = (x_i-\xi)^{3} \text{ if } x_{i} > \xi, \text{ otherwise } (x_{i}-\xi)_{+}^{3} = 0$

Truncated power basis: check

Truncated power basis functions: single knot

Suppose to have a single knot at $\xi$

\[f(x_{i})= \beta_{0} +\beta_{1} x_{i} +\beta_{2} x_{i}^{2}+\beta_{3} x_{i}^{3} +\beta_{4} (x_{i}-\xi)^{3}_{+}\]

The goal is to fit $f_{1}(.)$ and $f_{2}(.)$ , two cubic polymonials in $R_{1}(x_{i}\leq \xi)$ and $R_{2}(x_{i}> \xi)$, that is

\[f_{1}(x_{i})= a_{1} +b_{1} x_{i} +c_{1} x_{i}^{2}+d_{1}x_{i}^{3}\] \[f_{2}(x_{i})= a_{2} +b_{2} x_{i} +c_{2} x_{i}^{2}+d_{2}x_{i}^{3}\]

First, re-write the coefficients $a_{r},b_{r},c_{r},d_{r}, r=1,2$ in terms of $\beta_{j}$’s such that

$f_{1}(x_{i})=f(x_{i})$ and $f_{2}(x_{i})=f(x_{i})$

Then, check the following conditions that ensure continuity and smoothness at the knot:

$f_{1}(\xi)=f_{2}(\xi)$

$f^{'}_{1}(\xi)=f^{'}_{2}(\xi)$

$f^{''}_{1}(\xi)=f^{''}_{2}(\xi)$

Check continuity at the knot: $f_{1}(\xi) = f_{2}(\xi)$

$f_{1}(x_{i})=f(x_{i})$

Define $a_{1},b_{1},c_{1},d_{1}$ ; note that, in $R_{1}(x_{i}\leq \xi)$ , $(x_{i}-\xi)^{3}_{+}=0$ , therefore

\[f_{1}(x_{i}) = f(x_{i})\] \[a_{1} + b_{1} x_{i} + c_{1} x_{i}^{2} + d_{1}x_{i}^{3} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2}+ \beta_{3} x_{i}^{3}\]

and $a_{1}=\beta_{0}$ , $b_{1}=\beta_{1}$ , $c_{1}=\beta_{2}$ , $d_{1}=\beta_{3}$ .

Check continuity at the knot

$f_{2}(x_{i})=f(x_{i})$

Define $a_{2},b_{2},c_{2},d_{2}$ ; note that, in $R_{2}(x_{i}> \xi)$ , then

\[(x_{i}-\xi)^{3}_{+}=(x_{i}-\xi)^{3}\] \[f_{2}(x_{i}) = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2}+ \beta_{3} x_{i}^{3}+\beta_{4}(x_{i}-\xi)^{3}\] re-writing \[(x_{i}-\xi)^{3}=x_{i}^{3} + 3x_{i}\xi^{2}-3x_{i}^{2}\xi - \xi^{3}\]

and plugging it in

\[a_{2} + b_{2} x_{i} + c_{2} x_{i}^{2} + d_{2}x_{i}^{3} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2}+ \beta_{3} x_{i}^{3}+\beta_{4}x_{i}^{3} + 3\beta_{4}x_{i}\xi^{2}-3\beta_{4}x_{i}^{2}\xi - \beta_{4}\xi^{3}\] then $a_{2}=\beta_{0} - \beta_{4} \xi^{3}$ , $b_{2}=\beta_{1}+ 3 \beta_{4} \xi^{2}$ , $c_{2}=\beta_{2} -3 \beta_{4} \xi$ and $d_{2}=\beta_{3} + \beta_{4}$

Check continuity at the knot

\[a_{1} +b_{1} \xi +c_{1} \xi^{2}+d_{1} \xi^{3} = a_{2} +b_{2} \xi +c_{2} \xi^{2}+d_{2} \xi^{3}\] recalling the definitions of $a_{1},\ldots, d_{2}$, the above is re-written as

\[\begin{align} \beta_{0} +\beta_{1} \xi +\beta_{2} \xi^{2}+\beta_{3} \xi^{3} &= \underbrace{\beta_{0} - \beta_{4}\xi^{3}}_{a_{2}} +\underbrace{(\beta_{1}+ 3 \beta_{4} \xi^{2})}_{b_{2}} \xi + \underbrace{(\beta_{2} -3 \beta_{4} \xi)}_{c_{2}} \xi^{2}+\underbrace{(\beta_{3} + \beta_{4})}_{d_{2}} \xi^{3}=\\ &= \beta_{0} - \beta_{4}\xi^{3}+\beta_{1}\xi+ 3 \beta_{4} \xi^{3}+\beta_{2}\xi^{2} -3 \beta_{4} \xi^{3}+ \beta_{3}\xi^{3} + \beta_{4}\xi^{3}=\\ &= \beta_{0} +\beta_{1}\xi+ \beta_{2}\xi^{2} +\beta_{3}\xi^{3} \end{align}\]

Check smoothness: $f^{'}_{1}(\xi) = f^{'}_{2}(\xi)$

$f^{'}_{1}(\xi) = b_{1} + 2 c_{1}\xi_{i}+3 d_{1}\xi^{2}_{i}$ and $f^{'}_{2}(\xi) = b_{2}+ 2 c_{2}\xi_{i}+3 d_{2}\xi^{2}_{i}$

recalling the definitions of $a_{1},\ldots, d_{2}$, the above is re-written as

\[\begin{align} \beta_{1} +2\beta_{2} \xi+3\beta_{3} \xi^{2} &= \underbrace{(\beta_{1}+ 3 \beta_{4} \xi^{2})}_{b_{2}} + 2\underbrace{(\beta_{2} -3 \beta_{4} \xi)}_{c_{2}} \xi+3\underbrace{(\beta_{3} + \beta_{4})}_{d_{2}} \xi^{2}=\\ &= \beta_{1}\xi+ 3 \beta_{4} \xi^{3} + 2\beta_{2}\xi - 6 \beta_{4} \xi^{2}+ 3\beta_{3}\xi^{2}+3\beta_{4}\xi^{2}=\\ &= \beta_{1} +2\beta_{2} \xi+3\beta_{3} \xi^{2} \end{align}\]

Check smoothness: $f^{''}_{1}(\xi) = f^{''}_{2}(\xi)$

$f^{''}_{1}(\xi) = c_{1}\xi+6 d_{1}\xi$ and $f^{''}_{2}(\xi) = c_{2} + 6 d_{2}\xi$

recalling the definitions of $a_{1},\ldots, d_{2}$, the above is re-written as

\[\begin{align} 2\beta_{2} +6\beta_{3} \xi &= + 2\underbrace{(\beta_{2} -3 \beta_{4} \xi)}_{c_{2}}+6\underbrace{(\beta_{3} + \beta_{4})}_{d_{2}} \xi=\\ &= 2\beta_{2} - 6 \beta_{4} \xi+ 6\beta_{3}\xi+6\beta_{4}\xi=\\ &= 2\beta_{2} + 6\beta_{3}\xi \end{align}\]

…end of check

Cubic spline

To fit a cubic spline, one has to specify the basis function, and then fit a linear model

the basis functions $b_{j}(.)'s$ are defined in the $\texttt{recipe}$ (pre-processing)
the model specification and fit do not change

cubic spline with a single knot at $\texttt{age}=50$

Code

pre_proc_w_vs_a = w_vs_a |>
  recipe(wage~age) |>
  step_bs(age,degree = 3,options= list( knots = 50))

model specification and fit

Code

cubic_spline_wf = workflow() |>
  add_recipe(pre_proc_w_vs_a) |>
  add_model(linear_model_spec)

cubic_spline_fit = cubic_spline_wf |>
  fit(data=w_vs_a)

Cubic spline

curve and bands data

Code

cubic_spline_preds = cubic_spline_fit |> 
  augment(new_data=grid_age) |> 
  mutate(cubic_spline_fit |> 
           predict(new_data=grid_age,type="conf_int"))

Code

two_regions_plot+
  geom_point(data=cubic_spline_preds,aes(x=age,y=.pred),
             size=.05,color="blue", inherit.aes = F)+
  geom_ribbon(data=cubic_spline_preds,
              aes(x=age,ymin=.pred_lower,ymax=.pred_upper),
              fill="green",color="green",alpha=.25,
              inherit.aes = F)+
  theme_minimal() -> cubic_spline_fit_plot

Cubic splines: training sample sizes

Code

tibble(sizes=c(50,100,250,500,1000,2000,3000)) |> 
  mutate(sizes=as.list(sizes), 
         training_data = sizes |> 
           map(~return(w_vs_a |> 
                         slice(sample(1:n(),.x))))
         ) |> 
  mutate(grid_ages = sizes |> map(~grid_age)
         ) |> 
  mutate(cubic_fit = map2(.x=training_data,.y=grid_ages,
              .f=function(x=.x,y=.y){ 
              pre_proc_w_vs_a = x |> 
               recipe(wage~age) |>
                step_bs(age,degree = 3,
                options= list( knots = 50));   
             cubic_spline_wf = workflow() |>
               add_recipe(pre_proc_w_vs_a) |>
               add_model(linear_model_spec);
             cubic_spline_fit = cubic_spline_wf |>
               fit(data=x)
             return(cubic_spline_fit)}
             )
         ) |> 
   mutate(cubic_preds = map2(.x=cubic_fit,.y=grid_ages,
             .f=function(x=.x,y=.y){                                
             cubic_spline_preds = x |>
               augment(new_data=y) |>
               mutate(x |>
                        predict(new_data=y,
                                type="conf_int"));
             return(cubic_spline_preds)
           }
           )
   ) |> 
  mutate(training_size=sizes) -> sizes_exp

Cubic splines

Pull the training sets from the list, labeled by size

Code

training_points = sizes_exp |> 
  select(starts_with("training")) |>
  unnest(everything()) |> 
  mutate(training_size=as.factor(training_size))

Pull the cubic preds from the list, labeled by size

Code

curves_data = sizes_exp |> 
  select(training_size,cubic_preds) |>
  unnest(everything()) |> 
  mutate(training_size=as.factor(training_size))

Cubic splines: training sample sizes

Code

fit_by_sizes_plot = w_vs_a |> 
  ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.4,color="lightgrey")+
  geom_point(data=training_points,aes(x=age,y=wage),size=1,alpha=.25,color="green",inherit.aes = FALSE)+
  geom_line(data=curves_data,aes(x=age,y=.pred), color="red",inherit.aes = FALSE)+
  geom_ribbon(data=curves_data,aes(x=age,ymin=.pred_lower,ymax=.pred_upper),fill="cyan",color="blue",alpha=.2,inherit.aes = FALSE)+
  facet_wrap(~training_size,scales = "free")

Cubic splines: varying number of knots

Code

tibble(degrees_freedom=5:10) |> 
  mutate(degrees_freedom=as.list(degrees_freedom), 
         training_data =  map(.x=500,.f=~return(w_vs_a |> 
                         slice(sample(1:n(),.x))))
         ) |> 
  mutate(grid_ages = degrees_freedom |> map(~grid_age)
         ) |> 
  mutate(cubic_fit = pmap(.l=list(x=training_data,
                               y=grid_ages,
                               z=degrees_freedom),
              .f=function(x,y,z){ 
              pre_proc_w_vs_a = x |> 
               recipe(wage~age) |>
                step_bs(age,degree = 3,
                deg_free=z);   
             cubic_spline_wf = workflow() |>
               add_recipe(pre_proc_w_vs_a) |>
               add_model(linear_model_spec);
             cubic_spline_fit = cubic_spline_wf |>
               fit(data=x)
             return(cubic_spline_fit)}
             )
         ) |> 
   mutate(cubic_preds = map2(.x=cubic_fit,.y=grid_ages,
             .f=function(x=.x,y=.y){                                
             cubic_spline_preds = x |>
               augment(new_data=y) |>
               mutate(x |>
                        predict(new_data=y,
                                type="conf_int"));
             return(cubic_spline_preds)
           }
           )
   )-> knots_exp

Cubic splines

Compute the number of knots out of the degrees of freedom: $knots= df-d-1$, where $d$ is the degree of the polynomial

Code

knots_exp = knots_exp |>  
  unnest(degrees_freedom) |> 
  mutate(n_knots=degrees_freedom-4)

Pull the training sets from the list, labeled by the number of knots

Code

training_points = knots_exp |> 
  select(training_data,n_knots) |>
  unnest(everything()) |> 
  mutate(n_knots=as.factor(n_knots))

Pull the cubic preds from the list, labeled by the number of knots

Code

curves_data = knots_exp |> 
  select(n_knots,cubic_preds) |>
  unnest(everything()) |> 
  mutate(n_knots=as.factor(n_knots))

Cubic splines: 1 to 6 knots

Code

fit_by_knots_plot = w_vs_a |> 
  ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.4,color="lightgrey")+
  geom_point(data=training_points,aes(x=age,y=wage),size=1,alpha=.25,color="green",inherit.aes = FALSE)+
  geom_line(data=curves_data,aes(x=age,y=.pred), color="red",inherit.aes = FALSE)+
  geom_ribbon(data=curves_data,aes(x=age,ymin=.pred_lower,ymax=.pred_upper),fill="cyan",color="blue",alpha=.2,inherit.aes = FALSE)+
  facet_wrap(~n_knots,scales = "free")

Cubic splines and natural cubic splines

The higher the number of knots, the higher the flexibility of $\hat{f}(.)$ (and, its variability)
For larger the training set sizes, a higher number of knots can be selected (via e.g. cross-validation).
In the outer range of $X$ there are usually fewer observations, and splines can have high variance
To tackle this drawback, one can impose the function to be linear in the boundary regions
A cubic spline with the additional linearity constraints is defined a natural cubic spline

Cubic splines and natural cubic splines

Code

w_vs_a_sample = w_vs_a |> slice_sample(n=250)

cubic spline

natural spline

pre-processing

Code

pre_proc_w_vs_a_bs = w_vs_a_sample |>
  recipe(wage~age) |>
  step_bs(age,deg_free = 7)

Code

pre_proc_w_vs_a_ns = w_vs_a_sample |>
  recipe(wage~age) |>
  step_ns(age,deg_free = 7)

. . .

model specification and fit

Code

cubic_spline_wf = workflow() |>
  add_recipe(pre_proc_w_vs_a_bs) |>
  add_model(linear_model_spec)

cubic_spline_fit = cubic_spline_wf |>
  fit(data=w_vs_a_sample)

Code

natural_spline_wf = workflow() |>
  add_recipe(pre_proc_w_vs_a_ns) |>
  add_model(linear_model_spec)

natural_spline_fit = natural_spline_wf |>
  fit(data=w_vs_a_sample)

curve and bands data

Code

cubic_spline_preds = cubic_spline_fit |> 
  augment(new_data=grid_age) |> 
  mutate(cubic_spline_fit |> 
           predict(new_data=grid_age,type="conf_int"))

Code

natural_spline_preds = natural_spline_fit |> 
  augment(new_data=grid_age) |> 
  mutate(natural_spline_fit |> 
           predict(new_data=grid_age,type="conf_int"))

cubic vs natural fit

Code

cub_vs_nat = rbind(cubic_spline_preds |> mutate(method="cubic"),
                   natural_spline_preds |> mutate(method="natural"))

w_vs_a |> 
  ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.2,color="lightgrey")+
  geom_point(data=w_vs_a_sample,aes(x=age,y=wage),size=1,color="lightgreen",inherit.aes = FALSE)+
  scale_linetype_manual(values = c("dotted","solid"))+
  geom_line(data=cub_vs_nat,aes(x=age,y=.pred, colour = method,linetype=method),inherit.aes = FALSE)+
  geom_ribbon(data=cub_vs_nat,aes(x=age,ymin=.pred_lower,ymax=.pred_upper,color=method,linetype=method),alpha=.05,inherit.aes = FALSE)

Smoothing splines

Smoothing vs regression splines

In regression splines > - pick $K$ knots > - define a basis function within each of the resulting regions > - estimate the spline coefficients via least squares

In smoothing splines > - find $g(\cdot)$ that minimizes the loss function \[\sum_{i=1}^{n}\left(y_{i}-g(x_{i})\right)^{2}\] > - with no restrictions, one could pick $g(\cdot)$ so that $\sum_{i=1}^{n}\left(y_{i}-g(x_{i})\right)^{2}=0$ > - the fitted curve would interpolate the training observations (massive overfitting) . . .

to avoid this, a roughness penalty is included

Smoothing splines

the minimization problem is $loss+penalty$, that is \[\sum_{i=1}^{n}\left(y_{i}-g(x_{i})\right)^{2}+ \lambda\int g''(t) dt\] - where $g''(t)$ is the second derivative of $g(\cdot)$ at $t$, indicating if the slope of the function is changing (increasing/descreasing) or not

the rougher $g(\cdot)$, the larger the value of $\int g''(t) dt$;
$\lambda$ is the penalty parameter that dictates how smooth $g(\cdot)$ will be

(effective) degrees of freedom

A smoothing spline has $n$ nominal degrees of freedom since it has $n$ parameters (equivalent to one knot per training observation).

Because of the roughness penalty, however, the $n$ parameters are heavily shrunk.
To measure the model complexity/flexibility $df_{y}$ is used instead: the effective degrees of freedom

Consider ${\bf \hat g}_{\lambda}$ to be the vector of estimated values (one per $x_{i}$, $i = 1,\ldots, n$) for a specific $\lambda$ value, it is obtained \[{\bf \hat g}_{\lambda} = {\bf S}_{\lambda}{\bf y}\] where ${\bf S}_{\lambda}$ is the smoothing matrix that is, the smoothing spline analogue of the hat matrix ${\bf H} = {\bf X}\left({\bf X}^{\sf T}{\bf X}\right){\bf X}^{\sf T}$ for the regression case.
then effective degrees of freedom are given by $df_{y}=trace\left({\bf S}_{\lambda}\right)$

Direct computation of the Leave-one-out cross validation estimate of the test error

In fact the LOOCV estimate of the test RSS is computed by

\[RSS_{cv}(\lambda)=\sum_{i =1}^{n}{\left(y_{i}-\hat{g}^{(-i)}_{\lambda}(x_{i})\right)^{2}}=\sum_{i =1}^{n}{\left[\frac{y_{i}-\hat{g}_{\lambda}(x_{i})}{1-\left\{{\bf S}_{\lambda}\right\}_{ii}} \right]}\]

where $\hat{g}^{(-i)}_{\lambda}(\cdot)$ is the function fitted on all but the $i^{th}$ training observations, and $\left\{{\bf S}_{\lambda}\right\}_{ii}$ is the $i^{th}$ element of the matrix ${\bf S}_{\lambda}$

Code

wage_smooth_df_32 <- smooth.spline(x = w_vs_a$age, y = w_vs_a$wage, df = 32)
wage_smooth_df_16 <- smooth.spline(x = w_vs_a$age, y = w_vs_a$wage, df = 16)
wage_smooth_cv <- smooth.spline(x = w_vs_a$age, y = w_vs_a$wage, cv = TRUE)
tibble(wage_smooth_df_32$df,wage_smooth_df_16$df,wage_smooth_cv$df)|> kbl()

wage_smooth_df_32$df	wage_smooth_df_16$df	wage_smooth_cv$df
31.99591	16.00237	6.794596

Collect the predictions from the three fitted models

Code

age_grid=seq(from=min(w_vs_a$age),to=max(w_vs_a$age),by=.1)

all_preds = bind_rows(
  as_tibble(predict(wage_smooth_df_32, age_grid)) |>
    mutate(model = "32 degrees of freedom"),
    as_tibble(predict(wage_smooth_df_16, age_grid)) |>
    mutate(model = "16 degrees of freedom"),
  as_tibble(predict(wage_smooth_cv, age_grid)) |>
    mutate(model = "LOOCV")
)

Plot the three smoothing splines

Code

w_vs_a |> 
  ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.2,color="lightgrey")+
  geom_line(data = all_preds, aes(x=x,y=y,color=model))

Local regression

The local regression fit $\hat{f}(x_{0})$ depends on the span, that is the proportion of points to be considered as neighbors of $x_{0}$.

In particular, a weighted linear regression is fitted by minimizing \[\sum_{i=1}^{n}{K_{i0}}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}\]

where $K_{i0}$ is a weight proportional to the closeness of the $x_{i}$’s to $x_{0}$: for the neighbors $K_{i0}>0$, for the non-neighbors $K_{i0}=0$

the weighted least squares minimizers $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ are then used to get the estimate of the function at $x_{0}$

\[\hat{f}(x_{0})=\hat{\beta_{0}}+ \hat{\beta_{1}}x_{0}\]

Local regression: tuning parameter(s)

the weighting function to set the values for $K_{i0}$
the fit: a linear function is an option: it could be linear, or quadratic

However, the parameter affecting $\hat{f}$ the most is the span: a low value will correspond to a lower number of neighbours, so $\hat{f}$ will be more flexible

Code

loess_data= tibble(span=c(.1,.3,.5,.7,.9)) |>
  mutate(model_fit=map(.x=span,~loess(wage ~ age, data = w_vs_a, span = .x)),
         preds=map(.x=model_fit,~predict(.x,age_grid)),
         span=ordered(span),
         age = map(1:5,\(x) age_grid)
  ) |> select(-model_fit) |> unnest(preds,age)

Local regression: tuning parameter(s)

the weighting function to set the values for $K_{i0}$
the fit: a linear function is an option: it could be linear, or quadratic

However, the parameter affecting $\hat{f}$ the most is the span: a low value will correspond to a lower number of neighbours, so $\hat{f}$ will be more flexible

Code

loess_plot = w_vs_a |> 
  ggplot(aes(x=age,y=wage))+theme_minimal()+
  geom_point(size=1,alpha=.2,color="lightgrey")+
  geom_line(data = loess_data, aes(x=age,y=preds,color=span))

Generalized additive models: putting it all together

GAM

Generalized additive models (GAMs) extend a standard linear model to non-linear functions of each of the predictors

\[y_{i} = \beta_{0}+\beta_{1}f(x_{i1})+\beta_{2}f(x_{i2})+\ldots+\beta_{p}f(x_{ip})+\epsilon_{i}\]

Code

library(gam)
library(splines)
data(Wage,package = "ISLR2")

wage_gam_ns_fit <- gam(wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
d <- preplot(wage_gam_ns_fit)

plot_structure=tibble(plot_name=names(d) , data = d |> map(~tibble(x=.$x,y=.$y,se_y=.$se.y) |> 
                                                              mutate(y_lower=y-se_y,
                                                                     y_upper=y+se_y)
)
) |> mutate(plots=map2(.x = data,.y=plot_name,
                        function(x=.x,y=.y){
                          if(!is.factor(x$x)){
                            myplot=x |> ggplot(aes(x)) +
                              geom_line(aes(y = y), color = "red") +
                              geom_line(aes(y = y_lower), lty = 2, color = "red") +
                              geom_line(aes(y = y_upper), lty = 2, color = "red") +
                              ggtitle(y)+ theme(axis.text.x=element_text(angle=90)) 
                          }else{
                            myplot = x |> ggplot(aes(x)) +
                              geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper),color="red") +
                              ggtitle(y)+ theme(axis.text.x=element_text(angle=90)) 
                          }
                        }
)
)

GAM

Code

plot_structure$plots[[1]] + plot_structure$plots[[2]] + plot_structure$plots[[3]]

GAM

Code

wage_gam_smooth_fit <-
  gam(wage ~ s(year, 4) + s(age, 5) + education,
      data = Wage)

d <- preplot(wage_gam_smooth_fit)

plot_structure=tibble(plot_name=names(d) , data = d |> map(~tibble(x=.$x,y=.$y,se_y=.$se.y) |> 
                                                              mutate(y_lower=y-se_y,
                                                                     y_upper=y+se_y)
)
) |> mutate(plots=map2(.x = data,.y=plot_name,
                        function(x=.x,y=.y){
                          if(!is.factor(x$x)){
                            myplot=x |> ggplot(aes(x)) +
                              geom_line(aes(y = y), color = "blue") +
                              geom_line(aes(y = y_lower), lty = 2, color = "blue") +
                              geom_line(aes(y = y_upper), lty = 2, color = "blue") +
                              ggtitle(y)+ theme(axis.text.x=element_text(angle=90)) 
                          }else{
                            myplot = x |> ggplot(aes(x)) +
                              geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper),color="blue") +
                              ggtitle(y)+ theme(axis.text.x=element_text(angle=90)) 
                          }
                        }
)
)

GAM

Code

plot_structure$plots[[1]] + plot_structure$plots[[2]] + plot_structure$plots[[3]]

Non Linear Regression

Linear models

Toy example : linear case \((f(x_{i}) = 1 +.75 x_{i})\)

Toy example: non linear case \((f(x_{i}) = 1 +3 x_{i}-.5 x^{2}_{i})\)

Non-linear approaches

Polynomial regression

Study the dependence of wage from age, from the Wage dataset in the package ISLR2

Degree 4 polynomial regression

Degree 4 polynomial regression

Degree 4 polynomial regression

Polynomial fit

Why?

\(SE(\hat{f}(x_{0}))= \ell_{0}^{\sf T}\hat{\beta}\ell_{0}\)

Polynomal regression: covariance matrix of \(\hat{\beta}\)

New values

Two ways to compute \(SE(\hat{f}(x_{0}))\)

Confidence band computation and plot

Confidence band (tidy) computation and plot

Polynomial (logistic) regression

Polynomial (logistic) regression

Polynomial logistic fit

Step functions: regression

Step functions

Step functions: classification

Step functions: classification

Step functions: classification

Step functions: classification

Basis functions

Basis functions: a general setup

Regression splines: piecewise linear

Regression splines: piecewise linear

Two linear regressions, one per region

Two linear regressions per region

Regression splines: piecewise linear

Regression splines: piecewise linear

Regression splines: piecewise polynomial

cubic splines

Regression splines: piecewise polynomial

Two polynomial regressions per region

Not ideal!

Cubic splines

In general a degree \(d\) spline is a piecewise degree \(d\) polynomial

Spline basis representation

Truncated power basis: check

Truncated power basis functions: single knot

Check continuity at the knot: \(f_{1}(\xi) = f_{2}(\xi)\)

Check continuity at the knot

Check continuity at the knot

Check smoothness: \(f^{'}_{1}(\xi) = f^{'}_{2}(\xi)\)

Check smoothness: \(f^{''}_{1}(\xi) = f^{''}_{2}(\xi)\)

…end of check

Cubic spline

Cubic spline

Cubic splines: training sample sizes

Cubic splines

Cubic splines: training sample sizes

Cubic splines: varying number of knots

Cubic splines

Cubic splines: 1 to 6 knots

Cubic splines and natural cubic splines

Cubic splines and natural cubic splines

cubic spline

natural spline

pre-processing

model specification and fit

curve and bands data

cubic vs natural fit

Smoothing splines

Smoothing vs regression splines

Smoothing splines

(effective) degrees of freedom

Direct computation of the Leave-one-out cross validation estimate of the test error

Collect the predictions from the three fitted models

Plot the three smoothing splines

Local regression

Local regression

Local regression: tuning parameter(s)

Local regression: tuning parameter(s)

Generalized additive models: putting it all together

GAM