Statistical Learning
The general formula is \(Y=f(X)+\epsilon\)
Estimate \(f(.)\) via \(\hat{f}(.)\)
If \(Y\) is numeric, and \(X\) is a single predictor, assuming
\(f(.)\) is linear, the model is
\[y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i}\]
pros
it only takes to estimate the parameters \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) to obtain \(\hat{f}(.)\)
straightforward interpretation, also for the multiple predictors case.
cons
bias
and lead to poor predictionsset.seed(1234)
x = runif(1000,min=0,max = 10)
eps = rnorm(1000,0,1.5)
b_0 = 1
b_1 = .75
y = b_0 + b_1 * x + eps
pop_xy = tibble(y=y, x=x)
sam_xy = pop_xy |>
slice(sample(1:1000,30))
sam_xy |>
ggplot(aes(x = x,y = y)) +
geom_point(size=3,alpha=.5,
color="darkblue")+
xlim(c(0,11))+ylim(c(0,11))+
theme_minimal()+
geom_smooth(method="lm",
color="darkgreen")
set.seed(1234)
x = runif(1000,min=0,max = 10)
eps = rnorm(1000,0,1.5)
b_0 = 1
b_1 = 3
b_2 = -.5
y = b_0 + b_1 * x + b_2 * (x^2) + eps
pop_xy = tibble(y=y, x=x)
sam_xy = pop_xy |>
slice(sample(1:1000,30))
sam_xy |>
ggplot(aes(x = x,y = y)) +
geom_point(size=3,alpha=.5,
color="darkblue")+
theme_minimal() +
geom_smooth(method="lm",
color="red")+
geom_smooth(method = "loess",
color="darkgreen")
Non-linear approaches are often extensions of the linear ones; they consider a single predictor \(X\) .
Polynomial regression
it is the simples way to estimate \(f()\) via a non-linear \(\hat{f}()\). It just takes to replace the predictor \(X\) with its power \(d\) polynomial \(y_{i}=\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\ldots+\beta_{d}X^{d}+\epsilon_{i}\) .
Step functions
the \(X\) range is split in \(K\) regions, \(R_{j}\), \(j=1,\ldots, K\); within \(R_{j}\), \(y_{i}\) is averaged for each \(i\) such that \(x_{i}\in R_{j}\). The resulting \(\hat{f}()\) is piece-wise constant.
Regression splines
a combination of polynomial regression and step functions: in each of the \(K\) regions \(R_{j}\), a polynomial regression is fit: not just a piece-wise polynomial \(\hat{f}()\), continuity at the knots must be assured.
Smoothing splines
Similar to regression splines, the model flexibility is specified according to a smoothness penalty
.
Local Regression
In this case \(\hat{f}()\) is fitted on non-disjoint regions of \(X\).
Pre-processing (recipe) specification
Pre-processed data
wage | age_poly_1 | age_poly_2 | age_poly_3 | age_poly_4 |
---|---|---|---|---|
75.04 | 18 | 324 | 5832 | 104976 |
70.48 | 24 | 576 | 13824 | 331776 |
130.98 | 45 | 2025 | 91125 | 4100625 |
154.69 | 43 | 1849 | 79507 | 3418801 |
75.04 | 50 | 2500 | 125000 | 6250000 |
127.12 | 54 | 2916 | 157464 | 8503056 |
model specification
fit the model: indicate the data to apply the workflow to
To obtain point-wise confidence intervals, it takes \(SE\left[\hat{f}(x_{0})\right]\)
Each observation \(x_{0}\) is associated to the vector \({\bf \ell}^{\sf T}=[1,x_{0},x_{0}^{2},x_{0}^{3},x_{0}^{4}]\) , then
\[SE\left[\hat{f}(x_{0})\right] = \left[{\bf \ell}^{\sf T}_{0} {\bf C}{\bf \ell}_{0} \right]^{1/2}\]
where \({\bf C}\) is the covariance matrix of the coefficient estimators
\[ {\bf C}= \begin{bmatrix} SE({\hat{\beta}_{0}})^{2} & cov(\hat{\beta}_{0},\hat{\beta}_{1}) & \ldots & \ldots & cov(\hat{\beta}_{0},\hat{\beta}_{4}) \\ cov(\hat{\beta}_{1},\hat{\beta}_{0}) &SE({\hat{\beta}_{1}})^{2}& \ldots& \ldots & cov(\hat{\beta}_{1},\hat{\beta}_{4}) \\ \vdots &\vdots & SE({\hat{\beta}_{2}})^{2} & \vdots & \vdots \\ \vdots &\vdots & \vdots & \ldots & \vdots \\ cov(\hat{\beta}_{4},\hat{\beta}_{0}) &\ldots & \ldots & \ldots &SE({\hat{\beta}_{4}})^{2}\\ \end{bmatrix} \]
\(\hat{f}(x_{0})\) is a linear combination of \(\hat{\beta}\) with \({\bf D} = \ell_{0}^{\sf T}\) , so
\[SE(\hat{f}(x_{0}))=[var(\hat{f}(x_{0}))]^{1/2}=[{\bf D}\hat{\sigma}^{2}({\bf X}^{\sf T}{\bf X})^{-1}{\bf D}^{\sf T}]^{1/2} = [\ell_{0}^{\sf T}{\bf C}\ell_{0}]^{1/2}\]
Therefore the 95% confidence interval is
\[\hat{f}(x_{0})\pm z_{\alpha/2}SE(\hat{f}(x_{0}))=\hat{f}(x_{0})\pm 1.96SE(\hat{f}(x_{0}))\]
(Intercept) | age_poly_1 | age_poly_2 | age_poly_3 | age_poly_4 | |
---|---|---|---|---|---|
(Intercept) | 3604.8469 | -351.2230 | 12.1027 | -0.1760 | 9e-04 |
age_poly_1 | -351.2230 | 34.6538 | -1.2072 | 0.0177 | -1e-04 |
age_poly_2 | 12.1027 | -1.2072 | 0.0425 | -0.0006 | 0e+00 |
age_poly_3 | -0.1760 | 0.0177 | -0.0006 | 0.0000 | 0e+00 |
age_poly_4 | 0.0009 | -0.0001 | 0.0000 | 0.0000 | 0e+00 |
Create a grid of values for age
Create a matrix with \(\ell_{0}\)’s on rows
int | age_poly_1 | age_poly_2 | age_poly_3 | age_poly_4 |
---|---|---|---|---|
1 | 18.00 | 324.00 | 5832.00 | 104976.0 |
1 | 18.01 | 324.36 | 5841.73 | 105209.5 |
1 | 18.02 | 324.72 | 5851.46 | 105443.3 |
1 | 18.03 | 325.08 | 5861.21 | 105677.6 |
1 | 18.04 | 325.44 | 5870.97 | 105912.2 |
Computation using \({\bf \ell}_{0}^{\sf T}{\bf C}{\bf \ell}_{0}\)
lt_C_l | pred_se |
---|---|
5.30 | 5.30 |
5.29 | 5.29 |
5.28 | 5.28 |
5.27 | 5.27 |
5.26 | 5.26 |
5.25 | 5.25 |
poly_fun = grid_age |>
mutate(preds=x_0s_preds$fit,
low_band =x_0s_preds$fit - 1.96*x_0s_preds$se.fit,
up_band =x_0s_preds$fit + 1.96*x_0s_preds$se.fit
)
w_vs_a |>
ggplot(aes(x=age,y=wage))+
geom_point(alpha=.5,color="orange")+
geom_point(data=poly_fun,aes(x=age,y=preds),size=.05,color="blue")+
geom_ribbon(data=poly_fun,
aes(x=age,ymin=low_band,ymax=up_band)
,fill="green",color="green",alpha=.25,inherit.aes = F)+
theme_minimal()
poly_fun = poly_fit |>
augment(new_data = grid_age) |>
mutate(poly_fit |>
predict(new_data=grid_age,
type="conf_int"))
w_vs_a |>
ggplot(aes(x=age,y=wage))+
geom_point(alpha=.5,color="orange")+
geom_point(data=poly_fun,aes(x=age,y=.pred),size=.05,color="blue")+
geom_ribbon(data=poly_fun,
aes(x=age,ymin=.pred_lower,ymax=.pred_upper)
,fill="blue",color="blue",alpha=.25,inherit.aes = F)+
theme_minimal()
\(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{4}X^{4}+\beta_{3}X^{3}}}{1+e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{3}X^{3}+\beta_{4}X^{4}}}\)
Same dataset, turn wage
into a binary variable: if wage
\(_{i}>250\) then high_wage
\(_{i}=TRUE\)
\(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{4}X^{4}+\beta_{3}X^{3}}}{1+e^{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{3}X^{3}+\beta_{4}X^{4}}}\)
Same dataset, turn wage
into a binary variable: if wage
\(_{i}>250\) then high_wage
\(_{i}=TRUE\)
Pre-processing:
Insert the wage
transformation into high_wage
in the recipe specification.
Note: specify skip=TRUE
to ignore this step for new data (that contains age
only)
Model specification
Create the workflow and fit the model
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -109.553 | 47.627 | -2.300 | 0.021 |
age_poly_1 | 8.995 | 4.184 | 2.150 | 0.032 |
age_poly_2 | -0.282 | 0.135 | -2.082 | 0.037 |
age_poly_3 | 0.004 | 0.002 | 2.023 | 0.043 |
age_poly_4 | 0.000 | 0.000 | -1.968 | 0.049 |
Compute the fit with via augment()
, and the confidence bands via predict()
age | .pred_TRUE | .pred_lower_TRUE | .pred_upper_TRUE |
---|---|---|---|
18.00 | 0 | 0 | 0.00166 |
18.01 | 0 | 0 | 0.00167 |
18.02 | 0 | 0 | 0.00167 |
18.03 | 0 | 0 | 0.00168 |
18.04 | 0 | 0 | 0.00168 |
18.05 | 0 | 0 | 0.00169 |
18.06 | 0 | 0 | 0.00169 |
18.07 | 0 | 0 | 0.00170 |
18.08 | 0 | 0 | 0.00170 |
18.09 | 0 | 0 | 0.00171 |
w_vs_a |>
mutate(high_wage=wage>250,
high_wage=as.numeric(high_wage)/4) |>
ggplot(aes(x=age,y=high_wage,
color=wage>250))+
geom_point()+
theme_minimal() +
scale_y_continuous(limits = c(0,.3))+
geom_point(data=fit_data,
aes(x=age,y=.pred_TRUE),
size=.05,color="blue",
inherit.aes = F)+
geom_ribbon(data=fit_data,
aes(x=age,
ymin=.pred_lower_TRUE,
ymax=.pred_upper_TRUE),
fill="green",
color="green",
alpha=.25,
inherit.aes = F)
Global fit
Polynomial regression, and even regression, fit \(f(.)\) with \(\hat{f}(.)\) using the observed data all at once.
Local fit
Split the \(X\) range into \(K\) intervals, and fit the function within each interval
For \(c_{1},c_{2},\ldots,c_{K}\) breaks, one defines the following \(C_{j}\) , \(j = 0,K\) , indicator (step) functions. For each observation \(i\) , one of the step functions will be \(1\) , and all the others will be \(0\) , depending on the interval \(x_{i}\) belongs to.
\[\begin{eqnarray} C_{0}(X) &=& I( X<c_{1})\\ C_{1}(X) &=& I( c_{1}\leq X<c_{2})\\ C_{2}(X) &=& I( c_{2}\leq X<c_{3})\\ \ldots \ldots \\ C_{K-1}(X) &=& I( c_{K-1}\leq X<c_{K})\\ C_{K}(X) &=& I( c_{K}\leq X)\\ \end{eqnarray}\]
Pre-processing: split the predictor age into three intervals
step_regions_plot = w_vs_a |>
ggplot(aes(x=age, y=wage))+
theme_minimal()+
annotate(geom="rect",xmin=min(w_vs_a$age),xmax = 35,ymin=0,ymax=max(w_vs_a$wage),fill="indianred",alpha=.25)+
annotate(geom="rect",xmin=35,xmax=50,ymin=0,ymax=max(w_vs_a$wage),fill="lightgreen",alpha=.25)+
annotate(geom="rect",xmin=50,xmax=65,ymin=0,ymax=max(w_vs_a$wage),fill="lightblue",alpha=.25)+
annotate(geom="rect",xmin=65,xmax=80,ymin=0,ymax=max(w_vs_a$wage),fill="indianred",alpha=.25)+
geom_point(alpha=.5)
Pre-processing: split the predictor age into three intervals
Define the workflow and fit the model
Compute preds and confidence band
Update the plot with the step function
Split the predictor \(\texttt{age}\) into three intervals and use it to predict whether the wage is higher than 250
Workflow specification and fit
final plot
w_vs_a |>
mutate(high_wage=as.factor(wage>250)) |>
ggplot(aes(x=age, y=as.double(wage>250)/4,color=high_wage))+ylab("wage > 250")+
geom_point(alpha=.25,size=5)+theme_minimal()+
geom_point(data = class_fit_data, aes(x=age, y=.pred_TRUE), size=.1,color="blue",inherit.aes = FALSE)+
geom_ribbon(data = class_fit_data, aes(x=age, ymin=.pred_lower_TRUE,ymax=.pred_upper_TRUE),
fill="lightblue",inherit.aes = FALSE,alpha=.25)+
annotate(geom="rect",xmin=min(w_vs_a$age),xmax = 35,ymin=0,ymax=.25,fill="indianred",alpha=.25)+
annotate(geom="rect",xmin=35,xmax=50,ymin=0,ymax=.25,fill="lightgreen",alpha=.25)+
annotate(geom="rect",xmin=50,xmax=65,ymin=0,ymax=.25,fill="lightblue",alpha=.25)+
annotate(geom="rect",xmin=65,xmax=80,ymin=0,ymax=.25,fill="indianred",alpha=.25)
A basis function \(b_{j}(.)\) , \(j=1,\ldots,K\) takes as input the predictor \(X\).
\[y_{i} = \beta_{0}+ \beta_{1}b_{1}(x_{i}) + \beta_{2}b_{2}(x_{i}) + \ldots + \beta_{K}b_{K}(x_{i})+\epsilon_{i}\]
Suppose again you are regressing \(\texttt{wage}\) on \(\texttt{age}\) . And that the range of \(\texttt{age}\) is split once, at 50: \(R_{1}:\texttt{age}<50\) and \(R_{2}:\texttt{age}\geq 50\) .
Note that the split value \(\texttt{age}=50\) is referred to as knot.
w_vs_a |>
ggplot(aes(x=age,y=wage)) + theme_minimal()+
geom_point(color="darkgreen")+
geom_vline(aes(xintercept=50),color="orange")+
annotate(geom="rect",xmin=min(w_vs_a$age),xmax=50,ymin=0,ymax=max(w_vs_a$wage),fill="indianred",alpha=.25)+
annotate(geom="rect",xmin=50,xmax=max(w_vs_a$age),ymin=0,ymax=max(w_vs_a$wage),fill="lightblue",alpha=.25)+
annotate(geom="text",x=22,y=220,label="paste(italic(R),1)",parse=TRUE,size=15,alpha=.3)+
annotate(geom="text",x=75,y=220,label="paste(italic(R),2)",parse=TRUE,size=15,alpha=.3)->two_regions_plot
two_regions_plot
Turn the data into a list, according to the knot: then fit a linear model on each element of the list
Compute the predictions on a grid of age values, to draw the regression lines and bands
Within each of the regions defined by the split, fit a degree 3 polynomial regression
\[\texttt{wage}_{i,x_{i}\in R_{1}}= \beta_{0,R_{1}} + \beta_{1,R_{1}} \texttt{age}_{i}+\beta_{2,R_{1}} \texttt{age}^{2}_{i} +\beta_{3,R_{1}} \texttt{age}^{3}_{i}+\epsilon_{i}\]
\[\texttt{wage}_{i,x_{i}\in R_{2}}= \beta_{0,R_{2}} + \beta_{1,R_{2}} \texttt{age}_{i}+\beta_{2,R_{2}} \texttt{age}^{2}_{i} +\beta_{3,R_{2}} \texttt{age}^{3}_{i}+\epsilon_{i}\]
Fit different polynomial (degree 3) regressions within each region, independent from one another
two_poly_recipe = w_vs_a |>
mutate(age=as.double(age),
region=age>50) |>
recipe(wage~age) |>
step_poly(age,3)
w_vs_a_list_2 = w_vs_a |>
mutate(region=age>50,
age=as.double(age)) |>
group_by(region) |>
group_map(~workflow() |>
# add_recipe(two_poly_recipe) |>
add_formula(wage ~ poly(age,3)) |>
add_model(linear_model_spec) |>
fit(data = .x)
)
Combine predictions and confidence bands for each region
Let \(\hat{f}_{1}(.)\) and \(\hat{f}_{2}(.)\) be the two degree 3 polynomials fitted in regions \(R_{1}\) and \(R_{2}\) , and let \(\xi\) the only considered knot
Single fitting curve \(\hat{f}(.)\): conditions
continuity at the knot: \(\hat{f}_{1}(\xi) = \hat{f}_{2}(\xi)\)
\(\hat{f}(.)\) is supposed to be smooth at the knot, too:
\(\hat{f}^{\prime}_{1}(\xi) = \hat{f}^{\prime}_{2}(\xi)\)
\(\hat{f}^{\prime\prime}_{1}(\xi) = \hat{f}^{\prime\prime}_{2}(\xi)\)
Basis functions are useful to represent regression splines, including piecewise polynomials.
Suppose to have \(K\) knots
cubic spline via basis functions
\[y_{i}= \beta_{0} +\beta_{1} b_{1}(x_{i}) +\beta_{2} b_{2}(x_{i}) +\ldots +\beta_{K} b_{K}(x_{i})+\ldots+\beta_{K+3} b_{K+3}(x_{i}) +\epsilon_{i}\] for a degree 3 (cubic) polynomial and \(K\) knots, \(K+3\) basis functions \(b_{j}(.)\), \(j=1,\ldots,K+3\), are needed.
choice of \(b_{j}(.)\)’s
for each term of the polynomial, the choice is obvious: \(b_{j}(x_{i})=x_{i}^{j}\) , \(j=1,\ldots,d\); for cubic splines \(d=3\)
at each knot \(\xi\) , \(b_{j}(x_{i})= h(x_{i},\xi)\) a truncated power basis function is used
\(h(x_{i},\xi) = (x_{i}-\xi)_{+}^{3} = (x_i-\xi)^{3} \text{ if } x_{i} > \xi, \text{ otherwise } (x_{i}-\xi)_{+}^{3} = 0\)
Suppose to have a single knot at \(\xi\)
\[f(x_{i})= \beta_{0} +\beta_{1} x_{i} +\beta_{2} x_{i}^{2}+\beta_{3} x_{i}^{3} +\beta_{4} (x_{i}-\xi)^{3}_{+}\]
The goal is to fit \(f_{1}(.)\) and \(f_{2}(.)\) , two cubic polymonials in \(R_{1}(x_{i}\leq \xi)\) and \(R_{2}(x_{i}> \xi)\), that is
\[f_{1}(x_{i})= a_{1} +b_{1} x_{i} +c_{1} x_{i}^{2}+d_{1}x_{i}^{3}\] \[f_{2}(x_{i})= a_{2} +b_{2} x_{i} +c_{2} x_{i}^{2}+d_{2}x_{i}^{3}\]
First, re-write the coefficients \(a_{r},b_{r},c_{r},d_{r}, r=1,2\) in terms of \(\beta_{j}\)’s such that
\(f_{1}(x_{i})=f(x_{i})\) and \(f_{2}(x_{i})=f(x_{i})\)
Then, check the following conditions that ensure continuity and smoothness at the knot:
\(f_{1}(\xi)=f_{2}(\xi)\)
\(f^{'}_{1}(\xi)=f^{'}_{2}(\xi)\)
\(f^{''}_{1}(\xi)=f^{''}_{2}(\xi)\)
\(f_{1}(x_{i})=f(x_{i})\)
Define \(a_{1},b_{1},c_{1},d_{1}\) ; note that, in \(R_{1}(x_{i}\leq \xi)\) , \((x_{i}-\xi)^{3}_{+}=0\) , therefore
\[f_{1}(x_{i}) = f(x_{i})\] \[a_{1} + b_{1} x_{i} + c_{1} x_{i}^{2} + d_{1}x_{i}^{3} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2}+ \beta_{3} x_{i}^{3}\]
and \(a_{1}=\beta_{0}\) , \(b_{1}=\beta_{1}\) , \(c_{1}=\beta_{2}\) , \(d_{1}=\beta_{3}\) .
\(f_{2}(x_{i})=f(x_{i})\)
Define \(a_{2},b_{2},c_{2},d_{2}\) ; note that, in \(R_{2}(x_{i}> \xi)\) , then
\[(x_{i}-\xi)^{3}_{+}=(x_{i}-\xi)^{3}\] \[f_{2}(x_{i}) = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2}+ \beta_{3} x_{i}^{3}+\beta_{4}(x_{i}-\xi)^{3}\] re-writing \[(x_{i}-\xi)^{3}=x_{i}^{3} + 3x_{i}\xi^{2}-3x_{i}^{2}\xi - \xi^{3}\]
and plugging it in
\[a_{2} + b_{2} x_{i} + c_{2} x_{i}^{2} + d_{2}x_{i}^{3} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2}+ \beta_{3} x_{i}^{3}+\beta_{4}x_{i}^{3} + 3\beta_{4}x_{i}\xi^{2}-3\beta_{4}x_{i}^{2}\xi - \beta_{4}\xi^{3}\] then \(a_{2}=\beta_{0} - \beta_{4} \xi^{3}\) , \(b_{2}=\beta_{1}+ 3 \beta_{4} \xi^{2}\) , \(c_{2}=\beta_{2} -3 \beta_{4} \xi\) and \(d_{2}=\beta_{3} + \beta_{4}\)
\[a_{1} +b_{1} \xi +c_{1} \xi^{2}+d_{1} \xi^{3} = a_{2} +b_{2} \xi +c_{2} \xi^{2}+d_{2} \xi^{3}\] recalling the definitions of \(a_{1},\ldots, d_{2}\), the above is re-written as
\[\begin{align} \beta_{0} +\beta_{1} \xi +\beta_{2} \xi^{2}+\beta_{3} \xi^{3} &= \underbrace{\beta_{0} - \beta_{4}\xi^{3}}_{a_{2}} +\underbrace{(\beta_{1}+ 3 \beta_{4} \xi^{2})}_{b_{2}} \xi + \underbrace{(\beta_{2} -3 \beta_{4} \xi)}_{c_{2}} \xi^{2}+\underbrace{(\beta_{3} + \beta_{4})}_{d_{2}} \xi^{3}=\\ &= \beta_{0} - \beta_{4}\xi^{3}+\beta_{1}\xi+ 3 \beta_{4} \xi^{3}+\beta_{2}\xi^{2} -3 \beta_{4} \xi^{3}+ \beta_{3}\xi^{3} + \beta_{4}\xi^{3}=\\ &= \beta_{0} +\beta_{1}\xi+ \beta_{2}\xi^{2} +\beta_{3}\xi^{3} \end{align}\]
\(f^{'}_{1}(\xi) = b_{1} + 2 c_{1}\xi_{i}+3 d_{1}\xi^{2}_{i}\) and \(f^{'}_{2}(\xi) = b_{2}+ 2 c_{2}\xi_{i}+3 d_{2}\xi^{2}_{i}\)
recalling the definitions of \(a_{1},\ldots, d_{2}\), the above is re-written as
\[\begin{align} \beta_{1} +2\beta_{2} \xi+3\beta_{3} \xi^{2} &= \underbrace{(\beta_{1}+ 3 \beta_{4} \xi^{2})}_{b_{2}} + 2\underbrace{(\beta_{2} -3 \beta_{4} \xi)}_{c_{2}} \xi+3\underbrace{(\beta_{3} + \beta_{4})}_{d_{2}} \xi^{2}=\\ &= \beta_{1}\xi+ 3 \beta_{4} \xi^{3} + 2\beta_{2}\xi - 6 \beta_{4} \xi^{2}+ 3\beta_{3}\xi^{2}+3\beta_{4}\xi^{2}=\\ &= \beta_{1} +2\beta_{2} \xi+3\beta_{3} \xi^{2} \end{align}\]
\(f^{''}_{1}(\xi) = c_{1}\xi+6 d_{1}\xi\) and \(f^{''}_{2}(\xi) = c_{2} + 6 d_{2}\xi\)
recalling the definitions of \(a_{1},\ldots, d_{2}\), the above is re-written as
\[\begin{align} 2\beta_{2} +6\beta_{3} \xi &= + 2\underbrace{(\beta_{2} -3 \beta_{4} \xi)}_{c_{2}}+6\underbrace{(\beta_{3} + \beta_{4})}_{d_{2}} \xi=\\ &= 2\beta_{2} - 6 \beta_{4} \xi+ 6\beta_{3}\xi+6\beta_{4}\xi=\\ &= 2\beta_{2} + 6\beta_{3}\xi \end{align}\]
To fit a cubic spline, one has to specify the basis function, and then fit a linear model
the basis functions \(b_{j}(.)'s\) are defined in the \(\texttt{recipe}\) (pre-processing)
the model specification and fit do not change
cubic spline with a single knot at \(\texttt{age}=50\)
model specification and fit
curve and bands data
tibble(sizes=c(50,100,250,500,1000,2000,3000)) |>
mutate(sizes=as.list(sizes),
training_data = sizes |>
map(~return(w_vs_a |>
slice(sample(1:n(),.x))))
) |>
mutate(grid_ages = sizes |> map(~grid_age)
) |>
mutate(cubic_fit = map2(.x=training_data,.y=grid_ages,
.f=function(x=.x,y=.y){
pre_proc_w_vs_a = x |>
recipe(wage~age) |>
step_bs(age,degree = 3,
options= list( knots = 50));
cubic_spline_wf = workflow() |>
add_recipe(pre_proc_w_vs_a) |>
add_model(linear_model_spec);
cubic_spline_fit = cubic_spline_wf |>
fit(data=x)
return(cubic_spline_fit)}
)
) |>
mutate(cubic_preds = map2(.x=cubic_fit,.y=grid_ages,
.f=function(x=.x,y=.y){
cubic_spline_preds = x |>
augment(new_data=y) |>
mutate(x |>
predict(new_data=y,
type="conf_int"));
return(cubic_spline_preds)
}
)
) |>
mutate(training_size=sizes) -> sizes_exp
Pull the training sets from the list, labeled by size
fit_by_sizes_plot = w_vs_a |>
ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.4,color="lightgrey")+
geom_point(data=training_points,aes(x=age,y=wage),size=1,alpha=.25,color="green",inherit.aes = FALSE)+
geom_line(data=curves_data,aes(x=age,y=.pred), color="red",inherit.aes = FALSE)+
geom_ribbon(data=curves_data,aes(x=age,ymin=.pred_lower,ymax=.pred_upper),fill="cyan",color="blue",alpha=.2,inherit.aes = FALSE)+
facet_wrap(~training_size,scales = "free")
tibble(degrees_freedom=5:10) |>
mutate(degrees_freedom=as.list(degrees_freedom),
training_data = map(.x=500,.f=~return(w_vs_a |>
slice(sample(1:n(),.x))))
) |>
mutate(grid_ages = degrees_freedom |> map(~grid_age)
) |>
mutate(cubic_fit = pmap(.l=list(x=training_data,
y=grid_ages,
z=degrees_freedom),
.f=function(x,y,z){
pre_proc_w_vs_a = x |>
recipe(wage~age) |>
step_bs(age,degree = 3,
deg_free=z);
cubic_spline_wf = workflow() |>
add_recipe(pre_proc_w_vs_a) |>
add_model(linear_model_spec);
cubic_spline_fit = cubic_spline_wf |>
fit(data=x)
return(cubic_spline_fit)}
)
) |>
mutate(cubic_preds = map2(.x=cubic_fit,.y=grid_ages,
.f=function(x=.x,y=.y){
cubic_spline_preds = x |>
augment(new_data=y) |>
mutate(x |>
predict(new_data=y,
type="conf_int"));
return(cubic_spline_preds)
}
)
)-> knots_exp
Compute the number of knots out of the degrees of freedom: \(knots= df-d-1\), where \(d\) is the degree of the polynomial
Pull the training sets from the list, labeled by the number of knots
fit_by_knots_plot = w_vs_a |>
ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.4,color="lightgrey")+
geom_point(data=training_points,aes(x=age,y=wage),size=1,alpha=.25,color="green",inherit.aes = FALSE)+
geom_line(data=curves_data,aes(x=age,y=.pred), color="red",inherit.aes = FALSE)+
geom_ribbon(data=curves_data,aes(x=age,ymin=.pred_lower,ymax=.pred_upper),fill="cyan",color="blue",alpha=.2,inherit.aes = FALSE)+
facet_wrap(~n_knots,scales = "free")
The higher the number of knots, the higher the flexibility of \(\hat{f}(.)\) (and, its variability)
For larger the training set sizes, a higher number of knots can be selected (via e.g. cross-validation).
In the outer range of \(X\) there are usually fewer observations, and splines can have high variance
To tackle this drawback, one can impose the function to be linear in the boundary regions
A cubic spline with the additional linearity constraints is defined a natural cubic spline
. . .
cub_vs_nat = rbind(cubic_spline_preds |> mutate(method="cubic"),
natural_spline_preds |> mutate(method="natural"))
w_vs_a |>
ggplot(aes(x=age,y=wage))+theme_minimal()+geom_point(size=1,alpha=.2,color="lightgrey")+
geom_point(data=w_vs_a_sample,aes(x=age,y=wage),size=1,color="lightgreen",inherit.aes = FALSE)+
scale_linetype_manual(values = c("dotted","solid"))+
geom_line(data=cub_vs_nat,aes(x=age,y=.pred, colour = method,linetype=method),inherit.aes = FALSE)+
geom_ribbon(data=cub_vs_nat,aes(x=age,ymin=.pred_lower,ymax=.pred_upper,color=method,linetype=method),alpha=.05,inherit.aes = FALSE)
In regression splines > - pick \(K\) knots > - define a basis function within each of the resulting regions > - estimate the spline coefficients via least squares
In smoothing splines > - find \(g(\cdot)\) that minimizes the loss function \[\sum_{i=1}^{n}\left(y_{i}-g(x_{i})\right)^{2}\] > - with no restrictions, one could pick \(g(\cdot)\) so that \(\sum_{i=1}^{n}\left(y_{i}-g(x_{i})\right)^{2}=0\) > - the fitted curve would interpolate the training observations (massive overfitting) . . .
the minimization problem is \(loss+penalty\), that is \[\sum_{i=1}^{n}\left(y_{i}-g(x_{i})\right)^{2}+ \lambda\int g''(t) dt\] - where \(g''(t)\) is the second derivative of \(g(\cdot)\) at \(t\), indicating if the slope of the function is changing (increasing/descreasing) or not
the rougher \(g(\cdot)\), the larger the value of \(\int g''(t) dt\);
\(\lambda\) is the penalty parameter that dictates how smooth \(g(\cdot)\) will be
A smoothing spline has \(n\) nominal degrees of freedom since it has \(n\) parameters (equivalent to one knot per training observation).
Because of the roughness penalty, however, the \(n\) parameters are heavily shrunk.
To measure the model complexity/flexibility \(df_{y}\) is used instead: the effective degrees of freedom
Consider \({\bf \hat g}_{\lambda}\) to be the vector of estimated values (one per \(x_{i}\), \(i = 1,\ldots, n\)) for a specific \(\lambda\) value, it is obtained \[{\bf \hat g}_{\lambda} = {\bf S}_{\lambda}{\bf y}\] where \({\bf S}_{\lambda}\) is the smoothing matrix that is, the smoothing spline analogue of the hat matrix \({\bf H} = {\bf X}\left({\bf X}^{\sf T}{\bf X}\right){\bf X}^{\sf T}\) for the regression case.
then effective degrees of freedom are given by \(df_{y}=trace\left({\bf S}_{\lambda}\right)\)
In fact the LOOCV estimate of the test RSS is computed by
\[RSS_{cv}(\lambda)=\sum_{i =1}^{n}{\left(y_{i}-\hat{g}^{(-i)}_{\lambda}(x_{i})\right)^{2}}=\sum_{i =1}^{n}{\left[\frac{y_{i}-\hat{g}_{\lambda}(x_{i})}{1-\left\{{\bf S}_{\lambda}\right\}_{ii}} \right]}\]
where \(\hat{g}^{(-i)}_{\lambda}(\cdot)\) is the function fitted on all but the \(i^{th}\) training observations, and \(\left\{{\bf S}_{\lambda}\right\}_{ii}\) is the \(i^{th}\) element of the matrix \({\bf S}_{\lambda}\)
wage_smooth_df_32 <- smooth.spline(x = w_vs_a$age, y = w_vs_a$wage, df = 32)
wage_smooth_df_16 <- smooth.spline(x = w_vs_a$age, y = w_vs_a$wage, df = 16)
wage_smooth_cv <- smooth.spline(x = w_vs_a$age, y = w_vs_a$wage, cv = TRUE)
tibble(wage_smooth_df_32$df,wage_smooth_df_16$df,wage_smooth_cv$df)|> kbl()
wage_smooth_df_32$df | wage_smooth_df_16$df | wage_smooth_cv$df |
---|---|---|
31.99591 | 16.00237 | 6.794596 |
age_grid=seq(from=min(w_vs_a$age),to=max(w_vs_a$age),by=.1)
all_preds = bind_rows(
as_tibble(predict(wage_smooth_df_32, age_grid)) |>
mutate(model = "32 degrees of freedom"),
as_tibble(predict(wage_smooth_df_16, age_grid)) |>
mutate(model = "16 degrees of freedom"),
as_tibble(predict(wage_smooth_cv, age_grid)) |>
mutate(model = "LOOCV")
)
The local regression fit \(\hat{f}(x_{0})\) depends on the span, that is the proportion of points to be considered as neighbors of \(x_{0}\).
In particular, a weighted linear regression is fitted by minimizing \[\sum_{i=1}^{n}{K_{i0}}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}\]
where \(K_{i0}\) is a weight proportional to the closeness of the \(x_{i}\)’s to \(x_{0}\): for the neighbors \(K_{i0}>0\), for the non-neighbors \(K_{i0}=0\)
the weighted least squares minimizers \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\) are then used to get the estimate of the function at \(x_{0}\)
\[\hat{f}(x_{0})=\hat{\beta_{0}}+ \hat{\beta_{1}}x_{0}\]
However, the parameter affecting \(\hat{f}\) the most is the span: a low value will correspond to a lower number of neighbours, so \(\hat{f}\) will be more flexible
However, the parameter affecting \(\hat{f}\) the most is the span: a low value will correspond to a lower number of neighbours, so \(\hat{f}\) will be more flexible
Generalized additive models (GAMs) extend a standard linear model to non-linear functions of each of the predictors
\[y_{i} = \beta_{0}+\beta_{1}f(x_{i1})+\beta_{2}f(x_{i2})+\ldots+\beta_{p}f(x_{ip})+\epsilon_{i}\]
library(gam)
library(splines)
data(Wage,package = "ISLR2")
wage_gam_ns_fit <- gam(wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
d <- preplot(wage_gam_ns_fit)
plot_structure=tibble(plot_name=names(d) , data = d |> map(~tibble(x=.$x,y=.$y,se_y=.$se.y) |>
mutate(y_lower=y-se_y,
y_upper=y+se_y)
)
) |> mutate(plots=map2(.x = data,.y=plot_name,
function(x=.x,y=.y){
if(!is.factor(x$x)){
myplot=x |> ggplot(aes(x)) +
geom_line(aes(y = y), color = "red") +
geom_line(aes(y = y_lower), lty = 2, color = "red") +
geom_line(aes(y = y_upper), lty = 2, color = "red") +
ggtitle(y)+ theme(axis.text.x=element_text(angle=90))
}else{
myplot = x |> ggplot(aes(x)) +
geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper),color="red") +
ggtitle(y)+ theme(axis.text.x=element_text(angle=90))
}
}
)
)
wage_gam_smooth_fit <-
gam(wage ~ s(year, 4) + s(age, 5) + education,
data = Wage)
d <- preplot(wage_gam_smooth_fit)
plot_structure=tibble(plot_name=names(d) , data = d |> map(~tibble(x=.$x,y=.$y,se_y=.$se.y) |>
mutate(y_lower=y-se_y,
y_upper=y+se_y)
)
) |> mutate(plots=map2(.x = data,.y=plot_name,
function(x=.x,y=.y){
if(!is.factor(x$x)){
myplot=x |> ggplot(aes(x)) +
geom_line(aes(y = y), color = "blue") +
geom_line(aes(y = y_lower), lty = 2, color = "blue") +
geom_line(aes(y = y_upper), lty = 2, color = "blue") +
ggtitle(y)+ theme(axis.text.x=element_text(angle=90))
}else{
myplot = x |> ggplot(aes(x)) +
geom_errorbar(aes(y = y, ymin = y_lower, ymax = y_upper),color="blue") +
ggtitle(y)+ theme(axis.text.x=element_text(angle=90))
}
}
)
)