Model Selection

Statistical Learning

Alfonso Iodice D’Enza

Improving models

sub-set selection: reduce the number of predictors to compress the model variance and improve interpretability.
shrinkage methods: the \(p\) predictors are kept in the model, the estimates of the coefficient are shrunken towards 0. This also compresses the variance.
dimension reduction: \(M\) syntheric predictors are defined that are linear combinations of the starting \(p\) variables (PCA anyone?), setting \(M<p\) the model complexity is reduced

subset selection

best subset selection

This approach uses the complete search space of all the possible models one can obtain starting from \(p\) predictors ( \(2^{p}\) ).

\(\mathcal{M}_{0}\) is the null model (intercept only).

For \(k=1,\ldots,p\)

fit \(p\choose k\) possible models on \(k\) predictors

find \(\mathcal{M}_{k}\) , the best model with \(k\) predictors: lowest RSS or largest \(R^{2}\).

From the sequence \(\mathcal{M}_{0},\mathcal{M}_{1},\ldots,\mathcal{M}_{p}\) of best models given the number of predictors the overall best is chosen

due to the different degrees of freedom, a test error estimate is required to make the choice.
or, one can use some training error measures adjusted for the model complexity. (Mallow \(C_{p}\), AIC, the BIC, the adjusted \(R^{2}\) ).

best subset selection

best sub-set selection: credit dataset

The response is balance, 11 predictors with two dummy for ethnicity.

each \(\color{white!50!black}{\text{gray}}\) point is a model, with \(x\) predictors, the y’s are RSS (sx) and \(R^{2}\) (dx).
each \(\color{red}{\text{red}}\) point is the best model with \(x\) predictors.

Cons best subset selection

computational complexity: \(2^{p}\) models must be fitted, in the previous example 1024 models have been fitted. With 20 predictors, one needs to fit more than a million models (1048576)!
statistical problem: due to the high number of fitted models, it is likely that one model randomly fits well, or even overfit, the training set

forward stepwise selection

\(\mathcal{M}_{0}\) is the null model (intercept only)

For \(k=0,\ldots,p-1\)

fit the \(p - k\) possible models that add a further predictor to the model \(\mathcal{M}_{k}\) ;

find the best there is and define it \(\mathcal{M}_{k+1}\) .

From the sequence \(\mathcal{M}_{0},\mathcal{M}_{1},\ldots,\mathcal{M}_{p}\) of best models given the number of predictors, the overall best is chosen

due to the different degrees of freedom, a test error estimate is required to make the choice.
or, one can use some training error measures adjusted for the model complexity. (Mallow \(C_{p}\), AIC, BIC, adjusted \(R^{2}\) )

backward stepwise selection

fit \(\mathcal{M}_{p}\), the full model.

For \(k=p, p-1,\ldots,1\)

fit all the possible models that contain all the predictors in \(\mathcal{M}_{k}\) but one;

find the best among the \(k\) models, that becomes \(\mathcal{M}_{k-1}\) .

From the sequence \(\mathcal{M}_{0},\mathcal{M}_{1},\ldots,\mathcal{M}_{p}\) of best models given the number of predictors, the overall best is chosen

due to the different degrees of freedom, a test error estimate is required to make the choice.
or, one can use some training error measures adjusted for the model complexity. (Mallow \(C_{p}\), AIC, BIC, adjusted \(R^{2}\) )

the stepwise approach

the number of models to evaluate is \(1+p(p+1)/2\), smaller than \(2^{p}\)
the reduced search space does not guarantee to find the best possible model
the backward selection cannot be applied when \(p>n\)

selecting the best model irrespective the size

adjust the training error measure

add some price to pay for each extra-predictor in the model

use cross-validation to estimate the test error

estimate the performance of the model on the test set: e.g. via cross-validation

Adjusting the training error

Mallow \(C_{p} = \frac{1}{n} (RSS+2d\hat{\sigma}^{2})\)

AIC \(= -2\log(L) +2d\)

BIC \(= \frac{1}{n} (RSS+\log(n)d\hat{\sigma}^{2})\)

Adjusted \(R^{2}= 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\)

\(d\) is the number of parameters in the model;
\(\hat{\sigma}^{2}\) is the estimate of the error \(\epsilon\) variance;
\(L\) is the max of the likelihood function;

Note

for linear models, under normal assumptions, \(L\) and \(RSS\) coincide, thus \(C_{p} = AIC\)
\(C_{p}\) and \(BIC\) differ in \(2\) being replaced by \(\log(n)\) in BIC: since, for \(n>7\), \(\log(n)>2\), the \(BIC\) tends to select a lower number of predictors;
the adjusted \(R^{2}\) penalizes \(RSS\) by \(n-d-1\), that increases with the number of parameters in the model.

Credit data example: adjusting the training error

Using \(C_{p}\) and Adjusted-\(R^{2}\) the choice is 6 and 7 predictors, respectively.The \(BIC\) leads to choose 4 predictors.

Cross-validation selection

\(k\) -fold cross-validation can be used to estimate the test error and choose the best model;
an advantage is that \(\hat{\sigma}\) and \(d\) are not required (and they are sometimes not easy to identify);
this approach can be applied to any type of model

Credit data example: BIC vs validation approaches

Subset selection

note that \(\texttt{leaps::regsubsets}\) does not have a pre-defined \(\texttt{predict}\) function, nor has it \(\texttt{broom::augment}\).
selection is done via the provided corrected training error measures: \(C_{p}\) , \(AIC\) , \(BIC\) , \(Adjusted \ R^{2}\)
An ad-hoc procedure is needed for cross-validation based selection.

Subset selection example: hitters dataset

basic cleaning

Remove missings and clean the names

data(Hitters,package = "ISLR2") 
my_hitters = Hitters %>% na.omit() %>%
  as_tibble() %>% clean_names()

set.seed(123)
main_split=initial_split(my_hitters, prop=4/5)
hit_train=training(main_split)
hit_test=testing(main_split)
hit_flds = vfold_cv(data = hit_train,v=10)