Statistical learning: the flow

Alfonso Iodice D’Enza

supervised learning flow

pre-processing

split your observations in training, validation (or, cross-validate) and test

transform the predictors properly (features engineering)

model-spec

specify the model to fit

tuning

select a reasonable grid of model hyperparameter(s) values to choose from

for each combination of hyperparameters

  1. fit the model on training observations
  2. compute appropriate metrics on evaluation observations

pick-up the best hyperparamters combination

final evaluation and fit

compute the metric for the tuned model on the test set (observations never used yet)

obtain the final fit for the model on all the available observations

the tidymodels metapackage

All the core packages in the tidymodels refer to one step of a supervised learning flow

tidymodels logo

For all things tidymodels check tidymodels.org!

the tidymodels core

All the core packages in the tidymodels refer to one step of a supervised learning flow

all

the tidymodels core

the rsample package provides tools for data splitting and resampling

rsample

the tidymodels core

the recipe package provides tools for data pre-processing and feature engineering

recipes

the tidymodels core

the parsnip package provides a unified interface to recall several models available in R

parsnip

the tidymodels core

the workflows package combines together pre-processing, modeling and post-processing steps

workflows

the tidymodels core

the yardstick package provides several performance metrics

yardstick

the tidymodels core

the dials package provides tools to define hyperparameters value grids

dials

the tidymodels core

the tune package remarkably simplifies the hyperparameters optimization implementation

tune

the tidymodels core

the broom package provides utility functions to tidify model output

tune

a simple flow

pre-processing: training and test

The idea is to:

  • fit the model on a set of observations (training)

  • assess the model performance on a different set of observations (testing)

Refer to the Advertising data, split the observations in training/testing

adv_data = read_csv(file="./data/Advertising.csv") %>% select(-1)
adv_split = initial_split(adv_data,prop=3/4,strata=sales)
adv_train=training(adv_split)
adv_test=testing(adv_split)

training set

Code
adv_train |> slice_sample(n=5) |> kbl()
TV radio newspaper sales
280.2 10.1 21.4 14.8
240.1 16.7 22.9 15.9
237.4 5.1 23.5 12.5
131.1 42.8 28.9 18.0
44.5 39.3 45.1 10.4

test set

Code
adv_test |> slice_sample(n=5) |> kbl()
TV radio newspaper sales
120.2 19.6 11.6 13.2
228.3 16.9 26.2 15.5
102.7 29.6 8.4 14.0
151.5 41.3 58.5 18.5
217.7 33.5 59.0 19.4

pre-processing: specify the recipe (just the formula in this case)

adv_rec = recipe(formula=sales~.,data = adv_train)

 

model specification

adv_model = linear_reg(mode="regression",engine="lm")

 

pairing recipe and specification in a worklow and fit the model on the training set

adv_wf = workflow() |> add_recipe(adv_rec) |> add_model(adv_model)
adv_fit = adv_wf |> fit(data = adv_train)

 

use the fitted model to predict the test observations

adv_pred = adv_fit |> augment(new_data = adv_test)

 

compute the performance metric rmse (root mean squared error)

adv_pred |> rmse(truth = sales, estimate=.pred) |> kbl()
.metric .estimator .estimate
rmse standard 2.036518