Statistical learning: the flow

Alfonso Iodice D’Enza

supervised learning flow

pre-processing

split your observations in training, validation (or, cross-validate) and test

transform the predictors properly (features engineering)

model-spec

specify the model to fit

tuning

select a reasonable grid of model hyperparameter(s) values to choose from

for each combination of hyperparameters

fit the model on training observations

compute appropriate metrics on evaluation observations

pick-up the best hyperparamters combination

final evaluation and fit

compute the metric for the tuned model on the test set (observations never used yet)

obtain the final fit for the model on all the available observations

the tidymodels metapackage

All the core packages in the tidymodels refer to one step of a supervised learning flow

For all things tidymodels check tidymodels.org!

the tidymodels core

All the core packages in the tidymodels refer to one step of a supervised learning flow

the tidymodels core

the rsample package provides tools for data splitting and resampling

the tidymodels core

the recipe package provides tools for data pre-processing and feature engineering

the tidymodels core

the parsnip package provides a unified interface to recall several models available in R

the tidymodels core

the workflows package combines together pre-processing, modeling and post-processing steps

the tidymodels core

the yardstick package provides several performance metrics

the tidymodels core

the dials package provides tools to define hyperparameters value grids

the tidymodels core

the tune package remarkably simplifies the hyperparameters optimization implementation

the tidymodels core

the broom package provides utility functions to tidify model output

a simple flow

pre-processing: training and test

The idea is to:

fit the model on a set of observations (training)
assess the model performance on a different set of observations (testing)

Refer to the Advertising data, split the observations in training/testing

adv_data = read_csv(file="./data/Advertising.csv") %>% select(-1)
adv_split = initial_split(adv_data,prop=3/4,strata=sales)
adv_train=training(adv_split)
adv_test=testing(adv_split)

training set

Code

adv_train |> slice_sample(n=5) |> kbl()

TV	radio	newspaper	sales
280.2	10.1	21.4	14.8
240.1	16.7	22.9	15.9
237.4	5.1	23.5	12.5
131.1	42.8	28.9	18.0
44.5	39.3	45.1	10.4

test set

Code

adv_test |> slice_sample(n=5) |> kbl()

TV	radio	newspaper	sales
120.2	19.6	11.6	13.2
228.3	16.9	26.2	15.5
102.7	29.6	8.4	14.0
151.5	41.3	58.5	18.5
217.7	33.5	59.0	19.4

pre-processing: specify the recipe (just the formula in this case)

adv_rec = recipe(formula=sales~.,data = adv_train)

model specification

adv_model = linear_reg(mode="regression",engine="lm")

pairing recipe and specification in a `worklow` and fit the model on the training set

adv_wf = workflow() |> add_recipe(adv_rec) |> add_model(adv_model)
adv_fit = adv_wf |> fit(data = adv_train)

use the fitted model to predict the test observations

adv_pred = adv_fit |> augment(new_data = adv_test)

compute the performance metric `rmse` (root mean squared error)

adv_pred |> rmse(truth = sales, estimate=.pred) |> kbl()

.metric	.estimator	.estimate
rmse	standard	2.036518

Statistical learning: the flow

supervised learning flow

the tidymodels metapackage

For all things tidymodels check tidymodels.org!

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

the tidymodels core

a simple flow

pre-processing: training and test

pre-processing: specify the recipe (just the formula in this case)

model specification

pairing recipe and specification in a worklow and fit the model on the training set

use the fitted model to predict the test observations

compute the performance metric rmse (root mean squared error)

pairing recipe and specification in a `worklow` and fit the model on the training set

compute the performance metric `rmse` (root mean squared error)