The idea is to use synthetic data to investigate some aspects of model flexibility and how it impacts on bias and variance.
library(tidyverse)
library(tidymodels)
library(gganimate)
library(transformr)
library(kableExtra)
We generate a sample with a given \(\texttt{sample\_size}\), the true \(f(X)=2X - 6x^{2} +2 x^{3}\) and \(\epsilon\sim N(0,true\_sigma^{2})\)
true_f = function(x){return(2*x+6*x^2+2*x^3)}
true_sigma=5
xmin=-3
xmax=2
sample_size = 200
synth_data = tibble(x = runif(sample_size,min=xmin,max=xmax),
fx= true_f(x),
noise=rnorm(sample_size,mean = 0,sd=true_sigma),
y=fx+noise)
synth_plot = synth_data %>%
ggplot(aes(x=x,y=y))+ theme_minimal() + geom_point(alpha=.1)
synth_plot
Even for a given data set, flexible methods may lead to different \(\hat{f}(X)\) due to the training/testing split.
n_samples = 99
train_prop = .25
synth_samples = tibble(data=rerun(.n = n_samples, synth_data %>% slice_sample(prop = train_prop)),
sample_id = c(paste0("sample: 0",1:9),paste0("sample: ",10:n_samples)),
)
To visualize this, consider 99 random splits, with a small proportion of training observations.
sample_movie = synth_plot +
geom_point(data=synth_samples %>% unnest(data),aes(x,y),colour="dodgerblue",alpha=.5) +
labs(title = '{closest_state}')+
transition_states(sample_id,state_length = 1)
anim_save(filename = "pop_sample_movie.gif",animation = sample_movie,path = "figures/",nframes = 200)
Fitting a linear regression, and a polynomial regression of degree 8 on a small training set will results in variable fitted curves. To visulize that…
linear_reg_movie = synth_plot +
geom_point(data=synth_samples %>% unnest(data),aes(x,y),colour="dodgerblue",alpha=.5) +
stat_smooth(data=synth_samples %>% unnest(data), aes(x,y), method = "lm", se = FALSE, color="indianred", geom='line',size=2) +
labs(title = 'linear reg. {closest_state}',subtitle = paste0("proportion of training obs: ", train_prop))+
transition_states(sample_id, wrap = TRUE,state_length = 1) +
shadow_trail(alpha = 0.35, exclude_layer = 2,color="forestgreen",size=.5)
anim_save(filename = "linear_regression_movie.gif",animation = linear_reg_movie,path = "figures/",nframes = 200)
poly8_movie = synth_plot +
geom_point(data=synth_samples %>% unnest(data),aes(x,y),colour="dodgerblue",alpha=.5) +
stat_smooth(data=synth_samples %>% unnest(data), aes(x,y), method="lm",formula = y~poly(x,8) , se = FALSE, color="indianred", geom='line',size=2) +
labs(title = 'polynomial reg. degree 8 {closest_state}',subtitle = paste0("proportion of training obs: ", train_prop))+
transition_states(sample_id, wrap = TRUE,state_length = 1) +
shadow_trail(alpha = 0.35, exclude_layer = 2,color="forestgreen",size=.5)
anim_save(filename = "poly8_movie.gif",animation = poly8_movie,path = "figures/",nframes = 200)
By increasing the training set size, the variability of the fitted curves will decrease.