Using LASSO to Predict Monthly Brew Materials
Mon, May 23, 2022
5-minute read
In this blog post, I will use a data set about beer brewing materials provided by TidyTuesday to make prediction for the monthly barrels of several beer materials. The tidymodels meta-package will be used, with bootstrap as the resampling technique.
library(tidyverse)
library(tidymodels)
library(lubridate)
theme_set(theme_bw())
brewing_materials_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewing_materials.csv")
brewing_materials_raw
## # A tibble: 1,440 x 9
## data_type material_type year month type month_current month_prior_year
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Pounds of Mat~ Grain Produc~ 2008 1 Malt~ 374165152 365300134
## 2 Pounds of Mat~ Grain Produc~ 2008 1 Corn~ 57563519 41647092
## 3 Pounds of Mat~ Grain Produc~ 2008 1 Rice~ 72402143 81050102
## 4 Pounds of Mat~ Grain Produc~ 2008 1 Barl~ 3800844 2362162
## 5 Pounds of Mat~ Grain Produc~ 2008 1 Whea~ 1177186 1195381
## 6 Pounds of Mat~ Total Grain ~ 2008 1 Tota~ 509108844 491554871
## 7 Pounds of Mat~ Non-Grain Pr~ 2008 1 Suga~ 78358212 83664091
## 8 Pounds of Mat~ Non-Grain Pr~ 2008 1 Hops~ 4506546 2037754
## 9 Pounds of Mat~ Non-Grain Pr~ 2008 1 Hops~ 621912 411166
## 10 Pounds of Mat~ Non-Grain Pr~ 2008 1 Other 1291615 766735
## # ... with 1,430 more rows, and 2 more variables: ytd_current <dbl>,
## # ytd_prior_year <dbl>
Clean and process brewing_materials_raw
:
brewing_materials <- brewing_materials_raw %>%
filter(type %in% c("Malt and malt products",
"Sugar and syrups",
"Rice and rice products",
"Barley and barley products",
"Wheat and wheat products")) %>%
select(year, month, type, month_prior_year, month_current)
brewing_materials
## # A tibble: 600 x 5
## year month type month_prior_year month_current
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2008 1 Malt and malt products 365300134 374165152
## 2 2008 1 Rice and rice products 81050102 72402143
## 3 2008 1 Barley and barley products 2362162 3800844
## 4 2008 1 Wheat and wheat products 1195381 1177186
## 5 2008 1 Sugar and syrups 83664091 78358212
## 6 2008 2 Malt and malt products 350035966 355687578
## 7 2008 2 Rice and rice products 75004651 66061597
## 8 2008 2 Barley and barley products 2573584 3236714
## 9 2008 2 Wheat and wheat products 1074183 1240983
## 10 2008 2 Sugar and syrups 86331629 80188744
## # ... with 590 more rows
EDA:
brewing_materials %>%
mutate(year_month = ym(paste(year, month, sep = "-"))) %>%
ggplot(aes(year_month, month_current, color = type)) +
geom_line() +
geom_point() +
facet_wrap(~type, scales = "free_y", ncol = 1) +
theme(legend.position = "none",
panel.grid = element_blank()) +
labs(x = NULL,
y = "current number of monthly barrels",
title = "Monthly Barrels per Type")
What happened after 2016? Why are the lines close to 0? I assume there is some data entry issue.
For the sake of our predictive modeling, data after year 2016 will be removed.
brew_before_2016 <- brewing_materials %>%
filter(year < 2016)
brew_before_2016
## # A tibble: 480 x 5
## year month type month_prior_year month_current
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2008 1 Malt and malt products 365300134 374165152
## 2 2008 1 Rice and rice products 81050102 72402143
## 3 2008 1 Barley and barley products 2362162 3800844
## 4 2008 1 Wheat and wheat products 1195381 1177186
## 5 2008 1 Sugar and syrups 83664091 78358212
## 6 2008 2 Malt and malt products 350035966 355687578
## 7 2008 2 Rice and rice products 75004651 66061597
## 8 2008 2 Barley and barley products 2573584 3236714
## 9 2008 2 Wheat and wheat products 1074183 1240983
## 10 2008 2 Sugar and syrups 86331629 80188744
## # ... with 470 more rows
brew_before_2016 %>%
ggplot(aes(month_prior_year, month_current, color = type)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(x = "monthly barrels prior year",
y = "monthly barrels current year",
title = "Monthly Barrels between Current and Prior Year")
The relationship between current and prior monthly barrel is almost log-linear.
Below a LASSO model will be built to predict month_current
.
- Split the data:
set.seed(2022)
brew_spl <- brew_before_2016 %>%
initial_split(strata = "month")
brew_train <- training(brew_spl)
brew_test <- testing(brew_spl)
brew_boots <- bootstraps(brew_train, times = 1000, strata = "month")
brew_boots
## # Bootstrap sampling using stratification
## # A tibble: 1,000 x 2
## splits id
## <list> <chr>
## 1 <split [360/135]> Bootstrap0001
## 2 <split [360/132]> Bootstrap0002
## 3 <split [360/128]> Bootstrap0003
## 4 <split [360/128]> Bootstrap0004
## 5 <split [360/135]> Bootstrap0005
## 6 <split [360/144]> Bootstrap0006
## 7 <split [360/135]> Bootstrap0007
## 8 <split [360/139]> Bootstrap0008
## 9 <split [360/131]> Bootstrap0009
## 10 <split [360/132]> Bootstrap0010
## # ... with 990 more rows
- Create the recipe:
brew_rec <- recipe(month_current ~ ., data = brew_train) %>%
step_mutate(type = factor(type)) %>%
step_log(month_prior_year, month_current, base = 10) %>%
step_dummy(all_nominal_predictors())
- Set up the LASSO spec:
brew_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
- Set up the workflow:
brew_wf <- workflow() %>%
add_model(brew_spec) %>%
add_recipe(brew_rec)
- Tune the model:
brew_res <- brew_wf %>%
tune_grid(
brew_boots,
grid = crossing(
penalty = 10 ^ seq(-7, -0.5, 0.5)
),
control = control_grid(save_pred = T,
save_workflow = T)
)
autoplot(brew_res)
- Finalize the workflow:
brew_last_fit <- brew_wf %>%
finalize_workflow(select_best(brew_res, "rmse")) %>%
last_fit(brew_spl)
brew_last_fit %>%
collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.142 Preprocessor1_Model1
## 2 rsq standard 0.969 Preprocessor1_Model1
brew_last_fit %>%
collect_predictions() %>%
ggplot(aes(month_current, .pred)) +
geom_point() +
geom_abline(size = 2, color = "grey50", lty = 2)
The LASSO does have a good prediction, but one thing to keep in mind is that the predictions are on log10 scale.