class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#E7553C;">5</strong> </span> # Applied Machine Learning ## Regression Models --- # Loading .code90[ ```r library(tidymodels) ``` ``` ## ── Attaching packages ─────────────────────────────────────────────────────────── tidymodels 0.0.4 ── ``` ``` ## ✓ broom 0.5.3 ✓ recipes 0.1.9 ## ✓ dials 0.0.4 ✓ rsample 0.0.5 ## ✓ dplyr 0.8.3 ✓ tibble 2.1.3 ## ✓ ggplot2 3.2.1 ✓ tune 0.0.1 ## ✓ infer 0.5.1 ✓ workflows 0.1.0 ## ✓ parsnip 0.0.5 ✓ yardstick 0.0.5 ## ✓ purrr 0.3.3 ``` ``` ## ── Conflicts ────────────────────────────────────────────────────────────── tidymodels_conflicts() ── ## x dplyr::combine() masks gridExtra::combine() ## x purrr::discard() masks scales::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x dplyr::group_rows() masks kableExtra::group_rows() ## x dplyr::lag() masks stats::lag() ## x ggplot2::margin() masks dials::margin() ## x recipes::step() masks stats::step() ## x recipes::yj_trans() masks scales::yj_trans() ``` ] --- # Outline * Example Data * Regularized Linear Models * Multivariate Adaptive Regression Splines * Parallel Processing * Bayesian Optimization --- # Example Data: Train Ridership These data are used in our [Feature Engineering and Selection](https://bookdown.org/max/FES/chicago-intro.html) book. Several years worth of data were assembled to try to predict the daily number of people entering the Clark and Lake elevated ("L") train station in Chicago. For predictors, * the 14-day lagged ridership at this and other stations (units: thousands of rides/day) * weather data * home/away game schedules for Chicago teams * the date The data are in `modeldata`. See `?Chicago`. --- # L Train Locations
--- # Hands-On: Explore the Data Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome. ```r data("Chicago") ```
10
:
00
--- # How Should Features Be Encoded/Engineered? Should the ridership data be transformed? How should we encode the date? --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays *us_hol <- * timeDate::listHolidays() %>% * str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) ``` ] .pull-right[ Define a few holidays from the `timeDate` package to be used later. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- * recipe(ridership ~ ., data = Chicago) ``` ] .pull-right[ `ridership` at Clark and Lake is the outcome. All other columns are predictors. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) %>% * step_holiday(date, holidays = us_hol) ``` ] .pull-right[ Make indicator variables for the 20 US holidays identified in `us_hol`. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) %>% step_holiday(date, holidays = us_hol) %>% * step_date(date) ``` ] .pull-right[ Make factor variables from the `date` column, such as `dow`, `month`, and `year`. These are not automatically converted to dummy variables. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) %>% step_holiday(date, holidays = us_hol) %>% step_date(date) %>% * step_rm(date) ``` ] .pull-right[ We've made all of our date-based predictors, so remove the `date` column from the data. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) %>% step_holiday(date, holidays = us_hol) %>% step_date(date) %>% step_rm(date) %>% * step_dummy(all_nominal()) ``` ] .pull-right[ Make dummy variables out of all of the factor or character columns in the data. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) %>% step_holiday(date, holidays = us_hol) %>% step_date(date) %>% step_rm(date) %>% step_dummy(all_nominal()) %>% * step_zv(all_predictors()) ``` ] .pull-right[ In case there are columns with only a single unique value (perhaps due to resampling), remove them. ] --- # A Recipe <img src="images/recipes.png" class="title-hex"> .pull-left[ ```r library(stringr) # define a few holidays us_hol <- timeDate::listHolidays() %>% str_subset("(^US)|(Easter)") chi_rec <- recipe(ridership ~ ., data = Chicago) %>% step_holiday(date, holidays = us_hol) %>% step_date(date) %>% step_rm(date) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) * # step_normalize(one_of(!!stations)) * # step_pca(one_of(!!stations), num_comp = tune()) ``` ] .pull-right[ The ridership between stations is highly correlated. If we use a model that would be harmed by this, we _could_ extract the principal components for these columns. ] --- # Resampling <img src="images/rsample.png" class="title-hex"> If your job were to model these data, you would probably take historical data as your training set and use the most recent data as the test set. Our resampling scheme will emulate this using [rolling forecasting origin](https://otexts.com/fpp2/accuracy.html) resampling with * Moving analysis sets of 15 years moving over 28-day periods * An assessment set of the most recent 28 days of data ```r chi_folds <- rolling_origin( Chicago, initial = 364 * 15, assess = 7 * 4, skip = 7 * 4, cumulative = FALSE ) chi_folds %>% nrow() ``` ``` ## [1] 8 ``` --- # Resampling Graphic <img src="images/part-5-resample-plot-full-1.png" width="75%" style="display: block; margin: auto;" /> --- # Enhance! <img src="images/part-5-resample-plot-1.svg" width="75%" style="display: block; margin: auto;" /> --- layout: false class: inverse, middle, center # Linear Models --- # Linear Regression Analysis We'll start by fitting linear regression models to these data. As a reminder, the "linear" part means that the model is linear in the _parameters_; we can add nonlinear terms to the model (e.g. `x^2` or `log(x)`) without causing issues. The most start might be with `lm()` and the formula method. ```r lm(ridership ~ . - date, data = Chicago) ``` We know that there are a lot of features that we'd miss out on though (e.g. holidays, day-of-the-week, etc.). --- # Potential Issues with Linear Regression We'll look at the L train data and examine a few different models to illustrate some more complex models and approaches to optimizing them. We'll start with linear models. However, some potential issues with linear methods: * They do not automatically do _feature selection_ and including irrelevant predictors may degrade performance. * Linear models are sensitive to situations where the predictors are _highly correlated_ (aka collinearity). This isn't too big of an issue for these data though. To mitigate these two scenarios, _regularization_ will be used. This approach adds a penalty to the regression parameters. * In order to have a large slope in the model, the predictor will need to have a large impact on the model. There are different types of regularization methods. --- # Effect of Collinearity As an example of collinearity, our data set has two predictors that have a correlation above 0.95: `Irving_Park` and `Belmont`. What happens when we fit models with both predictors versus one-at-a-time? <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden" colspan="1"></th> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Coefficients</div></th> <th style="border-bottom:hidden" colspan="1"></th> </tr> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Belmont Only </th> <th style="text-align:right;"> Irving Park Only </th> <th style="text-align:right;"> Both Predictors </th> <th style="text-align:right;"> Variance Inflation </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Irving Park </td> <td style="text-align:right;"> --- </td> <td style="text-align:right;"> 4.974 </td> <td style="text-align:right;"> 4.109 </td> <td style="text-align:right;"> 26.842 </td> </tr> <tr> <td style="text-align:left;"> Belmont </td> <td style="text-align:right;"> 4.433 </td> <td style="text-align:right;"> --- </td> <td style="text-align:right;"> 0.795 </td> <td style="text-align:right;"> 25.112 </td> </tr> </tbody> </table> The coefficients can drastically change depending on what is in the model and their corresponding variances can also be artificially large and may flip signs. --- # Regularized Linear Regression Now suppose we want to see if _regularizing_ the regression coefficients will result in better fits. The [`glmnet`](https://www.jstatsoft.org/article/view/v033i01) model can be used to build a linear model using L<sub>1</sub> or L<sub>2</sub> regularization (or a mixture of the two). * The general formulation minimizes: `\(\sum_{i = 1}^{n} (y_i - \sum_{j = 1}^{p} x_{ij} \beta_j) ^ 2 + penalty\)`. * An L<sub>1</sub> penalty (penalty is `\(\lambda_1\sum|\beta_j|\)`) can have the effect of setting coefficients to zero. * L<sub>2</sub> regularization ( `\(\lambda_2\sum\beta_j^2\)` ) is basically ridge regression where the magnitude of the coefficients are dampened to avoid overfitting. Both methods _shrink_ the model coefficients towards zero at different rates. Important parameters tend to be the furthest away from zero. --- # glmnet Regularized Linear Regression For a `glmnet` model, we need to determine the total amount regularization (called `lambda`) and the mixture of L<sub>1</sub> and L<sub>2</sub> (called `alpha`). The total penalty then looks like: `\(penalty = \alpha * \lambda_1\sum|\beta_j| + (1 - \alpha) * \lambda_2\sum\beta_j^2\)` * `alpha` = 1 is a _lasso model_ * `alpha` = 0 is _ridge regression_ (aka weight decay). Predictors require centering/scaling before being used in a `glmnet`, lasso, or ridge regression model. Technical bits can be found in [Statistical Learning with Sparsity](https://web.stanford.edu/~hastie/StatLearnSparsity/). --- # Harmonization of Parameter Names If you are new to these models, `lambda` and `alpha` are pretty arcane and don't tell you anything about what they do. Other packages use different names for these parameters (`reg_param`, `penalty`, `lambda1`, `lambda2`, etc.) so it isn't very friendly. The `parsnip` package tries to standardize on less jargony and more self-documenting. We use `penalty` (instead of `lambda`) and `mixture` instead of `alpha`. These will always be the same for models within an engine and between-models too. For this problem, we have two tuning parameters: * `mixture` must be between 0 and 1. A small grid is used for this parameter. * `penalty` is not as clear-cut. We consider values on the log<sub>10</sub> scale. Usually values less than 1 are sufficient but this is not always true. --- # Tuning the Model Let's once again use grid search with a regular grid to find good values of `penalty` and `mixture`. It turns out that evaluating values of `penalty` are _cheaper_ than values of `mixture`. We'll tune a grid of 20 penalty values and 5 mixtures between ridge regression and the lasso. ```r glmn_grid <- expand.grid( penalty = 10 ^ seq(-3, -1, length = 20), mixture = (0:5) / 5 ) ``` The reason that penalties are cheap is that this model simultaneously computes parameter estimates for _all possible penalty values_ (for a fixed mixture). This is the _sub-model trick_. Using the grid above, we evaluate 120 models but only fit five. --- # Tuning the Model <img src="images/tune.png" class="title-hex"><img src="images/rsample.png" class="title-hex"><img src="images/recipes.png" class="title-hex"> ```r # We need to normalize the predictors: glmn_rec <- chi_rec %>% step_normalize(all_predictors()) glmn_mod <- linear_reg(penalty = tune(), mixture = tune()) %>% set_engine("glmnet") # Save the assessment set predictions ctrl <- control_grid(save_pred = TRUE) glmn_tune <- tune_grid( glmn_rec, model = glmn_mod, resamples = chi_folds, grid = glmn_grid, control = ctrl ) ``` --- # While We Wait, Can I Interest You in Parallelism? .pull-left[ There is no real barrier to running these in parallel. Can we benefit from splitting the fits up to run on multiple cores? These speed-ups can be very model- and data-dependent but this pattern generally holds. Note that there is little incremental benefit to using more workers than physical cores on the computer. Use `parallel::detectCores(logical = FALSE)`. (A lot more details can be found in [this blog post](http://appliedpredictivemodeling.com/blog/2018/1/17/parallel-processing)) ] .pull-right[ <img src="images/part-5-par-plot-1.svg" width="100%" style="display: block; margin: auto;" /> In these simulations, we estimated the speed-up by using the sub-model trick to be about _25-fold_. ] --- # Running in Parallel with {tune} .pull-left[ To loop through the models and data sets, `tune` uses the [`foreach`](https://www.rdocumentation.org/packages/foreach) package, which can parallelize `for` loops. `foreach` has a number of _parallel backends_ which allow various technologies to be used in conjunction with the package. On CRAN, these are the "`do{X}`" packages, such as [`doAzureParallel`](https://github.com/Azure/doAzureParallel), [`doFuture`](https://www.rdocumentation.org/packages/doFuture), [`doMC`](https://www.rdocumentation.org/packages/doMC), [`doMPI`](https://www.rdocumentation.org/packages/doMPI), [`doParallel`](https://www.rdocumentation.org/packages/doParallel), [`doRedis`](https://www.rdocumentation.org/packages/doRedis), and [`doSNOW`](https://www.rdocumentation.org/packages/doSNOW). For example, `doMC` uses the `multicore` package, which forks processes to split computations (for unix and OS X). `doParallel` can be used for all operating systems. ] .pull-right[ To use parallel processing in `tune`, no changes are needed when calling `tune_*()`. The parallel technology must be _registered_ with `foreach` prior to calling `tune_*()`: ```r library(doParallel) cl <- makeCluster(6) registerDoParallel(cl) # run `tune_grid()`... stopCluster(cl) ``` ] --- # Plotting the Resampling Profile <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"> .pull-left[ ```r rmse_vals <- collect_metrics(glmn_tune) %>% filter(.metric == "rmse") rmse_vals %>% mutate(mixture = format(mixture)) %>% ggplot(aes(x = penalty, y = mean, col = mixture)) + geom_line() + geom_point() + scale_x_log10() # There is `autoplot(glmn_tune)` but the grid # structure works better with the code above. ``` ] .pull-right[ <img src="images/part-5-lr-grid-plot-1.svg" width="95%" style="display: block; margin: auto;" /> ] --- # Capture the Best Values <img src="images/tune.png" class="title-hex"> .code90[ .pull-left[ A pure ridge regression solution (`mixture = 0`) does poorly and the model seems to like a small amount of regularization overall. The numerically best results were: ```r show_best(glmn_tune, metric = "rmse", maximize = FALSE) ``` ``` ## # A tibble: 5 x 7 ## penalty mixture .metric .estimator mean n std_err ## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> ## 1 0.00162 0.8 rmse standard 1.15 8 0.0688 ## 2 0.00127 0.8 rmse standard 1.15 8 0.0680 ## 3 0.001 0.8 rmse standard 1.15 8 0.0682 ## 4 0.00207 0.8 rmse standard 1.15 8 0.0699 ## 5 0.00264 0.8 rmse standard 1.15 8 0.0712 ``` ] .pull-right[ ```r best_glmn <- select_best(glmn_tune, metric = "rmse", maximize = FALSE) best_glmn ``` ``` ## # A tibble: 1 x 2 ## penalty mixture ## <dbl> <dbl> ## 1 0.00162 0.8 ``` ] ] --- # Residual Analysis <img src="images/tune.png" class="title-hex"> Recall that the `save_pred = TRUE` option was used. That retains the held-out predictions for each resample and sub-model. Those are in a list column called `.predictions`. We can use `tidyr::unnest()` to get the results back or use this convenience function: ```r glmn_pred <- collect_predictions(glmn_tune) glmn_pred ``` ``` ## # A tibble: 26,880 x 6 ## id .pred .row penalty mixture ridership ## <chr> <dbl> <int> <dbl> <dbl> <dbl> ## 1 Slice1 18.8 5461 0.001 0 19.6 ## 2 Slice1 18.8 5461 0.00127 0 19.6 ## 3 Slice1 18.8 5461 0.00162 0 19.6 ## 4 Slice1 18.8 5461 0.00207 0 19.6 ## 5 Slice1 18.8 5461 0.00264 0 19.6 ## 6 Slice1 18.8 5461 0.00336 0 19.6 ## 7 Slice1 18.8 5461 0.00428 0 19.6 ## 8 Slice1 18.8 5461 0.00546 0 19.6 ## 9 Slice1 18.8 5461 0.00695 0 19.6 ## 10 Slice1 18.8 5461 0.00886 0 19.6 ## # … with 26,870 more rows ``` --- # Observed Versus Predicted Plot <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> .pull-left[ ```r # Keep the best model glmn_pred <- glmn_pred %>% inner_join(best_glmn, by = c("penalty", "mixture")) ggplot(glmn_pred, aes(x = .pred, y = ridership)) + geom_abline(col = "green") + geom_point(alpha = .3) + coord_equal() ``` ] .pull-right[ <img src="images/part-5-lr-pred-plot-1.svg" width="75%" style="display: block; margin: auto;" /> ] --- # Which training set points had the worst results? <img src="images/ggplot2.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> .code90[ .pull-left[ ```r large_resid <- glmn_pred %>% mutate(resid = ridership - .pred) %>% arrange(desc(abs(resid))) %>% slice(1:4) library(lubridate) Chicago %>% slice(large_resid$.row) %>% select(date) %>% mutate(day = wday(date, label = TRUE)) %>% bind_cols(large_resid) ``` ``` ## # A tibble: 4 x 9 ## date day id .pred .row penalty mixture ridership resid ## <date> <ord> <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> ## 1 2016-07-04 Mon Slice7 11.2 5643 0.00162 0.8 5.92 -5.26 ## 2 2016-03-12 Sat Slice3 7.59 5529 0.00162 0.8 12.4 4.80 ## 3 2016-06-26 Sun Slice7 7.63 5635 0.00162 0.8 5.07 -2.56 ## 4 2016-04-01 Fri Slice4 19.8 5549 0.00162 0.8 22.4 2.56 ``` ] ] .pull-right[ We have a July 4th holiday indicator yet still over-predicted. For this data set, I end up googling to see why my predictions fail. <img src="images/chicago-2016-03-12.png" width="100%" /> ] --- # Creating a Final Model <img src="images/tune.png" class="title-hex"><img src="images/recipes.png" class="title-hex"><img src="images/parsnip.png" class="title-hex"> Let's prep the recipe then fit the final glmnet model with the best parameters: .pull-left[ ```r glmn_rec_final <- prep(glmn_rec) glmn_mod_final <- finalize_model(glmn_mod, best_glmn) glmn_mod_final ``` ``` ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0.00162377673918872 ## mixture = 0.8 ## ## Computational engine: glmnet ``` ```r glmn_fit <- glmn_mod_final %>% fit(ridership ~ ., data = juice(glmn_rec_final)) ``` ] .pull-right[ .code90[ ```r glmn_fit ``` ``` ## parsnip model object ## ## Fit time: 47ms ## ## Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~0.8) ## ## Df %Dev Lambda ## 1 0 0.0000 7.2490 ## 2 2 0.1117 6.6050 ## 3 5 0.2175 6.0180 ## 4 5 0.3095 5.4830 ## 5 8 0.3869 4.9960 ## 6 8 0.4523 4.5520 ## 7 9 0.5068 4.1480 ## 8 9 0.5524 3.7800 ## 9 9 0.5904 3.4440 ## 10 9 0.6221 3.1380 ## 11 10 0.6488 2.8590 ## 12 10 0.6711 2.6050 ## 13 10 0.6896 2.3740 ## 14 10 0.7051 2.1630 ## 15 10 0.7180 1.9710 ## 16 9 0.7287 1.7960 ## 17 9 0.7377 1.6360 ## 18 9 0.7452 1.4910 ## 19 8 0.7515 1.3580 ## 20 8 0.7568 1.2380 ## 21 9 0.7622 1.1280 ## 22 9 0.7668 1.0280 ## 23 10 0.7708 0.9362 ## 24 11 0.7763 0.8531 ## 25 14 0.7836 0.7773 ## 26 14 0.7923 0.7082 ## 27 16 0.8002 0.6453 ## 28 16 0.8084 0.5880 ## 29 16 0.8152 0.5357 ## 30 16 0.8209 0.4881 ## 31 16 0.8257 0.4448 ## 32 18 0.8299 0.4053 ## 33 19 0.8341 0.3693 ## 34 20 0.8379 0.3365 ## 35 22 0.8413 0.3066 ## 36 24 0.8451 0.2793 ## 37 25 0.8492 0.2545 ## 38 28 0.8616 0.2319 ## 39 29 0.8740 0.2113 ## 40 30 0.8843 0.1925 ## 41 30 0.8931 0.1754 ## 42 30 0.9008 0.1598 ## 43 31 0.9075 0.1456 ## 44 33 0.9132 0.1327 ## 45 36 0.9180 0.1209 ## 46 37 0.9223 0.1102 ## 47 36 0.9257 0.1004 ## 48 37 0.9288 0.0915 ## 49 37 0.9315 0.0833 ## 50 37 0.9337 0.0759 ## 51 36 0.9354 0.0692 ## 52 37 0.9369 0.0630 ## 53 37 0.9380 0.0574 ## 54 38 0.9391 0.0523 ## 55 37 0.9399 0.0477 ## 56 38 0.9406 0.0435 ## 57 39 0.9414 0.0396 ## 58 38 0.9420 0.0361 ## 59 42 0.9430 0.0329 ## 60 45 0.9438 0.0300 ## 61 46 0.9445 0.0273 ## 62 48 0.9452 0.0249 ## 63 48 0.9457 0.0227 ## 64 54 0.9461 0.0206 ## 65 58 0.9465 0.0188 ## 66 59 0.9470 0.0171 ## 67 60 0.9475 0.0156 ## 68 60 0.9478 0.0142 ## 69 63 0.9481 0.0130 ## 70 64 0.9484 0.0118 ## 71 67 0.9486 0.0108 ## 72 68 0.9488 0.0098 ## 73 68 0.9489 0.0089 ## 74 68 0.9491 0.0081 ## 75 69 0.9492 0.0074 ## 76 71 0.9493 0.0068 ## 77 72 0.9494 0.0062 ## 78 72 0.9495 0.0056 ## 79 73 0.9496 0.0051 ## 80 73 0.9497 0.0047 ## 81 75 0.9497 0.0042 ## 82 74 0.9498 0.0039 ## 83 75 0.9498 0.0035 ## 84 76 0.9498 0.0032 ## 85 76 0.9499 0.0029 ## 86 77 0.9499 0.0027 ## 87 78 0.9499 0.0024 ## 88 77 0.9499 0.0022 ## 89 78 0.9500 0.0020 ## 90 78 0.9500 0.0018 ## 91 80 0.9500 0.0017 ## 92 82 0.9500 0.0015 ## 93 82 0.9500 0.0014 ## 94 82 0.9500 0.0013 ## 95 83 0.9500 0.0012 ``` ] ] --- # Using the `glmnet` Object .pull-left[ The `parsnip` object saves the optimized model that was fit to the entire training set in the slot `$fit`. This can be used as it normally would. The plot on the right is created using: ```r library(glmnet) plot(glmn_fit$fit, xvar = "lambda") ``` However, **please don't use `predict(object$fit)` **! Use the `predict()` method on the object that is produced by `fit`. ] .pull-right[ <img src="images/part-5-reg-path-1.svg" width="99%" style="display: block; margin: auto;" /> ] Coefs: ``` ## (Intercept) date_dow_Thu date_dow_Tue date_dow_Wed ## 13.619334 4.788952 4.762551 4.761234 ``` --- # A glmnet Coefficient Plot <img src="images/ggplot2.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"><img src="images/broom.png" class="title-hex"> .code90[ ```r # Get the set of coefficients across penalty values tidy_coefs <- broom::tidy(glmn_fit) %>% dplyr::filter(term != "(Intercept)") %>% dplyr::select(-step, -dev.ratio) # Get the lambda closest to tune's optimal choice delta <- abs(tidy_coefs$lambda - best_glmn$penalty) lambda_opt <- tidy_coefs$lambda[which.min(delta)] # Keep the large values label_coefs <- tidy_coefs %>% mutate(abs_estimate = abs(estimate)) %>% dplyr::filter(abs_estimate >= 1.1) %>% distinct(term) %>% inner_join(tidy_coefs, by = "term") %>% dplyr::filter(lambda == lambda_opt) # plot the paths and highlight the large values tidy_coefs %>% ggplot(aes(x = lambda, y = estimate, group = term, col = term, label = term)) + geom_vline(xintercept = lambda_opt, lty = 3) + geom_line(alpha = .4) + theme(legend.position = "none") + scale_x_log10() + ggrepel::geom_text_repel(data = label_coefs, aes(x = .005)) ``` ] --- # A glmnet Coefficient Plot <img src="images/ggrepl.svg" class="title-hex"><img src="images/ggplot2.png" class="title-hex"> <img src="images/part-5-reg-glmnet-path-1.svg" width="90%" style="display: block; margin: auto;" /> --- # glmnet Variable Importance <img src="images/vip.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"> .pull-left[ _Variable importance scores_ are aggregate metrics that try to measure how much each predictor affected the model results. These methods are specific to each model and not all models have ways to measure importance. For (generalized) linear models, the simplest approach is to look at the absolute value of the regression coefficients (recall that we normalized the predictors). The `caret` and `vip` packages have general interfaces to compute these measures and plot the results. ] .pull-right[ ```r library(vip) vip(glmn_fit, num_features = 20L, # Needs to know which coefficients to use lambda = best_glmn$penalty) ``` <img src="images/part-5-reg-vip-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # What's Next? The model is pretty simple right now. If this level of performance is acceptable, we'd be done. If not, more thorough residual analysis would be used to determine if more complex features should be used in the analysis. There is the choice of making a simple model more complex or trying out a complex model. We'll stop here with linear regression and try something else. --- layout: false class: inverse, middle, center # Multivariate Adaptive Regression Splines --- # Multivariate Adaptive Regression Splines (MARS) MARS is a nonlinear machine learning model that develops sequential sets of artificial features that are used in linear models (similar to the previous spline discussion). The features are "hinge functions" or single knot splines that use the function: ```r h(x) <- function(x) ifelse(x > 0, x, 0) ``` The MARS model does a fast search through every predictor and every value of each predictor to find a suitable "split" point for the predictor that results in the best features. Suppose a value `x0` is found. The MARS model creates two model terms `h(x - x0)` and `h(x0 - x)` that are added to the intercept column. This creates a type of _segmented regression_. These terms are the same as deep learning rectified linear units ([ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)). Let's look at some example data... --- # Simulated Data: `y = 2 * exp(-6 * (x - 1.3)^2) + e` <img src="images/part-5-mars-sim-1.svg" width="50%" style="display: block; margin: auto;" /> --- # MARS Feature Creation -- Iteration #1 .pull-left[ After searching through these data, the model evaluates all possible values of `x0` to find the best "cut" of the data. It finally chooses a value of 0.5958249. To do this, it creates these two new predictors that isolate different regions of `x`. If we stop there, these two terms would be added into a linear regression model, yielding: ``` ## y = ## 0.111 ## - 0.262 * h(0.595825 - x) ## + 2.41 * h(x - 0.595825) ``` ] .pull-right[ <img src="images/part-5-mars-cuts-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- # Fitted Model with Two Features <img src="images/part-5-mars-sim-fit-1.svg" width="50%" style="display: block; margin: auto;" /> --- # Growing and Pruning Similar to tree-based models, MARS starts off with a "growing" stage where it keeps adding new features until it reaches a pre-defined limit. After the first pair is created, the next cut-point is found using another exhaustive search to see which split of a predictor is best _conditional on the existing features_. Once all the features are created, a _pruning phase_ starts where model selection tools are used to eliminate terms that do not contribute meaningfully to the model. Generalized cross-validation ([GCV](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22generalized+cross+validation%22&btnG=)) is used to efficiently remove model terms while still providing some protection from overfitting. --- # Model Size There are two approaches: 1. Use the internal GCV to prune the model to the best subset size. This is fast but you don't learn much and it may under-select terms. 2. Use the external resampling (10-fold CV here) to tune the model as you would any other. I usually don't start with GCV. Instead use method #2 above to understand the trends. --- # The Final Model .pull-left[ For the simulated data, the mars model only requires 4 features to model the data (via GCV). ``` ## y = ## 0.0599 ## - 0.126 * h(0.595825 - x) ## + 1.61 * h(x - 0.595825) ## + 2.08 * h(x - 0.777082) ## - 5.27 * h(x - 1.24441) ``` The parameters are estimated by added the MARS features into ordinary linear regression models using least squares. ] .pull-right[ <img src="images/part-5-mars-sim-final-1.svg" width="100%" style="display: block; margin: auto;" /> ] ??? Talk a bit about the process of creating/choosing these particular features. Mention bagging here maybe (although it is on a future slide) Note the coefs have changed due to refitting: ``` ## h(x-0.595825) ## 2.408466 ``` ``` ## h(x-0.595825) ## 1.609755 ``` --- # Additive MARS is segmented regression <img src="images/part-5-mars-seg-1.svg" width="50%" style="display: block; margin: auto;" /> --- # Aspects of MARS Models * The model also tests to see if a simple linear term is best (i.e. not split). This is also how dummy variables are evaluated. * The model automatically conducts _feature selection_; if a predictor is never used in a split, it is functionally independent of the model. This is really good! * If an additive model is used (as in the previous example), the functional form of each predictor can be determined (and visualized) independently for each predictor. * A _second degree_ MARS model also evaluates interactions of two hinge features (e.g. `h(x0 - x) * h(z - z0)`). This can be useful in isolating regions of bivariate predictor space since it divides two-dimensional space into four quadrants. (see next slide) --- # Second Degree MARS Term Example <img src="images/part-5-mars-2d-1.svg" width="50%" style="display: block; margin: auto;" /> --- # MARS in R The [`mda`](https://cran.r-project.org/package=mda) package has a `mars` function but the [`earth`](https://cran.r-project.org/package=earth) package is far superior. The [`earth()` function](https://www.rdocumentation.org/packages/earth/versions/4.5.1/topics/earth) has both formula and non-formula interfaces. It can also be used with generalized linear models and flexible discriminant analysis. To use the nominal growing and GCV pruning process, the syntax is ```r earth(y ~ ., data) # or earth(x = x, y = y) ``` The feature creation process can be controlled using the `nk`, `nprune`, and `pmethod` parameters although this can be [somewhat complex](http://www.milbo.org/doc/earth-notes.pdf). There is a variable importance method that tracks the changes in the GCV results as features are added to the model. --- # MARS via {parsnip} and {tune} As with other models, a specification is made for the model. For our data: ```r # Let MARS decide the number of terms but tune the term dimensions mars_mod <- mars(prod_degree = tune()) # We'll decide via search: mars_mod <- mars(num_terms = tune("mars terms"), prod_degree = tune(), prune_method = "none") %>% set_engine("earth") %>% set_mode("regression") ``` One issue with MARS is that it is based on linear regression. We know that linear regression doesn't do so well when the predictors are highly correlated. We'll add to our recipe to de-correlate the data using principal component analysis: ```r mars_rec <- chi_rec %>% step_normalize(one_of(!!stations)) %>% step_pca(one_of(!!stations), num_comp = tune("pca comps")) ``` We could use grid search but let's try something else... --- layout: false class: inverse, middle, center # Segue --- Iterative Search Methods --- # Grid Versus Iterative Search .pull-left[ Grid Search: * The candidate values need to be pre-defined and don't learn from previous results. * You don't know the best values until all the computations are finished. * With many parameters, it is difficult to efficiently cover the parameter space. * Easily optimized via parallel processing and other tricks ] .pull-right[ Iterative Search: * _Usually_ builds a probability model to predict better parameters to test based on previous results. * More flexibility in how the parameter space is searched. * Less opportunities for efficiency optimizations are possible. ] --- # Iterative Search _Any_ search procedure could be used. [This repo](https://github.com/topepo/Optimization-Methods-for-Tuning-Predictive-Models) shows examples using genetic algorithm, Nelder-Mead simplex search, and other approaches. The most popular method is _Bayesian optimization_. We'll focus on this today. --- # Bayesian Optimization Takes an initial set of results and uses these to build a model to predict new tuning parameters. What makes it Bayesian? * A Gaussian Process model, with some distributional assumptions, is usually the "meta-model". This enables mean and variance predictions to be used (more later) --- # How Are the Tuning Data Used The Gaussian Process uses the previous parameters as _predictors_ and our performance measure as the _outcome_: <table class="table table-striped table-hover table-bordered" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Tuning parameters are now predictors</div></th> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Now the outcome</div></th> </tr> <tr> <th style="text-align:right;"> mars terms </th> <th style="text-align:right;"> prod_degree </th> <th style="text-align:right;"> pca comps </th> <th style="text-align:right;"> mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 1.203 </td> </tr> <tr> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1.825 </td> </tr> <tr> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 1.825 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 2.623 </td> </tr> <tr> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.134 </td> </tr> </tbody> </table> --- # Gaussian Process Model This approach is often used in the analysis of spatial data and, as parameterized here, has some connections to kernel methods (such as support vector machines). * I recommend Rasmussen and Williams (2006) [(pdf)](http://www.gaussianprocess.org/gpml/chapters/RW.pdf) as a good place to start. The model assumes that the model residuals follow a Gaussian distribution and that the model _parameters_ have a joint Gaussian (prior) distribution. Based on these assumptions, the predictive (posterior) distribution is also Gaussian The trick that makes this model so useful here is how the covariance function of the _model inputs_ is defined. * A _kernel_ function is often used that is small when the predictors are close and increases as they diverge. * The radial basis function is a good example: `\(K(x_a, x_b) = exp(-0.5 (x_a - x_b)^2)\)` --- # Gaussian Process Model Predictions Using a nonlinear kernel function enables the GP to create nonlinear regression functions. Since this is a Bayesian model, both the mean _and variance_ of model performance can be predicted. The predicted variance is usually dominated by the spatial relationships between predictors. * Predictions made far away from the training data used for the GP tend to have very large variances. We'll see examples of this in a minute. To start, suppose we have a single numeric tuning parameter and we are trying to maximize R<sup>2</sup> Three parameter values were sampled in the middle of the parameter's range. --- # Initial Grid Results <img src="images/part-5-gp-init-res-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Initial GP Predictions <img src="images/part-5-gp-init-pred-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Selecting New Tuning Parameter Candidates Using this model, the Bayesian optimization process would search the grid to find the "best" new parameters to evaluate using resampling. * A nonlinear optimization routine or grid search can be used here. Once the resampling results are obtained, another GP is fit and the process repeats. How do we select the best parameter? Bayesian optimization introduced the idea of _acquisition functions_. These functions are used to make trade-offs between exploitation and exploration: * exploration: search new areas of the parameter space that seem promising * exploitation: search in the vicinity of the existing best results. This is usually a trade-off between mean and variance. --- # Acquisition Function Based on Credible Intervals .pull-left[ Since we want to maximize R<sup>2</sup>, this function would seek to maximize the _lower credible bound_. The size of the bound controls the trade-off: `$$bound = \mu(\theta) - C \times \sigma(\theta)$$` where `\(\theta\)` is the vector of tuning parameters. This isn't very helpful since it often mirrors the mean value. ] .pull-right[ <img src="images/part-5-gp-ci-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- # Acquisition Function for Expected Improvement One of the most popular methods is based on the expected improvement for new parameters, which is relative to the current best results. Let's denote the best (mean) performance value was `\(m_{opt}\)` and assume that we are maximizing performance. The [expected improvement]() is calculated using: $$ `\begin{align} EI(\theta; m_{opt}) &= \delta(\theta) \Phi\left(\frac{\delta(\theta)}{\sigma(\theta)}\right) + \sigma(\theta) \phi\left(\frac{\delta(\theta)}{\sigma(\theta)}\right) \notag \\ &\text{where} \notag \\ \delta(\theta) &= \mu(\theta) - m_{opt} \notag \end{align}` $$ The function `\(\Phi(\cdot)\)` is the cumulative standard normal and `\(\phi(\cdot)\)` is the standard normal density. --- # Acquisition Function for Expected Improvement <img src="images/part-5-gp-ei-1.svg" width="50%" style="display: block; margin: auto;" /> --- # Expected Improvement After 20 Iterations <img src="images/part-5-gp-ei-20-1.svg" width="50%" style="display: block; margin: auto;" /> --- # Exploration using Expected Improvement The previous equation has a fixed trade-off between the mean and variance. One method to make the search explore more is to define a trade-off value: $$ \delta(\theta) = \mu(\theta) - m_{opt} - \tau $$ where `\(\tau\)` is the amount of performance that we are willing to sacrifice (in the original units). Larger values of `\(\tau\)` will result in more novel candidate values. The value of `\(\tau\)` can change over time so that exploration is the focus initially and exploitation is the goal in later iterations. Another approach is to add _uncertainty samples_. This is an idea from the field of [Active Learning](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22active+learning%22+%22uncertainty+sampling%22&btnG=) where we add a new point that is most likely to help the model get better. In this context, we sample a candidate value with very large variance. --- # In Summary Baysian optimization is an iterative method for searching for reasonable tuning parameters. It requires: * The range/list of possible parameters (in the transformed scale). * An initial set of resampled parameter results. * The resampling scheme. * A performance metric to optimize. * An acquisition function. * The maximum number of iterations. We have a function called `tune_bayes()` for this. --- # Parameter Ranges <img src="images/recipes.png" class="title-hex"><img src="images/dials.png" class="title-hex"> `tune_bayes()` can access the default ranges defined by the `dials` package. For illustration, we'll change those ranges by adding the recipe and model to a workflow: ```r chi_wflow <- workflow() %>% add_recipe(mars_rec) %>% add_model(mars_mod) chi_set <- parameters(chi_wflow) %>% update( `pca comps` = num_comp(c(0, 20)), # 0 comps => PCA is not used `mars terms` = num_terms(c(2, 100)) ) ``` This is an _optional_ step. --- # Running the Optimization <img src="images/tune.png" class="title-hex"> ```r library(doMC) registerDoMC(cores = 8) ctrl <- control_bayes(verbose = TRUE, save_pred = TRUE) # Some defaults: # - Uses expected improvement with no trade-off. See ?exp_improve(). # - RMSE is minimized set.seed(7891) mars_tune <- tune_bayes( chi_wflow, resamples = chi_folds, iter = 25, param_info = chi_set, metrics = metric_set(rmse), initial = 4, control = ctrl ) ``` --- # Example of Logging .code80[ ``` ❯ Generating a set of 4 initial parameter results ✓ Initialization complete Optimizing rmse using the expected improvement ── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── i Current best: rmse=1.29 (@iter 0) i Gaussian process model ✓ Gaussian process model i Generating 2877 candidates i Predicted candidates i mars terms=8, prod_degree=2, pca comps=0 i Estimating performance ✓ Estimating performance ⓧ Newest results: rmse=2.379 (+/-0.285) ── Iteration 2 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <snip> ── Iteration 4 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── i Current best: rmse=1.29 (@iter 0) i Gaussian process model ✓ Gaussian process model i Generating 2915 candidates i Predicted candidates i mars terms=100, prod_degree=1, pca comps=20 i Estimating performance ✓ Estimating performance ♥ Newest results: rmse=1.074 (+/-0.0701) ``` ] --- # Performance over iterations <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"> ```r autoplot(mars_tune, type = "performance") ``` <img src="images/part-5-bo-perf-1.svg" width="65%" style="display: block; margin: auto;" /> --- # Performance versus parameters <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"> ```r autoplot(mars_tune, type = "marginals") ``` <img src="images/part-5-bo-param-1.svg" width="85%" style="display: block; margin: auto;" /> --- # Parameters over iterations <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"> ```r autoplot(mars_tune, type = "parameters") ``` <img src="images/part-5-bo-param-iter-1.svg" width="95%" style="display: block; margin: auto;" /> --- # Results <img src="images/tune.png" class="title-hex"> `collect_metrics()` and `show_best()` work the same here as with grid search: ```r show_best(mars_tune, maximize = FALSE) ``` ``` ## # A tibble: 5 x 9 ## `mars terms` prod_degree `pca comps` .iter .metric .estimator mean n std_err ## <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> ## 1 78 1 7 7 rmse standard 1.13 8 0.0940 ## 2 35 1 18 6 rmse standard 1.20 8 0.106 ## 3 38 1 17 14 rmse standard 1.20 8 0.106 ## 4 53 1 19 2 rmse standard 1.20 8 0.106 ## 5 77 1 19 5 rmse standard 1.20 8 0.106 ``` For MARS, it has been my observation that second-degree (i.e. non-additive) models can have better performance but also large variability. These results about the same as the `glmnet` model (RMSE of 1.15 versus 1.131). --- # Performance Compared to Grid Search For illustration, all 4158 possible sub-models for these three tuning parameters were assessed. We can use these results to see if the Bayesian optimization was effective. The initial set of samples (pre-GP model) had a minimum RMSE of 1.825. The final search results yielded a model with RMSE = 1.131 which was better than 89.4% of the exhaustive grid search results. What were the best results? They mostly used a lot of terms, less than 10 PCA components, and were additive (i.e. `prod_degree = 1`): .font80[ ``` ## # A tibble: 6 x 5 ## `mars terms` prod_degree `pca comps` RMSE rank ## <int> <int> <int> <dbl> <int> ## 1 24 1 1 1.12 1 ## 2 24 1 2 1.12 2 ## 3 25 1 3 1.12 3 ## 4 29 1 3 1.13 11 ## 5 29 1 4 1.13 12 ## 6 29 1 5 1.13 13 ``` ] --- # Performance Compared to Grid Search <img src="images/part-5-bo-vs-complete-1.svg" width="75%" style="display: block; margin: auto;" /> --- # My Thoughts on Bayesian Optimization It is a reasonable approach to optimizing models. It comes from the deep learning literature. While I believe the literature, I feel like the method is somewhat overfit to their problems. For example, DL models * tend to have far more critical parameters than many other models * can't exploit sub-models and parallel process _single models_ * are created using massive data sets and a single validation set * have well defined ranges of tuning parameters (and non-uniform priors on those) Non DL models tend to have performance profiles that plateau and less _a priori_ knowledge. Also, space-filling designs do a much better job of finding good starting values than regular or random grids. --- # Assessment Set Results (Again) <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> .pull-left[ ```r mars_pred <- mars_tune %>% collect_predictions() %>% inner_join( select_best(mars_tune, maximize = FALSE), by = c("mars terms", "prod_degree", "pca comps") ) ggplot(mars_pred, aes(x = .pred, y = ridership)) + geom_abline(col = "green") + geom_point(alpha = .3) + coord_equal() ``` ] .pull-right[ <img src="images/part-5-mars-pred-plot-1.svg" width="75%" style="display: block; margin: auto;" /> ] --- # Finalizing the recipe and model <img src="images/tune.png" class="title-hex"><img src="images/recipes.png" class="title-hex"><img src="images/parsnip.png" class="title-hex"> ```r best_mars <- select_best(mars_tune, "rmse", maximize = FALSE) best_mars ``` ``` ## # A tibble: 1 x 3 ## `mars terms` prod_degree `pca comps` ## <int> <int> <int> ## 1 78 1 7 ``` ```r final_mars_wfl <- finalize_workflow(chi_wflow, best_mars) # No formula is needed since a recipe is embedded in the workflow final_mars_wfl <- fit(final_mars_wfl, data = Chicago) ``` --- # Variable importance <img src="images/vip.png" class="title-hex"> .pull-left[ MARS models can measure importance in a few different ways: * The number of times that a column is used in a feature. * The reduction in error when each term is added to the model (preferred). The `earth` package has a generalized cross-validation (GCV) estimate for the latter metric. ] .pull-right[ .font90[ ```r final_mars_wfl %>% # Pull out the model pull_workflow_fit() %>% vip(num_features = 20L, type = "gcv") ``` <img src="images/part-5-mars-vip-1.svg" width="100%" style="display: block; margin: auto;" /> ] ]