class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#E7553C;">1</strong> </span> # Applied Machine Learning ## Introduction --- class: middle, center .pull-left[ # <i class="fas fa-wifi"></i> Wifi network name rstudio20 ] .pull-left[ # <i class="fas fa-key"></i> Wifi password tidyverse20 ] --- class: middle, center # <i class="fas fa-cloud"></i> # Go here and log in (free): [rstd.io/class](rstd.io/class) Workshop identifier: `applied_ml` --- layout: true <div class="my-footer"><span>rstd.io/class</span></div> --- # Why R for Modeling? .pull-left[ * _R has cutting edge models_. Machine learning developers in some domains use R as their primary computing environment and their work often results in R packages. * _R and R packages are built by people who **do** data analysis_. * _The S language is very mature_. ] .pull-right[ * _It is easy to port or link to other applications_. R doesn't try to be everything to everyone. If you prefer models implemented in C, C++, `tensorflow`, `keras`, `python`, `stan`, or `Weka`, you can access these applications without leaving R. * The machine learning environment in R is extremely rich. ] --- # Downsides to Modeling in R .pull-left[ * R is a data analysis language and is not C or Java. If a high performance deployment is required, R can be treated like a prototyping language. * R is mostly memory-bound. There are plenty of exceptions to this though. ] .pull-right[ The main issue is one of _consistency of interface_. For example: * There are two methods for specifying what terms are in a model<sup>1</sup>. Not all models have both. * 99% of model functions auto-generate dummy variables. * Sparse matrices can be used (unless they can't). ] .footnote[[1] There are now three but the last one is brand new and will be discussed later.] --- # Syntax for Computing Predicted Class Probabilities |Function |Package |Code | |:------------|:------------|:----------------------------------------------------| |`lda` |`MASS` |`predict(obj)` | |`glm` |`stats` |`predict(obj, type = "response")` | |`gbm` |`gbm` |`predict(obj, type = "response", n.trees)` | |`mda` |`mda` |`predict(obj, type = "posterior")` | |`rpart` |`rpart` |`predict(obj, type = "prob")` | |`Weka` |`RWeka` |`predict(obj, type = "probability")` | |`logitboost` |`LogitBoost` |`predict(obj, type = "raw", nIter)` | |`pamr.train` |`pamr` |`pamr.predict(obj, type = "posterior", threshold)` | We'll see a solution for this later in the class. --- # `tidymodels` Collection of Packages <img src="images/tidymodels.png" class="title-hex"> .code90[ ```r library(tidymodels) ``` ``` ## Registered S3 method overwritten by 'xts': ## method from ## as.zoo.xts zoo ``` ``` ## ── Attaching packages ────────────────────────────── tidymodels 0.0.3 ── ``` ``` ## ✓ broom 0.5.3 ✓ purrr 0.3.3 ## ✓ dials 0.0.4.9000 ✓ recipes 0.1.9 ## ✓ dplyr 0.8.3 ✓ rsample 0.0.5 ## ✓ ggplot2 3.2.1 ✓ tibble 2.1.3 ## ✓ infer 0.5.1 ✓ yardstick 0.0.4 ## ✓ parsnip 0.0.5 ``` ``` ## ── Conflicts ───────────────────────────────── tidymodels_conflicts() ── ## x purrr::discard() masks scales::discard() ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x ggplot2::margin() masks dials::margin() ## x recipes::step() masks stats::step() ## x recipes::yj_trans() masks scales::yj_trans() ``` ] Plus [`tidypredict`](http://tidypredict.netlify.com/), [`tidyposterior`](https://tidymodels.github.io/tidyposterior/), [`tidytext`](https://github.com/juliasilge/tidytext), and more in development. --- # Example Data Set - House Prices For our examples, we will use the Ames IA housing data. There are 2,930 properties in the data. The Sale Price was recorded along with 81 predictors, including: * Location (e.g. neighborhood) and lot information. * House components (garage, fireplace, pool, porch, etc.). * General assessments such as overall quality and condition. * Number of bedrooms, baths, and so on. More details can be found in [De Cock (2011, Journal of Statistics Education)](http://ww2.amstat.org/publications/jse/v19n3/decock.pdf). The raw data are at [`http://bit.ly/2whgsQM`](http://bit.ly/2whgsQM) but we will use a processed version found in the [`AmesHousing`](https://github.com/topepo/AmesHousing) package. --- # Example Data Set - House Prices
--- # Tidyverse Syntax <img src="images/dplyr.png" class="title-hex"> Many tidyverse functions have syntax unlike base R code. For example: .pull-left[ Vectors of variable names are eschewed in favor of _functional programming_. ```r contains("Sepal") # instead of c("Sepal.Width", "Sepal.Length") ``` ] .pull-right[ The _pipe_ operator is preferred. ```r merged <- inner_join(a, b) # is equal to merged <- a %>% inner_join(b) ``` ] <br> Functions are generally more _modular_ than their traditional analogs, i.e. compare `dplyr`'s `filter()` and `select()` with `base::subset()` </br> --- # Some Example Data Manipulation Code <img src="images/readr.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> .font10[ ```r library(tidyverse) ames_prices <- "http://bit.ly/2whgsQM" %>% read_delim(delim = "\t", guess_max = 2000) %>% rename_at(vars(contains(' ')), list(~gsub(' ', '_', .))) %>% dplyr::rename(Sale_Price = SalePrice) %>% dplyr::filter(!is.na(Electrical)) %>% dplyr::select(-Order, -PID, -Garage_Yr_Blt) ames_prices %>% group_by(Alley) %>% summarize( mean_price = mean(Sale_Price / 1000), n = sum(!is.na(Sale_Price)) ) ``` ``` ## # A tibble: 3 x 3 ## Alley mean_price n ## <chr> <dbl> <int> ## 1 Grvl 124. 120 ## 2 Pave 177. 78 ## 3 <NA> 183. 2731 ``` ] --- # Examples of `purrr::map*` <img src="images/purrr.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> purrr contains functions that _iterate over lists_ without the explicit use of loops. They are similar to the family of apply functions in base R, but are type stable. .pull-left[ ```r # purrr loaded with tidyverse or tidymodels package mini_ames <- ames_prices %>% dplyr::select(Alley, Sale_Price, Yr_Sold) %>% dplyr::filter(!is.na(Alley)) head(mini_ames, n = 5) ``` ``` ## # A tibble: 5 x 3 ## Alley Sale_Price Yr_Sold ## <chr> <dbl> <dbl> ## 1 Pave 190000 2010 ## 2 Pave 155000 2010 ## 3 Pave 151000 2010 ## 4 Pave 149500 2010 ## 5 Pave 152000 2010 ``` ] .pull-right[ ```r by_alley <- split(mini_ames, mini_ames$Alley) # map(.x, .f, ...) map(by_alley, head, n = 2) ``` ``` ## $Grvl ## # A tibble: 2 x 3 ## Alley Sale_Price Yr_Sold ## <chr> <dbl> <dbl> ## 1 Grvl 96500 2010 ## 2 Grvl 109500 2010 ## ## $Pave ## # A tibble: 2 x 3 ## Alley Sale_Price Yr_Sold ## <chr> <dbl> <dbl> ## 1 Pave 190000 2010 ## 2 Pave 155000 2010 ``` ] --- # Examples of `purrr::map*` <img src="images/purrr.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> .pull-left[ ```r map(by_alley, nrow) ``` ``` ## $Grvl ## [1] 120 ## ## $Pave ## [1] 78 ``` `map()` always returns a list. Use suffixed versions for simplification of the result. ```r map_int(by_alley, nrow) ``` ``` ## Grvl Pave ## 120 78 ``` ] .pull-right[ Complex operations can be specified using _formula notation_. Access the current element with `.x`. ```r map( by_alley, ~summarise(.x, max_price = max(Sale_Price)) ) ``` ``` ## $Grvl ## # A tibble: 1 x 1 ## max_price ## <dbl> ## 1 256000 ## ## $Pave ## # A tibble: 1 x 1 ## max_price ## <dbl> ## 1 345000 ``` ] --- # `purrr` and list-columns <img src="images/tidyr.png" class="title-hex"><img src="images/purrr.png" class="title-hex"><img src="images/dplyr.png" class="title-hex"> Rather than using `split()`, we can `tidyr::nest()` by `Alley` to get a data frame with a _list-column_. We often use these when working with _multiple models_. .pull-left[ ```r ames_lst_col <- nest(mini_ames, data = -Alley) ames_lst_col ``` ``` ## # A tibble: 2 x 2 ## Alley data ## <chr> <list<df[,2]>> ## 1 Pave [78 × 2] ## 2 Grvl [120 × 2] ``` ] .pull-right[ ```r ames_lst_col %>% mutate( n_row = map_int(data, nrow), max = map_dbl(data, ~ max(.x$Sale_Price)) ) ``` ``` ## # A tibble: 2 x 4 ## Alley data n_row max ## <chr> <list<df[,2]>> <int> <dbl> ## 1 Pave [78 × 2] 78 345000 ## 2 Grvl [120 × 2] 120 256000 ``` ] --- # Quick Data Investigation To get warmed up, let's load the real Ames data and do some basic investigations into the variables, such as exploratory visualizations or summary statistics. The idea is to get a feel for the data. Let's take 10 minutes to work on your own or with someone next to you. Collaboration is highly encouraged! To get the data: ```r library(AmesHousing) ames <- make_ames() ```
10
:
00
--- # The Modeling _Process_ Common steps during model building are: * estimating model parameters (i.e. training models) * determining the values of _tuning parameters_ that cannot be directly calculated from the data * model selection (within a model type) and model comparison (between types) * calculating the performance of the final model that will generalize to new data Many books and courses portray predictive modeling as a short sprint. A better analogy would be a marathon or campaign (depending on how hard the problem is). --- # What the Modeling Process Usually Looks Like <img src="images/part-1-mod-process-1.svg" width="95%" style="display: block; margin: auto;" /> --- # What Are We Doing with the Data? .pull-left[ We often think of the model as the _only_ real data analysis step in this process. However, there are other procedures that are often applied before or after the model fit that are data-driven and have an impact. ] .pull-right[ <img src="images/diagram-simple.svg" width="75%" style="display: block; margin: auto;" /> ] If we only think of the model as being important, we might end up accidentally overfitting to the data in-hand. This is very similar to the problems of "the garden of forking paths" and ["p-hacking"](http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf). --- # Define the Data Analysis Process .pull-left[ Let's conceptualize a process or _workflow_ that involves all of the steps where the data are analyzed in a significant way. This includes the model but might also include other _estimation_ steps. Admittedly, there is some grey area here. This concept will become important when we talk about measuring performance of the modeling process. ] .pull-right[ <img src="images/diagram-complex.svg" width="95%" style="display: block; margin: auto;" /> * Data preparation steps (e.g. imputation, encoding, transformations, etc) * Selection of which terms go into the model ] --- # Some naming conventions There are a few suffixes that we'll use for certain types of objects: * `_mod` for a `parsnip` model specification * `_fit` for a fitted model * `_rec` for a recipe * `_wfl` for a workflow * `_tune` for a tuning object * `_res` for a general result --- # Resources * [`http://www.tidyverse.org/`](http://www.tidyverse.org/) * [R for Data Science](http://r4ds.had.co.nz/) * Jenny's [`purrr` tutorial](https://jennybc.github.io/purrr-tutorial/) or [Happy R Users Purrr](https://www.rstudio.com/resources/videos/happy-r-users-purrr-tutorial/) * Programming with `dplyr` [vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html) * Selva Prabhakaran's [`ggplot2` tutorial](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html) * `caret` package [documentation](https://topepo.github.io/caret/) * [CRAN Machine Learning Task View](https://cran.r-project.org/web/views/MachineLearning.html)