Applied Machine Learning

class: title-slide, center

# Applied Machine Learning

## Classification Models

---
# Outline

* Performance Measures

* Amazon Review Data

* Classification Trees

* Boosting

* Extra topics as time allows

---

# Load Packages

```r
library(tidymodels)
```

```
## ── Attaching packages ───────────────────────────────────────────── tidymodels 0.0.4 ──
```

```
## ✓ broom     0.5.3     ✓ recipes   0.1.9
## ✓ dials     0.0.4     ✓ rsample   0.0.5
## ✓ dplyr     0.8.3     ✓ tibble    2.1.3
## ✓ infer     0.5.1     ✓ tune      0.0.1
## ✓ parsnip   0.0.5     ✓ workflows 0.1.0
## ✓ purrr     0.3.3     ✓ yardstick 0.0.5
```

```
## ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::accumulate()    masks foreach::accumulate()
## x purrr::discard()       masks scales::discard()
## x dplyr::filter()        masks stats::filter()
## x recipes::fixed()       masks stringr::fixed()
## x dplyr::group_rows()    masks kableExtra::group_rows()
## x dplyr::ident()         masks dbplyr::ident()
## x dplyr::lag()           masks stats::lag()
## x purrr::lift()          masks caret::lift()
## x dials::margin()        masks ggplot2::margin()
## x yardstick::precision() masks caret::precision()
## x dials::prune()         masks rpart::prune()
## x yardstick::recall()    masks caret::recall()
## x dplyr::select()        masks MASS::select()
## x dplyr::sql()           masks dbplyr::sql()
## x recipes::step()        masks stats::step()
## x purrr::when()          masks foreach::when()
## x recipes::yj_trans()    masks scales::yj_trans()
```

---
layout: false
class: inverse, middle, center

# Measuring Performance in Classification

---

# Illustrative Example  <img src="images/yardstick.png" class="title-hex">

`yardstick` contains another test set example in a data frame called `two_class_example`:

```r
two_class_example %>% head(4)
```

```
##    truth  Class1 Class2 predicted
## 1 Class2 0.00359  0.996    Class2
## 2 Class1 0.67862  0.321    Class1
## 3 Class2 0.11089  0.889    Class2
## 4 Class1 0.73516  0.265    Class1
```

Both `truth` and `predicted` are factors with the same levels. The other two columns represent _class probabilities_.

This reflects that most classification models can generate "hard" and "soft" predictions for models.

The class predictions are usually created by thresholding some numeric output of the model (e.g. a class probability) or by choosing the largest value.

---

# Class Prediction Metrics <img src="images/yardstick.png" class="title-hex">

.pull-left[

With class predictions, a common summary method is to produce a _confusion matrix_ which is a simple cross-tabulation between the observed and predicted classes:

```r
two_class_example %>% 
	conf_mat(truth = truth, estimate = predicted)
```

```
##           Truth
## Prediction Class1 Class2
##     Class1    227     50
##     Class2     31    192
```

These can be visualized using [mosaic plots](https://en.wikipedia.org/wiki/Mosaic_plot).

]

.pull-right[

Accuracy is the most obvious metric for characterizing the performance of models.

```r
two_class_example %>% 
	accuracy(truth = truth, estimate = predicted)
```

```
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.838
```

However, it suffers when there is a _class imbalance_; suppose 95% of the data have a specific class. 95% accuracy can be achieved by predicting samples to be the majority class. There are measures that correct for the natural event rate, such as [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa).

]

---

# Two Classes <img src="images/yardstick.png" class="title-hex">

There are a number of specialized metrics that can be used when there are two classes. Usually, one of these classes can be considered the _event of interest_ or the _positive class_.

One common way to think about performance is to consider false negatives and false positives.

* The **sensitivity** is the _true positive rate_ (out of all of the actual positives, how many did you get right?).

* The **specificity** is the rate of correctly predicted negatives, or 1 - _false positive rate_ (out of all the actual negatives, how many did you get right?).

From this, assuming that `Class1` is the event of interest:

.pull-left[

```
##           Truth
## Prediction Class1 Class2
##     Class1    227     50
##     Class2     31    192
```
]
.pull-right[
 sensitivity = 227/(227 + 31)  = 0.88

specificity = 192/(192 + 50)  = 0.79
]

---

# Conditional and Unconditional Measures <img src="images/yardstick.png" class="title-hex">

Sensitivity and specificity can be computed from `sens()` and `spec()`, respectively.

It should be noted that these are _conditional measures_ since we need to know the true outcome.

The event rate is the _prevalence_ (or the Bayesian _prior_). Sensitivity and specificity are analogous to the _likelihood values_.

There are _unconditional_ analogs to the _posterior values_ called the positive predictive values and the negative predictive values.

A variety of other measures are available for two class systems, especially for _information retrieval_.

One thing to consider: what happens if our **threshold to call a sample an event is not optimal**?

---

# Changing the Probability Threshold <img src="images/yardstick.png" class="title-hex">

.pull-left[
For two classes, the 50% cutoff is customary; if the probability of class 1 is >= 50%, they would be labelled as `Class1`.

What happens when you change the cutoff?

* Increasing it makes it harder to be called `Class1` `$\Rightarrow$` fewer predicted events, specificity `$\uparrow$`, sensitivity `$\downarrow$`

* Decreasing the cutoff makes it easier to be called `Class1` `$\Rightarrow$` more predicted events, specificity `$\downarrow$`, sensitivity `$\uparrow$`

]
.pull-right[

With two classes, the **Receiver Operating Characteristic (ROC) curve** can be used to estimate performance using a combination of sensitivity and specificity.  
  
To create the curve, many alternative cutoffs are evaluated.

For each cutoff, we calculate the sensitivity and specificity.

The ROC curve plots the sensitivity (eg. true positive rate) versus 1 - specificity (eg. the false positive rate).

]
 
  
---

# The Receiver Operating Characteristic (ROC) Curve  <img src="images/yardstick.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"><img src="images/dplyr.png" class="title-hex">

.pull-left[

```r
roc_obj <- 
  two_class_example %>% 
  roc_curve(truth, Class1)
```

```r
two_class_example %>% roc_auc(truth, Class1)
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.939
```

```r
autoplot(roc_obj) + thm
```
]
.pull-right[

<img src="images/part-6-roc-plot-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

# Changing the Threshold

---

# The Receiver Operating Characteristic (ROC) Curve

The ROC curve has some major advantages:

* It can allow models to be optimized for performance before a definitive cutoff is determined.
 
 * It is _robust_ to class imbalances; no matter the event rate, it does a good job at characterizing model performance. 
 
 * The ROC curve can be used to pick an optimal cutoff based on the trade-offs between the types of errors that can occur.

When there are two classes, it is advisable to focus on the area under the ROC curve instead of sensitivity and specificity.

Once an acceptable model is determined, a proper cutoff can be determined.

---
layout: false
class: inverse, middle, center

# Example Data

---

# Amazon Review Data

These data are from Amazon, who describe it as

> "This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review."

We will use the text data to predict whether the review have a five-star result or not.

This will involve some natural language processing methods, which we will walk through.

The data are found in the `modeldata` package

```r
library(modeldata)
data(small_fine_foods)
```

---

# Feature Engineering

Most of the work for the features is to extra information from text.

We will use the basics here but more information can be found in the [_Tidy Text Mining with R_](https://www.tidytextmining.com/).

To do this, we will heavily rely on the `textrecipes` package. 
 
---

# Defining roles <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
* update_role(product, new_role = "id")
```

]
 
]
.pull-right-a-little[
`product` is used for data splitting (as we'll see in a bit).

Arguably, it is not a predictor (although some might use it that way).

We update the role so that it is retained in the recipe but not used as a predictor. 
]

---

# Copying a column <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
* step_mutate(review_raw = review)
```

]

]
.pull-right-a-little[
Two of the steps that we'll use will destroy the original predictor.

We'll use a basic `step_mutate()` to make a temporary copy. 
]

---

# Initial feature set <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
* step_textfeature(review_raw)
```

]

]
.pull-right-a-little[
A set of numeric predictors are derived from the text.

Most are counts of text elements (e.g. words, punctuation, etc) 
]

---

# Tokenize <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

]

]
.pull-right-a-little[

]

---

# Remove Stop Words <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
  step_textfeature(review_raw)  %>%               
  step_tokenize(review)   %>%
* step_stopwords(review)
```

]

]
.pull-right-a-little[
Stop words are those that occur commonly in text, such as "the", "a", and so on.

Removing them from text _might_ be a good idea.

This largely depends on what you are doing with the text. 
]

---

# Word stemming <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
  step_textfeature(review_raw) %>%               
  step_tokenize(review) %>%
  step_stopwords(review)  %>%
* step_stem(review)
```

]

]
.pull-right-a-little[
Stemming is a method that uses a common root of a word instead of the original value.

For example, these 7 words are fairly similar: "teach", "teacher", "teachers", "teaches", "teachable", "teaching", "teachings".

Stemming would reduce these to 3 unique values: "teach", "teacher", "teachabl".

Like stop word removal, this may or may not be a good idea. 
]

---

# Feature hashing <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
  step_textfeature(review_raw) %>%               
  step_tokenize(review) %>%
  step_stopwords(review)  %>%
  step_stem(review) %>%
* step_texthash(review, signed = FALSE, num_terms = 1024)
```

]

.pull-right-a-little[

Feature hashing creates numeric terms from words in a sentence (or some other token) similar to dummy variables.

However, there are big differences, including:

* There is no look-up table to consult to make the mapping 
* The placement of the non-zero values in meant to emulate randomness. 
* The new features are computed on the actual words.

]

---

# Feature hashing

.pull-left[

For string "On Time and product looked like it", a sketch of the calculations to make 8 hashed values:

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:right;"> hashed integer value </th>
   <th style="text-align:right;"> (integer mod 8) + 1 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> On </td>
   <td style="text-align:right;"> -182693672 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Time </td>
   <td style="text-align:right;"> 1593484409 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> and </td>
   <td style="text-align:right;"> -1079337235 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> product </td>
   <td style="text-align:right;"> -979280496 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> looked </td>
   <td style="text-align:right;"> -2120797534 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> like </td>
   <td style="text-align:right;"> -592737581 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> it </td>
   <td style="text-align:right;"> 1278008556 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
</tbody>
</table>
]
.pull-right[
Note that multiple words end up in the same feature column. This is _aliasing_ (statistical term) or a _collision_ (comp sci term).

* We wouldn't be able to distinguish the effect of those two words.

We can encode this as a simple zero or, as `textrecipes` does, use the count as the value.

* There are also _signed_ hashes that help avoid collisions.

Note that no words were mapped to feature columns three or seven. 
]

---

# Optional step: convert binary to factors <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
count_to_binary <- function(x) {
  factor(ifelse(x != 0, "present", "absent"),
         levels = c("present", "absent"))
}
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
  step_textfeature(review_raw) %>%               
  step_tokenize(review) %>%
  step_stopwords(review)  %>%
  step_stem(review) %>%
  step_texthash(review, signed = FALSE, num_terms = 1024) %>%
* step_mutate_at(starts_with("review_hash"), fn = count_to_binary)
```

]

.pull-right-a-little[

The naive Bayes model will be used on these data.

It computes probability values from each predictor.

- If the predictor is numeric, its statistical density is used.

- If categorical, a contingency table is used.

.font100[

Since the hash values are really about the presence/absence of words, we should convert them to 2-level factor variables to ensure appropriate calculations.

]

---

# Optional step: remove zero-variance predictors <img src="images/recipes.png" class="title-hex">

.pull-left-a-lot[

.font80[

```r
count_to_binary <- function(x) {
  factor(ifelse(x != 0, "present", "absent"),
         levels = c("present", "absent"))
}
text_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
  step_textfeature(review_raw) %>%               
  step_tokenize(review) %>%
  step_stopwords(review)  %>%
  step_stem(review) %>%
  step_texthash(review, signed = FALSE, num_terms = 1024) %>%
  step_mutate_at(starts_with("review_hash"), fn = count_to_binary) %>% 
* step_zv(all_predictors())
```

]

]
.pull-right-a-little[
Removing the features that are all zero in the training set increases computational efficiency and may stop model failures. 
]

---

# Resampling and Analysis Strategy <img src="images/rsample.png" class="title-hex">

There are enough data here to do a simple 10-fold cross-validation.

Since there is a class imbalance, we will stratify the splits.

```r
set.seed(8935)
text_folds <- vfold_cv(training_data, strata = "score")
```

---
layout: false
class: inverse, middle, center

# Classification Trees

---

# Tree model structure

A classification tree searches through each predictor to find a value of a single variable that best splits the data into two groups.

For the resulting groups, the process is repeated until a hierarchical structure (a tree) is created.

* In effect, trees partition the `$X$` space into rectangular sections that assign a single value to samples within the rectangle.

The final structure in the tree is the _terminal node_ and each path through the tree is a _rule_.

.pull-left[

```r
# Example tree with three terminal nodes
if (x > 1) {
     if (y < 3) {
       class <- "A"
     } else {
       class <- "B"
     }
} else {
  class <- "A"
}
```
]
.pull-right[

```r
# Same tree, stated as rules
if (x >  1 & y <   3) class <- "A"
if (x >  1 & y >=  3) class <- "B"
if (x <= 1)           class <- "A"
```
]

---

# Species of tree-based models

There are a variety of different methods for creating trees that vary over:

* The search method (e.g., greedy or global).
 
 * The splitting criterion.
 
 * The number of splits. 
 
 * Handling of missing values. 
 
 * Pruning method.

The most popular is the CART methodology, followed by the C5.0 model.

We will focus on CART for single trees.

The [CRAN Machine Learning Task View](https://cran.r-project.org/web/views/MachineLearning.html) has a good summary of the methods available in R.

---

# Growing phase

CART starts by _growing_ the tree.

* More and more splits are conducted until a pre-specified samples size requirement is exceeded (`min_n`).

* The criterion used is the _purity_ of the terminal nodes that are created by each split.

For example, for simulated data with a 50% event rate, which one of these splits is better?

.pull-left[

<img src="images/part-6-good-split-1.svg" width="70%" style="display: block; margin: auto;" />
]
.pull-right[

<img src="images/part-6-bad-split-1.svg" width="70%" style="display: block; margin: auto;" />
]

---

# Pruning phase

The deepest possible tree has a higher likelihood of overfitting the data.

CART conducts cost-complexity pruning to find the "right sized tree".

It basically penalizes the error rate of the tree by the number of terminal nodes by minimizing

`$$Error_{cv} - (c_p \times nodes)$$`

* The `$c_p$` value, usually between 0 and 0.1, controls the depth of the tree.

* CART has an internal 10-fold cross-validation that it uses to estimate the model error.

* If the outcome has a large class imbalance, this method optimizes the tree for the majority class.

For CART, `$c_p$` (aka `cost_complexity`) and the minimum splitting size (`min_n`) are the tuning parameters.

---

# Aspects of single trees

* The class percentages in the terminal node are used to make predictions.

* The number of possible class probabilities is typically low.

* Trees are _theoretically_ interpretable if the number of terminal nodes is low.

* The training time tends to be very fast.

* Trees are _unstable_; if the data are slightly changed, the entire tree structure can change. These are [low-bias/high-variance models](https://bookdown.org/max/FES/important-concepts.html#model-bias-and-variance).

* Very little, if any, data pre-processing is needed. _Dummy variables [are not required](https://bookdown.org/max/FES/categorical-trees.html)_.

* Trees automatically conduct _feature selection_.

---

# Fitting and tuning trees

Like MARS, there are two main ways to tune the CART model:

* Rely on the internal CV procedure to pick the tree depth via purity/error rate:

```r
    decision_tree(min_n = tune()) %>% 
      set_engine("rpart") %>% 
      set_mode("classification")
```

* Manually specify `$c_p$` values and use external resampling with a metric of your choice:

```r
    decision_tree(cost_complexity = tune(), min_n = tune()) %>% 
      set_engine("rpart") %>% 
      set_mode("classification")
```
 
I prefer the latter approach; I believe that the automated choice tends to pick overly simple models.

---

# {recipe} and {parsnip} objects  <img src="images/tune.png" class="title-hex"><img src="images/recipes.png" class="title-hex"><img src="images/parsnip.png" class="title-hex">

.pull-left-a-lot[
.font80[

```r
library(textfeatures)
library(textrecipes)

tree_rec <-
  recipe(score ~ product + review, data = training_data) %>%
  update_role(product, new_role = "id") %>%
  step_mutate(review_raw = review) %>%
  step_textfeature(review_raw) %>%               
  step_tokenize(review) %>%
  step_stopwords(review)  %>%
  step_stem(review) %>%
  step_texthash(review, signed = FALSE, num_terms = tune()) %>%
  step_zv(all_predictors())

# and

cart_mod <- 
  decision_tree(cost_complexity = tune(), min_n = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

ctrl <- control_grid(save_pred = TRUE)
```

]

]
.pull-right-a-little[
Note that:

All of the text processing operations are deterministic and not reliant on any other data in the training set.

We could pre-compute the data prior to the text hashing.

Also, if we were not tuning the number of hashing terms, we could pre-compute the whole feature set with the exception of the zero-variance filter. 
]

---

# Model tuning <img src="images/tune.png" class="title-hex">

```r
cart_wfl <- 
  workflow() %>% 
  add_recipe(tree_rec) %>% 
  add_model(cart_mod)

set.seed(2553)

cart_tune <- tune_grid(
  cart_wfl,
  text_folds,
  grid = 10,
  metrics = metric_set(roc_auc),
  control = ctrl
)

show_best(cart_tune, metric = "roc_auc")
```

```
## # A tibble: 5 x 8
##   cost_complexity min_n num_terms .metric .estimator  mean     n std_err
##             <dbl> <int>     <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1  0.000118          33      3253 roc_auc binary     0.711    10 0.00852
## 2  0.00000393        25      1840 roc_auc binary     0.698    10 0.00520
## 3  0.00000000286     22       816 roc_auc binary     0.684    10 0.00601
## 4  0.000000000303    20      1748 roc_auc binary     0.684    10 0.00486
## 5  0.000526          16      1274 roc_auc binary     0.676    10 0.0103
```

---

# Parameter profiles

```r
autoplot(cart_tune)
```

---

# Plotting ROC curves <img src="images/yardstick.png" class="title-hex"><img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex">

.font80[

.pull-left[

```r
cart_pred <- collect_predictions(cart_tune)
cart_pred %>% slice(1:5)
```

```
## # A tibble: 5 x 8
##   id     .pred_great .pred_other  .row num_terms cost_complexity min_n score
##   <chr>        <dbl>       <dbl> <int>     <int>           <dbl> <int> <fct>
## 1 Fold01       0          1         24       412     0.000000271     7 other
## 2 Fold01       0.939      0.0606    25       412     0.000000271     7 great
## 3 Fold01       0.734      0.266     26       412     0.000000271     7 other
## 4 Fold01       0.961      0.0390    46       412     0.000000271     7 other
## 5 Fold01       0.939      0.0606    48       412     0.000000271     7 great
```

```r
cart_pred %>% 
  inner_join(
    select_best(cart_tune), 
    by = c("num_terms", "cost_complexity", "min_n")
  ) %>% 
  group_by(id) %>% 
  roc_curve(score, .pred_great) %>% 
  autoplot()
```
]
.pull-right[

```
## Joining, by = c("num_terms", "cost_complexity", "min_n")
```

]

---

# A single (but approximate) ROC curve <img src="images/yardstick.png" class="title-hex"><img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex"><img src="images/dplyr.png" class="title-hex">

Instead of showing all 10 curves, we could pool the 10 fold's worth of data and make a single curve that might approximate the "average" curve.

.font90[

.pull-left[

```r
# For  tun_* object, get the right predictions 
# and extract the ROC curve data
auc_curve_data <- function(x) {
    collect_predictions(x) %>% 
    inner_join(select_best(x, "roc_auc")) %>% 
    roc_curve(score, .pred_great)
}
```

]
.pull-right[

```r
# Apply the `auc_roc_data()` function across
# models. 
approx_roc_curves <- function(...) {
  curves <- map_dfr(
    list(...), auc_curve_data, .id = "model"
  )
  
  default_cut <- curves %>% 
    group_by(model) %>% 
    arrange(abs(.threshold - .5)) %>% 
    slice(1)
  
  ggplot(curves) +
    aes(y = sensitivity, x = 1 - specificity, 
        col = model) +
    geom_abline(lty = 3) + 
    geom_step(direction = "vh") + 
    geom_point(data = default_cut) + 
    coord_equal()
}
```
]

]

---

For our model:

.font90[

.pull-left[

```r
# Use named arguments for better labels
approx_roc_curves(CART = cart_tune)
```

]
.pull-right[
<img src="images/part-6-cart-roc-approx-1.svg" width="90%" style="display: block; margin: auto;" />
]

]

---

# Hands-On: Down-Sampling

Looking at the ROC curve, the default cutoff may not be optimal if FP and FN errors are about equal.

We could pick a better cutoff or fit another model using _sub-class sampling_.

The latter approach would balance the data prior to model fitting.

* The most common method would be to _down-sample_ the data. 
 
 * This is fairly controversial (at least in statistical circles).

Let's take 20m and refit the model code above with a recipe that includes downsampling.

[link to recipes documentation](https://tidymodels.github.io/recipes/reference/index.html)

---

# Variable importance

Also like MARS, these models judge importance by how much the terminal node purity improved with each split.

These are aggregated over variables.

Note: by default, these measures will contain predictors _not used in the tree_ (due to [surrogate splits](https://www.salford-systems.com/resources/webinars-tutorials/tips-and-tricks/using-surrogates-to-improve-datasets-with-missing-values)).

You can change this using the `rpart.control()` option.

Unfortunately, for these data, the hashed variables are rather opaque. It takes a lot of work to determine which words map to which features.

---
layout: false
class: inverse, middle, center

# Boosting

---

# Original concept of boosting

The original boosting algorithm was created for two-class problems and was designed to _boost_ a weak learner into a strong one.

```
Fit an initial tree where all samples are treated equally

for i = 1 to M boosting iterations {
   Samples predicted   correctly have _increased_ weights
   Samples predicted incorrectly have _decreased_ weights
   
   Fit a new model under the weighting scheme
   
   Quantify the model fit
}
```

Once this sequence of trees were fit, the final prediction was a weighted average of all of the trees

* These weights were created from the performance estimates for each tree.

This led to a dramatic increase in performance and only works because of the instability of tree-based models.

---

# Example tree fit with equal weights

---

# Boosting iteration 2

---

# Boosting iteration 3

---

# Boosting iteration 4

---

# Boosting iteration 5

---

# Initial limitations led to modern boosting

The initial model was only for two-class problems and had some obvious drawbacks.

Once the statisticians took a look at things, they made connections to statistical theory and gradient descent.

* This is why modern boosting is called _stochastic gradient boosting_. 
 
The deep learning people also contributed a lot of interesting work by adding tuning parameters that help optimize these models.

The `xgboost` package is probably the best modern gradient boosting package.

However, we are going to use something less complicated and somewhat old-school but very powerful.

---

# C5.0

A researcher named Ross Quinlan did research on tree- and rule-based models in the same general time frame as the CART folks.

His C4.5 algorithm was somewhat different than CART and, in some ways, much more elegant.

* More than two splits, unbiased selection, better handling of missing values, a different pruning process, etc.
 
Quinlan's modern models, **C5.0** and **Cubist**, are not as well known mostly since he has not published them. C5.0 takes his C4.5 model and enables a classical boosting approach to building a sequence of trees.

Most improvement comes in the first 50 iterations of boosting and the main two tuning parameters are `min_n` and the number of `trees` in the ensemble.

The `C50` package is based on the original C code for this model. The best references for this are [Quinlan's original book](https://www.google.com/search?ei=XXvBXd6XGZC35gKmuLbYDw&q=quinlan+%22C4.5%3A+Programs+for+Machine+Learning%22&oq=quinlan+%22C4.5%3A+Programs+for+Machine+Learning%22) and Chapter 14 of [_APM_](https://www.google.com/search?q=Kuhn+%22applied+predictive+modeling%22&oq=Kuhn+%22applied+predictive+modeling%22).

---

# C5.0 <img src="images/tune.png" class="title-hex"><img src="images/parsnip.png" class="title-hex"><img src="images/dplyr.png" class="title-hex">

This model is available as a `parsnip` engine for `decision_tree()` as well as `boosted_tree()`. We will demonstrate with the latter.

.font90[

.pull-left[

```r
C5_mod <- 
  boost_tree(trees = tune(), min_n = tune()) %>% 
  set_engine("C5.0") %>% 
  set_mode("classification")

C5_wfl <- update_model(cart_wfl, C5_mod)
```
]
.pull-right[

```r
# We will just modify our CART grid and add 
# a new parameter: 
set.seed(5793)
C5_grid <- 
  collect_metrics(cart_tune) %>% 
  dplyr::select(min_n, num_terms) %>% 
  mutate(trees = sample(1:100, 10))

C5_tune <-
  tune_grid(
    C5_wfl,
    text_folds,
    grid = C5_grid,
    metrics = metric_set(roc_auc),
    control = ctrl
  )
```
]

]

---

# Comparing models <img src="images/tune.png" class="title-hex"><img src="images/ggplot2.png" class="title-hex">

.code90[

.pull-left[

```r
approx_roc_curves(CART = cart_tune, C5 = C5_tune)
```

]
.pull-right[

```r
show_best(C5_tune)
```

```
## # A tibble: 5 x 8
##   trees min_n num_terms .metric .estimator  mean     n std_err
##   <int> <int>     <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    54    33      3253 roc_auc binary     0.808    10 0.00931
## 2    52    20      1748 roc_auc binary     0.798    10 0.00891
## 3    84    25      1840 roc_auc binary     0.795    10 0.00924
## 4    64    16      1274 roc_auc binary     0.788    10 0.00763
## 5     6    37      2654 roc_auc binary     0.780    10 0.00932
```

```r
autoplot(C5_tune)
```

]
]

---

# Finalizing the recipe and model<img src="images/tune.png" class="title-hex"><img src="images/recipes.png" class="title-hex">

.code90[

.pull-left[

```r
best_C5 <- select_best(C5_tune)
best_C5
```

```
## # A tibble: 1 x 3
##   trees min_n num_terms
##   <int> <int>     <int>
## 1    54    33      3253
```

```r
# no prep-juice calls!
C5_wfl_final <- 
  C5_wfl %>%
  finalize_workflow(best_C5) %>% 
  fit(data = training_data)
```
]
.pull-right[

```r
C5_wfl_final
```

```
## ══ Workflow [trained] ═════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────
## 7 Recipe Steps
## 
## ● step_mutate()
## ● step_textfeature()
## ● step_tokenize()
## ● step_stopwords()
## ● step_stem()
## ● step_texthash()
## ● step_zv()
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────
## 
## Call:
## C5.0.default(x = x, y = y, trials = 54L, control = C50::C5.0Control(minCases = 33L, sample = 0))
## 
## Classification Tree
## Number of samples: 4000 
## Number of predictors: 3120 
## 
## Number of boosting iterations: 54 requested;  37 used due to early stopping
## Average tree size: 21.1 
## 
## Non-standard options: attempt to group attributes, minimum number of cases: 33
```
]

]

---

# Predicting the test set <img src="images/yardstick.png" class="title-hex"><img src="images/recipes.png" class="title-hex"><img src="images/parsnip.png" class="title-hex">

.pull-left[

```r
test_probs <- C5_wfl_final %>%
  predict(testing_data, type = "prob") %>% 
  bind_cols(testing_data %>% dplyr::select(score)) %>% 
  bind_cols(predict(C5_wfl_final, testing_data))

roc_auc(test_probs, score, .pred_great)
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.807
```

```r
conf_mat(test_probs, score, .pred_class)
```

```
##           Truth
## Prediction great other
##      great   580   160
##      other    78   182
```
]
.pull-right[

```r
roc_values <- 
  roc_curve(test_probs, score, .pred_great)

autoplot(roc_values)
```

<img src="images/part-6-test-set-curve-1.svg" width="80%" style="display: block; margin: auto;" />
]

---
layout: false
class: inverse, middle, center

#  Extra Slides (as time allows)

---
layout: false
class: inverse, middle, center

#  Naive Bayes

---

# Naive Bayes Models

This classification model is motivated directly from statistical theory based on Bayes' Rule:

`$$Pr[Class|Predictors] = \frac{Pr[Class]\times Pr[Predictors|Class]}{Pr[Predictors]} 
= \frac{Prior\:\times\:Likelihood}{Evidence}$$`

In English:

> Given our predictor data, what is the probability of each class?

The _prior_ is the prevalence that was mentioned earlier (e.g. the rate of 5-star reviews). This can be estimated or set.

Most of the action is in `Pr[Predictors|Class]`, which is based on the observed training set.

Predictions are based on a blend of the training data and our _prior belief_ about the outcome...

---

# So  Why is it Naive?

Determining `$Pr[Predictors|Class]$` can be very difficult without strong assumptions because it measures the _joint probability_ of all of the predictors.

* For example, what is the correlation between a person's essay length and their religion?

To resolve this, **naive** Bayes assumes that all of the predictors are _independent_ and that their probabilities can be estimated separately.

The joint probability is then the product of all of the individual probabilities (an example follows soon).

This assumption is almost certainly bogus but the model tends to do well despite this.

---

# The Effect of Independence

The probability contours assume multivariate normality with different assumptions.

Suppose the red dot is a new sample.

Probability of the red point: 0.0000066 (accurate) and 0.013 (inaccurate).

---

# Conditional Densities for Each Class

`$Pr[Words|Class]$`

---

# Conditional Values for Numeric Predictors

`$Pr[Words=100|score]$`

---

# Conditional Probabilities for Categorical Predictors

`$Pr[\text{Has Word } X|score]$`

---

# Combining Predictor Scores with the Prior

For a review with 100 words that didn't have keyword `$X$`, their likelihood values were:

* `$Pr[Words = 100| great] \times Pr[X \text{ Absent}| great]$` = 0.734 x 0.699 = 0.513

* `$Pr[Words = 100| other] \times Pr[X \text{ Absent}| other]$` =0.951 x 0.866 = 0.824

However, when these are combined with the _prior probability_ for each class, the _relative probabilities_ show:

* `$Pr[Predictors| great] \times Pr[great]$` = 0.513 x 0.65 = 0.334

* `$Pr[Predictors| other] \times Pr[other]$` = 0.824 x 0.35 = 0.288

We don't need to compute the evidence; we can just normalize these values to add up to 1.

The results is that the _posterior probability_ that this review was 5-star is 53.6%.

---

# Pros and Cons

.pull-left[
Good:

* This model can be very quickly trained (and theoretically in parallel).

* Once trained, the prediction is basically a look-up table (i.e. fast).

* Nonlinear class boundaries can be generated.

]
.pull-right[
Bad:

* Linearly diagonal boundaries can be difficult.

* With many predictors, the class probabilities become poorly calibrated and U-shaped with most values near zero or one. 
]

---

# U-Shaped Class Probability Distributions

.pull-left[
A completely non-informative data set was simulated using the naive assumption.

The training set has 500 data points over two classes and 450 predictors.

When a model is fit with 10 predictors, the distribution of the class probabilities gives us shapes that we would expect.
  
What happens when the number of predictors becomes larger? 
]
.pull-right[

<img src="images/part-6-normal-shape-1.svg" width="90%" style="display: block; margin: auto;" />
]

---

# U-Shaped Class Probability Distributions

---

# Naive Bayes recipe and fit <img src="images/tune.png" class="title-hex"><img src="images/recipes.png" class="title-hex"><img src="images/parsnip.png" class="title-hex">

There is a step specifically designed for converting binary dummy variables into factors.

```r
count_to_binary <- function(x) {
  factor(ifelse(x != 0, "present", "absent"),
         levels = c("present", "absent"))
}

nb_rec <- 
  tree_rec %>%
  step_mutate_at(starts_with("review_hash"), fn = count_to_binary)

library(discrim)

nb_mod <- naive_Bayes() %>% set_engine("klaR")

nb_tune <-
  tune_grid(
    nb_rec,
    nb_mod,
    text_folds,
    grid = tibble(num_terms = floor(2^seq(8, 12, by = 0.5))),
    metrics = metric_set(roc_auc),
    control = ctrl
  )
```

---

# Naive Bayes results

.font80[
There are a number of warnings that look like:

> `Numerical 0 probability for all classes with observation 1`

This is due to the poorly calibrated probabilities although the warning is a bit misleading. This issue does not generally affect performance and can be ignored.

.pull-left[

```r
autoplot(nb_tune) +
  scale_x_continuous(trans = log2_trans())
```

<img src="images/part-6-nb-plot-1.svg" width="80%" style="display: block; margin: auto;" />
]
.pull-right[

```r
approx_roc_curves(CART = cart_tune, C5 = C5_tune, 
                  "Naive Bayes" = nb_tune)
```

]

---
layout: false
class: inverse, middle, center

#  Some Cool Things to Mention!

---

# {tidypredict} and {modeldb}

These are two packages that can use SQL to deploy (and sometimes fit) models.

```r
library(tidypredict)
library(dbplyr)

lin_reg_fit <- lm(Sepal.Width ~ ., data = iris)

# R code
tidypredict_fit(lin_reg_fit)
```

```
## 1.65716389226361 + (Sepal.Length * 0.377773406416986) + (Petal.Length * 
##     -0.187566584248584) + (Petal.Width * 0.625710493432476) + 
##     (ifelse(Species == "versicolor", 1, 0) * -1.16028529814733) + 
##     (ifelse(Species == "virginica", 1, 0) * -1.39825487768477)
```

```r
# SQL code
tidypredict_sql(lin_reg_fit, con = simulate_dbi())
```

```
## <SQL> 1.65716389226361 + (`Sepal.Length` * 0.377773406416986) + (`Petal.Length` * -0.187566584248584) + (`Petal.Width` * 0.625710493432476) + (CASE WHEN (`Species` = 'versicolor') THEN (1.0) WHEN NOT(`Species` = 'versicolor') THEN (0.0) END * -1.16028529814733) + (CASE WHEN (`Species` = 'virginica') THEN (1.0) WHEN NOT(`Species` = 'virginica') THEN (0.0) END * -1.39825487768477)
```

---

# Multiclass Metrics With yardstick <img src="images/yardstick.png" class="title-hex">

Multiclass? This just means your outcome has >2 possibilities (Religion: Catholic, Atheist, Buddhist, etc).

.pull-left[

Consider binary `precision()`:

$$
Pr = \frac{TP}{TP + FP} 
$$

```
## # A tibble: 5 x 2
##   truth estimate
##   <fct> <fct>   
## 1 ✅    ✅      
## 2 😡    😡      
## 3 ✅    ✅      
## 4 😡    ✅      
## 5 😡    😡
```

]

.pull-right[

`$$TP = 2$$`

`$$FP = 1$$`

`$$Pr = \frac{2}{2 + 1} = \frac{2}{3}$$`

```r
precision(prec_example, truth, estimate)
```

```
## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.667
```

]

---

# Macro Averaging <img src="images/yardstick.png" class="title-hex">

What does this look like in multiclass world?

.pull-left[

```
## # A tibble: 5 x 2
##   truth estimate
##   <fct> <fct>   
## 1 ✅    ✅      
## 2 🤷    😡      
## 3 ✅    ✅      
## 4 😡    🤷      
## 5 😡    😡
```

One technique for dealing with this is _macro averaging_. This reduces the problem to multiple one-vs-all comparisons.

]

.pull-right[

1) Convert `truth`/`estimate` to binary with levels: ✅ and `other`.

2) Compute `precision()` to get `Pr_1`.

3) Repeat 1) and 2) for each level to get `Pr_1`, `Pr_2`, `Pr_3`.

4) Average the results:

$$
Pr_{macro} = \frac{Pr_1 + Pr_2 + Pr_3}{3}
$$

]

---

# Macro Averaging <img src="images/yardstick.png" class="title-hex">

.pull-left[

```r
prec_multi
```

```
## # A tibble: 5 x 2
##   truth estimate
##   <fct> <fct>   
## 1 ✅    ✅      
## 2 🤷    😡      
## 3 ✅    ✅      
## 4 😡    🤷      
## 5 😡    😡
```

$$
`\begin{align}
Pr_1 &= \frac{2}{2 + 0} = 1\\
Pr_2 &= \frac{1}{1 + 1} = 0.5\\
Pr_3 &= \frac{0}{0 + 1} = 0
\end{align}`
$$

]

.pull-right[

`$$Pr_{macro} = \frac{1 + 0.5 + 0}{3} = 0.5$$`

```r
precision(prec_multi, truth, estimate)
```

```
## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision macro            0.5
```

]

???

yardstick automatically detects that you have a multiclass outcome.

---

# Caveats <img src="images/yardstick.png" class="title-hex">

Macro averaging gives each class _equal weight_ to the total precision value (`1/3` here). This may not be realistic when a class imbalance is present.

In that case, you can use a _weighted macro average_ which weights by the frequency of that class in the `truth` column.

```r
precision(prec_multi, truth, estimate, estimator = "macro_weighted")
```

```
## # A tibble: 1 x 3
##   .metric   .estimator     .estimate
##   <chr>     <chr>              <dbl>
## 1 precision macro_weighted       0.6
```

There is additionally a _micro average_ that gives each _observation_ equal weight rather than each _class_. This gives classes with more observations more influence.

Find more information at the [yardstick vignette](https://tidymodels.github.io/yardstick/articles/multiclass.html).