In this case study, our objective is to predict the sales price of a home. This is a regression problem since the goal is to predict any real number across some spectrum ($119,201, $168,594, $301,446, etc). To predict the sales price, we will use numeric and categorical features of the home.
Throughout this case study you will learn a few new concepts:
- Vectorization and standardization of tabular features
- Adjusting batch size & epochs for training performance
- What callbacks are and how to start applying them
- Early stopping
- Controlling the learning rate
- Knowing when to adjust model capacity
Package requirements
library(keras) # for deep learning
library(testthat) # unit testing
library(tidyverse) # for dplyr, ggplot2, etc.
library(rsample) # for data splitting
library(recipes) # for feature engineering
The Ames housing dataset
For this case study we will use the Ames housing dataset provided by the AmesHousing package.
ames <- AmesHousing::make_ames()
dim(ames)
[1] 2930 81
Understanding our data
This data has been partially cleaned up and has no missing data:
sum(is.na(ames))
[1] 0
But this tabular data is a combination of numeric and categorical data that we need to address.
str(ames)
The numeric variables are on different scales. For example:
ames %>%
select(Lot_Area, Lot_Frontage, Year_Built, Gr_Liv_Area, Garage_Cars, Mo_Sold) %>%
gather(feature, value) %>%
ggplot(aes(feature, value)) +
geom_boxplot() +
scale_y_log10(labels = scales::comma)
There are categorical features that could be ordered:
ames %>%
select(matches("(Qual|Cond|QC|Qu)$")) %>%
str()
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2930 obs. of 12 variables:
$ Overall_Qual: Factor w/ 10 levels "Very_Poor","Poor",..: 6 5 6 7 5 6 8 8 8 7 ...
$ Overall_Cond: Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...
$ Exter_Qual : Factor w/ 4 levels "Excellent","Fair",..: 4 4 4 3 4 4 3 3 3 4 ...
$ Exter_Cond : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Bsmt_Qual : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 3 6 3 3 3 6 ...
$ Bsmt_Cond : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
$ Heating_QC : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...
$ Kitchen_Qual: Factor w/ 5 levels "Excellent","Fair",..: 5 5 3 1 5 3 3 3 3 3 ...
$ Fireplace_Qu: Factor w/ 6 levels "Excellent","Fair",..: 3 4 4 6 6 3 4 4 6 6 ...
$ Garage_Qual : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Garage_Cond : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Pool_QC : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
And some of the categorical features have many levels:
ames %>%
select_if(~ is.factor(.) & length(levels(.)) > 8) %>%
str()
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2930 obs. of 8 variables:
$ MS_SubClass : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...
$ Neighborhood: Factor w/ 28 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...
$ Condition_1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
$ Overall_Qual: Factor w/ 10 levels "Very_Poor","Poor",..: 6 5 6 7 5 6 8 8 8 7 ...
$ Overall_Cond: Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...
$ Exterior_1st: Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
$ Exterior_2nd: Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
$ Sale_Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
Consequently, our first challenge is transforming this dataset into numeric tensors that our model can use.
Create train & test splits
One of the first things we want to do is create a train and test set as you probably noticed that we do not have a train and test set similar to how MNIST was already set up for us. We can use the rsample package to create our train and test datasets.
set.seed(123)
ames_split <- initial_split(ames, prop = 0.7)
ames_train <- analysis(ames_split)
ames_test <- assessment(ames_split)
dim(ames_train)
[1] 2051 81
dim(ames_test)
[1] 879 81
Preparing the data
All inputs and response values in a neural network must be tensors of either floating-point or integer data. Moreover, our feature values should not be relatively large compared to the randomized initial weights and all our features should take values in roughly the same range.
Consequently, we need to vectorize our data into a format conducive to neural networks ℹ️. For this data set, we’ll transform our data by:
- removing any zero-variance (or near zero-variance) features
- condensing unique levels of categorical features to “other”
- ordinal encoding the quality features
- normalize numeric feature distributions
- standardizing numeric features to mean = 0, std dev = 1
- one-hot encoding remaining categorical features
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_nzv(all_nominal()) %>%
step_other(all_nominal(), threshold = .01, other = "other") %>%
step_integer(matches("(Qual|Cond|QC|Qu)$")) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
blueprint
This next step computes any relavent information (mean and std deviation of numeric features, names of one-hot encoded features) on the training data so there is no information leakage from the test data.
prepare <- prep(blueprint, training = ames_train)
prepare
We can now vectorize our training and test data. If you scroll through the data you will notice that all features are now numeric and are either 0/1 (one hot encoded features) or have mean 0 and generally range between -3 and 3.
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
# unit testing to ensure all columns are numeric
expect_equal(map_lgl(baked_train, ~ !is.numeric(.)) %>% sum(), 0)
expect_equal(map_lgl(baked_test, ~ !is.numeric(.)) %>% sum(), 0)
baked_train
Lastly, we need to create the final feature and response objects for train and test data. Since keras and tensorflow require our features & labels to be seperate objects we need to separate them. In doing so, our features need to be a 2D tensor which is why we apply as.matrix
and our response needs to be a vector which is why we apply pull
.
x_train <- select(baked_train, -Sale_Price) %>% as.matrix()
y_train <- baked_train %>% pull(Sale_Price)
x_test <- select(baked_test, -Sale_Price) %>% as.matrix()
y_test <- baked_test %>% pull(Sale_Price)
# unit testing to x & y tensors have same number of observations
expect_equal(nrow(x_train), length(y_train))
expect_equal(nrow(x_test), length(y_test))
Our final feature set now has 188 input variables:
dim(x_train)
[1] 2051 188
dim(x_test)
[1] 879 188
Initial model
Our initial model looks fairly similar to the MNIST model we applied. However, note the following differences:
- Final output layer has
units = 1
and no activation function since this is a regression problem.
- We are using a different loss and metric function. The original Kaggle competition had you log the response variable and then use MSE. This is equivalent to using the MSLE loss function. ℹ️
- The batch size is much smaller
network <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(x_train)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 1)
network %>% compile(
optimizer = "rmsprop",
loss = "msle",
metrics = c("mae")
)
summary(network)
Model: "sequential"
__________________________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================================
dense (Dense) (None, 128) 24192
__________________________________________________________________________________________________________
dense_1 (Dense) (None, 128) 16512
__________________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 129
==========================================================================================================
Total params: 40,833
Trainable params: 40,833
Non-trainable params: 0
__________________________________________________________________________________________________________
Our results below show two things:
- Our validation loss score has not reached a minimum.
- There are no signs of overfitting.
plot(history) + scale_y_log10()
Considerations regarding batch sizes and epochs
First, let’s discuss batch sizes and epochs as they can have a significant impact on how quickly we start reaching our minimum loss function.
Differences in performance for differing batch size largely depends on the underlying (and typically unknown) real cost function; however, here are some basic guidelines:
- batch sizes commonly take on values of \(2^s\) (i.e. 32, 64, 128, 256, 512),
- smaller batch sizes…
- can have more variability; however, this variability can help us stay out of local minimumns
- can decrease learning time
- larger batch sizes…
- have less variability but
- can increase learning time
- optimal batch sizes can be influenced by size of data:
- larger n can afford larger batch sizes (128, 256, 512)
- smaller n can afford smaller batch sizes (16, 32, 64)
- Which is best?
- I typically start with 32 or 64
- Trial and error for your specific problem
- use enough epochs so that our learning rate reaches a minimum
Here are some good articles discussing the impacts of batch size:
YOUR TURN! (5 min)
Try different batch sizes and epochs and see how model performance changes. Remember, batch sizes are typically powers of 2 (i.e. 16, 32, 64, 128, 256, 512).
network <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = ____) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = ____)
network %>% compile(
optimizer = "rmsprop",
loss = "msle",
metrics = c("mae")
)
history <- network %>% fit(
x_train,
y_train,
epochs = ____,
batch_size = ____,
validation_split = 0.2
)
Early stopping
You likely noticed that a large batch size (i.e. 512) took many more epochs to start reaching a minimum versus a smaller batch size (i.e. 16). But regardless of batch size, you still had to do a fair amount of trial and error to find the right number of epochs to reach a minimum loss score.
Let’s meet our first callback, which can help in this situation ℹ️. Using early stopping allows us to crank up the number of epochs and let the training automatically stop after we experience no improvement in our loss after patience
number of epochs.
network <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(x_train)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 1)
network %>%
compile(
optimizer = "rmsprop",
loss = "msle",
metrics = "mae"
)
history <- network %>% fit(
x_train,
y_train,
epochs = 250,
batch_size = 32,
validation_split = 0.2,
callbacks = callback_early_stopping(patience = 5, restore_best_weights = TRUE)
)
history
Trained on 1,640 samples (batch_size=32, epochs=171)
Final epoch (plot to see history):
loss: 0.01708
mae: 15,330
val_loss: 0.01508
val_mae: 16,540
cat("\nThe minimum loss score is", min(history$metrics$val_loss) %>% round(4),
"which occurred at epoch", which.min(history$metrics$val_loss))
The minimum loss score is 0.015 which occurred at epoch 166
plot(history) + scale_y_log10()
Adjustable learning rate
One thing you may notice is that there is significant learning happening for the first 15-20 epochs and then the model slowly chips away at the loss for the next ~100+ epochs. We can speed up this process with two things:
- customize our optimizer with a larger learning rate to try speed up the downhill traversal of the gradient descent
- add a callback that slowly reduces the learning rate by 20% if we don’t experience improvement in our loss for
patience
number of epochs.
Note how we are now using optimizer = optimizer_rmsprop(lr = 0.01)
instead of optimizer = "rmsprop"
so that we can customize the learning rate.
network <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(x_train)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 1)
network %>%
compile(
optimizer = optimizer_rmsprop(lr = 0.01),
loss = "msle",
metrics = c("mae")
)
history <- network %>% fit(
x_train,
y_train,
epochs = 50,
batch_size = 32,
validation_split = 0.2,
callbacks = list(
callback_early_stopping(patience = 10, restore_best_weights = TRUE),
callback_reduce_lr_on_plateau(factor = 0.2, patience = 4)
)
)
Now we see a much faster process of model training; it only takes ~25% the number of epochs and it appears that we actually see an increase in performance!
history
Trained on 1,640 samples (batch_size=32, epochs=31)
Final epoch (plot to see history):
loss: 0.01483
mae: 14,258
val_loss: 0.01426
val_mae: 15,675
lr: 0.00008
cat("\nThe minimum loss score is", min(history$metrics$val_loss) %>% round(4),
"which occurred at epoch", which.min(history$metrics$val_loss))
The minimum loss score is 0.0142 which occurred at epoch 21
plot(history) +
scale_y_log10() +
scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))
YOUR TURN! (5 min)
Try different variations of learning rate, patience parameters, and learning rate reduction factor. Here are a couple of things to keep in mind:
- optimizer learning rate ranges worth exploring differ depending on the optimizer but for RMSProp common ranges include 0.1-0.0001.
- learning rate reduction factor typically range from 0.5-0.1.
- the patience value for early stopping should not be shorter than the patience value for learning rate reduction, otherwise the learning rate will never have the opportunity to decrease.
network <- keras_model_sequential() %>%
layer_dense(units = 128, activation = ____, input_shape = ____) %>%
layer_dense(units = 128, activation = ____) %>%
layer_dense(units = ____)
network %>%
compile(
optimizer = optimizer_rmsprop(lr = ____),
loss = "msle",
metrics = c("mae")
)
history <- network %>% fit(
x_train,
y_train,
epochs = 50,
batch_size = 32,
validation_split = 0.2,
callbacks = list(
callback_early_stopping(patience = ____),
callback_reduce_lr_on_plateau(factor = ____, patience = ____)
)
)
Under capacity
Unfortunately, our model is still underfitting when we reach our minimum validation loss score. This is a classic sign that we are under-capacity. There are two ways to increase model capacity ℹ️:
- Add more units in each hidden layer
- Add more hidden layers
In the next module, we’ll look at how to dynamically assess these inputs but for now, spend the next 3 minutes adjusting the number of units in each hidden layer and/or adjusting the number of hidden layers.
YOUR TURN! (5 min)
Try different batch sizes and epochs and see how model performance changes. Remember, batch sizes are typically powers of 2 (i.e. 16, 32, 64, 128, 256, 512).
network <- keras_model_sequential() %>%
layer_dense(units = ____, activation = "relu", input_shape = ____) %>%
layer_dense(units = ____, activation = "relu") %>%
layer_dense(units = ____)
network %>%
compile(
optimizer = optimizer_rmsprop(lr = 0.01),
loss = "msle",
metrics = c("mae")
)
history <- network %>% fit(
x_train,
y_train,
epochs = 50,
batch_size = 32,
validation_split = 0.2,
callbacks = list(
callback_early_stopping(patience = 10, restore_best_weights = TRUE),
callback_reduce_lr_on_plateau(factor = 0.2, patience = 4)
)
)
What to look for
Typically, I add units and layers until I see significant overfitting or start to see high variability in our loss score or metrics and then constrain the model from there.
For example, the following model with 5 hidden layers consisting of 1024 units each overfits at the minimum validation loss score (but not much) and shows signs of loss and metric variability. From here, I would start to regularize the model by removing layers, reduce the number of units in each layer, or using an alternative regularization method until I find a happy compromise between model capacity, loss minimization & stability.
network <- keras_model_sequential() %>%
layer_dense(units = 1024, activation = "relu", input_shape = ncol(x_train)) %>%
layer_dense(units = 1024, activation = "relu") %>%
layer_dense(units = 1024, activation = "relu") %>%
layer_dense(units = 1024, activation = "relu") %>%
layer_dense(units = 1)
network %>%
compile(
optimizer = optimizer_rmsprop(lr = 0.01),
loss = "msle",
metrics = c("mae")
)
history <- network %>% fit(
x_train,
y_train,
epochs = 250,
batch_size = 32,
validation_split = 0.2,
callbacks = list(
callback_early_stopping(patience = 10, restore_best_weights = TRUE),
callback_reduce_lr_on_plateau(factor = 0.2, patience = 4)
)
)
cat("The minimum loss score is", min(history$metrics$val_loss) %>% round(4),
"which occurred at epoch", which.min(history$metrics$val_loss))
The minimum loss score is 0.0136 which occurred at epoch 25
plot(history) +
scale_y_log10() +
scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))
Generalizing to small datasets
Note that finding an optimal model that generalizes well based on our validation approach may be difficult. This is because our validation data (via validation_split = 0.2
) only consists of 800+ samples. Consequently, model performance will be highly dependent on these 800+ samples.
In general, the fewer observations in our validation set, the greater variance in our loss score. As the number of observations in our validation data increases, variance in our loss score will decrease. However, we do not always have the option to just go out and get more data. So, if we want to gain a more accurate understanding of the loss score and its variance we could perform k-fold cross validation.
See the validation procedures notebook for an example of performing k-fold cross validation.
Key takeaways
- Preparing tabular data
- We need to vectorize categorical features
- We need to standardize and normalize numeric features
- Batch size & epochs
- Smaller batch sizes tend to perform best (speed and loss minimization)
- However, this is a hyperparameter you can adjust and evaluate
- Make sure you include enough epochs to reach a minimum loss
- Learning rate
- The most important hyperparameter to tune
- Typically you want to tune the learning rate and batch size together
- Callbacks
- Use early stopping to automate model training and stopping after loss has been minimized
- Use
callback_reduce_lr_on_plateau()
for more control over the learning rate.
- Knowing when to adjust model capacity
- Your minimum validation loss score should not be less than your training loss
- If so, increase model capacity until you’ve balanced overfitting and minimization of validation loss score and stability
---
title: "Case Study 1: Ames -- Regression to predict Ames, IA Home Sales Prices"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
ggplot2::theme_set(ggplot2::theme_minimal())
```

In this case study, our objective is to predict the sales price of a home. This 
is a _regression_ problem since the goal is to predict any real number across
some spectrum (\$119,201, \$168,594, \$301,446, etc). To predict the sales 
price, we will use numeric and categorical features of the home.

Throughout this case study you will learn a few new concepts:

* Vectorization and standardization of tabular features
* Adjusting batch size & epochs for training performance
* What callbacks are and how to start applying them
   - Early stopping
   - Controlling the learning rate
* Knowing when to adjust model capacity

# Package requirements

```{r load-pkgs}
library(keras)     # for deep learning
library(testthat)  # unit testing
library(tidyverse) # for dplyr, ggplot2, etc.
library(rsample)   # for data splitting
library(recipes)   # for feature engineering
```


# The Ames housing dataset

For this case study we will use the [Ames housing dataset](http://jse.amstat.org/v19n3/decock.pdf) 
provided by the __AmesHousing__ package.

```{r get-data}
ames <- AmesHousing::make_ames()
dim(ames)
```

# Understanding our data

This data has been partially cleaned up and has no missing data:

```{r}
sum(is.na(ames))
```

But this tabular data is a combination of numeric and categorical data that we
need to address.

```{r ames-structure}
str(ames)
```

The numeric variables are on different scales. For example:

```{r numeric-ranges}
ames %>%
  select(Lot_Area, Lot_Frontage, Year_Built, Gr_Liv_Area, Garage_Cars, Mo_Sold) %>%
  gather(feature, value) %>%
  ggplot(aes(feature, value)) +
  geom_boxplot() +
  scale_y_log10(labels = scales::comma)
```

There are categorical features that could be ordered:

```{r numeric-categories}
ames %>%
  select(matches("(Qual|Cond|QC|Qu)$")) %>%
  str()
```

And some of the categorical features have many levels:

```{r}
ames %>%
  select_if(~ is.factor(.) & length(levels(.)) > 8) %>%
  str()
```

Consequently, our first challenge is transforming this dataset into numeric
tensors that our model can use.

# Create train & test splits

One of the first things we want to do is create a train and test set as you
probably noticed that we do not have a train and test set similar to how MNIST 
was already set up for us. We can use the __rsample__ package to create our
train and test datasets.

```{r}
set.seed(123)
ames_split <- initial_split(ames, prop = 0.7)
ames_train <- analysis(ames_split)
ames_test <- assessment(ames_split)

dim(ames_train)
dim(ames_test)
```

# Preparing the data

All inputs and response values in a neural network must be tensors of either 
floating-point or integer data. Moreover, our feature values should not be
relatively large compared to the randomized initial weights _and_ all our 
features should take values in roughly the same range.

Consequently, we need to ___vectorize___ our data into a format conducive to neural 
networks [ℹ️](http://bit.ly/dl-02#3). For this data set, we'll transform our
data by:

- removing any zero-variance (or near zero-variance) features
- condensing unique levels of categorical features to "other"
- ordinal encoding the quality features
- normalize numeric feature distributions
- standardizing numeric features to mean = 0, std dev = 1
- one-hot encoding remaining categorical features

```{r}
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_nominal()) %>%
  step_other(all_nominal(), threshold = .01, other = "other") %>%
  step_integer(matches("(Qual|Cond|QC|Qu)$")) %>%
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

blueprint
```

This next step computes any relavent information (mean and std deviation of
numeric features, names of one-hot encoded features) on the training data so
there is no information leakage from the test data.

```{r}
prepare <- prep(blueprint, training = ames_train)
prepare
```

We can now vectorize our training and test data. If you scroll through the data
you will notice that all features are now numeric and are either 0/1 (one hot
encoded features) or have mean 0 and generally range between -3 and 3.

```{r}
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)

# unit testing to ensure all columns are numeric
expect_equal(map_lgl(baked_train, ~ !is.numeric(.)) %>% sum(), 0)
expect_equal(map_lgl(baked_test, ~ !is.numeric(.)) %>% sum(), 0)

baked_train
```

Lastly, we need to create the final feature and response objects for train and 
test data. Since __keras__ and __tensorflow__ require our features & labels to be 
seperate objects we need to separate them. In doing so, our features need to be 
a 2D tensor which is why we apply `as.matrix` and our response needs to be a 
vector which is why we apply `pull`.

```{r}
x_train <- select(baked_train, -Sale_Price) %>% as.matrix()
y_train <- baked_train %>% pull(Sale_Price)

x_test <- select(baked_test, -Sale_Price) %>% as.matrix()
y_test <- baked_test %>% pull(Sale_Price)

# unit testing to x & y tensors have same number of observations
expect_equal(nrow(x_train), length(y_train))
expect_equal(nrow(x_test), length(y_test))
```

Our final feature set now has 188 input variables:

```{r}
dim(x_train)
dim(x_test)
```

# Initial model

Our initial model looks fairly similar to the MNIST model we applied. However, 
note the following differences:

* Final output layer has `units = 1` and no activation function since this is a 
  regression problem.
* We are using a different loss and metric function. The original Kaggle
  competition had you log the response variable and then use MSE. This is 
  equivalent to using the MSLE loss function. [ℹ️](http://bit.ly/2R4uh2L)
* The batch size is much smaller

```{r initial-model}
network <- keras_model_sequential() %>% 
  layer_dense(units = 128, activation = "relu", input_shape = ncol(x_train)) %>% 
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 1)

network %>% compile(
    optimizer = "rmsprop",
    loss = "msle",
    metrics = c("mae")
  )
``` 

```{r model-summary}
summary(network)
```  

```{r train, results='hide'}
history <- network %>% fit(
  x_train,
  y_train,
  batch_size = 32,
  epochs = 20,
  validation_split = 0.2
)
```

Our results below show two things:

1. Our validation loss score has not reached a minimum.
2. There are no signs of overfitting.

```{r}
plot(history) + scale_y_log10()
```

# Considerations regarding batch sizes and epochs

First, let's discuss batch sizes and epochs as they can have a significant 
impact on how quickly we start reaching our minimum loss function.

Differences in performance for differing batch size largely depends on the 
underlying (and typically unknown) real cost function; however, here are some
basic guidelines:

- batch sizes commonly take on values of $2^s$ (i.e. 32, 64, 128, 256, 512),
   - smaller batch sizes...
      - can have more variability; however, this variability can help us stay out
        of local minimumns 
      - can decrease learning time
   - larger batch sizes...
      - have less variability but 
      - can increase learning time
- optimal batch sizes can be influenced by size of data:
   - larger _n_ can afford larger batch sizes (128, 256, 512)
   - smaller _n_ can afford smaller batch sizes (16, 32, 64)
- Which is best?
   - I typically start with 32 or 64
   - Trial and error for your specific problem
- use enough epochs so that our learning rate reaches a minimum

Here are some good articles discussing the impacts of batch size:

   - [Effect of batch size on training dynamics](http://bit.ly/2rQdO7G)
   - [On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima](https://arxiv.org/abs/1609.04836)
   - [Deep Learning book, Section 8.1.3](http://www.deeplearningbook.org/contents/optimization.html)
   - [Impact of Training Set Batch Size on the Performance](http://bit.ly/35VWM7e)

# YOUR TURN! (5 min)

Try different batch sizes and epochs and see how model performance changes. 
Remember, batch sizes are typically powers of 2 (i.e. 16, 32, 64, 128, 256, 512).

```{r your-turn-1}
network <- keras_model_sequential() %>% 
  layer_dense(units = 128, activation = "relu", input_shape = ____) %>% 
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = ____)

network %>% compile(
    optimizer = "rmsprop",
    loss = "msle",
    metrics = c("mae")
  )

history <- network %>% fit(
  x_train,
  y_train,
  epochs = ____,
  batch_size = ____,
  validation_split = 0.2
)
```

# Early stopping

You likely noticed that a large batch size (i.e. 512) took many more epochs to
start reaching a minimum versus a smaller batch size (i.e. 16). But regardless
of batch size, you still had to do a fair amount of trial and error to find the
right number of epochs to reach a minimum loss score.

Let's meet our first ___callback___, which can help in this situation [ℹ️](http://bit.ly/dl-02#14). 
Using early stopping allows us to crank up the number of epochs and let the 
training automatically stop after we experience no improvement in our loss after
`patience` number of epochs. 
  
```{r initial-model-early-stopping}
network <- keras_model_sequential() %>% 
  layer_dense(units = 128, activation = "relu", input_shape = ncol(x_train)) %>% 
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 1) 

network %>%
  compile(
    optimizer = "rmsprop",
    loss = "msle",
    metrics = "mae"
  )

history <- network %>% fit(
  x_train,
  y_train,
  epochs = 250,
  batch_size = 32,
  validation_split = 0.2,
  callbacks = callback_early_stopping(patience = 5, restore_best_weights = TRUE)
)
```

```{r early-stopping-model-performance}
history

cat("\nThe minimum loss score is", min(history$metrics$val_loss) %>% round(4),
    "which occurred at epoch", which.min(history$metrics$val_loss))
```


```{r early-stopping-plot}
plot(history) + scale_y_log10()
```

# Adjustable learning rate

One thing you may notice is that there is significant learning happening for the
first 15-20 epochs and then the model slowly chips away at the loss for the next
~100+ epochs.  We can speed up this process with two things:

1. customize our optimizer with a larger learning rate to try speed up the 
   downhill traversal of the gradient descent
2. add a callback that slowly reduces the learning rate by 20% if we don't
   experience improvement in our loss for `patience` number of epochs.
   
Note how we are now using `optimizer = optimizer_rmsprop(lr = 0.01)` instead of
`optimizer = "rmsprop"` so that we can customize the learning rate.

```{r initial-model-adj-lr}
network <- keras_model_sequential() %>% 
  layer_dense(units = 128, activation = "relu", input_shape = ncol(x_train)) %>% 
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 1) 

network %>%
  compile(
    optimizer = optimizer_rmsprop(lr = 0.01),
    loss = "msle",
    metrics = c("mae")
  )

history <- network %>% fit(
  x_train,
  y_train,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  callbacks = list(
        callback_early_stopping(patience = 10, restore_best_weights = TRUE),
        callback_reduce_lr_on_plateau(factor = 0.2, patience = 4)
    )
)
```

Now we see a much faster process of model training; it only takes ~25% the 
number of epochs and it appears that we actually see an increase in performance!

```{r adj-learning-rate-performance}
history

cat("\nThe minimum loss score is", min(history$metrics$val_loss) %>% round(4),
    "which occurred at epoch", which.min(history$metrics$val_loss))
```


```{r adj-learning-rate-plot}
plot(history) + 
  scale_y_log10() +
  scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))
```

# YOUR TURN! (5 min)

Try different variations of learning rate, patience parameters, and learning 
rate reduction factor.  Here are a couple of things to keep in mind:

- _optimizer learning rate_ ranges worth exploring differ depending on the 
  optimizer but for RMSProp common ranges include 0.1-0.0001.
- _learning rate reduction factor_ typically range from 0.5-0.1.  
- the patience value for _early stopping_ should not be shorter than the 
  patience value for _learning rate reduction_, otherwise the learning rate will 
  never have the opportunity to decrease.

```{r your-turn-2}
network <- keras_model_sequential() %>% 
  layer_dense(units = 128, activation = ____, input_shape = ____) %>% 
  layer_dense(units = 128, activation = ____) %>%
  layer_dense(units = ____) 

network %>%
  compile(
    optimizer = optimizer_rmsprop(lr = ____),
    loss = "msle",
    metrics = c("mae")
  )

history <- network %>% fit(
  x_train,
  y_train,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  callbacks = list(
        callback_early_stopping(patience = ____),
        callback_reduce_lr_on_plateau(factor = ____, patience = ____)
    )
)
```

# Under capacity

Unfortunately, our model is still underfitting when we reach our minimum
validation loss score. This is a classic sign that we are under-capacity. There
are two ways to increase model capacity [ℹ️](http://bit.ly/dl-02#17):

1. Add more units in each hidden layer
2. Add more hidden layers

In the next module, we'll look at how to dynamically assess these inputs but for
now, spend the next 3 minutes adjusting the number of units in each hidden layer
and/or adjusting the number of hidden layers.

## YOUR TURN! (5 min)

Try different batch sizes and epochs and see how model performance changes. 
Remember, batch sizes are typically powers of 2 (i.e. 16, 32, 64, 128, 256, 512).

```{r your-turn-3}
network <- keras_model_sequential() %>% 
  layer_dense(units = ____, activation = "relu", input_shape = ____) %>% 
  layer_dense(units = ____, activation = "relu") %>%
  layer_dense(units = ____)

network %>%
  compile(
    optimizer = optimizer_rmsprop(lr = 0.01),
    loss = "msle",
    metrics = c("mae")
  )

history <- network %>% fit(
  x_train,
  y_train,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  callbacks = list(
        callback_early_stopping(patience = 10, restore_best_weights = TRUE),
        callback_reduce_lr_on_plateau(factor = 0.2, patience = 4)
    )
)
```

## What to look for

Typically, I add units and layers until I see significant overfitting or start
to see high variability in our loss score or metrics and then constrain the model
from there. 

For example, the following model with 5 hidden layers consisting of 1024 units
each overfits at the minimum validation loss score (but not much) and
shows signs of loss and metric variability. From here, I would start to
regularize the model by removing layers, reduce the number of units in each
layer, or using an alternative regularization method until I find a happy
compromise between model capacity, loss minimization & stability.

```{r overfit-model}
network <- keras_model_sequential() %>% 
  layer_dense(units = 1024, activation = "relu", input_shape = ncol(x_train)) %>%
  layer_dense(units = 1024, activation = "relu") %>%
  layer_dense(units = 1024, activation = "relu") %>%
  layer_dense(units = 1024, activation = "relu") %>%
  layer_dense(units = 1)

network %>%
  compile(
    optimizer = optimizer_rmsprop(lr = 0.01),
    loss = "msle",
    metrics = c("mae")
  )

history <- network %>% fit(
  x_train,
  y_train,
  epochs = 250,
  batch_size = 32,
  validation_split = 0.2,
  callbacks = list(
        callback_early_stopping(patience = 10, restore_best_weights = TRUE),
        callback_reduce_lr_on_plateau(factor = 0.2, patience = 4)
    )
)
```

```{r}
cat("The minimum loss score is", min(history$metrics$val_loss) %>% round(4),
    "which occurred at epoch", which.min(history$metrics$val_loss))
```


```{r, message=FALSE}
plot(history) + 
  scale_y_log10() +
  scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))
```


# Generalizing to small datasets

Note that finding an optimal model that generalizes well based on our validation
approach may be difficult. This is because our validation data (via
`validation_split = 0.2`) only consists of 800+ samples. Consequently, model
performance will be highly dependent on these 800+ samples.

In general, the fewer observations in our validation set, the greater variance
in our loss score.  As the number of observations in our validation data 
increases, variance in our loss score will decrease. However, we do not always 
have the option to just go out and get more data. So, if we want to gain a more 
accurate understanding of the loss score and its variance we could perform 
_k-fold cross validation_. 

See the [validation procedures notebook](https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/validation-procedures.nb.html) 
for an example of performing k-fold cross validation.

# Key takeaways

* Preparing tabular data
   - We need to vectorize categorical features
   - We need to standardize and normalize numeric features
* Batch size & epochs
   - Smaller batch sizes tend to perform best (speed and loss minimization)
   - However, this is a hyperparameter you can adjust and evaluate
   - Make sure you include enough epochs to reach a minimum loss
* Learning rate
   - The most important hyperparameter to tune
   - Typically you want to tune the learning rate and batch size together
* Callbacks
   - Use early stopping to automate model training and stopping after loss has
     been minimized
   - Use `callback_reduce_lr_on_plateau()` for more control over the learning 
     rate.
* Knowing when to adjust model capacity
   - Your minimum validation loss score should not be less than your training
     loss
   - If so, increase model capacity until you've balanced overfitting and
     minimization of validation loss score and stability