Hello [Deep Learning] World

# Hello [Deep Learning] World
### Brad Boehmke
### 2020-01-27

---

.font1000.bold[MNIST]

---
# Origination

* National Institute of Standards and Technology (NIST) database

* MNIST (Modified NIST)

* 60,000 training images and 10,000 testing images

* normalized to fit into a 28x28 pixel bounding box

]

]
]

---
# Important benchmark

* Used as an important benchmark for image processing from 1990s - 2012

* 1998: 12% error rate

* 2012: 0.23% error rate

* Website: http://yann.lecun.com/exdb/mnist/

]

]
]

---
class: clear, center, middle

.font1000.bold[`%<-%`]

---
# zeallot

Object unpacking mimicks tuple unpacking in Python

A simple vector

```r
my_name <- c('Brad', 'Boehmke')
```

Traditional assignment unpacking

```r
first <- my_name[1]
last <- my_name[2]
```

]

Object unpacking

```r
c(first, last) %<-% my_name
```

]

Both result in:

```r
first
## [1] "Brad"
last
## [1] "Boehmke"
```

---
# zeallot

```r
mnist <- dataset_mnist()
str(mnist)
## List of 2
## $ train:List of 2
## ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
## $ test :List of 2
## ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
```

Traditional assignment unpacking

```r
mnist <- dataset_mnist()

train_images <- mnist$train$x
train_labels <- mnist$train$y
test_images <- mnist$test$x
test_labels <- mnist$test$y
```

]

Object unpacking

```r
c(c(train_images, train_labels), c(test_images, test_labels)) %<-% mnist
```

]

---
class: clear, center, middle

.font1000.bold[Tensors]
---
# The .red[tensor] in TensorFlow

]

.pull-right[
 
.center.bold[
_Don't worry, you actually use tensors everyday (at least everyday you use R!)_
]
]

---
# The .red[tensor] in TensorFlow

]

.pull-right[
 
.center.bold.opacity20[
_Don't worry, you actually use tensors everyday (at least everyday you use R!)_
]
.center.bold.blue[Vectors are 1D tensors]

]

---
# The .red[tensor] in TensorFlow

]

.pull-right[
 
.center.bold.opacity20[
_Don't worry, you actually use tensors everyday (at least everyday you use R!)_

Vectors are 1D tensors
]
.center.bold.blue[Matrices are 2D tensors]

]

---
# MNIST tensor

* Since our MNIST data are gray scale it can be represented as a 2D tensor
* We just needed to reshape it so:
 - each column feature
 - each row observation
 
.pull-left[

]

]

---
# .red[Tensor] benefits

* .red.bold[Generalization]: Tensors generalize vectors and matrices to an arbitary
number of dimensions,

* .red.bold[Flexibility]: can hold a wide range of data dimensions,

* .red.bold[Speed]: provide fast, parallel processing computations.

.center.bold[_They just get a bit complicated when you start working with higher dimensions_]

---
# .red.bold[3D] Tensor

* Represented as arrays

* Sequence data
   - time series
   - text
   - dim = (observations, seq steps, features)

* Examples
   - 250 days of high, low, and current stock price for 390 minutes of trading
   in a day; dim = c(250, 390, 3)
   - 1M tweets that can be 140 characters long and include 128 unique characters; dim = c(1M, 140, 128)

]

]

---
# .red.bold[4D] Tensor

* Represented as arrays

* Image data
   - RGB channels
   - dim = (observations, height, width, color_depth)

]

]

---
# .red.bold[4D] Tensor

* Represented as arrays

* Image data
   - RGB channels
   - dim = (observations, height, width, .red[color_depth])

* Technically, we could treat our original MNIST data as a 4D tensor where
.red[color_depth = 1]

* We'll see this play out when we start working with CNNs

]

]

---
# .red.bold[5D] Tensor

* Represented as arrays

* Video data
   - samples: 4 (each video is 1 minute long)
   - frames: 240 (4 frames/second)
   - width: 256 (pixels)
   - height: 144 (pixels)
   - channels: 3 (red, green, blue)
   
* Tensor shape (4, 240, 256, 144, 3)

]

]

---
# Now you know what tensors are

* Tensors aren't that bad, humans are just really bad at visualizing multiple dimensions!

* Feeling comfortable will come with practice

]

]

---
class: clear, center, middle

.font500.bold[Network architecture]

---
# Sequential vs functional

* Creating a single linear stack of layers
* Most common type of neural networks
* Examples:
   - Predicting sales price based on tabular data of home characteristics,
   - Predicting animal based on image,
   - Predicting author based on text,
   - Predicting hurricane path based on numeric meteorologic data.

]

* More advanced modeling
* Allows flexible, customizable model structures
* Examples:
 - Predicting presence of cancer based on images .bold[_and_] patient 
 transcripts,
 - Forecasting time .bold[_and_] volume of sale based on tabular 
 transaction data .bold[_and_] customer text.

]

---
# Densely connected layers

.pull-left[
* `layer_dense()` is creating what's called a .bold[_fully connected feed forward neural network_]

* Fundamental building block of nearly all deep learning models
]

---
# Densely connected layers

.pull-left[
* `layer_dense()` is creating what's called a .bold[_fully connected feed forward neural network_]

* Fundamental building block of nearly all deep learning models

* So why do we call `layer_dense()` twice?  And what about the arguments inside?

```r
network <- keras_model_sequential() %>%
* layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
* layer_dense(units = 10, activation = 'softmax')
```

]

---
# Densely connected layers

```r
network <- keras_model_sequential() %>%
* layer_dense() %>% # hidden layer
* layer_dense() # output layer
```

* .font100[Each `layer_dense()` represents a hidden layer or the final output layer]

]

]

.center[.content-box-grey[We refer to a neural network with one or more hidden layer as a .blue[_multi-layer perceptron_]]]

---
# Densely connected layers

```r
network <- keras_model_sequential() %>%
* layer_dense() %>% # hidden layer 1
* layer_dense() %>% # hidden layer 2
* layer_dense() %>% # hidden layer 3
* layer_dense() # output layer
```

* We can add multiple hidden layers by adding more `layer_dense()` functions

* Technically, .blue[_deep learning_] refers to any neural network that has 
2 or more hidden layers

* The last `layer_dense()` will always represent the output layer

]

.pull-right[
<img src="images/basic_feedforward.png" width="960" style="display: block; margin: auto;" />
]

---
# Hidden layer

```r
network <- keras_model_sequential() %>%
* layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) # hidden layer
```

* `units = 512`: number of nodes in the given layer

* `input_shape = ncol(train_images)`
   - tells the first hidden layer how many input features there are
   - only required for the first `layer_dense`

]

.pull-right[
<img src="images/hidden_layer.png" width="715" style="display: block; margin: auto;" />
]

---
# Hidden layer

```r
network <- keras_model_sequential() %>%
* layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) # hidden layer
```

* `units = 512`: number of nodes in the given layer

* `input_shape = ncol(train_images)`
   - tells the first hidden layer how many input features there are
   - only required for the first `layer_dense`

* `activation`: <img src="https://emojis.slackmojis.com/emojis/images/1499373537/2585/homer_thinking.png?1499373537" style="height:1em; width:auto; "/>
]

.pull-right[
<img src="images/perceptron_zoom.png" width="739" style="display: block; margin: auto;" />
]

---
# Individual perceptron

.font100.pull-left[

* There is a two-step computation process when data goes through a node

]

.pull-right[
<img src="images/perceptron1.png" width="699" style="display: block; margin: auto;" />
]

---
# Individual perceptron

.font100.pull-left[

* There is a two-step computation process when data goes through a node

* Step 1: _linear transformation_
   - `$z = w_0b_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n$`
   - note the extra bias term which is typically always set to 1

]

.pull-right[
<img src="images/perceptron2.png" width="704" style="display: block; margin: auto;" />
]

---
# Individual perceptron

* There is a two-step computation process when data goes through a node

* .opacity20[Step 1: linear transformation]

* Step 2: _activation function_
 - in hidden layers, the most common activation function is the `$\text{ReLU} = max(0, z)$`
 - You will be introduced to other activation functions but ReLU should nearly
 always be your default for hidden layers
 
<img src="02-hello-dl-world_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" />

]

.pull-right[
<img src="images/perceptron3.png" width="701" style="display: block; margin: auto;" />
]

---
# Many ReLU transformations

<img src="02-hello-dl-world_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />
 
How can a simple semi-linear transformation work so well?

]

---
# Many ReLU transformations

<img src="02-hello-dl-world_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" />
 
How can a simple semi-linear transformation work so well?

* Many simple geometric transformations can produce very complex patterns.

* Other benefits (see [Glorot et al. (2011)](http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf)):
   - computational simplicity
   - representational sparcity
   - linear behavior

]

.right.font50[[http://apps.amandaghassaei.com/OrigamiSimulator/](http://apps.amandaghassaei.com/OrigamiSimulator/)]

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
* layer_dense(units = 10, activation = 'softmax')
```

.center.blue.font120[Completely dependent on the type of prediction you are making]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
 layer_dense(units = 10, activation = 'softmax')
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

1. number of units
   - regression: `units = 1`:

]

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
 layer_dense(units = 10, activation = 'softmax')
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

1. number of units
   - regression: `units = 1`:
   - binary classification: `units = 1`

]

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
* layer_dense(units = 10, activation = 'softmax')
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

1. number of units
   - regression: `units = 1`:
   - binary classification: `units = 1`
   - multi-class classification: `units = n`

]

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
 layer_dense(units = 10, activation = 'softmax')
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

1. .opacity20[number of units]
2. activation function
   - regression: `activation = NULL` (identity function)

]

.pull-right.center[

`$y = w_0b_0 + w_1h^1_1 + w_2h^1_2 + \cdots + w_nh^1_n$`

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
 layer_dense(units = 10, activation = 'softmax') 
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

1. .opacity20[number of units]
2. activation function
   - regression: `activation = NULL` (identity function)
   - binary classification: `activation = 'sigmoid'`

]

.pull-right.center[

`$f(y) = \frac{1}{1 + e^{-y}}$`

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
* layer_dense(units = 10, activation = 'softmax')
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

1. .opacity20[number of units]
2. activation function
   - regression: `activation = NULL` (identity function)
   - binary classification: `activation = 'sigmoid'`
   - multi-class classification: `activation = 'softmax'`

]

.pull-right.center[

`$f(y) = \frac{e^{y_i}}{\sum_je^{y_j}}$`

]

---
# Output layer

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = ncol(train_images)) %>%
* layer_dense(units = 10, activation = 'softmax')
```

.font100.pull-left[
Two primary arguments of concern for the final output layer:

]

.pull-right.center[

]

.center[.content-box-grey.font90.blue[_This is why we used `to_categorical()`, so that we can get the probabilities for each output class._]]

---
# Network architecture summary

.pull-left-30.font80[

1. A sequential, dense, fully connected neural network

2. 784 inputs

3. 1 hidden layer 
 a. 512 nodes 
 b. ReLU activation function
 
 
4. multi-class output layer 
 a. 10 nodes (1 for each output) 
 b. Softmax activation function
]

]

---
# Network architecture summary

.pull-left.font80[

1. A sequential, dense, fully connected neural network

2. 784 inputs

3. 1 hidden layer  
   a. 512 nodes  
   b. ReLU activation function  
   `$params = (784 \times 512) + 512 = 401920$`
   
4. multi-class output layer  
   a. 10 nodes  
   b. Softmax activation function  
   `$params = (512 \times 10) + 10 = 5130$`
]

```r
summary(network)
## Model: "sequential"
## ______________________________________________________________________________________
## Layer (type)                          Output Shape                       Param #      
## ======================================================================================
## dense (Dense)                         (None, 512)                        401920       
## ______________________________________________________________________________________
## dense_1 (Dense)                       (None, 10)                         5130         
## ======================================================================================
## Total params: 407,050
## Trainable params: 407,050
## Non-trainable params: 0
## ______________________________________________________________________________________
```

]

---
class: clear, center, middle

.font500.bold[Network compilation]

---
# Forward pass

]

]

---
# Forward pass

.pull-left.font120[
 
* Weights are _initialized_ as very small semi-random values

]

]

---
# Forward pass

.pull-left.font100[
 
* Weights are _initialized_ as very small semi-random values

* Results in predicted values that are significantly different then the actual
targets

]

]

---
# Forward pass

.pull-left.code70[
 
* Weights are _initialized_ as very small semi-random values

* Results in predicted values that are significantly different then the actual
targets

* We measure this difference with a _loss function_

```r
network %>% compile(
* loss = "categorical_crossentropy",
  optimizer = "rmsprop",
  metrics = c("accuracy")
)
```

]

]

---
# Loss function

_Loss function (objective function)_ : the quantity that will be minimized during training.

* Many options

* Should use the one that aligns best to the problem at hand; however,...

]

* mean squared error (MSE)
* mean absolute error (MAE)
* mean absolute percentage error (MAPE)
* mean squared logarithmic error (MSLE)
* squared hinge
* log cosine hinge
* binary cross entropy
* categorical hinge
* categorical crossentropy
* sparse categorical crossentropy
* Kullback-Leibler divergence
* Poisson
* cosine proximity
* can even build your own custom loss functions!

]

---
# Loss function

_Loss function (objective function)_ : the quantity that will be minimized during training.

* Many options

* Should use the one that aligns best to the problem at hand; however,...

* general recommendations for common problems include:
   - Regression: MSE
   - Binary classification: binary crossentropy
   - Multi-class classification: categorical crossentropy

]

* .blue[mean squared error (MSE)]
* mean absolute error (MAE)
* mean absolute percentage error (MAPE)
* mean squared logarithmic error (MSLE)
* squared hinge
* log cosine hinge
* .blue[binary cross entropy]
* categorical hinge
* .blue[categorical crossentropy]
* sparse categorical crossentropy
* Kullback-Leibler divergence
* Poisson
* cosine proximity
* can even build your own custom loss functions!

]

---
# Loss function

_Loss function (objective function)_ : the quantity that will be minimized during training.

* Many options

* Should use the one that aligns best to the problem at hand; however,...

* general recommendations for common problems include:
   - Regression: MSE
   - Binary classification: binary crossentropy
   - Multi-class classification: categorical crossentropy

]

]

---
# Loss function

.center.blue.bold[_Our goal is to find weights that minimize the loss score_]

]

]

---
# Backward pass

.center.blue.bold[_Our goal is to find weights that minimize the loss score_]

The ___backward pass___ is the process of using information from the loss function
to update the weights so that we improve the model's performance.

]

]

---
# Backward pass example

.pull-left[
<img src="images/backward_pass_ex1.png" width="747" style="display: block; margin: auto;" />
]

---
# Backward pass example

.pull-left[
<img src="images/backward_pass_ex2.png" width="749" style="display: block; margin: auto;" />
]

---
# Backward pass example

.pull-left[
<img src="images/backward_pass_ex3.png" width="751" style="display: block; margin: auto;" />
]

---
# Backward pass example

.pull-left[
<img src="images/backward_pass_ex4.png" width="751" style="display: block; margin: auto;" />
]

]

---
# Derivatives

_Sensitivity_ is just the change in one thing per 1 unit change in another

.pull-left[
.font100[
$$= \frac{\text{Change in temperature error}}{\text{Change in shower handle position}} $$
]
 
.font100[
$$= \frac{\Delta \text{temperature error}}{\Delta \text{shower handle position}} $$
]
]

.pull-right[
 
<img src="02-hello-dl-world_files/figure-html/unnamed-chunk-67-1.png" style="display: block; margin: auto;" />

]

---
# Derivatives

_Sensitivity_ is just the change in one thing per 1 unit change in another

.pull-right[
 
<img src="02-hello-dl-world_files/figure-html/unnamed-chunk-68-1.png" style="display: block; margin: auto;" />

]

---
# Derivatives

_Sensitivity_ is just the change in one thing per 1 unit change in another

.pull-left[
.font100[
$$= \frac{\text{Change in temperature}}{\text{Change in shower handle position}} $$
]
 
.font100[
$$= \frac{\Delta \text{temperature}}{\Delta \text{shower handle position}} $$
]
 
.font100[
$$= \frac{\text{d(temperature)}}{\text{d(shower handle position)}} $$

]
]

]

---
# More than one variable?

What if we have more than one variable?

]

]

---
# Partial derivatives

What if we have more than one variable?

.pull-left[
.font100[
$$= \frac{\text{Change in temperature}}{\text{Change in shower handle position}} $$
]
 
.font100[
$$= \frac{\Delta \text{temperature}}{\Delta \text{shower handle position}} $$
]
 
.font100[
$$= \frac{\partial \text{ temperature}}{\partial \text{ shower handle position}} $$

]
]

]

---
# Gradient descent

* We can use derivative information to point us in the right direction.

* Stepping in the opposite direction of the partial derivative leads us to the 
minimum.

* The size of the step is called our ___learning rate ( `$\eta$` )___ .white[

- Too small: will take forever to converge
   - Too big: risk never finding the minimum]
   
`$$-\eta \text{ }\frac{\partial \text{ temperature}}{\partial \text{ shower handle position}}$$`

]

]

---
# Gradient descent

* We can use derivative information to point us in the right direction.

* Stepping in the opposite direction of the partial derivative leads us to the 
minimum.

* The size of the step is called our ___learning rate ( `$\eta$` )___

- Too small: will take forever to converge
   - Too big: risk jumping over the minimum
   
`$$-\eta \text{ }\frac{\partial \text{ temperature}}{\partial \text{ shower handle position}}$$`

]

]

---
# Chain rule

* However, typically our problems have many layers of transformations `$\rightarrow \hat y = f(g(x))$`.

* The ___chain rule___ states that we can compute partial derivatives for each
layer as `$\partial \text{layer}_2 \times \partial \text{layer}_1 \Rightarrow \partial f(g(x)) \times \partial g(x)$`

* Consequently, we can compute the following:
   - `$\Delta s = -\eta \frac{\partial \hat{y}}{\partial s} \cdot \frac{\partial E}{\partial \hat{y}}$` 
   - `$\Delta m = -\eta \frac{\partial h}{\partial m} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial E}{\partial \hat{y}}$`

]

]

.center[.content-box-grey[_m_ and _s_ simply represent the weights of the first and second layer of a hypothetical neural net.]]

---
# Backpropagation

* This updating of our weights is what is called ___backpropagation___.

* This simple example illustrated basic gradient descent backpropagation.

* There have been several algorithms developed that slightly modify and _optimize_ this approach [(Sebastian Ruder)](http://ruder.io/optimizing-gradient-descent/).

* A main differentiating component is ___momentum___.

]

]

---
# Backpropagation

* This updating of our weights is what is called ___backpropagation___.

* This simple example illustrated basic gradient descent backpropagation.

* A main differentiating component is ___momentum___.

]

.right.font50[[Sebastian Ruder (2016)](https://ruder.io/optimizing-gradient-descent/)]
]

---
# Backpropagation

* This updating of our weights is what is called ___backpropagation___.

* This simple example illustrates basic gradient descent backpropagation.

* A main differentiating component is ___momentum___.

]

```r
network %>% compile(
  loss = "categorical_crossentropy",
* optimizer = "rmsprop",
  metrics = c("accuracy")
)
```

]

---
# Tracking additional metrics

In addition to the loss function, there are many other metrics we can track.

* mean squared error (MSE)
* mean absolute error (MAE)
* mean absolute percentage error (MAPE)
* mean squared logarithmic error (MSLE)
* squared hinge
* log cosine hinge
* overall accuracy
* binary cross entropy
* categorical hinge
* categorical cross entropy
* sparse categorical cross entropy
* Kullback-Leibler divergence
* Poisson
* cosine proximity
* can even build your own custom metrics to track!

]

---
# Tracking additional metrics

In addition to the loss function, there are many other metrics we can track.

* mean squared error (MSE)
* mean absolute error (MAE)
* mean absolute percentage error (MAPE)
* mean squared logarithmic error (MSLE)
* squared hinge
* log cosine hinge
* .blue[overall accuracy]
* binary cross entropy
* categorical hinge
* categorical cross entropy
* sparse categorical cross entropy
* Kullback-Leibler divergence
* Poisson
* cosine proximity
* can even build your own custom metrics to track!

]

```r
network %>% compile(
  loss = "categorical_crossentropy",
  optimizer = "rmsprop", 
* metrics = c("accuracy")
)
```

]

---
# Network compilation summary

1. Loss function

2. Backpropagation optimizer

3. Additional metrics tracked

]

]

---
class: clear, center, middle

.font500.bold[Training loop]

---
# Supply data

* We use `fit()` to start executing model training

* First, we need to supply the features and target response for our training data

]

```r
history <- network %>% 
* fit(train_images,
* train_labels,
 batch_size = 128, 
 epochs = 20, 
 validation_split = 0.2)
```

]

---
# 3 variants of gradient descent

1. .bold[Batch gradient descent]
 - computes the error for each example in the training dataset and updates the 
 weights after all training examples have been evaluated
 - .bold.green[Pros]:
 - Fewer updates to the model can result in computational efficiencies
 - Aggregation of errors leads to smoother gradient descent and, often, 
 quicker convergence
 - .bold.red[Cons]:
 - Scales horribly to "longer" datasets
 - Aggregation of errors often leads to convergence at local minimums
]

]

---
# 3 variants of gradient descent

1. Batch gradient descent
2. .bold[Stochastic gradient descent]
 - randomly selects an individual ob, computes gradients and 
 updates model weights after this single observation has been evaluated
 - .bold.green[Pros]:
 - Makes individual weight updates much faster
 - updating model on a single, random observation results in a very noisy 
 gradient descent which helps avoid local minimums
 - .bold.red[Cons]:
 - Takes longer to converge which means can be computationally inefficient
 - Noisy learning process can also make it hard for the algorithm to settle 
 on an error minimum for the model.
]

]

---
# 3 variants of gradient descent

1. Batch gradient descent
2. Stochastic gradient descent
3. .bold[Mini-batch gradient descent]
 - randomly selects a subset of obs, computes gradients and updates model 
 weights after this subset has been evaluated
 - .bold.green[Pros]:
 - Balances efficiencies of batch vs. stochastic
 - Balances robust convergence of batch with some stochastic nature to 
 minimize local minima.
 - .bold.red[Cons]:
 - One more hyperparameter to think about
 - Most common: `$2^s \rightarrow 32, 64, 128, 256, 512$`

]

]

---
# Epochs

.pull-left.font110[

- The number of times the learning algorithm will work through the entire 
  training dataset.

]

]

---
# Epochs

.pull-left.font110[

- The number of times the learning algorithm will work through the entire 
  training dataset.

```r
history <- network %>% 
 fit(train_images, 
 train_labels, 
* batch_size = 128,
* epochs = 20,
 validation_split = 0.2)
```

]

]

---
# Epochs

.pull-left.font100[

- The number of times the learning algorithm will work through the entire 
  training dataset.
  
- Enough epochs to where we see the loss converge to a minimum

]

]

---
# Validation

* If we train a large enough model, theoretically, we should be able to produce 
  an _identity function_.
  
* This is not our objective!

]

]

---
# Validation

* If we train a large enough model, theoretically, we should be able to produce 
  an _identity function_.
  
* This is not our objective!

* We want to identify the model and location that minimizes loss on _unseen data_.

* `validation_split` will:
   - select the last XX% of data in our training set,
   - score our model on this validation set at the end of each epoch

]

]

---
# Training loop summary

1. Training features and labels

2. Size of mini batches and number epochs

3. Data used as "unseen" validation data

]

]

---
# Now you know!

```r
network <- keras_model_sequential() %>%
 layer_dense(units = 512, activation = 'relu', input_shape = c(28 * 28)) %>%
 layer_dense(units = 10, activation = 'softmax')

network %>% compile(
  loss = "categorical_crossentropy",
  optimizer = "rmsprop",
  metrics = c("accuracy")
)

history <- network %>% 
 fit(train_images, train_labels, 
 batch_size = 128, epochs = 20, 
 validation_split = 0.2)
```

]

]

---
# Back home

[.center[]](https://github.com/rstudio-conf-2020/dl-keras-tf)