Computer vision & CNNs: MNIST revisted

class: center, middle, inverse, title-slide

# Computer vision & CNNs: MNIST revisted
### Brad Boehmke
### 2020-01-27

---

# Densely connected MLPs find common patterns

.pull-left[

]

.pull-right[

]

---
# Densely connected MLPs find common patterns

.pull-left[

]

.pull-right[

]

---
class: clear

.red.font160[But they also expect the features to be conistently located...]

.pull-left[

]

.pull-right[

]

---
class: clear, center, middle
background-image: url(images/Computer-Vision.png)
background-size: cover

---
# .red[Image variance]

.pull-left[

Computer vision should be robust to ___image variance___

]

.pull-right[

.font50.right[Image: Matt Krause]

]

---
# Convolutional neural networks (CNNs)

---
# CNNs...

.pull-left[

.bold[identify a hierarchy of features]

]

.pull-right[

.bold[provide several features to allow for image variance]

]

---
# Case studies

.pull-left[
.bold.center[MNIST]

]

.pull-right[
.bold.center[Cats vs Dogs]

]

---
# New concepts

.pull-left.font140[
* Image variance

* Spatial hierarchy

* Convolutions

* Filters and kernels

* Feature maps
]

.pull-right.font140[
* Pooling

* Flattening

* Augmentation

* Pre-trained networks
]

---
class: clear, center, middle

.font300.bold[MNIST revisited as a CNN]

.opacity[
<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" width="100%" height="100%" style="display: block; margin: auto;" />
]

---
# The structure of CNN models

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]

---
# What the convolution operation is doing

.pull-left[

* A convolution layer has an input and a ___filter___ (aka kernel)

* The filter is typically a 3x3 or 5x5 matrix of weights

* Similar to MLP these weights are initially randomized, updated and optimized with backpropagation

]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]

]

---
# What the convolution operation is doing

.pull-left[

* We slide this filter over the inputs and perform a simple computation:

`$$(1 \times 1) + (1 \times 0) + \cdots + (0 \times 0) + (1 \times 1)$$`

* Results in a scalar value that goes into a new matrix called a ___feature map___

]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]

]

---
# What the convolution operation is doing

.pull-left[

* We place this filter over the inputs and perform a simple computation:

`$$(1 \times 1) + (1 \times 0) + \cdots + (0 \times 0) + (1 \times 1)$$`

* Results in a scalar value that goes into a new matrix called a ___feature map___

* We slide the filter and repeat the process to complete our feature map matrix

]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]

]

---
# What the convolution operation is doing

.pull-left.opacity[

* We place this filter over the inputs and perform a simple computation:

`$$(1 \times 1) + (1 \times 0) + \cdots + (0 \times 0) + (1 \times 1)$$`

* Results in a scalar value that goes into a new matrix called a ___feature map___

* We slide the filter and repeat the process to complete our feature map matrix

]

.pull-right.opacity[

]

But what does `filters = 32` mean?

```r
model <- keras_model_sequential() %>%
* layer_conv_2d(filters = 32, ...)
```

---
# Many feature maps

.pull-left[

* We're actually going to do this 32 times

* Each convolution will use a different filter (different weights)

]

.pull-right[

.font60.right[Image: [Rick Scavetta
](http://scavetta.academy/DLwR/Presentation/SCAVETTA%2C%20Rick%20--%20Intro%20to%20Deep%20Learning%20--%20RStudioConf2019.pdf)]

]

---
# Many feature maps

.pull-left[

* We're actually going to do this 32 times

* Each convolution will use a different filter (different weights)

* Each feature map will learn unique features represented in the image

]

.pull-right[

.font60.right[Image: [Rick Scavetta
](http://scavetta.academy/DLwR/Presentation/SCAVETTA%2C%20Rick%20--%20Intro%20to%20Deep%20Learning%20--%20RStudioConf2019.pdf)]

]

---
# Many feature maps

.pull-left[

* We're actually going to do this 32 times

* Each convolution will use a different filter (different weights)

* Each feature map will learn unique features represented in the image

* Consequently, the output of a convolution layer will typically be:
   - smaller in width x height
   - deeper due to the multiple feature maps

]

.pull-right[

```r
summary(model)
Model: "sequential_1"
______________________________________________________________
Layer (type)               Output Shape             Param #   
==============================================================
*conv2d_1 (Conv2D)          (None, 26, 26, 32)       320
______________________________________________________________
max_pooling2d (MaxPooling2 (None, 13, 13, 32)       0         
______________________________________________________________
conv2d_2 (Conv2D)          (None, 11, 11, 64)       18496     
______________________________________________________________
max_pooling2d_1 (MaxPoolin (None, 5, 5, 64)         0         
______________________________________________________________
conv2d_3 (Conv2D)          (None, 3, 3, 64)         36928     
==============================================================
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
______________________________________________________________
```

]

---
# .red[Stride] & padding

.pull-left[

* ___Stride___ specifies how much we move the convolution filter at each step
   - most common is 1 (default)
   - larger strides 
      - less feature correlation
      - less memory required
      - minimizes overfitting
   - pooling helps with many of these issues    
]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]

]

---
# Stride & .red[padding]

.pull-left[

* ___Stride___ specifies how much we move the convolution filter at each step

* ___Padding___ zero padding adds a border with values of 0
   - helps keep information at the edges
   - prevents deep models from reducing feature map sizes too quickly
]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]
]

---
# What about images with .red[multiple channels?]

.pull-left.font140[

* Our filter/kernel simply becomes a cube

]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]
]

---
# What about images with .red[multiple channels?]

.pull-left.font120[

* Our filter/kernel simply becomes a cube

* but math doesn't get much more complex

* and the output is still a scalar

]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]

]

---
# ReLU

.pull-left.font110[

* Note that we still use a ReLU activation function

* This applies the ReLU activation across each feature map output from a convolution layer

* Simply puts focus on non-negative values in the feature map

```r
*layer_conv_2d(..., activation = "relu", ...)
```

]

.pull-right[

.center[This is why we say the first convolution layer is an _edge_ detector]

.font60.right[Image: [Ujjwal Karn](https://ujjwalkarn.me/)]

]

---
# Pooling for downsampling

.code125[

```r
model <- keras_model_sequential() %>%
 layer_conv_2d() %>%
* layer_max_pooling_2d(pool_size = c(2, 2)) %>%
 ...
```

]

---
# Pooling for downsampling

.pull-left[

* After one or more convolution operations we usually perform pooling to reduce the dimensionality

* Identifies the most prominent features within each feature map
]

.pull-right[

.font60.right[Image: [Sumit Saha
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)]
]

]

---
# Pooling for downsampling

.pull-left[

* After one or more convolution operations we usually perform pooling to reduce the dimensionality

* Identifies the most prominent features within each feature map

* Pooling reduces the width x height dimensions of each feature map independently, keeping the depth intact

* 2x2 with a stride of 2 is most common
]

.pull-right[

```r
summary(model)
Model: "sequential_1"
______________________________________________________________
Layer (type)               Output Shape             Param #   
==============================================================
conv2d_1 (Conv2D)          (None, 26, 26, 32)       320        
______________________________________________________________
*max_pooling2d (MaxPooling2 (None, 13, 13, 32)       0
______________________________________________________________
conv2d_2 (Conv2D)          (None, 11, 11, 64)       18496     
______________________________________________________________
max_pooling2d_1 (MaxPoolin (None, 5, 5, 64)         0         
______________________________________________________________
conv2d_3 (Conv2D)          (None, 3, 3, 64)         36928     
==============================================================
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
______________________________________________________________
```

]

---
# Pooling for downsampling

.bold[Benefits]...

* makes the feature dimensions smaller and more manageable

* improves computation time & controlls overfitting

* makes the network invariant to image variance `$\rightarrow$` a small distortion in input will not change the output of pooling – since we take the maximum / average value in a local neighborhood)

* helps us arrive at an almost scale invariant representation of our image

---
# Multiple convolutions & pooling

.pull-left[

.bold[The idea...]

* More convolution steps results in more complicated features learnt

* Initial layers typically find lower level detail features (i.e. edges)

* Subsequent layers aggregate lower level features into larger ones

* Facial recognition example:
   - Layer 1 detects edges
   - Layer 2 uses edges to identify facial items (i.e. eyes, nose, mouth)
   - Layer 3 puts these features together into faces

]

.pull-right[

.bold[A common misinterpretation...]

.font60.right[Image: [Ujjwal Karn](https://ujjwalkarn.me/)]
]

---
# Multiple convolutions & pooling

In reality, early layers will resemble the initial images the most and subsequent layers create abstract images that only make sense mathematically.

.font60.right[Image: [Adam Harley](http://scs.ryerson.ca/~aharley/vis/conv/flat.html)]
---
# Check this out!

.center[

.font200[Spend a couple minutes playing around with this]

.font150[http://scs.ryerson.ca/~aharley/vis/conv/flat.html]
]

---
# Last step

.pull-left.code60[

___Flattening___ simply takes our last multidimensional convolution output and flattens it into a vector

```r
model %>%
*   layer_flatten() %>%
    layer_dense(units = 64, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

summary(model)
Model: "sequential_1"
______________________________________________________________________________
Layer (type)                       Output Shape              Param #      
==============================================================================
conv2d_1 (Conv2D)                  (None, 26, 26, 32)        320          
______________________________________________________________________________
max_pooling2d (MaxPooling2D)       (None, 13, 13, 32)          0            
______________________________________________________________________________
conv2d_2 (Conv2D)                  (None, 11, 11, 64)      18496        
______________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)     (None, 5, 5, 64)            0            
______________________________________________________________________________
*conv2d_3 (Conv2D)                  (None, 3, 3, 64)        36928
______________________________________________________________________________
*flatten (Flatten)                  (None, 576)                 0
______________________________________________________________________________
dense (Dense)                      (None, 64)               36928        
______________________________________________________________________________
dense_1 (Dense)                    (None, 10)                 650          
==============================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
______________________________________________________________________________
```

]

.pull-right[

Feed into a densely connected MLP for final classification.

.font60.right[[Jiwon Jeong](https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-convolutional-neural-network-3607be47480)]

]

---
# Some tips for convolutional layers

* Hyperparameters:
   - filter size: 3x3, 5x5, 7x7
   - stride size
   - number of filters: `$2^p \rightarrow$` 32, 64, 128, ..., 1024
   - number of convolutional layers

* Pooling
   - most common size: 2x2 or 3x3
   - not necessary after every convolutional layer
   - balance speed and efficiency with performance
   
* Since the convolution layer does most of the feature extraction, the capacity of the densely connected portion of the network can often be much smaller and use less epochs

* Larger problem sets will require GPUs!

---
class: yourturn
# Your Turn! (lines of code 131+)

.font150[Spend 5 minutes adjusting various CNN components:]

.font140[
- change the number of filters
- change filter/kernel size
- adjust the stride
- add padding
- add more convolution layers
]

---
class: center
# Where are the images?

.bold[Cats vs Dogs Case Study]

[![cats-dogs](images/woof_meow.jpg)]()

---
class: clear, center, middle

.font300.bold[Cats vs. Dogs]

.opacity[
<img src="https://miro.medium.com/proxy/1*oB3S5yHHhvougJkPXuc8og.gif" width="85%" height="85%" style="display: block; margin: auto;" />
]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- rotate the image
- shift the image vertically and horizontally
- shear the image
- zoom in and out
- flip the orientation

```r
datagen <- image_data_generator(
 rescale = 1/255,
 rotation_range = 40,
 width_shift_range = 0.2,
 height_shift_range = 0.2,
 shear_range = 0.2,
 zoom_range = 0.2,
 horizontal_flip = TRUE,
 fill_mode = "nearest"
)
```

]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- .bold[rotate the image]
- shift the image vertically and horizontally
- shear the image
- zoom in and out
- flip the orientation

```r
datagen <- image_data_generator(
 rescale = 1/255,
* rotation_range = 40,
 width_shift_range = 0.2,
 height_shift_range = 0.2,
 shear_range = 0.2,
 zoom_range = 0.2,
 horizontal_flip = TRUE,
 fill_mode = "nearest"
)
```

]

.pull-right[

]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- rotate the image
- .bold[shift the image vertically and horizontally]
- shear the image
- zoom in and out
- flip the orientation

```r
datagen <- image_data_generator(
 rescale = 1/255,
 rotation_range = 40, 
* width_shift_range = 0.2,
* height_shift_range = 0.2,
 shear_range = 0.2,
 zoom_range = 0.2,
 horizontal_flip = TRUE,
 fill_mode = "nearest"
)
```

]

.pull-right[

]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- rotate the image
- shift the image vertically and horizontally
- .bold[shear the image]
- zoom in and out
- flip the orientation

```r
datagen <- image_data_generator(
 rescale = 1/255,
 rotation_range = 40, 
 width_shift_range = 0.2, 
 height_shift_range = 0.2,
* shear_range = 0.2,
 zoom_range = 0.2,
 horizontal_flip = TRUE,
 fill_mode = "nearest"
)
```

]

.pull-right[

]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- rotate the image
- shift the image vertically and horizontally
- shear the image
- .bold[zoom in and out]
- flip the orientation

```r
datagen <- image_data_generator(
 rescale = 1/255,
 rotation_range = 40, 
 width_shift_range = 0.2, 
 height_shift_range = 0.2,
 shear_range = 0.2, 
* zoom_range = 0.2,
 horizontal_flip = TRUE,
 fill_mode = "nearest"
)
```

]

.pull-right[

]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- rotate the image
- shift the image vertically and horizontally
- shear the image
- zoom in and out
- .bold[flip the orientation]

```r
datagen <- image_data_generator(
 rescale = 1/255,
 rotation_range = 40, 
 width_shift_range = 0.2, 
 height_shift_range = 0.2,
 shear_range = 0.2, 
 zoom_range = 0.2, 
* horizontal_flip = TRUE,
 fill_mode = "nearest"
)
```

]

.pull-right[

]

---
# Image augmentation

.pull-left[

There are many approaches we can take to augment images such as:

- rotate the image
- shift the image vertically and horizontally
- shear the image
- zoom in and out
- flip the orientation

```r
datagen <- image_data_generator(
 rescale = 1/255,
 rotation_range = 40, 
 width_shift_range = 0.2, 
 height_shift_range = 0.2,
 shear_range = 0.2, 
 zoom_range = 0.2, 
 horizontal_flip = TRUE, 
* fill_mode = "nearest"
)
```

]

.pull-right[

]

---
class: clear, center, middle
background-image: url(images/transfer_learning_icon.png)
background-size: cover

---
# Two main approaches

.pull-left[

1. Use the convolutional base to do feature engineering on our images and then 
feed into a new densely connected classifier.

2. Build a full sequential model with the convolutional base and a new 
densely connected classifier and train the entire model with some or all of the 
convolutional base layers _frozen_.

]

.pull-right[

]

---
# Back home

[.center[]](https://github.com/rstudio-conf-2020/dl-keras-tf)

.center[https://github.com/rstudio-conf-2020/dl-keras-tf]