class: center, middle, inverse, title-slide # Computer vision & CNNs: MNIST revisted ### Brad Boehmke ### 2020-01-27 --- # Densely connected MLPs find common patterns .pull-left[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- # Densely connected MLPs find common patterns .pull-left[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- class: clear .red.font160[But they also expect the features to be conistently located...] .pull-left[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- class: clear, center, middle background-image: url(images/Computer-Vision.png) background-size: cover --- # .red[Image variance] .pull-left[ Computer vision should be robust to ___image variance___ ] .pull-right[ <img src="images/image_variance.png" width="75%" height="75%" style="display: block; margin: auto;" /> .font50.right[Image: Matt Krause] ] --- # Convolutional neural networks (CNNs) <br> <img src="https://iq.opengenus.org/content/images/2019/04/pic01-1.png" style="display: block; margin: auto;" /> --- # CNNs... .pull-left[ .bold[identify a hierarchy of features] <br> <img src="images/spatial_hierarchy.png" width="1001" style="display: block; margin: auto;" /> ] -- .pull-right[ .bold[provide several features to allow for image variance] <img src="images/image_variance.png" width="60%" height="60%" style="display: block; margin: auto;" /> ] --- # Case studies .pull-left[ .bold.center[MNIST] <img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" style="display: block; margin: auto;" /> ] .pull-right[ .bold.center[Cats vs Dogs] <img src="images/woof_meow.jpg" width="600" style="display: block; margin: auto;" /> ] --- # New concepts .pull-left.font140[ * Image variance * Spatial hierarchy * Convolutions * Filters and kernels * Feature maps ] .pull-right.font140[ * Pooling * Flattening * Augmentation * Pre-trained networks ] --- class: clear, center, middle .font300.bold[MNIST revisited as a CNN] .opacity[ <img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" width="100%" height="100%" style="display: block; margin: auto;" /> ] --- # The structure of CNN models <br> <img src="images/CNN-structure.jpeg" width="1673" style="display: block; margin: auto;" /> <br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] --- # What the convolution operation is doing .pull-left[ * A convolution layer has an input and a ___filter___ (aka kernel) * The filter is typically a 3x3 or 5x5 matrix of weights * Similar to MLP these weights are initially randomized, updated and optimized with backpropagation ] .pull-right[ <img src="https://miro.medium.com/max/1026/1*cTEp-IvCCUYPTT0QpE3Gjg@2x.png" style="display: block; margin: auto;" /> <br><br><br><br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # What the convolution operation is doing .pull-left[ * We slide this filter over the inputs and perform a simple computation: `$$(1 \times 1) + (1 \times 0) + \cdots + (0 \times 0) + (1 \times 1)$$` * Results in a scalar value that goes into a new matrix called a ___feature map___ ] .pull-right[ <img src="https://miro.medium.com/max/1018/1*ghaknijNGolaA3DpjvDxfQ@2x.png" style="display: block; margin: auto;" /> <br><br><br><br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # What the convolution operation is doing .pull-left[ * We place this filter over the inputs and perform a simple computation: `$$(1 \times 1) + (1 \times 0) + \cdots + (0 \times 0) + (1 \times 1)$$` * Results in a scalar value that goes into a new matrix called a ___feature map___ * We slide the filter and repeat the process to complete our feature map matrix ] .pull-right[ <img src="https://cdn-media-1.freecodecamp.org/images/Htskzls1pGp98-X2mHmVy9tCj0cYXkiCrQ4t" width="110%" height="110%" style="display: block; margin: auto;" /> <br><br><br><br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # What the convolution operation is doing .pull-left.opacity[ * We place this filter over the inputs and perform a simple computation: `$$(1 \times 1) + (1 \times 0) + \cdots + (0 \times 0) + (1 \times 1)$$` * Results in a scalar value that goes into a new matrix called a ___feature map___ * We slide the filter and repeat the process to complete our feature map matrix ] .pull-right.opacity[ <img src="https://cdn-media-1.freecodecamp.org/images/Htskzls1pGp98-X2mHmVy9tCj0cYXkiCrQ4t" width="110%" height="110%" style="display: block; margin: auto;" /> ] But what does `filters = 32` mean? ```r model <- keras_model_sequential() %>% * layer_conv_2d(filters = 32, ...) ``` --- # Many feature maps .pull-left[ * We're actually going to do this 32 times * Each convolution will use a different filter (different weights) ] .pull-right[ <img src="images/multiple-feature-maps.png" width="1061" style="display: block; margin: auto;" /> <br><br><br><br><br> .font60.right[Image: [Rick Scavetta ](http://scavetta.academy/DLwR/Presentation/SCAVETTA%2C%20Rick%20--%20Intro%20to%20Deep%20Learning%20--%20RStudioConf2019.pdf)] ] --- # Many feature maps .pull-left[ * We're actually going to do this 32 times * Each convolution will use a different filter (different weights) * Each feature map will learn unique features represented in the image ] .pull-right[ <img src="images/example-feature-maps.png" width="2097" style="display: block; margin: auto;" /> <br><br><br><br> .font60.right[Image: [Rick Scavetta ](http://scavetta.academy/DLwR/Presentation/SCAVETTA%2C%20Rick%20--%20Intro%20to%20Deep%20Learning%20--%20RStudioConf2019.pdf)] ] --- # Many feature maps .pull-left[ * We're actually going to do this 32 times * Each convolution will use a different filter (different weights) * Each feature map will learn unique features represented in the image * Consequently, the output of a convolution layer will typically be: - smaller in width x height - deeper due to the multiple feature maps ] .pull-right[ ```r summary(model) Model: "sequential_1" ______________________________________________________________ Layer (type) Output Shape Param # ============================================================== *conv2d_1 (Conv2D) (None, 26, 26, 32) 320 ______________________________________________________________ max_pooling2d (MaxPooling2 (None, 13, 13, 32) 0 ______________________________________________________________ conv2d_2 (Conv2D) (None, 11, 11, 64) 18496 ______________________________________________________________ max_pooling2d_1 (MaxPoolin (None, 5, 5, 64) 0 ______________________________________________________________ conv2d_3 (Conv2D) (None, 3, 3, 64) 36928 ============================================================== Total params: 55,744 Trainable params: 55,744 Non-trainable params: 0 ______________________________________________________________ ``` ] --- # .red[Stride] & padding .pull-left[ * ___Stride___ specifies how much we move the convolution filter at each step - most common is 1 (default) - larger strides - less feature correlation - less memory required - minimizes overfitting - pooling helps with many of these issues ] .pull-right[ <img src="https://miro.medium.com/max/790/1*L4T6IXRalWoseBncjRr4wQ@2x.gif" width="75%" height="75%" style="display: block; margin: auto;" /> <img src="https://miro.medium.com/max/721/1*4wZt9G7W7CchZO-5rVxl5g@2x.gif" width="70%" height="70%" style="display: block; margin: auto;" /> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # Stride & .red[padding] .pull-left[ * ___Stride___ specifies how much we move the convolution filter at each step * ___Padding___ zero padding adds a border with values of 0 - helps keep information at the edges - prevents deep models from reducing feature map sizes too quickly ] .pull-right[ <img src="https://miro.medium.com/max/1063/1*W2D564Gkad9lj3_6t9I2PA@2x.gif" style="display: block; margin: auto;" /> <br><br><br><br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # What about images with .red[multiple channels?] .pull-left.font140[ * Our filter/kernel simply becomes a cube ] .pull-right[ <img src="https://miro.medium.com/max/326/1*NsiYxt8tPDQyjyH3C08PVA@2x.png" style="display: block; margin: auto;" /> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # What about images with .red[multiple channels?] .pull-left.font120[ * Our filter/kernel simply becomes a cube * but math doesn't get much more complex * and the output is still a scalar ] .pull-right[ <img src="https://miro.medium.com/max/1280/1*ciDgQEjViWLnCbmX-EeSrA.gif" style="display: block; margin: auto;" /> <br><br><br><br><br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] --- # ReLU .pull-left.font110[ * Note that we still use a ReLU activation function * This applies the ReLU activation across each feature map output from a convolution layer * Simply puts focus on non-negative values in the feature map ```r *layer_conv_2d(..., activation = "relu", ...) ``` ] .pull-right[ <br><br> <img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-07-at-6-18-19-pm.png?w=748" style="display: block; margin: auto;" /> .center[This is why we say the first convolution layer is an _edge_ detector] <br><br><br> .font60.right[Image: [Ujjwal Karn](https://ujjwalkarn.me/)] ] --- # Pooling for downsampling .code125[ ```r model <- keras_model_sequential() %>% layer_conv_2d() %>% * layer_max_pooling_2d(pool_size = c(2, 2)) %>% ... ``` ] --- # Pooling for downsampling .pull-left[ * After one or more convolution operations we usually perform pooling to reduce the dimensionality * Identifies the most prominent features within each feature map ] .pull-right[ <img src="https://miro.medium.com/max/596/1*KQIEqhxzICU7thjaQBfPBQ.png" style="display: block; margin: auto;" /> <br><br> .font60.right[Image: [Sumit Saha ](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)] ] ] --- # Pooling for downsampling .pull-left[ * After one or more convolution operations we usually perform pooling to reduce the dimensionality * Identifies the most prominent features within each feature map * Pooling reduces the width x height dimensions of each feature map independently, keeping the depth intact * 2x2 with a stride of 2 is most common ] .pull-right[ ```r summary(model) Model: "sequential_1" ______________________________________________________________ Layer (type) Output Shape Param # ============================================================== conv2d_1 (Conv2D) (None, 26, 26, 32) 320 ______________________________________________________________ *max_pooling2d (MaxPooling2 (None, 13, 13, 32) 0 ______________________________________________________________ conv2d_2 (Conv2D) (None, 11, 11, 64) 18496 ______________________________________________________________ max_pooling2d_1 (MaxPoolin (None, 5, 5, 64) 0 ______________________________________________________________ conv2d_3 (Conv2D) (None, 3, 3, 64) 36928 ============================================================== Total params: 55,744 Trainable params: 55,744 Non-trainable params: 0 ______________________________________________________________ ``` ] --- # Pooling for downsampling .bold[Benefits]... * makes the feature dimensions smaller and more manageable * improves computation time & controlls overfitting * makes the network invariant to image variance `\(\rightarrow\)` a small distortion in input will not change the output of pooling – since we take the maximum / average value in a local neighborhood) * helps us arrive at an almost scale invariant representation of our image <img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-06-at-12-45-35-pm.png?w=748" style="display: block; margin: auto;" /> --- # Multiple convolutions & pooling .pull-left[ .bold[The idea...] * More convolution steps results in more complicated features learnt * Initial layers typically find lower level detail features (i.e. edges) * Subsequent layers aggregate lower level features into larger ones * Facial recognition example: - Layer 1 detects edges - Layer 2 uses edges to identify facial items (i.e. eyes, nose, mouth) - Layer 3 puts these features together into faces ] .pull-right[ .bold[A common misinterpretation...] <img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-10-at-12-58-30-pm.png?w=242&h=256" width="80%" height="80%" style="display: block; margin: auto;" /> .font60.right[Image: [Ujjwal Karn](https://ujjwalkarn.me/)] ] --- # Multiple convolutions & pooling In reality, early layers will resemble the initial images the most and subsequent layers create abstract images that only make sense mathematically. <img src="images/2layer-CNN.png" width="80%" height="80%" style="display: block; margin: auto;" /> .font60.right[Image: [Adam Harley](http://scs.ryerson.ca/~aharley/vis/conv/flat.html)] --- # Check this out! <br><br><br> .center[ .font200[Spend a couple minutes playing around with this] .font150[http://scs.ryerson.ca/~aharley/vis/conv/flat.html] ] --- # Last step .pull-left.code60[ ___Flattening___ simply takes our last multidimensional convolution output and flattens it into a vector ```r model %>% * layer_flatten() %>% layer_dense(units = 64, activation = "relu") %>% layer_dense(units = 10, activation = "softmax") summary(model) Model: "sequential_1" ______________________________________________________________________________ Layer (type) Output Shape Param # ============================================================================== conv2d_1 (Conv2D) (None, 26, 26, 32) 320 ______________________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0 ______________________________________________________________________________ conv2d_2 (Conv2D) (None, 11, 11, 64) 18496 ______________________________________________________________________________ max_pooling2d_1 (MaxPooling2D) (None, 5, 5, 64) 0 ______________________________________________________________________________ *conv2d_3 (Conv2D) (None, 3, 3, 64) 36928 ______________________________________________________________________________ *flatten (Flatten) (None, 576) 0 ______________________________________________________________________________ dense (Dense) (None, 64) 36928 ______________________________________________________________________________ dense_1 (Dense) (None, 10) 650 ============================================================================== Total params: 93,322 Trainable params: 93,322 Non-trainable params: 0 ______________________________________________________________________________ ``` ] .pull-right[ Feed into a densely connected MLP for final classification. <img src="https://miro.medium.com/max/1025/1*IWUxuBpqn2VuV-7Ubr01ng.png" style="display: block; margin: auto;" /> <br><br><br> .font60.right[[Jiwon Jeong](https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-convolutional-neural-network-3607be47480)] ] --- # Some tips for convolutional layers * Hyperparameters: - filter size: 3x3, 5x5, 7x7 - stride size - number of filters: `\(2^p \rightarrow\)` 32, 64, 128, ..., 1024 - number of convolutional layers * Pooling - most common size: 2x2 or 3x3 - not necessary after every convolutional layer - balance speed and efficiency with performance * Since the convolution layer does most of the feature extraction, the capacity of the densely connected portion of the network can often be much smaller and use less epochs * Larger problem sets will require GPUs! --- class: yourturn # Your Turn! (lines of code 131+) .font150[Spend 5 minutes adjusting various CNN components:] .font140[ - change the number of filters - change filter/kernel size - adjust the stride - add padding - add more convolution layers ] --- class: center # Where are the images? <br> <br> .bold[Cats vs Dogs Case Study] []() --- class: clear, center, middle .font300.bold[Cats vs. Dogs] .opacity[ <img src="https://miro.medium.com/proxy/1*oB3S5yHHhvougJkPXuc8og.gif" width="85%" height="85%" style="display: block; margin: auto;" /> ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - rotate the image - shift the image vertically and horizontally - shear the image - zoom in and out - flip the orientation ```r datagen <- image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" ) ``` ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - .bold[rotate the image] - shift the image vertically and horizontally - shear the image - zoom in and out - flip the orientation ```r datagen <- image_data_generator( rescale = 1/255, * rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" ) ``` ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - rotate the image - .bold[shift the image vertically and horizontally] - shear the image - zoom in and out - flip the orientation ```r datagen <- image_data_generator( rescale = 1/255, rotation_range = 40, * width_shift_range = 0.2, * height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" ) ``` ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - rotate the image - shift the image vertically and horizontally - .bold[shear the image] - zoom in and out - flip the orientation ```r datagen <- image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, * shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" ) ``` ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - rotate the image - shift the image vertically and horizontally - shear the image - .bold[zoom in and out] - flip the orientation ```r datagen <- image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, * zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" ) ``` ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - rotate the image - shift the image vertically and horizontally - shear the image - zoom in and out - .bold[flip the orientation] ```r datagen <- image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, * horizontal_flip = TRUE, fill_mode = "nearest" ) ``` ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> ] --- # Image augmentation .pull-left[ There are many approaches we can take to augment images such as: - rotate the image - shift the image vertically and horizontally - shear the image - zoom in and out - flip the orientation ```r datagen <- image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, * fill_mode = "nearest" ) ``` ] .pull-right[ <img src="04-computer-vision-cnns_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> ] --- class: clear, center, middle background-image: url(images/transfer_learning_icon.png) background-size: cover --- # Two main approaches .pull-left[ 1. Use the convolutional base to do feature engineering on our images and then feed into a new densely connected classifier. 2. Build a full sequential model with the convolutional base and a new densely connected classifier and train the entire model with some or all of the convolutional base layers _frozen_. ] .pull-right[ <img src="https://s3.amazonaws.com/book.keras.io/img/ch5/swapping_fc_classifier.png" style="display: block; margin: auto;" /> ] --- # Back home <br><br><br><br> [.center[
<i class="fas fa-home fa-10x faa-FALSE animated "></i>
]](https://github.com/rstudio-conf-2020/dl-keras-tf) .center[https://github.com/rstudio-conf-2020/dl-keras-tf]