This project is designed to test your current knowledge on applying LSTMS to the Cornell Movie Review dataset provided by Cornell University. This dataset contains movie reviews introduced in Pang & Lee (2004) with 2000 total observations. Detailed information about the data can be found here.

Your goal is to develop and compare the performance of a word embedding deep learning classifier to one that incorporates LSTM sequence embedding. I will guide you along the way but this project expects you to do most of the work from importing and preprocessing text, to building the models.

Nearly all the code that you need can be found in these notebooks:

Good luck!

Requirements

library(keras)
library(tidyverse)
library(fs)
library(glue)
library(testthat)

Import the data

For those in the workshop we have already downloaded the movie review data for you into the "/materials/data/cornell_reviews" directory. Outside of the workshop, you can find the download instructions here.

# get path to data
if (stringr::str_detect(here::here(), "conf-2020-user")) {
  movie_dir <- "/home/conf-2020-user/data/cornell_reviews/data"
} else {
  movie_dir <- here::here("materials", "data", "cornell_reviews", "data")
}

fs::dir_tree(movie_dir, recurse = FALSE)
/Users/b294776/Desktop/Workspace/Training/rstudio-conf-2020/dl-keras-tf/materials/data/cornell_reviews/data
├── neg
└── pos

Step 1: You can see the data have already been separated into positive vs negative sets. The actual reviews are contained in individual .txt files. Similar to Intro to word embeddings, let’s go ahead use this structure to our advantage by iterating over each review and…

  1. creating the path to each individual review file,
  2. creating a label based on the “neg” or “pos” folder the review is in, and
  3. saving the output as a data frame with each review on an individual row.
training_files <- movie_dir %>%
  dir_ls() %>%
  map(dir_ls) %>%
  set_names(basename) %>%
  plyr::ldply(data_frame) %>%
  set_names(c("label", "path"))

# you should have 2000 total observations
expect_equal(nrow(training_files), 2000)

Go ahead and take a look at your data frame

training_files

Step 2: How many obseravations are in each response label (i.e. “neg” vs “pos”)?

count(training_files, label)

Step 3: Next, let’s iterate over each row and

  1. save the label in a labels vector,
  2. import the movie review, and
  3. save in a texts vector.
obs <- nrow(training_files)
labels <- vector(mode = "integer", length = obs)
texts <- vector(mode = "character", length = obs)

for (file in seq_len(obs)) {
  label <- training_files[[file, "label"]]
  path <- training_files[[file, "path"]]
  
  labels[file] <- ifelse(label == "neg", 0, 1)
  texts[file] <- readChar(path, nchars = file.size(path)) 
}

The number of observations in your text should be equal to the number of responses.

expect_equal(length(texts), length(labels))

Go ahead and check out the text of a couple reviews.

texts[1]
[1] "plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and \" sorta \" find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n"

Data exploration

Step 4: Before preprocessing, let’s get a sense of two attributes that will help us set two of our preprocessing hyperparameters:

  1. How many unique words exist across all our reviews? We’ll use this to determine a good starting point for preprocessing our text.

  2. What is the distribution of word count across all movie reviews (i.e. mean, median)? We’ll use this to determine a good starting point for preprocessing our text.

text_df <- texts %>%
  tibble(.name_repair = ~ "text") %>%
  mutate(text_length = str_count(text, "\\w+"))

unique_words <- text_df %>%
  tidytext::unnest_tokens(word, text) %>%
  pull(word) %>%
  n_distinct()

avg_review_length <- median(text_df$text_length, na.rm = TRUE)
  
ggplot(text_df, aes(text_length)) +
  geom_histogram(bins = 100, fill = "grey70", color = "grey40") +
  geom_vline(xintercept = avg_review_length, color = "red", lty = "dashed") +
  scale_x_log10("# words") +
  ggtitle(glue("Median review length is {avg_review_length} words"),
          subtitle = glue("Total number of unique words is {unique_words}"))

Data preprocessing

Step 5: Now let’s tokenize our text sequences. To do so we:

  1. Specify how many words we want to include. Remember, a good starting point to use roughly 50% of the number of unique words in the data. This is a hyper- parameter that you can always come back to and adjust.
  2. Create a text_tokenizer object which defines how we want to preprocess the text. The defaults are sufficient.
  3. Apply the tokenizer to our text with fit_text_tokenizer().
  4. Extract our vectorized review data with texts_to_sequences().
# 1
top_n_words <- 20000

# 2-3
tokenizer <- text_tokenizer(num_words = top_n_words) %>% 
  fit_text_tokenizer(texts)

# 4
sequences <- texts_to_sequences(tokenizer, texts)

Go ahead and check out the first vectorized sequence. Should look familiar from earlier modules.

# The vectorized first instance:
sequences[[1]]
  [1]    97    76   948  4621   130     5     2  2015   788  3763     3   102  1313
 [14]    36    72    55    24  1340    23     4     1   518  1324    16    13   677
 [27]  1570     5    83    54     7    33    89     3    32  4476   603     1   637
 [40]   241     1    26     3  8853   189    40  2185     2   301  6461    26    12
 [53]     1   948  1457     8  2401    19     2    79   690   361    16  1767     9
 [66]     7     2    79   105  3979    45     6    48   142    14   465    24    56
 [79]  3667    23     5  1174   194    18  1341  8854    95    45   526     5   830
 [92]     1  7097   902    11   169   362     3   117   341  4203  8855    16    52
[105]    20    61     3   105   752     4   242    35  2694     4    95     3   112
[118]  1878    50   248    14    23  6179    36   249     5    31   624    14   271
[131]  3764  1223    16  2794     9  2283    39    48    20     1   469    11     1
[144]    26    77    64   348   353     6     8    37   321    88  7839     9   436
[157]    87  1314    16   102    55    14  1445   145     7    45    34    10    24
[170]   161  1122    31    59   361   603   159    19    52    20  1076    52    20
[183]    81   504   135    28     1   328    52    20   485    27   177    38     1
[196]   328    52    20   769 13435    52    20 17472    52    20     2     4   910
[209]   114    52    20  5277     4  1386   171     8   632     3    63     4     9
[222]     6   321    25  1957   143    18  2926   120   301   254     5  9489     2
[235]    15   150   143     3   102    16    44    35     9    91     6   256   103
[248]     1   164  2851   100     3   100   208    18    72   259     4  4077    42
[261]    82     2    94    45     6    14   247   954   353    37   613   294    14
[274]   133   801     5  3146    16     9   138     5   257     5  3146     9   308
[287]   313    64   369   493   226     3    75    36    86   171   439  2795    46
[300]    56  1747     7     1  5675    25    92     1  1280   204     6     8     1
[313]  4078     3    18   158  5676    19  1711    38    14    39    62   175  4079
[326]    63     4     9    40    21     1   268    84   212    39    35     4     1
[339] 12156    82     8   197   442     5    86     2    98   251     4   261    16
[352]     9   137   248     1    86     1    15    35     8    43   439    18   618
[365]     1  1486   311    11   121    38    14     6     8    34   167   245    86
[378]   276     8     1   161     6    55     9    56   141    36    20   284     1
[391]   801 15147     5  2016   169   145     4  2329    18   570   963  7840 10194
[404]   426   221    28  3863    12    41  1430   226   457     1    26     6    50
[417]  1684  3592  1123    62    72     9    52    20   101  3024    33     3    62
[430]   120   118    27    36    20    75    62    92   437     5    83     9   100
[443]     3   100   208    96    41   665   132   316   114  3288  1044  2440    55
[456]    35     4     1 12156   159   149     7     1    26   682     1   929   874
[469]    14    15   221    28    64   123     3 13436     9    42   491     3     9
[482]   336    52  8856    69     2   271   875   948   301  6461    26     7   128
[495]  1102    16    18   618     1  2852  1325     8  1729     9    55     2   289
[508]   431    11    98  1326    68    86    43   261     1   209    20   271    61
[521]    12     1    63   204   187  2594 10195    50   675     5    22   388     1
[534]  2109   164    74     8    17   197     7   258  1034    57     7     2   113
[547]  5278    16   106   954  5279   130    40     5 10194    27  1543    33   181
[560]   457     1   339    15     3   175    32    34   621    33  1526   898     1
[573]    15   116  1993    85     9   116  2476    37  1471     9  1358     3     9
[586]   648   271  7098    12    63     4    64 10196   400     2   271   690   324
[599]     3  1994     5    35     4     1 17473     8   770   141     9   657     3
[612]    21     1    84    14     6    25     2   295    46   948  1649   731    37
[625]    50 17474     5   177     8    84    85   371     6   682  6792     8     1
[638]   521     6   137  1132    11     1   440     9    71  3210   460    76   174
[651]   714     3    32    69  1487    19     1  7443   193   194  1045  2853     9
[664]  3980  4204   504    28     2  2214     4  7099  1003   640  1211   302  1446
[677]  1315   314  1211   302     1  2518  1853   302     1  2518  7100   749   302
[690]   341  4203   302   302  8855   302   302     1   485  1853   302  4341     4
[703]  3864  1022   302

We can see how our tokenizer converted our original text to a cleaned up version:

cat(crayon::blue("Original text:\n"))
Original text:

texts[[1]]
[1] "plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and \" sorta \" find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n"
cat(crayon::blue("\nRevised text:\n"))

Revised text:

paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
[1] "plot two teen couples go to a church party drink and then drive they get into an accident one of the guys dies but his girlfriend continues to see him in her life and has nightmares what's the deal watch the movie and sorta find out critique a mind fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package which is what makes this review an even harder one to write since i generally applaud films which attempt to break the mold mess with your head and such lost highway memento but there are good and bad ways of making all types of films and these folks just didn't this one correctly they seem to have taken this pretty neat concept but executed it terribly so what are the problems with the movie well its main problem is that it's simply too jumbled it starts off normal but then into this fantasy world in which you as an audience member have no idea what's going on there are dreams there are characters coming back from the dead there are others who look like the dead there are strange apparitions there are disappearances there are a of chase scenes there are tons of weird things that happen and most of it is simply not explained now i personally don't mind trying to unravel a film every now and then but when all it does is give me the same clue over and over again i get kind of fed up after a while which is this film's biggest problem it's obviously got this big secret to hide but it seems to want to hide it completely until its final five minutes and do they make things entertaining thrilling or even engaging in the meantime not really the sad part is that the arrow and i both dig on flicks like this so we actually figured most of it out by the half way point so all of the strangeness after that did start to make a little bit of sense but it still didn't the make the film all that more entertaining i guess the bottom line with movies like this is that you should always make sure that the audience is into it even before they are given the secret password to enter your world of understanding i mean showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy okay we get it there are people chasing her and we don't know who they are do we really need to see it over and over again how about giving us different scenes offering further insight into all of the strangeness going down in the movie apparently the studio took this film away from its director and chopped it up themselves and it shows there might've been a pretty decent teen mind fuck movie in here somewhere but i guess the suits decided that turning it into a music video with little edge would make more sense the actors are pretty good for the most part although wes bentley just seemed to be playing the exact same character that he did in american beauty only in a new neighborhood but my biggest kudos go out to sagemiller who holds her own throughout the entire film and actually has you feeling her character's overall the film doesn't stick because it doesn't entertain it's confusing it rarely and it feels pretty redundant for most of its runtime despite a pretty cool ending and explanation to all of the craziness that came before it oh and by the way this is not a horror or teen slasher flick it's just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids it also wrapped production two years ago and has been sitting on the shelves ever since whatever skip it where's joblo coming from a nightmare of elm street 3 7 10 blair witch 2 7 10 the crow 9 10 the crow salvation 4 10 lost highway 10 10 memento 10 10 the others 9 10 stir of echoes 8 10"

Step 6: Next, since each review is a different length, we need to limit ourselves to a certain number of words so that all our text sequences are the same length.

To do so we:

  1. Specify the max length for each sequence. You can start out with 500 and then tune this hyperparameter later.
  2. Use pad_sequences() to truncate or pad reviews to the specified max_len.
max_len <- 500
features <- pad_sequences(sequences, maxlen = max_len)

Your now have your preprocessed feature data that is a 2D tensor (aka matrix) and contains 2000 observations (rows) and max_len columns.

dim(features)
[1] 2000  500
expect_equal(class(features), "matrix")
expect_equal(dim(features), c(obs, max_len))

You can see how the final preprocessed sequence looks for the first movie review with the following code:

paste(unlist(tokenizer$index_word)[features[1,]], collapse = " ")
[1] "a of chase scenes there are tons of weird things that happen and most of it is simply not explained now i personally don't mind trying to unravel a film every now and then but when all it does is give me the same clue over and over again i get kind of fed up after a while which is this film's biggest problem it's obviously got this big secret to hide but it seems to want to hide it completely until its final five minutes and do they make things entertaining thrilling or even engaging in the meantime not really the sad part is that the arrow and i both dig on flicks like this so we actually figured most of it out by the half way point so all of the strangeness after that did start to make a little bit of sense but it still didn't the make the film all that more entertaining i guess the bottom line with movies like this is that you should always make sure that the audience is into it even before they are given the secret password to enter your world of understanding i mean showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy okay we get it there are people chasing her and we don't know who they are do we really need to see it over and over again how about giving us different scenes offering further insight into all of the strangeness going down in the movie apparently the studio took this film away from its director and chopped it up themselves and it shows there might've been a pretty decent teen mind fuck movie in here somewhere but i guess the suits decided that turning it into a music video with little edge would make more sense the actors are pretty good for the most part although wes bentley just seemed to be playing the exact same character that he did in american beauty only in a new neighborhood but my biggest kudos go out to sagemiller who holds her own throughout the entire film and actually has you feeling her character's overall the film doesn't stick because it doesn't entertain it's confusing it rarely and it feels pretty redundant for most of its runtime despite a pretty cool ending and explanation to all of the craziness that came before it oh and by the way this is not a horror or teen slasher flick it's just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids it also wrapped production two years ago and has been sitting on the shelves ever since whatever skip it where's joblo coming from a nightmare of elm street 3 7 10 blair witch 2 7 10 the crow 9 10 the crow salvation 4 10 lost highway 10 10 memento 10 10 the others 9 10 stir of echoes 8 10"

Model training

Step 7: To train our model we will use the validation_split procedure within fit(). Remember, this takes the last XX% of our data to be used as our validation set. But if you recall, our data was organized in “neg” and “pos” folders so we should randomize our data to make sure our validation set doesn’t end up being all positive or negative reviews!

set.seed(123)
index <- sample(1:nrow(features))

x_train <- features[index, ]
y_train <- labels[index]

# there should be 2 unique values (0 - neg, 1 - pos) in last 30% of data
expect_equal(
  length(unique(y_train[floor(length(y_train) * 0.7):length(y_train)])), 
  2
  )

Word embedding model

Step 8: We’re now ready to do modeling. For our first model, let’s create a model that:

  1. applies a word embedding layer
    • input_dim should equal top_n_words
    • input_length should equal max_len
    • start with output_dim = 16
  2. flattens the embeddings
  3. classifies with a dense layer

You can use early stopping if you’d like but for the first model:

  • use the default learning rate
  • 20 epochs is more than enough
  • use a batch size of 32
  • use a validation split of 30%
model_basic <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = top_n_words,  # number of words we are considering
    input_length = max_len,   # length that we have set each review to
    output_dim = 16           # length of our word embeddings
    ) %>%  
  layer_flatten() %>%
  layer_dense(units = 1, activation = "sigmoid")
  
model_basic %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = "accuracy"
)

history_basic <- model_basic %>% 
  fit(
    x_train, y_train,
    epochs = 20,
    batch_size = 32,
    validation_split = 0.3,
    callbacks = list(
      callback_early_stopping(patience = 3, restore_best_weights = TRUE)
      )
    )

Run the following code to check out your optimal loss and corresponding accuracy.

best_epoch <- which.min(history_basic$metrics$val_loss)
best_loss <- history_basic$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history_basic$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
Our optimal loss is 0.565 with an accuracy of 0.705

Word embedding + LSTM model

Step 9: Now let’s build on to the previous model by adding an LSTM layer after the layer_embedding layer. When feeding an embedding layer into an LSTM layer you do not need to flatten the layer. Reference the Intro to LSTMs notebook. For this first LSTM model use units = 32.

model_lstm <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = top_n_words,  # number of words we are considering
    input_length = max_len,   # length that we have set each review to
    output_dim = 16            # length of our word embeddings
    ) %>%  
  layer_lstm(units = 32) %>%
  layer_dense(units = 1, activation = "sigmoid") 

model_lstm %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = "accuracy"
)

history_lstm <- model_lstm %>% fit(
  x_train, y_train,
  epochs = 20,
  batch_size = 32,
  validation_split = 0.3,
  callbacks = list(
    callback_early_stopping(patience = 3, restore_best_weights = TRUE)
    )
)

Run the following code to check out your optimal loss and corresponding accuracy.

  1. How does it compare to the word embedding only model?
  2. Why do you think there is a difference?
best_epoch <- which.min(history_lstm$metrics$val_loss)
best_loss <- history_lstm$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history_lstm$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
Our optimal loss is 0.445 with an accuracy of 0.795

Search for a better model

Step 10: Spend the rest of the time tuning hyperparameters and see if you can find a better model. Things you can try:

  • Preprocessing hyperparameters
    • adjust the number of words to retain in the word index (top_n_words)
    • adjust the size of the sequences (max_len)
  • Word embedding layer
    • adjust the output_dim
  • LSTM layer
  • Other
    • adjust the learning rate (or even the optimizer (i.e. try “adam”))
    • adjust the batch_size
    • add a callback to adjust the learning upon plateauing

🏠

---
title: "Can You Improve Sentiment Polarity with LSTMs?"
output:
  html_notebook:
    toc: yes
    toc_float: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

This project is designed to test your current knowledge on applying LSTMS to the
[Cornell Movie Review dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
provided by Cornell University. This dataset contains movie reviews introduced
in [Pang & Lee (2004)](https://bit.ly/2SWGVBZ) with 2000 total observations.
Detailed information about the data can be found [here](https://bit.ly/2N08o22).

Your goal is to develop and compare the performance of a word embedding deep
learning classifier to one that incorporates LSTM sequence embedding. I will
guide you along the way but this project expects you to do most of the work from
importing and preprocessing text, to building the models.

Nearly all the code that you need can be found in these notebooks:

* [Intro to word embeddings](http://bit.ly/dl-imdb-embeddings)
* [Intro to LSTMs](http://bit.ly/dl-lstm-intro)

___Good luck!___

# Requirements

```{r}
library(keras)
library(tidyverse)
library(fs)
library(glue)
library(testthat)
```

# Import the data

For those in the workshop we have already downloaded the movie review data for
you into the `"/materials/data/cornell_reviews"` directory. Outside of the
workshop, you can find the download instructions [here](http://bit.ly/dl-rqmts).

```{r}
# get path to data
if (stringr::str_detect(here::here(), "conf-2020-user")) {
  movie_dir <- "/home/conf-2020-user/data/cornell_reviews/data"
} else {
  movie_dir <- here::here("materials", "data", "cornell_reviews", "data")
}

fs::dir_tree(movie_dir, recurse = FALSE)
```

__Step 1__: You can see the data have already been separated into  positive vs 
negative sets. The actual reviews are contained in individual .txt files. Similar
to [Intro to word embeddings](http://bit.ly/dl-imdb-embeddings), let's go ahead
use this structure to our advantage by iterating over each review and...

1. creating the path to each individual review file,
2. creating a label based on the “neg” or “pos” folder the review is in, and
3. saving the output as a data frame with each review on an individual row.

```{r}
training_files <- movie_dir %>%
  dir_ls() %>%
  map(dir_ls) %>%
  set_names(basename) %>%
  plyr::ldply(data_frame) %>%
  set_names(c("label", "path"))

# you should have 2000 total observations
expect_equal(nrow(training_files), 2000)
```

Go ahead and take a look at your data frame

```{r}
training_files
```

__Step 2__: How many obseravations are in each response label (i.e. "neg" vs "pos")?

```{r}
count(training_files, label)
```

__Step 3__: Next, let's iterate over each row and

1. save the label in a labels vector,
2. import the movie review, and
3. save in a texts vector.

```{r}
obs <- nrow(training_files)
labels <- vector(mode = "integer", length = obs)
texts <- vector(mode = "character", length = obs)

for (file in seq_len(obs)) {
  label <- training_files[[file, "label"]]
  path <- training_files[[file, "path"]]
  
  labels[file] <- ifelse(label == "neg", 0, 1)
  texts[file] <- readChar(path, nchars = file.size(path)) 
}
```

The number of observations in your text should be equal to the number of responses.

```{r}
expect_equal(length(texts), length(labels))
```

Go ahead and check out the text of a couple reviews.

```{r}
texts[1]
```

# Data exploration

__Step 4__: Before preprocessing, let's get a sense of two attributes that will
help us set two of our preprocessing hyperparameters:

1. How many unique words exist across all our reviews? We'll use this to determine
a good starting point for preprocessing our text.

2. What is the distribution of word count across all movie reviews (i.e. mean, 
median)? We'll use this to determine a good starting point for preprocessing our
text.

```{r}
text_df <- texts %>%
  tibble(.name_repair = ~ "text") %>%
  mutate(text_length = str_count(text, "\\w+"))

unique_words <- text_df %>%
  tidytext::unnest_tokens(word, text) %>%
  pull(word) %>%
  n_distinct()

avg_review_length <- median(text_df$text_length, na.rm = TRUE)
  
ggplot(text_df, aes(text_length)) +
  geom_histogram(bins = 100, fill = "grey70", color = "grey40") +
  geom_vline(xintercept = avg_review_length, color = "red", lty = "dashed") +
  scale_x_log10("# words") +
  ggtitle(glue("Median review length is {avg_review_length} words"),
          subtitle = glue("Total number of unique words is {unique_words}"))
```

# Data preprocessing

__Step 5__: Now let's tokenize our text sequences. To do so we:

1. Specify how many words we want to include. Remember, a good starting point
   to use roughly 50% of the number of unique words in the data. This is a hyper-
   parameter that you can always come back to and adjust.
2. Create a `text_tokenizer` object which defines how we want to preprocess the
   text. The defaults are sufficient.
3. Apply the tokenizer to our text with `fit_text_tokenizer()`.
4. Extract our vectorized review data with `texts_to_sequences()`.

```{r}
# 1
top_n_words <- 20000

# 2-3
tokenizer <- text_tokenizer(num_words = top_n_words) %>% 
  fit_text_tokenizer(texts)

# 4
sequences <- texts_to_sequences(tokenizer, texts)
```

Go ahead and check out the first vectorized sequence. Should look familiar from
earlier modules.

```{r}
# The vectorized first instance:
sequences[[1]]
```

We can see how our tokenizer converted our original text to a cleaned up 
version:

```{r} 
cat(crayon::blue("Original text:\n"))
texts[[1]]

cat(crayon::blue("\nRevised text:\n"))
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
```

__Step 6__: Next, since each review is a different length, we need to limit
ourselves to a certain number of words so that all our text sequences are the
same length. 

To do so we:

1. Specify the max length for each sequence. You can start out with 500 and then
tune this hyperparameter later.
2. Use `pad_sequences()` to truncate or pad reviews to the specified `max_len`.

```{r}
max_len <- 500
features <- pad_sequences(sequences, maxlen = max_len)
```

Your now have your preprocessed feature data that is a 2D tensor (aka matrix)
and contains 2000 observations (rows) and `max_len` columns.

```{r}
dim(features)

expect_equal(class(features), "matrix")
expect_equal(dim(features), c(obs, max_len))
```

You can see how the final preprocessed sequence looks for the first movie review
with the following code:

```{r}
paste(unlist(tokenizer$index_word)[features[1,]], collapse = " ")
```

# Model training

__Step 7__: To train our model we will use the `validation_split` procedure
within `fit()`. Remember, this takes the last XX% of our data to be used as our
validation set. But if you recall, our data was organized in "neg" and "pos"
folders so we should randomize our data to make sure our validation set doesn’t
end up being all positive or negative reviews!

```{r}
set.seed(123)
index <- sample(1:nrow(features))

x_train <- features[index, ]
y_train <- labels[index]

# there should be 2 unique values (0 - neg, 1 - pos) in last 30% of data
expect_equal(
  length(unique(y_train[floor(length(y_train) * 0.7):length(y_train)])), 
  2
  )
```

## Word embedding model

__Step 8__: We're now ready to do modeling. For our first model, let's create a
model that:

1. applies a word embedding layer
   - `input_dim` should equal `top_n_words`
   - `input_length` should equal `max_len`
   - start with `output_dim` = 16
2. flattens the embeddings
3. classifies with a dense layer

You can use early stopping if you'd like but for the first model:

* use the default learning rate
* 20 epochs is more than enough
* use a batch size of 32
* use a validation split of 30%

```{r}
model_basic <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = top_n_words,  # number of words we are considering
    input_length = max_len,   # length that we have set each review to
    output_dim = 16           # length of our word embeddings
    ) %>%  
  layer_flatten() %>%
  layer_dense(units = 1, activation = "sigmoid")
  
model_basic %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = "accuracy"
)

history_basic <- model_basic %>% 
  fit(
    x_train, y_train,
    epochs = 20,
    batch_size = 32,
    validation_split = 0.3,
    callbacks = list(
      callback_early_stopping(patience = 3, restore_best_weights = TRUE)
      )
    )
```

Run the following code to check out your optimal loss and corresponding accuracy.

```{r}
best_epoch <- which.min(history_basic$metrics$val_loss)
best_loss <- history_basic$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history_basic$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
```

## Word embedding + LSTM model

__Step 9__: Now let's build on to the previous model by adding an LSTM layer
after the `layer_embedding` layer. When feeding an embedding layer into an LSTM
layer you __do not__ need to flatten the layer. Reference the [Intro to LSTMs notebook](http://bit.ly/dl-lstm-intro#train-an-lstm). For this first LSTM model
use `units = 32`.

```{r}
model_lstm <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = top_n_words,  # number of words we are considering
    input_length = max_len,   # length that we have set each review to
    output_dim = 16            # length of our word embeddings
    ) %>%  
  layer_lstm(units = 32) %>%
  layer_dense(units = 1, activation = "sigmoid") 

model_lstm %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = "accuracy"
)

history_lstm <- model_lstm %>% fit(
  x_train, y_train,
  epochs = 20,
  batch_size = 32,
  validation_split = 0.3,
  callbacks = list(
    callback_early_stopping(patience = 3, restore_best_weights = TRUE)
    )
)
```

Run the following code to check out your optimal loss and corresponding accuracy.

1. How does it compare to the word embedding only model?
2. Why do you think there is a difference?

```{r}
best_epoch <- which.min(history_lstm$metrics$val_loss)
best_loss <- history_lstm$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history_lstm$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
```

## Search for a better model

__Step 10__: Spend the rest of the time tuning hyperparameters and see if you
can find a better model. Things you can try:

* Preprocessing hyperparameters
   - adjust the number of words to retain in the word index (`top_n_words`)
   - adjust the size of the sequences (`max_len`)
* Word embedding layer
   - adjust the `output_dim`
* LSTM layer
   - adjust the number of `units`
   - add dropout (ref http://bit.ly/dl-lstm-intro#your-turn-5min-1)
   - maybe even add a 2nd LSTM layer
* Other
   - adjust the learning rate (or even the optimizer (i.e. try "adam"))
   - adjust the `batch_size`
   - add a callback to adjust the learning upon plateauing

[🏠](https://github.com/rstudio-conf-2020/dl-keras-tf)