This project is designed to test your current knowledge on applying LSTMS to the Cornell Movie Review dataset provided by Cornell University. This dataset contains movie reviews introduced in Pang & Lee (2004) with 2000 total observations. Detailed information about the data can be found here.
Your goal is to develop and compare the performance of a word embedding deep learning classifier to one that incorporates LSTM sequence embedding. I will guide you along the way but this project expects you to do most of the work from importing and preprocessing text, to building the models.
Nearly all the code that you need can be found in these notebooks:
Good luck!
library(keras)
library(tidyverse)
library(fs)
library(glue)
library(testthat)
For those in the workshop we have already downloaded the movie review data for you into the "/materials/data/cornell_reviews"
directory. Outside of the workshop, you can find the download instructions here.
# get path to data
if (stringr::str_detect(here::here(), "conf-2020-user")) {
movie_dir <- "/home/conf-2020-user/data/cornell_reviews/data"
} else {
movie_dir <- here::here("materials", "data", "cornell_reviews", "data")
}
fs::dir_tree(movie_dir, recurse = FALSE)
[01;34m/Users/b294776/Desktop/Workspace/Training/rstudio-conf-2020/dl-keras-tf/materials/data/cornell_reviews/data[0m
├── [01;34mneg[0m
└── [01;34mpos[0m
Step 1: You can see the data have already been separated into positive vs negative sets. The actual reviews are contained in individual .txt files. Similar to Intro to word embeddings, let’s go ahead use this structure to our advantage by iterating over each review and…
training_files <- movie_dir %>%
dir_ls() %>%
map(dir_ls) %>%
set_names(basename) %>%
plyr::ldply(data_frame) %>%
set_names(c("label", "path"))
# you should have 2000 total observations
expect_equal(nrow(training_files), 2000)
Go ahead and take a look at your data frame
training_files
Step 2: How many obseravations are in each response label (i.e. “neg” vs “pos”)?
count(training_files, label)
Step 3: Next, let’s iterate over each row and
obs <- nrow(training_files)
labels <- vector(mode = "integer", length = obs)
texts <- vector(mode = "character", length = obs)
for (file in seq_len(obs)) {
label <- training_files[[file, "label"]]
path <- training_files[[file, "path"]]
labels[file] <- ifelse(label == "neg", 0, 1)
texts[file] <- readChar(path, nchars = file.size(path))
}
The number of observations in your text should be equal to the number of responses.
expect_equal(length(texts), length(labels))
Go ahead and check out the text of a couple reviews.
texts[1]
[1] "plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and \" sorta \" find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n"
Step 4: Before preprocessing, let’s get a sense of two attributes that will help us set two of our preprocessing hyperparameters:
How many unique words exist across all our reviews? We’ll use this to determine a good starting point for preprocessing our text.
What is the distribution of word count across all movie reviews (i.e. mean, median)? We’ll use this to determine a good starting point for preprocessing our text.
text_df <- texts %>%
tibble(.name_repair = ~ "text") %>%
mutate(text_length = str_count(text, "\\w+"))
unique_words <- text_df %>%
tidytext::unnest_tokens(word, text) %>%
pull(word) %>%
n_distinct()
avg_review_length <- median(text_df$text_length, na.rm = TRUE)
ggplot(text_df, aes(text_length)) +
geom_histogram(bins = 100, fill = "grey70", color = "grey40") +
geom_vline(xintercept = avg_review_length, color = "red", lty = "dashed") +
scale_x_log10("# words") +
ggtitle(glue("Median review length is {avg_review_length} words"),
subtitle = glue("Total number of unique words is {unique_words}"))
Step 5: Now let’s tokenize our text sequences. To do so we:
text_tokenizer
object which defines how we want to preprocess the text. The defaults are sufficient.fit_text_tokenizer()
.texts_to_sequences()
.# 1
top_n_words <- 20000
# 2-3
tokenizer <- text_tokenizer(num_words = top_n_words) %>%
fit_text_tokenizer(texts)
# 4
sequences <- texts_to_sequences(tokenizer, texts)
Go ahead and check out the first vectorized sequence. Should look familiar from earlier modules.
# The vectorized first instance:
sequences[[1]]
[1] 97 76 948 4621 130 5 2 2015 788 3763 3 102 1313
[14] 36 72 55 24 1340 23 4 1 518 1324 16 13 677
[27] 1570 5 83 54 7 33 89 3 32 4476 603 1 637
[40] 241 1 26 3 8853 189 40 2185 2 301 6461 26 12
[53] 1 948 1457 8 2401 19 2 79 690 361 16 1767 9
[66] 7 2 79 105 3979 45 6 48 142 14 465 24 56
[79] 3667 23 5 1174 194 18 1341 8854 95 45 526 5 830
[92] 1 7097 902 11 169 362 3 117 341 4203 8855 16 52
[105] 20 61 3 105 752 4 242 35 2694 4 95 3 112
[118] 1878 50 248 14 23 6179 36 249 5 31 624 14 271
[131] 3764 1223 16 2794 9 2283 39 48 20 1 469 11 1
[144] 26 77 64 348 353 6 8 37 321 88 7839 9 436
[157] 87 1314 16 102 55 14 1445 145 7 45 34 10 24
[170] 161 1122 31 59 361 603 159 19 52 20 1076 52 20
[183] 81 504 135 28 1 328 52 20 485 27 177 38 1
[196] 328 52 20 769 13435 52 20 17472 52 20 2 4 910
[209] 114 52 20 5277 4 1386 171 8 632 3 63 4 9
[222] 6 321 25 1957 143 18 2926 120 301 254 5 9489 2
[235] 15 150 143 3 102 16 44 35 9 91 6 256 103
[248] 1 164 2851 100 3 100 208 18 72 259 4 4077 42
[261] 82 2 94 45 6 14 247 954 353 37 613 294 14
[274] 133 801 5 3146 16 9 138 5 257 5 3146 9 308
[287] 313 64 369 493 226 3 75 36 86 171 439 2795 46
[300] 56 1747 7 1 5675 25 92 1 1280 204 6 8 1
[313] 4078 3 18 158 5676 19 1711 38 14 39 62 175 4079
[326] 63 4 9 40 21 1 268 84 212 39 35 4 1
[339] 12156 82 8 197 442 5 86 2 98 251 4 261 16
[352] 9 137 248 1 86 1 15 35 8 43 439 18 618
[365] 1 1486 311 11 121 38 14 6 8 34 167 245 86
[378] 276 8 1 161 6 55 9 56 141 36 20 284 1
[391] 801 15147 5 2016 169 145 4 2329 18 570 963 7840 10194
[404] 426 221 28 3863 12 41 1430 226 457 1 26 6 50
[417] 1684 3592 1123 62 72 9 52 20 101 3024 33 3 62
[430] 120 118 27 36 20 75 62 92 437 5 83 9 100
[443] 3 100 208 96 41 665 132 316 114 3288 1044 2440 55
[456] 35 4 1 12156 159 149 7 1 26 682 1 929 874
[469] 14 15 221 28 64 123 3 13436 9 42 491 3 9
[482] 336 52 8856 69 2 271 875 948 301 6461 26 7 128
[495] 1102 16 18 618 1 2852 1325 8 1729 9 55 2 289
[508] 431 11 98 1326 68 86 43 261 1 209 20 271 61
[521] 12 1 63 204 187 2594 10195 50 675 5 22 388 1
[534] 2109 164 74 8 17 197 7 258 1034 57 7 2 113
[547] 5278 16 106 954 5279 130 40 5 10194 27 1543 33 181
[560] 457 1 339 15 3 175 32 34 621 33 1526 898 1
[573] 15 116 1993 85 9 116 2476 37 1471 9 1358 3 9
[586] 648 271 7098 12 63 4 64 10196 400 2 271 690 324
[599] 3 1994 5 35 4 1 17473 8 770 141 9 657 3
[612] 21 1 84 14 6 25 2 295 46 948 1649 731 37
[625] 50 17474 5 177 8 84 85 371 6 682 6792 8 1
[638] 521 6 137 1132 11 1 440 9 71 3210 460 76 174
[651] 714 3 32 69 1487 19 1 7443 193 194 1045 2853 9
[664] 3980 4204 504 28 2 2214 4 7099 1003 640 1211 302 1446
[677] 1315 314 1211 302 1 2518 1853 302 1 2518 7100 749 302
[690] 341 4203 302 302 8855 302 302 1 485 1853 302 4341 4
[703] 3864 1022 302
We can see how our tokenizer converted our original text to a cleaned up version:
cat(crayon::blue("Original text:\n"))
[34mOriginal text:
[39m
texts[[1]]
[1] "plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and \" sorta \" find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n"
cat(crayon::blue("\nRevised text:\n"))
[34m
Revised text:
[39m
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
[1] "plot two teen couples go to a church party drink and then drive they get into an accident one of the guys dies but his girlfriend continues to see him in her life and has nightmares what's the deal watch the movie and sorta find out critique a mind fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package which is what makes this review an even harder one to write since i generally applaud films which attempt to break the mold mess with your head and such lost highway memento but there are good and bad ways of making all types of films and these folks just didn't this one correctly they seem to have taken this pretty neat concept but executed it terribly so what are the problems with the movie well its main problem is that it's simply too jumbled it starts off normal but then into this fantasy world in which you as an audience member have no idea what's going on there are dreams there are characters coming back from the dead there are others who look like the dead there are strange apparitions there are disappearances there are a of chase scenes there are tons of weird things that happen and most of it is simply not explained now i personally don't mind trying to unravel a film every now and then but when all it does is give me the same clue over and over again i get kind of fed up after a while which is this film's biggest problem it's obviously got this big secret to hide but it seems to want to hide it completely until its final five minutes and do they make things entertaining thrilling or even engaging in the meantime not really the sad part is that the arrow and i both dig on flicks like this so we actually figured most of it out by the half way point so all of the strangeness after that did start to make a little bit of sense but it still didn't the make the film all that more entertaining i guess the bottom line with movies like this is that you should always make sure that the audience is into it even before they are given the secret password to enter your world of understanding i mean showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy okay we get it there are people chasing her and we don't know who they are do we really need to see it over and over again how about giving us different scenes offering further insight into all of the strangeness going down in the movie apparently the studio took this film away from its director and chopped it up themselves and it shows there might've been a pretty decent teen mind fuck movie in here somewhere but i guess the suits decided that turning it into a music video with little edge would make more sense the actors are pretty good for the most part although wes bentley just seemed to be playing the exact same character that he did in american beauty only in a new neighborhood but my biggest kudos go out to sagemiller who holds her own throughout the entire film and actually has you feeling her character's overall the film doesn't stick because it doesn't entertain it's confusing it rarely and it feels pretty redundant for most of its runtime despite a pretty cool ending and explanation to all of the craziness that came before it oh and by the way this is not a horror or teen slasher flick it's just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids it also wrapped production two years ago and has been sitting on the shelves ever since whatever skip it where's joblo coming from a nightmare of elm street 3 7 10 blair witch 2 7 10 the crow 9 10 the crow salvation 4 10 lost highway 10 10 memento 10 10 the others 9 10 stir of echoes 8 10"
Step 6: Next, since each review is a different length, we need to limit ourselves to a certain number of words so that all our text sequences are the same length.
To do so we:
pad_sequences()
to truncate or pad reviews to the specified max_len
.max_len <- 500
features <- pad_sequences(sequences, maxlen = max_len)
Your now have your preprocessed feature data that is a 2D tensor (aka matrix) and contains 2000 observations (rows) and max_len
columns.
dim(features)
[1] 2000 500
expect_equal(class(features), "matrix")
expect_equal(dim(features), c(obs, max_len))
You can see how the final preprocessed sequence looks for the first movie review with the following code:
paste(unlist(tokenizer$index_word)[features[1,]], collapse = " ")
[1] "a of chase scenes there are tons of weird things that happen and most of it is simply not explained now i personally don't mind trying to unravel a film every now and then but when all it does is give me the same clue over and over again i get kind of fed up after a while which is this film's biggest problem it's obviously got this big secret to hide but it seems to want to hide it completely until its final five minutes and do they make things entertaining thrilling or even engaging in the meantime not really the sad part is that the arrow and i both dig on flicks like this so we actually figured most of it out by the half way point so all of the strangeness after that did start to make a little bit of sense but it still didn't the make the film all that more entertaining i guess the bottom line with movies like this is that you should always make sure that the audience is into it even before they are given the secret password to enter your world of understanding i mean showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy okay we get it there are people chasing her and we don't know who they are do we really need to see it over and over again how about giving us different scenes offering further insight into all of the strangeness going down in the movie apparently the studio took this film away from its director and chopped it up themselves and it shows there might've been a pretty decent teen mind fuck movie in here somewhere but i guess the suits decided that turning it into a music video with little edge would make more sense the actors are pretty good for the most part although wes bentley just seemed to be playing the exact same character that he did in american beauty only in a new neighborhood but my biggest kudos go out to sagemiller who holds her own throughout the entire film and actually has you feeling her character's overall the film doesn't stick because it doesn't entertain it's confusing it rarely and it feels pretty redundant for most of its runtime despite a pretty cool ending and explanation to all of the craziness that came before it oh and by the way this is not a horror or teen slasher flick it's just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids it also wrapped production two years ago and has been sitting on the shelves ever since whatever skip it where's joblo coming from a nightmare of elm street 3 7 10 blair witch 2 7 10 the crow 9 10 the crow salvation 4 10 lost highway 10 10 memento 10 10 the others 9 10 stir of echoes 8 10"
Step 7: To train our model we will use the validation_split
procedure within fit()
. Remember, this takes the last XX% of our data to be used as our validation set. But if you recall, our data was organized in “neg” and “pos” folders so we should randomize our data to make sure our validation set doesn’t end up being all positive or negative reviews!
set.seed(123)
index <- sample(1:nrow(features))
x_train <- features[index, ]
y_train <- labels[index]
# there should be 2 unique values (0 - neg, 1 - pos) in last 30% of data
expect_equal(
length(unique(y_train[floor(length(y_train) * 0.7):length(y_train)])),
2
)
Step 8: We’re now ready to do modeling. For our first model, let’s create a model that:
input_dim
should equal top_n_words
input_length
should equal max_len
output_dim
= 16You can use early stopping if you’d like but for the first model:
model_basic <- keras_model_sequential() %>%
layer_embedding(
input_dim = top_n_words, # number of words we are considering
input_length = max_len, # length that we have set each review to
output_dim = 16 # length of our word embeddings
) %>%
layer_flatten() %>%
layer_dense(units = 1, activation = "sigmoid")
model_basic %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = "accuracy"
)
history_basic <- model_basic %>%
fit(
x_train, y_train,
epochs = 20,
batch_size = 32,
validation_split = 0.3,
callbacks = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE)
)
)
Run the following code to check out your optimal loss and corresponding accuracy.
best_epoch <- which.min(history_basic$metrics$val_loss)
best_loss <- history_basic$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history_basic$metrics$val_accuracy[best_epoch] %>% round(3)
glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
Our optimal loss is 0.565 with an accuracy of 0.705
Step 9: Now let’s build on to the previous model by adding an LSTM layer after the layer_embedding
layer. When feeding an embedding layer into an LSTM layer you do not need to flatten the layer. Reference the Intro to LSTMs notebook. For this first LSTM model use units = 32
.
model_lstm <- keras_model_sequential() %>%
layer_embedding(
input_dim = top_n_words, # number of words we are considering
input_length = max_len, # length that we have set each review to
output_dim = 16 # length of our word embeddings
) %>%
layer_lstm(units = 32) %>%
layer_dense(units = 1, activation = "sigmoid")
model_lstm %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = "accuracy"
)
history_lstm <- model_lstm %>% fit(
x_train, y_train,
epochs = 20,
batch_size = 32,
validation_split = 0.3,
callbacks = list(
callback_early_stopping(patience = 3, restore_best_weights = TRUE)
)
)
Run the following code to check out your optimal loss and corresponding accuracy.
best_epoch <- which.min(history_lstm$metrics$val_loss)
best_loss <- history_lstm$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history_lstm$metrics$val_accuracy[best_epoch] %>% round(3)
glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
Our optimal loss is 0.445 with an accuracy of 0.795
Step 10: Spend the rest of the time tuning hyperparameters and see if you can find a better model. Things you can try:
top_n_words
)max_len
)output_dim
units
batch_size