This project is designed to test your current knowledge on applying word embeddings to the Amazon Fine Foods reviews dataset available through Stanford. This dataset contains 568,454 reviews on 74,258 products.
Your goal is to develop a word embedding model to accurately predict how helpful a review will be. I supply code to help you get the data imported and prepped so that you can focus on the modeling aspect.
Good luck!
library(keras) # provides deep learning procedures
library(tidyverse) # provides basic data wrangling and visualization
library(glue) # provides efficient print statements
library(testthat) # provides unit testing
The finefoods.txt.gz file has already been downloaded and unzipped for you. All reviews are contained in a single .txt file.
amazon_reviews <- here::here("materials", "data", "amazon-food", "finefoods.txt")
reviews <- read_lines(amazon_reviews)
Each review consists of 8 items and each item is on its own line. The following shows all information collected for the first review.
head(reviews, 8)
[1] "product/productId: B001E4KFG0"
[2] "review/userId: A3SGXH7AUHU8GW"
[3] "review/profileName: delmartian"
[4] "review/helpfulness: 1/1"
[5] "review/score: 5.0"
[6] "review/time: 1303862400"
[7] "review/summary: Good Quality Dog Food"
[8] "review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most."
Based on the data’s website, we should have the following:
review_text <- reviews[str_detect(reviews, "review/text:")]
products <- reviews[str_detect(reviews, "product/productId:")]
users <- reviews[str_detect(reviews, "review/userId:")]
n_reviews <- length(review_text)
n_products <- n_distinct(products)
n_users <- n_distinct(users)
# Verify our imported data aligns with data codebook
expect_equal(n_reviews, 568454)
expect_equal(n_products, 74258)
expect_equal(n_users, 256059)
There are two main parts of these reviews that we need for our modeling purpose:
Let’s extract the text
text <- review_text %>%
str_replace("review/text:", "") %>%
iconv(to = "UTF-8") %>%
str_trim()
expect_equal(length(text), n_reviews)
text[1]
[1] "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most."
Now let’s extract our helpfulness information. This represents the fraction of users who found the review helpful for a given product.
helpfulness_info <- reviews[str_detect(reviews, "review/helpfulness:")] %>%
str_extract("\\d.*")
expect_equal(length(helpfulness_info), n_reviews)
head(helpfulness_info)
[1] "1/1" "0/0" "1/1" "3/3" "0/0" "0/0"
Let’s separate this information into the number of reviews (denominator) and the number of user who found the review helpful (numerator).
num_reviews <- str_replace(helpfulness_info, "^.*\\/", "") %>% as.integer()
helpfulness <- str_replace(helpfulness_info, "\\/.*$", "") %>% as.integer()
And we’re only going to care about those products with 10+ reviews to try minimize some of the noise.
num_index <- num_reviews >= 10
num_reviews <- num_reviews[num_index]
helpfulness <- helpfulness[num_index]
text <- text[num_index]
# verify that the number of observations in each vector is equal
expect_equal(
map_int(list(num_reviews, helpfulness, text), length) %>% n_distinct(),
1
)
glue("There are {sum(num_index)} observations with 10 or more reviews.")
There are 24982 observations with 10 or more reviews.
Our labels are going to be the fraction provided by helpfulness converted to a percentage.
labels <- helpfulness / num_reviews
expect_equal(length(labels), length(text))
range(labels)
[1] 0 1
We can look at a review that is considered very helpful…
first_pos <- first(which(labels == 1))
text[first_pos]
[1] "McCann's Instant Oatmeal is great if you must have your oatmeal but can only scrape together two or three minutes to prepare it. There is no escaping the fact, however, that even the best instant oatmeal is nowhere near as good as even a store brand of oatmeal requiring stovetop preparation. Still, the McCann's is as good as it gets for instant oatmeal. It's even better than the organic, all-natural brands I have tried. All the varieties in the McCann's variety pack taste good. It can be prepared in the microwave or by adding boiling water so it is convenient in the extreme when time is an issue.<br /><br />McCann's use of actual cane sugar instead of high fructose corn syrup helped me decide to buy this product. Real sugar tastes better and is not as harmful as the other stuff. One thing I do not like, though, is McCann's use of thickeners. Oats plus water plus heat should make a creamy, tasty oatmeal without the need for guar gum. But this is a convenience product. Maybe the guar gum is why, after sitting in the bowl a while, the instant McCann's becomes too thick and gluey."
versus a review that is considered very unhelpful.
first_neg <- first(which(labels == 0))
text[first_neg]
[1] "I read the reviews on this and thought id get some for my dog and pup.<br />They will not touch it. Even if I mix it with home cooked food.<br /><br />They like the dollar store dog food better than this.<br /><br />Amazon let me down on even allowing this dog food to be sold under their name."
Let’s get a quick assessment of word usage across the reviews:
text_df <- text %>%
tibble(.name_repair = ~ "text") %>%
mutate(text_length = str_trim(text) %>% str_count("\\w+"))
unique_words <- text_df %>%
tidytext::unnest_tokens(word, text) %>%
pull(word) %>%
n_distinct()
avg_review_length <- median(text_df$text_length, na.rm = TRUE)
ggplot(text_df, aes(text_length)) +
geom_histogram(bins = 100, fill = "grey70", color = "grey40") +
geom_vline(xintercept = avg_review_length, color = "red", lty = "dashed") +
scale_x_log10() +
ggtitle(glue("Median review length is {avg_review_length}"),
subtitle = glue("Total number of unique words is {unique_words}"))
We can explore word embeddings that give us some context of the review language.
# helper functions we'll use to explore word embeddings
source("helper_functions.R")
# clean up text and compute word embeddings
clean_text <- tolower(text) %>%
str_replace_all(pattern = "[[:punct:] ]+", replacement = " ") %>%
str_trim()
word_embeddings <- get_embeddings(clean_text)
Creating vocabulary...
Creating term-co-occurence matrix...
Computing embeddings based on GloVe algorithm...
INFO [2019-12-10 11:25:29] 2019-12-10 11:25:29 - epoch 1, expected cost 0.1420
INFO [2019-12-10 11:25:29] 2019-12-10 11:25:29 - epoch 2, expected cost 0.0971
INFO [2019-12-10 11:25:29] 2019-12-10 11:25:29 - epoch 3, expected cost 0.0851
INFO [2019-12-10 11:25:30] 2019-12-10 11:25:30 - epoch 4, expected cost 0.0777
INFO [2019-12-10 11:25:30] 2019-12-10 11:25:30 - epoch 5, expected cost 0.0724
INFO [2019-12-10 11:25:30] 2019-12-10 11:25:30 - epoch 6, expected cost 0.0687
INFO [2019-12-10 11:25:31] 2019-12-10 11:25:31 - epoch 7, expected cost 0.0658
INFO [2019-12-10 11:25:31] 2019-12-10 11:25:31 - epoch 8, expected cost 0.0636
INFO [2019-12-10 11:25:32] 2019-12-10 11:25:32 - epoch 9, expected cost 0.0617
INFO [2019-12-10 11:25:32] 2019-12-10 11:25:32 - epoch 10, expected cost 0.0602
INFO [2019-12-10 11:25:32] 2019-12-10 11:25:32 - epoch 11, expected cost 0.0589
INFO [2019-12-10 11:25:33] 2019-12-10 11:25:33 - epoch 12, expected cost 0.0577
INFO [2019-12-10 11:25:33] 2019-12-10 11:25:33 - epoch 13, expected cost 0.0568
INFO [2019-12-10 11:25:33] 2019-12-10 11:25:33 - epoch 14, expected cost 0.0559
INFO [2019-12-10 11:25:34] 2019-12-10 11:25:34 - epoch 15, expected cost 0.0552
INFO [2019-12-10 11:25:34] 2019-12-10 11:25:34 - epoch 16, expected cost 0.0545
INFO [2019-12-10 11:25:35] 2019-12-10 11:25:35 - epoch 17, expected cost 0.0539
INFO [2019-12-10 11:25:35] 2019-12-10 11:25:35 - epoch 18, expected cost 0.0534
INFO [2019-12-10 11:25:35] 2019-12-10 11:25:35 - epoch 19, expected cost 0.0529
INFO [2019-12-10 11:25:35] Success: early stopping. Improvement at iterartion 19 is less then convergence_tol
Explore your own words!
# find words with similar embeddings
get_similar_words("oil", word_embeddings)
oil coconut olive honey vegetable
1.0000000 0.8510423 0.8478722 0.6822951 0.6779796
Our labels are already a tensor (vector) so we don’t need to do any additional prep.
str(labels)
num [1:24982] 1 1 1 0.895 0.3 ...
However, we need to preprocess our text. First, lets decide on two key parameters to use when preprocessing our text:
These are two hyperparameters you can come back to and change as hyperparameters.
top_n_words <- 20000
max_len <- 200
Next, you need to create and apply a tokenizer to the text.
tokenizer <- text_tokenizer(num_words = top_n_words) %>%
fit_text_tokenizer(text)
names(tokenizer)
[1] "char_level" "document_count"
[3] "filters" "fit_on_sequences"
[5] "fit_on_texts" "get_config"
[7] "index_docs" "index_word"
[9] "lower" "num_words"
[11] "oov_token" "sequences_to_matrix"
[13] "sequences_to_texts" "sequences_to_texts_generator"
[15] "split" "texts_to_matrix"
[17] "texts_to_sequences" "texts_to_sequences_generator"
[19] "to_json" "word_counts"
[21] "word_docs" "word_index"
Now, convert your text to a numerically encoded sequence.
sequences <- texts_to_sequences(tokenizer, text)
# The vectorized first instance:
sequences[[1]]
[1] 3441 798 716 9 63 29 15 420 20 55 716 17 40
[14] 62 4852 841 110 26 227 364 5 1665 8 72 9 49
[27] 15978 1 351 194 13 83 1 111 798 716 9 2730 1126
[40] 21 41 21 83 4 174 183 7 716 5144 3948 2292 138
[53] 1 3441 9 21 41 21 8 707 12 798 716 43 83
[66] 106 57 1 145 35 158 294 2 20 107 35 1 898
[79] 10 1 3441 427 248 51 41 8 40 27 1450 10 1
[92] 949 26 67 735 1569 82 25 8 9 1034 10 1 3242
[105] 45 89 9 58 875 6 6 3441 76 7 965 1498 103
[118] 328 7 177 767 259 357 1077 42 1692 5 94 11 31
[131] 200 103 168 106 3 9 18 21 2138 21 1 61 173
[144] 33 196 2 78 18 28 250 9 3441 76 7 12410 930
[157] 484 82 484 618 163 90 4 997 407 716 175 1 207
[170] 12 4660 809 17 11 9 4 1434 31 359 1 4660 809
[183] 9 251 85 1001 10 1 513 4 190 1 798 3441 2123
[196] 93 832 3 14552
Run this code chunk to see how your text has been converted:
cat(crayon::blue("Original text:\n"))
[34mOriginal text:
[39m
text[[1]]
[1] "McCann's Instant Oatmeal is great if you must have your oatmeal but can only scrape together two or three minutes to prepare it. There is no escaping the fact, however, that even the best instant oatmeal is nowhere near as good as even a store brand of oatmeal requiring stovetop preparation. Still, the McCann's is as good as it gets for instant oatmeal. It's even better than the organic, all-natural brands I have tried. All the varieties in the McCann's variety pack taste good. It can be prepared in the microwave or by adding boiling water so it is convenient in the extreme when time is an issue.<br /><br />McCann's use of actual cane sugar instead of high fructose corn syrup helped me decide to buy this product. Real sugar tastes better and is not as harmful as the other stuff. One thing I do not like, though, is McCann's use of thickeners. Oats plus water plus heat should make a creamy, tasty oatmeal without the need for guar gum. But this is a convenience product. Maybe the guar gum is why, after sitting in the bowl a while, the instant McCann's becomes too thick and gluey."
cat(crayon::blue("\nRevised text:\n"))
[34m
Revised text:
[39m
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
[1] "mccann's instant oatmeal is great if you must have your oatmeal but can only scrape together two or three minutes to prepare it there is no escaping the fact however that even the best instant oatmeal is nowhere near as good as even a store brand of oatmeal requiring stovetop preparation still the mccann's is as good as it gets for instant oatmeal it's even better than the organic all natural brands i have tried all the varieties in the mccann's variety pack taste good it can be prepared in the microwave or by adding boiling water so it is convenient in the extreme when time is an issue br br mccann's use of actual cane sugar instead of high fructose corn syrup helped me decide to buy this product real sugar tastes better and is not as harmful as the other stuff one thing i do not like though is mccann's use of thickeners oats plus water plus heat should make a creamy tasty oatmeal without the need for guar gum but this is a convenience product maybe the guar gum is why after sitting in the bowl a while the instant mccann's becomes too thick and gluey"
Last, we want to make sure our sequences (aka each processed review) is of equal length.
features <- pad_sequences(sequences, maxlen = max_len)
expect_equal(ncol(features), max_len)
Make sure that the number of observations in your features and labels are equal:
expect_equal(nrow(features), length(labels))
Before we train our model, let’s go ahead and randomize our review data so that our training and validation data properly represent a mixture of products and users.
set.seed(123)
index <- sample(1:nrow(features))
split_point <- floor(length(index) * .3)
train_index <- index[1:split_point]
valid_index <- index[(split_point + 1):length(index)]
expect_equal(length(train_index) + length(valid_index), length(index))
x_train <- features[train_index, ]
y_train <- labels[train_index]
x_valid <- features[valid_index, ]
y_valid <- labels[valid_index]
Ok, so before we train our model, let’s get an understanding of a baseline loss score that we want to beat. The easiest baseline is to just predict the average of the training label for future observations.
avg <- mean(y_train)
baseline_mse <- mean((y_valid - avg)^2)
cat("Simply predicting the average helpfulness score of", round(avg, 2),
"for every review would give us a loss score of", round(baseline_mse, 3))
Simply predicting the average helpfulness score of 0.76 for every review would give us a loss score of 0.085
Ok, time to build your model architecture and compile it. Since this is a regression problem but we want to keep our predicted values bounded between 0 and 1, I create a custom “clipped” MSE metric. This is not something I expected you to do but wanted to illustrate here.
model <- keras_model_sequential() %>%
layer_embedding(input_dim = top_n_words,
output_dim = 32,
input_length = max_len) %>%
layer_flatten() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dropout(0.5) %>%
layer_dense(units = 1)
# create metric using backend tensor functions
metric_clipped_mse <- custom_metric("metric_capped_mse", function(y_true, y_pred) {
y_pred <- k_clip(y_pred, 0, 1)
k_mean(k_square(y_pred - y_true))
})
model %>% compile(
optimizer = optimizer_rmsprop(lr = 0.0001),
loss = metric_clipped_mse,
metrics = c("mse", "mae")
)
summary(model)
Model: "sequential_1"
____________________________________________________________________________________
Layer (type) Output Shape Param #
====================================================================================
embedding_1 (Embedding) (None, 200, 32) 640000
____________________________________________________________________________________
flatten_1 (Flatten) (None, 6400) 0
____________________________________________________________________________________
dense_2 (Dense) (None, 32) 204832
____________________________________________________________________________________
dropout_1 (Dropout) (None, 32) 0
____________________________________________________________________________________
dense_3 (Dense) (None, 1) 33
====================================================================================
Total params: 844,865
Trainable params: 844,865
Non-trainable params: 0
____________________________________________________________________________________
Let’s train our model:
history <- model %>% fit(
x_train, y_train,
epochs = 50,
batch_size = 128,
validation_data = list(x_valid, y_valid),
callbacks = list(
callback_reduce_lr_on_plateau(patience = 3),
callback_early_stopping(patience = 10, restore_best_weights = TRUE)
)
)
In this example we see that our model’s optimal validation loss was 0.057, which is 33% lower than the baseline loss.
opt_mse <- min(history$metrics$val_loss)
glue("Baseline loss score: {round(baseline_mse, 3)}")
Baseline loss score: 0.085
glue("Model loss score: {round(opt_mse, 3)}")
Model loss score: 0.057
The following plots the actual “helpfulness” value for each observation (dots) compared to our model’s predicted value (red line) and the average predicted value (blue). We see that our model is picking up some signal and is a better representation than the average value.
tibble(
actual = y_valid,
pred = model %>% predict(x_valid) %>% as.vector()
) %>%
mutate(pred = case_when(
pred < 0 ~ 0,
pred > 1 ~ 1,
TRUE ~ pred
)) %>%
arrange(pred) %>%
mutate(id = row_number()) %>%
ggplot(aes(x = id)) +
geom_point(aes(y = actual), size = 1, alpha = 0.2) +
geom_line(aes(y = pred), color = "red", size = 1) +
geom_hline(yintercept = avg, lty = "dashed", color = "blue")
You may be wondering how I chose my hyperparameters (i.e. embedding layer output dimension, number of hidden layers and units, learning rate)? Well, unlike you I had the luxery of time to run a grid search. You can check it out here.
However, realize this is not necessarily an optimized solution so you could continue improve upon it!