This project is designed to test your current knowledge on applying word embeddings to the Amazon Fine Foods reviews dataset available through Stanford. This dataset contains 568,454 reviews on 74,258 products.
Your goal is to develop a word embedding model to accurately predict how helpful a review will be. I supply code to help you get the data imported and prepped so that you can focus on the modeling aspect.
Good luck!
Requirements
library(keras) # provides deep learning procedures
library(tidyverse) # provides basic data wrangling and visualization
library(glue) # provides efficient print statements
library(testthat) # provides unit testing
Data importing
The finefoods.txt.gz file has already been downloaded and unzipped for you. All reviews are contained in a single .txt file.
# get path to data
if (stringr::str_detect(here::here(), "conf-2020-user")) {
amazon_reviews <- "/home/conf-2020-user/data/amazon-food/finefoods.txt"
} else {
amazon_reviews <- here::here("materials", "data", "amazon-food", "finefoods.txt")
}
reviews <- read_lines(amazon_reviews)
Each review consists of 8 items and each item is on its own line. The following shows all information collected for the first review.
head(reviews, 8)
Verify we properly imported
Based on the data’s website, we should have the following:
- Number of reviews: 568,454
- Number of products: 74,258
- Number of users: 256,059
review_text <- reviews[str_detect(reviews, "review/text:")]
products <- reviews[str_detect(reviews, "product/productId:")]
users <- reviews[str_detect(reviews, "review/userId:")]
n_reviews <- length(review_text)
n_products <- n_distinct(products)
n_users <- n_distinct(users)
# Verify our imported data aligns with data codebook
expect_equal(n_reviews, 568454)
expect_equal(n_products, 74258)
expect_equal(n_users, 256059)
Explore Glove Embeddings
We can explore word embeddings that give us some context of the review language.
# helper functions we'll use to explore word embeddings
source("helper_functions.R")
# clean up text and compute word embeddings
clean_text <- tolower(text) %>%
str_replace_all(pattern = "[[:punct:] ]+", replacement = " ") %>%
str_trim()
word_embeddings <- get_embeddings(clean_text)
Explore your own words!
# find words with similar embeddings
get_similar_words("oil", word_embeddings)
Prepare data
Our labels are already a tensor (vector) so we don’t need to do any additional prep.
str(labels)
Preprocessing hyperparameters
However, we need to preprocess our text. First, lets decide on two key parameters to use when preprocessing our text:
- number of most frequent words used (start with 20000)
- the maximum length of our processed text (start with 200)
These are two hyperparameters you can come back to and change as hyperparameters.
top_n_words <- ______
max_len <- ______
Preprocessing Feature text
Next, you need to create and apply a tokenizer to the text.
tokenizer <- text_tokenizer(num_words = top_n_words) %>%
fit_text_tokenizer(text)
names(tokenizer)
Now, convert your text to a numerically encoded sequence.
sequences <- texts_to_sequences(tokenizer, text)
# The vectorized first instance:
sequences[[1]]
Run this code chunk to see how your text has been converted:
cat(crayon::blue("Original text:\n"))
text[[1]]
cat(crayon::blue("\nRevised text:\n"))
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
Last, we want to make sure our sequences (aka each processed review) is of equal length.
features <- pad_sequences(sequences, maxlen = max_len)
expect_equal(ncol(features), max_len)
Make sure that the number of observations in your features and labels are equal:
expect_equal(nrow(features), length(labels))
Model training
Before we train our model, let’s go ahead and randomize our review data so that our training and validation data properly represent a mixture of products and users.
set.seed(123)
index <- sample(1:nrow(features))
split_point <- floor(length(index) * .3)
train_index <- index[1:split_point]
valid_index <- index[(split_point + 1):length(index)]
expect_equal(length(train_index) + length(valid_index), length(index))
x_train <- features[train_index, ]
y_train <- labels[train_index]
x_valid <- features[valid_index, ]
y_valid <- labels[valid_index]
Ok, so before we train our model, let’s get an understanding of a baseline loss score that we want to beat. The easiest baseline is to just predict the average of the training label for future observations.
avg <- mean(y_train)
baseline_mse <- mean((y_valid - avg)^2)
cat("Simply predicting the average helpfulness score of", round(avg, 2),
"for every review would give us a loss score of", round(baseline_mse, 3))
Ok, time to build your model architecture and compile it. Fill in the modeling blanks and consider the following:
- Your word embedding
input_dim
was already established with top_n_words
- Ref: line 226
- feel free to change this values and see how they impact performance
- Your word embedding
input_length
was already established with max_len
on
- Ref: line 227
- feel free to change this values and see how they impact performance
- Try out different
output_dim
values for the word embeddings
- typical values: powers of 2 –> 16, 32, 64, 128, 256
- Feel free to add additional hidden layers and dropout layers to the densely connected classifier.
model <- keras_model_sequential() %>%
layer_embedding(input_dim = _____,
output_dim = _____,
input_length = _____) %>%
layer_flatten() %>%
layer_dense(units = 1)
model %>% compile(
optimizer = _____,
loss = "mse",
metrics = _____
)
summary(model)
Let’s train our model:
history <- model %>% fit(
x_train, y_train,
epochs = _____,
batch_size = _____,
validation_data = list(x_valid, y_valid),
callbacks = list(
callback_reduce_lr_on_plateau(patience = _____),
callback_early_stopping(patience = _____, restore_best_weights = TRUE)
)
)
Let’s compare the optimal loss score versus the baseline loss score.
opt_mse <- min(history$metrics$val_loss)
glue("Baseline loss score: {round(baseline_mse, 3)}")
glue("Model loss score: {round(opt_mse, 3)}")
