In this example, we are going to learn about an alternative method to encode text data known as word embeddings. This is an incomplete tutorial on word embeddings but will at least give you the basic understanding on when, why, and how we use them.
Learning objectives:
# Initialize package
library(keras)
library(fs)
library(tidyverse)
library(glue)
library(progress)
# helper functions we'll use to explore word embeddings
source("helper_functions.R")
Keras provides a built-in IMBD dataset dataset_imdb()
, which contains the text of 25,000 movie reviews that have been classified as net positive or net negative. However, we are going to use the original IMDB movie review files which can be found at http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz. This tends to help students better understand the entire data prep required for text.
You can find the download instructions here. For those in the workshop we have already downloaded this data for you.
if (stringr::str_detect(here::here(), "conf-2020-user")) {
imdb_dir <- "/home/conf-2020-user/data/imdb"
} else {
imdb_dir <- here::here("materials", "data", "imdb")
}
fs::dir_tree(imdb_dir, type = "directory")
[01;34m/Users/b294776/Desktop/Workspace/Training/rstudio-conf-2020/dl-keras-tf/materials/data/imdb[0m
├── [01;34mtest[0m
│ ├── [01;34mneg[0m
│ └── [01;34mpos[0m
└── [01;34mtrain[0m
├── [01;34mneg[0m
└── [01;34mpos[0m
You can see the data have already been separated into test vs training sets and positive vs negative sets. The actual reviews are contained in individual .txt files. We can use this structure to our advantage - the below iterates over each review and
training_files <- file.path(imdb_dir, "train") %>%
dir_ls(type = "directory") %>%
map(dir_ls) %>%
set_names(basename) %>%
plyr::ldply(data_frame) %>%
set_names(c("label", "path"))
training_files
We can see our response observations are balanced:
count(training_files, label)
We can now iterate over each row and
labels
vector,texts
vector.obs <- nrow(training_files)
labels <- vector(mode = "integer", length = obs)
texts <- vector(mode = "character", length = obs)
# this just allows us to track progress of our loop
pb <- progress_bar$new(total = obs, width = 60)
for (file in seq_len(obs)) {
pb$tick()
label <- training_files[[file, "label"]]
path <- training_files[[file, "path"]]
labels[file] <- ifelse(label == "neg", 0, 1)
texts[file] <- readChar(path, nchars = file.size(path))
}
We now have two vectors, one consisting of the labels…
table(labels)
labels
0 1
12500 12500
and the other holding each review.
texts[1]
[1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
A little exploratory analysis will show us the total number of unique words across our corpus and the average length of each review. Its good to know the word count distribution of your text as later on we’ll make a decision of how many words to keep.
text_df <- texts %>%
tibble(.name_repair = ~ "text") %>%
mutate(text_length = str_count(text, "\\w+"))
unique_words <- text_df %>%
tidytext::unnest_tokens(word, text) %>%
pull(word) %>%
n_distinct()
avg_review_length <- median(text_df$text_length, na.rm = TRUE)
ggplot(text_df, aes(text_length)) +
geom_histogram(bins = 100, fill = "grey70", color = "grey40") +
geom_vline(xintercept = avg_review_length, color = "red", lty = "dashed") +
scale_x_log10("# words") +
ggtitle(glue("Median review length is {avg_review_length} words"),
subtitle = glue("Total number of unique words is {unique_words}"))
Word embeddings are designed to encode general semantic relationships which can serve two principle purposes. The first is for language modeling which aims to encode words for the purpose of predicting synonyms, sentence completion, and word relationships. ℹ️
Although we are not focusing on word embeddings for this purpose, I have written a couple helper functions to train word embeddings for this purpose. See the code behind these helper functions here.
# clean up text and compute word embeddings
clean_text <- tolower(texts) %>%
str_replace_all(pattern = "[[:punct:] ]+", replacement = " ") %>%
str_trim()
word_embeddings <- get_embeddings(clean_text)
Creating vocabulary...
Creating term-co-occurence matrix...
Computing embeddings based on GloVe algorithm...
INFO [2019-12-09 09:14:04] 2019-12-09 09:14:04 - epoch 1, expected cost 0.0821
INFO [2019-12-09 09:14:05] 2019-12-09 09:14:05 - epoch 2, expected cost 0.0555
INFO [2019-12-09 09:14:06] 2019-12-09 09:14:06 - epoch 3, expected cost 0.0485
INFO [2019-12-09 09:14:07] 2019-12-09 09:14:07 - epoch 4, expected cost 0.0443
INFO [2019-12-09 09:14:08] 2019-12-09 09:14:08 - epoch 5, expected cost 0.0415
INFO [2019-12-09 09:14:09] 2019-12-09 09:14:09 - epoch 6, expected cost 0.0395
INFO [2019-12-09 09:14:11] 2019-12-09 09:14:11 - epoch 7, expected cost 0.0379
INFO [2019-12-09 09:14:12] 2019-12-09 09:14:12 - epoch 8, expected cost 0.0367
INFO [2019-12-09 09:14:13] 2019-12-09 09:14:13 - epoch 9, expected cost 0.0357
INFO [2019-12-09 09:14:14] 2019-12-09 09:14:14 - epoch 10, expected cost 0.0348
INFO [2019-12-09 09:14:15] 2019-12-09 09:14:15 - epoch 11, expected cost 0.0341
INFO [2019-12-09 09:14:16] 2019-12-09 09:14:16 - epoch 12, expected cost 0.0335
INFO [2019-12-09 09:14:17] 2019-12-09 09:14:17 - epoch 13, expected cost 0.0330
INFO [2019-12-09 09:14:18] 2019-12-09 09:14:18 - epoch 14, expected cost 0.0326
INFO [2019-12-09 09:14:19] 2019-12-09 09:14:19 - epoch 15, expected cost 0.0322
INFO [2019-12-09 09:14:20] 2019-12-09 09:14:20 - epoch 16, expected cost 0.0318
INFO [2019-12-09 09:14:22] 2019-12-09 09:14:22 - epoch 17, expected cost 0.0315
INFO [2019-12-09 09:14:23] 2019-12-09 09:14:23 - epoch 18, expected cost 0.0312
INFO [2019-12-09 09:14:23] Success: early stopping. Improvement at iterartion 18 is less then convergence_tol
Explore your own words!
# find words with similar embeddings
get_similar_words("horrible", word_embeddings)
horrible terrible awful bad acting
1.0000000 0.9132471 0.8663343 0.8041792 0.7790165
The other principle purpose for word embeddings is to encode text for classification reasons. In this case, we train the word embeddings to take on weights that optimize the classification loss function. ℹ️
Our response variable labels
is already a tensor; however, we still need to preprocess our text features. To do so we:
text_tokenizer
object which defines how we want to preprocess the text (i.e. convert to lowercase, remove punctuation, token splitting characters). For the most part, the defaults are sufficient.fit_text_tokenizer
. This results in an object with many details of our corpus (i.e. word counts, word index).top_n_words <- 10000
tokenizer <- text_tokenizer(num_words = top_n_words) %>%
fit_text_tokenizer(texts)
names(tokenizer)
[1] "char_level" "document_count" "filters"
[4] "fit_on_sequences" "fit_on_texts" "get_config"
[7] "index_docs" "index_word" "lower"
[10] "num_words" "oov_token" "sequences_to_matrix"
[13] "sequences_to_texts" "sequences_to_texts_generator" "split"
[16] "texts_to_matrix" "texts_to_sequences" "texts_to_sequences_generator"
[19] "to_json" "word_counts" "word_docs"
[22] "word_index"
We have now tokenized our reviews. We are considering 10,000 of 88,582 total unique words. The most common words include:
head(tokenizer$word_index)
$the
[1] 1
$and
[1] 2
$a
[1] 3
$of
[1] 4
$to
[1] 5
$is
[1] 6
Next, we extract our vectorized review data as a list. Each review is encoded as a sequence of word indexes (integers).
sequences <- texts_to_sequences(tokenizer, texts)
# The vectorized first instance:
sequences[[1]]
[1] 62 4 3 129 34 44 7576 1414 15 3 4252 514 43 16 3 633
[17] 133 12 6 3 1301 459 4 1751 209 3 7693 308 6 676 80 32
[33] 2137 1110 3008 31 1 929 4 42 5120 469 9 2665 1751 1 223 55
[49] 16 54 828 1318 847 228 9 40 96 122 1484 57 145 36 1 996
[65] 141 27 676 122 1 411 59 94 2278 303 772 5 3 837 20 3
[81] 1755 646 42 125 71 22 235 101 16 46 49 624 31 702 84 702
[97] 378 3493 2 8422 67 27 107 3348
We can map the integer values back to the word index. The integer number corresponds to the position in the word count list and the name of the vector is the actual word.
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
[1] "story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a orchestra audience is turned into an insane violent mob by the crazy of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the dialogue would make shakespeare seem easy to a third on a technical level it's better than you might think with some good cinematography by future great future stars sally and forrest can be seen briefly"
We can see how our tokenizer converted our original text to a cleaned up version:
cat("Original text:\n")
Original text:
texts[[1]]
[1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
cat("\nRevised text:\n")
Revised text:
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
[1] "story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a orchestra audience is turned into an insane violent mob by the crazy of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the dialogue would make shakespeare seem easy to a third on a technical level it's better than you might think with some good cinematography by future great future stars sally and forrest can be seen briefly"
Next, since each review is a different length, we need to limit ourselves to a certain number of words so that all our features (reviews) are the same length. This should be viewed as a tuning parameter.
Tip: I typically start with values around the 50% (median) but then explore values that represent 25% & 75% percentile of the word distribution when tuning.
Note (?pad_sequences
):
max_len <- 150
features <- pad_sequences(sequences, maxlen = max_len)
Since this review includes less than 150 words from our word index of 10K most frequent words, it pads the front-end with zeros (see ?pad_sequences()
for alternative padding options).
features[1,]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[17] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[33] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 62 4
[49] 3 129 34 44 7576 1414 15 3 4252 514 43 16 3 633 133 12
[65] 6 3 1301 459 4 1751 209 3 7693 308 6 676 80 32 2137 1110
[81] 3008 31 1 929 4 42 5120 469 9 2665 1751 1 223 55 16 54
[97] 828 1318 847 228 9 40 96 122 1484 57 145 36 1 996 141 27
[113] 676 122 1 411 59 94 2278 303 772 5 3 837 20 3 1755 646
[129] 42 125 71 22 235 101 16 46 49 624 31 702 84 702 378 3493
[145] 2 8422 67 27 107 3348
So, in essence, we have created an input that is a numeric representation of this:
features[1,] %>%
map_chr(~ ifelse(.x == 0, "<pad>", unlist(tokenizer$index_word[.x]))) %>%
cat()
<pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a orchestra audience is turned into an insane violent mob by the crazy of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the dialogue would make shakespeare seem easy to a third on a technical level it's better than you might think with some good cinematography by future great future stars sally and forrest can be seen briefly
Check out different reviews and see how we have transformed the data. Remove eval=FALSE
to run.
# use review number (i.e. 2, 10, 150)
which_review <- ____
cat(crayon::blue("Original text:\n"))
texts[[which_review ]]
cat(crayon::blue("\nRevised text:\n"))
paste(unlist(tokenizer$index_word)[features[which_review ,]] , collapse = " ")
cat(crayon::blue("\nEncoded text:\n"))
features[which_review,] %>%
map_chr(~ ifelse(.x == 0, "<pad>", unlist(tokenizer$index_word[.x]))) %>%
cat()
Our data is now preprocessed! We have 25000 observations and 150 features. Our features
data is a matrix where each row is a single observation and each column represents the words in the review in the order that they appear.
dim(features)
[1] 25000 150
length(labels)
[1] 25000
To train our model we will use the validation_split
procedure within fit
. Remember, this takes the last XX% of our data to be used as our validation set. But if you recall, our data was organized in neg and pos folders so we should randomize our data to make sure our validation set doesn’t end up being all positive or negative reviews!
set.seed(123)
index <- sample(1:nrow(features))
x_train <- features[index, ]
y_train <- labels[index]
To create our network architecture that includes word embeddings, we need to include two things:
layer_embedding
layer that creates the embeddings,layer_flatten
to flatten our embeddings to a 2D tensor for our densely connected portion of our modelNote:
input_dim
& input_length
are considered pre-processing hyperparameters.output_dim
is our word embeddings hyperparameter.model <- keras_model_sequential() %>%
layer_embedding(
input_dim = top_n_words, # number of words we are considering
input_length = max_len, # length that we have set each review to
output_dim = 32 # length of our word embeddings
) %>%
layer_flatten() %>%
layer_dense(units = 1, activation = "sigmoid")
summary(model)
Model: "sequential_1"
_______________________________________________________________________________________
Layer (type) Output Shape Param #
=======================================================================================
embedding_1 (Embedding) (None, 150, 32) 320000
_______________________________________________________________________________________
flatten_1 (Flatten) (None, 4800) 0
_______________________________________________________________________________________
dense_1 (Dense) (None, 1) 4801
=======================================================================================
Total params: 324,801
Trainable params: 324,801
Non-trainable params: 0
_______________________________________________________________________________________
The rest of our modeling procedure follows the same protocols that you’ve seen in the other modules.
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = "accuracy"
)
history <- model %>% fit(
x_train, y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2
)
best_epoch <- which.min(history$metrics$val_loss)
best_loss <- history$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history$metrics$val_accuracy[best_epoch] %>% round(3)
glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
Our optimal loss is 0.301 with an accuracy of 0.876
plot(history)
Spend a few minutes adjusting this model and see how it impacts performance. You may want to test:
output_dim
) impacts performance?yourturn_model <- keras_model_sequential() %>%
layer_embedding(
input_dim = _____,
input_length = _____,
output_dim = _____
) %>%
layer_flatten() %>%
layer_dense(units = ____, activation = ____) %>%
layer_dense(units = 1, activation = "sigmoid")
yourturn_model %>% compile(
optimizer = _____,
loss = "binary_crossentropy",
metrics = "accuracy"
)
yourturn_results <- yourturn_model %>% fit(
x_train, y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2
)
Recall that the word embeddings we found for natural language modeling created results like:
# natural language modeling embeddings
get_similar_words("horrible", word_embeddings)
horrible terrible awful bad acting
1.0000000 0.9132471 0.8663343 0.8041792 0.7790165
However, embeddings we find for classification tasks are not always so clean and intuitive. We can get the word embeddings from our classification model with:
wts <- get_weights(model)
embedding_wts <- wts[[1]]
The following just does some bookkeeping to extract the applicable words and assign them as row names to the embedding matrix.
words <- tokenizer$word_index %>%
as_tibble() %>%
pivot_longer(everything(), names_to = "word", values_to = "id") %>%
filter(id <= tokenizer$num_words) %>%
arrange(id)
row.names(embedding_wts) <- words$word
The following is one of the custom functions you imported from the helper_functions.R file. You can see the word embeddings that most closely align to a given word are not as intuitive as those produced from the natural language model. However, these are the embeddings that optimized for the classification procedure at hand.
similar_classification_words("horrible", embedding_wts)
horrible foul source rivals homicide fits
1.0000000 0.7439281 0.7299364 0.7215308 0.7141167 0.7128572
Here’s a handy sequence of code that uses the t-SNE methodology to visualize nearest neighbor word embeddings.
# plotting too many words makes the output hard to read
n_words_to_plot <- 1000
tsne <- Rtsne::Rtsne(
X = embedding_wts[1:n_words_to_plot,],
perplexity = 100,
pca = FALSE
)
p <- tsne$Y %>%
as.data.frame() %>%
mutate(word = row.names(embedding_wts)[1:n_words_to_plot]) %>%
ggplot(aes(x = V1, y = V2, label = word)) +
geom_text(size = 3)
plotly::ggplotly(p)
text_tokenizer()
to define the text preprocessing we desire (defaults are good).fit_text_tokenizer()
to preprocess text (i.e. remove punctuation, standardize to lowercase).texts_to_sequences()
to convert standardized text to numeric representation of word index.pad_sequences()
to make all text sequences the same length. This length can be adjusted to improve performance.layer_embedding()
.output_dim
ension of the embeddings.layer_flatten()
to convert embeddings to a 2D tensor and use layer_dense()
for classification (there are alternatives which we will cover later).