4

Advanced Word Vector Algorithms

In Chapter 3, Word2vec – Learning Word Embeddings, we introduced you to Word2vec, the basics of learning word embeddings, and the two common Word2vec algorithms: skip-gram and CBOW. In this chapter, we will discuss several other word vector algorithms:

  • GloVe – Global Vectors
  • ELMo – Embeddings from Language Models
  • Document classification with ELMo

First, you will learn a word embedding learning technique known as Global Vectors (GloVe) and the specific advantages that GloVe has over skip-gram and CBOW.

You will also look at a recent approach for representing language called Embeddings from Language Models (ELMo). ELMo has an edge over other algorithms as it is able to disambiguate words, as well as capture semantics. Specifically, ELMo generates “contextualized” word representations, by using a given word along with its surrounding words, as opposed to treating word representations independently, as in skip-gram or CBOW.

Finally, we will solve an exciting use-case of document classification using our newly founded ELMo vectors.

GloVe – Global Vectors representation

One of the main limitations of skip-gram and CBOW algorithms is that they can only capture local contextual information, as they only look at a fixed-length window around a word. There’s an important part of the puzzle missing here as these algorithms do not look at global statistics (by global statistics we mean a way for us to see all the occurrences of words in the context of another word in a text corpus).

However, we have already studied a structure that could contain this information in Chapter 3, Word2vec – Learning Word Embeddings: the co-occurrence matrix. Let’s refresh our memory on the co-occurrence matrix, as GloVe uses the statistics captured in the co-occurrence matrix to compute vectors.

Co-occurrence matrices encode the context information of words, but they require maintaining a V × V matrix, where V is the size of the vocabulary. To understand the co-occurrence matrix, let’s take two example sentences:

  • Jerry and Mary are friends.
  • Jerry buys flowers for Mary.

If we assume a context window of size 1, on each side of a chosen word, the co-occurrence matrix will look like the following (we only show the upper triangle of the matrix, as the matrix is symmetric):

Jerry

and

Mary

are

friends

buys

flowers

for

Jerry

0

1

0

0

0

1

0

0

and

0

1

0

0

0

0

0

Mary

0

1

0

0

0

1

are

0

1

0

0

0

friends

0

0

0

0

buys

0

1

0

flowers

0

1

for

0

We can see that this matrix shows us how a word in a corpus is related to any other word, hence it contains global statistics about the corpus. That said, what are some of the advantages of having a co-occurrence matrix, as opposed to seeing just the local context?

  • It provides you with additional information about the characteristics of the words. For example, if you consider the sentence “the cat sat on the mat,” it is difficult to say if “the” is a special word that appears in the context of words such as “cat” or “mat.” However, if you have a large-enough corpus and a co-occurrence matrix, it’s very easy to see that “the” is a frequently occurring stop word.
  • The co-occurrence matrix recognizes the repeating usages of contexts or phrases, whereas in the local context this information is ignored. For example, in a large enough corpus, “New York” will be a clear winner, showing that the two words appear in the same context many times.

It is important to keep in mind that Word2vec algorithms use various techniques to approximately inject some word co-occurrence patterns, while learning word vectors. For example, the sub-sampling technique we used in the previous chapter (i.e. sampling lower-frequency words more) helps to detect and avoid stop words. But they introduce additional hyperparameters and are not as informative as the co-occurrence matrix.

Using global statistics to come up with word representations is not a new concept. An algorithm known as Latent Semantic Analysis (LSA) has been using global statistics in its approach.

LSA is used as a document analysis technique that maps words in the documents to something known as a concept, a common pattern of words that appears in a document. Global matrix factorization-based methods efficiently exploit the global statistics of a corpus (for example, co-occurrence of words in a global scope), but have been shown to perform poorly at word analogy tasks. On the other hand, context window-based methods have been shown to perform well at word analogy tasks, but do not utilize global statistics of the corpus, leaving space for improvement. GloVe attempts to get the best of both worlds—an approach that efficiently leverages global corpus statistics while optimizing the learning model in a context window-based manner similar to skip-gram or CBOW.

GloVe, a new technique for learning word embeddings was introduced in the paper “GloVe: Global Vectors for Word Representation” by Pennington et al. (https://nlp.stanford.edu/pubs/glove.pdf). GloVe attempts to bridge the gap of missing global co-occurrence information in Word2vec algorithms. The main contribution of GloVe is a new cost function (or an objective function) that uses the valuable statistics available in the co-occurrence matrix. Let’s first understand the motivation behind the GloVe method.

Understanding GloVe

Before looking at the implementation details of GloVe, let’s take time to understand the concepts governing the computations in GloVe. To do so, let’s consider an example:

  • Consider word i=Ice and j=Steam
  • Define an arbitrary probe word k
  • Define to be the probability of words i and k occurring close to each other, and to be the words j and k occurring together

Now let’s look at how the entity behaves with different values for k.

For k = “Solid” , it is highly likely to appear with i, thus, will be high. However, k would not often appear along with j causing a low . Therefore, we get the following expression:

Next, for k = “gas”, it is unlikely to appear in the close proximity of i and therefore will have a low ; however, since k highly correlates with j, the value of will be high. This leads to the following:

Now, for words such as k = “water”, which has a strong relationship with both i and j, or k = “Fashion”, which i and j both have minimal relevance to, we get this:

If you assume we have learned sensible word embeddings for these words, these relationships can be visualized in a vectors space to understand why the ratio behaves this way (see Figure 4. 1). In the figure below, the solid arrow shows the distance between the words (i, j), whereas the dashed lines express the distance between the words, (i, k) and (j, k). These distances can then be associated with the probability values we discussed. For example, when i = “ice” and k = “solid”, we expect their vectors to have a shorter distance between them (i.e. more frequently co-occurring). Therefore, we can associate distance between (i, k) as the inverse of (i.e. ) due to the definition of . This diagram shows how these distances vary as the probe word k changes:

Figure 4.1: How the entities P_ik and P_jk behave as the probe word changes in proximity to the words i and j

It can be seen that the entity, which is calculated by measuring the frequency of two words appearing close to each other, behaves in different ways as the relationship between the three words changes. As a result, it becomes a good candidate for learning word vectors. Therefore, a good starting point for defining the loss function will be as shown here:

Here, F is some function and w and are two different embedding spaces we’ll be using. In other words, the words and are looked up from one embedding space, whereas the probe word is looked up from another. From this point, the original paper goes through the derivation meticulously to reach the following loss function:

We will not go through the derivation here, as that’s out of scope for this book. Rather we will use the derived loss function and implement the algorithm with TensorFlow. If you need a less mathematically dense explanation of how we can derive this cost function, please refer to the author-written article at https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010.

Here, is defined as , if , else 1, where is the frequency with which the word j appeared in the context of the word i. is a hyperparameter we set. Remember that we defined two embedding spaces and in our loss function. and represent the word embedding and the bias embedding for the word i obtained from embedding space , respectively. And, and represent the word embedding and bias embedding for word j obtained from embedding space , respectively. Both these embeddings behave similarly except for the randomization at the initialization. At the evaluation phase, these two embeddings are added together, leading to improved performance.

Implementing GloVe

In this subsection, we will discuss the steps for implementing GloVe. The full code is available in the ch4_glove.ipynb exercise file located in the ch4 folder.

First, we’ll define the hyperparameters as we did in the previous chapter:

batch_size = 4096 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
window_size=1 # We use a window size of 1 on either side of target word
epochs = 5 # Number of epochs to train for
# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always 
# being deterministic
valid_window = 250
# When selecting valid examples, we select some of the most frequent words # as well as some moderately rare words as well
np.random.seed(54321)
random.seed(54321)
valid_term_ids = np.array(random.sample(range(valid_window), valid_size))
valid_term_ids = np.append(
    valid_term_ids, random.sample(range(1000, 1000+valid_window), valid_ 
    size),
    axis=0
)

The hyperparameters you define here are the same hyperparameters we defined in the previous chapter. We have a batch size, embedding size, window size, the number of epochs, and, finally, a set of held-out validation word IDs that we will print the most similar words to.

We will then define the model. First, we will import a few things we will need down the line:

import tensorflow.keras.backend as K
from tensorflow.keras.layers import Input, Embedding, Dot, Add
from tensorflow.keras.models import Model
K.clear_session()

The model is going to have two input layers: word_i and word_j. They represent a batch of context words and a batch of target words (or a batch of positive skip-grams):

# Define two input layers for context and target words
word_i = Input(shape=())
word_j = Input(shape=())

Note how the shape is defined. The shape is defined as an empty tuple. This means the final shape of word_i and word_j would be [None], meaning it will take a vector of an arbitrary number of elements as the input.

Next, we are going to define the embedding layers. There will be four embedding layers:

  • embeddings_i – The context embedding layer
  • embeddings_j – The target embedding layer
  • b_i – The context embedding bias
  • b_j – The target embedding bias

The following code defines these:

# Each context and target has their own embeddings (weights and biases)
# Embedding weights
embeddings_i = Embedding(n_vocab, embedding_size, name='target_embedding')(word_i)
embeddings_j = Embedding(n_vocab, embedding_size, name='context_embedding')(word_j)
# Embedding biases
b_i = Embedding(n_vocab, 1, name='target_embedding_bias')(word_i)
b_j = Embedding(n_vocab, 1, name='context_embedding_bias')(word_j)

Next, we are going to compute the output. The output of this model will be:

As you can see, that’s a portion of our final loss function. We have all the right ingredients to compute this result:

# Compute the dot product between embedding vectors (i.e. w_i.w_j)
ij_dot = Dot(axes=-1)([embeddings_i,embeddings_j])
# Add the biases (i.e. w_i.w_j + b_i + b_j )
pred = Add()([ij_dot, b_i, b_j])

First we will use the tensorflow.keras.layers.Dot layer to compute the dot product batch-wise between the context embedding lookup (embeddings_i) and the target embedding lookup (embeddings_j). For example, the two inputs to the Dot layer will be of size [batch size, embedding size]. After the dot product, the output ij_dot will be [batch size, 1], where ij_dot[k] will be the dot product between embeddings_i[k, :] and embeddings_j[k, :]. Then we simply add b_i and b_j (which has shape [None, 1]) element-wise to ij_dot.

Finally, the model is defined as taking word_i and word_j as inputs and outputting pred:

# The final model
glove_model = Model(
    inputs=[word_i, word_j],outputs=pred,
name='glove_model'
)

Next, we are going to do something quite important.

We have to devise a way to compute the complex loss function defined above, using various components/functionality available in a model. First let’s revisit the loss function.

where,

, if , else 1.

Although it looks complex, we can use already existing loss functions and other functionality to implement the GloVe loss. You can abstract this loss function into three components as shown in the image below:

Figure 4.2: The breakdown of the GloVe loss function showing how predictions, targets, and weights interact with each other to compute the final loss

Therefore, if sample weights are denoted by , predictions are denoted by , and true targets are denoted by , then we can write the loss as:

This is simply a weighted mean squared loss. Therefore, we will use "mse" as the loss for our model:

# Glove has a specific loss function with a sound mathematical
# underpinning
# It is a form of mean squared error
glove_model.compile(loss="mse", optimizer = 'adam')

We will later see how we can feed in sample weights to the model to complete the loss function. So far, we have defined different components of the GloVe algorithm and compiled the model. Next, we are going to have a look at how data can be generated to train the GloVe model.

Generating data for GloVe

The dataset we will be using is the same as the dataset from the previous chapter. To recap, we will be using the BBC news articles dataset available at http://mlg.ucd.ie/datasets/bbc.html. It contains 2225 news articles belonging to 5 topics, business, entertainment, politics, sport, and tech, which were published on the BBC website between 2004 and 2005.

Let’s now generate the data. We will be encapsulating the data generation in a function called glove_data_generator(). As the first step, let us write a function signature:

def glove_data_generator(
    sequences, window_size, batch_size, vocab_size, cooccurrence_matrix,
    x_max=100.0, alpha=0.75, seed=None
):

The function takes several arguments:

  • sequences (List[List[int]]) – a list of a list of word IDs. This is the output generated by tokenizer’s texts_to_sequences() function.
  • window_size (int) – Window size for the context.
  • batch_size (int) – Batch size.
  • vocab_size (int) – Vocabulary size.
  • cooccurrence_matrix (scipy.sparse.lil_matrix) – A sparse matrix containing co-occurrences of words.
  • x_max (int) – Hyperparameter used by GloVe to compute sample weights.
  • alpha (float) – Hyperparameter used by GloVe to compute sample weights.
  • seed – The random seed.

It also has several outputs:

  • A batch of (target, context) word ID tuples
  • The corresponding values for the (target, context) tuples
  • Sample weights (i.e. ) values for the (target, context) tuples

First we will shuffle the order of news articles:

    # Shuffle the data so that, every epoch, the order of data is
    # different
    rand_sequence_ids = np.arange(len(sequences))
    np.random.shuffle(rand_sequence_ids)

Next, we will create the sampling table, so that we can use sub-sampling to avoid over-sampling common words (e.g. stop words):

    sampling_table = 
    tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

With that, for every sequence (i.e. list of word IDs) representing an article, we generate positive skip-grams. Note how we are keeping negative_samples=0.0 as, unlike skip-gram or CBOW algorithms, GloVe does not rely on negative candidates:

    # For each story/article
    for si in rand_sequence_ids:
        
        # Generate positive skip-grams while using sub-sampling 
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.
        skipgrams(
            sequences[si], 
            vocabulary_size=vocab_size, 
            window_size=window_size, 
            negative_samples=0.0, 
            shuffle=False,   
            sampling_table=sampling_table,
            seed=seed
        )

With that, we first break down the skip-gram tuples into two lists, one containing targets and the other containing context words, and convert them to NumPy arrays subsequently:

        # Take targets and context words separately
        targets, context = zip(*positive_skip_grams)
        targets, context = np.array(targets).ravel(),
        np.array(context).ravel()

We then index the positions given by the (target, context) word pairs, from the co-occurrence matrix to retrieve the corresponding values, where (i,j) represents a (target, context) pair:

        x_ij = np.array(cooccurrence_matrix[targets, 
        context].toarray()).ravel()

Then we compute a corresponding (denoted by log_x_ij) and (denoted by sample_weights):

        # Compute log - Introducing an additive shift to make sure we
        # don't compute log(0)
        log_x_ij = np.log(x_ij + 1)
        
        # Sample weights 
        # if x < x_max => (x/x_max)**alpha / else => 1        
        sample_weights = np.where(x_ij < x_max, (x_ij/x_max)**alpha, 1)

If a code is not chosen, a random seed is set. Afterward, all of context, targets, log_x_ij, and sample_weights are shuffled while maintaining the correspondence of elements between the arrays:

        # If seed is not provided generate a random one
        if not seed:
            seed = random.randint(0, 10e6)
        
        # Shuffle data
        np.random.seed(seed)
        np.random.shuffle(context)
        np.random.seed(seed)
        np.random.shuffle(targets)
        np.random.seed(seed)
        np.random.shuffle(log_x_ij)
        np.random.seed(seed)
        np.random.shuffle(sample_weights)

Finally, we iterate through batches of the data we created above. Each batch will consist of

  • A batch of (target, context) word ID tuples
  • The corresponding values for the (target, context) tuples
  • Sample weights (i.e. ) values for the (target, context) tuples

in that order.

        # Generate a batch or data in the format 
        # ((target words, context words), log(X_ij) <- true targets,
        # f(X_ij) <- sample weights)
        for eg_id_start in range(0, context.shape[0], batch_size):            
            yield (
                targets[eg_id_start: min(eg_id_start+batch_size, 
                targets.shape[0])], 
                context[eg_id_start: min(eg_id_start+batch_size, 
                context.shape[0])]
            ), log_x_ij[eg_id_start: min(eg_id_start+batch_size, 
            log_x_ij.shape[0])], 
            sample_weights[eg_id_start: min(eg_id_start+batch_size, 
            sample_weights.shape[0])]

Now that the data is ready to be pumped in, let’s discuss the final piece of the puzzle: training the model.

Training and evaluating GloVe

Training the model is effortless, as we have all the components to train the model. As the first step, we will reuse the ValidationCallback we created in Chapter 3, Word2vec – Learning Word Embeddings. To recap, ValidationCallback is a Keras callback. Keras callbacks give you a way to execute some important operation(s) at the end of every training iteration, epoch, prediction step, etc. Here we are using the callback to perform a validation step at the end of every epoch. Our callback would take a list of word IDs intended as the validation words (held out in valid_term_ids), the model containing the embedding matrix, and a tokenizer to decode word IDs. Then it will compute the most similar top-k words for every word in the validation word set and print that as the output:

glove_validation_callback = ValidationCallback(valid_term_ids, glove_model, tokenizer)
# Train the model for several epochs
for ei in range(epochs):
    
    print("Epoch: {}/{} started".format(ei+1, epochs))
    
    news_glove_data_gen = glove_data_generator(
        news_sequences, window_size, batch_size, n_vocab
    )
    
    glove_model.fit(
        news_glove_data_gen, epochs=1, 
        callbacks=glove_validation_callback,
    )

You should get a sensible-looking output once the model has finished training. Here are some of the cherry-picked results:

election: attorney, posters, forthcoming, november's, month's
months: weeks, years, nations, rbs, thirds
you: afford, we, they, goodness, asked
music: cameras, mp3, hp's, refuseniks, divide
best: supporting, category, asante, counterparts, actor
mr: ron, tony, bernie, jack, 63
leave: pay, need, unsubstantiated, suited, return
5bn: 8bn, 2bn, 1bn, 3bn, 7bn
debut: solo, speakerboxxx, youngster, nasty, toshack
images: 117, pattern, recorder, lennon, unexpectedly
champions: premier, celtic, football, representatives, neighbour
individual: extra, attempt, average, improvement, survived
businesses: medium, sell, redder, abusive, handedly
deutsche: central, austria's, donald, ecb, austria
machine: unforced, wireless, rapid, vehicle, workplace

You can see that words like “months,” “weeks,” and “years” are grouped together. Numbers like “5bn,” “8bn,” and “2bn” are grouped together as well. “Deutsche” is surrounded by “Austria’s” and “Austria.” Finally, we will save the embeddings to the disk. We will combine weights and the bias of each context and target vector space to a single array, where the last column of the array will represent the bias and save it to the disk:

def save_embeddings(model, tokenizer, vocab_size, save_dir):
    
    os.makedirs(save_dir, exist_ok=True)
    
    _, words_sorted = zip(*sorted(list(tokenizer.index_word.items()),
    key=lambda x: x[0])[:vocab_size-1])
        
    words_sorted = [None] + list(words_sorted)
    
    context_embedding_weights = model.get_layer("context_embedding").get_
    weights()[0]
    context_embedding_bias = model.get_layer("context_embedding_bias").
    get_weights()[0]
    context_embedding = np.concatenate([context_embedding_weights,
    context_embedding_bias], axis=1)
    
    target_embedding_weights = model.get_layer("target_embedding").get_
    weights()[0]
    target_embedding_bias = model.get_layer("target_embedding_bias").get_
    weights()[0]
    target_embedding = np.concatenate([target_embedding_weights, target_
    embedding_bias], axis=1)
    
    pd.DataFrame(
        context_embedding, 
        index = words_sorted
    ).to_pickle(os.path.join(save_dir, "context_embedding_and_bias.pkl"))
    
    pd.DataFrame(
        target_embedding, 
        index = words_sorted
    ).to_pickle(os.path.join(save_dir, "target_embedding_and_bias.pkl"))
    
save_embeddings(glove_model, tokenizer, n_vocab, save_dir='glove_embeddings')

We will save embeddings as pandas DataFrames. First we get all the words sorted by their IDs. We subtract 1 to discount the reserved word ID 0 as we’ll add that manually, in the following line. Note that, word ID 0 will not show up in tokenizer.index_word. Next we get the required layers by name (namely, context_embedding, target_embedding, context_embedding_bias and target_embedding_bias). Once we have the layers we can use the get_weights() function to retrieve weights.

In this section, we looked at GloVe, another word embedding learning technique.

The main advantage of GloVe over the Word2vec techniques discussed in Chapter 3, Word2vec – Learning Word Embeddings, is that it pays attention to both global and local statistics of the corpus to learn embeddings. As GloVe is able to capture the global information about words, it tends to give better performance, especially when the corpus size increases. Another advantage is that, unlike in Word2vec techniques, GloVe does not approximate the cost function (for example, Word2vec using negative sampling), but calculates the true cost. This leads to better and easier optimization of the loss.

In the next section, we are going to look at one more word vector algorithm known as Embeddings from Language Models (ELMo).

ELMo – Taking ambiguities out of word vectors

So far, we’ve looked at word embedding algorithms that can give only a unique representation of the words in the vocabulary. However, they will give a constant representation for a given word, no matter how many times you query. Why would this be a problem? Consider the following two phrases:

I went to the bank to deposit some money

and

I walked along the river bank

Clearly, the word “bank” is used in two totally different contexts. If you use a vanilla word vector algorithm (e.g. skip-gram), you can only have one representation for the word “bank”, and it is probably going to be muddled between the concept of a financial institution and the concept of walkable edges along a river, depending on the references to this word found in the corpus it’s trained on. Therefore, it is more sensible to provide embeddings for a word while preserving and leveraging the context around it. This is exactly what ELMo is striving for.

Specifically, ELMo takes in a sequence, as opposed to a single token, and provides contextualized representations for each token in the sequence. Figure 4.3 depicts various components encompassing the model. The first thing to understand is that ELMo is a complicated beast! There are lots of neural network models orchestrating in ELMo to produce the output. Particularly, the model uses:

  • A character embedding layer (an embedding vector for each character).
  • A convolutional neural network (CNN) – a CNN consists of many convolutional layers followed by an optional fully connected classification layer.

A convolution layer takes in a sequence of inputs (e.g. sequence of characters in a word) and moves a window of weights over the input to generate a latent representation. We will discuss CNNs in detail in the coming chapters.

  • Two bi-directional LSTM layers – an LSTM is a type of model that is used to process time-series data. Given a sequence of inputs (e.g. sequence of word vectors), an LSTM goes from one input to the other, on the time dimension, and produces an output at each position. Unlike fully connected networks, LSTMs have memory, meaning the output at the current position will be affected by what the LSTM has seen in the past. We will discuss LSTMs in detail in the coming chapters.

The specifics of these different components are outside the scope of this chapter. They will be discussed in detail in the coming chapters. Therefore, do not worry if you do not understand the exact mechanisms of the sub-components shown here (Figure 4.3).

Figure 4.3: Different components of the ELMo model. Token embeddings are generated using a type of neural network known as a CNN. These token embeddings are fed to an LSTM model (that can process time-series data). The output of the first LSTM model is fed to a second LSTM model to generate a latent contextualized representation for each token

We can download a pretrained ELMo model from TensorFlow Hub (https://tfhub.dev). TF Hub is a repository for various pretrained models.

It hosts models for tasks such as image classification, text classification, text generation, etc. You can go to the site and browse various available models.

Downloading ELMo from TensorFlow Hub

The ELMo model we will be using is found at https://tfhub.dev/google/elmo/3. It has been trained on a very large corpus of text to solve a task known as language modeling. In language modeling, we try to predict the next word given the previous sequence of tokens. We will learn more about language modeling in the coming chapters.

Before downloading the model, let’s set the following environment variables:

# Not allocating full GPU memory upfront
%env TF_FORCE_GPU_ALLOW_GROWTH=true
# Making sure we cache the models and are not downloaded all the time
%env TFHUB_CACHE_DIR=./tfhub_modules

TF_FORCE_GPU_ALLOW_GROWTH allows TensorFlow to allocate GPU memory on-demand as opposed to allocating all GPU memory at once. TFHUB_CACHE_DIR sets the directory where the models will be downloaded. We will first import TensorFlow Hub:

import tensorflow_hub as hub

Next, as usual, we will clear any running TensorFlow sessions by running the following code:

import tensorflow as tf
import tensorflow.keras.backend as K
K.clear_session()

Finally, we will download the ELMo model. You can employ two ways to download pretrained models from TF Hub and use them in our code:

  • hub.load(<url>, **kwargs) – Recommended way for downloading and using TensorFlow 2-compatible models
  • hub.KerasLayer(<url>, **kwargs) – This is a workaround for using TensorFlow 1-based models in TensorFlow 2

Unfortunately, ELMo has not been ported to TensorFlow 2 yet. Therefore, we will use the hub.KerasLayer() as the workaround to load ELMo in TensorFlow 2:

elmo_layer = hub.KerasLayer(
    "https://tfhub.dev/google/elmo/3", 
    signature="tokens",signature_outputs_as_dict=True
)

Note that we are providing two arguments, signature and signature_outputs_as_dict:

  • signature (str) – Can be default or tokens. The default signature accepts a list of strings, where each string will be converted to a list of tokens internally. The tokens signature takes in inputs as dictionary having two keys. Namely, tokens (a list of list of tokens. Each list of tokens is a single phrase/sentence and includes padding tokens to bring them to a fixed length) and “sequence_len" (the length of each list of tokens, to determine the padding length).
  • signature_outputs_as_dict (bool) – When set to true, it will return all the outputs defined in the provided signature.

Now that we have understood the components of ELMo and downloaded it from TensorFlow Hub, let’s see how we can process input data for ELMo.

Preparing inputs for ELMo

Here we will define a function that will convert a given list of strings to the format ELMo expects the inputs to be in. Remember that we set the signature of ELMo to be tokens. An example input to the signature "tokens" would look as follows.

{
    'tokens': [
        ['the', 'cat', 'sat', 'on', 'the', 'mat'],
        ['the', 'mat', 'sat', '', '', '']
    ], 
    'sequence_len': [6, 3]
}

Let’s take a moment to process what the input comprises. First it has the key tokens, which has a list of tokens. Each list of tokens can be thought of as a sentence. Note how padding is added to the end of the short sentence to match the length. This is important as, otherwise, the model will throw an error as it can’t convert arbitrary-length sequences to a tensor. Next we have sequence_len, which is a list of integers. Each integer specifies the true length of each sequence. Note how the second element says 3, to match the actual tokens present in the second sequence.

Given a list of strings, we can write a function to do this transformation for us. That’s what the format_text_for_elmo() function will do for us. Let’s sink our teeth into the specifics:

def format_text_for_elmo(texts, lower=True, split=" ", max_len=None):
    
    """ Formats a given text for the ELMo model (takes in a list of
    strings) """
        
    token_inputs = [] # Maintains individual tokens
    token_lengths = [] # Maintains the length of each sequence
    
    max_len_inferred = 0 
    # We keep a variable to maintain the max length of the input
    # Go through each text (string)
    for text in texts:    
        
        # Process the text and get a list of tokens
        tokens = tf.keras.preprocessing.text.text_to_word_sequence(text, 
        lower=lower, split=split)
        
        # Add the tokens 
        token_inputs.append(tokens)                   
        
        # Compute the max length for the collection of sequences
        if len(tokens)>max_len_inferred:
            max_len_inferred = len(tokens)
    
    # It's important to make sure the maximum token length is only as
    # large as the longest input in the sequence
    # Here we make sure max_len is only as large as the longest input
    if max_len and max_len_inferred < max_len:
        max_len = max_len_inferred
    if not max_len:
        max_len = max_len_inferred
    
    # Go through each token sequence and modify sequences to have same
    # length
    for i, token_seq in enumerate(token_inputs):
        
        token_lengths.append(min(len(token_seq), max_len))
        
        # If the maximum length is less than input length, truncate
        if max_len < len(token_seq):
            token_seq = token_seq[:max_len]            
        # If the maximum length is greater than or equal to input length,
        # add padding as needed
        else:            
            token_seq = token_seq+[""]*(max_len-len(token_seq))
                
        assert len(token_seq)==max_len
        
        token_inputs[i] = token_seq
    
    # Return the final output
    return {
        "tokens": tf.constant(token_inputs), 
        "sequence_len": tf.constant(token_lengths)
    }

We first create two lists, token_inputs and token_lengths, to contain individual tokens and their respective lengths. Next we go through each string in texts, and get the individual tokens using the tf.keras.preprocessing.text.text_to_word_sequence() function. While doing so, we will calculate the maximum token length we have observed so far. After iterating through the sequences, we check if the maximum length inferred from the inputs is different to max_len (if specified). If so, we will use max_len_inferred as the maximum length. This is important, because if you do otherwise, you may unnecessarily lengthen the inputs by defining a large value for max_len. Not only that, the model will raise an error like the one below if you do so.

    #InvalidArgumentError:  Incompatible shapes: [2,6,1] vs. [2,10,1024]
    #    [[node mul (defined at .../python3.6/site-packages/tensorflow_
    hub/module_v2.py:106) ]] [Op:__inference_pruned_3391]

Once the proper maximum length is found, we will go through the sequences and

  • If it is longer than max_len, truncate the sequence
  • If it is shorter than max_len, add tokens until it reaches max_len

Finally, we will convert them to tf.Tensor objects using the tf.constant construct. For example, you can call this function with:

print(format_text_for_elmo(["the cat sat on the mat", "the mat sat"], max_len=10))

This will output:

{'tokens': <tf.Tensor: shape=(2, 6), dtype=string, numpy=
array([[b'the', b'cat', b'sat', b'on', b'the', b'mat'],
       [b'the', b'mat', b'sat', b'', b'', b'']], dtype=object)>, 'sequence_len': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([6, 3], dtype=int32)>}

We will now see how ELMo can be used to generate embeddings for the prepared inputs.

Generating embeddings with ELMo

Once the input is prepared, generating embeddings is quite easy. First we will transform the inputs to the stipulated format of the ELMo layer. Here we are using some example titles from the BBC dataset:

# Titles of 001.txt - 005.txt in bbc/business
elmo_inputs = format_text_for_elmo([
    "Ad sales boost Time Warner profit",
    "Dollar gains on Greenspan speech",
    "Yukos unit buyer faces loan claim",
    "High fuel prices hit BA's profits",
    "Pernod takeover talk lifts Domecq"
])

Next, simply pass the elmo_inputs to the elmo_layer as the input and get the result:

# Get the result from ELMo
elmo_result = elmo_layer(elmo_inputs)

Let’s now print the results and their shapes with the following line:

# Print the result
for k,v in elmo_result.items():    
    print("Tensor under key={} is a {} shaped Tensor".format(k, v.shape))

This will print out:

Tensor under key=sequence_len is a (5,) shaped Tensor
Tensor under key=elmo is a (5, 6, 1024) shaped Tensor
Tensor under key=default is a (5, 1024) shaped Tensor
Tensor under key=lstm_outputs1 is a (5, 6, 1024) shaped Tensor
Tensor under key=lstm_outputs2 is a (5, 6, 1024) shaped Tensor
Tensor under key=word_emb is a (5, 6, 512) shaped Tensor

As you can see, the model returns 6 different outputs. Let’s go through them one by one:

  • sequence_len – The same input we provided containing the lengths of the sequences in the input
  • word_emb – The token embeddings obtained via the CNN layer in the ELMo model. We got a vector of size 512 for all sequence positions (i.e. 6) and for all rows in the batch (i.e. 5).
  • lstm_output1 – The contextualized representations of tokens obtained via the first LSTM layer
  • lstm_output2 – The contextualized representations of tokens obtained via the second LSTM layer
  • default – The mean embedding vector obtained by averaging all of the lstm_output1 and lstm_output2 embeddings
  • elmo – The weighted sum of all of word_emb, lstm_output1, and lstm_output2, where weights are a set of task-specific trainable parameters that will be jointly trained during the task-specific training

What we are interested in here is the default output. That would give us a very good representation of what’s contained in the document.

Other word embedding techniques

Apart from the word embedding techniques we discussed here, there are a few notable widely used word embedding techniques. We will discuss a few of those here.

FastText

FastText (https://fasttext.cc/), introduced in the paper “Enriching Word Vectors with Subword Information” by Bojanowski et al. (https://arxiv.org/pdf/1607.04606.pdf), introduces a technique where word embeddings are computed by considering the sub-components of a word. Specifically, they compute the word embedding as a summation of embeddings of n-grams of the word for several values of n. In the paper, they use 3 <= n <=6. For example, for the word “banana,” the tri-grams (n=3) would be ['ban', 'ana', 'nan', 'ana']. This leads to robust embeddings that can withstand common problems of text, such as spelling mistakes.

Swivel embeddings

Swivel embeddings, introduced by the paper “Swivel: Improving Embeddings by Noticing What’s Missing” by Shazeer et al. (https://arxiv.org/pdf/1602.02215.pdf), tries to blend GloVe and skip-grams with negative sampling. One of the critical limitations of GloVe is that it only uses information about positive contexts. Therefore, the method is not penalized for trying to create similar vectors of words that have not been observed together. But the negative sampling used in skip-grams directly tackles this problem. The biggest innovation of Swivel is a loss function that incorporates unobserved word pairs. As an added benefit, it can also be trained in a distributed environment.

Transformer models

Transformers are a type of model that has reimagined the way we think about NLP problems. The Transformer model was initially introduced in the paper “Attention is all you need” by Vaswani (https://arxiv.org/pdf/1706.03762.pdf). This model has many different embeddings within it and, like ELMo, can generate an embedding per token by processing a sequence of text. We will talk about Transformer models in detail in later chapters.

We have discussed all the bells and whistles required to confidently use the ELMo model. Next we will classify documents using ELMo, in which ELMo will generate document embeddings as inputs to a classification model.

Document classification with ELMo

Although Word2vec gives a very elegant way of learning numerical representations of words, learning word representations alone is not convincing enough to realize the power of word vectors in real-world applications.

Word embeddings are used as the feature representation of words for many tasks, such as image caption generation and machine translation. However, these tasks involve combining different learning models such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models or two LSTM models (the CNN and LSTM models will be discussed in more detail in later chapters). To understand a real-world usage of word embeddings let’s stick to a simpler task—document classification.

Document classification is one of the most popular tasks in NLP. Document classification is extremely useful for anyone who is handling massive collections of data such as those for news websites, publishers, and universities. Therefore, it is interesting to see how learning word vectors can be adapted to a real-world task such as document classification by means of embedding entire documents instead of words.

This exercise is available in the Ch04-Advance-Word-Vectors folder (ch4_document_classification.ipynb).

Dataset

For this task, we will use an already-organized set of text files. These are news articles from the BBC. Every document in this collection belongs to one of the following categories: Business, Entertainment, Politics, Sports, or Technology.

Here are a couple of brief snippets from the actual data:

Business

Japan narrowly escapes recession

Japan’s economy teetered on the brink of a technical recession in the three months to September, figures show.

Revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. On an annual basis, the data suggests annual growth of just 0.2%,...

First, we will download the data and load the data into memory. We will use the same download_data() function to download the data. Then we will slightly modify the read_data() function to not only return a list of articles, where each article is a string, but also to return a list of filenames, where each filename corresponds to the file the article was stored in. The filenames will subsequently help us to create the labels for our classification model.

def read_data(data_dir):
    
    # This will contain the full list of stories
    news_stories = []    
    filenames = []
    print("Reading files")
    
    i = 0 # Just used for printing progress
    for root, dirs, files in os.walk(data_dir):
        
        for fi, f in enumerate(files):
            
            # We don't read the readme file
            if 'README' in f:
                continue
            
            # Printing progress
            i += 1
            print("."*i, f, end='
')
            
            # Open the file
            with open(os.path.join(root, f), encoding='latin-1') as text_
            file:
                story = []
                # Read all the lines
                for row in text_file:
                    story.append(row.strip())
                    
                # Create a single string with all the rows in the doc
                story = ' '.join(story)                        
                # Add that to the list
                news_stories.append(story)  
                filenames.append(os.path.join(root, f))
                
        print('', end='
')
        
    print("
Detected {} stories".format(len(news_stories)))
    return news_stories, filenames
news_stories, filenames = read_data(os.path.join('data', 'bbc'))

We will then create and fit a tokenizer on the data, as we have done before.

from tensorflow.keras.preprocessing.text import Tokenizer
n_vocab = 15000 + 1
tokenizer = Tokenizer(
    num_words=n_vocab - 1,
    filters='!"#$%&()*+,-./:;<=>?@[\]^_'{|}~	
',
    lower=True, split=' ', oov_token=''
)
tokenizer.fit_on_texts(news_stories)

As the next step, we will create labels. Since we are training a classification model, we need both inputs and labels. Our inputs will be document embeddings (we will see how to compute them soon), and the targets will be a label ID between 0 and 4. Each class we mentioned above (e.g. business, tech, etc.) will be assigned to a separate category. Since the filename includes the category as a folder, we can leverage the filename to generate a label ID.

We will use the pandas library to create the labels. First we will convert the list of filenames to a pandas Series object using:

labels_ser = pd.Series(filenames, index=filenames)

An example entry in this series could look like data/bbc/tech/127.txt. Next, we will split each item on the “/” character, which will return a list ['data', 'bbc', 'tech', '127.txt']. We will also set expand=True. expand=True will transform our Series object to a DataFrame by turning each item in the list of tokens into a separate column of a DataFrame. In other words, our pd.Series object will become an [N, 4]-sized pd.DataFrame with one token in each column, where N is the number of files:

labels_ser = labels_ser.str.split(os.path.sep, expand=True)

In the resulting data, we only care about the third column, which has the category of a given article (e.g. tech). Therefore, we will discard the rest of the data and only keep that column:

labels_ser = labels_ser.iloc[:, -2]

Finally, we will map the string label to an integer ID using the pandas map() function as follows:

labels_ser = labels_ser.map({'business': 0, 'entertainment': 1, 'politics': 2, 'sport': 3, 'tech': 4})

This will result in something like:

data/bbc/tech/272.txt    4
data/bbc/tech/127.txt    4
data/bbc/tech/370.txt    4
data/bbc/tech/329.txt    4
data/bbc/tech/240.txt    4
Name: 2, dtype: int64

What we did here can be written as just one line by chaining the sequence of commands to a single line:

labels_ser = pd.Series(filenames, index=filenames).str.split(os.path.sep, expand=True).iloc[:, -2].map(
    {'business': 0, 'entertainment': 1, 'politics': 2, 'sport': 3,
    'tech': 4}
)

With that, we move on to the next important step, i.e. splitting the data into train/test subsets. When training a supervised model, we generally need three datasets:

  • A training set – This is the dataset the model will be trained on.
  • A validation set – This will be used during the training to monitor model performance (e.g. signs of overfitting).
  • A testing set – This will be not exposed to the model at any time during the model training. It will only be used after the model training to evaluate the model on unseen data.

In this exercise, we will only use the training set and the testing set. This will help us to keep our conversation more focused on embeddings and keep the discussion about the downstream classification model simple. Here we will use 67% of the data as training data and use 33% of data as testing data. Data will be split randomly:

from sklearn.model_selection import train_test_split
train_labels, test_labels = train_test_split(labels_ser, test_size=0.33)

Now we have a training dataset to train the model and a test dataset to test it on unseen data. We will now see how we can generate document embeddings from token or word embeddings.

Generating document embeddings

Let’s first remind ourselves how we stored embeddings for skip-gram, CBOW, and GloVe algorithms. Figure 4.4 depicts how these look in a pd.DataFrame object.

Figure 4.4: A snapshot of the context embeddings of the skip-gram algorithm we saved to the disk. You can see below it says that it has 128 columns (i.e. the embedding size)

ELMo embeddings are an exception to this. Since ELMo generates contextualized representations for all tokens in a sequence, we have stored the mean embedding vectors resulting from averaging all the generated vectors:

Figure 4.5: A snapshot of ELMo vectors. ELMo vectors have 1024 elements

To compute the document embeddings from skip-gram, CBOW, and GloVe embeddings, let us write the following function:

def generate_document_embeddings(texts, filenames, tokenizer, embeddings):
    
    """ This function takes a sequence of tokens and compute the mean
    embedding vector from the word vectors of all the tokens in the
    document """
    
    doc_embedding_df = []
    # Contains document embeddings for all the articles
    assert isinstance(embeddings, pd.DataFrame), 'embeddings must be a 
    pd.DataFrame'
    
    # This is a trick we use to quickly get the text preprocessed by the
    # tokenizer
    # We first convert text to a sequences, and then back to text, which
    # will give the preprocessed tokens
    sequences = tokenizer.texts_to_sequences(texts)    
    preprocessed_texts = tokenizer.sequences_to_texts(sequences)
    
    # For each text,
    for text in preprocessed_texts:
        # Make sure we had matches for tokens in the embedding matrx
        assert embeddings.loc[text.split(' '), :].shape[0]>0
        # Compute mean of all the embeddings associated with words
        mean_embedding = embeddings.loc[text.split(' '), :].mean(axis=0)
        # Add that to list
        doc_embedding_df.append(mean_embedding)
        
    # Save the doc embeddings in a dataframe
    doc_embedding_df = pd.DataFrame(doc_embedding_df, index=filenames)
    
    return doc_embedding_df

The generate_document_embeddings() function takes the following arguments:

  • texts – A list of strings, where each string represents an article
  • filenames – A list of filenames corresponding to the articles in texts
  • tokenizer – A tokenizer that can process texts
  • embeddings – The embeddings as a pd.DataFrame, where each row represents a word vector, indexed by the corresponding token

The function first preprocesses the texts by converting the strings to sequences, and then back to a list of strings. This helps us to use the built-in preprocessing functionalities of the tokenizer to clean the text. Next, each preprocessed string is split by the space character to return a list of tokens. Then we index all the positions in the embeddings matrix that corresponds to all the tokens in the text. Finally, the mean vector is computed for the document by computing the mean of all the chosen embedding vectors.

With that, we can load the embeddings from different algorithms (skip-gram, CBOW, and GloVe), and compute the document embeddings. Here we will only show the process for the skip-gram algorithm. But you can easily extend it to the other algorithms, as they have similar inputs and outputs:

# Load the skip-gram embeddings context and target
skipgram_context_embeddings = pd.read_pickle(
    os.path.join('../Ch03-Word-Vectors/skipgram_embeddings',
    'context_embedding.pkl')
)
skipgram_target_embeddings = pd.read_pickle(
    os.path.join('../Ch03-Word-Vectors/skipgram_embeddings',
    'target_embedding.pkl')
)
# Compute the mean of context & target embeddings for better embeddings
skipgram_embeddings = (skipgram_context_embeddings + skipgram_target_embeddings)/2
# Generate the document embeddings with the average context target
# embeddings
skipgram_doc_embeddings = generate_document_embeddings(news_stories, filenames, tokenizer, skipgram_embeddings)

Now we will see how we can leverage the generated document embedding to train a classifier.

Classifying documents with document embeddings

We will be training a simple multi-class (or a multinomial) logistic regression classifier on this data. The logistic regression model will look as follows:

Figure 4.6: This diagram depicts the multinomial logistic regression model. The model takes in an embedding vector and outputs a probability distribution over different available classes

It’s a very simple model with a single layer, where the input is the embedding vector (e.g. a 128-element-long vector), and the output is a 5-node softmax layer that will output the likelihood of the input belonging to each category, as a probability distribution.

We will be training several models, as opposed to a single run. This will give us a more consistent result on the performance of the model. To implement the model, we’ll be using a popular general-purpose machine learning library called scikit-learn (https://scikit-learn.org/stable/). In each run, a multi-class logistic regression classifier is created with the sklearn.linear_model.LogisticRegression object. Additionally, in each run:

  1. The model is trained on the training inputs and targets
  2. The model predicts the class (a value from 0 to 4) for each test input, where the class of an input is the one that has the maximum probability from all classes
  3. The model computes the test accuracy using the predicted classes and true classes of the test set

The code looks like the following:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
def get_classification_accuracy(doc_embeddings, train_labels, test_labels, n_trials):
    """ Train a simple MLP model for several trials and measure test 
    accuracy"""
    
    accuracies = [] # Store accuracies across trials
    
    # For each trial
    for trial in range(n_trials):
        # Create a MLP classifier
        lr_classifier = LogisticRegression(multi_class='multinomial', 
        max_iter=500)
        
        # Fit the model on training data
        lr_classifier.fit(doc_embeddings.loc[train_labels.index],
        train_labels)
        
        # Get the predictions for test data
        predictions = lr_classifier.predict(doc_embeddings.loc[test_
        labels.index])
    
        # Compute accuracy
        accuracies.append(accuracy_score(predictions, test_labels))
    
    return accuracies
# Get classification accuracy for skip-gram models
skipgram_accuracies = get_classification_accuracy(
    skipgram_doc_embeddings, train_labels, test_labels, n_trials=5
)
print("Skip-gram accuracies: {}".format(skipgram_accuracies))

By setting multi_class='multinomial', we are making sure it’s a multi-class logistic regression model (or a softmax classifier). This will output:

Skip-gram accuracies: [0.882…, 0.882…, 0.881…, 0.882…, 0.884…]

When you follow the procedure for all the skip-gram, CBOW, GloVe, and ELMo algorithms, you will see a result similar to the following. This is a box plot diagram. However, as performance is quite similar between trials, you won’t see much variation present in the diagram:

Figure 4.7: Box plot interpreting performance on document classification for different models. We can see that ELMo is a clear-cut winner, where GloVe performs the worst

We can see that skip-gram achieves around 86% accuracy, followed closely by CBOW, which achieves on-par performance. Surprisingly GloVe achieves performance far below the skip-gram and CBOW, around 66% accuracy.

This could be pointing to a limitation of the GloVe loss function. Unlike, skip-gram and CBOW, which are considered both positive (observed) and negative (unobserved) target and context pairs, GloVe only focuses on observed pairs.

This could be hurting GloVe’s ability to generate effective representations of words. Finally, ELMo achieves the best, which is around 98% accuracy. But it is important to keep in mind that ELMo has been trained on a much larger dataset than the BBC dataset, thus it is not fair to compare ELMo with other models just on this number.

In this section, you learned how we can extend word embeddings turned to document embeddings and how these can be used in a downstream classifier model to classify documents. First, you learned about word embeddings using a selected algorithm (e.g. skip-gram, CBOW, and GloVe). Then we created document embeddings by averaging the word embeddings of all the words found in that document. This was the case for the skip-gram, CBOW, and GloVe algorithms. In the case of the ELMo algorithm, we were able to infer document embeddings straight from the model. Later we used these document embeddings to classify some BBC news articles that fall into these categories: entertainment, tech, politics, business, and sports.

Summary

In this chapter, we discussed GloVe—another word embedding learning technique. GloVe takes the current Word2vec algorithms a step further by incorporating global statistics into the optimization, thus increasing the performance.

Next, we learned about a much more advanced algorithm known as ELMo (which stands for Embeddings from Language Models). ELMo provides contextualized representations of words by looking at a word within a sentence or a phrase, not by itself.

Finally, we discussed a real-world application of using word embeddings—document classification. We showed that word embeddings are very powerful and allow us to classify related documents with a simple multi-class logistic regression model reasonably well. ELMo performed the best out of skip-gram, CBOW, and GloVe, due to the vast amount of data it has been trained on.

In the next chapter, we will move on to discussing a different family of deep networks that are more powerful in exploiting spatial information present in data, known as Convolutional Neural Networks (CNNs).

Precisely, we will see how CNNs can be used to exploit the spatial structure of sentences to classify them into different classes.

To access the code files for this book, visit our GitHub page at: https://packt.link/nlpgithub

Join our Discord community to meet like-minded people and learn alongside more than 1000 members at: https://packt.link/nlp

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.166.190