Loading pretrained word vectors

As I have just mentioned, I'm going to use a Keras embedding layer. For the second version of the model, we will initialize the weights of the embedding layer with the GloVe word vectors we covered previously in the chapter. To do so, we will need to load those weights from disk and put them into a suitable 2D matrix that the layer can use as weights. We will cover that operation here.

When you download the GloVe vectors, you'll see that you have several text files in the directory you unzipped the download in. Each of these files corresponds to a separate set of dimensions; however, in all cases, these vectors were developed using the same common corpus containing 6 billion unique words (hence the title GloVe.6B). I will demonstrate using glove.6B.100d.txt file. Inside glove.6B.100d.txt every line is a single word vector. On that line, you will find the word and a 100 dimension vector associated to it. The word and the elements of the vector are stored as text and separated by spaces.

To get this data into a usable state, we will start by loading it from disk. We will then split the line into its first component, the word, and the elements of the vector. Once we're done with that, we will convert the vector into an array. Lastly, we will store the array as a value in a dictionary, using the word as the key for that value. The following code illustrates this process:

def load_word_vectors(glove_dir):
    print('Indexing word vectors.')

    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'),    
             encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))
    return embeddings_index

Once we run this, we will have a dictionary called embeddings_index that contains the GloVe words as keys and their vectors as values. The Keras embedding layer needs a 2D matrix as input, however, not a dictionary, so we will need to manipulate our dictionary into a matrix, using the following code:

def embedding_index_to_matrix(embeddings_index, vocab_size, embedding_dim, word_index):
    print('Preparing embedding matrix.')

    # prepare embedding matrix
    num_words = min(vocab_size, len(word_index))
    embedding_matrix = np.zeros((num_words, embedding_dim))
    for word, i in word_index.items():
        if i >= vocab_size:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

I know all this munging might seem terrible, and it is, but the authors of GloVe are quite well-intentioned in how they distribute these word vectors. They hope to make these vectors consumable by anyone using any programming language and to that end the text format will be quite appreciated. Besides, if you're a practicing data scientist, you will be used to this!

Now that we have our vectors present as a 2D matrix, we're ready to use them in a Keras embedding layer. Our prep work is done, so now let's build the network.

Table of Contents for Loading pretrained word vectors

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading pretrained word vectors