GloVe model

The GloVe model stands for Global Vectors, which is an unsupervised learning model that can be used to obtain dense word vectors similar to Word2Vec. However, the technique is different and training is performed on an aggregated global word-word co-occurrence matrix, giving us a vector space with meaningful substructures. This method was published in the paper, GloVe: Global Vectors for Word Representation by Pennington and their co-authors (https://www.aclweb.org/anthology/D14-1162). We have talked about count-based matrix factorization methods, such as latent semantic analysis (LSA) and predictive methods like Word2vec. This paper claims that currently both families suffer significant drawbacks. Methods like LSA efficiently leverage statistical information, but they do relatively poorly on the word analogy task—how we found out semantically similar words. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus on a global level. 

The basic methodology of the GloVe model is to first create a huge word-context co-occurrence matrix consisting of (word, context) pairs, such that each element in this matrix represents how often a word occurs within the context (which can be a sequence of words). This word-context matrix WC is very similar to the term-document matrix popularly used in text analysis for various tasks. Matrix factorization is used to represent the matrix WC as a product of two matrices; the Word-Feature (WF) matrix and the Feature-Context (FCmatrix. WC = WF x FCWF and FC are initialized with some random weights, and we multiply them to get WC' (an approximation of WC) and measure how close it is to WC. We do this multiple times, using Stochastic Gradient Descent (SGD) to minimize the error. Finally, the WF matrix gives us the word embeddings for each word, where F can be preset to a specific number of dimensions. A very important point to remember is that both the Word2vec and GloVe models are very similar in how they work. Both of them aim to build a vector space where the position of each word is influenced by its neighboring words, based on their context and semantics. Word2vec starts with local individual examples of word co-occurrence pairs, and GloVe starts with global aggregated co-occurrence statistics across all words in the corpus.

In the following sections, we will be using both Word2vec and GloVe for various classification problems. We have developed some utility code to read and load GloVe and Word2vec vectors from a file and return an embedding matrix. The file format expected is standard GloVe file format. Following is an example in five-dimensional embedding format for a few words—word followed by vectors, all space separated:

  • Flick 7.068106 -5.410074 1.430083 -4.482612 -1.079401
  • Heart -1.584336 4.421625 -12.552878 4.940779 -5.281123
  • Side 0.461367 4.773087 -0.176744 8.251079 -11.168787
  • Horrible 7.324110 -9.026680 -0.616853 -4.993752 -4.057131

Following is the main function to read the GloVe vectors, given a vocabulary as a Python dictionary with the dictionary keys as words in vocabulary. This is needed to load only the required embedding for the words that occur in our training vocabulary. Also, the words in the vocabulary that are not present in GloVe embeddings are initialized with a mean vector of all embeddings and some white noise. The rows 0 and 1 are dedicated for the space and out-of-vocabulary (OOV) words. These are words that are not in the vocabulary but are there in the corpus, such as very infrequent words or some filtered out noise. The embedding for space is a zero vector. The embedding for OOV is the mean vector of all the remaining embeddings:

def _init_embedding_matrix(self, word_index_dict, 
oov_words_file='OOV-Words.txt'):
# reserve 0, 1 index for empty and OOV
self.embedding_matrix = np.zeros((len(word_index_dict)+2 ,
self.EMBEDDING_DIM))
not_found_words=0
missing_word_index = []

with open(oov_words_file, 'w') as f:
for word, i in word_index_dict.items():
embedding_vector = self.embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-
zeros.
self.embedding_matrix[i] = embedding_vector
else:
not_found_words+=1
f.write(word + ','+str(i)+' ')
missing_word_index.append(i)

#oov by average vector:
self.embedding_matrix[1] = np.mean(self.embedding_matrix,
axis=0)

            for indx in missing_word_index:
self.embedding_matrix[indx] =
np.random.rand(self.EMBEDDING_DIM)+
self.embedding_matrix[1]
print("words not found in embeddings:
{}".format(not_found_words))

One more utility function is update_embeddings. This is required for transfer learning. We may want to update the embeddings learned by one model with the embeddings learned by another model:

def update_embeddings(self, word_index_dict, other_embedding, other_word_index):
num_updated = 0
for word, i in other_word_index.items():
if word_index_dict.get(word) is not None:
embedding_vector = other_embedding[i]
this_vocab_word_indx = word_index_dict.get(word)
self.embedding_matrix[this_vocab_word_indx] =
embedding_vector
num_updated+=1

print('{} words are updated out of {}'.format(num_updated,
len(word_index_dict)))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.114.244