Learning word embeddings with prediction

Word embeddings are calculated by using a neural network built specifically for the task. I'll cover an overview of that network here. Once the word embeddings for some corpora are calculated, they can be easily reused for other applications, so that makes this technique a candidate for transfer learning, similar to techniques we looked at in Chapter 8, Transfer Learning with Pretrained CNNs.

When we're done training this word-embedding network, the weights of the single hidden layer of our network will become a lookup table for our word embeddings. For each word in our vocabulary, we will have learned a vector for that word.

This hidden layer will contain fewer neurons than the input space, forcing the network to learn a compressed form of the information present in the input layer. This architecture very much resembles an auto-encoder; however, the technique is wrapped around a task that helps the network learn the semantic values of each word in a vector space.

The task we will use to train our embedding network with is predicting the probability of some target word appearing within a window of distance from the training word. For example, if koala was our input word and marsupial was our target word, we'd want to know the probability of these two words being near each other.

The input layer for this task will be one hot encoded vector of every word in the vocabulary. The output layer will be a softmax layer of the same size, as shown in the following figure:

This network results in a hidden layer with a weight matrix of shape [vocabulary x neurons]. For example, if we had 20,000 unique words in our corpus and 300 neurons in our hidden layer, our hidden layer weight matrix would be 20,000 x 300. Once we save these weights to disk, we have a 300 element vector that we can use to represent each word. These vectors can then be used to represent words when training other models.

There is most certainly more to training word-embedding networks than this and I'm intentionally oversimplifying the quick reference style.

If you'd like to learn more, I recommend starting by reading Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al. (https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). This paper describes a popular way to create word embeddings called word2vec
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.107.254