Model objective – simplifying the softmax

Word2vec models aim to predict a single word out of the potentially very large vocabulary. Neural networks often use the softmax function that maps any number of real values to an equal number of probabilities to implement the corresponding multiclass objective, where h refers to the embedding and v to the input vectors, and c is the context of word w:

However, the softmax complexity scales with the number of classes, as the denominator requires the computation of the dot product for all words in the vocabulary to standardize the probabilities. Word2vec models gain efficiency by using a simplified version of the softmax or sampling-based approaches (see references for details):

  • The hierarchical softmax organizes the vocabulary as a binary tree with words as leaf nodes. The unique path to each node can be used to compute the word probability.
  • Noise-contrastive estimation (NCE) samples out-of-context "noise words" and approximates the multiclass task by a binary classification problem. The NCE derivative approaches the softmax gradient as the number of samples increases, but as few as 25 samples can yield convergence similar to the softmax, at a rate that is 45 times faster.
  • Negative sampling (NEG) omits the noise word samples to approximate NCE and directly maximizes the probability of the target word. Hence, NEG optimizes the semantic quality of embedding vectors (similar vectors for similar usage) rather than the accuracy on a test set. It may, however, produce poorer representations for infrequent words than the hierarchical softmax objective.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
34.205.2.207