Understanding the doc2vec API

model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, iter=5, workers=cores)

Let's quickly understand the flags we have used in the preceding code:

  • dm ({1,0}, optional): This defines the training algorithm; if dm=1, distributed memory (PV-DM) is used; otherwise, a distributed bag of words (PV-DBOW) is employed
  • size (int, optional): This is the dimensionality of feature vectors
  • window (int, optional): This represents the maximum distance between the current and predicted word within a sentence
  • negative (int, optional): If > 0, negative sampling will be used (the int for negative values specifies how many noise words should be drawn, which is usually between 5-20); if set to 0, no negative sampling is used
  • hs ({1,0}, optional): If 1, hierarchical softmax will be used for model training, and if set to 0 where the negative is non-zero, negative sampling will be used
  • iter (int, optional): This represents the number of iterations (epochs) over the corpus

The preceding list has been taken directly from the Gensim documentation. With that in mind, we'll now move on and explain some of the new terms introduced here, including negative sampling and hierarchical softmax.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.179.58