Model training

The gensim.models.Word2vec class implements the SG and CBOW architectures introduced previously. The Word2vec notebook contains additional implementation detail.

To facilitate memory-efficient text ingestion, the LineSentence class creates a generator from individual sentences contained in the provided text file:

sentence_path = Path('data', 'ngrams', f'ngrams_2.txt')
sentences = LineSentence(sentence_path)

The Word2vec class offers the configuration options previously introduced:

model = Word2vec(sentences,
sg=1, # 1=skip-gram; otherwise CBOW
hs=0, # hier. softmax if 1, neg. sampling if 0
size=300, # Vector dimensionality
window=3, # Max dist. btw target and context word
min_count=50, # Ignore words with lower frequency
negative=10, # noise word count for negative sampling
workers=8, # no threads
iter=1, # no epochs = iterations over corpus
alpha=0.025, # initial learning rate
min_alpha=0.0001 # final learning rate
)

The notebook shows how to persist and reload models to continue training, or how to store the embedding vectors separately, for example, for use in ML models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.160.156