Model training

The gensim.models.Word2vec class implements the SG and CBOW architectures introduced previously. The Word2vec notebook contains additional implementation detail.

To facilitate memory-efficient text ingestion, the LineSentence class creates a generator from individual sentences contained in the provided text file:

sentence_path = Path('data', 'ngrams', f'ngrams_2.txt')
sentences = LineSentence(sentence_path)

The Word2vec class offers the configuration options previously introduced:

model = Word2vec(sentences,
                 sg=1,    # 1=skip-gram; otherwise CBOW
                 hs=0,    # hier. softmax if 1, neg. sampling if 0
                 size=300,      # Vector dimensionality
                 window=3,      # Max dist. btw target and context word
                 min_count=50,  # Ignore words with lower frequency
                 negative=10,  # noise word count for negative sampling
                 workers=8,     # no threads 
                 iter=1,        # no epochs = iterations over corpus
                 alpha=0.025,   # initial learning rate
                 min_alpha=0.0001 # final learning rate
                )

The notebook shows how to persist and reload models to continue training, or how to store the embedding vectors separately, for example, for use in ML models.

Table of Contents for Model training

Create new playlist

Sign In

Sign Up

Table of Contents for
Model training