Defining Our Word2vec Model

Now let's use gensim in our definition of our word2vec model. To begin, let's define a few hyper-parameters for our model like the dimension, which means how many low-level features we want to learn. Each dimension will learn a unique concept of gender, objects, age etc.

Computational Linguistics Model Tip #1: Increasing the number of dimensions leads to better generalization ...but also it also adds more computational complexity. The right number is an empirical question for you to determine as an Applied AI Deep Learning Engineer!

Computational Linguistics Model Tip #2: Pay attention to context_size . It's important because it sets the upper limit for the distance between the current and target word prediction within a sentence. This helps the model in learning deeper relationships of a word with the other nearby words.

Using the gensim instance, we will define our model including all the hyper-parameters.

num_features = 300

# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.

#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words. 0 - 1e-5 is good for this
downsampling = 1e-3

seed = 1

model2vec = w2v.Word2Vec(
            sg=1,
            seed=seed,
            workers=num_workers,
            size=num_features,
            min_count=min_word_count,
            window=context_size,
            sample=downsampling
        )
        
model2vec.build_vocab(sentences)

Table of Contents for Defining Our Word2vec Model

Create new playlist

Sign In

Sign Up

Table of Contents for
Defining Our Word2vec Model