Analysing The Model

Now that we have trained our word2vec model, let's explore what our model was able to learn. We will use most_similar() to explore the relations between various words. In the following example, you see that the model was able to learn that the word earth  is related to crust, globe, and other words. It is interesting to see that we have just provided the raw data and model was able to learn all this relations and concepts automatically!

model2vec.most_similar("earth")

[(u'crust', 0.6946468353271484),
(u'globe', 0.6748907566070557),
(u'inequalities', 0.6181437969207764),
(u'planet', 0.6092090606689453),
(u'orbit', 0.6079996824264526),
(u'laboring', 0.6058655977249146),
(u'sun', 0.5901342630386353),
(u'reduce', 0.5893668532371521),
(u'moon', 0.5724939107894897),
(u'eccentricity', 0.5709577798843384)]

Let's try to find words related to human we see what the model has learned.

model2vec.most_similar("human")


[(u'art', 0.6744576692581177),
(u'race', 0.6348963975906372),
(u'industry', 0.6203593611717224),
(u'man', 0.6148483753204346),
(u'population', 0.6090731620788574),
(u'mummies', 0.5895125865936279),
(u'gods', 0.5859177112579346),
(u'domesticated', 0.5857442021369934),
(u'lives', 0.5848811864852905),
(u'figures', 0.5809590816497803)]
Critical thinking Tip: It's interesting to observe that 'art', 'race', 'industry' are the most similar outputs.  Remember that these similarities are based on the corpus of text that we used for training and should be thought of in that context.  Generalization and it's unwanted sidekick Bias can come into play when similarities from outdated or dissimilar training corpus are used to train a model that is applied to a new set of language data or cultural norms.

Even when we try to derive an analogy by using two positive vectors as earth and moon and a negative vector orbit, the model predicts the word sun which makes sense because there is a semantic relation between moon orbiting around earth and earth orbiting around the sun.

model2vec.most_similar_cosmul(positive=['earth','moon'], negative=['orbit'])

(u'sun', 0.8161555624008179)

So, we learned that using word2vec model one can derive valuable information from the raw unlabeled data. This process is very crucial in terms of learning the language grammar and semantic correlations between words.

Later we will learn how to use these word2vec features as an input for the classification model, which helps in boosting the model accuracy and performance.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.106.7