Plotting The Word Cluster Using The t-SNE Algorithm

So after our analysis, we know that our word2vec model has learned some concepts from the provided corpus but how do we visualize it.  Because we have created a 300-dimensional space to learn the features, it's practically impossible for us to visualize. To make it possible we will use a dimension reduction algorithm called t-SNE which is very well known for reducing a high dimensional space into more humanly understandable 2 or 3-dimensional space.

"t-Distributed Stochastic Neighbor Embedding (t-SNE) (https://lvdmaaten.github.io/tsne/) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples."
                                                                                                                                       -- Laurens van der Maaten

To implement this we will use sklearn package and define the n_components=2  which mean we want to have 2-dimensional space as the out. Next, we will perform the transformation by feeding the word vectors into the t-SNE object.

After this step, we now have a set of value for each word which we can use as x-coordinate and y-coordinates respectively to plot it in the 2d plane.  Let's prepare a dataframe to store all the words and its x, y coordinates in the same variable as shown in figure 3.2 and take data from there to create a scatter plot:

tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

all_word_vectors_matrix = model2vec.wv.vectors

all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

points = pd.DataFrame(
[
(word, coords[0], coords[1])
for word, coords in [
(word, all_word_vectors_matrix_2d[model2vec.wv.vocab[word].index])
for word in model2vec.wv.vocab
]
],
columns=["word", "x", "y"]
)

sns.set_context("poster")
ax = points.plot.scatter("x", "y", s=10, figsize=(20, 12))
fig = ax.get_figure()

This is our dataframe containing: words and coordinates for both x and y. 

Figure 3.2 Word list with the coordinate values obtained using t-SNE

This is what the entire cluster looks like after plotting 425,633 tokens in the 2d plane. Each point is positioned after learning the features and correlations between the nearby words:

Figure 3.3 Scatter plot of all the unique words in 2D plane.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.159.223