How to evaluate embeddings – vector arithmetic and analogies

The bag-of-words model creates document vectors that reflect the presence and relevance of tokens to the document. Latent semantic analysis reduces the dimensionality of these vectors and identifies what can be interpreted as latent concepts in the process. Latent Dirichlet allocation represents both documents and terms as vectors that contain the weights of latent topics.

The dimensions of the word and phrase vectors do not have an explicit meaning. However, the embeddings encode similar usage as proximity in the latent space in a way that carries over to semantic relationships. This results in the interesting properties that analogies can be expressed by adding and subtracting word vectors.

The following figure shows how the vector connecting Paris and France (that is, the difference of their embeddings) reflects the capital of relationship. The analogous relationship, London: UK, corresponds to the same vector, that is, the UK is very close to the location obtained by adding the capital of vector to London:

Just as words can be used in different contexts, they can be related to other words in different ways, and these relationships correspond to different directions in the latent space. Accordingly, there are several types of analogies that the embeddings should reflect if the training data permits.

The Word2vec authors provide a list of several thousand relationships spanning aspects of geography, grammar and syntax, and family relationships to evaluate the quality of embedding vectors. As illustrated above, the test validates that the target word (UK) is closest to the result of adding the vector that represents an analogous relationship (Paris: France) to the target's complement (London).

The following figure projects the 300-dimensional embeddings of the most closely related analogies for a Word2vec model trained on the Wikipedia corpus, with over 2 billion tokens, into two dimensions using principal component analysis (PCA). A test of over 24,400 analogies from the following categories achieved an accuracy of over 73.5% (see notebook):

Working with embedding models

Similar to other unsupervised learning techniques, the goal of learning embedding vectors is to generate features for other tasks such as text classification or sentiment analysis.

There are several options to obtain embedding vectors for a given corpus of documents:

  • Use embeddings learned from a generic large corpus such as Wikipedia or Google News
  • Train your own model using documents that reflect a domain of interest

The less generic and more specialized the content of the subsequent text modeling task is, the more preferable is the second approach. However, quality word embeddings are data-hungry and require informative documents containing hundreds of millions of words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.231.55.243