Best practice 14 – extracting features from text data

We have worked intensively with text data in Chapter 2, Exploring the 20 Newsgroups Dataset with Text Analysis Techniques, Chapter 3, Mining the 20 Newsgroups Dataset with Clustering, and Topic Modeling Algorithms, Chapter 4, Detecting Spam Email with Naive Bayes, and Chapter 5, Classifying News Topics with a Support Vector Machine, where we extracted features from text based on term frequency (tf) and term frequency-inverse document frequency (tf-idf). Both methods consider each document of words (terms) a collection of words, or a bag of words (BoW), disregarding the order of words, but keeping multiplicity. A tf approach simply uses the counts of tokens, while tf-idf extends tf by assigning each tf a weighting factor that is inversely proportional to the document frequency. With the idf factor incorporated, tf-idf diminishes the weight of common terms (such as get, make) that occur frequently, and emphasizes terms that rarely occur, but convey important meaning. Hence, oftentimes features extracted from tf-idf are more representative than those from tf.

As you may remember, a document is represented by a very sparse vector where only present terms have non-zero values. And its dimensionality is usually high, which is determined by the size of vocabulary and the number of unique terms. Also, such one-hot encoding approaching treats each term as an independent item and does not consider the relationship across words (referred to as "context" in linguistics).

On the contrary, another approach, called word embedding, is able to capture the meanings of words and their context. In this approach, a word is represented by a vector of float numbers. Its dimensionality is a lot lower than the size of vocabulary and is usually several hundreds only. For example, the word machine can be represented as [1.4, 2.1, 10.3, 0.2, 6.81]. So, how can we embed a word into a vector? One solution is word2vec, which trains a shallow neural network to predict a word given other words around it (called CBOW) or to predict words around a word (called skip-gram). The coefficients of the trained neural network are the embedding vectors for corresponding words.

CBOW is short for Continuous Bag of Words. Given a sentence I love reading Python machine learning by example in a corpus, and 5 as the size of word window, we can have the following training sets for the CBOW neural network:

Of course, the inputs and outputs of the neural network are one-hot encoding vectors, where values are either 1 for present words, or 0 for absent words. And we can have millions of training samples constructed from a corpus sentence by sentence. After the network is trained, the weights that connect the input layer and hidden layer embed individual input words. A skip-gram-based neural network embeds words in a similar way. But its input and output is an inverse version of CBOW. Given the same sentence I love reading Python machine learning by example and 5 as the size of word window, we can have the following training sets for the skip-gram neural network:

The embedding vectors are of real values where each dimension encodes an aspect of meaning for words in the vocabulary. This helps reserve the semantics information of words, as opposed to discarding it as in the dummy one-hot encoding approach using tf or td-idf. An interesting phenomenon is that vectors from semantically similar words are proximate to each other in geometric space. For example, both the word clustering and grouping refer to unsupervised clustering in the context of machine learning, hence their embedding vectors are close together.

Training a word embedding neural network can be time-consuming and computationally expensive. Fortunately, there are several big tech companies that have trained word embedding models on different kinds of corpora and open sourced them. We can simply use these pre-trained models to map words to vectors. Some popular pretrained word embedding models are as follows:

Once we have embedding vectors for individual words, we can represent a document sample by averaging all of the vectors of present words in this document. The resulting vectors of document samples are then consumed by downstream predictive tasks, such as classification, similarity ranking in search engine, and clustering.

Now let's play around with gensim, a popular NLP package with powerful word embedding modules. If you have not installed the package in Chapter 2, Exploring the 20 Newsgroups Dataset with Text Analysis Techniques, you can do so using pip.

First, we import the package and load a pretrained model, glove-twitter-25, as follows:

>>> import gensim.downloader as api
>>> model = api.load("glove-twitter-25")
[==================================================] 100.0% 
104.8/104.8MB downloaded

You will see the process bar if you first run this line of code. The glove-twitter-25 model is one of the smallest ones so the download will not take very long.

We can obtain the embedding vector for a word (computer, for example), as follows:

>>> vector = model.wv['computer']
>>> print('Word computer is embedded into:
', vector)
Word computer is embedded into:
[ 0.64005 -0.019514 0.70148 -0.66123 1.1723 -0.58859 0.25917
-0.81541 1.1708 1.1413 -0.15405 -0.11369 -3.8414 -0.87233
  0.47489 1.1541 0.97678 1.1107 -0.14572 -0.52013 -0.52234
 -0.92349 0.34651 0.061939 -0.57375 ]

The result is a 25-dimension float vector as expected.

We can also get the top 10 words that are most contextually relevant to computer using the most_similar method, as follows:

>>> similar_words = model.most_similar("computer")
>>> print('Top ten words most contextually relevant to computer:
', 
           similar_words)
Top ten words most contextually relevant to computer:
 [('camera', 0.907833456993103), ('cell', 0.891890287399292), ('server', 0.8744666576385498), ('device', 0.869352400302887), ('wifi', 0.8631256818771362), ('screen', 0.8621907234191895), ('app', 0.8615544438362122), ('case', 0.8587921857833862), ('remote', 0.8583616018295288), ('file', 0.8575270771980286)]

The result looks promising.

Finally, we demonstrate how to generate representing vectors for a document with a simple example, as follows:

>>> doc_sample = ['i', 'love', 'reading', 'python', 'machine', 
                 'learning', 'by', 'example']
>>> import numpy as np
>>> doc_vector = np.mean([model.wv[word] for word in doc_sample], 
                                                           axis=0)
>>> print('The document sample is embedded into:
', doc_vector)
The document sample is embedded into:
 [-0.17100249 0.1388764 0.10616798 0.200275 0.1159925 -0.1515975
  1.1621187 -0.4241785 0.2912 -0.28199488 -0.31453252 0.43692702
 -3.95395 -0.35544625 0.073975 0.1408525 0.20736426 0.17444688
  0.10602863 -0.04121475 -0.34942 -0.2736689 -0.47526264 -0.11842456
 -0.16284864]

The resulting vector is the average of embedding vectors of eight input words.

In traditional NLP applications, such as text classification and topic modeling, tf or td-idf is still an outstanding solution for feature extraction. In more complicated areas, such as text summarization, machine translation, named entity resolution, question answering, and information retrieval, word embedding is used extensively and extracts far better features than the two traditional approaches.

Table of Contents for Best practice 14&#xA0;&#x2013; extracting features from text data

Create new playlist

Sign In

Sign Up

Table of Contents for
Best practice 14 – extracting features from text data