Natural language processing

Once the text content of a web page has been extracted, the text data is usually preprocessed to remove parts that do not bring any relevant information. A text is tokenized, that is, transformed into a list of words (tokens), and all the punctuation marks are removed. Another usual operation is to remove all the stopwords, that is, all the words used to construct the syntax of a sentence but not containing text information (such as conjunctions, articles, and prepositions) such as a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with, and many others.

Many words in English (or any language) share the same root but have different suffixes or prefixes. For example, the words think, thinking, and thinker all share the same root—think indicating that the meaning is the same—but the role in a sentence is different (verb, noun, and so on). The procedure to reduce all the words in a set to its roots it is called stemming, and many algorithms have been invented to do so (Porter, Snowball, and Lancaster). All of these techniques are parts of a broader range of algorithms called natural language processing, and they are implemented in Python on the nltk library (installed as usual through sudo pip install nltk). As an example, the following code preprocesses a sample text using the techniques described previously (using the Python interface terminal):

Natural language processing

Note that the stopwords list has been downloaded using the nltk dowloader nltk.download('stopwords').

Information retrieval models

The information retrieval methods are needed to find the most relevant documents to a given query. The words contained in the web pages can be modeled using different approaches such as Boolean models, vector space models, and probabilistic models, and in this book, we have decided to discuss the vector space models and how to implement them. Formally, given a vocabulary of V words, each web page di (or document) in a collection of N pages, can be thought of as a vector of words, Information retrieval models, where each word j belonging to the document i is represented by wij, which can be either a number (weight) or a vector depending on the chosen algorithm:

  • Term frequency-inverse document frequency (TF-IDF), wij, is a real number
  • Latent Semantic Analysis (LSA), wij, is a real number (representation independent of the document i)
  • Doc2Vec (or word2vec), wij, is a vector of real numbers (representation independent of the document i)

Since the query can also be represented by a vector of words, Information retrieval models, the web pages most similar to the vector q are found by calculating a similarity measure between the query vector and each document. The most used similarity measure is called cosine similarity, for any document i given by:

Information retrieval models

Note that there are other measures used in literature (okapi and pivoted normalization weighting), but for the purpose of this book, they are not necessary.

The following sections will give some details about the three methods before being applied in a text case in the final paragraph of the section.

TF-IDF

This method calculates wij, taking into account the fact that a word that appears many times and in a large number of pages is likely to be less important than a word that occurs many times but only in a subset of documents. It is given by the multiplication of two factors:

TF-IDF where:

  • TF-IDF is the normalized frequency of the word j in the document I
  • TF-IDF is the inverse document frequency and dfj is the number of web pages that contain the word j

Latent Semantic Analysis (LSA)

The name of this algorithm comes from the idea that there is a latent space in which each word (and each document) can be efficiently described, assuming that words with similar meanings also occur in similar text positions. Projection on this subspace is performed using the (truncated) SVD method already discussed in Chapter 2, Machine Learning Techniques – Unsupervised Learning. We contextualize the method for LSA as follows: the web pages are collected together in matrix X (V ´N), in which each column is a document:

Latent Semantic Analysis (LSA)

Here, Ut (V ´d) is the matrix of the words projected in the new latent space with d dimensions, Latent Semantic Analysis (LSA) (d ´N) is the transpose matrix of the documents transformed into the subspace, and Latent Semantic Analysis (LSA) (d´d) is the diagonal matrix with singular values. The query vector itself is projected into the latent space by:

Latent Semantic Analysis (LSA)

Now, each document represented by each row of Vt can be compared with qt using the cosine similarity. Note that the true mathematical representation of the documents on the latent space is given by Latent Semantic Analysis (LSA) (not Vt) because the singular values are the scaling factors of the space axis components and they must be taken into account. Therefore, this matrix should be compared with Latent Semantic Analysis (LSA). Nevertheless, it usually computes the similarity between Vt and qt, and in practice, it is still unknown which method returns the best results.

Doc2Vec (word2vec)

This method represents each word j, wj, as a vector Doc2Vec (word2vec) but independent of the document di it occurs in. Doc2Vec is an extension of the word2vec algorithm originally proposed by Mikolov and others, and it employs neuron networks and backpropagation to generate the word (and document) vectors. Due to the increasing importance of neuron networks (especially deep learning) in many machine learning applications, we decided to include the main concepts and formulas of this quite advanced method here to give you an introduction to a subject that will become extremely important in the future of machine learning in various fields. The following description is based on the paper Rong (2014) and Le and Mikolov (2014), and the notation also reflects the name currently used in literature.

Word2vec – continuous bag of words and skip-gram architectures

Each word j in the vocabulary V is represented by a vector of length |V|, with binary entries xj=(x1j, …, xVj), where only xjj=1; otherwise, 0. The word2vec method trains a single (hidden) layer of N neurons (weights), choosing between two different network architectures (shown in the following figure). Note that both architectures have only one layer of N neurons or weights, h. This means that the method has to be considered shallow learning and not deep, which typically refers to networks with many hidden layers. The Continuous Bag Of Words (CBOW) method (displayed to the right in the following figure) trains the model using a set of C words as an input called context trying to predict the word (target) that occurs adjacent to the input text. The reverse approach is called Skip-gram, in which the input is the target word and the network is trained to predict the context set (displayed to the left in the following figure ). Note that C is called the window parameter and it sets how far from the target word the context words are selected:

Word2vec – continuous bag of words and skip-gram architectures

Skip-gram (left) and CBOW (right) architectures of the word2vec algorithm; figures taken from word2vec Parameter Learning Explained by X Rong (2015)

In both cases, the matrix W transforms the input vectors into the hidden layer and W' transforms from the hidden layer to the output layer y, where the target (or context) is evaluated. In the training phase, the error from the true target (or context) is computed and used to calculate a stochastic gradient descent to update both the matrices W and W'. We will give a more mathematical description of the CBOW method in the following section. Note that the Skip-gram equations are similar and we will refer to the Rong (2015) paper for further details.

Mathematical description of the CBOW model

Starting from the input layer, the hidden layer h can be obtained by computing, Mathematical description of the CBOW model, where Mathematical description of the CBOW model is a vector of length N that represents the word wi on the hidden layer and wC is the average of the C context vectors Mathematical description of the CBOW model. Choosing a target word wj, the score at the output layer uj is obtained by multiplying the vector Mathematical description of the CBOW model (the j-th column of W') by h:

Mathematical description of the CBOW model

This is not the final value on the output layer yj because we want to evaluate the posterior conditional probability to have the target word wj given the context C expressed by the softmax formula:

Mathematical description of the CBOW model

Now the training objective is to maximize this probability for all the words in the vocabulary, which is equivalent to Mathematical description of the CBOW model, where Mathematical description of the CBOW model and the index jM represents the vector of W' in which the product is maximum, that is, the most probable target word.

The stochastic gradient descent equations are then obtained by calculating the derivatives of E with respect to the entries of W (wij) and W' (w'ij'). The final equations for each output target word wj are:

Mathematical description of the CBOW model

Mathematical description of the CBOW model where Mathematical description of the CBOW model and a is the learning rate of the gradient descent. The derivative Mathematical description of the CBOW model represents the error of the network with respect to the true target word so that the error is back propagated on the system, which can learn iteratively. Note that the vectors Mathematical description of the CBOW model are the usual word representations used to perform the semantic operations.

Further details can be found in the Rong (2015) paper.

Doc2Vec extension

As explained in Le and Mikolov (2014), Doc2Vec is a natural extension of the word2vec method in which a document is considered as an additional word vector. So in the case of the CBOW architecture, the hidden layer vector h is just the average of the context vectors and the document vector di:

Doc2Vec extension

This architecture is shown in the following figure, and it is called the distributed memory model (DM) because the document di vector just remembers the information of the document not represented by the context words. The vector Doc2Vec extension is shared with all the context words sampled from the document di but the matrix W (and W') is the same for all the documents:

Doc2Vec extension

A distributed memory model example with a context of three words (window=3); figure taken from Distributed Representations of Sentences and Documents by Le and Mikolov (2014)

The other proposed architecture is called distributed bag of words (DBOW), which considers only a document vector in the input layer and a set of context words sampled from the document in the output layer. It has been shown that the DM architecture performs better than DBOW, and it is therefore the default model in the gensim library implementation. The reader is advised to read the paper of Le and Mikolov (2014) for further details.

Movie review query example

To show in action the three information retrieval methods discussed previously, we use the IMBD movie reviews in the polarity dataset v2.0 and Pool of 27886 unprocessed html files at http://www.cs.cornell.edu/people/pabo/movie-review-data/, provided by Bo Pang and Lillian Lee (the dataset and the code are also stored in the GitHub account of the author at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_4/. Download and unzip the movie.zip file from the website (called polarity_html.zip), which creates the movie folder with all the web page movie reviews (about 2000 files). First of all, we need to prepare the data from the files:

Movie review query example

This time we use BeautifulSoup to parse the title of the movie from each HTML web page and create a dictionary, moviedict. The polarity dataset v2.0.tar.gz contains a folder, review_polarity, which is inside the txt_sentoken/ folder that split the positive and negative reviews into two separate subfolders (pros and cons). These files are preprocessed using the following code:

Movie review query example

Now all the 2,000 reviews are stored in the tot_textreviews list and the corresponding titles in tot_titles. The TF-IDF model can be trained using sklearn:

Movie review query example

After the PreprocessTfidf function, apply all the preprocessing techniques (removing stop words, tokenizing, and stemming) to each document. In the same way, we can train the LSA model using the gensim library, specifying 10 latent dimensions:

Movie review query example

Note that the GenSimCorpus function just preprocesses the documents with the usual techniques and transforms them into a format that the gensim LSA implementation can read. From the lsi object, it is possible to obtain the matrices U, V, and S that are needed to transform the query into the latent space:

Movie review query example

Also the indexed dictionary of words, dict_words, has been calculated to transform a query word into the corresponding index word in dict_corpus.

The last model to train is Doc2Vec. First, we prepare the data in a format that the gensim Doc2Vec implementation can handle:

Movie review query example

Each review has been placed in a namedtuple object, which contains the words preprocessed by the PreprocessDoc2Vec function (stopwords removed and tokenization performed) and the tag that is the name of the file. Note that we chose not to apply a stemmer because the results are generally better without it (the reader can test the results by applying the stemmer, setting the Boolean flag doc2vecstem to True). The Doc2Vec training is finally performed by the following code:

Movie review query example

We set the DM architecture (dm=1), the hidden layer with 500 dimensions (size), a window size of 10 words, and all the words that occur at least once have been taken into account by the model (min_count=1).The other parameters are related to the efficiency optimization method (negative for negative sampling and hs for hierarchical softmax). The training lasted for 20 epochs, with a learning rate equal to 0.99.

We can now verify which results each method returns, defining a query to retrieve all the web documents related to sci-fi movies, that is, movies usually described by this list of words:

Movie review query example

The TF-IDF method returns the five most similar web pages using the following script:

Movie review query example

Note that the model uses a sparse matrix format to store data, so the cosine_similarity function converts the vectors into regular vectors. Then it computes the similarity. In a similar way, the query is converted in a qk in LSA terminology and the five most similar web pages are printed out:

Movie review query example

Finally, the doc2vec model transforms the query list into a vector using the infer_vector function, and the most similar reviews are returned by the most_similar function:

Movie review query example

Note that the random parameter of the model needs to be set up to a fixed value to return deterministic results whenever an optimization approach is used (negative sampling or hierarchical softmax). The results are as follows:

  • TF-IDF:
    Movie review query example
  • LSA:
    Movie review query example
  • Doc2vec:
    Movie review query example

All three methods show movies related to the query. Interestingly, TF-IDF performs better than the more advanced LSA and Doc2Vec algorithms because In the Heat of the Night, Pokemon, Rocky Horror Picture Show, and Wild Things are not related to the query compared with the TF-IDF results which show only one movie (No Telling) as unrelated. The movies Charlie's Angels and Batman & Robin are action movies, so they are mostly related to the single query word action. Doc2Vec returns the worst results mostly because the training dataset is too small to learn good vector representations (as an example, Google released a word2vec trained dataset based on billions of documents, or more). The website http://www.cs.cornell.edu/people/pabo/movie-review-data/ provides a larger dataset, so the reader can try to train Doc2Vec with more data as an exercise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.254.61