Once the text content of a web page has been extracted, the text data is usually preprocessed to remove parts that do not bring any relevant information. A text is tokenized, that is, transformed into a list of words (tokens), and all the punctuation marks are removed. Another usual operation is to remove all the stopwords, that is, all the words used to construct the syntax of a sentence but not containing text information (such as conjunctions, articles, and prepositions) such as a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with, and many others.
Many words in English (or any language) share the same root but have different suffixes or prefixes. For example, the words think, thinking, and thinker all share the same root—think indicating that the meaning is the same—but the role in a sentence is different (verb, noun, and so on). The procedure to reduce all the words in a set to its roots it is called
stemming, and many algorithms have been invented to do so (Porter, Snowball, and Lancaster). All of these techniques are parts of a broader range of algorithms called
natural language processing, and they are implemented in Python on the nltk
library (installed as usual through sudo pip install nltk
). As an example, the following code preprocesses a sample text using the techniques described previously (using the Python interface terminal):
Note that the stopwords
list has been downloaded using the nltk dowloader nltk.download('stopwords')
.
The information retrieval methods are needed to find the most relevant documents to a given query. The words contained in the web pages can be modeled using different approaches such as Boolean models, vector space models, and probabilistic models, and in this book, we have decided to discuss the vector space models and how to implement them. Formally, given a vocabulary of V words, each web page di (or document) in a collection of N pages, can be thought of as a vector of words, , where each word j belonging to the document i is represented by wij, which can be either a number (weight) or a vector depending on the chosen algorithm:
Since the query can also be represented by a vector of words, , the web pages most similar to the vector q are found by calculating a similarity measure between the query vector and each document. The most used similarity measure is called cosine similarity, for any document i given by:
Note that there are other measures used in literature (okapi and pivoted normalization weighting), but for the purpose of this book, they are not necessary.
The following sections will give some details about the three methods before being applied in a text case in the final paragraph of the section.
This method calculates wij, taking into account the fact that a word that appears many times and in a large number of pages is likely to be less important than a word that occurs many times but only in a subset of documents. It is given by the multiplication of two factors:
where:
The name of this algorithm comes from the idea that there is a latent space in which each word (and each document) can be efficiently described, assuming that words with similar meanings also occur in similar text positions. Projection on this subspace is performed using the (truncated) SVD method already discussed in Chapter 2, Machine Learning Techniques – Unsupervised Learning. We contextualize the method for LSA as follows: the web pages are collected together in matrix X (V ´N), in which each column is a document:
Here, Ut (V ´d) is the matrix of the words projected in the new latent space with d dimensions, (d ´N) is the transpose matrix of the documents transformed into the subspace, and (d´d) is the diagonal matrix with singular values. The query vector itself is projected into the latent space by:
Now, each document represented by each row of Vt can be compared with qt using the cosine similarity. Note that the true mathematical representation of the documents on the latent space is given by (not Vt) because the singular values are the scaling factors of the space axis components and they must be taken into account. Therefore, this matrix should be compared with . Nevertheless, it usually computes the similarity between Vt and qt, and in practice, it is still unknown which method returns the best results.
This method represents each word j, wj, as a vector but independent of the document di it occurs in. Doc2Vec is an extension of the word2vec algorithm originally proposed by Mikolov and others, and it employs neuron networks and backpropagation to generate the word (and document) vectors. Due to the increasing importance of neuron networks (especially deep learning) in many machine learning applications, we decided to include the main concepts and formulas of this quite advanced method here to give you an introduction to a subject that will become extremely important in the future of machine learning in various fields. The following description is based on the paper Rong (2014) and Le and Mikolov (2014), and the notation also reflects the name currently used in literature.
Each word j in the vocabulary V is represented by a vector of length |V|, with binary entries xj=(x1j, …, xVj), where only xjj=1; otherwise, 0. The word2vec method trains a single (hidden) layer of N neurons (weights), choosing between two different network architectures (shown in the following figure). Note that both architectures have only one layer of N neurons or weights, h. This means that the method has to be considered shallow learning and not deep, which typically refers to networks with many hidden layers. The Continuous Bag Of Words (CBOW) method (displayed to the right in the following figure) trains the model using a set of C words as an input called context trying to predict the word (target) that occurs adjacent to the input text. The reverse approach is called Skip-gram, in which the input is the target word and the network is trained to predict the context set (displayed to the left in the following figure ). Note that C is called the window parameter and it sets how far from the target word the context words are selected:
In both cases, the matrix W transforms the input vectors into the hidden layer and W' transforms from the hidden layer to the output layer y, where the target (or context) is evaluated. In the training phase, the error from the true target (or context) is computed and used to calculate a stochastic gradient descent to update both the matrices W and W'. We will give a more mathematical description of the CBOW method in the following section. Note that the Skip-gram equations are similar and we will refer to the Rong (2015) paper for further details.
Starting from the input layer, the hidden layer h can be obtained by computing, , where is a vector of length N that represents the word wi on the hidden layer and wC is the average of the C context vectors . Choosing a target word wj, the score at the output layer uj is obtained by multiplying the vector (the j-th column of W') by h:
This is not the final value on the output layer yj because we want to evaluate the posterior conditional probability to have the target word wj given the context C expressed by the softmax formula:
Now the training objective is to maximize this probability for all the words in the vocabulary, which is equivalent to , where and the index jM represents the vector of W' in which the product is maximum, that is, the most probable target word.
The stochastic gradient descent equations are then obtained by calculating the derivatives of E with respect to the entries of W (wij) and W' (w'ij'). The final equations for each output target word wj are:
where and a is the learning rate of the gradient descent. The derivative represents the error of the network with respect to the true target word so that the error is back propagated on the system, which can learn iteratively. Note that the vectors are the usual word representations used to perform the semantic operations.
Further details can be found in the Rong (2015) paper.
As explained in Le and Mikolov (2014), Doc2Vec is a natural extension of the word2vec method in which a document is considered as an additional word vector. So in the case of the CBOW architecture, the hidden layer vector h is just the average of the context vectors and the document vector di:
This architecture is shown in the following figure, and it is called the distributed memory model (DM) because the document di vector just remembers the information of the document not represented by the context words. The vector is shared with all the context words sampled from the document di but the matrix W (and W') is the same for all the documents:
The other proposed architecture is called
distributed bag of words (DBOW), which considers only a document vector in the input layer and a set of context words sampled from the document in the output layer. It has been shown that the DM architecture performs better than DBOW, and it is therefore the default model in the gensim
library implementation. The reader is advised to read the paper of Le and Mikolov (2014) for further details.
To show in action the three information retrieval methods discussed previously, we use the IMBD movie reviews in the polarity dataset v2.0 and Pool of 27886 unprocessed html files at http://www.cs.cornell.edu/people/pabo/movie-review-data/, provided by Bo Pang and Lillian Lee (the dataset and the code are also stored in the GitHub account of the author at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_4/. Download and unzip the movie.zip
file from the website (called polarity_html.zip
), which creates the movie
folder with all the web page movie reviews (about 2000 files). First of all, we need to prepare the data from the files:
This time we use BeautifulSoup to parse the title of the movie from each HTML web page and create a dictionary, moviedict
. The polarity dataset v2.0.tar.gz
contains a folder, review_polarity
, which is inside the txt_sentoken/
folder that split the positive and negative reviews into two separate subfolders (pros and cons). These files are preprocessed using the following code:
Now all the 2,000 reviews are stored in the tot_textreviews
list and the corresponding titles in tot_titles
. The TF-IDF model can be trained using sklearn
:
After the PreprocessTfidf
function, apply all the preprocessing techniques (removing stop words, tokenizing, and stemming) to each document. In the same way, we can train the LSA model using the gensim
library, specifying 10 latent dimensions:
Note that the GenSimCorpus
function just preprocesses the documents with the usual techniques and transforms them into a format that the gensim LSA implementation can read. From the lsi
object, it is possible to obtain the matrices U, V, and S that are needed to transform the query into the latent space:
Also the indexed dictionary of words, dict_words
, has been calculated to transform a query word into the corresponding index word in dict_corpus
.
The last model to train is Doc2Vec. First, we prepare the data in a format that the gensim Doc2Vec implementation can handle:
Each review has been placed in a namedtuple
object, which contains the words preprocessed by the PreprocessDoc2Vec
function (stopwords removed and tokenization performed) and the tag that is the name of the file. Note that we chose not to apply a stemmer because the results are generally better without it (the reader can test the results by applying the stemmer, setting the Boolean flag doc2vecstem
to True
). The Doc2Vec training is finally performed by the following code:
We set the DM architecture (dm
=1), the hidden layer with 500 dimensions (size
), a window size of 10 words, and all the words that occur at least once have been taken into account by the model (min_count
=1).The other parameters are related to the efficiency optimization method (negative
for negative sampling and hs
for hierarchical softmax). The training lasted for 20
epochs, with a learning rate equal to 0.99
.
We can now verify which results each method returns, defining a query to retrieve all the web documents related to sci-fi movies, that is, movies usually described by this list of words:
The TF-IDF method returns the five most similar web pages using the following script:
Note that the model uses a sparse matrix format to store data, so the cosine_similarity
function converts the vectors into regular vectors. Then it computes the similarity. In a similar way, the query is converted in a qk in LSA terminology and the five most similar web pages are printed out:
Finally, the doc2vec
model transforms the query list into a vector using the infer_vector
function, and the most similar reviews are returned by the most_similar
function:
Note that the random
parameter of the model needs to be set up to a fixed value to return deterministic results whenever an optimization approach is used (negative sampling or hierarchical softmax). The results are as follows:
All three methods show movies related to the query. Interestingly, TF-IDF performs better than the more advanced LSA and Doc2Vec algorithms because In the Heat of the Night, Pokemon, Rocky Horror Picture Show, and Wild Things are not related to the query compared with the TF-IDF results which show only one movie (No Telling) as unrelated. The movies Charlie's Angels and Batman & Robin are action movies, so they are mostly related to the single query word action. Doc2Vec returns the worst results mostly because the training dataset is too small to learn good vector representations (as an example, Google released a word2vec trained dataset based on billions of documents, or more). The website http://www.cs.cornell.edu/people/pabo/movie-review-data/ provides a larger dataset, so the reader can try to train Doc2Vec with more data as an exercise.
18.221.254.61