Natural language processing

Natural language processing (NLP) is about analyzing text, articles and involves carrying out predictive analysis on textual data. The algorithm we make will address a simple problem, but the same concept is applicable to any text. We can also predict the genre of a book with NLP.

Consider the following Tab Separated Values (TSV), which is a tab-delimited dataset for us to apply NLP to and see how it works:

This is a small portion of the data we will be working on. In this case, the data represents customer reviews about a restaurant. The reviews are given as text, and they have a rating, which is 0 or 1 to indicate whether the customer liked the restaurant or not. 1 would mean the review is positive and 0 would indicate that it's not positive.

Usually, we would use a CSV file. Here, however, we are using a TSV file where the delimiter is a tab because we are working on text-based data, so we may have commas that don't indicate a separator. If we take the 14^th record, for example, we can see a comma in the text. Had this been a CSV file, Python would have taken the first half of the sentence as the review and the second half as a rating, while the 1 would have been taken as a new review. This would mess up the whole model.

The dataset has got around 1,000 reviews and has been labeled manually. Since we are importing a TSV file, some parameters of pandas.read_csv will need to change. First of all, we specify that the delimiters are tab separated, using /t. We should also ignore double quotes, which can be done by specifying parameter quoting=3:

The imported dataset is shown here:

We can see that the 1,000 reviews have been imported successfully. All the reviews are in the review column and all the ratings are in the Liked column. In NLP, we have to clean text-based data before we use it. This is because NLP algorithms work using the bag of words concept, which means that only the words that lead to a prediction are maintained. The bag of words actually contains only the relevant words that impact the prediction. Words such as a, the, on, and so on are considered to be irrelevant in this context. We also get rid of dots and numbers unless numbers are needed, and apply stemming on the words. An example of stemming would be taking the word love in place of loved. The reason why we apply stemming is because we don't want to have too many words in the end, and also to regroup words such as loving and loved to one word, love. We also remove the capital letters and have everything in lowercase. To apply our bag-of-words model, we need to apply tokenization. After we do this, we will have different words, because the pre-processing will have got rid of those that are irrelevant.

Then, we take all the words of the different reviews and make one column for each word. There are likely to be many columns as there may be many different words in the reviews. Then, for each review, each column would contain a number that indicates the number of times that word has occurred in that specific review. This kind of matrix is called a sparse matrix, as there is likely to be lots of zeros in the dataset.

The dataset['Review'][0] command will give us the first review:

We use a sub module of regular expressions, as follows:

The sub module we are using is called a subtract function. This subtracts specified characters from our input string. It can also club words together and replace the specified characters with a character of your choice. The characters to be replaced can either be input as a string or in regular expression format. In regular expression format shown in the previous example, the ^ sign means not and [a-zA-Z] means everything other then a-z and A-Z should be replaced by a single space ' '. In the given string, the dots will be removed and replaced by spaces, producing this output: Wow Loved this place.

We now remove all non-significant words such as the, a, this, and so on. To do this, we will use the nltk library (natural language toolkit). This has a sub module called stopwords, which contains all the words (generic words ) that are mostly irrelevant with regard to fetching the meaning of the sentence. To download stopwords, we use the following command:

This downloads stop words to the current path from where they can be used directly. First, we break the reviews into a list of words and then we move through the different words and compare them with the downloaded stopwords, removing those that are unnecessary:

In the previous code snippet, we are using a for loop. Declaring the [] sign in front of review signifies that the list will contain the words that will be returned from the for loop, which are the stopwords in this case.

The code preceding the for loop indicates that we should assign the string word, and update the list with new words every time that word is present in the review list and not present in the stopwords.words('English') list. Note that we are making use of the set() function to actually convert the given stop word list to a set, because in Python the search operation over sets is much faster than over lists. Finally, the review will hold the string with our irrelevant words. In this case, for the first review, it will hold [wov, loved, place].

The next step is to perform stemming. The reason why we apply stemming is to avoid sparsity, which occurs when we have lots and lots of zeros in our matrix (known as a sparse matrix). To reduce sparsity, we need to reduce the proportion of zeros in the matrix.

We will use the portstemmer library to apply stemming to each word:

Now, the review will hold [wov, love, place].

In this step, we will join the transformed string review from the list back to a string by calling join. We will put a space as the delimiter ' '.join(review) to join all the words in the review list together and then we use ' ' as a separator to separate the words.

The review is now a string of relevant words all in lowercase:

After executing the code, if we compare the original dataset and the obtained corpus list, we will obtain the following:

Since the stopword list also had the word Not, the string at index 1, Crust is not good (which had a Liked rating of 0), became crust good. We need to make sure that this does not happen. Likewise, would not go back became would go back. One of the ways to handle it would be to use a stop word list as set(stopwords.words('english'))].

Next, we will create a bag of words model. Here, the different words from the obtained corpus (list of sentences) would be taken, and a column would be made for each distinct word. None of the words will be repeated.

Thus, words such as wov love place, crust good, tasti textur nasti, and so on will be taken and a column will be made for each. Each column will correspond to a different word. We will also have the review comment and an entry number, specifying how many times the word has existed in that specific review.

With this kind of setup, there would be many zeros in our table because there may be words that do not appear frequently. The objective should always be to keep sparsity to a minimum, such that only the relevant words point to a prediction. This will yield a better model. The sparse matrix we have just created will be our bag of words model, and it works just like our classification model. We have some independent variables that take some values (in this case, the independent variables are the review words) and, based on the values of the independent variables, we will predict the dependent variables, which is if the review is positive or not. To create our bag of words model, we will apply a classification model to predict whether each new review is positive or negative. We will create a bag of words model with the help of tokenization and a tool called CountVectoriser.

We will use the following code to use this library:

from sklearn.feature_extraction.text import CountVectorizer

Next, we will create an instance of this class. The parameters take stop words as one of the arguments, but since we have already applied stop words to our dataset, we do not need to do that again. This class also allows us to control the case and the token pattern. We could have chosen to perform all the steps before with this class as well, but doing it separately gives better granular control:

Note that the line cv.fit_transform will actually fit the sparse matrix to cv and return a matrix of features that has all the words of the corpus.

Up until now, we have made our bag of words, or sparse matrix, a matrix of independent variables. The next step is to use a classification model and train the model over a part of the bag of words, -X, and the dependent variable over the same indexes, -Y. The dependent variable in this case is the Liked column.

Executing the preceding code will create a matrix of features with around 1,565 features (different columns). If the number of distinct features come out to be very large, we can limit the max features and specify a maximum threshold number. Let's say that if we specify the threshold number to be 1,500, then only 1,500 features or distinct words will be taken in the sparse matrix and those that are less frequent as compared to the first 1,500 would get removed. This would make a better correlation between the independent and dependent variables, further reducing sparsity.

We now need to train our classification model on the bag of model words and the dependent variables:

Extract the dependent variable as follows:

X and Y would look as follows:

Note that in the previous case, each index (0-1499) corresponds to a word in the original corpus list. We now have exactly what we had in the classification model: a metric of independent variables and a result, 0 for a negative review and 1 for a positive review. However, we have still got a significant amount of sparsity.

The next step for us is to make use of a classification model for training. There are two ways to use classifications models. One way is to test all the classification models against our dataset and determine false positives and false negatives, and the other method is based on experience and past experiments. The most common models used alongside NLP are Naive Bayes and decision trees or random forest classification. In this tutorial, we will be using a Naive Bayes model:

The whole code is shown here:

From the preceding code, we can see that we are splitting the train and test sets as 80% and 20%. We will give 800 observations to the training set and 200 observations to the test set, and see how our model will behave. The value of the confusion metric after the execution is given as following:

There are 55 correct predictions for negative reviews and 91 correct predictions for positive reviews. There are 42 incorrect predictions for negative reviews and 12 incorrect predictions for positive reviews. Therefore, out of 200 predictions, there are 146 total correct predictions, which is equal to 73%.

Table of Contents for Natural language processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Natural language processing