Basics of Natural Language Processing

Table of Contents for Basics of Natural Language Processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Basics of Natural Language Processing

If machine learning models only operate on numerical data, how can we transform our text into a numerical representation? That is exactly the focus of Natural Language Processing (NLP). Let's take a brief look at how this is done.

We'll begin with a small corpus of three sentences:

The new kitten played with the other kittens
She ate lunch
She loved her kitten

We'll first convert our corpus into a bag-of-words (BOW) representation. We'll skip preprocessing for now. Converting our corpus into a BOW representation involves taking each word and its count to create what's called a term-document matrix. In a term-document matrix, each unique word is assigned to a column, and each document is assigned to a row. At the intersection of the two is the count:

Sr. no.	the	new	kitten	played	with	other	kittens	she	ate	lunch	loved	her
1	1	1	1	1	1	1	1	0	0	0	0	0
2	0	0	0	0	0	0	0	1	1	1	0	0
3	0	0	1	0	0	0	0	1	0	0	1	1

Notice that, for these three short sentences, we already have 12 features. As you might imagine, if we were dealing with actual documents, such as news articles or even books, the number of features would explode into the hundreds of thousands. To mitigate this explosion, we can take a number of steps to remove features that add little to no informational value to our analysis.

The first step we can take is to remove stop words. These are words that are so common that they typically tell you nothing about the content of the document. Common examples of English stop words are the, is, at, which, and on. We'll remove those, and recompute our term-document matrix:

Sr. no.	new	kitten	played	kittens	ate	lunch	loved
1	1	1	1	1	0	0	0
2	0	0	0	0	1	1	0
3	0	1	0	0	0	0	1

As you can see, the number of features was reduced from 12 to 7. This is great, but we can take it even further. We can perform stemming or lemmatization to reduce the features further. Notice that in our matrix, we have both kitten and kittens. By using stemming or lemmatization, we can consolidate that into just kitten:

Our new matrix consolidated kittens and kitten, but something else happened as well. We lost the suffixes to played and loved, and ate was transformed to eat. Why? This is what lemmatization does. If you remember your grade school grammar classes, we've gone from the inflectional form to the base form of the word. If that is lemmatization, what is stemming? Stemming has the same goal, but uses a less sophisticated approach. This approach can sometimes produce pseudo-words rather than the actual base form. For example, in lemmatization, if you were to reduce ponies, you would get pony, but with stemming, you'd get poni.

Let's now go further to apply another transformation to our matrix. So far, we have used a simple count of each word, but we can apply an algorithm that will act like a filter on our data to enhance the words that are unique to each document. This algorithm is called term frequency-inverse document frequency (tf-idf).

We calculate this tf-idf ratio for each term in our matrix. Let's calculate it for a couple of examples. For the word new in document one, the term frequency is just the count, which is 1. The inverse document frequency is calculated as the log of the number of documents in the corpus over the number of documents the term appears in. For new, this is log (3/1), or .4471. So, for the complete tf-idf value, we have tf * idf, or, here, it is 1 x .4471, or just .4471. For the word kitten in document one, the tf-idf is 2 * log (3/2), or .3522.

Completing this for the remainder of the terms and documents, we have the following:

Sr. no.	new	kitten	play	eat	lunch	love
1	.4471	.3522	.4471	0	0	0
2	0	0	0	.4471	.4471	0
3	0	.1761	0	0	0	.4471

Why do all of this? Let's say, for example, we have a corpus of documents about many subjects (medicine, computing, food, animals, and so on) and we want to classify them into topics. Very few documents would contain the word sphygmomanometer, which is the device used to measure blood pressure; and all the documents that did would likely concern the topic of medicine. And obviously, the more times this word appears in a document, the more likely it is to be about medicine. So a term that occurs rarely across our entire corpus, but that is present many times in a document, makes it likely that this term is tied closely to the topic of that document. In this way, documents can be said to be represented by those terms with high tf-idf values.

With the help of this framework, we'll now convert our training set into a tf-idf matrix:

from sklearn.feature_extraction.text import TfidfVectorizer 
 
vect = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=3) 
 
tv = vect.fit_transform(df['text'])

With those three lines, we have converted all our documents into a tf-idf vector. We passed in a number of parameters: ngram_range, stop_words, and min_df. Let's discuss each.

First, ngram_range is how the document is tokenized. In our previous examples, we used each word as a token, but here, we are using all one- to three-word sequences as tokens. Let's take our second sentence, She ate lunch. We'll ignore stop words for the moment. The n-grams for this sentence would be: she, she ate, she ate lunch, ate, ate lunch, and lunch.

Next, we have stop_words. We pass in english for this to remove all the English stop words. As discussed previously, this removes all terms that lack informational content.

And finally, we have min_df. This removes all words from consideration that don't appear in at least three documents. Adding this removes very rare terms and reduces the size of our matrix.

Now that our article corpus is in a workable numerical format, we'll move on to feeding it to our classifier.

Sr. no.

kitten