Chapter 4. Text Classification

One of the more novel uses for binary classification is sentiment analysis, which examines a sample of text such as a product review, a tweet, or a comment left on a website and scores it on a scale of 0.0 to 1.0, where 0.0 represents negative sentiment and 1.0 represents positive sentiment. A review such as “great product at a great price” might score 0.9, while “overpriced product that barely works” might score 0.1. The score is the probability that the text expresses positive sentiment. Sentiment analysis models are difficult to build algorithmically but are relatively easy to craft with machine learning. For examples of how sentiment analysis is used in business today, see the article “8 Sentiment Analysis Real-World Use Cases” by Nicholas Bianchi.

Sentiment analysis is one example of a task that involves classifying textual data rather than numerical data. Because machine learning works with numbers, you must convert text to numbers before training a sentiment analysis model, a model that identifies spam emails, or any other model that classifies text. A common approach is to build a table of word frequencies called a bag of words. Scikit-Learn provides classes to help. It also includes support for normalizing text so that, for example, “awesome” and “Awesome” don’t count as two different words.

This chapter begins by describing how to prepare text for use in classification models. After building a sentiment analysis model, you’ll learn about another popular learning algorithm called Naive Bayes that works particularly well with text and use it to build a model that distinguishes between legitimate emails and spam emails. Finally, you’ll learn about a mathematical technique for measuring the similarity of two text samples and use it to build an app that recommends movies based on other movies you enjoy.

Preparing Text for Classification

Before you train a model to classify text, you must convert the text into numbers, a process known as vectorization. Chapter 1 presented the illustration reproduced in Figure 4-1, which demonstrates a common technique for vectorizing text. Each row represents a text sample such as a tweet or a movie review, and each column represents a word in the training text. The numbers in the rows are word counts, and the final number in each row is a label: 0 for negative and 1 for positive.

Figure 4-1. Dataset for sentiment analysis

Text is typically cleaned before it’s vectorized. Examples of cleaning include converting characters to lowercase (so, for example, “Excellent” is equivalent to “excellent”), removing punctuation symbols, and optionally removing stop words—common words such as the and and that are likely to have little impact on the outcome. Once cleaned, sentences are divided into individual words (tokenized) and the words are used to produce datasets like the one in Figure 4-1.

Scikit-Learn has three classes that handle the bulk of the work of cleaning and vectorizing text:

CountVectorizer
Creates a dictionary (vocabulary) from the corpus of words in the training text and generates a matrix of word counts like the one in Figure 4-1
HashingVectorizer
Uses word hashes rather than an in-memory vocabulary to produce word counts and is therefore more memory efficient
TfidfVectorizer
Creates a dictionary from words provided to it and generates a matrix similar to the one in Figure 4-1, but rather than containing integer word counts, the matrix contains term frequency-inverse document frequency (TFIDF) values between 0.0 and 1.0 reflecting the relative importance of individual words

All three classes are capable of converting text to lowercase, removing punctuation symbols, removing stop words, splitting sentences into individual words, and more. They also support n-grams, which are combinations of two or more consecutive words (you specify the number n) that should be treated as a single word. The idea is that words such as credit and score might be more meaningful if they appear next to each other in a sentence than if they appear far apart. Without n-grams, the relative proximity of words is ignored. The downside to using n-grams is that it increases memory consumption and training time. Used judiciously, however, it can make text classification models more accurate.

Note

Neural networks have other, more powerful ways of taking word order into account that don’t require related words to occur next to each other. A conventional machine learning model can’t connect the words blue and sky in the sentence “I like blue, for it is the color of the sky,” but a neural network can. I will shed more light on this in Chapter 13.

Here’s an example demonstrating what CountVectorizer does and how it’s used:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

lines = [
    'Four score and 7 years ago our fathers brought forth,',
    '... a new NATION, conceived in liberty $$$,',
    'and dedicated to the PrOpOsItIoN that all men are created equal',
    'One nation's freedom equals #freedom for another $nation!'
]

# Vectorize the lines
vectorizer = CountVectorizer(stop_words='english')
word_matrix = vectorizer.fit_transform(lines)

# Show the resulting word matrix
feature_names = vectorizer.get_feature_names_out()
line_names = [f'Line {(i + 1):d}' for i, _ in enumerate(word_matrix)]

df = pd.DataFrame(data=word_matrix.toarray(), index=line_names,
                  columns=feature_names)

df.head()

Here’s the output:

The corpus of text in this case is four strings in a Python list. CountVectorizer broke the strings into words, removed stop words and symbols, and converted all remaining words to lowercase. Those words comprise the columns in the dataset, and the numbers in the rows show how many times a given word appears in each string. The stop_words='english' parameter tells CountVectorizer to remove stop words using a built-in dictionary of more than 300 English-language stop words. If you prefer, you can provide your own list of stop words in a Python list. (Or you can leave the stop words in there; it often doesn’t matter.) And if you’re training with text written in another language, you can get lists of multilanguage stop words from other Python libraries such as the Natural Language Toolkit (NLTK) and Stop-words.

Observe from the output that equal and equals count as separate words, even though they have similar meaning. Data scientists sometimes go a step further when preparing text for machine learning by stemming or lemmatizing words. If the preceding text were stemmed, all occurrences of equals would be converted to equal. Scikit lacks support for stemming and lemmatization, but you can get it from other libraries such as NLTK.

CountVectorizer removes punctuation symbols, but it doesn’t remove numbers. It ignored the 7 in line 1 because it ignores single characters. But if you changed 7 to 777, the term 777 would appear in the vocabulary. One way to fix that is to define a function that removes numbers and pass it to CountVectorizer via the preprocessor parameter:

import re

def preprocess_text(text):
    return re.sub(r'd+', '', text).lower()

vectorizer = CountVectorizer(stop_words='english', preprocessor=preprocess_text)
word_matrix = vectorizer.fit_transform(lines)

Note the call to lower to convert the text to lowercase. CountVectorizer doesn’t convert text to lowercase if you provide a preprocessing function, so the preprocessing function must convert it itself. It still removes punctuation characters, however.

Another useful parameter to CountVectorizer is min_df, which ignores words that appear fewer than the specified number of times. It can be an integer specifying a minimum count (for example, ignore words that appear fewer than five times in the training text, or min_df=5), or it can be a floating-point value from 0.0 to 1.0 specifying the minimum percentage of samples in which a word must appear—for example, ignore words that appear in less than 10% of the samples (min_df=0.1). It’s great for filtering out words that probably aren’t meaningful anyway, and it reduces memory consumption and training time by decreasing the size of the vocabulary. Count​Vec⁠tor⁠izer also supports a max_df parameter for eliminating words that appear too frequently.

The preceding examples use CountVectorizer, which probably leaves you wondering when (and why) you would use HashingVectorizer or TfidfVectorizer instead. HashingVectorizer is useful when dealing with large datasets. Rather than store words in memory, it hashes each word and uses the hash as an index into an array of word counts. It can therefore do more with less memory and is very useful for reducing the size of vectorizers when serializing them so that you can restore them later—a topic I’ll say more about in Chapter 7. The downside to HashingVectorizer is that it doesn’t let you work backward from vectorized text to the original text. Count​Vec⁠tor⁠izer does, and it provides an inverse_transform method for that purpose.

TfidfVectorizer is frequently used to perform keyword extraction: examining a document or set of documents and extracting keywords that characterize their content. It assigns words numerical weights reflecting their importance, and it uses two factors to determine the weights: how often a word appears in individual documents, and how often it appears in the overall document set. Words that appear more frequently in individual documents but occur in fewer documents receive higher weights. I won’t go further into it here, but if you’re curious to learn more, this book’s GitHub repo contains a notebook that uses Tfidf​Vec⁠tor⁠izer to extract keywords from the manuscript of Chapter 1.

Sentiment Analysis

To train a sentiment analysis model, you need a labeled dataset. Several such datasets are available in the public domain. One of those is the IMDB movie review dataset, which contains 25,000 samples of negative reviews and 25,000 samples of positive reviews posted on the Internet Movie Database website. Each review is meticulously labeled with a 0 for negative sentiment or a 1 for positive sentiment. To demonstrate how sentiment analysis works, let’s build a binary classification model and train it with this dataset. We’ll use logistic regression as the learning algorithm. A sentiment analysis score yielded by this model is simply the probability that the input expresses positive sentiment, which is easily retrieved by calling LogisticRegres⁠sion’s predict_proba method.

Start by downloading the dataset and copying it to the Data subdirectory of the directory that hosts your Jupyter notebooks. Then run the following code in a notebook to load the dataset and show the first five rows:

import pandas as pd
 
df = pd.read_csv('Data/reviews.csv', encoding='ISO-8859-1')
df.head()

The encoding attribute is necessary because the CSV file uses ISO-8859-1 character encoding rather than UTF-8. The output is as follows:

Find out how many rows the dataset contains and confirm that there are no missing values:

df.info()

Use the following statement to see how many instances there are of each class (0 for negative and 1 for positive):

df.groupby('Sentiment').describe()

Here is the output:

There is an even number of positive and negative samples, but in each case, the number of unique samples is less than the number of samples for that class. That means the dataset has duplicate rows, and duplicate rows could bias a machine learning model. Use the following statements to delete the duplicate rows and check for balance again:

df = df.drop_duplicates()
df.groupby('Sentiment').describe()

Now there are no duplicate rows, and the number of positive and negative samples is roughly equal.

Next, use CountVectorizer to prepare and vectorize the text in the Text column. Set min_df to 20 to ignore words that appear infrequently in the training text. This reduces the likelihood of out-of-memory errors and will probably make the model more accurate as well. Also use the ngram_range parameter to allow Count​Vec⁠tor⁠izer to include word pairs as well as individual words:

from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english',
                             min_df=20)

x = vectorizer.fit_transform(df['Text'])
y = df['Sentiment']

Now split the dataset for training and testing. We’ll use a 50/50 split since there are almost 50,000 samples in total:

from sklearn.model_selection import train_test_split
 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5,
                                                    random_state=0)

The next step is to train a classifier. We’ll use Scikit’s LogisticRegression class, which uses logistic regression to fit a model to the data:

from sklearn.linear_model import LogisticRegression
 
model = LogisticRegression(max_iter=1000, random_state=0)
model.fit(x_train, y_train)

Validate the trained model with the 50% of the dataset set aside for testing and show the results in a confusion matrix:

%matplotlib inline
from sklearn.metrics import ConfusionMatrixDisplay as cmd

cmd.from_estimator(model, x_test, y_test,
                   display_labels=['Negative', 'Positive'],
                   cmap='Blues', xticks_rotation='vertical')

The confusion matrix reveals that the model correctly identified 10,795 negative reviews while misclassifying 1,574 of them. It correctly identified 10,966 positive reviews and got it wrong 1,456 times:

Now comes the fun part: analyzing text for sentiment. Use the following statements to produce a sentiment score for the sentence “The long lines and poor customer service really turned me off”:

text = 'The long lines and poor customer service really turned me off'
model.predict_proba(vectorizer.transform([text]))[0][1]

Here’s the output:

0.09183447847778639

Now do the same for “The food was great and the service was excellent!”:

text = 'The food was great and the service was excellent!'
model.predict_proba(vectorizer.transform([text]))[0][1]

If you expected a higher score for this one, you won’t be disappointed:

0.8536277207125618

Feel free to try sentences of your own and see if you agree with the sentiment scores the model predicts. It’s not perfect, but it’s good enough that if you run hundreds of reviews or comments through it, you should get a reliable indication of the sentiment expressed in the text.

Note

Sometimes CountVectorizer’s built-in list of stop words lowers the accuracy of a model because the list is so broad. As an experiment, remove stop_words='english' from CountVectorizer and run the code again. Check the confusion matrix. Does the accuracy increase or decrease? Feel free to vary other parameters such as min_df and ngram_range too. In the real world, data scientists often try many different parameter combinations to determine which one produces the best results.

Naive Bayes

Logistic regression is a go-to algorithm for classification models and is often very effective at classifying text. But in scenarios involving text classification, data scientists often turn to another learning algorithm called Naive Bayes. It’s a classification algorithm based on Bayes’ theorem, which provides a means for calculating conditional probabilities. Mathematically, Bayes’ theorem is stated this way:

P(A|B)=P(B|A)·P(A)P(B)

This says the probability that A is true given that B is true is equal to the probability that B is true given that A is true multiplied by the probability that A is true divided by the probability that B is true. That’s a mouthful, and while accurate, it doesn’t explain why Naive Bayes is so useful for classifying text—or how you apply it, for example, to a collection of emails to determine which ones are spam.

Let’s start with a simple example. Suppose 10% of all the emails you receive are spam. That’s P(A). Analysis reveals that 5% of the spam emails you receive contain the word congratulations, but just 1% of all your emails contain the same word. Therefore, P(B|A) is 0.05 and P(B) is 0.01. The probability of an email being spam if it contains the word congratulations is P(A|B), which is (0.05 x 0.10) / 0.01, or 0.50.

Of course, a spam filter must consider all the words in an email, not just one. It turns out that if you make some simple (naive) assumptions—that the order of the words in an email doesn’t matter, and that every word has equal weight—you can write Bayes’ equation this way for a spam classifier:

P(S|message)=P(S)·P(word1|S)·P(word2|S)...P(wordn|S)

In plain English, the probability that a message is spam is proportional to the prod­uct of:

  • The probability that any message in the dataset is spam, or P(S)

  • The probability that each word in the message appears in a spam message, or P(word|S)

P(S) can be calculated easily enough: it’s simply the fraction of the messages in the dataset that are spam messages. If you train a machine learning model with 1,000 messages and 500 of them are spam, then P(S) = 0.5. For a given word, P(word|S) is simply the number of times the word appears in spam messages divided by the number of words in all the spam messages. The entire problem reduces to word counts. You can do a similar calculation to compute the probability that the message is not spam, and then use the higher of the two probabilities to make a prediction.

Here’s an example involving four sample emails. The emails are:

TextSpam
Raise your credit score in minutes1
Here are the minutes from yesterday’s meeting0
Meeting tomorrow to review yesterday’s scores0
Score tomorrow’s meds at yesterday’s prices1

If you remove stop words, convert characters to lowercase, and stem the words such that tomorrow’s becomes tomorrow, you’re left with this:

TextSpam
raise credit score minute1
minute yesterday meeting0
meeting tomorrow review yesterday score0
score tomorrow med yesterday price1

Because two of the four messages are spam and two are not, the probability that any message is spam (P(S)) is 0.5. The same goes for the probability that any message is not spam (P(N) = 0.5). In addition, the spam messages contain nine unique words, while the nonspam messages contain a total of eight.

The next step is to build the following table of word frequencies. Take the word yesterday as an example. It appears once in a message that’s labeled as spam, so P(yesterday|S) is 1/9, or 0.111. It appears twice in nonspam messages, so P(yesterday|N) is 2/8, or 0.250:

WordP(word|S)P(word|N)
raise1/9 = 0.1110/8 = 0.000
credit1/9 = 0.1110/8 = 0.000
score2/9 = 0.2221/8 = 0.125
minute1/9 = 0.1111/8 = 0.125
yesterday1/9 = 0.1112/8 = 0.250
meeting0/9 = 0.0002/8 = 0.250
tomorrow1/9 = 0.1111/8 = 0.125
review0/9 = 0.0001/8 = 0.125
med1/9 = 0.1110/8 = 0.000
price1/9 = 0.1110/8 = 0.000

This works up to a point, but the zeros in the table are a problem. Let’s say you want to determine whether “Scores must be reviewed by tomorrow” is spam. Removing stop words leaves you with “score review tomorrow.” You can compute the probability that the message is spam this way:

P(S|scorereviewtomorrow)=P(S)·P(score|S)·P(review|S)·P(tomorrow|S)
P(S|scorereviewtomorrow)=0.5·0.222·0.0·0.111=0
P(S|scorereviewtomorrow)=0

The result is 0 because review doesn’t appear in a spam message, and 0 times anything is 0. The algorithm simply can’t assign a spam probability to “Scores must be reviewed by tomorrow.”

A common way to resolve this is to apply Laplace smoothing, also known as additive smoothing. Typically, this involves adding 1 to each numerator and the number of unique words in the dataset (in this case, 10) to each denominator. Now, P(review|S) evaluates to (0 + 1) / (9 + 10), which equals 0.053. It’s not much, but it’s better than nothing (literally). Here are the word frequencies again, this time revised with La­p⁠lace smoothing:

WordP(word|S)P(word|N)
raise(1 + 1) / (9 + 10) = 0.105(0 + 1) / (8 + 10) = 0.056
credit(1 + 1) / (9 + 10) = 0.105(0 + 1) / (8 + 10) = 0.056
score(2 + 1) / (9 + 10) = 0.158(1 + 1) / (8 + 10) = 0.111
minute(1 + 1) / (9 + 10) = 0.105(1 + 1) / (8 + 10) = 0.111
yesterday(1 + 1) / (9 + 10) = 0.105(2 + 1) / (8 + 10) = 0.167
meeting(0 + 1) / (9 + 10) = 0.053(2 + 1) / (8 + 10) = 0.167
tomorrow(1 + 1) / (9 + 10) = 0.105(1 + 1) / (8 + 10) = 0.111
review(0 + 1) / (9 + 10) = 0.053(1 + 1) / (8 + 10) = 0.111
med(1 + 1) / (9 + 10) = 0.105(0 + 1) / (8 + 10) = 0.056
price(1 + 1) / (9 + 10) = 0.105(0 + 1) / (8 + 10) = 0.056

Now you can determine whether “Scores must be reviewed by tomorrow” is spam by performing two simple calculations:

P(S|scorereviewtomorrow)=0.5·0.158·0.053·0.105=0.000440
P(N|scorereviewtomorrow)=0.5·0.111·0.111·0.111=0.000684

By this measure, “Scores must be reviewed by tomorrow” is likely not to be spam. The probabilities are relative, but you could normalize them and conclude there’s about a 40% chance the message is spam and a 60% chance it’s not based on the emails the model was trained with.

Fortunately, you don’t have to do these computations by hand. Scikit-Learn provides several classes to help out, including the MultinomialNB class, which works great with tables of word counts produced by CountVectorizer.

Spam Filtering

It’s no coincidence that modern spam filters are remarkably adept at identifying spam. Virtually all of them rely on machine learning. Such models are difficult to implement algorithmically because an algorithm that uses keywords such as credit and score to determine whether an email is spam is easily fooled. Machine learning, by contrast, looks at a body of emails and uses what it learns to classify the next email. Such models often achieve more than 99% accuracy. And they get smarter over time as they’re trained with more and more emails.

The previous example used logistic regression to predict whether text input to it expresses positive or negative sentiment. It used the probability that the text expresses positive sentiment as a sentiment score, and you saw that expressions such as “The long lines and poor customer service really turned me off” score close to 0.0, while expressions such as “The food was great and the service was excellent” score close to 1.0. Now let’s build a binary classification model that classifies emails as spam or not spam and use Naive Bayes to fit the model to the training data.

There are several spam classification datasets available in the public domain. Each contains a collection of emails with samples labeled with 1s for spam and 0s for not spam. We’ll use a relatively small dataset containing 1,000 samples. Begin by downloading the dataset and copying it into your notebooks’ Data subdirectory. Then load the data and display the first five rows:

import pandas as pd
 
df = pd.read_csv('Data/ham-spam.csv')
df.head()

Now check for duplicate rows in the dataset:

df.groupby('IsSpam').describe()

The dataset contains one duplicate row. Let’s remove it and check for balance:

df = df.drop_duplicates()
df.groupby('IsSpam').describe()

The dataset now contains 499 samples that are not spam, and 500 that are. The next step is to use CountVectorizer to vectorize the emails. Once more, we’ll allow CountVectorizer to consider word pairs as well as individual words and remove stop words using Scikit’s built-in dictionary of English stop words:

from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
x = vectorizer.fit_transform(df['Text'])
y = df['IsSpam']

Split the dataset so that 80% can be used for training and 20% for testing:

from sklearn.model_selection import train_test_split
 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
                                                    random_state=0)

The next step is to train a Naive Bayes classifier using Scikit’s MultinomialNB class:

from sklearn.naive_bayes import MultinomialNB
 
model = MultinomialNB()
model.fit(x_train, y_train)

Validate the trained model with the 20% of the dataset set aside for testing using a confusion matrix:

%matplotlib inline
from sklearn.metrics import ConfusionMatrixDisplay as cmd

cmd.from_estimator(model, x_test, y_test,
                   display_labels=['Not Spam', 'Spam'],
                   cmap='Blues', xticks_rotation='vertical')

The model correctly identified 101 of 102 legitimate emails as not spam, and 95 of 98 spam emails as spam:

Use the score method to get a rough measure of the model’s accuracy:

model.score(x_test, y_test)

Now use Scikit’s RocCurveDisplay class to visualize the ROC curve:

from sklearn.metrics import RocCurveDisplay as rcd
import seaborn as sns
sns.set()

rcd.from_estimator(model, x_test, y_test)

The results are encouraging. Trained with just 999 samples, the area under the ROC curve (AUC) indicates the model is more than 99.9% accurate at classifying emails as spam or not spam:

Let’s see how the model classifies a few emails that it hasn’t seen before, starting with one that isn’t spam. The model’s predict method predicts a class—0 for not spam, or 1 for spam:

msg = 'Can you attend a code review on Tuesday to make sure the logic is solid?'
input = vectorizer.transform([msg])
model.predict(input)[0]

The model says this message is not spam, but what’s the probability that it’s not spam? You can get that from predict_proba, which returns an array containing two values: the probability that the predicted class is 0, and the probability that the predicted class is 1, in that order:

model.predict_proba(input)[0][0]

The model seems very sure that this email is legitimate:

0.9999497111473539

Now test the model with a spam message:

msg = 'Why pay more for expensive meds when you can order them online ' 
      'and save $$$?'

input = vectorizer.transform([msg])
model.predict(input)[0]

What is the probability that the message is not spam?

model.predict_proba(input)[0][0]

The answer is:

0.00021423891260677753

What is the probability that the message is spam?

model.predict_proba(input)[0][1]

And the answer is:

0.9997857610873945

Observe that predict and predict_proba accept a list of inputs. Based on that, could you classify an entire batch of emails with one call to either method? How would you get the results for each email?

Recommender Systems

Another branch of machine learning that has proven its mettle in recent years is recommender systems—systems that recommend products or services to customers. Amazon’s recommender system reportedly drives 35% of its sales. The good news is that you don’t have to be Amazon to benefit from a recommender system, nor do you have to have Amazon’s resources to build one. They’re relatively simple to create once you learn a few basic principles.

Recommender systems come in many forms. Popularity-based systems present options to customers based on what products and services are popular at the time—for example, “Here are this week’s bestsellers.” Collaborative systems make recommendations based on what others have selected, as in “People who bought this book also bought these books.” Neither of these systems requires machine learning.

Content-based systems, by contrast, benefit greatly from machine learning. An example of a content-based system is one that says “if you bought this book, you might like these books also.” These systems require a means for quantifying similarity between items. If you liked the movie Die Hard, you might or might not like Monty Python and the Holy Grail. If you liked Toy Story, you’ll probably like A Bug’s Life too. But how do you make that determination algorithmically?

Content-based recommenders require two ingredients: a way to vectorize—convert to numbers—the attributes that characterize a service or product, and a means for calculating similarity between the resulting vectors. The first one is easy. Count​Vec⁠tor⁠izer converts text into tables of word counts. All you need is a way to measure similarity between rows of word counts and you can build a recommender system. And one of the simplest and most effective ways to do that is a technique called cosine similarity.

Cosine Similarity

Cosine similarity is a mathematical means for computing the similarity between pairs of vectors (or rows of numbers treated as vectors). The basic idea is to take each value in a sample—for example, word counts in a row of vectorized text—and use them as endpoint coordinates for a vector, with the other endpoint at the origin of the coordinate system. Do that for two samples, and then compute the cosine between vectors in m-dimensional space, where m is the number of values in each sample. Because the cosine of 0 is 1, two identical vectors have a similarity of 1. The more dissimilar the vectors, the closer the cosine will be to 0.

Here’s an example in two-dimensional space to illustrate. Suppose you have three rows containing two values each:

12
23
31

You want to determine whether row 2 is more similar to row 1 or row 3. It’s hard to tell just by looking at the numbers, and in real life, there are many more numbers. If you simply added the numbers in each row and compared the sums, you would conclude that row 2 is more similar to row 3. But what if you treated each row as a vector, as shown in Figure 4-2?

  • Row 1: (0, 0) → (1, 2)

  • Row 2: (0, 0) → (2, 3)

  • Row 3: (0, 0) → (3, 1)

Figure 4-2. Cosine similarity

Now you can plot each row as a vector, compute the cosines of the angles formed by 1 and 2 and 2 and 3, and determine that row 2 is more like row 1 than row 3. That’s cosine similarity in a nutshell.

Cosine similarity isn’t limited to two dimensions; it works in higher-dimensional space as well. To help compute cosine similarities regardless of the number of dimensions, Scikit offers the cosine_similarity function. The following code computes the cosine similarities of the three samples in the preceding example:

data = [[1, 2], [2, 3], [3, 1]]
cosine_similarity(data)

The return value is a similarity matrix containing the cosines of every vector pair. The width and height of the matrix equals the number of samples:

array([[1.        , 0.99227788, 0.70710678],
       [0.99227788, 1.        , 0.78935222],
       [0.70710678, 0.78935222, 1.        ]])

From this, you can see that the similarity of rows 1 and 2 is 0.992, while the similarity of rows 2 and 3 is 0.789. In other words, row 2 is more similar to row 1 than it is to row 3. There is also more similarity between rows 2 and 3 (0.789) than there is between rows 1 and 3 (0.707).

Building a Movie Recommendation System

Let’s put cosine similarity to work building a content-based recommender system for movies. Start by downloading the dataset, which is one of several movie datasets available from Kaggle.com. This one has information for about 4,800 movies, including title, budget, genres, keywords, cast, and more. Place the CSV file in your Jupyter notebooks’ Data subdirectory. Then load the dataset and peruse its contents:

import pandas as pd
 
df = pd.read_csv('Data/movies.csv')
df.head()

The dataset contains 24 columns, only a few of which are needed to describe a movie. Use the following statements to extract key columns such as title and genres and fill missing values with empty strings:

df = df[['title', 'genres', 'keywords', 'cast', 'director']]
df = df.fillna('') # Fill missing values with empty strings
df.head()

Next, add a column named features that combines all the words in the other columns:

df['features'] = df['title'] + ' ' + df['genres'] + ' ' + 
            df['keywords'] + ' ' + df['cast'] + ' ' + 
            df['director']

Use CountVectorizer to vectorize the text in the features column:

from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer(stop_words='english', min_df=20)
word_matrix = vectorizer.fit_transform(df['features'])
word_matrix.shape

The table of word counts contains 4,803 rows—one for each movie—and 918 columns. The next task is to compute cosine similarities for each row pair:

from sklearn.metrics.pairwise import cosine_similarity
 
sim = cosine_similarity(word_matrix)

Ultimately, the goal of this system is to input a movie title and identify the n movies that are most similar to that movie. To that end, define a function named get​_recom⁠mendations that accepts a movie title, a DataFrame containing information about all the movies, a similarity matrix, and the number of movie titles to return:

def get_recommendations(title, df, sim, count=10):
    # Get the row index of the specified title in the DataFrame
    index = df.index[df['title'].str.lower() == title.lower()]
     
    # Return an empty list if there is no entry for the specified title
    if (len(index) == 0):
        return []
 
    # Get the corresponding row in the similarity matrix
    similarities = list(enumerate(sim[index[0]]))
     
    # Sort the similarity scores in that row in descending order
    recommendations = sorted(similarities, key=lambda x: x[1], reverse=True)
     
    # Get the top n recommendations, ignoring the first entry in the list since
    # it corresponds to the title itself (and thus has a similarity of 1.0)
    top_recs = recommendations[1:count + 1]
 
    # Generate a list of titles from the indexes in top_recs
    titles = []
 
    for i in range(len(top_recs)):
        title = df.iloc[top_recs[i][0]]['title']
        titles.append(title)
 
    return titles

This function sorts the cosine similarities in descending order to identify the count movies most like the one identified by the title parameter. Then it returns the titles of those movies.

Now use get_recommendations to search the database for similar movies. First ask for the 10 movies that are most similar to the James Bond thriller Skyfall:

get_recommendations('Skyfall', df, sim)

Here is the output:

['Spectre',
 'Quantum of Solace',
 'Johnny English Reborn',
 'Clash of the Titans',
 'Die Another Day',
 'Diamonds Are Forever',
 'Wrath of the Titans',
 'I Spy',
 'Sanctum',
 'Blackthorn']

Call get_recommendations again to list movies that are like Mulan:

get_recommendations('Mulan', df, sim)

Feel free to try other movies as well. Note that you can only input movie titles that are in the dataset. Use the following statements to print a complete list of titles:

pd.set_option('display.max_rows', None)
print(df['title'])

I think you’ll agree that the system does a pretty credible job of picking similar movies. Not bad for about 20 lines of code!

Summary

Machine learning models that classify text are common and see a variety of uses in industry and in everyday life. What rational human being doesn’t wish for a magic wand that eradicates all spam mails, for example?

Text used to train a text classification model must be prepared and vectorized prior to training. Preparation includes converting characters to lowercase and removing punctuation characters, and may include removing stop words, removing numbers, and stemming or lemmatizing. Once prepared, text is vectorized by converting it into a table of word frequencies. Scikit’s CountVectorizer class makes short work of the vectorization process and handles some of the preparation duties too.

Logistic regression and other popular classification algorithms can be used to classify text once it’s converted to numerical form. For text classification tasks, however, the Naive Bayes learning algorithm frequently outperforms other algorithms. By making a few “naive” assumptions such as that the order in which words appear in a text sample doesn’t matter, Naive Bayes reduces to a process of word counting. Scikit’s MultinomialNB class provides a handy Naive Bayes implementation.

Cosine similarity is a mathematical means for computing the similarity between two rows of numbers. One use for it is building systems that recommend products or services based on other products or services that a customer has purchased. Word frequency tables produced from textual descriptions by CountVectorizer can be combined with cosine similarity to create intelligent recommender systems intended to supplement a company’s bottom line.

Feel free to use this chapter’s examples as a starting point for experiments of your own. For instance, see if you can tweak the parameters passed to CountVectorizer in any of the examples and increase the accuracy of the resulting model. Data scientists call the search for the optimum parameter combination hyperparameter tuning, and it’s a subject you’ll learn about in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.167.114