Overview
In this chapter, you will be able to categorize data based on its content and structure. You will be able to describe preprocessing steps in detail and implement them to clean up text data. You will learn about feature engineering and calculate the similarity between texts. Once you understand these concepts, you will be able to use word clouds and some other techniques to visualize text.
In the previous chapter, we learned about the concepts of Natural Language Processing (NLP) and text analytics. We also took a quick look at various preprocessing steps. In this chapter, we will learn how to make text understandable to machine learning algorithms.
As we know, to use a machine learning algorithm on textual data, we need a numerical or vector representation of text data since most of these algorithms are unable to work directly with plain text or strings. But before converting the text data into numerical form, we will need to pass it through some preprocessing steps such as tokenization, stemming, lemmatization, and stop-word removal.
So, in this chapter, we will learn a little bit more about these preprocessing steps and how to extract features from the preprocessed text and convert them into vectors. We will also explore two popular methods for feature extraction (Bag of Words and Term Frequency-Inverse Document Frequency), as well as various methods for finding similarity between different texts. By the end of this chapter, you will have gained an in-depth understanding of how text data can be visualized.
To deal with data effectively, we need to understand the various forms in which it exists. First, let's explore the types of data that exist. There are two main ways to categorize data (by structure and by content), as explained in the upcoming sections.
Data can be divided on the basis of structure into three categories, namely, structured, semi-structured, and unstructured data, as shown in the following diagram:
These three categories are as follows:
The preceding table contains information about five people, with each row representing a person and each column representing one of their attributes.
The format shown in the preceding screenshot is called markup language format. Here, the data is stored between tags, hierarchically. It is a universally accepted format, and there are a lot of parsers available that can convert this data into structured data.
This is called unstructured data because if we want to get employee details from the preceding text snippet with our program, we will not be able to do so by simple parsing. We have to make our algorithm understand the semantics of the language to make it able to extract information from this.
Data can be divided into four categories based on content, as shown in the following diagram:
Let's look at each category here:
With that, we have learned about the different types of data and their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the next section, we will look into some of the preprocessing steps for cleaning data.
The text data that we are going to discuss here is unstructured text data, which consists of written sentences. Most of the time, this text data cannot be used as it is for analysis because it contains some noisy elements, that is, elements that do not really contribute much to the meaning of the sentence at all. These noisy elements need to be removed because they do not contribute to the meaning and semantics of the text. If they're not removed, they can not only waste system memory and processing time, but also negatively impact the accuracy of the results. Data cleaning is the art of extracting meaningful portions from data by eliminating unnecessary details. Consider the sentence, "He tweeted, 'Live coverage of General Elections available at this.tv/show/ge2019. _/\_ Please tune in :) '. "
In this example, to perform NLP tasks on the sentence, we will need to remove the emojis, punctuation, and stop words, and then change the words into their base grammatical form.
To achieve this, methods such as stopword removal, tokenization, and stemming are used. We will explore them in detail in the upcoming sections. Before we do so, let's get acquainted with some basic NLP libraries that we will be using here:
Tokenization and word tokenizers were briefly described in Chapter 1, Introduction to Natural Language Processing. Tokenization is the process of splitting sentences into their constituents; that is, words and punctuation. Let's perform a simple exercise to see how this can be done using various packages.
In this exercise, we will clean some text and extract the tokens from it. Follow these steps to complete this exercise:
import re
def clean_text(sentence):
return re.sub(r'([^sw]|_)+', ' ', sentence).split()
sentence = 'Sunil tweeted, "Witnessing 70th Republic Day "
"of India from Rajpath, New Delhi. "
"Mesmerizing performance by Indian Army! "
"Awesome airshow! @india_official "
"@indian_army #India #70thRepublic_Day. "
"For more photos ping me [email protected] :)"'
clean_text(sentence)
The preceding command fragments the string wherever any blank space is present. The output should be as follows:
With that, we have learned how to extract tokens from text. Often, extracting each token separately does not help. For instance, consider the sentence, "I don't hate you, but your behavior." Here, if we process each of the tokens, such as "hate" and "behavior," separately, then the true meaning of the sentence would not be comprehended. In this case, the context in which these tokens are present becomes essential. Thus, we consider n consecutive tokens at a time. n-grams refers to the grouping of n consecutive tokens together.
Note
To access the source code for this specific section, please refer to https://packt.live/2CQikt7.
You can also run this example online at https://packt.live/33cn0nF.
Next, we will look at an exercise where n-grams can be extracted from a given text.
In this exercise, we will extract n-grams using three different methods. First, we will use custom-defined functions, and then the nltk and textblob libraries. Follow these steps to complete this exercise:
import re
def n_gram_extractor(sentence, n):
tokens = re.sub(r'([^sw]|_)+', ' ', sentence).split()
for i in range(len(tokens)-n+1):
print(tokens[i:i+n])
In the preceding function, we are splitting the sentence into tokens using regex, then looping over the tokens, taking n consecutive tokens at a time.
n_gram_extractor('The cute little boy is playing with the kitten.',
2)
The preceding code generates the following output:
['The', 'cute']
['cute', 'little']
['little', 'boy']
['boy', 'is']
['is', 'playing']
['playing', 'with']
['with', 'the']
['the', 'kitten']
n_gram_extractor('The cute little boy is playing with the kitten.',
3)
The preceding code generates the following output:
['The', 'cute', 'little']
['cute', 'little', 'boy']
['little', 'boy', 'is']
['boy', 'is', 'playing']
['is', 'playing', 'with']
['playing', 'with', 'the']
['with', 'the', 'kitten']
from nltk import ngrams
list(ngrams('The cute little boy is playing with the kitten.'
.split(), 2))
The preceding code generates the following output:
[('The', 'cute'),
('cute', 'little'),
('little', 'boy'),
('boy', 'is'),
('is', 'playing'),
('playing', 'with'),
('with', 'the'),
('the', 'kitten')]
list(ngrams('The cute little boy is playing with the kitten.'.split(), 3))
The preceding code generates the following output:
[('The', 'cute', 'little'),
('cute', 'little', 'boy'),
('little', 'boy', 'is'),
('boy', 'is', 'playing'),
('playing', 'with', 'the'),
('with', 'the', 'kitten.')]
!pip install -U textblob
from textblob import TextBlob
blob = TextBlob("The cute little boy is playing with the kitten.")
blob.ngrams(n=2)
The preceding code generates the following output:
[WordList(['The', 'cute']),
WordList(['cute', 'little']),
WordList(['little', 'boy']),
WordList(['boy', 'is']),
WordList(['is', 'playing']),
WordList(['playing', 'with']),
WordList(['with', 'the']),
WordList(['the', 'kitten'])]
blob.ngrams(n=3)
The preceding code generates the following output:
[WordList(['The', 'cute', 'little']),
WordList(['cute', 'little', 'boy']),
WordList(['little', 'boy', 'is']),
WordList(['boy', 'is' 'playing']),
WordList(['is', 'playing' 'with']),
WordList(['playing', 'with' 'the']),
WordList(['with', 'the' 'kitten'])]
In this exercise, we learned how to generate n-grams using various methods.
Note
To access the source code for this specific section, please refer to https://packt.live/2PabHUK.
You can also run this example online at https://packt.live/2XbjFRX.
In this exercise, we will use keras and textblob to tokenize texts. Follow these steps to complete this exercise:
from keras.preprocessing.text import text_to_word_sequence
from textblob import TextBlob
sentence = 'Sunil tweeted, "Witnessing 70th Republic Day "
"of India from Rajpath, New Delhi. "
"Mesmerizing performance by Indian Army! "
"Awesome airshow! @india_official "
"@indian_army #India #70thRepublic_Day. "
"For more photos ping me [email protected] :)"'
def get_keras_tokens(text):
return text_to_word_sequence(text)
get_keras_tokens(sentence)
The preceding code generates the following output:
def get_textblob_tokens(text):
blob = TextBlob(text)
return blob.words
get_textblob_tokens(sentence)
The preceding code generates the following output:
With that, we have learned how to tokenize texts using the keras and textblob libraries.
Note
To access the source code for this specific section, please refer to https://packt.live/3393hFi.
You can also run this example online at https://packt.live/39Dtu09.
In the next section, we will discuss the different types of tokenizers.
There are different types of tokenizers that come in handy for specific tasks. Let's look at the ones provided by nltk one by one:
Now that we have learned about the different types of tokenizers, in the next section, we will carry out an exercise to get a better understanding of them.
In this exercise, we will use different tokenizers to tokenize text. Perform the following steps to implement this exercise:
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import MWETokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import WordPunctTokenizer
sentence = 'Sunil tweeted, "Witnessing 70th Republic Day "
"of India from Rajpath, New Delhi. "
"Mesmerizing performance by Indian Army! "
"Awesome airshow! @india_official "
"@indian_army #India #70thRepublic_Day. "
"For more photos ping me [email protected] :)"'
def tokenize_with_tweet_tokenizer(text):
# Here will create an object of tweetTokenizer
tweet_tokenizer = TweetTokenizer()
"""
Then we will call the tokenize method of
tweetTokenizer which will return token list of sentences.
"""
return tweet_tokenizer.tokenize(text)
tokenize_with_tweet_tokenizer(sentence)
Note
The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.
The preceding code generates the following output:
As you can see, the hashtags, emojis, websites, and Twitter IDs are extracted as single tokens. If we had used the white space tokenizer, we would have got hash, dots, and the @ symbol as separate tokens.
def tokenize_with_mwe(text):
mwe_tokenizer = MWETokenizer([('Republic', 'Day')])
mwe_tokenizer.add_mwe(('Indian', 'Army'))
return mwe_tokenizer.tokenize(text.split())
tokenize_with_mwe(sentence)
The preceding code generates the following output:
In the preceding screenshot, the words "Indian" and "Army!", which should have been treated as a single identity, were treated separately. This is because "Army!" (not "Army") is treated as a token. Let's see how this can be fixed in the next step.
tokenize_with_mwe(sentence.replace('!',''))
The preceding code generates the following output:
Here, we can see that instead of being treated as separate tokens, "Indian" and "Army" are treated as a single entity.
def tokenize_with_regex_tokenizer(text):
reg_tokenizer = RegexpTokenizer('w+|$[d.]+|S+')
return reg_tokenizer.tokenize(text)
tokenize_with_regex_tokenizer(sentence)
The preceding code generates the following output:
def tokenize_with_wst(text):
wh_tokenizer = WhitespaceTokenizer()
return wh_tokenizer.tokenize(text)
tokenize_with_wst(sentence)
The preceding code generates the following output:
def tokenize_with_wordpunct_tokenizer(text):
wp_tokenizer = WordPunctTokenizer()
return wp_tokenizer.tokenize(text)
tokenize_with_wordpunct_tokenizer(sentence)
The preceding code generates the following output:
In this section, we have learned about different tokenization techniques and their nltk implementation.
Note
To access the source code for this specific section, please refer to https://packt.live/3hSbDWi.
You can also run this example online at https://packt.live/3hOi7oR.
Now, we're ready to use them in our programs.
In many languages, the base forms of words change when they're used in sentences. For example, the word "produce" can be written as "production" or "produced" or even "producing," depending on the context. The process of converting a word back into its base form is known as stemming. It is essential to do this, because without it, algorithms would treat two or more different forms of the same word as different entities, despite them having the same semantic meaning. So, the words "producing" and "produced" would be treated as different entities, which can lead to erroneous inferences. In Python, RegexpStemmer and PorterStemmer are the most widely used stemmers. Let's explore them one at a time.
RegexpStemmer uses regular expressions to check whether morphological or structural prefixes or suffixes are present. For instance, in many cases, verbs in the present continuous tense (the present tense form ending with "ing") can be restored to their base form simply by removing "ing" from the end; for example, "playing" becomes "play".
Let's complete the following exercise to get some hands-on experience with RegexpStemmer.
In this exercise, we will use RegexpStemmer on text to convert words into their basic form by removing some generic suffixes such as "ing" and "ed". To use nltk's regex_stemmer, we have to create an object of RegexpStemmer by passing the regex of the suffix or prefix and an integer, min, which indicates the minimum length of the stemmed string. Follow these steps to complete this exercise:
from nltk.stem import RegexpStemmer
def get_stems(text):
"""
Creating an object of RegexpStemmer, any string ending
with the given regex 'ing$' will be removed.
"""
regex_stemmer = RegexpStemmer('ing$', min=4)
"""
The below code line will convert every word into its
stem using regex stemmer and then join them with space.
"""
return ' '.join([regex_stemmer.stem(wd) for
wd in text.split()])
sentence = "I love playing football"
get_stems(sentence)
The preceding code generates the following output:
'I love play football'
As we can see, the word playing has been changed into its base form, play. In this exercise, we learned how we can perform stemming using nltk's RegexpStemmer.
Note
To access the source code for this specific section, please refer to https://packt.live/3hRYUm6.
You can also run this example online at https://packt.live/2D0Ztvk.
The Porter stemmer is the most common stemmer for dealing with English words. It removes various morphological and inflectional endings (such as suffixes, prefixes, and the plural "s") from English words. In doing so, it helps us extract the base form of a word from its variations. To get a better understanding of this, let's carry out a simple exercise.
In this exercise, we will apply the Porter stemmer to some text. Follow these steps to complete this exercise:
from nltk.stem.porter import *
sentence = "Before eating, it would be nice to "
"sanitize your hands with a sanitizer"
def get_stems(text):
ps_stemmer = PorterStemmer()
return ' '.join([ps_stemmer.stem(wd) for
wd in text.split()])
get_stems(sentence)
The preceding code generates the following output:
'befor eating, it would be nice to sanit your hand wash with a sanit'
Note
To access the source code for this specific section, please refer to https://packt.live/2CUqelc.
You can also run this example online at https://packt.live/2X8WUhD.
PorterStemmer is a generic rule-based stemmer that tries to convert a word into its basic form by removing common suffixes and prefixes of the English language.
Though stemming is a useful technique in NLP, it has a severe drawback. As we can see from this exercise, we find that, while eating has been converted into eat (which is its proper grammatical base form), the word sanitize has been converted into sanit (which isn't the proper grammatical base form). This may lead to some problems if we use it. To overcome this issue, there is another technique we can use called lemmatization.
As we saw in the previous section, there is a problem with stemming. It often generates meaningless words. Lemmatization deals with such cases by using vocabulary and analyzing the words' morphologies. It returns the base forms of words that can be found in dictionaries. Let's walk through a simple exercise to understand this better.
In this exercise, we will perform lemmatization on some text. Follow these steps to complete this exercise:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')
sentence = "The products produced by the process today are "
"far better than what it produces generally."
lemmatizer = WordNetLemmatizer()
def get_lemmas(text):
lemmatizer = WordNetLemmatizer()
return ' '.join([lemmatizer.lemmatize(word) for
word in word_tokenize(text)])
get_lemmas(sentence)
The preceding code generates the following output:
'The product produced by the process today are far better than what it produce generally.'
With that, we learned how to generate the lemma of a word. The lemma is the correct grammatical base form. They use the vocabulary to match the word to its correct nearest grammatical form.
Note
To access the source code for this specific section, please refer to https://packt.live/2X5JEKA.
You can also run this example online at https://packt.live/30Zqt6v.
In the next section, we will deal with other kinds of word variations by looking at singularizing and pluralizing words using textblob.
In this exercise, we will make use of the textblob library to singularize and pluralize words in the given text. Follow these steps to complete this exercise:
from textblob import TextBlob
sentence = TextBlob('She sells seashells on the seashore')
To check the list of words in the sentence, type the following code:
sentence.words
The preceding code generates the following output:
WordList(['She', 'sells', 'seashells', 'on', 'the', 'seashore'])
def singularize(word):
return word.singularize()
singularize(sentence.words[2])
The preceding code generates the following output:
'seashell'
def pluralize(word):
return word.pluralize()
pluralize(sentence.words[5])
The preceding code generates the following output:
'seashores'
Note
To access the source code for this specific section, please refer to https://packt.live/3gooUoQ.
You can also run this example online at https://packt.live/309Gqrm.
Now, in the next section, we will learn about another preprocessing task: language translation.
You might have used Google Translate before, which gives the exact translation of a word in another language; this is an example of language translation or machine translation. In Python, we can use TextBlob to translate text from one language into another. TextBlob provides a method called translate(), in which you have to pass text in the source language. The method will return the translated word in the destination language. Let's look at how this is done.
In this exercise, we will make use of the TextBlob library to translate a sentence from Spanish into English. Follow these steps to implement this exercise:
from textblob import TextBlob
def translate(text,from_l,to_l):
en_blob = TextBlob(text)
return en_blob.translate(from_lang=from_l, to=to_l)
translate(text='muy bien',from_l='es',to_l='en')
The preceding code generates the following output:
TextBlob("very well")
With that, we have seen how we can use TextBlob to translate from one language to another.
Note
To access the source code for this specific section, please refer to https://packt.live/2XquGiH.
You can also run this example online at https://packt.live/3hQiVK8.
In the next section, we will look at another preprocessing task: stop-word removal.
Stop words, such as "am," "the," and "are," occur frequently in text data. Although they help us construct sentences properly, we can find the meaning even if we remove them. This means that the meaning of text can be inferred even without them. So, removing stop words from text is one of the preprocessing steps in NLP tasks. In Python, nltk, and textblob, text can be used to remove stop words from text. To get a better understanding of this, let's look at an exercise.
In this exercise, we will remove the stop words from a given text. Follow these steps to complete this exercise:
from nltk import word_tokenize
sentence = "She sells seashells on the seashore"
def remove_stop_words(text,stop_word_list):
return ' '.join([word for word in word_tokenize(text)
if word.lower() not in stop_word_list])
custom_stop_word_list = ['she', 'on', 'the', 'am', 'is', 'not']
remove_stop_words(sentence,custom_stop_word_list)
The preceding code generates the following output:
'sells seashells seashore'
Thus, we've seen how stop words can be removed from a sentence.
Note
To access the source code for this specific section, please refer to https://packt.live/337aMwH.
You can also run this example online at https://packt.live/30buvJF.
In the next activity, we'll put our knowledge of preprocessing steps into practice.
In this activity, you will extract the most frequently occurring keywords from a sample news article.
Note
The new article that's being used for this activity can be found at https://packt.live/314mg1r.
The following steps will help you implement this activity:
Note
The solution to this activity can be found on page 373.
With that, we have learned about the various ways we can clean unstructured data. Now, let's examine the concept of extracting features from texts.
As we already know, machine learning algorithms do not understand textual data directly. We need to represent the text data in numerical form or vectors. To convert each textual sentence into a vector, we need to represent it as a set of features. This set of features should uniquely represent the text, though, individually, some of the features may be common across many textual sentences. Features can be classified into two different categories:
Let's explore these in detail.
As we've already learned, general features refer to those that are not directly dependent on the individual tokens constituting a text corpus. Let's consider these two sentences: "The sky is blue" and "The pillar is yellow". Here, the sentences have the same number of words (a general feature)—that is, four. But the individual constituent tokens are different. Let's complete an exercise to understand this better.
In this exercise, we will extract general features from input text. These general features include detecting the number of words, the presence of "wh" words (words beginning with "wh", such as "what" and "why") and the language in which the text is written. Follow these steps to implement this exercise:
import pandas as pd
from textblob import TextBlob
df = pd.DataFrame([['The interim budget for 2019 will '
'be announced on 1st February.'],
['Do you know how much expectation '
'the middle-class working population '
'is having from this budget?'],
['February is the shortest month '
'in a year.'],
['This financial year will end on '
'31st March.']])
df.columns = ['text']
df.head()
The preceding code generates the following output:
def add_num_words(df):
df['number_of_words'] = df['text'].apply(lambda x :
len(TextBlob(str(x)).words))
return df
add_num_words(df)['number_of_words']
The preceding code generates the following output:
0 11
1 15
2 8
3 8
Name: number_of_words, dtype: int64
The preceding code line will print the number_of_words column of the DataFrame to represent the number of words in each row.
def is_present(wh_words, df):
"""
The below line of code will find the intersection
between set of tokens of every sentence and the
wh_words and will return true if the length of
intersection set is non-zero.
"""
df['is_wh_words_present'] = df['text'].apply(lambda x :
True if
len(set(TextBlob(str(x)).
words).intersection(wh_words))
>0 else False)
return df
wh_words = set(['why', 'who', 'which', 'what',
'where', 'when', 'how'])
is_present(wh_words, df)['is_wh_words_present']
The preceding code generates the following output:
0 False
1 True
2 False
3 False
Name: is_wh_words_present, dtype: bool
The preceding code line will print the is_wh_words_present column that was added by the is_present method to df, which means for every row, we will see whether wh_word is present.
def get_language(df):
df['language'] = df['text'].apply(lambda x :
TextBlob(str(x)).detect_language())
return df
get_language(df)['language']
The preceding code generates the following output:
0 en
1 en
2 en
3 en
Name: language, dtype: object
With that, we have learned how to extract general features from text data.
Note
To access the source code for this specific section, please refer to https://packt.live/2X9jLcS.
You can also run this example online at https://packt.live/3fgrYSK.
Let's perform another exercise to get a better understanding of this.
In this exercise, we will extract various general features from documents. The dataset that we will be using here consists of random statements. Our objective is to find the frequency of various general features such as punctuation, uppercase and lowercase words, letters, digits, words, and whitespaces.
Note
The dataset that is being used in this exercise can be found at this link: https://packt.live/3k0qCPR.
import pandas as pd
from string import punctuation
import nltk
nltk.download('tagsets')
from nltk.data import load
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk import word_tokenize
from collections import Counter
def get_tagsets():
tagdict = load('help/tagsets/upenn_tagset.pickle')
return list(tagdict.keys())
tag_list = get_tagsets()
print(tag_list)
The preceding code generates the following output:
"""
This method will count the occurrence of pos
tags in each sentence.
"""
def get_pos_occurrence_freq(data, tag_list):
# Get list of sentences in text_list
text_list = data.text
# create empty dataframe
feature_df = pd.DataFrame(columns=tag_list)
for text_line in text_list:
# get pos tags of each word.
pos_tags = [j for i, j in
pos_tag(word_tokenize(text_line))]
"""
create a dict of pos tags and their frequency
in given sentence.
"""
row = dict(Counter(pos_tags))
feature_df = feature_df.append(row, ignore_index=True)
feature_df.fillna(0, inplace=True)
return feature_df
tag_list = get_tagsets()
data = pd.read_csv('../data/data.csv', header=0)
feature_df = get_pos_occurrence_freq(data, tag_list)
feature_df.head()
The preceding code generates the following output:
def add_punctuation_count(feature_df, data):
feature_df['num_of_unique_punctuations'] = data['text'].
apply(lambda x: len(set(x).intersection
(set(punctuation))))
return feature_df
feature_df = add_punctuation_count(feature_df, data)
feature_df['num_of_unique_punctuations'].head()
The add_punctuation_count() method will find the intersection of the set of punctuation marks in the text and punctuation sets that were imported from the string module. Then, it will find the length of the intersection set in each row and add it to the num_of_unique_punctuations column of the DataFrame. The preceding code generates the following output:
0 0
1 0
2 1
3 1
4 0
Name: num_of_unique_punctuations, dtype: int64
def get_capitalized_word_count(feature_df, data):
"""
The below code line will tokenize text in every row and
create a set of only capital words, ten find the length of
this set and add it to the column 'number_of_capital_words'
of dataframe.
"""
feature_df['number_of_capital_words'] = data['text'].
apply(lambda x: len([word for word in
word_tokenize(str(x)) if word[0].isupper()]))
return feature_df
feature_df = get_capitalized_word_count(feature_df, data)
feature_df['number_of_capital_words'].head()
The preceding code will tokenize the text in every row and create a set of words consisting of only capital words. It will then find the length of this set and add it to the number_of_capital_words column of the DataFrame. The preceding code generates the following output:
0 1
1 1
2 1
3 1
4 1
Name: number_of_capital_words, dtype: int64
The last line of the preceding code will print the number_of_capital_words column, which represents the count of the number of capital letter words in each row.
def get_small_word_count(feature_df, data):
"""
The below code line will tokenize text in every row and
create a set of only small words, then find the length of
this set and add it to the column 'number_of_small_words'
of dataframe.
"""
feature_df['number_of_small_words'] = data['text'].
apply(lambda x: len([word for word in
word_tokenize(str(x)) if word[0].islower()]))
return feature_df
feature_df = get_small_word_count(feature_df, data)
feature_df['number_of_small_words'].head()
The preceding code will tokenize the text in every row and create a set of only small words, then find the length of this set and add it to the number_of_small_words column of the DataFrame. The preceding code generates the following output:
0 4
1 3
2 7
3 3
4 2
Name: number_of_small_words, dtype: int64
The last line of the preceding code will print the number_of_small_words column, which represents the number of small letter words in each row.
def get_number_of_alphabets(feature_df, data):
feature_df['number_of_alphabets'] = data['text'].
apply(lambda x: len([ch for ch in str(x)
if ch.isalpha()]))
return feature_df
feature_df = get_number_of_alphabets(feature_df, data)
feature_df['number_of_alphabets'].head()
The preceding code will break the text line into a list of characters in each row and add the count of that list to the number_of_alphabets columns. This will produce the following output:
0 19
1 18
2 28
3 14
4 13
Name: number_of_alphabets, dtype: int64
The last line of the preceding code will print the number_of_columns column, which represents the count of the number of alphabets in each row.
def get_number_of_digit_count(feature_df, data):
"""
The below code line will break the text line in a list of
digits in each row and add the count of that list into
the columns 'number_of_digits'
"""
feature_df['number_of_digits'] = data['text'].
apply(lambda x: len([ch for ch in str(x)
if ch.isdigit()]))
return feature_df
feature_df = get_number_of_digit_count(feature_df, data)
feature_df['number_of_digits'].head()
The preceding code will get the digit count from each row and add the count of that list to the number_of_digits columns. The preceding code generates the following output:
0 0
1 0
2 0
3 0
4 0
Name: number_of_digits, dtype: int64
def get_number_of_words(feature_df, data):
"""
The below code line will break the text line in a list of
words in each row and add the count of that list into
the columns 'number_of_digits'
"""
feature_df['number_of_words'] = data['text'].
apply(lambda x : len(word_tokenize(str(x))))
return feature_df
feature_df = get_number_of_words(feature_df, data)
feature_df['number_of_words'].head()
The preceding code will split the text line into a list of words in each row and add the count of that list to the number_of_digits columns. We will get the following output:
0 5
1 4
2 9
3 5
4 3
Name: number_of_words, dtype: int64
def get_number_of_whitespaces(feature_df, data):
"""
The below code line will generate list of white spaces
in each row and add the length of that list into
the columns 'number_of_white_spaces
"""
feature_df['number_of_white_spaces'] = data['text'].
apply(lambda x: len([ch for ch in str(x)
if ch.isspace()]))
return feature_df
feature_df = get_number_of_whitespaces(feature_df, data)
feature_df['number_of_white_spaces'].head()
The preceding code will generate a list of whitespaces in each row and add the length of that list to the number_of_white_spaces columns. The preceding code generates the following output:
0 4
1 3
2 7
3 3
4 2
Name: number_of_white_spaces, dtype: int64
feature_df.head()
We will be printing the head of the final DataFrame, which means we will print five rows of all the columns. We will get the following output:
With that, we have learned how to extract general features from the given text.
Note
To access the source code for this specific section, please refer to https://packt.live/3jSsLNh.
You can also run this example online at https://packt.live/3hPFmPA.
Now, let's explore how we can extract unique features.
The Bag of Words (BoW) model is one of the most popular methods for extracting features from raw texts.
In this technique, we convert each sentence into a vector. The length of this vector is equal to the number of unique words in all the documents. This is done in two steps:
A vocabulary or dictionary is created from all the unique possible words available in the corpus (all documents) and every single word is assigned a unique index number. In the second step, every document is represented by a list whose length is equal to the number of words in the vocabulary. The following exercise illustrates how BoW can be implemented using Python.
In this exercise, we will create a BoW representation for all the terms in a document and ascertain the 10 most frequent terms. In this exercise, we will use the CountVectorizer module from sklearn, which performs the following tasks:
Follow these steps to implement this exercise:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def vectorize_text(corpus):
"""
Will return a dataframe in which every row will ,be
vector representation of a document in corpus
:param corpus: input text corpus
:return: dataframe of vectors
"""
bag_of_words_model = CountVectorizer()
"""
performs the above described three tasks on
the given data corpus.
"""
dense_vec_matrix = bag_of_words_model.
fit_transform(corpus).todense()
bag_of_word_df = pd.DataFrame(dense_vec_matrix)
bag_of_word_df.columns = sorted(bag_of_words_model.
vocabulary_)
return bag_of_word_df
corpus = ['Data Science is an overlap between Arts and Science',
'Generally, Arts graduates are right-brained and '
'Science graduates are left-brained',
'Excelling in both Arts and Science at a time '
'becomes difficult',
'Natural Language Processing is a part of Data Science']
df = vectorize_text(corpus)
df.head()
The vectorize_text method will take a document corpus as an argument and return a DataFrame in which every row will be a vector representation of a document in the corpus.
The preceding code generates the following output:
def bow_top_n(corpus, n):
"""
Will return a dataframe in which every row
will be represented by presence or absence of top 10 most
frequently occurring words in data corpus
:param corpus: input text corpus
:return: dataframe of vectors
"""
bag_of_words_model_small = CountVectorizer(max_features=n)
bag_of_word_df_small = pd.DataFrame
(bag_of_words_model_small.fit_transform
(corpus).todense())
bag_of_word_df_small.columns =
sorted(bag_of_words_model_small.vocabulary_)
return bag_of_word_df_small
df_2 = bow_top_n(corpus, 10)
df_2.head()
In the preceding code, we are checking the occurrence of the top 10 most frequent words in each sentence and creating a DataFrame out of it.
The preceding code generates the following output:
Note
To access the source code for this specific section, please refer to https://packt.live/3gdhViJ.
You can also run this example online at https://packt.live/3hPUTi8.
In this section, we learned what BoW is and how to can use it to convert a sentence or document into a vector. BoW is the easiest way to convert text into a vector; however, it has a severe disadvantage. This method only considers the presence and absence of words in a sentence or document—not the frequency of the words/tokens in a document. If we are going to use the semantics of any sentence, the frequency of the words plays an important role. To overcome this issue, there is another feature extraction model called TFIDF, which we will discuss later in this chapter.
According to Zipf's law, the number of times a word occurs in a corpus is inversely proportional to its rank in the frequency table. In simple terms, if the words in a corpus are arranged in descending order of their frequency of occurrence, then the frequency of the word at the ith rank will be proportional to 1/i:
This also means that the frequency of the most frequent word will be twice the frequency of the second most frequent word. For example, if we look at the Brown University Standard Corpus of Present-Day American English, the word "the" is the most frequent word (its frequency is 69,971), while the word "of" is the second most frequent (with a frequency of 36,411). As we can see, its frequency is almost half of the most frequently occurring word. To get a better understanding of this, let's perform a simple exercise.
In this exercise, we will plot both the expected and actual ranks and frequencies of tokens with the help of Zipf's law. We will be using the 20newsgroups dataset provided by the sklearn library, which is a collection of newsgroup documents. Follow these steps to implement this exercise:
from pylab import *
import nltk
nltk.download('stopwords')
from sklearn.datasets import fetch_20newsgroups
from nltk import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import re
import string
from collections import Counter
Add two methods for loading stop words and the data from the newsgroups_data_sample variable:
def get_stop_words():
stop_words = stopwords.words('english')
stop_words = stop_words + list(string.printable)
return stop_words
def get_and_prepare_data(stop_words):
"""
This method will load 20newsgroups data and
and remove stop words from it using given stop word list.
:param stop_words:
:return:
"""
newsgroups_data_sample =
fetch_20newsgroups(subset='train')
tokenized_corpus = [word.lower() for sentence in
newsgroups_data_sample['data']
for word in word_tokenize
(re.sub(r'([^sw]|_)+', ' ', sentence))
if word.lower() not in stop_words]
return tokenized_corpus
In the preceding code, there are two methods; get_stop_words() will load stop word list from nltk data, while get_and_prepare_data() will load the 20newsgroups data and remove stop words from it using the given stop word list.
def get_frequency(corpus, n):
token_count_di = Counter(corpus)
return token_count_di.most_common(n)
The preceding method uses the Counter class to count the frequency of tokens in the corpus and then return the most common n tokens.
stop_word_list = get_stop_words()
corpus = get_and_prepare_data(stop_word_list)
get_frequency(corpus, 50)
The preceding code generates the following output:
def get_actual_and_expected_frequencies(corpus):
freq_dict = get_frequency(corpus, 1000)
actual_frequencies = []
expected_frequencies = []
for rank, tup in enumerate(freq_dict):
actual_frequencies.append(log(tup[1]))
rank = 1 if rank == 0 else rank
# expected frequency 1/rank as per zipf's law
expected_frequencies.append(1 / rank)
return actual_frequencies, expected_frequencies
def plot(actual_frequencies, expected_frequencies):
plt.plot(actual_frequencies, 'g*',
expected_frequencies, 'ro')
plt.show()
# We will plot the actual and expected frequencies
actual_frequencies, expected_frequencies =
get_actual_and_expected_frequencies(corpus)
plot(actual_frequencies, expected_frequencies)
The preceding code generates the following output:
So, as we can see from the preceding output, both lines have almost the same slope. In other words, we can say that the lines (or graphs) depict the proportionality of two lists.
Note
To access the source code for this specific section, please refer to https://packt.live/30ZnKtD.
You can also run this example online at https://packt.live/3f9ZFoT.
Term Frequency-Inverse Document Frequency (TFIDF) is another method of representing text data in a vector format. Here, once again, we'll represent each document as a list whose length is equal to the number of unique words/tokens in all documents (corpus), but the vector here not only represents the presence and absence of a word, but also the frequency of the word—both in the current document and the whole corpus.
This technique is based on the idea that the rarely occurring words are better representatives of the document than frequently occurring words. Hence, this representation gives more weightage to the rarer or less frequent words than frequently occurring words. It does so with the following formula:
Here, term frequency is the frequency of a word in the given document. Inverse document frequency can be defined as log(D/df), where df is document frequency and D is the total number of documents in the background corpus.
Now, let's complete an exercise and learn how TFIDF can be implemented in Python.
In this exercise, we will represent the input texts with their TFIDF vectors. We will use a sklearn module named TfidfVectorizer, which converts text into TFIDF vectors. Follow these steps to implement this exercise:
from sklearn.feature_extraction.text import TfidfVectorizer
def get_tf_idf_vectors(corpus):
tfidf_model = TfidfVectorizer()
vector_list = tfidf_model.fit_transform(corpus).todense()
return vector_list
corpus = ['Data Science is an overlap between Arts and Science',
'Generally, Arts graduates are right-brained and '
'Science graduates are left-brained',
'Excelling in both Arts and Science at a '
'time becomes difficult',
'Natural Language Processing is a part of Data Science']
vector_list = get_tf_idf_vectors(corpus)
print(vector_list)
In the preceding code, the get_tf_idf_vectors() method will generate TFIDF vectors from the corpus. You will then call this method on a given corpus. The preceding code generates the following output:
The preceding output represents the TFIDF vectors for each row. As you can see from the results, each document is represented by a list whose length is equal to the unique words in the corpus and in each list (vector). The vector contains the TFIDF values of the words at their corresponding index.
Note
To access the source code for this specific section, please refer to https://packt.live/3gdzsHA.
You can also run this example online at https://packt.live/3fdP5gS.
In the next section, we will solve an activity to extract specific features from texts.
So far in this chapter, we have learned how to generate vectors from text. These vectors are then fed to machine learning algorithms to perform various tasks. Other than using them in machine learning applications, we can also perform simple NLP tasks using these vectors. Finding the string similarity is one of them. This is a technique in which we find the similarity between two strings by converting them into vectors. The technique is mainly used in full-text searching.
There are different techniques for finding the similarity between two strings or texts. They are explained one by one here:
Here, A and B are two vectors, A.B is the dot product of two vectors, and |A| and |B| are the magnitude of two vectors.
Consider the following example. Suppose there are two texts:
Text 1: I like detective Byomkesh Bakshi.
Text 2: Byomkesh Bakshi is not a detective; he is a truth seeker.
The common terms are "Byomkesh," "Bakshi," and "detective."
The number of common terms in the texts is three.
The unique terms present in the texts are "I," "like," "is," "not," "a," "he," "is," "truth," and "seeker." So, the number of unique terms is nine.
Therefore, the Jaccard similarity is 3/9 = 0.3.
To get a better understanding of text similarity, we will complete an exercise.
In this exercise, we will calculate the Jaccard and cosine similarity for a given pair of texts. Follow these steps to complete this exercise:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()
def extract_text_similarity_jaccard(text1, text2):
"""
This method will return Jaccard similarity between two texts
after lemmatizing them.
:param text1: text1
:param text2: text2
:return: similarity measure
"""
lemmatizer = WordNetLemmatizer()
words_text1 = [lemmatizer.lemmatize(word.lower())
for word in word_tokenize(text1)]
words_text2 = [lemmatizer.lemmatize(word.lower())
for word in word_tokenize(text2)]
nr = len(set(words_text1).intersection(set(words_text2)))
dr = len(set(words_text1).union(set(words_text2)))
jaccard_sim = nr / dr
return jaccard_sim
pair1 = ["What you do defines you", "Your deeds define you"]
pair2 = ["Once upon a time there lived a king.",
"Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]
extract_text_similarity_jaccard(pair1[0],pair1[1])
The preceding code generates the following output:
0.14285714285714285
extract_text_similarity_jaccard(pair2[0],pair2[1])
The preceding code generates the following output:
0.0
extract_text_similarity_jaccard(pair3[0],pair3[1])
The preceding code generates the following output:
0.6
def get_tf_idf_vectors(corpus):
tfidf_vectorizer = TfidfVectorizer()
tfidf_results = tfidf_vectorizer.fit_transform(corpus).
todense()
return tfidf_results
corpus = [pair1[0], pair1[1], pair2[0],
pair2[1], pair3[0], pair3[1]]
tf_idf_vectors = get_tf_idf_vectors(corpus)
cosine_similarity(tf_idf_vectors[0],tf_idf_vectors[1])
The preceding code generates the following output:
array([[0.3082764]])
cosine_similarity(tf_idf_vectors[2],tf_idf_vectors[3])
The preceding code generates the following output:
array([[0.]])
cosine_similarity(tf_idf_vectors[4],tf_idf_vectors[5])
The preceding code generates the following output:
array([[0.80368547]])
So, in this exercise, we learned how to check the similarity between texts. As you can see, the texts "He is desperate" and "Is he not desperate?" returned similarity results of 0.80 (meaning they are highly similar), whereas sentences such as "Once upon a time there lived a king." and "Who is your queen?" returned zero as their similarity measure.
Note
To access the source code for this specific section, please refer to https://packt.live/2Eyw0JC.
You can also run this example online at https://packt.live/2XbGRQ3.
The Lesk algorithm is used for resolving word sense disambiguation. Suppose we have a sentence such as "On the bank of river Ganga, there lies the scent of spirituality" and another sentence, "I'm going to withdraw some cash from the bank". Here, the same word—that is, "bank"—is used in two different contexts. For text processing results to be accurate, the context of the words needs to be considered.
In the Lesk algorithm, words with ambiguous meanings are stored in the background in synsets. The definition that is closer to the meaning of a word being used in the context of the sentence will be taken as the right definition. Let's perform a simple exercise to get a better idea of how we can implement this.
In this exercise, we are going to implement the Lesk algorithm step by step using the techniques we have learned so far. We will find the meaning of the word "bank" in the sentence, "On the banks of river Ganga, there lies the scent of spirituality." We will use cosine similarity as well as Jaccard similarity here. Follow these steps to complete this exercise:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
def get_tf_idf_vectors(corpus):
tfidf_vectorizer = TfidfVectorizer()
tfidf_results = tfidf_vectorizer.fit_transform
(corpus).todense()
return tfidf_results
def to_lower_case(corpus):
lowercase_corpus = [x.lower() for x in corpus]
return lowercase_corpus
def find_sentence_definition(sent_vector,defnition_vectors):
"""
This method will find cosine similarity of sentence with
the possible definitions and return the one with
highest similarity score along with the similarity score.
"""
result_dict = {}
for definition_id,def_vector in definition_vectors.items():
sim = cosine_similarity(sent_vector,def_vector)
result_dict[definition_id] = sim[0][0]
definition = sorted(result_dict.items(),
key=lambda x: x[1],
reverse=True)[0]
return definition[0],definition[1]
corpus = ["On the banks of river Ganga, there lies the scent "
"of spirituality",
"An institute where people can store extra "
"cash or money.",
"The land alongside or sloping down to a river or lake"
"What you do defines you",
"Your deeds define you",
"Once upon a time there lived a king.",
"Who is your queen?",
"He is desperate",
"Is he not desperate?"]
lower_case_corpus = to_lower_case(corpus)
corpus_tf_idf = get_tf_idf_vectors(lower_case_corpus)
sent_vector = corpus_tf_idf[0]
definition_vectors = {'def1':corpus_tf_idf[1],
'def2':corpus_tf_idf[2]}
definition_id, score =
find_sentence_definition(sent_vector,definition_vectors)
print("The definition of word {} is {} with similarity of {}".
format('bank',definition_id,score))
You will get the following output:
The definition of word bank is def2 with similarity of 0.14419130686278897
As we already know, def2 represents a riverbank. So, we have found the correct definition of the word here. In this exercise, we have learned how to use text vectorization and text similarity to find the right definition of ambiguous words.
Note
To access the source code for this specific section, please refer to https://packt.live/39GzJAs.
You can also run this example online at https://packt.live/3fbxQwK.
Unlike numeric data, there are very few ways in which text data can be represented visually. The most popular way of visualizing text data is by using word clouds. A word cloud is a visualization of a text corpus in which the sizes of the tokens (words) represent the number of times they have occurred, as shown in the following image:
In the following exercise, we will be using a Python library called wordcloud to build a word cloud from the 20newsgroups dataset.
Let's go through an exercise to understand this better.
In this exercise, we will visualize the most frequently occurring words in the first 1,000 articles from sklearn's fetch_20newsgroups text dataset using a word cloud. Follow these steps to complete this exercise:
import nltk
nltk.download('stopwords')
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 200
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 200
def get_data(n):
newsgroups_data_sample = fetch_20newsgroups(subset='train')
text = str(newsgroups_data_sample['data'][:n])
return text
def load_stop_words():
other_stopwords_to_remove = ['\n', 'n', '', '>',
'nLines', 'nI',"n'"]
stop_words = stopwords.words('english')
stop_words.extend(other_stopwords_to_remove)
stop_words = set(stop_words)
return stop_words
def generate_word_cloud(text, stopwords):
"""
This method generates word cloud object
with given corpus, stop words and dimensions
"""
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
max_words=200,
stopwords = stopwords,
min_font_size = 10).generate(text)
return wordcloud
text = get_data(1000)
stop_words = load_stop_words()
wordcloud = generate_word_cloud(text, stop_words)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
The preceding code generates the following output:
So, in this exercise, we learned what word clouds are and how to generate word clouds with Python's wordcloud library and visualize this with matplotlib.
Note
To access the source code for this specific section, please refer to https://packt.live/30eaSRn.
You can also run this example online at https://packt.live/2EzqLJJ.
In the next section, we will explore other visualizations, such as dependency parse trees and named entities.
Apart from word clouds, there are various other ways of visualizing texts. Some of the most popular ways are listed here:
Let's go through the following exercise to understand this better.
In this exercise, we will look at two of the most popular visualization methods, after word clouds, which are dependency parse trees and using named entities. Follow these steps to complete this exercise:
import spacy
from spacy import displacy
!python -m spacy download
en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp('God helps those who help themselves')
displacy.render(doc, style='dep', jupyter=True)
The preceding code generates the following output:
text = 'Once upon a time there lived a saint named '
'Ramakrishna Paramahansa. His chief disciple '
'Narendranath Dutta also known as Swami Vivekananda '
'is the founder of Ramakrishna Mission and '
'Ramakrishna Math.'
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
The preceding code generates the following output:
Note
To access the source code for this specific section, please refer to https://packt.live/313m4iD.
You can also run this example online at https://packt.live/3103fgr.
Now that you have learned about visualizations, we will solve an activity based on them to gain an even better understanding.
In this activity, you will create a word cloud for the 50 most frequent words in a dataset. The dataset we will use consists of random sentences that are not clean. First, we need to clean them and create a unique set of frequently occurring words.
Note
The text_corpus.txt file that's being used in this activity can be found at https://packt.live/2DiVIBj.
Follow these steps to implement this activity:
Note
The solution to this activity can be found on page 375.
In this chapter, you have learned about various types of data and ways to deal with unstructured text data. Text data is usually extremely noisy and needs to be cleaned and preprocessed, which mainly consists of tokenization, stemming, lemmatization, and stop-word removal. After preprocessing, features are extracted from texts using various methods, such as BoW and TFIDF. These methods convert unstructured text data into structured numeric data. New features are created from existing features using a technique called feature engineering. In the last part of this chapter, we explored various ways of visualizing text data, such as word clouds.
In the next chapter, you will learn how to develop machine learning models to classify texts using the feature extraction methods you have learned about in this chapter. Moreover, different sampling techniques and model evaluation parameters will be introduced.
18.223.195.97