© Akshay Kulkarni and Adarsha Shivananda 2019
Akshay Kulkarni and Adarsha ShivanandaNatural Language Processing Recipeshttps://doi.org/10.1007/978-1-4842-4267-4_2

2. Exploring and Processing Text Data

Akshay Kulkarni1  and Adarsha Shivananda1
(1)
Bangalore, Karnataka, India
 

In this chapter, we are going to cover various methods and techniques to preprocess the text data along with exploratory data analysis.

We are going to discuss the following recipes under text preprocessing and exploratory data analysis.
  • Recipe 1. Lowercasing

  • Recipe 2. Punctuation removal

  • Recipe 3. Stop words removal

  • Recipe 4. Text standardization

  • Recipe 5. Spelling correction

  • Recipe 6. Tokenization

  • Recipe 7. Stemming

  • Recipe 8. Lemmatization

  • Recipe 9. Exploratory data analysis

  • Recipe 10. End-to-end processing pipeline

Before directly jumping into the recipes, let us first understand the need for preprocessing the text data. As we all know, around 90% of the world’s data is unstructured and may be present in the form of an image, text, audio, and video. Text can come in a variety of forms from a list of individual words, to sentences to multiple paragraphs with special characters (like tweets and other punctuations). It also may be present in the form of web, HTML, documents, etc. And this data is never clean and consists of a lot of noise. It needs to be treated and then perform a few of the preprocessing functions to make sure we have the right input data for the feature engineering and model building. Suppose if we don’t preprocess the data, any algorithms that are built on top of such data will not add any value for the business. This reminds me of a very popular phrase in the Data Science world “Garbage in – Garbage out.”

Preprocessing involves transforming raw text data into an understandable format. Real-world data is very often incomplete, inconsistent, and filled with a lot of noise and is likely to contain many errors. Preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw text data for further processing.

Recipe 2-1. Converting Text Data to Lowercase

In this recipe, we are going to discuss how to lowercase the text data in order to have all the data in a uniform format and to make sure “NLP” and “nlp” are treated as the same.

Problem

How to lowercase the text data?

Solution

The simplest way to do this is by using the default lower() function in Python.

The lower() method converts all uppercase characters in a string into lowercase characters and returns them.

How It Works

Let’s follow the steps in this section to lowercase a given text or document. Here, we are going to use Python.

Step 1-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
                                               tweet
0                        This is introduction to NLP
1              It is likely to be useful, to people
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this

Step 1-2 Execute lower() function on the text data

When there is just the string, apply the lower() function directly as shown below:
x = 'Testing'
x2 = x.lower()
print(x2)
#output
'testing'
When you want to perform lowercasing on a data frame, use the apply a function as shown below:
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['tweet']
#output
0                          this is introduction to nlp
1                 it is likely to be useful, to people
2               machine learning is the new electrcity
3    there would be less hype around ai and more ac...
4                             python is the best tool!
5                                  r is good langauage
6                                     i like this book
7                          i want more books like this

That’s all. We have converted the whole tweet column into lowercase. Let’s see what else we can do in the next recipes.

Recipe 2-2. Removing Punctuation

In this recipe, we are going to discuss how to remove punctuation from the text data. This step is very important as punctuation doesn’t add any extra information or value. Hence removal of all such instances will help reduce the size of the data and increase computational efficiency.

Problem

You want to remove punctuation from the text data.

Solution

The simplest way to do this is by using the regex and replace() function in Python.

How It Works

Let’s follow the steps in this section to remove punctuation from the text data.

Step 2-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
 tweet
0 This is introduction to NLP
1 It is likely to be useful, to people
2 Machine learning is the new electrcity
3 There would be less hype around AI and more ac...
4 python is the best tool!
5 R is good langauage
6 I like this book
7 I want more books like this

Step 2-2 Execute below function on the text data

Using the regex and replace() function, we can remove the punctuation as shown below:
import re
s = "I. like. This book!"
s1 = re.sub(r'[^ws]',",s)
s1
#output
'I like This book'
Or:
df['tweet'] = df['tweet'].str.replace('[^ws]',")
df['tweet']
#output
0                          this is introduction to nlp
1                  it is likely to be useful to people
2               machine learning is the new electrcity
3    there would be less hype around ai and more ac...
4                              python is the best tool
5                                  r is good langauage
6                                     i like this book
7                          i want more books like this
Or:
import string
s = "I. like. This book!"
for c in string.punctuation:
      s= s.replace(c,"")
s
#output
'I like This book'

Recipe 2-3. Removing Stop Words

In this recipe, we are going to discuss how to remove stop words. Stop words are very common words that carry no meaning or less meaning compared to other keywords. If we remove the words that are less commonly used, we can focus on the important keywords instead. Say, for example, in the context of a search engine, if your search query is “How to develop chatbot using python,” if the search engine tries to find web pages that contained the terms “how,” “to,” “develop,” “chatbot,” “using,” “python,” the search engine is going to find a lot more pages that contain the terms “how” and “to” than pages that contain information about developing chatbot because the terms “how” and “to” are so commonly used in the English language. So, if we remove such terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop,” “chatbot,” “python” – which would more closely bring up pages that are of real interest. Similarly we can remove more common words and rare words as well.

Problem

You want to remove stop words.

Solution

The simplest way to do this by using the NLTK library, or you can build your own stop words file.

How It Works

Let’s follow the steps in this section to remove stop words from the text data.

Step 3-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
 tweet
0 This is introduction to NLP
1 It is likely to be useful, to people
2 Machine learning is the new electrcity
3 There would be less hype around AI and more ac...
4 python is the best tool!
5 R is good langauage
6 I like this book
7 I want more books like this

Step 3-2 Execute below commands on the text data

Using the NLTK library, we can remove the punctuation as shown below.
#install and import libraries
!pip install nltk
import nltk
nltk.download()
from nltk.corpus import stopwords
#remove stop words
stop = stopwords.words('english')
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['tweet']
#output
0                                  introduction nlp
1                              likely useful people
2                   machine learning new electrcity
3    would less hype around ai action going forward
4                                  python best tool
5                                  r good langauage
6                                         like book
7                                   want books like

There are no stop words now. Everything has been removed in this step.

Recipe 2-4. Standardizing Text

In this recipe, we are going to discuss how to standardize the text. But before that, let’s understand what is text standardization and why we need to do it. Most of the text data is in the form of either customer reviews, blogs, or tweets, where there is a high chance of people using short words and abbreviations to represent the same meaning. This may help the downstream process to easily understand and resolve the semantics of the text.

Problem

You want to standardize text.

Solution

We can write our own custom dictionary to look for short words and abbreviations.

How It Works

Let’s follow the steps in this section to perform text standardization.

Step 4-1 Create a custom lookup dictionary

The dictionary will be for text standardization based on your data.
lookup_dict = {'nlp':'natural language processing', 'ur':'your', "wbu" : "what about you"}
import re

Step 4-2 Create a custom function for text standardization

Here is the code:
def text_std(input_text):
 words = input_text.split()
 new_words = []
 for word in words:
     word = re.sub(r'[^ws]',",word)
     if word.lower() in lookup_dict:
         word = lookup_dict[word.lower()]
         new_words.append(word)
         new_text = " ".join(new_words)
 return new_text

Step 4-3 Run the text_std function

We also need to check the output:
text_std("I like nlp it's ur choice")
#output
'natural language processing your'
Here, nlp has standardised to 'natural language processing' and ur to 'your'.

Recipe 2-5. Correcting Spelling

In this recipe, we are going to discuss how to do spelling correction. But before that, let’s understand why this spelling correction is important. Most of the text data is in the form of either customer reviews, blogs, or tweets, where there is a high chance of people using short words and making typo errors. This will help us in reducing multiple copies of words, which represents the same meaning. For example, “proccessing” and “processing” will be treated as different words even if they are used in the same sense.

Note that abbreviations should be handled before this step, or else the corrector would fail at times. Say, for example, “ur” (actually means “your”) would be corrected to “or.”

Problem

You want to do spelling correction.

Solution

The simplest way to do this by using the TextBlob library .

How It Works

Let’s follow the steps in this section to do spelling correction.

Step 5-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.
text=['Introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity', 'R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
                                    tweet
0                     Introduction to NLP
1   It is likely to be useful, to people
2  Machine learning is the new electrcity
3                     R is good langauage
4                        I like this book
5             I want more books like this

Step 5-2 Execute below code on the text data

Using TextBlob, we can do spelling correction as shown below:
#Install textblob library
!pip install textblob
#import libraries and use 'correct' function
from textblob import TextBlob
df['tweet'].apply(lambda x: str(TextBlob(x).correct()))
#output
0                        Introduction to NLP
1      It is likely to be useful, to people
2    Machine learning is the new electricity
3                         R is good language
4                           I like this book
5                I want more books like this
If you clearly observe this, it corrected the spelling of electricity and language.
#You can also use autocorrect library as shown below
#install autocorrect
!pip install autocorrect
from autocorrect import spell
print(spell(u'mussage'))
print(spell(u'sirvice'))
#output
'message'
'service'

Recipe 2-6. Tokenizing Text

In this recipe, we would look at the ways to tokenize. Tokenization refers to splitting text into minimal meaningful units. There is a sentence tokenizer and word tokenizer. We will see a word tokenizer in this recipe, which is a mandatory step in text preprocessing for any kind of analysis. There are many libraries to perform tokenization like NLTK, SpaCy, and TextBlob. Here are a few ways to achieve it.

Problem

You want to do tokenization.

Solution

The simplest way to do this is by using the TextBlob library.

How It Works

Let’s follow the steps in this section to perform tokenization.

Step 6-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
 tweet
0 This is introduction to NLP
1 It is likely to be useful, to people
2 Machine learning is the new electrcity
3 There would be less hype around AI and more ac...
4 python is the best tool!
5 R is good langauage
6 I like this book
7 I want more books like this

Step 6-2 Execute below code on the text data

The result of tokenization is a list of tokens:
#Using textblob
from textblob import TextBlob
TextBlob(df['tweet'][3]).words
#output
WordList(['would', 'less', 'hype', 'around', 'ai', 'action', 'going', 'forward'])
#using NLTK
import nltk
#create data
mystring = "My favorite animal is cat"
nltk.word_tokenize(mystring)
#output
['My', 'favorite', 'animal', 'is', 'cat']
#using split function from python
mystring.split()
#output
['My', 'favorite', 'animal', 'is', 'cat']

Recipe 2-7. Stemming

In this recipe, we will discuss stemming. Stemming is a process of extracting a root word. For example, “fish,” “fishes,” and “fishing” are stemmed into fish.

Problem

You want to do stemming.

Solution

The simplest way to do this by using NLTK or a TextBlob library.

How It Works

Let’s follow the steps in this section to perform stemming.

Step 7-1 Read the text data

Let’s create a list of strings and assign it to a variable.
text=['I like fishing','I eat fish','There are many fishes in pound']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
                            tweet
0                  I like fishing
1                      I eat fish
2  There are many fishes in pound

Step 7-2 Stemming the text

Execute the below code on the text data:
#Import library
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
#output
0                     I like fish
1                      I eat fish
2    there are mani fish in pound

If you observe this, you will notice that fish, fishing, and fishes have been stemmed to fish.

Recipe 2-8. Lemmatizing

In this recipe, we will discuss lemmatization. Lemmatization is a process of extracting a root word by considering the vocabulary. For example, “good,” “better,” or “best” is lemmatized into good.

The part of speech of a word is determined in lemmatization. It will return the dictionary form of a word, which must be a valid word while stemming just extracts the root word.
  • Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile.”

  • Stemming handles matching “car” to “cars.”

Lemmatization can get better results.
  • The stemmed form of leafs is leaf.

  • The stemmed form of leaves is leav.

  • The lemmatized form of leafs is leaf.

  • The lemmatized form of leaves is leaf.

Problem

You want to perform lemmatization.

Solution

The simplest way to do this is by using NLTK or the TextBlob library.

How It Works

Let’s follow the steps in this section to perform lemmatization.

Step 8-1 Read the text data

Let’s create a list of strings and assign it to a variable.
text=['I like fishing','I eat fish','There are many fishes in pound', 'leaves and leaf']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
                                tweet
0                      I like fishing
1                          I eat fish
2  There are multiple fishes in pound
3                     leaves and leaf

Step 8-2 Lemmatizing the data

Execute the below code on the text data:
#Import library
from textblob import Word
#Code for lemmatize
df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['tweet']
#output
0                      I like fishing
1                          I eat fish
2    There are multiple fish in pound
3                       leaf and leaf

You can observe that fish and fishes are lemmatized to fish and, as explained, leaves and leaf are lemmatized to leaf.

Recipe 2-9. Exploring Text Data

So far, we are comfortable with data collection and text preprocessing. Let us perform some exploratory data analysis.

Problem

You want to explore and understand the text data.

Solution

The simplest way to do this by using NLTK or the TextBlob library.

How It Works

Let’s follow the steps in this process.

Step 9-1 Read the text data

Execute the below code to download the dataset, if you haven’t already done so:

nltk.download().
#Importing data
import nltk
from nltk.corpus import webtext
nltk.download('webtext')
wt_sentences = webtext.sents('firefox.txt')
wt_words = webtext.words('firefox.txt')

Step 9-2 Import necessary libraries

Import Library for computing frequency:
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string

Step 9-3 Check number of words in the data

Count the number of words:
len(wt_sentences)
#output
1142
len(wt_words)
#output
102457

Step 9-4 Compute the frequency of all words in the reviews

Generating frequency for all the words:
frequency_dist = nltk.FreqDist(wt_words)
frequency_dist
#showing only top few results
FreqDist({'slowing': 1,
          'warnings': 6,
          'rule': 1,
          'Top': 2,
          'XBL': 12,
          'installation': 44,
          'Networking': 1,
          'inccorrect': 1,
          'killed': 3,
          ']"': 1,
          'LOCKS': 1,
          'limited': 2,
          'cookies': 57,
          'method': 12,
          'arbitrary': 2,
          'b': 3,
          'titlebar': 6,
sorted_frequency_dist =sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)
sorted_frequency_dist
['.',
 'in',
 'to',
 '"',
 'the',
 "'",
 'not',
 '-',
 'when',
 'on',
 'a',
 'is',
 't',
 'and',
 'of',

Step 9-5 Consider words with length greater than 3 and plot

Let’s take the words only if their frequency is greater than 3.
large_words = dict([(k,v) for k,v in frequency_dist.items() if len(k)>3])
frequency_dist = nltk.FreqDist(large_words)
frequency_dist.plot(50,cumulative=False)
#output
../images/475440_1_En_2_Chapter/475440_1_En_2_Figa_HTML.jpg

Step 9-6 Build Wordcloud

Wordcloud is the pictorial representation of the most frequently repeated words representing the size of the word.
#install library
!pip install wordcloud
#build wordcloud
from wordcloud import WordCloud
wcloud = WordCloud().generate_from_frequencies(frequency_dist)
#plotting the wordcloud
import matplotlib.pyplot as plt
plt.imshow(wcloud, interpolation="bilinear")
plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
plt.show()
#output
../images/475440_1_En_2_Chapter/475440_1_En_2_Figb_HTML.jpg

Readers, give this a try: Remove the stop words and then build the word cloud. The output would look something like that below.

../images/475440_1_En_2_Chapter/475440_1_En_2_Figc_HTML.jpg

Recipe 2-10. Building a Text Preprocessing Pipeline

So far, we have completed most of the text manipulation and processing techniques and methods. In this recipe, let’s do something interesting.

Problem

You want to build an end-to-end text preprocessing pipeline. Whenever you want to do preprocessing for any NLP application, you can directly plug in data to this pipeline function and get the required clean text data as the output.

Solution

The simplest way to do this by creating the custom function with all the techniques learned so far.

How It Works

This works by putting all the possible processing techniques into a wrapper function and passing the data through it.

Step 10-1 Read/create the text data

Let’s create a list of strings and assign it to a variable. Maybe a tweet sample:
tweet_sample= "How to take control of your #debt https://personal.vanguard.com/us/insights/saving-investing/debt-management.#Best advice for #family #financial #success (@PrepareToWin)"

You can also use your Twitter data extracted in Chapter 1.

Step 10-2 Process the text

Execute the below function to process the tweet:
def processRow(row):
    import re
    import nltk
    from textblob import TextBlob
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from textblob import Word
    from nltk.util import ngrams
    import re
    from wordcloud import WordCloud, STOPWORDS
    from nltk.tokenize import word_tokenize
    tweet = row
    #Lower case
    tweet.lower()
    #Removes unicode strings like "u002c" and "x96"
    tweet = re.sub(r'(\u[0-9A-Fa-f]+)',r", tweet)
    tweet = re.sub(r'[^x00-x7f]',r",tweet)
    #convert any url to URL
    tweet = re.sub('((www.[^s]+)|(https?://[^s]+))','URL',tweet)
    #Convert any @Username to "AT_USER"
    tweet = re.sub('@[^s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[s]+', ' ', tweet)
    tweet = re.sub('[ ]+', ' ', tweet)
    #Remove not alphanumeric symbols white spaces
    tweet = re.sub(r'[^w]', ' ', tweet)
    #Removes hastag in front of a word """
    tweet = re.sub(r'#([^s]+)', r'1', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^s]+)', r'1', tweet)
    #Remove :( or :)
    tweet = tweet.replace(':)',")
    tweet = tweet.replace(':(',")
    #remove numbers
    tweet = ".join([i for i in tweet if not i.isdigit()])
    #remove multiple exclamation
    tweet = re.sub(r"(!)1+", ' ', tweet)
    #remove multiple question marks
    tweet = re.sub(r"(?)1+", ' ', tweet)
    #remove multistop
    tweet = re.sub(r"(.)1+", ' ', tweet)
    #lemma
    from textblob import Word
    tweet =" ".join([Word(word).lemmatize() for word in tweet.split()])
    #stemmer
    #st = PorterStemmer()
    #tweet=" ".join([st.stem(word) for word in tweet.split()])
    #Removes emoticons from text
    tweet = re.sub(':)|;)|:-)|(-:|:-D|=D|:P|xD|X-p|^^|:-*|^.^|^-^|^\_^|,-)|)-:|:'(|:(|:-(|:S|T.T|.\_.|:<|:-S|:-<|*-*|:O|=O|=-O|O.o|XO|O\_O|:-@|=/|:/|X-(|>.<|>=(|D:', ", tweet)
    #trim
    tweet = tweet.strip(''"')
    row = tweet
    return row
#call the function with your data
processRow(tweet_sample)
#output
'How to take control of your debt URL Best advice for family financial success AT_USER'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.50.183