Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Akshay Kulkarni and Adarsha ShivanandaNatural Language Processing Recipeshttps://doi.org/10.1007/978-1-4842-4267-4_2

2. Exploring and Processing Text Data

Akshay Kulkarni¹ and Adarsha Shivananda¹

(1)

Bangalore, Karnataka, India

In this chapter, we are going to cover various methods and techniques to preprocess the text data along with exploratory data analysis.

We are going to discuss the following recipes under text preprocessing and exploratory data analysis.

Recipe 1. Lowercasing
Recipe 2. Punctuation removal
Recipe 3. Stop words removal
Recipe 4. Text standardization
Recipe 5. Spelling correction
Recipe 6. Tokenization
Recipe 7. Stemming
Recipe 8. Lemmatization
Recipe 9. Exploratory data analysis
Recipe 10. End-to-end processing pipeline

Before directly jumping into the recipes, let us first understand the need for preprocessing the text data. As we all know, around 90% of the world’s data is unstructured and may be present in the form of an image, text, audio, and video. Text can come in a variety of forms from a list of individual words, to sentences to multiple paragraphs with special characters (like tweets and other punctuations). It also may be present in the form of web, HTML, documents, etc. And this data is never clean and consists of a lot of noise. It needs to be treated and then perform a few of the preprocessing functions to make sure we have the right input data for the feature engineering and model building. Suppose if we don’t preprocess the data, any algorithms that are built on top of such data will not add any value for the business. This reminds me of a very popular phrase in the Data Science world “Garbage in – Garbage out.”

Preprocessing involves transforming raw text data into an understandable format. Real-world data is very often incomplete, inconsistent, and filled with a lot of noise and is likely to contain many errors. Preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw text data for further processing.

Recipe 2-1. Converting Text Data to Lowercase

In this recipe, we are going to discuss how to lowercase the text data in order to have all the data in a uniform format and to make sure “NLP” and “nlp” are treated as the same.

Problem

How to lowercase the text data?

Solution

The simplest way to do this is by using the default lower() function in Python.

The lower() method converts all uppercase characters in a string into lowercase characters and returns them.

How It Works

Let’s follow the steps in this section to lowercase a given text or document. Here, we are going to use Python.

Step 1-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.

text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']

#convert list to data frame

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

#output

0 This is introduction to NLP

1 It is likely to be useful, to people

2 Machine learning is the new electrcity

3 There would be less hype around AI and more ac...

4 python is the best tool!

5 R is good langauage

6 I like this book

7 I want more books like this

Step 1-2 Execute lower() function on the text data

When there is just the string, apply the lower() function directly as shown below:

x = 'Testing'

x2 = x.lower()

print(x2)

#output

'testing'

When you want to perform lowercasing on a data frame, use the apply a function as shown below:

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

df['tweet']

#output

0 this is introduction to nlp

1 it is likely to be useful, to people

2 machine learning is the new electrcity

3 there would be less hype around ai and more ac...

4 python is the best tool!

5 r is good langauage

6 i like this book

7 i want more books like this

That’s all. We have converted the whole tweet column into lowercase. Let’s see what else we can do in the next recipes.

Recipe 2-2. Removing Punctuation

In this recipe, we are going to discuss how to remove punctuation from the text data. This step is very important as punctuation doesn’t add any extra information or value. Hence removal of all such instances will help reduce the size of the data and increase computational efficiency.

Problem

You want to remove punctuation from the text data.

Solution

The simplest way to do this is by using the regex and replace() function in Python.

How It Works

Let’s follow the steps in this section to remove punctuation from the text data.

Step 2-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.

#convert list to dataframe

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

#output

0 This is introduction to NLP

1 It is likely to be useful, to people

2 Machine learning is the new electrcity

3 There would be less hype around AI and more ac...

4 python is the best tool!

5 R is good langauage

6 I like this book

7 I want more books like this

Step 2-2 Execute below function on the text data

Using the regex and replace() function, we can remove the punctuation as shown below:

import re

s = "I. like. This book!"

s1 = re.sub(r'[^ws]',",s)

#output

'I like This book'

Or:

df['tweet'] = df['tweet'].str.replace('[^ws]',")

df['tweet']

#output

0 this is introduction to nlp

1 it is likely to be useful to people

2 machine learning is the new electrcity

3 there would be less hype around ai and more ac...

4 python is the best tool

5 r is good langauage

6 i like this book

7 i want more books like this

Or:

import string

s = "I. like. This book!"

for c in string.punctuation:

s= s.replace(c,"")

#output

'I like This book'

Recipe 2-3. Removing Stop Words

In this recipe, we are going to discuss how to remove stop words. Stop words are very common words that carry no meaning or less meaning compared to other keywords. If we remove the words that are less commonly used, we can focus on the important keywords instead. Say, for example, in the context of a search engine, if your search query is “How to develop chatbot using python,” if the search engine tries to find web pages that contained the terms “how,” “to,” “develop,” “chatbot,” “using,” “python,” the search engine is going to find a lot more pages that contain the terms “how” and “to” than pages that contain information about developing chatbot because the terms “how” and “to” are so commonly used in the English language. So, if we remove such terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop,” “chatbot,” “python” – which would more closely bring up pages that are of real interest. Similarly we can remove more common words and rare words as well.

Problem

You want to remove stop words.

Solution

The simplest way to do this by using the NLTK library, or you can build your own stop words file.

How It Works

Let’s follow the steps in this section to remove stop words from the text data.

Step 3-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.

#convert list to data frame

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

#output

0 This is introduction to NLP

1 It is likely to be useful, to people

2 Machine learning is the new electrcity

3 There would be less hype around AI and more ac...

4 python is the best tool!

5 R is good langauage

6 I like this book

7 I want more books like this

Step 3-2 Execute below commands on the text data

Using the NLTK library, we can remove the punctuation as shown below.

#install and import libraries

!pip install nltk

import nltk

nltk.download()

from nltk.corpus import stopwords

#remove stop words

stop = stopwords.words('english')

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

df['tweet']

#output

0 introduction nlp

1 likely useful people

2 machine learning new electrcity

3 would less hype around ai action going forward

4 python best tool

5 r good langauage

6 like book

7 want books like

There are no stop words now. Everything has been removed in this step.

Recipe 2-4. Standardizing Text

In this recipe, we are going to discuss how to standardize the text. But before that, let’s understand what is text standardization and why we need to do it. Most of the text data is in the form of either customer reviews, blogs, or tweets, where there is a high chance of people using short words and abbreviations to represent the same meaning. This may help the downstream process to easily understand and resolve the semantics of the text.

Problem

You want to standardize text.

Solution

We can write our own custom dictionary to look for short words and abbreviations.

How It Works

Let’s follow the steps in this section to perform text standardization.

Step 4-1 Create a custom lookup dictionary

The dictionary will be for text standardization based on your data.

lookup_dict = {'nlp':'natural language processing', 'ur':'your', "wbu" : "what about you"}

import re

Step 4-2 Create a custom function for text standardization

Here is the code:

def text_std(input_text):

words = input_text.split()

new_words = []

for word in words:

word = re.sub(r'[^ws]',",word)

if word.lower() in lookup_dict:

word = lookup_dict[word.lower()]

new_words.append(word)

new_text = " ".join(new_words)

return new_text

Step 4-3 Run the text_std function

We also need to check the output:

text_std("I like nlp it's ur choice")

#output

'natural language processing your'

Here, nlp has standardised to 'natural language processing' and ur to 'your'.

Recipe 2-5. Correcting Spelling

In this recipe, we are going to discuss how to do spelling correction. But before that, let’s understand why this spelling correction is important. Most of the text data is in the form of either customer reviews, blogs, or tweets, where there is a high chance of people using short words and making typo errors. This will help us in reducing multiple copies of words, which represents the same meaning. For example, “proccessing” and “processing” will be treated as different words even if they are used in the same sense.

Note that abbreviations should be handled before this step, or else the corrector would fail at times. Say, for example, “ur” (actually means “your”) would be corrected to “or.”

Problem

You want to do spelling correction.

Solution

The simplest way to do this by using the TextBlob library .

How It Works

Let’s follow the steps in this section to do spelling correction.

Step 5-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.

text=['Introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity', 'R is good langauage','I like this book','I want more books like this']

#convert list to dataframe

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

#output

0 Introduction to NLP

1 It is likely to be useful, to people

2 Machine learning is the new electrcity

3 R is good langauage

4 I like this book

5 I want more books like this

Step 5-2 Execute below code on the text data

Using TextBlob, we can do spelling correction as shown below:

#Install textblob library

!pip install textblob

#import libraries and use 'correct' function

from textblob import TextBlob

df['tweet'].apply(lambda x: str(TextBlob(x).correct()))

#output

0 Introduction to NLP

1 It is likely to be useful, to people

2 Machine learning is the new electricity

3 R is good language

4 I like this book

5 I want more books like this

If you clearly observe this, it corrected the spelling of electricity and language.

#You can also use autocorrect library as shown below

#install autocorrect

!pip install autocorrect

from autocorrect import spell

print(spell(u'mussage'))

print(spell(u'sirvice'))

#output

'message'

'service'

Recipe 2-6. Tokenizing Text

In this recipe, we would look at the ways to tokenize. Tokenization refers to splitting text into minimal meaningful units. There is a sentence tokenizer and word tokenizer. We will see a word tokenizer in this recipe, which is a mandatory step in text preprocessing for any kind of analysis. There are many libraries to perform tokenization like NLTK, SpaCy, and TextBlob. Here are a few ways to achieve it.

Problem

You want to do tokenization.

Solution

The simplest way to do this is by using the TextBlob library.

How It Works

Let’s follow the steps in this section to perform tokenization.

Step 6-1 Read/create the text data

Let’s create a list of strings and assign it to a variable.

#convert list to dataframe

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

#output

0 This is introduction to NLP

1 It is likely to be useful, to people

2 Machine learning is the new electrcity

3 There would be less hype around AI and more ac...

4 python is the best tool!

5 R is good langauage

6 I like this book

7 I want more books like this

Step 6-2 Execute below code on the text data

The result of tokenization is a list of tokens:

#Using textblob

from textblob import TextBlob

TextBlob(df['tweet'][3]).words

#output

WordList(['would', 'less', 'hype', 'around', 'ai', 'action', 'going', 'forward'])

#using NLTK

import nltk

#create data

mystring = "My favorite animal is cat"

nltk.word_tokenize(mystring)

#output

['My', 'favorite', 'animal', 'is', 'cat']

#using split function from python

mystring.split()

#output

['My', 'favorite', 'animal', 'is', 'cat']

Recipe 2-7. Stemming

In this recipe, we will discuss stemming. Stemming is a process of extracting a root word. For example, “fish,” “fishes,” and “fishing” are stemmed into fish.

Problem

You want to do stemming.

Solution

The simplest way to do this by using NLTK or a TextBlob library.

How It Works

Let’s follow the steps in this section to perform stemming.

Step 7-1 Read the text data

Let’s create a list of strings and assign it to a variable.

text=['I like fishing','I eat fish','There are many fishes in pound']

#convert list to dataframe

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

#output

0 I like fishing

1 I eat fish

2 There are many fishes in pound

Step 7-2 Stemming the text

Execute the below code on the text data:

#Import library

from nltk.stem import PorterStemmer

st = PorterStemmer()

df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

#output

0 I like fish

1 I eat fish

2 there are mani fish in pound

If you observe this, you will notice that fish, fishing, and fishes have been stemmed to fish.

Recipe 2-8. Lemmatizing

In this recipe, we will discuss lemmatization. Lemmatization is a process of extracting a root word by considering the vocabulary. For example, “good,” “better,” or “best” is lemmatized into good.

The part of speech of a word is determined in lemmatization. It will return the dictionary form of a word, which must be a valid word while stemming just extracts the root word.

Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile.”
Stemming handles matching “car” to “cars.”

Lemmatization can get better results.

The stemmed form of leafs is leaf.
The stemmed form of leaves is leav.
The lemmatized form of leafs is leaf.
The lemmatized form of leaves is leaf.

Problem

You want to perform lemmatization.

Solution

The simplest way to do this is by using NLTK or the TextBlob library.

How It Works

Let’s follow the steps in this section to perform lemmatization.

Step 8-1 Read the text data

Let’s create a list of strings and assign it to a variable.

text=['I like fishing','I eat fish','There are many fishes in pound', 'leaves and leaf']

#convert list to dataframe

import pandas as pd

df = pd.DataFrame({'tweet':text})

print(df)

0 I like fishing

1 I eat fish

2 There are multiple fishes in pound

3 leaves and leaf

Step 8-2 Lemmatizing the data

Execute the below code on the text data:

#Import library

from textblob import Word

#Code for lemmatize

df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

df['tweet']

#output

0 I like fishing

1 I eat fish

2 There are multiple fish in pound

3 leaf and leaf

You can observe that fish and fishes are lemmatized to fish and, as explained, leaves and leaf are lemmatized to leaf.

Recipe 2-9. Exploring Text Data

So far, we are comfortable with data collection and text preprocessing. Let us perform some exploratory data analysis.

Problem

You want to explore and understand the text data.

Solution

The simplest way to do this by using NLTK or the TextBlob library.

How It Works

Let’s follow the steps in this process.

Step 9-1 Read the text data

Execute the below code to download the dataset, if you haven’t already done so:

nltk.download().

#Importing data

import nltk

from nltk.corpus import webtext

nltk.download('webtext')

wt_sentences = webtext.sents('firefox.txt')

wt_words = webtext.words('firefox.txt')

Step 9-2 Import necessary libraries

Import Library for computing frequency:

from nltk.probability import FreqDist

from nltk.corpus import stopwords

import string

Step 9-3 Check number of words in the data

Count the number of words:

len(wt_sentences)

#output

1142

len(wt_words)

#output

102457

Step 9-4 Compute the frequency of all words in the reviews

Generating frequency for all the words:

frequency_dist = nltk.FreqDist(wt_words)

frequency_dist

#showing only top few results

FreqDist({'slowing': 1,

'warnings': 6,

'rule': 1,

'Top': 2,

'XBL': 12,

'installation': 44,

'Networking': 1,

'inccorrect': 1,

'killed': 3,

']"': 1,

'LOCKS': 1,

'limited': 2,

'cookies': 57,

'method': 12,

'arbitrary': 2,

'b': 3,

'titlebar': 6,

sorted_frequency_dist =sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)

sorted_frequency_dist

['.',

'in',

'to',

'"',

'the',

"'",

'not',

'-',

'when',

'on',

'a',

'is',

't',

'and',

'of',

Step 9-5 Consider words with length greater than 3 and plot

Let’s take the words only if their frequency is greater than 3.

large_words = dict([(k,v) for k,v in frequency_dist.items() if len(k)>3])

frequency_dist = nltk.FreqDist(large_words)

frequency_dist.plot(50,cumulative=False)

#output

../images/475440_1_En_2_Chapter/475440_1_En_2_Figa_HTML.jpg

Step 9-6 Build Wordcloud

Wordcloud is the pictorial representation of the most frequently repeated words representing the size of the word.

#install library

!pip install wordcloud

#build wordcloud

from wordcloud import WordCloud

wcloud = WordCloud().generate_from_frequencies(frequency_dist)

#plotting the wordcloud

import matplotlib.pyplot as plt

plt.imshow(wcloud, interpolation="bilinear")

plt.axis("off")

(-0.5, 399.5, 199.5, -0.5)

plt.show()

#output

../images/475440_1_En_2_Chapter/475440_1_En_2_Figb_HTML.jpg

Readers, give this a try: Remove the stop words and then build the word cloud. The output would look something like that below.

../images/475440_1_En_2_Chapter/475440_1_En_2_Figc_HTML.jpg

Recipe 2-10. Building a Text Preprocessing Pipeline

So far, we have completed most of the text manipulation and processing techniques and methods. In this recipe, let’s do something interesting.

Problem

You want to build an end-to-end text preprocessing pipeline. Whenever you want to do preprocessing for any NLP application, you can directly plug in data to this pipeline function and get the required clean text data as the output.

Solution

The simplest way to do this by creating the custom function with all the techniques learned so far.

How It Works

This works by putting all the possible processing techniques into a wrapper function and passing the data through it.

Step 10-1 Read/create the text data

Let’s create a list of strings and assign it to a variable. Maybe a tweet sample:

tweet_sample= "How to take control of your #debt https://personal.vanguard.com/us/insights/saving-investing/debt-management.#Best advice for #family #financial #success (@PrepareToWin)"

You can also use your Twitter data extracted in Chapter 1.

Step 10-2 Process the text

Execute the below function to process the tweet:

def processRow(row):

import re

import nltk

from textblob import TextBlob

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from textblob import Word

from nltk.util import ngrams

import re

from wordcloud import WordCloud, STOPWORDS

from nltk.tokenize import word_tokenize

tweet = row

#Lower case

tweet.lower()

#Removes unicode strings like "u002c" and "x96"

tweet = re.sub(r'(\u[0-9A-Fa-f]+)',r", tweet)

tweet = re.sub(r'[^x00-x7f]',r",tweet)

#convert any url to URL

tweet = re.sub('((www.[^s]+)|(https?://[^s]+))','URL',tweet)

#Convert any @Username to "AT_USER"

tweet = re.sub('@[^s]+','AT_USER',tweet)

#Remove additional white spaces

tweet = re.sub('[s]+', ' ', tweet)

tweet = re.sub('[ ]+', ' ', tweet)

#Remove not alphanumeric symbols white spaces

tweet = re.sub(r'[^w]', ' ', tweet)

#Removes hastag in front of a word """

tweet = re.sub(r'#([^s]+)', r'1', tweet)

#Replace #word with word

tweet = re.sub(r'#([^s]+)', r'1', tweet)

#Remove :( or :)

tweet = tweet.replace(':)',")

tweet = tweet.replace(':(',")

#remove numbers

tweet = ".join([i for i in tweet if not i.isdigit()])

#remove multiple exclamation

tweet = re.sub(r"(!)1+", ' ', tweet)

#remove multiple question marks

tweet = re.sub(r"(?)1+", ' ', tweet)

#remove multistop

tweet = re.sub(r"(.)1+", ' ', tweet)

#lemma

from textblob import Word

tweet =" ".join([Word(word).lemmatize() for word in tweet.split()])

#stemmer

#st = PorterStemmer()

#tweet=" ".join([st.stem(word) for word in tweet.split()])

#Removes emoticons from text

tweet = re.sub(':)|;)|:-)|(-:|:-D|=D|:P|xD|X-p|^^|:-*|^.^|^-^|^\_^|,-)|)-:|:'(|:(|:-(|:S|T.T|.\_.|:<|:-S|:-<|*-*|:O|=O|=-O|O.o|XO|O\_O|:-@|=/|:/|X-(|>.<|>=(|D:', ", tweet)

#trim

tweet = tweet.strip(''"')

row = tweet

return row

#call the function with your data

processRow(tweet_sample)

#output

'How to take control of your debt URL Best advice for family financial success AT_USER'

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Exploring and Processing Text Data

Create new playlist

Sign In

Sign Up

2. Exploring and Processing Text Data

Recipe 2-1. Converting Text Data to Lowercase

Problem

Solution

How It Works

Step 1-1 Read/create the text data

Step 1-2 Execute lower() function on the text data

Recipe 2-2. Removing Punctuation

Problem

Solution

How It Works

Step 2-1 Read/create the text data

Step 2-2 Execute below function on the text data

Recipe 2-3. Removing Stop Words

Problem

Solution

How It Works

Step 3-1 Read/create the text data

Step 3-2 Execute below commands on the text data

Recipe 2-4. Standardizing Text

Problem

Solution

How It Works

Step 4-1 Create a custom lookup dictionary

Step 4-2 Create a custom function for text standardization

Step 4-3 Run the text_std function

Recipe 2-5. Correcting Spelling

Problem

Solution

How It Works

Step 5-1 Read/create the text data

Step 5-2 Execute below code on the text data

Recipe 2-6. Tokenizing Text

Problem

Solution

How It Works

Step 6-1 Read/create the text data

Step 6-2 Execute below code on the text data

Recipe 2-7. Stemming

Problem

Solution

How It Works

Step 7-1 Read the text data

Step 7-2 Stemming the text

Recipe 2-8. Lemmatizing

Problem

Solution

How It Works

Step 8-1 Read the text data

Step 8-2 Lemmatizing the data

Recipe 2-9. Exploring Text Data

Problem

Solution

How It Works

Step 9-1 Read the text data

Step 9-2 Import necessary libraries

Step 9-3 Check number of words in the data

Step 9-4 Compute the frequency of all words in the reviews

Step 9-5 Consider words with length greater than 3 and plot

Step 9-6 Build Wordcloud

Recipe 2-10. Building a Text Preprocessing Pipeline

Problem

Solution

How It Works

Step 10-1 Read/create the text data

Step 10-2 Process the text

Table of Contents for
2. Exploring and Processing Text Data