Sentiment analysis

Sentiment analysis is one of the application areas of natural language processing. It is widely in use across industries and domains, and there is a big need for it in the industry. Every organization is aiming to focus customers and their needs. Hence, to understand voice and sentiment, the customer turns out to be the prime goal, as knowing the pulse of the customers leads to revenue generation. Nowadays, customers voice their sentiments through Twitter, Facebook, or blogs. It takes some work to refine that textual data and make it consumable. Let's look at how to do it in Python.

Here, verbatims of cinegoers have been taken from IMDB. This is shared on GitHub, too.

We will launch the libraries , as follows:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
import os
print(os.listdir())

We will load the dataset, as follows:

data= pd.read_csv("imdb_master.csv",encoding = "ISO-8859-1")

Now, let's explore the data and its dimensions:

print(data.head())
print(data.shape)

We get the following output:

We only need two variables, review and label, to build the model. We will just keep both of them in the data. A new dataframe has been created , as follows:

Newdata= data[["review","label"]]
Newdata.shape

Now, this is the step where we need to check how many categories are in label, as we are only interested in keeping the positive and negative ones:

g= Newdata.groupby("label")
g.count()

The output is as follows:

Now, it's clear that there are three categories and we will get rid of unsup, as follows:

sent=["neg","pos"]

Newdata = Newdata[Newdata.label.isin(sent)]
Newdata.head()

We get the following output:

Our data has now been set up. However, since we got rid of a few rows, we will reset the index of the data, as it sometimes causes some issues:

print(len(Newdata))
Newdata=Newdata.reset_index(drop=True) Newdata.head()

The output is as follows:

We are done with it. Now, we will encode the label variable in order to make it usable for machine learning models. We have to use LabelEncode for that, as follows:

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
Newdata["label"] = labelencoder.fit_transform(Newdata["label"])

We have to work on cleansing part of the data, in order to make it clean and standard, as follows:

Newdata["Clean_review"]= Newdata['review'].str.replace("[^a-zA-Z#]", " ")

Newdata.head()

The output is as follows:

Here, we are trying to get rid of the words that are less than 3 in length as the idea is that most of the words that are less than 3 in length don't have much of an impact on the meaning:

Newdata['Clean_review'] = Newdata['Clean_review'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
Newdata.shape

The tokenization of the data can now take place, as follows:

tokenized_data = Newdata['Clean_review'].apply(lambda x: x.split())
tokenized_data.shape

We are making use of stemming, in order to get rid of different variations of the same words. For example, we will look at satisfying, satisfy, and satisfied, as follows:

from nltk.stem.porter import *
stemmer = PorterStemmer()
tokenized_data = tokenized_data.apply(lambda x: [stemmer.stem(i) for i in x])
tokenized_data.head()

The output is as follows:

After stemming, we have to join the data back, as we are heading towards producing a word cloud:

for i in range(len(tokenized_data)):
tokenized_data[i] = ' '.join(tokenized_data[i])

tokenized_data.head()

We get the following output:

Here, the tokenized data has been combined with the old Newdata dataframe:

Newdata["Clean_review2"]= tokenized_data
Newdata.head()

The following is the output for the preceding code:

A word cloud combining all of the words together has been produced:

all_words = ' '.join([str(text) for text in Newdata['Clean_review2']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

The output can be seen as follows:

Now, we will make a word cloud for negative and positive sentiments separately, as follows:

  • For Negative sentiments, we will use the following:
Negative =' '.join([text for text in Newdata['Clean_review2'][Newdata['label'] == 0]])
wordcloud1= WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(Negative)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud1, interpolation="bilinear")
plt.title("Word Cloud- Negative")
plt.axis('off')
plt.show()

The following output shows a word cloud for Negative sentiments:

  • We will use the following for Positive sentiments:
Positive=' '.join([text for text in Newdata['Clean_review2'][Newdata['label'] == 1]])
wordcloud2 = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(Positive)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.title("Word Cloud-Positive")
plt.axis('off')
plt.show()

The following output shows a word cloud for Positive sentiments:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.137.58