By the end of this chapter, you will be able to:
This chapter shall cover the concepts on processing human language.
One of the most important goals of artificial intelligence (AI) is to understand the human language to perform tasks. Spellcheck, sentiment analysis, question answering, chat bots, and virtual assistants (such as Siri and Google Assistant) all have a natural language processing (NLP) module. The NLP module enables virtual assistants to process human language and perform actions based on it. For example, when we say, "OK Google, set an alarm for 7 A.M.", the speech is first converted to text and then this text is processed by the NLP module. After this processing, the virtual assistant calls the appropriate API of the Alarm/Clock application. Processing human language has its own set of challenges because it is ambiguous, with words meaning different things depending on the context in which they are used. This is the biggest pain point of language for AI.
Another big reason is the unavailability of complete information. We tend to leave out most of the information while communicating; information that is common sense or things that are universally true or false. For example, the sentence "I saw a man on a hill with a telescope" can have different meanings depending on the contextual information. For example, it could mean that "I saw a man who had a telescope on a hill," but it could also mean that "I saw a man on a hill through a telescope." It is very difficult for computers to keep track of this information as most of it is contextual. Due to the advances in deep learning, NLP today works much better than when we used traditional methods such as clustering and linear models. This is the reason we will use deep learning on text corpora to solve NLP problems. NLP, like any other machine learning problem, has two main parts, data processing and model creation. In the next topic, we will learn how to process textual data, and later, we will learn how to use this processed data to create machine learning models to solve our problems.
Before we start building machine learning models for our textual data, we need to process the data. First, we will learn the different ways in which we can understand what the data comprises. This helps us get a sense of what the data really is and decide on the preprocessing techniques to be used in the next step. Next, we will move on to learn the techniques that will help us preprocess the data. This step helps reduce the size of the data, thus reducing the training time, and also helps us transform the data into a form that would be easier for machine learning algorithms to extract information from. Finally, we will learn how to convert the textual data to numbers so that machine learning algorithms can actually use it to create models. We do this using word embedding, much like the entity embedding we performed in Chapter 5: Mastering Structured Data.
Before we start working on textual data, we need to learn about regular expressions (RegEx). RegEx is not really a preprocessing technique, but a sequence of characters that defines a search pattern in a string. RegEx is a powerful tool when dealing with textual data as it helps us find sequences in a collection of text. A RegEx consists of metacharacters and regular characters.
Using RegEx, we can search for complex patterns in a text. For example, we can use it to remove URLs from a text. We can use the re module in Python to remove a URL as follows:
re.sub(r"https?://S+s", '', "https://www.asfd.com hello world")
re.sub accepts three parameters: the first is RegEx, the second is the expression you want to substitute in place of the matched pattern, and the third is the text in which it should search for the pattern.
The output of the command is as follows:
It is difficult to remember all the RegEx conventions, so when working with RegEx, refer to a cheat sheet, such as: (http://www.pyregex.com/).
In this exercise, we will use the re module of Python to modify and analyze a string. We will simply learn how to use RegEx in this exercise, and in the following section, we will see how we can use RegEx to preprocess our data. We will use a single review from the IMDB Movie Review dataset (https://github.com/TrainingByPackt/Data-Science-with-Python/tree/master/Chapter07), which we shall also work on later in the chapter to create sentiment analysis models. This dataset is already processed, and some words have been removed. This will be the case sometimes when dealing with prebuilt datasets, so it is important to analyze the dataset you are working on before you start working.
string = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play part better. Christopher Lloyd hilarious perfect part. Tony Danza believable Mel Clark. can't help, enjoy movie! give 10/10!<br /><br />- review Jamie Robert Ward (http://www.invocus.net)"
len(string)
The output length is as follows:
import re
string = re.sub(r"https?://S+", '', string)
string
The string with hyperlinks is removed as follows:
string = re.sub(r'<br />', ' ', string)
string
The string without the br tags is as follows:
string = re.sub('d','', string)
string
The string without digits is shown here:
string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', '', string)
string
The string without special characters and punctuations is shown here:
string = re.sub(r"can't", "cannot", string)
string = re.sub(r"it's", "it is", string)
string
The final string is as follows:
len(string)
The output size of the string is as follows:
We reduced the size of the review by 14%.
re.findall takes the regex pattern and the string as input and outputs all substrings that match the pattern.
This is shown in the following code:
re.findall(r"[A-Z][a-z]*", string)
The words are as follows:
re.findall(r"[A-z]{1,2}", string)
The output is as follows:
Congratulations! You successfully modified and analyzed a review string using RegEx with the re module.
Basic feature extraction helps us understand what our data consists of. This helps us select the steps to take to preprocess the dataset. Basic feature extraction consists of actions such as calculation of the average number of words and count of special characters. We will make use of the IMDB movie review dataset in this section as an example:
data = pd.read_csv('movie_reviews.csv', encoding='latin-1')
Let's see what our dataset consists of:
data.iloc[0]
The output is as follows:
The SentimentText variable contains the actual review and the Sentiment variable contains the sentiment of the review. 1 represents a positive sentiment and 0 represents a negative sentiment. Let's print the first review to get a sense of the data we are dealing with:
data.SentimentText[0]
The first review is as follows:
Now, we will try to understand the kind of data we are working with by getting the key statistics of the dataset.
Number of words
We can get the number of words in each review with the following code:
data['word_count'] = data['SentimentText'].apply(lambda x: len(str(x).split(" ")))
Now, the word_count variable in the DataFrame contains the total number of words in the review. The apply function applies the split function to each row of the dataset iteratively. Now, we can get the average number of words for each class to see if positive reviews have more words than negative reviews.
The mean() function calculates the average of a column in pandas. For negative reviews, use the following code:
data.loc[data.Sentiment == 0, 'word_count'].mean()
The average number of words for a negative sentiment is as follows:
For positive reviews, use the following:
data.loc[data.Sentiment == 1, 'word_count'].mean()
The average number of words for a positive sentiment is as follows:
We can see that there isn't much difference in the average number of words for either class.
Stop words
Stop words are the most common words in a language – for example, "I", "me", "my", "yours", and "the." Most of the time, these words provide no real information about the sentence, so we remove these words from our dataset to reduce the size. The nltk library has a list of stop words for the English language that we can access.
from nltk.corpus import stopwords
stop = stopwords.words('english')
To get the count of these stop words, we can use the following code:
data['stop_count'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x in stop]))
Then, we can see the average number of stop words for each class, by using the following code:
data.loc[data.Sentiment == 0, 'stop_count'].mean()
The average number of stop words for a negative sentiment is shown here:
Now, to get the number of stop words for a positive sentiment, we use the following code:
data.loc[data.Sentiment == 1, 'stop_count'].mean()
The output average number of stop words for a positive sentiment is shown here:
Number of special characters
Depending on the kind of problem you are dealing with, you will want to either keep special characters such as @, #, $, and *, or remove them. To be able to do that, you first must figure out how many special characters occur in your dataset. To get the count of ^, &, *, $, @, and # in your dataset, use the following code:
data['special_count'] = data['SentimentText'].apply(lambda x: len(re.sub('[^^&*$@#]+' ,'', x)))
Now that we know what our data comprises, we need to preprocess it so that our machine learning algorithms can easily find patterns in the text. In this section, we will go over some of the techniques used to clean and reduce the dimensionality of the data we feed into our machine learning algorithm.
Lowercase
The first preprocessing step we perform is converting all the data into lowercase. This prevents multiple copies of the same word. You can easily convert all text to lowercase using the following code:
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x.lower() for x in x.split()))
The apply function applies the lower function to each row of the dataset iteratively.
Stop word removal
As discussed previously, stop words should be removed from the dataset as they add very little useful information. Stop words do not affect the sentiments of a sentence. We perform this step to remove any bias that stop words might introduce:
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
Frequent word removal
Stop words are more general words such as 'a', 'an,' and 'the.' However, in this step, you will remove the most frequent word from the dataset you are working with. For example, the words that can be removed from a tweet dataset are RT, @username, and DM. First, find the most frequent words:
word_freq = pd.Series(' '.join(data['SentimentText']).split()).value_counts()
word_freq.head()
The most frequent words are:
From the output, we get a hint: the text contains HTML tags, which can be removed to considerably reduce the dataset size. So, let's first remove all <br /> HTML tags and then remove words such as 'movie' and 'film,' which will not have much impact on the sentiment detector:
data['SentimentText'] = data['SentimentText'].str.replace(r'<br />','')
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x for x in x.split() if x not in ['movie', 'film']))
Punctuation and special character removal
Next, we remove all the punctuations and special characters from the text as they add very little information to the text. To remove punctuations and special characters, use this regex:
punc_special = r"[^A-Za-z0-9s]+"
data['SentimentText'] = data['SentimentText'].str.replace(punc_special,'')
The regex selects all alphanumerical characters and spaces.
Spellcheck
Sometimes, incorrect spellings of the same word causes us to have copies of the same word. This can be corrected by performing a spellcheck using the autocorrect library:
from autocorrect import spell
data['SentimentText'] = [' '.join([spell(i) for i in x.split()]) for x in data['SentimentText']]
Stemming
Stemming refers to the practice of removing the suffixes such as 'ily', 'iest,' and 'ing.' This helps our model because variations of the same root word have the same meaning. For example, 'happy,' 'happily,' and 'happiest' all mean the same thing, so they can be replaced with 'happy.' This is helpful in cases of sentiment analysis, where the degree of happiness is not required as it will reduce the number of dimensions required to represent the data. To perform stemming, we can use the nltk library:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))
Spellcheck, stemming, and lemmatization can take a lot of time to complete depending on the size of the dataset, so make sure that you do need to perform this step by looking into the dataset.
Lemmatization
You should prefer lemmatization to stemming as it is more effective.
Lemmatization is similar to stemming, but here, we substitute words with their root words to reduce the dimensionality of the dataset. Lemmatization is generally a more effective option than stemming. To perform lemmatization, you can use the nltk library:
lemmatizer = nltk.stem.WordNetLemmatizer()
data['SentimentText'][:5].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()]))
We are working to reduce the dimensionality of the dataset because of the "curse of dimensionality." Datasets become sparse as their dimensionality (dependent variables) increases. This causes data science techniques to fail. This is due to the difficulty in modelling the high number of features (dependent variables) to get the correct output. As the number of features of the dataset increases, we need more data points to model them. So, to get around the curse of high-dimensional data, we need to obtain a lot more data, which in turn would increase the time required to process it.
Tokenization
Tokenization is the process of dividing either sentences into sequences of words or paragraphs into sequences of sentences. We need to do this to eventually convert the data into one-hot vectors of words. Tokenization can be performed using the nltk library:
import nltk
nltk.word_tokenize("Hello Dr. Ajay. It's nice to meet you.")
The tokenized list is as follows:
As you can see, it separates punctuations from words and detects complex words such as "Dr."
In this exercise, we will preprocess the IMDB Movie Review dataset to make it ready for any machine learning algorithm. The dataset consists of 25,000 movie reviews along with the sentiment (positive or negative) of the review. We want to predict sentiments using the review, so we need to keep that in mind while performing preprocessing.
import pandas as pd
data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')
data.SentimentText = data.SentimentText.str.lower()
import re
def clean_str(string):
string = re.sub(r"https?://S+", '', string)
string = re.sub(r'<a href', ' ', string)
string = re.sub(r'&', 'and', string)
string = re.sub(r'<br />', ' ', string)
string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', ' ', string)
string = re.sub('d','', string)
string = re.sub(r"can't", "cannot", string)
string = re.sub(r"it's", "it is", string)
return string
This function first removes any hyperlink from the text and then removes the HTML tags (<a> or <br>). Next, it substitutes all & with 'and,' followed by removing all special characters, punctuations, and numbers. Finally, it substitutes 'can't' with 'cannot' and 'it's' with 'it is'.
data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))
Use the apply function of pandas to perform review cleaning on the complete dataset.
pd.Series(' '.join(data['SentimentText']).split()).value_counts().head(10)
The occurrence of the top 10 words is as follows:
This will be done by first tokenizing the reviews and then removing the stop word loaded from the nltk library.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
stop_words = stopwords.words('english') + ['movie', 'film', 'time']
stop_words = set(stop_words)
remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)
def combine_text(text):
try:
return ' '.join(text[0])
except:
return np.nan
data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))
data = data.dropna(how='any')
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=250)
tokenizer.fit_on_texts(list(data['SentimentText']))
sequences = tokenizer.texts_to_sequences(data['SentimentText'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
The number of unique tokens is as follows:
Rows where there are less characters will get padded with 0s. You can increase or decrease this value depending on the accuracy and training time of the model.
from keras.preprocessing.sequence import pad_sequences
reviews = pad_sequences(sequences, maxlen=200)
import pickle
with open('tokenizer.pkl', 'wb') as handle:
pickle.dump(tokenizer,
handle,
protocol=pickle.HIGHEST_PROTOCOL)
To preview a cleaned review, run the following command:
data.SentimentText[124]
A cleaned review looks like this:
To get the actual input to the next step of the process, run the following command:
reviews[124]
The input to the next step for the reviews command will look something like this:
Congratulations! You have successfully preprocessed your first text dataset. The review data is now a matrix of 25,000 rows, or reviews, and 200 columns, or words. Next, we will learn how we can convert this data into embedding to make it easier to predict the sentiment.
Now that we have cleaned our dataset, we will convert it into a form that machine learning models can work with. Recall Chapter 5, Mastering Structured Data, where we discussed how neural networks cannot process words, so we need to represent words as numbers to be able to process them. Therefore, to be able to perform tasks such as sentiment analysis, we need to convert text into numbers.
So, the very first method we discussed was one-hot encoding, which performs poorly in the case of words, because words have certain relationships between them and one-hot encoding makes it so that the words are computed as if they were independent of each other. For example, let us assume we have three words: 'car,' 'truck,' and 'ship.' Now, 'car' is closer to 'truck' in terms of similarity, but it still has some similarity to 'ship.' One-hot encoding fails to capture that relationship.
Word embeddings too are vector representations of words, but they capture the relationship of each word with another word. The different ways of getting word embedding are explained in the following section.
Count Embedding
Count embedding is a simple vector representation of the word depending on the amount of times it appears in a piece of text. Assume a dataset where you have n unique words and M different records. To get the count embedding, you create an N x M matrix, where each row is a word and each column is a record. The values of any (n,m) location in the matrix will contain a count of the number of times a word n occurs in a record m.
TF-IDF Embedding
TF-IDF is a way to obtain the importance of each word in a collection of words or document. It stands for term frequency-inverse document frequency. In TF-IDF, the importance of a word increases proportionally to the frequency of the word, but this importance is offset by the number of documents that have that word, thus helping to adjust for certain words that are used more frequently. In other words, the importance of a word is calculated using the frequency of the word in one data point of the training set. This importance is increased or decreased depending on the occurrence of the word in other data points of the training set.
Weights generated by TF-IDF consist of two terms:
where w is the word.
The weight is the product of these two terms. In case of TF-IDF, we replace the count of the word with this weight in the N x M matrix that we used in the count embedding section.
Continuous bag-of-words embedding
Continuous bag-of-words (CBOW) works by using neural networks. It predicts a word when the input is its surrounding words. The input to the neural network is the one-hot vector of the surrounding words. The count of input words is selected using the window parameter. The network has only one hidden layer and the output layer of the network is activated using the softmax activation function to get the probability. The activation function between the layers is linear, but the method of updating the gradients is the same as normal neural networks.
The embedding matrix of the corpus is the weight between the hidden layer and the output layer. Thus, this embedding matrix will be of the N x H dimension, where N is the number of unique words in the corpus and H is the number of hidden layer nodes. CBOW works better than the two methods discussed previously due to its probabilistic nature and low memory requirements.
Skip-gram embedding
Using a neural network, skip-gram predicts the surrounding words given an input word. The input here is the one-hot vector of the word and the output is the probability of the surrounding words. The number of output words is decided by the window parameter. Much like CBOW, this method uses a neural network with a single hidden layer and the activations are all linear except for the output layer, where we use the softmax function. One big difference though is how the error gets calculated: different errors are calculated for the different words being predicted and then all are added together to get the final error. The error for each individual word is calculated by subtracting the output probability vector with the target one-hot vector.
The embedding matrix here is the weight matrix between the input layer and the hidden layer. Thus, this embedding matrix will be of the H x N dimension, where N is the number of unique words in the corpus and H is the number of hidden layer nodes. Skip-gram works much better than CBOW for less frequent words, but is generally slower:
Use CBOW for datasets with less words but a high number of samples. Use skip-gram when working with a dataset with more words and a low number of samples.
Word2Vec
The Word2Vec model is a group of CBOW and skip-gram that is used to produce word embedding. Word2Vec helps us obtain the word embedding of a corpus easily. To implement the model and obtain word embedding, we will make use of the gensim library:
model = gensim.models.Word2Vec(
tokens,
iter=5
size=100,
window=5,
min_count=5,
workers=10,
sg=0)
To train the model, we need to pass the tokenized sentences as arguments to the Word2Vec class of gensim. iter is the number of epochs to train for, and size refers to the number of nodes in the hidden layer and decides the size of the embedding layer. window refers to the number of surrounding words that are considered when training the neural network. min_count refers to the minimum frequency required for a word to be considered. workers is the number of threads to use while training and sg refers to the training algorithm to be used, 0 for CBOW and 1 for skip-gram.
To get the number of unique words in the trained embedding, you can use the following:
vocab = list(model.wv.vocab)
len(vocab)
Before we use these embeddings, we need to make sure that they are correct. To do that, we find the similar words:
model.wv.most_similar('fun')
The output is as follows:
To save your embeddings to a file, use the following code:
model.wv.save_word2vec_format('movie_embedding.txt', binary=False)
To load a pretrained embedding, you can use this function:
def load_embedding(filename, word_index , num_words, embedding_dim):
embeddings_index = {}
file = open(filename, encoding="utf-8")
for line in file:
values = line.split()
word = values[0]
coef = np.asarray(values[1:])
embeddings_index[word] = coef
file.close()
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, pos in word_index.items():
if pos >= num_words:
continue
print(num_words)
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[pos] = embedding_vector
return embedding_matrix
The function first reads the embedding file filename and gets all the embedding vectors present in the file. Then, it creates an embedding matrix that stacks the embedding vectors together. The num_words parameter limits the size of the vocabulary and can be helpful in cases where the training time of the NLP algorithm is too high. word_index is a dictionary with the key as unique words of the corpus and the value as the index of the word. embedding_dim is the size of the embedding vectors as specified while training.
There are a lot of really good pretrained embeddings available. Some of the popular ones are GloVe: https://nlp.stanford.edu/projects/glove/ and fastText: https://fasttext.cc/docs/en/english-vectors.html
In this exercise, we will create our own Word2Vec embedding using the gensim library. The word embedding will be created for the IMDB movie review dataset that we have been working with. We will take off from where we left in Exercise 55.
data['SentimentText'] [0]
The tokens of the first review are as follows:
data['SentimentText'] = data['SentimentText'].apply(lambda x: x[0])
from gensim.models import Word2Vec
model = Word2Vec(
data['SentimentText'],
iter=50,
size=100,
window=5,
min_count=5,
workers=10)
model.wv.most_similar('insight')
The most similar words to 'insight' in the dataset are:
model.wv.similarity(w1='violent', w2='brutal')
The similarity score ranges from 0 to 1, where 1 means that both words are the same, and 0 means that both words are completely different and not related in any way.
First, convert the embedding into two dimensions using PCA. We will plot only the first 200 words. (You can plot more if you like.)
from sklearn.decomposition import PCA
word_limit = 200
X = model[model.wv.vocab][: word_limit]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
import matplotlib.pyplot as plt
plt.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)[: word_limit]
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()
Your output should look like the following:
The axes do not mean anything in the representation of word embedding. The representation simply shows the closeness of different words.
model.wv.save_word2vec_format('movie_embedding.txt', binary=False)
Congratulations! You just created your first word embedding. You can play around with the embedding and view the similarity between different words.
In this activity, we will attempt to predict sentiments of movie reviews. The dataset (https://github.com/TrainingByPackt/Data-Science-with-Python/tree/master/Chapter07) comprises 25,000 movie reviews sourced from IMDB with their sentiment (positive or negative). Let's look at the following scenario: You work at a DVD rental company, which has to predict the number of DVDs to create of a certain movie depending on how it is being perceived by the reviewers. To do this, you create a machine learning model that can analyze reviews to figure out how the movie is being perceived.
The solution for this activity can be found on page 378.
The output is a little cryptic because stop words and punctuations have been removed, but you can still understand the general sense of the review.
Congratulations! You just created your first NLP module. You should find that the model gives an accuracy of of around 76% which is quite low. This is because it is predicting sentiments based on individual words; it has no way of figuring out the context of the review. For example, it will predict "not good" as a positive sentiment as it sees the word 'good.' If it could look at multiple words, it would understand that this is a negative sentiment. In the next section, we will learn how to create neural networks that can retain information of the past.
Until now, none of the problems we discussed had a temporal dependence, which means that the prediction depends not only on the current input but also on the past inputs. For example, in the case of the dog vs. cat classifier, we only needed the picture of the dog to classify it as a dog. No other information or images were required. Instead, if you want to make a classifier that predicts if a dog is walking or standing, you will require multiple images in a sequence or a video to figure out what the dog is doing. RNNs are like the fully connected networks that we talked about. The only change is that an RNN has memory that stores information about the previous inputs as states. The outputs of the hidden layers are fed in as inputs for the next input.
From the image, you can understand how the outputs of the hidden layers are used as inputs for the next input. This acts as a memory element in the neural network. Another thing to keep in mind is that the output of a normal neural network is a function of the input and weights of the network.
his allows us to randomly input any data point to get the right output. However, this is not the case with RNNs. In the case of RNNs, our output depends on the previous inputs, so we need to feed in the input in the correct sequence.
In the preceding image, you can see a single RNN layer on the left in the "folded" model. U is the input weight, V is the output weight, and W is the weight associated with the memory input. The memory of the RNN is also referred to as state. The "unfolded" model on the right shows how the RNN works for the input sequence [xt-1, xt, xt+1]. The model differs based on the kind of application. For example, in case of sentiment analysis, the input sequence will require only one output in the end. The unfolded model for this problem is shown here:
Long short-term memory (LSTM) cell is a special kind of RNN cell, capable of retaining information over long-term periods of time. Hochreiter and Schmidhuber introduced LSTMs in 1997. RNNs suffer from the vanishing gradient problem. They lose information detected over long periods of time. For example, if we are performing sentiment analysis on a text and the first sentence says "I am happy today" and then the rest of the text is devoid of any sentiments, the RNN will not do a good job of detecting that the sentiment of the text is happy. Long short-term memory (LSTM) cells overcome this issue by storing certain inputs for a longer time without forgetting them. Most real-world recurrent machine learning implementations are done using LSTMs. The only difference between RNN cells and LSTM cells is the memory states. Every RNN layer takes an input of the memory state and outputs a memory state, whereas every LSTM layer takes a long-term and a short-term memory as the input and outputs both the long and the short-term memories. The long-term memory allows the network to retain information for a longer time.
LSTM cells are implemented in Keras, and you easily can add an LSTM layer into your model:
model.add(keras.layers.LSTM(units, activation='tanh', dropout=0.0, recurrent_dropout=0.0, return_sequences=False))
Here, units is the number of nodes in the layer, activation is the activation function to use for the layer. recurrent_dropout and dropout are the dropout probability for the recurrent state and input respectively. return_sequences specifies if the output should contain the sequence or not; this is made True when you plan to use another recurrent layer after the current layer.
LSTMs almost always work better than RNNs.
In this exercise, we will modify the model we created for the previous activity, to make it use an LSTM cell. We will use the same IMDB movie review dataset that we have been working with. Most of the preprocessing steps are like those in Activity 19.
import pandas as pd
data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')
data.text = data.text.str.lower()
import re
def clean_str(string):
string = re.sub(r"https?://S+", '', string)
string = re.sub(r'<a href', ' ', string)
string = re.sub(r'&', '', string)
string = re.sub(r'<br />', ' ', string)
string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', ' ', string)
string = re.sub('d','', string)
string = re.sub(r"can't", "cannot", string)
string = re.sub(r"it's", "it is", string)
return string
data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
stop_words = stopwords.words('english') + ['movie', 'film', 'time']
stop_words = set(stop_words)
remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)
def combine_text(text):
try:
return ' '.join(text[0])
except:
return np.nan
data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))
data = data.dropna(how='any')
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(list(data['SentimentText']))
sequences = tokenizer.texts_to_sequences(data['SentimentText'])
word_index = tokenizer.word_index
from keras.preprocessing.sequence import pad_sequences
reviews = pad_sequences(sequences, maxlen=100)
import numpy as np
def load_embedding(filename, word_index , num_words, embedding_dim):
embeddings_index = {}
file = open(filename, encoding="utf-8")
for line in file:
values = line.split()
word = values[0]
coef = np.asarray(values[1:])
embeddings_index[word] = coef
file.close()
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, pos in word_index.items():
if pos >= num_words:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[pos] = embedding_vector
return embedding_matrix
embedding_matrix = load_embedding('movie_embedding.txt', word_index, len(word_index), 16)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews, pd.get_dummies(data.Sentiment), test_size=0.2, random_state=9)
from keras.models import Model
from keras.layers import Input, Dense, Dropout, BatchNormalization, Embedding, Flatten, LSTM
inp = Input((100,))
embedding_layer = Embedding(len(word_index),
16,
weights=[embedding_matrix],
input_length=100,
trainable=False)(inp)
model = Dropout(0.10)(embedding_layer)
model = LSTM(128, dropout=0.2)(model)
model = Dense(units=256, activation='relu')(model)
model = Dense(units=64, activation='relu')(model)
model = Dropout(0.3)(model)
predictions = Dense(units=2, activation='softmax')(model)
model = Model(inputs = inp, outputs = predictions)
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)
from sklearn.metrics import accuracy_score
preds = model.predict(X_test)
accuracy_score(np.argmax(preds, 1), np.argmax(y_test.values, 1))
The accuracy of the LSTM model is:
y_actual = pd.Series(np.argmax(y_test.values, axis=1), name='Actual')
y_pred = pd.Series(np.argmax(preds, axis=1), name='Predicted')
pd.crosstab(y_actual, y_pred, margins=True)
review_num = 110
print("Review: "+tokenizer.sequences_to_texts([X_test[review_num]])[0])
sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"
print(" Predicted sentiment = "+ sentiment)
sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"
print(" Actual sentiment = "+ sentiment)
The output is as follows:
Congratulations! You just implemented an RNN to predict sentiments of a movie review. This network works a little better than the previous network we created. Play around with the architecture and hyperparameters of the network to improve the accuracy of the model. You can also try using pretrained word embedding from either fastText or GloVe to improve the accuracy of the model.
In this activity, we will attempt to predict sentiments of a tweet. The provided dataset (https://github.com/TrainingByPackt/Data-Science-with-Python/tree/master/Chapter07) contains 1.5 million tweets and their sentiments (positive or negative). Let's look at the following scenario: You work at a big consumer organization, which recently created a Twitter account. Some of the customers who have had a bad experience with your company are taking to Twitter to express their sentiments, which is causing a decline in the reputation of the company. You have been tasked to identify these tweets so that the company can get in touch with them to provide better support. You do this by creating a sentiment predictor, which can determine whether the sentiment of a tweet is positive or negative. Before using your new sentiment predictor on actual tweets about your company, you will test it on the provided tweets dataset.
The solution for this activity can be found on page 383.
Congratulations! You just created a machine learning module to predict sentiments from tweets. You can now deploy this using Twitter API to perform real-time sentiment analysis on tweets. You can play around with different embeddings from GloVe and fastText and see how much improvement you can get on your model.
In this chapter, we learned how computers understand human language. We first learned what RegEx is and how it helps data scientists analyze and clean text data. Next, we learned about stop words, what they are, and why they are removed from the data to reduce the dimensionality. Next, we next learned about sentence tokenization and its importance, followed by word embedding. Embedding is a topic that we covered in Chapter 5: Mastering Structured Data; here, we learned how to create word embedding to boost our NLP model's performance. To create better models, we looked at a RNNs, a special type of neural network that retains memory of past inputs. Finally, we learned about LSTM cells and how they are better than normal RNN cells.
Now that you have completed this chapter, you are capable of handling textual data and creating machine learning models for NLP. In the next chapter, you will learn how to make models faster using transfer learning and a few tricks of the craft.
18.118.205.165