Chapter 7

Processing Human Language

Learning Objectives

By the end of this chapter, you will be able to:

  • Create machine learning models for textual data
  • Use the NLTK library to preprocess text
  • Utilize regular expressions to clean and analyze strings
  • Create word embedding using the Word2Vec model

This chapter shall cover the concepts on processing human language.

Introduction

One of the most important goals of artificial intelligence (AI) is to understand the human language to perform tasks. Spellcheck, sentiment analysis, question answering, chat bots, and virtual assistants (such as Siri and Google Assistant) all have a natural language processing (NLP) module. The NLP module enables virtual assistants to process human language and perform actions based on it. For example, when we say, "OK Google, set an alarm for 7 A.M.", the speech is first converted to text and then this text is processed by the NLP module. After this processing, the virtual assistant calls the appropriate API of the Alarm/Clock application. Processing human language has its own set of challenges because it is ambiguous, with words meaning different things depending on the context in which they are used. This is the biggest pain point of language for AI.

Another big reason is the unavailability of complete information. We tend to leave out most of the information while communicating; information that is common sense or things that are universally true or false. For example, the sentence "I saw a man on a hill with a telescope" can have different meanings depending on the contextual information. For example, it could mean that "I saw a man who had a telescope on a hill," but it could also mean that "I saw a man on a hill through a telescope." It is very difficult for computers to keep track of this information as most of it is contextual. Due to the advances in deep learning, NLP today works much better than when we used traditional methods such as clustering and linear models. This is the reason we will use deep learning on text corpora to solve NLP problems. NLP, like any other machine learning problem, has two main parts, data processing and model creation. In the next topic, we will learn how to process textual data, and later, we will learn how to use this processed data to create machine learning models to solve our problems.

Text Data Processing

Before we start building machine learning models for our textual data, we need to process the data. First, we will learn the different ways in which we can understand what the data comprises. This helps us get a sense of what the data really is and decide on the preprocessing techniques to be used in the next step. Next, we will move on to learn the techniques that will help us preprocess the data. This step helps reduce the size of the data, thus reducing the training time, and also helps us transform the data into a form that would be easier for machine learning algorithms to extract information from. Finally, we will learn how to convert the textual data to numbers so that machine learning algorithms can actually use it to create models. We do this using word embedding, much like the entity embedding we performed in Chapter 5: Mastering Structured Data.

Regular Expressions

Before we start working on textual data, we need to learn about regular expressions (RegEx). RegEx is not really a preprocessing technique, but a sequence of characters that defines a search pattern in a string. RegEx is a powerful tool when dealing with textual data as it helps us find sequences in a collection of text. A RegEx consists of metacharacters and regular characters.

Figure 7.1: Tables containing metacharacters used in RegEx, and some examples
Figure 7.1: Tables containing metacharacters used in RegEx, and some examples

Using RegEx, we can search for complex patterns in a text. For example, we can use it to remove URLs from a text. We can use the re module in Python to remove a URL as follows:

re.sub(r"https?://S+s", '', "https://www.asfd.com hello world")

re.sub accepts three parameters: the first is RegEx, the second is the expression you want to substitute in place of the matched pattern, and the third is the text in which it should search for the pattern.

The output of the command is as follows:

Figure 7.2: Output command
Figure 7.2: Output command

Note

It is difficult to remember all the RegEx conventions, so when working with RegEx, refer to a cheat sheet, such as: (http://www.pyregex.com/).

Exercise 54: Using RegEx for String Cleaning

In this exercise, we will use the re module of Python to modify and analyze a string. We will simply learn how to use RegEx in this exercise, and in the following section, we will see how we can use RegEx to preprocess our data. We will use a single review from the IMDB Movie Review dataset (https://github.com/TrainingByPackt/Data-Science-with-Python/tree/master/Chapter07), which we shall also work on later in the chapter to create sentiment analysis models. This dataset is already processed, and some words have been removed. This will be the case sometimes when dealing with prebuilt datasets, so it is important to analyze the dataset you are working on before you start working.

  1. In this exercise, we will use a movie review from IMDB. Save the review text into a variable, as in the following code. You can use any other paragraph of text for this exercise:

    string = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play part better. Christopher Lloyd hilarious perfect part. Tony Danza believable Mel Clark. can't help, enjoy movie! give 10/10!<br /><br />- review Jamie Robert Ward (http://www.invocus.net)"

  2. Calculate the length of the review to know by how much we should reduce the size. We will use len(string) and get the output, as shown in the following code:

    len(string)

    The output length is as follows:

    Figure 7.3: Length of the string
    Figure 7.3: Length of the string
  3. Sometimes, when you scrape data from websites, hyperlinks get recorded as well. Most of the times, hyperlinks do not provide us any information. Remove any hyperlink from the data using a complex regex string, as in "https?://S+". This selects any substring with https:// in it:

    import re

    string = re.sub(r"https?://S+", '', string)

    string

    The string with hyperlinks is removed as follows:

    Figure 7.4: The string with hyperlinks removed
    Figure 7.4: The string with hyperlinks removed
  4. Next, we will remove the br HTML tags from the text, which we observed while reading the string. Sometimes, these HTML tags get added to the scrapped data:

    string = re.sub(r'<br />', ' ', string)

    string

    The string without the br tags is as follows:

    Figure 7.5: The string without br tags
    Figure 7.5: The string without br tags
  5. Now, we will remove all the digits from the text. This helps us reduce the size of the dataset when digits are of no significance to us:

    string = re.sub('d','', string)

    string

    The string without digits is shown here:

    Figure 7.6: The string without digits
    Figure 7.6: The string without digits
  6. Next, we will remove all special characters and punctuations. Depending on your problem, these could just be taking up space and not providing relevant information for the machine learning algorithms. So, we remove them with the following regex pattern:

    string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', '', string)

    string

    The string without special characters and punctuations is shown here:

    Figure 7.7: The string without special characters
    Figure 7.7: The string without special characters
  7. Now, we will substitute can't with cannot and it's with it is. This helps us reduce the training time as the number of unique words reduces:

    string = re.sub(r"can't", "cannot", string)

    string = re.sub(r"it's", "it is", string)

    string

    The final string is as follows:

    Figure 7.8: The final string
    Figure 7.8: The final string
  8. Finally, we will calculate the length of the cleaned string:

    len(string)

    The output size of the string is as follows:

    Figure 7.9: The length of cleaned string
    Figure 7.9: The length of cleaned string

    We reduced the size of the review by 14%.

  9. Now, we will use RegEx to analyze the data and get all the words that start with a capital letter:

    Note

    re.findall takes the regex pattern and the string as input and outputs all substrings that match the pattern.

    This is shown in the following code:

    re.findall(r"[A-Z][a-z]*", string)

    The words are as follows:

    Figure 7.10: Words starting with capital letters
    Figure 7.10: Words starting with capital letters
  10. To find all the one- and two-letter words in the text, use the following:

    re.findall(r"[A-z]{1,2}", string)

    The output is as follows:

Figure 7.11: One and two letter words
Figure 7.11: One and two letter words

Congratulations! You successfully modified and analyzed a review string using RegEx with the re module.

Basic Feature Extraction

Basic feature extraction helps us understand what our data consists of. This helps us select the steps to take to preprocess the dataset. Basic feature extraction consists of actions such as calculation of the average number of words and count of special characters. We will make use of the IMDB movie review dataset in this section as an example:

data = pd.read_csv('movie_reviews.csv', encoding='latin-1')

Let's see what our dataset consists of:

data.iloc[0]

The output is as follows:

Figure 7.12: SentimentText data
Figure 7.12: SentimentText data

The SentimentText variable contains the actual review and the Sentiment variable contains the sentiment of the review. 1 represents a positive sentiment and 0 represents a negative sentiment. Let's print the first review to get a sense of the data we are dealing with:

data.SentimentText[0]

The first review is as follows:

Figure 7.13: First review
Figure 7.13: First review

Now, we will try to understand the kind of data we are working with by getting the key statistics of the dataset.

Number of words

We can get the number of words in each review with the following code:

data['word_count'] = data['SentimentText'].apply(lambda x: len(str(x).split(" ")))

Now, the word_count variable in the DataFrame contains the total number of words in the review. The apply function applies the split function to each row of the dataset iteratively. Now, we can get the average number of words for each class to see if positive reviews have more words than negative reviews.

The mean() function calculates the average of a column in pandas. For negative reviews, use the following code:

data.loc[data.Sentiment == 0, 'word_count'].mean()

The average number of words for a negative sentiment is as follows:

Figure 7.14: Total number of words for negative sentiment
Figure 7.14: Total number of words for negative sentiment

For positive reviews, use the following:

data.loc[data.Sentiment == 1, 'word_count'].mean()

The average number of words for a positive sentiment is as follows:

Figure 7.15: Total number of words for positive sentiment
Figure 7.15: Total number of words for positive sentiment

We can see that there isn't much difference in the average number of words for either class.

Stop words

Stop words are the most common words in a language – for example, "I", "me", "my", "yours", and "the." Most of the time, these words provide no real information about the sentence, so we remove these words from our dataset to reduce the size. The nltk library has a list of stop words for the English language that we can access.

from nltk.corpus import stopwords

stop = stopwords.words('english')

To get the count of these stop words, we can use the following code:

data['stop_count'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x in stop]))

Then, we can see the average number of stop words for each class, by using the following code:

data.loc[data.Sentiment == 0, 'stop_count'].mean()

The average number of stop words for a negative sentiment is shown here:

Figure 7.16: Average number of stop words for a negative sentiment

Now, to get the number of stop words for a positive sentiment, we use the following code:

data.loc[data.Sentiment == 1, 'stop_count'].mean()

The output average number of stop words for a positive sentiment is shown here:

Figure 7.17: Average number of stop words for a positive sentiment
Figure 7.17: Average number of stop words for a positive sentiment

Number of special characters

Depending on the kind of problem you are dealing with, you will want to either keep special characters such as @, #, $, and *, or remove them. To be able to do that, you first must figure out how many special characters occur in your dataset. To get the count of ^, &, *, $, @, and # in your dataset, use the following code:

data['special_count'] = data['SentimentText'].apply(lambda x: len(re.sub('[^^&*$@#]+' ,'', x)))

Text Preprocessing

Now that we know what our data comprises, we need to preprocess it so that our machine learning algorithms can easily find patterns in the text. In this section, we will go over some of the techniques used to clean and reduce the dimensionality of the data we feed into our machine learning algorithm.

Lowercase

The first preprocessing step we perform is converting all the data into lowercase. This prevents multiple copies of the same word. You can easily convert all text to lowercase using the following code:

data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x.lower() for x in x.split()))

The apply function applies the lower function to each row of the dataset iteratively.

Stop word removal

As discussed previously, stop words should be removed from the dataset as they add very little useful information. Stop words do not affect the sentiments of a sentence. We perform this step to remove any bias that stop words might introduce:

data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

Frequent word removal

Stop words are more general words such as 'a', 'an,' and 'the.' However, in this step, you will remove the most frequent word from the dataset you are working with. For example, the words that can be removed from a tweet dataset are RT, @username, and DM. First, find the most frequent words:

word_freq = pd.Series(' '.join(data['SentimentText']).split()).value_counts()

word_freq.head()

The most frequent words are:

Figure 7.18: Most frequent words in tweet dataset
Figure 7.18: Most frequent words in tweet dataset

From the output, we get a hint: the text contains HTML tags, which can be removed to considerably reduce the dataset size. So, let's first remove all <br /> HTML tags and then remove words such as 'movie' and 'film,' which will not have much impact on the sentiment detector:

data['SentimentText'] = data['SentimentText'].str.replace(r'<br />','')

data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x for x in x.split() if x not in ['movie', 'film']))

Punctuation and special character removal

Next, we remove all the punctuations and special characters from the text as they add very little information to the text. To remove punctuations and special characters, use this regex:

punc_special = r"[^A-Za-z0-9s]+"

data['SentimentText'] = data['SentimentText'].str.replace(punc_special,'')

The regex selects all alphanumerical characters and spaces.

Spellcheck

Sometimes, incorrect spellings of the same word causes us to have copies of the same word. This can be corrected by performing a spellcheck using the autocorrect library:

from autocorrect import spell

data['SentimentText'] = [' '.join([spell(i) for i in x.split()]) for x in data['SentimentText']]

Stemming

Stemming refers to the practice of removing the suffixes such as 'ily', 'iest,' and 'ing.' This helps our model because variations of the same root word have the same meaning. For example, 'happy,' 'happily,' and 'happiest' all mean the same thing, so they can be replaced with 'happy.' This is helpful in cases of sentiment analysis, where the degree of happiness is not required as it will reduce the number of dimensions required to represent the data. To perform stemming, we can use the nltk library:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))

Note

Spellcheck, stemming, and lemmatization can take a lot of time to complete depending on the size of the dataset, so make sure that you do need to perform this step by looking into the dataset.

Lemmatization

Tip

You should prefer lemmatization to stemming as it is more effective.

Lemmatization is similar to stemming, but here, we substitute words with their root words to reduce the dimensionality of the dataset. Lemmatization is generally a more effective option than stemming. To perform lemmatization, you can use the nltk library:

lemmatizer = nltk.stem.WordNetLemmatizer()

data['SentimentText'][:5].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()]))

Note

We are working to reduce the dimensionality of the dataset because of the "curse of dimensionality." Datasets become sparse as their dimensionality (dependent variables) increases. This causes data science techniques to fail. This is due to the difficulty in modelling the high number of features (dependent variables) to get the correct output. As the number of features of the dataset increases, we need more data points to model them. So, to get around the curse of high-dimensional data, we need to obtain a lot more data, which in turn would increase the time required to process it.

Tokenization

Tokenization is the process of dividing either sentences into sequences of words or paragraphs into sequences of sentences. We need to do this to eventually convert the data into one-hot vectors of words. Tokenization can be performed using the nltk library:

import nltk

nltk.word_tokenize("Hello Dr. Ajay. It's nice to meet you.")

The tokenized list is as follows:

Figure 7.19: Tokenized list
Figure 7.19: Tokenized list

As you can see, it separates punctuations from words and detects complex words such as "Dr."

Exercise 55: Preprocessing the IMDB Movie Review Dataset

In this exercise, we will preprocess the IMDB Movie Review dataset to make it ready for any machine learning algorithm. The dataset consists of 25,000 movie reviews along with the sentiment (positive or negative) of the review. We want to predict sentiments using the review, so we need to keep that in mind while performing preprocessing.

  1. Load the IMDB movie review dataset using pandas:

    import pandas as pd

    data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')

  2. First, we will convert all characters in the dataset into lowercase:

    data.SentimentText = data.SentimentText.str.lower()

  3. Next, we will write a clean_str function, in which we will clean the reviews using the re module:

    import re

    def clean_str(string):

         string = re.sub(r"https?://S+", '', string)

         string = re.sub(r'<a href', ' ', string)

         string = re.sub(r'&amp;', 'and', string)

         string = re.sub(r'<br />', ' ', string)

         string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', ' ', string)

         string = re.sub('d','', string)

         string = re.sub(r"can't", "cannot", string)

         string = re.sub(r"it's", "it is", string)

         return string

    Note

    This function first removes any hyperlink from the text and then removes the HTML tags (<a> or <br>). Next, it substitutes all &amp; with 'and,' followed by removing all special characters, punctuations, and numbers. Finally, it substitutes 'can't' with 'cannot' and 'it's' with 'it is'.

    data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))

    Use the apply function of pandas to perform review cleaning on the complete dataset.

  4. Next, check the word distribution in the dataset using the following code:

    pd.Series(' '.join(data['SentimentText']).split()).value_counts().head(10)

    The occurrence of the top 10 words is as follows:

    Figure 7.20: Top 10 words
    Figure 7.20: Top 10 words
  5. Remove stop words from the reviews:

    Note

    This will be done by first tokenizing the reviews and then removing the stop word loaded from the nltk library.

  6. We add 'movie,' 'film,' and 'time' to the stop words as they occur very frequently in the reviews and don't really contribute much to understanding what the review sentiment is:

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize,sent_tokenize

    stop_words = stopwords.words('english') + ['movie', 'film', 'time']

    stop_words = set(stop_words)

    remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]

    data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)

  7. Next, we convert the tokens back into sentences and drop the reviews where all the text was stop words:

    def combine_text(text):

        try:

            return ' '.join(text[0])

        except:

            return np.nan

    data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))

    data = data.dropna(how='any')

  8. The next step is to convert the text into tokens and then numbers. We will be using the Keras Tokenizer as it performs both the steps for us:

    from keras.preprocessing.text import Tokenizer

    tokenizer = Tokenizer(num_words=250)

    tokenizer.fit_on_texts(list(data['SentimentText']))

    sequences = tokenizer.texts_to_sequences(data['SentimentText'])

  9. To get the size of the vocabulary, use the following:

    word_index = tokenizer.word_index

    print('Found %s unique tokens.' % len(word_index))

    The number of unique tokens is as follows:

    Figure 7.21: Number of unique tokens
  10. To reduce the training time of our model, we will cap the length of our reviews at 200 words. You can play around with this number to find out what gives you the best accuracy.

    Note

    Rows where there are less characters will get padded with 0s. You can increase or decrease this value depending on the accuracy and training time of the model.

    from keras.preprocessing.sequence import pad_sequences

    reviews = pad_sequences(sequences, maxlen=200)

  11. You should save the tokenizer so that you can convert the reviews back to text:

    import pickle

    with open('tokenizer.pkl', 'wb') as handle:

                pickle.dump(tokenizer,

                            handle,

                            protocol=pickle.HIGHEST_PROTOCOL)

    To preview a cleaned review, run the following command:

    data.SentimentText[124]

    A cleaned review looks like this:

Figure 7.22: A cleaned review
Figure 7.22: A cleaned review

To get the actual input to the next step of the process, run the following command:

reviews[124]

The input to the next step for the reviews command will look something like this:

Figure 7.23: Input for next step to the cleaned review
Figure 7.23: Input for next step to the cleaned review

Congratulations! You have successfully preprocessed your first text dataset. The review data is now a matrix of 25,000 rows, or reviews, and 200 columns, or words. Next, we will learn how we can convert this data into embedding to make it easier to predict the sentiment.

Text Processing

Now that we have cleaned our dataset, we will convert it into a form that machine learning models can work with. Recall Chapter 5, Mastering Structured Data, where we discussed how neural networks cannot process words, so we need to represent words as numbers to be able to process them. Therefore, to be able to perform tasks such as sentiment analysis, we need to convert text into numbers.

So, the very first method we discussed was one-hot encoding, which performs poorly in the case of words, because words have certain relationships between them and one-hot encoding makes it so that the words are computed as if they were independent of each other. For example, let us assume we have three words: 'car,' 'truck,' and 'ship.' Now, 'car' is closer to 'truck' in terms of similarity, but it still has some similarity to 'ship.' One-hot encoding fails to capture that relationship.

Word embeddings too are vector representations of words, but they capture the relationship of each word with another word. The different ways of getting word embedding are explained in the following section.

Count Embedding

Count embedding is a simple vector representation of the word depending on the amount of times it appears in a piece of text. Assume a dataset where you have n unique words and M different records. To get the count embedding, you create an N x M matrix, where each row is a word and each column is a record. The values of any (n,m) location in the matrix will contain a count of the number of times a word n occurs in a record m.

TF-IDF Embedding

TF-IDF is a way to obtain the importance of each word in a collection of words or document. It stands for term frequency-inverse document frequency. In TF-IDF, the importance of a word increases proportionally to the frequency of the word, but this importance is offset by the number of documents that have that word, thus helping to adjust for certain words that are used more frequently. In other words, the importance of a word is calculated using the frequency of the word in one data point of the training set. This importance is increased or decreased depending on the occurrence of the word in other data points of the training set.

Weights generated by TF-IDF consist of two terms:

  • Term Frequency (TF): The frequency of a word in the document, as shown in the following figure:
Figure 7.24: The term frequency equation
Figure 7.24: The term frequency equation

where w is the word.

  • Inverse Document Frequency (IDF): The amount of information the word provides, as shown in the following figure:
Figure 7.25: The inverse document frequency equation
Figure 7.25: The inverse document frequency equation

The weight is the product of these two terms. In case of TF-IDF, we replace the count of the word with this weight in the N x M matrix that we used in the count embedding section.

Continuous bag-of-words embedding

Continuous bag-of-words (CBOW) works by using neural networks. It predicts a word when the input is its surrounding words. The input to the neural network is the one-hot vector of the surrounding words. The count of input words is selected using the window parameter. The network has only one hidden layer and the output layer of the network is activated using the softmax activation function to get the probability. The activation function between the layers is linear, but the method of updating the gradients is the same as normal neural networks.

The embedding matrix of the corpus is the weight between the hidden layer and the output layer. Thus, this embedding matrix will be of the N x H dimension, where N is the number of unique words in the corpus and H is the number of hidden layer nodes. CBOW works better than the two methods discussed previously due to its probabilistic nature and low memory requirements.

Figure 7.26: A representation of CBOW network
Figure 7.26: A representation of CBOW network

Skip-gram embedding

Using a neural network, skip-gram predicts the surrounding words given an input word. The input here is the one-hot vector of the word and the output is the probability of the surrounding words. The number of output words is decided by the window parameter. Much like CBOW, this method uses a neural network with a single hidden layer and the activations are all linear except for the output layer, where we use the softmax function. One big difference though is how the error gets calculated: different errors are calculated for the different words being predicted and then all are added together to get the final error. The error for each individual word is calculated by subtracting the output probability vector with the target one-hot vector.

The embedding matrix here is the weight matrix between the input layer and the hidden layer. Thus, this embedding matrix will be of the H x N dimension, where N is the number of unique words in the corpus and H is the number of hidden layer nodes. Skip-gram works much better than CBOW for less frequent words, but is generally slower:

Figure 7.27: A representation of skip-gram network
Figure 7.27: A representation of skip-gram network

Tip

Use CBOW for datasets with less words but a high number of samples. Use skip-gram when working with a dataset with more words and a low number of samples.

Word2Vec

The Word2Vec model is a group of CBOW and skip-gram that is used to produce word embedding. Word2Vec helps us obtain the word embedding of a corpus easily. To implement the model and obtain word embedding, we will make use of the gensim library:

model = gensim.models.Word2Vec(

        tokens,

        iter=5

        size=100,

        window=5,

        min_count=5,

        workers=10,

        sg=0)

To train the model, we need to pass the tokenized sentences as arguments to the Word2Vec class of gensim. iter is the number of epochs to train for, and size refers to the number of nodes in the hidden layer and decides the size of the embedding layer. window refers to the number of surrounding words that are considered when training the neural network. min_count refers to the minimum frequency required for a word to be considered. workers is the number of threads to use while training and sg refers to the training algorithm to be used, 0 for CBOW and 1 for skip-gram.

To get the number of unique words in the trained embedding, you can use the following:

vocab = list(model.wv.vocab)

len(vocab)

Before we use these embeddings, we need to make sure that they are correct. To do that, we find the similar words:

model.wv.most_similar('fun')

The output is as follows:

Figure 7.28: Similar words
Figure 7.28: Similar words

To save your embeddings to a file, use the following code:

model.wv.save_word2vec_format('movie_embedding.txt', binary=False)

To load a pretrained embedding, you can use this function:

def load_embedding(filename, word_index , num_words, embedding_dim):

    embeddings_index = {}

    file = open(filename, encoding="utf-8")

    for line in file:

        values = line.split()

        word = values[0]

        coef = np.asarray(values[1:])

        embeddings_index[word] = coef

    file.close()

    

    embedding_matrix = np.zeros((num_words, embedding_dim))

    for word, pos in word_index.items():

        if pos >= num_words:

            continue

        print(num_words)

        embedding_vector = embeddings_index.get(word)

        if embedding_vector is not None:

            embedding_matrix[pos] = embedding_vector

    return embedding_matrix

The function first reads the embedding file filename and gets all the embedding vectors present in the file. Then, it creates an embedding matrix that stacks the embedding vectors together. The num_words parameter limits the size of the vocabulary and can be helpful in cases where the training time of the NLP algorithm is too high. word_index is a dictionary with the key as unique words of the corpus and the value as the index of the word. embedding_dim is the size of the embedding vectors as specified while training.

Tip

There are a lot of really good pretrained embeddings available. Some of the popular ones are GloVe: https://nlp.stanford.edu/projects/glove/ and fastText: https://fasttext.cc/docs/en/english-vectors.html

Exercise 56: Creating Word Embeddings Using Gensim

In this exercise, we will create our own Word2Vec embedding using the gensim library. The word embedding will be created for the IMDB movie review dataset that we have been working with. We will take off from where we left in Exercise 55.

  1. The reviews variable has reviews in the token form but they have been converted into numbers. The gensim Word2Vec requires tokens in the string form, so we backtrack to where we converted the tokens back to sentences in step 6 of Exercise 55.

    data['SentimentText'] [0]

    The tokens of the first review are as follows:

    Figure 7.29: Tokens of first review
    Figure 7.29: Tokens of first review
  2. Now, we convert the lists in each row into a single list using the apply function of pandas, using the following code:

    data['SentimentText'] = data['SentimentText'].apply(lambda x: x[0])

  3. Now, we feed this preprocessed data into Word2Vec to create the word embedding:

    from gensim.models import Word2Vec

    model = Word2Vec(

            data['SentimentText'],

            iter=50,

            size=100,

            window=5,

            min_count=5,

            workers=10)

  4. Let us check how well the model performs by viewing some similar words:

    model.wv.most_similar('insight')

    The most similar words to 'insight' in the dataset are:

    Figure 7.30: Similar words to 'insight'
    Figure 7.30: Similar words to 'insight'
  5. To obtain the similarity between two words, use:

    model.wv.similarity(w1='violent', w2='brutal')

  6. The output similarity is shown here:
    Figure 7.31: Similarity output
    Figure 7.31: Similarity output

    The similarity score ranges from 0 to 1, where 1 means that both words are the same, and 0 means that both words are completely different and not related in any way.

  7. Plot the embedding on a 2D space to understand what words are found to be similar to each other.

    First, convert the embedding into two dimensions using PCA. We will plot only the first 200 words. (You can plot more if you like.)

    from sklearn.decomposition import PCA

    word_limit = 200

    X = model[model.wv.vocab][: word_limit]

    pca = PCA(n_components=2)

    result = pca.fit_transform(X)

  8. Now, plot the result on a scatter plot using matplotlib:

    import matplotlib.pyplot as plt

    plt.scatter(result[:, 0], result[:, 1])

    words = list(model.wv.vocab)[: word_limit]

    for i, word in enumerate(words):

    plt.annotate(word, xy=(result[i, 0], result[i, 1]))

    plt.show()

    Your output should look like the following:

    Figure 7.32: Representation of embedding of first 200 words using PCA
    Figure 7.32: Representation of embedding of first 200 words using PCA

    Note

    The axes do not mean anything in the representation of word embedding. The representation simply shows the closeness of different words.

  9. Save the embedding to a file so that you can retrieve it later:

    model.wv.save_word2vec_format('movie_embedding.txt', binary=False)

Congratulations! You just created your first word embedding. You can play around with the embedding and view the similarity between different words.

Activity 19: Predicting Sentiments of Movie Reviews

In this activity, we will attempt to predict sentiments of movie reviews. The dataset (https://github.com/TrainingByPackt/Data-Science-with-Python/tree/master/Chapter07) comprises 25,000 movie reviews sourced from IMDB with their sentiment (positive or negative). Let's look at the following scenario: You work at a DVD rental company, which has to predict the number of DVDs to create of a certain movie depending on how it is being perceived by the reviewers. To do this, you create a machine learning model that can analyze reviews to figure out how the movie is being perceived.

  1. Read and preprocess the movie reviews.
  2. Create the word embedding of the reviews.
  3. Create a fully connected neural network to predict sentiments, much like the neural network models we created in Chapter 5: Mastering Structured Data. The input will be the word embedding of the reviews and the output of the model will be either 1 (positive sentiment) or 0 (negative sentiment).

    Note

    The solution for this activity can be found on page 378.

The output is a little cryptic because stop words and punctuations have been removed, but you can still understand the general sense of the review.

Congratulations! You just created your first NLP module. You should find that the model gives an accuracy of of around 76% which is quite low. This is because it is predicting sentiments based on individual words; it has no way of figuring out the context of the review. For example, it will predict "not good" as a positive sentiment as it sees the word 'good.' If it could look at multiple words, it would understand that this is a negative sentiment. In the next section, we will learn how to create neural networks that can retain information of the past.

Recurrent Neural Networks (RNNs)

Until now, none of the problems we discussed had a temporal dependence, which means that the prediction depends not only on the current input but also on the past inputs. For example, in the case of the dog vs. cat classifier, we only needed the picture of the dog to classify it as a dog. No other information or images were required. Instead, if you want to make a classifier that predicts if a dog is walking or standing, you will require multiple images in a sequence or a video to figure out what the dog is doing. RNNs are like the fully connected networks that we talked about. The only change is that an RNN has memory that stores information about the previous inputs as states. The outputs of the hidden layers are fed in as inputs for the next input.

Figure 7.33: Representation of recurrent neural network
Figure 7.33: Representation of recurrent neural network

From the image, you can understand how the outputs of the hidden layers are used as inputs for the next input. This acts as a memory element in the neural network. Another thing to keep in mind is that the output of a normal neural network is a function of the input and weights of the network.

his allows us to randomly input any data point to get the right output. However, this is not the case with RNNs. In the case of RNNs, our output depends on the previous inputs, so we need to feed in the input in the correct sequence.

Figure 7.34: Representation of recurrent layer
Figure 7.34: Representation of recurrent layer

In the preceding image, you can see a single RNN layer on the left in the "folded" model. U is the input weight, V is the output weight, and W is the weight associated with the memory input. The memory of the RNN is also referred to as state. The "unfolded" model on the right shows how the RNN works for the input sequence [xt-1, xt, xt+1]. The model differs based on the kind of application. For example, in case of sentiment analysis, the input sequence will require only one output in the end. The unfolded model for this problem is shown here:

Figure 7.35: Unfolded representation of a recurrent layer used to perform sentiment analysis
Figure 7.35: Unfolded representation of a recurrent layer used to perform sentiment analysis

LSTMs

Long short-term memory (LSTM) cell is a special kind of RNN cell, capable of retaining information over long-term periods of time. Hochreiter and Schmidhuber introduced LSTMs in 1997. RNNs suffer from the vanishing gradient problem. They lose information detected over long periods of time. For example, if we are performing sentiment analysis on a text and the first sentence says "I am happy today" and then the rest of the text is devoid of any sentiments, the RNN will not do a good job of detecting that the sentiment of the text is happy. Long short-term memory (LSTM) cells overcome this issue by storing certain inputs for a longer time without forgetting them. Most real-world recurrent machine learning implementations are done using LSTMs. The only difference between RNN cells and LSTM cells is the memory states. Every RNN layer takes an input of the memory state and outputs a memory state, whereas every LSTM layer takes a long-term and a short-term memory as the input and outputs both the long and the short-term memories. The long-term memory allows the network to retain information for a longer time.

LSTM cells are implemented in Keras, and you easily can add an LSTM layer into your model:

model.add(keras.layers.LSTM(units, activation='tanh', dropout=0.0, recurrent_dropout=0.0, return_sequences=False))

Here, units is the number of nodes in the layer, activation is the activation function to use for the layer. recurrent_dropout and dropout are the dropout probability for the recurrent state and input respectively. return_sequences specifies if the output should contain the sequence or not; this is made True when you plan to use another recurrent layer after the current layer.

Note

LSTMs almost always work better than RNNs.

Exercise 57: Performing Sentiment Analysis Using LSTM

In this exercise, we will modify the model we created for the previous activity, to make it use an LSTM cell. We will use the same IMDB movie review dataset that we have been working with. Most of the preprocessing steps are like those in Activity 19.

  1. Read the IMDB movie review dataset using pandas in Python:

    import pandas as pd

    data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')

  2. Convert the tweets to lowercase to reduce the number of unique words:

    data.text = data.text.str.lower()

  3. Clean the reviews using RegEx with the clean_str function:

    import re

    def clean_str(string):

        

        string = re.sub(r"https?://S+", '', string)

        string = re.sub(r'<a href', ' ', string)

        string = re.sub(r'&amp;', '', string)

        string = re.sub(r'<br />', ' ', string)

        string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', ' ', string)

        string = re.sub('d','', string)

        string = re.sub(r"can't", "cannot", string)

        string = re.sub(r"it's", "it is", string)

        return string

    data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))

  4. Next, remove stop words and other frequently occurring unnecessary words from the reviews. This step converts strings into tokens (which will be helpful in the next step):

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize,sent_tokenize

    stop_words = stopwords.words('english') + ['movie', 'film', 'time']

    stop_words = set(stop_words)

    remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]

    data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)

  5. Combine the tokens to get a string and then drop any review that does not have anything in it after the stop-word removal:

    def combine_text(text):

        try:

            return ' '.join(text[0])

        except:

            return np.nan

    data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))

    data = data.dropna(how='any')

  6. Tokenize the reviews using the Keras Tokenizer and convert them into numbers:

    from keras.preprocessing.text import Tokenizer

    tokenizer = Tokenizer(num_words=5000)

    tokenizer.fit_on_texts(list(data['SentimentText']))

    sequences = tokenizer.texts_to_sequences(data['SentimentText'])

    word_index = tokenizer.word_index

  7. Finally, pad the tweets to have a maximum of 100 words. This will remove any words after the 100-word limit and add 0s if the number of words is less than 100:

    from keras.preprocessing.sequence import pad_sequences

    reviews = pad_sequences(sequences, maxlen=100)

  8. Load the previously created embedding to get the embedding matrix using the load_embedding function discussed in the Text Processing section, by using the following code:

    import numpy as np

    def load_embedding(filename, word_index , num_words, embedding_dim):

        embeddings_index = {}

        file = open(filename, encoding="utf-8")

        for line in file:

            values = line.split()

            word = values[0]

            coef = np.asarray(values[1:])

            embeddings_index[word] = coef

        file.close()

        

        embedding_matrix = np.zeros((num_words, embedding_dim))

        for word, pos in word_index.items():

            if pos >= num_words:

                continue

            embedding_vector = embeddings_index.get(word)

            if embedding_vector is not None:

                embedding_matrix[pos] = embedding_vector

        return embedding_matrix

    embedding_matrix = load_embedding('movie_embedding.txt', word_index, len(word_index), 16)

  9. Split the data into training and testing sets with an 80:20 split. This can be modified to find the best split:

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(reviews, pd.get_dummies(data.Sentiment), test_size=0.2, random_state=9)

  10. Create and compile the Keras model with one LSTM layer. You can experiment with different layers and hyperparameters:

    from keras.models import Model

    from keras.layers import Input, Dense, Dropout, BatchNormalization, Embedding, Flatten, LSTM

    inp = Input((100,))

    embedding_layer = Embedding(len(word_index),

                        16,

                        weights=[embedding_matrix],

                        input_length=100,

                        trainable=False)(inp)

    model = Dropout(0.10)(embedding_layer)

    model = LSTM(128, dropout=0.2)(model)

    model = Dense(units=256, activation='relu')(model)

    model = Dense(units=64, activation='relu')(model)

    model = Dropout(0.3)(model)

    predictions = Dense(units=2, activation='softmax')(model)

    model = Model(inputs = inp, outputs = predictions)

    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])

  11. Train the model on the data for 10 epochs to see if it performs better than the one in Activity 1, by using the following code:

    model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)

  12. Check the accuracy of the model:

    from sklearn.metrics import accuracy_score

    preds = model.predict(X_test)

    accuracy_score(np.argmax(preds, 1), np.argmax(y_test.values, 1))

    The accuracy of the LSTM model is:

    Figure 7.36: LSTM model accuracy
    Figure 7.36: LSTM model accuracy
  13. Plot the confusion matrix of the model to get a proper sense of the model's prediction:

    y_actual = pd.Series(np.argmax(y_test.values, axis=1), name='Actual')

    y_pred = pd.Series(np.argmax(preds, axis=1), name='Predicted')

    pd.crosstab(y_actual, y_pred, margins=True)

    Figure 7.37: Confusion matrix of the model (0 = negative sentiment, 1 = positive sentiment)
    Figure 7.37: Confusion matrix of the model (0 = negative sentiment, 1 = positive sentiment)
  14. Check the performance of the model by seeing the sentiment predictions on random reviews using the following code:

    review_num = 110

    print("Review: "+tokenizer.sequences_to_texts([X_test[review_num]])[0])

    sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"

    print(" Predicted sentiment = "+ sentiment)

    sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"

    print(" Actual sentiment = "+ sentiment)

    The output is as follows:

Figure 7.38: A negative review from the IMDB dataset
Figure 7.38: A negative review from the IMDB dataset

Congratulations! You just implemented an RNN to predict sentiments of a movie review. This network works a little better than the previous network we created. Play around with the architecture and hyperparameters of the network to improve the accuracy of the model. You can also try using pretrained word embedding from either fastText or GloVe to improve the accuracy of the model.

Activity 20: Predicting Sentiments from Tweets

In this activity, we will attempt to predict sentiments of a tweet. The provided dataset (https://github.com/TrainingByPackt/Data-Science-with-Python/tree/master/Chapter07) contains 1.5 million tweets and their sentiments (positive or negative). Let's look at the following scenario: You work at a big consumer organization, which recently created a Twitter account. Some of the customers who have had a bad experience with your company are taking to Twitter to express their sentiments, which is causing a decline in the reputation of the company. You have been tasked to identify these tweets so that the company can get in touch with them to provide better support. You do this by creating a sentiment predictor, which can determine whether the sentiment of a tweet is positive or negative. Before using your new sentiment predictor on actual tweets about your company, you will test it on the provided tweets dataset.

  1. Read the data and remove all unnecessary information.
  2. Clean the tweets, tokenize them, and finally convert them into numbers.
  3. Load GloVe Twitter embedding and create the embedding matrix (https://nlp.stanford.edu/projects/glove/).
  4. Create an LSTM model to predict the sentiment.

    Note

    The solution for this activity can be found on page 383.

Congratulations! You just created a machine learning module to predict sentiments from tweets. You can now deploy this using Twitter API to perform real-time sentiment analysis on tweets. You can play around with different embeddings from GloVe and fastText and see how much improvement you can get on your model.

Summary

In this chapter, we learned how computers understand human language. We first learned what RegEx is and how it helps data scientists analyze and clean text data. Next, we learned about stop words, what they are, and why they are removed from the data to reduce the dimensionality. Next, we next learned about sentence tokenization and its importance, followed by word embedding. Embedding is a topic that we covered in Chapter 5: Mastering Structured Data; here, we learned how to create word embedding to boost our NLP model's performance. To create better models, we looked at a RNNs, a special type of neural network that retains memory of past inputs. Finally, we learned about LSTM cells and how they are better than normal RNN cells.

Now that you have completed this chapter, you are capable of handling textual data and creating machine learning models for NLP. In the next chapter, you will learn how to make models faster using transfer learning and a few tricks of the craft.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.205.165