Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Akshay Kulkarni and Adarsha ShivanandaNatural Language Processing Recipeshttps://doi.org/10.1007/978-1-4842-4267-4_6

6. Deep Learning for NLP

Akshay Kulkarni¹ and Adarsha Shivananda¹

(1)

Bangalore, Karnataka, India

In this chapter, we will implement deep learning for NLP:

Recipe 1. Information retrieval using deep learning
Recipe 2. Text classification using CNN, RNN, LSTM
Recipe 3. Predicting the next word/sequence of words using LSTM for Emails

Introduction to Deep Learning

Deep learning is a subfield of machine learning that is inspired by the function of the brain. Just like how neurons are interconnected in the brain, neural networks also work the same. Each neuron takes input, does some kind of manipulation within the neuron, and produces an output that is closer to the expected output (in the case of labeled data).

What happens within the neuron is what we are interested in: to get to the most accurate results. In very simple words, it’s giving weight to every input and generating a function to accumulate all these weights and pass it onto the next layer, which can be the output layer eventually.

The network has 3 components:

Input layer
Hidden layer/layers
Output layer

../images/475440_1_En_6_Chapter/475440_1_En_6_Figa_HTML.jpg

The functions can be of different types based on the problem or the data. These are also called activation functions. Below are the types.

Linear Activation functions: A linear neuron takes a linear combination of the weighted inputs; and the output can take any value between -infinity to infinity.
Nonlinear Activation function: These are the most used ones, and they make the output restricted between some range:
- Sigmoid or Logit Activation Function: Basically, it scales down the output between 0 and 1 by applying a log function, which makes the classification problems easier.
- Softmax function: Softmax is almost similar to sigmoid, but it calculates the probabilities of the event over ‘n’ different classes, which will be useful to determine the target in multiclass classification problems.
- Tanh Function: The range of the tanh function is from (-1 to 1), and the rest remains the same as sigmoid.
- Rectified Linear Unit Activation function: ReLU converts anything that is less than zero to zero. So, the range becomes 0 to infinity.

We still haven’t discussed how training is carried out in neural networks. Let’s do that by taking one of the networks as an example, which is the convolutional neural network.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) are similar to ordinary neural networks but have multiple hidden layers and a filter called the convolution layer. CNN is successful in identifying faces, objects, and traffic signs and also used in self-driving cars.

Data

As we all know, algorithms work basically on numerical data. Images and text data are unstructured data as we discussed earlier, and they need to be converted into numerical values even before we start anything.

Image: Computer takes an image as an array of pixel values. Depending on the resolution and size of the image, it will see an X Y x Z array of numbers. For example, there is a color image and its size is 480 x 480 pixels. The representation of the array will be 480 x 480 x 3 where 3 is the RGB value of the color. Each of these numbers varies from 0 to 255, which describes the pixel intensity/density at that point. The concept is that if given the computer and this array of numbers, it will output the probability of the image being a certain class in case of a classification problem.
Text: We already discussed throughout the book how to create features out of the text. We can use any of those techniques to convert text to features. RNN and LSTM are suited better for text-related solutions that we will discuss in the next sections.

../images/475440_1_En_6_Chapter/475440_1_En_6_Figb_HTML.jpg

Architecture

CNN is a special case of a neural network with an input layer, output layer, and multiple hidden layers. The hidden layers have 4 different procedures to complete the network. Each one is explained in detail.

Convolution

../images/475440_1_En_6_Chapter/475440_1_En_6_Figc_HTML.jpg

The Convolution layer is the heart of a Convolutional Neural Network, which does most of the computational operations. The name comes from the “convolution” operator that extracts features from the input image. These are also called filters (Orange color 3*3 matrix). The matrix formed by sliding the filter over the full image and calculating the dot product between these 2 matrices is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map’. Suppose that in table data, different types of features are calculated like “age” from “date of birth.” The same way here also, straight edges, simple colors, and curves are some of the features that the filter will extract from the image.

During the training of the CNN, it learns the numbers or values present inside the filter and uses them on testing data. The greater the number of features, the more the image features get extracted and recognize all patterns in unseen images.

Nonlinearity (ReLU)

../images/475440_1_En_6_Chapter/475440_1_En_6_Figd_HTML.jpg

ReLU (Rectified Linear Unit) is a nonlinear function that is used after a convolution layer in CNN architecture. It replaces all negative values in the matrix to zero. The purpose of ReLU is to introduce nonlinearity in the CNN to perform better.

Pooling

../images/475440_1_En_6_Chapter/475440_1_En_6_Fige_HTML.jpg

Pooling or subsampling is used to decrease the dimensionality of the feature without losing important information. It’s done to reduce the huge number of inputs to a full connected layer and computation required to process the model. It also helps to reduce the overfitting of the model. It uses a 2 x 2 window and slides over the image and takes the maximum value in each region as shown in the figure. This is how it reduces dimensionality.

Flatten, Fully Connected, and Softmax Layers

The last layer is a dense layer that needs feature vectors as input. But the output from the pooling layer is not a 1D feature vector. This process of converting the output of convolution to a feature vector is called flattening. The Fully Connected layer takes an input from the flatten layer and gives out an N-dimensional vector where N is the number of classes. The function of the fully connected layer is to use these features for classifying the input image into various classes based on the loss function on the training dataset. The Softmax function is used at the very end to convert these N-dimensional vectors into a probability for each class, which will eventually classify the image into a particular class.

Backpropagation: Training the Neural Network

In normal neural networks, you basically do Forward Propagation to get the output and check if this output is correct and calculate the error. In Backward Propagation, we are going backward through your network that finds the partial derivatives of the error with respect to each weight.

Let’s see how exactly it works.

The input image is fed into the network and completes forward propagation, which is convolution, ReLU, and pooling operations with forward propagation in the fully Connected layer and generates output probabilities for each class. As per the feed forward rule, weights are randomly assigned and complete the first iteration of training and also output random probabilities. After the end of the first step, the network calculates the error at the output layer using

Total Error = ∑ ½ (target probability – output probability) ²

Now, your backpropagation starts to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values and weights, which will eventually minimize the output error. Parameters like the number of filters, filter sizes, and the architecture of the network will be finalized while building your network. The filter matrix and connection weights will get updated for each run. The whole process is repeated for the complete training set until the error is minimized.

Recurrent Neural Networks

CNNs are basically used for computer vision problems but fail to solve sequence models. Sequence models are those where even a sequence of the entity also matters. For example, in the text, the order of the words matters to create meaningful sentences. This is where RNNs come into the picture and are useful with sequential data because each neuron can use its memory to remember information about the previous step.

../images/475440_1_En_6_Chapter/475440_1_En_6_Figf_HTML.jpg

It is quite complex to understand how exactly RNN is working. If you see the above figure, the recurrent neural network is taking the output from the hidden layer and sending it back to the same layer before giving the prediction.

Training RNN – Backpropagation Through Time (BPTT)

We know how feed forward and backpropagation work from CNN, so let’s see how training is done in case of RNN.

../images/475440_1_En_6_Chapter/475440_1_En_6_Figg_HTML.jpg

If we just discuss the hidden layer, it’s not only taking input from the hidden layer, but we can also add another input to the same hidden layer. Now the backpropagation happens like any other previous training we have seen; it’s just that now it is dependent on time. Here error is backpropagated from the last timestamp to the first through unrolling the hidden layers. This allows calculating the error for each timestamp and updating the weights. Recurrent networks with recurrent connections between hidden units read an entire sequence and then produce a required output.

When the values of a gradient are too small and the model takes way too long to learn, this is called Vanishing Gradients. This problem is solved by LSTMs.

Long Short-Term Memory (LSTM)

LSTMs are a kind of RNNs with betterment in equation and backpropagation, which makes it perform better. LSTMs work almost similarly to RNN, but these units can learn things with very long time gaps, and they can store information just like computers.

../images/475440_1_En_6_Chapter/475440_1_En_6_Figh_HTML.jpg

The algorithm learns the importance of the word or character through weighing methodology and decides whether to store it or not. For this, it uses regulated structures called gates that have the ability to remove or add information to the cell. These cells have a sigmoid layer that decides how much information should be passed. It has three layers, namely “input,” “forget,” and “output” to carry out this process.

Going in depth on how CNN and RNNs work is beyond the scope of this book. We have mentioned references at the end of the book if anyone is interested in learning about this in more depth.

Recipe 6-1. Retrieving Information

Information retrieval is one of the highly used applications of NLP and it is quite tricky. The meaning of the words or sentences not only depends on the exact words used but also on the context and meaning. Two sentences may be of completely different words but can convey the same meaning. We should be able to capture that as well.

An information retrieval (IR) system allows users to efficiently search documents and retrieve meaningful information based on a search text/query.

../images/475440_1_En_6_Chapter/475440_1_En_6_Figi_HTML.jpg

Problem

Information retrieval using word embeddings.

Solution

There are multiple ways to do Information retrieval. But we will see how to do it using word embeddings, which is very effective since it takes context also into consideration. We discussed how word embeddings are built in Chapter 3. We will just use the pretrained word2vec in this case.

Let’s take a simple example and see how to build a document retrieval using query input. Let’s say we have 4 documents in our database as below. (Just showcasing how it works. We will have too many documents in a real-world application.)

Doc1 = ["With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders." ]

Doc2 = ["Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."]

Doc3 = ["He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems."]

Doc4 = ["But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg."]

Assume we have numerous documents like this. And you want to retrieve the most relevant once for the query “cricket.” Let’s see how to build it.

query = "cricket"

How It Works

Step 1-1 Import the libraries

Here are the libraries:

import gensim

from gensim.models import Word2Vec

import numpy as np

import nltk

import itertools

from nltk.corpus import stopwords

from nltk.tokenize import sent_tokenize, word_tokenize

import scipy

from scipy import spatial

from nltk.tokenize.toktok import ToktokTokenizer

import re

tokenizer = ToktokTokenizer()

stopword_list = nltk.corpus.stopwords.words('english')

Step 1-2 Create/import documents

Randomly taking sentences from the internet:

Doc3 = ["He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems."]

# Put all the documents in one list

fin= Doc1+Doc2+Doc3+Doc4

Step 1-3 Download word2vec

As mentioned earlier, we are going to use the word embeddings to solve this problem. Download word2vec from the below link:

https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

#load the model

model = gensim.models.KeyedVectors.load_word2vec_format('/GoogleNews-vectors-negative300.bin', binary=True)

Step 1-4 Create IR system

Now we build the information retrieval system:

#Preprocessing

def remove_stopwords(text, is_lower_case=False):

pattern = r'[^a-zA-z0-9s]'

text = re.sub(pattern, ", ".join(text))

tokens = tokenizer.tokenize(text)

tokens = [token.strip() for token in tokens]

if is_lower_case:

filtered_tokens = [token for token in tokens if token not in stopword_list]

else:

filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]

filtered_text = ' '.join(filtered_tokens)

return filtered_text

# Function to get the embedding vector for n dimension, we have used "300"

def get_embedding(word):

if word in model.wv.vocab:

return model[x]

else:

return np.zeros(300)

For every document, we will get a lot of vectors based on the number of words present. We need to calculate the average vector for the document through taking a mean of all the word vectors.

# Getting average vector for each document

out_dict = {}

for sen in fin:

average_vector = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(remove_stopwords(sen))]), axis=0))

dict = { sen : (average_vector) }

out_dict.update(dict)

# Function to calculate the similarity between the query vector and document vector

def get_sim(query_embedding, average_vector_doc):

sim = [(1 - scipy.spatial.distance.cosine(query_embedding, average_vector_doc))]

return sim

# Rank all the documents based on the similarity to get relevant docs

def Ranked_documents(query):

query_words = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(query.lower())],dtype=float), axis=0))

rank = []

for k,v in out_dict.items():

rank.append((k, get_sim(query_words, v)))

rank = sorted(rank,key=lambda t: t[1], reverse=True)

print('Ranked Documents :')

return rank

Step 1-5 Results and applications

Let’s see how the information retrieval system we built is working with a couple of examples.

# Call the IR function with a query

Ranked_documents("cricket")

Result :

[('But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg.',

[0.44954327116871795]),

('He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems.',

[0.23973446569030055]),

('With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders.',

[0.18323712012013349]),

('Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.',

[0.17995060855459855])]

If you see, doc4 (on top in result), this will be most relevant for the query “cricket” even though the word “cricket” is not even mentioned once with the similarity of 0.449.

Let’s take one more example as may be driving.

Ranked_documents("driving")

[('With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders.',

[0.35947287723800669]),

('But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg.',

[0.19042556935316801]),

('He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems.',

[0.17066536985237601]),

[0.088723080005327359])]

Again, since driving is connected to transport and the Motor Vehicles Act, it pulls out the most relevant documents on top. The first 2 documents are relevant to the query.

We can use the same approach and scale it up for as many documents as possible. For more accuracy, we can build our own embeddings, as we learned in Chapter 3, for specific industries since the one we are using is generalized.

This is the fundamental approach that can be used for many applications like the following:

Search engines
Document retrieval
Passage retrieval
Question and answer

../images/475440_1_En_6_Chapter/475440_1_En_6_Figj_HTML.jpg

It’s been proven that results will be good when queries are longer and the result length is shorter. That’s the reason we don’t get great results in search engines when the search query has lesser number of words.

Recipe 6-2. Classifying Text with Deep Learning

In this recipe, let us build a text classifier using deep learning approaches.

Problem

We want to build a text classification model using CNN, RNN, and LSTM.

Solution

The approach and NLP pipeline would remain the same as discussed earlier. The only change would be that instead of using machine learning algorithms, we would be building models using deep learning algorithms.

How It Works

Let’s follow the steps in this section to build the email classifier using the deep learning approaches.

Step 2-1 Understanding/defining business problem

Email classification (spam or ham) . We need to classify spam or ham email based on email content.

Step 2-2 Identifying potential data sources, collection, and understanding

Using the same data used in Recipe 4-6 from Chapter 4:

#read file

file_content = pd.read_csv('spam.csv', encoding = "ISO-8859-1")

#check sample content in the email

file_content['v2'][1]

#output

'Ok lar... Joking wif u oni...'

Step 2-3 Text preprocessing

Let’s preprocess the data:

#Import library

from nltk.corpus import stopwords

from nltk import *

from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

# Remove stop words

stop = stopwords.words('english')

file_content['v2'] = file_content['v2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

# Delete unwanted columns

Email_Data = file_content[['v1', 'v2']]

# Rename column names

Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})

Email_Data.head()

#output

Target Email

0 ham Go jurong point, crazy.. Available bugis n gre...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry 2 wkly comp win FA Cup final tkts 2...

3 ham U dun say early hor... U c already say...

4 ham Nah I think goes usf, lives around though

#Delete punctuations, convert text in lower case and delete the double space

Email_Data['Email'] = Email_Data['Email'].apply(lambda x: re.sub('[!@#$:).;,?&]', ", x.lower()))

Email_Data['Email'] = Email_Data['Email'].apply(lambda x: re.sub(' ', ' ', x))

Email_Data['Email'].head(5)

#output

0 go jurong point crazy available bugis n great ...

1 ok lar joking wif u oni

2 free entry 2 wkly comp win fa cup final tkts 2...

3 u dun say early hor u c already say

4 nah i think goes usf lives around though

Name: Email, dtype: object

#Separating text(input) and target classes

list_sentences_rawdata = Email_Data["Email"].fillna("_na_").values

list_classes = ["Target"]

target = Email_Data[list_classes].values

To_Process=Email_Data[['Email', 'Target']]

Step 2-4 Data preparation for model building

Now we prepare the data:

#Train and test split with 80:20 ratio

train, test = train_test_split(To_Process, test_size=0.2)

# Define the sequence lengths, max number of words and embedding dimensions

# Sequence length of each sentence. If more, truncate. If less, pad with zeros

MAX_SEQUENCE_LENGTH = 300

# Top 20000 frequently occurring words

MAX_NB_WORDS = 20000

# Get the frequently occurring words

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)

tokenizer.fit_on_texts(train.Email)

train_sequences = tokenizer.texts_to_sequences(train.Email)

test_sequences = tokenizer.texts_to_sequences(test.Email)

# dictionary containing words and their index

word_index = tokenizer.word_index

# print(tokenizer.word_index)

# total words in the corpus

print('Found %s unique tokens.' % len(word_index))

# get only the top frequent words on train

train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)

# get only the top frequent words on test

test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

print(train_data.shape)

print(test_data.shape)

#output

Found 8443 unique tokens.

(4457, 300)

(1115, 300)

train_labels = train['Target']

test_labels = test['Target']

#import library

from sklearn.preprocessing import LabelEncoder

# converts the character array to numeric array. Assigns levels to unique labels.

le = LabelEncoder()

le.fit(train_labels)

train_labels = le.transform(train_labels)

test_labels = le.transform(test_labels)

print(le.classes_)

print(np.unique(train_labels, return_counts=True))

print(np.unique(test_labels, return_counts=True))

#output

['ham' 'spam']

(array([0, 1]), array([3889, 568]))

(array([0, 1]), array([936, 179]))

# changing data types

labels_train = to_categorical(np.asarray(train_labels))

labels_test = to_categorical(np.asarray(test_labels))

print('Shape of data tensor:', train_data.shape)

print('Shape of label tensor:', labels_train.shape)

print('Shape of label tensor:', labels_test.shape)

#output

Shape of data tensor: (4457, 300)

Shape of label tensor: (4457, 2)

Shape of label tensor: (1115, 2)

EMBEDDING_DIM = 100

print(MAX_SEQUENCE_LENGTH)

#output

300

Step 2-5 Model building and predicting

We are building the models using different deep learning approaches like CNN, RNN, LSTM, and Bidirectional LSTM and comparing the performance of each model using different accuracy metrics.

We can now define our CNN model.

Here we define a single hidden layer with 128 memory units. The network uses a dropout with a probability of 0.5. The output layer is a dense layer using the softmax activation function to output a probability prediction.

# Import Libraries

import sys, os, re, csv, codecs, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical

from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation

from keras.layers import Bidirectional, GlobalMaxPool1D, Conv1D, SimpleRNN

from keras.models import Model

from keras.models import Sequential

from keras import initializers, regularizers, constraints, optimizers, layers

from keras.layers import Dense, Input, Flatten, Dropout, BatchNormalization

from keras.layers import Conv1D, MaxPooling1D, Embedding

from keras.models import Sequential

print('Training CNN 1D model.')

model = Sequential()

model.add(Embedding(MAX_NB_WORDS,

EMBEDDING_DIM,

input_length=MAX_SEQUENCE_LENGTH

))

model.add(Dropout(0.5))

model.add(Conv1D(128, 5, activation="relu"))

model.add(MaxPooling1D(5))

model.add(Dropout(0.5))

model.add(BatchNormalization())

model.add(Conv1D(128, 5, activation="relu"))

model.add(MaxPooling1D(5))

model.add(Dropout(0.5))

model.add(BatchNormalization())

model.add(Flatten())

model.add(Dense(128, activation="relu"))

model.add(Dense(2, activation="softmax"))

model.compile(loss='categorical_crossentropy',

optimizer="rmsprop",

metrics=['acc'])

We are now fitting our model to the data. Here we have 5 epochs and a batch size of 64 patterns.

model.fit(train_data, labels_train,

batch_size=64,

epochs=5,

validation_data=(test_data, labels_test))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figk_HTML.jpg

#predictions on test data

predicted=model.predict(test_data)

predicted

#output

array([[0.5426713 , 0.45732868],

[0.5431667 , 0.45683333],

[0.53082496, 0.46917507],

...,

[0.53582424, 0.46417573],

[0.5305845 , 0.46941552],

[0.53102577, 0.46897423]], dtype=float32)

#model evaluation

import sklearn

from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, fscore, support = score(labels_test, predicted.round())

print('precision: {}'.format(precision))

print('recall: {}'.format(recall))

print('fscore: {}'.format(fscore))

print('support: {}'.format(support))

print("############################")

print(sklearn.metrics.classification_report(labels_test, predicted.round()))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figl_HTML.jpg

We can now define our RNN model.

#import library

from keras.layers.recurrent import SimpleRNN

#model training

print('Training SIMPLERNN model.')

model = Sequential()

model.add(Embedding(MAX_NB_WORDS,

EMBEDDING_DIM,

input_length=MAX_SEQUENCE_LENGTH

))

model.add(SimpleRNN(2, input_shape=(None,1)))

model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy', optimizer="adam",metrics = ['accuracy'])

model.fit(train_data, labels_train,

batch_size=16,

epochs=5,

validation_data=(test_data, labels_test))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figm_HTML.jpg

# prediction on test data

predicted_Srnn=model.predict(test_data)

predicted_Srnn

#output

array([[0.9959137 , 0.00408628],

[0.99576926, 0.00423072],

[0.99044365, 0.00955638],

...,

[0.9920791 , 0.00792089],

[0.9958105 , 0.00418955],

[0.99660563, 0.00339443]], dtype=float32)

#model evaluation

from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, fscore, support = score(labels_test, predicted_Srnn.round())

print('precision: {}'.format(precision))

print('recall: {}'.format(recall))

print('fscore: {}'.format(fscore))

print('support: {}'.format(support))

print("############################")

print(sklearn.metrics.classification_report(labels_test, predicted_Srnn.round()))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Fign_HTML.jpg

And here is our Long Short-Term Memory (LSTM):

#model training

print('Training LSTM model.')

model = Sequential()

model.add(Embedding(MAX_NB_WORDS,

EMBEDDING_DIM,

input_length=MAX_SEQUENCE_LENGTH

))

model.add(LSTM(output_dim=16, activation="relu", inner_activation="hard_sigmoid",return_sequences=True))

model.add(Dropout(0.2))

model.add(BatchNormalization())

model.add(Flatten())

model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy', optimizer="adam",metrics = ['accuracy'])

model.fit(train_data, labels_train,

batch_size=16,

epochs=5,

validation_data=(test_data, labels_test))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figo_HTML.jpg

#prediction on text data

predicted_lstm=model.predict(test_data)

predicted_lstm

array([[1.0000000e+00, 4.0581045e-09],

[1.0000000e+00, 8.3188789e-13],

[9.9999976e-01, 1.8647323e-07],

...,

[9.9999976e-01, 1.8333606e-07],

[1.0000000e+00, 1.7347950e-09],

[9.9999988e-01, 1.3574694e-07]], dtype=float32)

#model evaluation

from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, fscore, support = score(labels_test, predicted_lstm.round())

print('precision: {}'.format(precision))

print('recall: {}'.format(recall))

print('fscore: {}'.format(fscore))

print('support: {}'.format(support))

print("############################")

print(sklearn.metrics.classification_report(labels_test, predicted_lstm.round()))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figp_HTML.jpg

Finally, let’s see what is Bidirectional LSTM and implement the same.

As we know, LSTM preserves information from inputs using the hidden state. In bidirectional LSTMs, inputs are fed in two ways: one from previous to future and the other going backward from future to past, helping in learning future representation as well. Bidirectional LSTMs are known for producing very good results as they are capable of understanding the context better.

#model training

print('Training Bidirectional LSTM model.')

model = Sequential()

model.add(Embedding(MAX_NB_WORDS,

EMBEDDING_DIM,

input_length=MAX_SEQUENCE_LENGTH

))

model.add(Bidirectional(LSTM(16, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))

model.add(Conv1D(16, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform"))

model.add(GlobalMaxPool1D())

model.add(Dense(50, activation="relu"))

model.add(Dropout(0.1))

model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy', optimizer="adam",metrics = ['accuracy'])

model.fit(train_data, labels_train,

batch_size=16,

epochs=3,

validation_data=(test_data, labels_test))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figq_HTML.jpg

# prediction on test data

predicted_blstm=model.predict(test_data)

predicted_blstm

#output

array([[9.9999976e-01, 2.6086647e-07],

[9.9999809e-01, 1.9633851e-06],

[9.9999833e-01, 1.6918856e-06],

...,

[9.9999273e-01, 7.2622524e-06],

[9.9999964e-01, 3.3541210e-07],

[9.9999964e-01, 3.5427794e-07]], dtype=float32)

#model evaluation

from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, fscore, support = score(labels_test, predicted_blstm.round())

print('precision: {}'.format(precision))

print('recall: {}'.format(recall))

print('fscore: {}'.format(fscore))

print('support: {}'.format(support))

print("############################")

print(sklearn.metrics.classification_report(labels_test, predicted_blstm.round()))

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figr_HTML.jpg

We can see that Bidirectional LSTM outperforms the rest of the algorithms.

Recipe 6-3. Next Word Prediction

Autofill/showing what could be the potential sequence of words saves a lot of time while writing emails and makes users happy to use it in any product.

Problem

You want to build a model to predict/suggest the next word based on a previous sequence of words using Email Data.

Like you see in the below image, language is being suggested as the next word.

../images/475440_1_En_6_Chapter/475440_1_En_6_Figs_HTML.jpg

Solution

In this section, we will build an LSTM model to learn sequences of words from email data. We will use this model to predict the next word.

How It Works

Let's follow the steps in this section to build the next word prediction model using the deep learning approach.

Step 3-1 Understanding/defining business problem

Predict the next word based on the sequence of words or sentences.

Step 3-2 Identifying potential data sources, collection, and understanding

For this problem, let us use the same email data used in Recipe 4-6 from Chapter 4. This has a lot less data, but still to showcase the working flow, we are fine with this data. The more data, the better the accuracy.

file_content = pd.read_csv('spam.csv', encoding = "ISO-8859-1")

# Just selecting emails and connverting it into list

Email_Data = file_content[[ 'v2']]

list_data = Email_Data.values.tolist()

list_data

#output

[['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'],

['Ok lar... Joking wif u oni...'],

["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"],

['U dun say so early hor... U c already then say...'],

["Nah I don't think he goes to usf, he lives around here though"],

["FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv"],

['Even my brother is not like to speak with me. They treat me like aids patent.'],

["As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"],

['WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'] ,

['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030'],

Step 3-3 Importing and installing necessary libraries

Here are the libraries:

import numpy as np

import random

import pandas as pd

import sys

import os

import time

import codecs

import collections

import numpy

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Dropout

from keras.layers import LSTM

from keras.callbacks import ModelCheckpoint

from keras.utils import np_utils

from nltk.tokenize import sent_tokenize, word_tokenize

import scipy

from scipy import spatial

from nltk.tokenize.toktok import ToktokTokenizer

import re

tokenizer = ToktokTokenizer()

Step 3-4 Processing the data

Now we process the data:

#Converting list to string

from collections import Iterable

def flatten(items):

"""Yield items from any nested iterable"""

for x in items:

if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):

for sub_x in flatten(x):

yield sub_x

else:

yield x

TextData=list(flatten(list_data))

TextData = ".join(TextData)

# Remove unwanted lines and converting into lower case

TextData = TextData.replace(' ',")

TextData = TextData.lower()

pattern = r'[^a-zA-z0-9s]'

TextData = re.sub(pattern, ", ".join(TextData))

# Tokenizing

tokens = tokenizer.tokenize(TextData)

tokens = [token.strip() for token in tokens]

# get the distinct words and sort it

word_counts = collections.Counter(tokens)

word_c = len(word_counts)

print(word_c)

distinct_words = [x[0] for x in word_counts.most_common()]

distinct_words_sorted = list(sorted(distinct_words))

# Generate indexing for all words

word_index = {x: i for i, x in enumerate(distinct_words_sorted)}

# decide on sentence length

sentence_length = 25

Step 3-5 Data preparation for modeling

Here we are dividing the mails into sequence of words with a fixed length of 10 words (you can choose anything based on the business problem and computation power). We are splitting the text by words sequences. When creating these sequences, we slide this window along the whole document one word at a time, allowing each word to learn from its preceding one.

#prepare the dataset of input to output pairs encoded as integers

# Generate the data for the model

#input = the input sentence to the model with index

#output = output of the model with index

InputData = []

OutputData = []

for i in range(0, word_c - sentence_length, 1):

X = tokens[i:i + sentence_length]

Y = tokens[i + sentence_length]

InputData.append([word_index[char] for char in X])

OutputData.append(word_index[Y])

print (InputData[:1])

print (" ")

print(OutputData[:1])

#output

[[5086, 12190, 6352, 9096, 3352, 1920, 8507, 5937, 2535, 7886, 5214, 12910, 6541, 4104, 2531, 2997, 11473, 5170, 1595, 12552, 6590, 6316, 12758, 12087, 8496]]

[4292]

# Generate X

X = numpy.reshape(InputData, (len(InputData), sentence_length, 1))

# One hot encode the output variable

Y = np_utils.to_categorical(OutputData)

#output

array([[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.],

...,

[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.]])

Step 3-6 Model building

We will now define the LSTM model. Here we define a single hidden LSTM layer with 256 memory units. This model uses dropout 0.2. The output layer is using the softmax activation function. Here we are using the ADAM optimizer.

# define the LSTM model

model = Sequential()

model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))

model.add(Dropout(0.2))

model.add(Dense(Y.shape[1], activation="softmax"))

model.compile(loss='categorical_crossentropy', optimizer="adam")

#define the checkpoint

file_name_path="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"

checkpoint = ModelCheckpoint(file_name_path, monitor="loss", verbose=1, save_best_only=True, mode="min")

callbacks = [checkpoint]

We can now fit the model to the data. Here we use 5 epochs and a batch size of 128 patterns. For better results, you can use more epochs like 50 or 100. And of course, you can use them on more data.

#fit the model

model.fit(X, Y, epochs=5, batch_size=128, callbacks=callbacks)

Note

We have not split the data into training and testing data. We are not interested in the accurate model. As we all know, deep learning models will require a lot of data for training and take a lot of time to train, so we are using a model checkpoint to capture all of the model weights to file. We will use the best set of weights for our prediction.

#output

../images/475440_1_En_6_Chapter/475440_1_En_6_Figt_HTML.jpg

After running the above code, you will have weight checkpoint files in your local directory. Pick the network weights file that is saved in your working directory. For example, when we ran this example, below was the checkpoint with the smallest loss that we achieved with 5 epochs.

# load the network weights

file_name = "weights-improvement-05-6.8213.hdf5"

model.load_weights(file_name)

model.compile(loss='categorical_crossentropy', optimizer="adam")

Step 3-7 Predicting next word

We will randomly generate a sequence of words and input to the model and see what it predicts.

# Generating random sequence

start = numpy.random.randint(0, len(InputData))

input_sent = InputData[start]

# Generate index of the next word of the email

X = numpy.reshape(input_sent, (1, len(input_sent), 1))

predict_word = model.predict(X, verbose=0)

index = numpy.argmax(predict_word)

print(input_sent)

print (" ")

print(index)

# Output

[9122, 1920, 8187, 5905, 6828, 9818, 1791, 5567, 1597, 7092, 11606, 7466, 10198, 6105, 1837, 4752, 7092, 3928, 10347, 5849, 8816, 7092, 8574, 7092, 1831]

5849

# Convert these indexes back to words

word_index_rev = dict((i, c) for i, c in enumerate(tokens))

result = word_index_rev[index]

sent_in = [word_index_rev[value] for value in input_sent]

print(sent_in)

print (" ")

print(result)

Result :

['us', 'came', 'use', 'respecthe', 'would', 'us', 'are', 'it', 'you', 'to', 'pray', 'because', 'you', 'do', 'me', 'out', 'youre', 'thk', 'where', 'are', 'mrng', 'minutes', 'long', '500', 'per']

shut

So, given the 25 input words, it's predicting the word “shut” as the next word. Of course, its not making much sense, since it has been trained on much less data and epochs. Make sure you have great computation power and train on huge data with high number of epochs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Deep Learning for NLP

Create new playlist

Sign In

Sign Up

6. Deep Learning for NLP

Introduction to Deep Learning

Convolutional Neural Networks

Data

Architecture

Convolution

Nonlinearity (ReLU)

Pooling

Flatten, Fully Connected, and Softmax Layers

Backpropagation: Training the Neural Network

Recurrent Neural Networks

Training RNN – Backpropagation Through Time (BPTT)

Long Short-Term Memory (LSTM)

Recipe 6-1. Retrieving Information

Problem

Solution

How It Works

Step 1-1 Import the libraries

Step 1-2 Create/import documents

Step 1-3 Download word2vec

Step 1-4 Create IR system

Step 1-5 Results and applications

Recipe 6-2. Classifying Text with Deep Learning

Problem

Solution

How It Works

Step 2-1 Understanding/defining business problem

Step 2-2 Identifying potential data sources, collection, and understanding

Step 2-3 Text preprocessing

Step 2-4 Data preparation for model building

Step 2-5 Model building and predicting

Recipe 6-3. Next Word Prediction

Problem

Solution

How It Works

Step 3-1 Understanding/defining business problem

Step 3-2 Identifying potential data sources, collection, and understanding

Step 3-3 Importing and installing necessary libraries

Step 3-4 Processing the data

Step 3-5 Data preparation for modeling

Step 3-6 Model building

Note

Step 3-7 Predicting next word

Table of Contents for
6. Deep Learning for NLP