How to do it...

We proceed with the recipe as follows:

  1. Import TensorFlow, tflearn, and the modules needed for building our network. Then, Import IMDb libraries and perform one-hot encoding and padding:
import tensorflow as tf
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_1d, global_max_pool
from tflearn.layers.merge_ops import merge
from tflearn.layers.estimator import regression
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb
  1. Load the dataset, pad the sentences to maxlength with 0s, and perform one-hot encoding on the labels with two values corresponding to the true and one false value. Note that the parameter n_words is the number of words to keep in the vocabulary. All extra words are set to unknown. Also, note that trainX and trainY are sparse vectors because each review will most likely contain a subset of the whole set of words:
# IMDb Dataset loading
train, test, _ = imdb.load_data(path='imdb.pkl', n_words=10000,
valid_portion=0.1)
trainX, trainY = train
testX, testY = test
#pad the sequence
trainX = pad_sequences(trainX, maxlen=100, value=0.)
testX = pad_sequences(testX, maxlen=100, value=0.)
#one-hot encoding
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
  1. Print a few dimensions to inspect the just-processed data and understand what the dimension of the problem is:
print ("size trainX", trainX.size)
print ("size testX", testX.size)
print ("size testY:", testY.size)
print ("size trainY", trainY.size)
size trainX 2250000
size testX 250000
size testY: 5000
site trainY 45000
  1. Build an embedding for the text contained in the dataset. Just for now, consider this step as a black box which takes the words and maps them into aggregates (clusters) so that similar words are likely to appear in the same cluster. Note that the vocabulary for the previous steps is discrete and sparse. With the embedding, we will create a map which will embed each word into a continuous dense vector space. Using this vector space representation will give us a continuous, distributed representation of our vocabulary words. How to build embeddings is something that will be discussed in detail when we talk about RNNs:
# Build an embedding
network = input_data(shape=[None, 100], name='input')
network = tflearn.embedding(network, input_dim=10000, output_dim=128)
  1. Build a suitable convnet. We have three convolutional layers. Since we are dealing with a text, we will use one-dimension ConvNets and the layers will act in parallel. Each layer takes a tensor of size 128 (the output of the embeddings) and applies a number of filters (respectively 3, 4, 5) with valid padding, the activation function ReLU, and an L2 regularizer. Then the output of each layer is concatenated with a merge operation. After that, a max pool layer is added, followed by a dropout with a probability of 50%. The final layer is a fully connected one with softmax activation:
#Build the convnet
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.5)
network = fully_connected(network, 2, activation='softmax')
  1. The learning phase implies the Adam optimizer with categorical_crossentropy used as a loss function:
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
  1. Then we run the training with batch_size=32 and observe what is the accuracy reached on the training and validation set. As you can see, we are able to get an accuracy of 79% in predicting what is the sentiment expressed for movie reviews:
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY, n_epoch = 5, shuffle=True, validation_set=(testX, testY), show_metric=True, batch_size=32)
Training Step: 3519 | total loss: 0.09738 | time: 85.043s
| Adam | epoch: 005 | loss: 0.09738 - acc: 0.9747 -- iter: 22496/22500
Training Step: 3520 | total loss: 0.09733 | time: 86.652s
| Adam | epoch: 005 | loss: 0.09733 - acc: 0.9741 | val_loss: 0.58740 - val_acc: 0.7944 -- iter: 22500/22500
--
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.52