© David Paper 2021
D. PaperTensorFlow 2.x in the Colaboratory Cloudhttps://doi.org/10.1007/978-1-4842-6649-6_9

9. Sentiment Analysis

David Paper1  
(1)
Logan, UT, USA
 

We’ve already demonstrated how to train a character-level RNN to create original text. Now, we create a word-level RNN to analyze sentiment.

Sentiment analysis is the interpretation and classification of polarity, emotions, and intentions within text data using NLP text analysis tools. Polarity can be positive, negative, or neutral. Emotions can vary across a wide range of feelings such as anger, happiness, frustration, and sadness, to name a few. Intentions can also vary across a wide range of motives such as interested or not interested. A common application of sentiment analysis is to identify customer sentiment toward products, brands, or services through online feedback. General application includes social media monitoring, brand monitoring, customer service, customer feedback, and market research.

For an excellent discussion of sentiment analysis, consult the following URL:

https://monkeylearn.com/sentiment-analysis/#:~:text=Sentiment%20analysis%20is%20the%20interpretation,or%20services%20in%20online%20feedback

Sentiment analysis is a very common NLP task. Technically, it computationally identifies and categorizes opinions expressed in a text corpus to determine attitude or sentiment. Typically, sentiment analysis is used to determine a positive, negative, or neutral opinion toward a particular topic or product.

Notebooks for chapters are located at the following URL: https://github.com/paperd/tensorflow.

IMDb Dataset

A popular dataset used to practice NLP is the IMDb reviews dataset. IMDb is a benchmark dataset for binary sentiment classification. The dataset contains 50,000 movie reviews labeled as either positive (1) or negative (0). Reviews are preprocessed with each encoded as a sequence of word indexes in the form of integers. Words within the reviews are indexed by their overall frequency within the dataset. The 50,000 reviews are split into 25,000 for training and 25,000 for testing. So we can predict the number of positive and negative reviews using either classification or other deep learning algorithms.

IMDb is popular because it is simple to use, relatively easy to process, and challenging enough for machine learning aficionados. We enjoy working with IMDb because it’s just plain fun to work with movie data.

Enable the GPU (if not already enabled):
  1. 1.

    Click Runtime in the top-left menu.

     
  2. 2.

    Click Change runtime type from the drop-down menu.

     
  3. 3.

    Choose GPU from the Hardware accelerator drop-down menu.

     
  4. 4.

    Click SAVE.

     
Test if GPU is active:
import tensorflow as tf
# display tf version and test if GPU is active
tf.__version__, tf.test.gpu_device_name()

Import the tensorflow library. If ‘/device:GPU:0’ is displayed, the GPU is active. If ‘..’ is displayed, the regular CPU is active.

Load IMDb as a TFDS

The recommended way to load IMDb is as a TFDS:
import tensorflow_datasets as tfds
imdb, info = tfds.load(
    'imdb_reviews/subwords8k', with_info=True,
    as_supervised=True, shuffle_files=True)

We use the imdb_reviews/subwords8k TFDS so we train the model on a smaller vocabulary. The subwords8k subset has a vocabulary size of 8,000, which means that we are training the model on the 8,000 most commonly used words in the reviews. It also means that we don’t have to build our own vocabulary dictionary! We can get good performance with this subset and substantially reduce training time. Loading the TFDS also gives us access to the tfds.features.text.SubwordTextEncoder, which is the TFDS text encoder.

We set with_info=True to enable access to information about the dataset and the encoder. We set as_supervised=True so that the returned TFDS has a two-tuple structure (input, label) in accordance with builder.info.supervised_keys. If set to False (the default), the returned TFDS will have a dictionary with all features included. We set shuffle_files=True because shuffling typically improves performance.

Display the Keys

By loading the dataset as a TFDS, we have access to its keys:
imdb.keys()

We see that the dataset is split into test, train, and unsupervised samples.

Split into Train and Test Sets

Since we are building a supervised model, we only have interest in the train and test samples:
train, test = imdb['train'], imdb['test']

Display the First Sample

It’s always a good idea to explore a sample from the dataset:
br =' '
for sample, target in train.take(1):
  print ('encoded review:')
  print (sample, br)
  print ('target:', target.numpy())

The first training example contains an encoded review and a label. The review is already encoded as a tensor of integers with datatype int64. The label is a scalar value of either 0 (negative) or 1 (positive) with datatype int64.

The shape of the review tensor indicates the number of words it contains. For readability, we convert the target tensor to values with the numpy method.

Display Information About the TFDS

The info object gives us access to metadata:
info

Peruse Metadata

See the number of examples in train and test splits:
train_size = info.splits['train'].num_examples
test_size = info.splits['test'].num_examples
train_size, test_size
See supervised keys:
info.supervised_keys
See feature information:
info.features
See the TFDS name and a slice of its description:
info.name, info.description[0:25]
We can even slice the citation string to get the title:
info.citation[184:242]

Create the Encoder

An encoder is built into the TFDS SubwordTextEncoder. With the encoder, we can easily decode (integer to text) and encode (text to integer). We access the encoder from the dataset’s info object.

Create an encoder based on the IMDb dataset we loaded into memory:
encoder = info.features['text'].encoder

Now that the encoder is built, we can use it to vectorize strings and decode vectorized strings back into text strings.

Test the encoder:
sample_string = 'What a Beautiful Day!'
encoded_string = encoder.encode(sample_string)
print ('Encoded string:', encoded_string)
original_string = encoder.decode(encoded_string)
print ('Original string:', original_string)

Use the Encoder on Samples

Create a function that returns the label rating in readable form:
def rev(d):
  if tf.math.equal(d, 0): return 'negative review'
  elif tf.math.equal(d, 1): return 'positive review'
Display the first review as shown in Listing 9-1.
for sample, target in train.take(1):
  print ('review:', end=' ')
  text = encoder.decode(sample)
  print (text[0:100])
  print ('opinion:', end=' ')
  print (''' + rev(target) + ''')
Listing 9-1

Display the first review

Display multiple reviews as shown in Listing 9-2.
n = 6
for i, sample in enumerate(train.take(n)):
  if i > 0:
    print ('review', str(i+1) +':', end=' ')
    text = encoder.decode(sample[0])
    print (text[0:100])
    print ('opinion:', end=' ')
    print (''' + rev(sample[1]) + ''')
    if i < n-1:
      print ()
Listing 9-2

Display multiple reviews

We skip the first review because we’ve already seen it.

Display vocabulary size:
print('Vocabulary size: {}'.format(encoder.vocab_size))

Finish the Input Pipeline

Create batches of the encoded strings (or reviews) to greatly enhance performance. Since machine learning algorithms expect batches of the same size, use the padded_batch method to zero-pad the sequences so that each review is the same length as the longest string in the batch.

Initialize variables:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
Shuffle (where appropriate), batch, cache, and prefetch train and test sets as shown in Listing 9-3.
train_ds = (train
            .shuffle(BUFFER_SIZE)
            .padded_batch(BATCH_SIZE)
            .cache().prefetch(1))
test_ds = (test
           .padded_batch(BATCH_SIZE)
           .cache().prefetch(1))
Listing 9-3

Finish the input pipeline

Inspect tensors:
train_ds, test_ds

Consult the following URL for updates on padding character tensors:

www.tensorflow.org/tutorials/text/text_classification_rnn

Create the Model

Plant seeds, import libraries, clear previous models, and create the model as shown in Listing 9-4.
import numpy as np
# generate seed for reproducibility
tf.random.set_seed(0)
np.random.seed(0)
# import libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense,
Embedding
# clear any previous models
tf.keras.backend.clear_session()
# build the model
embed_size = 128
model = Sequential([
  Embedding(encoder.vocab_size, embed_size, mask_zero=True,
            input_shape=[None]),
  GRU(128, return_sequences=True),
  GRU(128),
  Dense(1, activation="sigmoid")
])
Listing 9-4

Create the model

The first layer is an embedding layer. The embedding layer is used to create word vectors for incoming words. During training, representations of word categories (or word vectors) are learned in a way where similar categories are closer to one another. So word vectors can store relationships between words like good and great. Word vectors are dense because our model learns word relationships. As a result, word vectors aren’t padded with a huge number of zeros like what we do with one-hot encodings.

The embedding layer accepts vocabulary size, embedding size, and input shape. We set mask_zero=True to inform the model to ignore padding tokens by all downstream layers. Ignoring padding tokens improves performance.

The next two layers are GRU layers, and the final layer is a single-neuron dense output layer. The output layer uses sigmoid activation to output the estimated probability that the review expresses a positive or negative sentiment regarding the movie.

Model Summary

Inspect the model:
model.summary()

The first layer is an embedding. So calculate the number of learnable parameters by multiplying vocabulary size of 8185 by embedding dimension (embed_size) of 128 for a total of 1,047,680.

The second layer is a GRU. The number of learnable parameters is thereby based on the formula 3 × (n2 × mn + 2n) where m is the input dimension and n is the output dimension. Multiply by 3 because there are three sets of operations for a GRU that requires weight matrices of these sizes. Multiply n by 2 because of the feedback loops of a RNN. So we get 99,072 learnable parameters.

Here’s how we break down the result:
  • * 3 × (1282 + 128 × 128 + 2 × 128)

  • * 3 × (16384 + 16384 + 256)

  • * 3 × 33024

  • * 99,072

As we can see, calculating learnable parameters for the second layer is pretty complex. So let’s break it down logically. A GRU layer is a feedforward layer with feedback loops. Learnable parameters for a feedforward network are calculated by multiplying output from the previous layer (128 neurons) with neurons at the current layer (128 neurons). With a feedforward network, we also have to account for the 128 neurons at this layer. But we multiply the 128 neurons at this layer by 2 because of the feedback mechanism of a RNN. Finally, the current layer’s 128 neurons are fed back resulting in 1282 learnable parameters. A GRU uses three sets of operations (hidden state, reset gate, and update gate) requiring weight matrices, so we multiply the learnable parameters by 3.

The third layer is a GRU. We get 99,072 learnable parameters because n and m are exactly the same as the second layer. So the calculations are the same.

The final layer is dense. So calculate the number of learnable parameters by multiplying output dimension of 1 by input dimension of 128 and adding 1 to account for the number of neurons at this layer for a total of 129.

Compile the Model

Compile :
model.compile(loss='binary_crossentropy', optimizer="adam",
              metrics=['accuracy'])

Train the Model

We found that two epochs provide pretty good accuracy without tuning. However, your results may differ. So you can experiment to your heart’s content. But keep in mind that training text models requires a lot of training time!
# to suppress unimportant error messages
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
history = model.fit(train_ds, epochs=2, validation_data=test_ds)

Generalize on Test Data

Although model fit information provides validation loss and accuracy values during training, it is always a good idea to explicitly evaluate a model on test data because accuracy and loss values may differ:
test_loss, test_acc = model.evaluate(test_ds)

Visualize Training Performance

Visualize as shown in Listing 9-5.
import matplotlib.pyplot as plt
# history.history contains the training record
history_dict = history.history
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)
plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()
# clear previous figure
plt.clf()
plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Listing 9-5

Visualize training performance

Make Predictions from Fabricated Reviews

Let’s make predictions from reviews that we fabricate. Begin by creating a function that returns the predictions. Since we create our own reviews, the function must convert the text review for TensorFlow consumption.

Create the function:
def predict_review(text):
  encoded_text = encoder.encode(text)
  encoded_text = tf.cast(encoded_text, tf.float32)
  prediction = model.predict(tf.expand_dims(encoded_text, 0))
  return prediction

The function accepts a text review. It begins by encoding the review. It then converts the encoded review to float32. The function ends by making the prediction and returning it to the calling environment. We add the 1 dimension to the encoded text so it can be consumed by the TensorFlow model.

Test the function with a fabricated review:
review = ('Just loved it. My kids thought the movie was cool. '
         'Even my wife liked it.')
pred = predict_review(review)
pred, pred.shape

We have a prediction. Predictions greater than 0.5 mean that the review is positive. Otherwise, the review is negative.

Let’s make the prediction more palatable by creating another function:
def palatable(pred):
  score = tf.squeeze(pred, 0).numpy()
  return score[0]

The function converts the prediction to a numpy scalar.

Invoke the function:
score = palatable(pred)
score, score.shape

The function removes the 1 dimension from the prediction.

Let’s go a step further by creating a function that returns the review in plain English:
def impression(score):
  if score >= 0.5:
    return 'positive impression'
  else:
    return 'negative impression'
Invoke the function:
impression(score)

As expected, the review is positive.

Let’s try another one:
review = ('The movie absolutely sucked. '
          'No character development. '
          'Dialogue just blows.')
pred = predict_review(review)
score = palatable(pred)
print (impression(score))

As expected, the review is negative.

Make Predictions on a Test Data Batch

We can also predict from the test set. Let’s make predictions on the first test batch with the predict method. Since test data is already in tensor form, we don’t need to encode.

Predict from the test set batch and display the first review as shown in Listing 9-6.
# get predictions from 1st test batch
for sample, target in test_ds.take(1):
  y_pred_64 = model.predict(sample)
# display first review from this batch
print ('review:', end=' ')
print (encoder.decode(sample[0])[177:307])
# display first label from this batch
print ('label:', end=' ')
print (target[0].numpy(), br)
# display number of examples in the batch
print ('samples and target in first batch:', end=' ')
len(sample), len(target)
Listing 9-6

Make predictions based on a batch from the test set

We take the first batch from test_ds. We make predictions with the predict method and place them in y_pred_64. Variable y_pred_64 holds 64 predictions because batch size is 64. We then display the first review and its associated label from this batch. Remember that label 1 means the review is positive and label 0 means it is negative. We end by displaying the size of the sample and target to verify that we have 64 examples in our first batch.

Get the first prediction:
print (y_pred_64[0])
Make it palatable:
impression(y_pred_64[0])
Compare the prediction to the actual label:
impression(y_pred_64[0]), impression(target[0].numpy())

If the prediction matches the actual label, it was correct.

Listing 9-7 displays prediction efficacy for the first five predictions.
for i in range(5):
  p = impression(y_pred_64[i])
  t = impression(target[i].numpy())
  print (i, end=': ')
  if p == t: print ('correct')
  else: print ('incorrect')
Listing 9-7

Prediction efficacy for five predictions

Prediction Accuracy for the First Batch

Create a function to convert an impression back to a label of either 1 or 0:
def convert_label(feeling):
  if feeling == 'positive impression':
    return 1
  else: return 0
Return prediction accuracy for the entire first batch as shown in Listing 9-8.
ls = []
n = len(target)
for i, _ in enumerate(range(n)):
  t = target[i].numpy() # labels
  p = convert_label(impression(y_pred_64[i])) # predictions
  if t == p: ls.append(True)
correct = ls.count(True)
acc = correct / n
batch_accuracy = str(int(np.round(acc, 2) * 100)) + '%'
print ('accuracy for the first batch:', batch_accuracy)
Listing 9-8

Prediction accuracy for the first batch

We begin by traversing the first batch and comparing labels to predictions. If a prediction is correct, we add this information to a list. We continue by counting the number of correct predictions. We end by dividing correct predictions by the batch size to get overall prediction accuracy.

Leverage Pretrained Embeddings

Amazingly, we can reuse modules from pretrained models on the IMDb dataset. The TensorFlow Hub project is a library with hundreds of reusable machine learning modules. A module is a self-contained piece of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks in a process known as transfer learning. Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a different task.

You can browse the library by perusing the following URL:

http://tfhub.dev

Once you locate a module, copy the URL into your model. The module is automatically downloaded along with its pretrained weights. A huge advantage of using pretrained models is that we don’t have to create and train our own models from scratch!

Load the IMDb Dataset

Since we are using a pretrained model, we can load the IMDb dataset with the full vocabulary:
data, info = tfds.load('imdb_reviews', as_supervised=True,
                       with_info=True, shuffle_files=True)

We use the full vocabulary because we don’t have to worry about training with it!

Display metadata:
info

Build the Input Pipeline

Create train and test sets:
train, test = data['train'], data['test']
Batch and prefetch:
batch_size = 32
train_set = train.repeat().batch(batch_size).prefetch(1)
test_set = test.batch(batch_size).prefetch(1)
Inspect tensors:
train_set, test_set

Create the Pretrained Model

Import the TF Hub library and create a skeleton model to house the pretrained module as shown in Listing 9-9.
import tensorflow_hub as hub
# clear any previous models
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
  hub.KerasLayer(
      'https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1',
      dtype=tf.string, input_shape=[], output_shape=[50]),
  Dense(128, activation="relu"),
  Dense(1, activation="sigmoid")
])
Listing 9-9

Create the model

The hub.KerasLayer downloads the sentence encoder module. Each string input into this layer is automatically encoded as a 50D vector. So each vector represents 50 words. Each word is embedded based on an embedding matrix pretrained on the 7 billion–word Google News corpus. The next two dense layers are added to provide a basic model for sentiment analysis. Using TF Hub is convenient and efficient because we can use what was already learned from the pretrained model.

Compile the Model

Compile the model:
model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

Train the Model

Train the model:
# to suppress unimportant error messages
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
history = model.fit(train_set,
                    steps_per_epoch=train_size // batch_size,
                    epochs=5, validation_data=test_set)

Training time is substantially reduced!

Make Predictions

Make predictions based on the first test_set batch:
for sample, target in test_set.take(1):
  y_pred_32 = model.predict(sample)

Since batch size is 32, we have 32 predictions for each batch.

Display misclassifications in test_set by index as shown in Listing 9-10.
for i in range(batch_size):
  p = convert_label(impression(y_pred_32[i]))
  l = target[i].numpy()
  if p != l:
    print ('pred:', p, 'actual:', l, 'indx:', i)
Listing 9-10

Misclassifications in the first batch by index

Calculate Prediction Accuracy for the First Batch

Listing 9-11 shows the code to calculate prediction accuracy for the first batch.
ls = []
n = len(target)
for i, _ in enumerate(range(n)):
  t = target[i].numpy() # labels
  p = convert_label(impression(y_pred_32[i])) # predictions
  if t == p: ls.append(True)
correct = ls.count(True)
acc = correct / n
batch_accuracy = str(int(np.round(acc, 2) * 100)) + '%'
print ('accuracy for the first batch:', batch_accuracy)
Listing 9-11

Calculate prediction accuracy for the first batch

Instead of finding misclassifications, the code finds correct predictions. The code begins by comparing an actual label to a predicted one. If the actual label is predicted correctly, a Boolean True is appended to a list. Once the first batch is traversed, the number of elements in the list is counted. This count is divided by batch size of 32 to determine accuracy, which is displayed as a percentage.

Explore IMDb with Keras

Since Keras is very popular in industry, we demonstrate how to train IMDb with keras.datasets. We use the keras.datasets.imdb.load_data function to load the dataset in a format-ready fashion for use in neural network and deep learning models.

Loading the Keras IMDb has some advantages. First, words have already been encoded with integers. Second, encoded words are arranged by their absolute popularity in the dataset. So sentences in each review are comprised of a sequence of integers. Third, calling imdb.load_data the first time downloads IMDb to your computer and stores it in your home directory under ~/.keras/datasets/imdb.pkl as a 32-megabyte file. The imdb.load_data function also provides additional arguments including number of top words to load (where words with a lower integer are marked as zero in the returned data), number of top words to skip (to avoid words like the), and the maximum length of reviews to support.

Load the Keras IMDb:
train, test = tf.keras.datasets.imdb.load_data()

The function loads data into train and test tuples. So train[0] contains training reviews and train[1] contains training labels. And test[0] contains test reviews and test[1] contains test labels. Each review is represented as a numpy array of integers with each integer representing a word. The labels contain lists of integer labels (0 is a negative review and 1 is positive).

For readability, create variables to represent reviews and labels:
train_reviews, train_labels = train[0], train[1]
test_reviews, test_labels = test[0], test[1]
Display the shape of train and test review samples:
train_reviews.shape, test_reviews.shape

As expected, we have 25,000 train and 25,000 test reviews.

Display the shape of train and test labels:
train_labels.shape, test_labels.shape

As expected, we have 25,000 train and 25,000 test labels.

Explore the Train Sample

Display label categories and number of unique words:
print ('categories:', np.unique(train_labels))
print ('number of unique words:',
       len(np.unique(np.hstack(train_reviews))))

The dataset is labeled by two categories that represent sentiment of each review. And the train sample contains 88,585 unique words.

Let’s see how many words are in the longest training review:
longest = np.amax([len(i) for i in train_reviews])
print ('longest review:', longest)

We create a list containing the number of words in each review and then find the length of the review with the maximum number of words.

Get the index of the longest review:
mid_result = np.where([len(i) for i in train_reviews] == longest)
longest_index = mid_result[0][0]
longest_index

We use the np.where function to find the index. We used double indexing because the function returns a tuple containing a list that holds the index we desire.

Create a Decoding Function

Create a function that decodes a review into English readable form as shown in Listing 9-12.
def readable(review):
  index = tf.keras.datasets.imdb.get_word_index()
  reverse_index = dict([(value, key)
                        for (key, value) in index.items()])
  return ' '.join( [reverse_index.get(i - 3, '?')
                    for i in review])
Listing 9-12

Function that decodes a review

The function uses the tf.keras.datasets.imdb.get_word_index utility to obtain a dictionary of words and their uniquely assigned integers. The function then creates another dictionary containing key-value groupings as value and key groupings from the first dictionary. Finally, it returns the words based on their IDs (or keys). The indices are offset by 3 because 0, 1, and 2 are reserved indices for padding, start of sequence, and unknown.

Invoke the Decoding Function

Let’s see what the longest review looks like as shown in Listing 9-13.
review = readable(train_reviews[longest_index])
print ()
print ('review:', end=' ')
# just display a slice of the full review
print (review[:50] + ' ...', br)
label = train_labels[longest_index]
idea = impression(label)
print (idea, br)
# verify length of review
print (len(train_reviews[longest_index]))
Listing 9-13

Decode the longest review

Since we already know the index of the longest review, we can easily retrieve it from train_reviews. Display a slice of it since the review is pretty long. We can also easily retrieve the label from train_labels. Make the label readable and display it. Finally, display the length of the longest review.

Let’s see what the shortest review looks like. But we can’t do this directly.

We must first find the minimum number of words:
shortest = np.amin([len(i) for i in train_reviews])
print ('shortest review:', shortest)

Since we don’t know which review is the shortest, we use the amin method to return the minimum.

We can now get the index of the shortest review in the train sample:
result = np.where([len(i) for i in train_reviews] == shortest)
shortest_index = result[0][0]
shortest_index

We use the where method to return the index we seek. Since the method returns all reviews that meet the criterion, we grab the first one and display its index.

Listing 9-14 displays the review, its label in readable form, and its length.
review = readable(train_reviews[shortest_index])
print (review[2:], br)
label = train_labels[shortest_index]
idea = impression(label)
print (idea, br)
# verify length of review
print (len(train_reviews[shortest_index]))
Listing 9-14

Display the shortest review

Continue Exploring the Training Sample

Return the average review length:
length = [len(i) for i in train_reviews]
print ('average review length:', np.mean(length))
Display the first label and its review encoded as integers as shown in Listing 9-15.
first_label = train_labels[0]
print('label:', first_label, end=' ')
idea = impression(first_label)
print ('(' + idea + ')', br)
# display slice of first review
print (train_reviews[0][:20])
# display readable slice of first review
print (readable(train_reviews[0][:20]))
Listing 9-15

First label and its review

Display the first review in readable form:
review = readable(train_reviews[0])
print (review[2:105] + ' ...')

Train Keras IMDb Data

Limit vocabulary size to improve performance:
# limit vocabulary to 8000 most commonly used words in reviews
vocab_size = 8000
Cut text size to 80 words to improve performance:
maxlen = 80

Load Data

Load data with the limited vocabulary:
(x_train, y_train), (x_test, y_test) =
tf.keras.datasets.imdb.load_data(num_words=vocab_size)
Display information about train and test data:
print ('train and test features:')
print (len(x_train), 'train sequences')
print (len(x_test), 'test sequences', br)
print ('sequence shape before padding:')
print ('x_train shape:', x_train.shape)
print ('x_test shape:', x_test.shape)

Pad Samples

Convert train and test sets to numpy:
x_train = np.asarray(x_train)
x_test = np.asarray(x_test)
Import the appropriate library:
from tensorflow.keras.preprocessing.sequence import pad_sequences
Pad samples to ensure that all sequences are of the same length:
print('padded sequences (samples, maxlen):')
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Build the Input Pipeline

Initialize pipeline variables:
buffer_size = 10000
batch_size = 512
Prepare train data for TensorFlow consumption:
train_k = tf.data.Dataset.from_tensor_slices(
    (x_train, y_train))
train_ks = train_k.shuffle(
    buffer_size).batch(batch_size).prefetch(1)
Prepare test data for TensorFlow consumption:
test_k = tf.data.Dataset.from_tensor_slices(
    (x_test, y_test))
test_ks = test_k.batch(batch_size).prefetch(1)

Build the Model

Clear any previous models:
tf.keras.backend.clear_session()
Create the model:
embed_size = 128
model = Sequential([
  Embedding(vocab_size, embed_size, mask_zero=True,
            input_shape=[None]),
  GRU(128, return_sequences=True),
  GRU(128),
  Dense(1, activation="sigmoid")
])

Compile the Model

Compile:
model.compile(loss='binary_crossentropy', optimizer="adam",
              metrics=['accuracy'])

Train the Model

Suppress error messages:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
Train:
epochs = 2
model.fit(train_ks, batch_size=BATCH_SIZE,
          epochs=epochs, validation_data=(test_ks))

Predict

Get predictions:
k_pred = model.predict(test_ks)
Display the first prediction:
impression(k_pred[0][0])
Display a slice of the review:
pred_first = readable(x_test[0])
pred_first[26:53]
Display the impression:
impression(y_test[0])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.132.214