Preparing the data

Because we're using a built-in dataset, Keras takes care of a great deal of the mundane work we'd need to do around tokenizing, stemming, stop words, and converting our word tokens into numeric tokens. keras.datasets.imbd will give us a list of lists, each list containing a variable length sequence of integers representing the words in the review. We will define our data using the following code:

def load_data(vocab_size):
data = dict()
data["vocab_size"] = vocab_size
(data["X_train"], data["y_train"]), (data["X_test"], data["y_test"]) =
imdb.load_data(num_words=vocab_size)
return data

We can load our data by calling load_data and choosing a maximum size for our vocabulary. For this example, I'll use 20,000 words as the vocabulary size.

If you needed to do this step by hand, to make the example code work with your own problem, you can use the keras.preprocessing.text.Tokenizer class, which we will also cover in the next example. We will load our data using the following code:

data = load_data(20000)

As a next step, I'd like each of these sequences to be the same length, and I need this list of lists to be a 2D matrix, where each review is a row and each column is a word in the review. To get each list to be the same size, I will pad the shorter sequences with 0s. The LSTM we will use later will learn to ignore those 0s, which is of course very convenient for us.

This padding operation is fairly common, enough so that it is built into Keras. We can use keras.preprocessing.sequence.pad_sequences to accomplish this, using the following code:

def pad_sequences(data):
data["X_train"] = sequence.pad_sequences(data["X_train"])
data["sequence_length"] = data["X_train"].shape[1]
data["X_test"] = sequence.pad_sequences(data["X_test"], maxlen=data["sequence_length"])
return data

Invoking this function will convert our lists of lists to equal length sequences and conveniently convert our list of lists into a 2D matrix, as follows:

data = pad_sequences(data)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.21.158