Data pre-processing

Let's perform the data pre-processing to convert the raw data into the encoded form. We will extract fixed length sentences, then encode those using a one-hot encoding process and finally build a tensor of shape (sequence, maxlen, unique_characters) as shown in the figure below. At the same time we will prepare the target vector y, to contain the associated next character that follows each extracted sequence.

This is the code we'll use to pre-process the data.

# Length of extracted character sequences
maxlen = 100

# We sample a new sequence every 5 characters
step = 5

# List to hold extracted sequences
sentences = []

# List to hold the target characters 
next_chars = []

# Extracting sentences and the next characters.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))

# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Converting characters into one-hot encoding.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Table of Contents for Data pre-processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data pre-processing