Data pre-processing

Let's perform the data pre-processing to convert the raw data into the encoded form. We will extract fixed length sentences, then encode those using a one-hot encoding process and finally build a tensor of shape (sequence, maxlen, unique_characters) as shown in the figure below. At the same time we will prepare the target vector y, to contain the associated next character that follows each extracted sequence.

This is the code we'll use to pre-process the data.

# Length of extracted character sequences
maxlen = 100

# We sample a new sequence every 5 characters
step = 5

# List to hold extracted sequences
sentences = []

# List to hold the target characters
next_chars = []

# Extracting sentences and the next characters.
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))

# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Converting characters into one-hot encoding.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.134.17