One hot encoding

In this function, we will be taking the dictionary we just built and one hot encoding the text of each phrase.

Once we're done, we will be left with three dictionaries. Each of them will be of dimension [number of texts * max sequence length * tokens]. If you squint, and think back to the simpler times of Chapter 10, Training LSTMs with Word Embeddings from Scratch, you can see this is really the same as the other NLP models we've done on the input side. We will define one hot encoding using the following code:

def one_hot_vectorize(data):
    input_chars = data['input_chars']
    target_chars = data['target_chars']
    input_texts = data['input_texts']
    target_texts = data['target_texts']
    max_encoder_seq_length = data['max_encoder_seq_length']
    max_decoder_seq_length = data['max_decoder_seq_length']
    num_encoder_tokens = data['num_encoder_tokens']
    num_decoder_tokens = data['num_decoder_tokens']

    input_token_index = dict([(char, i) for i, char in 
      enumerate(input_chars)])
    target_token_index = dict([(char, i) for i, char in 
      enumerate(target_chars)])
    encoder_input_data = np.zeros((len(input_texts), 
      max_encoder_seq_length, num_encoder_tokens), dtype='float32')
    decoder_input_data = np.zeros((len(input_texts), 
      max_decoder_seq_length, num_decoder_tokens), dtype='float32')
    decoder_target_data = np.zeros((len(input_texts), 
      max_decoder_seq_length, num_decoder_tokens), dtype='float32')

    for i, (input_text, target_text) in enumerate(zip(input_texts, 
     target_texts)):
        for t, char in enumerate(input_text):
            encoder_input_data[i, t, input_token_index[char]] = 1.
        for t, char in enumerate(target_text):
    # decoder_target_data is ahead of decoder_input_data by one 
       timestep
            decoder_input_data[i, t, target_token_index[char]] = 1.
            if t > 0:
           # decoder_target_data will be ahead by one timestep
           # and will not include the start character.
           decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    data['input_token_index'] = input_token_index
    data['target_token_index'] = target_token_index
    data['encoder_input_data'] = encoder_input_data
    data['decoder_input_data'] = decoder_input_data
    data['decoder_target_data'] = decoder_target_data
    return data

There are three training vectors that we create in this code. Before moving on, I want to make sure we understand each of these vectors:

encoder_input_data is a 3D matrix of shape (number_of_pairs, max_english_sequence_length, number_of_english_characters).
decoder_input_data is a 3d matrix of shape (number_of_pairs, max_french_sequence_length, number_of_french_characters).
decoder_output_data is the same as decoder_input_data shifted one time step ahead. This means that decoder_input_data[:, t+1, :] is equal to decoder_output_data[:, t, :].

Each of the preceding vectors is a one hot encoded representation of an entire phrase at the character level. This means that if our input phrase was Go! The first time step of the vector would contain an element for every possible English character in the text. Each of these elements would be set to 0, except g, which would be set to 1.

Our goal will be to train a sequence-to-sequence model to predict decoder_output_data using encoder_input_data and decoder_input data as our input features.

And at long last our data prep is done, so we can start to build our sequence-to-sequence network architecture.

Table of Contents for One hot encoding

Create new playlist

Sign In

Sign Up

Table of Contents for
One hot encoding