Loading data

There is quite a bit involved with loading this data. You might want to refer to the code block as you read though this text.

The first for loop in the code below is going to loop through the entire input file or some number of samples that we specify when we call load_data(). I'm doing this because you might not have the RAM to load the entire dataset. You might get good results with as few as 10,000 examples; however, more is always better.

As we loop through the input file, line by line, we're doing several things at once:

We're wrapping each French translation in a ' ' to start the phrase and a ' ' to end it. This corresponds to the <SOS> and <EOS> tags I used in the sequence-to-sequence diagram. This will allow us to use ' ' as an input to seed the decoder when we want to generate a translation sequence.
We are splitting each line into the English input, and its respective French translation. These are stored in the lists input_texts and target_texts.
Finally, we are adding each character of both the input and target text into a set. Those sets are called input_characters and target_characters. We will use these sets when it's time to one hot encode our phrases.

After our loop completes, we will convert the character sets into sorted lists. We will also create variables called num_encoder_tokens and num_decoder_tokens to hold the size of each of these lists. We will need these later for one hot encoding as well.

In order to get the inputs and targets into a matrix, we will need to pad the phrases to the length of the longest phrase, just as we did in the last chapter. To do that, we will need to know the longest phrase. We will store that in max_encoder_seq_length and max_decoder_seq_length, as shown in the following code:

def load_data(num_samples=50000, start_char='	', end_char='
', data_path='data/fra-eng/fra.txt'):
    input_texts = []
    target_texts = []
    input_characters = set()
    target_characters = set()
    lines = open(data_path, 'r', encoding='utf-8').read().split('
')
    for line in lines[: min(num_samples, len(lines) - 1)]:
        input_text, target_text = line.split('	')
        target_text = start_char + target_text + end_char
        input_texts.append(input_text)
        target_texts.append(target_text)
        for char in input_text:
            if char not in input_characters:
                input_characters.add(char)
        for char in target_text:
            if char not in target_characters:
                target_characters.add(char)

    input_characters = sorted(list(input_characters))
    target_characters = sorted(list(target_characters))
    num_encoder_tokens = len(input_characters)
    num_decoder_tokens = len(target_characters)
    max_encoder_seq_length = max([len(txt) for txt in input_texts])
    max_decoder_seq_length = max([len(txt) for txt in target_texts])

    print('Number of samples:', len(input_texts))
    print('Number of unique input tokens:', num_encoder_tokens)
    print('Number of unique output tokens:', num_decoder_tokens)
    print('Max sequence length for inputs:', max_encoder_seq_length)
    print('Max sequence length for outputs:', max_decoder_seq_length)
    return {'input_texts': input_texts, 'target_texts': target_texts,
           'input_chars': input_characters, 'target_chars': 
           target_characters, 'num_encoder_tokens': num_encoder_tokens, 
           'num_decoder_tokens': num_decoder_tokens,
           'max_encoder_seq_length': max_encoder_seq_length, 
           'max_decoder_seq_length': max_decoder_seq_length}

After our data is loaded, we will return all this information in a dictionary that can be passed along to a function that will one hot encode each phrase. Let's do that next.

Table of Contents for Loading data

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading data