Data Preparation

Here we will build our question answering system. For the project, we need a dataset which is a question and answer pair as shown in the following image. Both the columns are the sequence of words which is what we need to feed into our seq2seq model. Also, you can note that we can have a dynamic length of the sentences:

Let's load them and perform the same data processing using build_dataset(). In the end, we will have a dictionary with words as keys and the associated values are the counts of the word in the respective corpus. Also, we have 4 extras values that we talked about before in this chapter:

import numpy as np
import tensorflow as tf
import collections
from utils import *


file_path = './conversation_data/'

with open(file_path+'from.txt', 'r') as fopen:
    text_from = fopen.read().lower().split('
')
with open(file_path+'to.txt', 'r') as fopen:
    text_to = fopen.read().lower().split('
')
print('len from: %d, len to: %d'%(len(text_from), len(text_to)))


concat_from = ' '.join(text_from).split()
vocabulary_size_from = len(list(set(concat_from)))
data_from, count_from, dictionary_from, rev_dictionary_from = build_dataset(concat_from, vocabulary_size_from)
 

concat_to = ' '.join(text_to).split()
vocabulary_size_to = len(list(set(concat_to)))
data_to, count_to, dictionary_to, rev_dictionary_to = build_dataset(concat_to, vocabulary_size_to)
 

GO = dictionary_from['GO']
PAD = dictionary_from['PAD']
EOS = dictionary_from['EOS']
UNK = dictionary_from['UNK']

Table of Contents for Data Preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data Preparation