Building a vocabulary for our captions

The next step involves doing some preprocessing for our caption data and building a vocabulary or metadata dictionary for our captions. We start by reading in our training dataset records and writing a function to preprocess our text captions:

train_df = pd.read_csv('image_train_dataset.tsv', delimiter='	') 
total_samples = train_df.shape[0] 
total_samples 
 
35000 
 
# function to pre-process text captions 
def preprocess_captions(caption_list): 
    pc = [] 
    for caption in caption_list: 
        caption = caption.strip().lower() 
        caption = caption.replace('.', '').replace(',',   
                      '').replace("'", "").replace('"', '') 
        caption = caption.replace('&','and').replace('(','').replace(')',   
                                         '').replace('-', ' ') 
        caption = ' '.join(caption.split())  
        caption = '<START> '+caption+' <END>' 
        pc.append(caption) 
    return pc 

We will now preprocess our captions and build some basic metadata for our vocabulary, including utilities for converting unique words into numeric representations and vice-versa:

# pre-process caption data 
train_captions = train_df.caption.tolist() 
processed_train_captions = preprocess_captions(train_captions) 
 
tc_tokens = [caption.split() for caption in  
                      processed_train_captions] 
tc_tokens_length = [len(tokenized_caption) for tokenized_caption  
                        in tc_tokens] 
 
# build vocabulary metadata 
from collections import Counter 
 
tc_words = [word.strip() for word_list in tc_tokens for word in  
                           word_list] 
unique_words = list(set(tc_words)) 
token_counter = Counter(unique_words) 
 
word_to_index = {item[0]: index+1 for index, item in  
                   enumerate(dict(token_counter).items())} 
word_to_index['<PAD>'] = 0 
index_to_word = {index: word for word, index in  
                       word_to_index.items()} 
vocab_size = len(word_to_index) 
max_caption_size = np.max(tc_tokens_length) 

It is important to ensure we save this vocabulary metadata to disk so we can reuse it anytime in the future for model training as well as predictions; otherwise, if we re-generate our vocabulary, it is quite possible that a model may have been trained with some other version of a vocabulary where word-to-number mappings may have been different. This would give us the wrong results and we'd lose valuable time:

from sklearn.externals import joblib 
 
vocab_metadata = dict() 
vocab_metadata['word2index'] = word_to_index 
vocab_metadata['index2word'] = index_to_word 
vocab_metadata['max_caption_size'] = max_caption_size 
vocab_metadata['vocab_size'] = vocab_size 
joblib.dump(vocab_metadata, 'vocabulary_metadata.pkl') 
 
['vocabulary_metadata.pkl'] 

If needed, you can check the contents of our vocabulary metadata using the following code snippet and also see how a general preprocessed text caption might look for one of the images:

# check vocabulary metadata 
{k: v if type(v) is not dict  
         else list(v.items())[:5]  
             for k, v in vocab_metadata.items()} 
 
{'index2word': [(0, '<PAD>'), (1, 'nearby'), (2, 'flooded'), 
                (3, 'fundraising'), (4, 'snowboarder')], 
 'max_caption_size': 39, 
 'vocab_size': 7927, 
 'word2index': [('reflections', 4122), ('flakes', 1829),    
       ('flexing', 7684), ('scaling', 1057), ('pretend', 6788)]} 
 
# check pre-processed caption 
processed_train_captions[0] 
 
'<START> a black dog is running after a white dog in the snow <END>' 

We will be leveraging this shortly when we build a data-generator function that will serve as input for our deep learning model during model training.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.33.157