Building an image caption dataset generator

One of the most essential steps in any complex deep learning system that consumes large amounts of data is to build an efficient dataset generator. This is very relevant in our system, especially because we will be dealing with image and text data. Besides that, we will be dealing with sequence models where we have to pass the same data multiple times to our model during training. Unpacking all the data in lists, pre-building datasets would be the most in-efficient way to tackle this problem. Hence we will be leveraging the power of generators for our system.

To start with, we will load up our image features learned from transfer learning, along with our vocabulary metadata, using the following code:

from sklearn.externals import joblib 
 
tl_img_feature_map = joblib.load('transfer_learn_img_features.pkl') 
vocab_metadata = joblib.load('vocabulary_metadata.pkl') 
 
 
train_img_names = train_df.image.tolist() 
train_img_features = [tl_img_feature_map[img_name] for img_name in train_img_names] 
train_img_features = np.array(train_img_features) 
 
word_to_index = vocab_metadata['word2index'] 
index_to_word = vocab_metadata['index2word']  
max_caption_size = vocab_metadata['max_caption_size'] 
vocab_size = vocab_metadata['vocab_size'] 
 
train_img_features.shape 
 
(35000, 4096)

We can see that there are 35,000 images, where each image has a dense feature vector representation of size 4,096. The idea now is to build a model-dataset generator that will generate (input, output) pairs. For our input, we will use our source images converted into dense feature vectors, as well as the corresponding image caption, with one word added at each time step. The corresponding output will be the next word (which has to be predicted) for the same caption for the corresponding input image and caption. The following figure makes this approach clearer:

Based on the this architecture, it is clear that at each time step for the same image, we pass the same feature vector and keep adding one word of the caption at a time while passing the next word to be predicted as the corresponding output to train our model. The following function will help us achieve this, where we leverage Python generators for lazy loading and to be more memory-efficient:

from keras.preprocessing import sequence 
 
def dataset_generator(processed_captions, transfer_learnt_features, vocab_size, max_caption_size, batch_size=32): 
    partial_caption_set = [] 
    next_word_seq_set = [] 
    img_feature_set = [] 
    batch_count = 0 
    batch_num = 0 
     
    while True: 
        for index, caption in enumerate(processed_captions): 
            img_features = transfer_learnt_features[index] 
            for cap_idx in range(len(caption.split()) - 1): 
                partial_caption = [word_to_index[word] for word in  
                                     caption.split()[:cap_idx+1]] 
                partial_caption_set.append(partial_caption) 
 
                next_word_seq = np.zeros(vocab_size) 
                next_word_seq[word_to_index 
                                 [caption.split()[cap_idx+1]]] = 1 
                next_word_seq_set.append(next_word_seq) 
                img_feature_set.append(img_features) 
                batch_count+=1 
 
                if batch_count >= batch_size: 
                    batch_num += 1 
                    img_feature_set = np.array(img_feature_set) 
                    partial_caption_set =  
                                 sequence.pad_sequences( 
                                    sequences=partial_caption_set,   
                                    maxlen=max_caption_size,  
                                    padding='post') 
                    next_word_seq_set =  
                                      np.array(next_word_seq_set) 
                     
                    yield [[img_feature_set, partial_caption_set],  
                           next_word_seq_set]                     
                    batch_count = 0 
                    partial_caption_set = [] 
                    next_word_seq_set = [] 
                    img_feature_set = []

Let's try to understand how this function really works! While we do have a nice visual depicted in the previous figure, we will now do a sample data-generation for a batch size of 10 using the following code:

MAX_CAPTION_SIZE = max_caption_size 
VOCABULARY_SIZE = vocab_size 
BATCH_SIZE = 10 
 
print('Vocab size:', VOCABULARY_SIZE) 
print('Max caption size:', MAX_CAPTION_SIZE) 
print('Test Batch size:', BATCH_SIZE) 
 
d = dataset_generator(processed_captions=processed_train_captions,  
                      transfer_learnt_features=train_img_features,  
                      vocab_size=VOCABULARY_SIZE,  
                      max_caption_size=MAX_CAPTION_SIZE, 
                      batch_size=BATCH_SIZE) 
d = list(d) 
img_features, partial_captions = d[0][0] 
next_word = d[0][1] 
 
Vocab size: 7927 
Max caption size: 39 
Test Batch size: 10

We can now verify the dimensions of the returned datasets from our data-generator function using the following code:

img_features.shape, partial_captions.shape, next_word.shape 
 
((10, 4096), (10, 39), (10, 7927))

It is quite clear that our image features are essentially dense vectors of 4,096 features in each vector. The same feature vector is repeated for the same image at every time step of the caption. The caption-generated vector is of size MAX_CAPTION_SIZE, which is 39. The next word is typically returned in a one-hot encoded fashion, which is very useful to serve as the input for the softmax layer to check whether the correct word is predicted by the model. The following code shows us how the image feature vector looks for a batch size of 10 for an input image:

np.round(img_features, 3) 
 
array([[0.   , 0.   , 1.704, ..., 0.   , 0.   , 0.   ], 
       [0.   , 0.   , 1.704, ..., 0.   , 0.   , 0.   ], 
       [0.   , 0.   , 1.704, ..., 0.   , 0.   , 0.   ], 
       ..., 
       [0.   , 0.   , 1.704, ..., 0.   , 0.   , 0.   ], 
       [0.   , 0.   , 1.704, ..., 0.   , 0.   , 0.   ], 
       [0.   , 0.   , 1.704, ..., 0.   , 0.   , 0.   ]], dtype=float32)

As we discussed before, the same feature vector for the image is repeated at each time-step during the batch data-generation process. We can check how the caption is formed at each time-step that is fed to the model as input. We show only the first 11 words for simplicity in the output:

# display raw caption tokens at each time-step 
print(np.array([partial_caption[:11] for partial_caption in   
                 partial_captions])) 
 
[[6917    0    0    0    0    0    0    0    0    0    0] 
 [6917 2578    0    0    0    0    0    0    0    0    0] 
 [6917 2578 7371    0    0    0    0    0    0    0    0] 
 [6917 2578 7371 3519    0    0    0    0    0    0    0] 
 [6917 2578 7371 3519 3113    0    0    0    0    0    0] 
 [6917 2578 7371 3519 3113 6720    0    0    0    0    0] 
 [6917 2578 7371 3519 3113 6720    7    0    0    0    0] 
 [6917 2578 7371 3519 3113 6720    7 2578    0    0    0] 
 [6917 2578 7371 3519 3113 6720    7 2578 1076    0    0] 
 [6917 2578 7371 3519 3113 6720    7 2578 1076 3519    0]] 
 
 
# display actual caption tokens at each time-step 
print(np.array([[index_to_word[word] for word in cap][:11] for cap  
                                     in partial_captions])) 
 
[['<START>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' 'is' '<PAD>' '<PAD>' '<PAD>' '<PAD>' 
  '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' 'is' 'running' '<PAD>' '<PAD>' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' 'is' 'running' 'after' '<PAD>' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' 'is' 'running' 'after' 'a' '<PAD>' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' 'is' 'running' 'after' 'a' 'white' '<PAD>' '<PAD>'] 
 ['<START>' 'a' 'black' 'dog' 'is' 'running' 'after' 'a' 'white' 'dog' '<PAD>']]

We can clearly see how one word is added at each step to the input caption after the <START> symbol, which signifies the start of the text caption. Let's look at the corresponding next word generation output now (which is typically the next word to be predicted based on the two input):

next_word 
 
array([[0., 0., 0., ..., 0., 0., 0.], 
       [0., 0., 0., ..., 0., 0., 0.], 
       [0., 0., 0., ..., 0., 0., 0.], 
       ..., 
       [0., 0., 0., ..., 0., 0., 0.], 
       [0., 0., 0., ..., 0., 0., 0.], 
       [0., 0., 0., ..., 0., 0., 0.]]) 
 
 
print('Next word positions:', np.nonzero(next_word)[1]) 
print('Next words:', [index_to_word[word] for word in  
                np.nonzero(next_word)[1]]) 
 
Next word positions: [2578 7371 3519 3113 6720    7 2578 1076 3519 5070] 
Next words: ['a', 'black', 'dog', 'is', 'running', 'after', 'a', 'white', 'dog', 'in']

It is quite clear that the next word typically points to the next correct word in the caption based on the sequence of words in the input caption for each time-step. This is the data that will be fed to our model at each epoch during training.

Table of Contents for Building an image caption dataset generator

Create new playlist

Sign In

Sign Up

Table of Contents for
Building an image caption dataset generator