Building our image language encoder-decoder deep learning model

We have all the essential components and utilities needed to build our model now. As we mentioned earlier, we will be using an encoder-decoder deep learning model architecture to build our image-captioning system.

The following code helps us build the architecture for this model, where we take pairs of image features and caption sequences as input to predict the next possible word in the caption at each time-step:

from keras.models import Sequential, Model 
from keras.layers import LSTM, Embedding, TimeDistributed, Dense, RepeatVector, Activation, Flatten, concatenate 
 
DENSE_DIM = 256 
EMBEDDING_DIM = 256 
MAX_CAPTION_SIZE = max_caption_size 
VOCABULARY_SIZE = vocab_size 
 
image_model = Sequential() 
image_model.add(Dense(DENSE_DIM, input_dim=4096, activation='relu')) 
image_model.add(RepeatVector(MAX_CAPTION_SIZE)) 
 
language_model = Sequential() 
language_model.add(Embedding(VOCABULARY_SIZE, EMBEDDING_DIM, input_length=MAX_CAPTION_SIZE)) 
language_model.add(LSTM(256, return_sequences=True)) 
language_model.add(TimeDistributed(Dense(DENSE_DIM))) 
 
merged_output = concatenate([image_model.output, language_model.output]) 
merged_output = LSTM(1024, return_sequences=False)(merged_output) 
merged_output = (Dense(VOCABULARY_SIZE))(merged_output) 
merged_output = Activation('softmax')(merged_output) 
 
model = Model([image_model.input, language_model.input], merged_output) 
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']) 
 
model.summary()

The output for the preceding code is as follows:

We can see from the preceding architecture that we have an image model that focuses more on handling the image-based features as its input, and the language model leverages LSTMs to handle the sequence of words flowing in per image caption. The last layer is a softmax layer with 7,927 units because we have a total of 7,927 unique words in our vocabulary and the next predicted word in our caption will be one of these words which gets generated as the output. We can also visualize our model architecture using the following code snippet:

from IPython.display import SVG 
from keras.utils.vis_utils import model_to_dot 
 
SVG(model_to_dot(model, show_shapes=True, show_layer_names=False,  
    rankdir='TB').create(prog='dot', format='svg'))

The output for the preceding code is as follows:

Table of Contents for Building our image language encoder-decoder deep learning model

Create new playlist

Sign In

Sign Up

Table of Contents for
Building our image language encoder-decoder deep learning model