How to do it...

We can proceed with the recipe as follows:

Load all the Keras modules needed by the recipe. These include spaCy for word embedding, VGG16 for image features extraction, and LSTM for language modeling. The remaining few additional modules are pretty standard:

%matplotlib inline
import os, argparse
import numpy as np
import cv2 as cv2
import spacy as spacy
import matplotlib.pyplot as plt
from keras.models import Model, Input
from keras.layers.core import Dense, Dropout, Reshape
from keras.layers.recurrent import LSTM
from keras.layers.merge import concatenate
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from sklearn.externals import joblib
import PIL.Image

Define a few constants. Note that we assume that our corpus of questions has max_length_questions = 30 and we know that we are going to use VGG16 to extract 4,096 features describing our input image. In addition, we know that the word embeddings are in a space of length_feature_space = 300. Note that we are going to use a set of pre-trained weights downloaded from the internet (https://github.com/iamaaditya/VQA_Demo):

# mapping id -> labels for categories
label_encoder_file_name =
'/Users/gulli/Books/TF/code/git/tensorflowBook/Chapter5/FULL_labelencoder_trainval.pkl'
# max length across corpus
max_length_questions = 30
# VGG output
length_vgg_features = 4096
# Embedding outout
length_feature_space = 300
# pre-trained weights
VQA_weights_file =
'/Users/gulli/Books/TF/code/git/tensorflowBook/Chapter5/VQA_MODEL_WEIGHTS.hdf5'

3. Use VGG16 for extracting features. Note that we explicitly extract them from the layer fc2. This function returns 4,096 features given an input image:

'''image features'''
def get_image_features(img_path, VGG16modelFull):
'''given an image returns a tensor with (1, 4096) VGG16 features'''
# Since VGG was trained as a image of 224x224, every new image
# is required to go through the same transformation
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
# this is required because of the original training of VGG was batch
# even if we have only one image we need to be consistent
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = VGG16modelFull.predict(x)
model_extractfeatures = Model(inputs=VGG16modelFull.input,
outputs=VGG16modelFull.get_layer('fc2').output)
fc2_features = model_extractfeatures.predict(x)
fc2_features = fc2_features.reshape((1, length_vgg_features))
return fc2_features

Note that VGG16 is defined as follows:

Layer (type) Output Shape Param #
 =================================================================
 input_5 (InputLayer) (None, 224, 224, 3) 0
 _________________________________________________________________
 block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
 _________________________________________________________________
 block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
 _________________________________________________________________
 block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
 _________________________________________________________________
 block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
 _________________________________________________________________
 block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
 _________________________________________________________________
 block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
 _________________________________________________________________
 block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
 _________________________________________________________________
 block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
 _________________________________________________________________
 block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
 _________________________________________________________________
 block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
 _________________________________________________________________
 block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
 _________________________________________________________________
 flatten (Flatten) (None, 25088) 0
 _________________________________________________________________
 fc1 (Dense) (None, 4096) 102764544
 _________________________________________________________________
 fc2 (Dense) (None, 4096) 16781312
 _________________________________________________________________
 predictions (Dense) (None, 1000) 4097000
 =================================================================
 Total params: 138,357,544
 Trainable params: 138,357,544
 Non-trainable params: 0
 _________________________________________

Use spaCy to get a word embedding and map the input question into a space (max_length_questions, 300) where max_length_questions is the max length of questions in our corpus, and 300 is the dimension of embeddings produced by spaCy. Internally, spaCy uses an algorithm called gloVe (http://nlp.stanford.edu/projects/glove/). gloVe reduces a given token to a 300-dimensional representation. Note that the question is padded to the max_lengh_questions with right 0 paddings:

'''embedding'''
def get_question_features(question):
''' given a question, a unicode string, returns the time series vector
with each word (token) transformed into a 300 dimension representation
calculated using Glove Vector '''
word_embeddings = spacy.load('en', vectors='en_glove_cc_300_1m_vectors')
tokens = word_embeddings(question)
ntokens = len(tokens)
if (ntokens > max_length_questions) :
ntokens = max_length_questions
question_tensor = np.zeros((1, max_length_questions, 300))
for j in xrange(len(tokens)):
question_tensor[0,j,:] = tokens[j].vector
return question_tensor

Load an image and get its salient features by using the previously defined image feature extractor:

image_file_name = 'girl.jpg'
img0 = PIL.Image.open(image_file_name)
img0.show()
#get the salient features
model = VGG16(weights='imagenet', include_top=True)
image_features = get_image_features(image_file_name, model)
print image_features.shape

Write a question and get its salient features by using the previously defined sentence feature extractor:

question = u"Who is in this picture?"
language_features = get_question_features(question)
print language_features.shape

Combine the two heterogeneous sets of features into one. In this network, we have three LSTM layers which will take the creation of our language model into account. Note that the LSTM will be discussed in detail in Chapter 6 and for now we only use them as black boxes. The last LSTM returns 512 features which are then used as inputs to a sequence of Dense and Dropout layers. The last layer is a Dense one with a softmax activation function in a probability space of 1,000 potential answers:

'''combine'''
def build_combined_model(
number_of_LSTM = 3,
number_of_hidden_units_LSTM = 512,
number_of_dense_layers = 3,
number_of_hidden_units = 1024,
activation_function = 'tanh',
dropout_pct = 0.5
):
#input image
input_image = Input(shape=(length_vgg_features,),
name="input_image")
model_image = Reshape((length_vgg_features,),
input_shape=(length_vgg_features,))(input_image)
#input language
input_language = Input(shape=(max_length_questions,length_feature_space,),
name="input_language")
#build a sequence of LSTM
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=True,
name = "lstm_1")(input_language)
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=True,
name = "lstm_2")(model_language)
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=False,
name = "lstm_3")(model_language)
#concatenate 4096+512
model = concatenate([model_image, model_language])
#Dense, Dropout
for _ in xrange(number_of_dense_layers):
model = Dense(number_of_hidden_units,
kernel_initializer='uniform')(model)
model = Dropout(dropout_pct)(model)
model = Dense(1000,
activation='softmax')(model)
#create model from tensors
model = Model(inputs=[input_image, input_language], outputs = model)
return model

Build the combined network and show its summary just to understand how it looks internally. Load the pre-trained weights and compile the model by using the categorical_crossentropy loss function, with the rmsprop optimizer:

combined_model = build_combined_model()
combined_model.summary()
combined_model.load_weights(VQA_weights_file)
combined_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
____________________________
 Layer (type) Output Shape Param # Connected to
 ====================================================================================================
 input_language (InputLayer) (None, 30, 300) 0
 ____________________________________________________________________________________________________
 lstm_1 (LSTM) (None, 30, 512) 1665024 input_language[0][0]
 ____________________________________________________________________________________________________
 input_image (InputLayer) (None, 4096) 0
 ____________________________________________________________________________________________________
 lstm_2 (LSTM) (None, 30, 512) 2099200 lstm_1[0][0]
 ____________________________________________________________________________________________________
 reshape_3 (Reshape) (None, 4096) 0 input_image[0][0]
 ____________________________________________________________________________________________________
 lstm_3 (LSTM) (None, 512) 2099200 lstm_2[0][0]
 ____________________________________________________________________________________________________
 concatenate_3 (Concatenate) (None, 4608) 0 reshape_3[0][0]
 lstm_3[0][0]
 ____________________________________________________________________________________________________
 dense_8 (Dense) (None, 1024) 4719616 concatenate_3[0][0]
 ____________________________________________________________________________________________________
 dropout_7 (Dropout) (None, 1024) 0 dense_8[0][0]
 ____________________________________________________________________________________________________
 dense_9 (Dense) (None, 1024) 1049600 dropout_7[0][0]
 ____________________________________________________________________________________________________
 dropout_8 (Dropout) (None, 1024) 0 dense_9[0][0]
 ____________________________________________________________________________________________________
 dense_10 (Dense) (None, 1024) 1049600 dropout_8[0][0]
 ____________________________________________________________________________________________________
 dropout_9 (Dropout) (None, 1024) 0 dense_10[0][0]
 ____________________________________________________________________________________________________
 dense_11 (Dense) (None, 1000) 1025000 dropout_9[0][0]
 ====================================================================================================
 Total params: 13,707,240
 Trainable params: 13,707,240
 Non-trainable params: 0

Use the pre-trained combined network for making the prediction. Note that in this case we use weights already available online for this network, but the interested reader can re-train the combined network on their own training set:

y_output = combined_model.predict([image_features, language_features])
# This task here is represented as a classification into a 1000 top answers
# this means some of the answers were not part of training and thus would
# not show up in the result.
# These 1000 answers are stored in the sklearn Encoder class
labelencoder = joblib.load(label_encoder_file_name)
for label in reversed(np.argsort(y_output)[0,-5:]):
print str(round(y_output[0,label]*100,2)).zfill(5), "% ", labelencoder.inverse_transform(label)

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...