How to do it...

We can proceed with the recipe as follows:

  1. Load all the Keras modules needed by the recipe. These include spaCy for word embedding, VGG16 for image features extraction, and LSTM for language modeling. The remaining few additional modules are pretty standard:
%matplotlib inline
import os, argparse
import numpy as np
import cv2 as cv2
import spacy as spacy
import matplotlib.pyplot as plt
from keras.models import Model, Input
from keras.layers.core import Dense, Dropout, Reshape
from keras.layers.recurrent import LSTM
from keras.layers.merge import concatenate
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from sklearn.externals import joblib
import PIL.Image
  1. Define a few constants. Note that we assume that our corpus of questions has max_length_questions = 30 and we know that we are going to use VGG16 to extract 4,096 features describing our input image. In addition, we know that the word embeddings are in a space of length_feature_space = 300. Note that we are going to use a set of pre-trained weights downloaded from the internet (https://github.com/iamaaditya/VQA_Demo):
# mapping id -> labels for categories
label_encoder_file_name =
'/Users/gulli/Books/TF/code/git/tensorflowBook/Chapter5/FULL_labelencoder_trainval.pkl'
# max length across corpus
max_length_questions = 30
# VGG output
length_vgg_features = 4096
# Embedding outout
length_feature_space = 300
# pre-trained weights
VQA_weights_file =
'/Users/gulli/Books/TF/code/git/tensorflowBook/Chapter5/VQA_MODEL_WEIGHTS.hdf5'

3. Use VGG16 for extracting features. Note that we explicitly extract them from the layer fc2. This function returns 4,096 features given an input image:

'''image features'''
def get_image_features(img_path, VGG16modelFull):
'''given an image returns a tensor with (1, 4096) VGG16 features'''
# Since VGG was trained as a image of 224x224, every new image
# is required to go through the same transformation
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
# this is required because of the original training of VGG was batch
# even if we have only one image we need to be consistent
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = VGG16modelFull.predict(x)
model_extractfeatures = Model(inputs=VGG16modelFull.input,
outputs=VGG16modelFull.get_layer('fc2').output)
fc2_features = model_extractfeatures.predict(x)
fc2_features = fc2_features.reshape((1, length_vgg_features))
return fc2_features

Note that VGG16 is defined as follows:

Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________
  1. Use spaCy to get a word embedding and map the input question into a space (max_length_questions, 300) where max_length_questions is the max length of questions in our corpus, and 300 is the dimension of embeddings produced by spaCy. Internally, spaCy uses an algorithm called gloVe (http://nlp.stanford.edu/projects/glove/). gloVe reduces a given token to a 300-dimensional representation. Note that the question is padded to the max_lengh_questions with right 0 paddings:
'''embedding'''
def get_question_features(question):
''' given a question, a unicode string, returns the time series vector
with each word (token) transformed into a 300 dimension representation
calculated using Glove Vector '''
word_embeddings = spacy.load('en', vectors='en_glove_cc_300_1m_vectors')
tokens = word_embeddings(question)
ntokens = len(tokens)
if (ntokens > max_length_questions) :
ntokens = max_length_questions
question_tensor = np.zeros((1, max_length_questions, 300))
for j in xrange(len(tokens)):
question_tensor[0,j,:] = tokens[j].vector
return question_tensor
  1. Load an image and get its salient features by using the previously defined image feature extractor:
image_file_name = 'girl.jpg'
img0 = PIL.Image.open(image_file_name)
img0.show()
#get the salient features
model = VGG16(weights='imagenet', include_top=True)
image_features = get_image_features(image_file_name, model)
print image_features.shape
  1. Write a question and get its salient features by using the previously defined sentence feature extractor:
question = u"Who is in this picture?"
language_features = get_question_features(question)
print language_features.shape
  1. Combine the two heterogeneous sets of features into one. In this network, we have three LSTM layers which will take the creation of our language model into account. Note that the LSTM will be discussed in detail in Chapter 6 and for now we only use them as black boxes. The last LSTM returns 512 features which are then used as inputs to a sequence of Dense and Dropout layers. The last layer is a Dense one with a softmax activation function in a probability space of 1,000 potential answers:
'''combine'''
def build_combined_model(
number_of_LSTM = 3,
number_of_hidden_units_LSTM = 512,
number_of_dense_layers = 3,
number_of_hidden_units = 1024,
activation_function = 'tanh',
dropout_pct = 0.5
):
#input image
input_image = Input(shape=(length_vgg_features,),
name="input_image")
model_image = Reshape((length_vgg_features,),
input_shape=(length_vgg_features,))(input_image)
#input language
input_language = Input(shape=(max_length_questions,length_feature_space,),
name="input_language")
#build a sequence of LSTM
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=True,
name = "lstm_1")(input_language)
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=True,
name = "lstm_2")(model_language)
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=False,
name = "lstm_3")(model_language)
#concatenate 4096+512
model = concatenate([model_image, model_language])
#Dense, Dropout
for _ in xrange(number_of_dense_layers):
model = Dense(number_of_hidden_units,
kernel_initializer='uniform')(model)
model = Dropout(dropout_pct)(model)
model = Dense(1000,
activation='softmax')(model)
#create model from tensors
model = Model(inputs=[input_image, input_language], outputs = model)
return model
  1. Build the combined network and show its summary just to understand how it looks internally. Load the pre-trained weights and compile the model by using the categorical_crossentropy loss function, with the rmsprop optimizer:
combined_model = build_combined_model()
combined_model.summary()
combined_model.load_weights(VQA_weights_file)
combined_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
____________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_language (InputLayer) (None, 30, 300) 0
____________________________________________________________________________________________________
lstm_1 (LSTM) (None, 30, 512) 1665024 input_language[0][0]
____________________________________________________________________________________________________
input_image (InputLayer) (None, 4096) 0
____________________________________________________________________________________________________
lstm_2 (LSTM) (None, 30, 512) 2099200 lstm_1[0][0]
____________________________________________________________________________________________________
reshape_3 (Reshape) (None, 4096) 0 input_image[0][0]
____________________________________________________________________________________________________
lstm_3 (LSTM) (None, 512) 2099200 lstm_2[0][0]
____________________________________________________________________________________________________
concatenate_3 (Concatenate) (None, 4608) 0 reshape_3[0][0]
lstm_3[0][0]
____________________________________________________________________________________________________
dense_8 (Dense) (None, 1024) 4719616 concatenate_3[0][0]
____________________________________________________________________________________________________
dropout_7 (Dropout) (None, 1024) 0 dense_8[0][0]
____________________________________________________________________________________________________
dense_9 (Dense) (None, 1024) 1049600 dropout_7[0][0]
____________________________________________________________________________________________________
dropout_8 (Dropout) (None, 1024) 0 dense_9[0][0]
____________________________________________________________________________________________________
dense_10 (Dense) (None, 1024) 1049600 dropout_8[0][0]
____________________________________________________________________________________________________
dropout_9 (Dropout) (None, 1024) 0 dense_10[0][0]
____________________________________________________________________________________________________
dense_11 (Dense) (None, 1000) 1025000 dropout_9[0][0]
====================================================================================================
Total params: 13,707,240
Trainable params: 13,707,240
Non-trainable params: 0
  1. Use the pre-trained combined network for making the prediction. Note that in this case we use weights already available online for this network, but the interested reader can re-train the combined network on their own training set:
y_output = combined_model.predict([image_features, language_features])
# This task here is represented as a classification into a 1000 top answers
# this means some of the answers were not part of training and thus would
# not show up in the result.
# These 1000 answers are stored in the sklearn Encoder class
labelencoder = joblib.load(label_encoder_file_name)
for label in reversed(np.argsort(y_output)[0,-5:]):
print str(round(y_output[0,label]*100,2)).zfill(5), "% ", labelencoder.inverse_transform(label)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.60.20