Neural network architecture

In this chapter, we will create a neural network that will take an input of 10 video frames and output the probability over 101 action categories. We will create a neural network based on the conv3d operation in TensorFlow. This network is inspired on the work of D. Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks. However, we have simplified the model so it is easier to explain in a chapter. We have also used some techniques that are not mentioned by Tran et al., such as batch normalization and dropout.

Now, create a new Python file named nets.py and add the following code:

 import tensorflow as tf 
 from utils import print_variables, print_layers 
 from tensorflow.contrib.layers.python.layers.layers import  
 batch_norm 
 def inference(input_data, is_training=False): 
    conv1 = _conv3d(input_data, 3, 3, 3, 64, 1, 1, 1, "conv1") 
    pool1 = _max_pool3d(conv1, 1, 2, 2, 1, 2, 2, "pool1") 
 
    conv2 = _conv3d(pool1, 3, 3, 3, 128, 1, 1, 1, "conv2") 
    pool2 = _max_pool3d(conv2, 2, 2, 2, 2, 2, 2, "pool2") 
     
    conv3a = _conv3d(pool2, 3, 3, 3, 256, 1, 1, 1, "conv3a") 
    conv3b = _conv3d(conv3a, 3, 3, 3, 256, 1, 1, 1, "conv3b") 
    pool3 = _max_pool3d(conv3b, 2, 2, 2, 2, 2, 2, "pool3") 
     
    conv4a = _conv3d(pool3, 3, 3, 3, 512, 1, 1, 1, "conv4a") 
    conv4b = _conv3d(conv4a, 3, 3, 3, 512, 1, 1, 1, "conv4b") 
    pool4 = _max_pool3d(conv4b, 2, 2, 2, 2, 2, 2, "pool4") 
     
    conv5a = _conv3d(pool4, 3, 3, 3, 512, 1, 1, 1, "conv5a") 
    conv5b = _conv3d(conv5a, 3, 3, 3, 512, 1, 1, 1, "conv5b") 
    pool5 = _max_pool3d(conv5b, 2, 2, 2, 2, 2, 2, "pool5") 
 
    fc6 = _fully_connected(pool5, 4096, name="fc6") 
    fc7 = _fully_connected(fc6, 4096, name="fc7") 
    if is_training: 
        fc7 = tf.nn.dropout(fc7, keep_prob=0.5) 
    fc8 = _fully_connected(fc7, 101, name='fc8', relu=False) 
     
    endpoints = dict() 
    endpoints["conv1"] = conv1 
    endpoints["pool1"] = pool1 
    endpoints["conv2"] = conv2 
    endpoints["pool2"] = pool2 
    endpoints["conv3a"] = conv3a 
    endpoints["conv3b"] = conv3b 
    endpoints["pool3"] = pool3 
    endpoints["conv4a"] = conv4a 
    endpoints["conv4b"] = conv4b 
    endpoints["pool4"] = pool4 
    endpoints["conv5a"] = conv5a 
    endpoints["conv5b"] = conv5b 
    endpoints["pool5"] = pool5 
    endpoints["fc6"] = fc6 
    endpoints["fc7"] = fc7 
    endpoints["fc8"] = fc8 
         
    return fc8, endpoints 
 
 if __name__ == "__main__": 
    inputs = tf.placeholder(tf.float32, [None, 10, 112, 112, 3],  
 name="inputs") 
    outputs, endpoints = inference(inputs) 
 
    print_variables(tf.global_variables()) 
    print_variables([inputs, outputs]) 
    print_layers(endpoints)

In the inference function, we call _conv3d, _max_pool3d, and _fully_connected to create the network. It is not that different to the CNN network for images in previous chapters. At the end of the function, we also create a dictionary named endpoints, which will be used in the main section to visualize the network architecture.

Next, let's add the code of the _conv3d and _max_pool3d functions:

 def _conv3d(input_data, k_d, k_h, k_w, c_o, s_d, s_h, s_w, name,  
 relu=True, padding="SAME"): 
    c_i = input_data.get_shape()[-1].value 
    convolve = lambda i, k: tf.nn.conv3d(i, k, [1, s_d, s_h, s_w,  
 1], padding=padding) 
    with tf.variable_scope(name) as scope: 
        weights = tf.get_variable(name="weights",  
 shape=[k_d, k_h, k_w, c_i, c_o], 
 regularizer = tf.contrib.layers.l2_regularizer(scale=0.0001), 
                                   
 initializer=tf.truncated_normal_initializer(stddev=1e-1,  
 dtype=tf.float32)) 
        conv = convolve(input_data, weights) 
        biases = tf.get_variable(name="biases",  
 shape=[c_o], dtype=tf.float32, 
 initializer = tf.constant_initializer(value=0.0)) 
        output = tf.nn.bias_add(conv, biases) 
        if relu: 
            output = tf.nn.relu(output, name=scope.name) 
        return batch_norm(output) 
 
 
 def _max_pool3d(input_data, k_d, k_h, k_w, s_d, s_h, s_w, name,  
 padding="SAME"): 
    return tf.nn.max_pool3d(input_data,  
 ksize=[1, k_d, k_h, k_w, 1], 
 strides=[1, s_d, s_h, s_w, 1], padding=padding, name=name)

This code is similar to the previous chapters. However, we use the built-in tf.nn.conv3d and tf.nn.max_pool3d functions instead of tf.nn.conv2d and tf.nn.max_pool3d for images. Therefore, we need to add the k_d and s_d parameters to give information about the depth of the filters. Moreover, we will need to train this network from scratch without any pre-trained models. So, we need to use the batch_norm function to add the batch normalization to each layer.

Let's add the code for the fully connected layer:

 def _fully_connected(input_data, num_output, name, relu=True): 
    with tf.variable_scope(name) as scope: 
        input_shape = input_data.get_shape() 
        if input_shape.ndims == 5: 
            dim = 1 
            for d in input_shape[1:].as_list(): 
                dim *= d 
            feed_in = tf.reshape(input_data, [-1, dim]) 
        else: 
            feed_in, dim = (input_data, input_shape[-1].value) 
        weights = tf.get_variable(name="weights",  
 shape=[dim, num_output],  
 regularizer = tf.contrib.layers.l2_regularizer(scale=0.0001),                                   
 initializer=tf.truncated_normal_initializer(stddev=1e-1,  
 dtype=tf.float32)) 
        biases = tf.get_variable(name="biases", 
 shape=[num_output], dtype=tf.float32, 
                                  
 initializer=tf.constant_initializer(value=0.0)) 
        op = tf.nn.relu_layer if relu else tf.nn.xw_plus_b 
        output = op(feed_in, weights, biases, name=scope.name) 
        return batch_norm(output)

This function is a bit different to what we used with images. First, we check that the input_shape.ndims is equal to 5 instead of 4. Secondly, we add the batch normalization to the output.

Finally, let's open the utils.py file and add the following utility functions:

 from prettytable import PrettyTable 
 def print_variables(variables): 
    table = PrettyTable(["Variable Name", "Shape"]) 
    for var in variables: 
        table.add_row([var.name, var.get_shape()]) 
    print(table) 
    print("") 
 
 
 def print_layers(layers): 
    table = PrettyTable(["Layer Name", "Shape"]) 
    for var in layers.values(): 
        table.add_row([var.name, var.get_shape()]) 
    print(table) 
    print("")

Now we can run nets.py to have a better understanding of the network's architecture:

    python nets.py

In the first part of the console result, you will see a table like this:

    +------------------------------------+---------------------+
    |           Variable Name            |        Shape        |
    +------------------------------------+---------------------+
    |          conv1/weights:0           |   (3, 3, 3, 3, 64)  |
    |           conv1/biases:0           |        (64,)        |
    |       conv1/BatchNorm/beta:0       |        (64,)        |
    |   conv1/BatchNorm/moving_mean:0    |        (64,)        |
    | conv1/BatchNorm/moving_variance:0  |        (64,)        |
    |               ...                  |         ...         |
    |           fc8/weights:0            |     (4096, 101)     |
    |            fc8/biases:0            |        (101,)       |
    |        fc8/BatchNorm/beta:0        |        (101,)       |
    |    fc8/BatchNorm/moving_mean:0     |        (101,)       |
    |  fc8/BatchNorm/moving_variance:0   |        (101,)       |
    +------------------------------------+---------------------+

These are the shapes of variables in the network. As you can see, three variables that have the text BatchNorm are added to each layer. These variables increase the total parameters that the network needs to learn. However, since we will train from scratch, it will be much for harder to train the network without batch normalization. Batch normalization also increases the ability of the network to regularize unseen data.

In the second table of the console, you will see the following table:

    +---------------------------------+----------------------+
    |          Variable Name          |        Shape         |
    +---------------------------------+----------------------+
    |             inputs:0            | (?, 10, 112, 112, 3) |
    | fc8/BatchNorm/batchnorm/add_1:0 |       (?, 101)       |
    +---------------------------------+----------------------+

These are the shapes of the input and output of the network. As you can see, the input contains 10 video frames of size (112, 112, 3), and the output contains a vector of 101 elements.

In the last table, you will see how the shape of the output at each layer has changed through the network:

    +------------------------------------+-----------------------+
    |             Layer Name             |         Shape         |
    +------------------------------------+-----------------------+
    |  fc6/BatchNorm/batchnorm/add_1:0   |       (?, 4096)       |
    |  fc7/BatchNorm/batchnorm/add_1:0   |       (?, 4096)       |
    |  fc8/BatchNorm/batchnorm/add_1:0   |        (?, 101)       |
    |               ...                  |         ...           |
    | conv1/BatchNorm/batchnorm/add_1:0  | (?, 10, 112, 112, 64) |
    | conv2/BatchNorm/batchnorm/add_1:0  |  (?, 10, 56, 56, 128) |
    +------------------------------------+-----------------------+

In the preceding table, we can see that the output of the conv1 layer has the same size as the input, and the output of the conv2 layer has changed due to the effect of max pooling.

Now, let's create a new Python file named models.py and add the following code:

 import tensorflow as tf 
 
 def compute_loss(logits, labels): 
    labels = tf.squeeze(tf.cast(labels, tf.int32)) 
 
    cross_entropy =  
 tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,  
 labels=labels) 
    cross_entropy_loss= tf.reduce_mean(cross_entropy) 
    reg_loss =  
 tf.reduce_mean(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES 
 )) 
 
    return cross_entropy_loss + reg_loss, cross_entropy_loss,  
 reg_loss 
 
 
 def compute_accuracy(logits, labels): 
    labels = tf.squeeze(tf.cast(labels, tf.int32)) 
    batch_predictions = tf.cast(tf.argmax(logits, 1), tf.int32) 
    predicted_correctly = tf.equal(batch_predictions, labels) 
    accuracy = tf.reduce_mean(tf.cast(predicted_correctly,  
    tf.float32)) 
    return accuracy 
 
 
 def get_learning_rate(global_step, initial_value, decay_steps,  
 decay_rate): 
    learning_rate = tf.train.exponential_decay(initial_value,  
    global_step, decay_steps, decay_rate, staircase=True) 
    return learning_rate 
 
 
 def train(total_loss, learning_rate, global_step): 
    optimizer = tf.train.AdamOptimizer(learning_rate) 
    train_op = optimizer.minimize(total_loss, global_step) 
    return train_op

These functions create the operation to calculate loss, accuracy, learning rate, and perform the train process. This is the same as the previous chapter, so we won't explain these functions.

Now, we have all the functions required to train the network to recognize video actions. In the next section, we will start the training routine on a single GPU and visualize the results on TensorBoard.

Table of Contents for Neural network architecture

Create new playlist

Sign In

Sign Up

Table of Contents for
Neural network architecture