Training routine with single GPU

In the scripts package, create a new Python file named train.py. We will start by defining some parameters as follows:

 import tensorflow as tf 
 import os 
 import sys 
 from datetime import datetime 
 from tensorflow.python.ops import data_flow_ops 
 
 import nets 
 import models 
 from utils import lines_from_file 
 from datasets import sample_videos, input_pipeline 
 
 # Dataset 
 num_frames = 16 
 train_folder = "/home/ubuntu/datasets/ucf101/train/" 
 train_txt = "/home/ubuntu/datasets/ucf101/train.txt" 
 
 # Learning rate 
 initial_learning_rate = 0.001 
 decay_steps = 1000 
 decay_rate = 0.7 
  
 # Training 
 image_size = 112 
 batch_size = 24 
 num_epochs = 20 
 epoch_size = 28747 
 
 train_enqueue_steps = 100 
 min_queue_size = 1000 
 
 save_steps = 200  # Number of steps to perform saving checkpoints 
 test_steps = 20  # Number of times to test for test accuracy 
 start_test_step = 50 
 
 max_checkpoints_to_keep = 2 
 save_dir = "/home/ubuntu/checkpoints/ucf101"

These parameters are self-explanatory. Now, we will define some operations for training:

 train_data_reader = lines_from_file(train_txt, repeat=True) 
 
 image_paths_placeholder = tf.placeholder(tf.string, shape=(None,  
 num_frames), name='image_paths') 
 labels_placeholder = tf.placeholder(tf.int64, shape=(None,),  
 name='labels') 
 
 train_input_queue =  
 data_flow_ops.RandomShuffleQueue(capacity=10000, 
                                                      
 min_after_dequeue=batch_size, 
 dtypes= [tf.string, tf.int64], 
 shapes= [(num_frames,), ()]) 
 
 train_enqueue_op =  
 train_input_queue.enqueue_many([image_paths_placeholder,  
 labels_placeholder]) 
 
 frames_batch, labels_batch = input_pipeline(train_input_queue,   
 batch_size=batch_size, image_size=image_size) 
 
 with tf.variable_scope("models") as scope: 
    logits, _ = nets.inference(frames_batch, is_training=True) 
 
 total_loss, cross_entropy_loss, reg_loss =  
 models.compute_loss(logits, labels_batch) 
 train_accuracy = models.compute_accuracy(logits, labels_batch) 
 
 global_step = tf.Variable(0, trainable=False) 
 learning_rate = models.get_learning_rate(global_step,  
 initial_learning_rate, decay_steps, decay_rate) 
 train_op = models.train(total_loss, learning_rate, global_step)

In this code, we get a generator object from the text file. Then, we create two placeholders for image_paths and labels, which will be enqueued to RandomShuffleQueue. The input_pipeline function that we created in datasets.py will receive RandomShuffleQueue and return a batch of frames and labels. Finally, we create operations to compute loss, accuracy, and the training operation.

We also want to log the training process and visualize it in TensorBoard. So, we will create some summaries:

 tf.summary.scalar("learning_rate", learning_rate) 
 tf.summary.scalar("train/accuracy", train_accuracy) 
 tf.summary.scalar("train/total_loss", total_loss) 
 tf.summary.scalar("train/cross_entropy_loss", cross_entropy_loss) 
 tf.summary.scalar("train/regularization_loss", reg_loss) 
 
 summary_op = tf.summary.merge_all() 
 
 saver = tf.train.Saver(max_to_keep=max_checkpoints_to_keep) 
 time_stamp = datetime.now().strftime("single_%Y-%m-%d_%H-%M-%S") 
 checkpoints_dir = os.path.join(save_dir, time_stamp) 
 summary_dir = os.path.join(checkpoints_dir, "summaries") 
  
 train_writer = tf.summary.FileWriter(summary_dir, flush_secs=10) 
 
 if not os.path.exists(save_dir): 
    os.mkdir(save_dir) 
 if not os.path.exists(checkpoints_dir): 
    os.mkdir(checkpoints_dir) 
 if not os.path.exists(summary_dir): 
    os.mkdir(summary_dir)

saver and train_writer will be responsible for saving checkpoints and summaries respectively. Now, let's finish the training process by creating the session and performing the training loop:

 config = tf.ConfigProto() 
 config.gpu_options.allow_growth = True 
 
 with tf.Session(config=config) as sess: 
    coords = tf.train.Coordinator() 
    threads = tf.train.start_queue_runners(sess=sess, coord=coords) 
 
    sess.run(tf.global_variables_initializer()) 
 
    num_batches = int(epoch_size / batch_size) 
 
    for i_epoch in range(num_epochs): 
        for i_batch in range(num_batches): 
            # Prefetch some data into queue 
            if i_batch % train_enqueue_steps == 0: 
                num_samples = batch_size * (train_enqueue_steps + 1) 
 
                image_paths, labels =  
 sample_videos(train_data_reader, root_folder=train_folder, 
                                                     
 num_samples=num_samples, num_frames=num_frames) 
                print("
Epoch {} Batch {} Enqueue {}  
 videos".format(i_epoch, i_batch, num_samples)) 
 
                sess.run(train_enqueue_op, feed_dict={ 
                    image_paths_placeholder: image_paths, 
                    labels_placeholder: labels 
                }) 
 
            if (i_batch + 1) >= start_test_step and (i_batch + 1) %  
 test_steps == 0: 
                _, lr_val, loss_val, ce_loss_val, reg_loss_val,  
 summary_val, global_step_val, train_acc_val = sess.run([ 
                    train_op, learning_rate, total_loss,  
 cross_entropy_loss, reg_loss, 
                    summary_op, global_step, train_accuracy 
                ]) 
                train_writer.add_summary(summary_val, 
 global_step=global_step_val) 
  
                print("
Epochs {}, Batch {} Step {}: Learning Rate  
 {} Loss {} CE Loss {} Reg Loss {} Train Accuracy {}".format( 
                    i_epoch, i_batch, global_step_val, lr_val,  
 loss_val, ce_loss_val, reg_loss_val, train_acc_val 
                )) 
            else: 
                _ = sess.run(train_op) 
                sys.stdout.write(".") 
                sys.stdout.flush() 
 
          if (i_batch + 1) > 0 and (i_batch + 1) % save_steps ==  0: 
                saved_file = saver.save(sess, 
                                         
 os.path.join(checkpoints_dir, 'model.ckpt'), 
                                        global_step=global_step) 
                print("Save steps: Save to file %s " % saved_file) 
 
    coords.request_stop() 
    coords.join(threads)

This code is very straightforward. We will use the sample_videos function to get a list of image paths and labels. Then, we will call the train_enqueue_op operation to add these image paths and labels to RandomShuffleQueue. After that, the training process can be run by using train_op without the feed_dict mechanism.

Now, we can run the training process by calling the following command in the root folder:

export PYTHONPATH=.
python scripts/train.py

You may see the OUT_OF_MEMORY error if your GPU memory isn't big enough for a batch size of 32. In the training process, we created a session with gpu_options.allow_growth so you can try to change the batch_size to use your GPU memory effectively.

The training process takes a few hours before it converges. We will take a look at the training process on TensorBoard.

In the directory that you have chosen to save the checkpoints, run the following command:

tensorboard --logdir .

Now, open your web browser and navigate to http://localhost:6006:

The regularization loss and total loss with one GPU are as follows:

As you can see in these images, the training accuracy took about 10,000 steps to reach 100% accuracy on training data. These 10,000 steps took 6 hours on our machine. It may be different on your configuration.

The training loss is decreasing, and it may reduce if we train longer. However, the training accuracy is almost unchanged after 10,000 steps.

Now, let's move on to the most interesting part of this chapter. We will use multiple GPUs to train and see how that helps.

Table of Contents for Training routine with single GPU

Create new playlist

Sign In

Sign Up

Table of Contents for
Training routine with single GPU