Learning rate scheduling

In the last chapter, we briefly mentioned a problem that can occur by keeping a constant learning rate during training. As our model starts to learn, it is very likely that our initial learning rate will become too big for it to continue learning. The gradient descent updates will start overshooting or circling around our minimum; as a result, the loss function will not decrease in value. To solve this issue, we can, from time to time, decrease the value of the learning rate. This process is called learning rate scheduling, and there are several popular approaches.

The first method involves reducing the learning rate at fixed time steps during training, such as when training is 33% and 66% complete. Normally, you would decrease the learning rate by a factor of 10 when it reaches these set times.

The second approach involves reducing the learning rate according to an exponential or quadratic function of the time steps. An example of a function that would do this is as follows:

decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)

By using this approach, the learning rate is smoothly decreased over the time of training.

A final approach is to use our validation set and look at the current accuracy on the validation set. While the validation accuracy keeps increasing, we do nothing to our learning rate. Once the validation accuracy stops increasing, we decrease the learning rate by some factor. This process is repeated until training finishes.

All methods can produce good results, and it may be worth trying out all these different methods when you train to see which one works better for you. For this particular model, we will use the second approach of an exponentially decaying learning rate. We use the TensorFlow operation—tf.train.exponential_decay—to do this, which follows the formula previously shown. As input, it takes the current learning rate, the global step, the amount of steps before decaying and a decay rate.

At every iteration, the current learning rate is supplied to our Adam Optimizer, which uses the minimize function that uses gradient descent to minimize the loss and increases the global_step variable by one. Lastly, learning_rate and global_step are added to the summary data to be displayed on TensorBoard during training:

           with tf.name_scope("optimizer") as scope: 
               global_step = tf.Variable(0, trainable=False) 

               starter_learning_rate = 1e-3 

               # decay every 10000 steps with a base of 0.96 function 
               learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 
                                                          1000, 0.9, staircase=True) 
    
               self.__train_step = tf.train.AdamOptimizer(learning_rate).minimize(self.__loss, 
                                                                        global_step=global_step) 

               tf.summary.scalar("learning_rate", learning_rate) 
               tf.summary.scalar("global_step", global_step)

Although the Adam optimizer automatically adjusts and decays the learning rate for us, we still find that having some form of learning rate scheduling as well improves results.

Once all the components of the graph have been defined, all the summaries collected in the graph are merged into __merged_summary_op, and all the variables of the graph are initialized by tf.global_variables_initializer().

Naturally, while training the model, we want to store the network weights as binary files so that we can load them back to perform forward propagation. Those binary files in TensorFlow are called checkpoints, and they map variable names to tensor values. To save and restore variables to and from checkpoints, we use the Saver class. To avoid filling up disks, savers manage checkpoint files automatically. For example, they can keep only the N most recent files or one checkpoint for every N hours of training. In our case, we have set max_to_keep to None, which means all checkpoint files are kept:

           # Merge op for tensorboard 
           self.__merged_summary_op = tf.summary.merge_all() 

           # Build graph 
           init = tf.global_variables_initializer() 

           # Saver for checkpoints 
           self.__saver = tf.train.Saver(max_to_keep=None)

In addition, we can specify the proportion of GPU memory to be used with tf.GPUOptions. The object session encapsulates the environment in which ops are executed and tensors are evaluated. After creating the FileWriter object to store the summaries and events to a file, the __session.run(init) method runs one step of TensorFlow computation, by running the necessary graph fragment to execute every operation and evaluate every tensor that was initialized in init as parts of the graph:

           # Avoid allocating the whole memory 

           gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.6) 

           self.__session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 

           # Configure summary to output at given directory 

           self.__writer = tf.summary.FileWriter("./logs/cifar10", self.__session.graph) 

           self.__session.run(init)

Table of Contents for Learning rate scheduling

Create new playlist

Sign In

Sign Up

Table of Contents for
Learning rate scheduling