Gradient descent

We will now make a full implementation of the training method for our NN in the form of batch-stochastic gradient descent (BSGD). Let's think about what this means, word by word. Batch means that this training algorithm will operate on a collection of training samples at once, rather than all of the samples simultaneously, while stochastic indicates that each batch is chosen randomly. Gradient means that we will be using a gradient from calculus—which, here, is the collection of derivatives for each weight and bias on the loss function. Finally, descent means that we are trying to reduce the loss function—we do this by iteratively making subtle changes on the weights and biases by subtracting the Gradient.

Remember from calculus that the gradient of a point always points in the direction of the greatest increase, with its opposite direction being that of the greatest decrease. Since we want a decrease, we subtract the gradient.

We will now implement BSGD as the bsgd method in our SequentialNetwork class. Let's go over the input parameters of bsgd, one by one:

training will be a two-dimensional NumPy array of training samples
labels will be the desired output of the final layer of the NN corresponding to each training sample
delta will indicate how much we should increase a weight for the calculation of derivatives by
max_streams will indicate the maximum number of concurrent CUDA streams that BSGD will perform calculations over
batch_size will indicate how large we want the batches that we will calculate the loss function on for each update of the weights
epochs will indicate how many times we shuffle the order of the current set of samples, break into a collection of batches, and then perform BSGD on
training_rate will indicate the rate at which we will update our weights and biases with our gradient calculations

We'll start out this method as usual and perform some checks and typecasting, set up the collection of CUDA stream objects into a Python list, and allocate some additional needed GPU memory in another list:

def bsgd(self, training=None, labels=None, delta=None, max_streams = None, batch_size = None, epochs = 1, training_rate=0.01):
 
 training_rate = np.float32(training_rate)
 
 training = np.float32(training)
 labels = np.float32(labels)
 
 if( training.shape[0] != labels.shape[0] ):
  raise Exception("Number of training data points should be same as labels!")

 if max_streams is None:
  max_streams = self.max_streams
 
 if epochs is None:
 epochs = self.epochs
 
 if delta is None:
 delta = self.delta
 
 streams = []
 bgd_mem = []
 
 # create the streams needed for training
 for _ in xrange(max_streams):
  streams.append(drv.Stream())
  bgd_mem.append([])
 
 
 # allocate memory for each stream
 for i in xrange(len(bgd_mem)):
  for mem_bank in self.network_mem:
   bgd_mem[i].append( gpuarray.empty_like(mem_bank) )

Now, we can begin training. We will start by doing an iteration of the entire BSGD for each epoch, performing a random shuffle of the entire dataset for each epoch. We'll print some information to the terminal as well so that the user will have some status updates in the training process:

num_points = training.shape[0]
 
if batch_size is None:
 batch_size = self.max_batch_size
 
index = range(training.shape[0])
 
for k in xrange(epochs): 
 
 print '-----------------------------------------------------------'
 print 'Starting training epoch: %s' % k
 print 'Batch size: %s , Total number of training samples: %s' % (batch_size, num_points)
 print '-----------------------------------------------------------'
 
 all_grad = []
 
 np.random.shuffle(index)

Now, we will make a loop that iterates over each batch in the shuffled dataset. We start by calculating the entropy from the current batch, and we will print this as well. If the user sees decreases in entropy, then they will know that gradient descent is working here:

for r in xrange(int(np.floor(training.shape[0]/batch_size))):
 
 batch_index = index[r*batch_size:(r+1)*batch_size] 
 
 batch_training = training[batch_index, :]
 batch_labels = labels[batch_index, :]
 
 batch_predictions = self.predict(batch_training)
 
 cur_entropy = cross_entropy(predictions=batch_predictions, ground_truth=batch_labels)
 
 print 'entropy: %s' % cur_entropy

We will now iterate through each dense layer of our NN, calculating the gradient for the entire set of weights and biases. We will store these derivatives for the weights and biases in flattened (one-dimensional) arrays, which will correspond to the w_t and b_t indices in our CUDA kernels, which are also flattened. Since we will have multiple streams process different outputs for different weights, we will use a Python Queue container to store the set of weights and biases that are yet to be processed for this batch: we can then just pop values off the top of this container to the next available stream (we'll store these as tuples, with the first element indicating whether this is a weight or bias, in particular):

for i in xrange(len(self.network)):
 
 if self.network_summary[i][0] != 'dense':
  continue
 
 all_weights = Queue()
 
 grad_w = np.zeros((self.network[i].weights.size,), dtype=np.float32)
 grad_b = np.zeros((self.network[i].b.size,), dtype=np.float32)
 
 for w in xrange( self.network[i].weights.size ):
  all_weights.put( ('w', np.int32(w) ) )
 
 for b in xrange( self.network[i].b.size ):
  all_weights.put(('b', np.int32(b) ) )

Now, we need to iterate over each and every weight and bias, which we can do with a while loop that checks if the queue object we just set up is empty. We will set up another queue, stream_weights, that will help us organize which weights and biases each stream has processed. After setting up the weight and bias inputs appropriately, we can now use partial_predict by using the current stream and corresponding GPU memory arrays:

Notice that we already performed a predict for this batch of samples to calculate the entropy, so we are now able to perform partial_predict on this batch, provided we are careful about which memory and layers we use.

while not all_weights.empty():
 
 stream_weights = Queue()
 
 for j in xrange(max_streams):
 
  if all_weights.empty():
    break
 
  wb = all_weights.get()
 
  if wb[0] == 'w':
   w_t = wb[1]
   b_t = None
  elif wb[0] == 'b':
   b_t = wb[1]
   w_t = None
 
  stream_weights.put( wb )
 
  self.partial_predict(layer_index=i, w_t=w_t, b_t=b_t, partial_mem=bgd_mem[j], stream=streams[j], batch_size=batch_size, delta=delta)

We have only computed the prediction of the output for alterations of a small set of weights and biases. We will have to compute the entropy for each, and then store the value of the derivative in the flattened arrays:

for j in xrange(max_streams):
 
 if stream_weights.empty():
  break
 
 wb = stream_weights.get()
 
 w_predictions = bgd_mem[j][-1].get_async(stream=streams[j])
 
 w_entropy = cross_entropy(predictions=w_predictions[ :batch_size,:], ground_truth=batch_labels)
 
 
 
 if wb[0] == 'w':
  w_t = wb[1]
  grad_w[w_t] = -(w_entropy - cur_entropy) / delta
 
 elif wb[0] == 'b':
  b_t = wb[1]
  grad_b[b_t] = -(w_entropy - cur_entropy) / delta

We have now finished the while loop. Once we reach the outside of this, we will know that we've calculated the derivatives for all weights and biases for this particular layer. Before we iterate to the next layer, we will append the calculated values for the gradient of the current set of weights and biases into the all_grad list. We will also reshape the flattened list of weights back into the original shape while we're at it:

all_grad.append([np.reshape(grad_w,self.network[i].weights.shape) , grad_b])

After we are done iterating over every layer, we can perform the optimization of the weights and biases of our NN on this batch. Notice how if the training_rate variable is far less than 1, this will reduce how fast the weights are updated:

for i in xrange(len(self.network)):
 if self.network_summary[i][0] == 'dense':
  new_weights = self.network[i].weights.get()
  new_weights += training_rate*all_grad[i][0]
  new_bias = self.network[i].b.get()
  new_bias += training_rate*all_grad[i][1]
  self.network[i].weights.set(new_weights)
  self.network[i].b.set(new_bias)

We have fully implemented a (very simple) GPU-based DNN!

Table of Contents for Gradient descent

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient descent