Defining the train and evaluate functions

The training of the model is very similar to what we saw in all the previous examples in this book. There are a few important changes that we need to make so that the trained model works better. Let's look at the code and its key parts:

criterion = nn.CrossEntropyLoss()

def trainf():
    # Turn on training mode which enables dropout.
    lstm.train()
    total_loss = 0
    start_time = time.time()
    hidden = lstm.init_hidden(batch_size)
    for i,batch in enumerate(train_iter):
        data, targets = batch.text,batch.target.view(-1)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)
        lstm.zero_grad()
        output, hidden = lstm(data, hidden)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm(lstm.parameters(), clip)
        for p in lstm.parameters():
            p.data.add_(-lr, p.grad.data)

        total_loss += loss.data

        if i % log_interval == 0 and i > 0:
            cur_loss = total_loss[0] / log_interval
            elapsed = time.time() - start_time
            (print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | loss {:5.2f} | ppl {:8.2f}'.format(epoch, i, len(train_iter), lr,elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss))))
            total_loss = 0
            start_time = time.time()

Since we are using dropout in our model, we need to use it differently during training and for validation/test datasets. Calling train() on the model will ensure dropout is active during training, and calling eval() on the model will ensure that dropout is used differently:

lstm.train()

For an LSTM model, along with the input, we also need to pass the hidden variables. The init_hidden function will take the batch size as input and then return a hidden variable, which can be used along with the inputs. We can iterate through the training data and pass the input data to the model. Since we are processing sequence data, starting with a new hidden state (randomly initialized) for every iteration will not make sense. So, we will use the hidden state from the previous iteration after removing it from the graph by calling the detach method. If we do not call the detach method, then we end up calculating gradients for a very long sequence until we run out of GPU memory.

We then pass on the input to the LSTM model and calculate loss using CrossEntropyLoss. Using the previous values of the hidden state is implemented in the following repackage_hidden function:

def repackage_hidden(h):
    """Wraps hidden states in new Variables, to detach them from their history."""
    if type(h) == Variable:
        return Variable(h.data)
    else:
        return tuple(repackage_hidden(v) for v in h)

RNNs and their variants, such as LSTM and the Gated Recurrent Unit (GRU), suffer from a problem called exploding gradients. One simple trick to avoid the problem is to clip the gradients, which is done in the following code:

torch.nn.utils.clip_grad_norm(lstm.parameters(), clip)

We manually adjust the values of the parameters by using the following code. Implementing an optimizer manually gives more flexibility than using a prebuilt optimizer:

  for p in lstm.parameters():
      p.data.add_(-lr, p.grad.data)

We are iterating through all the parameters and adding up the value of the gradients, multiplied by the learning rate. Once we update all the parameters, we log all the statistics such as time, loss, and perplexity.

We write a similar function for validation, where we call the eval method on the model. The evaluate function is defined using the following code:

def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    lstm.eval()
    total_loss = 0 
    hidden = lstm.init_hidden(batch_size)
    for batch in data_source: 
        data, targets = batch.text,batch.target.view(-1)
        output, hidden = lstm(data, hidden)
        output_flat = output.view(-1, ntokens)
        total_loss += len(data) * criterion(output_flat, targets).data
        hidden = repackage_hidden(hidden)
    return total_loss[0]/(len(data_source.dataset[0].text)//batch_size)

Most of the training logic and evaluation logic is similar, except for calling eval and not updating the parameters of the model.

Table of Contents for Defining the train and evaluate functions

Create new playlist

Sign In

Sign Up

Table of Contents for
Defining the train and evaluate functions