We train the model for multiple epochs and validate it using the following code:
# Loop over epochs.
best_val_loss = None
epochs = 40
for epoch in range(1, epochs+1):
epoch_start_time = time.time()
trainf()
val_loss = evaluate(valid_iter)
print('-' * 89)
print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
val_loss, math.exp(val_loss)))
print('-' * 89)
if not best_val_loss or val_loss < best_val_loss:
best_val_loss = val_loss
else:
# Anneal the learning rate if no improvement has been seen in the validation dataset.
lr /= 4.0
The previous code is training the model for 40 epochs, and we start with a high-learning rate of 20 and reduce it further when the validation loss saturates. Running the model for 40 epochs gives a ppl score of approximately 108.45. The following code block contains the logs when the model was last run:
----------------------------------------------------------------------------------------- | end of epoch 39 | time: 34.16s | valid loss 4.70 | valid ppl 110.01 ----------------------------------------------------------------------------------------- | epoch 40 | 200/ 3481 batches | lr 0.31 | ms/batch 11.47 | loss 4.77 | ppl 117.40 | epoch 40 | 400/ 3481 batches | lr 0.31 | ms/batch 9.56 | loss 4.81 | ppl 122.19 | epoch 40 | 600/ 3481 batches | lr 0.31 | ms/batch 9.43 | loss 4.73 | ppl 113.08 | epoch 40 | 800/ 3481 batches | lr 0.31 | ms/batch 9.48 | loss 4.65 | ppl 104.77 | epoch 40 | 1000/ 3481 batches | lr 0.31 | ms/batch 9.42 | loss 4.76 | ppl 116.42 | epoch 40 | 1200/ 3481 batches | lr 0.31 | ms/batch 9.55 | loss 4.70 | ppl 109.77 | epoch 40 | 1400/ 3481 batches | lr 0.31 | ms/batch 9.41 | loss 4.74 | ppl 114.61 | epoch 40 | 1600/ 3481 batches | lr 0.31 | ms/batch 9.47 | loss 4.77 | ppl 117.65 | epoch 40 | 1800/ 3481 batches | lr 0.31 | ms/batch 9.46 | loss 4.77 | ppl 118.42 | epoch 40 | 2000/ 3481 batches | lr 0.31 | ms/batch 9.44 | loss 4.76 | ppl 116.31 | epoch 40 | 2200/ 3481 batches | lr 0.31 | ms/batch 9.46 | loss 4.77 | ppl 117.52 | epoch 40 | 2400/ 3481 batches | lr 0.31 | ms/batch 9.43 | loss 4.74 | ppl 114.06 | epoch 40 | 2600/ 3481 batches | lr 0.31 | ms/batch 9.44 | loss 4.62 | ppl 101.72 | epoch 40 | 2800/ 3481 batches | lr 0.31 | ms/batch 9.44 | loss 4.69 | ppl 109.30 | epoch 40 | 3000/ 3481 batches | lr 0.31 | ms/batch 9.47 | loss 4.71 | ppl 111.51 | epoch 40 | 3200/ 3481 batches | lr 0.31 | ms/batch 9.43 | loss 4.70 | ppl 109.65 | epoch 40 | 3400/ 3481 batches | lr 0.31 | ms/batch 9.51 | loss 4.63 | ppl 102.43 val loss 4.686332647950745 ----------------------------------------------------------------------------------------- | end of epoch 40 | time: 34.50s | valid loss 4.69 | valid ppl 108.45 -----------------------------------------------------------------------------------------
In the last few months, researchers started exploring the previous approach to create a language model for creating pretrained embeddings. If you are more interested in this approach, I would strongly recommend you read the paper Fine-tuned Language Models for Text Classification (https://arxiv.org/abs/1801.06146) by Jeremy Howard and Sebastian Ruder, where they go into a lot of detail on how language modeling techniques can be used to prepare domain-specific word embeddings, which can later be used for different NLP tasks, such as text classification problems.