Bias initialization of gates

Recently, at an ML conference, the International Conference on Learning Representations, a paper was delivered by a team from Facebook AI Research that described the progress of RNNs. This paper was concerned with the effectiveness of RNNs that had been augmented with GRU/LSTM units. Though a deep dive into the paper is outside the scope of this book, you can read more about it in the Further reading section, at the end of this chapter. An interesting hypothesis fell out of their research: that these units could have their bias vector initialized in a certain way, and that this would improve the network's ability to learn very long-term dependencies. They published their results, and it was shown that there seems to be an improvement in the training time and the speed with which perplexity is reduced:

This graph, taken from the paper, represents the network's loss on the y axis, and the number of training iterations on the x axis. The red indicates chrono initialization.

This is very new research, and there is definite scientific value in understanding why LSTM/GRU-based networks perform as well as they do. The main practical implications of this paper, namely the initialization of the gated unit's biases, offer us yet another tool to improve model performance and save those precious GPU cycles. For now, these performance improvements are the most significant (though still incremental) on the word-level PTB and character-level text8 datasets. The network we will build in the next section can be easily adapted to test out the relative performance improvements of this change.

Table of Contents for Bias initialization of gates

Create new playlist

Sign In

Sign Up

Table of Contents for
Bias initialization of gates