Which hyperparameters should we optimize?

Even if you were to follow my advice above and settle on a good enough architecture, you can and should still attempt to search for ideal hyperparameters within that architecture. Some of the hyperparameters we might want to search include the following:

Our choice of optimizer. Thus far, I've been using Adam, but an rmsprop optimizer or a well-tuned SGD may do better.
Each of these optimizers has a set of hyperparameters that we might tune, such as learning rate, momentum, and decay.
Network weight initialization.
Neuron activation.
Regularization parameters such as dropout probability or the regularization parameter used in l2 regularization.
Batch size.

As implied above, this is not an exhaustive list. There are most certainly more options you could try, including introducing variable numbers of neurons in each hidden layer, varying dropout probability per layer, and so on. The possible combinations of hyperparameters are, as we've been implying, limitless. It is also most certainly possible that these choices are not independent of network architecture, adding and removing layers might result in a new optimal choice for any of these hyperparameters.

Table of Contents for Which hyperparameters should we optimize?

Create new playlist

Sign In

Sign Up

Table of Contents for
Which hyperparameters should we optimize?