Optimization and network architecture

As for our optimization method, we use Adam. You may recall from Chapter 2, What is a Neural Network and How Do I Train One?, that the Adam solver belongs to the class of solvers that use a dynamic learning rate. In vanilla SGD, we fix the learning rate. Here, the learning rate is set per parameter, giving us more control in cases where sparsity of data (vectors) is a problem. Additionally, we use the root MSE propagation versus the previous gradient, understanding the rate of change in the shape of our optimization surface and, by doing so, improving how our network handles noise in the data.

Now, let's talk about the layers of our neural network. Our first two layers are standard feedforward networks with Rectified Linear Unit (ReLU) activation:

output = activation(dotp(input, weights) + bias)

The first is sized according to the state size (that is, a vector representation of all the possible states in the system).

Our output layer is restricted to the number of possible actions. These are achieved by applying a linear activation to our second hidden dimension's output.

Our loss function depends on the task and data we have; in general, we will use MSE or cross-entropy loss.

Table of Contents for Optimization and network architecture

Create new playlist

Sign In

Sign Up

Table of Contents for
Optimization and network architecture