What is a DQN?

As you will learn, a DQN is not that different from the standard feedforward and convolutional networks that we have covered so far. Indeed, all the standard ingredients are present:

  • A representation of our data (in this example, the state of our maze and the agent trying to navigate through it)
  • Standard layers to process a representation of our maze, which also includes standard operations between these layers, such as the Tanh activation function
  • An output layer with a linear activation, which gives you predictions

Here, our predictions represent possible moves affecting the state of our input. In the case of maze solving, we are trying to predict moves that produce the maximum (and cumulative) expected reward for our player, which ultimately leads to the maze's exit. These predictions occur as part of a training loop, where the learning algorithm uses a Gamma variable as a decaying-over-time variable that balances the exploration of the environment's state space and the exploitation of knowledge gleaned by building up a map of actions, states, or rewards. 

Let's introduce a number of new concepts. First, we need an m x n matrix that represents the rewards, R, for a given state (that is, a row) and action (that is, a column). We also need a Q table. This is a matrix (initialized with zero values) that represents the memory of the agent (that is, our player trying to find its way through the maze), or a history of states, actions taken, and their rewards.

These two matrices relate to each other. We can determine the memory (Q table) of our agent with respect to the table of known rewards with the following formula:

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Here, our epoch is an episode. Our agent performs an action and receives updates or rewards from the environment until the state of the system is terminal. In our example, this means getting stuck in the maze.

The thing we are trying to learn is a policy. This policy is a function or a map of states to actions. It is a giant n-dimensional table of optimal actions given every possible state in our system.

Our ability to assess a state, S, is dependent on the assumption that it is a Markov Decision Process (MDP). As we've pointed out previously, this book is more concerned with implementation rather than theory; however, MDPs are fundamental to any real understanding of RL, so it's worth going over them in a bit of detail.

We use a capital S to denote all the possible states of our system. In the case of a maze, this is every possible location of an agent within the boundaries of the maze.

We use a lowercase s to denote a single state. The same applies to all actions, A, and an individual action, a.

Each pair (s, a) produces a distribution of the rewards, R. It also produces P, which is referred to as the transition probability, where for a given (s, a), the distribution of possible next states is s(t + 1).

We also have a hyperparameter, which is the discount factor (gamma). In the vein of hyperparameters generally, this is something we set ourselves. This is the relative value assigned to the predicted reward for a given time step. For example, let's say we want to assign a greater value to the predicted rewards in the next time step, rather than after three time steps. We can represent this in the context of our objective in order to learn an optimal policy; the pseudocode looks like this:

OptimalPolicy = max(sum(gamma x reward) for timestep t

Breaking down the conceptual components of our DQN further, we can now talk about the value function. This function indicates the cumulative reward for a given state. For example, early on in our maze exploration, the cumulative expected reward is low. This is because of the number of possible actions or states our agent could take or occupy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.47.208