Defining Replay

The function below 'replay' is called inside the train function (defined in the next section) at the end of the game for training the agent. It is in this function that we define the targets for each state using the Q function Bellman Equation.

def replay(epsilon, gamma, epsilon_min, epsilon_decay, model, training_data, batch_size=64):
"""Train the agent on a batch of data."""
idx = random.sample(range(len(training_data)), min(len(training_data), batch_size))
train_batch = [training_data[j] for j in idx]
for state, new_state, reward, done, action in train_batch:
target = reward
if not done:
target = reward + gamma * np.amax(model.predict(new_state)[0])
#print('target', target)
target_f = model.predict(state)
#print('target_f', target_f)
target_f[0][action] = target
#print('target_f_r', target_f)

model.fit(state, target_f, epochs=1, verbose=0)
if epsilon > epsilon_min:
epsilon *= epsilon_decay
return epsilon

It is inside this function that we train the agent compiled with mean squared error loss to learn to maximize the reward. We have done so because we are predicting the numerical value of the reward possible for the two actions. Remember that the agent accepts the state as input which is of shape 1*4. The output of this agent is of shape 1*2 and it basically contains the expected reward for the two possible actions.

So, when an episode ends, we use a batch of data stored in the deque container to train the agent.

In this batch of data, consider the 1st tuple. Let, 

state = [[-0.07294358 -0.94589796 0.03188364 1.40490844]] 
new_state = [[-0.09186154 -1.14140094 0.05998181 1.70738606]]
reward = 1
done = False
action = 0

For the 'state', we know the 'action' that needs to be taken to enter the 'new state' and the 'reward' for doing so. we also have 'done' which indicates if the 'new state' entered is within the game rules.

As long as the new state s' being entered is within the game rules i.e done is False, the total reward according to the Bellman equation for entering the new state s' form state s by taking an action a  can be written in python as 

target = reward + gamma * np.amax(model.predict(new_state)[0])

Output of model.predict(new_state)[0] be [-0.55639267,     0.37972435]
np.amax([-0.55639267,     0.37972435]) will be 0.37972435

With discount/gamma as 0.95 and reward as 1, the value of 

reward + gamma * np.amax(model.predict(new_state)[0]) endus up as 1.36073813587427

which is the value of target defined above.

Using the model, let's predict the reward for the two possible actions for the current state

target_f = model.predict(state)  will be  [[-0.4597198 0.31523475]]

Since we already know the 'action' that needs to be taken for the 'state' which is 0, to maximize the reward for the next state, we will set the reward at index zero of target_f equal to the reward computed using the Bellman equation.

target_f[0][action] = 1.3607381358742714

Finally, target_f will be equal to  [[1.3607382 0.31523475]].

We will use the state as input and the target_f as the target reward and fit the agent/model on it.

This process will be repeated for all the data points in the batch of training data. Also, for each call of the replay function, the value of epsilon is reduced by the multiplier epsilon decay.

1) random.sample samples n elements from a population set. 2) np.amax returns maximum value in an array
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.177.86