How to do it...

Let's start with the recipe:

The first step, as always, is importing modules. Besides the usual modules, in this case, we will import gym so that we can use the different environments provided by it:

import gym
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

Next, we create a RlAgent class. The class is made with three methods--the __init__ method initializes the NN size and creates the computational graph. Here, we employed the TensorFlow function, tf.multinomial, to make the decision of the possible action to take. The function returns the action based on the sigmoidal values of the nine output neurons of our network. This ensures that the network chooses the final action probabilistically. The predict method returns the action predicted by the NN. The get_weights method helps us get the weights and biases of the winner agent:

class RlAgent(object):
 def __init__(self,m,n,ini=False,W=None, b=None ):
     self._graph = tf.Graph()
     with self._graph.as_default():
     self._X = tf.placeholder(tf.float32,shape=(1,m))
     if ini==False:
         self.W = tf.Variable(tf.random_normal([m,n]), trainable=False)
         self.bias =         tf.Variable(tf.random_normal([1,n]),trainable=False)
     else:
         self.W = W
         self.bias = b
     out = tf.nn.sigmoid(tf.matmul(self._X,self.W)+ self.bias)
     self._result = tf.multinomial(out,1)
     init = tf.global_variables_initializer()

    self._sess = tf.Session()
    self._sess.run(init)

    def predict(self, X):
         action = self._sess.run(self._result, feed_dict= {self._X: X})
         return action

    def get_weights(self):
         W, b = self._sess.run([self.W, self.bias])
         return W, b

We define some helper functions to play a single complete game, play_one_episode:

def play_one_episode(env, agent):
    obs = env.reset()
    img_pre = preprocess_image(obs)
     done = False
     t = 0
    while not done and t < 10000:
         env.render()  # This can be commented to speed up 
         t += 1
         action = agent.predict(img_pre)
         #print(t,action)
         obs, reward, done, info = env.step(action)
         img_pre = preprocess_image(obs)
         if done:
             break
    return t

The play_multiple_episodes function creates an instance of the agent and plays a number of games with this agent, returning its average duration of play:

def play_multiple_episodes(env, T,ini=False, W=None, b=None):
    episode_lengths = np.empty(T)
    obs = env.reset()
    img_pre = preprocess_image(obs)
    if ini== False:
        agent = RlAgent(img_pre.shape[1],env.action_space.n)
    else:
        agent = RlAgent(img_pre.shape[1],env.action_space.n,ini, W, b)
    for i in range(T):
        episode_lengths[i] = play_one_episode(env, agent)
    avg_length = episode_lengths.mean()
    print("avg length:", avg_length)
    if ini == False:
        W, b = agent.get_weights()
    return avg_length, W, b

The random_search function invokes play_multiple_episodes; each time play_multiple_episodes is called, a new agent is instantiated with a new set of random weights and biases. One of these randomly created NN agents will outperform others, and this will be the agent that we finally select:

def random_search(env):
    episode_lengths = []
    best = 0
    for t in range(10):
        print("Agent {} reporting".format(t))
        avg_length, wts, bias = play_multiple_episodes(env, 10)
        episode_lengths.append(avg_length)
        if avg_length > best:
            best_wt = wts
            best_bias = bias
            best = avg_length
    return episode_lengths, best_wt, best_bias

The environment returns an observation field each time it progresses a step. This observation is with three color channels; to feed it to the NN, the observation needs to be preprocessed and, at present, the only preprocessing we do is convert it to grayscale, increase the contrast, and reshape it into a row vector:

def preprocess_image(img):
    img = img.mean(axis =2) # to grayscale
    img[img==150] = 0  # Bring about a better contrast
    img = (img - 128)/128 - 1 # Normalize image from -1 to 1
    m,n = img.shape
    return img.reshape(1,m*n)

The NN agents are instantiated one by one and the best is selected. For computational efficiency, we have at present searched only 10 agents, each playing 10 games. The one that can play the longest is considered the best:

if __name__ == '__main__':
    env_name = 'Breakout-v0'
    #env_name = 'MsPacman-v0'
    env = gym.make(env_name)
    episode_lengths, W, b = random_search(env)
    plt.plot(episode_lengths)
    plt.show()
    print("Final Run with best Agent")
    play_multiple_episodes(env,10, ini=True, W=W, b=b)

The result is as follows:

We can see that our random agent too can play the game for an average length of 615.5. Not bad!

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...