How to do it...

We proceed with the recipe as follows:

The first step is importing the necessary modules. This time, besides our usual TensorFlow, Numpy, and Matplotlib, we will be importing Gym and some classes from scikit:

import numpy as np
import tensorflow as tf
import gym
import matplotlib.pyplot as plt
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.kernel_approximation import RBFSampler

In Q learning, we use NNs as function approximators to estimate the value-function. We define a linear NeuralNetwork class; the NN will take the transformed observation space as input and will predict the estimated Q value. As we have two possible actions, we need two different neural network objects to get the predicted state action value. The class includes methods to train the individual NN and predict the output:

class NeuralNetwork:
 def __init__(self, D):
     eta = 0.1
     self.W = tf.Variable(tf.random_normal(shape=(D, 1)), name='w')
     self.X = tf.placeholder(tf.float32, shape=(None, D), name='X')
     self.Y = tf.placeholder(tf.float32, shape=(None,), name='Y')
 
     # make prediction and cost
     Y_hat = tf.reshape(tf.matmul(self.X, self.W), [-1])
     err = self.Y - Y_hat
     cost = tf.reduce_sum(tf.pow(err,2))
 
    # ops we want to call later
    self.train_op = tf.train.GradientDescentOptimizer(eta).minimize(cost)
    self.predict_op = Y_hat

    # start the session and initialize params
    init = tf.global_variables_initializer()
    self.session = tf.Session()
    self.session.run(init)
 
def train(self, X, Y):
    self.session.run(self.train_op, feed_dict={self.X: X, self.Y: Y})

def predict(self, X):
    return self.session.run(self.predict_op, feed_dict={self.X: X})

The next important class is the Agent class that uses the NeuralNetwork class to create a learning agent. The class on instantiation creates an agent with two linear NNs, each with 2,000 input neurons and 1 output neuron. (Essentially, this means that the agent has 2 neurons each with 2,000 inputs as the input layer of the NN does not do any processing). The Agent class has methods defined to predict the output of the two NNs and update the weights of the two NNs. The agent here uses Epsilon Greedy Policy for exploration during the training phase. At each step, the agent either chooses an action with the highest Q value or a random action, depending on the value of epsilon (eps); epsilon is annealed during the training process so that, initially, the agent takes lots of random actions (exploration) but as training progresses, the actions with maximum Q value are taken (exploitation). This is called the Exploration-Exploitation trade-off: we allow the agent to explore random actions over the exploited course of actions, which allows the agent to try new random actions and learn from them:

class Agent:
 def __init__(self, env, feature_transformer):
 self.env = env
 self.agent = []
 self.feature_transformer = feature_transformer
 for i in range(env.action_space.n):
 model = NeuralNetwork(feature_transformer.dimensions)
 self.agent.append(model)

def predict(self, s):
 X = self.feature_transformer.transform([s])
 return np.array([m.predict(X)[0] for m in self.agent])

def update(self, s, a, G):
 X = self.feature_transformer.transform([s])
 self.agent[a].train(X, [G])

def sample_action(self, s, eps):
 if np.random.random() < eps:
     return self.env.action_space.sample()
 else:
     return np.argmax(self.predict(s))

Next, we define a function to play one episode; it is similar to the play_one function we used earlier, but now we use Q learning to update the weights of our agent. We start the episode by resetting the environment using env.reset(), and then till the game is done (and maximum iterations to ensure the program ends). Like before, the agent chooses an action for the present observation state (obs) and implements the action on the environment (env.step(action)). The difference now is that, based on the previous state and the state after the action is taken, the NN weights are updated using G = r + γ max_a' Q(s',a'), such that it can predict an accurate expected value corresponding to an action. For better stability, we have modified the rewards--whenever the pole falls, the agent is given a reward of -400, otherwise for each step, it gets a reward of +1:

def play_one(env, model, eps, gamma):
 obs = env.reset()
 done = False
 totalreward = 0
 iters = 0
 while not done and iters < 2000:
 action = model.sample_action(obs, eps)
 prev_obs = obs
 obs, reward, done, info = env.step(action)
 env.render()   # Can comment it to speed up.

if done:
 reward = -400

# update the model
 next = model.predict(obs)
 assert(len(next.shape) == 1)
 G = reward + gamma*np.max(next)
 model.update(prev_obs, action, G)

if reward == 1:
 totalreward += reward
iters += 1

Now that all the functions and classes are in place, we define our agent and environment (in this case, 'CartPole-v0'). The agent plays in total 1,000 episodes and learns by interacting with the environment with the help of the value function:

if __name__ == '__main__':
    env_name = 'CartPole-v0'
    env = gym.make(env_name)
    ft = FeatureTransformer(env)
    agent = Agent(env, ft)
    gamma = 0.97

    N = 1000
    totalrewards = np.empty(N)
    running_avg = np.empty(N)
    for n in range(N):
        eps = 1.0 / np.sqrt(n + 1)
        totalreward = play_one(env, agent, eps, gamma)
        totalrewards[n] = totalreward
        running_avg[n] = totalrewards[max(0, n - 100):(n + 1)].mean()
        if n % 100 == 0:
            print("episode: {0}, total reward: {1} eps: {2} avg reward (last 100): {3}".format(n, totalreward, eps,
                                                                                               running_avg[n]), )

    print("avg reward for last 100 episodes:", totalrewards[-100:].mean())
    print("total steps:", totalrewards.sum())

    plt.plot(totalrewards)
    plt.xlabel('episodes')
    plt.ylabel('Total Rewards')
    plt.show()

    plt.plot(running_avg)

    plt.xlabel('episodes')
    plt.ylabel('Running Average')
    plt.show()
    env.close()

The following is the plot for total rewards and running average reward as the agent learned through the game. According to the Cart-Pole wiki, a reward of 200 means that the agent won the episode after being trained for 1,000 episodes; our agent managed to reach an average reward of 195.7 while playing 100 episodes, which is a remarkable feat:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...