Understanding the Gym interface

Let's continue our Gym exploration by understanding the interface between the Gym environment and the agents that we will develop. To help us with that, let's have another look at the picture we saw in Chapter 2, Reinforcement Learning and Deep Reinforcement Learning, when we were discussing the basics of reinforcement learning:

Did the picture give you an idea about the interface between the agent and the environment? We will make your understanding secure by going over the description of the interface.

After we import gym , we make an environment using the following line of code:

 env = gym.make("ENVIRONMENT_NAME")

Here, ENVIRONMENT_NAME is the name of the environment we want, chosen from the list of the environments we found installed on our system. From the previous diagram, we can see that the first arrow comes from the environment to the agent, and is named Observation. From Chapter 2, Reinforcement Learning and Deep Reinforcement Learning, we understand the difference between partially observable environments and fully observable environments, and the difference between state and observation in each case. We get that first observation from the environment by calling env.reset(). Let's store the observation in a variable named obs using the following line of code:

obs = env.reset()

Now, the agent has received the observation (the end of the first arrow). It's time for the agent to take an action and send the action to the environment to see what happens. In essence, this is what the algorithms we develop for the agents should figure out! We'll be developing various state-of-the-art algorithms to develop agents in the next and subsequent chapters. Let's continue our journey towards understanding the Gym interface.

Once the action to be taken is decided, we send it to the environment (second arrow in the diagram) using the env.step() method, which will return four values in this order: next_state, reward, done, and info:

The next_state is the resulting state of the environment after the action was taken in the previous state.

Some environments may internally run one or more steps using the same action before returning the next_state. We discussed deterministic and NoFrameskip types in the previous section, which are examples of such environments.

The reward (third arrow in the diagram) is returned by the environment.
The done variable is a Boolean (true or false), which gets a value of true if the episode has terminated/finished (therefore, it is time to reset the environment) and false otherwise. This will be useful for the agent to know when an episode has ended or when the environment is going to be reset to some initial state.

The info variable returned is an optional variable, which some environments may return with some additional information. Usually, this is not used by the agent to make its decision on which action to take.

Here is a consolidated summary of the four values returned by a Gym environment's step() method, together with their types and a concise description about them:

Returned value	Type	Description
`next_state` (or observation)	`Object`	Observation returned by the environment. The object could be the RGB pixel data from the screen/camera, RAM contents, join angles and join velocities of a robot, and so on, depending on the environment.
`reward`	`Float`	Reward for the previous action that was sent to the environment. The range of the `Float` value varies with each environment, but irrespective of the environment, a higher reward is always better and the goal of the agent should be to maximize the total reward.
`done`	`Boolean`	Indicates whether the environment is going to be reset in the next step. When the Boolean value is true, it most likely means that the episode has ended (due to loss of life of the agent, timeout, or some other episode termination criteria).
`info`	`Dict`	Some additional information that can optionally be sent out by an environment as a dictionary of arbitrary key-value pairs. The agent we develop should not rely on any of the information in this dictionary for taking action. It may be used (if available) for debugging purposes.

Note that the following code is provided to show the general structure and is not ready to be executed due to the ENVIRONMENT_NAME and the agent.choose_action() not being defined in this snippet.

Let's put all the pieces together and look at them in one place:

import gym
env = gym.make("ENVIRONMENT_NAME")
obs = env.reset() # The first arrow in the picture
# Inner loop (roll out)
action = agent.choose_action(obs) # The second arrow in the picture
next_state, reward, done, info = env.step(action) # The third arrow (and more)
obs = next_state
# Repeat Inner loop (roll out)

I hope you got a good understanding of one cycle of the interaction between the environment and the agent. This process will repeat until we decide to terminate the cycle after a certain number of episodes or steps have passed. Let's now have a look at a complete example with the inner loop running for MAX_STEPS_PER_EPISODE and the outer loop running for MAX_NUM_EPISODES in a Qbert-v0 environment:

#!/usr/bin/env python
import gym
env = gym.make("Qbert-v0")
MAX_NUM_EPISODES = 10
MAX_STEPS_PER_EPISODE = 500
for episode in range(MAX_NUM_EPISODES):
    obs = env.reset()
    for step in range(MAX_STEPS_PER_EPISODE):
        env.render()
        action = env.action_space.sample()# Sample random action. This will be replaced by our agent's action when we start developing the agent algorithms
        next_state, reward, done, info = env.step(action) # Send the action to the environment and receive the next_state, reward and whether done or not
        obs = next_state

        if done is True:
            print("
 Episode #{} ended in {} steps.".format(episode, step+1))
            break

When you run this script, you will notice a Qbert screen pop up and Qbert taking random actions and getting a score, as shown here:

You will also see print statements on the console like the following, depending on when the episode ended. Note that the step numbers you get might be different because the actions are random:

The boilerplate code is available in this book's code repository under the ch4 folder and is named rl_gym_boilerplate_code.py. It is indeed boilerplate code, because the overall structure of the program will remain the same. When we build our intelligent agents in subsequent chapters, we will extend this boilerplate code. It is worth taking a while and going through the script line by line to make sure you understand it well.

You may have noticed that in the example code snippets provided in this chapter and in Chapter 3, Getting Started with OpenAI Gym and Deep Reinforcement Learning, we used env.action_space.sample() in place of action in the previous code. env.action_space returns the type of the action space (Discrete(18), for example, in the case of Alien-v0), and the sample() method randomly samples a value from that action_space. That's all it means!

We will now have a closer look at the spaces in the Gym to understand the state space and action spaces of environments.

Table of Contents for Understanding the Gym interface

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the Gym interface