Preprocessing Atari screen image frames

The Atari Gym environment produces observations which typically have a shape of 210x160x3, which represents a RGB (color) image of a width of 210 pixels and a height of 160 pixels. While the color image at the original resolution of 210x160x3 has more pixels and therefore more information, it turns out that often, better performance is possible with reduced resolution. Lower resolution means less data to be processed by the agent at every step, which translates to faster training time, especially on consumer grade computing hardware that you and I own.

Let's create a preprocessing pipeline that would take the original observation image (of the Atari screen) and perform the following operations:

We can crop out the region on the screen that does not have any useful information regarding the environment for the agent.

Finally, we resize the image to a dimension of 84x84. We can choose a different number, other than 84, as long as it contains a reasonable amount of pixels. However, it is efficient to have a square matrix (like 84x84 or 80x80) as the convolution operations (for example, with CUDA) are optimized for such square input:

def process_frame_84(frame, conf):
    frame = frame[conf["crop1"]:conf["crop2"] + 160, :160]
    frame = frame.mean(2)
    frame = frame.astype(np.float32)
    frame *= (1.0 / 255.0)
    frame = cv2.resize(frame, (84, conf["dimension2"]))
    frame = cv2.resize(frame, (84, 84))
    frame = np.reshape(frame, [1, 84, 84])
    return frame


class AtariRescale(gym.ObservationWrapper):
    def __init__(self, env, env_conf):
        gym.ObservationWrapper.__init__(self, env)
        self.observation_space = Box(0.0, 1.0, [1, 84, 84])
        self.conf = env_conf

    def observation(self, observation):
        return process_frame_84(observation, self.conf)

Note that with a resolution of 84x84 pixels for one observation frame with a data type of numpy.float32 which takes 4 bytes, we need about 4x84x84 = 28,224 bytes. As you may recall from the Experience memory section, one experience object contains two frames (one for the observation and the other for the next observation), which means we'll need 2x 28,224 = 56,448 bytes (+ 2 bytes for action + 4 bytes for reward). The 56,448 bytes (or 0.056448 MB) may not seem much, but if you consider the fact that it is typical to be using an experience memory capacity in the order of 1e6 (million), you may realize that we need about 1e6 x 0.056448 MB = 56,448 MB or 56.448 GB! This means that we will need 56.448 GB of RAM just for the experience memory with a capacity of 1 million experiences!

You can do a couple of memory optimizations to reduce the required RAM for training the agent. Using a smaller experience memory is a straightforward way to reduce the memory footprint in some games. In some environments, having a larger experience memory will help the agent to learn faster. One way to reduce the memory footprint it by not scaling the frames (by dividing by 255) while storing, which requires a floating point representation (numpy.float32) and rather storing the frames as numpy.uint8 so that we only need 1 byte instead of 4 bytes per pixels, which will help in reducing the memory requirement by a factor of 4. Then, when we want to use the stored experiences in our forward pass to the network to the deep Q-network to get the Q-value predictions, we can scale the images to be in the range 0.0 to 1.0.

Table of Contents for Preprocessing Atari screen image frames

Create new playlist

Sign In

Sign Up

Table of Contents for
Preprocessing Atari screen image frames