Solving a maze using a DQN in Gorgonia

Now, it's time to build our maze solver!

Using a DQN to solve a little ASCII maze is a bit like bringing a bulldozer to the beach to make sandcastles for your kids: it's completely unnecessary, but you get to play with a big machine. However, as a tool for learning about DQNs, mazes are invaluable. This is because the number of states or actions in the game is limited, and the representation of constraints is also simple (such as the walls of our maze that our agent cannot move through). This means that we can step through our program and easily inspect what our network is doing.

We will follow these steps:

Create a maze.go file for this bit of code
Import our libraries and set our data type
Define our Maze{}
Write a NewMaze() function to instantiate this struct

We also need to define our Maze{} helper functions. These include the following:

CanMoveTo(): Check whether a move is valid
Move(): Move our player to a co-ordinate in the maze
Value(): Return the reward for a given action
Reset(): Set player to start co-ordinates

Let's take a look at the start of the code for our maze generator. This is an excerpt, and the remainder of the code can be found in the book's GitHub repository:

...
type Point struct{ X, Y int }
type Vector Point

type Maze struct {
  // some maze object
  *mazegen.Maze
  repr *tensor.Dense
  iter [][]tile
  values [][]float32

  player, start, goal Point

  // meta

  r *rand.Rand
}
...

Now that we've got the code that we need to generate and interact with a maze, we need to define the simple feedforward, fully connected network. This code should be familiar to us by now. Let's create nn.go:

...
type NN struct {
  g *ExprGraph
  x *Node
  y *Node
  l []FC

  pred *Node
  predVal Value
}

func NewNN(batchsize int) *NN {
  g := NewGraph()
  x := NewMatrix(g, of, WithShape(batchsize, 4), WithName("X"), WithInit(Zeroes()))
  y := NewVector(g, of, WithShape(batchsize), WithName("Y"), WithInit(Zeroes()))
...

We can now begin to define the DQN that will make use of this neural network. First, let's create a memory.go file with the basic struct type that captures information about a given episode:

package main

type Memory struct {
  State Point
  Action Vector
  Reward float32
  NextState Point
  NextMovables []Vector
  isDone bool
}

We will make a memory of []Memories and use it to store the per-play X/Y state co-ordinates, move vectors, expected reward, next states/possible moves, and whether the maze is solved.

Now we can edit our main.go and pull everything together. First, we define our possible moves across the m x n matrix that represents our maze:

package main

import (
  "fmt"
  "log"
  "math/rand"
  "time"

  "gorgonia.org/gorgonia"
)

var cardinals = [4]Vector{
  Vector{0, 1}, // E
  Vector{1, 0}, // N
  Vector{-1, 0}, // S
  Vector{0, -1}, // W
}

Next, we need our main DQN{} structure to which we attach the neural network we defined earlier, our VM/Solver, and our DQN-specific hyper parameters. We also need an init() function to build the embedded feedforward network as well as the DQN object itself:

type DQN struct {
  *NN
  gorgonia.VM
  gorgonia.Solver
  Memories []Memory // The Q-Table - stores State/Action/Reward/NextState/NextMoves/IsDone - added to each train x times per episode

  gamma float32
  epsilon float32
  epsDecayMin float32
  decay float32
}

func (m *DQN) init() {
  if _, err := m.NN.cons(); err != nil {
    panic(err)
  }
  m.VM = gorgonia.NewTapeMachine(m.NN.g)
  m.Solver = gorgonia.NewRMSPropSolver()
}

Next is our experience replay() function. Here, we first create batches of memory from which we retrain and update our network, gradually updating our epsilon:

func (m *DQN) replay(batchsize int) error {
  var N int
  if batchsize < len(m.Memories) {
    N = batchsize
  } else {
    N = len(m.Memories)
  }
  Xs := make([]input, 0, N)
  Ys := make([]float32, 0, N)
  mems := make([]Memory, N)
  copy(mems, m.Memories)
  rand.Shuffle(len(mems), func(i, j int) {
    mems[i], mems[j] = mems[j], mems[i]
  })

  for b := 0; b < batchsize; b++ {
    mem := mems[b]

    var y float32
    if mem.isDone {
      y = mem.Reward
    } else {
      var nextRewards []float32
      for _, next := range mem.NextMovables {
        nextReward, err := m.predict(mem.NextState, next)
        if err != nil {
          return err
        }
        nextRewards = append(nextRewards, nextReward)
      }
      reward := max(nextRewards)
      y = mem.Reward + m.gamma*reward
    }
    Xs = append(Xs, input{mem.State, mem.Action})
    Ys = append(Ys, y)
    if err := m.VM.RunAll(); err != nil {
      return err
    }
    m.VM.Reset()
    if err := m.Solver.Step(m.model()); err != nil {
      return err
    }
    if m.epsilon > m.epsDecayMin {
      m.epsilon *= m.decay
    }
  }
  return nil
}

The predict() function—called when we are determining the best possible move (or move with the greatest predicted reward)—is next. It takes a player's position in the maze and a single move, and returns our neural network's projected reward for that move:

func (m *DQN) predict(player Point, action Vector) (float32, error) {
  x := input{State: player, Action: action}
  m.Let1(x)
  if err := m.VM.RunAll(); err != nil {
    return 0, err
  }
  m.VM.Reset()
  retVal := m.predVal.Data().([]float32)[0]
  return retVal, nil
}

We then define our main training loop for n episodes, moving around the maze and building our DQN's memory:

func (m *DQN) train(mz *Maze) (err error) {
  var episodes = 20000
  var times = 1000
  var score float32

  for e := 0; e < episodes; e++ {
    for t := 0; t < times; t++ {
      if e%100 == 0 && t%999 == 1 {
        log.Printf("episode %d, %dst loop", e, t)
      }

      moves := getPossibleActions(mz)
      action := m.bestAction(mz, moves)
      reward, isDone := mz.Value(action)
      score = score + reward
      player := mz.player
      mz.Move(action)
      nextMoves := getPossibleActions(mz)
      mem := Memory{State: player, Action: action, Reward: reward, NextState: mz.player, NextMovables: nextMoves, isDone: isDone}
      m.Memories = append(m.Memories, mem)
    }
  }
  return nil
}

We also need a bestAction() function that selects the best possible move to take, given a slice of options and an instance of our maze:

func (m *DQN) bestAction(state *Maze, moves []Vector) (bestAction Vector) {
  var bestActions []Vector
  var maxActValue float32 = -100
  for _, a := range moves {
    actionValue, err := m.predict(state.player, a)
    if err != nil {
      // DO SOMETHING
    }
    if actionValue > maxActValue {
      maxActValue = actionValue
      bestActions = append(bestActions, a)
    } else if actionValue == maxActValue {
      bestActions = append(bestActions, a)
    }
  }
  // shuffle bestActions
  rand.Shuffle(len(bestActions), func(i, j int) {
    bestActions[i], bestActions[j] = bestActions[j], bestActions[i]
  })
  return bestActions[0]
}

Finally, we define a getPossibleActions() function to produce a slice of possible moves, given our maze and our little max() helper function for finding the maximum value in a slice of float32s:

func getPossibleActions(m *Maze) (retVal []Vector) {
  for i := range cardinals {
    if m.CanMoveTo(m.player, cardinals[i]) {
      retVal = append(retVal, cardinals[i])
    }
  }
  return retVal
}

func max(a []float32) float32 {
  var m float32 = -999999999
  for i := range a {
    if a[i] > m {
      m = a[i]
    }
  }
  return m
}

With all those pieces in place, we can write our main() function to complete our DQN. We begin by setting vars, which includes our epsilon. Then, we initialize DQN{} and instantiate Maze:

We then kick off our training loop and, once complete, try to solve our maze:

func main() {
  // DQN vars

  // var times int = 1000
  var gamma float32 = 0.95 // discount factor
  var epsilon float32 = 1.0 // exploration/exploitation bias, set to 1.0/exploration by default
  var epsilonDecayMin float32 = 0.01
  var epsilonDecay float32 = 0.995

  rand.Seed(time.Now().UTC().UnixNano())
  dqn := &DQN{
    NN: NewNN(32),
    gamma: gamma,
    epsilon: epsilon,
    epsDecayMin: epsilonDecayMin,
    decay: epsilonDecay,
  }
  dqn.init()

  m := NewMaze(5, 10)
  fmt.Printf("%+#v", m.repr)
  fmt.Printf("%v %v
", m.start, m.goal)

  fmt.Printf("%v
", m.CanMoveTo(m.start, Vector{0, 1}))
  fmt.Printf("%v
", m.CanMoveTo(m.start, Vector{1, 0}))
  fmt.Printf("%v
", m.CanMoveTo(m.start, Vector{0, -1}))
  fmt.Printf("%v
", m.CanMoveTo(m.start, Vector{-1, 0}))

  if err := dqn.train(m); err != nil {
    panic(err)
  }

  m.Reset()
  for {
    moves := getPossibleActions(m)
    best := dqn.bestAction(m, moves)
    reward, isDone := m.Value(best)
    log.Printf("
%#v", m.repr)
    log.Printf("player at: %v best: %v", m.player, best)
    log.Printf("reward %v, done %v", reward, isDone)
    m.Move(best)
  }
}

Now, let's execute our program and observe the outputs:

We can see the dimensions of the maze, as well as a simple representation of walls (1), a clear path (o), our player (2), and our maze exit (3). The next line, {1 0} {9 20}, tells us the exact (X, Y) co-ordinates of our player's starting point and the maze's exit, respectively. We then loop through the movement vectors as a sanity check and begin our training run across n episodes.

Our agent now moves through the maze:

You can experiment with different numbers of episodes (and episode lengths), and generate larger and more complex mazes!

Table of Contents for Solving a maze using a DQN in Gorgonia

Create new playlist

Sign In

Sign Up

Table of Contents for
Solving a maze using a DQN in Gorgonia