Solving a maze using a DQN in Gorgonia

Now, it's time to build our maze solver!

Using a DQN to solve a little ASCII maze is a bit like bringing a bulldozer to the beach to make sandcastles for your kids: it's completely unnecessary, but you get to play with a big machine. However, as a tool for learning about DQNs, mazes are invaluable. This is because the number of states or actions in the game is limited, and the representation of constraints is also simple (such as the walls of our maze that our agent cannot move through). This means that we can step through our program and easily inspect what our network is doing.

We will follow these steps:

  1. Create a maze.go file for this bit of code
  2. Import our libraries and set our data type
  3. Define our Maze{}
  4. Write a NewMaze() function to instantiate this struct

We also need to define our Maze{} helper functions. These include the following:

  • CanMoveTo(): Check whether a move is valid
  • Move(): Move our player to a co-ordinate in the maze
  • Value(): Return the reward for a given action
  • Reset(): Set player to start co-ordinates

Let's take a look at the start of the code for our maze generator. This is an excerpt, and the remainder of the code can be found in the book's GitHub repository:

...
type Point struct{ X, Y int }
type Vector Point

type Maze struct {
// some maze object
*mazegen.Maze
repr *tensor.Dense
iter [][]tile
values [][]float32

player, start, goal Point

// meta

r *rand.Rand
}
...

Now that we've got the code that we need to generate and interact with a maze, we need to define the simple feedforward, fully connected network. This code should be familiar to us by now. Let's create nn.go:

...
type NN struct {
g *ExprGraph
x *Node
y *Node
l []FC

pred *Node
predVal Value
}

func NewNN(batchsize int) *NN {
g := NewGraph()
x := NewMatrix(g, of, WithShape(batchsize, 4), WithName("X"), WithInit(Zeroes()))
y := NewVector(g, of, WithShape(batchsize), WithName("Y"), WithInit(Zeroes()))
...

We can now begin to define the DQN that will make use of this neural network. First, let's create a memory.go file with the basic struct type that captures information about a given episode:

package main

type Memory struct {
State Point
Action Vector
Reward float32
NextState Point
NextMovables []Vector
isDone bool
}

We will make a memory of []Memories and use it to store the per-play X/Y state co-ordinates, move vectors, expected reward, next states/possible moves, and whether the maze is solved.

Now we can edit our main.go and pull everything together. First, we define our possible moves across the m x n matrix that represents our maze:

package main

import (
"fmt"
"log"
"math/rand"
"time"

"gorgonia.org/gorgonia"
)

var cardinals = [4]Vector{
Vector{0, 1}, // E
Vector{1, 0}, // N
Vector{-1, 0}, // S
Vector{0, -1}, // W
}

Next, we need our main DQN{} structure to which we attach the neural network we defined earlier, our VM/Solver, and our DQN-specific hyper parameters. We also need an init() function to build the embedded feedforward network as well as the DQN object itself:

type DQN struct {
*NN
gorgonia.VM
gorgonia.Solver
Memories []Memory // The Q-Table - stores State/Action/Reward/NextState/NextMoves/IsDone - added to each train x times per episode

gamma float32
epsilon float32
epsDecayMin float32
decay float32
}

func (m *DQN) init() {
if _, err := m.NN.cons(); err != nil {
panic(err)
}
m.VM = gorgonia.NewTapeMachine(m.NN.g)
m.Solver = gorgonia.NewRMSPropSolver()
}

Next is our experience replay() function. Here, we first create batches of memory from which we retrain and update our network, gradually updating our epsilon:

func (m *DQN) replay(batchsize int) error {
var N int
if batchsize < len(m.Memories) {
N = batchsize
} else {
N = len(m.Memories)
}
Xs := make([]input, 0, N)
Ys := make([]float32, 0, N)
mems := make([]Memory, N)
copy(mems, m.Memories)
rand.Shuffle(len(mems), func(i, j int) {
mems[i], mems[j] = mems[j], mems[i]
})

for b := 0; b < batchsize; b++ {
mem := mems[b]

var y float32
if mem.isDone {
y = mem.Reward
} else {
var nextRewards []float32
for _, next := range mem.NextMovables {
nextReward, err := m.predict(mem.NextState, next)
if err != nil {
return err
}
nextRewards = append(nextRewards, nextReward)
}
reward := max(nextRewards)
y = mem.Reward + m.gamma*reward
}
Xs = append(Xs, input{mem.State, mem.Action})
Ys = append(Ys, y)
if err := m.VM.RunAll(); err != nil {
return err
}
m.VM.Reset()
if err := m.Solver.Step(m.model()); err != nil {
return err
}
if m.epsilon > m.epsDecayMin {
m.epsilon *= m.decay
}
}
return nil
}

The predict() function—called when we are determining the best possible move (or move with the greatest predicted reward)—is next. It takes a player's position in the maze and a single move, and returns our neural network's projected reward for that move:

func (m *DQN) predict(player Point, action Vector) (float32, error) {
x := input{State: player, Action: action}
m.Let1(x)
if err := m.VM.RunAll(); err != nil {
return 0, err
}
m.VM.Reset()
retVal := m.predVal.Data().([]float32)[0]
return retVal, nil
}

We then define our main training loop for n episodes, moving around the maze and building our DQN's memory:

func (m *DQN) train(mz *Maze) (err error) {
var episodes = 20000
var times = 1000
var score float32

for e := 0; e < episodes; e++ {
for t := 0; t < times; t++ {
if e%100 == 0 && t%999 == 1 {
log.Printf("episode %d, %dst loop", e, t)
}

moves := getPossibleActions(mz)
action := m.bestAction(mz, moves)
reward, isDone := mz.Value(action)
score = score + reward
player := mz.player
mz.Move(action)
nextMoves := getPossibleActions(mz)
mem := Memory{State: player, Action: action, Reward: reward, NextState: mz.player, NextMovables: nextMoves, isDone: isDone}
m.Memories = append(m.Memories, mem)
}
}
return nil
}

We also need a bestAction() function that selects the best possible move to take, given a slice of options and an instance of our maze:

func (m *DQN) bestAction(state *Maze, moves []Vector) (bestAction Vector) {
var bestActions []Vector
var maxActValue float32 = -100
for _, a := range moves {
actionValue, err := m.predict(state.player, a)
if err != nil {
// DO SOMETHING
}
if actionValue > maxActValue {
maxActValue = actionValue
bestActions = append(bestActions, a)
} else if actionValue == maxActValue {
bestActions = append(bestActions, a)
}
}
// shuffle bestActions
rand.Shuffle(len(bestActions), func(i, j int) {
bestActions[i], bestActions[j] = bestActions[j], bestActions[i]
})
return bestActions[0]
}

Finally, we define a getPossibleActions() function to produce a slice of possible moves, given our maze and our little max() helper function for finding the maximum value in a slice of float32s:

func getPossibleActions(m *Maze) (retVal []Vector) {
for i := range cardinals {
if m.CanMoveTo(m.player, cardinals[i]) {
retVal = append(retVal, cardinals[i])
}
}
return retVal
}

func max(a []float32) float32 {
var m float32 = -999999999
for i := range a {
if a[i] > m {
m = a[i]
}
}
return m
}

With all those pieces in place, we can write our main() function to complete our DQN. We begin by setting vars, which includes our epsilon. Then, we initialize DQN{} and instantiate Maze:

We then kick off our training loop and, once complete, try to solve our maze:

func main() {
// DQN vars

// var times int = 1000
var gamma float32 = 0.95 // discount factor
var epsilon float32 = 1.0 // exploration/exploitation bias, set to 1.0/exploration by default
var epsilonDecayMin float32 = 0.01
var epsilonDecay float32 = 0.995

rand.Seed(time.Now().UTC().UnixNano())
dqn := &DQN{
NN: NewNN(32),
gamma: gamma,
epsilon: epsilon,
epsDecayMin: epsilonDecayMin,
decay: epsilonDecay,
}
dqn.init()

m := NewMaze(5, 10)
fmt.Printf("%+#v", m.repr)
fmt.Printf("%v %v ", m.start, m.goal)

fmt.Printf("%v ", m.CanMoveTo(m.start, Vector{0, 1}))
fmt.Printf("%v ", m.CanMoveTo(m.start, Vector{1, 0}))
fmt.Printf("%v ", m.CanMoveTo(m.start, Vector{0, -1}))
fmt.Printf("%v ", m.CanMoveTo(m.start, Vector{-1, 0}))

if err := dqn.train(m); err != nil {
panic(err)
}

m.Reset()
for {
moves := getPossibleActions(m)
best := dqn.bestAction(m, moves)
reward, isDone := m.Value(best)
log.Printf(" %#v", m.repr)
log.Printf("player at: %v best: %v", m.player, best)
log.Printf("reward %v, done %v", reward, isDone)
m.Move(best)
}
}

Now, let's execute our program and observe the outputs:

We can see the dimensions of the maze, as well as a simple representation of walls (1), a clear path (o), our player (2), and our maze exit (3). The next line, {1 0} {9 20}, tells us the exact (X, Y) co-ordinates of our player's starting point and the maze's exit, respectively. We then loop through the movement vectors as a sanity check and begin our training run across n episodes.

Our agent now moves through the maze: 

You can experiment with different numbers of episodes (and episode lengths), and generate larger and more complex mazes!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.4.179