Learning to use reinforcement

Imagine that we need to come up with an enemy that needs to select different actions over time as the player progresses through the game and his or her patterns change, or a game for training different types of pets that have free will to some extent.

For these types of tasks, we can use a series of techniques aimed at modeling learning based on experience. One of these algorithms is Q-learning, which will be implemented in this recipe.

Getting ready…

Before delving into the main algorithm, it is necessary to have certain data structures implemented. We need to define a structure for game state, another for game actions, and a class for defining an instance of the problem. They can coexist in the same file.

The following is an example of the data structure for defining a game state:

public struct GameState
{
    // TODO
    // your state definition here
}

Next is an example of the data structure for defining a game action:

public struct GameAction
{
    // TODO
    // your action definition here
}

Finally, we will build the data type for defining a problem instance:

  1. Create the file and class:
    public class ReinforcementProblem
    {
    }
  2. Define a virtual function for retrieving a random state. Depending on the type of game we're developing, we are interested in random states considering the current state of the game:
    public virtual GameState GetRandomState()
    {
        // TODO
        // Define your own behaviour
        return new GameState();
    }
  3. Define a virtual function for retrieving all the available actions from a given game state:
    public virtual GameAction[] GetAvailableActions(GameState s)
    {
        // TODO
        // Define your own behaviour
        return new GameAction[0];
    }
  4. Define a virtual function for carrying out an action, and then retrieving the resulting state and reward:
    public virtual GameState TakeAction(
            GameState s,
            GameAction a,
            ref float reward)
    {
        // TODO
        // Define your own behaviour
        reward = 0f;
        return new GameState();
    }

How to do it…

We will implement two classes. The first one stores values in a dictionary for learning purposes, and the second one is the class that actually holds the Q-learning algorithm:

  1. Create the QValueStore class:
    using UnityEngine;
    using System.Collections.Generic;
    
    public class QValueStore : MonoBehaviour
    {
        private Dictionary<GameState, Dictionary<GameAction, float>> store;
    }
  2. Implement the constructor:
    public QValueStore()
    {
        store = new Dictionary<GameState, Dictionary<GameAction, float>>();
    }
  3. Define the function for getting the resulting value of taking an action in a game state. Carefully craft this, considering an action cannot be taken in that particular state:
    public virtual float GetQValue(GameState s, GameAction a)
    {
        // TODO: your behaviour here
        return 0f;
    }
  4. Implement the function for retrieving the best action to take in a certain state:
    public virtual GameAction GetBestAction(GameState s)
    {
        // TODO: your behaviour here
        return new GameAction();
    }
  5. Implement the function for :
    public void StoreQValue(
            GameState s,
            GameAction a,
            float val)
    {
        if (!store.ContainsKey(s))
        {
            Dictionary<GameAction, float> d;
            d = new Dictionary<GameAction, float>();
            store.Add(s, d);
        }
        if (!store[s].ContainsKey(a))
        {
            store[s].Add(a, 0f);
        }
        store[s][a] = val;
    }
  6. Let's move on to the QLearning class, which will run the algorithm:
    using UnityEngine;
    using System.Collections;
    
    public class QLearning : MonoBehaviour
    {
        public QValueStore store;
    }
  7. Define the function for retrieving random actions from a given set:
    private GameAction GetRandomAction(GameAction[] actions)
    {
        int n = actions.Length;
        return actions[Random.Range(0, n)];
    }
  8. Implement the learning function. Be advised that this is split into several steps. Start by defining it. Take into consideration that this is a coroutine:
    public IEnumerator Learn(
            ReinforcementProblem problem,
            int numIterations,
            float alpha,
            float gamma,
            float rho,
            float nu)
    {
        // next steps  
    }
  9. Validate that the store list is initialized:
    if (store == null)
        yield break;
  10. Get a random state:
    GameState state = problem.GetRandomState();
    for (int i = 0; i < numIterations; i++)
    {
        // next steps
    }
  11. Return null for the current frame to keep running:
    yield return null;
  12. Validate against the length of the walk :
    if (Random.value < nu)
        state = problem.GetRandomState();
  13. Get the available actions from the current game state:
    GameAction[] actions;
    actions = problem.GetAvailableActions(state);
    GameAction action;
  14. Get an action depending on the value of the randomness of exploration:
    if (Random.value < rho)
        action = GetRandomAction(actions);
    else
        action = store.GetBestAction(state);
  15. Calculate the new state for taking the selected action on the current state and the resulting reward value:
    float reward = 0f;
    GameState newState;
    newState = problem.TakeAction(state, action, ref reward);
  16. Get the q value, given the current game, and take action and the best action for the new state computed before:
    float q = store.GetQValue(state, action);
    GameAction bestAction = store.GetBestAction(newState);
    float maxQ = store.GetQValue(newState, bestAction);
  17. Apply the Q-learning formula:
    q = (1f - alpha) * q + alpha * (reward + gamma * maxQ);
  18. Store the computed q value, giving its parents as indices:
    store.StoreQValue(state, action, q);
    state = newState;

How it works…

In the Q-learning algorithm, the game world is treated as a state machine. It is important to take note of the meaning of the parameters:

  • alpha: This is the learning rate
  • gamma: This is the discount rate
  • rho: This is the randomness of exploration
  • nu: This is the length of the walk
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.44.53