© Isaiah Hull 2021
I. HullMachine Learning for Economics and Finance in TensorFlow 2https://doi.org/10.1007/978-1-4842-6373-0_10

10. Theoretical Models

Isaiah Hull1  
(1)
Nacka, Sweden
 

Relative to other machine learning packages, TensorFlow requires a substantial time investment to master. This is because it provides users with the capacity to define and solve any graph-based model, rather than providing them with a simple and interpretable set of pre-defined models. This feature of TensorFlow was intended to foster the development of deep learning models; however, it also has secondary value for economists who want to solve theoretical models.

In this chapter, we’ll provide a brief overview of TensorFlow’s capabilities in this area. We’ll start by demonstrating how to define and solve an arbitrary mathematical model in TensorFlow. We’ll then apply these tools to solve the neoclassical business cycle model with full depreciation. This model has an analytical solution, which will allow us to evaluate how well TensorFlow performed. However, we will also discuss how to evaluate performance in cases where we do not have analytical solutions.

After we demonstrate how to solve basic mathematical models in TensorFlow, we’ll end the chapter by examining deep reinforcement learning, a field that combines reinforcement learning and deep learning. In recent years, it has accumulated several impressive achievements involving the development of robots and networks that play video games with superhuman levels of performance. We’ll see how this can be applied to solve otherwise intractable theoretical models in economics.

Solving Theoretical Models

Thus far, we have defined a model by selecting a specific architecture and then training the model’s parameters using data. In economics and finance, however, we often encounter a different set of problems that are theoretical, rather than empirical, in nature. These problems require us to solve a functional equation or a system of differential equations. Such problems are derived from a theoretical model that describes optimization problems for households, firms, or social planners.

In such settings, the model’s deep parameters – which typically describe technology, constraints, and preferences – are either calibrated or estimated outside of the model and, thus, are known prior to the implementation of the solution method. The role of TensorFlow in such settings is to enable the solution of a system of differential equations.

The Cake-Eating Problem

The cake-eating problem is commonly used as a “hello world” style introduction to dynamic programming.1 In the problem, an individual is endowed with cake and must decide how much of it to eat in each period. While highly stylized, it provides a strong analogy to the standard consumption-savings problem in economics, where an individual must decide whether to consume more today or delay consumption by allocating more to savings.

As we discussed previously, the deep parameters of such models are typically calibrated or estimated outside of the solution routine. In this case, the individual consuming the cake has a utility function and a discount factor. The utility function measures the enjoyment an individual gets from consuming a piece of cake of a certain size. And the discount factor tells us how an individual will value a slice of cake today versus in the future. We will use common values of the parameters in the utility function and for the discount factor.

Formally, the cake-eating problem can be written down as a dynamic, constrained optimization problem. Equation 10-1 defines the instantaneous utility that an individual receives from eating a slice of cake at time t. In particular, we assume that the instantaneous utility received is invariant to the period in which the agent receives it: that is, we place a time subscript on c, but not u(·). We also assume that utility is given by the natural logarithm of the amount of cake consumed. This will ensure that more cake yields more utility, but the incremental gain – the marginal utility – of more cake is decreasing in c. This provides the cake-eater with a natural desire to space consumption out over time, rather than eating the entire cake today.

Equation 10-1. Instantaneous utility of cake consumption.
$$ uleft({c}_t
ight)=log left({c}_t
ight) $$

The marginal utility of consumption can be expressed as the derivative of u(ct) with respect to ct, as given in Equation 10-2. Notice that neither Equation 10-1 nor Equation 10-2 contains parameters. This is one of the benefits of adopting log utility for such problems: it yields simple, parameter-free expressions for utility and marginal utility and satisfies the requirements that we typically place on utility functions in economics and finance.

Equation 10-2. Marginal utility of consumption.
$$ {u}^{prime}left({c}_t
ight)=frac{duleft({c}_t
ight)}{d{c}_t}=frac{1}{c_t} $$

In addition to this, the second derivative is always negative, as can be seen in Equation 10-3.

Equation 10-3. Marginal utility of consumption.
$$ {u}^{prime prime}left({c}_t
ight)=-frac{1}{c_t^2} $$
To simplify the problem, we’ll normalize the size of the cake to 1, which means that all consumption choices will be between 0 and 1. In Figure 10-1, we plot the level of utility and its first and second derivatives over c values in this interval.
../images/496662_1_En_10_Chapter/496662_1_En_10_Fig1_HTML.png
Figure 10-1

Utility of consumption, along with its first and second derivatives over the (0,1] interval

We’ll start by considering a finite horizon problem, where the agent must divide consumption over T periods. This could be because the cake only remains edible for T periods or because the individual only lives T periods. In this stylized example, the reasoning is not particularly important, but it is, of course, more important for consumption-savings problems.

At time t = 0, the agent maximizes the objective function given in Equation 10-4, subject to the budget constraint in Equation 10-5 and a positivity constraint on st + 1 in Equation 10-6. That is, the agent must make a sequence of consumption choices, c0, …, cT − 1, each of which is constrained by the amount of remaining cake, st, and the requirement to carry a positive amount of cake, st + 1, into the following period. Additionally, consumption in all future periods is discounted by β ≤ 1.

In Equation 10-4, we also apply the Principle of Optimality (Bellman 1954) to restate the value of entering period zero with s0 cake. It will be equal to the discounted sums of utilities along the optimal consumption path, which we will denote as the unknown function, V(·).

Equation 10-4. Objective function for agent at time t = 0.
$$ Vleft({s}_0;0
ight)=underset{{mathrm{c}}_{mathrm{o}},dots {c}_{T-1}}{max }{sum}_{tin left{0,..,T-1
ight}}{upbeta}^tlog left({c}_t
ight) $$
Equation 10-5. Budget constraint.
$$ {c}_t={s}_t-{s}_{t+1} $$
$$ forall tin left{0,dots, T-1
ight} $$
Equation 10-6. Positivity constraint.
$$ {s}_{t+1}>0 $$
$$ forall tin left{0,dots, T-1
ight} $$

Bellman (1954) demonstrated that we may re-express the objective function in an arbitrary period using what was later termed the “Bellman equation,” given in Equation 10-7. We also substitute the budget constraint into the equation.

Equation 10-7. The Bellman equation for the cake-eating problem.
$$ Vleft({s}_t;t
ight)=underset{{mathrm{s}}_{mathrm{t}+1}}{max}mathit{log}left({s}_t-{s}_{t+1}
ight)+eta Vleft({s}_{t+1};t+1
ight) $$

Rather than choosing a consumption sequence for T-t+1 periods, we instead choose ct or the st + 1 it implies for the current period. Solving the problem then reduces to solving a functional equation to recover V(·). After doing this, choosing an st + 1 will pin down both the instantaneous utility and the discounted flows of utility from future periods, making this a sequence of one-period optimization problems.

For finite horizon problems, such as the one we’ve set up, we can pin down V(sT; T) for all sT. Since the decision problem ends in period T − 1, all choices of sT will yield V(sT; T) = 0. Thus, we’ll start by solving Equation 10-8, where it will always be optimal to consume sT − 1. We can now step back in time recursively, solving for V(·) in each period until we arrive at t = 0.

Equation 10-8. The Bellman equation for the cake-eating problem.
$$ Vleft({s}_{T-1};T-1
ight)=underset{{mathrm{s}}_{mathrm{T}}}{max}mathit{log}left({s}_{T-1}-{s}_T
ight) $$

There are several ways in which we could perform the recursive optimization step. A common one is to use a discrete grid to represent the value function. For the sake of exploiting TensorFlow’s strengths and maintaining continuity with the remainder of the chapter, we’ll instead focus on a parametric approach. More specifically, we’ll parameterize the policy function that maps the state at time t, which is the amount of cake we have at the start of the period, to the state at time t+1, which is the amount of cake we carry into the following period.

To keep things simple, we’ll use a linear function for the decision rule that is proportional in the state, as shown in Equation 10-9.

Equation 10-9. Functional form of policy rule for cake-eating.
$$ {s}_{t+1}={uptheta}_{mathrm{t}}{mathrm{s}}_{mathrm{t}} $$

We will now implement this approach in TensorFlow for the simple case where T = 2. That is, we start with a full cake of size 1 and must decide how much to carry forward to period T − 1.

In Listing 10-1, we define the constants and parameters need to solve the model. This includes the slope of the policy function, theta, which tells us the share of the cake we carry forward into the following period; the discount factor, beta, which tells us how much the agent values cake in period t relative to t+1; and the share of the cake remaining in period zero, s0. Notice that theta is a trainable variable; beta is set to 1.0, indicating that we do not discount cake consumption in period t+1; and we initially have an entire cake (s0= 1).
import tensorflow as tf
# Define policy rule parameter.
theta = tf.Variable(0.1, tf.float32)
# Define discount factor.
beta = tf.constant(1.0, tf.float32)
# Define state at t = 0.
s0 = tf.constant(1.0, tf.float32)
Listing 10-1

Define the constants and variables for the cake-eating problem

We next define a function for the policy rule in Listing 10-2, which takes values of the parameters and yields s1. Notice that we define s1 as theta*s0. We use tf.clip_by_value() to restrict s1 to the [0.01, 0.99] interval, which imposes the positivity constraint.

Next, in Listing 10-3, we define the loss function, which takes the parameter values as an input and yields the loss. Notice that v1 is pinned down by the choice of s1, since 1 is the terminal period. With v1 determined, we can then compute v0, conditional on the choice of theta. We will choose theta – and, thus, s1 – to maximize v0. However, since we will perform minimization in practice, we’ll instead use -v0 as the measure of loss.
# Define policy rule.
def policyRule(theta, s0 = s0, beta = beta):
        s1 = tf.clip_by_value(theta*s0,
        clip_value_min = 0.01, clip_value_max = 0.99)
        return s1
Listing 10-2

Define a function for the policy rule

# Define the loss function.
def loss(theta, s0 = s0, beta = beta):
        s1 = policyRule(theta)
        v1 = tf.math.log(s1)
        v0 = tf.math.log(s0-s1) + beta*v1
        return -v0
Listing 10-3

Define the loss function

We next instantiate an optimizer and perform minimization over the course of 500 iterations in Listing 10-4.
# Instantiate an optimizer.
opt = tf.optimizers.Adam(0.1)
# Perform minimization.
for j in range(500):
opt.minimize(lambda: loss(theta),
        var_list = [theta])
Listing 10-4

Perform optimization

After 100 iterations of training, theta converges to 0.5, as shown in Figure 10-2. The interpretation of theta = 0.5 is that the agent should eat half of the cake in period 0 and half of the cake in period 1, which is exactly what we would expect in the case where the agent does not discount the future.
../images/496662_1_En_10_Chapter/496662_1_En_10_Fig2_HTML.png
Figure 10-2

Evolution of policy function parameter over training iterations

Of course, we will typically assume a beta of less than one. Figure 10-3 plots optimal values of theta for different values of beta. In each case, we re-solve the model. As expected, we see an upward sloping relationship between the two. That is, as we place more value on the future consumption, we also choose to carry more cake forward into the future to consume.

This problem was highly stylized, and focusing on the two-period case trivialized it even further. It did, however, demonstrate the basic template for constructing and solving theoretical models in TensorFlow. In the following subsection, we’ll consider a more realistic problem, but will concentrate on a case where we have a closed-form solution. This will make it relatively easy to evaluate the performance of our approach.
../images/496662_1_En_10_Chapter/496662_1_En_10_Fig3_HTML.png
Figure 10-3

Relationship between the discount factor and the policy rule parameter

The Neoclassical Business Cycle Model

We will end this section by solving a special form of the neoclassical business cycle model introduced by Brock and Mirman (1972). In the model, a social planner maximizes a representative household’s discounted flows of utility from consumption. In each period, t, the planner chooses next period capital, kt + 1, which yields output in the following period, yt + 1. Under the assumption of log utility and full depreciation, the model has a tractable closed-form solution.

Equation 10-10 is the planner’s problem in the initial period, which is subject to the budget constraint in Equation 10-11. The objective is similar to the cake-eating problem, but the household is infinitely lived, so we now have an infinite summation of discounted utility streams from consumption. The budget constraint indicates that the social planner divides output into consumption and capital in each period. Equation 10-12 specifies the production function.

Equation 10-10. The social planner’s problem.
$$ underset{c_0}{max}sum limits_{t=0}^{infty }{upbeta}^{mathrm{t}}log left({c}_t
ight) $$
Equation 10-11. The economy-wide budget constraint.
$$ {y}_t={c}_t+{k}_{t+1} $$
Equation 10-12. The production function.
$$ {y}_t={k}_t^{alpha } $$

We also assume that β < 1, α ∈ (0, 1), and capital fully depreciates in each period. This means that we recover the output produced using the capital we carried forward from the previous period, but we do not recover any of the capital itself.

One way in which we can solve this problem is by identifying a policy function that satisfies the Euler equation. The Euler equation, given in Equation 10-13, requires that the marginal utility of consumption in period t be equal to the discounted gross return to capital in period t+1, multiplied by the marginal utility of consumption in period t+1.

Equation 10-13. The Euler equation.
$$ frac{1}{c_t}=eta alpha {k}_{t+1}^{alpha -1}frac{1}{c_{t+1}} $$
$$ 	o {c}_{t+1}=eta alpha {k}_{t+1}^{alpha -1}{c}_t $$

The Euler equation has an intuitive interpretation: a solution is optimal if the planner can’t make the household better off by reallocating a small amount of consumption from period t to period t+1 or vice versa. We will find a solution that is consistent with Equations 10-11, 10-12, and 10-13 by defining policy functions for capital and consumption. We will see, though, that the policy function for consumption is redundant.

We’ll start by assuming that the solution can be expressed as a policy function that is proportional to output. That is, the planner will choose a share of output to allocate to capital and to consumption. Equation 10-14 provides the policy function for capital, and Equation 10-15 provides the function for consumption.

Equation 10-14. Policy function for capital.
$$ {k}_{t+1}={	heta}_k{k}_t^{alpha }={	heta}_k{y}_t $$
Equation 10-15. Policy function for consumption.
$$ {c}_t=left(1-{	heta}_k
ight){k}_t^{alpha }=left(1-{	heta}_k
ight){y}_t $$

The closed-form expressions for the policy functions are given in Equations 10-16 and 10-17. We will use these to evaluate the accuracy of our results in TensorFlow.

Equation 10-16. Policy rule for capital.
$$ {k}_{t+1}=alpha eta {k}_t^{alpha } $$
Equation 10-17. Policy rule for consumption.
$$ {c}_t=left(1-alpha eta 
ight){k}_t^{alpha } $$
We have now defined the problem and can implement a solution in TensorFlow. We’ll start by defining the parameters and the capital grid in Listing 10-5. We’ll use standard values for alpha and beta, the production function parameter and discount factor. Next we’ll define thetaK, the share of output that is allocated to capital in the following period. Finally, we’ll define a start-of-period capital grid, k0. This is the vector of capital values that a household could hold at the start of period t.
import tensorflow as tf
# Define production function parameter.
alpha = tf.constant(0.33, tf.float32)
# Define discount factor.
beta = tf.constant(0.95, tf.float32)
# Define params for decision rules.
thetaK = tf.Variable(0.1, tf.float32)
# Define state grid.
k0 = tf.linspace(0.001, 1.00, 10000)
Listing 10-5

Define model parameters

In Listing 10-6, we define the loss function. We first compute the policy rule for next period capital and then plug the policy rules into the Euler equation. We then subtract the right-hand side from the left-hand side, yielding error, which is sometimes referred to as the Euler equation residual. We then square the residuals and compute the mean.
# Define the loss function.
def loss(thetaK, k0 = k0, beta = beta):
        # Define period t+1 capital.
        k1 = thetaK*k0**alpha
        # Define Euler equation residual.
        error = k1**alpha-
        beta*alpha*k0**alpha*k1**(alpha-1)
        return tf.reduce_mean(tf.multiply(error,error))
Listing 10-6

Define the loss function

The final step is to define an optimizer and perform minimization, which we do in Listing 10-7. After performing optimization, we print thetaK and the parameter expression in the closed-form solution, beta*alpha. In both cases, we get 0.3135002, suggesting that our TensorFlow implementation identified the true solution to the model.
# Instantiate an optimizer.
opt = tf.optimizers.Adam(0.1)
# Perform minimization.
for j in range(1000):
opt.minimize(lambda: loss(thetaK),
        var_list = [thetaK])
# Print thetaK.
print(thetaK)
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.31350002>
# Compare analytical solution and thetaK.
print(alpha*beta)
tf.Tensor(0.31350002, shape=(), dtype=float32)
Listing 10-7

Perform optimization and evaluate results

Now that we’ve solved for the policy rules, we can use them to do things like compute transition paths. Listing 10-8 shows how to compute the transitions for consumption, capital, and output using the policy rules and starting from a capital stock value of 0.05. We plot the transition paths in Figure 10-4.
# Set initial value of capital.
k0 = 0.05
# Define empty lists.
y, k, c = [], [], []
# Perform transition.
for j in range(10):
        # Update variables.
        k1 = thetaK*k0**alpha
        c0 = (1-thetaK)*k0**alpha
        # Update lists.
        y.append(k0**alpha)
        k.append(k1)
        c.append(c0)
        # Update state.
        k0 = k1
Listing 10-8

Compute transition path

Finally, it is worth pointing out that we have used an intentionally trivial example where the solution can be computed analytically. In practice, we will typically encounter problems where this is not the case. In such cases, we will often use Euler equation residuals to evaluate the accuracy of the solution method.

Listing 10-9 demonstrates how we can modify the loss function to compute Euler equation residuals. We’ll start by defining a grid over which to compute them. In some cases, we may want to expand the bounds beyond what we used to solve the model to demonstrate that our model also performs well far away from the steady state. In this case, we’ll use the same grid that we used to solve the model.
../images/496662_1_En_10_Chapter/496662_1_En_10_Fig4_HTML.png
Figure 10-4

Transition path for output, capital, and consumption

Perhaps unsurprisingly – since our policy rule matches the analytical solution – the maximum Euler equation residual is negligibly small. While not particularly important for this problem, Euler equation residuals will be helpful whenever we want to determine the extent to which our results are affected by approximation error.
# Define state grid.
k0 = tf.linspace(0.001, 1.00, 10000)
# Define function to return Euler equation residuals.
def eer(k0, thetaK = thetaK, beta = beta):
        # Define period t+1 capital.
        k1 = thetaK*k0**alpha
        # Define Euler equation residual.
        residuals = k1**alpha-
        beta*alpha*k0**alpha*k1**(alpha-1)
        return residuals
# Generate residuals.
resids = eer(k0)
# Print largest residual.
print(resids.numpy().max())
5.9604645e-08
Listing 10-9

Compute the Euler equation residuals

Deep Reinforcement Learning

Standard theoretical models in economics and finance assume that agents are rational optimizers. This implies that agents form unbiased expectations about the future and achieve their objectives by performing optimization. A rational agent might incorrectly predict the return to capital in every period, but it won’t systematically overpredict or unpredict it. Similarly, an optimizer will not always achieve the best results ex-post, but ex-ante, it will have made the best decision given its information set. More explicitly, an optimizer will choose the exact optimum, given their utility function and constraints, rather than using a heuristic or rule of thumb.

As described in Palmer (2015), there are several reasons why we may wish to deviate from the rational optimizer framework. One is that we may want to focus on the process by which agents form policy rules, rather than assuming that they have adopted the one implied by rationality and optimization. Another reason is that breaking either the rationality or optimization requirement will greatly improve the computational tractability of many models.

If we do wish to depart from the standard model, one alternative approach is reinforcement learning, described in Sutton and Barto (1998). Its value within economics has been discussed in Athey and Imbens (2019) and Palmer (2015). Additionally, it was applied in Hull (2015) as a means of solving intractable dynamic programming problems.

Similar to the standard rational optimizer framework in economics, agents in reinforcement learning problems perform optimization, but they do so in an environment where they have limited information about the state of the system. This induces a trade-off between “exploration” and “exploitation” – that is, learning more about the system or optimizing over the part of the system you understand.

In this section, we’ll focus on a recently introduced variant of reinforcement learning called “deep Q-learning,” which combines deep learning and reinforcement learning. Our objective will be to slacken the computational constraints that prevent us from solving the rational optimizer versions of problems with high-dimensional state spaces, rather than studying the learning process itself. That is, we will still seek a solution for the rational optimizer’s problem, but we will do so using deep Q-learning, rather than using more conventional methods in computational economics.

Similar to dynamic programming, Q-learning is often done using a “look-up table” approach. In dynamic programming, this entails constructing a table that represents the value of being in each state. We then iteratively update that table until we achieve convergence. The table itself is the solution for the value function. In contrast, in Q-learning, we instead construct a state-action table. In our neoclassical business cycle model example, which we’ll return to here, the state was the capital stock and the action was the level of consumption.

Equation 10-18 demonstrates how the Q-table would be updated in the case where we use temporal difference learning. That is, we update the value associated with the state-action pair (st, at) in iteration i+1 by taking the value in i and adding to it to the learning rate, multiplied by the expected change in value induced by choosing the optimal action.

Equation 10-18. Updating the Q-table.
$$ {Q}_{i+1}left({s}_t,{a}_t
ight)leftarrow {Q}_ileft({s}_t,{a}_t
ight)+lambda left[{r}_t+eta underset{a}{max }Qleft({k}_{t+1},a
ight)-{Q}_ileft({s}_t,{a}_t
ight)
ight] $$

Deep Q-learning replaces the look-up table with a deep neural network called a “deep Q-network.” The approach was introduced in Mnih et al. (2015) and was originally applied to train Q-networks to play video games at superhuman levels of performance.

We will briefly outline how deep Q-learning can be used to solve economic models, returning to the neoclassical business cycle model example. There are several ways in which this can be done in TensorFlow. Two common options are tf-agents, which is a native TensorFlow implementation, and keras-rl2, which makes use of the high-level Keras API in TensorFlow. Since our coverage will be brief and introductory, we’ll focus on keras-rl2, which will allow for a simpler implementation with more familiar syntax.

In Listing 10-10, we install the keras-rl2 module and import tensorflow and numpy. We then import three submodules from the newly installed rl module: DQNAgent, which we will use to define a deep Q-learning agent; EpsGreedyQPolicy, which we’ll use to set the process that generates policy decisions on the training path; and SequentialMemory, which is used to retain decision paths and outcomes that are then used as inputs to train the deep Q-network. Finally, we import gym, which we will use to define the model environment.
# Install keras-rl2.
!pip install keras-rl2
# Import numpy and tensorflow.
import numpy as np
import tensorflow as tf
# Import reinforcement learning modules from keras.
from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory
# Import module for comparing RL algorithms.
import gym
Listing 10-10

Install and import modules to perform deep Q-learning

In Listing 10-11, we’ll set the number of capital nodes and define an environment, planner, which is a subclass of gym.Env. This will specify the details of the social planner’s reinforcement learning problem.

Our class, planner, is constructed to do the following at initialization: define a discrete capital grid, define action and observation spaces, initialize the number of decisions to zero, set the maximum number of decisions, set the node index of the initial value of capital (500 out of 1000), and set the production function parameter (alpha). For our purposes, the action and observation spaces will both be discrete objects with 1000 nodes, defined using gym.spaces. The observation space in our case is the entire state space: that is, all capital nodes. The action space is also the same.
# Define number of capital nodes.
n_capital = 1000
# Define environment.
class planner(gym.Env):
        def __init__(self):
                self.k = np.linspace(0.01, 1.0, n_capital)
                self.action_space =
                gym.spaces.Discrete(n_capital)
                self.observation_space =
                gym.spaces.Discrete(n_capital)
                self.decision_count = 0
                self.decision_max = 100
                self.observation = 500
                self.alpha = 0.33
        def step(self, action):
                assert self.action_space.contains(action)
                self.decision_count += 1
                done = False
                if(self.observation**self.alpha – action) > 0:
                        reward =
                np.log(self.k[self.observation]**self.alpha –
                self.k[action])
                else:
                        reward = -1000
                self.observation = action
                if (self.decision_count >= self.decision_max)
                or reward == -1000:
                        done = True
                return self.observation, reward, done,
                {"decisions": self.decision_count}
        def reset(self):
                self.decision_count = 0
                self.observation = 500
                return self.observation
Listing 10-11

Define custom reinforcement learning environment

We next define a step method of the class, which is required to return four outputs: the observation (state), the reward (instantaneous utility), an indicator for whether a training session should be reset (done), and a dictionary object that contains relevant debugging information. Calling this method increments the decision_count attribute, which records the number of decisions an agent has made during a training session. It also initially sets done to False. We then evaluate whether the agent made a valid decision – that is, selected a positive value of consumption. If an agent makes more than decision_max decisions or chooses a non-positive consumption value, the reset() method is called, which reinitializes the state and decision count.

In Listing 10-12, we instantiate a planner environment and then define a neural network in TensorFlow. We use the Sequential model with one dense layer and a relu activation function. Note that the model should have an output layer that contains n_capital nodes; however, beyond that, we can choose the architecture that is best suited to our problem.
# Instantiate planner environment.
env = planner()
# Define model in TensorFlow.
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(tf.keras.layers.Dense(32, activation="relu"))
model.add(tf.keras.layers.Dense(n_capital, activation="linear"))
Listing 10-12

Instantiate environment and define model in TensorFlow

Now that our environment and network have been defined, we need to specify hyperparameters and train the model, which we do in Listing 10-13. We first use SequentialMemory to retain a “replay buffer” of 50,000 decision paths, which will be used to train the model. We then set the model to use an epsilon-greedy policy with epsilon = 0.30. During training time, this means that the model will maximize utility 70% of the time and explore with a random decision the remaining 30% of the time. Finally, we set the hyperparameters of the DQNAgent model, compile it, and perform training.
# Specify replay buffer.
memory = SequentialMemory(limit=10000, window_length=1)
# Define policy used to make training-time decisions.
policy = EpsGreedyQPolicy(0.30)
# Define deep Q-learning network (DQN).
dqn = DQNAgent(model=model, nb_actions=n_capital, memory=memory,
        nb_steps_warmup=100, gamma=0.95,
        target_model_update=1e-2, policy=policy)
# Compile and train model.
dqn.compile(tf.keras.optimizers.Adam(0.005), metrics=['mse'])
dqn.fit(env, nb_steps=10000)
Listing 10-13

Set model hyperparameters and train

Monitoring the training process yields two observations. First, the number of decisions per session increases across iteration, suggesting that the agent learns to avoid negative amounts of future periods by not drawing capital down as sharply as a greedy policy might suggest. And second, the loss declines and the average reward begins to rise, suggesting that the agent is moving closer to optimality.

If we wanted to perform a more thorough analysis of the quality of our solution, we could examine the Euler equation residuals, as we discussed in the previous section. This would tell us whether the DQM model yielded something that was approximately optimal.

Summary

TensorFlow not only provides us with a means of training deep learning models but also offers a suite of tools that can be used to solve arbitrary mathematical models. This includes models that are commonly used in economics and finance. In this chapter, we examined how to do this using a toy model (the cake-eating model) and a common benchmark in the computational literature: the neoclassical business cycle model. Both models are trivial to solve using conventional methods in economics, but provide a simple means of demonstrating how TensorFlow can be used to solve theoretical models of relevance for economists.

We also showed how deep reinforcement learning could be used as an alternative to standard methods in computational economics. In particular, using deep Q-learning networks (DQN) in TensorFlow may enable economists to solve higher-dimensional models in a non-linear setting without changing model assumptions or introducing a substantial amount of numerical error.

Bibliography

Athey, S., and G.W. Imbens. 2019. “Machine Learning Methods Economist Should Know About.” Annual Review of Economics 11: 685–725.

Bellman, R. 1954. “The theory of dynamic programming.” Bulletin of the American Mathematical Society 60: 503–515.

Brock, W., and L. Mirman. 1972. “Optimal Economic Growth and Uncertainty: The Discounted Case.” Journal of Economic Theory 4 (3): 479–513.

Hull, I. 2015. “Approximate Dynamic Programming with Post-Decision States as a Solution Method for Dynamic Economic Models.” Journal of Economic Dynamics and Control 55: 57–70.

Mnih, V. et al. 2015. “Human-level control through deep reinforcement learning.” Nature 518: 529–533.

Palmer, N.M. 2015. Individual and Social Learning: An Implementation of Bounded Rationality from First Principles. Doctoral Dissertation in Computational Social Science, Fairfax, VA: George Mason University.

Sutton, R.S., and A.G. Barto. 1998. Reinforcement Learning: An Introduction. Cambridge: MIT Press.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.11.98