© Isaiah Hull 2021
I. HullMachine Learning for Economics and Finance in TensorFlow 2https://doi.org/10.1007/978-1-4842-6373-0_3

3. Regression

Isaiah Hull1  
(1)
Nacka, Sweden
 

The term “regression” differs in common usage between econometrics and machine learning. In econometrics, a regression involves the estimation of parameter values that relate a dependent variable to independent variables. The most common form of regression in econometrics is multiple linear regression, which involves the estimation of a linear association between a continuous dependent variable and multiple independent variables. Within econometrics, however, the term also encompasses non-linear models and models where the dependent variable is discrete. To the contrary, a regression in machine learning refers to a linear or non-linear supervised learning model with a continuous dependent variable (target). Throughout this chapter, we will adopt the broader econometrics definition of regression, but will introduce methods commonly applied in machine learning.

Linear Regression

In this section, we’ll introduce the concept of a “linear regression,” which is the most commonly employed empirical method in econometrics. It is used when the dependent variable is continuous, and the true relationships between the dependent variable and the independent variables are assumed to be linear.

Overview

A linear regression models the relationship between a dependent variable, Y, and a set of independent variables, {X0, …, Xk}, under the assumption of linearity in the coefficients. Linearity requires that the relationship between each Xj and Y can be modeled as a constant slope, represented by a scalar coefficient, βj. Equation 3-1 provides the general form for a linear model with k independent variables.

Equation 3-1. A linear model.
$$ Y=alpha +{eta}_0{X}_0+dots +{eta}_{k-1}{X}_{k-1} $$

In many cases, we will adopt the notation given in Equation 3-2, which explicitly specifies an index for each observation. Yi, for instance, denotes the value of variable Y for entity i.

Equation 3-2. A linear model with entity indices.
$$ {Y}_i=alpha +{eta}_0{X}_{i0}+dots +{eta}_{k-1}{X}_{ik-1} $$

In addition to entity indices, we will often use time indices in economic problems. In such cases, we will typically use a t subscript to indicate the time period in which the variable is observed, as we have done in Equation 3-3.

Equation 3-3. A linear model with entity and time indices.
$$ {Y}_{it}=alpha +{eta}_0{X}_{it0}+dots +{eta}_{k-1}{X}_{it k-1} $$

In a linear regression, the model parameters, {α, β1, …, βk}, do not vary with time or by entity and, thus, are not indexed by either. Additionally, non-linear transformations of the parameters are not permitted. A dense neural network layer, for instance, has a similar functional form, but applies a non-linear transformation to the sum of coefficient-variable products, as shown in Equation 3-4, where σ represents the sigmoid function.

Equation 3-4. A dense layer of a neural network with a sigmoid activation function.
$$ {Y}_{it}=sigma left(alpha +{eta}_0{X}_{it0}+dots +{eta}_{k-1}{X}_{it k-1}
ight) $$

While linearity may appear to be a severe functional form restriction, it does not prevent us from applying transformations – including non-linear transformations – to the independent variables. We can, for instance, re-define X0 as its natural logarithm and include it as an independent variable. Linear regressions also permit interactions between two variables, such as X0 ∗ X1, or indicator variables, such as $$ {1}_{left{{X}_0>{x}_0
ight}} $$. Additionally, in time series and panel settings, we can include lags of variables, such as Xt − 1j and Xt − 2j.

Transforming and re-defining variables makes linear regression a flexible method that can be used to approximate non-linear functions to an arbitrarily high degree of precision. For instance, consider the case where the true relationship between X and Y is given by the exponential function in Equation 3-5.

Equation 3-5. An exponential model.
$$ {Y}_i=mathit{exp}left(alpha +eta {X}_i
ight) $$

If we take the natural logarithm of Yi, we can perform the linear regression in Equation 3-6 to recover the model parameters, {α, β}.

Equation 3-6. A transformed exponential model.
$$ ln left({Y}_i
ight)=alpha +eta {X}_i $$

In most settings, we won’t know the underlying data generating process (DGP). Furthermore, there will not be a deterministic relationship between the dependent variable and independent variables. Rather, there will be some noise, ϵi, associated with each observation, which could arise as the result of unobserved, random differences across entities or measurement error.

As an example of this, let’s say that we have data drawn from a process that is known to be non-linear, but its exact functional form is unknown. Figure 3-1 shows a scatterplot of the data, along with plots of two linear regression models. The first is trained under the assumption that the relationship between X and Y is well-approximated over the [0, 10] interval using a single line, as in Equation 3-7. The second is trained under the assumption that five line segments are needed, as in Equation 3-8.

Equation 3-7. A linear approximation to a non-linear model.
$$ {Y}_i=alpha +eta {X}_i+{epsilon}_i $$
Equatio 3-8. A linear approximation to a non-linear relationship.
$$ {Y}_i=left[{alpha}_0+{eta}_0{X}_i
ight] {1}_{left{0le {X}_i<2
ight}}+dots +left[{alpha}_0+{eta}_0left({X}_i-8
ight)
ight] {1}_{left{8le {X}_ile 10
ight}}+{epsilon}_i $$
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig1_HTML.png
Figure 3-1

Two linear approximations of a non-linear function

Figure 3-1 suggests that using a linear regression model with a single slope and intercept was insufficient; however, using multiple line segments in the form of a piecewise polynomial spline was sufficient to approximate the non-linear function, even though we worked entirely within the framework of linear regression.

Ordinary Least Squares (OLS)

Linear regression , as we have seen, is a versatile method that can be used to model the relationship between a dependent variable and set of independent variables. Even when that relationship is non-linear, we saw that it was possible to approximate it in a linear model using indicator functions, variable interactions, or variable transformations. In some cases, we were even able to capture it exactly through a variable transformation.

In this section, we’ll discuss how to implement a linear regression in TensorFlow. The way in which we do this will depend on our choice of loss function. In economics, the most common loss function is the sum or mean of the squared errors, which we will consider first. For the purpose of this example, we will stack all of the independent variables in an n x k matrix, X, where n is the number of observations and k is the number of independent variables, including the constant (bias) term.

We will let $$ hat{eta} $$ denote the vector of estimated coefficients on the independent variables, which we distinguish from the true parameter values, β. The “error” term that we will use to construct our loss function is given in Equation 3-9. It will often be referred to by different names, such as error, residual, or disturbance term.

Equation 3-9. The disturbance term from a linear regression.
$$ epsilon =Y-hat{eta}X $$

Note that ϵ is an n-element column vector. This means that we can square and sum each element by pre-multiplying by its transpose, as in Equation 3-10, which gives us the sum of squared errors.

Equation 3-10. The sum of squared errors.
$$ {epsilon}^{prime}epsilon ={left(Y-hat{eta}X
ight)}^{prime}left(Y-hat{eta}X
ight) $$

One of the benefits of using the sum of squared errors as a loss function  – also called performing “ordinary least squares” (OLS) – is that it permits an analytical solution, as derived in Equation 3-11, which means that we do not need to use time-consuming and error-prone optimization algorithms. We obtain this solution by choosing $$ hat{eta} $$ to minimize the sum of squared errors.

Equation 3-11. Minimizing the sum of squared errors.
$$ frac{partial {epsilon}^{prime}epsilon }{partial hat{eta}}=frac{partial }{partial hat{eta}}{left(Y-hat{eta}X
ight)}^{prime}left(Y-hat{eta}X
ight)=0 $$
$$ -2{X}^{prime }Y+2{X}^{prime }X hat{eta}=0 $$
$$ {X}^{prime }Xhat{eta}={X}^{prime }Y $$
$$ hat{eta}={left({X}^{prime }X
ight)}^{-1}{X}^{prime }Y $$
The only thing left to check is whether $$ hat{eta} $$ is a minimum or a maximum. It will be a minimum whenever X has “full rank.” This will hold if no column of X is a linear combination of one or more other columns of X. Listing 3-1 provides a demonstration of how we can perform ordinary least squares (OLS) in TensorFlow for a toy problem .
import tensorflow as tf
# Define the data as constants.
X = tf.constant([[1, 0], [1, 2]], tf.float32)
Y = tf.constant([[2], [4]], tf.float32)
# Compute vector of parameters.
XT = tf.transpose(X)
XTX = tf.matmul(XT,X)
beta = tf.matmul(tf.matmul(tf.linalg.inv(XTX),XT),Y)
Listing 3-1

Implementation of OLS in TensorFlow 2

For convenience, we have defined the transpose of X as XT. We have also defined XTX as XT post-multiplied by X. We can compute $$ hat{eta} $$ by inverting XTX, post-multiplying by XT, and then post-multiplying by Y again.

The parameter vector we’ve computed, $$ hat{eta} $$, minimizes the sum of squared errors. While computing $$ hat{eta} $$ was simple, it might be unclear why we would want to use TensorFlow for such a task. If we had instead used MATLAB, the syntax for writing the linear algebra operations would have been compact and readable. Alternatively, if we had used Stata or any statistics module in Python or R, we’d be able to automatically compute standard errors and confidence intervals for the vector of parameters, as well as measures of fit for the regression.

TensorFlow does, of course, have natural advantages if a task requires parallel or distributed computing; however, the need for this is likely to be minor when performing OLS analytically. The value of TensorFlow will become apparent when we want to minimize a loss function that doesn’t have an analytical solution or when we cannot hold all of the data in memory.

Least Absolute Deviations (LAD)

While OLS is the most commonly used form of linear regression in economics and has many attractive properties, we will sometimes want to use an alternative loss function. We may, for instance, want to minimize the sum of the absolute values of the errors, rather than the sum of the squares. This form of linear regression is referred to as Least Absolute Deviations (LAD) or Least Absolute Errors (LAE).

For all models, including OLS and LAD, the sensitivity of parameter estimates to outliers is driven by the loss function. Since OLS minimizes the squares of errors, it places a high emphasis on setting parameter values to explain outliers. That is, OLS will place a greater emphasis on eliminating a single large error than it will on two errors half of its size. To the contrary, LAD would place equal weight on the large error and the two smaller errors.

Another difference between OLS and LAD is that we cannot express the solution to a LAD regression analytically, since the absolute value prevents us from obtaining a closed-form algebraic expression. This means we must search for the minimum by “training” or “estimating” the model.

While TensorFlow wasn’t particularly useful for solving OLS, it has clear advantages when performing a LAD regression or training another type of model that has no analytical solution. We’ll see how to do this in TensorFlow and also evaluate how accurately TensorFlow identifies the true parameter values at the same time. More specifically, we’ll perform a Monte Carlo experiment, where we randomly generate data under certain assumed parameter values. We’ll then use the data to estimate the model, allowing us to compare the true and estimated parameters.

Listing 3-2 shows how the data is generated. We start by defining the number of observations and number of samples. Since we want to evaluate TensorFlow’s performance, we’ll train the model parameters on 100 separate samples. We’ll also use 10,000 observations to ensure that there is sufficient data to train the model.

Next, we define the true values of the model parameters, alpha and beta, which correspond to the constant (bias) term and the slope. We set the constant term to 1.0 and the slope to 3.0. Since these are the true values of the parameters and do not need to be trained, we will use tf.constant() to define them.

We now draw X and epsilon from normal distributions. For X, we use a standard normal distribution, which has a mean of 0 and a standard deviation of 1. These are the default parameter values for tf.random.normal(), so we do not need to specify anything beyond the number of samples and observations. For epsilon, we use a standard deviation of 0.25, which we specify using the stddev parameter . Finally, we compute the dependent variable, Y.

We can now use the generated data to train the model using LAD. There are a few steps we will need to complete, which are common to all model construction and training processes in TensorFlow. We’ll first illustrate them using an example that makes use of only the first sample of randomly drawn data. We’ll then repeat the process for each of the 100 samples.
import tensorflow as tf
# Set number of observations and samples
S = 100
N = 10000
# Set true values of parameters.
alpha = tf.constant([1.], tf.float32)
beta = tf.constant([3.], tf.float32)
# Draw independent variable and error.
X = tf.random.normal([N, S])
epsilon = tf.random.normal([N, S], stddev=0.25)
# Compute dependent variable.
Y = alpha + beta*X + epsilon
Listing 3-2

Generate input data for a linear regression

Listing 3-3 provides the code for the first step in the model training process in TensorFlow. We first draw values from a normal distribution with a mean of 0 and a standard deviation of 5.0 and then use them to initialize alphaHat and betaHat . The choice of 5.0 is arbitrary, but is intended to emulate a problem in which we have limited prior knowledge about the true parameter values. We use the suffix “Hat” to indicate that these are not the true values, but estimates. Since we want to train the parameters to minimize the loss function, we will define them using tf.Variable() , rather than tf.constant().

The next step is to define a function to compute the loss. A LAD regression minimizes the sum of absolute errors, which is equivalent to minimizing the mean absolute error. We will minimize the mean absolute error, since this has better numerical properties.1

To compute the mean absolute error, we define a function called maeLoss, which takes the parameters and data as inputs and outputs the associated value of the loss function. The function first computes the error for each observation. It then transforms these values to their absolute values using tf.abs() and then returns the mean across all observations using tf.reduce_mean() .
# Draw initial values randomly.
alphaHat0 = tf.random.normal([1], stddev=5.0)
betaHat0 = tf.random.normal([1], stddev=5.0)
# Define variables.
alphaHat = tf.Variable(alphaHat0, tf.float32)
betaHat = tf.Variable(betaHat0, tf.float32)
# Define function to compute MAE loss.
def maeLoss(alphaHat, betaHat, xSample, ySample):
        prediction = alphaHat + betaHat*xSample
        error = ySample – prediction
        absError = tf.abs(error)
        return tf.reduce_mean(absError)
Listing 3-3

Initialize variables and define the loss

The final step is to perform optimization , which we do in Listing 3-4. To do this, we’ll first create an instance of the stochastic gradient descent optimizer named opt using tf.optimizers.SGD() . We’ll then use that instance to perform minimization. This involves applying the minimize() method to opt. To perform a single step of optimization over the entire sample, we pass the function that returns the loss to the minimize operation as a lambda function. Additionally, we pass the parameters, alphaHat and betaHat, and the first sample of input data, X[:,0] and Y[0:], to maeLoss(). Finally, we also need to pass a list of trainable variables, var_list, to minimize(). Each increment of the loop performs a minimization step, which updates the parameters and the state of the optimizer. In this example, we have repeated the minimization step 1000 times.
# Define optimizer.
opt = tf.optimizers.SGD()
# Define empty lists to hold parameter values.
alphaHist, betaHist = [], []
# Perform minimization and retain parameter updates.
for j in range(1000):
        # Perform minimization step.
        opt.minimize(lambda: maeLoss(alphaHat, betaHat,
        X[:,0], Y[:,0]), var_list = [alphaHat,
        betaHat])
        # Update list of parameters.
        alphaHist.append(alphaHat.numpy()[0])
        betaHist.append(betaHat.numpy()[0])
Listing 3-4

Define an optimizer and minimize the loss function

Before we repeat the process for the remaining 99 samples, let’s see how successful we were in identifying the true parameter values in the first. Figure 3-2 shows a plot of the values of alphaHat and betaHat at each step in the minimization process. The code for generating this plot is shown in Listing 3-5. Notice that we did not divide the sample into mini-batches, so each step is labeled as an epoch, where an epoch is a complete pass over the sample. The initial values, as we saw earlier, were randomly generated by drawing from a normal distribution with a high variance. Nevertheless, both alphaHat and betaHat appear to converge to their true parameter values after approximately 600 epochs.
# Define DataFrame of parameter histories.
params = pd.DataFrame(np.hstack([alphaHist,
        betaHist]), columns = ['alphaHat', 'betaHat'])
# Generate plot.
params.plot(figsize=(10,7))
# Set x axis label.
plt.xlabel('Epoch')
# Set y axis label.
plt.ylabel('Parameter Value')
Listing 3-5

Plot the parameter training histories

Furthermore, alphaHat and betaHat do not appear to adjust any further after they converge on their true parameter values. This suggests that the training process was stable and the stochastic gradient descent algorithm, which we will discuss in detail later in the chapter, was able to identify a clear local minimum, which turned out to be the global minimum in this case.2

Now that we’ve tested the solution method for one sample, we’ll repeat the process 100 times with different initial parameter values and different samples. We’ll then evaluate the performance of our solution method to determine whether it is sensitive to the choice of initial values or the data sample drawn. Figure 3-3 shows a histogram of the parameter value estimates at the 1000th epoch for each sample. Most estimates appear to be tightly clustered around the true parameter values; however, there are some deviations, due to either the initial values or the sample drawn. If we were planning to use LAD on a dataset with attributes similar to what we’ve generated in the Monte Carlo experiment, we might want to consider using a higher number of epochs to increase the probability that we converge to the true parameter values.
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig2_HTML.png
Figure 3-2

History of parameter values over 1000 epochs of training

Beyond changing the number of epochs, we may also want to consider adjusting the optimization algorithm’s hyperparameters, rather than using the default options. Alternatively, we might consider using a different optimization algorithm altogether. As we will discuss later in the chapter, this is relatively simple to do in TensorFlow.
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig3_HTML.png
Figure 3-3

Parameter estimate counts from Monte Carlo experiment

Other Loss Functions

As we discussed, OLS has an analytical solution, but LAD does not. Since most machine learning models do not permit an analytical solution, LAD can provide an instructive example. The same process we used to construct a model, define a loss function, and perform minimization for LAD will be repeated throughout the chapter and book. Indeed, the steps used to perform LAD can be applied to any form of linear regression by simply modifying the loss function.

There are, of course, reasons to favor OLS beyond the fact that it has a closed-form solution. For instance, if the conditions for the Gauss-Markov Theorem are satisfied, then the OLS estimator has the lowest variance among all linear and unbiased estimators.3 There is also a large econometric literature which builds on OLS and its variants, making it a natural choice for related work.

In many machine learning applications within economics and finance, however, the objective will often be to perform prediction, rather than hypothesis testing . In those cases, it may make sense to use a different form of linear regression; and using TensorFlow will make this task easier.

Partially Linear Models

In many machine learning applications, we will want to model non-linearities in a way that cannot be satisfactorily achieved using a linear regression model, even with the strategies we outlined earlier. This will require us to use a different modeling technique. In this section, we’ll expand the linear model to allow for the inclusion of a non-linear function.

Rather than constructing a purely non-linear model, we’ll start with what’s called a “partially linear model.” Such a model allows for certain independent variables to enter linearly, while others are permitted to enter the model through a non-linear function.

In the context of standard econometric applications, where the objective is typically statistical inference, a partially linear model would usually consist of a single variable of interest, which enters linearly, and a set of controls, which is permitted to enter non-linearly. The objective of such an exercise would be to perform inference on the parameter that enters linearly.

There are, however, econometric challenges to performing valid statistical inference with partially linear models. First, there is an issue with parameter consistency when the variable of interest and the controls are collinear.4 This is addressed in Robinson (1988), which constructs a consistent estimator for such cases.5 Another issue arises when we apply regularization to the non-linear function of controls. If we simply apply the estimator from Robinson (1988), the parameter of interest will be biased. Chernozhukov et al. (2017) demonstrate how to eliminate bias through the use of orthogonalization and sample splitting.

For the purposes of this chapter, we will focus exclusively on the construction and training of a partially linear model for predictive purposes, rather than for statistical inference. In doing so, we will sidestep questions related to consistency and bias and focus on the practical implementation of a training routine in TensorFlow.

We’ll start by defining the model we wish to train in Equation 3-12. Here, β is the vector of coefficients that enter the model linearly, and g(Z) is a non-linear function of the controls.

Equation 3-12. A partially linear model.
$$ Y=alpha +eta X+g(Z)+epsilon $$

Similar to the example for LAD, we’ll use a Monte Carlo experiment to evaluate whether we’ve correctly constructed and trained the model in TensorFlow and also to determine whether we are likely to encounter numerical issues, given our sample size and model specification.

In order to perform the Monte Carlo experiment, we’ll need to make specific assumptions about the values of the linear parameters, as well as the functional form of g(). For the sake of simplicity, we’ll assume that there is only one variable of interest, X, and one control, Z, which enters with the functional form exp(θZ). Additionally, the true parameter values are assumed to be α = 1, β = 3, and θ = 0.05.

We’ll start the Monte Carlo experiment in Listing 3-6 by generating data. As in the previous example, we’ll use 100 samples and 10,000 observations and define the true parameter values using tf.constant(). Next, we’ll draw realizations of the regressors, X and Z, and the error term, epsilon. Finally, we use the randomly generated data to construct the dependent variable, Y.
import tensorflow as tf
# Set number of observations and samples
S = 100
N = 10000
# Set true values of parameters.
alpha = tf.constant([1.], tf.float32)
beta = tf.constant([3.], tf.float32)
theta = tf.constant([0.05], tf.float32)
# Draw independent variable and error.
X = tf.random.normal([N, S])
Z = tf.random.normal([N, S])
epsilon = tf.random.normal([N, S], stddev=0.25)
# Compute dependent variable.
Y = alpha + beta*X + tf.exp(theta*Z) + epsilon
Listing 3-6

Generate data for partially linear regression experiment

The next step, shown in Listing 3-7, is to define and initialize the model parameters: alphaHat0, betaHat0, and thetaHat0. We then deviate slightly from the previous example: rather than computing the loss function immediately, we’ll first define a function for the partially linear model, which takes the parameters and a sample of the data as inputs and then outputs a prediction for each observation.
# Draw initial values randomly.
alphaHat0 = tf.random.normal([1], stddev=5.0)
betaHat0 = tf.random.normal([1], stddev=5.0)
thetaHat0 = tf.random.normal([1], mean=0.05,
            stddev=0.10)
# Define variables.
alphaHat = tf.Variable(alphaHat0, tf.float32)
betaHat = tf.Variable(betaHat0, tf.float32)
thetaHat = tf.Variable(thetaHat0, tf.float32)
# Compute prediction.
def plm(alphaHat, betaHat, thetaHat, xS, zS):
        prediction = alphaHat + betaHat*xS +
                        tf.exp(thetaHat*zS)
        return prediction
Listing 3-7

Initialize variables and compute the loss

We’ve now generated the data, initialized the parameters, and defined the partially linear model. The next step is to define a loss function, which we do in Listing 3-8. As with the previous examples, we can use whichever loss function is best suited to our problem. In this case, we’ll use the mean absolute error (MAE). Additionally, rather than computing the MAE ourselves, as we did previously, we’ll instead use a TensorFlow operation. The first argument to the tf.losses.mae() operation is an array of true values and the second is an array of predicted values.
# Define function to compute MAE loss.
def maeLoss(alphaHat, betaHat, thetaHat, xS, zS, yS):
        yHat = plm(alphaHat, betaHat, thetaHat, xS, zS)
        return tf.losses.mae(yS, yHat)
Listing 3-8

Define a loss function for a partially linear regression

The final step is to perform minimization, which we do in Listing 3-9. As in the LAD example, we’ll do this by instantiating an optimizer and then applying the minimize method. Each time we execute the minimize method, we’ll complete an entire epoch of training.
# Instantiate optimizer.
opt = tf.optimizers.SGD()
# Perform optimization.
for i in range(1000):
        opt.minimize(lambda: maeLoss(alphaHat, betaHat,
        thetaHat, X[:,0], Z[:,0], Y[:,0]),
        var_list = [alphaHat, betaHat, thetaHat])
Listing 3-9

Train a partially linear regression model

After the optimization process terminates, we can evaluate the results, as we did for the LAD example. Figure 3-4 shows the history of parameter value estimates over 1000 epochs of training. Notice that alphaHat, betaHat, and thetaHat all converge to their true values after approximately 800 epochs of training. Additionally, they do not appear to diverge from their true values as the training process continues.
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig4_HTML.png
Figure 3-4

History of parameter values over 1000 epochs of training

In addition to this, we’ll also examine the estimates for all 100 samples to see how sensitive the results are to the initialization and data. The final epoch parameter values for each sample are visualized in histograms in Figure 3-5. From the figure, it is clear that estimates of both alphaHat and betaHat are tightly clustered around their respective true values. While thetaHat appears unbiased, since the histogram is centered around the true value of theta, there appears to be more variation in the estimates. This suggests that we may want to make adjustments to the training process, possibly by using a higher number of epochs.

Performing a LAD regression and a partially linear regression demonstrated that TensorFlow is capable of handling the construction and training of an arbitrary model , including those that contain non-linearities. In the following section, we’ll see that TensorFlow can also handle discrete dependent variables. We’ll then complete the chapter by discussing the various ways in which we can adjust the training process to improve results.
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig5_HTML.png
Figure 3-5

Monte Carlo experiment results for partially linear regression

Non-linear Regression

In the previous section, we discussed partially linear models, which had both a linear and non-linear component. Solving a fully non-linear model can be accomplished using the same workflow as the partially linear model. We first generate or load the data. Next, we define the model and loss function. And finally, we instantiate an optimizer and perform minimization of the loss function.

Rather than using generated data, as we did in earlier examples, we’ll make use of the natural logarithm of the daily exchange rate for US dollar (USD) and British pound (GBP), which is shown in Figure 3-6.6
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig6_HTML.png
Figure 3-6

Natural logarithm of the USD-GBP exchange rate at a daily frequency (1970–2020). Source: Federal Reserve Board of Governors

Since exchange rates are challenging to predict, a random walk is often used as the benchmark model in forecasting exercises. As shown in Equation 3-13, a random walk models the next period’s exchange rate as the current period’s exchange rate plus some random noise.

Equation 3-13. A random walk model of the nominal exchange rate .
$$ {e}_t=alpha +{e}_{t-1}+{epsilon}_t $$

A line of literature that emerged in the 1990s argued that threshold autoregressive (TAR) models could generate improvements over the random walk model. Several variants of such models were proposed, including Smooth Transition Autoregressive Models (STAR) and Exponential Smoothed Autoregressive Models (ESTAR).7

Our exercise will focus on implementing a TAR model in TensorFlow and will deviate from the literature by, among other things, using the nominal, rather than real, exchange rate. Additionally, we will again abstract away from questions related to statistical inference by focusing on prediction.

An autoregressive model assumes that movements in a series are explained by past values of the series and noise. A random walk, for instance, is an autoregressive model of order one – since it contains a single lag – that has an autoregressive parameter of one. The autoregressive parameter is the coefficient on the lagged value of the dependent variable.

A TAR model modifies an autoregression by allowing parameter values to vary according to pre-defined thresholds. That is, parameters are assumed to be fixed within a particular regime, but may vary across regimes. We’ll use the regimes given in Equation 3-14. If there’s a sharp depreciation of more than 2%, then we’re in one regime, associated with one autoregressive parameter value. Otherwise, we’re in another.

Equation 3-14. A threshold autoregressive (TAR) model with two regimes.
$$ {e}_t=left{egin{array}{c}{
ho}_0{e}_{t-1}+{epsilon}_t,kern0.5em {epsilon}_{t-1}-{epsilon}_{t-2}<-0.02\ {}{
ho}_1{e}_{t-1}+{epsilon}_t,kern0.5em {epsilon}_{t-1}-{epsilon}_{t-2}ge -0.02end{array}
ight. $$
Our first step in the TensorFlow implementation will be to prepare the data. In order to do this, we’ll need to load the log of the nominal exchange rate, compute a lag, and compute a lagged first difference. We’ll load and transform the data in pandas and numpy. We’ll then convert them into tf.constant() objects. For the threshold variable, we’ll also need to change its type from a Boolean to 32-bit floating-point number. All steps are shown in Listing 3-10.
import pandas as pd
import numpy as np
import tensorflow as tf
# Define data path.
data_path = '../data/chapter3/'
# Load data.
data = pd.read_csv(data_path+'exchange_rate.csv')
# Convert log exchange rate to numpy array.
e = np.array(data["log_USD_GBP"])
# Identify exchange decreases greater than 2%.
de = tf.cast(np.diff(e[:-1]) < -0.02, tf.float32)
# Define the lagged exchange rate as a constant.
le = tf.constant(e[1:-1], tf.float32)
# Define the exchange rate as a constant.
e = tf.constant(e[2:], tf.float32)
Listing 3-10

Prepare the data for a TAR model of the USD-GBP exchange rate

Now that the data has been prepared, we’ll define the trainable model parameters, rho0Hat and rho1Hat, in Listing 3-11.
# Define variables.
rho0Hat = tf.Variable(0.80, tf.float32)
rho1Hat = tf.Variable(0.80, tf.float32)
Listing 3-11

Define parameters for a TAR model of the USD-GBP exchange rate

We next define both the model and the loss function in Listing 3-12. We then multiply the autoregressive coefficient by a dummy variable for the regime, de. Finally, this is multiplied by a lag of the exchange rate, le. For the sake of simplicity, we’ll use the mean absolute loss function, along with the TensorFlow operation for it.
# Define model.
def tar(rho0Hat, rho1Hat, le, de):
        # Compute regime-specific prediction.
        regime0 = rho0Hat*le
        regime1 = rho1Hat*le
        # Compute prediction for regime.
        prediction = regime0*de + regime1*(1-de)
        return prediction
# Define loss.
def maeLoss(rho0Hat, rho1Hat, e, le, de):
        ehat = tar(rho0Hat, rho1Hat, le, de)
        return tf.losses.mae(e, ehat)
Listing 3-12

Define model and loss function for TAR model of USD-GBP exchange rate

The final step is to define an optimizer and perform optimization, which we do in Listing 3-13.

Figure 3-7 shows the training history. The autoregressive parameter for the “normal” regime – where no sharp depreciation occurs the previous day – rapidly converges to approximately 1.0. This suggests that the exchange rate is best modeled as a random walk in normal times. However, when we look at cases where a sharp depreciation occurred the previous day, we instead find an autoregressive coefficient of 0.993, suggesting that the rate will be highly persistent, but will tend to drift back toward its mean, rather than remaining permanently lower.
# Define optimizer.
opt = tf.optimizers.SGD()
# Perform minimization.
for i in range(20000):
        opt.minimize(lambda: maeLoss(
        rho0Hat, rho1Hat, e, le, de),
        var_list = [rho0Hat, rho1Hat]
        )
Listing 3-13

Train TAR model of the USD-GBP exchange rate

../images/496662_1_En_3_Chapter/496662_1_En_3_Fig7_HTML.jpg
Figure 3-7

Training history of the TAR model of the USD-GBP exchange rate

We’ve now seen how to perform linear regression with different loss functions, partially linear regression, and non-linear regression in TensorFlow. In the next section, we’ll examine another type of regression, which has a discrete dependent variable.

Logistic Regression

In machine learning , supervised learning models are typically divided into “regression” and “classification” categories based on whether they have a discrete or continuous dependent variable. As discussed earlier, we will use the definition of regression from econometrics, which also applies to classification models, such as a logistic regression.

A logistic regression or “logit” predicts the class of the dependent variable. In a microeconometric setting, a logit might be used to model the choice of transportation over two options. In a financial setting, it might be used to model whether we are in a crisis or not.

Since the process of constructing and training a logistic regression involves many of the same steps as linear, partially linear, and non-linear regression, we will focus exclusively on what differs.

First, the model takes a specific functional form – namely, that of the logistic curve – which is given in Equation 3-15.

Equation 3-15. The logistic curve.
$$ p(X)=frac{1}{1+{e}^{-left(alpha +{eta}_0{X}_0+dots +{eta}_k{X}_k
ight)}} $$

Notice that the model’s output is a continuous probability, rather than a discrete outcome. Since probabilities range from 0 to 1, probabilities greater than 0.5 will often be treated as predictions of outcome 1. While this functional form differs from anything we’ve dealt with previously in this chapter, it can be handled using all of the same tools and operations in TensorFlow.

Finally, the other difference between a logistic model and those we’ve defined earlier in this chapter is that it will require a different loss function. Specifically, we will use the binary cross-entropy loss function, which is defined in Equation 3-16.

Equation 3-16. Binary cross-entropy loss function.
$$ {Sigma}_i-Big({Y}_iast log left(pleft({X}_i
ight)
ight)+left(1-{Y}_i
ight)ast log left(1-pleft({X}_i
ight)
ight) $$

We use this particular functional form because the outcomes are discrete and the predictions are continuous. Note that the binary cross-entropy loss sums over the product of the outcome variable and the natural log of the predicted probability for each observation. If, for instance, the true class of Yi is 1 and the model predicts a 0.98 probability of class 1, then that observation will add 0.02 to the loss. If, instead, the prediction is 0.10, which is far from the true classification, then the addition to the loss will instead be 2.3.

While computing the binary cross-entropy loss function is relatively simple, TensorFlow simplifies it further by providing the operation tf.losses.binary_crossentropy(), which takes the true label as its first argument and the predicted probability as its second .

Loss Functions

Whenever we solve a model in TensorFlow, we will need to define a loss function. The minimization operation will make use of this function to determine how to adjust parameter values. Fortunately, it will not always be necessary to define a custom loss function. Rather, we will often be able to use one of the pre-defined loss functions provided by TensorFlow.

There are currently two submodules of TensorFlow that contain loss functions: tf.losses and tf.keras.losses. The first submodule contains native TensorFlow implementations of loss functions. The second submodule contains the Keras implementations of the loss functions. Keras is a library for performing deep learning that is available as both a stand-alone module in Python and a high-level API in TensorFlow.

TensorFlow 2.3 offers 15 standard loss functions in the tf.losses submodule . Each of those loss functions takes the form tf.loss_function(y_true, y_pred). That is, we pass the dependent variable, y_true, as the first argument and the model’s predictions, y_pred, as the second argument. It then returns the value of the loss function.

When we work with high-level APIs in TensorFlow in later chapters, we will make use of the loss functions directly. However, for the purpose of this chapter, which is centered around optimization using low-level TensorFlow operations, we will need to wrap those loss functions within a function of the model’s trainable parameters and data. The optimizer will need to make use of the outer function to perform minimization.

Discrete Dependent Variables

The submodule tf.losses offers two loss functions for discrete dependent variables in regression settings: tf.binary_crossentropy(), tf.categorical_crossentropy(), and tf.sparse_categorical_crossentropy(). We have previously covered the binary cross-entropy function, which is used in logistic regression. This provides us with a measure of loss when we have a binary dependent variable, such as an indicator for whether the economy is a recession, and a continuous prediction, such as a probability of being in a recession. For convenience, we repeat the formula for binary cross-entropy in Equation 3-17.

Equation 3-17. Binary cross-entropy loss function .
$$ Lleft(Y,p(X)
ight)={Sigma}_i-Big({Y}_iast log left(pleft({X}_i
ight)
ight)+left(1-{Y}_i
ight)ast log left(1-pleft({X}_i
ight)
ight) $$

The categorical cross-entropy loss is simply the extension of the binary cross-entropy loss to cases where the dependent variable has more than two categories. Such models are commonly used in discrete choice problems, such as a model of the decision to commute by subway, bicycle, car, or foot. Within machine learning, categorical cross-entropy is the standard loss function for classification problems with more than two classes and is commonly used in neural networks that perform image and text classification. The equation for categorical cross-entropy is given in Equation 3-18. Note that (Yi==k) is a binary variable equal to 1 if Yi is class k and 0 otherwise. Additionally, pk(Xi) is the probability that the model assigns to Xi being class k.

Equation 3-18. Categorical cross-entropy loss function.
$$ Lleft(Y,p(X)
ight)=-{Sigma}_i{Sigma}_kleft({mathrm{Y}}_{mathrm{i}}==mathrm{k}
ight)ast log left({p}_kleft({X}_i
ight)
ight) $$

Finally, if we have a problem with a dependent variable that may belong to multiple categories – that is, a “multi-label” problem – we’ll use the sparse categorical cross-entropy loss function, rather than categorical cross-entropy. Notice that the normal cross-entropy loss function assumes that the dependent variable can have only one class.

Continuous Dependent Variables

For continuous dependent variables, the most common loss functions are the mean absolute error (MAE) and mean squared error (MSE). MAE is used in LAD and MSE in OLS. Equation 3-19 defines the MAE loss function, and Equation 3-20 defines the MSE loss. Recall that $$ hat{Y_i} $$ is the model’s predicted value for observation i.

Equation 3-19. Mean absolute error loss .
$$ Lleft(Y,hat{Y}
ight)=frac{1}{n}{sum}_imid {Y}_i-hat{Y_i}mid $$
Equation 3-20. Mean squared error loss .
$$ Lleft(Y,hat{Y}
ight)=frac{1}{n}{sum}_i{left({Y}_i-hat{Y_i}
ight)}^2 $$

Note that we can compute the losses using tf.losses.mae() and tf.losses.mse().

Other common loss functions for linear regression include the mean absolute percentage error (MAPE), the mean squared logarithmic error (MSLE) , and the Huber error, which are defined in Equations 3-21, 3-22, and 3-23. Respectively, these are available as tf.losses.MAPE(), tf.losses.MSLE(), and tf.losses.Huber().

Equation 3-21. Mean absolute percentage error.
$$ Lleft(Y,hat{Y}
ight)=100ast frac{1}{n}{Sigma}_imid left({Y}_i-hat{Y_i}
ight)/hat{Y_i}mid $$
Equation 3-22. Mean squared logarithmic error.
$$ Lleft(Y,hat{Y}
ight)=frac{1}{n}{Sigma}_i{left(log left({Y}_i+1
ight)-log left(hat{Y_i}+1
ight)
ight)}^2 $$
Equation 3-23. Huber error.
$$ Lleft(Y,hat{Y}
ight)=left{egin{array}{c}frac{1}{2}{left({Y}_i-hat{Y_i}
ight)}^2kern2.75em for left|{Y}_i-hat{Y_i}
ight|le delta \ {}delta {left(|{Y}_i-hat{Y_i}|-frac{1}{2}delta 
ight)}^2kern2.25em otherwiseend{array}
ight. $$
Figure 3-8 provides a comparison of selected loss functions. For each loss function, the loss value is plotted against the error value. Notice that the MAE loss scales linearly in the error. To the contrary, the MSE loss increases slowly near zero, but grows much faster far away from zero, leading to the application of a substantial penalty on outliers. Finally, the Huber loss is similar to the MSE loss near zero, but similar to the MAE loss as the error increases in size.
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig8_HTML.png
Figure 3-8

Comparison of common loss functions

Optimizers

The last topic we’ll consider in this chapter is the use of optimizers in TensorFlow. We have already seen how optimizers work when we applied them in the context of linear regressions. In each case, we used the stochastic gradient descent (SGD) optimizer, which is simple and interpretable, but is less commonly used in more recent work on machine learning. In this section, we’ll expand the set of optimizers we discuss.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) is a minimization algorithm that updates parameter values through the use of the gradient. In this case, the gradient is a tensor of partial derivatives of the loss function with respect to each of the parameters.

The parameter update process is given in Equation 3-24. To ensure compatibility with the equivalent TensorFlow operation, we use the definition provided in the documentation. Note that θt is a vector of parameter values at iteration t, lr is the learning rate, and gt is the gradient computed in iteration i.

Equation 3-24. Stochastic gradient descent in TensorFlow.
$$ {	heta}_t={	heta}_{t-1}- lrast {g}_t $$

You might wonder in what sense SGD is “stochastic.” The stochasticity arises from the sampling process used to update the parameters. This differs from gradient descent, where the entire sample is used at each iteration. The benefits of the stochastic version of gradient descent are that it increases iteration speed and alleviates memory constraints.

Let’s take a look at a single SGD step for a linear regression with an intercept term and a single variable, where θt = [αt, βt]. We’ll start at iteration 0 and assume we’ve computed the gradient, g0, for the batch of data as [−0.25, 0.33]. Additionally, we’ll set the learning rate, lr, to 0.01. What does this imply for θ1? Using Equation 3-24, we can see that θ1 = [α0 + 0.025, β0 − 0.033]. That is, we decrease α0 by 0.025 and increase β0 by 0.033.

Why do we increase a parameter value when the partial derivative is negative and decrease it when it is positive? Because the partial derivatives tell us how the loss function changes in response to a change in a given parameter. If the loss function is increasing, we’re moving further away from the minimum, so we want to change direction; however, if the loss function is decreasing, we’re moving toward a minimum, so we want to continue on the same direction. Furthermore, if the loss function is neither increasing nor decreasing, this means we’re at a minimum and the algorithm will naturally terminate.

Figure 3-9 illustrates this for the partial derivative of the loss function with respect to the intercept term. We focus on a narrow window around the true value of the intercept and plot both the loss function and its derivative. We can see that the derivative is initially negative, but increases to 0 at the true value of the intercept. It then becomes positive and increasing thereafter.
../images/496662_1_En_3_Chapter/496662_1_En_3_Fig9_HTML.png
Figure 3-9

Loss function and its derivative with respect to the intercept

Returning to Equation 3-24, notice that the selection of the learning rate can also be quite consequential. If we select a high learning rate, we’ll take larger steps with each iteration, which could bring us closer to a minimum faster. However, taking larger steps could also lead us to skip over the minimum , missing it entirely. The selection of the learning rate should take this trade-off into consideration.

Finally, it is worth mentioning that the “minima” we’re identifying are local and, thus, may be higher than the global minimum. That is, SGD makes no distinction between the lowest point in an area and the lowest value of the loss function. Consequently, it may be worthwhile to re-run the algorithm for several different sets of initial parameter values to see if we always converge to the same minimum.

Modern Optimizers

While SGD is easy to understand, it is rarely used in machine learning applications in its original form. This is because modern extensions typically offer more flexibility and robustness and perform better on benchmark tasks. The most common extensions of SGD are root mean square propagation (RMSProp), adaptive moment estimation (Adam), and adaptive gradient methods (Adagrad and Adadelta).

There are several advantages to using modern extensions of SGD. First, starting with RMSProp, which is the oldest, they allow for the application of separate learning rates to each parameter. In many optimization problems, there will be orders of magnitude differences between partial derivatives in the gradient. Consequently, applying a learning rate of 0.001, for instance, may be sensible for one parameter, but not for another. RMSProp allows us to overcome this problem. It also allows for the use of “momentum,” where the gradients accumulate over mini-batches, making it possible for the algorithm to break out of local minima.

Adagrad, Adadelta, and Adam all offer variants on the use of momentum and adaptive updates for each individual parameter. Adam tends to work well for many optimization problems with its default parameters. Adagrad is centered around the accumulation of gradients and the adaptation of learning rates for individual parameters. And Adadelta modifies Adagrad by introducing a window over which accumulated gradients are retained.8

In all cases, the use of optimizers will follow a familiar two-step process. We’ll first instantiate an optimizer and will set its parameter values in the process using the tf.optimizer submodule. And second, we’ll iteratively apply the minimize function and pass the loss function to it as a lambda function.

Since we have performed the second step multiple times, we’ll focus exclusively on the first step in Listing 3-14. There, we’ve instantiated SGD, RMSProp, Adagrad, and Adadelta optimizers and have emphasized how to set their respective parameter values.
# Instantiate optimizers.
sgd = tf.optimizers.SGD(learning_rate = 0.001,
        momentum = 0.5)
rms = tf.optimizers.RMSprop(learning_rate = 0.001,
        rho = 0.8, momentum = 0.9)
agrad = tf.optimizers.Adagrad(learning_rate = 0.001,
        initial_accumulator_value = 0.1)
adelt = tf.optimizers.Adadelta(learning_rate = 0.001,
        rho = 0.95)
adam = tf.optimizers.Adam(learning_rate = 0.001,
        beta_1 = 0.9, beta_2 = 0.999)
Listing 3-14

Instantiate optimizers

For SGD, we set the learning rate and the momentum. If we’re concerned that there are many local minima, we can increase momentum to a higher value. For RMSProp, we not only set a momentum parameter but also set rho, which is the rate at which information about the gradient decays. The Adadelta parameter, which retains gradients for a period of time, also has the same decay parameter, rho. For Adagrad, we set an initial accumulator value, related to the intensity with which gradients are accumulated over time. Finally, for the Adam optimizer, we set decay rates for the accumulation of information about the mean and variance of the gradients. In this case, we have used the default values for the Adam optimizer, which generally perform well in large optimization problems.

We’ve now introduced the main optimizers we will use throughout the book. We will return to them again in detail when we apply them to train models. The modern variants of SGD will be particularly useful when we train large models with thousands of parameters.

Summary

The most commonly used empirical method in economics is the regression. In machine learning, the term regression refers to a supervised learning model with a continuous target. In economics, the term “regression” is more broadly defined and may refer to cases with binary or categorical dependent variables, such as logistic regression. For the purposes of this book, we adopt the economics terminology.

In this chapter, we introduced the concept of a regression, including the linear, partially linear, and non-linear varieties. We saw how to define and train such models in TensorFlow, which will ultimately form the basis for solving any arbitrary model in TensorFlow, as we will see in later chapters.

Finally, we discussed the finer details of the training process. We saw how to construct a loss function and what pre-defined loss functions were available in TensorFlow. We also saw how to perform minimization with a variety of different optimization routines.

Bibliography

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. 2017. “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal 21 (1).

Goodfellow, I., Y. Bengio, and A. Courville. 2017. Deep Learning. Cambridge, MA: MIT Press.

Robinson, P.M. 1988. “Root-N-Consistent Semiparametric Regression.” Econometrica 56 (4): 931–954.

Taylor, M.P., D.A. Peel, and L. Sarno. 2001. “Nonlinear Mean-Reversion in Real Exchange Rates: Toward a Solution to the Purchasing Power Parity Puzzles.” International Economic Review 42 (4): 1015–1042.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.111.125