Chapter 8. Regression with lagged explanatory variables

Most applications in finance are concerned with the analysis of time series data. However, most of the examples in Chapters 3 to 7 used cross-sectional data. This allowed us to build up the basic ideas underlying regression, including statistical concepts such as hypothesis testing and confidence intervals, in a simple manner. When working with time series variables, knowledge of such ideas is essential. However, some additional issues arise when working with time series data. The purpose of this chapter is to offer an introduction to these issues and to familiarize the reader with some concepts and notation used with time series models. After this introductory material, we take one step away in the direction of developing the models and methods that are used with financial time series.

The goal of the researcher working with time series data does not differ too much from that of the researcher working with cross-sectional data: both aim to develop a regression relating a dependent variable to some explanatory variables. However, the analyst using time series data will face two problems that the analyst using cross-sectional data will not encounter: (1) one time series variable can influence another with a time lag; and (2) if the variables are nonstationary, a problem known as spurious regression may arise.

At this stage, you are not expected to understand the second of these problems. The terms nonstationary, stationary and spurious regression will be discussed in detail in subsequent chapters of this book. But keep in mind this general rule: If you have nonstationary time series variables then you should not include them in a regression model. The appropriate route is to transform the variables before running a regression in order to make them stationary. There is one exception to this general rule, which we shall discuss later, and which occurs where the variables in a regression model are cointegrated. We will elaborate on what we mean by these terms later. If you find it confusing for them to be introduced now without definitions, just think in the following terms: Some problems arise with time series data that do not arise with cross-sectional data. These problems make it risky to naively use multiple regression in the manner of Chapters 4 to 7. The purpose of this and the following chapters is to show you how to correctly modify regression techniques with time series data.

In this chapter, we will assume all variables in the regression are stationary. The next chapter explains what this means. At this point, note only that the second problem will not occur and that we can therefore focus on the first problem.

The first problem can be understood intuitively with some simple examples. When we estimate a regression model we are interested in measuring the effect of one or more explanatory variables on the dependent variable. In the case of time series data we have to be very careful in our choice of explanatory variables since their effect on the dependent variable may take time to manifest itself.

For instance, in previous chapters we worked with cross-sectional regressions involving company data. In one example, our dependent variable was market capitalization and explanatory variables were company characteristics (e.g. income, assets, sales, etc.). In another, our dependent variable was executive compensation which we sought to explain using variables like profits and debt. However, all our dependent and explanatory variables referred to the same year. In practice, this may not be reasonable. The value that the stock market places on a firm might depend not only on current income, but on historical income as well. After all, current income could be affected by short-term factors and may not be a totally reliable guide to long-run performance. For instance, an ice cream company might suffer a short-term fall in income due to an unusually cold summer. Looking at data based on this one unusual event could give an unreliable view of the long-run potential of this company. Similar considerations hold for our executive compensation example where compensation might be determined not only on current profits, but also on past profits. In short, there are good reasons to include not only current values of explanatory variables, but also past values.

To put this concept in the language of regression, we say that the value of the dependent variable at a given point in time should depend not only on the value of the explanatory variable at that time period, but also on values of the explanatory variable in the past. A simple model to incorporate such dynamic effects has the form:[40]

Regression with lagged explanatory variables

This is precisely the same as the multiple regression model in Chapter 6, with the exception that the "explanatory variables" are not entirely different (e.g. lot size, number of bathrooms, number of bedrooms, etc.) but are just one explanatory variable that is observed at different time periods. In this model, the right-hand side variables are referred to as lagged variables and q, the lag order or lag length. We will focus on the case where the dependent variable depends on one explanatory variable and its lags. However, everything we say can be generalized in a straightforward fashion to several explanatory variables, all having time lags. Since the effect of the explanatory variable on the dependent variable does not happen all at once, but rather is distributed over several time periods, this model is sometimes referred to as a distributed lag model.

Since the regression model with time lags is a regression model, everything we said in Chapters 4 to 6 about regression is relevant here. For instance, computer packages like Excel can provide OLS estimates of coefficients, confidence intervals and P-values for testing whether coefficients are equal to zero. Coefficients can be interpreted as measures of the influence of the explanatory variable on the dependent variable. In this case, we have to be careful with timing. For instance, we interpret results as: "β2 measures the effect of the explanatory variable two periods ago on the dependent variable, ceteris paribus". Other than these minor differences, both the statistical methods and interpretation are very similar to the tools we described previously. Nevertheless, it is worth discussing this class of models separately, as it will help us to develop some time series terminology and introduce ideas that we will build on in subsequent chapters.

Before turning to an illustrative example of how to work with regression models with lagged variables, we will make two brief detours. One of these describes what lagged variables are and how to calculate them in a spreadsheet software package. The other clarifies the notation that will be used in this and subsequent chapters.

Aside on lagged variables

The concept of a lagged variable is fundamental to time series data, so we will describe in some detail what it means and how to construct and work with lagged variables on a computer spreadsheet. We do this mostly because it really helps to understand what lagged variables are by seeing how they are constructed. However, we partly work with a spreadsheet to begin to show you that it is awkward to work with spreadsheets when you have time series variables. It is possible to do almost everything in this book (with the exception of models involving volatility which we will discuss later) with a spreadsheet such as Excel. However, it is much more convenient to use a specialized computer package for financial econometrics such as E-views, MicroFit or Stata.

Suppose we have time series data for t = 1, ..., T periods on a variable X. As before, we denote individual observations by Xt for t = 1, ..., T. Consider creating a new variable W which has observations Wt = Xt for t = 2, ..., T and a new variable Z which has observations Zt = Xt−1 for t = 2, ..., T. Why do we write t = 2, ..., T instead of t = 1, ..., T ? If we had written t = 1, ..., T then the first observation of the variable Z, Z1, would be set equal to X0. Yet we do not know what X0 is since variable X is observed only from t = 1, ..., T. In other words, W and Z have only T – 1 observations. Note also that had we written Zt = Xt−2 then the new variable Z would have observations from t = 3, ..., T and only T – 2 observations.

The new variables W and Z both have T – 1 observations. If we imagine W and Z as two columns containing T – 1 numbers each (as in an Excel spreadsheet), we can see that the first element of W will be X2 and the first element of Z will be X1. The second element of W and Z will be X3 and X2, etc. In words, we say that W contains X and Z contains X one period ago or lagged one period. In general, we can create variables "X lagged one period – or "lagged X" for short – "X lagged two periods" – or, in general, "X lagged j periods".

You can think of "X ", "X lagged one period", "X lagged two periods", etc. as different explanatory variables in the same way as you can of "house price", "lot size", or "number of bedrooms" as different explanatory variables.

Note, however, that if you want to include several explanatory variables in a multiple regression model, all variables must have the same number of observations. Let us consider the implication of this statement, in the present context. Suppose a regression includes X = the interest rate lagged j periods as an explanatory variable. If you began with t = 1, ..., T observations on the interest rate, then X lagged j periods will contain only Tj observations. Since this variable contains only Tj observations you must make sure that all the other variables in the model also contain exactly Tj observations. In words, each variable in a time series regression must contain the number of observations equal to T minus the maximum number of lags that any variable has.

Many of the more sophisticated statistical software packages (e.g. E-views, Stata or MicroFit,) will create lagged variables automatically with a simple command, but not most spreadsheet packages like Excel. This is a key reason why, when working with time series data, you might want to learn such a software package and not work with a spreadsheet such as Excel. When working with a spreadsheet you will have to create lagged variables yourself before running a regression involving them. A brief explanation of how to do this will be both useful when you work with spreadsheets and will provide a practical way to illustrate the material above.

As an example, suppose we have 10 observations on variables Y and X (i.e. t = 1, ..., 10) and we wish to run a regression model that includes X, lagged X, X lagged two periods and X lagged 3 periods. That is, we wish to estimate the regression model:

Aside on lagged variables

Table 8.1 shows how the data would look in a spreadsheet format.

Note that spreadsheets label each observation by row and column, as in Table 8.1. Each column contains a variable (e.g. Column C contains the variable X lagged one period) and each row contains observations. Note that each of the variables contains 7 observations, which is T minus maximum number of lags (i.e. 10 − 3 = 7). Looking across any row (e.g. Row 4) you can see that: (a) Y and X contain data at a particular point in time (e.g. Y7 and X7 or t = 7); (b) X lagged will contain the observation from one period previously (e.g. X6); (c) X lagged two periods will contain the observation from two periods previously (e.g. X5); and (d) X lagged three periods will contain the observation from three periods previously (e.g. X4).

Table 8.1. Creating lagged variables.

 

Column A Y

Column B X

Column B X lagged one period

Column D X lagged two periods

Column E X lagged three periods

Row 1

Y4

X4

X3

X2

X1

Row 2

Y5

X5

X4

X3

X2

Row 3

Y6

X6

X5

X4

X3

Row 4

Y7

X7

X6

X5

X4

Row 5

Y8

X8

X7

X6

X5

Row 6

Y9

X9

X8

X7

X6

Row 7

Y10

X10

X9

X8

X7

You can create this table in Excel. First use the Cut/Paste commands in the spreadsheet containing the original data on Y and X (i.e. the one that contained the 10 original observations on the two variables) to create a spreadsheet that looks like Table 8.1. Then run the regression by using the Excel regression menu in the standard way and specifying A1:A7 in the box labeled "Input Y-range", and B1:E7 in the box labeled "Input X-range".

This section on lagged variables may seem of little direct relevance for understanding and interpreting results. However, it is important not to forget this material if you are at the computer, working with a spreadsheet.

Aside on notation

It is also important to make sure that our notation is clear. Consider a variable, X(e.g. executive compensation). After collecting data on X we will have observations Xi for i = 1, ..., N for cross-sectional data and Xt for t = 1, ..., T for time series data (see Chapter 2).

In other words, X is a generic notation for the variable and Xi or Xt indicates a particular observation of the variable (e.g. Xi = executive compensation in the ith company or Xt = executive compensation in the tth time period). In our discussion of regression in Chapters 4 to 7 we often wrote equations of the form:

Aside on notation

Expressed in words, the above implies that "the dependent variable Y depends on the explanatory variable X in a linear fashion". When we have actual data we can write,

Aside on notation

Expressed in words, "observation i of Y depends on observation i of X." For instance, "executive compensation in company i depends on profits in company i". Both of these equations are perfectly correct. But, since the subscript i in the latter equation is a little obvious (e.g. it is obvious that executive compensation in Company A depends on profits in Company A — it certainly will not depend on profits in Company B), you often see the i subscript dropped out from the latter equation for simplicity's sake.

We complicated our notation even more in Chapter 6 in our discussion of multiple regression, in which X1, X2, ..., Xk were k different explanatory variables. Here the subscript on X indicated which explanatory variable we were referring to, not which observation. In the rare cases when we wanted to be more explicit we wrote, for example, X2i, to indicate the ith observation of the second explanatory variable. However, since it is usually obvious in the multiple regression case that Yi (e.g. executive compensation in company i) depends on X1i(e.g. profits in company i) and on X2i(e.g. change in sales in company i), the i subscript was often dropped from the equation.

In short, throughout this book our subscript notation, which distinguishes between a variable and a particular observation of a variable, has been a little loose. This is okay (and common in textbooks), since the meaning is fairly obvious from the context and the alternative is to clutter up equations with numerous subscripts. In the time series chapters of this book, we will show similar informality, using the notation Xt-j to indicate both a particular observation (e.g. if t = 1968 and j = 3, then Xt-j is the value of variable X in 1965) and the variable X lagged j periods. It will be obvious from the context which is which. Quite frankly, in virtually any equation in this book it will not matter which way you interpret it.

Selection of lag order

When working with distributed lag models, we rarely know a priori exactly how many lags we should include. In the previous example, why did we assume that market capitalization depends on movements in the oil price up to four months ago? Why not three or six or even eight? That is, unlike most of the regression models considered in Chapters 4 to 7, we don't know which explanatory variables in a distributed lag model belong in the regression before we actually sit down at the computer and start working with the data. Appropriately, the issue of lag length selection becomes a databased one where we use statistical means to decide how many lags to include.

There are many different approaches to lag length selection in the econometrics literature. Here we outline a common one that does not require any new statistical techniques beyond those developed in Chapter 5. This method uses t-tests for whether β q = 0 to decide lag length. A common strategy is to: (a) Begin with a fairly large lag length, [45] qmax, and test whether the coefficient on the maximum lag is equal to zero (i.e. test whether βq max = 0). (b) If it is, drop the highest lag and re-estimate the model with maximum lag equal to qmax – 1. (c) If you find βqmax−1 = 0 in this new regression, then lower the lag order by one and re-estimate the model. (d) Keep on dropping the lag order by one and re-estimating the model until you reject the hypothesis that the coefficient on the longest lag is equal to zero.

This informal description of lag length selection can be formalized in the following series of steps:

Step 1. Choose the maximum possible lag length, qmax, that seems reasonable to you.

Step 2. Estimate the distributed lag model:

Selection of lag order

If the P-value for testing βqmax = 0 is less than the significance level you choose (e.g. 0.05) then go no further. Use qmax as lag length. Otherwise go on to the next step.

Step 3. Estimate the distributed lag model:

Selection of lag order

If the P-value for testing βqmax−2 = 0 is less than the significance level you chose choose (e.g. 0.05) then go no further. Use qmax – 1 as lag length. Otherwise go on to the next step.

Step 4. Estimate the distributed lag model:

Selection of lag order

If the P-value for testing βqmax−2 = 0 is less than the significance level you choose (e.g. 0.05) then go no further. Use qmax – 2 as lag length. Otherwise go on to the next step, etc.

As an aside of practical relevance to note when you are working with a spreadsheet, the number of observations used in a distributed lag model is equal to the original number of observations, T, minus the maximum lag length. This means that, in Step 2, we are working with Tqmax observations; in Step 3, with Tqmax + 1 observations; in Step 4 with Tqmax + 2, observations; etc. Each step will require some cutting and pasting in the spreadsheet to create variables with the appropriate number of observations. Alternatively, some researchers simply use Tqmax observations for all regressions. This has the advantage that, at each step, the researcher uses the same observations. However, this strategy may mean using a smaller data set than necessary. Remember from Chapter 5 that having more observations increases the accuracy of OLS estimates.

Chapter summary

  1. Regressions with time series variables involve two issues we have not dealt with in the past. First, one variable can influence another with a time lag. Second, if the variables are non-stationary, the spurious regressions problem can result. The latter issue will be dealt with in Chapter 10.

  2. Distributed lag models have the dependent variable depending on an explanatory variable and time lags of the explanatory variable.

  3. If the variables in the distributed lag model are stationary, then OLS estimates are reliable and the statistical techniques of multiple regression (e.g. looking at P-values or confidence intervals) can be used in a straightforward manner.

  4. The lag length in a distributed lag model can be selected by sequentially using t-tests beginning with a reasonably large lag length.



[40] We can, of course, label our coefficients using any convention we want. The convention chosen here relates the subscript on β to the number of periods ago to which the explanatory variables refers. For instance, β1 is the coefficient on Xt-1, which is the value of the explanatory variable one period ago.

[41] The interested reader is referred to Chapter 7 of Campbell, Lo and MacKinlay, The Econometrics of Financial Markets, for details.

[42] Formally, this is a price relative to a benchmark price which accounts for the zeros in the data set.

[43] Note that we are assuming this data to be stationary. In a real empirical exercise involving market capitalization, this may be a poor assumption. However, this data set is a fictitious one, created so as to be stationary, so we will not worry about this issue here.

[44] The value $1,268,060 is the estimate of the total effect. It is possible to calculate a confidence interval as well, but this would require a more complicated formula and is beyond the scope of this book.

[45] Although not too large! Remember that each variable in a distributed lag model will have number of observations equal to T minus the maximum number of lags. If you set the maximum number of lags too large, you will be left with very few observations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.52.212