Understanding Granger causality

Imagine you're asked a question such as, What's the relationship between the number of new prescriptions and total prescriptions for medicine X? You know that these are measured monthly, so what could you do to understand that relationship, given that people believe that new scripts will drive up total scripts? Or how about testing the hypothesis that commodity prices—in particular, copper—is a leading indicator of stock market prices in the US? Well, with two sets of time series data, x and y, Granger causality is a method that attempts to determine whether one series is likely to influence a change in the other. This is done by taking different lags of one series and using this to model the change in the second series. To accomplish this, we'll create two models that will predict y, one with only the past values of y (Ω) and the other with the past values of y and x (π). The models are as follows, where k is the number of lags in the time series:

The RSS is then compared and F-test is used to determine whether the nested model (Ω) is adequate enough to explain the future values of y or whether the full model (π) is better. F-test is used to test the following null and alternative hypotheses:

H0: αi = 0 for each i ∊[1,k], no Granger causality
H1: αi ≠ 0 for at least one i ∊[1,k], Granger causality

Essentially, we're trying to determine whether we can say that, statistically, x provides more information about the future values of y than the past values of y alone. In this definition, it's clear that we aren't trying to prove actual causation, only that the two values are related by some phenomenon. Along these lines, we must also run this model in reverse in order to verify that y doesn't provide information about the future values of x. If we find that this is the case, it's likely that there's some exogenous variable, say Z, that needs to be controlled or would possibly be a better candidate for the Granger causation. Originally, you had to apply the method to stationary time series to avoid spurious results. This is no longer the case as I will demonstrate.

Note that research papers are available that discuss the techniques nonlinear models use, but this is outside the scope of this book. I recommend reading an excellent introductory paper on Granger causality that revolves around the age-old conundrum of the chicken and the egg (Thurman, 1988).

There are a couple of different ways to identify the proper lag structure. Naturally, we can use brute force and ignorance to test all of the reasonable lags, one at a time. We may have a rational intuition based on domain expertise or perhaps prior research that exists to guide the lag selection. If not, then you can apply vector autoregression (VAR) to identify the lag structure with the lowest information criterion, such as Aikake's information criterion (AIC) or final prediction error (FPE). For simplicity, here is the notation for the VAR models with two variables, and this incorporates only one lag for each variable. This notation can be extended for as many variables and lags as appropriate:

Y = constant₁ + B₁₁Y_t-1 + B₁₂Y_t-1 + e₁
X = constant₁ + B2₁Y_t-1 + B2₂Y_t-1 + e2

In R, this process is quite simple to implement as we'll see in the following practical problem.

Table of Contents for Understanding Granger causality

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding Granger causality