Autoregressive integrated moving average

An autoregressive integrated moving average (ARIMA) model is a combination of the following elements:

Autoregressive operator: We have already learned what this means; just to reiterate, it is the lags of the stationarized series. It is denoted by p, which is nothing but the number of autoregressive terms. The PACF plot yields this component.
Integration operator: A series that needs to be differenced to be made stationary is said to be an integrated version of a stationary series. It is denoted by d, which is the amount of differencing that is needed to transform the nonstationary time series into a stationary one. This is done by subtracting the observation from the current period from the previous one. If this has been done only once to the series, it is called first differenced. This process eliminates the trend out of the series that is growing at a constant rate. In this case, the series is growing at an increasing rate, and the differenced series needs another round of differencing, which is called second differencing.
Moving average operator: The lags of the forecasted errors, which is denoted by q. It is the number of lagged forecast errors in the equation. The ACF plot would yield this component.

The ARIMA model can only be applied on stationary series. Therefore, before applying it, the stationarity condition has to be checked in the series. The ADF test can be performed to establish this.

The equation of ARIMA turns looks like the following:

The first part of the equation (before the - sign) is the autoregressive section, and the second part (after the - sign)is the MA section.

We can go ahead and add a seasonal component in ARIMA as well, which would be ARIMA (p,d,q)(p,d,q)_s. While adding it, we need to perform seasonal differencing, which means subtracting the current observation from the seasonal lag.

Let's plot ACF and PACF in order to find out the p and q parameters.

Here, we take the number of lags as 20 and use the statsmodel.tsa.stattools library to import the acf and pacf functions, as follows:

from statsmodels.tsa.stattools import acf,pacf
lag_acf= acf(ts_log_dif,nlags=20)
lag_pacf = pacf(ts_log_dif, nlags=20,method="ols")

Now we will plot with the help of matplotlib using the following code:


plt.subplot(121) 
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')

The output is as follows:

Here, we are measuring the correlation between the time series with a lagged version of itself. For instance, at lag 5, ACF would compare the series at time instant t1, t2 with the series at instant t_1-5, …, t_2-5. It is a plot of the coefficients of the correlation with its lagged values.

If we look at the preceding plot carefully, we will see that the upper confidence level line has been crossed at lag 2. Therefore, the order of MA would be 2 and q=2.

A partial correlation between the series and lagged values is plotted, and it gives us a partial auto correlation functional (PACF) plot. It's a very interesting term. If we go on and compute the correlation between a Y variable and X3 while we know that Y has a separation association with X1 and X2, the partial correlation addresses that portion of the correlation that is not explained by their correlations with X1 and X2.

Here, the partial correlation is the square root (reduction in variance by adding a variable (here, X3) while regressing Y on the other variables (here X1, X2)).

In the case of a time series, partial autocorrelation between Y & lagged value Y_t-3 will be the value that is not explained by a correlation between Y and Y_t-1 and Y_t-2, as shown in the following code:

#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()

We will get the following output:

If we look at the preceding plot carefully, we will see that the upper confidence level line has been crossed at lag 2. Therefore, the order of AR would be 2 and p=2.

Let's try out an AR model that is of the order (p=2, d=1, q=0). The d value has been taken as 1, since it is a case of single differencing. The residual sum of the square has been calculated as well to judge how good the model is and compare it with others, as shown in the following code:

from statsmodels.tsa.arima_model import ARIMA
model1 = ARIMA(ts_log, order=(2, 1, 0)) 
results_AR = model1.fit(disp=-1) 
plt.plot(ts_log_dif)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_AR.fittedvalues-ts_log_dif)**2))

The output can be seen as follows:

Now, we can have a look at the model summary that depicts the coefficients of AR1 and AR2 using the following code:

results_AR.summary()

Now, let's build an MA model of the order (p=0,d=1,q=2) using the following code:

model2 = ARIMA(ts_log, order=(0, 1, 2)) 
results_MA = model2.fit(disp=-1) 
plt.plot(ts_log_dif)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_MA.fittedvalues-ts_log_dif)**2))

The output can be seen as follows:

Now, let's combine these two models and build an ARIMA model using the following code:

model3 = ARIMA(ts_log, order=(2, 1, 2))  
results_ARIMA = model.fit(disp=-1)  
plt.plot(ts_log_dif)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_dif)**2))

The output is as follows:

We can experience a dip in the value of RSS from the AR model to ARIMA. Now RSS= 1.0292:

results_ARIMA.summary()

We can see the coefficients of AR1, AR2, MA1, and MA2, and, if we go by p values, we can see that all these parameters are significant, as shown in the following screenshot:

Let's turn the predicted values into a series using the following code:

predictions_ARIMA_dif= pd.Series(results_ARIMA.fittedvalues, copy=True)
print(predictions_ARIMA_dif.head())

We will get the following output:

The way to convert the differencing to log scale is to add these differences consecutively to the base number. An easy way to do this is to first determine the cumulative sum at the index and then add it to the base number. The cumulative sum can be found using the following code:

predictions_ARIMA_dif_cumsum = predictions_ARIMA_dif.cumsum()
print(predictions_ARIMA_dif_cumsum.head())

From this, we will get the following output:

We will create a series with all values as the base number and add the differences to it in order to add to the base series, as follows:

predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_dif_cumsum,fill_value=0)
predictions_ARIMA_log.head()

The following shows the output:

Let's now find out the forecast using the following code:

predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))

The output can be seen as follows:

Table of Contents for Autoregressive integrated moving average

Create new playlist

Sign In

Sign Up

Table of Contents for
Autoregressive integrated moving average