We have covered a lot of statistical models for time series forecasting. Back in chapters 4 and 5, you learned how to model moving average processes and autoregressive processes. We then combined these models to form the ARMA model and added a parameter to forecast non-stationary time series, leading us to the ARIMA model. We then added a seasonal component with the SARIMA model. Adding the effect of exogenous variables culminated in the SARIMAX model. Finally, we covered multivariate time series forecasting using the VAR model. Thus, you now have access to many statistical models that allow you to forecast a wide variety of time series, from simple to more complex. This is a good time to consolidate your learning and put your knowledge into practice with a capstone project.
The objective of the project in this chapter is forecasting the number of antidiabetic drug prescriptions in Australia, from 1991 to 2008. In a professional setting, solving this problem would allow us to gauge the production of antidiabetic drugs, such as to produce enough to meet the demand and but also avoid overproduction. The data we’ll use was recorded by the Australian Health Insurance Commission. We can visualize the time series in figure 11.1.
In figure 11.1 you’ll see a clear trend in the time series, as the number of prescriptions increases over time. Furthermore, you’ll observe strong seasonality, as each year seems to start at a low value and end at a high value. By now, you should intuitively know which model is potentially the most suitable for solving this problem.
To solve this problem, refer to the following steps:
The objective is to forecast 12 months of antidiabetic drug prescriptions. Use the last 36 months of the dataset as a test set to allow for rolling forecasts.
Use time series decomposition to extract the trend and seasonal components.
Based on your exploration, determine the most suitable model.
Compare the model’s performance to a baseline. Select an appropriate baseline and error metric.
To get the most out of this capstone project, you are highly encouraged to complete it on your own by referring to the preceding steps. This will help you assess your autonomy in the modeling process and your understanding.
If you ever feel stuck or want to validate your reasoning, the rest of this chapter walks through the completion of this project. Also, the full solution is available on GitHub if you wish to refer to the code directly: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH11.
I wish you luck on this project!
The natural first step is to import the libraries that will be needed to complete the project. We can then load the data and store it in a DataFrame
to be used throughout the project.
Thus, we’ll import the following libraries and specify the magic function %matplotlib inline
to display the plots in the notebook:
from sklearn.metrics import mean_squared_error, mean_absolute_error from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.seasonal import seasonal_decompose, STL from statsmodels.stats.diagnostic import acorr_ljungbox from statsmodels.tsa.statespace.sarimax import SARIMAX from statsmodels.tsa.arima_process import ArmaProcess from statsmodels.graphics.gofplots import qqplot from statsmodels.tsa.stattools import adfuller from tqdm import tqdm_notebook from itertools import product from typing import Union import matplotlib.pyplot as plt import statsmodels.api as sm import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore') %matplotlib inline
Once the libraries are imported, we can read the data and store it in a DataFrame
. We can also display the shape of the DataFrame
to determine the number of data points.
❶ Displays the shape of a DataFrame. The first value is the number of rows, and the second value is the number of columns.
The data is now ready to be used throughout the project.
With the data loaded, we can now easily visualize the series. This essentially recreates figure 11.1.
fig, ax = plt.subplots() ax.plot(df.y) ax.set_xlabel('Date') ax.set_ylabel('Number of anti-diabetic drug prescriptions') plt.xticks(np.arange(6, 203, 12), np.arange(1992, 2009, 1)) fig.autofmt_xdate() plt.tight_layout()
Next we can perform decomposition to visualize the different components of the time series. Remember that time series decomposition allows us to visualize the trend component, seasonal component, and the residuals.
decomposition = STL(df.y, period=12).fit() ❶ fig, (ax1, ax2, ax3, ax4) = plt.subplots(nrows=4, ncols=1, sharex=True, ➥ figsize=(10,8)) ax1.plot(decomposition.observed) ax1.set_ylabel('Observed') ax2.plot(decomposition.trend) ax2.set_ylabel('Trend') ax3.plot(decomposition.seasonal) ax3.set_ylabel('Seasonal') ax4.plot(decomposition.resid) ax4.set_ylabel('Residuals') plt.xticks(np.arange(6, 203, 12), np.arange(1992, 2009, 1)) fig.autofmt_xdate() plt.tight_layout()
❶ Column y holds the number of monthly antidiabetic prescriptions. Also, the period is set to 12, since we have monthly data.
The result is shown in figure 11.2. Everything seems to suggest that a SARIMA(p,d,q) (P,D,Q)m model would be the optimal solution for forecasting this time series. We have a trend as well as clear seasonality. Plus, we do not have any exogenous variables to work with, so the SARIMAX model cannot be applied. Finally, we wish to predict only one target, meaning that a VAR model is also not relevant in this case.
We’ve decided that a SARIMA(p,d,q) (P,D,Q)m model is the most suitable for modeling and forecasting this time series. Therefore, we’ll follow the general modeling procedure for a SARIMAX model, as a SARIMA model is a special case of the SARIMAX model. The modeling procedure is shown in figure 11.3.
Following the modeling procedure outlined in figure 11.3, we’ll first determine whether the series is stationary using the augmented Dickey-Fuller (ADF) test.
ad_fuller_result = adfuller(df.y) print(f'ADF Statistic: {ad_fuller_result[0]}') print(f'p-value: {ad_fuller_result[1]}')
This returns a p-value of 1.0, meaning that we cannot reject the null hypothesis, and we conclude that the series is not stationary. Thus, we must apply transformations to make it stationary.
We’ll first apply a first-order differencing on the data and test for stationarity again.
y_diff = np.diff(df.y, n=1) ad_fuller_result = adfuller(y_diff) print(f'ADF Statistic: {ad_fuller_result[0]}') print(f'p-value: {ad_fuller_result[1]}')
This returns a p-value of 0.12. Again, the p-value is greater than 0.05, meaning that the series is not stationary. Let’s try applying a seasonal difference, since we noticed a strong seasonal pattern in the data. Recall that we have monthly data, meaning that m = 12. Thus, a seasonal difference subtracts values that are 12 timesteps apart.
y_diff_seasonal_diff = np.diff(y_diff, n=12) ❶ ad_fuller_result = adfuller(y_diff_seasonal_diff) print(f'ADF Statistic: {ad_fuller_result[0]}') print(f'p-value: {ad_fuller_result[1]}')
❶ We have monthly data, so n = 12.
The returned p-value is 0.0. Thus, we can reject the null hypothesis and conclude that our time series is stationary.
Since we differenced the series once and took one seasonal difference, d = 1 and D = 1. Also, since we have monthly data, we know that m = 12. Therefore, we know that our final model will be a SARIMA(p,1,q)(P,1,Q)12 model.
We have established that our model will be a SARIMA(p,1,q)(P,1,Q)12 model. Now we need to find the optimal values of p, q, P, and Q. This is the model selection step where we choose the parameters that minimize the Akaike information criterion (AIC).
To do so, we’ll first split the data into train and test sets. As specified in the steps in the chapter introduction, the test set will consist of the last 36 months of data.
❶ Print out the length of the test set to make sure that it contains the last 36 months.
With our split done, we can now use the optimize_SARIMAX
function to find the values of p, q, P, and Q that minimize the AIC. Note that we can use optimize_SARIMAX
here because SARIMA is a special case of the more general SARIMAX model. The function is shown in the following listing.
from typing import Union from tqdm import tqdm_notebook from statsmodels.tsa.statespace.sarimax import SARIMAX def optimize_SARIMAX(endog: Union[pd.Series, list], exog: Union[pd.Series, ➥ list], order_list: list, d: int, D: int, s: int) -> pd.DataFrame: results = [] for order in tqdm_notebook(order_list): try: model = SARIMAX( endog, exog, order=(order[0], d, order[1]), seasonal_order=(order[2], D, order[3], s), simple_differencing=False).fit(disp=False) except: continue aic = model.aic results.append([order, model.aic]) result_df = pd.DataFrame(results) result_df.columns = ['(p,q,P,Q)', 'AIC'] #Sort in ascending order, lower AIC is better result_df = result_df.sort_values(by='AIC', ➥ ascending=True).reset_index(drop=True) return result_df
With the function defined, we can now decide on the range of values to try for p, q, P, and Q. Then we’ll generate a list of unique combinations of parameters. Feel free to test a different range of values than I’ve used here. Simply note that the larger the range, the longer it will take to run the optimize_SARIMAX
function.
ps = range(0, 5, 1) qs = range(0, 5, 1) Ps = range(0, 5, 1) Qs = range(0, 5, 1) order_list = list(product(ps, qs, Ps, Qs)) d = 1 D = 1 s = 12
We can now run the optimize_SARIMAX
function. In this example, 625 unique combinations are tested, since we have 5 possible values for 4 parameters.
Once the function is finished, the result shows that the minimum AIC is achieved with p = 2, q = 3, P = 1, and Q = 3. Therefore, the optimal model is a SARIMA(2,1,3)(1,1,3)12 model.
Now that we have the optimal model, we must analyze its residuals to determine whether the model can be used or not. This will depend on the residuals, which should behave like white noise. If that is the case, the model can be used for forecasting.
We can fit the model and use the plot_diagnostics
method to qualitatively analyze its residuals.
SARIMA_model = SARIMAX(train, order=(2,1,3), ➥ seasonal_order=(1,1,3,12), simple_differencing=False) SARIMA_model_fit = SARIMA_model.fit(disp=False) SARIMA_model_fit.plot_diagnostics(figsize=(10,8));
The result is shown in figure 11.4, and we can conclude from this qualitative analysis that the residuals closely resemble white noise.
The next step is to perform the Ljung-Box test, which determines whether the residuals are independent and uncorrelated. The null hypothesis of the Ljung-Box test states that the residuals are uncorrelated, just like white noise. Thus, we want the test to return p-values larger than 0.05. In that case, we cannot reject the null hypothesis and conclude that our residuals are independent, and therefore behave like white noise.
residuals = SARIMA_model_fit.resid lbvalue, pvalue = acorr_ljungbox(residuals, np.arange(1, 11, 1)) print(pvalue)
In this case, all the p-values are above 0.05, so we do not reject the null hypothesis, and we conclude that the residuals are independent and uncorrelated. We can conclude that the model can used for forecasting.
We have a model that can be used for forecasting, so we’ll now perform rolling forecasts of 12 months over the test set of 36 months. That way we’ll have a better evaluation of our model’s performance, as testing on fewer data points might lead to skewed results. We’ll use the naive seasonal forecast as a baseline; it will simply take the last 12 months of data and use them as forecasts for the next 12 months.
We’ll first define the rolling_forecast
function to generate the predictions over the entire test set with a window of 12 months. The function is shown in the following listing.
def rolling_forecast(df: pd.DataFrame, train_len: int, horizon: int, ➥ window: int, method: str) -> list: total_len = train_len + horizon end_idx = train_len if method == 'last_season': pred_last_season = [] for i in range(train_len, total_len, window): last_season = df['y'][i-window:i].values pred_last_season.extend(last_season) return pred_last_season elif method == 'SARIMA': pred_SARIMA = [] for i in range(train_len, total_len, window): model = SARIMAX(df['y'][:i], order=(2,1,3), ➥ seasonal_order=(1,1,3,12), simple_differencing=False) res = model.fit(disp=False) predictions = res.get_prediction(0, i + window - 1) oos_pred = predictions.predicted_mean.iloc[-window:] pred_SARIMA.extend(oos_pred) return pred_SARIMA
Next, we’ll create a DataFrame
to hold the predictions as well as the actual values. This is simply a copy of the test set.
Now we can define the parameters to be used for the rolling_forecast
function. The dataset contains 204 rows, and the test set contains 36 data points, which means the length of the training set is 204 – 36 = 168. The horizon is 36, since our test set contains 36 months of data. Finally, the window is 12 months, as we are forecasting 12 months at a time.
With those values set, we can record the predictions coming from our baseline, which is a naive seasonal forecast. It simply takes the last 12 months of observed data and uses them as forecasts for the next 12 months.
TRAIN_LEN = 168 HORIZON = 36 WINDOW = 12 pred_df['last_season'] = rolling_forecast(df, TRAIN_LEN, HORIZON, WINDOW, ➥ 'last_season')
Next, we’ll compute the forecasts from the SARIMA model.
At this point, pred_df
contains the actual values, the forecasts from the naive seasonal method, and the forecasts from the SARIMA model. We can use this to visualize our forecasts against the actual values. For clarity, we’ll limit the x-axis to zoom in on the test period. The resulting plot is shown in figure 11.5.
fig, ax = plt.subplots() ax.plot(df.y) ax.plot(pred_df.y, 'b-', label='actual') ax.plot(pred_df.last_season, 'r:', label='naive seasonal') ax.plot(pred_df.SARIMA, 'k--', label='SARIMA') ax.set_xlabel('Date') ax.set_ylabel('Number of anti-diabetic drug prescriptions') ax.axvspan(168, 204, color='#808080', alpha=0.2) ax.legend(loc=2) plt.xticks(np.arange(6, 203, 12), np.arange(1992, 2009, 1)) plt.xlim(120, 204) fig.autofmt_xdate() plt.tight_layout()
In figure 11.5 you can see that the predictions from the SARIMA model (the dashed line) follow the actual values more closely than the naive seasonal forecasts (the dotted line). We can therefore intuitively expect the SARIMA model to have performed better than the baseline method.
To evaluate the performance quantitatively, we’ll use the mean absolute percentage error (MAPE). The MAPE is easy to interpret, as it returns a percentage error.
def mape(y_true, y_pred): return np.mean(np.abs((y_true - y_pred) / y_true)) * 100 mape_naive_seasonal = mape(pred_df.y, pred_df.last_season) mape_SARIMA = mape(pred_df.y, pred_df.SARIMA) print(mape_naive_seasonal, mape_SARIMA)
This prints out a MAPE of 12.69% for the baseline and 7.90% for the SARIMA model. We can optionally plot the MAPE of each model in a bar chart for a nice visualization, as shown in figure 11.6.
fig, ax = plt.subplots() x = ['naive seasonal', 'SARIMA(2,1,3)(1,1,3,12)'] y = [mape_naive_seasonal, mape_SARIMA] ax.bar(x, y, width=0.4) ax.set_xlabel('Models') ax.set_ylabel('MAPE (%)') ax.set_ylim(0, 15) for index, value in enumerate(y): plt.text(x=index, y=value + 1, s=str(round(value,2)), ha='center') plt.tight_layout()
Since the SARIMA model achieves the lowest MAPE, we can conclude that the SARIMA(2,1,3)(1,1,3)12 model should be used to forecast the monthly number of antidiabetic drug prescriptions in Australia.
Congratulations on completing this capstone project. I hope that you were able to complete it on your own and that you now feel confident in your skills and knowledge of time series forecasting using statistical models.
Of course, practice makes perfect, so I highly encourage you to find other time series datasets and practice modeling and forecasting them. This will help you build your intuition and hone your skills.
In the next chapter, we’ll start a new section where we’ll use deep learning models to model and forecast complex time series with high dimensionality.
3.21.104.72