11 Capstone: Forecasting the number of antidiabetic drug prescriptions in Australia

This chapter covers

  • Developing a forecasting model to predict the number of antidiabetic drug prescriptions in Australia
  • Applying the modeling procedure with a SARIMA model
  • Evaluating our model against a baseline
  • Determining the champion model

We have covered a lot of statistical models for time series forecasting. Back in chapters 4 and 5, you learned how to model moving average processes and autoregressive processes. We then combined these models to form the ARMA model and added a parameter to forecast non-stationary time series, leading us to the ARIMA model. We then added a seasonal component with the SARIMA model. Adding the effect of exogenous variables culminated in the SARIMAX model. Finally, we covered multivariate time series forecasting using the VAR model. Thus, you now have access to many statistical models that allow you to forecast a wide variety of time series, from simple to more complex. This is a good time to consolidate your learning and put your knowledge into practice with a capstone project.

The objective of the project in this chapter is forecasting the number of antidiabetic drug prescriptions in Australia, from 1991 to 2008. In a professional setting, solving this problem would allow us to gauge the production of antidiabetic drugs, such as to produce enough to meet the demand and but also avoid overproduction. The data we’ll use was recorded by the Australian Health Insurance Commission. We can visualize the time series in figure 11.1.

Figure 11.1 Monthly number of antidiabetic drug prescriptions in Australia between 1991 and 2008.

In figure 11.1 you’ll see a clear trend in the time series, as the number of prescriptions increases over time. Furthermore, you’ll observe strong seasonality, as each year seems to start at a low value and end at a high value. By now, you should intuitively know which model is potentially the most suitable for solving this problem.

To solve this problem, refer to the following steps:

  1. The objective is to forecast 12 months of antidiabetic drug prescriptions. Use the last 36 months of the dataset as a test set to allow for rolling forecasts.

  2. Visualize the time series.

  3. Use time series decomposition to extract the trend and seasonal components.

  4. Based on your exploration, determine the most suitable model.

  5. Model the series with the usual steps:

    1. Apply transformations to make it stationary
    2. Set the values of d and D. Set the value of m.
    3. Find the optimal (p,d,q)(P,D,Q)m parameters.
    4. Perform residual analysis to validate your model.
  6. Perform rolling forecasts of 12 months on the test set.

  7. Visualize your forecasts.

  8. Compare the model’s performance to a baseline. Select an appropriate baseline and error metric.

  9. Conclude whether the model should be used or not.

To get the most out of this capstone project, you are highly encouraged to complete it on your own by referring to the preceding steps. This will help you assess your autonomy in the modeling process and your understanding.

If you ever feel stuck or want to validate your reasoning, the rest of this chapter walks through the completion of this project. Also, the full solution is available on GitHub if you wish to refer to the code directly: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH11.

I wish you luck on this project!

11.1 Importing the required libraries and loading the data

The natural first step is to import the libraries that will be needed to complete the project. We can then load the data and store it in a DataFrame to be used throughout the project.

Thus, we’ll import the following libraries and specify the magic function %matplotlib inline to display the plots in the notebook:

from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.gofplots import qqplot
from statsmodels.tsa.stattools import adfuller
from tqdm import tqdm_notebook
from itertools import product
from typing import Union
 
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import numpy as np
 
import warnings
warnings.filterwarnings('ignore')
 
%matplotlib inline

Once the libraries are imported, we can read the data and store it in a DataFrame. We can also display the shape of the DataFrame to determine the number of data points.

df = pd.read_csv('data/AusAnti-diabeticDrug.csv')
print(df.shape)                                    

Displays the shape of a DataFrame. The first value is the number of rows, and the second value is the number of columns.

The data is now ready to be used throughout the project.

11.2 Visualizing the series and its components

With the data loaded, we can now easily visualize the series. This essentially recreates figure 11.1.

fig, ax = plt.subplots()
 
ax.plot(df.y)
ax.set_xlabel('Date')
ax.set_ylabel('Number of anti-diabetic drug prescriptions')
 
plt.xticks(np.arange(6, 203, 12), np.arange(1992, 2009, 1))
 
fig.autofmt_xdate()
plt.tight_layout()

Next we can perform decomposition to visualize the different components of the time series. Remember that time series decomposition allows us to visualize the trend component, seasonal component, and the residuals.

decomposition = STL(df.y, period=12).fit()     
 
fig, (ax1, ax2, ax3, ax4) = plt.subplots(nrows=4, ncols=1, sharex=True, 
 figsize=(10,8))
 
ax1.plot(decomposition.observed)
ax1.set_ylabel('Observed')
 
ax2.plot(decomposition.trend)
ax2.set_ylabel('Trend')
 
ax3.plot(decomposition.seasonal)
ax3.set_ylabel('Seasonal')
 
ax4.plot(decomposition.resid)
ax4.set_ylabel('Residuals')
 
plt.xticks(np.arange(6, 203, 12), np.arange(1992, 2009, 1))
 
fig.autofmt_xdate()
plt.tight_layout()

Column y holds the number of monthly antidiabetic prescriptions. Also, the period is set to 12, since we have monthly data.

The result is shown in figure 11.2. Everything seems to suggest that a SARIMA(p,d,q) (P,D,Q)m model would be the optimal solution for forecasting this time series. We have a trend as well as clear seasonality. Plus, we do not have any exogenous variables to work with, so the SARIMAX model cannot be applied. Finally, we wish to predict only one target, meaning that a VAR model is also not relevant in this case.

Figure 11.2 Time series decomposition on the antidiabetic drug prescriptions dataset. The first plot shows the observed data. The second plot shows the trend component, which tells us that the number of antidiabetic drug prescriptions is increasing over time. The third plot shows the seasonal component, where we can see a repeating pattern over time, indicating the presence of seasonality. The last plot shows the residuals, which are variations that are not explained by the trend of the seasonal component.

11.3 Modeling the data

We’ve decided that a SARIMA(p,d,q) (P,D,Q)m model is the most suitable for modeling and forecasting this time series. Therefore, we’ll follow the general modeling procedure for a SARIMAX model, as a SARIMA model is a special case of the SARIMAX model. The modeling procedure is shown in figure 11.3.

Figure 11.3 The SARIMA modeling procedure. This procedure is the most general modeling procedure, and it can be used for a SARIMA, ARIMA, or ARMA model, as they are simply special cases of the SARIMAX model.

Following the modeling procedure outlined in figure 11.3, we’ll first determine whether the series is stationary using the augmented Dickey-Fuller (ADF) test.

ad_fuller_result = adfuller(df.y)
 
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

This returns a p-value of 1.0, meaning that we cannot reject the null hypothesis, and we conclude that the series is not stationary. Thus, we must apply transformations to make it stationary.

We’ll first apply a first-order differencing on the data and test for stationarity again.

y_diff = np.diff(df.y, n=1)
 
ad_fuller_result = adfuller(y_diff)
 
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

This returns a p-value of 0.12. Again, the p-value is greater than 0.05, meaning that the series is not stationary. Let’s try applying a seasonal difference, since we noticed a strong seasonal pattern in the data. Recall that we have monthly data, meaning that m = 12. Thus, a seasonal difference subtracts values that are 12 timesteps apart.

y_diff_seasonal_diff = np.diff(y_diff, n=12)      
 
ad_fuller_result = adfuller(y_diff_seasonal_diff)
 
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

We have monthly data, so n = 12.

The returned p-value is 0.0. Thus, we can reject the null hypothesis and conclude that our time series is stationary.

Since we differenced the series once and took one seasonal difference, d = 1 and D = 1. Also, since we have monthly data, we know that m = 12. Therefore, we know that our final model will be a SARIMA(p,1,q)(P,1,Q)12 model.

11.3.1 Performing model selection

We have established that our model will be a SARIMA(p,1,q)(P,1,Q)12 model. Now we need to find the optimal values of p, q, P, and Q. This is the model selection step where we choose the parameters that minimize the Akaike information criterion (AIC).

To do so, we’ll first split the data into train and test sets. As specified in the steps in the chapter introduction, the test set will consist of the last 36 months of data.

train = df.y[:168]
test = df.y[168:]
 
print(len(test))      

Print out the length of the test set to make sure that it contains the last 36 months.

With our split done, we can now use the optimize_SARIMAX function to find the values of p, q, P, and Q that minimize the AIC. Note that we can use optimize_SARIMAX here because SARIMA is a special case of the more general SARIMAX model. The function is shown in the following listing.

Listing 11.1 Function to find the values of p, q, P, and Q that minimize the AIC

from typing import Union
from tqdm import tqdm_notebook
from statsmodels.tsa.statespace.sarimax import SARIMAX
 
def optimize_SARIMAX(endog: Union[pd.Series, list], exog: Union[pd.Series, 
 list], order_list: list, d: int, D: int, s: int) -> pd.DataFrame:
    
    results = []
    
    for order in tqdm_notebook(order_list):
        try: 
            model = SARIMAX(
                endog,
                exog,
                order=(order[0], d, order[1]),
                seasonal_order=(order[2], D, order[3], s),
                simple_differencing=False).fit(disp=False)
        except:
            continue
            
        aic = model.aic
        results.append([order, model.aic])
        
    result_df = pd.DataFrame(results)
    result_df.columns = ['(p,q,P,Q)', 'AIC']
    
    #Sort in ascending order, lower AIC is better
    result_df = result_df.sort_values(by='AIC', 
 ascending=True).reset_index(drop=True)
    
    return result_df

With the function defined, we can now decide on the range of values to try for p, q, P, and Q. Then we’ll generate a list of unique combinations of parameters. Feel free to test a different range of values than I’ve used here. Simply note that the larger the range, the longer it will take to run the optimize_SARIMAX function.

ps = range(0, 5, 1)
qs = range(0, 5, 1)
Ps = range(0, 5, 1)
Qs = range(0, 5, 1)
 
order_list = list(product(ps, qs, Ps, Qs))
 
d = 1
D = 1
s = 12

We can now run the optimize_SARIMAX function. In this example, 625 unique combinations are tested, since we have 5 possible values for 4 parameters.

SARIMA_result_df = optimize_SARIMAX(train, None, order_list, d, D, s)
SARIMA_result_df

Once the function is finished, the result shows that the minimum AIC is achieved with p = 2, q = 3, P = 1, and Q = 3. Therefore, the optimal model is a SARIMA(2,1,3)(1,1,3)12 model.

11.3.2 Conducting residual analysis

Now that we have the optimal model, we must analyze its residuals to determine whether the model can be used or not. This will depend on the residuals, which should behave like white noise. If that is the case, the model can be used for forecasting.

We can fit the model and use the plot_diagnostics method to qualitatively analyze its residuals.

SARIMA_model = SARIMAX(train, order=(2,1,3), 
 seasonal_order=(1,1,3,12), simple_differencing=False)
SARIMA_model_fit = SARIMA_model.fit(disp=False)
 
SARIMA_model_fit.plot_diagnostics(figsize=(10,8));

The result is shown in figure 11.4, and we can conclude from this qualitative analysis that the residuals closely resemble white noise.

Figure 11.4 Visual diagnostics of the residuals. In the top-left plot, the residuals have no trend over time, and the variance seems constant. At the top right, the distribution of the residuals is very close to a normal distribution. This is further supported by the Q-Q plot at the bottom left, which displays a fairly straight line that sits on y = x. Finally, the correlogram at the bottom right shows no significant coefficients after lag 0, just like white noise.

The next step is to perform the Ljung-Box test, which determines whether the residuals are independent and uncorrelated. The null hypothesis of the Ljung-Box test states that the residuals are uncorrelated, just like white noise. Thus, we want the test to return p-values larger than 0.05. In that case, we cannot reject the null hypothesis and conclude that our residuals are independent, and therefore behave like white noise.

residuals = SARIMA_model_fit.resid
 
lbvalue, pvalue = acorr_ljungbox(residuals, np.arange(1, 11, 1))
 
print(pvalue)

In this case, all the p-values are above 0.05, so we do not reject the null hypothesis, and we conclude that the residuals are independent and uncorrelated. We can conclude that the model can used for forecasting.

11.4 Forecasting and evaluating the model’s performance

We have a model that can be used for forecasting, so we’ll now perform rolling forecasts of 12 months over the test set of 36 months. That way we’ll have a better evaluation of our model’s performance, as testing on fewer data points might lead to skewed results. We’ll use the naive seasonal forecast as a baseline; it will simply take the last 12 months of data and use them as forecasts for the next 12 months.

We’ll first define the rolling_forecast function to generate the predictions over the entire test set with a window of 12 months. The function is shown in the following listing.

Listing 11.2 Function to perform a rolling forecast over a horizon

def rolling_forecast(df: pd.DataFrame, train_len: int, horizon: int, 
 window: int, method: str) -> list:
 
    total_len = train_len + horizon
    end_idx = train_len
 
    if method == 'last_season':
        pred_last_season = []
        
        for i in range(train_len, total_len, window):
            last_season = df['y'][i-window:i].values
            pred_last_season.extend(last_season)
            
        return pred_last_season
    
    elif method == 'SARIMA':
        pred_SARIMA = []
        
        for i in range(train_len, total_len, window):
            model = SARIMAX(df['y'][:i], order=(2,1,3), 
 seasonal_order=(1,1,3,12), simple_differencing=False)
            res = model.fit(disp=False)
            predictions = res.get_prediction(0, i + window - 1)
            oos_pred = predictions.predicted_mean.iloc[-window:]
            pred_SARIMA.extend(oos_pred)
            
        return pred_SARIMA

Next, we’ll create a DataFrame to hold the predictions as well as the actual values. This is simply a copy of the test set.

pred_df = df[168:]

Now we can define the parameters to be used for the rolling_forecast function. The dataset contains 204 rows, and the test set contains 36 data points, which means the length of the training set is 204 – 36 = 168. The horizon is 36, since our test set contains 36 months of data. Finally, the window is 12 months, as we are forecasting 12 months at a time.

With those values set, we can record the predictions coming from our baseline, which is a naive seasonal forecast. It simply takes the last 12 months of observed data and uses them as forecasts for the next 12 months.

TRAIN_LEN = 168
HORIZON = 36
WINDOW = 12
 
pred_df['last_season'] = rolling_forecast(df, TRAIN_LEN, HORIZON, WINDOW, 
 'last_season')

Next, we’ll compute the forecasts from the SARIMA model.

pred_df['SARIMA'] = rolling_forecast(df, TRAIN_LEN, HORIZON, WINDOW, 
 'SARIMA')

At this point, pred_df contains the actual values, the forecasts from the naive seasonal method, and the forecasts from the SARIMA model. We can use this to visualize our forecasts against the actual values. For clarity, we’ll limit the x-axis to zoom in on the test period. The resulting plot is shown in figure 11.5.

fig, ax = plt.subplots()
 
ax.plot(df.y)
ax.plot(pred_df.y, 'b-', label='actual')
ax.plot(pred_df.last_season, 'r:', label='naive seasonal')
ax.plot(pred_df.SARIMA, 'k--', label='SARIMA')
ax.set_xlabel('Date')
ax.set_ylabel('Number of anti-diabetic drug prescriptions')
ax.axvspan(168, 204, color='#808080', alpha=0.2)
 
ax.legend(loc=2)
 
plt.xticks(np.arange(6, 203, 12), np.arange(1992, 2009, 1))
plt.xlim(120, 204)
 
fig.autofmt_xdate()
plt.tight_layout()

Figure 11.5 Forecasts of the number of antidiabetic drug prescriptions in Australia. The predictions from the baseline are shown as a dotted line, while the predictions from the SARIMA model are shown as a dashed line.

In figure 11.5 you can see that the predictions from the SARIMA model (the dashed line) follow the actual values more closely than the naive seasonal forecasts (the dotted line). We can therefore intuitively expect the SARIMA model to have performed better than the baseline method.

To evaluate the performance quantitatively, we’ll use the mean absolute percentage error (MAPE). The MAPE is easy to interpret, as it returns a percentage error.

def mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
 
mape_naive_seasonal = mape(pred_df.y, pred_df.last_season)
mape_SARIMA = mape(pred_df.y, pred_df.SARIMA)
 
print(mape_naive_seasonal, mape_SARIMA)

This prints out a MAPE of 12.69% for the baseline and 7.90% for the SARIMA model. We can optionally plot the MAPE of each model in a bar chart for a nice visualization, as shown in figure 11.6.

fig, ax = plt.subplots()
 
x = ['naive seasonal', 'SARIMA(2,1,3)(1,1,3,12)']
y = [mape_naive_seasonal, mape_SARIMA]
 
ax.bar(x, y, width=0.4)
ax.set_xlabel('Models')
ax.set_ylabel('MAPE (%)')
ax.set_ylim(0, 15)
 
for index, value in enumerate(y):
    plt.text(x=index, y=value + 1, s=str(round(value,2)), ha='center')
 
plt.tight_layout()

Figure 11.6 The MAPE for the naive seasonal forecast and the SARIMA model. Since the MAPE of the SARIMA model is lower than the MAPE of the baseline, we can conclude that the SARIMA model should be used to forecast the number of antidiabetic drug prescriptions.

Since the SARIMA model achieves the lowest MAPE, we can conclude that the SARIMA(2,1,3)(1,1,3)12 model should be used to forecast the monthly number of antidiabetic drug prescriptions in Australia.

Next steps

Congratulations on completing this capstone project. I hope that you were able to complete it on your own and that you now feel confident in your skills and knowledge of time series forecasting using statistical models.

Of course, practice makes perfect, so I highly encourage you to find other time series datasets and practice modeling and forecasting them. This will help you build your intuition and hone your skills.

In the next chapter, we’ll start a new section where we’ll use deep learning models to model and forecast complex time series with high dimensionality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.104.72