12 Introducing deep learning for time series forecasting

This chapter covers

  • Using deep learning for forecasting
  • Exploring different types of deep learning models
  • Getting ready to apply deep learning to time series forecasting

In the last chapter, we concluded the part of the book on time series forecasting using statistical models. Those models work particularly well when you have small datasets (usually less than 10,000 data points), and when the seasonal period is monthly, quarterly, or yearly. In situations where you have daily seasonality or where the dataset is very large (more than 10,000 data points), those statistical models become very slow, and their performance degrades.

Thus, we turn to deep learning. Deep learning is a subset of machine learning that focuses on building models on the neural network architecture. Deep learning has the advantage that it tends to perform better as more data is available, making it a great choice for forecasting high-dimensional time series.

In this part of the book, we’ll explore various model architectures so you’ll have a set of tools for tackling virtually any time series forecasting problem. Note that I’ll assume you have some familiarity with deep learning, so topics such as activation functions, loss functions, batches, layers, and epochs should be known. This part of the book will not serve as an introduction to deep learning, but rather focuses on applying deep learning to time series forecasting. Of course, each model architecture will be thoroughly explained, and you will gain an intuition as to why a particular architecture might work better than another in particular situations. Throughout these chapters, we will use TensorFlow, or more specifically Keras, to build different deep learning models.

In this chapter specifically, we identify the conditions that justify the use of deep learning and explore the different types of models that can be built, such as single-step, multi-step, and multi-output models. We’ll conclude the chapter with the initial setup that will get us ready to apply deep learning models in the following chapters. Finally, we’ll explore the data, perform feature engineering, and split the data into training, validation, and testing sets.

12.1 When to use deep learning for time series forecasting

Deep learning shines when we have large complex datasets. In those situations, deep learning can leverage all the available data to infer relationships between each feature and the target, usually resulting in good forecasts.

In the context of time series, a dataset is considered to be large when we have more than 10,000 data points. Of course, this is an approximation rather than a hard-set limit, so if you have 8,000 data points, deep learning could be a viable option. When the size of the dataset is large, any declination of the SARIMAX model will take a long time to fit, which is not ideal for model selection, as we usually fit many models during that step.

If your data has multiple seasonal periods, the SARIMAX model cannot be used. For example, suppose you must forecast the hourly temperature. It is reasonable to assume that there will be daily seasonality, as temperature tends to be lower at night and higher during the day, but there is also yearly seasonality, due to temperatures being lower in winter and higher during summer. In such a case, deep learning can be used to leverage the information from both seasonal periods to make forecasts. In fact, from experience, fitting a SARIMA model in such a case will usually result in residuals that are not normally distributed and still correlated, meaning that the model cannot be used at all.

Ultimately, deep learning is used either when statistical models take too much time to fit or when they result in correlated residuals that do not approximate white noise. This can be due to the fact that there is another seasonal period that cannot be considered in the model, or simply because there is a nonlinear relationship between the features and the target. In those cases, deep learning models can be used to capture this nonlinear relationship, and they have the added advantage of being very fast to train.

12.2 Exploring the different types of deep learning models

There are three main types of deep learning models that we can build for time series forecasting: single-step models, multi-step models, and multi-output models.

The single-step model is the simplest of the three. Its output is a single value representing the forecast of one variable one step into the future. The model therefore simply returns a scalar, as shown in figure 12.1.

Figure 12.1 The single-step model outputs the value of one target one timestep into the future. The output is therefore a scalar.

Single-step model

The single-step model outputs a single value representing the prediction for the next timestep. The input can be of any length, but the output remains a single prediction for the next timestep.

Next we can have a multi-step model, meaning that we output the value for one target, but for many timesteps into the future. For example, given hourly data, we may want to forecast the next 24 hours. In that case, we have a multi-step model, since we are forecasting 24 timesteps into the future. The output is a 24 × 1 matrix, as shown in figure 12.2.

Figure 12.2 A multi-step model outputs the predictions for 1 variable multiple timesteps into the future. This example predicts 24 timesteps, resulting in a 24 x 1 output matrix.

Multi-step model

In a multi-step model, the output of the model is a sequence of values representing predictions for many timesteps into the future. For example, if the model predicts the next 6 hours, 24 hours, or 12 months, it is a multi-step model.

Finally, the multi-output model generates predictions for more than one target. For example, if we were to predict both the temperature and humidity, we would use a multi-output model. This model can output as many timesteps as desired. In figure 12.3, a multi-output model returning predictions for two features for the next 24 timesteps is shown. In that particular case, the output is a 24 x 2 matrix.

Figure 12.3 A multi-output model makes predictions for more than one target for one or more timesteps in the future. Here the model outputs predictions for two targets for the next 24 timesteps.

Multi-output model

A multi-output model generates predictions for more than one target. For example, if we forecast the temperature and wind speed, it is a multi-output model.

Each of these models can have different architectures. For example, a convolutional neural network can be used as a single-step model, a multi-step model, or a multi-output model. In the following chapters, we will implement different model architectures and apply them for all three model types.

This brings us to the stage where we’ll do the initial setup for the different deep learning models we’ll implement in the next five chapters.

12.3 Getting ready to apply deep learning for forecasting

From here through chapter 17, we will use the metro interstate traffic volume dataset available on the UCI machine learning repository. The original dataset recorded the hourly westbound traffic on I-94 between Minneapolis and St. Paul in Minnesota, from 2012 to 2018. For the purpose of learning how to apply deep learning for time series forecasting, the dataset has been shortened and cleaned to get rid of missing values. While the cleaning steps are not covered in this chapter, you can still consult the preprocessing code in the GitHub repository for this chapter. Our main forecasting goal is to predict the hourly traffic volume. In the case of multi-output models, we will also forecast the hourly temperature. In this initial setup for the next several chapters, we’ll load the data, perform feature engineering, and split it into training, validation, and testing sets.

We will use TensorFlow, or more specifically Keras, in this part of the book. At the time of writing, the latest stable version of TensorFlow was 2.6.0, which is what I’ll use in this and the following chapters.

Note The full source code for this chapter is available on GitHub: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH12.

12.3.1 Performing data exploration

We will first load the data using pandas.

df = 
 pd.read_csv('../data/metro_interstate_traffic_volume_preprocessed.csv')
df.head()

As mentioned, this dataset is a shortened and cleaned version of the original dataset available on the UCI machine learning repository. In this case, the dataset starts on September 29, 2016, at 5 p.m. and ends on September 30, 2018, at 11 p.m. Using df.shape, we can see that we have a total of six features and 17,551 rows.

The features include the date and time, the temperature, the amount of rain and snow, the cloud coverage, as well as the traffic volume. Table 12.1 describes each column in more detail.

Table 12.1 The variables in the metro interstate traffic volume dataset

Feature

Description

date_time

Date and time of the data, recorded in the CST time zone. The format is YYYY-MM-DD HH:MM:SS.

temp

Average temperature recorded in the hour, expressed in Kelvin.

rain_1h

Amount of rain that occurred in the hour, expressed in millimeters.

snow_1h

Amount of snow that occurred in the hour, expressed in millimeters.

clouds_all

Percentage of cloud cover during the hour.

traffic_volume

Volume of traffic reported westbound on I-94 during the hour.

Now, let’s visualize the evolution of the traffic volume over time. Since our dataset is very large, with more than 17,000 records, we’ll plot only the first 400 data points, which is roughly equivalent to two weeks of data. The result is shown in figure 12.4.

fig, ax = plt.subplots()
 
ax.plot(df['traffic_volume'])
ax.set_xlabel('Time')
ax.set_ylabel('Traffic volume')
 
plt.xticks(np.arange(7, 400, 24), ['Friday', 'Saturday', 'Sunday', 
 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 
 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 
 'Saturday', 'Sunday'])
plt.xlim(0, 400)
 
fig.autofmt_xdate()
plt.tight_layout()

In figure 12.4 you’ll notice clear daily seasonality, since the traffic volume is lower at the start and end of each day. You’ll also see a smaller traffic volume during the weekends. As for the trend, two weeks of data is likely insufficient to draw a reasonable conclusion, but it seems that the volume is neither increasing nor decreasing over time in the figure.

Figure 12.4 Westbound traffic volume on I-94 between Minneapolis and St. Paul in Minnesota, starting on September 29, 2016, at 5 p.m. You’ll notice clear daily seasonality, with traffic being lower at the start and end of each day.

We can also plot the hourly temperature, as it will be a target for our multi-output models. Here, we’ll expect to see both yearly and daily seasonality. The yearly seasonality should be due to the seasons in the year, while the daily seasonality will be due to the fact that temperatures tend to be lower at night and higher during the day.

Let’s first visualize the hourly temperature over the entire dataset to see if we can identify any yearly seasonality. The result is shown in figure 12.5.

fig, ax = plt.subplots()
 
ax.plot(df['temp'])
ax.set_xlabel('Time')
ax.set_ylabel('Temperature (K)')
 
plt.xticks([2239, 10999], [2017, 2018])
 
fig.autofmt_xdate()
plt.tight_layout()

Figure 12.5 Hourly temperature (in Kelvin) from September 29, 2016, to September 30, 2018. Although there is noise, we can see a yearly seasonal pattern.

In figure 12.5 you’ll see a yearly seasonal pattern in the hourly temperature, since temperatures are lower at the end and beginning of the year (winter in Minnesota), and higher in the middle of the year (summer). Thus, as expected, the temperature has yearly seasonality.

Now let’s verify whether we can observe daily seasonality in temperature. The result is shown in figure 12.6.

fig, ax = plt.subplots()
 
ax.plot(df['temp'])
ax.set_xlabel('Time')
ax.set_ylabel('Temperature (K)')
 
plt.xticks(np.arange(7, 400, 24), ['Friday', 'Saturday', 'Sunday', 
 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 
 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 
 'Saturday', 'Sunday'])
plt.xlim(0, 400)
 
fig.autofmt_xdate()
plt.tight_layout()

Figure 12.6 Hourly temperature (in Kelvin) starting on September 29, 2016, at 5 p.m. CST. Although it is a bit noisy, we can see that temperatures are indeed lower at the start and end of each day and peak during midday, suggesting daily seasonality.

In figure 12.6 you’ll notice that the temperature is indeed lower at the start and end of each day and peaks toward the middle of each day. This suggests daily seasonality, just as we observed for traffic volume in figure 12.4.

12.3.2 Feature engineering and data splitting

With our data exploration done, we’ll move on to feature engineering and data splitting. In this section, we will study each feature and create new ones that will help our models forecast the traffic volume and hourly temperature. Finally, we’ll split the data and save each set as a CSV file for later use.

A great way to study the features of a dataset is to use the describe method from pandas. This method returns the number of records for each feature, allowing us to quickly identify missing values, the mean, standard deviation, quartiles, and maximum and minimum values for each feature.

df.describe().transpose()    

The transpose method puts each feature on its own row.

From the output, you’ll notice that rain_1h is mostly 0 throughout the dataset, as its third quartile is still at 0. Since at least 75% of the values for rain_1h are 0, it is unlikely that it is a strong predictor of traffic volume. Thus, this feature will be removed.

Looking at snow_1h, you’ll notice that this variable is at 0 through the entire dataset. This is easily observable, since its minimum and maximum values are both 0. Thus, this is not predictive of the variation in traffic volume over time. This feature will also be removed from the dataset.

cols_to_drop = ['rain_1h', 'snow_1h']
df = df.drop(cols_to_drop, axis=1)

Now we reach the interesting problem of encoding time as a usable feature for our deep learning models. Right now, the date_time feature is not usable by our models, since it is a datetime string. We will thus convert it into a numerical value.

A simple way to do that is to express the date as a number of seconds. This is achieved through the use of the timestamp method from the datetime library.

timestamp_s = 
 pd.to_datetime(df['date_time']).map(datetime.datetime.timestamp)

Unfortunately, we are not done, as this simply expresses each date in seconds, as shown in figure 12.7. This leads us to losing the cyclical nature of time, because the number of seconds simply increases linearly with time.

Figure 12.7 Number of seconds expressing each date in the dataset. The number of seconds linearly increases with time, meaning that we lose the cyclical property of time.

Therefore, we must apply a transformation to recover the cyclical behavior of time. A simple way to do that is to apply a sine transformation. We know that the sine function is cyclical, bounded between –1 and 1. This will help us regain part of the cyclical property of time.

day = 24 * 60 * 60                                              
 
df['day_sin'] = (np.sin(timestamp_s * (2*np.pi/day))).values    

The timestamp is in seconds, so we must calculate the number of seconds in a day before applying the sine transformation.

Application of the sine transformation. Notice that we use radians in the sine function.

With a single sine transformation, we regain some of the cyclical property that was lost when converting to seconds. However, at this point, 12 p.m. is equivalent to 12 a.m., and 5 p.m. is equivalent to 5 a.m. This is undesired, as we want to distinguish between morning and afternoon. Thus, we’ll apply a cosine transformation. We know that cosine is out of phase with the sine function. This allows us to distinguish between 5 a.m. and 5 p.m., expressing the cyclical nature of time in a day. At this point, we can remove the date_time column from the DataFrame.

df['day_cos'] = (np.cos(timestamp_s * (2*np.pi/day))).values    
df = df.drop(['date_time'], axis=1)                             

Apply the cosine transformation to the timestamp in seconds.

Remove the date_time column.

We can quickly convince ourselves that these transformations worked by plotting a sample of day_sin and day_cos. The result is shown in figure 12.8.

df.sample(50).plot.scatter('day_sin','day_cos').set_aspect('equal');

Figure 12.8 Plot of a sample of the day_sin and day_cos encoding. We have successfully encoded the time as a numerical value while keeping the daily cycle.

In figure 12.8 you’ll notice that the points form a circle, just like a clock. Therefore, we have successfully expressed each timestamp as a point on the clock, meaning that we now have numerical values that retain the cyclical nature of time in a day, and this can be used in our deep learning models. This will be useful since we observed daily seasonality for both the temperature and the volume of traffic.

With the feature engineering complete, we can now split our data train, validation, and test sets. The train set is the sample of data used to fit the model. The validation set is a bit like a test set that the model can peek at to tune its hyperparameters and improve its performance during the model’s training. The test set is completely separate from the model’s training procedure and is used for an unbiased evaluation of the model’s performance.

Here we’ll use a simple 70:20:10 split for the train, validation, and test sets. While 10% of the data seems like a small portion for the test set, remember that we have more than 17,000 records, meaning that we will evaluate the model on more than 1,000 data points, which is more than enough.

n = len(df)
 
# Split 70:20:10 (train:validation:test)
train_df = df[0:int(n*0.7)]               
val_df = df[int(n*0.7):int(n*0.9)]        
test_df = df[int(n*0.9):]                 

First 70% goes to the train set.

Next 20% goes to the validation set.

The remaining 10% goes to the test set.

Before saving the data, we must scale it so all values are between 0 and 1. This decreases the time required for training deep learning models, and it improves their performance. We’ll use MinMaxScaler from sklearn to scale our data.

Note that we will fit the scaler on the training set to avoid data leakage. That way, we are simulating the fact that we only have the training data available when we’re using the model, and no future information is known by the model. The evaluation of the model remains unbiased.

from sklearn.preprocessing import MinMaxScaler
 
scaler = MinMaxScaler()
scaler.fit(train_df)       
 
train_df[train_df.columns] = scaler.transform(train_df[train_df.columns])
val_df[val_df.columns] = scaler.transform(val_df[val_df.columns])
test_df[test_df.columns] = scaler.transform(test_df[test_df.columns])

Fit only on the training set.

It is worth mentioning why the data is scaled and not normalized. Scaling and normalization can be confusing terms for data scientists, as they are often used interchangeably. In short, scaling the data affects only its scale and not its distribution. Thus, it simply forces the values into a certain range. In our case, we force the values to be between 0 and 1.

Normalizing the data, on the other hand, affects its distribution and its scale. Thus, normalizing the data would force it to have a normal distribution or a Gaussian distribution. The original range would also change, and plotting the frequency of each value would generate a classic bell curve.

Normalizing the data is only useful when the models we use require the data to be normal. For example, linear discriminant analysis (LDA) is derived from the assumption of a normal distribution, so it is better to normalize data before using LDA. However, in the case of deep learning, no assumptions are made, so normalizing is not required.

Finally, we’ll save each set as a CSV file for use in the following chapters.

train_df.to_csv('../data/train.csv')
val_df.to_csv('../data/val.csv')
test_df.to_csv('../data/test.csv')

12.4 Next steps

In this chapter, we looked at the use of deep learning for forecasting and covered the three main types of deep learning models. We then explored the data we’ll be using and performed feature engineering so the data is ready to be used in the next chapter, where we’ll apply deep learning models to forecast traffic volume.

In the next chapter, we will start by implementing baseline models that will serve as benchmarks for more complex deep learning architectures. We will also implement linear models, the simplest models that can be built, followed by deep neural networks, which have at least one hidden layer. Baselines, linear models, and deep neural networks will be implemented as single-step models, multi-step models, and multi-output models. You should be excited for the next chapter, as we’ll start modeling and forecasting using deep learning.

12.5 Exercise

As an exercise, we will prepare some data for use in deep learning exercises in chapters 12 through 18. This data will be used to develop a deep learning model to forecast the air quality in Beijing at the Aotizhongxin station.

Specifically, for univariate modeling, we will ultimately predict the concentration of nitrogen dioxide (NO2). For the multivariate problem, we will predict the concentration of nitrogen dioxide and temperature.

Note Predicting the concentration of air pollutants is an important problem, as they can have negative health effects on a population, such as coughing, wheezing, inflammation, and reduced lung function. Temperature also plays an important role, because hot air tends to rise, creating a convection effect and moving pollutants from the ground to higher altitudes. With accurate models, we can better manage air pollution and better inform the population to take the right precautions.

The original dataset is available in the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data. It has been preprocessed and cleaned to treat missing data and make it easy to work with (the preprocessing steps are available on GitHub). You will find the data in a CSV file on GitHub: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH12.

The objective of this exercise is to prepare the data for deep learning. Follow these steps:

  1. Read the data.

  2. Plot the target.

  3. Remove unnecessary columns.

  4. Identify whether there is daily seasonality and encode the time accordingly.

  5. Split your data into training, validation, and testing sets.

  6. Scale the data using MinMaxScaler.

  7. Save the train, validation, and test sets to be used later.

Summary

  • Deep learning for forecasting is used when:

    • The dataset is large (more than 10,000 data points).
    • Declination of the SARIMAX model takes a long time to fit.
    • The residuals of the statistical model still show some correlation.
    • There is more than one seasonal period.
  • There are three types of models for forecasting:

    • Single-step model: Predicts one step into the future for one variable.
    • Multi-step model: Predicts many steps into the future for one variable.
    • Multi-output model: Predicts many variables one or more steps into the future.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.214.207