16 Filtering a time series with CNN

This chapter covers

  • Examining the CNN architecture
  • Implementing a CNN with Keras
  • Combining a CNN with an LSTM

In the last chapter, we examined and implemented a long short-term memory (LSTM) network, which is a type of recurrent neural network (RNN) that processes sequences of data especially well. Its implementation was the top performing architecture for the single-step model, multi-step model, and multi-output model.

Now we’re going to explore the convolutional neural network (CNN). CNNs are mostly applied in the field of computer vision, and this architecture is behind many algorithms for image classification and image segmentation.

Of course, this architecture can also be used for time series analysis. It turns out that CNNs are noise resistant and can effectively filter out the noise in a time series with the convolution operation. This allows the network to produce a set of robust features that do not include abnormal values. In addition, CNNs are usually faster to train than LSTMs, as their operations can be parallelized.

In this chapter, we’ll first explore the CNN architecture and understand how the network filters a time series and creates a unique set of features. Then we’ll implement a CNN using Keras to produce forecasts. We’ll also combine the CNN architecture with the LSTM architecture to see if we can further improve the performance of our deep learning models.

16.1 Examining the convolutional neural network (CNN)

A convolutional neural network is a deep learning architecture that makes use of the convolution operation. The convolution operation allows the network to create a reduced set of features. Therefore, it is a way of regularizing the network, preventing overfitting, and effectively filtering the inputs. Of course, for this to make sense, you must first understand the convolution operation and how it impacts the inputs.

In mathematical terms, a convolution is an operation on two functions that generates a third function that expresses how the shape of one function is changed by the other. In a CNN, this operation occurs between the inputs and a kernel (also known as a filter). The kernel is simply a matrix that is placed on top of the feature matrix. In figure 16.1, the kernel is slid along the time axis, taking the dot product between the kernel and the features. This results in a reduced set of features, achieving regularization and the filtering of abnormal values.

Figure 16.1 Visualizing the kernel and the feature map. The kernel is the light gray matrix that is applied on top of the feature map. Each row corresponds to a feature of the dataset, while the length is the time axis.

To better understand the convolution operation, let’s consider a simple example with only one feature and one kernel, as shown in figure 16.2. To make things simple, we’ll consider only one row of features. Keep in mind that the horizontal axis remains the time dimension. The kernel is a smaller vector that is used to perform the convolution operation. Do not worry about the values used inside the kernel and the feature vector. They are arbitrary values. The values of the kernel are optimized and will change as the network is trained.

Figure 16.2 A simple example of one row of features and one kernel.

We can visualize the convolution operation and its result in figure 16.3. At first, the kernel is aligned with the beginning of the feature vector and the dot product is taken between the kernel and the values of the feature vector that are aligned with it. Once this is done, the kernel shifts one timestep to the right—this is also called a stride of one timestep. The dot product is again taken between the kernel and the feature vector, again only with the values that are aligned with the kernel. The kernel again shifts one timestep to the right, and the process is repeated until the kernel reaches the end of the feature vector. This happens when the kernel cannot be shifted any further with all of its values having an aligned feature value.

Figure 16.3 The full convolution operation. The operation starts with the kernel aligned at the beginning of the feature vector in step 1. The dot product is computed as shown by the intermediary equation of step 1, resulting in the first value in our output vector. In step 2, the kernel shifts one timestep to the right, and the dot product is taken again, resulting in the second value in the output vector. The process is repeated two more times until the kernel reaches the end of the feature vector.

In figure 16.3 you can see that using a feature vector of length 6 and a kernel of length 3, we obtain an output vector of length 4. Thus, in general, the length of the output vector of a convolution is given by equation 16.1.

output length = input length – kernel length + 1

Equation 16.1

Note that since the kernel is moving only in one direction (to the right), this is a 1D convolution. Luckily, Keras comes with the Conv1D layer, allowing us to easily implement it in Python. This is mostly used for time series forecasting, as the kernel can only move in the time dimension. For image processing, you’ll often see 2D or 3D convolutions, but that is outside of the scope of this book.

A convolution layer reduces the length of the set of features, and performing many convolutions will keep reducing the feature space. This can be problematic, as it limits the number of layers in the network, and we might lose too much information in the process. A common technique to prevent that is padding. Padding simply means adding values before and after the feature vector to keep the output length equivalent to the input length. Padding values are often zeros. You can see this in action in figure 16.4, where the output of the convolution is the same length as the input.

Figure 16.4 Convolution with padding. Here we padded the original input vector with zeros, as shown by the black squares. The output of the convolution thus has a length of 6, just like the original feature vector.

You can thus see how padding keeps the dimension of the output constant, allowing us to stack more convolution layers, and allowing the network to process features for a longer time. We use zeroes for padding because a multiplication by 0 is ignored. Thus, using zeroes as padding values is usually a good initial option.

Convolutional neural network (CNN)

A convolutional neural network (CNN) is a deep learning architecture that uses the convolution operation. This allows the network to reduce the feature space, effectively filtering the inputs and preventing overfitting.

The convolution is performed with a kernel, which is also trained during model fitting. The stride of the kernel determines the number of steps it shifts at each step of the convolution. In time series forecasting, only 1D convolution is used.

To avoid reducing the feature space too quickly, we can use padding, which adds zeros before and after the input vector. This keeps the output dimension the same as the original feature vector, allowing us to stack more convolution layers, which in turn allows the network to process the features for a longer time.

Now that you understand the inner working of a CNN, we can implement it with Keras and see if a CNN can produce more accurate predictions than the models we have built so far.

16.2 Implementing a CNN

As in previous chapters, we’ll implement the CNN architecture as a single-step model, a multi-step model, and a multi-output model. The single-step model will predict the traffic volume for the next timestep only, the multi-step model will predict the traffic volume for the next 24 hours, and the multi-output model will predict the temperature and traffic volume at the next timestep.

Make sure you have the DataWindow class and the compile_and_fit function (from chapters 13 to 15) in your notebook or Python script, as we’ll use both pieces of code to create windows of data and train the CNN model.

Note The source code for this chapter is available on GitHub: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH16.

In this chapter, we’ll also combine the CNN architecture with the LSTM architecture. It can be interesting to see if filtering our time series with a convolution layer and then processing the filtered sequence with an LSTM will improve the accuracy of our predictions. Thus, we’ll implement both a CNN only, and the combination of a CNN with an LSTM.

Of course, the other prerequisite is to read the training set, the validation set, and the test set, so let’s do that right now.

train_df = pd.read_csv('../data/train.csv', index_col=0)
val_df = pd.read_csv('../data/val.csv', index_col=0)
test_df = pd.read_csv('../data/test.csv', index_col=0)

Finally, we’ll use a kernel length of three timesteps in our CNN implementation. This is an arbitrary value, and you will have a chance to experiment with various kernel lengths in this chapter’s exercises and see how they impact the model’s performance. However, your kernel should have a length greater than 1; otherwise, you are simply multiplying the feature space by a scalar, and no filtering will be achieved.

16.2.1 Implementing a CNN as a single-step model

We’ll start by implementing a CNN as a single-step model. Recall that the single-step model outputs a prediction for traffic volume at the next timestep using the last known feature.

In this case, however, it does not make sense to provide the CNN model with only one timestep as an input because we want to run a convolution. We will instead use three input values to generate a prediction for the next timestep. That way we’ll have a sequence of data on which we can run a convolution operation. Furthermore, our input sequence must have a length at least equal to the kernel’s length, which in our case is 3. Recall that we expressed the relationship between the input length, kernel length, and output length in equation 16.1:

output length = input length – kernel length + 1

In this equation, no length can be equal to 0, since that would mean that no data is being processed or output. The condition that no length can be 0 is only satisfied if the input length is greater than or equal to the kernel length. Therefore, our input sequence must have at least three timesteps.

We can thus define the data window that will be used to train the model.

KERNEL_WIDTH = 3
 
conv_window = DataWindow(input_width=KERNEL_WIDTH, label_width=1, shift=1, 
 label_columns=['traffic_volume']) 

For plotting purposes, we would like to see the predictions of the model over a period of 24 hours. That way, we can evaluate the rolling forecasts of the model 1 timestep at a time, over 24 timesteps. Thus, we need to define another data window with a label_width of 24. The shift remains 1, as the model only predicts the next timestep. The input length is obtained by rearranging equation 16.1 as equation 16.2.

output length = input length – kernel length + 1

input length = output length + kernel length – 1

Equation 16.2

We can now simply compute the required input length to generate predictions over a sequence of 24 timesteps. In this case, the input length is 24 + 3– 1 = 26. That way, we avoid using padding. Later, in the exercises, you’ll be able to try using padding instead of a longer input sequence to accommodate the output length.

We can now define our data window for plotting the predictions of the model.

LABEL_WIDTH = 24
INPUT_WIDTH = LABEL_WIDTH + KERNEL_WIDTH– 1      
 
wide_conv_window = DataWindow(input_width=INPUT_WIDTH, 
 label_width=LABEL_WIDTH, shift=1, label_columns=['traffic_volume'])

From equation 16.2

With all the data windows ready, we can define our CNN model. Again, we’ll use the Sequential model from Keras to stack different layers. Then we’ll use the Conv1D layer, as we are working with time series, and the kernel only moves in the temporal dimension. The filters parameter is equivalent to the units parameter of the Dense layer, and it simply represents the number of neurons in the convolutional layer. We’ll set the kernel_size to the width of our kernel, which is 3. We don’t need to specify the other dimensions, as Keras will automatically take the right shape to accommodate the inputs. Then we’ll pass the output of the CNN to a Dense layer. That way, the model will be learning on a reduced set of features that were previously filtered by the convolutional step. We’ll finally output a prediction with a Dense layer of only one unit, as we are forecasting only the traffic volume for the next timestep.

cnn_model = Sequential([
    Conv1D(filters=32,                    
          kernel_size=(KERNEL_WIDTH,),    
          activation='relu'),
    Dense(units=32, activation='relu'),
    Dense(units=1)
])

The filters parameter is equivalent to the units parameter of the Dense layer; it defines the number of neurons in the convolutional layer.

The width of the kernel is specified, but the other dimensions are left out, as Keras automatically adapts to the shape of the inputs.

Next, we’ll compile and fit the model, and we’ll store its performance metrics for comparison later.

history = compile_and_fit(cnn_model, conv_window)
 
val_performance = {}
performance = {}
 
val_performance['CNN'] = cnn_model.evaluate(conv_window.val)
performance['CNN'] = cnn_model.evaluate(conv_window.test, verbose=0)

We can visualize the predictions against the labels using the plot method of our data window. The result is shown in figure 16.5.

wide_conv_window.plot(cnn_model)

Figure 16.5 Predicting traffic volume with a CNN as a single-step model. The model takes three values as an input, which is why we only see a prediction at the fourth timestep. Again, many predictions (shown as crosses) overlap labels (shown as squares), meaning that the model is fairly accurate.

As you can see in figure 16.5, many predictions overlap labels, meaning that we have fairly accurate predictions. Of course, we must compare this model’s performance metrics to those of the other models to properly assess its performance.

Before doing that, let’s combine the CNN and LSTM architectures into a single model. You saw in the previous chapter how the LSTM architecture resulted in the best-performing models so far. Thus, it is a reasonable hypothesis that filtering our input sequence before feeding it to an LSTM might improve the performance.

Thus, we’ll follow the Conv1D layer with two LSTM layers. This is an arbitrary choice, so make sure you experiment with it later on. There is rarely only one good way of building models, so it is important to showcase what is possible.

cnn_lstm_model = Sequential([
    Conv1D(filters=32,
          kernel_size=(KERNEL_WIDTH,),
          activation='relu'),
    LSTM(32, return_sequences=True),
    LSTM(32, return_sequences=True),
    Dense(1)
]) 

We’ll then fit the model and store its evaluation metrics.

history = compile_and_fit(cnn_lstm_model, conv_window)
 
val_performance['CNN + LSTM'] = cnn_lstm_model.evaluate(conv_window.val)
performance['CNN + LSTM'] = cnn_lstm_model.evaluate(conv_window.test, 
 verbose=0)

With both models built and evaluated, we can look at the MAE of our newly built models in figure 16.6. As you can see, the CNN model did not perform any better than the LSTM, and the combination of CNN and LSTM resulted in a slightly higher MAE than the CNN alone.

Figure 16.6 The MAE of all the single-step models built so far. You can see that the CNN did not improve upon the LSTM performance. Combining the CNN with an LSTM did not help either, and the combination even performed slightly worse than the CNN.

These results might be explained by the length of the input sequence. The model is given only an input sequence of three values, which might not be sufficient for the CNN to extract valuable features for predictions. While a CNN is better than the baseline model and the linear model, the LSTM remains the best-performing single-step model for now.

16.2.2 Implementing a CNN as a multi-step model

We’ll now move on to the multi-step model. Here we’ll use the last known 24 hours to forecast the traffic volume over the next 24 hours.

Again, keep in mind that the convolution reduces the length of the features, but we still expect the model to generate 24 predictions in a single shot. Therefore, we’ll reuse equation 16.2 and feed the model an input sequence with a length of 26 to make sure that we get an output of length 24. This, of course, means that we’ll keep the kernel length of 3. We can thus define our data window for the multi-step model.

KERNEL_WIDTH = 3
LABEL_WIDTH = 24
INPUT_WIDTH = LABEL_WIDTH + KERNEL_WIDTH - 1
 
multi_window = DataWindow(input_width=INPUT_WIDTH, label_width=LABEL_WIDTH, 
 shift=24, label_columns=['traffic_volume'])

Next, we’ll define the CNN model. Again, we’ll use the Sequential model, in which we’ll stack the Conv1D layer, followed by a Dense layer with 32 neurons, and then a Dense layer with one unit, since we are predicting only traffic volume.

ms_cnn_model = Sequential([
    Conv1D(32, activation='relu', kernel_size=(KERNEL_WIDTH)),
    Dense(units=32, activation='relu'),
    Dense(1, kernel_initializer=tf.initializers.zeros),
])

We can then train the model and store its performance metrics for comparison later.

history = compile_and_fit(ms_cnn_model, multi_window)
 
ms_val_performance = {}
ms_performance = {}
 
ms_val_performance['CNN'] = ms_cnn_model.evaluate(multi_window.val)
ms_performance['CNN'] = ms_cnn_model.evaluate(multi_window.test, verbose=0)

Optionally, we can visualize the forecasts of the model using multi_window .plot(ms_cnn_model). For now, let’s skip this and combine the CNN architecture with the LSTM architecture as previously. Here we’ll simply replace the intermediate Dense layer with an LSTM layer. Once the model is defined, we can fit it and store its performance metrics.

ms_cnn_lstm_model = Sequential([
    Conv1D(32, activation='relu', kernel_size=(KERNEL_WIDTH)),
    LSTM(32, return_sequences=True),
    Dense(1, kernel_initializer=tf.initializers.zeros),
])
 
history = compile_and_fit(ms_cnn_lstm_model, multi_window)
ms_val_performance['CNN + LSTM'] = 
 ms_cnn_lstm_model.evaluate(multi_window.val)
ms_performance['CNN + LSTM'] = 
 ms_cnn_lstm_model.evaluate(multi_window.test, verbose=0)

With the two new models trained, we can evaluate their performance against all the multi-step models built so far. As you can see in figure 16.7, the CNN model did not improve upon the LSTM model. However, combining both models resulted in the lowest MAE of all the multi-step models, meaning that it generates the most accurate predictions. The LSTM model is thus dethroned, and we have a new winning model.

Figure 16.7 The MAE of all multi-step models built so far. The CNN model is worse than the LSTM model, since it has a higher MAE. However, combining the CNN with an LSTM resulted in the lowest MAE of all.

16.2.3 Implementing a CNN as a multi-output model

Finally, we’ll implement the CNN architecture as a multi-output model. In this case, we wish to forecast the temperature and traffic volume for the next timestep only.

We have seen that giving an input sequence of length 3 was not sufficient for the CNN model to extract meaningful features, so we will use the same input length as for the multi-step model. This time, however, we are forecasting one timestep at a time over 24 timesteps.

We’ll define our data window as follows:

KERNEL_WIDTH = 3
LABEL_WIDTH = 24
INPUT_WIDTH = LABEL_WIDTH + KERNEL_WIDTH - 1
 
wide_mo_conv_window = DataWindow(input_width=INPUT_WIDTH, label_width=24, 
 shift=1, label_columns=['temp', 'traffic_volume'])

By now you should be comfortable building models with Keras, so defining the CNN architecture as a multi-output model should be straightforward. Again, we’ll use the Sequential model, in which we’ll stack a Conv1D layer, followed by a Dense layer, allowing the network to learn on a set of filtered features. The output layer will have two neurons, since we’re forecasting both the temperature and the traffic volume. Next we’ll fit the model and store its performance metrics.

mo_cnn_model = Sequential([
    Conv1D(filters=32, kernel_size=(KERNEL_WIDTH,), activation='relu'),
    Dense(units=32, activation='relu'),
    Dense(units=2)
])
 
history = compile_and_fit(mo_cnn_model, wide_mo_conv_window)
 
mo_val_performance = {}
mo_performance = {}
 
mo_val_performance['CNN'] = mo_cnn_model.evaluate(wide_mo_conv_window.val)
mo_performance['CNN'] = mo_cnn_model.evaluate(wide_mo_conv_window.test, 
 verbose=0)

We can also combine the CNN architecture with the LSTM architecture as done previously. We’ll simply replace the intermediate Dense layer with an LSTM layer, fit the model, and store its metrics.

mo_cnn_lstm_model = Sequential([
    Conv1D(filters=32, kernel_size=(KERNEL_WIDTH,), activation='relu'),
    LSTM(32, return_sequences=True),
    Dense(units=2)
])
 
history = compile_and_fit(mo_cnn_lstm_model, wide_mo_conv_window)
 
mo_val_performance['CNN + LSTM'] = 
 mo_cnn_model.evaluate(wide_mo_conv_window.val)
mo_performance['CNN + LSTM'] = 
 mo_cnn_model.evaluate(wide_mo_conv_window.test, verbose=0)

As usual, we’ll compare the performance of the new models with the previous multi-output models in figure 16.8. You’ll notice that the CNN, and the combination of CNN and LSTM, did not result in an improvement over the LSTM. In fact, all three models achieve the same MAE.

Figure 16.8 The MAE of all multi-output models built so far. As you can see, the CNN and the combination of CNN and LSTM did not result in improvements over the LSTM model.

Explaining this behavior is hard, as deep learning models are black boxes, meaning that they are hard to interpret. While they can be very performant, the tradeoff lies in their explicability. Methods to interpret neural network models do exist, but they are outside of the scope of this book. If you want to learn more, take a look at Christof Molnar’s book, Interpretable Machine Learning, Second Edition (https://christophm.github.io/interpretable-ml-book/).

16.3 Next steps

In this chapter, we examined the architecture of the CNN. We observed how the convolution operation is used in the network and how it effectively filters the input sequence with the use of a kernel. We then implemented the CNN architecture and combined it with the LSTM architecture to produce two new single-step models, multi-step models, and multi-output models.

In the case of the single-step models, using a CNN did not improve the results. In fact, it performed worse than the LSTM alone. For the multi-step models, we observed a slight performance boost and obtained the best-performing multi-step model with the combination of a CNN and an LSTM. In the case of the multi-output model, the use of a CNN resulted in constant performance, so we have a tie between the CNN, the LSTM, and the combination of CNN and LSTM. Thus, we can see that a CNN does not necessarily result in the best-performing model. In one situation it did, in another it did not, and in another there was no difference.

It is important to consider the CNN architecture as a tool in your toolset when it comes to modeling with deep learning. Models will perform differently depending on the dataset and the forecasting goal. The key lies in windowing your data correctly, as is done by the DataWindow class, and in following a testing methodology, as we have done by keeping the training set, validation set, and testing set constant and evaluating all models using the MAE against baseline models.

The last deep learning architecture that we are going to explore specifically concerns the multi-step models. Up until now, all multi-step models have output predictions for the next 24 hours in a single shot. However, it is possible to gradually predict the next 24 hours and feed a past prediction back into the model to output the next prediction. This is especially done with the LSTM architecture, resulting in an autoregressive LSTM (ARLSTM). This will be the subject of the next chapter.

16.4 Exercises

In the previous chapter’s exercises, you built LSTM models. Now you’ll experiment with a CNN and a combination of CNN and LSTM to see if you can gain in performance. The solutions to these exercises are available on GitHub: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH16.

  1. For the single-step model:

    1. Build a CNN model. Set the kernel width to 3.
    2. Plot its predictions.
    3. Evaluate the model using the mean absolute error (MAE) and store the MAE.
    4. Build a CNN + LSTM model.
    5. Plot its predictions.
    6. Evaluate the model using the MAE and store the MAE.
    7. Which model performs best?
  2. For the multi-step model:

    1. Build a CNN model. Set the kernel width to 3.
    2. Plot its predictions.
    3. Evaluate the model using the MAE and store the MAE.
    4. Build a CNN+LSTM model.
    5. Plot its predictions.
    6. Evaluate the model using the MAE and store the MAE.
    7. Which model performs best?
  3. Multi-output model:

    1. Build a CNN model. Set the kernel width to 3.
    2. Plot its predictions.
    3. Evaluate the model using the MAE and store the MAE.
    4. Build a CNN + LSTM model.
    5. Plot its predictions.
    6. Evaluate the model using the MAE and store the MAE.
    7. Which model performs best?

As always, this is an occasion to experiment. You can explore the following:

  • Add more layers.

  • Change the number of units.

  • Pad the sequence instead of increasing the input length. This is done in the Conv1D layer using the parameter padding="same". In that case, your input sequence must have a length of 24.

  • Use different layer initializers.

Summary

  • The convolutional neural network (CNN) is a deep learning architecture that makes use of the convolution operation.

  • The convolution operation is performed between a kernel and the feature space. It is simply the dot product between the kernel and the feature vector.

  • Running a convolution operation results in an output sequence that is shorter than the input sequence. Running many convolutions can therefore decrease the output length quickly. Padding can be used to prevent that.

  • In time series forecasting, the convolution is performed in one dimension only: the temporal dimension.

  • The CNN is just another model in your toolbox and may not always be the best-performing model. Make sure you window your data correctly with DataWindow, and keep your testing methodology valid by keeping each set of data constant, building baseline models, and evaluating all models with the same error metric.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.255.189