Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

13 Data windowing and creating baselines for deep learning

This chapter covers

Creating windows of data
Implementing baseline models for deep learning

In the last chapter, I introduced deep learning for forecasting by covering the situations where deep learning is ideal and by outlining the three main types of deep learning models: single-step, multi-step, and multi-output. We then proceeded with data exploration and feature engineering to remove useless features and create new features that will help us forecast traffic volume. With that setup done, we are now ready to implement deep learning to forecast our target variable, which is the traffic volume.

In this chapter, we’ll build a reusable class that will create windows of data. This step is probably the most complicated and most useful topic in this part of the book on deep learning. Applying deep learning for forecasting relies on creating appropriate time windows and specifying the inputs and labels. Once that is done, you will see that implementing different models becomes incredibly easy, and this framework can be reused for different situations and datasets.

Once you know how to create windows of data, we’ll move on to implement baseline models, linear models, and deep neural networks. This will let us measure the performance of these models, and we can then move on to more complex architectures in the following chapters.

13.1 Creating windows of data

We’ll start off by creating the DataWindow class, which will allow us to format the data appropriately to be fed to our deep learning models. We’ll also add a plotting method to this class so that we can visualize the predictions and the actual values.

Before diving into the code and building the DataWindow class, however, it is important to understand why we must perform data windowing for deep learning. Deep learning models have a particular way of fitting on data, which we’ll explore in the next section. Then we’ll move on and implement the DataWindow class.

13.1.1 Exploring how deep learning models are trained for time series forecasting

In the first half of this book, we fit statistical models, such as SARIMAX, on training sets and made predictions. We were, in reality, fitting a set of predefined functions of a certain order (p,d,q)(P,D,Q)_m, and finding out which order resulted in the best fit.

For deep learning models, we do not have a set of functions to try. Instead, we let the neural network derive its own function such that when it takes the inputs, it generates the best predictions possible. To achieve that, we perform what is called data windowing. This is a process in which we define a sequence of data points on our time series and define which are inputs and which are labels. That way, the deep learning model can fit on the inputs, generate predictions, compare them to the labels, and repeat this process until it cannot improve the accuracy of its predictions.

Let’s walk through an example of data windowing. Our data window will use 24 hours of data to predict the next 24 hours. You probably wonder why are we using just 24 hours of data to generate predictions. After all, deep learning is data hungry and is used for large datasets. The key lies in the data window. A single window has 24 timesteps as input to generate an output of 24 timesteps. However, the entire training set is separated into multiple windows, meaning that we have many windows with inputs and labels, as shown in figure 13.1.

Figure 13.1 Visualizing the data windows on the training set. The inputs are shown with square markers, and the labels are shown with crosses. Each data window consists of 24 timesteps with square markers followed by 24 labels with crosses.

In figure 13.1 you can see the first 400 timesteps of our training set for traffic volume. Each data window consists of 24 input timesteps and 24 label timesteps (as shown in figure 13.2), giving us a total length of 48 timesteps. We can generate many data windows with the training set, so we are, in fact, leveraging this large quantity of data.

As you can see in figure 13.2, the data window’s total length is the sum of the lengths of each sequence. In this case, since we have 24 timesteps as input and 24 labels, the total length of the data window is 48 timesteps.

Figure 13.2 An example of a data window. Our data window has 24 timesteps as input and 24 timesteps as output. The model will then use 24 hours of input to generate 24 hours of predictions. The total length of the data window is the sum of the length of inputs and labels. In this case, we have a total length of 48 timesteps.

You might think that we are wasting a lot of training data, since in figure 13.2 timesteps 24 to 47 are labels. Are those never going to be used as inputs? Of course, they will be. The DataWindow class that we’ll implement in the next section generates data windows with inputs starting at t = 0. Then it will create another set of data windows, but this time starting at t = 1. Then it will start at t = 2. This goes on until it cannot have a sequence of 24 consecutive labels in the training set, as illustrated in figure 13.3.

Figure 13.3 Visualizing the different data windows that are generated by the DataWindow class. You can see that by repeatedly shifting the starting point by one timestep, we use as much of the training data as possible to fit our deep learning models.

To make computation more efficient, deep learning models are trained with batches. A batch is simply a collection of data windows that are fed to the model for training, as shown in figure 13.4.

Figure 13.4 A batch is simply a collection of data windows that are used for training the deep learning model.

Figure 13.4 shows an example of a batch with a batch size of 32. That means that 32 data windows are grouped together and used to train the model. Of course, this is only one batch—the DataWindow class generates as many batches as possible with the given training set. In our case, we have a training set with 12,285 rows. If each batch has 32 data windows, that means that we will have 12285/32 = 384 batches.

Training the model on all 384 batches once is called one epoch. One epoch often does not result in an accurate model, so the model will train for as many epochs as necessary until it cannot improve the accuracy of its predictions.

The final important concept in data windowing for deep learning is shuffling. I mentioned in the very first chapter of this book that time series data cannot be shuffled. Time series data has an order, and that order must be kept, so why are we shuffling the data here?

In this context, shuffling occurs at the batch level, not inside the data window—the order of the time series itself is maintained within each data window. Each data window is independent of all others. Therefore, in a batch, we can shuffle the data windows and still keep the order of our time series, as shown in figure 13.5. Shuffling the data is not essential, but it is recommended as it tends to make more robust models.

Figure 13.5 Shuffling the data windows in a batch. Each data window is independent of all others, so it is safe to shuffle the data windows within a batch. Note that the order of the time series is maintained within each data window.

Now that you understand the inner working of data windowing and how it is used for training deep learning models, let’s implement the DataWindow class.

13.1.2 Implementing the DataWindow class

We are now ready to implement the DataWindow class. This class has the advantage of being flexible, meaning that you can use it in a wide variety of scenarios to apply deep learning. The full code is available on GitHub: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH13%26CH14.

The class is based on the width of the input, the width of the label, and the shift. The width of the input is simply the number of timesteps that are fed into the model to make predictions. For example, given that we have hourly data in our dataset, if we feed the model with 24 hours of data to make a prediction, the input width is 24. If we feed only 12 hours of data, the input width is 12.

The label width is equivalent to the number of timesteps in the predictions. If we predict only one timestep, the label width is 1. If we predict a full day of data (with hourly data), the label width is 24.

Finally, the shift is the number of timesteps separating the input and the predictions. If we predict the next timestep, the shift is 1. If we predict the next 24 hours (with hourly data), the shift is 24.

Let’s visualize some windows of data to better understand these parameters. Figure 13.6 shows a window of data where the model predicts the next data point, given a single data point.

Figure 13.6 A data window where the model predicts one timestep in the future, given a single point of data. The input width is 1, since the model takes only 1 data point as input. The label width is also only 1, since the model outputs the prediction for 1 timestep only. Since the model predicts the next timestep, the shift is also 1. Finally, the total window size is the sum of the shift and the input widths, which equals 2.

Now let’s consider the situation where we feed 24 hours of data to the model in order to predict the next 24 hours. The data window in that situation is shown in figure 13.7. Now that you understand the concept of input width, label width, and shift, we can create the DataWindow class and define its initialization function in listing 13.1. The function will also take in the training, validation, and test sets, as the windows of data will come from our dataset. Finally, we’ll allow the target column to be specified.

Figure 13.7 Data window where the model predicts the next 24 hours using the last 24 hours of data. The input width is 24 and the label width is also 24. Since there are 24 timesteps separating the inputs and the predictions, the shift is also 24. This gives a total window size of 48 timesteps.

Note that the following listing reuses code from the official TensorFlow documentation’s website (https://www.tensorflow.org/tutorials/structured_data/time_series). This method of creating windows of data is viewed by the community as the best and easiest way of predicting time series data with deep learning models. It is also the best way to extend the capabilities of TensorFlow’s native function timeseries_dataset_from_array, such that we can apply deep learning models in any forecasting scenario.

The full implementation of the data windowing technique is shown in code listing 13.3. All code is reused within the terms of the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0), which you can consult in the GitHub repository (https://github.com/marcopeix/TimeSeriesForecastingInPython) for this book.

The examples that follow in the book build upon the code from the documentation, to make it more reusable in any scenario you might encounter outside of this book.

Listing 13.1 Defining the initialization function of DataWindow

class DataWindow():
    def __init__(self, input_width, label_width, shift, 
                 train_df=train_df, val_df=val_df, test_df=test_df, 
                 label_columns=None):
        
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df
        
        self.label_columns = label_columns                              ❶
        if label_columns is not None:
            self.label_columns_indices = {name: i for i, name in 
➥ enumerate(label_columns)}                                            ❷
        self.column_indices = {name: i for i, name in 
➥ enumerate(train_df.columns)}                                         ❸
        
        self.input_width = input_width
        self.label_width = label_width
        self.shift = shift
        
        self.total_window_size = input_width + shift
        
        self.input_slice = slice(0, input_width)                        ❹
        self.input_indices = 
➥ np.arange(self.total_window_size)[self.input_slice]                  ❺
        
        self.label_start = self.total_window_size - self.label_width    ❻
        self.labels_slice = slice(self.label_start, None)               ❼
        self.label_indices = 
➥ np.arange(self.total_window_size)[self.labels_slice]

❶ Name of the column that we wish to predict

❷ Create a dictionary with the name and index of the label column. This will be used for plotting.

❸ Create a dictionary with the name and index of each column. This will be used to separate the features from the target variable.

❹ The slice function returns a slice object that specifies how to slice a sequence. In this case, it says that the input slice starts at 0 and ends when we reach the input_width.

❺ Assign indices to the inputs. These are useful for plotting.

❻ Get the index at which the label starts. In this case, it is the total window size minus the width of the label.

❼ The same steps that were applied for the inputs are applied for labels.

In listing 13.1 you can see that the initialization function basically assigns the variables and manages the indices of the inputs and the labels. Our next step is to split our window between inputs and labels, so that our models can make predictions based on the inputs and measure an error metric against the labels. The following split_to_ inputs_labels function is defined within the DataWindow class.

def split_to_inputs_labels(self, features):
    inputs = features[:, self.input_slice, :]           ❶
    labels = features[:, self.labels_slice, :]          ❷
    if self.label_columns is not None:                  ❸
        labels = tf.stack(
            [labels[:,:,self.column_indices[name]] for name in 
➥ self.label_columns],
            axis=-1
        )
    inputs.set_shape([None, self.input_width, None])    ❹
    labels.set_shape([None, self.label_width, None])
    
    return inputs, labels

❶ Slice the window to get the inputs using the input_slice defined in __init__.

❷ Slice the window to get the labels using the labels_slice defined in __init__.

❸ If we have more than one target, we stack the labels.

❹ The shape will be [batch, time, features]. At this point, we only specify the time dimension and allow the batch and feature dimensions to be defined later.

The split_to_inputs_labels function will separate the big data window into two windows: one for the inputs and the other for the labels, as shown in figure 13.8.

Figure 13.8 The split_to_inputs_labels function simply separates the big data window into two windows, where one contains the inputs and the other the labels.

Next we’ll define a function to plot the input data, the predictions, and the actual values (listing 13.2). Since we will be working with many time windows, we’ll show only the plot of three time windows, but this parameter can easily be changed. Also, the default label will be traffic volume, but we can change that by specifying any column we choose. Again, this function should be included in the DataWindow class.

Listing 13.2 Method to plot a sample of data windows

def plot(self, model=None, plot_col='traffic_volume', max_subplots=3):
    inputs, labels = self.sample_batch
    
    plt.figure(figsize=(12, 8))
    plot_col_index = self.column_indices[plot_col]
    max_n = min(max_subplots, len(inputs))
    
    for n in range(max_n):
        plt.subplot(3, 1, n+1)
        plt.ylabel(f'{plot_col} [scaled]')
        plt.plot(self.input_indices, inputs[n, :, plot_col_index],
                 label='Inputs', marker='.', zorder=-10)          ❶
 
        if self.label_columns:
          label_col_index = self.label_columns_indices.get(plot_col, 
➥ None)
        else:
          label_col_index = plot_col_index
 
        if label_col_index is None:
          continue
 
        plt.scatter(self.label_indices, labels[n, :, label_col_index],
                    edgecolors='k', marker='s', label='Labels', 
➥ c='green', s=64)                                               ❷
        if model is not None:
          predictions = model(inputs)
          plt.scatter(self.label_indices, predictions[n, :, 
➥ label_col_index],
                      marker='X', edgecolors='k', label='Predictions',
                      c='red', s=64)                              ❸
 
        if n == 0:
          plt.legend()
 
    plt.xlabel('Time (h)')

❶ Plot the inputs. They will appear as a continuous blue line with dots.

❷ Plot the labels or actual values. They will appear as green squares.

❸ Plot the predictions. They will appear as red crosses.

We are almost done building the DataWindow class. The last main piece of logic will format our dataset into tensors so that they can be fed to our deep learning models. TensorFlow comes with a very handy function called timeseries_dataset_from_ array, which creates a dataset of sliding windows, given an array.

def make_dataset(self, data):
    data = np.array(data, dtype=np.float32)
    ds = tf.keras.preprocessing.timeseries_dataset_from_array(
        data=data,                                 ❶
        targets=None,                              ❷
        sequence_length=self.total_window_size,    ❸
        sequence_stride=1,                         ❹
        shuffle=True,                              ❺
        batch_size=32                              ❻
    )
    
    ds = ds.map(self.split_to_inputs_labels)
    return ds

❶ Pass in the data. This corresponds to our training set, validation set, or test set.

❷ Targets are set to None, as they are handled by the split_to_input_labels function.

❸ Define the total length of the array, which is equal to the total window length.

❹ Define the number of timesteps separating each sequence. In our case, we want the sequences to be consecutive, so sequence_stride=1.

❺ Shuffle the sequences. Keep in mind that the data is still in chronological order. We are simply shuffling the order of the sequences, which makes the model more robust.

❻ Define the number of sequences in a single batch.

Remember that we are shuffling the sequences in a batch. This means that within each sequence, the data is in chronological order. However, in a batch of 32 sequences, we can and should shuffle them to make our model more robust and less prone to overfitting.

We’ll conclude our DataWindow class by defining some properties to apply the make_dataset function on the training, validation, and testing sets. We’ll also create a sample batch that we’ll cache within the class for plotting purposes.

@property
def train(self):
    return self.make_dataset(self.train_df)
 
@property
def val(self):
    return self.make_dataset(self.val_df)
 
@property
def test(self):
    return self.make_dataset(self.test_df)
    
@property
def sample_batch(self):                           ❶
    result = getattr(self, '_sample_batch', None)
    if result is None:
        result = next(iter(self.train))
        self._sample_batch = result
    return result

❶ Get a sample batch of data for plotting purposes. If the sample batch does not exist, we’ll retrieve a sample batch and cache it.

Our DataWindow class is now complete. The full class with all methods and properties is shown in listing 13.3.

Listing 13.3 The complete DataWindow class

class DataWindow():
    def __init__(self, input_width, label_width, shift, 
                 train_df=train_df, val_df=val_df, test_df=test_df, 
                 label_columns=None):
        
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df
        
        self.label_columns = label_columns
        if label_columns is not None:
            self.label_columns_indices = {name: i for i, name in 
➥ enumerate(label_columns)}
        self.column_indices = {name: i for i, name in 
➥ enumerate(train_df.columns)}
        
        self.input_width = input_width
        self.label_width = label_width
        self.shift = shift
        
        self.total_window_size = input_width + shift
        
        self.input_slice = slice(0, input_width)
        self.input_indices = 
➥ np.arange(self.total_window_size)[self.input_slice]
 
        self.label_start = self.total_window_size - self.label_width
        self.labels_slice = slice(self.label_start, None)
        self.label_indices = 
➥ np.arange(self.total_window_size)[self.labels_slice]
    
    def split_to_inputs_labels(self, features):
        inputs = features[:, self.input_slice, :]
        labels = features[:, self.labels_slice, :]
        if self.label_columns is not None:
            labels = tf.stack(
                [labels[:,:,self.column_indices[name]] for name in 
➥ self.label_columns],
                axis=-1
            )
        inputs.set_shape([None, self.input_width, None])
        labels.set_shape([None, self.label_width, None])
        
        return inputs, labels
    
    def plot(self, model=None, plot_col='traffic_volume', max_subplots=3):
        inputs, labels = self.sample_batch
        plt.figure(figsize=(12, 8))
        plot_col_index = self.column_indices[plot_col]
        max_n = min(max_subplots, len(inputs))
        
        for n in range(max_n):
            plt.subplot(3, 1, n+1)
            plt.ylabel(f'{plot_col} [scaled]')
            plt.plot(self.input_indices, inputs[n, :, plot_col_index],
                     label='Inputs', marker='.', zorder=-10)
 
            if self.label_columns:
              label_col_index = self.label_columns_indices.get(plot_col, 
➥ None)
            else:
              label_col_index = plot_col_index
 
            if label_col_index is None:
              continue
 
            plt.scatter(self.label_indices, labels[n, :, label_col_index],
                        edgecolors='k', marker='s', label='Labels', 
➥ c='green', s=64)
            if model is not None:
              predictions = model(inputs)
              plt.scatter(self.label_indices, predictions[n, :, 
➥ label_col_index],
                          marker='X', edgecolors='k', label='Predictions',
                          c='red', s=64)
 
            if n == 0:
              plt.legend()
 
        plt.xlabel('Time (h)')
 
    def make_dataset(self, data):
        data = np.array(data, dtype=np.float32)
        ds = tf.keras.preprocessing.timeseries_dataset_from_array(
            data=data,
            targets=None,
            sequence_length=self.total_window_size,
            sequence_stride=1,
            shuffle=True,
            batch_size=32
        )
        
        ds = ds.map(self.split_to_inputs_labels)
        return ds
    
    @property
    def train(self):
        return self.make_dataset(self.train_df)
    
    @property
    def val(self):
        return self.make_dataset(self.val_df)
    @property
    def test(self):
        return self.make_dataset(self.test_df)
    
    @property
    def sample_batch(self):
        result = getattr(self, '_sample_batch', None)
        if result is None:
            result = next(iter(self.train))
            self._sample_batch = result
        return result

For now, the DataWindow class might seem a bit abstract, but we will soon use it to apply baseline models. We will be using this class in all the chapters in this deep learning part of the book, so you will gradually tame this code and appreciate how easy it is to test different deep learning architectures.

13.2 Applying baseline models

With the DataWindow class complete, we are ready to use it. We will apply baseline models as single-step, multi-step, and multi-output models. You will see that their implementation is similar and incredibly simple when we have the right data windows.

Recall that a baseline is used as a benchmark to evaluate more complex models. A model is performant if it compares favorably to another, so building a baseline is an important step in modeling.

13.2.1 Single-step baseline model

We’ll first implement a single-step model as a baseline. In a single-step model, the input is one timestep and the output is the prediction of the next timestep.

The first step is to generate a window of data. Since we are defining a single-step model, the input width is 1, the label width is 1, and the shift is also 1, since the model predicts the next timestep. Our target variable is the volume of traffic.

single_step_window = DataWindow(input_width=1, label_width=1, shift=1, 
➥ label_columns=['traffic_volume'])

For plotting purposes, we’ll also define a wider window so we can visualize many predictions of our model. Otherwise, we could only visualize one input data point and one output prediction, which is not very interesting.

wide_window = DataWindow(input_width=24, label_width=24, shift=1, 
➥ label_columns=['traffic_volume'])

In this situation, the simplest prediction we can make is the last observed value. Basically, the prediction is simply the input data point. This is implemented by the class Baseline. As you can see in the following listing, the Baseline class can also be used for a multi-output model. For now, we’ll solely focus on a single-step model.

Listing 13.4 Class to return the input data as a prediction

class Baseline(Model):
    def __init__(self, label_index=None):
        super().__init__()
        self.label_index = label_index
        
    def call(self, inputs):
        if self.label_index is None:                ❶
            return inputs
        
        elif isinstance(self.label_index, list):    ❷
            tensors = []
            for index in self.label_index:
                result = inputs[:, :, index]
                result = result[:, :, tf.newaxis]
                tensors.append(result)
            return tf.concat(tensors, axis=-1)
        
        result = inputs[:, :, self.label_index]     ❸
        return result[:,:,tf.newaxis]

❶ If no target is specified, we return all columns. This is useful for multi-output models where all columns are to be predicted.

❷ If we specify a list of targets, it will return only the specified columns. Again, this is used for multi-output models.

❸ Return the input for a given target variable.

With the class defined, we can now initialize the model and compile it to generate predictions. To do so, we’ll find the index of our target column, traffic_volume, and pass it in to Baseline. Note that TensorFlow requires us to provide a loss function and a metric of evaluation. In this case, and throughout the deep learning chapters, we’ll use the mean squared error (MSE ) as a loss function—it penalizes large errors, and it generally yields well-fitted models. For the evaluation metric, we’ll use the mean absolute error (MAE) for its ease of interpretation.

column_indices = {name: i for i, name in enumerate(train_df.columns)}    ❶
 
baseline_last = Baseline(label_index=column_indices['traffic_volume'])   ❷
 
baseline_last.compile(loss=MeanSquaredError(), 
➥ metrics=[MeanAbsoluteError()])                                        ❸

❶ Generate a dictionary with the name and index of each column in the training set.

❷ Pass the index of the target column in the Baseline class.

❸ Compile the model to generate the predictions.

We’ll now evaluate the performance of our baseline on both the validation and test sets. Models built with TensorFlow conveniently come with the evaluate method, which allows us to compare the predictions to the actual values and calculate the error metric.

val_performance = {}                                             ❶
performance = {}                                                 ❷
 
val_performance['Baseline - Last'] = 
➥ baseline_last.evaluate(single_step_window.val)                ❸
performance['Baseline - Last'] = 
➥ baseline_last.evaluate(single_step_window.test, verbose=0)    ❹

❶ Create a dictionary to hold the MAE of a model on the validation set.

❷ Create a dictionary to hold the MAE of a model on the test set.

❸ Store the MAE of the baseline on the validation set.

❹ Store the MAE of the baseline on the test set.

Great, we have successfully built a baseline that predicts the last known value and evaluated it. We can visualize the predictions using the plot method of the DataWindow class. Remember to use the wide_window to see more than just two data points.

wide_window.plot(baseline_last)

In figure 13.9 the labels are squares and the predictions are crosses. The crosses at each timestep are simply the last known value, meaning that we have a baseline that functions as expected. Your plot may differ from figure 13.9, as the cached sample batch changes every time a data window is initialized.

Figure 13.9 Predictions of our baseline single-step model on three sequences from the sample batch. The prediction at each timestep is the last known value, meaning that our baseline works as expected.

We can optionally print the MAE of our baseline on the test set.

print(performance['Baseline - Last'][1])

This returns an MAE of 0.081. More complex models should perform better than the baseline, resulting in a smaller MAE.

13.2.2 Multi-step baseline models

In the previous section, we built a single-step baseline model that simply predicted the last known value. For multi-step models, we’ll predict more than one timestep into the future. In this case, we’ll forecast the traffic volume for the next 24 hours of data given an input of 24 hours.

Again, the first step is to generate the appropriate window of data. Because we wish to predict 24 timesteps into the future with an input of 24 hours, the input width is 24, the label width is 24, and the shift is also 24.

multi_window = DataWindow(input_width=24, label_width=24, shift=24, 
➥ label_columns=['traffic_volume'])

With the data window generated, we can now focus on implementing the baseline models. In this situation, there are two reasonable baselines:

Predict the last known value for the next 24 timesteps.
Predict the last 24 timesteps for the next 24 timesteps.

With that in mind, let’s implement the first baseline, where we’ll simply repeat the last known value over the next 24 timesteps.

Predicting the last known value

To predict the last known value, we’ll define a MultiStepLastBaseline class that simply takes in the input and repeats the last value of the input sequence over 24 timesteps. This acts as the prediction of the model.

class MultiStepLastBaseline(Model):
    def __init__(self, label_index=None):
        super().__init__()
        self.label_index = label_index
        
    def call(self, inputs):
        if self.label_index is None:
            return tf.tile(inputs[:, -1:, :], [1, 24, 1])               ❶
        return tf.tile(inputs[:, -1:, self.label_index:], [1, 24, 1])   ❷

❶ If no target is specified, return the last known value of all columns over the next 24 timesteps.

❷ Return the last known value of the target column over the next 24 timesteps.

Next we’ll initialize the class and specify the target column. We’ll then repeat the same steps as in the previous section, compiling the model and evaluating it on the validation set and test set.

ms_baseline_last = 
➥ MultiStepLastBaseline(label_index=column_indices['traffic_volume'])
 
ms_baseline_last.compile(loss=MeanSquaredError(), 
➥ metrics=[MeanAbsoluteError()])
 
ms_val_performance = {}
ms_performance = {}
ms_val_performance['Baseline - Last'] = 
➥ ms_baseline_last.evaluate(multi_window.val)
ms_performance['Baseline - Last'] = 
➥ ms_baseline_last.evaluate(multi_window.test, verbose=0)

We can now visualize the predictions using the plot method of DataWindow. The result is shown in figure 13.10.

multi_window.plot(ms_baseline_last)

Figure 13.10 Predicting the last known value for the next 24 timesteps. We can see that the predictions, shown as crosses, correspond to the last value of the input sequence, so our baseline behaves as expected.

Again, we can optionally print the baseline’s MAE. From figure 13.10, we can expect it to be fairly high, since there is a large discrepancy between the labels and the predictions.

print(ms_performance['Baseline - Last'][1])

This gives an MAE of 0.347. Now let’s see if we can build a better baseline by simply repeating the input sequence.

Repeating the input sequence

Let’s implement a second baseline for multi-step models, which simply returns the input sequence. This means that the prediction for the next 24 hours will simply be the last known 24 hours of data. This is implemented through the RepeatBaseline class.

class RepeatBaseline(Model):
    def __init__(self, label_index=None):
        super().__init__()
        self.label_index = label_index
        
    def call(self, inputs):
        return inputs[:, :, self.label_index:]    ❶

❶ Return the input sequence for the given target column.

Now we can initialize the baseline model and generate predictions. Note that the loss function and evaluation metric remain the same.

ms_baseline_repeat = 
➥ RepeatBaseline(label_index=column_indices['traffic_volume'])
 
ms_baseline_repeat.compile(loss=MeanSquaredError(), 
➥ metrics=[MeanAbsoluteError()])
 
ms_val_performance['Baseline - Repeat'] = 
➥ ms_baseline_repeat.evaluate(multi_window.val)
ms_performance['Baseline - Repeat'] = 
➥ ms_baseline_repeat.evaluate(multi_window.test, verbose=0)

Next we can visualize the predictions. The result is shown in figure 13.11.

Figure 13.11 Repeating the input sequence as the predictions. You’ll see that the predictions (represented as crosses) match exactly the input sequence. You’ll also notice that many predictions overlap the labels, which indicates that this baseline performs quite well.

This baseline performs well. This is to be expected, since we identified daily seasonality in the previous chapter. This baseline is the equivalent to predicting the last known season.

Again, we can print the MAE on the test set to verify that we indeed have a better baseline than simply predicting the last known value.

print(ms_performance['Baseline - Repeat'][1])

This gives an MAE of 0.341, which is lower than the MAE obtained by predicting the last known value. We have therefore successfully built a better baseline.

13.2.3 Multi-output baseline model

The final type of model we’ll cover is the multi-output model. In this situation, we wish to predict the traffic volume and the temperature for the next timestep using a single input data point. Essentially, we’re applying the single-step model on both the traffic volume and temperature, making it a multi-output model.

Again, we’ll start off by defining the window of data, but here we’ll define two windows: one for training and the other for visualization. Since the model takes in one data point and outputs one prediction, we want to initialize a wide window of data to visualize many predictions over many timesteps.

mo_single_step_window = DataWindow(input_width=1, label_width=1, shift=1, 
➥ label_columns=['temp','traffic_volume'])                               ❶
mo_wide_window = DataWindow(input_width=24, label_width=24, shift=1, 
➥ label_columns=['temp','traffic_volume'])

❶ Notice that we pass in both temp and traffic_volume, as those are our two targets for the multi-output model.

Then we’ll use the Baseline class that we defined for the single-step model. Recall that this class can output the last known value for a list of targets.

Listing 13.5 Class to return the input data as a prediction

class Baseline(Model):
    def __init__(self, label_index=None):
        super().__init__()
        self.label_index = label_index
        
    def call(self, inputs):
        if self.label_index is None:                ❶
            return inputs
        
        elif isinstance(self.label_index, list):    ❷
            tensors = []
            for index in self.label_index:
                result = inputs[:, :, index]
                result = result[:, :, tf.newaxis]
                tensors.append(result)
            return tf.concat(tensors, axis=-1)
        result = inputs[:, :, self.label_index]     ❸
        return result[:,:,tf.newaxis]

❶ If no target is specified, we return all columns. This is useful for multi-output models where all columns are to be predicted.

❷ If we specify a list of targets, it will return only these specified columns. Again, this is used for multi-output models.

❸ Return the input for a given target variable.

In the case of the multi-output model, we must simply pass the indexes of the temp and traffic_volume columns to output the last known value for the respective variables as a prediction.

print(column_indices['traffic_volume'])    ❶
print(column_indices['temp'])              ❷
 
mo_baseline_last = Baseline(label_index=[0, 2])

❶ Prints out 2

❷ Prints out 0

With the baseline initialized with our two target variables, we can now compile the model and evaluate it.

mo_val_performance = {}
mo_performance = {}
 
mo_val_performance['Baseline - Last'] = 
➥ mo_baseline_last.evaluate(mo_wide_window.val)
mo_performance['Baseline - Last'] = 
➥ mo_baseline_last.evaluate(mo_wide_window.test, verbose=0)

Finally, we can visualize the predictions against the actual values. By default, our plot method will show the traffic volume on the y-axis, allowing us to quickly display one of our targets, as shown in figure 13.12.

mo_wide_window.plot(mo_baseline_last)

Figure 13.12 Predicting the last known value for traffic volume

Figure 13.12 does not show anything surprising, as we already saw these results when we built a single-step baseline model. The particularity of the multi-output model is that we also have predictions for the temperature. Of course, we can also visualize the predictions for the temperature by specifying the target in the plot method. The result is shown in figure 13.13.

mo_wide_window.plot(model=mo_baseline_last, plot_col='temp')

Figure 13.13 Predicting the last known value for the temperature. The predictions (crosses) are equal to the previous data point, so our baseline model behaves as expected.

Again, we can print the MAE of our baseline model.

print(mo_performance['Baseline - Last'])

We obtain an MAE of 0.047 on the test set. In the next chapter, we’ll start building more complex models, and they should result in a lower MAE, as they will be trained to fit the data.

13.3 Next steps

In this chapter, we covered the crucial step of creating data windows, which will allow us to quickly build any type of model. We then proceeded to build baseline models for each type of model, so that we have benchmarks we can compare to when we build our more complex models in later chapters.

Of course, building baseline models is not an application of deep learning just yet. In the next chapter, we will implement linear models and deep neural networks, and see if those models are already more performant than the simple baselines.

13.4 Exercises

In the previous chapter, as an exercise, we prepared the air pollution dataset for deep learning modeling. Now we’ll use the training set, validation set, and test set to build baseline models and evaluate them.

For each type of model, follow the steps outlined. Recall that the target for the single-step and multi-step model is the concentration of NO₂, and the targets for the multi-output model are the concentration of NO₂ and temperature. The complete solution is available on GitHub: https://github.com/marcopeix/TimeSeriesForecastingInPython/tree/master/CH13%26CH14.

For the single-step model
1. Build a baseline model that predicts the last known value.
2. Plot it.
3. Evaluate its performance using the mean absolute error (MAE) and store it for comparison in a dictionary.
For the multi-step model
1. Build a baseline that predicts the last known value over a horizon of 24 hours.
2. Build a baseline model that repeats the last 24 hours.
3. Plot the predictions of both models.
4. Evaluate both models using the MAE and store their performance.
For the multi-output model
1. Build a baseline model that predicts the last known value.
2. Plot it.
3. Evaluate its performance using the MAE and store it for comparison in a dictionary.

Summary

Data windowing is essential in deep learning to format the data as inputs and labels for the model.
The DataWindow class can easily be used in any situation and can be extended to your liking. Make use of it in your own projects.
Deep learning models require a loss function and an evaluation metric. In our case, we chose the mean squared error (MSE) as the loss function, because it penalizes large errors and tends to yield better-fit models. The evaluation metric is the mean absolute error (MAE), chosen for its ease of interpretation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13 Data windowing and creating baselines for deep learning

Create new playlist

Sign In

Sign Up