We’ve already leveraged RNNs for NLP. In this chapter, we create experiments to forecast with time series data. We use the famous Weather dataset to demonstrate both a univariate and a multivariate example.
A RNN is well suited for time series forecasting because it remembers the past and its decisions are influenced by what it has learned from the past. So it makes good decisions as data changes. Time series forecasting is deploying a model to predict future values based on previously observed values.
Time series data is different than what we’ve worked with so far because it is a sequence of observations taken sequentially in time. Time series data includes a time dimension, which is an explicit order dependence between observations.
Weather Forecasting
Forecasting the weather is a difficult and complex endeavor. But leading-edge companies like Google, IBM, Monsanto, and Facebook are leveraging AI technology to realize accurate and timely weather forecasts. Given the introductory nature of our lessons, we cannot hope to demonstrate such complex AI experiments. But we show you how to build simple time series forecasting models with weather data.
Notebooks for chapters are located at the following URL: https://github.com/paperd/tensorflow.
The Weather Dataset
We introduce time series forecasting with RNNs for a univariate problem. We then forecast a multivariate time series. We use weather time series data to train our models. The data we use is recorded by the Max Planck Institute for Biogeochemistry.
Find out more about the institute by perusing the following URL:
www.bgc-jena.mpg.de/index.php/Main/HomePage
- 1.
Click Runtime in the top-left menu.
- 2.
Click Change runtime type from the drop-down menu.
- 3.
Choose GPU from the Hardware accelerator drop-down menu.
- 4.
Click SAVE.
Import the tensorflow library. If ‘/device:GPU:0’ is displayed, the GPU is active. If ‘ .. ’ is displayed, the regular CPU is active.
Get weather data
Use the splitext method to extract the CSV file from the URL. Create the appropriate path for easy loading into pandas.
Explore the Data
We see that data collection began on January 1, 2009, and ended on December 31, 2016. The last data recorded was on January 1, 2017, but it’s irrelevant for our purposes since it is the only recorded piece of data for that year. We also see that data is recorded every 10 minutes. So the time step for this experiment is 10 minutes. A time step is a single occurrence of an event.
The first timestamp begins on January 1, 2009 (01.01.2009), with data recorded from 00:00:00 to 00:10:00. The second timestamp begins immediately after 00:10:00 to 00:20:00. This pattern continues throughout the day with the last time step at 23:50:00. The second day (and all subsequent days), January 2, 2009 (02.01.2009), follows the same pattern. Generally, time series forecasting predicts the observation at the next time step.
Date Time – Date-time reference
p (mbar) – Atmospheric pressure in millibars
T (degC) – Temperature in Celsius
Tpot (K) – Temperature in Kelvin
Tdew (degC) – Temperature in Celsius relative to humidity
rh (%) – Relative humidity
VPmax (mbar) – Saturation vapor pressure in millibars
VPact (mbar) – Vapor pressure in millibars
VPdef (mbar) – Vapor pressure deficit in millibars
sh (g/kg) – Specific humidity in grams per kilogram
H2OC (mmol/mol) – Water vapor concentration in millimoles per mole
rho (g/m**3) – Air density in grams per meter cubed
wv (m/s) – Wind speed in meters per second
max. wv (m/s) – Maximum wind speed in meters per second
wd (deg) – Wind direction in degrees
We have 14 features because Date Time is a reference column. There is no missing data. All data is float64 except the Date Time reference object. And the dataset contains 420,551 rows of data with indexes ranging from 0 to 420550.
The describe method generates descriptive statistics. The transpose method transposes indexes and columns.
The dataframe contains 420,551 rows with 15 columns included in each row.
Plot Relative Humidity Over Time
Since data is in a pandas dataframe, it is easy to plot any of the 14 features against Date Time time steps.
Since we have an observation every 10 minutes, each hour has six observations. And each day has 144 (6 observations × 24 hours) observations.
Since we have 144 observations per day, we plot a total of 1,440 (10 days × 144 observations/day) observations. Notice that indexing begins a 0.
With a narrow view of the data (the first 10 days), we can see daily periodicity. We also see that fluctuation is pretty chaotic, which means that prediction is more difficult.
Forecast a Univariate Time Series
Scale Data
Scale data
Scale relative humidity data. Squeeze out the extra 1 dimension added by the TensorFlow function so we can convert the TensorFlow tensor into a numpy array for easier processing. Display the first five scaled observations to verify that scaling works as expected.
Establish Training Split
For this experiment, use the first 315,413 rows of data for training and the remaining 105,138 (420,551 – 315,413) rows for the test set. Training data accounts for about 2,190 (315, 413/144) days of data. And test data accounts for about 730 (105, 138/144) days of data.
Create Features and Labels
Function that creates features and labels
The function accepts a dataset, an index where we want to start the split, an ending index, the size of each window, and the target size. Parameter window is the size of the past window of information. The target_size is how far in the future we want our model to learn how to predict.
The function creates lists to hold features and labels. It then establishes the starting point that reflects the window size. To create a test set, the function checks the end value. If None, it uses the length of the entire dataset less the target size as the ending value so the test set can start where the training set left off.
Once training and test starting points are established, the function creates the feature windows and labels. The indices for each feature window are established as the next window during iteration. That is, each subsequent set of indices begins where the last set ended. Feature windows are reshaped for TensorFlow consumption and added to the features list. Labels are created as the last observation in the next window and then added to the labels list. Both features and labels are returned as numpy arrays.
The function may seem confusing, but all it really does is create a feature set that holds windows of relative humidity observations (for our experiment) and another that holds targets. Targets are based on the last relative humidity observation from the next window of data. This makes sense because the last relative humidity from the next window is a pretty good indication of future relative humidity. So the feature set becomes a set of windows that contain time step observations. And the label set contains predictions for each window.
Create Train and Test Sets
For the train set, start at index 0 from the dataset and continue up to 315,412. For the test set, take the remainder. Set the window size to 20 and target to 0.
Create train and test sets
As expected, shapes reflect size of each dataset, window size, and the 1 dimension. The 1 dimension indicates that we are making one prediction into the future. The train set contains 315,393 records composed of windows holding 20 relative humidity readings. The test set contains 105,118 records composed of windows holding 20 relative humidity readings.
So why do we have 315,393 training observations instead of the original 315,413? The reason is that we need the first window to act as history. So just subtract the first window of 20 from 315,413. For test data, subtract the first window of 20 from 105,138 to get 105,118.
We can create bigger windows, but this dramatically increases the amount of data we must process. With only 20 observations per window, we already have 6,307,860 (315,393 × 20) data points for training and 2,102,360 (105,118 × 20) data points for testing!
View Windows of Past History
As expected, the window contains 20 relative humidity readings. So how did we get the target?
Inspect the patterns for the next few windows
Plot a Single Example
Start at -length to 0 to use the previous window as history.
Function that plots an example
Parameter delta indicates a change in a variable. The default is no change. The function plots each element in the data window with its associated time step.
The actual future for each window is its label, which is the last entry of the next window.
Create a Visual Performance Baseline
Before training, it is a good idea to create a simple visual performance baseline to compare against model performance. Of course, there are many ways to do this, but a very simple way is to use the average of the last 20 observations.
Create a Baseline Metric
It’s also a good idea to create a baseline metric to compare against our model. The simplest approach is to predict the last value for each window. We can then find the average mean squared error of the predictions and use this value as our metric.
Create a baseline metric
The MSE is very small, which means that our baseline metric might be hard to beat. Why? In general, machine learning has a pretty significant limitation. Unless the learning algorithm is hardcoded to look for a specific kind of simple model, parameter learning can sometimes fail to find a simple solution to a simple problem. Our time series problem is a very simple problem.
Finish the Input Pipeline
Finish the input pipeline
As expected, windows have 20 observations and 1 prediction.
Explore a Data Window
As expected, the window contains a feature set with 20 observations and a label with one prediction.
Create the Model
Input shape indicates window size of 20 with 1 prediction.
Create the model
A RNN is well suited to time series data because its layers can provide feedback to earlier layers. Specifically, it processes time series data time step by time step while remembering information it sees during training. Our model uses a GRU layer, which is a specialized RNN layer capable of remembering information over long periods of time. So it is especially well suited for time series modeling.
Model Summary
We use the formula 3 × (n2 × mn + 2n) to calculate the number of learnable parameters for a GRU, where m is the input dimension and n is the output dimension. For any neural net, multiply output from the previous layer by neurons at the current layer (m × n). In addition, account for neurons at the current layer. But count them twice because of feedback (2n). A GRU layer has feedback, so square the output dimension (n2). Multiply by 3 because there are three sets of operations for a GRU (hidden state, reset gate, and update gate) that require weight matrices.
3 × (322 + 32 + 2 × 32)
3 × (1024 + 32 + 64)
3 × 1120
3,360
The output dimension is 32. The input dimension doesn’t exist since there is no previous layer.
The dense layer has 33 learnable parameters calculated by multiplying neurons from the previous layer (32) by neurons at the current layer (1) and adding the number of neurons at the current layer (1).
Verify Model Output
The prediction shows batch size of 256 with 1 prediction. So the model is working as expected.
Compile the Model
Train the Model
Generalize on Test Data
Make Predictions
Make some predictions
Although visual inspection is pretty, it doesn’t efficiently gauge overall performance.
Plot Model Performance
Plot model performance
Voilà! Our model performs pretty well!
Forecast a Multivariate Time Series
We just demonstrated how to make a single prediction based on a single feature. Now, let’s make multiple predictions on multiple variables. We can choose any of the 14 features (we don’t want to predict from the date-time reference).
p (mbar) – Atmospheric pressure in millibars
T (degC) – Temperature in Celsius
Tpot (K) – Temperature in Kelvin
Tdew (degC) – Temperature in Celsius relative to humidity
rh (%) – Relative humidity
VPmax (mbar) – Saturation vapor pressure in millibars
VPact (mbar) – Vapor pressure in millibars
VPdef (mbar) – Vapor pressure deficit in millibars
sh (g/kg) – Specific humidity in grams per kilogram
H2OC (mmol/mol) – Water vapor concentration in millimoles per mole
rho (g/m**3) – Air density in grams per meter cubed
wv (m/s) – Wind speed in meters per second
max. wv (m/s) – Maximum wind speed in meters per second
wd (deg) – Wind direction in degrees
We chose four features – Tdew (degC), sh (g/kg), H2OC (mmol/mol), and T (degC). Tdew (degC) is the temperature in Celsius relative to humidity. sh (g/kg) is the specific humidity in grams per kilogram. H2OC (mmol/mol) is the water vapor concentration in millimoles per mole. And T (degC) is the temperature in Celsius. We chose these features for demonstration purposes only. Choice of features should be based on a problem domain.
Scale Data
As expected, we have 420,551 observations.
Scale data
Multistep Model
With relative humidity, we only predicted a single future point. But we can create a model to learn to predict a range of future values, which is what we are going to do with the multivariate data we just established.
Let’s say that we want to train our multistep model to learn to predict for the next 6 hours. Since our data time steps are 10 minutes (one observation every 10 minutes), there are six observations every hour. Given that we want to predict for the next 6 hours, our model makes 36 (6 observations × 6) predictions.
Let’s also say that we want to show our model data from the last 3 days for each sample. Since there are 24 hours in a day, we have 144 (6 × 24) observations each day. So we have a total of 432 (3 × 144) observations. But we want to sample every hour because we don’t expect a drastic change in any of our features within 60 minutes. Thus, 72 (432 observations/6 observations per hour) observations represents each window of data.
Generators
Since we are training multiple features to predict a range of future values, we create a generator function to create train and test splits. A generator is a function that returns an object iterator that we iterate over one value at a time. A generator is defined like a normal function, but it generates a value with the yield keyword rather than return. So adding the yield keyword automatically makes a normal function a generator function.
Generators are easy to implement, but a bit difficult to understand. We invoke a generator function in the same way as a normal function. But, when we invoke it, a generator object is created. We must iterate over the generator object to see its contents. As we iterate over the generator object, all processes within the generator function are processed until it reaches a yield statement. Once this happens, the generator yields a new value from its contents and returns execution back to the for loop. So a generator yields one element from its contents for each yield statement encountered.
Advantages of Using a Generator
Generator functions allow us to declare a function that behaves like an iterator. So we can make iterators in a fast, easy, and clean way. An iterator is an object that can be iterated upon. It is used to abstract a container of data to make it behave like an iterable object. As programmers, we use iterable objects like strings, lists, and dictionaries frequently.
Generators save memory space because they don’t compute the value of each item when instantiated. Generators only compute a value when explicitly asked to do so. Such behavior is known as lazy evaluation. Lazy evaluation is useful when we process large datasets because it allows us to start using data immediately rather than having to wait until the entire dataset is processed. It also saves memory because data is generated only when needed.
Generator Caveats
A generator creates a single object. So a generator object can only be assigned to a single variable no matter how many values it yields.
Once iterated over, a generator is exhausted. So it must be rerun to be repopulated.
Create a Generator Function
Generator function
Generate Train and Test Data
Invoke the generator
Notice that we assign the generator object to a single variable!
Reconstitute Generated Tensors
Remake generated data into numpy arrays. Since train and test data are generators, we must iterate to create features and labels.
As expected, features have 72 observations per window and 4 features. And labels have 36 predictions.
Reconstitute test data
Finish the Input Pipeline
Finish the input pipeline
Each batch contains 256 windows of feature data and 256 labels. Each window has 72 observations with each observation containing 4 features. Each label has 36 predictions.
As expected, the first window has 72 observations with 4 features, and the first label has 36 predictions.
As expected, window size is 72 with 4 features for each window.
Create the Model
Create the model
Model Summary
3 × (322 + 32 × 4 + 2 × 32)
3 × (1024 + 128 + 64)
3 × 1216
3,648
The only difference is that we have four features. So multiply the output (n) dimension by 4.
The dense layer has 1,188 learnable parameters calculated by multiplying neurons from the previous layer (32) by neurons at the current layer (36) and adding the number of neurons at the current layer (36).
Compile the Model
Train the Model
Generalize on Test Data
Plot Performance
Plot performance
The model is overfitting a bit, but performance is still pretty good.
Plot a Data Window
Plotting function
Make a Prediction
Not bad.