Manipulating time-series data

We will now examine several common operations that are performed on time-series data. These operations entail realigning data, changing the frequency of the samples and their values, and calculating aggregate results on continuously moving subsets of the data to determine the behavior of the values in the data as time changes. We will examine each of the following:

  • Shifting and lagging values to calculate percentage changes
  • Changing the frequency of the data in the time series
  • Up and down sampling of the intervals and values in the time series
  • Performing rolling-window calculations

Shifting and lagging

A common operation on time-series data is to shift the values backward and forward in time. The pandas method for this is .shift(), which will shift values in Series or DataFrame a specified number of units of the index's frequency.

To demonstrate shifting, we will use the following Series. This Series has five values, is indexed by date starting at 2014-08-01, and uses a daily frequency:

In [70]:
   # create a Series to work with
   np.random.seed(123456)
   ts = Series([1, 2, 2.5, 1.5, 0.5],
               pd.date_range('2014-08-01', periods=5))
   ts

Out[70]:
   2014-08-01    1.0
   2014-08-02    2.0
   2014-08-03    2.5
   2014-08-04    1.5
   2014-08-05    0.5
   Freq: D, dtype: float64

The following shifts the values forward by 1 day:

In [71]:
   # shift forward one day
   ts.shift(1)

Out[71]:
   2014-08-01    NaN
   2014-08-02    1.0
   2014-08-03    2.0
   2014-08-04    2.5
   2014-08-05    1.5
   Freq: D, dtype: float64

pandas has moved the values forward one unit of the index's frequency, which is one day. The index remains unchanged. There was no replacement data for 2014-08-01, so it is filled with NaN.

A lag is a shift in a negative direction. The following lags the Series by 2 days:

In [72]:
   # lag two days
   ts.shift(-2)

Out[72]:
   2014-08-01    2.5
   2014-08-02    1.5
   2014-08-03    0.5
   2014-08-04    NaN
   2014-08-05    NaN
   Freq: D, dtype: float64

Index labels 2014-08-04 and 2014-08-03 now have NaN values as there were no items to replace.

A common calculation that is performed using a shift is to calculate the percentage daily change in values. This can be performed by dividing a Series object by its values shifted by 1:

In [73]:
   # calculate daily percentage change
   ts / ts.shift(1)

Out[73]:
   2014-08-01         NaN
   2014-08-02    2.000000
   2014-08-03    1.250000
   2014-08-04    0.600000
   2014-08-05    0.333333
   Freq: D, dtype: float64

Shifts can be performed on different frequencies than that in the index. When this is performed, the index will be modified and the values remain the same. As an example, the following shifts the Series forward by one business day:

In [74]:
   # shift forward one business day
   ts.shift(1, freq="B")

Out[74]:
   2014-08-04    1.0
   2014-08-04    2.0
   2014-08-04    2.5
   2014-08-05    1.5
   2014-08-06    0.5
   dtype: float64

As another example, the following shifts forward by 5 hours:

In [75]:
   # shift forward five hours
   ts.tshift(5, freq="H")

Out[75]:
   2014-08-01 05:00:00    1.0
   2014-08-02 05:00:00    2.0
   2014-08-03 05:00:00    2.5
   2014-08-04 05:00:00    1.5
   2014-08-05 05:00:00    0.5
   dtype: float64

A time series can also be shifted using DateOffset. The following code shifts the time series forward by 0.5 minutes:

In [76]:
   # shift using a DateOffset
   ts.shift(1, DateOffset(minutes=0.5))

Out[76]:
   2014-08-01 00:00:30    1.0
   2014-08-02 00:00:30    2.0
   2014-08-03 00:00:30    2.5
   2014-08-04 00:00:30    1.5
   2014-08-05 00:00:30    0.5
   dtype: float64

There is an alternative form of shifting provided by the .tshift() method. This method shifts the index labels by the specified units and a frequency specified by the freq parameter (which is required). The following code demonstrates this approach by adjusting the index by -1 hour:

In [77]:
   # shift just the index values
   ts.tshift(-1, freq='H')

Out[77]:
   2014-07-31 23:00:00    1.0
   2014-08-01 23:00:00    2.0
   2014-08-02 23:00:00    2.5
   2014-08-03 23:00:00    1.5
   2014-08-04 23:00:00    0.5
   dtype: float64

Frequency conversion

Frequency data can be converted in pandas using the .asfreq() method of a time-series object, such as Series or DataFrame. When converting frequency, a new Series object with a new DatatimeIndex object will be created. The DatetimeIndex of the new Series object starts at the first Timestamp of the original and progresses at the given frequency until the last Timestamp of the original. Values will then be aligned into the new Series.

To demonstrate, we will use the following time series of consecutive incremental integers mapped into each hour of each day for August 2014:

In [78]:
   # create a Series of incremental values
   # index by hour through all of August 2014
   periods = 31 * 24
   hourly = Series(np.arange(0, periods),
                  pd.date_range('08-01-2014', freq="2H", 
                                periods = periods))
   hourly

Out[78]:
   2014-08-01 00:00:00    0
   2014-08-01 02:00:00    1
   2014-08-01 04:00:00    2
   ...
   2014-10-01 18:00:00    741
   2014-10-01 20:00:00    742
   2014-10-01 22:00:00    743
   Freq: 2H, Length: 744

As an example, the following converts this time series to a daily frequency using .asfreq('D'):

In [79]:
   # convert to daily frequency
   # many items will be dropped due to alignment
   daily = hourly.asfreq('D')
   daily

Out[79]:
   2014-08-01     0
   2014-08-02    12
   2014-08-03    24
   ...
   2014-09-29    708
   2014-09-30    720
   2014-10-01    732
   Freq: D, Length: 62

As data was aligned to the new daily time series from the hourly time series, only values matching the exact days were copied.

If we convert this result back to an hourly frequency, we will see that many of the values are NaN:

In [80]:
   # convert back to hourly.  Results in many NaNs
   # as the new index has many labels that do not
   # align with the source
   daily.asfreq('H')

Out[80]:
   2014-08-01 00:00:00     0
   2014-08-01 01:00:00   NaN
   2014-08-01 02:00:00   NaN
   ...
   2014-09-30 22:00:00    NaN
   2014-09-30 23:00:00    NaN
   2014-10-01 00:00:00    732
   Freq: H, Length: 1465

The new index has Timestamp objects at hourly intervals, so only the timestamps at exact days align with the daily time series, resulting in 670 NaN values.

This default behavior can be changed using the method parameter of the .asfreq() method. This value can be used for forward fill, reverse fill, or to pad the NaN values.

The ffill method will forward fill the last known value (pad also does the same):

In [81]:
   # forward fill values
   daily.asfreq('H', method='ffill')

Out[81]:
   2014-08-01 00:00:00    0
   2014-08-01 01:00:00    0
   2014-08-01 02:00:00    0
   ...
   2014-09-30 22:00:00    720
   2014-09-30 23:00:00    720
   2014-10-01 00:00:00    732
   Freq: H, Length: 1465

The bfill method will back fill values from the next known value:

In [82]:
   daily.asfreq('H', method='bfill')

Out[82]:
   2014-08-01 00:00:00     0
   2014-08-01 01:00:00    12
   2014-08-01 02:00:00    12
   ...
   2014-09-30 22:00:00    732
   2014-09-30 23:00:00    732
   2014-10-01 00:00:00    732
   Freq: H, Length: 1465

Up and down resampling

Frequency conversion provides a basic way to convert the index in a time series to another frequency. Data in the new time series is aligned with the old data and can result in many NaN values. This can be partially solved using a fill method, but that is limited in its capabilities to fill with appropriate information.

Resampling differs in that it does not perform a pure alignment. The values placed in the new series can use the same forward and reverse fill options, but they can also be specified using other pandas-provided algorithms or with your own functions.

To demonstrate resampling, we will use the following time series, which represents a random walk of values over a 5-day period:

In [83]:
   # calculate a random walk five days long at one second intervals
   # these many items will be needed
   count = 24 * 60 * 60 * 5
   # create a series of values
   np.random.seed(123456)
   values = np.random.randn(count)
   ws = pd.Series(values)
   # calculate the walk
   walk = ws.cumsum()
   # patch the index
   walk.index = pd.date_range('2014-08-01', periods=count, freq="S")
   walk

Out[83]:
   2014-08-01 00:00:00    0.469112
   2014-08-01 00:00:01    0.186249
   2014-08-01 00:00:02   -1.322810
   ...
   2014-08-05 23:59:57    455.202981
   2014-08-05 23:59:58    454.947362
   2014-08-05 23:59:59    456.191430
   Freq: S, Length: 432000

Resampling in pandas is accomplished using the .resample() method, by passing it a new frequency. To demonstrate this the following resamples our by-the-second data to by-the-minute. This is a downsampling as the result has a lower frequency and results in less values:

In [84]:
   # resample to minute intervals
   walk.resample("1Min")

Out[84]:
   2014-08-01 00:00:00    -8.718220
   2014-08-01 00:01:00   -15.239213
   2014-08-01 00:02:00    -9.179315
   ...
   2014-08-05 23:57:00    450.078149
   2014-08-05 23:58:00    444.637806
   2014-08-05 23:59:00    453.837417
   Freq: T, Length: 7200

Notice that the first value is -8.718220 whereas the original data had a value of 0.469112. A frequency conversion would have left this value at -8.718220. This is because a resampling does not copy data through alignment. A resampling will actually split the data into buckets of data based on new periods and then apply a particular operation to the data in each bucket. The default scenario is to calculate the mean of each bucket. This can be verified with the following, which slices the first minute of data from the walk and calculates its mean:

In [85]:
   # calculate the mean of the first minute of the walk
   walk['2014-08-01 00:00'].mean()

Out[85]:
   -8.7182200528326437

In downsampling, as the existing data is put into buckets based on the new intervals, there can often be a question of what values are on each end of the bucket. As an example, should the first interval in the previous resampling be from 2014-08-01 00:00:00 through 2014-08-01 23:59:59, or should it end at 2014-08-04 00:00:00 but start at 2014-08-03 23:59:59?

The default is the former, and it is referred to as a left close. To other scenario that excludes the left value and includes the right is a right close and can be performed by using the close='right' parameter. The following demonstrates this; notice the slight difference in the intervals and values:

In [86]:
   # use a right close
   walk.resample("1Min", closed='right')

Out[86]:
   2014-07-31 23:59:00     0.469112
   2014-08-01 00:00:00    -8.907477
   2014-08-01 00:01:00   -15.280685
   ...
   2014-08-05 23:57:00    450.039159
   2014-08-05 23:58:00    444.631719
   2014-08-05 23:59:00    453.955377
   Freq: T, Length: 7201

The decision about whether to use a right or left close is really up to you and your data modeling, but pandas gives you the option.

The specific method that is applied to each group of samples can be specified using the how parameter. As an example, the following takes the first value of each bucket:

In [87]:
   # take the first value of each bucket
   walk.resample("1Min", how='first')

Out[87]:
   2014-08-01 00:00:00     0.469112
   2014-08-01 00:01:00   -10.886314
   2014-08-01 00:02:00   -13.374656
   ...
   2014-08-05 23:57:00    449.582419
   2014-08-05 23:58:00    447.243014
   2014-08-05 23:59:00    446.877810
   Freq: T, Length: 7200

The how parameter can be supplied the name of any NumPy array function that takes an array and produces aggregated values as well as any function you create on your own.

To demonstrate upsampling, we will resample the walk to minutes and then back to seconds:

In [88]:
   # resample to 1 minute intervals, then back to 1 sec
   bymin = walk.resample("1Min")
   bymin.resample('S')

Out[88]:
   2014-08-01 00:00:00   -8.71822
   2014-08-01 00:00:01        NaN
   2014-08-01 00:00:02        NaN
   ...
   2014-08-05 23:58:58           NaN
   2014-08-05 23:58:59           NaN
   2014-08-05 23:59:00    453.837417
   Freq: S, Length: 431941

The upsampling created the index values for the second-by-second data but inserted NaN values by default. This default behavior can be modified using the fill_method parameter. We saw this when changing frequency with the options of forward and backward filling. These are also available with resampling. The following demonstrates how to use the forward fill:

In [89]:
   # resample to 1 second intervals using forward fill
   bymin.resample("S", fill_method="bfill")

Out[89]:
   2014-08-01 00:00:00    -8.718220
   2014-08-01 00:00:01   -15.239213
   2014-08-01 00:00:02   -15.239213
   ...
   2014-08-05 23:58:58    453.837417
   2014-08-05 23:58:59    453.837417
   2014-08-05 23:59:00    453.837417
   Freq: S, Length: 431941

It is also possible to interpolate the missing values using the .interpolate() method on the result. This will calculate a linear interpolation between the values existing in the result for all of the NaN values created during the resampling:

In [90]:
   # demonstrate interpolating the NaN values
   interpolated = bymin.resample("S").interpolate()
   interpolated

Out[90]:
   2014-08-01 00:00:00   -8.718220
   2014-08-01 00:00:01   -8.826903
   2014-08-01 00:00:02   -8.935586
   ...
   2014-08-05 23:58:58    453.530764
   2014-08-05 23:58:59    453.684090
   2014-08-05 23:59:00    453.837417
   Freq: S, Length: 431941

pandas provides a very convenient resampling method referred to as open, high, low, and close, which is specified by the how='ohlc' parameter. The following example takes our second-by-second data and calculates hour-by-hour ohlc values:

In [91]:
   # show ohlc resampling
   ohlc = walk.resample("H", how="ohlc")
   ohlc

Out[91]:
                           open        high         low       close
  2014-08-01 00:00:00    0.469112    0.469112  -67.873166   -2.922520
  2014-08-01 01:00:00   -3.374321   23.793007  -56.585154  -55.101543
  2014-08-01 02:00:00  -54.276885    5.232441  -87.809456    1.913276
  2014-08-01 03:00:00    0.260576   17.124638  -65.820652  -38.530620
  2014-08-01 04:00:00  -38.436581    3.537231 -109.805294  -61.014553
  ...                         ...         ...         ...         ...
  2014-08-05 19:00:00  437.652077  550.408942  430.549178  494.471788
  2014-08-05 20:00:00  496.539759  510.371745  456.365565  476.505765
  2014-08-05 21:00:00  476.025498  506.952877  425.472410  498.547578
  2014-08-05 22:00:00  497.941355  506.599652  411.119919  443.925832
  2014-08-05 23:00:00  443.017962  489.083657  426.062444  456.191430

   [120 rows x 4 columns]

Time-series moving-window operations

pandas provides a number of functions to compute moving (also known as rolling) statistics. In a rolling window, pandas computes the statistic on a window of data represented by a particular period of time. The window is then rolled along a certain interval, and the statistic is continually calculated on each window as long as the window fits within the dates of the time series.

pandas provides direct support for rolling windows with the following functions:

Function

Description

rolling_mean

The mean of values in the window

rolling_std

The standard deviation of values in the window

rolling_var

The variance of values

rolling_min

The minimum of values in the window

rolling_max

The maximum of values in the window

rolling_cov

The covariance of values

rolling_quantile

Moving window score at percentile/sample quantile

rolling_corr

The correlation of values in the window

rolling_median

The median of values in the window

rolling_sum

The sum of values in the window

rolling_apply

The application of a user function to values in the window

rolling_count

The number of non-NaN values in a window

rolling_skew

The skewedness of the values in the window

rolling_kurt

The kurtosis of values in the window

As a practical example, a rolling mean is commonly used to smooth out short-term fluctuations and highlight longer-term trends in data and is used quite commonly in financial time-series analysis.

To demonstrate, we will calculate a rolling mean with a window of 5 on the first minute of the random walk created earlier in the chapter. The following calculates this and then plots the result against the raw data to demonstrate the smoothing created by the rolling mean:

In [92]:
   first_minute = walk['2014-08-01 00:00']
   # calculate a rol1ing mean window of 5 periods
   pd.rolling_mean(first_minute, 5).plot()
   # plot it against the raw data
   first_minute.plot()
   # add a legend
   plt.legend(labels=['Rolling Mean', 'Raw']);
Time-series moving-window operations

Note

The generation of charts will be covered in more detail in Chapter 11, Visualization.

It can be seen how rolling_mean provides a smoother representation of the underlying data. A larger window will create less variance, and smaller windows will create more (until the window size is 1, which will be identical to the original series).

The following demonstrates the rolling mean with windows of 2, 5, and 10 plotted against the original series:

In [93]:
   # demonstrate the difference between 2, 5 and
   # 10 interval rolling windows
   h1w = walk['2014-08-01 00:00']
   h1w.plot()
   pd.rolling_mean(h1w, 2).plot()
   pd.rolling_mean(h1w, 5).plot()
   pd.rolling_mean(h1w, 10).plot()
   plt.legend(labels=['Raw', '2-interval RM',
                      '5-interval RM',
                      '10-interval RM']);
Time-series moving-window operations

Note that the larger the window, the more data is missing at the beginning of the curve. A window of size n requires n data points before the measure can be calculated and hence the gap in the beginning of the plot.

Any function can be applied via a rolling window using the pd.rolling_apply function. The supplied function will be passed an array of values in the window and should return a single value, which pandas will aggregate with these results into a time series.

To demonstrate, the following code calculates the mean average deviation, which gives you a feel of how far all values in the sample are from the overall mean on an average:

In [94]:
   # calculate mean average deviation with window of 5 intervals
   mean_abs_dev = lambda x: np.fabs(x - x.mean()).mean()
   pd.rolling_apply(h1w, 5, mean_abs_dev).plot();
Time-series moving-window operations

An expanding window mean can be calculated using a slight variant of the use of the pd.rolling_mean function that repeatedly calculates the mean by always starting with the first value in the time series and for each iteration increases the window size by one. An expanding window mean will be more stable (less responsive) than a rolling window, because as the size of the window increases, the less the impact of the next value will be:

In [95]:
   # calculate an expanding rolling mean
   expanding_mean = lambda x: pd.rolling_mean(x, len(x), 
                                              min_periods=1)
   h1w.plot()
   pd.expanding_mean(h1w).plot()
   plt.legend(labels=['Expanding mean', 'Raw']);
Time-series moving-window operations
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.103.154