We will now examine several common operations that are performed on time-series data. These operations entail realigning data, changing the frequency of the samples and their values, and calculating aggregate results on continuously moving subsets of the data to determine the behavior of the values in the data as time changes. We will examine each of the following:
A common operation on time-series data is to shift the values backward and forward in time. The pandas method for this is .shift()
, which will shift values in Series
or DataFrame
a specified number of units of the index's frequency.
To demonstrate shifting, we will use the following Series
. This Series
has five values, is indexed by date starting at 2014-08-01
, and uses a daily frequency:
In [70]: # create a Series to work with np.random.seed(123456) ts = Series([1, 2, 2.5, 1.5, 0.5], pd.date_range('2014-08-01', periods=5)) ts Out[70]: 2014-08-01 1.0 2014-08-02 2.0 2014-08-03 2.5 2014-08-04 1.5 2014-08-05 0.5 Freq: D, dtype: float64
The following shifts the values forward by 1
day:
In [71]: # shift forward one day ts.shift(1) Out[71]: 2014-08-01 NaN 2014-08-02 1.0 2014-08-03 2.0 2014-08-04 2.5 2014-08-05 1.5 Freq: D, dtype: float64
pandas has moved the values forward one unit of the index's frequency, which is one day. The index remains unchanged. There was no replacement data for 2014-08-01
, so it is filled with NaN
.
A lag is a shift in a negative direction. The following lags the Series
by 2
days:
In [72]: # lag two days ts.shift(-2) Out[72]: 2014-08-01 2.5 2014-08-02 1.5 2014-08-03 0.5 2014-08-04 NaN 2014-08-05 NaN Freq: D, dtype: float64
Index labels 2014-08-04
and 2014-08-03
now have NaN
values as there were no items to replace.
A common calculation that is performed using a shift is to calculate the percentage daily change in values. This can be performed by dividing a Series
object by its values shifted by 1
:
In [73]: # calculate daily percentage change ts / ts.shift(1) Out[73]: 2014-08-01 NaN 2014-08-02 2.000000 2014-08-03 1.250000 2014-08-04 0.600000 2014-08-05 0.333333 Freq: D, dtype: float64
Shifts can be performed on different frequencies than that in the index. When this is performed, the index will be modified and the values remain the same. As an example, the following shifts the Series
forward by one business day:
In [74]: # shift forward one business day ts.shift(1, freq="B") Out[74]: 2014-08-04 1.0 2014-08-04 2.0 2014-08-04 2.5 2014-08-05 1.5 2014-08-06 0.5 dtype: float64
As another example, the following shifts forward by 5
hours:
In [75]: # shift forward five hours ts.tshift(5, freq="H") Out[75]: 2014-08-01 05:00:00 1.0 2014-08-02 05:00:00 2.0 2014-08-03 05:00:00 2.5 2014-08-04 05:00:00 1.5 2014-08-05 05:00:00 0.5 dtype: float64
A time series can also be shifted using DateOffset
. The following code shifts the time series forward by 0.5 minutes:
In [76]: # shift using a DateOffset ts.shift(1, DateOffset(minutes=0.5)) Out[76]: 2014-08-01 00:00:30 1.0 2014-08-02 00:00:30 2.0 2014-08-03 00:00:30 2.5 2014-08-04 00:00:30 1.5 2014-08-05 00:00:30 0.5 dtype: float64
There is an alternative form of shifting provided by the .tshift()
method. This method shifts the index labels by the specified units and a frequency specified by the freq
parameter (which is required). The following code demonstrates this approach by adjusting the index by -1
hour:
In [77]: # shift just the index values ts.tshift(-1, freq='H') Out[77]: 2014-07-31 23:00:00 1.0 2014-08-01 23:00:00 2.0 2014-08-02 23:00:00 2.5 2014-08-03 23:00:00 1.5 2014-08-04 23:00:00 0.5 dtype: float64
Frequency data can be converted in pandas using the .asfreq()
method of a time-series object, such as Series
or DataFrame
. When converting frequency, a new Series
object with a new DatatimeIndex
object will be created. The DatetimeIndex
of the new Series
object starts at the first Timestamp
of the original and progresses at the given frequency until the last Timestamp
of the original. Values will then be aligned into the new Series
.
To demonstrate, we will use the following time series of consecutive incremental integers mapped into each hour of each day for August 2014:
In [78]: # create a Series of incremental values # index by hour through all of August 2014 periods = 31 * 24 hourly = Series(np.arange(0, periods), pd.date_range('08-01-2014', freq="2H", periods = periods)) hourly Out[78]: 2014-08-01 00:00:00 0 2014-08-01 02:00:00 1 2014-08-01 04:00:00 2 ... 2014-10-01 18:00:00 741 2014-10-01 20:00:00 742 2014-10-01 22:00:00 743 Freq: 2H, Length: 744
As an example, the following converts this time series to a daily frequency using .asfreq('D')
:
In [79]: # convert to daily frequency # many items will be dropped due to alignment daily = hourly.asfreq('D') daily Out[79]: 2014-08-01 0 2014-08-02 12 2014-08-03 24 ... 2014-09-29 708 2014-09-30 720 2014-10-01 732 Freq: D, Length: 62
As data was aligned to the new daily time series from the hourly time series, only values matching the exact days were copied.
If we convert this result back to an hourly frequency, we will see that many of the values are NaN
:
In [80]: # convert back to hourly. Results in many NaNs # as the new index has many labels that do not # align with the source daily.asfreq('H') Out[80]: 2014-08-01 00:00:00 0 2014-08-01 01:00:00 NaN 2014-08-01 02:00:00 NaN ... 2014-09-30 22:00:00 NaN 2014-09-30 23:00:00 NaN 2014-10-01 00:00:00 732 Freq: H, Length: 1465
The new index has Timestamp
objects at hourly intervals, so only the timestamps at exact days align with the daily time series, resulting in 670 NaN
values.
This default behavior can be changed using the method
parameter of the .asfreq()
method. This value can be used for forward fill, reverse fill, or to pad the NaN
values.
The ffill
method will forward fill the last known value (pad
also does the same):
In [81]: # forward fill values daily.asfreq('H', method='ffill') Out[81]: 2014-08-01 00:00:00 0 2014-08-01 01:00:00 0 2014-08-01 02:00:00 0 ... 2014-09-30 22:00:00 720 2014-09-30 23:00:00 720 2014-10-01 00:00:00 732 Freq: H, Length: 1465
The bfill
method will back fill values from the next known value:
In [82]: daily.asfreq('H', method='bfill') Out[82]: 2014-08-01 00:00:00 0 2014-08-01 01:00:00 12 2014-08-01 02:00:00 12 ... 2014-09-30 22:00:00 732 2014-09-30 23:00:00 732 2014-10-01 00:00:00 732 Freq: H, Length: 1465
Frequency conversion provides a basic way to convert the index in a time series to another frequency. Data in the new time series is aligned with the old data and can result in many NaN
values. This can be partially solved using a fill method, but that is limited in its capabilities to fill with appropriate information.
Resampling differs in that it does not perform a pure alignment. The values placed in the new series can use the same forward and reverse fill options, but they can also be specified using other pandas-provided algorithms or with your own functions.
To demonstrate resampling, we will use the following time series, which represents a random walk of values over a 5-day period:
In [83]: # calculate a random walk five days long at one second intervals # these many items will be needed count = 24 * 60 * 60 * 5 # create a series of values np.random.seed(123456) values = np.random.randn(count) ws = pd.Series(values) # calculate the walk walk = ws.cumsum() # patch the index walk.index = pd.date_range('2014-08-01', periods=count, freq="S") walk Out[83]: 2014-08-01 00:00:00 0.469112 2014-08-01 00:00:01 0.186249 2014-08-01 00:00:02 -1.322810 ... 2014-08-05 23:59:57 455.202981 2014-08-05 23:59:58 454.947362 2014-08-05 23:59:59 456.191430 Freq: S, Length: 432000
Resampling in pandas is accomplished using the .resample()
method, by passing it a new frequency. To demonstrate this the following resamples our by-the-second data to by-the-minute. This is a downsampling as the result has a lower frequency and results in less values:
In [84]: # resample to minute intervals walk.resample("1Min") Out[84]: 2014-08-01 00:00:00 -8.718220 2014-08-01 00:01:00 -15.239213 2014-08-01 00:02:00 -9.179315 ... 2014-08-05 23:57:00 450.078149 2014-08-05 23:58:00 444.637806 2014-08-05 23:59:00 453.837417 Freq: T, Length: 7200
Notice that the first value is -8.718220
whereas the original data had a value of 0.469112
. A frequency conversion would have left this value at -8.718220
. This is because a resampling does not copy data through alignment. A resampling will actually split the data into buckets of data based on new periods and then apply a particular operation to the data in each bucket. The default scenario is to calculate the mean of each bucket. This can be verified with the following, which slices the first minute of data from the walk and calculates its mean:
In [85]: # calculate the mean of the first minute of the walk walk['2014-08-01 00:00'].mean() Out[85]: -8.7182200528326437
In downsampling, as the existing data is put into buckets based on the new intervals, there can often be a question of what values are on each end of the bucket. As an example, should the first interval in the previous resampling be from 2014-08-01 00:00:00
through 2014-08-01 23:59:59
, or should it end at 2014-08-04 00:00:00
but start at 2014-08-03 23:59:59
?
The default is the former, and it is referred to as a left close. To other scenario that excludes the left value and includes the right is a right close and can be performed by using the close='right'
parameter. The following demonstrates this; notice the slight difference in the intervals and values:
In [86]: # use a right close walk.resample("1Min", closed='right') Out[86]: 2014-07-31 23:59:00 0.469112 2014-08-01 00:00:00 -8.907477 2014-08-01 00:01:00 -15.280685 ... 2014-08-05 23:57:00 450.039159 2014-08-05 23:58:00 444.631719 2014-08-05 23:59:00 453.955377 Freq: T, Length: 7201
The decision about whether to use a right or left close is really up to you and your data modeling, but pandas gives you the option.
The specific method that is applied to each group of samples can be specified using the how parameter. As an example, the following takes the first value of each bucket:
In [87]: # take the first value of each bucket walk.resample("1Min", how='first') Out[87]: 2014-08-01 00:00:00 0.469112 2014-08-01 00:01:00 -10.886314 2014-08-01 00:02:00 -13.374656 ... 2014-08-05 23:57:00 449.582419 2014-08-05 23:58:00 447.243014 2014-08-05 23:59:00 446.877810 Freq: T, Length: 7200
The how
parameter can be supplied the name of any NumPy array function that takes an array and produces aggregated values as well as any function you create on your own.
To demonstrate upsampling, we will resample the walk to minutes and then back to seconds:
In [88]: # resample to 1 minute intervals, then back to 1 sec bymin = walk.resample("1Min") bymin.resample('S') Out[88]: 2014-08-01 00:00:00 -8.71822 2014-08-01 00:00:01 NaN 2014-08-01 00:00:02 NaN ... 2014-08-05 23:58:58 NaN 2014-08-05 23:58:59 NaN 2014-08-05 23:59:00 453.837417 Freq: S, Length: 431941
The upsampling created the index values for the second-by-second data but inserted NaN
values by default. This default behavior can be modified using the fill_method
parameter. We saw this when changing frequency with the options of forward and backward filling. These are also available with resampling. The following demonstrates how to use the forward fill:
In [89]: # resample to 1 second intervals using forward fill bymin.resample("S", fill_method="bfill") Out[89]: 2014-08-01 00:00:00 -8.718220 2014-08-01 00:00:01 -15.239213 2014-08-01 00:00:02 -15.239213 ... 2014-08-05 23:58:58 453.837417 2014-08-05 23:58:59 453.837417 2014-08-05 23:59:00 453.837417 Freq: S, Length: 431941
It is also possible to interpolate the missing values using the .interpolate()
method on the result. This will calculate a linear interpolation between the values existing in the result for all of the NaN
values created during the resampling:
In [90]: # demonstrate interpolating the NaN values interpolated = bymin.resample("S").interpolate() interpolated Out[90]: 2014-08-01 00:00:00 -8.718220 2014-08-01 00:00:01 -8.826903 2014-08-01 00:00:02 -8.935586 ... 2014-08-05 23:58:58 453.530764 2014-08-05 23:58:59 453.684090 2014-08-05 23:59:00 453.837417 Freq: S, Length: 431941
pandas provides a very convenient resampling method referred to as open, high, low, and close, which is specified by the how='ohlc'
parameter. The following example takes our second-by-second data and calculates hour-by-hour ohlc
values:
In [91]: # show ohlc resampling ohlc = walk.resample("H", how="ohlc") ohlc Out[91]: open high low close 2014-08-01 00:00:00 0.469112 0.469112 -67.873166 -2.922520 2014-08-01 01:00:00 -3.374321 23.793007 -56.585154 -55.101543 2014-08-01 02:00:00 -54.276885 5.232441 -87.809456 1.913276 2014-08-01 03:00:00 0.260576 17.124638 -65.820652 -38.530620 2014-08-01 04:00:00 -38.436581 3.537231 -109.805294 -61.014553 ... ... ... ... ... 2014-08-05 19:00:00 437.652077 550.408942 430.549178 494.471788 2014-08-05 20:00:00 496.539759 510.371745 456.365565 476.505765 2014-08-05 21:00:00 476.025498 506.952877 425.472410 498.547578 2014-08-05 22:00:00 497.941355 506.599652 411.119919 443.925832 2014-08-05 23:00:00 443.017962 489.083657 426.062444 456.191430 [120 rows x 4 columns]
pandas provides a number of functions to compute moving (also known as rolling) statistics. In a rolling window, pandas computes the statistic on a window of data represented by a particular period of time. The window is then rolled along a certain interval, and the statistic is continually calculated on each window as long as the window fits within the dates of the time series.
pandas provides direct support for rolling windows with the following functions:
Function |
Description |
---|---|
|
The mean of values in the window |
|
The standard deviation of values in the window |
|
The variance of values |
|
The minimum of values in the window |
|
The maximum of values in the window |
|
The covariance of values |
|
Moving window score at percentile/sample quantile |
|
The correlation of values in the window |
|
The median of values in the window |
|
The sum of values in the window |
|
The application of a user function to values in the window |
|
The number of non-NaN values in a window |
|
The skewedness of the values in the window |
|
The kurtosis of values in the window |
As a practical example, a rolling mean is commonly used to smooth out short-term fluctuations and highlight longer-term trends in data and is used quite commonly in financial time-series analysis.
To demonstrate, we will calculate a rolling mean with a window of 5 on the first minute of the random walk created earlier in the chapter. The following calculates this and then plots the result against the raw data to demonstrate the smoothing created by the rolling mean:
In [92]: first_minute = walk['2014-08-01 00:00'] # calculate a rol1ing mean window of 5 periods pd.rolling_mean(first_minute, 5).plot() # plot it against the raw data first_minute.plot() # add a legend plt.legend(labels=['Rolling Mean', 'Raw']);
The generation of charts will be covered in more detail in Chapter 11, Visualization.
It can be seen how rolling_mean
provides a smoother representation of the underlying data. A larger window will create less variance, and smaller windows will create more (until the window size is 1
, which will be identical to the original series).
The following demonstrates the rolling mean with windows of 2
, 5
, and 10
plotted against the original series:
In [93]: # demonstrate the difference between 2, 5 and # 10 interval rolling windows h1w = walk['2014-08-01 00:00'] h1w.plot() pd.rolling_mean(h1w, 2).plot() pd.rolling_mean(h1w, 5).plot() pd.rolling_mean(h1w, 10).plot() plt.legend(labels=['Raw', '2-interval RM', '5-interval RM', '10-interval RM']);
Note that the larger the window, the more data is missing at the beginning of the curve. A window of size n
requires n
data points before the measure can be calculated and hence the gap in the beginning of the plot.
Any function can be applied via a rolling window using the pd.rolling_apply
function. The supplied function will be passed an array of values in the window and should return a single value, which pandas will aggregate with these results into a time series.
To demonstrate, the following code calculates the mean average deviation, which gives you a feel of how far all values in the sample are from the overall mean on an average:
In [94]: # calculate mean average deviation with window of 5 intervals mean_abs_dev = lambda x: np.fabs(x - x.mean()).mean() pd.rolling_apply(h1w, 5, mean_abs_dev).plot();
An expanding window mean can be calculated using a slight variant of the use of the pd.rolling_mean
function that repeatedly calculates the mean by always starting with the first value in the time series and for each iteration increases the window size by one. An expanding window mean will be more stable (less responsive) than a rolling window, because as the size of the window increases, the less the impact of the next value will be:
In [95]: # calculate an expanding rolling mean expanding_mean = lambda x: pd.rolling_mean(x, len(x), min_periods=1) h1w.plot() pd.expanding_mean(h1w).plot() plt.legend(labels=['Expanding mean', 'Raw']);
3.145.103.154