Handling time series

In this section, we show you how to handle time series data. We will start by showing how to create time series data using the data read in from a csv file.

Reading in time series data

Here, we demonstrate the various ways to read in time series data:

In [7]: ibmData=pd.read_csv('ibm-common-stock-closing-prices-1959_1960.csv')
  ibmData.head()
Out[7]:    TradeDate     closingPrice
        0   1959-06-29   445
        1   1959-06-30   448
        2   1959-07-01   450
        3   1959-07-02   447
        4   1959-07-06   451
        5 rows 2 columns

The source of this information can be found at http://datamarket.com.

We would like the TradeDate column to be a series of datetime values so that we can index it and create a time series. Let us first check the type of values in the TradeDate series:

In [16]: type(ibmData['TradeDate'])
Out[16]: pandas.core.series.Series
In [12]: type(ibmData['TradeDate'][0])
Out[12]: str

Next, we convert it to a Timestamp type:

In [17]: ibmData['TradeDate']=pd.to_datetime(ibmData['TradeDate'])
        type(ibmData['TradeDate'][0])
Out[17]: pandas.tslib.Timestamp

We can now use the TradeDate column as an index:

In [113]: #Convert DataFrame to TimeSeries
          #Resampling creates NaN rows for weekend dates, hence use dropna
  ibmTS=ibmData.set_index('TradeDate').resample('D')['closingPrice'].dropna()
    ibmTS
Out[113]: TradeDate
         1959-06-29    445
         1959-06-30    448
         1959-07-01    450
         1959-07-02    447
  1959-07-06    451
  ...
  Name: closingPrice, Length: 255

DateOffset and TimeDelta objects

A DateOffset object represents a change or offset in time. The key features of a DateOffset object are as follows:

  • This can be added/subtracted to/from a datetime object to obtain a shifted date
  • This can be multiplied by an integer (positive or negative) so that the increment can be applied multiple times
  • This has the rollforward and rollback methods to move a date forward to the next offset date or backward to the previous offset date

We illustrate how we use a DateOffset object as follows:

In [371]: xmasDay=pd.datetime(2014,12,25)
          xmasDay
Out[371]: datetime.datetime(2014, 12, 25, 0, 0)

In [373]: boxingDay=xmasDay+pd.DateOffset(days=1)
          boxingDay
Out[373]: Timestamp('2014-12-26 00:00:00', tz=None)

In [390}: today=pd.datetime.now()
          today
Out[390]: datetime.datetime(2014, 5, 31, 13, 7, 36, 440060)

Note that datetime.datetime is different from pd.Timestamp. The former is a Python class and is inefficient, while the latter is based on the numpy.datetime64 datatype. The pd.DateOffset object works with pd.Timestamp and adding it to a datetime.datetime function casts that object into a pd.Timestamp object.

The following illustrates the command for one week from today:

In [392]: today+pd.DateOffset(weeks=1)
Out[392]: Timestamp('2014-06-07 13:07:36.440060', tz=None)

The following illustrates the command for five years from today:

In [394]: today+2*pd.DateOffset(years=2, months=6)
Out[394]: Timestamp('2019-05-30 13:07:36.440060', tz=None)

Here is an example of using the rollforward functionality. QuarterBegin is a DateOffset object that is used to increment a given datetime object to the start of the next calendar quarter:

In [18]: lastDay=pd.datetime(2013,12,31)
In [24]: from pandas.tseries.offsets import QuarterBegin
        dtoffset=QuarterBegin()
        lastDay+dtoffset
Out[24]: Timestamp('2014-03-01 00:00:00', tz=None)

In [25]: dtoffset.rollforward(lastDay)
Out[25]: Timestamp('2014-03-01 00:00:00', tz=None)

Thus, we can see that the next quarter after December 31, 2013 starts on March 1, 2014. Timedeltas are similar to DateOffsets but work with datetime.datetime objects. The use of these has been explained by the following command:

In [40]: weekDelta=datetime.timedelta(weeks=1)
         weekDelta
Out[40]: datetime.timedelta(7)

In [39]: today=pd.datetime.now()
         today
Out[39]: datetime.datetime (2014, 6, 2, 3, 56, 0, 600309)

In [41]: today+weekDelta
Out[41]: datetime.datetime (2014, 6, 9, 3, 56,0, 600309)

Time series-related instance methods

In this section, we explore various methods for Time Series objects such as shifting, frequency conversion, and resampling.

Shifting/lagging

Sometimes, we may wish to shift the values in a Time Series backward or forward in time. One possible scenario is when a dataset contains the list of start dates for last year's new employees in a firm, and the company's human resource program wishes to shift these dates forward by one year so that the employees' benefits can be activated. We can do this by using the shift() function as follows:

In [117]: ibmTS.shift(3)
Out[117]: TradeDate
    1959-06-29    NaN
    1959-06-30    NaN
    1959-07-01    NaN
    1959-07-02    445
    1959-07-06    448
    1959-07-07    450
    1959-07-08    447
    ...

This shifts all the calendar days. However, if we wish to shift only business days, we must use the following command:

In [119]: ibmTS.shift(3, freq=pd.datetools.bday)
Out[119]: TradeDate
    1959-07-02    445
    1959-07-03    448
    1959-07-06    450
    1959-07-07    447
    1959-07-09    451

In the preceding snippet, we have specified the freq argument to shift; this tells the function to shift only the business days. The shift function has a freq argument whose value can be a DateOffset class, timedelta-like object, or an offset alias. Thus, using ibmTS.shift(3, freq='B') would also produce the same result.

Frequency conversion

We can use the asfreq function to change frequencies, as explained:

In [131]: # Frequency conversion using asfreq
          ibmTS.asfreq('BM')
Out[131]: 1959-06-30    448
    1959-07-31    428
    1959-08-31    425
    1959-09-30    411
    1959-10-30    411
    1959-11-30    428
    1959-12-31    439
    1960-01-29    418
    1960-02-29    419
    1960-03-31    445
    1960-04-29    453
    1960-05-31    504
    1960-06-30    522
    Freq: BM, Name: closingPrice, dtype: float64

In this case, we just obtain the values corresponding to the last day of the month from the ibmTS time series. Here, bm stands for business month end frequency. For a list of all possible frequency aliases, go to http://bit.ly/1cMI3iA.

If we specify a frequency that is smaller than the granularity of the data, the gaps will be filled in with NaN values:

In [132]: ibmTS.asfreq('H')
Out[132]: 1959-06-29 00:00:00    445
    1959-06-29 01:00:00    NaN
    1959-06-29 02:00:00    NaN
    1959-06-29 03:00:00    NaN
    ...
    1960-06-29 23:00:00    NaN
    1960-06-30 00:00:00    522
    Freq: H, Name: closingPrice, Length: 8809

We can also apply the asfreq method to the Period and PeriodIndex objects similar to how we do for the datetime and Timestamp objects. Period and PeriodIndex are introduced later and are used to represent time intervals.

The asfreq method accepts a method argument that allows you to forward fill (ffill) or back fill the gaps, similar to fillna:

In [140]: ibmTS.asfreq('H', method='ffill')
Out[140]: 1959-06-29 00:00:00    445
    1959-06-29 01:00:00    445
    1959-06-29 02:00:00    445
    1959-06-29 03:00:00    445
    ...
    1960-06-29 23:00:00    522
    1960-06-30 00:00:00    522
    Freq: H, Name: closingPrice, Length: 8809

Resampling of data

The TimeSeries.resample function enables us to summarize/aggregate more granular data based on a sampling interval and a sampling function.

Downsampling is a term that originates from digital signal processing and refers to the process of reducing the sampling rate of a signal. In the case of data, we use it to reduce the amount of data that we wish to process.

The opposite process is upsampling, which is used to increase the amount of data to be processed and requires interpolation to obtain the intermediate data points. For more information on downsampling and upsampling, refer to Practical Applications of Upsampling and Downsampling at http://bit.ly/1JC95HD and Downsampling Time Series for Visual Representation at http://bit.ly/1zrExVP.

Here, we examine some tick data for use in resampling. Before we examine the data, we need to prepare it. In doing so, we will learn some useful techniques for time series data, which are as follows:

  • Epoch Timestamps
  • Timezone handling

Here is an example that uses tick data for stock prices of Google for Tuesday, May 27, 2014:

In [150]: googTickData=pd.read_csv('./GOOG_tickdata_20140527.csv')
In [151]: googTickData.head()
Out[151]:     Timestamp   close    high    low   open   volume
         0    1401197402  555.008 556.41  554.35 556.38   81100
         1    1401197460  556.250 556.30  555.25 555.25   18500
         2    1401197526  556.730 556.75  556.05 556.39   9900
         3    1401197582  557.480 557.67  556.73 556.73   14700
         4    1401197642  558.155 558.66  557.48 557.59   15700
         5 rows 6 columns

The source for the preceding data can be found at http://bit.ly/1MKBwlB.

As you can see from the preceding section, we have a Timestamp column along with the columns for the close, high, low, and opening prices and the volume of trades of the Google stock.

So, why does the Timestamp column seem a bit strange? Well, tick data Timestamps are generally expressed in epoch time (for more information, refer to http://en.wikipedia.org/wiki/Unix_epoch) as a more compact means of storage. We'll need to convert this into a more human-readable time, and we can do this as follows:

In [201]: googTickData['tstamp']=pd.to_datetime(googTickData['Timestamp'],unit='s',utc=True)

In [209]: googTickData.head()
Out[209]:
       Timestamp   close   high   low    open   volume tstamp
    0  14011974020 555.008 556.41 554.35 556.38 81100 2014-05-27 13:30:02
    1  1401197460  556.250 556.30 555.25 555.25 18500 2014-05-27 13:31:00
    2  1401197526  556.730 556.75 556.05 556.39 9900  2014-05-27 13:32:06
    3  1401197582  557.480 557.67 556.73 556.73 14700 2014-05-27 13:33:02
    4  1401197642  558.155 558.66 557.48 557.59 15700 2014-05-27 13:34:02
    5 rows 7 columns

We would now like to make the tstamp column, as the index and eliminate the epoch Timestamp column:

In [210]: googTickTS=googTickData.set_index('tstamp')
          googTickTS=googTickTS.drop('Timestamp',axis=1)
          googTickTS.head()
Out[210]: 
        tstamp                 close    high    low     open     volume
        2014-05-27 13:30:02    555.008  556.41  554.35  556.38   811000
        2014-05-27 13:31:00    556.250  556.30  555.25  555.25   18500
        2014-05-27 13:32:06    556.730  556.75  556.05  556.39   9900
        2014-05-27 13:33:02    557.480  557.67  556.73  556.73   14700
        2014-05-27 13:34:02    558.155  558.66  557.48  557.59   15700
        5 rows 5 columns

Note that the tstamp index column has the times in UTC, and we can convert it to US/Eastern time using two operators—tz_localize and tz_convert:

In [211]: googTickTS.index=googTickTS.index.tz_localize('UTC').tz_convert('US/Eastern')

In [212]: googTickTS.head()
Out[212]:   
        tstamp                     close    high    low     open   volume    
        2014-05-27 09:30:02-04:00  555.008  556.41  554.35  556.38  81100
        2014-05-27 09:31:00-04:00  556.250  556.30  555.25  555.25  18500
        2014-05-27 09:32:06-04:00  556.730  556.75  556.05  556.39   9900
        2014-05-27 09:33:02-04:00  557.480  557.67  556.73  556.73  14700
        2014-05-27 09:34:02-04:00  558.155  558.66  557.48  557.59  15700
        5 rows 5 columns

In [213]: googTickTS.tail()
Out[213]:
      tstamp                       close     high   low    open    volume
      2014-05-27 15:56:00-04:00    565.4300  565.48 565.30 565.385  14300
      2014-05-27 15:57:00-04:00    565.3050  565.46 565.20 565.400  14700
      2014-05-27 15:58:00-04:00    565.1101  565.31 565.10 565.310  23200
      2014-05-27 15:59:00-04:00    565.9400  566.00 565.08 565.230  55600
      2014-05-27 16:00:00-04:00    565.9500  565.95 565.95 565.950 126000
      5 rows 5 columns

In [214]: len(googTickTS)
Out[214]: 390

From the preceding output, we can see ticks for every minute of the trading day—from 9:30 a.m., when the stock market opens, until 4:00 p.m., when it closes. This results in 390 rows in the dataset since there are 390 minutes between 9:30 a.m. and 4:00 p.m.

Suppose we want to obtain a snapshot every 5 minutes instead of every minute? We can achieve this by using downsampling as follows:

In [216]: googTickTS.resample('5Min').head(6)
Out[216]:           close      high   low    open       volume    tstamp
2014-05-27 09:30:00-04:00 556.72460 557.15800 555.97200 556.46800 27980
2014-05-27 09:35:00-04:00 556.93648 557.64800 556.85100 557.34200  24620
2014-05-27 09:40:00-04:00 556.48600 556.79994 556.27700 556.60678   8620
2014-05-27 09:45:00-04:00 557.05300 557.27600 556.73800 556.96600   9720
2014-05-27 09:50:00-04:00  556.66200  556.93596  556.46400  556.80326  14560
2014-05-27 09:55:00-04:00  555.96580  556.35400  555.85800  556.23600  12400
6 rows 5 columns

The default function used for resampling is the mean. However, we can also specify other functions, such as the minimum, and we can do this via the how parameter to resample:

In [245]: googTickTS.resample('10Min', how=np.min).head(4)
Out[245]:         close   high      low  open  volume
tstamp
2014-05-27 09:30:00-04:00   555.008  556.3000  554.35  555.25   9900
2014-05-27 09:40:00-04:00   556.190  556.5600  556.13  556.35   3500
2014-05-27 09:50:00-04:00   554.770  555.5500  554.77  555.55   3400
2014-05-27 10:00:00-04:00   554.580  554.9847  554.45  554.58   1800

Various function names can be passed to the how parameter, such as sum, ohlc, max, min, std, mean, median, first, and last.

The ohlc function that returns open-high-low-close values on time series data that is; the first, maximum, minimum, and last values. To specify whether the left or right interval is closed, we can pass the closed parameter as follows:

In [254]: pd.set_option('display.precision',5)
          googTickTS.resample('5Min', closed='right').tail(3)
Out[254]:                   close     high  low     open       volume
tstamp
2014-05-27 15:45:00-04:00   564.3167  564.3733   564.1075  564.1700  12816.6667
2014-05-27 15:50:00-04:00   565.1128  565.1725   565.0090  565.0650  13325.0000
2014-05-27 15:55:00-04:00   565.5158  565.6033   565.3083  565.4158  40933.3333
3 rows 5 columns

Thus, in the preceding command, we can see that the last row shows the tick at 15:55 instead of 16:00.

For upsampling, we need to specify a fill method to determine how the gaps should be filled via the fill_method parameter:

In [263]: googTickTS[:3].resample('30s', fill_method='ffill')
Out[263]:     close    high     low    open  volume    tstamp
     2014-05-27 09:30:00-04:00   555.008  556.41  554.35  556.38   81100
    2014-05-27 09:30:30-04:00   555.008  556.41  554.35  556.38   81100
    2014-05-27 09:31:00-04:00   556.250  556.30  555.25  555.25   18500
    2014-05-27 09:31:30-04:00   556.250  556.30  555.25  555.25   18500
    2014-05-27 09:32:00-04:00   556.730  556.75  556.05  556.39   9900
    5 rows 5 columns


In [264]: googTickTS[:3].resample('30s', fill_method='bfill')
Out[264]:
               close     high    low  open  volume     tstamp
    2014-05-27 09:30:00-04:00  555.008   556.41  554.35  556.38   81100
    2014-05-27 09:30:30-04:00  556.250   556.30  555.25  555.25   18500
    2014-05-27 09:31:00-04:00  556.250   556.30  555.25  555.25   18500
    2014-05-27 09:31:30-04:00  556.730   556.75  556.05  556.39   9900
    2014-05-27 09:32:00-04:00  556.730   556.75  556.05  556.39   9900
    5 rows 5 columns

Unfortunately, the fill_method parameter currently supports only two methods—forward fill and back fill. An interpolation method would be valuable.

Aliases for Time Series frequencies

To specify offsets, a number of aliases are available; some of the most commonly used ones are as follows:

  • B, BM: This stands for business day, business month. These are the working days of the month, that is, any day that is not a holiday or a weekend.
  • D, W, M, Q, A: It stands for calendar day, week, month, quarter, year-end.
  • H, T, S, L, U: It stands for hour, minute, second, millisecond, and microsecond.

These aliases can also be combined. In the following case, we resample every 7 minutes and 30 seconds:

In [267]: googTickTS.resample('7T30S').head(5)
Out[267]:
                           close     high   low   open    volume  
tstamp
2014-05-27 09:30:00-04:00 556.8266 557.4362 556.3144 556.8800 28075.0
2014-05-27 09:37:30-04:00 556.5889 556.9342 556.4264 556.7206 11642.9
2014-05-27 09:45:00-04:00 556.9921 557.2185 556.7171 556.9871  9800.0
2014-05-27 09:52:30-04:00 556.1824 556.5375 556.0350 556.3896 14350.0
2014-05-27 10:00:00-04:00 555.2111 555.4368 554.8288 554.9675 12512.5
5 rows x 5 columns

Suffixes can be applied to the frequency aliases to specify when in a frequency period to start. These are known as anchoring offsets:

  • W - SUN, MON, ... for example, W-TUE indicates a weekly frequency starting on a Tuesday.
  • Q - JAN, FEB, ... DEC for example, Q-MAY indicates a quarterly frequency with the year-end in May.
  • A - JAN, FEB, ... DEC for example, A-MAY indicates an annual frequency with the year-end in May.

These offsets can be used as arguments to the date_range and bdate_range functions as well as constructors for index types such as PeriodIndex and DatetimeIndex. A comprehensive discussion on this can be found in the pandas documentation at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#.

Time series concepts and datatypes

When dealing with time series, there are two main concepts that you have to consider: points in time and ranges, or time spans. In pandas, the former is represented by the Timestamp datatype, which is equivalent to Python's datatime.datetime (datetime) datatype and is interchangeable with it. The latter (time span) is represented by the Period datatype, which is specific to pandas.

Each of these datatypes has index datatypes associated with them: DatetimeIndex for Timestamp/Datetime and PeriodIndex for Period. These index datatypes are basically subtypes of numpy.ndarray that contain the corresponding Timestamp and Period datatypes and can be used as indexes for Series and DataFrame objects.

Period and PeriodIndex

The Period datatype is used to represent a range or span of time. Here are a few examples:

# Period representing May 2014
In [287]: pd.Period('2014', freq='A-MAY')
Out[287]: Period('2014', 'A-MAY')

# Period representing specific day – June 11, 2014
In [292]: pd.Period('06/11/2014')
Out[292]: Period('2014-06-11', 'D')

# Period representing 11AM, Nov 11, 1918 
In [298]: pd.Period('11/11/1918 11:00',freq='H')
Out[298]: Period('1918-11-11 11:00', 'H')

We can add integers to Periods which advances the period by the requisite number of unit of the frequency:

In [299]: pd.Period('06/30/2014')+4
Out[299]: Period('2014-07-04', 'D')

In [303]: pd.Period('11/11/1918 11:00',freq='H') - 48
Out[303]: Period('1918-11-09 11:00', 'H')

We can also calculate the difference between two Periods and return the number of units of frequency between them:

In [304]: pd.Period('2014-04', freq='M')-pd.Period('2013-02', freq='M')
Out[304]: 14

PeriodIndex

A PeriodIndex object, which is an index type for a Period object, can be created in two ways:

  1. From a series of Period objects using the period_range function an analogue of date_range:
    In [305]: perRng=pd.period_range('02/01/2014','02/06/2014',freq='D')
              perRng
    Out[305]: <class 'pandas.tseries.period.PeriodIndex'>
              freq: D
              [2014-02-01, ..., 2014-02-06]
              length: 6
    
    In [306]: type(perRng[:2])
    Out[306]: pandas.tseries.period.PeriodIndex
    
    In [307]: perRng[:2]
    Out[307]: <class 'pandas.tseries.period.PeriodIndex'>
              freq: D
             [2014-02-01, 2014-02-02]
    

    As we can confirm from the preceding command, when you pull the covers, a PeriodIndex function is really an ndarray of Period objects underneath.

  2. It can also be done via a direct call to the Period constructor:
    In [312]: JulyPeriod=pd.PeriodIndex(['07/01/2014','07/31/2014'], freq='D')
        JulyPeriod
    Out[312]: <class 'pandas.tseries.period.PeriodIndex'>
        freq: D
        [2014-07-01, 2014-07-31]
    

The difference between the two approaches, as can be seen from the preceding output, is that period_range fills in the resulting ndarray, but the Period constructor does not and you have to specify all the values that should be in the index.

Conversions between Time Series datatypes

We can convert the Period and PeriodIndex datatypes to Datetime/Timestamp and DatetimeIndex datatypes via the to_period and to_timestamp functions, as follows:

In [339]: worldCupFinal=pd.to_datetime('07/13/2014', 
                                       errors='raise')
         worldCupFinal
   Out[339]: Timestamp('2014-07-13 00:00:00')

In [340]: worldCupFinal.to_period('D')
   Out[340]: Period('2014-07-13', 'D')

In [342]: worldCupKickoff=pd.Period('06/12/2014','D')
         worldCupKickoff
Out[342]: Period('2014-06-12', 'D')
In [345]: worldCupKickoff.to_timestamp()
Out[345]: Timestamp('2014-06-12 00:00:00', tz=None)

In [346]: worldCupDays=pd.date_range('06/12/2014',periods=32,   
                                      freq='D')
         worldCupDays
Out[346]: <class 'pandas.tseries.index.DatetimeIndex'>
    [2014-06-12, ..., 2014-07-13]
    Length: 32, Freq: D, Timezone: None

In [347]: worldCupDays.to_period()
Out[347]: <class 'pandas.tseries.period.PeriodIndex'>
    freq: D
    [2014-06-12, ..., 2014-07-13]
    length: 32
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.153