Introducing time-series data

Due to its roots in finance, pandas excels in manipulating time-series data. Its abilities have been continuously refined over all of its versions to progressively increase its capabilities for time-series manipulation. These capabilities are the core of pandas and do not require additional libraries, unlike R, which requires the inclusion of Zoo to provide this functionality.

The core of the time-series functionality in pandas revolves around the use of specialized indexes that represent measurements of data at one or more timestamps. These indexes in pandas are referred to as DatetimeIndex objects. These are incredibly powerful objects, and their being core to pandas provides the ability to automatically align data based on dates and time, making working with sequences of data collected and time-stamped as easy as with any other type of indexes.

We will now examine how to create time-series data and DatetimeIndex objects both using explicit timestamp objects and using specific durations of time (referred to in pandas as frequencies).

DatetimeIndex

Sequences of timestamp objects are represented by pandas as DatetimeIndex, which is a type of pandas index that is optimized for indexing by date and time.

There are several ways to create DatetimeIndex objects in pandas. The following creates a DateTimeindex by passing a list of datetime objects as Series:

In [15]:
   # create a very simple time-series with two index labels
   # and random values
   dates = [datetime(2014, 8, 1), datetime(2014, 8, 2)]
   ts = pd.Series(np.random.randn(2), dates)
   ts

Out[15]:
   2014-08-01    1.566024
   2014-08-02    0.938517
   dtype: float64

Series has taken the datetime objects and constructed a DatetimeIndex from the date values, where each value of DatetimeIndex is a Timestamp object. This is one of the cases where pandas directly constructs Timestamp objects on your behalf.

The following verifies the type of the index and the types of the labels in the index:

In [16]:
   # what is the type of the index?
   type(ts.index)

Out[16]:
   pandas.tseries.index.DatetimeIndex

In [17]:
   # and we can see it is a collection of timestamps
   type(ts.index[0])

Out[17]:
   pandas.tslib.Timestamp

It is not required that you pass datetime objects in the list to create a time series. The Series object is smart enough to recognize that a string represents datetime and does the conversion for you. The following is equivalent to the previous example:

In [18]:
   # create from just a list of dates as strings!
   np.random.seed(123456)
   dates = ['2014-08-01', '2014-08-02']
   ts = pd.Series(np.random.randn(2), dates)
   ts

Out[18]:
   2014-08-01    0.469112
   2014-08-02   -0.282863
   dtype: float64

pandas provides a utility function in pd.to_datetime(). This function takes a sequence of similar- or mixed-type objects and pandas attempts to convert each into Timestamp and the collection of these timestamps into DatetimeIndex. If an object in the sequence cannot be converted, then NaT, representing not-a-time will be returned at the position in the index:

In [19]:
   # convert a sequence of objects to a DatetimeIndex
   dti = pd.to_datetime(['Aug 1, 2014', 
                         '2014-08-02', 
                         '2014.8.3', 
                         None])
   for l in dti: print (l)

   2014-08-01 00:00:00
   2014-08-02 00:00:00
   2014-08-03 00:00:00
   NaT

Be careful, as the pd.to_datetime() function will, by default, fall back to returning a NumPy array of objects instead of DatetimeIndex if it cannot parse a value to Timestamp:

In [20]:
   # this is a list of objects, not timestamps...
   pd.to_datetime(['Aug 1, 2014', 'foo'])

Out[20]:
   array(['Aug 1, 2014', 'foo'], dtype=object)

To force the function to convert to dates, you can use the coerce=True parameter. Values that cannot be converted will be assigned NaT in the resulting index:

In [21]:
   # force the conversion, NaT for items that don't work
   pd.to_datetime(['Aug 1, 2014', 'foo'], coerce=True)

Out[21]:
   <class 'pandas.tseries.index.DatetimeIndex'>
   [2014-08-01, NaT]
   Length: 2, Freq: None, Timezone: None

A range of timestamps at a specific frequency can be easily created using the pd.date_range() function. The following creates a Series object from DatetimeIndex of 10 consecutive days:

In [22]:
   # create a range of dates starting at a specific date
   # and for a specific number of days, creating a Series
   np.random.seed(123456)
   periods = pd.date_range('8/1/2014', periods=10)
   date_series = pd.Series(np.random.randn(10), index=periods)
   date_series

Out[22]:
   2014-08-01    0.469112
   2014-08-02   -0.282863
   2014-08-03   -1.509059
   2014-08-04   -1.135632
   2014-08-05    1.212112
   2014-08-06   -0.173215
   2014-08-07    0.119209
   2014-08-08   -1.044236
   2014-08-09   -0.861849
   2014-08-10   -2.104569
   Freq: D, dtype: float64

Like any pandas index, DatetimeIndex can be used for various index operations, such as data alignment, selection, and slicing. The following demonstrates slicing using index locations:

In [23]:
   # slice by location
   subset = date_series[3:7]
   subset

Out[23]:
   2014-08-04   -1.135632
   2014-08-05    1.212112
   2014-08-06   -0.173215
   2014-08-07    0.119209
   Freq: D, dtype: float64

To demonstrate, we will use the following Series created with the index of the subset we just created:

In [24]:
   # a Series to demonstrate alignment
   s2 = pd.Series([10, 100, 1000, 10000], subset.index)
   s2

Out[24]:
   2014-08-04       10
   2014-08-05      100
   2014-08-06     1000
   2014-08-07    10000
   Freq: D, dtype: int64

When we add s2 and date_series, alignment will be performed, returning NaN where items do not align and the sum of the two values where they align:

In [25]:
   # demonstrate alignment by date on a subset of items
   date_series + s2

Out[25]:
   2014-08-01             NaN
   2014-08-02             NaN
   2014-08-03             NaN
   2014-08-04        8.864368
   2014-08-05      101.212112
   2014-08-06      999.826785
   2014-08-07    10000.119209
   2014-08-08             NaN
   2014-08-09             NaN
   2014-08-10             NaN
   Freq: D, dtype: float64

Items in Series with DatetimeIndex can be retrieved using a string representing a date instead having to specify a datetime object:

In [26]:
   # lookup item by a string representing a date
   date_series['2014-08-05']

Out[26]:
   1.2121120250208506

DatetimeIndex can also be indexed and sliced using a string that represents a date or using datetime objects:

In [27]:
   # slice between two dates specified by string representing dates
   date_series['2014-08-05':'2014-08-07']

Out[27]:
   2014-08-05    1.212112
   2014-08-06   -0.173215
   2014-08-07    0.119209
   Freq: D, dtype: float64

Another convenient feature of pandas is that DatetimeIndex can be sliced using partial date specifications. As an example, the following code creates a Series object with dates spanning two years and then selects only those items of the year 2013:

In [28]:
   # a two year range of daily data in a Series
   # only select those in 2013
   s3 = pd.Series(0, pd.date_range('2013-01-01', '2014-12-31'))
   s3['2013']

Out[28]:
   2013-01-01    0
   2013-01-02    0
   2013-01-03    0
   ...
   2013-12-29    0
   2013-12-30    0
   2013-12-31    0
   Freq: D, Length: 365

We can also select items only in a specific year and month. This is demonstrated by the following, which selects the items in August 2014:

In [29]:
   # 31 items for May 2014
   s3['2014-05']

Out[29]:
   2014-05-01    0
   2014-05-02    0
   2014-05-03    0
   ...
   2014-05-29    0
   2014-05-30    0
   2014-05-31    0
   Freq: D, Length: 31

We can slice data contained within two specified months, as demonstrated by the following, which returns items in August and September, 2014:

In [30]:
   # items between two months
   s3['2014-08':'2014-09']

Out[30]:
   2014-08-01    0
   2014-08-02    0
   2014-08-03    0
   ...
   2014-09-28    0
   2014-09-29    0
   2014-09-30    0
   Freq: D, Length: 61

Creating time-series data with specific frequencies

Time-series data in pandas can be created on intervals other than daily frequency. Different frequencies can be generated with pd.date_range() by utilizing the freq parameter. This parameter defaults to a value of 'D', which represents daily frequency.

To demonstrate alternative frequencies, the following creates a DatetimeIndex with 1-minute intervals between the two specified dates by specifying freq='T':

In [31]:
   # generate a Series at one minute intervals
   np.random.seed(123456)
   bymin = pd.Series(np.random.randn(24*60*90), 
                     pd.date_range('2014-08-01', 
                                   '2014-10-29 23:59',
                                   freq='T'))
   bymin

   Out[31]:
   2014-08-01 00:00:00    0.469112
   2014-08-01 00:01:00   -0.282863
   2014-08-01 00:02:00   -1.509059
   ...
   2014-10-29 23:57:00    1.850604
   2014-10-29 23:58:00   -1.589660
   2014-10-29 23:59:00    0.266429
   Freq: T, Length: 129600

This time series allows us to slice at a finer resolution, down to the minute and smaller intervals if using finer frequencies. To demonstrate minute-level slicing, the following slices the values at 9 consecutive minutes:

In [32]:
   # slice down to the minute
   bymin['2014-08-01 00:02':'2014-08-01 00:10']

Out[32]:
   2014-08-01 00:02:00   -1.509059
   2014-08-01 00:03:00   -1.135632
   2014-08-01 00:04:00    1.212112
   2014-08-01 00:05:00   -0.173215
   2014-08-01 00:06:00    0.119209
   2014-08-01 00:07:00   -1.044236
   2014-08-01 00:08:00   -0.861849
   2014-08-01 00:09:00   -2.104569
   2014-08-01 00:10:00   -0.494929
   Freq: T, dtype: float64

The following table lists the possible frequency values:

Alias

Description

B

Business day frequency

C

Custom business day frequency

D

Calendar day frequency (the default)

W

Weekly frequency

M

Month end frequency

BM

Business month end frequency

CBM

Custom business month end frequency

MS

Month start frequency

BMS

Business month start frequency

CBMS

Custom business month start frequency

Q

Quarter end frequency

BQ

Business quarter frequency

QS

Quarter start frequency

BQS

Business quarter start frequency

A

Year end frequency

BA

Business year-end frequency

AS

Year start frequency

BAS

Business year start frequency

H

Hourly frequency

T

Minute-by-minute frequency

S

Second-by-second frequency

L

Milliseconds

U

Microseconds

As an example, if you want to generate a time series that uses only business days, then use the 'B' frequency:

In [33]:
   # generate a series based upon business days
   days = pd.date_range('2014-08-29', '2014-09-05', freq='B')
   for d in days : print (d)

   2014-08-29 00:00:00
   2014-09-01 00:00:00
   2014-09-02 00:00:00
   2014-09-03 00:00:00
   2014-09-04 00:00:00
   2014-09-05 00:00:00

In this time series, we can see that two days were skipped as they were on the weekend, which would not have occurred using a calendar-day frequency.

A range can be created starting at a particular date and time with a specific frequency and for a specific number of periods using the periods parameter. To demonstrate, the following creates a 10-item DatetimeIndex starting at 2014-08-01 12:10:01 and at 1-second intervals:

In [34]:
   # periods will use the frequency as the increment
   pd.date_range('2014-08-01 12:10:01', freq='S', periods=10)

Out[34]:
   <class 'pandas.tseries.index.DatetimeIndex'>
   [2014-08-01 12:10:01, ..., 2014-08-01 12:10:10]
   Length: 10, Freq: S, Timezone: None
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.55.18