12

Dates and Times

One of the bigger reasons for using Pandas is its ability to work with timeseries data. We observed some of this capability earlier, when we concatenated data in Chapter 6 and saw how the indices automatically aligned themselves. This chapter focuses on the more common tasks encountered when working with data that involve dates and times.

Learning Objectives

  • Create date objects with the datetime library

  • Use functions to convert strings into a date

  • Use functions to format dates

  • Perform date calculations

  • Use functions to resample dates

  • Use functions to work with and convert time zones

12.1 Python's datetime Object

Python has a built-in datetime object that is found in the datetime library.

from datetime import datetime

We can use datetime to get the current date and time.

now = datetime.now()
print(f"Last time this chapter was rendered for print: {now}")
Last time this chapter was rendered for print: 2022-09-01 01:55:41.496795

We can also create our own datetime manually.

t1 = datetime.now()
t2 = datetime(1970, 1,1)

And we can do datetime math.


diff = t1 - t2
print(diff)
19236 days, 1:55:41.499914

The data type of a date calculation is a timedelta.

print(type(diff))
<class 'datetime.timedelta'>

We can perform these types of actions when working within a Pandas dataframe.

12.2 Converting to datetime

Converting an object type into a datetime type is done with the to_datatime function. Let’s load up our Ebola data set and convert the Date column into a proper datetime object.

import pandas as pd
ebola = pd.read_csv('data/country_timeseries.csv')

# top left corner of the data
print(ebola.iloc[:5, :5])
        Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0   1/5/2015 289       2776.0           NaN           10030.0
1   1/4/2015 288       2775.0           NaN            9780.0
2   1/3/2015 287       2769.0        8166.0            9722.0
3   1/2/2015 286          NaN        8157.0               NaN
4 12/31/2014 284       2730.0        8115.0            9633.0

The first Date column contains date information, but the .info() attribute tells us it is actually encoded as a generic string object in Pandas.

print(ebola.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):

 #  Column              Non-Null Count Dtype
--- ------              -------------- -----
 0  Date                122 non-null   object
 1  Day                 122 non-null   int64
 2  Cases_Guinea        93 non-null   float64
 3  Cases_Liberia       83 non-null   float64
 4  Cases_SierraLeone   87 non-null   float64
 5  Cases_Nigeria       38 non-null   float64
 6  Cases_Senegal       25 non-null   float64
 7  Cases_UnitedStates  18 non-null   float64
 8  Cases_Spain         16 non-null   float64
 9  Cases_Mali          12 non-null   float64
 10 Deaths_Guinea       92 non-null   float64
 11 Deaths_Liberia      81 non-null   float64
 12 Deaths_SierraLeone  87 non-null   float64
 13 Deaths_Nigeria      38 non-null   float64
 14 Deaths_Senegal      22 non-null   float64
 15 Deaths_UnitedStates 18 non-null   float64
 16 Deaths_Spain        16 non-null   float64
 17 Deaths_Mali         12 non-null   float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.3+ KB
None

We can create a new column, date_dt, that converts the Date column into a datetime.

ebola['date_dt'] = pd.to_datetime(ebola['Date'])

We can also be a little more explicit with how we convert data into a datetime object.

The to_datetime() function has a parameter called format that allows you to manually specify the format of the date you are hoping to parse. Since our date is in a month/day/year format, we can pass in the string %m/%d/%Y.

ebola['date_dt'] = pd.to_datetime(ebola['Date'], format='%m/%d/%Y')

In both cases, we end up with a new column with a datetime type.

print(ebola.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 21 columns):

 #  Column              Non-Null Count Dtype
--- ------              -------------- -----
 0  Date                122 non-null   object
 1  Day                 122 non-null   int64
 2  Cases_Guinea        93 non-null   float64
 3  Cases_Liberia       83 non-null   float64
 4  Cases_SierraLeone   87 non-null   float64
 5  Cases_Nigeria       38 non-null   float64
 6  Cases_Senegal       25 non-null   float64
 7  Cases_UnitedStates  18 non-null   float64
 8  Cases_Spain         16 non-null   float64
 9  Cases_Mali          12 non-null   float64
 10 Deaths_Guinea       92 non-null   float64
 11 Deaths_Liberia      81 non-null   float64
 12 Deaths_SierraLeone  87 non-null   float64
 13 Deaths_Nigeria      38 non-null   float64
 14 Deaths_Senegal      22 non-null   float64
 15 Deaths_UnitedStates 18 non-null   float64
 16 Deaths_Spain        16 non-null   float64
 17 Deaths_Mali         12 non-null   float64
 18 date_dt             122 non-null  datetime64[ns]
 19 date_dt_a           122 non-null  datetime64[ns]
 20 date_dt_al          122 non-null  datetime64[ns]
dtypes: datetime64[ns](3), float64(16), int64(1), object(1)
memory usage: 20.1+ KB
None

The to_datetime() function includes convenient built-in options. For example, you can set the dayfirst or yearfirst options to True if the date format begins with a day (e.g., 31-03-2014) or if the date begins with a year (e.g., 2014-03-31), respectively.

For other date formats, you can manually specify how they are represented using the syntax specified by python’s strptime.1 This syntax is replicated in Table 12.1 from the official Python documentation.

1. strftime (string format time) and strptime (string parse time) behavior: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

Table 12.1 Python strftime and strptime Behavior (reproduced from the official Python documentation2)

Directive

Meaning

Example

%a

Weekday abbreviated name

Sun, Mon, …, Sat

%A

Weekday full name

Sunday, Monday, …, Saturday

%w

Weekday as a number, where 0 is Sunday

0, 1, …, 6

%d

Day of the month as a two-digit number

01, 02, …, 31

%b

Month abbreviated name

Jan, Feb, …, Dec

%B

Month full name

January, February, …, December

%m

Month as a two-digit number

01, 02, …, 12

%y

Year as a two-digit number

00, 01, …, 99

%Y

Year as a four-digit number

0001, 0002, …, 2013, 2014, …, 9999

%H

Hour (24-hour clock) as a two-digit number

00, 01, …, 23

%I

Hour (12-hour clock) as a two-digit number

01, 02, …, 12

%p

AM or PM

AM, PM

%M

Minute as a two-digit number

00, 01, …, 59

%S

Second as a two-digit number

00, 01, …, 59

%f

Microsecond as a number

000000, 000001, …, 999999

%z

UTC offset in the form of +HHMM or hbox{--HHMM}

(empty), +0000, -0400, +1030

%Z

Time zone name

(empty), UTC, EST, CST

%j

Day of the year as a three-digit number

001, 002, …, 366

%U

Week number of the year (Sunday first)

00, 01, …, 53

%W

Week number of the year (Monday first)

00, 01, …, 53

%c

Date and time representation

Tue Aug 16 21:30:00 1988

 

%x

Date representation

08/16/88 (None);08/16/1988

 

%X

Time representation

21:30:00

%%

Literal % character

%

%G

ISO 8601 year

0001, 0002, …, 2013, 2014, …, 9999

%u

ISO 8601 weekday

1, 2, …, 7

%V

ISO 8601 week

01, 02, …, 53

2. strftime (string format time) and strptime (string parse time) behavior: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

12.3 Loading Data That Include Dates

Many of the data sets used in this book are in a CSV format, or else they come from the seaborn library. The gapminder data set was an exception: It was a tab-separated file (TSV). The read_csv() function has a lot of parameters – for example, parse_dates, inher_datetime_format, keep_date_col, date_parser, dayfirst, and cache_dates. We can parse the Date column directly by specifying the column we want in the parse_dates parameter.

ebola = pd.read_csv('data/country_timeseries.csv', parse_dates=["Date"])
print(ebola.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):

 #  Column              Non-Null Count Dtype
--- ------              -------------- -----
 0  Date                122 non-null   datetime64[ns]
 1  Day                 122 non-null   int64
 2  Cases_Guinea        93 non-null   float64
 3  Cases_Liberia       83 non-null   float64
 4  Cases_SierraLeone   87 non-null   float64
 5  Cases_Nigeria       38 non-null   float64
 6  Cases_Senegal       25 non-null   float64
 7  Cases_UnitedStates  18 non-null   float64
 8  Cases_Spain         16 non-null   float64
 9  Cases_Mali          12 non-null   float64
 10 Deaths_Guinea       92 non-null   float64
 11 Deaths_Liberia      81 non-null   float64
 12 Deaths_SierraLeone  87 non-null   float64
 13 Deaths_Nigeria      38 non-null   float64
 14 Deaths_Senegal      22 non-null   float64
 15 Deaths_UnitedStates 18 non-null   float64
 16 Deaths_Spain        16 non-null   float64
 17 Deaths_Mali         12 non-null   float64
dtypes: datetime64[ns](1), float64(16), int64(1)
memory usage: 17.3 KB
None

This example shows how we can automatically convert columns into dates directly when the data are loaded.

12.4 Extracting Date Components

Now that we have a datetime object, we can extract various parts of the date, such as year, month, or day. Here’s an example datetime object.

d = pd.to_datetime('2021-12-14')
print(d)
2021-12-14 00:00:00

If we pass in a single string, we get a Timestamp.

print(type(d))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Now that we have a proper datetime, we can access various date components as attributes.

print(d.year)
2021
print(d.month)
12
print(d.day)
14

In Chapter 4, we tidied our data when we needed to parse a column that stored multiple bits of information and used the .str. accessor to use string methods like .split(). We can do something similar here with datetime objects by accessing datetime methods using the .dt. accessor.3 Let’s first re-create our date_dt column.

3. Datetime-like properties: https://pandas.pydata.org/docs/reference/series.html#datetimelike-properties

ebola['date_dt'] = pd.to_datetime(ebola['Date'])

We know we can get date components such as the year, month, and day by using the year, month, and day attributes, respectively, on a column basis; we saw how this works when we parsed strings in a column using .str.. Here’s the Date and date_dt columns we just created.

print(ebola[['Date', 'date_dt']])
          Date    date_dt
0   2015-01-05 2015-01-05
1   2015-01-04 2015-01-04
2   2015-01-03 2015-01-03
3   2015-01-02 2015-01-02
4   2014-12-31 2014-12-31
..         ...        ...
117 2014-03-27 2014-03-27
118 2014-03-26 2014-03-26
119 2014-03-25 2014-03-25
120 2014-03-24 2014-03-24
121 2014-03-22 2014-03-22
[122 rows x 2 columns]

We can create a new year column based on the Date column.

ebola['year'] = ebola['date_dt'].dt.year
print(ebola[['Date', 'date_dt', 'year']])
          Date    date_dt year
0   2015-01-05 2015-01-05 2015
1   2015-01-04 2015-01-04 2015
2   2015-01-03 2015-01-03 2015
3   2015-01-02 2015-01-02 2015
4   2014-12-31 2014-12-31 2014
..         ...        ...  ...
117 2014-03-27 2014-03-27 2014
118 2014-03-26 2014-03-26 2014
119 2014-03-25 2014-03-25 2014
120 2014-03-24 2014-03-24 2014
121 2014-03-22 2014-03-22 2014
[122 rows x 3 columns]

Let’s finish parsing our date.

ebola = ebola.assign(
    month=ebola["date_dt"].dt.month,
    day=ebola["date_dt"].dt.day
)
print(ebola[['Date', 'date_dt', 'year', 'month', 'day']])
          Date    date_dt year month day
0   2015-01-05 2015-01-05 2015     1   5
1   2015-01-04 2015-01-04 2015     1   4
2   2015-01-03 2015-01-03 2015     1   3
3   2015-01-02 2015-01-02 2015     1   2
4   2014-12-31 2014-12-31 2014    12  31
..         ...        ...  ...   ... ...
117 2014-03-27 2014-03-27 2014      3 27
118 2014-03-26 2014-03-26 2014      3 26
119 2014-03-25 2014-03-25 2014      3 25
120 2014-03-24 2014-03-24 2014      3 24
121 2014-03-22 2014-03-22 2014      3 22
[122 rows x 5 columns]

When we parsed out our dates, the data type was not preserved.

print(ebola.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 22 columns):

 #   Column             Non-Null Count Dtype
---  ------             -------------- -----
 0  Date                122 non-null   datetime64[ns]
 1  Day                 122 non-null   int64
 2  Cases_Guinea        93 non-null    float64
 3  Cases_Liberia       83 non-null    float64
 4  Cases_SierraLeone   87 non-null    float64
 5  Cases_Nigeria       38 non-null    float64
 6  Cases_Senegal       25 non-null    float64
 7  Cases_UnitedStates  18 non-null    float64
 8  Cases_Spain         16 non-null    float64
 9  Cases_Mali          12 non-null    float64
 10 Deaths_Guinea       92 non-null    float64
 11 Deaths_Liberia      81 non-null    float64
 12 Deaths_SierraLeone  87 non-null    float64
 13 Deaths_Nigeria      38 non-null    float64
 14 Deaths_Senegal      22 non-null    float64
 15 Deaths_UnitedStates 18 non-null    float64
 16 Deaths_Spain        16 non-null    float64
 17 Deaths_Mali         12 non-null    float64
 18 date_dt             122 non-null   datetime64[ns]
 19 year                122 non-null   int64
 20 month               122 non-null   int64
 21 day                 122 non-null   int64
dtypes: datetime64[ns](2), float64(16), int64(4)
memory usage: 21.1 KB
None

12.5 Date Calculations and Timedeltas

One of the benefits of having date objects is being able to do date calculations. Our Ebola data set includes a column named Day that indicates how many days into an Ebola outbreak a country is. We can recreate this column using date arithmetic. Here’s the bottom left corner of our data.

print(ebola.iloc[-5:, :5])
          Date  Day   Cases_Guinea   Cases_Liberia  Cases_SierraLeone
117 2014-03-27    5          103.0             8.0                6.0
118 2014-03-26    4           86.0             NaN                NaN
119 2014-03-25    3           86.0             NaN                NaN
120 2014-03-24    2           86.0             NaN                NaN
121 2014-03-22    0           49.0             NaN                NaN

The first day of the outbreak (the earliest date in this data set) is 2015-03-22. So, if we want to calculate the number of days into the outbreak, we can subtract this date from each date by using the .min() method of the column.

print(ebola['date_dt'].min())
2014-03-22 00:00:00

We can use this date in our calculation.

ebola['outbreak_d'] = ebola['date_dt'] - ebola['date_dt'].min()
print(ebola[['Date', 'Day', 'outbreak_d']])
          Date Day  outbreak_d
0   2015-01-05 289    289 days
1   2015-01-04 288    288 days
2   2015-01-03 287    287 days
3   2015-01-02 286    286 days
4   2014-12-31 284    284 days
..         ... ...         ...
117 2014-03-27   5      5 days
118 2014-03-26   4      4 days
119 2014-03-25   3      3 days
120 2014-03-24   2      2 days
121 2014-03-22   0      0 days
[122 rows x 3 columns]

When we perform this kind of date calculation, we actually end up with a timedelta object.

print(ebola.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 23 columns):
 #  Column              Non-Null Count  Dtype
--- ------              --------------  -----
 0  Date                122 non-null    datetime64[ns]
 1  Day                 122 non-null    int64
 2  Cases_Guinea        93 non-null     float64
 3  Cases_Liberia       83 non-null     float64
 4  Cases_SierraLeone   87 non-null     float64
 5  Cases_Nigeria       38 non-null     float64
 6  Cases_Senegal       25 non-null     float64
 7  Cases_UnitedStates  18 non-null     float64
 8  Cases_Spain         16 non-null     float64
 9  Cases_Mali          12 non-null     float64
 10 Deaths_Guinea       92 non-null     float64
 11 Deaths_Liberia      81 non-null     float64
 12 Deaths_SierraLeone  87 non-null     float64
 13 Deaths_Nigeria      38 non-null     float64
 14 Deaths_Senegal      22 non-null     float64
 15 Deaths_UnitedStates 18 non-null     float64
 16 Deaths_Spain        16 non-null     float64
 17 Deaths_Mali         12 non-null     float64
 18 date_dt             122 non-null    datetime64[ns]
 19 year                122 non-null    int64
 20 month               122 non-null    int64
 21 day                 122 non-null    int64
 22 outbreak_d          122 non-null    timedelta64[ns]
dtypes: datetime64[ns](2), float64(16), int64(4), timedelta64[ns](1)
memory usage: 22.0 KB
None

We get timedelta objects as results when we perform calculations with datetime objects.

12.6 Datetime Methods

Let’s look at another data set. This one deals with bank failures.

banks = pd.read_csv('data/banklist.csv')
print(banks.head())
                                             Bank Name  
0                                  Fayette County Bank
1    Guaranty Bank, (d/b/a BestBank in Georgia & Mi...
2                                       First NBC Bank
3                                       Proficio Bank
4                         Seaway Bank and Trust Company

                   City     ST   CERT  
0            Saint Elmo     IL   1802
1             Milwaukee     WI  30003
2           New Orleans     LA  58302
3    Cottonwood Heights     UT  35495
4               Chicago     IL  19328

                   Acquiring Institution      Closing Date Updated Date
0             United Fidelity  Bank, fsb         26-May-17    26-Jul-17
1    First-Citizens Bank & Trust Company          5-May-17    26-Jul-17
2                           Whitney Bank         28-Apr-17    26-Jul-17
3                      Cache Valley Bank          3-Mar-17    18-May-17
4                    State Bank of Texas         27-Jan-17    18-May-17

Again, we can import our data with the dates directly parsed.

banks = pd.read_csv(
  "data/banklist.csv", parse_dates=["Closing Date", "Updated Date"]
)

print(banks.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 7 columns):
 #  Column                Non-Null Count  Dtype
--- ------                --------------  -----
 0  Bank Name             553 non-null    object
 1  City                  553 non-null    object
 2  ST                    553 non-null    object
 3  CERT                  553 non-null    int64
 4  Acquiring Institution 553 non-null    object
 5  Closing Date          553 non-null    datetime64[ns]
 6  Updated Date          553 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 30.4+ KB
None

We can parse out the date by obtaining the quarter and year in which the bank closed.

banks = banks.assign(
  closing_quarter=banks['Closing Date'].dt.quarter,
  closing_year=banks['Closing Date'].dt.year
)
closing_year = banks.groupby(['closing_year']).size()

Alternatively, we can calculate how many banks closed in each quarter of each year.

closing_year_q = (
  banks
  .groupby(['closing_year', 'closing_quarter'])
  .size()
)

We can then plot these results as shown in Figure 12.1 and Figure 12.2.

Images

Figure 12.1 Number of banks closing each year

Images

Figure 12.2 Number of banks closing each year by quarter

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax = closing_year.plot()
plt.show()
fig, ax = plt.subplots()
ax = closing_year_q.plot()
plt.show()

12.7 Getting Stock Data

One commonly encountered type of data that contains dates is stock prices. Luckily Python has a way of getting this type of data programmatically with the pandas-datareader library.4

4. pandas-datareader library: https://pandas-datareader.readthedocs.io/

# we can install and use the pandas_datafreader
# to get data from the Internet
import pandas_datareader.data as web

# in this example we are getting stock information about Tesla
tesla = web.DataReader('TSLA', 'yahoo')

print(tesla)
      Date      High       Low      Open         Close 
2017-09-05 23.699333 23.059334 23.586666     23.306000
2017-09-06 23.398666 22.770666 23.299999     22.968666
2017-09-07 23.498667 22.896667 23.065332     23.374001
2017-09-08 23.318666 22.820000 23.266001     22.893333
2017-09-11 24.247334 23.333332 23.423332     24.246000
...        ...        ...        ...               ...
2022-08-25 302.959991 291.600006 302.359985 296.070007
2022-08-26 302.000000 287.470001 297.429993 288.089996
2022-08-29 287.739990 280.700012 282.829987 284.820007
2022-08-30 288.480011 272.649994 287.869995 277.700012
2022-08-31 281.250000 271.809998 280.619995 275.609985
      Date      Volume   Adj Close
2017-09-05  57526500.0   23.306000
2017-09-06  61371000.0   22.968666
2017-09-07  63588000.0   23.374001
2017-09-08  48952500.0   22.893333
2017-09-11 115006500.0   24.246000
       ...         ...         ...
2022-08-25  53230000.0  296.070007
2022-08-26  56905800.0  288.089996
2022-08-29  41864700.0  284.820007
2022-08-30  50541800.0  277.700012
2022-08-31  51788900.0  275.609985
[1257 rows x 6 columns]
# the stock data was saved
# so we do not need to rely on the Internet again
# instead we can load the same data set as a file
tesla = pd.read_csv(
  'data/tesla_stock_yahoo.csv', parse_dates=["Date"]
)

print(tesla)
           Date       Open      High          Low      Close 
0    2010-06-29  19.000000  25.000000   17.540001  23.889999
1    2010-06-30  25.790001  30.420000   23.299999  23.830000
2    2010-07-01  25.000000  25.920000   20.270000  21.959999
3    2010-07-02  23.000000  23.100000   18.709999  19.200001
4    2010-07-06  20.000000  20.000000   15.830000  16.110001
...         ...        ...        ...         ...        ...
1786 2017-08-02 318.940002 327.119995  311.220001 325.890015
1787 2017-08-03 345.329987 350.000000  343.149994 347.089996
1788 2017-08-04 347.000000 357.269989  343.299988 356.910004
1789 2017-08-07 357.350006 359.480011  352.750000 355.170013
1790 2017-08-08 357.529999 368.579987  357.399994 365.220001
     Adj Close    Volume
0    23.889999  18766300
1    23.830000  17187100
2    21.959999   8218800
3    19.200001   5139800
4    16.110001   6866900
...        ...       ...
1786 325.890015 13091500
1787 347.089996 13535000
1788 356.910004  9198400
1789 355.170013  6276900
1790 365.220001  7449837
[1791 rows x 7 columns]

12.8 Subsetting Data Based on Dates

Since we now know how to extract parts of a date out of a column (Section 12.4), we can incorporate these methods to subset our data without having to parse out the individual components manually.

For example, if we want only data for June 2010 from our stock price data set, we can use boolean subsetting.

print(
  tesla.loc[
    (tesla.Date.dt.year == 2010) & (tesla.Date.dt.month == 6)
  ]
)
        Date       Open   High        Low      Close Adj Close 
0 2010-06-29  19.000000  25.00  17.540001  23.889999 23.889999
1 2010-06-30  25.790001  30.42  23.299999  23.830000 23.830000
    Volume
0 18766300
1 17187100

12.8.1 The DatetimeIndex Object

When we are working with datetime data, we often need to set the datetime object to be the dataframe’s index. To this point, we’ve mainly left the dataframe row index to be the row number. We have also seen some side effects that arise because the row index may not always be the row number, such as when we were concatenating dataframes in Chapter 6.

First, let’s assign the Date column as the index.

tesla.index = tesla['Date']
print(tesla.index)
DatetimeIndex(['2010-06-29', '2010-06-30', '2010-07-01',
               '2010-07-02', '2010-07-06', '2010-07-07',
               '2010-07-08', '2010-07-09', '2010-07-12',
               '2010-07-13',
               ...
               '2017-07-26', '2017-07-27', '2017-07-28',
               '2017-07-31', '2017-08-01', '2017-08-02',
               '2017-08-03', '2017-08-04', '2017-08-07',
               '2017-08-08'],
              dtype='datetime64[ns]', name='Date', length=1791, freq=None)

With the index set as a date object, we can now use the date directly to subset rows. For example, we can subset our data based on the year.

print(tesla['2015'])
                 Date       Open       High        Low 
Date
2015-01-02 2015-01-02 222.869995 223.250000 213.259995
2015-01-05 2015-01-05 214.550003 216.500000 207.160004
2015-01-06 2015-01-06 210.059998 214.199997 204.210007
2015-01-07 2015-01-07 213.350006 214.779999 209.779999
2015-01-08 2015-01-08 212.809998 213.800003 210.009995
       ...        ...        ...        ...        ...
2015-12-24 2015-12-24 230.559998 231.880005 228.279999
2015-12-28 2015-12-28 231.490005 231.979996 225.539993
2015-12-29 2015-12-29 230.059998 237.720001 229.550003
2015-12-30 2015-12-30 236.600006 243.630005 235.669998
2015-12-31 2015-12-31 238.509995 243.449997 238.369995
                Close  Adj Close  Volume
Date
2015-01-02 219.309998 219.309998 4764400
2015-01-05 210.089996 210.089996 5368500
2015-01-06 211.279999 211.279999 6261900
2015-01-07 210.949997 210.949997 2968400
2015-01-08 210.619995 210.619995 3442500
...               ...        ...     ...
2015-12-24 230.570007 230.570007  708000
2015-12-28 228.949997 228.949997 1901300
2015-12-29 237.190002 237.190002 2406300
2015-12-30 238.089996 238.089996 3697900
2015-12-31 240.009995 240.009995 2683200
[252 rows x 7 columns]
  print(tesla.loc['2015'])

Alternatively, we can subset the data based on the year and month.


print(tesla['2010-06'])
                 Date       Open  High        Low       Close 
Date
2010-06-29 2010-06-29  19.000000  25.00  17.540001  23.889999
2010-06-30 2010-06-30  25.790001  30.42  23.299999  23.830000
           Adj Close   Volume
Date
2010-06-29 23.889999 18766300
2010-06-30 23.830000 17187100
print(tesla.loc['2010-06'])

12.8.2 The TimedeltaIndex Object

Just as we set the index of a dataframe to a datetime to create a DatetimeIndex, so we can do the same thing with a timedelta to create a TimedeltaIndex.

Let’s create a timedelta.

tesla['ref_date'] = tesla['Date'] - tesla['Date'].min()

Now we can assign the timedelta to the index.

tesla.index = tesla['ref_date']
print(tesla)
                 Date      Open        High        Low 
ref_date
0 days     2010-06-29 19.000000   25.000000  17.540001
1 days     2010-06-30 25.790001   30.420000  23.299999
2 days     2010-07-01 25.000000   25.920000  20.270000
3 days     2010-07-02 23.000000   23.100000  18.709999
7 days     2010-07-06 20.000000   20.000000  15.830000
...               ...       ...         ...        ...
2591 days  2017-08-02 318.940002 327.119995 311.220001
2592 days  2017-08-03 345.329987 350.000000 343.149994
2593 days  2017-08-04 347.000000 357.269989 343.299988
2596 days  2017-08-07 357.350006 359.480011 352.750000
2597 days  2017-08-08 357.529999 368.579987 357.399994
           Close Adj Close Volume   ref_date
ref_date
0 days     23.889999  23.889999  18766300    0 days
1 days     23.830000  23.830000  17187100    1 days
2 days     21.959999  21.959999   8218800    2 days
3 days     19.200001  19.200001   5139800    3 days
7 days     16.110001  16.110001   6866900    7 days
...              ...        ...      ...       ...
2591 days 325.890015 325.890015  13091500 2591 days
2592 days 347.089996 347.089996  13535000 2592 days
2593 days 356.910004 356.910004   9198400 2593 days
2596 days 355.170013 355.170013   6276900 2596 days
2597 days 365.220001 365.220001   7449837 2597 days
[1791 rows x 8 columns]

We can now select our data based on these deltas.

print(tesla['0 day': '10 day'])
              Date      Open      High       Low     Close 
ref_date
0 days  2010-06-29 19.000000 25.000000 17.540001 23.889999
1 days  2010-06-30 25.790001 30.420000 23.299999 23.830000
2 days  2010-07-01 25.000000 25.920000 20.270000 21.959999
3 days  2010-07-02 23.000000 23.100000 18.709999 19.200001
7 days  2010-07-06 20.000000 20.000000 15.830000 16.110001
8 days  2010-07-07 16.400000 16.629999 14.980000 15.800000
9 days  2010-07-08 16.139999 17.520000 15.570000 17.459999
10 days 2010-07-09 17.580000 17.900000 16.549999 17.400000
          Adj Close   Volume ref_date
ref_date
0 days    23.889999 18766300    0 days
1 days    23.830000 17187100    1 days
2 days    21.959999  8218800    2 days
3 days    19.200001  5139800    3 days
7 days    16.110001  6866900    7 days
8 days    15.800000  6921700    8 days
9 days    17.459999  7711400    9 days
10 days   17.400000  4050600   10 days

12.9 Date Ranges

Not every data set will have a fixed frequency of values. For example, in our Ebola data set, we do not have an observation for every day in the date range.

ebola = pd.read_csv(
'data/country_timeseries.csv', parse_dates=["Date"]
)

Here, 2015-01-01 is missing from the .head() of the data.

print(ebola.iloc[:, :5])

            Date Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0     2015-01-05 289        2776.0            NaN            10030.0
1     2015-01-04 288        2775.0            NaN             9780.0
2     2015-01-03 287        2769.0         8166.0             9722.0
3     2015-01-02 286           NaN         8157.0                NaN
4     2014-12-31 284        2730.0         8115.0             9633.0
..           ... ...           ...            ...                ...
117   2014-03-27   5         103.0            8.0                6.0
118   2014-03-26   4          86.0            NaN                NaN
119   2014-03-25   3          86.0            NaN                NaN
120   2014-03-24   2          86.0            NaN                NaN
121   2014-03-22   0          49.0            NaN                NaN
[122 rows x 5 columns]

It’s common practice to create a date range to .reindex() a data set. We can use the date_range()

head_range = pd.date_range(start='2014-12-31', end='2015-01-05')
print(head_range)
DatetimeIndex(['2014-12-31', '2015-01-01', '2015-01-02',
               '2015-01-03', '2015-01-04', '2015-01-05'],
             dtype='datetime64[ns]', freq='D')

We’ll just work with the first five rows in this example.

ebola_5 = ebola.head()

If we want to set this date range as the index, we need to first set the date as the index.

ebola_5.index = ebola_5['Date']

Next we can .reindex() our data.

ebola_5 = ebola_5.reindex(head_range)

print(ebola_5.iloc[:, :5])
                 Date          Day  Cases_Guinea Cases_Liberia 
2014-12-31 2014-12-31        284.0        2730.0        8115.0
2015-01-01        NaT          NaN           NaN           NaN
2015-01-02 2015-01-02        286.0           NaN        8157.0
2015-01-03 2015-01-03        287.0        2769.0        8166.0
2015-01-04 2015-01-04        288.0        2775.0           NaN
2015-01-05 2015-01-05        289.0        2776.0           NaN

      Cases_SierraLeone
2014-12-31       9633.0
2015-01-01          NaN
2015-01-02          NaN
2015-01-03       9722.0
2015-01-04       9780.0
2015-01-05      10030.0

12.9.1 Frequencies

When we created our head_range, the print statement included a parameter called freq. In that example, freq was 'D' for “day.” That is, the values in our date range were stepped through using a day-by-day increment. The possible frequencies are reproduced from the Pandas timeseries documentation that is listed in Table 12.2.5

5. Frequency offset aliases: https://pandas.pydata.org/docs/user_guide/timeseries.html#offset-aliases

Table 12.2 Possible Frequencies

Alias

Description

B

Business day frequency

C

Custom business day frequency (experimental)

D

Calendar day frequency

W

Weekly frequency

M

Month end frequency

SM

Semi-month end frequency (15th and end of month)

BM

Business month end frequency

CBM

Custom business month end frequency

MS

Month start frequency

SMS

Semi-month start frequency (1st and 15th)

BMS

Business month start frequency

CBMS

Custom business month start frequency

Q

Quarter end frequency

BQ

Business quarter end frequency

QS

Quarter start frequency

BQS

Business quarter start frequency

A

Year end frequency

BA

Business year end frequency

AS

Year start frequency

BAS

Business year start frequency

BH

Business hour frequency

H

Hour frequency

T

Minute frequency

S

Second frequency

L

Millisecond frequency

U

Microsecond frequency

N

Nanosecond frequency

These values can be passed into the freq parameter when calling date_range. For example, January 2, 2022, was a Sunday, and we can create a range consisting of the business days in that week.

# business days during the week of Jan 1, 2022
print(pd.date_range('2022-01-01', '2022-01-07', freq='B'))
DatetimeIndex(['2022-01-03', '2022-01-04', '2022-01-05',
               '2022-01-06', '2022-01-07'],
             dtype='datetime64[ns]', freq='B')

12.9.2 Offsets

Offsets are variations on a base frequency. For example, we can take the business days range that we just created and add an offset such that instead of every business day, data are included for every other business day.

# every other business day during the week of Jan 1, 2022
print(pd.date_range('2022-01-01', '2017-01-07', freq='2B'))
DatetimeIndex([], dtype='datetime64[ns]', freq='2B')

We created this offset by putting a multiplying value before the base frequency. This kind of offset can be combined with other base frequencies as well. For example, we can specify the first Thursday of each month in the year 2022.

print(pd.date_range('2022-01-01', '2022-12-31', freq='WOM-1THU'))
DatetimeIndex(['2022-01-06', '2022-02-03', '2022-03-03',
               '2022-04-07', '2022-05-05', '2022-06-02',
               '2022-07-07', '2022-08-04', '2022-09-01',
               '2022-10-06', '2022-11-03', '2022-12-01'],
             dtype='datetime64[ns]', freq='WOM-1THU')

We can also specify the third Friday of each month.

print(pd.date_range('2022-01-01', '2022-12-31', freq='WOM-3FRI'))
DatetimeIndex(['2022-01-21', '2022-02-18', '2022-03-18',
               '2022-04-15', '2022-05-20', '2022-06-17',
               '2022-07-15', '2022-08-19', '2022-09-16',
               '2022-10-21', '2022-11-18', '2022-12-16'],
             dtype='datetime64[ns]', freq='WOM-3FRI')

12.10 Shifting Values

There are a few reasons why you might want to shift your dates by a certain value. For example, you might need to correct some kind of measurement error in your data. Alternatively, you might want to standardize the start dates for your data so you can compare trends.

Even though our Ebola data isn’t “tidy,” one of the benefits of the data in its current format is that it allows us to plot the outbreak. This plot is shown in Figure 12.3.

Images

Figure 12.3 Ebola plot of cases and deaths (unshifted dates)

import matplotlib.pyplot as plt

ebola.index = ebola['Date']
fig, ax = plt.subplots()
ax = ebola.plot(ax=ax)
ax.legend(fontsize=7, loc=2, borderaxespad=0.0)
plt.show()

When we’re looking at an outbreak, one useful piece of information is how fast an outbreak is spreading relative to other countries. Let’s look at just a few columns from our Ebola data set.


ebola_sub = ebola[['Day', 'Cases_Guinea', 'Cases_Liberia']]
print(ebola_sub.tail(10))
          Day Cases_Guinea  Cases_Liberia
Date
2014-04-04 13        143.0           18.0
2014-04-01 10        127.0            8.0
2014-03-31  9        122.0            8.0
2014-03-29  7        112.0            7.0
2014-03-28  6        112.0            3.0
2014-03-27  5        103.0            8.0
2014-03-26  4         86.0            NaN
2014-03-25  3         86.0            NaN
2014-03-24  2         86.0            NaN
2014-03-22  0         49.0            NaN

You can see that each country’s starting date is different, which makes it difficult to compare the actual slopes between countries when a new outbreak occurs later in time.

In this example, we want all our dates to start from a common 0 day. There are multiple steps to this process.

  • Since not every date is listed, we need to create a date range of all the dates in our data set.

  • We need to calculate the difference between the earliest date in our data set, and the earliest valid (non NaN) date in each column.

  • We can then shift each of the columns by this calculated value.

Before we begin, let’s start with a fresh copy of the Ebola data set. We’ll parse the Date column as a proper date object, and assign this date to the .index. In this example, we are parsing the date and setting it as the index directly.

ebola = pd.read_csv(
  "data/country_timeseries.csv",
  index_col="Date",
  parse_dates=["Date"],
)

print(ebola.iloc[:, :4])
           Day Cases_Guinea Cases_Liberia Cases_SierraLeone
Date
2015-01-05 289       2776.0           NaN           10030.0
2015-01-04 288       2775.0           NaN            9780.0
2015-01-03 287       2769.0        8166.0            9722.0
2015-01-02 286          NaN        8157.0               NaN
2014-12-31 284       2730.0        8115.0            9633.0
...        ...          ...           ...               ...
2014-03-27   5        103.0           8.0               6.0
2014-03-26   4         86.0           NaN               NaN
2014-03-25   3         86.0           NaN               NaN
2014-03-24   2         86.0           NaN               NaN
2014-03-22   0         49.0           NaN               NaN
[122 rows x 4 columns]

First, we need to create the date range to fill in all the missing dates in our data. Then, when we shift our date values downward, the number of days that the data will shift will be the same as the number of rows that will be shifted.

new_idx = pd.date_range(ebola.index.min(), ebola.index.max())

print(new_idx)
DatetimeIndex(['2014-03-22', '2014-03-23', '2014-03-24',
              '2014-03-25', '2014-03-26', '2014-03-27',
              '2014-03-28', '2014-03-29', '2014-03-30',
              '2014-03-31',

              ...
              '2014-12-27', '2014-12-28', '2014-12-29',
              '2014-12-30', '2014-12-31', '2015-01-01',
              '2015-01-02', '2015-01-03', '2015-01-04',
              '2015-01-05'],
             dtype='datetime64[ns]', length=290, freq='D')

Looking at our new_idx, we see that the dates are not in the order that we want. To fix this, we can reverse the order of the index.

new_idx = reversed(new_idx)
print(new_idx)
<reversed object at 0x105aedfc0>

Now we can properly .reindex() our data. This will create rows of NaN values if the index does not exist already in our data set.

ebola = ebola.reindex(new_idx)

If we look at the .head() and .tail() of the resulting data, we see that dates that were originally not listed have been added into the data set, along with a row of NaN missing values. Additionally, the Date column is filled with the NaT value, which is an internal Pandas representation for missing time value (similar to how NaN is used for numeric missing values).

print(ebola.iloc[:, :4])

             Day Cases_Guinea Cases_Liberia Cases_SierraLeone
Date
2015-01-05  289.0       2776.0           NaN           10030.0
2015-01-04  288.0       2775.0           NaN            9780.0
2015-01-03  287.0       2769.0        8166.0            9722.0
2015-01-02  286.0          NaN        8157.0               NaN
2015-01-01    NaN          NaN           NaN               NaN
...           ...          ...           ...               ...
2014-03-26    4.0         86.0           NaN               NaN
2014-03-25    3.0         86.0           NaN               NaN
2014-03-24    2.0         86.0           NaN               NaN
2014-03-23    NaN          NaN           NaN               NaN
2014-03-22    0.0         49.0           NaN               NaN
[290 rows x 4 columns]

Now that we’ve created our date range and assigned it to the index, our next step is to calculate the difference between the earliest date in our data set and the earliest valid (non-missing) date in each column. To perform this calculation, we can use the Series method called .last_valid_index(), which returns the label (index) of the last non-missing or non-null value. An analogous method called .first_valid_index() returns the first non-missing or non-null value. Since we want to perform this calculation across all the columns, we can use the .apply() method.

last_valid = ebola.apply(pd.Series.last_valid_index)
print(last_valid)
Day                 2014-03-22
Cases_Guinea        2014-03-22
Cases_Liberia       2014-03-27
Cases_SierraLeone   2014-03-27
Cases_Nigeria       2014-07-23
                         ...
Deaths_Nigeria      2014-07-23
Deaths_Senegal      2014-09-07
Deaths_UnitedStates 2014-10-01
Deaths_Spain        2014-10-08
Deaths_Mali         2014-10-22
Length: 17, dtype: datetime64[ns]

Next, we want to get the earliest date in our data set.

earliest_date = ebola.index.min()
print(earliest_date)
2014-03-22 00:00:00

We then subtract this date from each of our last_valid dates.


shift_values = last_valid - earliest_date
print(shift_values)
Day                    0 days
Cases_Guinea           0 days
Cases_Liberia          5 days
Cases_SierraLeone     5 days
Cases_Nigeria        123 days
                       ...
Deaths_Nigeria       123 days
Deaths_Senegal       169 days
Deaths_UnitedStates  193 days
Deaths_Spain         200 days
Deaths_Mali          214 days
Length: 17, dtype: timedelta64[ns]

Finally, we can iterate through each column, using the .shift() method to shift the columns down by the corresponding value in shift_values. Note that the values in shift_values are all positive. If they were negative (if we flipped the order of our subtraction), this operation would shift the values up.

ebola_dict = {}

for idx, col in enumerate(ebola):
    d = shift_values[idx].days
    shifted = ebola[col].shift(d)
    ebola_dict[col] = shifted

#print(ebola_dict)

Since we have a dict of values, we can convert it to a dataframe using the Pandas DataFrame function.

ebola_shift = pd.DataFrame(ebola_dict)

The last row in each column now has a value; that is, the columns have been shifted down appropriately.

print(ebola_shift.tail())
           Day Cases_Guinea Cases_Liberia  Cases_SierraLeone 
Date
2014-03-26 4.0         86.0           8.0                2.0
2014-03-25 3.0         86.0           NaN                NaN
2014-03-24 2.0         86.0           7.0                NaN
2014-03-23 NaN          NaN           3.0                2.0
2014-03-22 0.0         49.0           8.0                6.0

           Cases_Nigeria Cases_Senegal  Cases_UnitedStates 
Date
2014-03-26           1.0           NaN                1.0
2014-03-25           NaN           NaN                NaN
2014-03-24           NaN           NaN                NaN
2014-03-23           NaN           NaN                NaN
2014-03-22           0.0           1.0                1.0
              Cases_Spain Cases_Mali Deaths_Guinea Deaths_Liberia 
Date
2014-03-26            1.0        NaN          62.0            4.0
2014-03-25            NaN        NaN          60.0            NaN
2014-03-24            NaN        NaN          59.0            2.0
2014-03-23            NaN        NaN           NaN            3.0
2014-03-22            1.0        1.0          29.0            6.0
            Deaths_SierraLeone Deaths_Nigeria Deaths_Senegal 
Date
2014-03-26                 2.0            1.0            NaN
2014-03-25                 NaN            NaN            NaN
2014-03-24                 NaN            NaN            NaN
2014-03-23                 2.0            NaN            NaN
2014-03-22                 5.0            0.0            0.0
            Deaths_UnitedStates Deaths_Spain Deaths_Mali
Date
2014-03-26                  0.0          1.0         NaN
2014-03-25                  NaN          NaN         NaN
2014-03-24                  NaN          NaN         NaN
2014-03-23                  NaN          NaN         NaN
2014-03-22                  0.0          1.0         1.0

Finally, since the indices are no longer valid across each row, we can remove them, and then assign the correct index, which is the Day. Note that Day no longer represents the first day of the entire outbreak, but rather the first day of an outbreak for the given country.

ebola_shift.index = ebola_shift['Day']
ebola_shift = ebola_shift.drop(['Day'], axis="columns")
print(ebola_shift.tail())

      Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_Nigeria 
Day
4.0           86.0           8.0               2.0           1.0
3.0           86.0           NaN               NaN           NaN
2.0           86.0           7.0               NaN           NaN
NaN            NaN           3.0               2.0           NaN
0.0           49.0           8.0               6.0           0.0
    Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali 
Day
4.0           NaN                1.0         1.0        NaN
3.0           NaN                NaN         NaN        NaN
2.0           NaN                NaN         NaN        NaN
NaN           NaN                NaN         NaN        NaN
0.0           1.0                1.0         1.0        1.0
     Deaths_Guinea Deaths_Liberia Deaths_SierraLeone 
Day
4.0           62.0            4.0                2.0
3.0           60.0            NaN                NaN
2.0           59.0            2.0                NaN
NaN            NaN            3.0                2.0
0.0           29.0            6.0                5.0
    Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates 
Day
4.0            1.0            NaN                 0.0
3.0            NaN            NaN                 NaN
2.0            NaN            NaN                 NaN
NaN            NaN            NaN                 NaN
0.0            0.0            0.0                 0.0
   Deaths_Spain Deaths_Mali
Day
4.0         1.0         NaN
3.0         NaN         NaN
2.0         NaN         NaN
NaN         NaN         NaN
0.0         1.0         1.0

12.11 Resampling

Resampling converts a datetime from one frequency to another frequency. Three types of resampling can occur:

  • Downsampling: from a higher frequency to a lower frequency (e.g., daily to monthly)

  • Upsampling: from a lower frequency to a higher frequency (e.g., monthly to daily)

  • No change: frequency does not change (e.g., every first Thursday of the month to the last Friday of the month)

The values we can pass into .resample() are listed in Table 12.2.

# downsample daily values to monthly values
# since we have multiple values, we need to aggregate the results
# here we will use the mean
down = ebola.resample('M').mean()
print(down.iloc[:,  :5])
                  Day Cases_Guinea   Cases_Liberia 
Date
2014-03-31   4.500000    94.500000        6.500000
2014-04-30  24.333333   177.818182       24.555556
2014-05-31  51.888889   248.777778       12.555556
2014-06-30  84.636364   373.428571       35.500000
2014-07-31 115.700000   423.000000      212.300000
...               ...          ...             ...
2014-09-30 177.500000   967.888889     2815.625000
2014-10-31 207.470588  1500.444444     4758.750000
2014-11-30 237.214286  1950.500000     7039.000000
2014-12-31 271.181818  2579.625000     7902.571429
2015-01-31 287.500000  2773.333333     8161.500000
     Cases_SierraLeone  Cases_Nigeria
Date
2014-03-31    3.333333            NaN
2014-04-30    2.200000            NaN
2014-05-31    7.333333            NaN
2014-06-30  125.571429            NaN
2014-07-31  420.500000       1.333333
...                ...            ...
2014-09-30 1726.000000      20.714286
2014-10-31 3668.111111      20.000000
2014-11-30 5843.625000      20.000000
2014-12-31 8985.875000      20.000000
2015-01-31 9844.000000            NaN
[11 rows x 5 columns]
# here we will upsample our downsampled value
# notice how missing dates are populated,
# but they are filled in with missing values
up = down.resample('D').mean()
print(up.iloc[:, :5])
             Day Cases_Guinea Cases_Liberia Cases_SierraLeone 
Date
2014-03-31   4.5    94.500000           6.5          3.333333
2014-04-01   NaN          NaN           NaN               NaN
2014-04-02   NaN          NaN           NaN               NaN
2014-04-03   NaN          NaN           NaN               NaN
2014-04-04   NaN          NaN           NaN               NaN
...          ...          ...           ...               ...
2015-01-27   NaN          NaN           NaN               NaN
2015-01-28   NaN          NaN           NaN               NaN
2015-01-29   NaN          NaN           NaN               NaN
2015-01-30   NaN          NaN           NaN               NaN
2015-01-31 287.5  2773.333333        8161.5       9844.000000
     Cases_Nigeria
Date
2014-03-31     NaN
2014-04-01     NaN
2014-04-02     NaN
2014-04-03     NaN
2014-04-04     NaN
...            ...
2015-01-27     NaN
2015-01-28     NaN
2015-01-29     NaN
2015-01-30     NaN
2015-01-31     NaN
[307 rows x 5 columns]

12.12 Time Zones

Don’t try to write your own time zone converter. As Tom Scott explains in a “Computerphile” video, “That way lies madness.”6 There are many things you probably did not even think to consider when working with different time zones. For example, not every country implements daylight savings time, and even those that do, may not necessarily change the clocks on the same day of the year. And don’t forget about leap years and leap seconds! Luckily Python has a library specifically designed to work with time zones7, Pandas also wraps this library when working with time zones.

6. The problem with time and time zones: Computerphile: www.youtube.com/watch?v=-5wpm-gesOY

7. Documentation for pytz:a https://pythonhosted.org/pytz/

import pytz

There are many time zones available in the library.

print(len(pytz.all_timezones))
594

Here are the U.S. time zones:

import re
regex = re.compile(r'^US')
selected_files = filter(regex.search, pytz.common_timezones)
print(list(selected_files))
['US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii',
' US/Mountain', 'US/Pacific']

The easiest way to interact with time zones in Pandas is to use the string names given in pytz.all_timezones().

One way to illustrate time zones is to create two timestamps using the Pandas Timestamp function. For example, if there was a flight between the JFK and LAX airports that departed at 7:00 AM from New York and landed at 9:57 AM in Los Angeles. We can encode these times with the proper time zone.

# 7AM Eastern
depart = pd.Timestamp('2017-08-29 07:00', tz='US/Eastern')
print(depart)
2017-08-29 07:00:00-04:00
arrive = pd.Timestamp('2017-08-29 09:57')
print(arrive)
2017-08-29 09:57:00

Another way we can encode a time zone is by using the .tz_localize() method on an “empty” timestamp.

arrive = arrive.tz_localize('US/Pacific')
print(arrive)
2017-08-29 09:57:00-07:00

We can convert the arrival time back to the Eastern time zone to see what the time would be on the East Coast when the flight arrives.

print(arrive.tz_convert('US/Eastern'))
2017-08-29 12:57:00-04:00

We can also perform operations on time zones. Here we look at the difference between the times to get the flight duration.

duration = arrive - depart
print(duration)
0 days 05:57:00

12.13 Arrow for Better Dates and Times

If you do end up working with date and time columns often, I would suggest looking into the arrow library. You can find the documentation page here: https://arrow.readthedocs.io/en/latest/ Do not confuse this Arrow library with the Apache Arrow project for language-independent dataframe formats.

Arrow is a separate library that needs to be installed, but works slightly different from the methods shown in this chapter. However, it does do a better job handling time zones. See this post by Paul Ganssle for more information about the benefits of arrow over pytz: https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html

Conclusion

Pandas provides a series of convenient methods and functions when we are working with dates and times because these types of data are used so often with time-series data. A common example of time-series data is stock prices, but other examples include observational and simulated data. These convenient Pandas functions and methods allow you to easily work with date objects without having to resort to string manipulation and parsing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.165.62