Reading data from remote data services

pandas has direct support for various web-based data source classes in the pandas.io.data namespace. The primary class of interest is pandas.io.data.DataReader, which is implemented to read data from various supported sources and return it to the application directly as DataFrame.

Currently, support exists for the following sources via the DataReader class:

  • Daily historical prices' stock from either Yahoo! and Google Finance
  • Yahoo! Options
  • The Federal Reserve Economic Data library
  • Kenneth French's Data Library
  • The World Bank

The specific source of data is specified via the DataReader object's data_source parameter. The specific items to be retrieved are specified using the name parameter. If the data source supports selecting data between a range of dates, these dates can be specified with the start and end parameters. We will now take a look at reading data from each of these sources.

Reading stock data from Yahoo! and Google Finance

Yahoo! Finance is specified by passing 'yahoo' as the data_source parameter. The following retrieves data from Yahoo! Finance, specifically, the data for MSFT between 2012-01-01 and 2014-01-27:

In [36]:
   # import pandas.io.data namespace, alias as web
   import pandas.io.data as web
   # and datetime for the dates
   import datetime

   # start end end dates
   start = datetime.datetime(2012, 1, 1)
   end = datetime.datetime(2014, 1, 27)

   # read the MSFT stock data from yahoo! and view the head
   yahoo = web.DataReader('MSFT', 'yahoo', start, end)
   yahoo.head()

Out[36]:
                Open   High    Low  Close    Volume  Adj Close
   Date                                                       
   2012-01-03  26.55  26.96  26.39  26.77  64731500      24.42
   2012-01-04  26.82  27.47  26.78  27.40  80516100      25.00
   2012-01-05  27.38  27.73  27.29  27.68  56081400      25.25
   2012-01-06  27.53  28.19  27.53  28.11  99455500      25.64
   2012-01-09  28.05  28.10  27.72  27.74  59706800      25.31

The source of the data can be changed to Google Finance with a change of the data_source parameter to 'google':

In [37]:
   # read from google and display the head of the data
   goog = web.DataReader("MSFT", 'google', start, end)
   goog.head()

Out[37]:
                Open   High    Low  Close    Volume
   Date                                           
   2012-01-03  26.55  26.96  26.39  26.76  64735391
   2012-01-04  26.82  27.47  26.78  27.40  80519402
   2012-01-05  27.38  27.73  27.29  27.68  56082205
   2012-01-06  27.53  28.19  27.52  28.10  99459469
   2012-01-09  28.05  28.10  27.72  27.74  59708266

Notice that the result of the Google query has different columns than the Yahoo! data; specifically, Google Finance does not provide an Adjusted Close value.

Retrieving data from Yahoo! Finance Options

pandas provides experimental support for Yahoo! Finance Options data to be retrieved via the pandas.io.data.Options class. In the following example, the .get_all_data() method is used to download options data for AAPL from yahoo:

In [38]:
   # specify we want all yahoo options data for AAPL
   # this can take a little time...
   aapl = pd.io.data.Options('AAPL', 'yahoo')
   # read all the data
   data = aapl.get_all_data()
   # examine the first six rows and four columns
   data.iloc[0:6, 0:4]

Out[38]:
                                               Last   Bid    Ask  Chg
   Strike Expiry     Type Symbol                                      
   27.86  2015-01-17 call AAPL150117C00027860 82.74 79.50  79.85    0
                     put  AAPL150117P00027860  0.02  0.00   0.01    0
   28.57  2015-01-17 call AAPL150117C00028570 82.02 78.80  79.10    0
                     put  AAPL150117P00028570  0.01  0.00   0.01    0
   29.29  2015-01-17 call AAPL150117C00029290 84.75 78.05  78.40    0
                     put  AAPL150117P00029290 0.01   0.00   0.01    0

The resulting DataFrame object contains a hierarchical index, which can be used to easily extract specific subsets of data. To demonstrate, look at several examples of slicing by the values in the index.

The following code will return all put options at a Strike price of $80. Using slice(None) as one of the values in the tuple used to select by index will include all Expiry dates:

In [39]:
   # get all puts at strike price of $80 (first four columns only)
   data.loc[(80, slice(None), 'put'), :].iloc[0:5, 0:4]

Out[39]:
                                               Last   Bid   Ask  Chg
   Strike Expiry     Type Symbol                                    
   80     2015-01-09 put  AAPL150109P00080000  0.01  0.00  0.01    0
          2015-01-17 put  AAPL150117P00080000  0.01  0.01  0.02    0
          2015-01-23 put  AAPL150123P00080000  0.06  0.01  0.04    0
          2015-01-30 put  AAPL150130P00080000  0.12  0.07  0.12    0
          2015-02-20 put  AAPL150220P00080000  0.22  0.22  0.24    0

As another example, we can narrow the date range by specifying a date slice instead of slice(None). The following narrows the result down to those where Expiry date is between 2015-01-17 and 2015-04-17:

In [40]:
   # put options at strike of $80, between 2015-01-17 and 2015-04-17
   data.loc[(80, slice('20150117','20150417'), 
             'put'), :].iloc[:, 0:4]

Out[40]:
                                               Last   Bid   Ask   Chg
   Strike Expiry     Type Symbol                                     
   80     2015-01-17 put  AAPL150117P00080000  0.01  0.01  0.02  0.00
          2015-01-23 put  AAPL150123P00080000  0.06  0.01  0.04  0.00
          2015-01-30 put  AAPL150130P00080000  0.12  0.07  0.12  0.00
          2015-02-20 put  AAPL150220P00080000  0.22  0.22  0.24  0.00
          2015-03-20 put  AAPL150320P00080000  0.40  0.39  0.43 -0.06
          2015-04-17 put  AAPL150417P00080000  0.64  0.61  0.66  0.00

If you do not want to download all of the data (which can take a few minutes), then you can use other methods, such as .get_call_data() and .get_put_data(), which will only download the data for the call or puts, respectively, and for a specific expiry date. To demonstrate, the following loads all call data for MSFT with the expiry date of 2015-01-05:

In [41]:
   # msft calls expiring on 2015-01-05
   expiry = datetime.date(2015, 1, 5)
   msft_calls = pd.io.data.Options('MSFT', 'yahoo').get_call_data(
                                                       expiry=expiry)
   msft_calls.iloc[0:5, 0:5]

Out[41]:
                                               Last   Bid   Ask  Chg       PctChg
   Strike Expiry     Type Symbol                                           
   35.5  2015-03-13 call MSFT150313C00035500  6.20  5.55  7.20   0  0.00%
   36.5  2015-03-13 call MSFT150313C00036500  7.60  5.65  5.95   0  0.00%
   37.5  2015-03-13 call MSFT150313C00037500  5.10  4.65  5.25   0  0.00%
   38.0  2015-03-13 call MSFT150313C00038000  3.10  4.15  4.45   0  0.00%
   39.0  2015-03-13 call MSFT150313C00039000  3.15  3.15  3.45   0  0.00%

The .get_all_data() method will load data for all expiry months and cache that data. The cache will be automatically used to make subsequent calls return much more quickly if the data is present in the cache. As an example, the following example will return calls quickly, as the data has already been cached by the previous .get_all_data() call for appl call options:

In [42]:
   # msft calls expiring on 2015-01-17
   expiry = datetime.date(2015, 1, 17)
   aapl_calls = aapl.get_call_data(expiry=expiry)
   aapl_calls.iloc[0:5, 0:4]

Out[42]:
                                               Last   Bid    Ask  Chg
   Strike Expiry     Type Symbol                                       
   27.86  2015-01-17 call AAPL150117C00027860 82.74 79.50  79.85    0
   28.57  2015-01-17 call AAPL150117C00028570 82.02 78.80  79.10    0
   29.29  2015-01-17 call AAPL150117C00029290 84.75 78.05  78.40    0
   30.00  2015-01-17 call AAPL150117C00030000 81.20 77.35  77.70    0
   30.71  2015-01-17 call AAPL150117C00030710 83.20 76.65  77.00    0

Reading economic data from the Federal Reserve Bank of St. Louis

The Federal Reserve Economic Data (FRED) of St. Louis (http://research.stlouisfed.org/fred2/) provides downloads of over 240,000 US and International time series from over 76 data sources, and it is constantly growing.

Reading economic data from the Federal Reserve Bank of St. Louis

FRED data can be specified by data_source="fred", passing the specific series tag as the name parameter. A specific data series can be specified using the series ID in the first parameter. As an example, the following retrieves GDP information between the two specified dates:

In [43]:
   # read GDP data from FRED
   gdp = web.DataReader("GDP", "fred", 
                        datetime.date(2012, 1, 1), 
                        datetime.date(2014, 1, 27))
   gdp

Out[43]:
                   GDP
   DATE               
   2012-01-01  15956.5
   2012-04-01  16094.7
   2012-07-01  16268.9
   2012-10-01  16332.5
   2013-01-01  16502.4
   2013-04-01  16619.2
   2013-07-01  16872.3
   2013-10-01  17078.3
   2014-01-01  17044.0

To select another series, simply specify the series identifier in the first parameter. The site can be conveniently navigated through series and data visualized directly on the site. For example, the following screenshot shows you the series Compensation of employees: Wages and salaries:

Reading economic data from the Federal Reserve Bank of St. Louis

This data series is represented by the A576RC1A027NBEA ID and we can download it with the following code:

In [44]:
   # Get Compensation of employees: Wages and salaries
   web.DataReader("A576RC1A027NBEA",
                  "fred", 
                  datetime.date(1929, 1, 1),
                  datetime.date(2013, 1, 1))

Out[44]:
               A576RC1A027NBEA
   DATE                       
   1929-01-01             50.5
   1930-01-01             46.2
   1931-01-01             39.2
   1932-01-01             30.5
   1933-01-01             29.0
   ...                     ...
   2009-01-01           6251.4
   2010-01-01           6377.5
   2011-01-01           6633.2
   2012-01-01           6932.1
   2013-01-01           7124.7

   [85 rows x 1 columns]

Accessing Kenneth French's data

Kenneth R. French is a professor of finance at the Tuck School of Business at Dartmouth University. He has created an extensive library of economic data, which is available for download over the Web. The website for his data is at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html, and it contains a detailed description of the datasets.

Accessing Kenneth French's data

The data available at the site is downloadable in zip files and can be read directly into DataFrame by specifying the dataset's filename (without .zip) as the first parameter of DataReader and using data_source="famafrench":

As an example, the following reads the Global Factors data:

In [45]:
   # read from Kenneth French fama global factors data set
   factors = web.DataReader("Global_Factors", "famafrench")
   factors

Out[45]:
   {3:         1 Mkt-RF  2 SMB  3 HML  4 WML  5 RF
    199007      0.79   0.07   0.24 -99.99  0.68
    199008    -10.76  -1.56   0.42 -99.99  0.66
    199009    -12.24   1.68   0.34 -99.99  0.60
    199010      9.58  -8.11  -3.29 -99.99  0.68
    199011     -3.87   1.62   0.68  -0.32  0.57
    ...          ...    ...    ...    ...   ...
    201409     -3.05  -2.63  -1.05   1.07  0.00
    201410      0.34  -0.23  -2.98  -0.46  0.00
    201411      1.67  -2.14  -1.92   0.65  0.00
    201412     -1.45   1.89  -0.33   1.06  0.00
    201501     -1.75   0.04  -2.78   4.50  0.00
 
    [295 rows x 5 columns]}

Reading from the World Bank

Thousands of data feeds are available from the World Bank and can be read directly into pandas DataFrame objects. The World Bank data catalog can be explored at http://www.worldbank.org/.

Reading from the World Bank

World Bank datasets are identified using indicators, a text code that represents each dataset. A full list of indicators can be retrieved using the pandas.io.wb.get_indicators() function. At the time of writing this, there were 13079 indicators:

In [46]:
   # make referencing pandas.io.wb a little less typing
   import pandas.io.wb as wb
   # get all indicators
   all_indicators = wb.get_indicators()

In [47]:
   # examine some of the indicators
   all_indicators.ix[:,0:1]

Out[47]:
                                     id
   0                 1.0.HCount.1.25usd
   1                   1.0.HCount.10usd
   2                  1.0.HCount.2.5usd
   3               1.0.HCount.Mid10to50
   4                    1.0.HCount.Ofcl
   ...                              ...
   13074      per_sionl.overlap_pop_urb
   13075  per_sionl.overlap_q1_preT_tot
   13076       per_sionl.overlap_q1_rur
   13077       per_sionl.overlap_q1_tot
   13078       per_sionl.overlap_q1_urb

   [13079 rows x 1 columns]

These indicators can be investigated using the World Bank website, but if you have an idea of the indicator you would like to sample, you can perform a search. As an example, the following uses the pandas.io.wb.search() function to search for indicators with data related to life expectancy:

In [48]:
   # search of life expectancy indicators
   le_indicators = wb.search("life expectancy")
   # report first three rows, first two columns
   le_indicators.iloc[:3,:2]

Out[48]:
                        id                                      name
   7785  SP.DYN.LE00.FE.IN  Life expectancy at birth, female (years)
   7786     SP.DYN.LE00.IN   Life expectancy at birth, total (years)
   7787  SP.DYN.LE00.MA.IN    Life expectancy at birth, male (years)

Each indicator is broken down into various countries. A full list of country data can be retrieved using the pandas.io.wb.get_countries() function, as demonstrated here:

In [49]:
   # get countries and show the 3 digit code and name
   countries = wb.get_countries()
   # show a subset of the country data
   countries.iloc[0:10].ix[:,['name', 'capitalCity', 'iso2c']]

Out[49]:
                      name       capitalCity iso2c
   0                 Aruba        Oranjestad    AW
   1           Afghanistan             Kabul    AF
   2                Africa                      A9
   3                Angola            Luanda    AO
   4               Albania            Tirane    AL
   5               Andorra  Andorra la Vella    AD
   6         Andean Region                      L5
   7            Arab World                      1A
   8  United Arab Emirates         Abu Dhabi    AE
   9             Argentina      Buenos Aires    AR

Data for an indicator can be downloaded using the pandas.io.wb.download() function, specifying the dataset using the indicator parameter. The following downloads the life expectancy data for countries from 1980 through 2012:

In [50]:
   # get life expectancy at birth for all countries from 1980 to 2014
   le_data_all = wb.download(indicator="SP.DYN.LE00.IN", 
                             start='1980', 
                             end='2014')
   le_data_all

Out[50]:
                       SP.DYN.LE00.IN
   country       year                
   Canada        2014             NaN
                 2013             NaN
                 2012       81.238049
                 2011       81.068317
                 2010       80.893488
   ...                            ...
   United States 1984       74.563415
                 1983       74.463415
                 1982       74.360976
                 1981       74.007317
                 1980       73.658537

   [105 rows x 1 columns]

By default, data is only returned for the United States, Canada, and Mexico. This can be seen by examining the index of the result of the previous query:

In [51]:
   # only US, CAN, and MEX are returned by default
   le_data_all.index.levels[0]

Out[51]:
   Index([u'Canada', u'Mexico', u'United States'], dtype='object')

To get the data for more countries, specify them explicitly using the country parameter. The following gets the data for all known countries:

In [52]:
   # retrieve life expectancy at birth for all countries 
   # from 1980 to 2014
   le_data_all = wb.download(indicator="SP.DYN.LE00.IN", 
                             country = countries['iso2c'],
                             start='1980', 
                             end='2012')
   le_data_all

Out[52]:
                  SP.DYN.LE00.IN
   country  year                
   Aruba    2012       75.206756
            2011       75.080390
            2010       74.952024
            2009       74.816146
            2008       74.674220
   ...                       ...
   Zimbabwe 1984       61.217561
            1983       60.902854
            1982       60.466171
            1981       59.944951
            1980       59.377610

   [8151 rows x 1 columns]

We can do some interesting things with this data. The example we will look at, determines which country has the lowest life expectancy for each year. To do this, we first need to pivot this data, so that the index is the country name and the year is the column. We will look at pivoting in more detail in later chapters, but for now, just know that the following reorganized the data into the country along the index and the year across the columns. Also, each value is the life expectancy for each country for that specific year:

In [53]:
#le_data_all.pivot(index='country', columns='year')
le_data = le_data_all.reset_index().pivot(index='country', 
                                          columns='year')
# examine pivoted data
le_data.ix[:,0:3]

Out[53]:
                  SP.DYN.LE00.IN                      
year                         1980       1981       1982
country                                                
Afghanistan             41.233659  41.760634  42.335610
Albania                 70.235976  70.454463  70.685122
Algeria                 58.164024  59.486756  60.786341
American Samoa                NaN        NaN        NaN
Andorra                       NaN        NaN        NaN
...                           ...        ...        ...
West Bank and Gaza            NaN        NaN        NaN
World                   63.186868  63.494118  63.798264
Yemen, Rep.             50.559537  51.541341  52.492707
Zambia                  51.148951  50.817707  50.350805
Zimbabwe                59.377610  59.944951  60.466171

   [247 rows x 3 columns]

With the data in this format, we can perform and determine which country has the lowest life expectancy for each year using .idxmin(axis=0):

In [54]:
   # ask what is the name of the country for each year
   # with the least life expectancy
   country_with_least_expectancy = le_data.idxmin(axis=0)
   country_with_least_expectancy

Out[54]:
                   year
   SP.DYN.LE00.IN  1980       Cambodia
                   1981       Cambodia
                   1982    Timor-Leste
   ...
   SP.DYN.LE00.IN  2010    Sierra Leone
                   2011    Sierra Leone
                   2012    Sierra Leone
   Length: 33, dtype: object

The actual minimum value for each year can be retrieved using .min(axis=0):

In [55]:
   # and what is the minimum life expectancy for each year
   expectancy_for_least_country = le_data.min(axis=0)
   expectancy_for_least_country

Out[55]:
                   year
   SP.DYN.LE00.IN  1980    29.613537
                   1981    35.856341
                   1982    38.176220
   ...
   SP.DYN.LE00.IN  2010    44.838951
                   2011    45.102585
                   2012    45.329049
   Length: 33, dtype: float64

These two results can then be combined into a new DataFrame that tells us which country had the least life expectancy for each year and what that value is:

In [56]:
   # this merges the two frames together and gives us
   # year, country and expectancy where there minimum exists
   least = pd.DataFrame(
       data = {'Country': country_with_least_expectancy.values,
               'Expectancy': expectancy_for_least_country.values},
       index = country_with_least_expectancy.index.levels[1])
   least

Out[56]:
              Country  Expectancy
   year                          
   1980      Cambodia   29.613537
   1981      Cambodia   35.856341
   1982   Timor-Leste   38.176220
   1983   South Sudan   39.676488
   1984   South Sudan   40.011024
   ...            ...         ...
   2008  Sierra Leone   44.067463
   2009  Sierra Leone   44.501439
   2010  Sierra Leone   44.838951
   2011  Sierra Leone   45.102585
   2012  Sierra Leone   45.329049

   [33 rows x 2 columns]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.254.192