pandas has direct support for various web-based data source classes in the pandas.io.data
namespace. The primary class of interest is pandas.io.data.DataReader
, which is implemented to read data from various supported sources and return it to the application directly as DataFrame
.
Currently, support exists for the following sources via the DataReader
class:
The specific source of data is specified via the DataReader
object's data_source
parameter. The specific items to be retrieved are specified using the name
parameter. If the data source supports selecting data between a range of dates, these dates can be specified with the start
and end
parameters. We will now take a look at reading data from each of these sources.
Yahoo! Finance is specified by passing 'yahoo'
as the data_source
parameter. The following retrieves data from Yahoo! Finance, specifically, the data for MSFT between 2012-01-01
and 2014-01-27
:
In [36]: # import pandas.io.data namespace, alias as web import pandas.io.data as web # and datetime for the dates import datetime # start end end dates start = datetime.datetime(2012, 1, 1) end = datetime.datetime(2014, 1, 27) # read the MSFT stock data from yahoo! and view the head yahoo = web.DataReader('MSFT', 'yahoo', start, end) yahoo.head() Out[36]: Open High Low Close Volume Adj Close Date 2012-01-03 26.55 26.96 26.39 26.77 64731500 24.42 2012-01-04 26.82 27.47 26.78 27.40 80516100 25.00 2012-01-05 27.38 27.73 27.29 27.68 56081400 25.25 2012-01-06 27.53 28.19 27.53 28.11 99455500 25.64 2012-01-09 28.05 28.10 27.72 27.74 59706800 25.31
The source of the data can be changed to Google Finance with a change of the data_source
parameter to 'google'
:
In [37]: # read from google and display the head of the data goog = web.DataReader("MSFT", 'google', start, end) goog.head() Out[37]: Open High Low Close Volume Date 2012-01-03 26.55 26.96 26.39 26.76 64735391 2012-01-04 26.82 27.47 26.78 27.40 80519402 2012-01-05 27.38 27.73 27.29 27.68 56082205 2012-01-06 27.53 28.19 27.52 28.10 99459469 2012-01-09 28.05 28.10 27.72 27.74 59708266
Notice that the result of the Google query has different columns than the Yahoo! data; specifically, Google Finance does not provide an Adjusted Close
value.
pandas provides experimental support for Yahoo! Finance Options data to be retrieved via the pandas.io.data.Options
class. In the following example, the .get_all_data()
method is used to download options data for AAPL
from yahoo
:
In [38]: # specify we want all yahoo options data for AAPL # this can take a little time... aapl = pd.io.data.Options('AAPL', 'yahoo') # read all the data data = aapl.get_all_data() # examine the first six rows and four columns data.iloc[0:6, 0:4] Out[38]: Last Bid Ask Chg Strike Expiry Type Symbol 27.86 2015-01-17 call AAPL150117C00027860 82.74 79.50 79.85 0 put AAPL150117P00027860 0.02 0.00 0.01 0 28.57 2015-01-17 call AAPL150117C00028570 82.02 78.80 79.10 0 put AAPL150117P00028570 0.01 0.00 0.01 0 29.29 2015-01-17 call AAPL150117C00029290 84.75 78.05 78.40 0 put AAPL150117P00029290 0.01 0.00 0.01 0
The resulting DataFrame
object contains a hierarchical index, which can be used to easily extract specific subsets of data. To demonstrate, look at several examples of slicing by the values in the index.
The following code will return all put
options at a Strike price of $80. Using slice(None)
as one of the values in the tuple used to select by index will include all Expiry
dates:
In [39]: # get all puts at strike price of $80 (first four columns only) data.loc[(80, slice(None), 'put'), :].iloc[0:5, 0:4] Out[39]: Last Bid Ask Chg Strike Expiry Type Symbol 80 2015-01-09 put AAPL150109P00080000 0.01 0.00 0.01 0 2015-01-17 put AAPL150117P00080000 0.01 0.01 0.02 0 2015-01-23 put AAPL150123P00080000 0.06 0.01 0.04 0 2015-01-30 put AAPL150130P00080000 0.12 0.07 0.12 0 2015-02-20 put AAPL150220P00080000 0.22 0.22 0.24 0
As another example, we can narrow the date range by specifying a date slice instead of slice(None)
. The following narrows the result down to those where Expiry
date is between 2015-01-17 and 2015-04-17:
In [40]: # put options at strike of $80, between 2015-01-17 and 2015-04-17 data.loc[(80, slice('20150117','20150417'), 'put'), :].iloc[:, 0:4] Out[40]: Last Bid Ask Chg Strike Expiry Type Symbol 80 2015-01-17 put AAPL150117P00080000 0.01 0.01 0.02 0.00 2015-01-23 put AAPL150123P00080000 0.06 0.01 0.04 0.00 2015-01-30 put AAPL150130P00080000 0.12 0.07 0.12 0.00 2015-02-20 put AAPL150220P00080000 0.22 0.22 0.24 0.00 2015-03-20 put AAPL150320P00080000 0.40 0.39 0.43 -0.06 2015-04-17 put AAPL150417P00080000 0.64 0.61 0.66 0.00
If you do not want to download all of the data (which can take a few minutes), then you can use other methods, such as .get_call_data()
and .get_put_data()
, which will only download the data for the call or puts, respectively, and for a specific expiry date. To demonstrate, the following loads all call data for MSFT with the expiry date of 2015-01-05:
In [41]: # msft calls expiring on 2015-01-05 expiry = datetime.date(2015, 1, 5) msft_calls = pd.io.data.Options('MSFT', 'yahoo').get_call_data( expiry=expiry) msft_calls.iloc[0:5, 0:5] Out[41]: Last Bid Ask Chg PctChg Strike Expiry Type Symbol 35.5 2015-03-13 call MSFT150313C00035500 6.20 5.55 7.20 0 0.00% 36.5 2015-03-13 call MSFT150313C00036500 7.60 5.65 5.95 0 0.00% 37.5 2015-03-13 call MSFT150313C00037500 5.10 4.65 5.25 0 0.00% 38.0 2015-03-13 call MSFT150313C00038000 3.10 4.15 4.45 0 0.00% 39.0 2015-03-13 call MSFT150313C00039000 3.15 3.15 3.45 0 0.00%
The .get_all_data()
method will load data for all expiry months and cache that data. The cache will be automatically used to make subsequent calls return much more quickly if the data is present in the cache. As an example, the following example will return calls quickly, as the data has already been cached by the previous .get_all_data()
call for appl
call options:
In [42]: # msft calls expiring on 2015-01-17 expiry = datetime.date(2015, 1, 17) aapl_calls = aapl.get_call_data(expiry=expiry) aapl_calls.iloc[0:5, 0:4] Out[42]: Last Bid Ask Chg Strike Expiry Type Symbol 27.86 2015-01-17 call AAPL150117C00027860 82.74 79.50 79.85 0 28.57 2015-01-17 call AAPL150117C00028570 82.02 78.80 79.10 0 29.29 2015-01-17 call AAPL150117C00029290 84.75 78.05 78.40 0 30.00 2015-01-17 call AAPL150117C00030000 81.20 77.35 77.70 0 30.71 2015-01-17 call AAPL150117C00030710 83.20 76.65 77.00 0
The Federal Reserve Economic Data (FRED) of St. Louis (http://research.stlouisfed.org/fred2/) provides downloads of over 240,000 US and International time series from over 76 data sources, and it is constantly growing.
FRED data can be specified by data_source="fred"
, passing the specific series tag as the name
parameter. A specific data series can be specified using the series ID in the first parameter. As an example, the following retrieves GDP information between the two specified dates:
In [43]: # read GDP data from FRED gdp = web.DataReader("GDP", "fred", datetime.date(2012, 1, 1), datetime.date(2014, 1, 27)) gdp Out[43]: GDP DATE 2012-01-01 15956.5 2012-04-01 16094.7 2012-07-01 16268.9 2012-10-01 16332.5 2013-01-01 16502.4 2013-04-01 16619.2 2013-07-01 16872.3 2013-10-01 17078.3 2014-01-01 17044.0
To select another series, simply specify the series identifier in the first parameter. The site can be conveniently navigated through series and data visualized directly on the site. For example, the following screenshot shows you the series Compensation of employees: Wages and salaries:
This data series is represented by the A576RC1A027NBEA
ID and we can download it with the following code:
In [44]: # Get Compensation of employees: Wages and salaries web.DataReader("A576RC1A027NBEA", "fred", datetime.date(1929, 1, 1), datetime.date(2013, 1, 1)) Out[44]: A576RC1A027NBEA DATE 1929-01-01 50.5 1930-01-01 46.2 1931-01-01 39.2 1932-01-01 30.5 1933-01-01 29.0 ... ... 2009-01-01 6251.4 2010-01-01 6377.5 2011-01-01 6633.2 2012-01-01 6932.1 2013-01-01 7124.7 [85 rows x 1 columns]
Kenneth R. French is a professor of finance at the Tuck School of Business at Dartmouth University. He has created an extensive library of economic data, which is available for download over the Web. The website for his data is at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html, and it contains a detailed description of the datasets.
The data available at the site is downloadable in zip files and can be read directly into DataFrame
by specifying the dataset's filename (without .zip) as the first parameter of DataReader
and using data_source="famafrench"
:
As an example, the following reads the Global Factors data:
In [45]: # read from Kenneth French fama global factors data set factors = web.DataReader("Global_Factors", "famafrench") factors Out[45]: {3: 1 Mkt-RF 2 SMB 3 HML 4 WML 5 RF 199007 0.79 0.07 0.24 -99.99 0.68 199008 -10.76 -1.56 0.42 -99.99 0.66 199009 -12.24 1.68 0.34 -99.99 0.60 199010 9.58 -8.11 -3.29 -99.99 0.68 199011 -3.87 1.62 0.68 -0.32 0.57 ... ... ... ... ... ... 201409 -3.05 -2.63 -1.05 1.07 0.00 201410 0.34 -0.23 -2.98 -0.46 0.00 201411 1.67 -2.14 -1.92 0.65 0.00 201412 -1.45 1.89 -0.33 1.06 0.00 201501 -1.75 0.04 -2.78 4.50 0.00 [295 rows x 5 columns]}
Thousands of data feeds are available from the World Bank and can be read directly into pandas DataFrame objects. The World Bank data catalog can be explored at http://www.worldbank.org/.
World Bank datasets are identified using indicators, a text code that represents each dataset. A full list of indicators can be retrieved using the pandas.io.wb.get_indicators()
function. At the time of writing this, there were 13079
indicators:
In [46]: # make referencing pandas.io.wb a little less typing import pandas.io.wb as wb # get all indicators all_indicators = wb.get_indicators() In [47]: # examine some of the indicators all_indicators.ix[:,0:1] Out[47]: id 0 1.0.HCount.1.25usd 1 1.0.HCount.10usd 2 1.0.HCount.2.5usd 3 1.0.HCount.Mid10to50 4 1.0.HCount.Ofcl ... ... 13074 per_sionl.overlap_pop_urb 13075 per_sionl.overlap_q1_preT_tot 13076 per_sionl.overlap_q1_rur 13077 per_sionl.overlap_q1_tot 13078 per_sionl.overlap_q1_urb [13079 rows x 1 columns]
These indicators can be investigated using the World Bank website, but if you have an idea of the indicator you would like to sample, you can perform a search. As an example, the following uses the pandas.io.wb.search()
function to search for indicators with data related to life expectancy:
In [48]: # search of life expectancy indicators le_indicators = wb.search("life expectancy") # report first three rows, first two columns le_indicators.iloc[:3,:2] Out[48]: id name 7785 SP.DYN.LE00.FE.IN Life expectancy at birth, female (years) 7786 SP.DYN.LE00.IN Life expectancy at birth, total (years) 7787 SP.DYN.LE00.MA.IN Life expectancy at birth, male (years)
Each indicator is broken down into various countries. A full list of country data can be retrieved using the pandas.io.wb.get_countries()
function, as demonstrated here:
In [49]: # get countries and show the 3 digit code and name countries = wb.get_countries() # show a subset of the country data countries.iloc[0:10].ix[:,['name', 'capitalCity', 'iso2c']] Out[49]: name capitalCity iso2c 0 Aruba Oranjestad AW 1 Afghanistan Kabul AF 2 Africa A9 3 Angola Luanda AO 4 Albania Tirane AL 5 Andorra Andorra la Vella AD 6 Andean Region L5 7 Arab World 1A 8 United Arab Emirates Abu Dhabi AE 9 Argentina Buenos Aires AR
Data for an indicator can be downloaded using the pandas.io.wb.download()
function, specifying the dataset using the indicator
parameter. The following downloads the life expectancy data for countries from 1980
through 2012
:
In [50]: # get life expectancy at birth for all countries from 1980 to 2014 le_data_all = wb.download(indicator="SP.DYN.LE00.IN", start='1980', end='2014') le_data_all Out[50]: SP.DYN.LE00.IN country year Canada 2014 NaN 2013 NaN 2012 81.238049 2011 81.068317 2010 80.893488 ... ... United States 1984 74.563415 1983 74.463415 1982 74.360976 1981 74.007317 1980 73.658537 [105 rows x 1 columns]
By default, data is only returned for the United States, Canada, and Mexico. This can be seen by examining the index of the result of the previous query:
In [51]: # only US, CAN, and MEX are returned by default le_data_all.index.levels[0] Out[51]: Index([u'Canada', u'Mexico', u'United States'], dtype='object')
To get the data for more countries, specify them explicitly using the country parameter. The following gets the data for all known countries:
In [52]: # retrieve life expectancy at birth for all countries # from 1980 to 2014 le_data_all = wb.download(indicator="SP.DYN.LE00.IN", country = countries['iso2c'], start='1980', end='2012') le_data_all Out[52]: SP.DYN.LE00.IN country year Aruba 2012 75.206756 2011 75.080390 2010 74.952024 2009 74.816146 2008 74.674220 ... ... Zimbabwe 1984 61.217561 1983 60.902854 1982 60.466171 1981 59.944951 1980 59.377610 [8151 rows x 1 columns]
We can do some interesting things with this data. The example we will look at, determines which country has the lowest life expectancy for each year. To do this, we first need to pivot this data, so that the index is the country name and the year is the column. We will look at pivoting in more detail in later chapters, but for now, just know that the following reorganized the data into the country along the index and the year across the columns. Also, each value is the life expectancy for each country for that specific year:
In [53]: #le_data_all.pivot(index='country', columns='year') le_data = le_data_all.reset_index().pivot(index='country', columns='year') # examine pivoted data le_data.ix[:,0:3] Out[53]: SP.DYN.LE00.IN year 1980 1981 1982 country Afghanistan 41.233659 41.760634 42.335610 Albania 70.235976 70.454463 70.685122 Algeria 58.164024 59.486756 60.786341 American Samoa NaN NaN NaN Andorra NaN NaN NaN ... ... ... ... West Bank and Gaza NaN NaN NaN World 63.186868 63.494118 63.798264 Yemen, Rep. 50.559537 51.541341 52.492707 Zambia 51.148951 50.817707 50.350805 Zimbabwe 59.377610 59.944951 60.466171 [247 rows x 3 columns]
With the data in this format, we can perform and determine which country has the lowest life expectancy for each year using .idxmin(axis=0)
:
In [54]: # ask what is the name of the country for each year # with the least life expectancy country_with_least_expectancy = le_data.idxmin(axis=0) country_with_least_expectancy Out[54]: year SP.DYN.LE00.IN 1980 Cambodia 1981 Cambodia 1982 Timor-Leste ... SP.DYN.LE00.IN 2010 Sierra Leone 2011 Sierra Leone 2012 Sierra Leone Length: 33, dtype: object
The actual minimum value for each year can be retrieved using .min(axis=0)
:
In [55]: # and what is the minimum life expectancy for each year expectancy_for_least_country = le_data.min(axis=0) expectancy_for_least_country Out[55]: year SP.DYN.LE00.IN 1980 29.613537 1981 35.856341 1982 38.176220 ... SP.DYN.LE00.IN 2010 44.838951 2011 45.102585 2012 45.329049 Length: 33, dtype: float64
These two results can then be combined into a new DataFrame
that tells us which country had the least life expectancy for each year and what that value is:
In [56]: # this merges the two frames together and gives us # year, country and expectancy where there minimum exists least = pd.DataFrame( data = {'Country': country_with_least_expectancy.values, 'Expectancy': expectancy_for_least_country.values}, index = country_with_least_expectancy.index.levels[1]) least Out[56]: Country Expectancy year 1980 Cambodia 29.613537 1981 Cambodia 35.856341 1982 Timor-Leste 38.176220 1983 South Sudan 39.676488 1984 South Sudan 40.011024 ... ... ... 2008 Sierra Leone 44.067463 2009 Sierra Leone 44.501439 2010 Sierra Leone 44.838951 2011 Sierra Leone 45.102585 2012 Sierra Leone 45.329049 [33 rows x 2 columns]
18.225.254.192