The data used in analyses is typically provided from other systems via files that are created and updated at various intervals, dynamically via access over the Web, or from various types of databases. The pandas library provides powerful facilities for easy retrieval of data from a variety of data sources and converting it into pandas objects. Here, we will briefly demonstrate this ease of use by loading data from files and from financial web services.
The pandas library provides built-in support for loading data in .csv
format, a common means of storing structured data in text files. Provided with the code from this book is a file data/test1.csv
in the CSV format, which represents some time series information. The specific content isn't important right now, as we just want to demonstrate the ease of loading data into a DataFrame
.
The following statement in IPython uses the operating system to display the content of this file (the command to use is different based upon your operating system).
In [29]: # display the contents of test1.csv # which command to use depends on your OS !cat data/test1.csv # on non-windows systems #!type data est1.csv # on windows systems date,0,1,2 2000-01-01 00:00:00,1.10376250134,-1.90997889703,-0.808955536115 2000-01-02 00:00:00,1.18891664768,0.581119740849,0.86159734949 2000-01-03 00:00:00,-0.964200042412,0.779764393246,1.82906224532 2000-01-04 00:00:00,0.782130444001,-1.72066965573,-1.10824167327 2000-01-05 00:00:00,-1.86701699823,-0.528368292754,-2.48830894087 2000-01-06 00:00:00,2.56928022646,-0.471901478927,-0.835033249865 2000-01-07 00:00:00,-0.39932258251,-0.676426550985,-0.0112559158931 2000-01-08 00:00:00,1.64299299394,1.01341997845,1.43566709724 2000-01-09 00:00:00,1.14730764657,2.13799951538,0.554171306191 2000-01-10 00:00:00,0.933765825769,1.38715526486,-0.560142729978
This information can be easily imported into DataFrame
using the pd.read_csv()
function.
In [30]: # read the contents of the file into a DataFrame df = pd.read_csv('data/test1.csv') df Out [30]: date 0 1 2 0 2000-01-01 00:00:00 1.103763 -1.909979 -0.808956 1 2000-01-02 00:00:00 1.188917 0.581120 0.861597 2 2000-01-03 00:00:00 -0.964200 0.779764 1.829062 3 2000-01-04 00:00:00 0.782130 -1.720670 -1.108242 4 2000-01-05 00:00:00 -1.867017 -0.528368 -2.488309 5 2000-01-06 00:00:00 2.569280 -0.471901 -0.835033 6 2000-01-07 00:00:00 -0.399323 -0.676427 -0.011256 7 2000-01-08 00:00:00 1.642993 1.013420 1.435667 8 2000-01-09 00:00:00 1.147308 2.138000 0.554171 9 2000-01-10 00:00:00 0.933766 1.387155 -0.560143
pandas has no idea that the first column is a date and has treated the contents of the date field as a string. This can be verified using the following Python statements:
In [31]: # the contents of the date column df.date Out [31]: 0 2000-01-01 00:00:00 1 2000-01-02 00:00:00 2 2000-01-03 00:00:00 3 2000-01-04 00:00:00 4 2000-01-05 00:00:00 5 2000-01-06 00:00:00 6 2000-01-07 00:00:00 7 2000-01-08 00:00:00 8 2000-01-09 00:00:00 9 2000-01-10 00:00:00 Name: date, dtype: object In [32]: # we can get the first value in the date column df.date[0] Out [32]: '2000-01-01 00:00:00' In [33]: # it is a string type(df.date[0]) Out [33]: str
To guide pandas on how to convert data directly into a Python/pandas date object, we can use the parse_dates
parameter of the pd.read_csv()
function. The following code informs pandas to convert the content of the 'date' column into actual TimeStamp
objects.
In [34]: # read the data and tell pandas the date column should be # a date in the resulting DataFrame df = pd.read_csv('data/test1.csv', parse_dates=['date']) df Out [34]: date 0 1 2 0 2000-01-01 1.103763 -1.909979 -0.808956 1 2000-01-02 1.188917 0.581120 0.861597 2 2000-01-03 -0.964200 0.779764 1.829062 3 2000-01-04 0.782130 -1.720670 -1.108242 4 2000-01-05 -1.867017 -0.528368 -2.488309 5 2000-01-06 2.569280 -0.471901 -0.835033 6 2000-01-07 -0.399323 -0.676427 -0.011256 7 2000-01-08 1.642993 1.013420 1.435667 8 2000-01-09 1.147308 2.138000 0.554171 9 2000-01-10 0.933766 1.387155 -0.560143
On checking whether it worked, we see it is indeed a Timestamp
object now.
In [35]: # verify the type now is date # in pandas, this is actually a Timestamp type(df.date[0]) Out [35]: pandas.tslib.Timestamp
Unfortunately, this has not used the date field as the index for the DataFrame
, instead it uses the default zero-based integer index labels.
In [36]: # unfortunately the index is numeric, which makes # accessing data by date more complicated df.index Out [36]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
This can be rectified using the index_col
parameter of the pd.read_csv()
method to specify which column in the file should be used as the index.
In [37]: # read in again, now specify the data column as being the # index of the resulting DataFrame df = pd.read_csv('data/test1.csv', parse_dates=['date'], index_col='date') df Out [37]: 0 1 2 date 2000-01-01 1.103763 -1.909979 -0.808956 2000-01-02 1.188917 0.581120 0.861597 2000-01-03 -0.964200 0.779764 1.829062 2000-01-04 0.782130 -1.720670 -1.108242 2000-01-05 -1.867017 -0.528368 -2.488309 2000-01-06 2.569280 -0.471901 -0.835033 2000-01-07 -0.399323 -0.676427 -0.011256 2000-01-08 1.642993 1.013420 1.435667 2000-01-09 1.147308 2.138000 0.554171 2000-01-10 0.933766 1.387155 -0.560143 In [38]: df.index Out [38]: <class 'pandas.tseries.index.DatetimeIndex'> [2000-01-01, ..., 2000-01-10] Length: 10, Freq: None, Timezone: None
Data from the Web can also be easily read via pandas. To demonstrate this, we will perform a simple load of actual stock data. The example here uses the pandas.io.data.DataReader
class, which is able to read data from various web sources, one of which is stock data from Yahoo! Finance.
The following reads the data of the previous three months for GOOG (based on the current date), and prints the five most recent days of stock data:
In [39]: # imports for reading data from Yahoo! from pandas.io.data import DataReader from datetime import date from dateutil.relativedelta import relativedelta # read the last three months of data for GOOG goog = DataReader("GOOG", "yahoo", date.today() + relativedelta(months=-3)) # the result is a DataFrame #and this gives us the 5 most recent prices goog.tail() Out [39]: Open High Low Close Volume Adj Close Date 2015-02-02 531.73 533.00 518.55 528.48 2826300 528.48 2015-02-03 528.00 533.40 523.26 529.24 2029200 529.24 2015-02-04 529.24 532.67 521.27 522.76 1656800 522.76 2015-02-05 523.79 528.50 522.09 527.58 1840300 527.58 2015-02-06 527.64 537.20 526.41 531.00 1744600 531.00
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code examples in the book are also publicly available on Wakari.io at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.
This is actually performs quite a bit of work on your behalf. It makes the web requests retrieving the CSV data and converting it into a DataFrame
with the proper conversion of types for the various series of data.
3.145.8.8