Chapter 6. Accessing Data

pandas is a tool that is used to manipulate and analyze data. But where does this data come from? It is almost universal that any pandas application will start by loading data from an outside source. As pandas is built on Python, you can use any means available in Python to retrieve data from outside the application. This really makes the possibility of the data that can be accessed unlimited, including files, excel spreadsheets, web sites and services, databases, and cloud services.

However, using Python to load data will require you to convert Python objects into pandas Series or DataFrame objects, increasing the complexity of your code. Fortunately, over the evolution of pandas, it has become apparent that there is a frequently used set of data access patterns, as well as data formats that were so commonly used, that direct support for them was added into the core of pandas. These often make retrieving data from outside sources, directly as a pandas Series or DataFrame objects, as simple as a single line of code.

This chapter will introduce you to these capabilities that are part of the pandas framework. Specifically, in this chapter, we will cover:

  • Reading and writing pandas data from files
  • Working with data in CSV, JSON, HTML, Excel, and HDF5 formats
  • Accessing data on the web and in the cloud
  • Reading and writing from/to SQL databases
  • Reading data from remote web data services

Setting up the IPython notebook

We will, as usual, start with some imports and set up options for pandas that facilitate the examples:

In [1]:
   # import pandas and numpy
   import numpy as np
   import pandas as pd

   # Set some pandas options for controlling output
   pd.set_option('display.notebook_repr_html', False)
   pd.set_option('display.max_columns', 10)
   pd.set_option('display.max_rows', 10)

CSV and Text/Tabular format

It is a pretty safe bet to say that Comma Separated Values (CSV) is likely to be the most common format of data that you will deal with in pandas. Many web-based services provide data in a CSV format, as well as many information systems within an enterprise. It is an easy format to use and is commonly used as an export format for spreadsheet applications, such as Excel.

CSV is a file consisting of multiple lines of text-based data with values separated by commas. It can be thought of as a table of data similar to a single sheet in a spreadsheet program. Each row of the data is in its own line in the file, and each column for each row is stored in the text format, with a comma separating the data in each column. For more detail on the specifics of CSV files, feel free to visit http://en.wikipedia.org/wiki/Comma-separated_values.

As CSV is so common and easily understood, we will spend most of the time describing how to read and write pandas data in this format. Lessons learned from CSV methods will apply to the other formats as well and allow a little more expediency when covering these other formats.

The sample CSV data set

We will start by reading a simple CSV file, data/msft.csv (in the book's source data folder). This file is a snapshot of Yahoo! Finance data for the MSFT ticker. Conveniently, Yahoo! Finance happens to be one of the financial web services that offers its data in a CSV format, and this data was simply retrieved and saved to a file.

The first several lines of this file can be examined using the !head command (on a Windows system, use the type command):

In [2]:
   # view the first five lines of data/msft.csv
   !head -n 5 data/msft.csv # osx or Linux
   # !type datamsft.csv # on windows, but shows the entire file

   Date,Open,High,Low,Close,Volume,Adj Close
   2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
   2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35
   2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63
   2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91

The first row of the file contains the names of all of the columns represented in the data, separated with a comma. For a specific day (the first column), each remaining row represents the open, high, low, close, volume, and adjusted close values for MSFT on that date.

Reading a CSV file into a DataFrame

The data in data/MSFT.CSV is perfect to read into DataFrame. All of its data is complete and has column names in the first row. All that we need to do to read this data into a DataFrame is use the pandas pd.read_csv() function:

In [3]:
   # read in msft.csv into a DataFrame
   msft = pd.read_csv("data/msft.csv")
   msft.head()

Out[3]:
            Date   Open   High    Low  Close   Volume  Adj Close
   0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
   2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
   3  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
   4  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

Wow, that was easy! pandas has realized that the first line of the file contains the names of the columns and bulk read in the data to DataFrame.

Specifying the index column when reading a CSV file

In the result of the previous example, the index is a numerical starting from 0, instead of by date. This is because pandas does not assume that any specific column should be the index. To help this situation, you can specify which column(s) should be the index in the call to read_csv() using the index_col parameter by assigning it the zero-based position of the column to be used as the index.

The following reads the data and tells pandas to use the column at position 0 in the file as the index (the Date column):

In [4]:
   # use column 0 as the index
   msft = pd.read_csv("data/msft.csv", index_col=0)
   msft.head()

Out[4]:
                Open   High    Low  Close   Volume  Adj Close
   Date                                                      
   2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
   2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
   2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
   2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

The date field is now the index. However, because of this, it is also not column data. If you want to use the date as a column, you will need to create a new column and assign the index labels to that column.

Data type inference and specification

An examination of the types of each column shows that pandas has attempted to infer the types of the columns from their content:

In [5]:
   # examine the types of the columns in this DataFrame
   msft.dtypes

Out[5]:
   Open         float64
   High         float64
   Low          float64
   Close        float64
   Volume         int64
   Adj Close    float64
   dtype: object

To force the types of columns, use the dtypes parameter of pd.read_csv(). The following forces the Volume column to also be float64:

In [6]:
   # specify that the Volume column should be a float64
   msft = pd.read_csv("data/msft.csv", 
                      dtype = { 'Volume' : np.float64})
   msft.dtypes

Out[6]:
   Date          object
   Open         float64
   High         float64
   Low          float64
   Close        float64
   Volume       float64
   Adj Close    float64
   dtype: object

Specifying column names

It is also possible to specify the column names at the time of reading the data using the names parameter:

In [7]:
   # specify a new set of names for the columns
   # all lower case, remove space in Adj Close
   # also, header=0 skips the header row
   df = pd.read_csv("data/msft.csv", 
                    header=0,
                    names=['open', 'high', 'low', 
                           'close', 'volume', 'adjclose'])
   df.head()

Out[7]:
                open   high    low  close   volume  adjclose
   2014-07-21  83.46  83.53  81.81  81.93  2359300     81.93
   2014-07-18  83.30  83.40  82.52  83.35  4020800     83.35
   2014-07-17  84.35  84.63  83.33  83.63  1974000     83.63
   2014-07-16  83.77  84.91  83.66  84.91  1755600     84.91
   2014-07-15  84.30  84.38  83.20  83.58  1874700     83.58

Note that because we specified the names of the columns, we need to skip over the column names' row in the file, which was performed with header=0. If not, pandas will assume the first row is part of the data, which will cause some issues later in processing.

Specifying specific columns to load

It is also possible to specify which columns to load when reading the file. This can be useful if you have a lot of columns in the file and some are of no interest to your analysis and you want to save the time and memory required to read and store them. Specifying which columns to read is accomplished with the usecols parameter, which can be passed a list of column names or column offsets.

The following reads only the Date and Close columns and uses Date as the index:

In [8]:
   # read in data only in the Date and Close columns
   # and index by the Date column
   df2 = pd.read_csv("data/msft.csv", 
                     usecols=['Date', 'Close'], 
                     index_col=['Date'])
   df2.head()

Out[8]:
               Close
   Date             
   2014-07-21  81.93
   2014-07-18  83.35
   2014-07-17  83.63
   2014-07-16  84.91
   2014-07-15  83.58

Saving DataFrame to a CSV file

CSV files can be saved from DataFrame using the .to_csv() method. To demonstrate saving data to a CSV file, we will save the df2 object with the revised column names to a new file named data/msft_modified.csv:

In [9]:
   # save df2 to a new csv file
   # also specify naming the index as date
   df2.to_csv("data/msft_modified.csv", index_label='date')

It was necessary to tell the method that the index label should be saved with a column name of date using index_label=date. Otherwise, the index does not have a name added to the first row of the file, which makes it difficult to read back properly.

To examine that this worked properly, we can explore the new file to view some of its content using !head command (and if on a Windows system use the !type command):

In [10]:
   # view the start of the file just saved
   !head data/msft_modified.csv # Linux or osx
   # !type datamsft_modified.csv # windows

   date,Close
   2014-07-21,81.93
   2014-07-18,83.35
   2014-07-17,83.63
   2014-07-16,84.91
   2014-07-15,83.58
   2014-07-14,84.4
   2014-07-11,83.35
   2014-07-10,83.42
   2014-07-09,85.5

General field-delimited data

CSV is actually a specific implementation of what is referred to as field-delimited data. In field-delimited data, items in each row are separated by a specific symbol. In the case of CSV, it happens to be a comma. However, other symbols are common, such as the | (pipe) symbol. When using a | character, the data is often called pipe-delimited data.

To facilitate reading field-delimited data, pandas provides the pd.read_table() function. The following example uses this function to read the data/MSFT.CSV file by specifying a comma as the value to the sep parameter:

In [11]:
   # use read_table with sep=',' to read a CSV
   df = pd.read_table("data/msft.csv", sep=',')
   df.head()

Out[11]:
            Date   Open   High    Low  Close   Volume  Adj Close
   0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
   2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
   3  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
   4  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

pandas does not provide a .to_table() method as an analogous write method to .to_csv(). However, the .to_csv() method can be used to write field-delimited data using a different delimiter than a comma. As an example, the following writes a pipe-delimited version of the data in DataFrame:

In [12]:
   # save as pipe delimited
   df.to_csv("data/msft_piped.txt", sep='|')
   # check that it worked
   !head -n 5 data/msft_piped.txt # osx or linux
   # !type datamsft_piped.txt # on windows

   |Date|Open|High|Low|Close|Volume|Adj Close
   0|2014-07-21|83.46|83.53|81.81|81.93|2359300|81.93
   1|2014-07-18|83.3|83.4|82.52|83.35|4020800|83.35
   2|2014-07-17|84.35|84.63|83.33|83.63|1974000|83.63
   3|2014-07-16|83.77|84.91|83.66|84.91|1755600|84.91

Handling noise rows in field-delimited data

Sometimes, data in a field-delimited file may contain extraneous headers and footers. Examples can be company information at the top, such as in an invoice number, addresses, and summary footers. Sometimes, I have even seen where data is stored on every other line. These situations will cause errors when loading the data like this. So, to handle these scenarios, the pandas pd.read_csv() and pd.read_table() methods have some useful parameters to help out.

To demonstrate, take the following variation on the MSFT stock data, which has extra rows of what could be referred to as noise information:

In [13]:
   # messy file
   !head data/msft2.csv # osx or Linux
   #!type datamsft2.csv # windows

   This is fun because the data does not start on the first line
   Date,Open,High,Low,Close,Volume,Adj Close

   And there is space between the header row and data
   2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
   2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35
   2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63
   2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91
   2014-07-15,84.30,84.38,83.20,83.58,1874700,83.58
   2014-07-14,83.66,84.64,83.11,84.40,1432100,84.40

This situation can be handled using the skiprows parameter, informing pandas to skip rows 0, 2, and 3:

In [14]:
   # read, but skip rows 0, 2 and 3
   df = pd.read_csv("data/msft2.csv", skiprows=[0, 2, 3])
   df

Out[14]:
            Date   Open   High    Low  Close   Volume  Adj Close
   0         NaN    NaN    NaN    NaN    NaN      NaN        NaN
   1  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   2  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
   3  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
   4  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
   5  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58
   6  2014-07-14  83.66  84.64  83.11  84.40  1432100      84.40
   7  2014-07-11  83.55  83.98  82.85  83.35  2001400      83.35
   8  2014-07-10  85.20  85.57  83.36  83.42  2713300      83.42
   9  2014-07-09  84.83  85.79  84.76  85.50  1540700      85.50

Another common situation is where a file has content at the end of the file, which should be ignored to prevent an error, such as the following.

In [15]:
   # another messy file, with the mess at the end
   !cat data/msft_with_footer.csv # osx or Linux
   # !type datamsft_with_footer.csv # windows

   Date,Open,High,Low,Close,Volume,Adj Close
   2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
   2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35

   Uh oh, there is stuff at the end.

This will cause an exception during reading, but it can be handled using the skip_footer parameter, which specifies how many lines at the end of the file to ignore:

In [16]:
   # skip only two lines at the end
   df = pd.read_csv("data/msft_with_footer.csv", 
                    skip_footer=2,
                    engine = 'python')
   df

   Out[16]:
            Date   Open   High    Low  Close   Volume  Adj Close
   0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35

Note

Note that I had to specify engine = 'python'. At least with Anaconda, skip_footer, without this option, gives a warning, as this option is not implemented by the default underlying C implementation. This forces it to use a Python implementation.

Suppose the file is large and you only want to read the first few rows, as you only want the data at the start of the file and do not want to read it all into the memory. This can be handled with the nrows parameter:

In [17]:
   # only process the first three rows
   pd.read_csv("data/msft.csv", nrows=3)

Out[17]:
            Date   Open   High    Low  Close   Volume  Adj Close
   0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
   2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63

If you want, you can skip a specific number of rows at the start of a file and read to the end, or you can read just a few lines once you get to that point in the file. To do this, use the skiprows parameter. The following example skips 100 rows and then reads in the next 5:

In [18]:
   # skip 100 lines, then only process the next five
   pd.read_csv("data/msft.csv", skiprows=100, nrows=5, 
               header=0,
               names=['open', 'high', 'low', 'close', 'vol', 
                      'adjclose'])

Out[18]:
                open   high    low  close      vol  adjclose
   2014-03-03  80.35  81.31  79.91  79.97  5004100     77.40
   2014-02-28  82.40  83.42  82.17  83.42  2853200     80.74
   2014-02-27  84.06  84.63  81.63  82.00  3676800     79.36
   2014-02-26  82.92  84.03  82.43  83.81  2623600     81.12
   2014-02-25  83.80  83.80  81.72  83.08  3579100     80.41

Note

Note that the preceding example also skipped reading the header line, so it was necessary to inform the process to not look for a header and use the specified names.

Reading and writing data in an Excel format

pandas supports reading data in Excel 2003 and newer formats using the pd.read_excel() function or via the ExcelFile class. Internally, both techniques use either the XLRD or OpenPyXL packages, so you will need to ensure that either is installed first in your Python environment.

For demonstration, a data/stocks.xlsx file is provided with the sample data. If you open it in Excel, you will see something similar to what is shown in the following:

Reading and writing data in an Excel format

The workbook contains two sheets, msft and aapl, which hold the stock data for each respective stock.

The following reads the data/stocks.xlsx file into DataFrame:

In [19]:
   # read excel file
   # only reads first sheet (msft in this case)
   df = pd.read_excel("data/stocks.xlsx")
   df.head()

Out[19]:
           Date   Open   High    Low  Close   Volume  Adj Close
   0 2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
   1 2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
   2 2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
   3 2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
   4 2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

This has read only content from the first worksheet in the Excel file (the msft worksheet) and used the contents of the first row as column names. To read the other worksheet, you can pass the name of the worksheet using the sheetname parameter:

In [20]:
   # read from the aapl worksheet
   aapl = pd.read_excel("data/stocks.xlsx", sheetname='aapl')
   aapl.head()

Out[20]:
           Date   Open   High    Low  Close    Volume  Adj Close
   0 2014-07-21  94.99  95.00  93.72  93.94  38887700      93.94
   1 2014-07-18  93.62  94.74  93.02  94.43  49898600      94.43
   2 2014-07-17  95.03  95.28  92.57  93.09  57152000      93.09
   3 2014-07-16  96.97  97.10  94.74  94.78  53396300      94.78
   4 2014-07-15  96.80  96.85  95.03  95.32  45477900      95.32

Like with pd.read_csv(), many assumptions are made about column names, data types, and indexes. All of the options explained for pd.read_csv() to specify this information also apply to the pd.read_excel() function.

Excel files can be written using the .to_excel() method of DataFrame. Writing to the XLS format requires the inclusion of the XLWT package, so make sure it is loaded in your Python environment.

The following writes the data we just acquired to stocks2.xls. The default is to store DataFrame in the Sheet1 worksheet:

In [21]:
   # save to an .XLS file, in worksheet 'Sheet1'
   df.to_excel("data/stocks2.xls")

Opening this in Excel shows you the following:

Reading and writing data in an Excel format

You can specify the name of the worksheet using the sheet_name parameter:

In [22]:
   # write making the worksheet name MSFT
   df.to_excel("data/stocks_msft.xls", sheet_name='MSFT')

In Excel, we can see that the sheet has been named MSFT.

Reading and writing data in an Excel format

To write more than one DataFrame to a single Excel file and each DataFrame object on a separate worksheet, use the ExcelWriter object, along with the with keyword. ExcelWriter is part of pandas, but you will need to make sure it is imported, as it is not in the top level namespace of pandas. The following writes two DataFrame objects to two different worksheets in one Excel file:

In [23]:
   # write multiple sheets
   # requires use of the ExcelWriter class
   from pandas import ExcelWriter
   with ExcelWriter("data/all_stocks.xls") as writer:
       aapl.to_excel(writer, sheet_name='AAPL')
       df.to_excel(writer, sheet_name='MSFT')

We can see that there are two worksheets in the Excel file:

Reading and writing data in an Excel format

Writing to XLSX files uses the same function but specifies .XLSX through the file extension:

In [24]:
   # write to xlsx
   df.to_excel("data/msft2.xlsx")

When writing an XLSX file, pandas will use the openpyxl or xlsxwriter packages, so make sure one is installed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.30.19