pandas is a tool that is used to manipulate and analyze data. But where does this data come from? It is almost universal that any pandas application will start by loading data from an outside source. As pandas is built on Python, you can use any means available in Python to retrieve data from outside the application. This really makes the possibility of the data that can be accessed unlimited, including files, excel spreadsheets, web sites and services, databases, and cloud services.
However, using Python to load data will require you to convert Python objects into pandas Series
or DataFrame
objects, increasing the complexity of your code. Fortunately, over the evolution of pandas, it has become apparent that there is a frequently used set of data access patterns, as well as data formats that were so commonly used, that direct support for them was added into the core of pandas. These often make retrieving data from outside sources, directly as a pandas Series
or DataFrame
objects, as simple as a single line of code.
This chapter will introduce you to these capabilities that are part of the pandas framework. Specifically, in this chapter, we will cover:
We will, as usual, start with some imports and set up options for pandas that facilitate the examples:
In [1]: # import pandas and numpy import numpy as np import pandas as pd # Set some pandas options for controlling output pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 10) pd.set_option('display.max_rows', 10)
It is a pretty safe bet to say that Comma Separated Values (CSV) is likely to be the most common format of data that you will deal with in pandas. Many web-based services provide data in a CSV format, as well as many information systems within an enterprise. It is an easy format to use and is commonly used as an export format for spreadsheet applications, such as Excel.
CSV is a file consisting of multiple lines of text-based data with values separated by commas. It can be thought of as a table of data similar to a single sheet in a spreadsheet program. Each row of the data is in its own line in the file, and each column for each row is stored in the text format, with a comma separating the data in each column. For more detail on the specifics of CSV files, feel free to visit http://en.wikipedia.org/wiki/Comma-separated_values.
As CSV is so common and easily understood, we will spend most of the time describing how to read and write pandas data in this format. Lessons learned from CSV methods will apply to the other formats as well and allow a little more expediency when covering these other formats.
We will start by reading a simple CSV file, data/msft.csv
(in the book's source data folder). This file is a snapshot of Yahoo! Finance data for the MSFT ticker. Conveniently, Yahoo! Finance happens to be one of the financial web services that offers its data in a CSV format, and this data was simply retrieved and saved to a file.
The first several lines of this file can be examined using the !head
command (on a Windows system, use the type
command):
In [2]: # view the first five lines of data/msft.csv !head -n 5 data/msft.csv # osx or Linux # !type datamsft.csv # on windows, but shows the entire file Date,Open,High,Low,Close,Volume,Adj Close 2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93 2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35 2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63 2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91
The first row of the file contains the names of all of the columns represented in the data, separated with a comma. For a specific day (the first column), each remaining row represents the open, high, low, close, volume, and adjusted close values for MSFT on that date.
The data in data/MSFT.CSV
is perfect to read into DataFrame
. All of its data is complete and has column names in the first row. All that we need to do to read this data into a DataFrame
is use the pandas pd.read_csv()
function:
In [3]: # read in msft.csv into a DataFrame msft = pd.read_csv("data/msft.csv") msft.head() Out[3]: Date Open High Low Close Volume Adj Close 0 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 1 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 2 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63 3 2014-07-16 83.77 84.91 83.66 84.91 1755600 84.91 4 2014-07-15 84.30 84.38 83.20 83.58 1874700 83.58
Wow, that was easy! pandas has realized that the first line of the file contains the names of the columns and bulk read in the data to DataFrame
.
In the result of the previous example, the index is a numerical starting from 0
, instead of by date. This is because pandas does not assume that any specific column should be the index. To help this situation, you can specify which column(s) should be the index in the call to read_csv()
using the index_col
parameter by assigning it the zero-based position of the column to be used as the index.
The following reads the data and tells pandas to use the column at position 0 in the file as the index (the Date column):
In [4]: # use column 0 as the index msft = pd.read_csv("data/msft.csv", index_col=0) msft.head() Out[4]: Open High Low Close Volume Adj Close Date 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63 2014-07-16 83.77 84.91 83.66 84.91 1755600 84.91 2014-07-15 84.30 84.38 83.20 83.58 1874700 83.58
The date field is now the index. However, because of this, it is also not column data. If you want to use the date as a column, you will need to create a new column and assign the index labels to that column.
An examination of the types of each column shows that pandas has attempted to infer the types of the columns from their content:
In [5]: # examine the types of the columns in this DataFrame msft.dtypes Out[5]: Open float64 High float64 Low float64 Close float64 Volume int64 Adj Close float64 dtype: object
To force the types of columns, use the dtypes
parameter of pd.read_csv()
. The following forces the Volume
column to also be float64
:
In [6]: # specify that the Volume column should be a float64 msft = pd.read_csv("data/msft.csv", dtype = { 'Volume' : np.float64}) msft.dtypes Out[6]: Date object Open float64 High float64 Low float64 Close float64 Volume float64 Adj Close float64 dtype: object
It is also possible to specify the column names at the time of reading the data using the names
parameter:
In [7]: # specify a new set of names for the columns # all lower case, remove space in Adj Close # also, header=0 skips the header row df = pd.read_csv("data/msft.csv", header=0, names=['open', 'high', 'low', 'close', 'volume', 'adjclose']) df.head() Out[7]: open high low close volume adjclose 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63 2014-07-16 83.77 84.91 83.66 84.91 1755600 84.91 2014-07-15 84.30 84.38 83.20 83.58 1874700 83.58
Note that because we specified the names of the columns, we need to skip over the column names' row in the file, which was performed with header=0
. If not, pandas will assume the first row is part of the data, which will cause some issues later in processing.
It is also possible to specify which columns to load when reading the file. This can be useful if you have a lot of columns in the file and some are of no interest to your analysis and you want to save the time and memory required to read and store them. Specifying which columns to read is accomplished with the usecols
parameter, which can be passed a list of column names or column offsets.
The following reads only the Date
and Close
columns and uses Date
as the index:
In [8]: # read in data only in the Date and Close columns # and index by the Date column df2 = pd.read_csv("data/msft.csv", usecols=['Date', 'Close'], index_col=['Date']) df2.head() Out[8]: Close Date 2014-07-21 81.93 2014-07-18 83.35 2014-07-17 83.63 2014-07-16 84.91 2014-07-15 83.58
CSV files can be saved from DataFrame
using the .to_csv()
method. To demonstrate saving data to a CSV file, we will save the df2
object with the revised column names to a new file named data/msft_modified.csv
:
In [9]: # save df2 to a new csv file # also specify naming the index as date df2.to_csv("data/msft_modified.csv", index_label='date')
It was necessary to tell the method that the index label should be saved with a column name of date
using index_label=date
. Otherwise, the index does not have a name added to the first row of the file, which makes it difficult to read back properly.
To examine that this worked properly, we can explore the new file to view some of its content using !head
command (and if on a Windows system use the !type
command):
In [10]: # view the start of the file just saved !head data/msft_modified.csv # Linux or osx # !type datamsft_modified.csv # windows date,Close 2014-07-21,81.93 2014-07-18,83.35 2014-07-17,83.63 2014-07-16,84.91 2014-07-15,83.58 2014-07-14,84.4 2014-07-11,83.35 2014-07-10,83.42 2014-07-09,85.5
CSV is actually a specific implementation of what is referred to as field-delimited data. In field-delimited data, items in each row are separated by a specific symbol. In the case of CSV, it happens to be a comma. However, other symbols are common, such as the | (pipe) symbol. When using a |
character, the data is often called pipe-delimited data.
To facilitate reading field-delimited data, pandas provides the pd.read_table()
function. The following example uses this function to read the data/MSFT.CSV
file by specifying a comma as the value to the sep
parameter:
In [11]: # use read_table with sep=',' to read a CSV df = pd.read_table("data/msft.csv", sep=',') df.head() Out[11]: Date Open High Low Close Volume Adj Close 0 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 1 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 2 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63 3 2014-07-16 83.77 84.91 83.66 84.91 1755600 84.91 4 2014-07-15 84.30 84.38 83.20 83.58 1874700 83.58
pandas does not provide a .to_table()
method as an analogous write method to .to_csv()
. However, the .to_csv()
method can be used to write field-delimited data using a different delimiter than a comma. As an example, the following writes a pipe-delimited version of the data in DataFrame
:
In [12]: # save as pipe delimited df.to_csv("data/msft_piped.txt", sep='|') # check that it worked !head -n 5 data/msft_piped.txt # osx or linux # !type datamsft_piped.txt # on windows |Date|Open|High|Low|Close|Volume|Adj Close 0|2014-07-21|83.46|83.53|81.81|81.93|2359300|81.93 1|2014-07-18|83.3|83.4|82.52|83.35|4020800|83.35 2|2014-07-17|84.35|84.63|83.33|83.63|1974000|83.63 3|2014-07-16|83.77|84.91|83.66|84.91|1755600|84.91
Sometimes, data in a field-delimited file may contain extraneous headers and footers. Examples can be company information at the top, such as in an invoice number, addresses, and summary footers. Sometimes, I have even seen where data is stored on every other line. These situations will cause errors when loading the data like this. So, to handle these scenarios, the pandas pd.read_csv()
and pd.read_table()
methods have some useful parameters to help out.
To demonstrate, take the following variation on the MSFT stock data, which has extra rows of what could be referred to as noise information:
In [13]: # messy file !head data/msft2.csv # osx or Linux #!type datamsft2.csv # windows This is fun because the data does not start on the first line Date,Open,High,Low,Close,Volume,Adj Close And there is space between the header row and data 2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93 2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35 2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63 2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91 2014-07-15,84.30,84.38,83.20,83.58,1874700,83.58 2014-07-14,83.66,84.64,83.11,84.40,1432100,84.40
This situation can be handled using the skiprows
parameter, informing pandas to skip rows 0
, 2
, and 3
:
In [14]: # read, but skip rows 0, 2 and 3 df = pd.read_csv("data/msft2.csv", skiprows=[0, 2, 3]) df Out[14]: Date Open High Low Close Volume Adj Close 0 NaN NaN NaN NaN NaN NaN NaN 1 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 2 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 3 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63 4 2014-07-16 83.77 84.91 83.66 84.91 1755600 84.91 5 2014-07-15 84.30 84.38 83.20 83.58 1874700 83.58 6 2014-07-14 83.66 84.64 83.11 84.40 1432100 84.40 7 2014-07-11 83.55 83.98 82.85 83.35 2001400 83.35 8 2014-07-10 85.20 85.57 83.36 83.42 2713300 83.42 9 2014-07-09 84.83 85.79 84.76 85.50 1540700 85.50
Another common situation is where a file has content at the end of the file, which should be ignored to prevent an error, such as the following.
In [15]: # another messy file, with the mess at the end !cat data/msft_with_footer.csv # osx or Linux # !type datamsft_with_footer.csv # windows Date,Open,High,Low,Close,Volume,Adj Close 2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93 2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35 Uh oh, there is stuff at the end.
This will cause an exception during reading, but it can be handled using the skip_footer
parameter, which specifies how many lines at the end of the file to ignore:
In [16]: # skip only two lines at the end df = pd.read_csv("data/msft_with_footer.csv", skip_footer=2, engine = 'python') df Out[16]: Date Open High Low Close Volume Adj Close 0 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 1 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35
Suppose the file is large and you only want to read the first few rows, as you only want the data at the start of the file and do not want to read it all into the memory. This can be handled with the nrows
parameter:
In [17]: # only process the first three rows pd.read_csv("data/msft.csv", nrows=3) Out[17]: Date Open High Low Close Volume Adj Close 0 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 1 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 2 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63
If you want, you can skip a specific number of rows at the start of a file and read to the end, or you can read just a few lines once you get to that point in the file. To do this, use the skiprows
parameter. The following example skips 100
rows and then reads in the next 5
:
In [18]: # skip 100 lines, then only process the next five pd.read_csv("data/msft.csv", skiprows=100, nrows=5, header=0, names=['open', 'high', 'low', 'close', 'vol', 'adjclose']) Out[18]: open high low close vol adjclose 2014-03-03 80.35 81.31 79.91 79.97 5004100 77.40 2014-02-28 82.40 83.42 82.17 83.42 2853200 80.74 2014-02-27 84.06 84.63 81.63 82.00 3676800 79.36 2014-02-26 82.92 84.03 82.43 83.81 2623600 81.12 2014-02-25 83.80 83.80 81.72 83.08 3579100 80.41
pandas supports reading data in Excel 2003 and newer formats using the pd.read_excel()
function or via the ExcelFile
class. Internally, both techniques use either the XLRD or OpenPyXL packages, so you will need to ensure that either is installed first in your Python environment.
For demonstration, a data/stocks.xlsx
file is provided with the sample data. If you open it in Excel, you will see something similar to what is shown in the following:
The workbook contains two sheets, msft and aapl, which hold the stock data for each respective stock.
The following reads the data/stocks.xlsx
file into DataFrame
:
In [19]: # read excel file # only reads first sheet (msft in this case) df = pd.read_excel("data/stocks.xlsx") df.head() Out[19]: Date Open High Low Close Volume Adj Close 0 2014-07-21 83.46 83.53 81.81 81.93 2359300 81.93 1 2014-07-18 83.30 83.40 82.52 83.35 4020800 83.35 2 2014-07-17 84.35 84.63 83.33 83.63 1974000 83.63 3 2014-07-16 83.77 84.91 83.66 84.91 1755600 84.91 4 2014-07-15 84.30 84.38 83.20 83.58 1874700 83.58
This has read only content from the first worksheet in the Excel file (the msft worksheet) and used the contents of the first row as column names. To read the other worksheet, you can pass the name of the worksheet using the sheetname
parameter:
In [20]: # read from the aapl worksheet aapl = pd.read_excel("data/stocks.xlsx", sheetname='aapl') aapl.head() Out[20]: Date Open High Low Close Volume Adj Close 0 2014-07-21 94.99 95.00 93.72 93.94 38887700 93.94 1 2014-07-18 93.62 94.74 93.02 94.43 49898600 94.43 2 2014-07-17 95.03 95.28 92.57 93.09 57152000 93.09 3 2014-07-16 96.97 97.10 94.74 94.78 53396300 94.78 4 2014-07-15 96.80 96.85 95.03 95.32 45477900 95.32
Like with pd.read_csv()
, many assumptions are made about column names, data types, and indexes. All of the options explained for pd.read_csv()
to specify this information also apply to the pd.read_excel()
function.
Excel files can be written using the .to_excel()
method of DataFrame
. Writing to the XLS format requires the inclusion of the XLWT package, so make sure it is loaded in your Python environment.
The following writes the data we just acquired to stocks2.xls
. The default is to store DataFrame
in the Sheet1
worksheet:
In [21]: # save to an .XLS file, in worksheet 'Sheet1' df.to_excel("data/stocks2.xls")
Opening this in Excel shows you the following:
You can specify the name of the worksheet using the sheet_name
parameter:
In [22]: # write making the worksheet name MSFT df.to_excel("data/stocks_msft.xls", sheet_name='MSFT')
In Excel, we can see that the sheet has been named MSFT.
To write more than one DataFrame
to a single Excel file and each DataFrame
object on a separate worksheet, use the ExcelWriter
object, along with the with
keyword. ExcelWriter
is part of pandas, but you will need to make sure it is imported, as it is not in the top level namespace of pandas. The following writes two DataFrame objects to two different worksheets in one Excel file:
In [23]: # write multiple sheets # requires use of the ExcelWriter class from pandas import ExcelWriter with ExcelWriter("data/all_stocks.xls") as writer: aapl.to_excel(writer, sheet_name='AAPL') df.to_excel(writer, sheet_name='MSFT')
We can see that there are two worksheets in the Excel file:
Writing to XLSX files uses the same function but specifies .XLSX through the file extension:
In [24]: # write to xlsx df.to_excel("data/msft2.xlsx")
When writing an XLSX file, pandas will use the openpyxl or xlsxwriter packages, so make sure one is installed.
3.145.40.189