Where possible, the examples in this chapter will utilize several datasets provided with the code in the download for the text. These datasets make the examples a little less academic in nature. These datasets will be read from files using the pd.read_csv()
function that will load the sample data from the file into a DataFrame
object.
pd.read_csv()
will be more extensively examined in Chapter 6, Accessing Data. For now, its function is simply to load some example DataFrame
objects.
The remainder of the samples will still utilize the DataFrame
objects created on demand if they demonstrate the concept being examined in a better way. Let's see the brief descriptions of these datasets.
The first dataset we will use is a snapshot of the S&P 500 from Yahoo! Finance. The following code shows the first three lines of the file:
In [13]: # show the first three lines of the file !head -n 3 data/sp500.csv # on Mac or Linux # !type datasp500.csv # on Windows, but will show the entire file Out[13]: Symbol,Name,Sector,Price,Dividend Yield,Price/Earnings,Earnings/Share,Book Value,52 week low,52 week high,Market Cap,EBITDA,Price/Sales,Price/Book,SEC Filings MMM,3M Co.,Industrials,141.14,2.12,20.33,6.90,26.668,107.15,143.37,92.345,8.121,2.95,5.26,http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM ABT,Abbott Laboratories,Health Care,39.60,1.82,25.93,1.529,15.573,32.70,40.49,59.477,4.359,2.74,2.55,http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT
The first line is the name of the field, and the remaining 500 lines represents the values for the 500 different stocks.
For now, we will load this data into a DataFrame
that we can used to demonstrate various operations. This code only uses four specific columns of data in the file by specifying those columns via the usecols
parameter to pd.read_csv().
This makes the output for examples a bit less unwieldy.
The following reads in the data.
In [14]: # read in the data and print the first five rows # use the Symbol column as the index, and # only read in columns in positions 0, 2, 3, 7 sp500 = pd.read_csv("data/sp500.csv", index_col='Symbol', usecols=[0, 2, 3, 7])
We can examine the first five rows of the DataFrame
using the .head()
method. The examples will use this method frequently to limit the number of rows in the output of the examples.
In [15]: # peek at the first 5 rows of the data using .head() sp500.head() Out[15]: Sector Price Book Value Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326 ACE Financials 102.91 86.897
Alternatively, we can examine the last five rows with the .tail()
method:
In [16]: # peek at the first 5 rows of the data using .head() sp500.tail() Out[16]: Sector Price Book Value Symbol YHOO Information Technology 35.02 12.768 YUM Consumer Discretionary 74.77 5.147 ZMH Health Care 101.84 37.181 ZION Financials 28.43 30.191 ZTS Health Care 30.53 2.150
We can see that there are indeed 500 rows of data.
In [17]: # how many rows of data? len(sp500) Out[17]: 500
The index of the DataFrame
consists of the symbols for the 500 stocks representing the S&P 500.
In [18]: # examine the index sp500.index Out[18]: Index([u'MMM', u'ABT', u'ABBV', u'ACN', u'ACE', u'ACT', u'ADBE', u'AES', u'AET', u'AFL', u'A', u'GAS', u'APD', u'ARG', u'AKAM', u'AA', u'ALXN', u'ATI', u'ALLE', u'AGN', u'ADS', u'ALL', u'ALTR', u'MO', u'AMZN', u'AEE', u'AEP', u'AXP', u'AIG', u'AMT', u'AMP', u'ABC', u'AME', u'AMGN', u'APH', u'APC', u'ADI', u'AON', u'APA', u'AIV', u'AAPL', u'AMAT', u'ADM', u'AIZ', u'T', u'ADSK', u'ADP', u'AN', u'AZO', u'AVB', u'AVY', u'AVP', u'BHI', u'BLL', u'BAC', u'BCR', u'BAX', u'BBT', u'BEAM', u'BDX', u'BBBY', u'BMS', u'BRK-B', u'BBY', u'BIIB', u'BLK', u'HRB', u'BA', u'BWA', u'BXP', u'BSX', u'BMY', u'BRCM', u'BF-B', u'CA', u'CVC', u'COG', u'CAM', u'CPB', u'COF', u'CAH', u'CFN', u'KMX', u'CCL', u'CAT', u'CBG', u'CBS', u'CELG', u'CNP', u'CTL', u'CERN', u'CF', u'CHRW', u'CHK', u'CVX', u'CMG', u'CB', u'CI', u'CINF', u'CTAS', ...], dtype='object')
Also, there are three columns in the DataFrame
:
In [19]: # get the columns sp500.columns Out[19]: Index([u'Sector', u'Price', u'Book Value'], dtype='object')
The second dataset we will use, is the adjusted closing price for MSFT
and AAPL
for December 2014.
In [20]: # first three lines of the file !head -n 3 data/omh.csv # Mac or Linux # !type dataomh.csv # on Windows, but prints the entire file Out[20]: Date,MSFT,AAPL 2014-12-01,48.62,115.07 2014-12-02,48.46,114.63
The data is read into a DataFrame
as follows, and the first three lines examined using a slice:
In [21]: # read in the data one_mon_hist = pd.read_csv("data/omh.csv") # examine the first three rows one_mon_hist[:3] Out[21]: Date MSFT AAPL 0 2014-12-01 48.62 115.07 1 2014-12-02 48.46 114.63 2 2014-12-03 48.08 115.93
This type of data is referred to as a time series. We will examine time series data in depth in Chapter 10, Time-series Data, but for the purposes of this chapter, we will use this data simply for demonstration of manipulating a DataFrame
.
3.15.151.32