Example data

Where possible, the examples in this chapter will utilize several datasets provided with the code in the download for the text. These datasets make the examples a little less academic in nature. These datasets will be read from files using the pd.read_csv() function that will load the sample data from the file into a DataFrame object.

Note

pd.read_csv() will be more extensively examined in Chapter 6, Accessing Data. For now, its function is simply to load some example DataFrame objects.

The remainder of the samples will still utilize the DataFrame objects created on demand if they demonstrate the concept being examined in a better way. Let's see the brief descriptions of these datasets.

S&P 500

The first dataset we will use is a snapshot of the S&P 500 from Yahoo! Finance. The following code shows the first three lines of the file:

In [13]:
   # show the first three lines of the file
   !head -n 3 data/sp500.csv # on Mac or Linux
   # !type datasp500.csv # on Windows, but will show the entire file

Out[13]:
   Symbol,Name,Sector,Price,Dividend Yield,Price/Earnings,Earnings/Share,Book Value,52 week low,52 week high,Market Cap,EBITDA,Price/Sales,Price/Book,SEC Filings
   MMM,3M Co.,Industrials,141.14,2.12,20.33,6.90,26.668,107.15,143.37,92.345,8.121,2.95,5.26,http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM
   ABT,Abbott Laboratories,Health Care,39.60,1.82,25.93,1.529,15.573,32.70,40.49,59.477,4.359,2.74,2.55,http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT

The first line is the name of the field, and the remaining 500 lines represents the values for the 500 different stocks.

For now, we will load this data into a DataFrame that we can used to demonstrate various operations. This code only uses four specific columns of data in the file by specifying those columns via the usecols parameter to pd.read_csv(). This makes the output for examples a bit less unwieldy.

The following reads in the data.

In [14]:
   # read in the data and print the first five rows
   # use the Symbol column as the index, and 
   # only read in columns in positions 0, 2, 3, 7
   sp500 = pd.read_csv("data/sp500.csv", 
                       index_col='Symbol', 
                       usecols=[0, 2, 3, 7])

We can examine the first five rows of the DataFrame using the .head() method. The examples will use this method frequently to limit the number of rows in the output of the examples.

In [15]:
   # peek at the first 5 rows of the data using .head()
   sp500.head()

Out[15]:
                           Sector   Price  Book Value
   Symbol                                            
   MMM                Industrials  141.14      26.668
   ABT                Health Care   39.60      15.573
   ABBV               Health Care   53.95       2.954
   ACN     Information Technology   79.79       8.326
   ACE                 Financials  102.91      86.897

Alternatively, we can examine the last five rows with the .tail() method:

In [16]:
   # peek at the first 5 rows of the data using .head()
   sp500.tail()

Out[16]:
                           Sector   Price  Book Value
   Symbol                                            
   YHOO    Information Technology   35.02      12.768
   YUM     Consumer Discretionary   74.77       5.147
   ZMH                Health Care  101.84      37.181
   ZION                Financials   28.43      30.191
   ZTS                Health Care   30.53       2.150

We can see that there are indeed 500 rows of data.

In [17]:
   # how many rows of data?
   len(sp500)

Out[17]:
   500

The index of the DataFrame consists of the symbols for the 500 stocks representing the S&P 500.

In [18]:
   # examine the index
   sp500.index

Out[18]:
   Index([u'MMM', u'ABT', u'ABBV', u'ACN', u'ACE', u'ACT', u'ADBE', u'AES', u'AET', u'AFL', u'A', u'GAS', u'APD', u'ARG', u'AKAM', u'AA', u'ALXN', u'ATI', u'ALLE', u'AGN', u'ADS', u'ALL', u'ALTR', u'MO', u'AMZN', u'AEE', u'AEP', u'AXP', u'AIG', u'AMT', u'AMP', u'ABC', u'AME', u'AMGN', u'APH', u'APC', u'ADI', u'AON', u'APA', u'AIV', u'AAPL', u'AMAT', u'ADM', u'AIZ', u'T', u'ADSK', u'ADP', u'AN', u'AZO', u'AVB', u'AVY', u'AVP', u'BHI', u'BLL', u'BAC', u'BCR', u'BAX', u'BBT', u'BEAM', u'BDX', u'BBBY', u'BMS', u'BRK-B', u'BBY', u'BIIB', u'BLK', u'HRB', u'BA', u'BWA', u'BXP', u'BSX', u'BMY', u'BRCM', u'BF-B', u'CA', u'CVC', u'COG', u'CAM', u'CPB', u'COF', u'CAH', u'CFN', u'KMX', u'CCL', u'CAT', u'CBG', u'CBS', u'CELG', u'CNP', u'CTL', u'CERN', u'CF', u'CHRW', u'CHK', u'CVX', u'CMG', u'CB', u'CI', u'CINF', u'CTAS', ...], dtype='object')

Also, there are three columns in the DataFrame:

In [19]:
   # get the columns
   sp500.columns

Out[19]:
   Index([u'Sector', u'Price', u'Book Value'], dtype='object')

Monthly stock historical prices

The second dataset we will use, is the adjusted closing price for MSFT and AAPL for December 2014.

In [20]:
   # first three lines of the file
   !head -n 3 data/omh.csv # Mac or Linux
   # !type dataomh.csv # on Windows, but prints the entire file

Out[20]:
   Date,MSFT,AAPL
   2014-12-01,48.62,115.07
   2014-12-02,48.46,114.63

The data is read into a DataFrame as follows, and the first three lines examined using a slice:

In [21]:
   # read in the data
   one_mon_hist = pd.read_csv("data/omh.csv")
   # examine the first three rows
   one_mon_hist[:3]

Out[21]:
            Date   MSFT    AAPL
   0  2014-12-01  48.62  115.07
   1  2014-12-02  48.46  114.63
   2  2014-12-03  48.08  115.93

Note

This type of data is referred to as a time series. We will examine time series data in depth in Chapter 10, Time-series Data, but for the purposes of this chapter, we will use this data simply for demonstration of manipulating a DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.151.32