The pandas was created by Wed McKinney in 2008 as a result of frustrations he encountered while working on time series data in R. It is built on top of NumPy and provides features not available in it. It provides fast, easy-to-understand data structures and helps fill the gap between Python and a language such as R.
A key reference for the various operations I demonstrate here is the official pandas data structure documentation: http://pandas.pydata.org/pandas-docs/dev/dsintro.html.
There are three main data structures in pandas:
Series is really a 1D NumPy array under the hood. It consists of a NumPy array coupled with an array of labels.
The general construct for creating a Series data structure is as follows:
import pandas as pd ser=pd.Series(data, index=idx)
where data can be one of the following:
In this case, the index must be the same length as the data. If an index is not specified, the following default index [0,... n-1]
will be created, where n
is the length of the data. The following example creates a Series structure of seven random numbers between 0 and 1; the index is not specified:
In [466]: import numpy as np np.random.seed(100) ser=pd.Series(np.random.rand(7)); ser Out[466]:0 0.543405 1 0.278369 2 0.424518 3 0.844776 4 0.004719 5 0.121569 6 0.670749 dtype: float64
The following example creates a Series structure of the first 5 months of the year with a specified index of month names:
In [481]: import calendar as cal monthNames=[cal.month_name[i] for i in np.arange(1,6)] months=pd.Series(np.arrange(1,6),index=monthNames);months Out[481]: January 1 February 2 March 3 April 4 May 5 dtype: int64 In [482]: months.index Out[482]: Index([u'January', u'February', u'March', u'April', u'May'], dtype=object)
If the data is a dictionary and an index is provided, the labels will be constructed from it; else, the keys of the dictionary will be used for the labels. The values of the dictionary are used to populate the Series structure.
In [486]: currDict={'US' : 'dollar', 'UK' : 'pound', 'Germany': 'euro', 'Mexico':'peso', 'Nigeria':'naira', 'China':'yuan', 'Japan':'yen'} currSeries=pd.Series(currDict); currSeries Out[486]: China yuan Germany euro Japan yen Mexico peso Nigeria naira UK pound US dollar dtype: object
The index of a pandas Series structure is of type pandas.core.index.Index
and can be viewed as an ordered multiset.
In the following case, we specify an index, but the index contains one entry that isn't a key in the corresponding dict
. The result is that the value for the key is assigned as NaN
, indicating that it is missing. We will deal with handling missing values in a later section.
In [488]: stockPrices = {'GOOG':1180.97,'FB':62.57, 'TWTR': 64.50, 'AMZN':358.69, 'AAPL':500.6} stockPriceSeries=pd.Series(stockPrices, index=['GOOG','FB','YHOO', 'TWTR','AMZN','AAPL'], name='stockPrices') stockPriceSeries Out[488]: GOOG 1180.97 FB 62.57 YHOO NaN TWTR 64.50 AMZN 358.69 AAPL 500.60 Name: stockPrices, dtype: float64
Note that Series also has a name attribute that can be set as shown in the preceding snippet. The name attribute is useful in tasks such as combining Series objects into a DataFrame structure.
For scalar data, an index must be provided. The value will be repeated for as many index values as possible. One possible use of this method is to provide a quick and dirty method of initialization, with the Series structure to be filled in later. Let us see how to create a Series using scalar values:
In [491]: dogSeries=pd.Series('chihuahua', index=['breed','countryOfOrigin', 'name', 'gender']) dogSeries Out[491]: breed chihuahua countryOfOrigin chihuahua name chihuahua gender chihuahua dtype: object
Failure to provide an index just results in a scalar value being returned as follows:
In [494]: dogSeries=pd.Series('pekingese'), dogSeries Out[494]: 'pekingese' In [495]: type(dogSeries) Out[495]: str
The behavior of Series is very similar to that of numpy
arrays discussed in a previous section, with one caveat being that an operation such as slicing also slices the index.
Values can be set and accessed using the index label in a dictionary-like manner:
In [503]: currDict['China'] Out[503]: 'yuan' In [505]: stockPriceSeries['GOOG']=1200.0 stockPriceSeries Out[505]: GOOG 1200.00 FB 62.57 YHOO NaN TWTR 64.50 AMZN 358.69 AAPL 500.60 dtype: float64
Just as in the case of dict
, KeyError
is raised if you try to retrieve a missing label:
In [506]: stockPriceSeries['MSFT'] KeyError: 'MSFT'
This error can be avoided by explicitly using get
as follows:
In [507]: stockPriceSeries.get('MSFT',np.NaN) Out[507]: nan
In this case, the default value of np.NaN
is specified as the value to return when the key does not exist in the Series structure.
The slice operation behaves the same way as a NumPy array:
In [498]: stockPriceSeries[:4] Out[498]: GOOG 1180.97 FB 62.57 YHOO NaN TWTR 64.50 dtype: float64
Logical slicing also works as follows:
In [500]: stockPriceSeries[stockPriceSeries > 100] Out[500]: GOOG 1180.97 AMZN 358.69 AAPL 500.60 dtype: float64
Arithmetic and statistical operations can be applied, just as with a NumPy array:
In [501]: np.mean(stockPriceSeries) Out[501]: 433.46600000000001 In [502]: np.std(stockPriceSeries) Out[502]: 410.50223047384287
Element-wise operations can also be performed on series:
In [506]: ser Out[506]: 0 0.543405 1 0.278369 2 0.424518 3 0.844776 4 0.004719 5 0.121569 6 0.670749 dtype: float64 In [508]: ser*ser Out[508]: 0 0.295289 1 0.077490 2 0.180215 3 0.713647 4 0.000022 5 0.014779 6 0.449904 dtype: float64 In [510]: np.sqrt(ser) Out[510]: 0 0.737160 1 0.527607 2 0.651550 3 0.919117 4 0.068694 5 0.348668 6 0.818993 dtype: float64
An important feature of Series is that the data is automatically aligned on the basis of the label:
In [514]: ser[1:] Out[514]: 1 0.278369 2 0.424518 3 0.844776 4 0.004719 5 0.121569 6 0.670749 dtype: float64 In [516]:ser[1:] + ser[:-2] Out[516]: 0 NaN 1 0.556739 2 0.849035 3 1.689552 4 0.009438 5 NaN 6 NaN dtype: float64
Thus, we can see that for non-matching labels, NaN
is inserted. The default behavior is that the union of the indexes is produced for unaligned Series structures. This is preferable as information is preserved rather than lost. We will handle missing values in pandas in a later chapter of the book.
DataFrame is an 2-dimensional labeled array. Its column types can be heterogeneous: that is, of varying types. It is similar to structured arrays in NumPy with mutability added. It has the following properties:
ndarray
but not a subclass of np.ndarray
.Every axis in a Series/DataFrame has an index, whether default or not. Indexes are needed for fast lookups as well as proper aligning and joining of data in pandas. The axes can also be named-for example in the form of month for the array of columns Jan Feb Mar... Dec. Here is a representation of an indexed DataFrame, with named columns across and an index column of characters V, W, X, Y, Z:
columns nums strs bools decs index V 11 cat True 1.4 W -6 hat False 6.9 X 25 bat False -0.6 Y 8 mat True 3.7 Z -17 sat False 18.
DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:
Row label indexes and column labels can be specified along with the data. If they're not specified, they will be generated from the input data in an intuitive fashion, for example, from the keys of dict.
(in case of column labels) or by using np.range(n)
in the case of row labels, where n
corresponds to the number of rows.
Here, we create a DataFrame structure by using a dictionary of Series objects.
In [97]:stockSummaries={ 'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88], index=['Closing price','EPS', 'Shares Outstanding(M)', 'Beta', 'P/E','Market Cap(B)']), 'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64], index=['Closing price','EPS','Shares Outstanding(M)', 'Beta','P/E','Market Cap(B)']), 'FB': pd.Series([61.48,0.59,2450,104.93,150.92], index=['Closing price','EPS','Shares Outstanding(M)', 'P/E', 'Market Cap(B)']), 'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36], index=['Closing price','EPS','Shares Outstanding(M)', 'P/E','Beta', 'Market Cap(B)']), 'TWTR':pd.Series([65.25,-0.3,555.2,36.23], index=['Closing price','EPS','Shares Outstanding(M)', 'Market Cap(B)']), 'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84], index=['Closing price','EPS','Shares Outstanding(M)','P/E', 'Market Cap(B)','Beta'])} In [99]: stockDF=pd.DataFrame(stockSummaries); stockDF Out[99]:
AAPL |
AMZN |
FB |
GOOG |
TWTR |
YHOO | |
---|---|---|---|---|---|---|
Beta |
0.84 |
0.52 |
NaN |
0.87 |
NaN |
0.66 |
Closing price |
501.53 |
346.15 |
61.48 |
1133.43 |
65.25 |
34.9 |
EPS |
40.32 |
0.59 |
0.59 |
36.05 |
-0.3 |
1.27 |
Market Cap(B) |
447.59 |
158.88 |
150.92 |
380.64 |
36.23 |
35.36 |
P/E |
12.44 |
589.8 |
104.93 |
31.44 |
NaN |
27.48 |
Shares Outstanding(M) |
892.45 |
459 |
2450 |
335.83 |
555.2 |
1010 |
In [100]:stockDF=pd.DataFrame(stockSummaries, index=['Closing price','EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)','Beta']);stockDF Out [100]:
AAPL |
AMZN |
FB |
GOOG |
TWTR |
YHOO | |
---|---|---|---|---|---|---|
Closing price |
501.53 |
346.15 |
61.48 |
1133.43 |
65.25 |
34.9 |
EPS |
40.32 |
0.59 |
0.59 |
36.05 |
-0.3 |
1.27 |
Shares Outstanding(M) |
892.45 |
459 |
2450 |
NaN |
555.2 |
1010 |
P/E |
12.44 |
589.8 |
104.93 |
31.44 |
NaN |
27.48 |
Market Cap(B) |
447.59 |
158.88 |
150.92 |
380.64 |
36.23 |
35.36 |
Beta |
0.84 |
0.52 |
NaN |
0.87 |
NaN |
0.66 |
In [102]:stockDF=pd.DataFrame(stockSummaries, index=['Closing price','EPS','Shares Outstanding(M)', 'P/E', 'Market Cap(B)','Beta'], columns=['FB','TWTR','SCNW']) stockDF Out [102]:
|
FB |
TWTR |
SCNW |
---|---|---|---|
Closing price |
61.48 |
65.25 |
NaN |
EPS |
0.59 |
-0.3 |
NaN |
Shares Outstanding(M) |
2450 |
555.2 |
NaN |
P/E |
104.93 |
NaN |
NaN |
Market Cap(B) |
150.92 |
36.23 |
NaN |
Beta |
NaN |
NaN |
NaN |
The row index labels and column labels can be accessed via the index and column attributes:
In [527]: stockDF.index Out[527]: Index([u'Closing price', u'EPS', u'Shares Outstanding(M)', u'P/E', u'Market Cap(B)', u'Beta'], dtype=object) In [528]: stockDF.columns Out[528]: Index([u'AAPL', u'AMZN', u'FB', u'GOOG', u'TWTR', u'YHOO'], dtype=object)
The source for the preceding data is Google Finance, accessed on 2/3/2014: http://finance.google.com.
Here, we create a DataFrame structure from a dictionary of lists. The keys become the column labels in the DataFrame structure and the data in the list becomes the column values. Note how the row label indexes are generated using np.range(n)
.
In [529]:algos={'search':['DFS','BFS','Binary Search', 'Linear','ShortestPath (Djikstra)'], 'sorting': ['Quicksort','Mergesort', 'Heapsort', 'Bubble Sort', 'Insertion Sort'], 'machine learning':['RandomForest', 'K Nearest Neighbor', 'Logistic Regression', 'K-Means Clustering', 'Linear Regression']} algoDF=pd.DataFrame(algos); algoDF Out[529]: machine learning search sorting 0 RandomForest DFS Quicksort 1 K Nearest Neighbor BFS Mergesort 2 Logistic Regression Binary Search Heapsort 3 K-Means Clustering Linear Bubble Sort 4 Linear Regression ShortestPath (Djikstra) Insertion Sort In [530]: pd.DataFrame(algos,index=['algo_1','algo_2','algo_3','algo_4', 'algo_5']) Out[530]: machine learning search sorting algo_1 RandomForest DFS Quicksort algo_2 K Nearest Neighbor BFS Mergesort algo_3 Logistic Regression Binary Search Heapsort algo_4 K-Means Clustering Linear Bubble Sort algo_5 Linear Regression ShortestPath (Djikstra) Insertion Sort
In this case, we use a structured array, which is an array of records or structs
. For more information on structured arrays, refer to the following: http://docs.scipy.org/doc/numpy/user/basics.rec.html.
In [533]: memberData = np.zeros((4,), dtype=[('Name','a15'), ('Age','i4'), ('Weight','f4')]) memberData[:] = [('Sanjeev',37,162.4), ('Yingluck',45,137.8), ('Emeka',28,153.2), ('Amy',67,101.3)] memberDF=pd.DataFrame(memberData);memberDF Out[533]: Name Age Weight 0 Sanjeev 37 162.4 1 Yingluck 45 137.8 2 Emeka 28 153.2 3 Amy 67 101.3 In [534]: pd.DataFrame(memberData, index=['a','b','c','d']) Out[534]: Name Age Weight a Sanjeev 37 162.4 b Yingluck 45 137.8 c Emeka 28 153.2 d Amy 67 101.3
Here, we show how to construct a DataFrame structure from a Series structure:
In [ 540]: currSeries.name='currency' pd.DataFrame(currSeries) Out[540]: currency China yuan Germany euro Japan yen Mexico peso Nigeria naira UK pound US dollar
There are also alternative constructors for DataFrame; they can be summarized as follows:
DataFrame.from_dict
: It takes a dictionary of dictionaries or sequences and returns DataFrame.DataFrame.from_records
: It takes a list of tuples or structured ndarray.DataFrame.from_items
: It takes a sequence of (key, value) pairs. The keys are the column or index names, and the values are the column or row values. If you wish the keys to be row index names, you must specify orient='index' as a parameter and specify the column names.pandas.io.parsers.read_csv
: This is a helper function that reads a CSV file into a pandas DataFrame structure.pandas.io.parsers.read_table
: This is a helper function that reads a delimited file into a pandas DataFrame structure.pandas.io.parsers.read_fwf
: This is a helper function that reads a table of fixed-width lines into a pandas DataFrame structure.Here, I will briefly describe the various DataFrame operations.
A specific column can be obtained as a Series structure:
In [543]: memberDF['Name'] Out[543]: 0 Sanjeev 1 Yingluck 2 Emeka 3 Amy Name: Name, dtype: object
A new column can be added via assignment, as follows:
In [545]: memberDF['Height']=60;memberDF Out[545]: Name Age Weight Height 0 Sanjeev 37 162.4 60 1 Yingluck 45 137.8 60 2 Emeka 28 153.2 60 3 Amy 67 101.3 60
A column can be deleted, as you would in the case of dict
:
In [546]: del memberDF['Height']; memberDF Out[546]: Name Age Weight 0 Sanjeev 37 162.4 1 Yingluck 45 137.8 2 Emeka 28 153.2 3 Amy 67 101.3
It can also be popped, as with a dictionary:
In [547]: memberDF['BloodType']='O' bloodType=memberDF.pop('BloodType'), bloodType Out[547]: 0 O 1 O 2 O 3 O Name: BloodType, dtype: object
Basically, a DataFrame structure can be treated as if it were a dictionary of Series objects. Columns get inserted at the end; to insert a column at a specific location, you can use the insert function:
In [552]: memberDF.insert(2,'isSenior',memberDF['Age']>60); memberDF Out[552]: Name Age isSenior Weight 0 Sanjeev 37 False 162.4 1 Yingluck 45 False 137.8 2 Emeka 28 False 153.2 3 Amy 67 True 101.3
DataFrame objects align in a manner similar to Series objects, except that they align on both column and index labels. The resulting object is the union of the column and row labels:
In [559]: ore1DF=pd.DataFrame(np.array([[20,35,25,20], [11,28,32,29]]), columns=['iron','magnesium', 'copper','silver']) ore2DF=pd.DataFrame(np.array([[14,34,26,26], [33,19,25,23]]), columns=['iron','magnesium', 'gold','silver']) ore1DF+ore2DF Out[559]: copper gold iron magnesium silver 0 NaN NaN 34 69 46 1 NaN NaN 44 47 52
In the case where there are no row labels or column labels in common, the value is filled with NaN
, for example, copper and gold. If you combine a DataFrame object and a Series object, the default behavior is to broadcast the Series object across the rows:
In [562]: ore1DF + pd.Series([25,25,25,25], index=['iron','magnesium', 'copper','silver']) Out[562]: iron magnesium copper silver 0 45 60 50 45 1 36 53 57 54
Panel is a 3D array. It is not as widely used as Series or DataFrame. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. The Panel data structure is the final piece of the data structure jigsaw puzzle in pandas. It is less widely used, and is used for 3D data. The three axis names are as follows:
As for Series and DataFrame, there are different ways to create Panel objects. They are explained in the upcoming sections.
Here, we show how to construct a Panel object from a 3D NumPy array.
In 586[]: stockData=np.array([[[63.03,61.48,75], [62.05,62.75,46], [62.74,62.19,53]], [[411.90, 404.38, 2.9], [405.45, 405.91, 2.6], [403.15, 404.42, 2.4]]]) stockData Out[586]: array([[[ 63.03, 61.48, 75. ], [ 62.05, 62.75, 46. ], [ 62.74, 62.19, 53. ]], [[ 411.9 , 404.38, 2.9 ], [ 405.45, 405.91, 2.6 ], [ 403.15, 404.42, 2.4 ]]]) In [587]: stockHistoricalPrices = pd.Panel(stockData, items=['FB', 'NFLX'], major_axis=pd.date_range('2/3/2014', periods=3), minor_axis=['open price', 'closing price', 'volume']) stockHistoricalPrices Out[587]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis) Items axis: FB to NFLX Major_axis axis: 2014-02-03 00:00:00 to 2014-02-05 00:00:00 Minor_axis axis: open price to volume
We construct a Panel structure by using a Python dictionary of DataFrame structures.
In [591]: USData=pd.DataFrame(np.array([[249.62 , 8900], [ 282.16,12680], [309.35,14940]]), columns=['Population(M)','GDP($B)'], index=[1990,2000,2010]) USData Out[591]: Population(M) GDP($B) 1990 249.62 8900 2000 282.16 12680 2010 309.35 14940 In [590]: ChinaData=pd.DataFrame(np.array([[1133.68, 390.28], [ 1266.83,1198.48], [1339.72, 6988.47]]), columns=['Population(M)','GDP($B)'], index=[1990,2000,2010]) ChinaData Out[590]: Population(M) GDP($B) 1990 1133.68 390.28 2000 1266.83 1198.48 2010 1339.72 6988.47 In [592]:US_ChinaData={'US' : USData, 'China': ChinaData} pd.Panel(US_ChinaData) Out[592]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis) Items axis: China to US Major_axis axis: 1990 to 2010
This method converts a DataFrame structure having a MultiIndex to a Panel structure:
In [617]: mIdx = pd.MultiIndex(levels=[['US', 'China'], [1990,2000, 2010]], labels=[[1,1,1,0,0,0],[0,1,2,0,1,2]]) mIdx Out[617]: MultiIndex [(u'China', 1990), (u'China', 2000), (u'China', 2010), (u'US', 1990), (u'US', 2000), (u'US', 2010)] ChinaUSDF = pd.DataFrame({'Population(M)' : [1133.68, 1266.83, 1339.72, 249.62, 282.16,309.35], 'GDB($B)': [390.28, 1198.48, 6988.47, 8900,12680, 14940]}, index=mIdx) ChinaUSDF In [618]: ChinaUSDF = pd.DataFrame({'Population(M)' : [1133.68, 1266.83, 1339.72, 249.62, 282.16, 309.35], 'GDB($B)': [390.28, 1198.48, 6988.47, 8900, 12680,14940]}, index=mIdx) ChinaUSDF Out[618]: GDB($B) Population(M) China 1990 390.28 1133.68 2000 1198.48 1266.83 2010 6988.47 1339.72 US 1990 8900.00 249.62 2000 12680.00 282.16 2010 14940.00 309.35 In [622]: ChinaUSDF.to_panel() Out[622]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 2 (major_axis) x 3 (minor_axis) Items axis: GDB($B) to Population(M) Major_axis axis: US to China Minor_axis axis: 1990 to 2010
The sources of US/China Economic data are the following sites:
3.14.144.108