Chapter 5. Getting Started with pandas

pandas will be the primary library of interest throughout much of the rest of the book. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

As a bit of background, I started building pandas in early 2008 during my tenure at AQR, a quantitative investment management firm. At the time, I had a distinct set of requirements that were not well-addressed by any single tool at my disposal:

  • Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.

  • Integrated time series functionality.

  • The same data structures handle both time series data and non-time series data.

  • Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).

  • Flexible handling of missing data.

  • Merge and other relational operations found in popular database databases (SQL-based, for example).

I wanted to be able to do all of these things in one place, preferably in a language well-suited to general purpose software development. Python was a good candidate language for this, but at that time there was not an integrated set of data structures and tools providing this functionality.

Over the last four years, pandas has matured into a quite large library capable of solving a much broader set of data handling problems than I ever anticipated, but it has expanded in its scope without compromising the simplicity and ease-of-use that I desired from the very beginning. I hope that after reading this book, you will find it to be just as much of an indispensable tool as I do.

Throughout the rest of the book, I use the following import conventions for pandas:

In [1]: from pandas import Series, DataFrame

In [2]: import pandas as pd

Thus, whenever you see pd. in code, it’s referring to pandas. Series and DataFrame are used so much that I find it easier to import them into the local namespace.

Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [4]: obj = Series([4, 7, -5, 3])

In [5]: obj
Out[5]: 
0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [6]: obj.values
Out[6]: array([ 4,  7, -5,  3])

In [7]: obj.index
Out[7]: Int64Index([0, 1, 2, 3], dtype='int64')

Often it will be desirable to create a Series with an index identifying each data point:

In [8]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [9]: obj2
Out[9]: 
d    4
b    7
a   -5
c    3
dtype: int64

In [10]: obj2.index
Out[10]: Index([u'd', u'b', u'a', u'c'], dtype='object')

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

In [11]: obj2['a']
Out[11]: -5

In [12]: obj2['d'] = 6

In [13]: obj2[['c', 'a', 'd']]
Out[13]: 
c    3
a   -5
d    6
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [14]: obj2[obj2 > 0]      In [15]: obj2 * 2      In [16]: np.exp(obj2)
Out[14]:                     Out[15]:               Out[16]:             
d    6                       d    12                d     403.428793     
b    7                       b    14                b    1096.633158     
c    3                       a   -10                a       0.006738     
dtype: int64                 c     6                c      20.085537     
                             dtype: int64           dtype: float64  

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict:

In [17]: 'b' in obj2
Out[17]: True

In [18]: 'e' in obj2
Out[18]: False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [19]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [20]: obj3 = Series(sdata)

In [21]: obj3
Out[21]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

When only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.

In [22]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [23]: obj4 = Series(sdata, index=states)

In [24]: obj4
Out[24]: 
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

In this case, 3 values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values. I will use the terms “missing” or “NA” to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:

In [25]: pd.isnull(obj4)      In [26]: pd.notnull(obj4)
Out[25]:                      Out[26]:                 
California     True           California    False      
Ohio          False           Ohio           True      
Oregon        False           Oregon         True      
Texas         False           Texas          True      
dtype: bool                   dtype: bool 

Series also has these as instance methods:

In [27]: obj4.isnull()
Out[27]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

I discuss working with missing data in more detail later in this chapter.

A critical Series feature for many applications is that it automatically aligns differently-indexed data in arithmetic operations:

In [28]: obj3          In [29]: obj4      
Out[28]:               Out[29]:           
Ohio      35000        California      NaN
Oregon    16000        Ohio          35000
Texas     71000        Oregon        16000
Utah       5000        Texas         71000
dtype: int64           dtype: float64     
                                          
In [30]: obj3 + obj4
Out[30]: 
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

Data alignment features are addressed as a separate topic.

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [31]: obj4.name = 'population'

In [32]: obj4.index.name = 'state'

In [33]: obj4
Out[33]: 
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

A Series’s index can be altered in place by assignment:

In [34]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [35]: obj
Out[35]: 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index). Compared with other such DataFrame-like structures you may have used before (like R’s data.frame), row-oriented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are far outside the scope of this book.

Note

While DataFrame stores the data internally in a two-dimensional format, you can easily represent much higher-dimensional data in a tabular format using hierarchical indexing, a subject of a later section and a key ingredient in many of the more advanced data-handling features in pandas.

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

In [37]: frame
Out[37]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

If you specify a sequence of columns, the DataFrame’s columns will be exactly what you pass:

In [38]: DataFrame(data, columns=['year', 'state', 'pop'])
Out[38]: 
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

As with Series, if you pass a column that isn’t contained in data, it will appear with NA values in the result:

In [39]: frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
   ....:                    index=['one', 'two', 'three', 'four', 'five'])

In [40]: frame2
Out[40]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

In [41]: frame2.columns
Out[41]: Index([u'year', u'state', u'pop', u'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [42]: frame2['state']          In [43]: frame2.year    
Out[42]:                          Out[43]:                
one        Ohio                   one      2000           
two        Ohio                   two      2001           
three      Ohio                   three    2002           
four     Nevada                   four     2001           
five     Nevada                   five     2002           
Name: state, dtype: object        Name: year, dtype: int64

Note that the returned Series have the same index as the DataFrame, and their name attribute has been appropriately set.

Rows can also be retrieved by position or name by a couple of methods, such as the ix indexing field (much more on this later):

In [44]: frame2.ix['three']
Out[44]: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:

In [45]: frame2['debt'] = 16.5

In [46]: frame2
Out[46]: 
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5

In [47]: frame2['debt'] = np.arange(5.)

In [48]: frame2
Out[48]: 
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4

When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:

In [49]: val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [50]: frame2['debt'] = val

In [51]: frame2
Out[51]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict:

In [52]: frame2['eastern'] = frame2.state == 'Ohio'

In [53]: frame2
Out[53]: 
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False

In [54]: del frame2['eastern']

In [55]: frame2.columns
Out[55]: Index([u'year', u'state', u'pop', u'debt'], dtype='object')

Caution

The column returned when indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied using the Series’s copy method.

Another common form of data is a nested dict of dicts format:

In [56]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
   ....:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner keys as the row indices:

In [57]: frame3 = DataFrame(pop)

In [58]: frame3
Out[58]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

Of course you can always transpose the result:

In [59]: frame3.T
Out[59]: 
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

The keys in the inner dicts are unioned and sorted to form the index in the result. This isn’t true if an explicit index is specified:

In [60]: DataFrame(pop, index=[2001, 2002, 2003])
Out[60]: 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

Dicts of Series are treated much in the same way:

In [61]: pdata = {'Ohio': frame3['Ohio'][:-1],
   ....:          'Nevada': frame3['Nevada'][:2]}

In [62]: DataFrame(pdata)
Out[62]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7

For a complete list of things you can pass the DataFrame constructor, see Table 5-1.

If a DataFrame’s index and columns have their name attributes set, these will also be displayed:

In [63]: frame3.index.name = 'year'; frame3.columns.name = 'state'

In [64]: frame3
Out[64]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

Like Series, the values attribute returns the data contained in the DataFrame as a 2D ndarray:

In [65]: frame3.values
Out[65]: 
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accomodate all of the columns:

In [66]: frame2.values
Out[66]: 
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

Table 5-1. Possible data inputs to DataFrame constructor

TypeNotes

2D ndarray

A matrix of data, passing optional row and column labels

dict of arrays, lists, or tuples

Each sequence becomes a column in the DataFrame. All sequences must be the same length.

NumPy structured/record array

Treated as the “dict of arrays” case

dict of Series

Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed.

dict of dicts

Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case.

list of dicts or Series

Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels

List of lists or tuples

Treated as the “2D ndarray” case

Another DataFrame

The DataFrame’s indexes are used unless different ones are passed

NumPy MaskedArray

Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [67]: obj = Series(range(3), index=['a', 'b', 'c'])

In [68]: index = obj.index

In [69]: index
Out[69]: Index([u'a', u'b', u'c'], dtype='object')

In [70]: index[1:]
Out[70]: Index([u'b', u'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [71]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-71-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'
/home/phillip/miniconda3/envs/conda2/lib/python2.7/site-packages/pandas/core/
base.pyc in _disabled(self, *args, **kwargs)
    177         """This method will not function because object is immutable."""
    178         raise TypeError("'%s' does not support mutable operations." %
--> 179                         self.__class__)
    180 
    181     __setitem__ = __setslice__ = __delitem__ = __delslice__ = _disabled
TypeError: '<class 'pandas.core.index.Index'>' does not support mutable operations.

Immutability is important so that Index objects can be safely shared among data structures:

In [72]: index = pd.Index(np.arange(3))

In [73]: obj2 = Series([1.5, -2.5, 0], index=index)

In [74]: obj2.index is index
Out[74]: True

Table 5-2 has a list of built-in Index classes in the library. With some development effort, Index can even be subclassed to implement specialized axis indexing functionality.

Note

Many users will not need to know much about Index objects, but they’re nonetheless an important part of pandas’s data model.

Table 5-2. Main Index objects in pandas

ClassDescription
IndexThe most general Index object, representing axis labels in a NumPy array of Python objects.
Int64IndexSpecialized Index for integer values.
MultiIndex“Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of as similar to an array of tuples.
DatetimeIndexStores nanosecond timestamps (represented using NumPy’s datetime64 dtype).
PeriodIndexSpecialized Index for Period data (timespans).

In addition to being array-like, an Index also functions as a fixed-size set:

In [75]: frame3
Out[75]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

In [76]: 'Ohio' in frame3.columns
Out[76]: True

In [77]: 2003 in frame3.index
Out[77]: False

Each Index has a number of methods and properties for set logic and answering other common questions about the data it contains. These are summarized in Table 5-3.

Table 5-3. Index methods and properties

MethodDescription
appendConcatenate with additional Index objects, producing a new Index
diffCompute set difference as an Index
intersectionCompute set intersection
unionCompute set union
isinCompute boolean array indicating whether each value is contained in the passed collection
deleteCompute new Index with element at index i deleted
dropCompute new index by deleting passed values
insertCompute new Index by inserting element at index i
is_monotonicReturns True if each element is greater than or equal to the previous element
is_uniqueReturns True if the Index has no duplicate values
uniqueCompute the array of unique values in the Index

Essential Functionality

In this section, I’ll walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. Upcoming chapters will delve more deeply into data analysis and manipulation topics using pandas. This book is not intended to serve as exhaustive documentation for the pandas library; I instead focus on the most important features, leaving the less common (that is, more esoteric) things for you to explore on your own.

Reindexing

A critical method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider a simple example from above:

In [78]: obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [79]: obj
Out[79]: 
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [80]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [81]: obj2
Out[81]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
In [82]: obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
Out[82]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill which forward fills the values:

In [83]: obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [84]: obj3.reindex(range(6), method='ffill')
Out[84]: 
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

Table 5-4 lists available method options. At this time, interpolation more sophisticated than forward- and backfilling would need to be applied after the fact.

Table 5-4. reindex method (interpolation) options

ArgumentDescription
ffill or padFill (or carry) values forward
bfill or backfillFill (or carry) values backward

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed just a sequence, the rows are reindexed in the result:

In [85]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
   ....:                   columns=['Ohio', 'Texas', 'California'])

In [86]: frame
Out[86]: 
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

In [87]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [88]: frame2
Out[88]: 
   Ohio  Texas  California
a     0      1           2
b   NaN    NaN         NaN
c     3      4           5
d     6      7           8

The columns can be reindexed using the columns keyword:

In [89]: states = ['Texas', 'Utah', 'California']

In [90]: frame.reindex(columns=states)
Out[90]: 
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

Both can be reindexed in one shot, though interpolation will only apply row-wise (axis 0):

In [91]: frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
   ....:               columns=states)
Out[91]: 
   Texas  Utah  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8

As you’ll see soon, reindexing can be done more succinctly by label-indexing with ix:

In [92]: frame.ix[['a', 'b', 'c', 'd'], states]
Out[92]: 
   Texas  Utah  California
a      1   NaN           2
b    NaN   NaN         NaN
c      4   NaN           5
d      7   NaN           8

Table 5-5. reindex function arguments

ArgumentDescription
indexNew sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying
methodInterpolation (fill) method, see Table 5-4 for options.
fill_valueSubstitute value to use when introducing missing data by reindexing
limitWhen forward- or backfilling, maximum size gap to fill
levelMatch simple Index on level of MultiIndex, otherwise select subset of
copyIf True, always copy underlying data even if new index is equivalent to old index. If False, do not copy the data when the indexes are equivalent.

Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [93]: obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [94]: new_obj = obj.drop('c')

In [95]: new_obj
Out[95]: 
a    0
b    1
d    3
e    4
dtype: float64

In [96]: obj.drop(['d', 'c'])
Out[96]: 
a    0
b    1
e    4
dtype: float64

With DataFrame, index values can be deleted from either axis:

In [97]: data = DataFrame(np.arange(16).reshape((4, 4)),
   ....:                  index=['Ohio', 'Colorado', 'Utah', 'New York'],
   ....:                  columns=['one', 'two', 'three', 'four'])

In [98]: data.drop(['Colorado', 'Ohio'])
Out[98]: 
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In [99]: data.drop('two', axis=1)      In [100]: data.drop(['two', 'four'], axis=1)
Out[99]:                               Out[100]:                                   
          one  three  four                       one  three                        
Ohio        0      2     3             Ohio        0      2                        
Colorado    4      6     7             Colorado    4      6                        
Utah        8     10    11             Utah        8     10                        
New York   12     14    15             New York   12     14        

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [101]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [102]: obj['b']          In [103]: obj[1]
Out[102]: 1.0               Out[103]: 1.0   
                                            
In [104]: obj[2:4]          In [105]: obj[['b', 'a', 'd']]
Out[104]:                   Out[105]:                     
c    2                      b    1                        
d    3                      a    0                        
dtype: float64              d    3                        
                            dtype: float64                
                                                          
In [106]: obj[[1, 3]]       In [107]: obj[obj < 2]
Out[106]:                   Out[107]:             
b    1                      a    0                
d    3                      b    1                
dtype: float64              dtype: float64 

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [108]: obj['b':'c']
Out[108]: 
b    1
c    2
dtype: float64

Setting using these methods works just as you would expect:

In [109]: obj['b':'c'] = 5

In [110]: obj
Out[110]: 
a    0
b    5
c    5
d    3
dtype: float64

As you’ve seen above, indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

In [111]: data = DataFrame(np.arange(16).reshape((4, 4)),
   .....:                  index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                  columns=['one', 'two', 'three', 'four'])

In [112]: data
Out[112]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [113]: data['two']          In [114]: data[['three', 'one']]
Out[113]:                      Out[114]:                       
Ohio         1                           three  one            
Colorado     5                 Ohio          2    0            
Utah         9                 Colorado      6    4            
New York    13                 Utah         10    8            
Name: two, dtype: int64        New York     14   12  

Indexing like this has a few special cases. First selecting rows by slicing or a boolean array:

In [115]: data[:2]                     In [116]: data[data['three'] > 5]
Out[115]:                              Out[116]:                        
          one  two  three  four                  one  two  three  four  
Ohio        0    1      2     3        Colorado    4    5      6     7  
Colorado    4    5      6     7        Utah        8    9     10    11  
                                       New York   12   13     14    15  

This might seem inconsistent to some readers, but this syntax arose out of practicality and nothing more. Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

In [117]: data < 5
Out[117]: 
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

In [118]: data[data < 5] = 0

In [119]: data
Out[119]: 
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

This is intended to make DataFrame syntactically more like an ndarray in this case.

For DataFrame label-indexing on the rows, I introduce the special indexing field ix. It enables you to select a subset of the rows and columns from a DataFrame with NumPy-like notation plus axis labels. As I mentioned earlier, this is also a less verbose way to do reindexing:

In [120]: data.ix['Colorado', ['two', 'three']]
Out[120]: 
two      5
three    6
Name: Colorado, dtype: int64

In [121]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
Out[121]: 
          four  one  two
Colorado     7    0    5
Utah        11    8    9

In [122]: data.ix[2]            In [123]: data.ix[:'Utah', 'two']
Out[122]:                       Out[123]:                        
one       8                     Ohio        0                    
two       9                     Colorado    5                    
three    10                     Utah        9                    
four     11                     Name: two, dtype: int64          
Name: Utah, dtype: int64                                         
                                                                 
In [124]: data.ix[data.three > 5, :3]
Out[124]: 
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14

So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, there is a short summary of many of them in Table 5-6. You have a number of additional options when working with hierarchical indexes as you’ll later see.

Note

When designing pandas, I felt that having to type frame[:, col] to select a column was too verbose (and error-prone), since column selection is one of the most common operations. Thus I made the design trade-off to push all of the rich label-indexing into ix.

Table 5-6. Indexing options with DataFrame

TypeNotes
obj[val]Select single column or sequence of columns from the DataFrame. Special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion).
obj.ix[val]Selects single row or subset of rows from the DataFrame.
obj.ix[:, val]Selects single column of subset of columns.
obj.ix[val1, val2]Select both rows and columns.
reindex methodConform one or more axes to new indexes.
xs methodSelect single row or column as a Series by label.
icol, irow methodsSelect single column or row, respectively, as a Series by integer location.
get_value, set_value methodsSelect single value by row and column label.

Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at a simple example:

In [125]: s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [126]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [127]: s1          In [128]: s2  
Out[127]:             Out[128]:     
a    7.3              a   -2.1      
c   -2.5              c    3.6      
d    3.4              e   -1.5      
e    1.5              f    4.0      
dtype: float64        g    3.1      
                      dtype: float64

Adding these together yields:

In [129]: s1 + s2
Out[129]: 
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces NA values in the indices that don’t overlap. Missing values propagate in arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [130]: df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
   .....:                 index=['Ohio', 'Texas', 'Colorado'])

In [131]: df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
   .....:                 index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [132]: df1            In [133]: df2    
Out[132]:                Out[133]:        
          b  c  d                b   d   e
Ohio      0  1  2        Utah    0   1   2
Texas     3  4  5        Ohio    3   4   5
Colorado  6  7  8        Texas   6   7   8
                         Oregon  9  10  11

Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:

In [134]: df1 + df2
Out[134]: 
           b   c   d   e
Colorado NaN NaN NaN NaN
Ohio       3 NaN   6 NaN
Oregon   NaN NaN NaN NaN
Texas      9 NaN  12 NaN
Utah     NaN NaN NaN NaN

Arithmetic methods with fill values

In arithmetic operations between differently-indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [135]: df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

In [136]: df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [137]: df1          In [138]: df2        
Out[137]:              Out[138]:            
   a  b   c   d            a   b   c   d   e
0  0  1   2   3        0   0   1   2   3   4
1  4  5   6   7        1   5   6   7   8   9
2  8  9  10  11        2  10  11  12  13  14
                       3  15  16  17  18  19

Adding these together results in NA values in the locations that don’t overlap:

In [139]: df1 + df2
Out[139]: 
    a   b   c   d   e
0   0   2   4   6 NaN
1   9  11  13  15 NaN
2  18  20  22  24 NaN
3 NaN NaN NaN NaN NaN

Using the add method on df1, I pass df2 and an argument to fill_value:

In [140]: df1.add(df2, fill_value=0)
Out[140]: 
    a   b   c   d   e
0   0   2   4   6   4
1   9  11  13  15   9
2  18  20  22  24  14
3  15  16  17  18  19

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:

In [141]: df1.reindex(columns=df2.columns, fill_value=0)
Out[141]: 
   a  b   c   d  e
0  0  1   2   3  0
1  4  5   6   7  0
2  8  9  10  11  0

Table 5-7. Flexible arithmetic methods

MethodDescription
addMethod for addition (+)
subMethod for subtraction (-)
divMethod for division (/)
mulMethod for multiplication (*)

Operations between DataFrame and Series

As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. First, as a motivating example, consider the difference between a 2D array and one of its rows:

In [142]: arr = np.arange(12.).reshape((3, 4))

In [143]: arr
Out[143]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [144]: arr[0]
Out[144]: array([ 0.,  1.,  2.,  3.])

In [145]: arr - arr[0]
Out[145]: 
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

This is referred to as broadcasting and is explained in more detail in Chapter 12. Operations between a DataFrame and a Series are similar:

In [146]: frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
   .....:                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [147]: series = frame.ix[0]

In [148]: frame          In [149]: series          
Out[148]:                Out[149]:                 
        b   d   e        b    0                    
Utah    0   1   2        d    1                    
Ohio    3   4   5        e    2                    
Texas   6   7   8        Name: Utah, dtype: float64
Oregon  9  10  11                                  

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:

In [150]: frame - series
Out[150]: 
        b  d  e
Utah    0  0  0
Ohio    3  3  3
Texas   6  6  6
Oregon  9  9  9

If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:

In [151]: series2 = Series(range(3), index=['b', 'e', 'f'])

In [152]: frame + series2
Out[152]: 
        b   d   e   f
Utah    0 NaN   3 NaN
Ohio    3 NaN   6 NaN
Texas   6 NaN   9 NaN
Oregon  9 NaN  12 NaN

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:

In [153]: series3 = frame['d']

In [154]: frame      In [155]: series3      
Out[154]:            Out[155]:              
        b   d   e    Utah       1           
Utah    0   1   2    Ohio       4           
Ohio    3   4   5    Texas      7           
Texas   6   7   8    Oregon    10           
Oregon  9  10  11    Name: d, dtype: float64
                                            
In [156]: frame.sub(series3, axis=0)
Out[156]: 
        b  d  e
Utah   -1  0  1
Ohio   -1  0  1
Texas  -1  0  1
Oregon -1  0  1

The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index and broadcast across.

Function application and mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects:

In [157]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
   .....:                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [158]: frame                           In [159]: np.abs(frame)             
Out[158]:                                 Out[159]:                           
               b         d         e                     b         d         e
Utah   -0.204708  0.478943 -0.519439      Utah    0.204708  0.478943  0.519439
Ohio   -0.555730  1.965781  1.393406      Ohio    0.555730  1.965781  1.393406
Texas   0.092908  0.281746  0.769023      Texas   0.092908  0.281746  0.769023
Oregon  1.246435  1.007189 -1.296221      Oregon  1.246435  1.007189  1.296221

Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [160]: f = lambda x: x.max() - x.min()

In [161]: frame.apply(f)        In [162]: frame.apply(f, axis=1)
Out[161]:                       Out[162]:                       
b    1.802165                   Utah      0.998382              
d    1.684034                   Ohio      2.521511              
e    2.689627                   Texas     0.676115              
dtype: float64                  Oregon    2.542656              
                                dtype: float64          

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

The function passed to apply need not return a scalar value, it can also return a Series with multiple values:

In [163]: def f(x):
   .....:     return Series([x.min(), x.max()], index=['min', 'max'])

In [164]: frame.apply(f)
Out[164]: 
            b         d         e
min -0.555730  0.281746 -1.296221
max  1.246435  1.965781  1.393406

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [165]: format = lambda x: '%.2f' % x

In [166]: frame.applymap(format)
Out[166]: 
            b     d      e
Utah    -0.20  0.48  -0.52
Ohio    -0.56  1.97   1.39
Texas    0.09  0.28   0.77
Oregon   1.25  1.01  -1.30

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [167]: frame['e'].map(format)
Out[167]: 
Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object

Sorting and ranking

Sorting a data set by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [168]: obj = Series(range(4), index=['d', 'a', 'b', 'c'])

In [169]: obj.sort_index()
Out[169]: 
a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:

In [170]: frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
   .....:                   columns=['d', 'a', 'b', 'c'])

In [171]: frame.sort_index()        In [172]: frame.sort_index(axis=1)
Out[171]:                           Out[172]:                         
       d  a  b  c                          a  b  c  d                 
one    4  5  6  7                   three  1  2  3  0                 
three  0  1  2  3                   one    5  6  7  4

The data is sorted in ascending order by default, but can be sorted in descending order, too:

In [173]: frame.sort_index(axis=1, ascending=False)
Out[173]: 
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

To sort a Series by its values, use its order method:

In [174]: obj = Series([4, 7, -3, 2])

In [175]: obj.order()
Out[175]: 
2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [176]: obj = Series([4, np.nan, 7, np.nan, -3, 2])

In [177]: obj.order()
Out[177]: 
4    -3
5     2
0     4
2     7
1   NaN
3   NaN
dtype: float64

On DataFrame, you may want to sort by the values in one or more columns. To do so, pass one or more column names to the by option:

In [178]: frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [179]: frame        In [180]: frame.sort_index(by='b')
Out[179]:              Out[180]:                         
   a  b                   a  b                           
0  0  4                2  0 -3                           
1  1  7                3  1  2                           
2  0 -3                0  0  4                           
3  1  2                1  1  7 

To sort by multiple columns, pass a list of names:

In [181]: frame.sort_index(by=['a', 'b'])
Out[181]: 
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7

Ranking is closely related to sorting, assigning ranks from one through the number of valid data points in an array. It is similar to the indirect sort indices produced by numpy.argsort, except that ties are broken according to a rule. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

In [182]: obj = Series([7, -5, 7, 4, 2, 0, 4])

In [183]: obj.rank()
Out[183]: 
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order they’re observed in the data:

In [184]: obj.rank(method='first')
Out[184]: 
0    6
1    1
2    7
3    4
4    3
5    2
6    5
dtype: float64

Naturally, you can rank in descending order, too:

In [185]: obj.rank(ascending=False, method='max')
Out[185]: 
0    2
1    7
2    2
3    4
4    5
5    6
6    4
dtype: float64

See Table 5-8 for a list of tie-breaking methods available. DataFrame can compute ranks over the rows or the columns:

In [186]: frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
   .....:                    'c': [-2, 5, 8, -2.5]})

In [187]: frame        In [188]: frame.rank(axis=1)
Out[187]:              Out[188]:                   
   a    b    c            a  b  c                  
0  0  4.3 -2.0         0  2  3  1                  
1  1  7.0  5.0         1  1  3  2                  
2  0 -3.0  8.0         2  2  1  3                  
3  1  2.0 -2.5         3  2  3  1

Table 5-8. Tie-breaking methods with rank

MethodDescription
'average'Default: assign the average rank to each entry in the equal group.
'min'Use the minimum rank for the whole group.
'max'Use the maximum rank for the whole group.
'first'Assign ranks in the order the values appear in the data.

Axis indexes with duplicate values

Up until now all of the examples I’ve showed you have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [189]: obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [190]: obj
Out[190]: 
a    0
a    1
b    2
b    3
c    4
dtype: int64

The index’s is_unique property can tell you whether its values are unique or not:

In [191]: obj.index.is_unique
Out[191]: False

Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiple entries returns a Series while single entries return a scalar value:

In [192]: obj['a']    In [193]: obj['c']
Out[192]:             Out[193]: 4       
a    0                                  
a    1                                  
dtype: int64

The same logic extends to indexing rows in a DataFrame:

In [194]: df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [195]: df
Out[195]: 
          0         1         2
a  0.274992  0.228913  1.352917
a  0.886429 -2.001637 -0.371843
b  1.669025 -0.438570 -0.539741
b  0.476985  3.248944 -1.021228

In [196]: df.ix['b']
Out[196]: 
          0         1         2
b  1.669025 -0.438570 -0.539741
b  0.476985  3.248944 -1.021228

Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data. Consider a small DataFrame:

In [197]: df = DataFrame([[1.4, np.nan], [7.1, -4.5],
   .....: 				[np.nan, np.nan], [0.75, -1.3]],
   .....: 			   index=['a', 'b', 'c', 'd'],
   .....: 			   columns=['one', 'two'])

In [198]: df
Out[198]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

Calling DataFrame’s sum method returns a Series containing column sums:

In [199]: df.sum()
Out[199]: 
one    9.25
two   -5.80
dtype: float64

Passing axis=1 sums over the rows instead:

In [200]: df.sum(axis=1)
Out[200]: 
a    1.40
b    2.60
c     NaN
d   -0.55
dtype: float64

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled using the skipna option:

In [201]: df.mean(axis=1, skipna=False)
Out[201]: 
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

See Table 5-9 for a list of common options for each reduction method options.

Table 5-9. Options for reduction methods

MethodDescription
axisAxis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipnaExclude missing values, True by default.
levelReduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

Some methods, like idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained:

In [202]: df.idxmax()
Out[202]: 
one    b
two    d
dtype: object

Other methods are accumulations:

In [203]: df.cumsum()
Out[203]: 
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8

Another type of method is neither a reduction nor an accumulation. describe is one such example, producing multiple summary statistics in one shot:

In [204]: df.describe()
Out[204]: 
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000

On non-numeric data, describe produces alternate summary statistics:

In [205]: obj = Series(['a', 'a', 'b', 'c'] * 4)

In [206]: obj.describe()
Out[206]: 
count     16
unique     3
top        a
freq       8
dtype: object

See Table 5-10 for a full list of summary statistics and related methods.

Table 5-10. Descriptive and summary statistics

MethodDescription
countNumber of non-NA values
describeCompute set of summary statistics for Series or each DataFrame column
min, maxCompute minimum and maximum values
argmin, argmaxCompute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmaxCompute index values at which minimum or maximum value obtained, respectively
quantileCompute sample quantile ranging from 0 to 1
sumSum of values
meanMean of values
medianArithmetic median (50% quantile) of values
madMean absolute deviation from mean value
varSample variance of values
stdSample standard deviation of values
skewSample skewness (3rd moment) of values
kurtSample kurtosis (4th moment) of values
cumsumCumulative sum of values
cummin, cummaxCumulative minimum or maximum of values, respectively
cumprodCumulative product of values
diffCompute 1st arithmetic difference (useful for time series)
pct_changeCompute percent changes

Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance:

import pandas.io.data as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.iteritems()})

I now compute percent changes of the prices:

In [208]: returns = price.pct_change()

In [209]: returns.tail()
Out[209]: 
                AAPL      GOOG       IBM      MSFT
Date                                              
2014-07-07  0.020632 -0.004241 -0.002599  0.004545
2014-07-08 -0.006460 -0.019167 -0.004361 -0.005001
2014-07-09  0.000420  0.008738  0.006410 -0.002633
2014-07-10 -0.003669 -0.008645 -0.003821  0.000480
2014-07-11  0.001894  0.014148  0.001598  0.009595

The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [210]: returns.MSFT.corr(returns.IBM)
Out[210]: 0.51360438136345077

In [211]: returns.MSFT.cov(returns.IBM)
Out[211]: 8.4825099973219876e-05

DataFrame’s corr and cov methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

In [212]: returns.corr()
Out[212]: 
          AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.180425  0.384439  0.346196
GOOG  0.180425  1.000000  0.368115  0.534293
IBM   0.384439  0.368115  1.000000  0.513604
MSFT  0.346196  0.534293  0.513604  1.000000

In [213]: returns.cov()
Out[213]: 
          AAPL      GOOG       IBM      MSFT
AAPL  0.000294  0.000037  0.000077  0.000084
GOOG  0.000037  0.000233  0.000051  0.000085
IBM   0.000077  0.000051  0.000137  0.000085
MSFT  0.000084  0.000085  0.000085  0.000200

Using DataFrame’s corrwith method, you can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

In [214]: returns.corrwith(returns.IBM)
Out[214]: 
AAPL    0.384439
GOOG    0.368115
IBM     1.000000
MSFT    0.513604
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here I compute correlations of percent changes with volume:

In [215]: returns.corrwith(volume)
Out[215]: 
AAPL   -0.108623
GOOG   -0.255828
IBM    -0.158163
MSFT   -0.117687
dtype: float64

Passing axis=1 does things row-wise instead. In all cases, the data points are aligned by label before computing the correlation.

Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [216]: obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

The first function is unique, which gives you an array of the unique values in a Series:

In [217]: uniques = obj.unique()

In [218]: uniques
Out[218]: array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in sorted order, but could be sorted after the fact if needed (uniques.sort()). Relatedly, value_counts computes a Series containing value frequencies:

In [219]: obj.value_counts()
Out[219]: 
c    3
a    3
b    2
d    1
dtype: int64

The Series is sorted by value in descending order as a convenience. value_counts is also available as a top-level pandas method that can be used with any array or sequence:

In [220]: pd.value_counts(obj.values, sort=False)
Out[220]: 
a    3
c    3
b    2
d    1
dtype: int64

Lastly, isin is responsible for vectorized set membership and can be very useful in filtering a data set down to a subset of values in a Series or column in a DataFrame:

In [221]: mask = obj.isin(['b', 'c'])

In [222]: mask        In [223]: obj[mask]
Out[222]:             Out[223]:          
0     True            0    c             
1    False            5    b             
2    False            6    b             
3    False            7    c             
4    False            8    c             
5     True            dtype: object      
6     True                               
7     True                               
8     True                               
dtype: bool

See Table 5-11 for a reference on these methods.

Table 5-11. Unique, value counts, and binning methods

MethodDescription
isinCompute boolean array indicating whether each Series value is contained in the passed sequence of values.
uniqueCompute array of unique values in a Series, returned in the order observed.
value_countsReturn a Series containing unique values as its index and frequencies as its values, ordered count in descending order.

In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:

In [224]: data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
   .....:                   'Qu2': [2, 3, 1, 2, 3],
   .....:                   'Qu3': [1, 5, 2, 4, 4]})

In [225]: data
Out[225]: 
   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4

Passing pandas.value_counts to this DataFrame’s apply function gives:

In [226]: result = data.apply(pd.value_counts).fillna(0)

In [227]: result
Out[227]: 
   Qu1  Qu2  Qu3
1    1    1    1
2    0    2    1
3    2    2    0
4    2    0    2
5    0    0    1

Handling Missing Data

Missing data is common in most data analysis applications. One of the goals in designing pandas was to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data as you’ve seen earlier in the chapter.

pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays. It is just used as a sentinel that can be easily detected:

In [228]: string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [229]: string_data        In [230]: string_data.isnull()
Out[229]:                    Out[230]:                     
0     aardvark               0    False                    
1    artichoke               1    False                    
2          NaN               2     True                    
3      avocado               3    False                    
dtype: object                dtype: bool

The built-in Python None value is also treated as NA in object arrays:

In [231]: string_data[0] = None

In [232]: string_data.isnull()
Out[232]: 
0     True
1    False
2     True
3    False
dtype: bool

I do not claim that pandas’s NA representation is optimal, but it is simple and reasonably consistent. It’s the best solution, with good all-around performance characteristics and a simple API, that I could concoct in the absence of a true NA data type or bit pattern in NumPy’s data types. Ongoing development work in NumPy may change this in the future.

Table 5-12. NA handling methods

ArgumentDescription
dropnaFilter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillnaFill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
isnullReturn like-type object containing boolean values indicating which values are missing / NA.
notnullNegation of isnull.

Filtering Out Missing Data

You have a number of options for filtering out missing data. While doing it by hand is always an option, dropna can be very helpful. On a Series, it returns the Series with only the non-null data and index values:

In [233]: from numpy import nan as NA

In [234]: data = Series([1, NA, 3.5, NA, 7])

In [235]: data.dropna()
Out[235]: 
0    1.0
2    3.5
4    7.0
dtype: float64

Naturally, you could have computed this yourself by boolean indexing:

In [236]: data[data.notnull()]
Out[236]: 
0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows or columns which are all NA or just those containing any NAs. dropna by default drops any row containing a missing value:

In [237]: data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
   .....: 				  [NA, NA, NA], [NA, 6.5, 3.]])

In [238]: cleaned = data.dropna()

In [239]: data        In [240]: cleaned
Out[239]:             Out[240]:        
    0    1   2           0    1  2     
0   1  6.5   3        0  1  6.5  3     
1   1  NaN NaN                         
2 NaN  NaN NaN                         
3 NaN  6.5   3

Passing how='all' will only drop rows that are all NA:

In [241]: data.dropna(how='all')
Out[241]: 
    0    1   2
0   1  6.5   3
1   1  NaN NaN
3 NaN  6.5   3

Dropping columns in the same way is only a matter of passing axis=1:

In [242]: data[4] = NA

In [243]: data            In [244]: data.dropna(axis=1, how='all')
Out[243]:                 Out[244]:                               
    0    1   2   4            0    1   2                          
0   1  6.5   3 NaN        0   1  6.5   3                          
1   1  NaN NaN NaN        1   1  NaN NaN                          
2 NaN  NaN NaN NaN        2 NaN  NaN NaN                          
3 NaN  6.5   3 NaN        3 NaN  6.5   3

A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [245]: df = DataFrame(np.random.randn(7, 3))

In [246]: df.ix[:4, 1] = NA; df.ix[:2, 2] = NA

In [247]: df                           In [248]: df.dropna(thresh=3)  
Out[247]:                              Out[248]:                      
          0         1         2                  0         1         2
0 -0.577087       NaN       NaN        5  0.332883 -2.359419 -0.199543
1  0.523772       NaN       NaN        6 -1.541996 -0.970736 -1.307030
2 -0.713544       NaN       NaN                                       
3 -1.860761       NaN  0.560145                                       
4 -1.265934       NaN -1.063512                                       
5  0.332883 -2.359419 -0.199543                                       
6 -1.541996 -0.970736 -1.307030

Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [249]: df.fillna(0)
Out[249]: 
          0         1         2
0 -0.577087  0.000000  0.000000
1  0.523772  0.000000  0.000000
2 -0.713544  0.000000  0.000000
3 -1.860761  0.000000  0.560145
4 -1.265934  0.000000 -1.063512
5  0.332883 -2.359419 -0.199543
6 -1.541996 -0.970736 -1.307030

Calling fillna with a dict you can use a different fill value for each column:

In [250]: df.fillna({1: 0.5, 3: -1})
Out[250]: 
          0         1         2
0 -0.577087  0.500000       NaN
1  0.523772  0.500000       NaN
2 -0.713544  0.500000       NaN
3 -1.860761  0.500000  0.560145
4 -1.265934  0.500000 -1.063512
5  0.332883 -2.359419 -0.199543
6 -1.541996 -0.970736 -1.307030

fillna returns a new object, but you can modify the existing object in place:

# always returns a reference to the filled object
In [251]: _ = df.fillna(0, inplace=True)

In [252]: df
Out[252]: 
          0         1         2
0 -0.577087  0.000000  0.000000
1  0.523772  0.000000  0.000000
2 -0.713544  0.000000  0.000000
3 -1.860761  0.000000  0.560145
4 -1.265934  0.000000 -1.063512
5  0.332883 -2.359419 -0.199543
6 -1.541996 -0.970736 -1.307030

The same interpolation methods available for reindexing can be used with fillna:

In [253]: df = DataFrame(np.random.randn(6, 3))

In [254]: df.ix[2:, 1] = NA; df.ix[4:, 2] = NA

In [255]: df
Out[255]: 
          0         1         2
0  0.286350  0.377984 -0.753887
1  0.331286  1.349742  0.069877
2  0.246674       NaN  1.004812
3  1.327195       NaN -1.549106
4  0.022185       NaN       NaN
5  0.862580       NaN       NaN

In [256]: df.fillna(method='ffill')      In [257]: df.fillna(method='ffill', limit=2)
Out[256]:                                Out[257]:                                   
          0         1         2                    0         1         2             
0  0.286350  0.377984 -0.753887          0  0.286350  0.377984 -0.753887             
1  0.331286  1.349742  0.069877          1  0.331286  1.349742  0.069877             
2  0.246674  1.349742  1.004812          2  0.246674  1.349742  1.004812             
3  1.327195  1.349742 -1.549106          3  1.327195  1.349742 -1.549106             
4  0.022185  1.349742 -1.549106          4  0.022185       NaN -1.549106             
5  0.862580  1.349742 -1.549106          5  0.862580       NaN -1.549106

With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series:

In [258]: data = Series([1., NA, 3.5, NA, 7])

In [259]: data.fillna(data.mean())
Out[259]: 
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

See Table 5-13 for a reference on fillna.

Table 5-13. fillna function arguments

ArgumentDescription
valueScalar value or dict-like object to use to fill missing values
methodInterpolation, by default 'ffill' if function called with no other arguments
axisAxis to fill on, default axis=0
inplaceModify the calling object without producing a copy
limitFor forward and backward filling, maximum number of consecutive periods to fill

Hierarchical Indexing

Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example; create a Series with a list of lists or arrays as the index:

In [260]: data = Series(np.random.randn(10),
   .....:               index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
   .....:                      [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

In [261]: data
Out[261]: 
a  1    0.670216
   2    0.852965
   3   -0.955869
b  1   -0.023493
   2   -2.304234
   3   -0.652469
c  1   -1.218302
   2   -1.332610
d  2    1.074623
   3    0.723642
dtype: float64

What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”:

In [262]: data.index
Out[262]: 
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], 
           [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

With a hierarchically-indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [263]: data['b']
Out[263]: 
1   -0.023493
2   -2.304234
3   -0.652469
dtype: float64

In [264]: data['b':'c']        In [265]: data.ix[['b', 'd']]
Out[264]:                      Out[265]:                    
b  1   -0.023493               b  1   -0.023493             
   2   -2.304234                  2   -2.304234             
   3   -0.652469                  3   -0.652469             
c  1   -1.218302               d  2    1.074623             
   2   -1.332610                  3    0.723642             
dtype: float64                 dtype: float64

Selection is even possible in some cases from an “inner” level:

In [266]: data[:, 2]
Out[266]: 
a    0.852965
b   -2.304234
c   -1.332610
d    1.074623
dtype: float64

Hierarchical indexing plays a critical role in reshaping data and group-based operations like forming a pivot table. For example, this data could be rearranged into a DataFrame using its unstack method:

In [267]: data.unstack()
Out[267]: 
          1         2         3
a  0.670216  0.852965 -0.955869
b -0.023493 -2.304234 -0.652469
c -1.218302 -1.332610       NaN
d       NaN  1.074623  0.723642

The inverse operation of unstack is stack:

In [268]: data.unstack().stack()
Out[268]: 
a  1    0.670216
   2    0.852965
   3   -0.955869
b  1   -0.023493
   2   -2.304234
   3   -0.652469
c  1   -1.218302
   2   -1.332610
d  2    1.074623
   3    0.723642
dtype: float64

stack and unstack will be explored in more detail in Chapter 7.

With a DataFrame, either axis can have a hierarchical index:

In [269]: frame = DataFrame(np.arange(12).reshape((4, 3)),
   .....:                   index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
   .....:                   columns=[['Ohio', 'Ohio', 'Colorado'],
   .....:                            ['Green', 'Red', 'Green']])

In [270]: frame
Out[270]: 
      Ohio       Colorado
     Green  Red     Green
a 1      0    1         2
  2      3    4         5
b 1      6    7         8
  2      9   10        11

The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output (don’t confuse the index names with the axis labels!):

In [271]: frame.index.names = ['key1', 'key2']

In [272]: frame.columns.names = ['state', 'color']

In [273]: frame
Out[273]: 
state       Ohio       Colorado
color      Green  Red     Green
key1 key2                      
a    1         0    1         2
     2         3    4         5
b    1         6    7         8
     2         9   10        11

With partial column indexing you can similarly select groups of columns:

In [274]: frame['Ohio']
Out[274]: 
color      Green  Red
key1 key2            
a    1         0    1
     2         3    4
b    1         6    7
     2         9   10

A MultiIndex can be created by itself and then reused; the columns in the above DataFrame with level names could be created like this:

MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
                       names=['state', 'color'])

Reordering and Sorting Levels

At times you will need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [275]: frame.swaplevel('key1', 'key2')
Out[275]: 
state       Ohio       Colorado
color      Green  Red     Green
key2 key1                      
1    a         0    1         2
2    a         3    4         5
1    b         6    7         8
2    b         9   10        11

sortlevel, on the other hand, sorts the data (stably) using only the values in a single level. When swapping levels, it’s not uncommon to also use sortlevel so that the result is lexicographically sorted:

In [276]: frame.sortlevel(1)           In [277]: frame.swaplevel(0, 1).sortlevel(0)
Out[276]:                              Out[277]:                                   
state       Ohio       Colorado        state       Ohio       Colorado             
color      Green  Red     Green        color      Green  Red     Green             
key1 key2                              key2 key1                                   
a    1         0    1         2        1    a         0    1         2             
b    1         6    7         8             b         6    7         8             
a    2         3    4         5        2    a         3    4         5             
b    2         9   10        11             b         9   10        11

Note

Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level, that is, the result of calling sortlevel(0) or sort_index().

Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the level you want to sum by on a particular axis. Consider the above DataFrame; we can sum by level on either the rows or columns like so:

In [278]: frame.sum(level='key2')
Out[278]: 
state   Ohio       Colorado
color  Green  Red     Green
key2                       
1          6    8        10
2         12   14        16

In [279]: frame.sum(level='color', axis=1)
Out[279]: 
color      Green  Red
key1 key2            
a    1         2    1
     2         8    4
b    1        14    7
     2        20   10

Under the hood, this utilizes pandas’s groupby machinery which will be discussed in more detail later in the book.

Using a DataFrame’s Columns

It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:

In [280]: frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
   .....:                    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
   .....:                    'd': [0, 1, 2, 0, 1, 2, 3]})

In [281]: frame
Out[281]: 
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3

DataFrame’s set_index function will create a new DataFrame using one or more of its columns as the index:

In [282]: frame2 = frame.set_index(['c', 'd'])

In [283]: frame2
Out[283]: 
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

By default the columns are removed from the DataFrame, though you can leave them in:

In [284]: frame.set_index(['c', 'd'], drop=False)
Out[284]: 
       a  b    c  d
c   d              
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3

reset_index, on the other hand, does the opposite of set_index; the hierarchical index levels are are moved into the columns:

In [285]: frame2.reset_index()
Out[285]: 
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

Other pandas Topics

Here are some additional topics that may be of use to you in your data travels.

Integer Indexing

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you would not expect the following code to generate an error:

ser = Series(np.arange(3.))
ser[-1]

In this case, pandas could “fall back” on integer indexing, but there’s not a safe and general way (that I know of) to do this without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult:

In [287]: ser
Out[287]: 
0    0
1    1
2    2
dtype: float64

On the other hand, with a non-integer index, there is no potential for ambiguity:

In [288]: ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])

In [289]: ser2[-1]
Out[289]: 2.0

To keep things consistent, if you have an axis index containing indexers, data selection with integers will always be label-oriented. This includes slicing with ix, too:

In [290]: ser.ix[:1]
Out[290]: 
0    0
1    1
dtype: float64

In cases where you need reliable position-based indexing regardless of the index type, you can use the iget_value method from Series and irow and icol methods from DataFrame:

In [291]: ser3 = Series(range(3), index=[-5, 1, 3])

In [292]: ser3.iget_value(2)
Out[292]: 2

In [293]: frame = DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1])

In [294]: frame.irow(0)
Out[294]: 
0    0
1    1
Name: 2, dtype: int64

Panel Data

While not a major topic of this book, pandas has a Panel data structure, which you can think of as a three-dimensional analogue of DataFrame. Much of the development focus of pandas has been in tabular data manipulations as these are easier to reason about, and hierarchical indexing makes using truly N-dimensional arrays unnecessary in a lot of cases.

To create a Panel, you can use a dict of DataFrame objects or a three-dimensional ndarray:

import pandas.io.data as web

pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))

Each item (the analogue of columns in a DataFrame) in the Panel is a DataFrame:

In [298]: pdata
Out[298]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 1158 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2010-01-04 00:00:00 to 2014-07-11 00:00:00
Minor_axis axis: Open to Adj Close

In [299]: pdata = pdata.swapaxes('items', 'minor')

In [300]: pdata['Adj Close']
Out[300]: 
             AAPL   DELL    GOOG   MSFT
Date                                   
2010-01-04  29.22  14.07     NaN  27.48
2010-01-05  29.27  14.38     NaN  27.49
2010-01-06  28.81  14.10     NaN  27.32
2010-01-07  28.75  14.24     NaN  27.03
2010-01-08  28.94  14.37     NaN  27.22
...           ...    ...     ...    ...
2014-07-07  95.97    NaN  582.25  41.99
2014-07-08  95.35    NaN  571.09  41.78
2014-07-09  95.39    NaN  576.08  41.67
2014-07-10  95.04    NaN  571.10  41.69
2014-07-11  95.22    NaN  579.18  42.09
[1158 rows x 4 columns]

ix-based label indexing generalizes to three dimensions, so we can select all data at a particular date or a range of dates like so:

In [301]: pdata.ix[:, '6/1/2012', :]
Out[301]: 
        Open    High     Low   Close     Volume  Adj Close
AAPL  569.16  572.65  560.52  560.99  130246900      76.60
DELL   12.15   12.30   12.05   12.07   19397600      11.68
GOOG     NaN     NaN     NaN     NaN        NaN        NaN
MSFT   28.76   28.96   28.44   28.45   56634300      26.82

In [302]: pdata.ix['Adj Close', '5/22/2012':, :]
Out[302]: 
             AAPL   DELL    GOOG   MSFT
Date                                   
2012-05-22  76.05  14.59     NaN  28.05
2012-05-23  77.91  12.08     NaN  27.44
2012-05-24  77.19  12.04     NaN  27.40
2012-05-25  76.78  12.05     NaN  27.39
2012-05-28    NaN  12.05     NaN    NaN
...           ...    ...     ...    ...
2014-07-07  95.97    NaN  582.25  41.99
2014-07-08  95.35    NaN  571.09  41.78
2014-07-09  95.39    NaN  576.08  41.67
2014-07-10  95.04    NaN  571.10  41.69
2014-07-11  95.22    NaN  579.18  42.09
[551 rows x 4 columns]

An alternate way to represent panel data, especially for fitting statistical models, is in “stacked” DataFrame form:

In [303]: stacked = pdata.ix[:, '5/30/2012':, :].to_frame()

In [304]: stacked
Out[304]: 
                    Open    High     Low   Close     Volume  Adj Close
Date       minor                                                      
2012-05-30 AAPL   569.20  579.99  566.56  579.17  132357400      79.08
           DELL    12.59   12.70   12.46   12.56   19787800      12.15
           MSFT    29.35   29.48   29.12   29.34   41585500      27.66
2012-05-31 AAPL   580.74  581.50  571.46  577.73  122918600      78.89
           DELL    12.53   12.54   12.33   12.33   19955600      11.93
...                  ...     ...     ...     ...        ...        ...
2014-07-10 GOOG   565.91  576.59  565.01  571.10    1353000     571.10
           MSFT    41.37   42.00   41.05   41.69   21854700      41.69
2014-07-11 AAPL    95.36   95.89   94.86   95.22   33988800      95.22
           GOOG   571.91  580.85  571.42  579.18    1617300     579.18
           MSFT    41.70   42.09   41.48   42.09   24083000      42.09
[1508 rows x 6 columns]

DataFrame has a related to_panel method, the inverse of to_frame:

In [305]: stacked.to_panel()
Out[305]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 545 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2014-07-11 00:00:00
Minor_axis axis: AAPL to MSFT
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.170.223