pandas will be the primary library of interest throughout much of the rest of the book. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.
As a bit of background, I started building pandas in early 2008 during my tenure at AQR, a quantitative investment management firm. At the time, I had a distinct set of requirements that were not well-addressed by any single tool at my disposal:
Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.
Integrated time series functionality.
The same data structures handle both time series data and non-time series data.
Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).
Flexible handling of missing data.
Merge and other relational operations found in popular database databases (SQL-based, for example).
I wanted to be able to do all of these things in one place, preferably in a language well-suited to general purpose software development. Python was a good candidate language for this, but at that time there was not an integrated set of data structures and tools providing this functionality.
Over the last four years, pandas has matured into a quite large library capable of solving a much broader set of data handling problems than I ever anticipated, but it has expanded in its scope without compromising the simplicity and ease-of-use that I desired from the very beginning. I hope that after reading this book, you will find it to be just as much of an indispensable tool as I do.
Throughout the rest of the book, I use the following import conventions for pandas:
In [1]: from pandas import Series, DataFrame In [2]: import pandas as pd
Thus, whenever you see pd.
in code,
it’s referring to pandas. Series and DataFrame are used so much that I find
it easier to import them into the local namespace.
To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:
In [4]: obj = Series([4, 7, -5, 3]) In [5]: obj Out[5]: 0 4 1 7 2 -5 3 3 dtype: int64
The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:
In [6]: obj.values Out[6]: array([ 4, 7, -5, 3]) In [7]: obj.index Out[7]: Int64Index([0, 1, 2, 3], dtype='int64')
Often it will be desirable to create a Series with an index identifying each data point:
In [8]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) In [9]: obj2 Out[9]: d 4 b 7 a -5 c 3 dtype: int64 In [10]: obj2.index Out[10]: Index([u'd', u'b', u'a', u'c'], dtype='object')
Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:
In [11]: obj2['a'] Out[11]: -5 In [12]: obj2['d'] = 6 In [13]: obj2[['c', 'a', 'd']] Out[13]: c 3 a -5 d 6 dtype: int64
NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:
In [14]: obj2[obj2 > 0] In [15]: obj2 * 2 In [16]: np.exp(obj2) Out[14]: Out[15]: Out[16]: d 6 d 12 d 403.428793 b 7 b 14 b 1096.633158 c 3 a -10 a 0.006738 dtype: int64 c 6 c 20.085537 dtype: int64 dtype: float64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict:
In [17]: 'b' in obj2 Out[17]: True In [18]: 'e' in obj2 Out[18]: False
Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
In [19]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} In [20]: obj3 = Series(sdata) In [21]: obj3 Out[21]: Ohio 35000 Oregon 16000 Texas 71000 Utah 5000 dtype: int64
When only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.
In [22]: states = ['California', 'Ohio', 'Oregon', 'Texas'] In [23]: obj4 = Series(sdata, index=states) In [24]: obj4 Out[24]: California NaN Ohio 35000 Oregon 16000 Texas 71000 dtype: float64
In this case, 3 values found in sdata
were placed in the appropriate
locations, but since no value for 'California'
was found, it appears as
NaN
(not a number)
which is considered in pandas to mark missing or NA
values. I will use the terms “missing” or “NA” to refer to missing data.
The isnull
and notnull
functions in
pandas should be used to detect missing data:
In [25]: pd.isnull(obj4) In [26]: pd.notnull(obj4) Out[25]: Out[26]: California True California False Ohio False Ohio True Oregon False Oregon True Texas False Texas True dtype: bool dtype: bool
Series also has these as instance methods:
In [27]: obj4.isnull() Out[27]: California True Ohio False Oregon False Texas False dtype: bool
I discuss working with missing data in more detail later in this chapter.
A critical Series feature for many applications is that it automatically aligns differently-indexed data in arithmetic operations:
In [28]: obj3 In [29]: obj4 Out[28]: Out[29]: Ohio 35000 California NaN Oregon 16000 Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 Texas 71000 dtype: int64 dtype: float64 In [30]: obj3 + obj4 Out[30]: California NaN Ohio 70000 Oregon 32000 Texas 142000 Utah NaN dtype: float64
Data alignment features are addressed as a separate topic.
Both the Series object itself and its index have a name
attribute, which integrates with other
key areas of pandas functionality:
In [31]: obj4.name = 'population' In [32]: obj4.index.name = 'state' In [33]: obj4 Out[33]: state California NaN Ohio 35000 Oregon 16000 Texas 71000 Name: population, dtype: float64
A Series’s index can be altered in place by assignment:
In [34]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] In [35]: obj Out[35]: Bob 4 Steve 7 Jeff -5 Ryan 3 dtype: int64
A DataFrame represents a tabular, spreadsheet-like data
structure containing an ordered collection of columns, each of which can
be a different value type (numeric, string, boolean, etc.). The
DataFrame has both a row and column index; it can be thought of as a
dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s
data.frame
), row-oriented and
column-oriented operations in DataFrame are treated roughly
symmetrically. Under the hood, the data is stored as one or more
two-dimensional blocks rather than a list, dict, or some other
collection of one-dimensional arrays. The exact details of DataFrame’s
internals are far outside the scope of this book.
While DataFrame stores the data internally in a two-dimensional format, you can easily represent much higher-dimensional data in a tabular format using hierarchical indexing, a subject of a later section and a key ingredient in many of the more advanced data-handling features in pandas.
There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data)
The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:
In [37]: frame Out[37]: pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002
If you specify a sequence of columns, the DataFrame’s columns will be exactly what you pass:
In [38]: DataFrame(data, columns=['year', 'state', 'pop']) Out[38]: year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9
As with Series, if you pass a column that isn’t contained in
data
, it will appear with NA values
in the result:
In [39]: frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], ....: index=['one', 'two', 'three', 'four', 'five']) In [40]: frame2 Out[40]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN In [41]: frame2.columns Out[41]: Index([u'year', u'state', u'pop', u'debt'], dtype='object')
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
In [42]: frame2['state'] In [43]: frame2.year Out[42]: Out[43]: one Ohio one 2000 two Ohio two 2001 three Ohio three 2002 four Nevada four 2001 five Nevada five 2002 Name: state, dtype: object Name: year, dtype: int64
Note that the returned Series have the same index as the
DataFrame, and their name
attribute
has been appropriately set.
Rows can also be retrieved by position or name by a couple of
methods, such as the ix
indexing
field (much more on this later):
In [44]: frame2.ix['three'] Out[44]: year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
Columns can be modified by assignment. For example, the empty
'debt'
column could be assigned a
scalar value or an array of values:
In [45]: frame2['debt'] = 16.5 In [46]: frame2 Out[46]: year state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5 In [47]: frame2['debt'] = np.arange(5.) In [48]: frame2 Out[48]: year state pop debt one 2000 Ohio 1.5 0 two 2001 Ohio 1.7 1 three 2002 Ohio 3.6 2 four 2001 Nevada 2.4 3 five 2002 Nevada 2.9 4
When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:
In [49]: val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) In [50]: frame2['debt'] = val In [51]: frame2 Out[51]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
Assigning a column that doesn’t exist will create a new column.
The del
keyword will delete
columns as with a dict:
In [52]: frame2['eastern'] = frame2.state == 'Ohio' In [53]: frame2 Out[53]: year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False In [54]: del frame2['eastern'] In [55]: frame2.columns Out[55]: Index([u'year', u'state', u'pop', u'debt'], dtype='object')
The column returned when indexing a DataFrame is a view on the underlying data, not a
copy. Thus, any in-place modifications to the Series will be reflected
in the DataFrame. The column can be explicitly copied using the
Series’s copy
method.
Another common form of data is a nested dict of dicts format:
In [56]: pop = {'Nevada': {2001: 2.4, 2002: 2.9}, ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner keys as the row indices:
In [57]: frame3 = DataFrame(pop) In [58]: frame3 Out[58]: Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
Of course you can always transpose the result:
In [59]: frame3.T Out[59]: 2000 2001 2002 Nevada NaN 2.4 2.9 Ohio 1.5 1.7 3.6
The keys in the inner dicts are unioned and sorted to form the index in the result. This isn’t true if an explicit index is specified:
In [60]: DataFrame(pop, index=[2001, 2002, 2003]) Out[60]: Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2003 NaN NaN
Dicts of Series are treated much in the same way:
In [61]: pdata = {'Ohio': frame3['Ohio'][:-1], ....: 'Nevada': frame3['Nevada'][:2]} In [62]: DataFrame(pdata) Out[62]: Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7
For a complete list of things you can pass the DataFrame constructor, see Table 5-1.
If a DataFrame’s index
and
columns
have their name
attributes set, these will also be
displayed:
In [63]: frame3.index.name = 'year'; frame3.columns.name = 'state' In [64]: frame3 Out[64]: state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
Like Series, the values
attribute returns the data contained in the DataFrame as a 2D
ndarray:
In [65]: frame3.values Out[65]: array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accomodate all of the columns:
In [66]: frame2.values Out[66]: array([[2000, 'Ohio', 1.5, nan], [2001, 'Ohio', 1.7, -1.2], [2002, 'Ohio', 3.6, nan], [2001, 'Nevada', 2.4, -1.5], [2002, 'Nevada', 2.9, -1.7]], dtype=object)
Table 5-1. Possible data inputs to DataFrame constructor
Type | Notes |
---|---|
2D ndarray | A matrix of data, passing optional row and column labels |
dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame. All sequences must be the same length. |
NumPy structured/record array | Treated as the “dict of arrays” case |
dict of Series | Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed. |
dict of dicts | Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case. |
list of dicts or Series | Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels |
List of lists or tuples | Treated as the “2D ndarray” case |
Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result |
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:
In [67]: obj = Series(range(3), index=['a', 'b', 'c']) In [68]: index = obj.index In [69]: index Out[69]: Index([u'a', u'b', u'c'], dtype='object') In [70]: index[1:] Out[70]: Index([u'b', u'c'], dtype='object')
Index objects are immutable and thus can’t be modified by the user:
In [71]: index[1] = 'd' --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-71-676fdeb26a68> in <module>() ----> 1 index[1] = 'd' /home/phillip/miniconda3/envs/conda2/lib/python2.7/site-packages/pandas/core/ base.pyc in _disabled(self, *args, **kwargs) 177 """This method will not function because object is immutable.""" 178 raise TypeError("'%s' does not support mutable operations." % --> 179 self.__class__) 180 181 __setitem__ = __setslice__ = __delitem__ = __delslice__ = _disabled TypeError: '<class 'pandas.core.index.Index'>' does not support mutable operations.
Immutability is important so that Index objects can be safely shared among data structures:
In [72]: index = pd.Index(np.arange(3)) In [73]: obj2 = Series([1.5, -2.5, 0], index=index) In [74]: obj2.index is index Out[74]: True
Table 5-2 has a list of built-in Index classes in the library. With some development effort, Index can even be subclassed to implement specialized axis indexing functionality.
Many users will not need to know much about Index objects, but they’re nonetheless an important part of pandas’s data model.
Table 5-2. Main Index objects in pandas
In addition to being array-like, an Index also functions as a fixed-size set:
In [75]: frame3 Out[75]: state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6 In [76]: 'Ohio' in frame3.columns Out[76]: True In [77]: 2003 in frame3.index Out[77]: False
Each Index has a number of methods and properties for set logic and answering other common questions about the data it contains. These are summarized in Table 5-3.
Table 5-3. Index methods and properties
In this section, I’ll walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. Upcoming chapters will delve more deeply into data analysis and manipulation topics using pandas. This book is not intended to serve as exhaustive documentation for the pandas library; I instead focus on the most important features, leaving the less common (that is, more esoteric) things for you to explore on your own.
A critical method on pandas objects is reindex
, which means to create a new object
with the data conformed to a new index. Consider a
simple example from above:
In [78]: obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) In [79]: obj Out[79]: d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64
Calling reindex
on this Series
rearranges the data according to the new index, introducing missing
values if any index values were not already present:
In [80]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) In [81]: obj2 Out[81]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64
In [82]: obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0) Out[82]: a -5.3 b 7.2 c 3.6 d 4.5 e 0.0 dtype: float64
For ordered data like time series, it may be desirable to do some
interpolation or filling of values when reindexing. The method
option allows us to do this, using a
method such as ffill
which forward
fills the values:
In [83]: obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) In [84]: obj3.reindex(range(6), method='ffill') Out[84]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
Table 5-4 lists available
method
options. At this time,
interpolation more sophisticated than forward- and backfilling would
need to be applied after the fact.
With DataFrame, reindex
can
alter either the (row) index, columns, or both. When passed just a
sequence, the rows are reindexed in the result:
In [85]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], ....: columns=['Ohio', 'Texas', 'California']) In [86]: frame Out[86]: Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 In [87]: frame2 = frame.reindex(['a', 'b', 'c', 'd']) In [88]: frame2 Out[88]: Ohio Texas California a 0 1 2 b NaN NaN NaN c 3 4 5 d 6 7 8
The columns can be reindexed using the columns
keyword:
In [89]: states = ['Texas', 'Utah', 'California'] In [90]: frame.reindex(columns=states) Out[90]: Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
Both can be reindexed in one shot, though interpolation will only apply row-wise (axis 0):
In [91]: frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', ....: columns=states) Out[91]: Texas Utah California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8
As you’ll see soon, reindexing can be done more succinctly by
label-indexing with ix
:
In [92]: frame.ix[['a', 'b', 'c', 'd'], states] Out[92]: Texas Utah California a 1 NaN 2 b NaN NaN NaN c 4 NaN 5 d 7 NaN 8
Table 5-5. reindex function arguments
Argument | Description |
---|---|
index | New sequence to use as index. Can be Index instance or any other
sequence-like Python data structure. An Index will be used exactly as is
without any copying |
method | Interpolation (fill) method, see Table 5-4 for options. |
fill_value | Substitute value to use when introducing missing data by reindexing |
limit | When forward- or backfilling, maximum size gap to fill |
level | Match simple Index on level of MultiIndex, otherwise select subset of |
copy | If True , always copy
underlying data even if new index is equivalent to old index. If
False , do not copy the data when the indexes
are equivalent. |
Dropping one or more entries from an axis is easy if you
have an index array or list without those entries. As that can require a
bit of munging and set logic, the drop
method will return
a new object with the indicated value or values deleted from an
axis:
In [93]: obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) In [94]: new_obj = obj.drop('c') In [95]: new_obj Out[95]: a 0 b 1 d 3 e 4 dtype: float64 In [96]: obj.drop(['d', 'c']) Out[96]: a 0 b 1 e 4 dtype: float64
With DataFrame, index values can be deleted from either axis:
In [97]: data = DataFrame(np.arange(16).reshape((4, 4)), ....: index=['Ohio', 'Colorado', 'Utah', 'New York'], ....: columns=['one', 'two', 'three', 'four']) In [98]: data.drop(['Colorado', 'Ohio']) Out[98]: one two three four Utah 8 9 10 11 New York 12 13 14 15 In [99]: data.drop('two', axis=1) In [100]: data.drop(['two', 'four'], axis=1) Out[99]: Out[100]: one three four one three Ohio 0 2 3 Ohio 0 2 Colorado 4 6 7 Colorado 4 6 Utah 8 10 11 Utah 8 10 New York 12 14 15 New York 12 14
Series indexing (obj[...]
) works analogously to NumPy array
indexing, except you can use the Series’s index values instead of only
integers. Here are some examples of this:
In [101]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd']) In [102]: obj['b'] In [103]: obj[1] Out[102]: 1.0 Out[103]: 1.0 In [104]: obj[2:4] In [105]: obj[['b', 'a', 'd']] Out[104]: Out[105]: c 2 b 1 d 3 a 0 dtype: float64 d 3 dtype: float64 In [106]: obj[[1, 3]] In [107]: obj[obj < 2] Out[106]: Out[107]: b 1 a 0 d 3 b 1 dtype: float64 dtype: float64
Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:
In [108]: obj['b':'c'] Out[108]: b 1 c 2 dtype: float64
Setting using these methods works just as you would expect:
In [109]: obj['b':'c'] = 5 In [110]: obj Out[110]: a 0 b 5 c 5 d 3 dtype: float64
As you’ve seen above, indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:
In [111]: data = DataFrame(np.arange(16).reshape((4, 4)), .....: index=['Ohio', 'Colorado', 'Utah', 'New York'], .....: columns=['one', 'two', 'three', 'four']) In [112]: data Out[112]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 In [113]: data['two'] In [114]: data[['three', 'one']] Out[113]: Out[114]: Ohio 1 three one Colorado 5 Ohio 2 0 Utah 9 Colorado 6 4 New York 13 Utah 10 8 Name: two, dtype: int64 New York 14 12
Indexing like this has a few special cases. First selecting rows by slicing or a boolean array:
In [115]: data[:2] In [116]: data[data['three'] > 5] Out[115]: Out[116]: one two three four one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
This might seem inconsistent to some readers, but this syntax arose out of practicality and nothing more. Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:
In [117]: data < 5 Out[117]: one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False In [118]: data[data < 5] = 0 In [119]: data Out[119]: one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
This is intended to make DataFrame syntactically more like an ndarray in this case.
For DataFrame label-indexing on the rows, I introduce the special
indexing field ix
. It enables you to
select a subset of the rows and columns from a DataFrame with NumPy-like
notation plus axis labels. As I mentioned earlier, this is also a less
verbose way to do reindexing:
In [120]: data.ix['Colorado', ['two', 'three']] Out[120]: two 5 three 6 Name: Colorado, dtype: int64 In [121]: data.ix[['Colorado', 'Utah'], [3, 0, 1]] Out[121]: four one two Colorado 7 0 5 Utah 11 8 9 In [122]: data.ix[2] In [123]: data.ix[:'Utah', 'two'] Out[122]: Out[123]: one 8 Ohio 0 two 9 Colorado 5 three 10 Utah 9 four 11 Name: two, dtype: int64 Name: Utah, dtype: int64 In [124]: data.ix[data.three > 5, :3] Out[124]: one two three Colorado 0 5 6 Utah 8 9 10 New York 12 13 14
So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, there is a short summary of many of them in Table 5-6. You have a number of additional options when working with hierarchical indexes as you’ll later see.
When designing pandas, I felt that having to type frame[:, col]
to select a column was too
verbose (and error-prone), since column selection is one of the most
common operations. Thus I made the design trade-off to push all of the
rich label-indexing into ix
.
Table 5-6. Indexing options with DataFrame
One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at a simple example:
In [125]: s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e']) In [126]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g']) In [127]: s1 In [128]: s2 Out[127]: Out[128]: a 7.3 a -2.1 c -2.5 c 3.6 d 3.4 e -1.5 e 1.5 f 4.0 dtype: float64 g 3.1 dtype: float64
Adding these together yields:
In [129]: s1 + s2 Out[129]: a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64
The internal data alignment introduces NA values in the indices that don’t overlap. Missing values propagate in arithmetic computations.
In the case of DataFrame, alignment is performed on both the rows and the columns:
In [130]: df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), .....: index=['Ohio', 'Texas', 'Colorado']) In [131]: df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), .....: index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [132]: df1 In [133]: df2 Out[132]: Out[133]: b c d b d e Ohio 0 1 2 Utah 0 1 2 Texas 3 4 5 Ohio 3 4 5 Colorado 6 7 8 Texas 6 7 8 Oregon 9 10 11
Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:
In [134]: df1 + df2 Out[134]: b c d e Colorado NaN NaN NaN NaN Ohio 3 NaN 6 NaN Oregon NaN NaN NaN NaN Texas 9 NaN 12 NaN Utah NaN NaN NaN NaN
In arithmetic operations between differently-indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:
In [135]: df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd')) In [136]: df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde')) In [137]: df1 In [138]: df2 Out[137]: Out[138]: a b c d a b c d e 0 0 1 2 3 0 0 1 2 3 4 1 4 5 6 7 1 5 6 7 8 9 2 8 9 10 11 2 10 11 12 13 14 3 15 16 17 18 19
Adding these together results in NA values in the locations that don’t overlap:
In [139]: df1 + df2 Out[139]: a b c d e 0 0 2 4 6 NaN 1 9 11 13 15 NaN 2 18 20 22 24 NaN 3 NaN NaN NaN NaN NaN
Using the add
method on
df1
, I pass df2
and an argument to fill_value
:
In [140]: df1.add(df2, fill_value=0) Out[140]: a b c d e 0 0 2 4 6 4 1 9 11 13 15 9 2 18 20 22 24 14 3 15 16 17 18 19
Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:
In [141]: df1.reindex(columns=df2.columns, fill_value=0) Out[141]: a b c d e 0 0 1 2 3 0 1 4 5 6 7 0 2 8 9 10 11 0
As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. First, as a motivating example, consider the difference between a 2D array and one of its rows:
In [142]: arr = np.arange(12.).reshape((3, 4)) In [143]: arr Out[143]: array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) In [144]: arr[0] Out[144]: array([ 0., 1., 2., 3.]) In [145]: arr - arr[0] Out[145]: array([[ 0., 0., 0., 0.], [ 4., 4., 4., 4.], [ 8., 8., 8., 8.]])
This is referred to as broadcasting and is explained in more detail in Chapter 12. Operations between a DataFrame and a Series are similar:
In [146]: frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), .....: index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [147]: series = frame.ix[0] In [148]: frame In [149]: series Out[148]: Out[149]: b d e b 0 Utah 0 1 2 d 1 Ohio 3 4 5 e 2 Texas 6 7 8 Name: Utah, dtype: float64 Oregon 9 10 11
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:
In [150]: frame - series Out[150]: b d e Utah 0 0 0 Ohio 3 3 3 Texas 6 6 6 Oregon 9 9 9
If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:
In [151]: series2 = Series(range(3), index=['b', 'e', 'f']) In [152]: frame + series2 Out[152]: b d e f Utah 0 NaN 3 NaN Ohio 3 NaN 6 NaN Texas 6 NaN 9 NaN Oregon 9 NaN 12 NaN
If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:
In [153]: series3 = frame['d'] In [154]: frame In [155]: series3 Out[154]: Out[155]: b d e Utah 1 Utah 0 1 2 Ohio 4 Ohio 3 4 5 Texas 7 Texas 6 7 8 Oregon 10 Oregon 9 10 11 Name: d, dtype: float64 In [156]: frame.sub(series3, axis=0) Out[156]: b d e Utah -1 0 1 Ohio -1 0 1 Texas -1 0 1 Oregon -1 0 1
The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index and broadcast across.
NumPy ufuncs (element-wise array methods) work fine with pandas objects:
In [157]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), .....: index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [158]: frame In [159]: np.abs(frame) Out[158]: Out[159]: b d e b d e Utah -0.204708 0.478943 -0.519439 Utah 0.204708 0.478943 0.519439 Ohio -0.555730 1.965781 1.393406 Ohio 0.555730 1.965781 1.393406 Texas 0.092908 0.281746 0.769023 Texas 0.092908 0.281746 0.769023 Oregon 1.246435 1.007189 -1.296221 Oregon 1.246435 1.007189 1.296221
Another frequent operation is applying a function on 1D arrays to
each column or row. DataFrame’s apply
method does
exactly this:
In [160]: f = lambda x: x.max() - x.min() In [161]: frame.apply(f) In [162]: frame.apply(f, axis=1) Out[161]: Out[162]: b 1.802165 Utah 0.998382 d 1.684034 Ohio 2.521511 e 2.689627 Texas 0.676115 dtype: float64 Oregon 2.542656 dtype: float64
Many of the most common array statistics (like sum
and mean
) are DataFrame methods, so using
apply
is not
necessary.
The function passed to apply
need not return a
scalar value, it can also return a Series with multiple values:
In [163]: def f(x): .....: return Series([x.min(), x.max()], index=['min', 'max']) In [164]: frame.apply(f) Out[164]: b d e min -0.555730 0.281746 -1.296221 max 1.246435 1.965781 1.393406
Element-wise Python functions can be used, too. Suppose you wanted
to compute a formatted string from each floating point value in frame
. You can do this with applymap
:
In [165]: format = lambda x: '%.2f' % x In [166]: frame.applymap(format) Out[166]: b d e Utah -0.20 0.48 -0.52 Ohio -0.56 1.97 1.39 Texas 0.09 0.28 0.77 Oregon 1.25 1.01 -1.30
The reason for the name applymap
is that Series has a map
method for applying
an element-wise function:
In [167]: frame['e'].map(format) Out[167]: Utah -0.52 Ohio 1.39 Texas 0.77 Oregon -1.30 Name: e, dtype: object
Sorting a data set by some criterion is another important
built-in operation. To sort lexicographically by row or column index,
use the sort_index
method,
which returns a new, sorted object:
In [168]: obj = Series(range(4), index=['d', 'a', 'b', 'c']) In [169]: obj.sort_index() Out[169]: a 1 b 2 c 3 d 0 dtype: int64
With a DataFrame, you can sort by index on either axis:
In [170]: frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], .....: columns=['d', 'a', 'b', 'c']) In [171]: frame.sort_index() In [172]: frame.sort_index(axis=1) Out[171]: Out[172]: d a b c a b c d one 4 5 6 7 three 1 2 3 0 three 0 1 2 3 one 5 6 7 4
The data is sorted in ascending order by default, but can be sorted in descending order, too:
In [173]: frame.sort_index(axis=1, ascending=False) Out[173]: d c b a three 0 3 2 1 one 4 7 6 5
To sort a Series by its values, use its order
method:
In [174]: obj = Series([4, 7, -3, 2]) In [175]: obj.order() Out[175]: 2 -3 3 2 0 4 1 7 dtype: int64
Any missing values are sorted to the end of the Series by default:
In [176]: obj = Series([4, np.nan, 7, np.nan, -3, 2]) In [177]: obj.order() Out[177]: 4 -3 5 2 0 4 2 7 1 NaN 3 NaN dtype: float64
On DataFrame, you may want to sort by the values in one or more
columns. To do so, pass one or more column names to the by
option:
In [178]: frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) In [179]: frame In [180]: frame.sort_index(by='b') Out[179]: Out[180]: a b a b 0 0 4 2 0 -3 1 1 7 3 1 2 2 0 -3 0 0 4 3 1 2 1 1 7
To sort by multiple columns, pass a list of names:
In [181]: frame.sort_index(by=['a', 'b']) Out[181]: a b 2 0 -3 0 0 4 3 1 2 1 1 7
Ranking is closely related to
sorting, assigning ranks from one through the number of valid data
points in an array. It is similar to the indirect sort indices produced
by numpy.argsort
, except
that ties are broken according to a rule. The rank
methods for Series and DataFrame are the
place to look; by default rank
breaks
ties by assigning each group the mean rank:
In [182]: obj = Series([7, -5, 7, 4, 2, 0, 4]) In [183]: obj.rank() Out[183]: 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
Ranks can also be assigned according to the order they’re observed in the data:
In [184]: obj.rank(method='first') Out[184]: 0 6 1 1 2 7 3 4 4 3 5 2 6 5 dtype: float64
Naturally, you can rank in descending order, too:
In [185]: obj.rank(ascending=False, method='max') Out[185]: 0 2 1 7 2 2 3 4 4 5 5 6 6 4 dtype: float64
See Table 5-8 for a list of tie-breaking methods available. DataFrame can compute ranks over the rows or the columns:
In [186]: frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], .....: 'c': [-2, 5, 8, -2.5]}) In [187]: frame In [188]: frame.rank(axis=1) Out[187]: Out[188]: a b c a b c 0 0 4.3 -2.0 0 2 3 1 1 1 7.0 5.0 1 1 3 2 2 0 -3.0 8.0 2 2 1 3 3 1 2.0 -2.5 3 2 3 1
Up until now all of the examples I’ve showed you have had
unique axis labels (index values). While many pandas functions (like
reindex
) require that the labels be
unique, it’s not mandatory. Let’s consider a small Series with duplicate
indices:
In [189]: obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c']) In [190]: obj Out[190]: a 0 a 1 b 2 b 3 c 4 dtype: int64
The index’s is_unique
property
can tell you whether its values are unique or not:
In [191]: obj.index.is_unique Out[191]: False
Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiple entries returns a Series while single entries return a scalar value:
In [192]: obj['a'] In [193]: obj['c'] Out[192]: Out[193]: 4 a 0 a 1 dtype: int64
The same logic extends to indexing rows in a DataFrame:
In [194]: df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b']) In [195]: df Out[195]: 0 1 2 a 0.274992 0.228913 1.352917 a 0.886429 -2.001637 -0.371843 b 1.669025 -0.438570 -0.539741 b 0.476985 3.248944 -1.021228 In [196]: df.ix['b'] Out[196]: 0 1 2 b 1.669025 -0.438570 -0.539741 b 0.476985 3.248944 -1.021228
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data. Consider a small DataFrame:
In [197]: df = DataFrame([[1.4, np.nan], [7.1, -4.5], .....: [np.nan, np.nan], [0.75, -1.3]], .....: index=['a', 'b', 'c', 'd'], .....: columns=['one', 'two']) In [198]: df Out[198]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3
Calling DataFrame’s sum
method returns a
Series containing column sums:
In [199]: df.sum() Out[199]: one 9.25 two -5.80 dtype: float64
Passing axis=1
sums over the rows
instead:
In [200]: df.sum(axis=1) Out[200]: a 1.40 b 2.60 c NaN d -0.55 dtype: float64
NA values are excluded unless the entire slice (row or
column in this case) is NA. This can be disabled using the skipna
option:
In [201]: df.mean(axis=1, skipna=False) Out[201]: a NaN b 1.300 c NaN d -0.275 dtype: float64
See Table 5-9 for a list of common options for each reduction method options.
Some methods, like idxmin
and idxmax
, return indirect
statistics like the index value where the minimum or maximum values are
attained:
In [202]: df.idxmax() Out[202]: one b two d dtype: object
Other methods are accumulations:
In [203]: df.cumsum() Out[203]: one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8
Another type of method is neither a reduction nor an accumulation.
describe
is one such
example, producing multiple summary statistics in one shot:
In [204]: df.describe() Out[204]: one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000
On non-numeric data, describe
produces
alternate summary statistics:
In [205]: obj = Series(['a', 'a', 'b', 'c'] * 4) In [206]: obj.describe() Out[206]: count 16 unique 3 top a freq 8 dtype: object
See Table 5-10 for a full list of summary statistics and related methods.
Table 5-10. Descriptive and summary statistics
Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance:
import pandas.io.data as web all_data = {} for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']: all_data[ticker] = web.get_data_yahoo(ticker) price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()}) volume = DataFrame({tic: data['Volume'] for tic, data in all_data.iteritems()})
I now compute percent changes of the prices:
In [208]: returns = price.pct_change() In [209]: returns.tail() Out[209]: AAPL GOOG IBM MSFT Date 2014-07-07 0.020632 -0.004241 -0.002599 0.004545 2014-07-08 -0.006460 -0.019167 -0.004361 -0.005001 2014-07-09 0.000420 0.008738 0.006410 -0.002633 2014-07-10 -0.003669 -0.008645 -0.003821 0.000480 2014-07-11 0.001894 0.014148 0.001598 0.009595
The corr
method of Series
computes the correlation of the overlapping, non-NA, aligned-by-index
values in two Series. Relatedly, cov
computes the
covariance:
In [210]: returns.MSFT.corr(returns.IBM) Out[210]: 0.51360438136345077 In [211]: returns.MSFT.cov(returns.IBM) Out[211]: 8.4825099973219876e-05
DataFrame’s corr
and cov
methods, on the
other hand, return a full correlation or covariance matrix as a
DataFrame, respectively:
In [212]: returns.corr() Out[212]: AAPL GOOG IBM MSFT AAPL 1.000000 0.180425 0.384439 0.346196 GOOG 0.180425 1.000000 0.368115 0.534293 IBM 0.384439 0.368115 1.000000 0.513604 MSFT 0.346196 0.534293 0.513604 1.000000 In [213]: returns.cov() Out[213]: AAPL GOOG IBM MSFT AAPL 0.000294 0.000037 0.000077 0.000084 GOOG 0.000037 0.000233 0.000051 0.000085 IBM 0.000077 0.000051 0.000137 0.000085 MSFT 0.000084 0.000085 0.000085 0.000200
Using DataFrame’s corrwith
method, you
can compute pairwise correlations between a DataFrame’s columns or rows
with another Series or DataFrame. Passing a Series returns a Series with
the correlation value computed for each column:
In [214]: returns.corrwith(returns.IBM) Out[214]: AAPL 0.384439 GOOG 0.368115 IBM 1.000000 MSFT 0.513604 dtype: float64
Passing a DataFrame computes the correlations of matching column names. Here I compute correlations of percent changes with volume:
In [215]: returns.corrwith(volume) Out[215]: AAPL -0.108623 GOOG -0.255828 IBM -0.158163 MSFT -0.117687 dtype: float64
Passing axis=1
does things
row-wise instead. In all cases, the data points are aligned by label
before computing the correlation.
Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:
In [216]: obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
The first function is unique
,
which gives you an array of the unique values in a Series:
In [217]: uniques = obj.unique() In [218]: uniques Out[218]: array(['c', 'a', 'd', 'b'], dtype=object)
The unique values are not necessarily returned in sorted order,
but could be sorted after the fact if needed (uniques.sort()
). Relatedly, value_counts
computes a Series containing
value frequencies:
In [219]: obj.value_counts() Out[219]: c 3 a 3 b 2 d 1 dtype: int64
The Series is sorted by value in descending order as a
convenience. value_counts
is also
available as a top-level pandas method that can be used with any array
or sequence:
In [220]: pd.value_counts(obj.values, sort=False) Out[220]: a 3 c 3 b 2 d 1 dtype: int64
Lastly, isin
is responsible for
vectorized set membership and can be very useful in filtering a data set
down to a subset of values in a Series or column in a DataFrame:
In [221]: mask = obj.isin(['b', 'c']) In [222]: mask In [223]: obj[mask] Out[222]: Out[223]: 0 True 0 c 1 False 5 b 2 False 6 b 3 False 7 c 4 False 8 c 5 True dtype: object 6 True 7 True 8 True dtype: bool
See Table 5-11 for a reference on these methods.
Table 5-11. Unique, value counts, and binning methods
Method | Description |
---|---|
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values. |
unique | Compute array of unique values in a Series, returned in the order observed. |
value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order. |
In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:
In [224]: data = DataFrame({'Qu1': [1, 3, 4, 3, 4], .....: 'Qu2': [2, 3, 1, 2, 3], .....: 'Qu3': [1, 5, 2, 4, 4]}) In [225]: data Out[225]: Qu1 Qu2 Qu3 0 1 2 1 1 3 3 5 2 4 1 2 3 3 2 4 4 4 3 4
Passing pandas.value_counts
to
this DataFrame’s apply
function
gives:
In [226]: result = data.apply(pd.value_counts).fillna(0) In [227]: result Out[227]: Qu1 Qu2 Qu3 1 1 1 1 2 0 2 1 3 2 2 0 4 2 0 2 5 0 0 1
Missing data is common in most data analysis applications. One of the goals in designing pandas was to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data as you’ve seen earlier in the chapter.
pandas uses the floating point value NaN
(Not a Number) to
represent missing data in both floating as well as in non-floating point
arrays. It is just used as a sentinel that can be easily
detected:
In [228]: string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) In [229]: string_data In [230]: string_data.isnull() Out[229]: Out[230]: 0 aardvark 0 False 1 artichoke 1 False 2 NaN 2 True 3 avocado 3 False dtype: object dtype: bool
The built-in Python None
value is
also treated as NA in object arrays:
In [231]: string_data[0] = None In [232]: string_data.isnull() Out[232]: 0 True 1 False 2 True 3 False dtype: bool
I do not claim that pandas’s NA representation is optimal, but it is simple and reasonably consistent. It’s the best solution, with good all-around performance characteristics and a simple API, that I could concoct in the absence of a true NA data type or bit pattern in NumPy’s data types. Ongoing development work in NumPy may change this in the future.
Table 5-12. NA handling methods
You have a number of options for filtering out missing
data. While doing it by hand is always an option, dropna
can be very helpful. On a Series, it
returns the Series with only the non-null data and index values:
In [233]: from numpy import nan as NA In [234]: data = Series([1, NA, 3.5, NA, 7]) In [235]: data.dropna() Out[235]: 0 1.0 2 3.5 4 7.0 dtype: float64
Naturally, you could have computed this yourself by boolean indexing:
In [236]: data[data.notnull()] Out[236]: 0 1.0 2 3.5 4 7.0 dtype: float64
With DataFrame objects, these are a bit more complex. You may want
to drop rows or columns which are all NA or just those containing any
NAs. dropna
by default drops any row
containing a missing value:
In [237]: data = DataFrame([[1., 6.5, 3.], [1., NA, NA], .....: [NA, NA, NA], [NA, 6.5, 3.]]) In [238]: cleaned = data.dropna() In [239]: data In [240]: cleaned Out[239]: Out[240]: 0 1 2 0 1 2 0 1 6.5 3 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3
Passing how='all'
will only
drop rows that are all NA:
In [241]: data.dropna(how='all') Out[241]: 0 1 2 0 1 6.5 3 1 1 NaN NaN 3 NaN 6.5 3
Dropping columns in the same way is only a matter of passing
axis=1
:
In [242]: data[4] = NA In [243]: data In [244]: data.dropna(axis=1, how='all') Out[243]: Out[244]: 0 1 2 4 0 1 2 0 1 6.5 3 NaN 0 1 6.5 3 1 1 NaN NaN NaN 1 1 NaN NaN 2 NaN NaN NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 NaN 3 NaN 6.5 3
A related way to filter out DataFrame rows tends to
concern time series data. Suppose you want to keep only rows containing
a certain number of observations. You can indicate this with the
thresh
argument:
In [245]: df = DataFrame(np.random.randn(7, 3)) In [246]: df.ix[:4, 1] = NA; df.ix[:2, 2] = NA In [247]: df In [248]: df.dropna(thresh=3) Out[247]: Out[248]: 0 1 2 0 1 2 0 -0.577087 NaN NaN 5 0.332883 -2.359419 -0.199543 1 0.523772 NaN NaN 6 -1.541996 -0.970736 -1.307030 2 -0.713544 NaN NaN 3 -1.860761 NaN 0.560145 4 -1.265934 NaN -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
Rather than filtering out missing data (and potentially
discarding other data along with it), you may want to fill in the
“holes” in any number of ways. For most purposes, the fillna
method is the
workhorse function to use. Calling fillna
with a constant
replaces missing values with that value:
In [249]: df.fillna(0) Out[249]: 0 1 2 0 -0.577087 0.000000 0.000000 1 0.523772 0.000000 0.000000 2 -0.713544 0.000000 0.000000 3 -1.860761 0.000000 0.560145 4 -1.265934 0.000000 -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
Calling fillna
with a dict you
can use a different fill value for each column:
In [250]: df.fillna({1: 0.5, 3: -1}) Out[250]: 0 1 2 0 -0.577087 0.500000 NaN 1 0.523772 0.500000 NaN 2 -0.713544 0.500000 NaN 3 -1.860761 0.500000 0.560145 4 -1.265934 0.500000 -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
fillna
returns a new
object, but you can modify the existing object in place:
# always returns a reference to the filled object In [251]: _ = df.fillna(0, inplace=True) In [252]: df Out[252]: 0 1 2 0 -0.577087 0.000000 0.000000 1 0.523772 0.000000 0.000000 2 -0.713544 0.000000 0.000000 3 -1.860761 0.000000 0.560145 4 -1.265934 0.000000 -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
The same interpolation methods available for reindexing can be
used with fillna
:
In [253]: df = DataFrame(np.random.randn(6, 3)) In [254]: df.ix[2:, 1] = NA; df.ix[4:, 2] = NA In [255]: df Out[255]: 0 1 2 0 0.286350 0.377984 -0.753887 1 0.331286 1.349742 0.069877 2 0.246674 NaN 1.004812 3 1.327195 NaN -1.549106 4 0.022185 NaN NaN 5 0.862580 NaN NaN In [256]: df.fillna(method='ffill') In [257]: df.fillna(method='ffill', limit=2) Out[256]: Out[257]: 0 1 2 0 1 2 0 0.286350 0.377984 -0.753887 0 0.286350 0.377984 -0.753887 1 0.331286 1.349742 0.069877 1 0.331286 1.349742 0.069877 2 0.246674 1.349742 1.004812 2 0.246674 1.349742 1.004812 3 1.327195 1.349742 -1.549106 3 1.327195 1.349742 -1.549106 4 0.022185 1.349742 -1.549106 4 0.022185 NaN -1.549106 5 0.862580 1.349742 -1.549106 5 0.862580 NaN -1.549106
With fillna
you can do lots
of other things with a little creativity. For example, you might pass
the mean or median value of a Series:
In [258]: data = Series([1., NA, 3.5, NA, 7]) In [259]: data.fillna(data.mean()) Out[259]: 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64
See Table 5-13 for a reference on
fillna
.
Table 5-13. fillna function arguments
Argument | Description |
---|---|
value | Scalar value or dict-like object to use to fill missing values |
method | Interpolation, by default 'ffill' if function called with no
other arguments |
axis | Axis to fill on, default axis=0 |
inplace | Modify the calling object without producing a copy |
limit | For forward and backward filling, maximum number of consecutive periods to fill |
Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example; create a Series with a list of lists or arrays as the index:
In [260]: data = Series(np.random.randn(10), .....: index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], .....: [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) In [261]: data Out[261]: a 1 0.670216 2 0.852965 3 -0.955869 b 1 -0.023493 2 -2.304234 3 -0.652469 c 1 -1.218302 2 -1.332610 d 2 1.074623 3 0.723642 dtype: float64
What you’re seeing is a prettified view of a Series with a
MultiIndex
as its index.
The “gaps” in the index display mean “use the label directly
above”:
In [262]: data.index Out[262]: MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
With a hierarchically-indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:
In [263]: data['b'] Out[263]: 1 -0.023493 2 -2.304234 3 -0.652469 dtype: float64 In [264]: data['b':'c'] In [265]: data.ix[['b', 'd']] Out[264]: Out[265]: b 1 -0.023493 b 1 -0.023493 2 -2.304234 2 -2.304234 3 -0.652469 3 -0.652469 c 1 -1.218302 d 2 1.074623 2 -1.332610 3 0.723642 dtype: float64 dtype: float64
Selection is even possible in some cases from an “inner” level:
In [266]: data[:, 2] Out[266]: a 0.852965 b -2.304234 c -1.332610 d 1.074623 dtype: float64
Hierarchical indexing plays a critical role in reshaping data and
group-based operations like forming a pivot table. For example, this data
could be rearranged into a DataFrame using its unstack
method:
In [267]: data.unstack() Out[267]: 1 2 3 a 0.670216 0.852965 -0.955869 b -0.023493 -2.304234 -0.652469 c -1.218302 -1.332610 NaN d NaN 1.074623 0.723642
The inverse operation of unstack
is
stack
:
In [268]: data.unstack().stack() Out[268]: a 1 0.670216 2 0.852965 3 -0.955869 b 1 -0.023493 2 -2.304234 3 -0.652469 c 1 -1.218302 2 -1.332610 d 2 1.074623 3 0.723642 dtype: float64
stack
and unstack
will be explored in more detail in Chapter 7.
With a DataFrame, either axis can have a hierarchical index:
In [269]: frame = DataFrame(np.arange(12).reshape((4, 3)), .....: index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], .....: columns=[['Ohio', 'Ohio', 'Colorado'], .....: ['Green', 'Red', 'Green']]) In [270]: frame Out[270]: Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output (don’t confuse the index names with the axis labels!):
In [271]: frame.index.names = ['key1', 'key2'] In [272]: frame.columns.names = ['state', 'color'] In [273]: frame Out[273]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
With partial column indexing you can similarly select groups of columns:
In [274]: frame['Ohio'] Out[274]: color Green Red key1 key2 a 1 0 1 2 3 4 b 1 6 7 2 9 10
A MultiIndex
can be created
by itself and then reused; the columns in the above DataFrame with level
names could be created like this:
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color'])
At times you will need to rearrange the order of the
levels on an axis or sort the data by the values in one specific level.
The swaplevel
takes two
level numbers or names and returns a new object with the levels
interchanged (but the data is otherwise unaltered):
In [275]: frame.swaplevel('key1', 'key2') Out[275]: state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11
sortlevel
, on the other
hand, sorts the data (stably) using only the values in a single level.
When swapping levels, it’s not uncommon to also use sortlevel
so that the
result is lexicographically sorted:
In [276]: frame.sortlevel(1) In [277]: frame.swaplevel(0, 1).sortlevel(0) Out[276]: Out[277]: state Ohio Colorado state Ohio Colorado color Green Red Green color Green Red Green key1 key2 key2 key1 a 1 0 1 2 1 a 0 1 2 b 1 6 7 8 b 6 7 8 a 2 3 4 5 2 a 3 4 5 b 2 9 10 11 b 9 10 11
Many descriptive and summary statistics on DataFrame and
Series have a level
option in which
you can specify the level you want to sum by on a particular axis.
Consider the above DataFrame; we can sum by level on either the rows or
columns like so:
In [278]: frame.sum(level='key2') Out[278]: state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16 In [279]: frame.sum(level='color', axis=1) Out[279]: color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10
Under the hood, this utilizes pandas’s
groupby
machinery which will be discussed in more
detail later in the book.
It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:
In [280]: frame = DataFrame({'a': range(7), 'b': range(7, 0, -1), .....: 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], .....: 'd': [0, 1, 2, 0, 1, 2, 3]}) In [281]: frame Out[281]: a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
DataFrame’s set_index
function will
create a new DataFrame using one or more of its columns as the
index:
In [282]: frame2 = frame.set_index(['c', 'd']) In [283]: frame2 Out[283]: a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
By default the columns are removed from the DataFrame, though you can leave them in:
In [284]: frame.set_index(['c', 'd'], drop=False) Out[284]: a b c d c d one 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 two 0 3 4 two 0 1 4 3 two 1 2 5 2 two 2 3 6 1 two 3
reset_index
, on the
other hand, does the opposite of set_index
; the
hierarchical index levels are are moved into the columns:
In [285]: frame2.reset_index() Out[285]: c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1
Here are some additional topics that may be of use to you in your data travels.
Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you would not expect the following code to generate an error:
ser = Series(np.arange(3.)) ser[-1]
In this case, pandas could “fall back” on integer indexing, but there’s not a safe and general way (that I know of) to do this without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult:
In [287]: ser Out[287]: 0 0 1 1 2 2 dtype: float64
On the other hand, with a non-integer index, there is no potential for ambiguity:
In [288]: ser2 = Series(np.arange(3.), index=['a', 'b', 'c']) In [289]: ser2[-1] Out[289]: 2.0
To keep things consistent, if you have an axis index containing
indexers, data selection with integers will always be label-oriented.
This includes slicing with ix
,
too:
In [290]: ser.ix[:1] Out[290]: 0 0 1 1 dtype: float64
In cases where you need reliable position-based indexing
regardless of the index type, you can use the iget_value
method from
Series and irow
and icol
methods from
DataFrame:
In [291]: ser3 = Series(range(3), index=[-5, 1, 3]) In [292]: ser3.iget_value(2) Out[292]: 2 In [293]: frame = DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1]) In [294]: frame.irow(0) Out[294]: 0 0 1 1 Name: 2, dtype: int64
While not a major topic of this book, pandas has a Panel data structure, which you can think of as a three-dimensional analogue of DataFrame. Much of the development focus of pandas has been in tabular data manipulations as these are easier to reason about, and hierarchical indexing makes using truly N-dimensional arrays unnecessary in a lot of cases.
To create a Panel, you can use a dict of DataFrame objects or a three-dimensional ndarray:
import pandas.io.data as web pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk)) for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
Each item (the analogue of columns in a DataFrame) in the Panel is a DataFrame:
In [298]: pdata Out[298]: <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 1158 (major_axis) x 6 (minor_axis) Items axis: AAPL to MSFT Major_axis axis: 2010-01-04 00:00:00 to 2014-07-11 00:00:00 Minor_axis axis: Open to Adj Close In [299]: pdata = pdata.swapaxes('items', 'minor') In [300]: pdata['Adj Close'] Out[300]: AAPL DELL GOOG MSFT Date 2010-01-04 29.22 14.07 NaN 27.48 2010-01-05 29.27 14.38 NaN 27.49 2010-01-06 28.81 14.10 NaN 27.32 2010-01-07 28.75 14.24 NaN 27.03 2010-01-08 28.94 14.37 NaN 27.22 ... ... ... ... ... 2014-07-07 95.97 NaN 582.25 41.99 2014-07-08 95.35 NaN 571.09 41.78 2014-07-09 95.39 NaN 576.08 41.67 2014-07-10 95.04 NaN 571.10 41.69 2014-07-11 95.22 NaN 579.18 42.09 [1158 rows x 4 columns]
ix
-based label indexing
generalizes to three dimensions, so we can select all data at a
particular date or a range of dates like so:
In [301]: pdata.ix[:, '6/1/2012', :] Out[301]: Open High Low Close Volume Adj Close AAPL 569.16 572.65 560.52 560.99 130246900 76.60 DELL 12.15 12.30 12.05 12.07 19397600 11.68 GOOG NaN NaN NaN NaN NaN NaN MSFT 28.76 28.96 28.44 28.45 56634300 26.82 In [302]: pdata.ix['Adj Close', '5/22/2012':, :] Out[302]: AAPL DELL GOOG MSFT Date 2012-05-22 76.05 14.59 NaN 28.05 2012-05-23 77.91 12.08 NaN 27.44 2012-05-24 77.19 12.04 NaN 27.40 2012-05-25 76.78 12.05 NaN 27.39 2012-05-28 NaN 12.05 NaN NaN ... ... ... ... ... 2014-07-07 95.97 NaN 582.25 41.99 2014-07-08 95.35 NaN 571.09 41.78 2014-07-09 95.39 NaN 576.08 41.67 2014-07-10 95.04 NaN 571.10 41.69 2014-07-11 95.22 NaN 579.18 42.09 [551 rows x 4 columns]
An alternate way to represent panel data, especially for fitting statistical models, is in “stacked” DataFrame form:
In [303]: stacked = pdata.ix[:, '5/30/2012':, :].to_frame() In [304]: stacked Out[304]: Open High Low Close Volume Adj Close Date minor 2012-05-30 AAPL 569.20 579.99 566.56 579.17 132357400 79.08 DELL 12.59 12.70 12.46 12.56 19787800 12.15 MSFT 29.35 29.48 29.12 29.34 41585500 27.66 2012-05-31 AAPL 580.74 581.50 571.46 577.73 122918600 78.89 DELL 12.53 12.54 12.33 12.33 19955600 11.93 ... ... ... ... ... ... ... 2014-07-10 GOOG 565.91 576.59 565.01 571.10 1353000 571.10 MSFT 41.37 42.00 41.05 41.69 21854700 41.69 2014-07-11 AAPL 95.36 95.89 94.86 95.22 33988800 95.22 GOOG 571.91 580.85 571.42 579.18 1617300 579.18 MSFT 41.70 42.09 41.48 42.09 24083000 42.09 [1508 rows x 6 columns]
DataFrame has a related to_panel
method, the
inverse of to_frame
:
In [305]: stacked.to_panel() Out[305]: <class 'pandas.core.panel.Panel'> Dimensions: 6 (items) x 545 (major_axis) x 4 (minor_axis) Items axis: Open to Adj Close Major_axis axis: 2012-05-30 00:00:00 to 2014-07-11 00:00:00 Minor_axis axis: AAPL to MSFT
18.223.170.223