Reindexing in pandas is a process that makes the data in a Series
or DataFrame
match a given set of labels. This is core to the functionality of pandas as it enables label alignment across multiple objects, which may originally have different indexing schemes.
This process of performing a reindex includes the following steps:
NaN
markers where no data exists for a label.NaN
values).Here is a simple example of reindexing a Series
. The following Series
has an index with numerical values, and the index is modified to be alphabetic by simply assigning a list of characters to the .index
property. This makes the values accessible via the character labels in the new index:
In [66]: # sample series of five items s = pd.Series(np.random.randn(5)) s Out[66]: 0 -0.173215 1 0.119209 2 -1.044236 3 -0.861849 4 -2.104569 dtype: float64 In [67]: # change the index s.index = ['a', 'b', 'c', 'd', 'e'] s Out[67]: a -0.173215 b 0.119209 c -1.044236 d -0.861849 e -2.104569 dtype: float64
Now, let's examine a slightly more practical example. The following code concatenates two Series
objects resulting in duplicate index labels, which may not be desired in the resulting Series
:
In [68]: # concat copies index values verbatim, # potentially making duplicates np.random.seed(123456) s1 = pd.Series(np.random.randn(3)) s2 = pd.Series(np.random.randn(3)) combined = pd.concat([s1, s2]) combined Out[68]: 0 0.469112 1 -0.282863 2 -1.509059 0 -1.135632 1 1.212112 2 -0.173215 dtype: float64
To fix this, the following creates a new index for the concatenated result which has sequential and distinct values.
In [69]: # reset the index combined.index = np.arange(0, len(combined)) combined Out[69]: 0 0.469112 1 -0.282863 2 -1.509059 3 -1.135632 4 1.212112 5 -0.173215 dtype: float64
Greater flexibility in creating a new index is provided using the .reindex()
method. An example of the flexibility of .reindex()
over assigning the .index
property directly is that the list provided to .reindex()
can be of a different length than the number of rows in the Series
:
In [70]: np.random.seed(123456) s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd']) # reindex with different number of labels # results in dropped rows and/or NaN's s2 = s1.reindex(['a', 'c', 'g']) s2 Out[70]: a 0.469112 c -1.509059 g NaN dtype: float64
There are several things here that are important to point out about .reindex()
. First is that the result of a .reindex()
method is a new Series
. This new Series
has an index with labels that are provided as the parameter to .reindex()
. For each item in the given parameter list, if the original Series
contains that label, then the value is assigned to that label. If the label does not exist in the original Series
, pandas assigns a NaN
value. Rows in the Series
without a label specified in the parameter of .reindex()
is not included in the result.
To demonstrate that the result of .reindex()
is a new Series
object, changing a value in s2
does not change the values in s1
:
In [71]: # s2 is a different Series than s1 s2['a'] = 0 s2 Out[71]: a 0.000000 c -1.509059 g NaN dtype: float64 In [72]: # this did not modify s1 s1 Out[72]: a 0.469112 b -0.282863 c -1.509059 d -1.135632 dtype: float64
Reindexing is also useful when you want to align two Series
to perform an operation on matching elements from each series; however, for some reason, the two Series
had index labels that will not initially align.
The following example demonstrates this, where the first Series
has indexes as sequential integers, but the second has a string representation of what would be the same values:
In [73]: # different types for the same values of labels # causes big trouble s1 = pd.Series([0, 1, 2], index=[0, 1, 2]) s2 = pd.Series([3, 4, 5], index=['0', '1', '2']) s1 + s2 Out[73]: 0 NaN 1 NaN 2 NaN 0 NaN 1 NaN 2 NaN dtype: float64
This is almost a catastrophic failure in accomplishing the desired result, and exemplifies a scenario where data may have been retrieved from two different systems that used different representations for the index labels. The reasons why this happens in pandas are as follows:
NaN
because the operation tries to add the item in the first series with the integer label 0, which has a value of 0, but can't find the item in the other series and therefore, the result is NaN
(and this fails six times in this case).Once this situation is identified, it becomes a fairly trivial situation to fix by reindexing the second series:
In [74]: # reindex by casting the label types # and we will get the desired result s2.index = s2.index.values.astype(int) s1 + s2 Out[74]: 0 3 1 5 2 7 dtype: int64
The default action of inserting NaN
as a missing value during reindexing can be changed by using the fill_value
parameter of the method. The following example demonstrates using 0
instead of NaN
:
In [75]: # fill with 0 instead of NaN s2 = s.copy() s2.reindex(['a', 'f'], fill_value=0) Out[75]: a -0.173215 f 0.000000 dtype: float64
When performing a reindex on ordered data such as a time series, it is possible to perform interpolation or filling of values. There will be a more elaborate discussion on interpolation and filling in Chapter 10, Time-series Data, but the following examples introduce the concept using this Series:
In [76]: # create example to demonstrate fills s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5]) s3 Out[76]: 0 red 3 green 5 blue dtype: object
The following example demonstrates forward filling, often referred to as "last known value." The Series is reindexed to create a contiguous integer index, and using the method='ffill'
parameter, any new index labels are assigned the previously known values that are not part of NaN
value from earlier in the Series
object:
In [77]: # forward fill example s3.reindex(np.arange(0,7), method='ffill') Out[77]: 0 red 1 red 2 red 3 green 4 green 5 blue 6 blue dtype: object
The following example fills backward using method='bfill'
:
In [78]: # backwards fill example s3.reindex(np.arange(0,7), method='bfill') Out[78]: 0 red 1 green 2 green 3 green 4 blue 5 blue 6 NaN dtype: object
There are several ways that an existing Series
can be modified in-place, having either its values changed or having rows added or deleted. In-place modification of a Series
is a slightly controversial topic. When possible, it is preferred to perform operations that return a new Series with the modifications represented in the new Series. However, it is possible to change values and add/remove rows in-place, and they will be explained here briefly.
A new item can be added to a Series
by assigning a value to an index label that does not already exist. The following code creates a Series
object and adds a new item to the series:
In [79]: # generate a Series to play with np.random.seed(123456) s = pd.Series(np.random.randn(3), index=['a', 'b', 'c']) s Out[79]: a 0.469112 b -0.282863 c -1.509059 dtype: float64 In [80]: # change a value in the Series # this is done in-place # a new Series is not returned that has a modified value s['d'] = 100 s Out[80]: a 0.469112 b -0.282863 c -1.509059 d 100.000000 dtype: float64
The value at a specific index label can be changed by assignment:
In [81]: # modify the value at 'd' in-place s['d'] = -100 s Out[81]: a 0.469112 b -0.282863 c -1.509059 d -100.000000 dtype: float64
Items can be removed from a Series
using the del()
function and passing the index label(s) to be removed. The following code removes the item at index label 'a'
:
In [82]: # remove a row / item del(s['a']) s Out[82]: b -0.282863 c -1.509059 d -100.000000 dtype: float64
3.129.210.102