Reindexing a Series

Reindexing in pandas is a process that makes the data in a Series or DataFrame match a given set of labels. This is core to the functionality of pandas as it enables label alignment across multiple objects, which may originally have different indexing schemes.

This process of performing a reindex includes the following steps:

  1. Reordering existing data to match a set of labels.
  2. Inserting NaN markers where no data exists for a label.
  3. Possibly, filling missing data for a label using some type of logic (defaulting to adding NaN values).

Here is a simple example of reindexing a Series. The following Series has an index with numerical values, and the index is modified to be alphabetic by simply assigning a list of characters to the .index property. This makes the values accessible via the character labels in the new index:

In [66]:
   # sample series of five items
   s = pd.Series(np.random.randn(5))
   s

Out[66]:
   0   -0.173215
   1    0.119209
   2   -1.044236
   3   -0.861849
   4   -2.104569
   dtype: float64

In [67]:
   # change the index
   s.index = ['a', 'b', 'c', 'd', 'e']
   s

Out[67]:
   a   -0.173215
   b    0.119209
   c   -1.044236
   d   -0.861849
   e   -2.104569
   dtype: float64

Note

The number of elements in the list being assigned to the .index property must match the number of rows, or else an exception will be thrown.

Now, let's examine a slightly more practical example. The following code concatenates two Series objects resulting in duplicate index labels, which may not be desired in the resulting Series:

In [68]:
   # concat copies index values verbatim,
   # potentially making duplicates
   np.random.seed(123456)
   s1 = pd.Series(np.random.randn(3))
   s2 = pd.Series(np.random.randn(3))
   combined = pd.concat([s1, s2])
   combined

Out[68]:
   0    0.469112
   1   -0.282863
   2   -1.509059
   0   -1.135632
   1    1.212112
   2   -0.173215
   dtype: float64

To fix this, the following creates a new index for the concatenated result which has sequential and distinct values.

In [69]:
   # reset the index
   combined.index = np.arange(0, len(combined))
   combined

Out[69]:
   0    0.469112
   1   -0.282863
   2   -1.509059
   3   -1.135632
   4    1.212112
   5   -0.173215
   dtype: float64

Note

Reindexing using the .index property in-place modifies the Series.

Greater flexibility in creating a new index is provided using the .reindex() method. An example of the flexibility of .reindex() over assigning the .index property directly is that the list provided to .reindex() can be of a different length than the number of rows in the Series:

In [70]:
   np.random.seed(123456)
   s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd'])
   # reindex with different number of labels
   # results in dropped rows and/or NaN's
   s2 = s1.reindex(['a', 'c', 'g'])
   s2

Out[70]:
   a    0.469112
   c   -1.509059
   g         NaN
   dtype: float64

There are several things here that are important to point out about .reindex(). First is that the result of a .reindex() method is a new Series. This new Series has an index with labels that are provided as the parameter to .reindex(). For each item in the given parameter list, if the original Series contains that label, then the value is assigned to that label. If the label does not exist in the original Series, pandas assigns a NaN value. Rows in the Series without a label specified in the parameter of .reindex() is not included in the result.

To demonstrate that the result of .reindex() is a new Series object, changing a value in s2 does not change the values in s1:

In [71]:
   # s2 is a different Series than s1
   s2['a'] = 0
   s2

Out[71]:
   a    0.000000
   c   -1.509059
   g         NaN
   dtype: float64

In [72]:
   # this did not modify s1
   s1

Out[72]:
   a    0.469112
   b   -0.282863
   c   -1.509059
   d   -1.135632
   dtype: float64

Reindexing is also useful when you want to align two Series to perform an operation on matching elements from each series; however, for some reason, the two Series had index labels that will not initially align.

The following example demonstrates this, where the first Series has indexes as sequential integers, but the second has a string representation of what would be the same values:

In [73]:
   # different types for the same values of labels
   # causes big trouble
   s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
   s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
   s1 + s2

Out[73]:
   0   NaN
   1   NaN
   2   NaN
   0   NaN
   1   NaN
   2   NaN
   dtype: float64

This is almost a catastrophic failure in accomplishing the desired result, and exemplifies a scenario where data may have been retrieved from two different systems that used different representations for the index labels. The reasons why this happens in pandas are as follows:

  1. pandas first tries to align by the indexes and finds no matches, so it copies the index labels from the first series and tries to append the indexes from the second series.
  2. However, since they are a different type, it defaults back to a zero-based integer sequence that results in duplicate values.
  3. Finally, all values are NaN because the operation tries to add the item in the first series with the integer label 0, which has a value of 0, but can't find the item in the other series and therefore, the result is NaN (and this fails six times in this case).

Once this situation is identified, it becomes a fairly trivial situation to fix by reindexing the second series:

In [74]:
   # reindex by casting the label types
   # and we will get the desired result
   s2.index = s2.index.values.astype(int)
   s1 + s2

Out[74]:
   0    3
   1    5
   2    7
   dtype: int64

The default action of inserting NaN as a missing value during reindexing can be changed by using the fill_value parameter of the method. The following example demonstrates using 0 instead of NaN:

In [75]:
   # fill with 0 instead of NaN
   s2 = s.copy()
   s2.reindex(['a', 'f'], fill_value=0)

Out[75]:
   a   -0.173215
   f    0.000000
   dtype: float64

When performing a reindex on ordered data such as a time series, it is possible to perform interpolation or filling of values. There will be a more elaborate discussion on interpolation and filling in Chapter 10, Time-series Data, but the following examples introduce the concept using this Series:

In [76]:
   # create example to demonstrate fills
   s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
   s3

Out[76]:
   0      red
   3    green
   5     blue
   dtype: object

The following example demonstrates forward filling, often referred to as "last known value." The Series is reindexed to create a contiguous integer index, and using the method='ffill' parameter, any new index labels are assigned the previously known values that are not part of NaN value from earlier in the Series object:

In [77]:
   # forward fill example
   s3.reindex(np.arange(0,7), method='ffill')

Out[77]:
   0      red
   1      red
   2      red
   3    green
   4    green
   5     blue
   6     blue
   dtype: object

The following example fills backward using method='bfill':

In [78]:
   # backwards fill example
   s3.reindex(np.arange(0,7), method='bfill')

Out[78]:
   0      red
   1    green
   2    green
   3    green
   4     blue
   5     blue
   6      NaN
   dtype: object

Modifying a Series in-place

There are several ways that an existing Series can be modified in-place, having either its values changed or having rows added or deleted. In-place modification of a Series is a slightly controversial topic. When possible, it is preferred to perform operations that return a new Series with the modifications represented in the new Series. However, it is possible to change values and add/remove rows in-place, and they will be explained here briefly.

A new item can be added to a Series by assigning a value to an index label that does not already exist. The following code creates a Series object and adds a new item to the series:

In [79]:
   # generate a Series to play with
   np.random.seed(123456)
   s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
   s

Out[79]:
   a    0.469112
   b   -0.282863
   c   -1.509059
   dtype: float64

In [80]:
   # change a value in the Series
   # this is done in-place
   # a new Series is not returned that has a modified value
   s['d'] = 100
   s

Out[80]:
   a      0.469112
   b     -0.282863
   c     -1.509059
   d    100.000000
   dtype: float64

The value at a specific index label can be changed by assignment:

In [81]:
   # modify the value at 'd' in-place
   s['d'] = -100
   s

Out[81]:
   a      0.469112
   b     -0.282863
   c     -1.509059
   d   -100.000000
   dtype: float64

Items can be removed from a Series using the del() function and passing the index label(s) to be removed. The following code removes the item at index label 'a':

In [82]:
   # remove a row / item
   del(s['a'])
   s

   Out[82]:
   b     -0.282863
   c     -1.509059
   d   -100.000000
   dtype: float64

Note

To add and remove items out-of-place, you use pd.concat() to add and remove a Boolean selection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.210.102