Working with missing data

Data is "missing" in pandas when it has a value of NaN (also seen as np.nan—the form from NumPy). The NaN value represents that in a particular Series that there is not a value specified for the particular index label.

In pandas, there are a number of reasons why a value can be NaN:

  • A join of two sets of data does not have matched values
  • Data that you retrieved from an external source is incomplete
  • The NaN value is not known at a given point in time and will be filled in later
  • There is a data collection error retrieving a value, but the event must still be recorded in the index
  • Reindexing of data has resulted in an index that does not have a value
  • The shape of data has changed and there are now additional rows or columns, which at the time of reshaping could not be determined

There are likely more reasons, but the general point is that they occur and you, as a pandas programmer, will need to work with them effectively to be able to perform correct data analysis. Fortunately, pandas provides you with several tools to identify missing values and to substitute them with values that facilitate your analysis.

To demonstrate handling missing data, we will use the following DataFrame object, which exhibits various patterns of missing data:

In [2]:
   # create a DataFrame with 5 rows and 3 columns
   df = pd.DataFrame(np.arange(0, 15).reshape(5, 3), 
                  index=['a', 'b', 'c', 'd', 'e'], 
                  columns=['c1', 'c2', 'c3'])
   df

Out[2]:
      c1  c2  c3
   a   0   1   2
   b   3   4   5
   c   6   7   8
   d   9  10  11
   e  12  13  14

There is no missing data at this point, so let's add some:

In [3]:
   # add some columns and rows to the DataFrame
   # column c4 with NaN values
   df['c4'] = np.nan
   # row 'f' with 15 through 18 
   df.loc['f'] = np.arange(15, 19) 
   # row 'g' will all NaN
   df.loc['g'] = np.nan
   # column 'C5' with NaN's
   df['c5'] = np.nan
   # change value in col 'c4' row 'a'
   df['c4']['a'] = 20
   df

Out[3]:
      c1  c2  c3  c4  c5
   a   0   1   2  20 NaN
   b   3   4   5 NaN NaN
   c   6   7   8 NaN NaN
   d   9  10  11 NaN NaN
   e  12  13  14 NaN NaN
   f  15  16  17  18 NaN
   g NaN NaN NaN NaN NaN

This DataFrame object exhibits the following characteristics that will support most of the examples that follow in this section:

  • One row consisting only of NaN values
  • One column is consisiting only of NaN values
  • Several rows and columns consisting of both numeric values and NaN values

Determining NaN values in Series and DataFrame objects

The NaN values in a DataFrame object can be identified using the .isnull() method. Any True value means that the item is a NaN value:

In [4]:
   # which items are NaN?
   df.isnull()

Out[4]:
         c1     c2     c3     c4    c5
   a  False  False  False  False  True
   b  False  False  False   True  True
   c  False  False  False   True  True
   d  False  False  False   True  True
   e  False  False  False   True  True
   f  False  False  False  False  True
   g   True   True   True   True  True

We can use the fact that the .sum() method treats True as 1 and False as 0 to determine the number of NaN values in a DataFrame object. By applying .sum() on the result of .isnull(), we will get a total for the number of True values (representing NaN values) in each column:

In [5]:
   # count the number of NaN values in each column
   df.isnull().sum()

Out[5]:
   c1    1
   c2    1
   c3    1
   c4    5
   c5    7
   dtype: int64

Applying .sum() to the resulting series gives the total number of NaN values in the original DataFrame object.

In [6]:
   # total count of NaN values
   df.isnull().sum().sum()

Out[6]:
   15

Another way to determine this is to use the .count() method of a Series object and DataFrame. For a Series method, this method will return the number of non-NaN values. For a DataFrame object, it will count the number of non-NaN values in each column:

In [7]:
   # number of non-NaN values in each column
   df.count()

Out[7]:
   c1    6
   c2    6
   c3    6
   c4    2
   c5    0
   dtype: int64

This then needs to be flipped around to sum the number of NaN values, which can be calculated as follows:

In [8]:
   # and this counts the number of NaN values too
   (len(df) - df.count()).sum()

Out[8]:
   15

We can also determine whether an item is not NaN using the .notnull() method, which returns True if the value is not a NaN value, otherwise it returns False:

In [9]:
   # which items are not null?
   df.notnull()

Out[9]:
         c1     c2     c3     c4     c5
   a   True   True   True   True  False
   b   True   True   True  False  False
   c   True   True   True  False  False
   d   True   True   True  False  False
   e   True   True   True  False  False
   f   True   True   True   True  False
   g  False  False  False  False  False

Selecting out or dropping missing data

One technique of handling missing data, is to simply remove it from your dataset. A scenario for this would be where data is sampled at regular intervals, but devices are offline and do not receive a reading, but you only need the actual periodic values.

The pandas library makes this possible using several techniques; one is through Boolean selection using the results of .isnull() and .notnull() to retrieve the values that are NaN or not NaN out of a Series object. To demonstrate, the following example selects all non-NaN values from the c4 column of DataFrame:

In [10]:
   # select the non-NaN items in column c4
   df.c4[df.c4.notnull()]

Out[10]:
   a    20
   f    18
   Name: c4, dtype: float64

pandas also provides a convenience function .dropna(), which will drop the items in a Series where the value is NaN, involving less typing than the previous example.

In [11]:
   # .dropna will also return non NaN values
   # this gets all non NaN items in column c4
   df.c4.dropna()

Out[11]:
   a    20
   f    18
   Name: c4, dtype: float64

Note that .dropna() has actually returned a copy of DataFrame without the rows. The original DataFrame is not changed:

In [12]:
   # dropna returns a copy with the values dropped
   # the source DataFrame / column is not changed
   df.c4

Out[12]:
   a    20
   b   NaN
   c   NaN
   d   NaN
   e   NaN
   f    18
   g   NaN
   Name: c4, dtype: float64

When applied to a DataFrame object, .dropna() will drop all rows from a DataFrame object that have at least one NaN value. The following code demonstrates this in action, and since each row has at least one NaN value, there are no rows in the result:

In [13]:
   # on a DataFrame this will drop entire rows
   # where there is at least one NaN
   # in this case, that is all rows
   df.dropna()

Out[13]:
   Empty DataFrame
   Columns: [c1, c2, c3, c4, c5]
   Index: []

If you want to only drop rows where all values are NaN, you can use the how='all' parameter. The following code only drops the g row since it has all NaN values:

In [14]:
   # using how='all', only rows that have all values
   # as NaN will be dropped
   df.dropna(how = 'all')

Out[14]:
      c1  c2  c3  c4  c5
   a   0   1   2  20 NaN
   b   3   4   5 NaN NaN
   c   6   7   8 NaN NaN
   d   9  10  11 NaN NaN
   e  12  13  14 NaN NaN
   f  15  16  17  18 NaN

This can also be applied to the columns instead of the rows, by changing the axis parameter to axis=1. The following code drops the c5 column as it is the only one with all NaN values:

In [15]:
   # flip to drop columns instead of rows
   df.dropna(how='all', axis=1) # say goodbye to c5

Out[15]:
      c1  c2  c3  c4
   a   0   1   2  20
   b   3   4   5 NaN
   c   6   7   8 NaN
   d   9  10  11 NaN
   e  12  13  14 NaN
   f  15  16  17  18
   g NaN NaN NaN NaN

We can also examine this process using a slightly different DataFrame object that has columns c1 and c3 with all values that are not NaN. In this case, all columns except c1 and c3 will be dropped:

In [16]:
   # make a copy of df
   df2 = df.copy()

   # replace two NaN cells with values
   df2.ix['g'].c1 = 0
   df2.ix['g'].c3 = 0
   df2

Out[16]:
     c1  c2  c3  c4  c5
   a   0   1   2  20 NaN
   b   3   4   5 NaN NaN
   c   6   7   8 NaN NaN
   d   9  10  11 NaN NaN
   e  12  13  14 NaN NaN
   f  15  16  17  18 NaN
   g   0 NaN   0 NaN NaN

In [17]:
   # now drop columns with any NaN values
   df2.dropna(how='any', axis=1)

Out[17]:
      c1  c3
   a   0   2
   b   3   5
   c   6   8
   d   9  11
   e  12  14
   f  15  17
   g   0   0

The .dropna() methods also has a parameter, thresh, which when given an integer value specifies the minimum number of NaN values that must exist before the drop is performed. The following code drops all columns with at least five NaN values; these are the c4 and c5 columns:

In [18]:
   # only drop columns with at least 5 NaN values
   df.dropna(thresh=5, axis=1)

Out[18]:
      c1  c2  c3
   a   0   1   2
   b   3   4   5
   c   6   7   8
   d   9  10  11
   e  12  13  14
   f  15  16  17
   g NaN NaN NaN

Note that the .dropna() method (and the Boolean selection) returns a copy of the DataFrame object, and the data is dropped from that copy. If you want to drop the data in the actual DataFrame, use the inplace=True parameter.

How pandas handles NaN values in mathematical operations

The NaN values are handled differently in pandas than in NumPy. This is demonstrated using the following example:

In [19]:
   # create a NumPy array with one NaN value
   a = np.array([1, 2, np.nan, 3])
   # create a Series from the array
   s = pd.Series(a)
   # the mean of each is different
   a.mean(), s.mean()

Out[19]:
   (nan, 2.0)

NumPy functions, when encountering a NaN value, will return NaN. pandas functions and will typically ignore the NaN values and continue processing the function as though the values were not part of the Series object.

Note

Note that the mean of the preceding series was calculated as (1+2+3)/3 = 2, not (1+2+3)/4, or (1+2+0+4)/4. This verifies that NaN is totally ignored and not even counted as an item in the Series.

More specifically, the way that pandas handles NaN values is as follows:

  • Summing of data treats NaN as 0
  • If all values are NaN, the result is NaN
  • Methods like .cumsum() and .cumprod() ignore NaN values, but preserve them in the resulting arrays

The following code demonstrates all of these concepts:

In [20]:
   # demonstrate sum, mean and cumsum handling of NaN
   # get one column
   s = df.c4
   s.sum() # NaN values treated as 0

Out[20]:
   (38.0,)

In [21]:
   s.mean() # NaN also treated as 0

Out[21]:
   19.0

In [22]:
   # as 0 in the cumsum, but NaN values preserved in result Series
   s.cumsum()

Out[22]:
   a    20
   b   NaN
   c   NaN
   d   NaN
   e   NaN
   f    38
   g   NaN
   Name: c4, dtype: float64

When using traditional mathematical operators, NaN is propagated through to the result.

In [23]:
   # in arithmetic, a NaN value will result in NaN
   df.c4 + 1

Out[23]:
   a    21
   b   NaN
   c   NaN
   d   NaN
   e   NaN
   f    19
   g   NaN
   Name: c4, dtype: float64

Filling in missing data

If you prefer to replace the NaN values with a specific value, instead of having them propagated or flat out ignored, you can use the .fillna() method. The following code fills the NaN values with 0:

In [24]:
   # return a new DataFrame with NaN values filled with 0
   filled = df.fillna(0)
   filled

Out[24]:
      c1  c2  c3  c4  c5
   a   0   1   2  20   0
   b   3   4   5   0   0
   c   6   7   8   0   0
   d   9  10  11   0   0
   e  12  13  14   0   0
   f  15  16  17  18   0
   g   0   0   0   0   0

Be aware that this causes differences in the resulting values. As an example, the following code shows the result of applying the .mean() method to the DataFrame object with the NaN values, as compared to the DataFrame that has its NaN values filled with 0:

In [25]:
   # NaNs don't count as an item in calculating
   # the means
   df.mean()

Out[25]:
   c1     7.5
   c2     8.5
   c3     9.5
   c4    19.0
   c5     NaN
   dtype: float64

In [26]:
   # having replaced NaN with 0 can make
   # operations such as mean have different results
   filled.mean()

Out[26]:
   c1    6.428571
   c2    7.285714
   c3    8.142857
   c4    5.428571
   c5    0.000000
   dtype: float64

It is also possible to limit the number of times that the data will be filled using the limit parameter. Each time the NaN values are identified, pandas will fill the NaN values with the previous value up to the limit times in each group of NaN values.

In [27]:
   # only fills the first two NaN values in each row with 0
   df.fillna(0, limit=2)

Out[27]:
      c1  c2  c3  c4  c5
   a   0   1   2  20   0
   b   3   4   5   0   0
   c   6   7   8   0 NaN
   d   9  10  11 NaN NaN
   e  12  13  14 NaN NaN
   f  15  16  17  18 NaN
   g   0   0   0 NaN NaN

Forward and backward filling of missing values

Gaps in data can be filled by propagating non-NaN values forward or backward along a Series. To demonstrate this, the following example will "fill forward" the c4 column of DataFrame:

In [28]:
   # extract the c4 column and fill NaNs forward
   df.c4.fillna(method="ffill")

Out[28]:
   a    20
   b    20
   c    20
   d    20
   e    20
   f    18
   g    18
   Name: c4, dtype: float64

Note

When working with time series data, this technique of filling is often referred to as the "last known value".

The direction of the fill can be reversed using method='bfill':

In [29]:
   # perform a backwards fill
   df.c4.fillna(method="bfill")

Out[29]:
   a    20
   b    18
   c    18
   d    18
   e    18
   f    18
   g   NaN
   Name: c4, dtype: float64

To save a little typing, pandas also has global level functions pd.ffill() and pd.bfill(), which are equivalent to .fillna(method="ffill") and .fillna(method="bfill").

Filling using index labels

Data can be filled using the labels of a Series or keys of a Python dictionary. This allows you to specify different fill values for different elements based upon the value of the index label:

In [30]:
   # create a new Series of values to be 
   # used to fill NaN values where the index label matches
   fill_values = pd.Series([100, 101, 102], index=['a', 'e', 'g'])
   fill_values

Out[30]:
   a    100
   e    101
   g    102
   dtype: int64

In [31]:
   # using c4, fill using fill_values
   # a, e and g will be filled with matching values
   df.c4.fillna(fill_values)

Out[31]:
   a     20
   b    NaN
   c    NaN
   d    NaN
   e    101
   f     18
   g    102
   Name: c4, dtype: float64

Only values of NaN will be filled. Notice that the values with label a are not changed.

Another common scenario, is to fill all the NaN values in a column with the mean of the column:

In [32]:
   # fill NaN values in each column with the 
   # mean of the values in that column
   df.fillna(df.mean())

Out[32]:
        c1    c2    c3  c4  c5
   a   0.0   1.0   2.0  20 NaN
   b   3.0   4.0   5.0  19 NaN
   c   6.0   7.0   8.0  19 NaN
   d   9.0  10.0  11.0  19 NaN
   e  12.0  13.0  14.0  19 NaN
   f  15.0  16.0  17.0  18 NaN
   g   7.5   8.5   9.5  19 NaN

Interpolation of missing values

Both DataFrame and Series have an .interpolate() method that will, by default, perform a linear interpolation of missing values:

In [33]:
   # linear interpolate the NaN values from 1 through 2
   s = pd.Series([1, np.nan, np.nan, np.nan, 2])
   s.interpolate()

Out[33]:
   0    1.00
   1    1.25
   2    1.50
   3    1.75
   4    2.00
   dtype: float64

The value of the interpolation is calculated by taking the first value before and after any sequence of NaN values and then incrementally adding that value from the start and substituting NaN values. In this case, 2.0 and 1.0 are the surrounding values, resulting in (2.0 – 1.0)/(5-1) = 0.25, which is then added incrementally through all the NaN values.

The interpolation method also has the ability to specify a specific method of interpolation. One of the common methods is to use time-based interpolation. Consider the following Series of dates and values:

In [34]:
   # create a time series, but missing one date in the Series
   ts = pd.Series([1, np.nan, 2], 
               index=[datetime.datetime(2014, 1, 1), 
                      datetime.datetime(2014, 2, 1),                   
                      datetime.datetime(2014, 4, 1)])
   ts

Out[34]:
   2014-01-01     1
   2014-02-01   NaN
   2014-04-01     2
   dtype: float64

Normal interpolation results are as shown in the following example:

In [35]:
   # linear interpolate based on the number of items in the Series
   ts.interpolate()

Out[35]:
   2014-01-01    1.0
   2014-02-01    1.5
   2014-04-01    2.0
   dtype: float64

The value for 2014-02-01 is calculated as 1.0 + (2.0-1.0)/2 = 1.5, since there is one NaN value between the values 2.0 and 1.0.

The important thing to note is that the series is missing an entry for 2014-03-01. If we were expecting to interpolate daily values, there would be two values calculated, one for 2014-02-01 and another for 2014-03-01, resulting in one more value in the numerator of the interpolation.

This can be corrected by specifying the method of interpolation as "time":

In [36]:
   # this accounts for the fact that we don't have
   # an entry for 2014-03-01
   ts.interpolate(method="time")

Out[36]:
   2014-01-01    1.000000
   2014-02-01    1.344444
   2014-04-01    2.000000
   dtype: float64

This is the correct interpolation for 2014-02-01 based upon dates. Also note that the index label and value for 2014-03-01 is not added to the Series, it is just factored into the interpolation.

Interpolation can also be specified to calculate values relative to the index values when using numeric index labels. To demonstrate this, we will use the following Series:

In [37]:
   # a Series to demonstrate index label based interpolation
   s = pd.Series([0, np.nan, 100], index=[0, 1, 10])
   s

Out[37]:
   0       0
   1     NaN
   10    100
   dtype: float64

If we perform a linear interpolation, we get the following value for label 1, which is correct for a linear interpolation:

In [38]:
   # linear interpolate
   s.interpolate()

Out[38]:
   0       0
   1      50
   10    100
   dtype: float64

However, what if we want to interpolate the value to be relative to the index value? To do this, we can use method="values":

In [39]:
   # interpolate based upon the values in the index
   s.interpolate(method="values")

Out[39]:
   0       0
   1      10
   10    100
   dtype: float64

Now, the value calculated for NaN is interpolated using relative positioning based upon the labels in the index. The NaN value has a label of 1, which is one tenth of the way between 0 and 10, so the interpolated value will be 0 + (100-0)/10, or 10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.68.18