Data is "missing" in pandas when it has a value of NaN
(also seen as np.nan
—the form from NumPy). The NaN
value represents that in a particular Series
that there is not a value specified for the particular index label.
In pandas, there are a number of reasons why a value can be NaN
:
NaN
value is not known at a given point in time and will be filled in later There are likely more reasons, but the general point is that they occur and you, as a pandas programmer, will need to work with them effectively to be able to perform correct data analysis. Fortunately, pandas provides you with several tools to identify missing values and to substitute them with values that facilitate your analysis.
To demonstrate handling missing data, we will use the following DataFrame
object, which exhibits various patterns of missing data:
In [2]: # create a DataFrame with 5 rows and 3 columns df = pd.DataFrame(np.arange(0, 15).reshape(5, 3), index=['a', 'b', 'c', 'd', 'e'], columns=['c1', 'c2', 'c3']) df Out[2]: c1 c2 c3 a 0 1 2 b 3 4 5 c 6 7 8 d 9 10 11 e 12 13 14
There is no missing data at this point, so let's add some:
In [3]: # add some columns and rows to the DataFrame # column c4 with NaN values df['c4'] = np.nan # row 'f' with 15 through 18 df.loc['f'] = np.arange(15, 19) # row 'g' will all NaN df.loc['g'] = np.nan # column 'C5' with NaN's df['c5'] = np.nan # change value in col 'c4' row 'a' df['c4']['a'] = 20 df Out[3]: c1 c2 c3 c4 c5 a 0 1 2 20 NaN b 3 4 5 NaN NaN c 6 7 8 NaN NaN d 9 10 11 NaN NaN e 12 13 14 NaN NaN f 15 16 17 18 NaN g NaN NaN NaN NaN NaN
This DataFrame
object exhibits the following characteristics that will support most of the examples that follow in this section:
NaN
valuesNaN
valuesNaN
valuesThe NaN
values in a DataFrame
object can be identified using the .isnull()
method. Any True
value means that the item is a NaN
value:
In [4]: # which items are NaN? df.isnull() Out[4]: c1 c2 c3 c4 c5 a False False False False True b False False False True True c False False False True True d False False False True True e False False False True True f False False False False True g True True True True True
We can use the fact that the .sum()
method treats True
as 1 and False
as 0 to determine the number of NaN
values in a DataFrame
object. By applying .sum()
on the result of .isnull()
, we will get a total for the number of True
values (representing NaN
values) in each column:
In [5]: # count the number of NaN values in each column df.isnull().sum() Out[5]: c1 1 c2 1 c3 1 c4 5 c5 7 dtype: int64
Applying .sum()
to the resulting series gives the total number of NaN
values in the original DataFrame
object.
In [6]: # total count of NaN values df.isnull().sum().sum() Out[6]: 15
Another way to determine this is to use the .count()
method of a Series
object and DataFrame
. For a Series
method, this method will return the number of non-NaN
values. For a DataFrame
object, it will count the number of non-NaN
values in each column:
In [7]: # number of non-NaN values in each column df.count() Out[7]: c1 6 c2 6 c3 6 c4 2 c5 0 dtype: int64
This then needs to be flipped around to sum the number of NaN
values, which can be calculated as follows:
In [8]: # and this counts the number of NaN values too (len(df) - df.count()).sum() Out[8]: 15
We can also determine whether an item is not NaN
using the .notnull()
method, which returns True
if the value is not a NaN
value, otherwise it returns False
:
In [9]: # which items are not null? df.notnull() Out[9]: c1 c2 c3 c4 c5 a True True True True False b True True True False False c True True True False False d True True True False False e True True True False False f True True True True False g False False False False False
One technique of handling missing data, is to simply remove it from your dataset. A scenario for this would be where data is sampled at regular intervals, but devices are offline and do not receive a reading, but you only need the actual periodic values.
The pandas library makes this possible using several techniques; one is through Boolean selection using the results of .isnull()
and .notnull()
to retrieve the values that are NaN
or not NaN
out of a Series
object. To demonstrate, the following example selects all non-NaN
values from the c4
column of DataFrame
:
In [10]: # select the non-NaN items in column c4 df.c4[df.c4.notnull()] Out[10]: a 20 f 18 Name: c4, dtype: float64
pandas also provides a convenience function .dropna()
, which will drop the items in a Series
where the value is NaN
, involving less typing than the previous example.
In [11]: # .dropna will also return non NaN values # this gets all non NaN items in column c4 df.c4.dropna() Out[11]: a 20 f 18 Name: c4, dtype: float64
Note that .dropna()
has actually returned a copy of DataFrame
without the rows. The original DataFrame
is not changed:
In [12]: # dropna returns a copy with the values dropped # the source DataFrame / column is not changed df.c4 Out[12]: a 20 b NaN c NaN d NaN e NaN f 18 g NaN Name: c4, dtype: float64
When applied to a DataFrame
object, .dropna()
will drop all rows from a DataFrame
object that have at least one NaN
value. The following code demonstrates this in action, and since each row has at least one NaN
value, there are no rows in the result:
In [13]: # on a DataFrame this will drop entire rows # where there is at least one NaN # in this case, that is all rows df.dropna() Out[13]: Empty DataFrame Columns: [c1, c2, c3, c4, c5] Index: []
If you want to only drop rows where all values are NaN
, you can use the how='all'
parameter. The following code only drops the g
row since it has all NaN
values:
In [14]: # using how='all', only rows that have all values # as NaN will be dropped df.dropna(how = 'all') Out[14]: c1 c2 c3 c4 c5 a 0 1 2 20 NaN b 3 4 5 NaN NaN c 6 7 8 NaN NaN d 9 10 11 NaN NaN e 12 13 14 NaN NaN f 15 16 17 18 NaN
This can also be applied to the columns instead of the rows, by changing the axis parameter to axis=1
. The following code drops the c5
column as it is the only one with all NaN
values:
In [15]: # flip to drop columns instead of rows df.dropna(how='all', axis=1) # say goodbye to c5 Out[15]: c1 c2 c3 c4 a 0 1 2 20 b 3 4 5 NaN c 6 7 8 NaN d 9 10 11 NaN e 12 13 14 NaN f 15 16 17 18 g NaN NaN NaN NaN
We can also examine this process using a slightly different DataFrame
object that has columns c1
and c3
with all values that are not NaN
. In this case, all columns except c1
and c3
will be dropped:
In [16]: # make a copy of df df2 = df.copy() # replace two NaN cells with values df2.ix['g'].c1 = 0 df2.ix['g'].c3 = 0 df2 Out[16]: c1 c2 c3 c4 c5 a 0 1 2 20 NaN b 3 4 5 NaN NaN c 6 7 8 NaN NaN d 9 10 11 NaN NaN e 12 13 14 NaN NaN f 15 16 17 18 NaN g 0 NaN 0 NaN NaN In [17]: # now drop columns with any NaN values df2.dropna(how='any', axis=1) Out[17]: c1 c3 a 0 2 b 3 5 c 6 8 d 9 11 e 12 14 f 15 17 g 0 0
The .dropna()
methods also has a parameter, thresh
, which when given an integer value specifies the minimum number of NaN
values that must exist before the drop is performed. The following code drops all columns with at least five NaN
values; these are the c4
and c5
columns:
In [18]: # only drop columns with at least 5 NaN values df.dropna(thresh=5, axis=1) Out[18]: c1 c2 c3 a 0 1 2 b 3 4 5 c 6 7 8 d 9 10 11 e 12 13 14 f 15 16 17 g NaN NaN NaN
Note that the .dropna()
method (and the Boolean selection) returns a copy of the DataFrame
object, and the data is dropped from that copy. If you want to drop the data in the actual DataFrame
, use the inplace=True
parameter.
The NaN
values are handled differently in pandas than in NumPy. This is demonstrated using the following example:
In [19]: # create a NumPy array with one NaN value a = np.array([1, 2, np.nan, 3]) # create a Series from the array s = pd.Series(a) # the mean of each is different a.mean(), s.mean() Out[19]: (nan, 2.0)
NumPy functions, when encountering a NaN
value, will return NaN
. pandas functions and will typically ignore the NaN
values and continue processing the function as though the values were not part of the Series
object.
More specifically, the way that pandas handles NaN
values is as follows:
The following code demonstrates all of these concepts:
In [20]: # demonstrate sum, mean and cumsum handling of NaN # get one column s = df.c4 s.sum() # NaN values treated as 0 Out[20]: (38.0,) In [21]: s.mean() # NaN also treated as 0 Out[21]: 19.0 In [22]: # as 0 in the cumsum, but NaN values preserved in result Series s.cumsum() Out[22]: a 20 b NaN c NaN d NaN e NaN f 38 g NaN Name: c4, dtype: float64
When using traditional mathematical operators, NaN
is propagated through to the result.
In [23]: # in arithmetic, a NaN value will result in NaN df.c4 + 1 Out[23]: a 21 b NaN c NaN d NaN e NaN f 19 g NaN Name: c4, dtype: float64
If you prefer to replace the NaN
values with a specific value, instead of having them propagated or flat out ignored, you can use the .fillna()
method. The following code fills the NaN
values with 0
:
In [24]: # return a new DataFrame with NaN values filled with 0 filled = df.fillna(0) filled Out[24]: c1 c2 c3 c4 c5 a 0 1 2 20 0 b 3 4 5 0 0 c 6 7 8 0 0 d 9 10 11 0 0 e 12 13 14 0 0 f 15 16 17 18 0 g 0 0 0 0 0
Be aware that this causes differences in the resulting values. As an example, the following code shows the result of applying the .mean()
method to the DataFrame
object with the NaN
values, as compared to the DataFrame
that has its NaN
values filled with 0
:
In [25]: # NaNs don't count as an item in calculating # the means df.mean() Out[25]: c1 7.5 c2 8.5 c3 9.5 c4 19.0 c5 NaN dtype: float64 In [26]: # having replaced NaN with 0 can make # operations such as mean have different results filled.mean() Out[26]: c1 6.428571 c2 7.285714 c3 8.142857 c4 5.428571 c5 0.000000 dtype: float64
It is also possible to limit the number of times that the data will be filled using the limit
parameter. Each time the NaN
values are identified, pandas will fill the NaN
values with the previous value up to the limit
times in each group of NaN
values.
In [27]: # only fills the first two NaN values in each row with 0 df.fillna(0, limit=2) Out[27]: c1 c2 c3 c4 c5 a 0 1 2 20 0 b 3 4 5 0 0 c 6 7 8 0 NaN d 9 10 11 NaN NaN e 12 13 14 NaN NaN f 15 16 17 18 NaN g 0 0 0 NaN NaN
Gaps in data can be filled by propagating non-NaN values forward or backward along a Series
. To demonstrate this, the following example will "fill forward" the c4
column of DataFrame
:
In [28]: # extract the c4 column and fill NaNs forward df.c4.fillna(method="ffill") Out[28]: a 20 b 20 c 20 d 20 e 20 f 18 g 18 Name: c4, dtype: float64
The direction of the fill can be reversed using method='bfill'
:
In [29]: # perform a backwards fill df.c4.fillna(method="bfill") Out[29]: a 20 b 18 c 18 d 18 e 18 f 18 g NaN Name: c4, dtype: float64
To save a little typing, pandas also has global level functions pd.ffill()
and pd.bfill()
, which are equivalent to .fillna(method="ffill")
and .fillna(method="bfill")
.
Data can be filled using the labels of a Series
or keys of a Python dictionary. This allows you to specify different fill values for different elements based upon the value of the index label:
In [30]: # create a new Series of values to be # used to fill NaN values where the index label matches fill_values = pd.Series([100, 101, 102], index=['a', 'e', 'g']) fill_values Out[30]: a 100 e 101 g 102 dtype: int64 In [31]: # using c4, fill using fill_values # a, e and g will be filled with matching values df.c4.fillna(fill_values) Out[31]: a 20 b NaN c NaN d NaN e 101 f 18 g 102 Name: c4, dtype: float64
Only values of NaN
will be filled. Notice that the values with label a
are not changed.
Another common scenario, is to fill all the NaN
values in a column with the mean of the column:
In [32]: # fill NaN values in each column with the # mean of the values in that column df.fillna(df.mean()) Out[32]: c1 c2 c3 c4 c5 a 0.0 1.0 2.0 20 NaN b 3.0 4.0 5.0 19 NaN c 6.0 7.0 8.0 19 NaN d 9.0 10.0 11.0 19 NaN e 12.0 13.0 14.0 19 NaN f 15.0 16.0 17.0 18 NaN g 7.5 8.5 9.5 19 NaN
Both DataFrame
and Series
have an .interpolate()
method that will, by default, perform a linear interpolation of missing values:
In [33]: # linear interpolate the NaN values from 1 through 2 s = pd.Series([1, np.nan, np.nan, np.nan, 2]) s.interpolate() Out[33]: 0 1.00 1 1.25 2 1.50 3 1.75 4 2.00 dtype: float64
The value of the interpolation is calculated by taking the first value before and after any sequence of NaN
values and then incrementally adding that value from the start and substituting NaN
values. In this case, 2.0 and 1.0 are the surrounding values, resulting in (2.0 – 1.0)/(5-1) = 0.25, which is then added incrementally through all the NaN
values.
The interpolation method also has the ability to specify a specific method of interpolation. One of the common methods is to use time-based interpolation. Consider the following Series
of dates and values:
In [34]: # create a time series, but missing one date in the Series ts = pd.Series([1, np.nan, 2], index=[datetime.datetime(2014, 1, 1), datetime.datetime(2014, 2, 1), datetime.datetime(2014, 4, 1)]) ts Out[34]: 2014-01-01 1 2014-02-01 NaN 2014-04-01 2 dtype: float64
Normal interpolation results are as shown in the following example:
In [35]: # linear interpolate based on the number of items in the Series ts.interpolate() Out[35]: 2014-01-01 1.0 2014-02-01 1.5 2014-04-01 2.0 dtype: float64
The value for 2014-02-01
is calculated as 1.0 + (2.0-1.0)/2 = 1.5, since there is one NaN
value between the values 2.0
and 1.0
.
The important thing to note is that the series is missing an entry for 2014-03-01. If we were expecting to interpolate daily values, there would be two values calculated, one for 2014-02-01 and another for 2014-03-01, resulting in one more value in the numerator of the interpolation.
This can be corrected by specifying the method of interpolation as "time
":
In [36]: # this accounts for the fact that we don't have # an entry for 2014-03-01 ts.interpolate(method="time") Out[36]: 2014-01-01 1.000000 2014-02-01 1.344444 2014-04-01 2.000000 dtype: float64
This is the correct interpolation for 2014-02-01
based upon dates. Also note that the index label and value for 2014-03-01
is not added to the Series
, it is just factored into the interpolation.
Interpolation can also be specified to calculate values relative to the index values when using numeric index labels. To demonstrate this, we will use the following Series
:
In [37]: # a Series to demonstrate index label based interpolation s = pd.Series([0, np.nan, 100], index=[0, 1, 10]) s Out[37]: 0 0 1 NaN 10 100 dtype: float64
If we perform a linear interpolation, we get the following value for label 1
, which is correct for a linear interpolation:
In [38]: # linear interpolate s.interpolate() Out[38]: 0 0 1 50 10 100 dtype: float64
However, what if we want to interpolate the value to be relative to the index value? To do this, we can use method="values"
:
In [39]: # interpolate based upon the values in the index s.interpolate(method="values") Out[39]: 0 0 1 10 10 100 dtype: float64
Now, the value calculated for NaN
is interpolated using relative positioning based upon the labels in the index. The NaN
value has a label of 1
, which is one tenth of the way between 0
and 10
, so the interpolated value will be 0 + (100-0)/10, or 10
.
18.191.68.18