We regularly encounter empty fields in data records. It's best that we accept this and learn how to handle this kind of issue in a robust manner. Real data can not only have gaps, it can also have wrong values because of faulty measuring equipment, for example. In pandas, missing numerical values will be designated as NaN
, objects as None
, and the datetime64
objects as NaT
. The outcome of arithmetic operations with NaN
values is NaN
as well. Descriptive statistics methods, such as summation and average, behave differently. As we observed in an earlier example, in such a case, NaN
values are treated as zero values. However, if all the values are NaN
during summation, for example, the sum returned is still NaN
. In aggregation operations, NaN
values in the column that we group are ignored. We will again load the WHO_first9cols.csv
file into a DataFrame. Recall that this file contains empty fields. Let's only select the first three rows, including the headers of the Country
and Net primary school enrolment ratio male (%)
columns as follows:
df = df[['Country', df.columns[-2]]][:2] print "New df ", df
We get a DataFrame with two NaN values:
New df Country Net primary school enrolment ratio male (%) 0 Afghanistan NaN 1 Albania 94 [2 rows x 2 columns]
The pandas isnull()
function checks for missing values as follows:
print "Null Values ", pd.isnull(df)
The output for our DataFrame is as follows:
Null Values Country Net primary school enrolment ratio male (%) 0 False True 1 False False
To count the number of NaN
values for each column, we can sum the Boolean values returned by isnull()
. This works because during summation, True
values are considered as ones and False
values are treated as zeros:
Total Null Values Country 0 Net primary school enrolment ratio male (%) 1 dtype: int64
Likewise, we can check with the DataFrame notnull()
method for any non-missing values that are present:
print "Not Null Values ", df.notnull()
The result of the notnull()
method is the opposite of the isnull()
function:
Not Null Values Country Net primary school enrolment ratio male (%) 0 True False 1 True True
When we double values in a DataFrame that has NaN
values, the product will still contain NaN
values, since doubling is an arithmetic operation:
print "Last Column Doubled ", 2 * df[df.columns[-1]]
We double the last column, which contains numerical values (doubling string values repeats the string):
Last Column Doubled 0 NaN 1 188 Name: Net primary school enrolment ratio male (%), dtype: float64
If we add a NaN
value, however, the NaN
value wins:
print "Last Column plus NaN ", df[df.columns[-1]] + np.nan
As you can see, the NaN
values declared total victory:
Last Column plus NaN 0 NaN 1 NaN Name: Net primary school enrolment ratio male (%), dtype: float64
Replace the missing values by a scalar value, for example, 0
(we can't always replace missing values with zeros, but sometimes this is good enough) with the fillna()
method:
print "Zero filled ", df.fillna(0)
The effect of the preceding line is to replace the NaN value with 0:
Zero filled Country Net primary school enrolment ratio male (%) 0 Afghanistan 0 1 Albania 94
The code for this section is in the missing_values.py
file of this book's code bundle:
import pandas as pd import numpy as np df = pd.read_csv('WHO_first9cols.csv') # Select first 3 rows of country and Net primary school enrolment ratio male (%) df = df[['Country', df.columns[-2]]][:2] print "New df ", df print "Null Values ", pd.isnull(df) print "Total Null Values ", pd.isnull(df).sum() print "Not Null Values ", df.notnull() print "Last Column Doubled ", 2 * df[df.columns[-1]] print "Last Column plus NaN ", df[df.columns[-1]] + np.nan print "Zero filled ", df.fillna(0)
3.144.17.128