Handling missing values

We regularly encounter empty fields in data records. It's best that we accept this and learn how to handle this kind of issue in a robust manner. Real data can not only have gaps, it can also have wrong values because of faulty measuring equipment, for example. In pandas, missing numerical values will be designated as NaN, objects as None, and the datetime64 objects as NaT. The outcome of arithmetic operations with NaN values is NaN as well. Descriptive statistics methods, such as summation and average, behave differently. As we observed in an earlier example, in such a case, NaN values are treated as zero values. However, if all the values are NaN during summation, for example, the sum returned is still NaN. In aggregation operations, NaN values in the column that we group are ignored. We will again load the WHO_first9cols.csv file into a DataFrame. Recall that this file contains empty fields. Let's only select the first three rows, including the headers of the Country and Net primary school enrolment ratio male (%) columns as follows:

df = df[['Country', df.columns[-2]]][:2]
print "New df
", df

We get a DataFrame with two NaN values:

New df
       Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                          NaN
1      Albania                                           94

[2 rows x 2 columns]

The pandas isnull() function checks for missing values as follows:

print "Null Values
", pd.isnull(df)

The output for our DataFrame is as follows:

Null Values
  Country Net primary school enrolment ratio male (%)
0   False                                        True
1   False                                       False

To count the number of NaN values for each column, we can sum the Boolean values returned by isnull(). This works because during summation, True values are considered as ones and False values are treated as zeros:

Total Null Values
Country                                        0
Net primary school enrolment ratio male (%)    1
dtype: int64

Likewise, we can check with the DataFrame notnull() method for any non-missing values that are present:

print "Not Null Values
", df.notnull()

The result of the notnull() method is the opposite of the isnull() function:

Not Null Values
  Country Net primary school enrolment ratio male (%)
0    True                                       False
1    True                                        True

When we double values in a DataFrame that has NaN values, the product will still contain NaN values, since doubling is an arithmetic operation:

print "Last Column Doubled
", 2 * df[df.columns[-1]]

We double the last column, which contains numerical values (doubling string values repeats the string):

Last Column Doubled
0    NaN
1    188
Name: Net primary school enrolment ratio male (%), dtype: float64

If we add a NaN value, however, the NaN value wins:

print "Last Column plus NaN
", df[df.columns[-1]] + np.nan

As you can see, the NaN values declared total victory:

Last Column plus NaN
0   NaN
1   NaN
Name: Net primary school enrolment ratio male (%), dtype: float64

Replace the missing values by a scalar value, for example, 0 (we can't always replace missing values with zeros, but sometimes this is good enough) with the fillna() method:

print "Zero filled
", df.fillna(0)

The effect of the preceding line is to replace the NaN value with 0:

Zero filled
       Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                            0
1      Albania                                           94

The code for this section is in the missing_values.py file of this book's code bundle:

import pandas as pd
import numpy as np

df = pd.read_csv('WHO_first9cols.csv')
# Select first 3 rows of country and Net primary school enrolment ratio male (%)
df = df[['Country', df.columns[-2]]][:2]
print "New df
", df
print "Null Values
", pd.isnull(df)
print "Total Null Values
", pd.isnull(df).sum()
print "Not Null Values
", df.notnull()
print "Last Column Doubled
", 2 * df[df.columns[-1]]
print "Last Column plus NaN
", df[df.columns[-1]] + np.nan
print "Zero filled
", df.fillna(0)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.17.128