5. Missing Data

5.1 Introduction

Rarely will you be given a data set without any missing values. There are many representations of missing data. In databases, they are NULL values; certain programming languages use NA; and depending on where you get your data, missing values can be an empty string, '', or even numeric values such as 88 or 99. Pandas displays missing values as NaN.

Concept Map

1. Prior knowledge

a. importing libraries

b. slicing and indexing data

c. using functions and methods

d. using function parameters

Objectives

This chapter will cover:

1. What a missing value is

2. How missing values are created

3. How to recode and make calculations with missing values

5.2 What Is a NaN Value?

The NaN value in Pandas comes from numpy. Missing values may be used or displayed in a few ways in Python—NaN, NAN, or nan—but they are all equivalent. Appendix H describes how these missing values are imported.

# Just import the numpy missing values
from numpy import NaN, NAN, nan

Missing values are different than other types of data, in that they don’t really equal anything. The data is missing, so there is no concept of equality. NaN is not be equivalent to 0 or an empty string, ''.

We can illustrate this idea in Python by testing for equality.

print(NaN == True)

False

print(NaN == False)

False

print(NaN == 0)

False

print(NaN == '')

False

Missing values are also not equal to other missing values.

print(NaN == NaN)

False

print(NaN == nan)

False

print(NaN == NAN)

False

print(nan == NAN)

False

Pandas has built-in methods to test for a missing value.

import pandas as pd

print(pd.isnull(NaN))

True

print(pd.isnull(nan))

True

print(pd.isnull(NAN))

True

Pandas also has methods for testing non-missing values.

print(pd.notnull(NaN))

False

print(pd.notnull(42))

True

print(pd.notnull('missing'))

True

5.3 Where Do Missing Values Come From?

We can get missing values when we load in a data set with missing values, or from the data munging process.

5.3.1 Load Data

The survey data we used in Chapter 4 included a data set, visited, that contained missing data. When we loaded the data, Pandas automatically found the missing data cell, and gave us a dataframe with the NaN value in the appropriate cell. In the read_csv function, three parameters relate to reading in missing values: na_values, keep_default_na, and na_filter.

The na_values parameter allows you to specify additional missing or NaN values. You can pass in either a Python str or a list-like object to be automatically coded as missing values when the file is read. Of course, default missing values, such as NA, NaN, or nan, are already available, which is why this parameter is not always used. Some health data may code 99 as a missing value; to specify the use of this value, you would set na_values=[99].

The keep_default_na parameter is a bool that allows you to specify whether any additional values need to be considered as missing. This parameter is True by default, meaning any additional missing values specified with the na_values parameter will be appended to the list of missing values. However, keep_default_na can also be set to keep_default_na=False, which will use only the missing values specified in na_values.

Lastly, na_filter is a bool that will specify whether any values will be read as missing. The default value of na_filter=True means that missing values will be coded as NaN. If we assign na_filter=False, then nothing will be recoded as missing. This parameter can by thought of as a means to turn off all the parameters set for na_values and keep_default_na, but it is more likely to be used when you want to achieve a performance boost by loading in data without missing values.

# set the location for data
visited_file = '../data/survey_visited.csv'

# load the data with default values
print(pd.read_csv(visited_file))

   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

# load the data without default missing values
print(pd.read_csv(visited_file, keep_default_na=False))

   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

# manually specify missing values
print(pd.read_csv(visited_file,
                  na_values=[''],
                  keep_default_na=False))

   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

5.3.2 Merged Data

Chapter 4 showed you how to combine data sets. Some of the examples in that chapter included missing values in the output. If we recreate the merged table from Section 4.4.3, we will see missing values in the merged output.

visited = pd.read_csv('../data/survey_visited.csv')
survey = pd.read_csv('../data/survey_survey.csv')

print(visited)

   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

print(survey)

   taken person quant reading
0    619   dyer   rad    9.82
1    619   dyer   sal    0.13
2    622   dyer   rad    7.80
3    622   dyer   sal    0.09
4    734     pb   rad    8.41
5    734   lake   sal    0.05
6    734     pb  temp  -21.50
7    735     pb   rad    7.22
8    735    NaN   sal    0.06
9    735    NaN  temp  -26.00
10   751     pb   rad    4.35
11   751     pb  temp  -18.50
12   751   lake   sal    0.10
13   752   lake   rad    2.19
14   752   lake   sal    0.09
15   752   lake  temp  -16.00
16   752    roe   sal   41.60
17   837   lake   rad    1.46
18   837   lake   sal    0.21
19   837    roe   sal   22.50
20   844    roe   rad   11.25

vs = visited.merge(survey, left_on='ident', right_on='taken')
print(vs)

    ident   site        dated  taken person quant  reading
0     619   DR-1   1927-02-08    619   dyer   rad     9.82
1     619   DR-1   1927-02-08    619   dyer   sal     0.13
2     622   DR-1   1927-02-10    622   dyer   rad     7.80
3     622   DR-1   1927-02-10    622   dyer   sal     0.09
4     734   DR-3   1939-01-07    734     pb   rad     8.41
5     734   DR-3   1939-01-07    734   lake   sal     0.05
6     734   DR-3   1939-01-07    734     pb  temp   -21.50
7     735   DR-3   1930-01-12    735     pb   rad     7.22
8     735   DR-3   1930-01-12    735    NaN   sal     0.06
9     735   DR-3   1930-01-12    735    NaN  temp   -26.00
10    751   DR-3   1930-02-26    751     pb   rad     4.35
11    751   DR-3   1930-02-26    751     pb  temp   -18.50
12    751   DR-3   1930-02-26    751   lake   sal     0.10
13    752   DR-3          NaN    752   lake   rad     2.19
14    752   DR-3          NaN    752   lake   sal     0.09
15    752   DR-3          NaN    752   lake  temp   -16.00
16    752   DR-3          NaN    752    roe   sal    41.60
17    837  MSK-4   1932-01-14    837   lake   rad     1.46
18    837  MSK-4   1932-01-14    837   lake   sal     0.21
19    837  MSK-4   1932-01-14    837    roe   sal    22.50
20    844   DR-1   1932-03-22    844    roe   rad    11.25

5.3.3 User Input Values

The user can also create missing values—for example, by creating a vector of values from a calculation or a manually curated vector. To build on the examples from Section 2.2, we will create our own data with missing values. NaNs are valid values for both Series and DataFrames.

# missing value in a series
num_legs = pd.Series({'goat': 4, 'amoeba': nan})
print(num_legs)

amoeba    NaN
goat      4.0
dtype: float64

# missing value in a dataframe
scientists = pd.DataFrame({
    'Name': ['Rosaline Franklin', 'William Gosset'],
    'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16'],
    'missing': [NaN, nan]})
print(scientists)

         Born        Died               Name    Occupation  missing
0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN
1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN

You can also assign a column of missing values to a dataframe directly.

# create a new dataframe
scientists = pd.DataFrame({
    'Name': ['Rosaline Franklin', 'William Gosset'],
    'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16']})

# assign a column of missing values
scientists['missing'] = nan

print(scientists)

         Born        Died               Name    Occupation  missing
0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN
1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN

5.3.4 Re-indexing

Another way to introduce missing values into your data is to reindex your dataframe. This is useful when you want to add new indices to your dataframe, but still want to retain its original values. A common usage is when the index represents some time interval, and you want to add more dates.

If we wanted to look at only the years from 2000 to 2010 from the Gapminder data plot in Section 1.5, we could perform the same grouped operations, subset the data, and then re-index it.

gapminder = pd.read_csv('../data/gapminder.tsv', sep=' ')

life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

We can re-index by slicing the data (see Section 1.3).

# you can continue to chain the 'loc' from the code above
print(life_exp.loc[range(2000, 2010), ])

year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64

Alternatively, you can subset the data separately, and use the reindex method.

# subset
y2000 = life_exp[life_exp.index > 2000]
print(y2000)

year
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

# reindex
print(y2000.reindex(range(2000, 2010)))

year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64

5.4 Working With Missing Data

Now that we know how missing values can be created, let’s see how they behave when we are working with data.

5.4.1 Find and Count Missing Data

ebola = pd.read_csv('../data/country_timeseries.csv')

One way to look at the number of missing values is to count them.

# count the number of non-missing values
print(ebola.count())

Date                  122
Day                   122
Cases_Guinea           93
Cases_Liberia          83
Cases_SierraLeone      87
Cases_Nigeria          38
Cases_Senegal          25
Cases_UnitedStates     18
Cases_Spain            16
Cases_Mali             12
Deaths_Guinea          92
Deaths_Liberia         81
Deaths_SierraLeone     87
Deaths_Nigeria         38
Deaths_Senegal         22
Deaths_UnitedStates    18
Deaths_Spain           16
Deaths_Mali            12
dtype: int64

You can also subtract the number of non-missing rows from the total number of rows.

num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count()
print(num_missing)

Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64

If you want to count the total number of missing values in your data, or count the number of missing values for a particular column, you can use the count_nonzero function from numpy in conjunction with the isnull method.

import numpy as np

print(np.count_nonzero(ebola.isnull()))

1214

print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))

29

Another way to get missing data counts is to use the value_counts method on a series. This will print a frequency table of values. If you use the dropna parameter, you can also get a missing value count.

# get the first 5 value counts from the Cases_Guinea column
print(ebola.Cases_Guinea.value_counts(dropna=False).head())

NaN       29
 86.0      3
 495.0     2
 112.0     2
 390.0     2
Name: Cases_Guinea, dtype: int64

5.4.2 Cleaning Missing Data

There are many different ways we can deal with missing data. For example, we can replace the missing data with another value, fill in the missing data using existing data, or drop the data from our data set.

5.4.2.1 Recode/Replace

We can use the fillna method to recode the missing values to another value. For example, suppose we wanted the missing values to be recoded as a 0.

print(ebola.fillna(0).iloc[0:10, 0:5])

         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            0.0            10030.0
1    1/4/2015  288        2775.0            0.0             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286           0.0         8157.0                0.0
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0            0.0             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0            0.0             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0

When if we use fillna, we can recode the values to a specific value. If you look at the documentation, you will discover that fillna, like many other Pandas functions, has a parameter for inplace. This means that the underlying data will be automatically changed for you; you do not need to create a new copy with the changes. You will want to use this parameter when your data gets larger and you want your code to be more memory efficient.

5.4.2.2 Fill Forward

We can use built-in methods to fill forward or backward. When we fill data forward, the last known value is used for the next missing value. In this way, missing values are replaced with the last known/recorded value.

print(ebola.fillna(method='ffill').iloc[0:10, 0:5])

         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            NaN            10030.0
1    1/4/2015  288        2775.0            NaN             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286        2769.0         8157.0             9722.0
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0         8018.0             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0         7977.0             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0

If a column begins with a missing value, then that data will remain missing because there is no previous value to fill in.

5.4.2.3 Fill Backward

We can also have Pandas fill data backward. When we fill data backward, the newest value is used to replace the missing data. In this way, missing values are replaced with the newest value.

print(ebola.fillna(method='bfill').iloc[:, 0:5].tail())

          Date  Day   Cases_Guinea  Cases_Liberia  Cases_SierraLeone
117  3/27/2014    5          103.0            8.0                6.0
118  3/26/2014    4           86.0            NaN                NaN
119  3/25/2014    3           86.0            NaN                NaN
120  3/24/2014    2           86.0            NaN                NaN
121  3/22/2014    0           49.0            NaN                NaN

If a column ends with a missing value, then it will remain missing because there is no new value to fill in.

5.4.2.4 Interpolate

Interpolation uses existing values to fill in missing values. Although there are many ways to fill in missing values, the interpolation in Pandas fills in missing values linearly. Specifically, it treats the missing values as if they should be equally spaced apart.

print(ebola.interpolate().iloc[0:10, 0:5])

         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
0    1/5/2015  289        2776.0            NaN            10030.0
1    1/4/2015  288        2775.0            NaN             9780.0
2    1/3/2015  287        2769.0         8166.0             9722.0
3    1/2/2015  286        2749.5         8157.0             9677.5
4  12/31/2014  284        2730.0         8115.0             9633.0
5  12/28/2014  281        2706.0         8018.0             9446.0
6  12/27/2014  280        2695.0         7997.5             9409.0
7  12/24/2014  277        2630.0         7977.0             9203.0
8  12/21/2014  273        2597.0         7919.5             9004.0
9  12/20/2014  272        2571.0         7862.0             8939.0

The interpolate method has a method parameter that can change the interpolation method.1

1. Series interpolate documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html

5.4.2.5 Drop Missing Values

The last way to work with missing data is to drop observations or variables with missing data. Depending on how much data is missing, keeping only complete case data can leave you with a useless data set. Perhaps the missing data is not random, so that dropping missing values will leave you with a biased data set, or perhaps keeping only complete data will leave you with insufficient data to run your analysis.

We can use the dropna method to drop missing data, and specify parameters to this method that control how data are dropped. For instance, the how parameter lets you specify whether a row (or column) is dropped when 'any' or 'all' of the data is missing. The thresh parameter lets you specify how many non-NaN values you have before dropping the row or column.

print(ebola.shape)

(122, 18)

If we keep only complete cases in our Ebola data set, we are left with just one row of data.

ebola_dropna = ebola.dropna()
print(ebola_dropna.shape)

(1, 18)

print(ebola_dropna)

          Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  
19  11/18/2014  241        2047.0         7082.0             6190.0

    Cases_Nigeria  Cases_Senegal Cases_UnitedStates  Cases_Spain  
19           20.0            1.0                4.0          1.0

    Cases_Mali  Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  
19         6.0         1214.0          2963.0              1267.0

    Deaths_Nigeria  Deaths_Senegal  Deaths_UnitedStates  
19             8.0             0.0                  1.0

    Deaths_Spain  Deaths_Mali
19           0.0          6.0

5.4.3 Calculations With Missing Data

Suppose we wanted to look at the case counts for multiple regions. We can add multiple regions together to get a new column holding the case counts.

ebola['Cases_multiple'] = ebola['Cases_Guinea'] +
                          ebola['Cases_Liberia'] +
                          ebola['Cases_SierraLeone']

Let’s look at the first 10 lines of the calculation.

ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',
                             'Cases_SierraLeone', 'Cases_multiple']]
print(ebola_subset.head(n=10))

   Cases_Guinea  Cases_Liberia  Cases_SierraLeone  Cases_multiple
0        2776.0            NaN            10030.0             NaN
1        2775.0            NaN             9780.0             NaN
2        2769.0         8166.0             9722.0         20657.0
3           NaN         8157.0                NaN             NaN
4        2730.0         8115.0             9633.0         20478.0
5        2706.0         8018.0             9446.0         20170.0
6        2695.0            NaN             9409.0             NaN
7        2630.0         7977.0             9203.0         19810.0
8        2597.0            NaN             9004.0             NaN
9        2571.0         7862.0             8939.0         19372.0

You can see that a value for Cases_multiple was calculated only when there was no missing value for Cases_Guinea, Cases_Liberia, and Cases_SierraLeone. Calculations with missing values will typically return a missing value, unless the function or method called has a means to ignore missing values in its calculations.

Examples of built-in methods that can ignore missing values include mean and sum. These functions will typically have a skipna parameter that will still calculate a value by skipping over the missing values.

# skipping missing values is True by default
print(ebola.Cases_Guinea.sum(skipna = True))

84729.0

print(ebola.Cases_Guinea.sum(skipna = False))

nan

5.5 Conclusion

It is rare to have a data set without any missing values. It is important to know how to work with missing values because, even when you are working with data that is complete, missing values can still arise from your own data munging. In this chapter, we examined some of the basic methods used in the data analysis process that pertain to data validity. By looking at your data and tabulating missing values, you can start the process of assessing whether the data is of sufficiently high quality for making decisions and drawing inferences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.185.40