Rarely will you be given a data set without any missing values. There are many representations of missing data. In databases, they are NULL
values; certain programming languages use NA
; and depending on where you get your data, missing values can be an empty string, ''
, or even numeric values such as 88
or 99
. Pandas displays missing values as NaN
.
1. Prior knowledge
a. importing libraries
b. slicing and indexing data
c. using functions and methods
d. using function parameters
This chapter will cover:
1. What a missing value is
2. How missing values are created
3. How to recode and make calculations with missing values
The NaN
value in Pandas comes from numpy
. Missing values may be used or displayed in a few ways in Python—NaN
, NAN
, or nan
—but they are all equivalent. Appendix H describes how these missing values are imported.
# Just import the numpy missing values
from numpy import NaN, NAN, nan
Missing values are different than other types of data, in that they don’t really equal anything. The data is missing, so there is no concept of equality. NaN
is not be equivalent to 0
or an empty string, ''
.
We can illustrate this idea in Python by testing for equality.
print(NaN == True)
print(NaN == False)
print(NaN == 0)
print(NaN == '')
Missing values are also not equal to other missing values.
print(NaN == NaN)
print(NaN == nan)
print(NaN == NAN)
print(nan == NAN)
Pandas has built-in methods to test for a missing value.
import pandas as pd
print(pd.isnull(NaN))
print(pd.isnull(nan))
print(pd.isnull(NAN))
Pandas also has methods for testing non-missing values.
print(pd.notnull(NaN))
print(pd.notnull('missing'))
We can get missing values when we load in a data set with missing values, or from the data munging process.
The survey data we used in Chapter 4 included a data set, visited
, that contained missing data. When we loaded the data, Pandas automatically found the missing data cell, and gave us a dataframe with the NaN
value in the appropriate cell. In the read_csv
function, three parameters relate to reading in missing values: na_values
, keep_default_na
, and na_filter
.
The na_values
parameter allows you to specify additional missing or NaN
values. You can pass in either a Python str
or a list-like object to be automatically coded as missing values when the file is read. Of course, default missing values, such as NA
, NaN
, or nan
, are already available, which is why this parameter is not always used. Some health data may code 99
as a missing value; to specify the use of this value, you would set na_values=[99]
.
The keep_default_na
parameter is a bool
that allows you to specify whether any additional values need to be considered as missing. This parameter is True
by default, meaning any additional missing values specified with the na_values
parameter will be appended to the list of missing values. However, keep_default_na
can also be set to keep_default_na=False
, which will use only the missing values specified in na_values
.
Lastly, na_filter
is a bool
that will specify whether any values will be read as missing. The default value of na_filter=True
means that missing values will be coded as NaN
. If we assign na_filter=False
, then nothing will be recoded as missing. This parameter can by thought of as a means to turn off all the parameters set for na_values
and keep_default_na
, but it is more likely to be used when you want to achieve a performance boost by loading in data without missing values.
# set the location for data
visited_file = '../data/survey_visited.csv'
# load the data with default values
print(pd.read_csv(visited_file))
# load the data without default missing values
print(pd.read_csv(visited_file, keep_default_na=False))
# manually specify missing values
print(pd.read_csv(visited_file,
na_values=[''],
keep_default_na=False))
Chapter 4 showed you how to combine data sets. Some of the examples in that chapter included missing values in the output. If we recreate the merged table from Section 4.4.3, we will see missing values in the merged output.
visited = pd.read_csv('../data/survey_visited.csv')
survey = pd.read_csv('../data/survey_survey.csv')
print(visited)
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(vs)
The user can also create missing values—for example, by creating a vector of values from a calculation or a manually curated vector. To build on the examples from Section 2.2, we will create our own data with missing values. NaN
s are valid values for both Series
and DataFrame
s.
# missing value in a series
num_legs = pd.Series({'goat': 4, 'amoeba': nan})
print(num_legs)
# missing value in a dataframe
scientists = pd.DataFrame({
'Name': ['Rosaline Franklin', 'William Gosset'],
'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16'],
'missing': [NaN, nan]})
print(scientists)
You can also assign a column of missing values to a dataframe directly.
# create a new dataframe
scientists = pd.DataFrame({
'Name': ['Rosaline Franklin', 'William Gosset'],
'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16']})
# assign a column of missing values
scientists['missing'] = nan
print(scientists)
Another way to introduce missing values into your data is to reindex your dataframe. This is useful when you want to add new indices to your dataframe, but still want to retain its original values. A common usage is when the index represents some time interval, and you want to add more dates.
If we wanted to look at only the years from 2000 to 2010 from the Gapminder data plot in Section 1.5, we could perform the same grouped operations, subset the data, and then re-index it.
gapminder = pd.read_csv('../data/gapminder.tsv', sep=' ')
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)
We can re-index by slicing the data (see Section 1.3).
# you can continue to chain the 'loc' from the code above
print(life_exp.loc[range(2000, 2010), ])
Alternatively, you can subset the data separately, and use the reindex
method.
# subset
y2000 = life_exp[life_exp.index > 2000]
print(y2000)
# reindex
print(y2000.reindex(range(2000, 2010)))
Now that we know how missing values can be created, let’s see how they behave when we are working with data.
ebola = pd.read_csv('../data/country_timeseries.csv')
One way to look at the number of missing values is to count
them.
# count the number of non-missing values
print(ebola.count())
You can also subtract the number of non-missing rows from the total number of rows.
num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count()
print(num_missing)
If you want to count the total number of missing values in your data, or count the number of missing values for a particular column, you can use the count_nonzero
function from numpy
in conjunction with the isnull
method.
import numpy as np
print(np.count_nonzero(ebola.isnull()))
print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))
Another way to get missing data counts is to use the value_counts
method on a series. This will print a frequency table of values. If you use the dropna
parameter, you can also get a missing value count.
# get the first 5 value counts from the Cases_Guinea column
print(ebola.Cases_Guinea.value_counts(dropna=False).head())
There are many different ways we can deal with missing data. For example, we can replace the missing data with another value, fill in the missing data using existing data, or drop the data from our data set.
We can use the fillna
method to recode the missing values to another value. For example, suppose we wanted the missing values to be recoded as a 0.
print(ebola.fillna(0).iloc[0:10, 0:5])
When if we use fillna
, we can recode the values to a specific value. If you look at the documentation, you will discover that fillna
, like many other Pandas functions, has a parameter for inplace
. This means that the underlying data will be automatically changed for you; you do not need to create a new copy with the changes. You will want to use this parameter when your data gets larger and you want your code to be more memory efficient.
We can use built-in methods to fill forward or backward. When we fill data forward, the last known value is used for the next missing value. In this way, missing values are replaced with the last known/recorded value.
print(ebola.fillna(method='ffill').iloc[0:10, 0:5])
If a column begins with a missing value, then that data will remain missing because there is no previous value to fill in.
We can also have Pandas fill data backward. When we fill data backward, the newest value is used to replace the missing data. In this way, missing values are replaced with the newest value.
print(ebola.fillna(method='bfill').iloc[:, 0:5].tail())
If a column ends with a missing value, then it will remain missing because there is no new value to fill in.
Interpolation uses existing values to fill in missing values. Although there are many ways to fill in missing values, the interpolation in Pandas fills in missing values linearly. Specifically, it treats the missing values as if they should be equally spaced apart.
print(ebola.interpolate().iloc[0:10, 0:5])
The interpolate
method has a method
parameter that can change the interpolation method.1
1. Series interpolate
documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html
The last way to work with missing data is to drop observations or variables with missing data. Depending on how much data is missing, keeping only complete case data can leave you with a useless data set. Perhaps the missing data is not random, so that dropping missing values will leave you with a biased data set, or perhaps keeping only complete data will leave you with insufficient data to run your analysis.
We can use the dropna
method to drop missing data, and specify parameters to this method that control how data are dropped. For instance, the how
parameter lets you specify whether a row (or column) is dropped when 'any'
or 'all'
of the data is missing. The thresh
parameter lets you specify how many non-NaN
values you have before dropping the row or column.
print(ebola.shape)
If we keep only complete cases in our Ebola data set, we are left with just one row of data.
ebola_dropna = ebola.dropna()
print(ebola_dropna.shape)
print(ebola_dropna)
Suppose we wanted to look at the case counts for multiple regions. We can add multiple regions together to get a new column holding the case counts.
ebola['Cases_multiple'] = ebola['Cases_Guinea'] +
ebola['Cases_Liberia'] +
ebola['Cases_SierraLeone']
Let’s look at the first 10 lines of the calculation.
ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',
'Cases_SierraLeone', 'Cases_multiple']]
print(ebola_subset.head(n=10))
You can see that a value for Cases_multiple
was calculated only when there was no missing value for Cases_Guinea
, Cases_Liberia
, and Cases_SierraLeone
. Calculations with missing values will typically return a missing value, unless the function or method called has a means to ignore missing values in its calculations.
Examples of built-in methods that can ignore missing values include mean
and sum
. These functions will typically have a skipna
parameter that will still calculate a value by skipping over the missing values.
# skipping missing values is True by default
print(ebola.Cases_Guinea.sum(skipna = True))
print(ebola.Cases_Guinea.sum(skipna = False))
It is rare to have a data set without any missing values. It is important to know how to work with missing values because, even when you are working with data that is complete, missing values can still arise from your own data munging. In this chapter, we examined some of the basic methods used in the data analysis process that pertain to data validity. By looking at your data and tabulating missing values, you can start the process of assessing whether the data is of sufficiently high quality for making decisions and drawing inferences.
18.191.185.40