Rarely will you be given a data set without any missing values. There are many representations of missing data. In databases, they are NULL
values; certain programming languages use NA
; and depending on where you get your data, missing values can be an empty string, "
, or even numeric values such as 88
or 99
. Pandas displays missing values as NaN
.
Identify how missing values are represented in pandas
Recognize potential ways data can go missing in data processing
Use different functions to fill in missing values
The NaN
value in Pandas comes from numpy
. Missing values may be used or displayed in a few ways in Pandas — NaN
, NAN
, or nan
— they are all the same in terms of how you specify a missing (floating point) number, but they are not the same in terms of equality. Appendix I describes how these missing values are imported.
# Just import the numpy missing values
from numpy import NaN, NAN, nan
Missing values are different than other types of data in that they don’t really equal anything, not even to themselves. The data is missing, so there is no concept of equality. NaN
is not equivalent to 0
or an empty string, "
. This is known as “three-valued logic.”
print(NaN == True)
False
print(NaN == 0)
False
print(NaN == ")
False
print(NaN == NaN)
False
print(NaN == NAN)
False
print(NaN == nan)
False
print(nan == NAN)
False
Pandas has functions to test for missing values, isnull()
.
import pandas as pd
print(pd.isnull(NaN))
True
print(pd.isnull(nan))
True
print(pd.isnull(NAN))
True
Pandas also has functions for testing non-missing values, notnull()
.
print(pd.notnull(NaN))
False
print(pd.notnull(42))
True
print(pd.notnull('missing'))
True
We can get missing values when we load in a data set with missing values, or from the data munging process.
The survey data we used in Chapter 6 included a data set, visited
, that contained missing data. When we loaded the data, Pandas automatically found the missing data cell and gave us a dataframe with the NaN
value in the appropriate cell. In the read_csv()
function, three parameters relate to reading missing values: na_values
, keep_default_na
, and na_filter
.
The na_values
parameter allows you to specify additional missing or NaN
values. You can pass in either a Python str
(i.e., string) or a list-like object to be automatically coded as missing values when the file is read. Of course, default missing values, such as NA
, NaN
, or nan
, are already available, which is why this parameter is not always used. Some health data may code 99
as a missing value; to specify the use of this value, you would set na_values=[99]
.
The keep_default_na
parameter is a bool
(i.e., True
or False
boolean) that allows you to specify whether any additional values need to be considered as missing. This parameter is True
by default, meaning any additional missing values specified with the na_values
parameter will be appended to the list of missing values. However, keep_default_na
can also be set to keep_default_na=False
, which will only use the missing values specified in na_values
.
Lastly, na_filter
is a bool
that will specify whether any values will be read as missing. The default value of na_filter=True
means that missing values will be coded as NaN
. If we assign na_filter=False
, then nothing will be recoded as missing. This parameter can be thought of as a means to turn off all the parameters set for na_values
and keep_default_na
, but it is more likely to be used when you want to achieve a performance boost by loading in data without missing values.
# set the location for data
visited_file = 'data/survey_visited.csv'
print(pd.read_csv(visited_file))
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
print(pd.read_csv(visited_file, keep_default_na=False))
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
print(
pd.read_csv(visited_file, na_values=[""], keep_default_na=False)
)
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
Chapter 6 showed you how to combine data sets. Some of the examples in that chapter included missing values in the output. If we recreate the merged table from Section 6.4.3, we will see missing values in the merged output.
visited = pd.read_csv('data/survey_visited.csv')
survey = pd.read_csv('data/survey_survey.csv')
print(visited)
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
print(survey)
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
.. ... ... ... ...
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25
[21 rows x 4 columns]
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(vs)
ident site dated taken person quant reading
0 619 DR-1 1927-02-08 619 dyer rad 9.82
1 619 DR-1 1927-02-08 619 dyer sal 0.13
2 622 DR-1 1927-02-10 622 dyer rad 7.80
3 622 DR-1 1927-02-10 622 dyer sal 0.09
4 734 DR-3 1939-01-07 734 pb rad 8.41
.. ... ... ... ... ... ... ...
16 752 DR-3 NaN 752 roe sal 41.60
17 837 MSK-4 1932-01-14 837 lake rad 1.46
18 837 MSK-4 1932-01-14 837 lake sal 0.21
19 837 MSK-4 1932-01-14 837 roe sal 22.50
20 844 DR-1 1932-03-22 844 roe rad 11.25
[21 rows x 7 columns]
The user can also create missing values—for example, by creating a vector of values from a calculation or a manually curated vector. To build on the examples from Section 2.1, we will create our own data with missing values. NaN
values are valid for both Series
and DataFrame
objects.
# missing value in a series
num_legs = pd.Series({'goat': 4, 'amoeba': nan})
print(num_legs)
goat 4.0
amoeba NaN
dtype: float64
# missing value in a dataframe
scientists = pd.DataFrame(
{
"Name": ["Rosaline Franklin", "William Gosset"],
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
"missing": [NaN, nan],
}
)
print(scientists)
Name Occupation Born Died missing
0 Rosaline Franklin Chemist 1920-07-25 1958-04-16 NaN
1 William Gosset Statistician 1876-06-13 1937-10-16 NaN
You will notice the dtype
of the missing
column will be a float64
. This is because the NaN
missing value from numpy
is a floating point value.
print(scientists.dtypes)
Name object
Occupation object
Born object
Died object
missing float64
dtype: object
You can also assign a column of missing values to a dataframe directly.
# create a new dataframe
scientists = pd.DataFrame(
{
"Name": ["Rosaline Franklin", "William Gosset"],
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
}
)
# assign a column of missing values
scientists["missing"] = nan
print(scientists)
Name Occupation Born Died missing
0 Rosaline Franklin Chemist 1920-07-25 1958-04-16 NaN
1 William Gosset Statistician 1876-06-13 1937-10-16 NaN
Another way to introduce missing values into your data is to reindex your dataframe. This is useful when you want to add new indices to your dataframe, but still want to retain its original values. A common usage is when the index represents some time interval, and you want to add more dates.
If we wanted to look at only the years from 2000 to 2010 from the Gapminder data plot in Section 1.5, we could perform the same grouped operations, subset the data, and then reindex it.
gapminder = pd.read_csv('data/gapminder.tsv', sep=' ')
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
...
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, Length: 12, dtype: float64
We can reindex by subsetting the data and use the .reindex()
method.
# subset
y2000 = life_exp[life_exp.index > 2000]
print(y2000)
year
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
# reindex
print(y2000.reindex(range(2000, 2010)))
year
2000 NaN
2001 NaN
2002 65.694923
2003 NaN
2004 NaN
2005 NaN
2006 NaN
2007 67.007423
2008 NaN
2009 NaN
Name: lifeExp, dtype: float64
Now that we know how missing values can be created, let’s see how they behave when we are working with data.
One way to look at the number of missing values is to count()
them.
ebola = pd.read_csv('data/country_timeseries.csv')
# count the number of non-missing values
print(ebola.count())
Date 122
Day 122
Cases_Guinea 93
Cases_Liberia 83
Cases_SierraLeone 87
...
Deaths_Nigeria 38
Deaths_Senegal 22
Deaths_UnitedStates 18
Deaths_Spain 16
Deaths_Mali 12
Length: 18, dtype: int64
You can also subtract the number of non-missing rows from the total number of rows.
num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count()
print(num_missing)
Date 0
Day 0
Cases_Guinea 29
Cases_Liberia 39
Cases_SierraLeone 35
...
Deaths_Nigeria 84
Deaths_Senegal 100
Deaths_UnitedStates 104
Deaths_Spain 106
Deaths_Mali 110
Length: 18, dtype: int64
If you want to count the total number of missing values in your data, or count the number of missing values for a particular column, you can use the count_nonzero()
function from numpy
in conjunction with the .isnull()
method.
import numpy as np
print(np.count_nonzero(ebola.isnull()))
1214
print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))
29
Another way to get missing data counts is to use the .value_counts()
method on a series. This will print a frequency table of values. If you use the dropna
parameter, you can also get a missing value count.
# value counts from the Cases_Guinea column
cnts = ebola.Cases_Guinea.value_counts(dropna=False)
print(cnts)
NaN 29
86.0 3
495.0 2
112.0 2
390.0 2
..
1199.0 1
1298.0 1
1350.0 1
1472.0 1
49.0 1
Name: Cases_Guinea, Length: 89, dtype: int64
The results are sorted so you can subset the count vector to just look at the missing values.
# select the values in the Series where the index is a NaN value
print(cnts.loc[pd.isnull(cnts.index)])
NaN 29
Name: Cases_Guinea, dtype: int64
In Python, True
values equate to the integer value 1
, and False
values equate to the integer value 0
. We can use this behavior to get the number of missing values by summing up a boolean vector with the .sum()
method.
# check if the value is missing, and sum up the results
print(ebola.Cases_Guinea.isnull().sum())
29
There are many different ways we can deal with missing data. For example, we can replace the missing data with another value, fill in the missing data using existing data, or drop the data from our data set.
We can use the .fillna()
method to recode the missing values to another value. For example, suppose we wanted the missing values to be recoded as a 0. When we use .fillna()
, we can recode the values to a specific value.
# fill the missing values to 0 and only look at the first 5 columns
print(ebola.fillna(0).iloc[:, 0:5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 0.0 10030.0
1 1/4/2015 288 2775.0 0.0 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 0.0 8157.0 0.0
4 12/31/2014 284 2730.0 8115.0 9633.0
.. ... ... ... ... ...
117 3/27/2014 5 103.0 8.0 6.0
118 3/26/2014 4 86.0 0.0 0.0
119 3/25/2014 3 86.0 0.0 0.0
120 3/24/2014 2 86.0 0.0 0.0
121 3/22/2014 0 49.0 0.0 0.0
[122 rows x 5 columns]
We can use built-in methods to fill forward or backward. When we fill data forward, the last known value (from top to bottom) is used for the next missing value. In this way, missing values are replaced with the last known and recorded value.
print(ebola.fillna(method='ffill').iloc[:, 0:5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2769.0 8157.0 9722.0
4 12/31/2014 284 2730.0 8115.0 9633.0
.. ... ... ... ... ...
117 3/27/2014 5 103.0 8.0 6.0
118 3/26/2014 4 86.0 8.0 6.0
119 3/25/2014 3 86.0 8.0 6.0
120 3/24/2014 2 86.0 8.0 6.0
121 3/22/2014 0 49.0 8.0 6.0
[122 rows x 5 columns]
If a column begins with a missing value, then that data will remain missing because there is no previous value to fill in.
We can also have Pandas fill data backward. When we fill data backward, the newest value (from top to bottom) is used to replace the missing data. In this way, missing values are replaced with the newest value.
print(ebola.fillna(method='bfill').iloc[:, 0:5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 8166.0 10030.0
1 1/4/2015 288 2775.0 8166.0 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2730.0 8157.0 9633.0
4 12/31/2014 284 2730.0 8115.0 9633.0
.. ... ... ... ... ...
117 3/27/2014 5 103.0 8.0 6.0
118 3/26/2014 4 86.0 NaN NaN
119 3/25/2014 3 86.0 NaN NaN
120 3/24/2014 2 86.0 NaN NaN
121 3/22/2014 0 49.0 NaN NaN
[122 rows x 5 columns]
If a column ends with a missing value, then it will remain missing because there is no new value to fill in.
Interpolation uses existing values to fill in missing values. There are many ways to fill in missing values, the interpolation in Pandas fills in missing values linearly. Specifically, it treats the missing values as if they should be equally spaced apart.
print(ebola.interpolate().iloc[:, 0:5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2749.5 8157.0 9677.5
4 12/31/2014 284 2730.0 8115.0 9633.0
.. ... ... ... ... ...
117 3/27/2014 5 103.0 8.0 6.0
118 3/26/2014 4 86.0 8.0 6.0
119 3/25/2014 3 86.0 8.0 6.0
120 3/24/2014 2 86.0 8.0 6.0
121 3/22/2014 0 49.0 8.0 6.0
[122 rows x 5 columns]
Notice how it behaves kind of in a forward fill fashion, but instead of passing on the last known value, it will fill in the differences between values.
The .interpolate()
method has a method
parameter that can change the interpolation method.1 Possible values at the time of writing have been reproduced in Table 9.1.
1. Series.interpolate()
documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html
Table 9.1 Possible Values (at the Time of Writing) to Pass Into the method
Parameter in the .interpolate()
Method
Technique | Description | |
---|---|---|
1 | linear | Ignore the index and treat the values as equally spaced. This is the only method supported on Multi-Indexes |
2 | time | Works on daily and higher resolution data to interpolate given length of interval |
3 | index, values | Use the actual numerical values of the index |
4 | pad | Fill in NaNs using existing values |
5 | nearest, zero, slinear, quadratic, cubic, spline, barycentric, polynomial | Passed to |
6 | krogh, piecewise_polynomial, spline, pchip, akima, cubicspline | Wrappers around the SciPy interpolation methods of similar names |
7 | from_derivatives | Refers to |
The last way to work with missing data is to drop observations or variables with missing data. Depending on how much data is missing, keeping only complete case data can leave you with a useless data set. Perhaps the missing data is not random, so that dropping missing values will leave you with a biased data set, or perhaps keeping only complete data will leave you with insufficient data to run your analysis.
We can use the.dropna()
method to drop missing data, and specify parameters to this method that control how data are dropped. For instance, the how
parameter lets you specify whether a row (or column) is dropped when 'any'
or 'all'
of the data is missing. The thresh
parameter lets you specify how many non-NaN
values you have before dropping the row or column.
print(ebola.shape)
(122, 18)
If we keep only complete cases in our Ebola data set, we are left with just one row of data.
ebola_dropna = ebola.dropna()
print(ebola_dropna.shape)
(1, 18)
print(ebola_dropna)
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
19 11/18/2014 241 2047.0 7082.0 6190.0
Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain
19 20.0 1.0 4.0 1.0
Cases_Mali Deaths_Guinea Deaths_Liberia Deaths_SierraLeone
19 6.0 1214.0 2963.0 1267.0
Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates
19 8.0 0.0 1.0
Deaths_Spain Deaths_Mali
19 0.0 6.0
Suppose we wanted to look at the case counts for multiple regions. We can add multiple regions together to get a new column holding the case counts.
ebola["Cases_multiple"] = (
ebola["Cases_Guinea"]
+ ebola["Cases_Liberia"]
+ ebola["Cases_SierraLeone"]
)
Let’s look at the first 10 lines of the calculation.
ebola_subset = ebola.loc[
:,
[
"Cases_Guinea",
"Cases_Liberia",
"Cases_SierraLeone",
"Cases_multiple",
],
]
print(ebola_subset.head(n=10))
Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_multiple
0 2776.0 NaN 10030.0 NaN
1 2775.0 NaN 9780.0 NaN
2 2769.0 8166.0 9722.0 20657.0
3 NaN 8157.0 NaN NaN
4 2730.0 8115.0 9633.0 20478.0
5 2706.0 8018.0 9446.0 20170.0
6 2695.0 NaN 9409.0 NaN
7 2630.0 7977.0 9203.0 19810.0
8 2597.0 NaN 9004.0 NaN
9 2571.0 7862.0 8939.0 19372.0
You can see that a value for Cases_multiple
was calculated only when there was no missing value for Cases_Guinea
, Cases_Liberia
, and Cases_SierraLeone
. Calculations with missing values will typically return a missing value, unless the function or method called has a means to ignore missing values in its calculations.
Examples of built-in methods that can ignore missing values include .mean()
and .sum()
. These functions will typically have a skipna
parameter that will still calculate a value by skipping over the missing values.
# skipping missing values is True by default
print(ebola.Cases_Guinea.sum(skipna = True))
84729.0
print(ebola.Cases_Guinea.sum(skipna = False))
nan
Pandas 1.0 introduced a built-in NA
value (pd.NA
). At the time of writing this feature is still “experimental.”2 The main goal of this feature is to provide a missing value that works across different data types.
2. Pandas experimental NA
: https://pandas.pydata.org/docs/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values
Let’s use our previous scientists
data set from earlier and look at the .dtypes
.
scientists = pd.DataFrame(
{
"Name": ["Rosaline Franklin", "William Gosset"],
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
"Age": [37, 61]
}
)
print(scientists)
Name Occupation Born Died Age
0 Rosaline Franklin Chemist 1920-07-25 1958-04-16 37
1 William Gosset Statistician 1876-06-13 1937-10-16 61
print(scientists.dtypes)
Name object
Occupation object
Born object
Died object
Age int64
dtype: object
scientists.loc[1, "Name"] = pd.NA
scientists.loc[1, "Age"] = pd.NA
print(scientists)
Name Occupation Born Died Age
0 Rosaline Franklin Chemist 1920-07-25 1958-04-16 37
1 <NA> Statistician 1876-06-13 1937-10-16 <NA>
print(scientists.dtypes)
Name object
Occupation object
Born object
Died object
Age object
dtype: object
Compare the .dtypes
from pd.NA
and np.NaN
from earlier in this chapter.
scientists = pd.DataFrame(
{
"Name": ["Rosaline Franklin", "William Gosset"],
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
"Age": [37, 61]
}
)
scientists.loc[1, "Name"] = np.NaN
scientists.loc[1, "Age"] = np.NaN
print(scientists.dtypes)
Name object
Occupation object
Born object
Died object
Age float64
dtype: object
Since pd.NA
is still experimental, best follow up with its behavior in the official documentation.
It is rare to have a data set without any missing values. It is important to know how to work with missing values because, even when you are working with data that is complete, missing values can still arise from your own data munging. In this chapter, we examined some of the basic methods used in the data analysis process that pertain to data validity. By looking at your data and tabulating missing values, you can start the process of assessing whether the data is of sufficiently high quality for making decisions and drawing inferences.
3.135.191.86