Data does not always come in forms ready for analysis. It could, for example, be in the wrong format, incorrect or even missing. Industry experience has shown that data scientists can spend as much as 75% of their time preparing data before they begin their studies. Preparing data for analysis is called data munging or data wrangling. These are synonyms—from this point forward, we’ll say data munging.
Two of the most important steps in data munging are data cleaning and transforming data into the optimal formats for your database systems and analytics software. Some common data cleaning examples are:
deleting observations with missing values,
substituting reasonable values for missing values,
deleting observations with bad values,
substituting reasonable values for bad values,
tossing outliers (although sometimes you’ll want to keep them),
duplicate elimination (although sometimes duplicates are valid),
dealing with inconsistent data,
and more.
You’re probably already thinking that data cleaning is a difficult and messy process where you could easily make bad decisions that would negatively impact your results. This is correct. When you get to the data science case studies in the later chapters, you’ll see that data science is more of an empirical science, like medicine, and less of a theoretical science, like theoretical physics. Empirical sciences base their conclusions on observations and experience. For example, many medicines that effectively solve medical problems today were developed by observing the effects that early versions of these medicines had on lab animals and eventually humans, and gradually refining ingredients and dosages. The actions data scientists take can vary per project, be based on the quality and nature of the data and be affected by evolving organization and professional standards.
Some common data transformations include:
removing unnecessary data and features (we’ll say more about features in the data science case studies),
combining related features,
sampling data to obtain a representative subset (we’ll see in the data science case studies that random sampling is particularly effective for this and we’ll say why),
standardizing data formats,
grouping data,
and more.
It’s always wise to hold onto your original data. We’ll show simple examples of cleaning and transforming data in the context of Pandas Series
and DataFrame
s.
Bad data values and missing values can significantly impact data analysis. Some data scientists advise against any attempts to insert “reasonable values.” Instead, they advocate clearly marking missing data and leaving it up to the data analytics package to handle the issue. Others offer strong cautions.4
Let’s consider a hospital that records patients’ temperatures (and probably other vital signs) four times per day. Assume that the data consists of a name and four float
values, such as
['Brown, Sue', 98.6, 98.4, 98.7, 0.0]
The preceding patient’s first three recorded temperatures are 99.7, 98.4 and 98.7. The last temperature was missing and recorded as 0.0, perhaps because the sensor malfunctioned. The average of the first three values is 98.57, which is close to normal. However, if you calculate the average temperature including the missing value for which 0.0 was substituted, the average is only 73.93, clearly a questionable result. Certainly, doctors would not want to take drastic remedial action on this patient—it’s crucial to “get the data right.”
One common way to clean the data is to substitute a reasonable value for the missing temperature, such as the average of the patient’s other readings. Had we done that above, then the patient’s average temperature would remain 98.57—a much more likely average temperature, based on the other readings.
Let’s begin by creating a Series
of five-digit ZIP Codes from a dictionary of city-name/five-digit-ZIP-Code key–value pairs. We intentionally entered an invalid ZIP Code for Miami:
In [1]: import pandas as pd
In [2]: zips = pd.Series({'Boston': '02215', 'Miami': '3310'})
In [3]: zips
Out[3]:
Boston 02215
Miami 3310
dtype: object
Though zips
looks like a two-dimensional array, it’s actually one-dimensional. The “second column” represents the Series
’ ZIP Code values (from the dictionary’s values), and the “first column” represents their indices (from the dictionary’s keys).
We can use regular expressions with Pandas to validate data. The str
attribute of a Series
provides string-processing and various regular expression methods. Let’s use the str
attribute’s match
method to check whether each ZIP Code is valid:
In [4]: zips.str.match(r'd{5}')
Out[4]:
Boston True
Miami False
dtype: bool
Method match
applies the regular expression d{5}
to each Series
element, attempting to ensure that the element is comprised of exactly five digits. You do not need to loop explicitly through all the ZIP Codes—match
does this for you. This is another example of functional-style programming with internal rather than external iteration. The method returns a new Series
containing True
for each valid element. In this case, the ZIP Code for Miami did not match, so its element is False
.
There are several ways to deal with invalid data. One is to catch it at its source and interact with the source to correct the value. That’s not always possible. For example, the data could be coming from high-speed sensors in the Internet of Things. In that case, we would not be able to correct it at the source, so we could apply data cleaning techniques. In the case of the bad Miami ZIP Code of 3310
, we might look for Miami ZIP Codes beginning with 3310
. There are two—33101
and 33109
—and we could pick one of those.
Sometimes, rather than matching an entire value to a pattern, you’ll want to know whether a value contains a substring that matches the pattern. In this case, use method contains
instead of match
. Let’s create a Series
of strings, each containing a U.S. city, state and ZIP Code, then determine whether each string contains a substring matching the pattern ' [A-Z]{2} '
(a space, followed by two uppercase letters, followed by a space):
In [5]: cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])
In [6]: cities
Out[6]:
0 Boston, MA 02215
1 Miami, FL 33101
dtype: object
In [7]: cities.str.contains(r' [A-Z]{2} ')
Out[7]:
0 True
1 True
dtype: bool
In [8]: cities.str.match(r' [A-Z]{2} ')
Out[8]:
0 False
1 False
dtype: bool
We did not specify the index values, so the Series
uses zero-based indexes by default (snippet [6]
). Snippet [7]
uses contains
to show that both Series
elements contain substrings that match ' [A-Z]{2} '
. Snippet [8]
uses match
to show that neither element’s value matches that pattern in its entirety, because each has other characters in its complete value.
We’ve discussed data cleaning. Now let’s consider munging data into a different format. As a simple example, assume that an application requires U.S. phone numbers in the format ###-###-####
, with hyphens separating each group of digits. The phone numbers have been provided to us as 10-digit strings without hyphens. Let’s create the DataFrame
:
In [9]: contacts = [['Mike Green', '[email protected]', '5555555555'],
...: ['Sue Brown', '[email protected]', '5555551234']]
...:
In [10]: contactsdf = pd.DataFrame(contacts,
...: columns=['Name', 'Email', 'Phone'])
...:
In [11]: contactsdf
Out[11]:
Name Email Phone
0 Mike Green [email protected] 5555555555
1 Sue Brown [email protected] 5555551234
In this DataFrame
, we specified column indices via the columns
keyword argument but did not specify row indices, so the rows are indexed from 0
. Also, the output shows the column values right aligned by default. This differs from Python formatting in which numbers in a field are right aligned by default but non-numeric values are left aligned by default.
Now, let’s munge the data with a little more functional-style programming. We can map the phone numbers to the proper format by calling the Series
method map
on the DataFrame
’s 'Phone'
column. Method map
’s argument is a function that receives a value and returns the mapped value. The function get_formatted_phone
maps 10 consecutive digits into the format ###-###-####
:
In [12]: import re
In [13]: def get_formatted_phone(value):
...: result = re.fullmatch(r'(d{3})(d{3})(d{4})', value)
...: return '-'.join(result.groups()) if result else value
...:
...:
The regular expression in the block’s first statement matches only 10 consecutive digits. It captures substrings containing the first three digits, the next three digits and the last four digits. The return
statement operates as follows:
If result
is None
, we simply return value
unmodified.
Otherwise, we call result.groups()
to get a tuple containing the captured substrings and pass that tuple to string method join
to concatenate the elements, separating each from the next with '-'
to form the mapped phone number.
Series
method map
returns a new Series
containing the results of calling its function argument for each value in the column. Snippet [15]
displays the result, including the column’s name and type:
In [14]: formatted_phone = contactsdf['Phone'].map(get_formatted_phone)
In [15]: formatted_phone
0 555-555-5555
1 555-555-1234
Name: Phone, dtype: object
Once you’ve confirmed that the data is in the correct format, you can update it in the original DataFrame
by assigning the new Series
to the 'Phone'
column:
In [16]: contactsdf['Phone'] = formatted_phone
In [17]: contactsdf
Out[17]:
Name Email Phone
0 Mike Green [email protected] 555-555-5555
1 Sue Brown [email protected] 555-555-1234
We’ll continue our pandas discussion in the next chapter’s Intro to Data Science section, and we’ll use pandas in several later chapters.
(Fill-In) Preparing data for analysis is called or . A subset of this process is data cleaning.
Answer: data munging, data wrangling.
(IPython Session) Let’s assume that an application requires U.S. phone numbers in the format (###) ###-####
. Modify the get_formatted_phone
function in snippet [13]
to return the phone number in this new format. Then recreate the DataFrame
from snippets [9]
and [10]
and use the updated get_formatted_phone
function to munge the data.
Answer:
In [1]: import pandas as pd
In [2]: import re
In [3]: contacts = [['Mike Green', '[email protected]', '5555555555'],
...: ['Sue Brown', '[email protected]', '5555551234']]
...:
In [4]: contactsdf = pd.DataFrame(contacts,
...: columns=['Name', 'Email', 'Phone'])
...:
In [5]: def get_formatted_phone(value):
...: result = re.fullmatch(r'(d{3})(d{3})(d{4})', value)
...: if result:
...: part1, part2, part3 = result.groups()
...: return '(' + part1 + ') ' + part2 + '-' + part3
...: else:
...: return value
...:
In [6]: contactsdf['Phone'] = contactsdf['Phone'].map(get_formatted_phone)
In [7]: contactsdf
Out[7]:
Name Email Phone
0 Mike Green [email protected] (555) 555-5555
1 Sue Brown [email protected] (555) 555-1234
3.138.110.119