By now, you should be able to load data into Pandas
and do some basic visualizations. This part of the book focuses on various data cleaning tasks. We begin with assembling a data set for analysis by combining various data sets together.
1. Prior knowledge
a. loading data
b. subsetting data
c. functions and class methods
This chapter will cover:
1. Tidy data
2. Concatenating data
3. Merging data sets
Hadley Wickham,1 one of the more prominent members of the R community, talks about the idea of tidy data. In fact, he’s written a paper about this concept in the Journal of Statistical Software.2 Tidy data is a framework to structure data sets so they can be easily analyzed. It is mainly used as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make data collection much easier.
1. Hadley Wickham’s homepage: http://hadley.nz
2. Tidy data paper: http://vita.had.co.nz/papers/tidy-data.pdf
So what is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria:
■ Each row is an observation.
■ Each column is a variable.
■ Each type of observational unit forms a table.
We begin with Hadley Wickham’s last tidy data point: “Each type of observational unit forms a table.” When data is tidy, you need to combine various tables together to answer a question. For example, there may be a separate table holding company information and another table holding stock prices. If we want to look at all the stock prices within the tech industry, we may first have to find all the tech companies from the company information table, and then combine that data with the stock price data to get the data we need for our question. The data may have been split up into separate tables to reduce the amount of redundant information (we don’t need to store the company information with each stock price entry), but this arrangement means we as data analysts must combine the relevant data ourselves to answer our question.
At other times, a single data set may be split into multiple parts. For example, with time-series data, each date may be in a separate file. In another case, a file may have been split into parts to make the individual files smaller. You may also need to combine data from multiple sources to answer a question (e.g., combine latitudes and longitudes with zip codes). In both cases, you will need to combine data into a single dataframe for analysis.
One of the (conceptually) easier ways to combine data is with concatenation. Concatenation can be thought of appending a row or column to your data. This approach is possible if your data was split into parts or if you performed a calculation that you want to append to your existing data set.
Concatenation is accomplished by using the concat
function from Pandas.
Let’s begin with some example data sets so you can see what is actually happening.
import pandas as pd
df1 = pd.read_csv('../data/concat_1.csv')
df2 = pd.read_csv('../data/concat_2.csv')
df3 = pd.read_csv('../data/concat_3.csv')
print(df1)
print(df3)
Stacking the dataframes on top of each other uses the concat
function in Pandas
. All of the dataframes to be concatenated are passed in a list
.
row_concat = pd.concat([df1, df2, df3])
print(row_concat)
As you can see, concat
blindly stacks the dataframes together. If you look at the row names (i.e., the row indices), they are also simply a stacked version of the original row indices. If we apply the various subsetting methods from Table 2.3, the table will be subsetted as expected.
# subset the fourth row of the concatenated dataframe
print(row_concat.iloc[3,])
Section 2.2.1 showed the process for creating a Series
. However, if we create a new series to append to a dataframe, it does not append correctly.
# create a new row of data
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(new_row_series)
# attempt to add the new row to a dataframe
print(pd.concat([df1, new_row_series]))
The first things you may notice are the NaN
values. This is simply Python’s way of representing a “missing value” (see Chapter 5, “Missing Data”). We were hoping to append our new values as a row, but that didn’t happen. In fact, not only did our code not append the values as a row, but it also created a new column completely misaligned with everything else.
If we pause to think about what is happening here, we can see that the results actually make sense. First, if we look at the new indices that were added, we notice that they are very similar to the results we obtained when we concatenated dataframes earlier. The indices of the new_row series
object are analogs to the row numbers of the dataframe. Also, since our series did not have a matching column, our new_row
was added to a new column.
To fix this problem, we can turn our series into a dataframe. This data frame contains one row of data, and the column names are the ones the data will bind to.
# note the double brackets
new_row_df = pd.DataFrame([['n1', 'n2', 'n3', 'n4']],
columns=['A', 'B', 'C', 'D'])
print(new_row_df)
print(pd.concat([df1, new_row_df]))
concat
is a general function that can concatenate multiple things at once. If you just needed to append a single object to an existing dataframe, the append
function can handle that task.
Using a DataFrame
:
print(df1.append(df2))
Using a single-row DataFrame
:
print(df1.append(new_row_df))
Using a Python dictionary:
data_dict = {'A': 'n1',
'B': 'n2',
'C': 'n3',
'D': 'n4'}
print(df1.append(data_dict, ignore_index=True))
In the last example, when we added a dict
to a dataframe, we had to use the ignore_index
parameter. If we look closer, you can see that the row index was also incremented by 1, and did not repeat a previous index value.
If we simply want to concatenate or append data together, we can use the ignore_index
parameter to reset the row index after the concatenation.
row_concat_i = pd.concat([df1, df2, df3], ignore_index=True)
print(row_concat_i)
Concatenating columns is very similar to concatenating rows. The main difference is the axis
parameter in the concat
function. The default value of axis
is 0
, so it will concatenate data in a row-wise fashion. However, if we pass axis=1
to the function, it will concatenate data in a column-wise manner.
col_concat = pd.concat([df1, df2, df3], axis=1)
print(col_concat)
If we try to subset data based on column names, we will get a similar result when we concatenated row-wise and subset by row index.
print(col_concat['A'])
Adding a single column to a dataframe can be done directly without using any specific Pandas function. Simply pass a new column name the vector you want assigned to the new column.
col_concat['new_col_list'] = ['n1', 'n2', 'n3', 'n4']
print(col_concat)
col_concat['new_col_series'] = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(col_concat)
Using the concat
function still works, as long as you pass it a dataframe. This approach does require a bit more unnecessary code.
Finally, we can reset the column indices so we do not have duplicated column names.
print(pd.concat([df1, df2, df3], axis=1, ignore_index=True))
The examples shown so far have assumed we are performing a simple row or column concatenation. They also assume that the new row(s) had the same column names or the column(s) had the same row indices.
This section addresses what happens when the row and column indices are not aligned.
Let’s modify our dataframes for the next few examples.
df1.columns = ['A', 'B', 'C', 'D']
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']
print(df1)
print(df2)
print(df3)
If we try to concatenate these dataframes as we did in Section 4.3.1, the dataframes now do much more than simply stack one on top of the other. The columns align themselves, and NaN
fills in any missing areas.
row_concat = pd.concat([df1, df2, df3])
print(row_concat)
One way to avoid the inclusion of NaN
values is to keep only those columns that are shared in common by the list of objects to be concatenated. A parameter named join
accomplishes this. By default, it has a value of 'outer'
, meaning it will keep all the columns. However, we can set join='inner'
to keep only the columns that are shared among the data sets.
If we try to keep only the columns from all three dataframes, we will get an empty dataframe, since there are no columns in common.
print(pd.concat([df1, df2, df3], join='inner'))
If we use the dataframes that have columns in common, only the columns that all of them share will be returned.
print(pd.concat([df1,df3], ignore_index=False, join='inner'))
Let’s take our dataframes and modify them again so that they have different row indices. Here, we are building on the same dataframe modifications from Section 4.3.3.1.
df1.index = [0, 1, 2, 3]
df2.index = [4, 5, 6, 7]
df3.index = [0, 2, 5, 7]
print(df1)
print(df2)
print(df3)
When we concatenate along axis=1
, we get the same results from concatenating along axis=0
. The new dataframes will be added in a column-wise fashion and matched against their respective row indices. Missing values indicators appear in the areas where the indices did not align.
col_concat = pd.concat([df1, df2, df3], axis=1)
print(col_concat)
Just as we did when we concatenated in a row-wise manner, we can choose to keep the results only when there are matching indices by using join='inner'
.
print(pd.concat([df1, df3], axis=1, join='inner'))
The previous section alluded to a few database concepts. The join='inner'
and the default join='outer'
parameters come from working with databases when we want to merge tables.
Instead of simply having a row or column index that you want to use to concatenate values, sometimes you may have two or more dataframes that you want to combine based on common data values. This task is known in the database world as performing a “join.”
Pandas has a pd.join
command that uses pd.merge
under the hood. join
will merge dataframe objects based on an index, but the merge
command is much more explicit and flexible. If you are planning to merge dataframes by the row index, for example, you might want to look into the join
function.3
3. Pandas DataFrame join
function: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html
We will be using sets of survey data in this series of examples.
person = pd.read_csv('../data/survey_person.csv')
site = pd.read_csv('../data/survey_site.csv')
survey = pd.read_csv('../data/survey_survey.csv')
visited = pd.read_csv('../data/survey_visited.csv')
print(person)
print(site)
print(visited)
print(survey)
Currently, our data is split into multiple parts, where each part is an observational unit. If we wanted to look at the dates at each site along with the latitude and longitude information for that site, we would have to combine (and merge) multiple dataframes. We can do this with the merge
function in Pandas. merge
is actually a DataFrame
method.
When we call this method, the dataframe that is called will be referred to the one on the 'left'
. Within the merge
function, the first parameter is the 'right'
dataframe. The next parameter is how
the final merged result looks. Table 4.1 provides more details. Next, we set the on
parameter. This specifies which columns to match on. If the left and right columns do not have the same name, we can use the left_on
and right_on
parameters instead.
Pandas |
SQL |
Description |
left |
left outer |
Keep all the keys from the left |
right |
right outer |
Keep all the keys from the right |
outer |
full outer |
Keep all the keys from both left and right |
inner |
inner |
Keep only the keys that exist in both left and right |
In the simplest type of merge, we have two dataframes where we want to join one column to another column, and where the columns we want to join do not contain any duplicate values.
For this example, we will modify the visited
dataframe so there are no duplicated site
values.
visited_subset = visited.loc[[0, 2, 6], ]
We can perform our one-to-one merge as follows:
# the default value for 'how' is 'inner'
# so it doesn't need to be specified
o2o_merge = site.merge(visited_subset,
left_on='name', right_on='site')
print(o2o_merge)
As you can see, we have now created a new dataframe from two separate dataframes where the rows were matched based on a particular set of columns. In SQL-speak, the columns used to match are called “keys.”
If we choose to do the same merge, but this time without using the subsetted visited
dataframe, we would perform a many-to-one merge. In this kind of merge, one of the dataframes has key values that repeat. The dataframes that contains the single observations will then be duplicated in the merge.
m2o_merge = site.merge(visited, left_on='name', right_on='site')
print(m2o_merge)
As you can see, the site
information (name
, lat
, and long
) were duplicated and matched to the visited
data.
Lastly, there will be times when we want to perform a match based on multiple columns. As an example, suppose we have two dataframes that come from person
merged with survey
, and another dataframe that comes from visited
merged with survey
.
ps = person.merge(survey, left_on='ident', right_on='person')
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(ps)
print(vs)
We can perform a many-to-many merge by passing the multiple columns to match on in a Python list.
ps_vs = ps.merge(vs,
left_on=['ident', 'taken', 'quant', 'reading'],
right_on=['person', 'ident', 'quant', 'reading'])
Let’s look at just the first row of data.
print(ps_vs.loc[0, ])
Pandas will automatically add a suffix to a column name if there are collisions in the name. In the output, the _x
refers to values from the left dataframe, and the _y
suffix comes from values in the right dataframe.
Sometimes you may need to combine various parts or data or multiple data sets depending on the question you are trying to answer. Keep in mind, however, that the data you need for analysis does not necessarily equate to the best shape of data for storage.
The survey data used in the last example came in four separate parts that needed to be merged together. After we merged the tables a lot of redundant information appeared across the rows. From a data storage and data entry point of view, each of these duplications can lead to errors and data inconsistency. This is what Hadley meant by saying that in tidy data, “each type of observational unit forms a table.”
18.220.134.75