By now, you should be able to load data into pandas
and do some basic visualizations. This part of the book focuses on various data cleaning tasks. We begin with assembling a data set for analysis by combining various data sets together.
Identify when needs to be combined
Identify whether data needs to be concatenated or joined together
Use the appropriate function or methods to combine multiple data sets
Produce a single data set from multiple files
Assess whether data was joined properly
We first talked about tidy data principles in Chapter 4. This chapter will cover the third criterion in the original “Tidy Data” paper1: “each type of observational unit forms a table.”
1. Tidy Data paper: http://vita.had.co.nz/papers/tidy-data.pdf
When data is tidy, you need to combine various tables together to answer a question. For example, there may be a separate table holding company information and another table holding stock prices. If we want to look at all the stock prices within the tech industry, we may first have to find all the tech companies from the company information table, and then combine that data with the stock price data to get the data we need for our question. The data may have been split up into separate tables to reduce the amount of redundant information (we don’t need to store the company information with each stock price entry), but this arrangement means we as data analysts must combine the relevant data ourselves to answer our question.
Other times, a single data set may be split into multiple parts. For example, with timeseries data, each date may be in a separate file. In another case, a file may have been split into parts to make the individual files smaller. You may also need to combine data from multiple sources to answer a question (e.g., combine latitudes and longitudes with zip codes). In both cases, you will need to combine data into a single dataframe for analysis.
One of the (conceptually) easier ways to combine data is with concatenation. Concatenation can be thought of as appending a row or column to your data. This approach is possible if your data was split into parts or if you performed a calculation that you want to append to your existing data set.
Let’s begin with some example data sets so you can see what is actually happening.
import pandas as pd
df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')
print(df1)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
print(df2)
A B C D
0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7
print(df3)
A B C D
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11
Concatenation is accomplished by using the concat()
function from Pandas.
Section 2.3.1 talked about the three parts of a dataframe: .index
, .columns
, and .values
. We will be working with .index
and .columns
a lot in this chapter.
The .index
refers to the labels on the left of the dataframe, by default they will be numbered starting from 0
.
print(df1.index)
RangeIndex(start=0, stop=4, step=1)
The “index” is an “axis” of a dataframe. These terms are important because pandas will try to automatically align by axis. The other axis is the “columns,” which we can get with .columns
.
print(df1.columns)
Index(['A', 'B', 'C', 'D'], dtype='object')
This refers to the column names of the dataframe.
Finally, just to be complete, the body of the dataframe can be represented as an numpy
array with .values
.
print(df1.values)
[['a0' 'b0' 'c0' 'd0']
['a1' 'b1' 'c1' 'd1']
['a2' 'b2' 'c2' 'd2']
['a3' 'b3' 'c3' 'd3']]
Stacking (i.e., concatenating) the dataframes on top of each other uses the concat()
function in pandas
. All of the dataframes to be concatenated are passed in a list
.
row_concat = pd.concat([df1, df2, df3])
print(row_concat)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
0 a4 b4 c4 d4
.. ... ... ... ...
3 a7 b7 c7 d7
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11
[12 rows x 4 columns]
As you can see, concat()
blindly stacks the dataframes together. If you look at the row names (i.e., the row indices), they are also simply a stacked version of the original row indices. If we apply the various subsetting methods (Table 2.3), the table will be subsetted as expected.
# subset the fourth row of the concatenated dataframe
print(row_concat.iloc[3, :])
A a3
B b3
C c3
D d3
Name: 3, dtype: object
Section 2.1.1 showed the process for creating a Series
. However, if we create a new series to append to a dataframe, it does not append correctly.
# create a new row of data
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(new_row_series)
0 n1
1 n2
2 n3
3 n4
dtype: object
# attempt to add the new row to a dataframe
print(pd.concat([df1, new_row_series]))
A B C D 0
0 a0 b0 c0 d0 NaN
1 a1 b1 c1 d1 NaN
2 a2 b2 c2 d2 NaN
3 a3 b3 c3 d3 NaN
0 NaN NaN NaN NaN n1
1 NaN NaN NaN NaN n2
2 NaN NaN NaN NaN n3
3 NaN NaN NaN NaN n4
The first things you may notice are the NaN
missing values. This is simply Python’s way of representing a “missing value” (more about missing values in Chapter 9). We were hoping to append our new values as a row, but that didn’t happen. In fact, not only did our code not append the values as a row, but it also created a new column completely misaligned with everything else.
Let’s think about what is happening here. First, our series did not have a matching column, so our new_row
was added to a new column. The rest of the values were concatenated to the bottom of the dataframe, and the original index values were retained.
To fix this problem, we need turn our series into a dataframe. This data frame contains one row of data, and the column names are the ones the data will bind to.
new_row_df = pd.DataFrame(
# note the double brackets to create a "row" of data
data=[["n1", "n2", "n3", "n4"]],
columns=["A", "B", "C", "D"],
)
print(new_row_df)
A B C D
0 n1 n2 n3 n4
# concatenate the row of data
print(pd.concat([df1, new_row_df]))
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
0 n1 n2 n3 n4
concat()
is a general function that can concatenate multiple things at once.
In the last example, when we added a dict
to a dataframe, we had to use the ignore_index
parameter. If we look closer, you can see that the row index was also incremented by 1, and did not repeat a previous index value.
If we simply want to concatenate or append data together, we can use the ignore_index
parameter to reset the row index after the concatenation.
row_concat_i = pd.concat([df1, df2, df3], ignore_index=True)
print(row_concat_i)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
.. ... ... ... ...
7 a7 b7 c7 d7
8 a8 b8 c8 d8
9 a9 b9 c9 d9
10 a10 b10 c10 d10
11 a11 b11 c11 d11
[12 rows x 4 columns]
Concatenating columns is very similar to concatenating rows. The main difference is the axis
parameter in the concat
function. The default value of axis
is 0
(or "index"
), so it will concatenate data in a row-wise fashion. However, if we pass axis=1
(or axis="columns"
) to the function, it will concatenate data in a column-wise manner.
col_concat = pd.concat([df1, df2, df3], axis="columns")
print(col_concat)
A B C D A B C D A B C D
0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8
1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9
2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10
3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11
If we try to subset data based on column names, we will get a similar result when we concatenated row-wise and subset by row index.
print(col_concat['A'])
A A A
0 a0 a4 a8
1 a1 a5 a9
2 a2 a6 a10
3 a3 a7 a11
Adding a single column to a dataframe can be done directly without using any specific Pandas function (We saw this in Section 2.4.1). Simply pass a new column name for the vector you want assigned to the new column.
col_concat['new_col_list'] = ['n1', 'n2', 'n3', 'n4']
print(col_concat)
A B C D A B C D A B C D new_col_list
0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1
1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2
2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3
3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4
col_concat['new_col_series'] = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(col_concat)
A B C D A B C D A B C D new_col_list
0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1
1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2
2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3
3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4
new_col_series
0 n1
1 n2
2 n3
3 n4
Using the concat()
function still works, as long as you give it a dataframe. However this approach requires more code.
Finally, we can reset the column indices so we do not have duplicated column names.
print(pd.concat([df1, df2, df3], axis="columns", ignore_index=True))
0 1 2 3 4 5 6 7 8 9 10 11
0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8
1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9
2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10
3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11
The examples shown so far have assumed we are performing a row or column concatenation. They also assume that the new row(s) had the same column names or the column(s) had the same row indices.
This section addresses what happens when the row and column indices are not aligned.
Let’s modify our dataframes for the next few examples.
# rename the columns of our dataframes
df1.columns = ['A', 'B', 'C', 'D']
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']
print(df1)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
print(df2)
E F G H
0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7
print(df3)
A C F H
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11
If we try to concatenate these dataframes as we did in Section 6.2.2, the dataframes now do much more than simply stack one on top of the other. The columns align themselves, and NaN
fills in any missing areas.
row_concat = pd.concat([df1, df2, df3])
print(row_concat)
A B C D E F G H
0 a0 b0 c0 d0 NaN NaN NaN NaN
1 a1 b1 c1 d1 NaN NaN NaN NaN
2 a2 b2 c2 d2 NaN NaN NaN NaN
3 a3 b3 c3 d3 NaN NaN NaN NaN
0 NaN NaN NaN NaN a4 b4 c4 d4
.. ... ... ... ... ... ... ... ...
3 NaN NaN NaN NaN a7 b7 c7 d7
0 a8 NaN b8 NaN NaN c8 NaN d8
1 a9 NaN b9 NaN NaN c9 NaN d9
2 a10 NaN b10 NaN NaN c10 NaN d10
3 a11 NaN b11 NaN NaN c11 NaN d11
[12 rows x 8 columns]
One way to avoid the inclusion of NaN
values is to keep only those columns that are shared in common by the list of objects to be concatenated. A parameter named join
accomplishes this. By default, it has a value of 'outer'
, meaning it will keep all the columns. However, we can set join='inner'
to keep only the columns that are shared among the data sets.
If we try to keep only the columns from all three dataframes, we will get an empty dataframe, since there are no columns in common.
print(pd.concat([df1, df2, df3], join='inner'))
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
[12 rows x 0 columns]
If we use the dataframes that have columns in common, only the columns that all of them share will be returned.
print(pd.concat([df1,df3], ignore_index=False, join='inner'))
A C
0 a0 c0
1 a1 c1
2 a2 c2
3 a3 c3
0 a8 b8
1 a9 b9
2 a10 b10
3 a11 b11
Let’s take our dataframes and modify them again so that they have different row indices. Here, we are building on the same dataframe modifications from Section 6.2.4.1.
df1.index = [0, 1, 2, 3]
df2.index = [4, 5, 6, 7]
df3.index = [0, 2, 5, 7]
print(df1)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
print(df2)
E F G H
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
7 a7 b7 c7 d7
print(df3)
A C F H
0 a8 b8 c8 d8
2 a9 b9 c9 d9
5 a10 b10 c10 d10
7 a11 b11 c11 d11
When we concatenate along axis="columns"
(axis=1
), the new dataframes will be added in a column-wise fashion and matched against their respective row indices. Missing values indicators appear in the areas where the indices did not align.
col_concat = pd.concat([df1, df2, df3], axis="columns")
print(col_concat)
A B C D E F G H A C F H
0 a0 b0 c0 d0 NaN NaN NaN NaN a8 b8 c8 d8
1 a1 b1 c1 d1 NaN NaN NaN NaN NaN NaN NaN NaN
2 a2 b2 c2 d2 NaN NaN NaN NaN a9 b9 c9 d9
3 a3 b3 c3 d3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN a4 b4 c4 d4 NaN NaN NaN NaN
5 NaN NaN NaN NaN a5 b5 c5 d5 a10 b10 c10 d10
6 NaN NaN NaN NaN a6 b6 c6 d6 NaN NaN NaN NaN
7 NaN NaN NaN NaN a7 b7 c7 d7 a11 b11 c11 d11
Just as we did when we concatenated in a row-wise manner, we can choose to keep the results only when there are matching indices by using join="inner"
.
print(pd.concat([df1, df3], axis="columns", join='inner'))
A B C D A C F H
0 a0 b0 c0 d0 a8 b8 c8 d8
2 a2 b2 c2 d2 a9 b9 c9 d9
One reason why data might be split across multiple files would be the size of the files. By splitting up data into various parts, each part would be smaller. This may be good when we need to share data on the Internet or via email, since many services limit the size of a file that can be opened or shared. Another reason why a data set might be split into multiple parts would be to account for the data collection process. For example, a separate data set containing stock information could be created for each day.
Since merging and concatenation have already been covered, this section will focus on techniques for quickly loading multiple data sources and assembling them together.
In this example, all of the billboard ratings data have a pattern.
data/billboard-by_week/billboard-XX.csv
Where XX
represents the week (e.g., 03). We can use the a pattern matching function from the built-in pathlib
module in Python to get a list of all the filenames that match a particular pattern.
from pathlib import Path
# from my current directory fine (glob) the this pattern
billboard_data_files = (
Path(".")
.glob("data/billboard-by_week/billboard-*.csv")
)
# this line is optional if you want to see the full list of files
billboard_data_files = sorted(list(billboard_data_files))
print(billboard_data_files)
[PosixPath('data/billboard-by_week/billboard-01.csv'),
PosixPath('data/billboard-by_week/billboard-02.csv'),
PosixPath('data/billboard-by_week/billboard-03.csv'),
PosixPath('data/billboard-by_week/billboard-04.csv'),
PosixPath('data/billboard-by_week/billboard-05.csv'),
.. ... ... ... ...
PosixPath('data/billboard-by_week/billboard-72.csv'),
PosixPath('data/billboard-by_week/billboard-73.csv'),
PosixPath('data/billboard-by_week/billboard-74.csv'),
PosixPath('data/billboard-by_week/billboard-75.csv'),
PosixPath('data/billboard-by_week/billboard-76.csv')]
The type()
of billboard_data_files
is a generator object, so if you “use it” you will lose its contents. If you want to see the full list, you would need to run:
billboard_data_files = list(billboard_data_files)
Now that we have a list of filenames we want to load, we can load each file into a dataframe. We can choose to load each file individually, as we have been doing so far.
billboard01 = pd.read_csv(billboard_data_files[0])
billboard02 = pd.read_csv(billboard_data_files[1])
billboard03 = pd.read_csv(billboard_data_files[2])
# just look at one of the data sets we loaded
print(billboard01)
year artist track time
0 2000 2 Pac Baby Don't Cry (Keep... 4:22
1 2000 2Ge+her The Hardest Part Of ... 3:15
2 2000 3 Doors Down Kryptonite 3:53
3 2000 3 Doors Down Loser 4:24
4 2000 504 Boyz Wobble Wobble 3:35
.. ... ... ... ...
312 2000 Yankee Grey Another Nine Minutes 3:10
313 2000 Yearwood, Trisha Real Live Woman 3:55
314 2000 Ying Yang Twins Whistle While You Tw... 4:19
315 2000 Zombie Nation Kernkraft 400 3:30
316 2000 matchbox twenty Bent 4:12
date.entered week rating
0 2000-02-26 wk1 87.0
1 2000-09-02 wk1 91.0
2 2000-04-08 wk1 81.0
3 2000-10-21 wk1 76.0
4 2000-04-15 wk1 57.0
.. ... ... ...
312 2000-04-29 wk1 86.0
313 2000-04-01 wk1 85.0
314 2000-03-18 wk1 95.0
315 2000-09-02 wk1 99.0
316 2000-04-29 wk1 60.0
[317 rows x 7 columns]
We can concatenate them just as we did in Chapter 6.
# shape of each dataframe
print(billboard01.shape)
print(billboard02.shape)
print(billboard03.shape)
(317, 7)
(317, 7)
(317, 7)
# concatenate the dataframes together
billboard = pd.concat([billboard01, billboard02, billboard03])
# shape of final concatenated taxi data
print(billboard.shape)
(951, 7)
Let’s write a check to make sure the number of rows were concatenated correctly
assert (
billboard01.shape[0]
+ billboard02.shape[0]
+ billboard03.shape[0]
== billboard.shape[0]
)
However, manually saving each dataframe will get tedious when the data is split into many parts. As an alternative approach, we can automate the process using loops and list comprehensions.
An easier way to load multiple files is to first create an empty list, use a loop to iterate though each of the CSV files, load the CSV files into a Pandas dataframe, and finally append the dataframe to the list. The final type of data we want is a list of dataframes because the concat()
function takes a list of dataframes to concatenate.
# this part was the same as earlier
from pathlib import Path
billboard_data_files = (
Path(".")
.glob("data/billboard-by_week/billboard-*.csv")
)
# create an empty list to append to
list_billboard_df = []
# loop though each CSV filename
for csv_filename in billboard_data_files:
# you can choose to print the filename for debugging
# print(csv_filename)
# load the CSV file into a dataframe
df = pd.read_csv(csv_filename)
# append the dataframe to the list that will hold the dataframes
list_billboard_df.append(df)
# print the length of the dataframe
print(len(list_billboard_df))
76
# type of the first element
print(type(list_billboard_df[0]))
<class 'pandas.core.frame.DataFrame'>
# look at the first dataframe
print(list_billboard_df[0])
year artist track time
0 2000 2 Pac Baby Don't Cry (Keep... 4:22
1 2000 2Ge+her The Hardest Part Of ... 3:15
2 2000 3 Doors Down Kryptonite 3:53
3 2000 3 Doors Down Loser 4:24
4 2000 504 Boyz Wobble Wobble 3:35
.. ... ... ... ...
312 2000 Yankee Grey Another Nine Minutes 3:10
313 2000 Yearwood, Trisha Real Live Woman 3:55
314 2000 Ying Yang Twins Whistle While You Tw... 4:19
315 2000 Zombie Nation Kernkraft 400 3:30
316 2000 matchbox twenty Bent 4:12
date.entered week rating
0 2000-02-26 wk15 NaN
1 2000-09-02 wk15 NaN
2 2000-04-08 wk15 38.0
3 2000-10-21 wk15 72.0
4 2000-04-15 wk15 78.0
.. ... ... ...
312 2000-04-29 wk15 NaN
313 2000-04-01 wk15 NaN
314 2000-03-18 wk15 NaN
315 2000-09-02 wk15 NaN
316 2000-04-29 wk15 3.0
[317 rows x 7 columns]
Now that we have a list of dataframes, we can concatenate them.
billboard_loop_concat = pd.concat(list_billboard_df)
print(billboard_loop_concat.shape)
(24092, 7)
Python has an idiom for looping though something and adding it to a list, called a list comprehension. The loop given previously, which is shown here again without the comments, can be written in a list comprehension (Appendix K).
# we have to re-create the generator because we
# "used it up" in the previous example
billboard_data_files = (
Path(".")
.glob("data/billboard-by_week/billboard-*.csv")
)
# the loop code without comments
list_billboard_df = []
for csv_filename in billboard_data_files:
df = pd.read_csv(csv_filename)
list_billboard_df.append(df)
billboard_data_files = (
Path(".")
.glob("data/billboard-by_week/billboard-*.csv")
)
# same code in a list comprehension
billboard_dfs = [pd.read_csv(data) for data in billboard_data_files]
The result from our list comprehension is a list, just as the earlier loop example.
print(type(billboard_dfs))
<class 'list'>
print(len(billboard_dfs))
76
Finally, we can concatenate the results just as we did earlier.
billboard_concat_comp = pd.concat(billboard_dfs)
print(billboard_concat_comp)
year artist track time
0 2000 2 Pac Baby Don't Cry (Keep... 4:22
1 2000 2Ge+her The Hardest Part Of ... 3:15
2 2000 3 Doors Down Kryptonite 3:53
3 2000 3 Doors Down Loser 4:24
4 2000 504 Boyz Wobble Wobble 3:35
.. ... ... ... ...
312 2000 Yankee Grey Another Nine Minutes 3:10
313 2000 Yearwood, Trisha Real Live Woman 3:55
314 2000 Ying Yang Twins Whistle While You Tw... 4:19
315 2000 Zombie Nation Kernkraft 400 3:30
316 2000 matchbox twenty Bent 4:12
date.entered week rating
0 2000-02-26 wk15 NaN
1 2000-09-02 wk15 NaN
2 2000-04-08 wk15 38.0
3 2000-10-21 wk15 72.0
4 2000-04-15 wk15 78.0
.. ... ... ...
312 2000-04-29 wk18 NaN
313 2000-04-01 wk18 NaN
314 2000-03-18 wk18 NaN
315 2000-09-02 wk18 NaN
316 2000-04-29 wk18 3.0
[24092 rows x 7 columns]
The previous section alluded to a few database concepts. The join="inner"
and the default join="outer"
parameters come from working with databases when we want to merge tables.
Instead of simply having a row or column index that you want to use to concatenate values, sometimes you may have two or more dataframes that you want to combine based on common data values. This task is known in the database world as performing a “join.”
Pandas has a .join()
method that uses .merge()
under the hood. .join()
will merge dataframe objects based on an index, but the .merge()
function is much more explicit and flexible.
If you are planning to merge dataframes by the row index, for example, you might want to look into the .join()
method.2
2. Pandas DataFrame.join()
method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
We will be using the set of survey data in the following examples.
person = pd.read_csv('data/survey_person.csv')
site = pd.read_csv('data/survey_site.csv')
survey = pd.read_csv('data/survey_survey.csv')
visited = pd.read_csv('data/survey_visited.csv')
print(person)
ident personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth
print(site)
name lat long
0 DR-1 -49.85 -128.57
1 DR-3 -47.15 -126.72
2 MSK-4 -48.87 -123.40
print(visited)
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
print(survey)
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
.. ... ... ... ...
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25
[21 rows x 4 columns]
Currently, our data is split into multiple parts, where each part is an observational unit. If we wanted to look at the dates at each site along with the latitude and longitude information for that site, we would have to combine (and merge) multiple dataframes. We can do this with the .merge()
method in Pandas.
When we call this method, the dataframe that is called will be referred to as the one on the “left.” Within the .merge()
method, the first parameter is the “right” dataframe (i.e., left.merge(right)
). The next parameter is how
the final merged result looks.
Table 6.1 provides more details. Next, we set the on
parameter. This specifies which columns to match on. If the left and right columns do not have the same name, we can use the left_on
and right_on
parameters instead.
Table 6.1 How the Pandas how Parameter Relates to SQL
Pandas | SQL | Description |
---|---|---|
left | left outer | Keep all the keys from the left |
right | right outer | Keep all the keys from the right |
outer | full outer | Keep all the keys from both left and right |
inner | inner | Keep only the keys that exist in both left and right |
In the simplest type of merge, we have two dataframes where we want to join one column to another column, and where the columns we want to join do not contain any duplicate values.
For this example, we will modify the visited
dataframe so there are no duplicated site
values.
visited_subset = visited.loc[[0, 2, 6], :]
print(visited_subset)
ident site dated
0 619 DR-1 1927-02-08
2 734 DR-3 1939-01-07
6 837 MSK-4 1932-01-14
# get a count of the values in the site column
print(
visited_subset["site"].value_counts()
)
DR-1 1
DR-3 1
MSK-4 1
Name: site, dtype: int64
We can perform our one-to-one merge as follows:
# the default value for 'how' is 'inner'
# so it doesn't need to be specified
o2o_merge = site.merge(
visited_subset, left_on="name", right_on="site"
)
print(o2o_merge)
name lat long ident site dated
0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08
1 DR-3 -47.15 -126.72 734 DR-3 1939-01-07
2 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14
As you can see, we have now created a new dataframe from two separate dataframes where the rows were matched based on a particular set of columns. In SQL-speak, the columns used to match are called “keys.”
If we choose to do the same merge, but this time without using the subsetted visited
dataframe, we would perform a many-to-one merge. In this kind of merge, one of the dataframes has key values that repeat.
# get a count of the values in the site column
print(
visited["site"].value_counts()
)
DR-3 4
DR-1 3
MSK-4 1
Name: site, dtype: int64
The dataframes that contain the single observations will then be duplicated in the merge.
m2o_merge = site.merge(visited, left_on='name', right_on='site')
print(m2o_merge)
name lat long ident site dated
0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08
1 DR-1 -49.85 -128.57 622 DR-1 1927-02-10
2 DR-1 -49.85 -128.57 844 DR-1 1932-03-22
3 DR-3 -47.15 -126.72 734 DR-3 1939-01-07
4 DR-3 -47.15 -126.72 735 DR-3 1930-01-12
5 DR-3 -47.15 -126.72 751 DR-3 1930-02-26
6 DR-3 -47.15 -126.72 752 DR-3 NaN
7 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14
The site
information (name
, lat
, and long
) were duplicated and matched to the visited
data.
Lastly, there will be times when we want to perform a match based on multiple columns. As an example, suppose we have two dataframes that come from person
merged with survey
, and another dataframe that comes from visited
merged with survey
.
ps = person.merge(survey, left_on='ident', right_on='person')
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(ps)
ident personal family taken person quant reading
0 dyer William Dyer 619 dyer rad 9.82
1 dyer William Dyer 619 dyer sal 0.13
2 dyer William Dyer 622 dyer rad 7.80
3 dyer William Dyer 622 dyer sal 0.09
4 pb Frank Pabodie 734 pb rad 8.41
.. ... ... ... ... ... ... ...
14 lake Anderson Lake 837 lake rad 1.46
15 lake Anderson Lake 837 lake sal 0.21
16 roe Valentina Roerich 752 roe sal 41.60
17 roe Valentina Roerich 837 roe sal 22.50
18 roe Valentina Roerich 844 roe rad 11.25
[19 rows x 7 columns]
print(vs)
ident site dated taken person quant reading
0 619 DR-1 1927-02-08 619 dyer rad 9.82
1 619 DR-1 1927-02-08 619 dyer sal 0.13
2 622 DR-1 1927-02-10 622 dyer rad 7.80
3 622 DR-1 1927-02-10 622 dyer sal 0.09
4 734 DR-3 1939-01-07 734 pb rad 8.41
.. ... ... ... ... ... ... ...
16 752 DR-3 NaN 752 roe sal 41.60
17 837 MSK-4 1932-01-14 837 lake rad 1.46
18 837 MSK-4 1932-01-14 837 lake sal 0.21
19 837 MSK-4 1932-01-14 837 roe sal 22.50
20 844 DR-1 1932-03-22 844 roe rad 11.25
[21 rows x 7 columns]
We know there is a many-to-many merge happening because there are duplicate values in the keys for both the left and right dataframe.
print(
ps["quant"].value_counts()
)
rad 8
sal 8
temp 3
Name: quant, dtype: int64
print(
vs["quant"].value_counts()
)
sal 9
rad 8
temp 4
Name: quant, dtype: int64
We can perform a many-to-many merge by passing the multiple columns to match on in a Python list.
ps_vs = ps.merge(
vs,
left_on=["quant"],
right_on=["quant"],
)
Let’s look at just the first row of data.
print(ps_vs.loc[0, :])
ident_x dyer
personal William
family Dyer
taken_x 619
person_x dyer
...
site DR-1
dated 1927-02-08
taken_y 619
person_y dyer
reading_y 9.82
Name: 0, Length: 13, dtype: object
Pandas will automatically add a suffix to a column name if there are collisions in the name. In the output, the _x
refers to values from the left dataframe, and the _y
suffix comes from values in the right dataframe.
A simple way to check your work before and after a merge is by looking at the number of rows of our data before and after the merge. If you end up with more rows than either of the dataframes you are merging together, that means a many-to-many merge occurred, and that is usually situation you do not want.
print(ps.shape) # left dataframe
(19, 7)
print(vs.shape) # right dataframe
(21, 7)
print(ps_vs.shape) # after merge
(148, 13)
One way you can check your work is by having your code fail when you know a bad condition exists. You can achieve this by using the Python assert
statement. When an expression evaluates to True
, assert
will not return anything, and your code will continue on to the next expression.
# expect this to be true
# note there is no output
assert vs.shape[0] == 21
However, if the expression to assert
evaluates to False
, it will throw an AssertionError
, and your code will stop.
assert ps_vs.shape[0] <= vs.shape[0]
AssertionError:
Using assert
is a good technique to build in checks into your code without having to run it and visually inspecting the result. This is also the basis for creating “unit tests” for functions.
Sometimes, you may need to combine various parts or data or multiple data sets depending on the question you are trying to answer. Keep in mind, however, that the data you need for analysis does not necessarily equate to the best shape of data for storage.
The survey data used in the last example came in four separate parts that needed to be merged together. After we merged the tables, a lot of redundant information appeared across the rows. From a data storage and data entry point of view, each of these duplications can lead to errors and data inconsistency. This is what Hadley meant by saying that in tidy data, “each type of observational unit forms a table.”
3.15.237.164