6

Data Assembly

By now, you should be able to load data into pandas and do some basic visualizations. This part of the book focuses on various data cleaning tasks. We begin with assembling a data set for analysis by combining various data sets together.

Learning Objectives

  • Identify when needs to be combined

  • Identify whether data needs to be concatenated or joined together

  • Use the appropriate function or methods to combine multiple data sets

  • Produce a single data set from multiple files

  • Assess whether data was joined properly

6.1 Combine Data Sets

We first talked about tidy data principles in Chapter 4. This chapter will cover the third criterion in the original “Tidy Data” paper1: “each type of observational unit forms a table.”

1. Tidy Data paper: http://vita.had.co.nz/papers/tidy-data.pdf

When data is tidy, you need to combine various tables together to answer a question. For example, there may be a separate table holding company information and another table holding stock prices. If we want to look at all the stock prices within the tech industry, we may first have to find all the tech companies from the company information table, and then combine that data with the stock price data to get the data we need for our question. The data may have been split up into separate tables to reduce the amount of redundant information (we don’t need to store the company information with each stock price entry), but this arrangement means we as data analysts must combine the relevant data ourselves to answer our question.

Other times, a single data set may be split into multiple parts. For example, with timeseries data, each date may be in a separate file. In another case, a file may have been split into parts to make the individual files smaller. You may also need to combine data from multiple sources to answer a question (e.g., combine latitudes and longitudes with zip codes). In both cases, you will need to combine data into a single dataframe for analysis.

6.2 Concatenation

One of the (conceptually) easier ways to combine data is with concatenation. Concatenation can be thought of as appending a row or column to your data. This approach is possible if your data was split into parts or if you performed a calculation that you want to append to your existing data set.

Let’s begin with some example data sets so you can see what is actually happening.

import pandas as pd

df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')
print(df1)
    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
print(df2)
    A   B   C   D
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
print(df3)
     A    B    C    D
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

Concatenation is accomplished by using the concat() function from Pandas.

6.2.1 Review Parts of a DataFrame

Section 2.3.1 talked about the three parts of a dataframe: .index, .columns, and .values. We will be working with .index and .columns a lot in this chapter.

The .index refers to the labels on the left of the dataframe, by default they will be numbered starting from 0.

print(df1.index)
RangeIndex(start=0, stop=4, step=1)

The “index” is an “axis” of a dataframe. These terms are important because pandas will try to automatically align by axis. The other axis is the “columns,” which we can get with .columns.

print(df1.columns)
Index(['A', 'B', 'C', 'D'], dtype='object')

This refers to the column names of the dataframe.

Finally, just to be complete, the body of the dataframe can be represented as an numpy array with .values.

print(df1.values)
[['a0'  'b0'  'c0'  'd0']
 ['a1'  'b1'  'c1'  'd1']
 ['a2'  'b2'  'c2'  'd2']
 ['a3'  'b3'  'c3'  'd3']]

6.2.2 Add Rows

Stacking (i.e., concatenating) the dataframes on top of each other uses the concat() function in pandas. All of the dataframes to be concatenated are passed in a list.

row_concat = pd.concat([df1, df2, df3])
print(row_concat)
      A    B    C    D
0    a0   b0   c0   d0
1    a1   b1   c1   d1
2    a2   b2   c2   d2
3    a3   b3   c3   d3
0    a4   b4   c4   d4
..  ...   ... ...  ...
3    a7   b7   c7   d7
0    a8   b8   c8   d8
1    a9   b9   c9   d9
2   a10  b10  c10  d10
3   a11  b11  c11  d11

[12 rows x 4 columns]

As you can see, concat() blindly stacks the dataframes together. If you look at the row names (i.e., the row indices), they are also simply a stacked version of the original row indices. If we apply the various subsetting methods (Table 2.3), the table will be subsetted as expected.

# subset the fourth row of the concatenated dataframe
print(row_concat.iloc[3, :])
A    a3
B    b3
C    c3
D    d3
Name: 3, dtype: object

Section 2.1.1 showed the process for creating a Series. However, if we create a new series to append to a dataframe, it does not append correctly.

# create a new row of data
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(new_row_series)
0    n1
1    n2
2    n3
3    n4
dtype: object
# attempt to add the new row to a dataframe
print(pd.concat([df1, new_row_series]))
      A     B     C     D     0
0    a0    b0    c0    d0   NaN
1    a1    b1    c1    d1   NaN
2    a2    b2    c2    d2   NaN
3    a3    b3    c3    d3   NaN
0   NaN   NaN   NaN   NaN    n1
1   NaN   NaN   NaN   NaN    n2
2   NaN   NaN   NaN   NaN    n3
3   NaN   NaN   NaN   NaN    n4

The first things you may notice are the NaN missing values. This is simply Python’s way of representing a “missing value” (more about missing values in Chapter 9). We were hoping to append our new values as a row, but that didn’t happen. In fact, not only did our code not append the values as a row, but it also created a new column completely misaligned with everything else.

Let’s think about what is happening here. First, our series did not have a matching column, so our new_row was added to a new column. The rest of the values were concatenated to the bottom of the dataframe, and the original index values were retained.

To fix this problem, we need turn our series into a dataframe. This data frame contains one row of data, and the column names are the ones the data will bind to.

new_row_df = pd.DataFrame(
  # note the double brackets to create a "row" of data
  data=[["n1", "n2", "n3", "n4"]],
  columns=["A", "B", "C", "D"],
)

print(new_row_df)
    A    B    C    D
0  n1   n2   n3   n4
# concatenate the row of data
print(pd.concat([df1, new_row_df]))
    A    B    C    D
0  a0   b0   c0   d0
1  a1   b1   c1   d1
2  a2   b2   c2   d2
3  a3   b3   c3   d3
0  n1   n2   n3   n4

concat() is a general function that can concatenate multiple things at once.

6.2.2.1 Ignore the Index

In the last example, when we added a dict to a dataframe, we had to use the ignore_index parameter. If we look closer, you can see that the row index was also incremented by 1, and did not repeat a previous index value.

If we simply want to concatenate or append data together, we can use the ignore_index parameter to reset the row index after the concatenation.

row_concat_i = pd.concat([df1, df2, df3], ignore_index=True)
print(row_concat_i)
       A     B     C     D
0     a0    b0    c0    d0
1     a1    b1    c1    d1
2     a2    b2    c2    d2
3     a3    b3    c3    d3
4     a4    b4    c4    d4
..   ...   ...   ...   ...
7     a7    b7    c7    d7
8     a8    b8    c8    d8
9     a9    b9    c9    d9
10   a10   b10   c10   d10
11   a11   b11   c11   d11

[12 rows x 4 columns]

6.2.3 Add Columns

Concatenating columns is very similar to concatenating rows. The main difference is the axis parameter in the concat function. The default value of axis is 0 (or "index"), so it will concatenate data in a row-wise fashion. However, if we pass axis=1 (or axis="columns" ) to the function, it will concatenate data in a column-wise manner.

col_concat = pd.concat([df1, df2, df3], axis="columns")
print(col_concat)
    A   B   C   D   A   B   C   D    A    B    C    D
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11

If we try to subset data based on column names, we will get a similar result when we concatenated row-wise and subset by row index.

print(col_concat['A'])
    A   A   A
0  a0  a4   a8
1  a1  a5   a9
2  a2  a6  a10
3  a3  a7  a11

Adding a single column to a dataframe can be done directly without using any specific Pandas function (We saw this in Section 2.4.1). Simply pass a new column name for the vector you want assigned to the new column.

col_concat['new_col_list'] = ['n1', 'n2', 'n3', 'n4']
print(col_concat)
    A   B   C   D   A   B   C   D    A    B   C    D  new_col_list
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8           n1
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9           n2
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10           n3
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11           n4
col_concat['new_col_series'] = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(col_concat)
    A   B   C   D   A   B   C   D    A    B    C    D  new_col_list 
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8            n1
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9            n2
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10            n3
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11            n4
  new_col_series
0             n1
1             n2
2             n3
3             n4

Using the concat() function still works, as long as you give it a dataframe. However this approach requires more code.

Finally, we can reset the column indices so we do not have duplicated column names.

print(pd.concat([df1, df2, df3], axis="columns", ignore_index=True))
    0   1   2   3   4   5   6   7    8    9   10   11
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11

6.2.4 Concatenate with Different Indices

The examples shown so far have assumed we are performing a row or column concatenation. They also assume that the new row(s) had the same column names or the column(s) had the same row indices.

This section addresses what happens when the row and column indices are not aligned.

6.2.4.1 Concatenate Rows with Different Columns

Let’s modify our dataframes for the next few examples.

# rename the columns of our dataframes
df1.columns = ['A', 'B', 'C', 'D']
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']

print(df1)
    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
print(df2)

    E   F   G   H
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
print(df3)
     A    C    F    H
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

If we try to concatenate these dataframes as we did in Section 6.2.2, the dataframes now do much more than simply stack one on top of the other. The columns align themselves, and NaN fills in any missing areas.

row_concat = pd.concat([df1, df2, df3])
print(row_concat)
     A     B     C    D      E     F     G     H
0   a0    b0    c0    d0   NaN   NaN   NaN   NaN
1   a1    b1    c1    d1   NaN   NaN   NaN   NaN
2   a2    b2    c2    d2   NaN   NaN   NaN   NaN
3   a3    b3    c3    d3   NaN   NaN   NaN   NaN
0  NaN   NaN   NaN   NaN    a4    b4    c4    d4
.. ...   ...   ...   ...   ...   ...   ...   ...
3  NaN   NaN   NaN   NaN    a7    b7    c7    d7
0   a8   NaN    b8   NaN   NaN    c8   NaN    d8
1   a9   NaN    b9   NaN   NaN    c9   NaN    d9
2  a10   NaN   b10   NaN   NaN   c10   NaN   d10
3  a11   NaN   b11   NaN   NaN   c11   NaN   d11

[12 rows x 8 columns]

One way to avoid the inclusion of NaN values is to keep only those columns that are shared in common by the list of objects to be concatenated. A parameter named join accomplishes this. By default, it has a value of 'outer', meaning it will keep all the columns. However, we can set join='inner' to keep only the columns that are shared among the data sets.

If we try to keep only the columns from all three dataframes, we will get an empty dataframe, since there are no columns in common.

print(pd.concat([df1, df2, df3], join='inner'))

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]

[12 rows x 0 columns]

If we use the dataframes that have columns in common, only the columns that all of them share will be returned.

print(pd.concat([df1,df3], ignore_index=False, join='inner'))
     A    C
0   a0   c0
1   a1   c1
2   a2   c2
3   a3   c3
0   a8   b8
1   a9   b9
2  a10  b10
3  a11  b11
6.2.4.2 Concatenate Columns with Different Rows

Let’s take our dataframes and modify them again so that they have different row indices. Here, we are building on the same dataframe modifications from Section 6.2.4.1.

df1.index = [0, 1, 2, 3]
df2.index = [4, 5, 6, 7]
df3.index = [0, 2, 5, 7]
print(df1)
    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
print(df2)
    E   F   G   H
4  a4  b4  c4  d4
5  a5  b5  c5  d5
6  a6  b6  c6  d6
7  a7  b7  c7  d7
print(df3)
     A    C    F    H
0   a8   b8   c8   d8
2   a9   b9   c9   d9
5  a10  b10  c10  d10
7  a11  b11  c11  d11

When we concatenate along axis="columns" (axis=1), the new dataframes will be added in a column-wise fashion and matched against their respective row indices. Missing values indicators appear in the areas where the indices did not align.

col_concat = pd.concat([df1, df2, df3], axis="columns")
print(col_concat)
     A    B    C    D    E    F    G    H    A   C     F   H
0   a0   b0   c0   d0  NaN  NaN  NaN  NaN   a8   b8   c8   d8
1   a1   b1   c1   d1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2   a2   b2   c2   d2  NaN  NaN  NaN  NaN   a9   b9   c9   d9
3   a3   b3   c3   d3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN   a4   b4   c4   d4  NaN  NaN  NaN  NaN
5  NaN  NaN  NaN  NaN   a5   b5   c5   d5  a10  b10  c10  d10
6  NaN  NaN  NaN  NaN   a6   b6   c6   d6  NaN  NaN  NaN  NaN
7  NaN  NaN  NaN  NaN   a7   b7   c7   d7  a11  b11  c11  d11

Just as we did when we concatenated in a row-wise manner, we can choose to keep the results only when there are matching indices by using join="inner".

print(pd.concat([df1, df3], axis="columns", join='inner'))
    A   B   C   D   A   C   F   H
0  a0  b0  c0  d0  a8  b8  c8  d8
2  a2  b2  c2  d2  a9  b9  c9  d9

6.3 Observational Units Across Multiple Tables

One reason why data might be split across multiple files would be the size of the files. By splitting up data into various parts, each part would be smaller. This may be good when we need to share data on the Internet or via email, since many services limit the size of a file that can be opened or shared. Another reason why a data set might be split into multiple parts would be to account for the data collection process. For example, a separate data set containing stock information could be created for each day.

Since merging and concatenation have already been covered, this section will focus on techniques for quickly loading multiple data sources and assembling them together.

In this example, all of the billboard ratings data have a pattern.

data/billboard-by_week/billboard-XX.csv

Where XX represents the week (e.g., 03). We can use the a pattern matching function from the built-in pathlib module in Python to get a list of all the filenames that match a particular pattern.

from pathlib import Path

# from my current directory fine (glob) the this pattern
 
billboard_data_files = (
    Path(".")
    .glob("data/billboard-by_week/billboard-*.csv")
)

# this line is optional if you want to see the full list of files
billboard_data_files = sorted(list(billboard_data_files))

print(billboard_data_files)
[PosixPath('data/billboard-by_week/billboard-01.csv'),
PosixPath('data/billboard-by_week/billboard-02.csv'),
PosixPath('data/billboard-by_week/billboard-03.csv'),
PosixPath('data/billboard-by_week/billboard-04.csv'),
PosixPath('data/billboard-by_week/billboard-05.csv'),
..    ...                ...               ...   ...
PosixPath('data/billboard-by_week/billboard-72.csv'),
PosixPath('data/billboard-by_week/billboard-73.csv'),
PosixPath('data/billboard-by_week/billboard-74.csv'),
PosixPath('data/billboard-by_week/billboard-75.csv'),
PosixPath('data/billboard-by_week/billboard-76.csv')]

The type() of billboard_data_files is a generator object, so if you “use it” you will lose its contents. If you want to see the full list, you would need to run:

billboard_data_files = list(billboard_data_files)

Now that we have a list of filenames we want to load, we can load each file into a dataframe. We can choose to load each file individually, as we have been doing so far.

billboard01 = pd.read_csv(billboard_data_files[0])
billboard02 = pd.read_csv(billboard_data_files[1])
billboard03 = pd.read_csv(billboard_data_files[2])
# just look at one of the data sets we loaded
print(billboard01)
     year            artist                    track   time  
0    2000             2 Pac  Baby Don't Cry (Keep...   4:22
1    2000           2Ge+her  The Hardest Part Of ...   3:15
2    2000      3 Doors Down               Kryptonite   3:53
3    2000      3 Doors Down                    Loser   4:24
4    2000          504 Boyz             Wobble Wobble  3:35
..    ...               ...                       ...   ...
312  2000       Yankee Grey      Another Nine Minutes  3:10
313  2000  Yearwood, Trisha           Real Live Woman  3:55
314  2000   Ying Yang Twins   Whistle While You Tw...  4:19
315  2000     Zombie Nation             Kernkraft 400  3:30
316  2000   matchbox twenty                      Bent  4:12

   date.entered week  rating
0    2000-02-26  wk1    87.0
1    2000-09-02  wk1    91.0
2    2000-04-08  wk1    81.0
3    2000-10-21  wk1    76.0
4    2000-04-15  wk1    57.0
..          ...  ...     ...
312  2000-04-29  wk1    86.0
313  2000-04-01  wk1    85.0
314  2000-03-18  wk1    95.0
315  2000-09-02  wk1    99.0
316  2000-04-29  wk1    60.0

[317 rows x 7 columns]

We can concatenate them just as we did in Chapter 6.

# shape of each dataframe
print(billboard01.shape)
print(billboard02.shape)
print(billboard03.shape)
(317, 7)
(317, 7)
(317, 7)
# concatenate the dataframes together
billboard = pd.concat([billboard01, billboard02, billboard03])

# shape of final concatenated taxi data
print(billboard.shape)
(951, 7)

Let’s write a check to make sure the number of rows were concatenated correctly

assert (
    billboard01.shape[0]
    + billboard02.shape[0]
    + billboard03.shape[0]
    == billboard.shape[0]
)

However, manually saving each dataframe will get tedious when the data is split into many parts. As an alternative approach, we can automate the process using loops and list comprehensions.

6.3.1 Load Multiple Files Using a Loop

An easier way to load multiple files is to first create an empty list, use a loop to iterate though each of the CSV files, load the CSV files into a Pandas dataframe, and finally append the dataframe to the list. The final type of data we want is a list of dataframes because the concat() function takes a list of dataframes to concatenate.

# this part was the same as earlier
from pathlib import Path
billboard_data_files = (
    Path(".")
    .glob("data/billboard-by_week/billboard-*.csv")
)

# create an empty list to append to
list_billboard_df = []

# loop though each CSV filename
for csv_filename in billboard_data_files:
    # you can choose to print the filename for debugging
    # print(csv_filename)

    # load the CSV file into a dataframe
    df = pd.read_csv(csv_filename)

    # append the dataframe to the list that will hold the dataframes
    list_billboard_df.append(df)

# print the length of the dataframe
print(len(list_billboard_df))
76
# type of the first element
print(type(list_billboard_df[0]))
<class 'pandas.core.frame.DataFrame'>
# look at the first dataframe
print(list_billboard_df[0])
     year            artist                    track  time  
0    2000             2 Pac  Baby Don't Cry (Keep...  4:22
1    2000           2Ge+her  The Hardest Part Of ...  3:15
2    2000      3 Doors Down               Kryptonite  3:53
3    2000      3 Doors Down                    Loser  4:24
4    2000          504 Boyz            Wobble Wobble  3:35

..    ...               ...                      ...   ...
312  2000       Yankee Grey     Another Nine Minutes  3:10
313  2000  Yearwood, Trisha          Real Live Woman  3:55
314  2000   Ying Yang Twins  Whistle While You Tw...  4:19
315  2000     Zombie Nation            Kernkraft 400  3:30
316  2000   matchbox twenty                     Bent  4:12

   date.entered  week  rating
0    2000-02-26  wk15     NaN
1    2000-09-02  wk15     NaN
2    2000-04-08  wk15    38.0
3    2000-10-21  wk15    72.0
4    2000-04-15  wk15    78.0
..          ...   ...     ...
312  2000-04-29  wk15     NaN
313  2000-04-01  wk15     NaN
314  2000-03-18  wk15     NaN
315  2000-09-02  wk15     NaN
316  2000-04-29  wk15     3.0

[317 rows x 7 columns]

Now that we have a list of dataframes, we can concatenate them.

billboard_loop_concat = pd.concat(list_billboard_df)
print(billboard_loop_concat.shape)
(24092, 7)

6.3.2 Load Multiple Files Using a List Comprehension

Python has an idiom for looping though something and adding it to a list, called a list comprehension. The loop given previously, which is shown here again without the comments, can be written in a list comprehension (Appendix K).

# we have to re-create the generator because we
# "used it up" in the previous example
billboard_data_files = (
    Path(".")
    .glob("data/billboard-by_week/billboard-*.csv")
)
# the loop code without comments
list_billboard_df = []
for csv_filename in billboard_data_files:
    df = pd.read_csv(csv_filename)
    list_billboard_df.append(df)

billboard_data_files = (
    Path(".")
    .glob("data/billboard-by_week/billboard-*.csv")
)

# same code in a list comprehension
billboard_dfs = [pd.read_csv(data) for data in billboard_data_files]

The result from our list comprehension is a list, just as the earlier loop example.

print(type(billboard_dfs))
<class 'list'>
print(len(billboard_dfs))
76

Finally, we can concatenate the results just as we did earlier.

billboard_concat_comp = pd.concat(billboard_dfs)
print(billboard_concat_comp)
     year            artist                    track  time  
0    2000             2 Pac  Baby Don't Cry (Keep...  4:22
1    2000           2Ge+her  The Hardest Part Of ...  3:15
2    2000      3 Doors Down               Kryptonite  3:53
3    2000      3 Doors Down                    Loser  4:24
4    2000          504 Boyz            Wobble Wobble  3:35
..    ...               ...                      ...   ...
312  2000       Yankee Grey     Another Nine Minutes  3:10
313  2000  Yearwood, Trisha          Real Live Woman  3:55
314  2000   Ying Yang Twins  Whistle While You Tw...  4:19
315  2000     Zombie Nation            Kernkraft 400  3:30
316  2000   matchbox twenty                     Bent  4:12

   date.entered  week  rating
0    2000-02-26  wk15     NaN
1    2000-09-02  wk15     NaN
2    2000-04-08  wk15    38.0
3    2000-10-21  wk15    72.0
4    2000-04-15  wk15    78.0
..          ...   ...     ...
312  2000-04-29  wk18    NaN
313  2000-04-01  wk18    NaN
314  2000-03-18  wk18    NaN
315  2000-09-02  wk18    NaN
316  2000-04-29  wk18    3.0

[24092 rows x 7 columns]

6.4 Merge Multiple Data Sets

The previous section alluded to a few database concepts. The join="inner" and the default join="outer" parameters come from working with databases when we want to merge tables.

Instead of simply having a row or column index that you want to use to concatenate values, sometimes you may have two or more dataframes that you want to combine based on common data values. This task is known in the database world as performing a “join.”

Pandas has a .join() method that uses .merge() under the hood. .join() will merge dataframe objects based on an index, but the .merge() function is much more explicit and flexible.

If you are planning to merge dataframes by the row index, for example, you might want to look into the .join() method.2

2. Pandas DataFrame.join() method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

We will be using the set of survey data in the following examples.

person = pd.read_csv('data/survey_person.csv')
site = pd.read_csv('data/survey_site.csv')
survey = pd.read_csv('data/survey_survey.csv')
visited = pd.read_csv('data/survey_visited.csv')
print(person)

      ident   personal   family
0      dyer    William     Dyer
1        pb      Frank  Pabodie
2      lake   Anderson     Lake
3       roe  Valentina  Roerich
4  danforth      Frank Danforth
print(site)
    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
print(visited)
  ident  site       dated
0   619  DR-1  1927-02-08
1   622  DR-1  1927-02-10
2   734  DR-3  1939-01-07
3   735  DR-3  1930-01-12
4   751  DR-3  1930-02-26
5   752  DR-3         NaN
6   837 MSK-4  1932-01-14
7   844  DR-1  1932-03-22
print(survey)
   taken person  quant reading
0    619   dyer    rad    9.82
1    619   dyer    sal    0.13
2    622   dyer    rad    7.80
3    622   dyer    sal    0.09
4    734     pb    rad    8.41
..   ...    ...    ...     ...
16   752    roe    sal   41.60
17   837   lake    rad    1.46
18   837   lake    sal    0.21
19   837    roe    sal   22.50
20   844    roe    rad   11.25

[21 rows x 4 columns]

Currently, our data is split into multiple parts, where each part is an observational unit. If we wanted to look at the dates at each site along with the latitude and longitude information for that site, we would have to combine (and merge) multiple dataframes. We can do this with the .merge() method in Pandas.

When we call this method, the dataframe that is called will be referred to as the one on the “left.” Within the .merge() method, the first parameter is the “right” dataframe (i.e., left.merge(right)). The next parameter is how the final merged result looks.

Table 6.1 provides more details. Next, we set the on parameter. This specifies which columns to match on. If the left and right columns do not have the same name, we can use the left_on and right_on parameters instead.

Table 6.1 How the Pandas how Parameter Relates to SQL

Pandas

SQL

Description

left

left outer

Keep all the keys from the left

right

right outer

Keep all the keys from the right

outer

full outer

Keep all the keys from both left and right

inner

inner

Keep only the keys that exist in both left and right

6.4.1 One-to-One Merge

In the simplest type of merge, we have two dataframes where we want to join one column to another column, and where the columns we want to join do not contain any duplicate values.

For this example, we will modify the visited dataframe so there are no duplicated site values.

visited_subset = visited.loc[[0, 2, 6], :]
print(visited_subset)
 ident   site       dated
0  619   DR-1  1927-02-08
2  734   DR-3  1939-01-07
6  837  MSK-4  1932-01-14
# get a count of the values in the site column
print(
  visited_subset["site"].value_counts()
)
DR-1     1
DR-3     1
MSK-4    1
Name: site, dtype: int64

We can perform our one-to-one merge as follows:

# the default value for 'how' is 'inner'
# so it doesn't need to be specified
o2o_merge = site.merge(
    visited_subset, left_on="name", right_on="site"
)
print(o2o_merge)

   name    lat    long  ident   site       dated
0  DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1  DR-3 -47.15 -126.72    734   DR-3  1939-01-07
2 MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14

As you can see, we have now created a new dataframe from two separate dataframes where the rows were matched based on a particular set of columns. In SQL-speak, the columns used to match are called “keys.”

6.4.2 Many-to-One Merge

If we choose to do the same merge, but this time without using the subsetted visited dataframe, we would perform a many-to-one merge. In this kind of merge, one of the dataframes has key values that repeat.

# get a count of the values in the site column
print(
  visited["site"].value_counts()
)
DR-3     4
DR-1     3
MSK-4    1
Name: site, dtype: int64

The dataframes that contain the single observations will then be duplicated in the merge.

m2o_merge = site.merge(visited, left_on='name', right_on='site')
print(m2o_merge)
    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57    622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57    844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72    734   DR-3  1939-01-07
4   DR-3 -47.15 -126.72    735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72    751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72    752   DR-3         NaN
7  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14

The site information (name, lat, and long) were duplicated and matched to the visited data.

6.4.3 Many-to-Many Merge

Lastly, there will be times when we want to perform a match based on multiple columns. As an example, suppose we have two dataframes that come from person merged with survey, and another dataframe that comes from visited merged with survey.

ps = person.merge(survey, left_on='ident', right_on='person')
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(ps)
   ident   personal   family taken person quant reading
0   dyer    William     Dyer   619   dyer   rad    9.82
1   dyer    William     Dyer   619   dyer   sal    0.13
2   dyer    William     Dyer   622   dyer   rad    7.80
3   dyer    William     Dyer   622   dyer   sal    0.09
4     pb      Frank  Pabodie   734     pb   rad    8.41
..   ...        ...      ...   ...    ...   ...     ...
14  lake   Anderson     Lake   837   lake   rad    1.46
15  lake   Anderson     Lake   837   lake   sal    0.21
16   roe  Valentina  Roerich   752    roe   sal   41.60
17   roe  Valentina  Roerich   837    roe   sal   22.50
18   roe  Valentina  Roerich   844    roe   rad   11.25

[19 rows x 7 columns]
print(vs)
  ident   site       dated  taken person quant reading
0   619   DR-1  1927-02-08    619   dyer   rad    9.82
1   619   DR-1  1927-02-08    619   dyer   sal    0.13
2   622   DR-1  1927-02-10    622   dyer   rad    7.80
3   622   DR-1  1927-02-10    622   dyer   sal    0.09
4   734   DR-3  1939-01-07    734     pb   rad    8.41
..  ...    ...         ...    ...    ...   ...    ...
16  752   DR-3         NaN    752    roe   sal   41.60
17  837  MSK-4  1932-01-14    837   lake   rad    1.46
18  837  MSK-4  1932-01-14    837   lake   sal    0.21
19  837  MSK-4  1932-01-14    837    roe   sal   22.50
20  844   DR-1  1932-03-22    844    roe   rad   11.25

[21 rows x 7 columns]

We know there is a many-to-many merge happening because there are duplicate values in the keys for both the left and right dataframe.

print(
  ps["quant"].value_counts()
)
rad     8
sal     8
temp    3
Name: quant, dtype: int64
print(
  vs["quant"].value_counts()
)
sal     9
rad     8
temp    4
Name: quant, dtype: int64

We can perform a many-to-many merge by passing the multiple columns to match on in a Python list.

ps_vs = ps.merge(
    vs,
    left_on=["quant"],
    right_on=["quant"],
)

Let’s look at just the first row of data.

print(ps_vs.loc[0, :])
ident_x            dyer
personal        William
family             Dyer
taken_x             619
person_x           dyer
                ...
site               DR-1
dated        1927-02-08
taken_y             619
person_y           dyer
reading_y          9.82
Name: 0, Length: 13, dtype: object

Pandas will automatically add a suffix to a column name if there are collisions in the name. In the output, the _x refers to values from the left dataframe, and the _y suffix comes from values in the right dataframe.

6.4.4 Check Your Work with Assert

A simple way to check your work before and after a merge is by looking at the number of rows of our data before and after the merge. If you end up with more rows than either of the dataframes you are merging together, that means a many-to-many merge occurred, and that is usually situation you do not want.

print(ps.shape) # left dataframe
(19, 7)
print(vs.shape) # right dataframe
(21, 7)
print(ps_vs.shape) # after merge
(148, 13)

One way you can check your work is by having your code fail when you know a bad condition exists. You can achieve this by using the Python assert statement. When an expression evaluates to True, assert will not return anything, and your code will continue on to the next expression.

# expect this to be true
# note there is no output
assert vs.shape[0] == 21

However, if the expression to assert evaluates to False, it will throw an AssertionError, and your code will stop.

assert ps_vs.shape[0] <= vs.shape[0]
AssertionError:

Using assert is a good technique to build in checks into your code without having to run it and visually inspecting the result. This is also the basis for creating “unit tests” for functions.

Conclusion

Sometimes, you may need to combine various parts or data or multiple data sets depending on the question you are trying to answer. Keep in mind, however, that the data you need for analysis does not necessarily equate to the best shape of data for storage.

The survey data used in the last example came in four separate parts that needed to be merged together. After we merged the tables, a lot of redundant information appeared across the rows. From a data storage and data entry point of view, each of these duplications can lead to errors and data inconsistency. This is what Hadley meant by saying that in tidy data, “each type of observational unit forms a table.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.26.217