6. Data Assembly

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6 Data Assembly

By now, you should be able to load data into pandas and do some basic visualizations. This part of the book focuses on various data cleaning tasks. We begin with assembling a data set for analysis by combining various data sets together.

Learning Objectives

Identify when needs to be combined
Identify whether data needs to be concatenated or joined together
Use the appropriate function or methods to combine multiple data sets
Produce a single data set from multiple files
Assess whether data was joined properly

6.1 Combine Data Sets

We first talked about tidy data principles in Chapter 4. This chapter will cover the third criterion in the original “Tidy Data” paper¹: “each type of observational unit forms a table.”

1. Tidy Data paper: http://vita.had.co.nz/papers/tidy-data.pdf

When data is tidy, you need to combine various tables together to answer a question. For example, there may be a separate table holding company information and another table holding stock prices. If we want to look at all the stock prices within the tech industry, we may first have to find all the tech companies from the company information table, and then combine that data with the stock price data to get the data we need for our question. The data may have been split up into separate tables to reduce the amount of redundant information (we don’t need to store the company information with each stock price entry), but this arrangement means we as data analysts must combine the relevant data ourselves to answer our question.

Other times, a single data set may be split into multiple parts. For example, with timeseries data, each date may be in a separate file. In another case, a file may have been split into parts to make the individual files smaller. You may also need to combine data from multiple sources to answer a question (e.g., combine latitudes and longitudes with zip codes). In both cases, you will need to combine data into a single dataframe for analysis.

6.2 Concatenation

One of the (conceptually) easier ways to combine data is with concatenation. Concatenation can be thought of as appending a row or column to your data. This approach is possible if your data was split into parts or if you performed a calculation that you want to append to your existing data set.

Let’s begin with some example data sets so you can see what is actually happening.

Pandas	SQL	Description
left	left outer	Keep all the keys from the left
right	right outer	Keep all the keys from the right
outer	full outer	Keep all the keys from both left and right
inner	inner	Keep only the keys that exist in both left and right

Table of Contents for 6. Data Assembly

Create new playlist

Sign In

Sign Up

6

Data Assembly

Learning Objectives

6.1 Combine Data Sets

6.2 Concatenation

6.2.1 Review Parts of a DataFrame

6.2.2 Add Rows

6.2.2.1 Ignore the Index

6.2.3 Add Columns

6.2.4 Concatenate with Different Indices

6.2.4.1 Concatenate Rows with Different Columns

6.2.4.2 Concatenate Columns with Different Rows

6.3 Observational Units Across Multiple Tables

6.3.1 Load Multiple Files Using a Loop

6.3.2 Load Multiple Files Using a List Comprehension

6.4 Merge Multiple Data Sets

6.4.1 One-to-One Merge

6.4.2 Many-to-One Merge

6.4.3 Many-to-Many Merge

6.4.4 Check Your Work with Assert

Conclusion

Table of Contents for
6. Data Assembly