4 Data Assembly

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4. Data Assembly

4.1 Introduction

By now, you should be able to load data into Pandas and do some basic visualizations. This part of the book focuses on various data cleaning tasks. We begin with assembling a data set for analysis by combining various data sets together.

Concept Map

1. Prior knowledge

a. loading data

b. subsetting data

c. functions and class methods

Objectives

This chapter will cover:

1. Tidy data

2. Concatenating data

3. Merging data sets

4.2 Tidy Data

Hadley Wickham,¹ one of the more prominent members of the R community, talks about the idea of tidy data. In fact, he’s written a paper about this concept in the Journal of Statistical Software.² Tidy data is a framework to structure data sets so they can be easily analyzed. It is mainly used as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make data collection much easier.

1. Hadley Wickham’s homepage: http://hadley.nz

2. Tidy data paper: http://vita.had.co.nz/papers/tidy-data.pdf

So what is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria:

■ Each row is an observation.

■ Each column is a variable.

■ Each type of observational unit forms a table.

4.2.1 Combining Data Sets

We begin with Hadley Wickham’s last tidy data point: “Each type of observational unit forms a table.” When data is tidy, you need to combine various tables together to answer a question. For example, there may be a separate table holding company information and another table holding stock prices. If we want to look at all the stock prices within the tech industry, we may first have to find all the tech companies from the company information table, and then combine that data with the stock price data to get the data we need for our question. The data may have been split up into separate tables to reduce the amount of redundant information (we don’t need to store the company information with each stock price entry), but this arrangement means we as data analysts must combine the relevant data ourselves to answer our question.

At other times, a single data set may be split into multiple parts. For example, with time-series data, each date may be in a separate file. In another case, a file may have been split into parts to make the individual files smaller. You may also need to combine data from multiple sources to answer a question (e.g., combine latitudes and longitudes with zip codes). In both cases, you will need to combine data into a single dataframe for analysis.

4.3 Concatenation

One of the (conceptually) easier ways to combine data is with concatenation. Concatenation can be thought of appending a row or column to your data. This approach is possible if your data was split into parts or if you performed a calculation that you want to append to your existing data set.

Concatenation is accomplished by using the concat function from Pandas.

4.3.1 Adding Rows

Let’s begin with some example data sets so you can see what is actually happening.

Pandas	SQL	Description
left	left outer	Keep all the keys from the left
right	right outer	Keep all the keys from the right
outer	full outer	Keep all the keys from both left and right
inner	inner	Keep only the keys that exist in both left and right

Table of Contents for 4 Data Assembly

Create new playlist

Sign In

Sign Up

4. Data Assembly

4.1 Introduction

Concept Map

Objectives

4.2 Tidy Data

4.2.1 Combining Data Sets

4.3 Concatenation

4.3.1 Adding Rows

4.3.1.1 Ignoring the Index

4.3.2 Adding Columns

4.3.3 Concatenation With Different Indices

4.3.3.1 Concatenate Rows With Different Columns

4.3.3.2 Concatenate Columns With Different Rows

4.4 Merging Multiple Data Sets

4.4.1 One-to-One Merge

4.4.2 Many-to-One Merge

4.4.3 Many-to-Many Merge

4.5 Conclusion

Table of Contents for
4 Data Assembly