11 Manipulating Data with dplyr

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

11
Manipulating Data with `dplyr`

The dplyr¹ (“dee-ply-er”) package is the preeminent tool for data wrangling in R (and perhaps in data science more generally). It provides programmers with an intuitive vocabulary for executing data management and analysis tasks. Learning and using this package will make your data preparation and management process faster and easier to understand. This chapter introduces the philosophy behind the package and provides an overview of how to use the package to work with data frames using its expressive and efficient syntax.

¹dplyr: http://dplyr.tidyverse.org

11.1 A Grammar of Data Manipulation

Hadley Wickham, the original creator of the dplyr package, fittingly refers to it as a Grammar of Data Manipulation. This is because the package provides a set of verbs (functions) to describe and perform common data preparation tasks. One of the core challenges in programming is mapping from questions about a data set to specific programming operations. The presence of a data manipulation grammar makes this process smoother, as it enables you to use the same vocabulary to both ask questions and write your program. Specifically, the dplyr grammar lets you easily talk about and perform tasks such as the following:

Select specific features (columns) of interest from a data set
Filter out irrelevant data and keep only observations (rows) of interest
Mutate a data set by adding more features (columns)
Arrange observations (rows) in a particular order
Summarize data in terms of aggregates such as the mean, median, or maximum
Join multiple data sets together into a single data frame

You can use these words when describing the algorithm or process for interrogating data, and then use dplyr to write code that will closely follow your “plain language” description because it uses functions and procedures that share the same language. Indeed, many real-world questions about a data set come down to isolating specific rows/columns of the data set as the “elements of interest” and then performing a basic comparison or computation (e.g., mean, count, max). While it is possible to perform such computation with base R functions (described in the previous chapters), the dplyr package makes it much easier to write and read such code.

11.2 Core `dplyr` Functions

The dplyr package provides functions that mirror the verbs mentioned previously. Using this package’s functions will allow you to quickly and effectively write code to ask questions of your data sets.

Since dplyr is an external package, you will need to install it (once per machine) and load it in each script in which you want to use the functions:

Table of Contents for 11 Manipulating Data with dplyr

Create new playlist

Sign In

Sign Up

11Manipulating Data with dplyr

11.1 A Grammar of Data Manipulation

11.2 Core dplyr Functions

11.2.1 Select

11.2.2 Filter

11.2.3 Mutate

11.2.4 Arrange

11.2.5 Summarize

11.3 Performing Sequential Operations

11.3.1 The Pipe Operator

11.4 Analyzing Data Frames by Group

11.5 Joining Data Frames Together

11.6 dplyr in Action: Analyzing Flight Data

Table of Contents for
11 Manipulating Data with dplyr

11
Manipulating Data with `dplyr`

11.2 Core `dplyr` Functions

11.6 `dplyr` in Action: Analyzing Flight Data