Basic tools of data wrangling

In this section, we're going to share some of the common data mining and aggregation operations that can be performed on data.frame, dplyr, and data.table.

First, we are going to learn how to use a few functions in Base R to perform basic manipulation operations. We'll then cover dplyr and data.tabletwo of the most well known and powerful packages in the R world today for managing data. tibble is an alternative to data.frame that is used widely in conjunction with dplyr and adheres more closely to how data.frame behave and uses the same conventions for slicing, indexing, and other operations as data.frame. The latter, data.table, uses a slightly different convention, but is extremely powerful, especially for handling large datasets.

Each of these sections, will in turn, cover the following:

  • Reading and writing: How to read and write data from and to files, websites, and other sources
  • Analysis: How to perform ad hoc data analysis such as aggregations and pivots

Although it is assumed that readers are familiar with R to some measure, we have nevertheless provided a brief example of data.frame and common operations on the same as a primer for new users.

The fundamental data structure used across R is called data.frame("Data Frame"). Many of you may be already familiar with the concept, and the following has been provided as a refresher.

A DataFrame is similar to a table or a spreadsheet consisting of rows and columns. Similar to spreadsheets, such as in Excel, each column has a header, known as the column name, and the data type in each column is the same; for example, a column of data type numeric cannot store characters. The general syntax of data.frame is represented as dataframe[rows, columns], where dataframe is the name of the data.frame being referenced.

For instance, the state dataset in R contains several key characteristics of US states:

# Load the data for state 
data(state) 
 
state <- data.frame(state.x77) # Creating a data.frame from the matrix state.x77 
 
# View the first few rows of state 
head(state) 
 
# View First 3 rows 
state[1:3,] 
 
# View First 3 columns 
state[,1:3] 
 
# View First 3 rows and 3 columns 
state[1:3,1:3] 
 
# Create a new column 
state$State <- row.names(state) 
 
# Find matches using boolean operations 
state[state$State == "Connecticut",] 
 
state[state$Population > 1000 & state$Income > 2000,] # Find states with Population > 1000 and Income > 2000 

As we will be covering how to read and write CSV files in later sections, let us also see how we can create and read CSV-formatted text files. In order to save the data.frame state as a CSV file, we can use the write.csv command as follows:

# Saving the state data.frame as a CSV File 
write.csv(x=state,file = "state.csv",row.names = F) 
 
# The arguments were as follows: 
# x = the name of the data.frame we want to save; file = the file we want to save as; row.names=F means do not include the row names 
 
# We can read/import the file back to see what it contains 
read.csv("state.csv") 

Note the differencethe original data.frame state contained row names, whereas the saved CSV file doesn't because of the options we had selected. Both write.csv and read.csv take several other options that you can view by running ?write.csv and ?read.csv in the R console respectively:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.84.169