Manipulating Spark data using both dplyr and SQL

Once you're done with the installation from this chapters introduction, let's create a remote dplyr data source for the Spark cluster. To do this, use the spark_connect function, as shown:

sc <- spark_connect(master = "local")

This will create a Spark cluster in your computer; you can see it at your RStudio, a tab guide alongside your R environment. To disconnect, use the spark_disconnect(sc) function. Keep connected and copy a couple of datasets from any R packages into the cluster:

library(DAAG)
dt_sugar <- copy_to(sc, sugar, "SUGAR")
dt_stVincent <- copy_to(sc, stVincent, "STVINCENT")

The preceding code uploads the DAAG::sugar and DAAG::stVicent DataFrames into the your connected Spark cluster. It also creates the table definitions; they were saved into dt_sugar and dt_stVincent objects. To read data into Spark DataFrames, just call the spark_read_* functions, where the asterisk must be the file type. Access the R documentation (?spark_read_csv, for example) to see the right syntax and arguments necessary. Now you can use the dplyr verbs with the tables object. For example, you can filter it this way:

dt_sugar %>% filter(trt == "Control")

What happens is that dplyr automatically translates your codes into SQL queries: it only needs to write the dplyr functions with the normal syntax. To see how translate is working, use the show_query() function to inspect it. If you are doing simple mathematical operations with R functions inside the dplyr functions, dplyr will translate the math operators into Spark SQL too.

Try to use another dplyr function; use summarise as an example, as they work in a similar way. Remember that you can combine many functions with the dplyr, %>% function.

If your objective is to use the SQL queries, you need the DBI package (install.packages("DBI") if you don't have it), calling its function dbGetQuery(). This function returns an R DataFrame and requires two arguments: the connection object (sc in our case) and the query (as quoted string). Try the following code as an example:

library(DBI)
dbGetQuery(sc, "SELECT trt FROM sugar")

To do this, some SQL query knowledge is necessary. Notice that a DataFrame is shown as a result of the last code. When we are using the dplyr functions to manipulate the table definitions, only they are changed; the dataset stays unaffected in the Spark environment, at least until you disconnect, then they are cleaned (if it is a local cluster).

Table of Contents for Manipulating Spark data using both&#xA0;dplyr&#xA0;and SQL

Create new playlist

Sign In

Sign Up

Table of Contents for
Manipulating Spark data using both dplyr and SQL