Filtering and aggregating Spark datasets 

To manipulate the table's dataset, we need to use the verb commands from dplyr. They will automatically be translated as SQL statements if you are connected to a DataFrame. I think here the best way to understand how it works is with an example, so let's run the following code:

mod_stvincent <- dt_stVincent %>% select(code, id, harvwt) %>% 
  filter(harvwt > 15) %>% arrange(desc(id))

The select function is used to choose the code, id, and harvwt columns from our dt_stVincent table object. The filter function is added to the code to choose only the row lines where harvwt is bigger than 15. In the end, arrange is used to set the order. You can also use summarise() as aggregators query and mutate() as operators query. Dplyr knows how to translate their mathematical algorithms to SQL. Run the show_query(mod_stvincent) code to see how the SQL translation works.

If you have paid enough attention, you will have noted that mod_stvincent isn't a DataFrame, but tables containing SQL queries. But don't worry, bringing Spark datasets into R is as simple as calling a function. Just use the collect() dplyr function and voila. Run the following code to make the (plot) magic happen:

mod1_stvincent <- collect(mod_stvincent)
plot(mod1_stvincent$id, mod1_stvincent$harvwt)

The collect function brings the datasets required, and it is manipulated by the dplyr functions to the R environment. Use class(mod1_stvincent) to see that it contains a DataFrame object. Now we can plot the variables from dt_stVincent easily. We only need to use our new mod1_stvincent object; we could use any R function we desire with it, too. If you try to build a plot directly with the dt_stVincent object or with the mod_stvincent tables (plot(mod_stvincent$id, mod_stvincent$harvwt)), the unique thing you'll get is a huge error message. If you want the same result using SQL query, try the next code:

mod1_stvincent <- dbGetQuery(sc, "SELECT *
  FROM (SELECT 'code', 'id', 'harvwt'
  FROM 'STVINCENT') 'tgssodfxtm'
  WHERE ('harvwt' > 15.0)
  ORDER BY 'id' DESC")
plot(mod1_stvincent$id, mod1_stvincent$harvwt)

You can see that the SQL query used is the same when you use show_query(mod_stvincent). This is the dplyr package translate capability in action. If you are not a SQL expert, I advise you to keep with the dplyr package regular functions. They are easier to learn and apply by anyone who already knows R. If you need a specific function, you can look for it in the dplyr documentation, found at: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

Table of Contents for Filtering and aggregating Spark datasets&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Filtering and aggregating Spark datasets