Using SQL to query data frames via the sqldf package

In the previous chapter, you learned how to compose SQL statements to query data from relational databases such as SQLite and MySQL. Is there a way to directly use SQL to query data frames in R as if these data frames are tables in relational databases? The sqldf package says yes.

This package takes advantage of SQLite, thanks to its lightweight structure and easiness to embed into an R session. Run the following command to install this package if you don't have it:

install.packages("sqldf") 

First, let's attach the package, as shown in the following code:

library(sqldf) 
## Loading required package: gsubfn 
## Loading required package: proto 
## Loading required package: RSQLite 
## Loading required package: DBI 

Note that when we attach sqldf, a number of other packages are automatically loaded. The sqldf package depends on these packages, because what it does is basically transferring data and converting data types between R and SQLite.

Then, we will reload the product tables we used in the previous sections:

product_info <- read_csv("data/product-info.csv") 
product_stats <- read_csv("data/product-stats.csv") 
product_tests <- read_csv("data/product-tests.csv") 
toy_tests <- read_csv("data/product-toy-tests.csv") 

The magic of this package is that we can directly query the data frames in our working environment with SQL. For example, we can select all records of product_info:

sqldf("select * from product_info") 
## Loading required package: tcltk 
##    id      name  type   class released 
## 1 T01    SupCar   toy vehicle      yes 
## 2 T02  SupPlane   toy vehicle       no 
## 3 M01     JeepX model vehicle      yes 
## 4 M02 AircraftX model vehicle      yes 
## 5 M03    Runner model  people      yes 
## 6 M04    Dancer model  people       no 

The sqldf package supports simple select queries that are supported by SQLite. For example, we can select a certain set of columns:

sqldf("select id, name, class from product_info") 
##    id      name   class 
## 1 T01    SupCar vehicle 
## 2 T02  SupPlane vehicle 
## 3 M01     JeepX vehicle 
## 4 M02 AircraftX vehicle 
## 5 M03    Runner  people 
## 6 M04    Dancer  people 

We can filter records by a certain condition:

sqldf("select id, name from product_info where released = 'yes'") 
##    id      name 
## 1 T01    SupCar 
## 2 M01     JeepX 
## 3 M02 AircraftX 
## 4 M03    Runner 

We can compute a new column and give it a name:

sqldf("select id, material, size / weight as density from product_stats") 
##    id material   density 
## 1 T01    Metal 12.000000 
## 2 T02    Metal  7.777778 
## 3 M01 Plastics        NA 
## 4 M02 Plastics 28.333333 
## 5 M03     Wood        NA 
## 6 M04     Wood 26.666667 

We can sort records by certain columns in given orders:

sqldf("select * from product_stats order by size desc") 
##    id material size weight 
## 1 T02    Metal  350   45.0 
## 2 T01    Metal  120   10.0 
## 3 M02 Plastics   85    3.0 
## 4 M01 Plastics   50     NA 
## 5 M04     Wood   16    0.6 
## 6 M03     Wood   15     NA 

The package also supports querying multiple data frames such as join. In the following code, we will merge product_info and product_stats by id, just like what we did with merge() previously:

sqldf("select * from product_info join product_stats using (id)") 
##    id      name  type   class released material size weight 
## 1 T01    SupCar   toy vehicle      yes    Metal  120   10.0 
## 2 T02  SupPlane   toy vehicle       no    Metal  350   45.0 
## 3 M01     JeepX model vehicle      yes Plastics   50     NA 
## 4 M02 AircraftX model vehicle      yes Plastics   85    3.0 
## 5 M03    Runner model  people      yes     Wood   15     NA 
## 6 M04    Dancer model  people       no     Wood   16    0.6 

Moreover, it also supports nested query. In the following code, we will select all records in product_info that are made of wood:

sqldf("select * from product_info where id in  
  (select id from product_stats where material = 'Wood')") 
##    id   name  type  class released 
## 1 M03 Runner model people      yes 
## 2 M04 Dancer model people       no 

Alternatively, we can use join with the same where condition to achieve the same goal. For many relational databases, join usually works faster than in when the data is large:

sqldf("select * from product_info join product_stats using (id)  
  where material = 'Wood'") 
##    id   name  type  class released material size weight 
## 1 M03 Runner model people      yes     Wood   15     NA 
## 2 M04 Dancer model people       no     Wood   16    0.6 

In addition to join, we can easily summarize data by group. For example, we group product_tests by waterproof into two groups: yes and no. For each group, we compute the average values of quality and durability, respectively:

sqldf("select waterproof, avg(quality), avg(durability) from product_tests 
  group by waterproof") 
##   waterproof avg(quality) avg(durability) 
## 1         no        10.00             9.5 
## 2        yes         5.75             5.0 

For the toy_tests data, it is easy to aggregate data for each product. Here is an example of averaging quality and durability values across time for each product:

sqldf("select id, avg(quality), avg(durability) from toy_tests  
  group by id") 
##    id avg(quality) avg(durability) 
## 1 T01         9.25            9.25 
## 2 T02         8.50            8.50 

To make the results more informative, we can join product_info with the grouped summary table so that the product information and average measures are presented together:

sqldf("select * from product_info join  
  (select id, avg(quality), avg(durability) from toy_tests  
    group by id) using (id)") 
##    id     name type   class released avg(quality) 
## 1 T01   SupCar  toy vehicle      yes         9.25 
## 2 T02 SupPlane  toy vehicle       no         8.50 
##   avg(durability) 
## 1            9.25 
## 2            8.50 

Using sqldf and SQL to query data frames looks very handy, but the limitations are obvious too.

First, since sqldf is, by default, based on SQLite, the limitation of the package is also the limitation of SQLite database, that is, the built-in group aggregate functions are limited. The official webpage (https://sqlite.org/lang_aggfunc.html) provides a list of functions: avg()count()group_concat()max()min()sum(), and total(). If we need more than that, for example, quantile(), it won't be easy. In R, we can use much more advanced algorithms to aggregate columns.

Second, since we need to supply a string of select statements to query data, it is not very convenient to generate SQL dynamically when part of it is determined by R variables. Therefore, we need to use sprintf() to allow the values of R variables to appear in the SQL statement.

Third, the limitation of sqldf is also the limitation of SQL. It is hard to compute new columns with more complex algorithms. For example, if we need to compute a ranking column based on an existing numeric column, it would not be very easy to implement. However, in R, we just need order(). Another thing is that it is hard or verbose to implement more complex filter operations such as ranking-based data filtering. For example, how do you select the first one or two products ordered by size in descending order grouped by material? Such a query requires a lot more thinking and tricks.

However, if we use the plyr package, such a task is a piece of cake. If you have the package installed, run the following code:

install.packages("plyr") 

To demonstrate how simple it is, we will use plyr::ddply to do this. We will supply material as the data splitter, that is, product_stats is divided into several parts for each value taken by material. We also supply a function to transform the input data frame (each part) to a new data frame. Then, the ddply function combines these data frames together:

plyr::ddply(product_stats, "material",  
  function(x) { 
    head(x[order(x$size, decreasing = TRUE),], 1L) 
  }) 
##    id material size weight 
## 1 T02    Metal  350   45.0 
## 2 M02 Plastics   85    3.0 
## 3 M04     Wood   16    0.6 

The anonymous function we supplied is called with three parts of product_stats with distinct material, each part having identical material.

Another example is to select top two test results with the most samples:

plyr::ddply(toy_tests, "id",  
  function(x) { 
    head(x[order(x$sample, decreasing = TRUE), ], 2) 
  }) 
##    id     date sample quality durability 
## 1 T01 20160405    180       9         10 
## 2 T01 20160302    150      10          9 
## 3 T02 20160403     90       9          8 
## 4 T02 20160502     85      10          9 

The anonymous function we supplied is called with two parts of toy_tests: one part is a data frame with id being T01 and the other T02. For each part, we order the sub-dataframe by sample in the descending order and take the top two records. The task is easily finished.

In addition, ddplyplyr provides a variety of functions of many possible pairs of input-output data types. To learn more, visit http://had.co.nz/plyr/ and https://github.com/hadley/plyr.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.17.12