Chapter 3. Filtering and Summarizing Data

After loading data from either flat files or databases (as we have seen in Chapter 1, Hello, Data!), or directly from the web via some APIs (as covered in Chapter 2, Getting Data from the Web), we often have to aggregate, transform, or filter the original dataset before the actual data analysis could take place.

In this chapter, we will focus on how to:

  • Filter rows and columns in data frames
  • Summarize and aggregate data
  • Improve the performance of such tasks with the dplyr and data.table packages besides the base R methods

Drop needless data

Although not loading the needless data is the optimal solution (see the Loading a subset of text files and Loading data from databases sections in Chapter 1, Hello, Data!), we often have to filter the original dataset inside R. This can be done with the traditional tools and functions from base R, such as subset, by using which and the [ or [[ operator (see the following code), or for example with the SQL-like approach of the sqldf package:

> library(sqldf)
> sqldf("SELECT * FROM mtcars WHERE am=1 AND vs=1")
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
2 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
3 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
4 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
5 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
6 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
7 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

I am sure that all readers who have a decent SQL background and are just getting in touch with R appreciate this alternative way of filtering data, but I personally prefer the following rather similar, native, and much more concise R version:

> subset(mtcars, am == 1 & vs == 1)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Please note the slight difference in the results. This is attributed to the fact that the row.names argument of sqldf is FALSE by default, which can of course be overridden to get the exact same results:

> identical(
+     sqldf("SELECT * FROM mtcars WHERE am=1 AND vs=1",
+       row.names = TRUE),
+     subset(mtcars, am == 1 & vs == 1)
+     )
[1] TRUE

These examples focused on how to drop rows from data.frame, but what if we also want to remove some columns?

The SQL approach is really straightforward; just specify the required columns instead of * in the SELECT statement. On the other hand, subset also supports this approach by the select argument, which can take vectors or an R expression describing, for example, a range of columns:

> subset(mtcars, am == 1 & vs == 1, select = hp:wt)
                hp drat    wt
Datsun 710      93 3.85 2.320
Fiat 128        66 4.08 2.200
Honda Civic     52 4.93 1.615
Toyota Corolla  65 4.22 1.835
Fiat X1-9       66 4.08 1.935
Lotus Europa   113 3.77 1.513
Volvo 142E     109 4.11 2.780

Tip

Pass the unquoted column names as a vector via the c function to select an arbitrary list of columns in the given order, or exclude the specified columns by using the - operator, for example, subset(mtcars, select = -c(hp, wt)).

Let's take this to the next step, and see how we can apply the forementioned filters on some larger datasets, when we face some performance issues with the base functions.

Drop needless data in an efficient way

R works best with datasets that can fit in the actual physical memory, and some R packages provide extremely fast access to this amount of data.

Note

Some benchmarks (see the References section at the end of the book) provide real-life examples of more efficient summarizing R functions than what the current open source (for example, MySQL, PostgreSQL, and Impala) and commercial databases (such as HP Vertica) provide.

Some of the related packages were already mentioned in Chapter 1, Hello, Data!, where we benchmarked reading a relatively large amount of data from the hflights package into R.

Let's see how the preceding examples perform on this dataset of a quarter of a million rows:

> library(hflights)
> system.time(sqldf("SELECT * FROM hflights WHERE Dest == 'BNA'", 
+   row.names = TRUE))
   user  system elapsed 
  1.487   0.000   1.493 
> system.time(subset(hflights, Dest == 'BNA'))
   user  system elapsed 
  0.132   0.000   0.131

The base::subset function seems to perform pretty well, but can we make it any faster? Well, the second generation of the plyr package, called dplyr (the relevant details are discussed High-performance helper functions section in this chapter and Chapter 4, Restructuring Data), provides extremely fast C++ implementations of the most common database manipulation methods in a rather intuitive way:

> library(dplyr)
> system.time(filter(hflights, Dest == 'BNA'))
   user  system elapsed 
  0.021   0.000   0.022

Further, we can extend this solution by dropping some columns from the dataset just like we did with subset before, although now, we call the select function instead of passing an argument with the same name:

> str(select(filter(hflights, Dest == 'BNA'), DepTime:ArrTime))
'data.frame':  3481 obs. of  2 variables:
 $ DepTime: int  1419 1232 1813 900 716 1357 2000 1142 811 1341 ...
 $ ArrTime: int  1553 1402 1948 1032 845 1529 2132 1317 945 1519 ...

Therefore, it's like calling the filter function instead of subset, and we get the results faster than the blink of an eye! The dplyr package can work with traditional data.frame or data.table objects, or can interact directly with the most widely used database engines. Please note that row names are not preserved in dplyr, so if you require them, it's worth copying the names to explicit variables before passing them to dplyr or directly to data.table as follows:

> mtcars$rownames <- rownames(mtcars)
> select(filter(mtcars, hp > 300), c(rownames, hp))
       rownames  hp
1 Maserati Bora 335

Drop needless data in another efficient way

Let's see a quick example of the data.table solution on its own, without dplyr.

Note

The data.table package provides an extremely efficient way to handle larger datasets in a column-based, auto-indexed in-memory data structure, with backward compatibility for the traditional data.frame methods.

After loading the package, we have to transform the hflights traditional data.frame to data.table. Then, we create a new column, called rownames, to which we assign the rownames of the original dataset with the help of the := assignment operator specific to data.table:

> library(data.table)
> hflights_dt <- data.table(hflights)
> hflights_dt[, rownames := rownames(hflights)]
> system.time(hflights_dt[Dest == 'BNA'])
   user  system elapsed 
  0.021   0.000   0.020

Well, it takes some time to get used to the custom data.table syntax and it might even seem a bit strange to the traditional R user at first sight, but it's definitely worth mastering in the long run. You get great performance, and the syntax turns out to be natural and flexible after the relatively steep learning curve of the first few examples.

As a matter of fact, the data.table syntax is pretty similar to SQL:

DT[i, j, ... , drop = TRUE]

This could be described with SQL commands as follows:

DT[where, select | update, group by][having][order by][ ]...[ ]

Therefore, [.data.table (which stands for the [ operator applied to a data.table object) has some different arguments as compared to the traditional [.data.frame syntax, as you have already seen in the preceding example.

Note

Now, we are not dealing with the assignment operator in detail, as this example might be too complex for such an introductory part of the book, and we are probably getting out of our comfort zone. Therefore, please find more details in Chapter 4, Restructuring Data, or head to ?data.table for a rather technical overview.

It seems that the first argument (i) of the [.data.table operator stands for filtering, or in other words, for the WHERE statement in SQL parlance, while [.data.frame expects indices specifying which rows to keep from the original dataset. The real difference between the two arguments is that the former can take any R expression, while the latter traditional method expects mainly integers or logical values.

Anyway, filtering is as easy as passing an R expression to the i argument of the [ operator specific to data.table. Further, let's see how we can select the columns in the data.table syntax, which should be done in the second argument (j) of the call on the basis of the abovementioned general data.table syntax:

> str(hflights_dt[Dest == 'BNA', list(DepTime, ArrTime)]) 
Classes 'data.table' and 'data.frame':     3481 obs. of 2 variables:
 $ DepTime: int  1419 1232 1813 900 716 1357 2000 1142 811 1341 ...
 $ ArrTime: int  1553 1402 1948 1032 845 1529 2132 1317 945 1519 ...
 - attr(*, ".internal.selfref")=<externalptr>

Okay, so we now have the two expected columns with the 3,481 observations. Note that list was used to define the required columns to keep, although the use of c (a function from base R to concatenate vector elements) is more traditionally used with [.data.frame. The latter is also possible with [.data.table, but then, you have to pass the variable names as a character vector and set with to FALSE:

> hflights_dt[Dest == 'BNA', c('DepTime', 'ArrTime'), with = FALSE] 

Note

Instead of list, you can use a dot as the function name in the style of the plyr package; for example: hflights_dt[, .(DepTime, ArrTime)].

Now that we are more or less familiar with our options for filtering data inside a live R session, and we know the overall syntax of the dplyr and data.table packages, let's see how these can be used to aggregate and summarize data in action.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.51.157