Filtering data by string matching

Although some filtering algorithms were already discussed in the previous chapters, the dplyr package contains some magic features that have not yet been covered and are worth mentioning here. As we all know by this time, the subset function in base, or the filter function from dplyr is used for filtering rows, and the select function can be used to choose a subset of columns.

The function filtering rows usually takes an R expression, which returns the IDs of the rows to drop, similar to the which function. On the other hand, providing such R expressions to describe column names is often more problematic for the select function; it's harder if not impossible to evaluate R expressions on column names.

The dplyr package provides some useful functions to select some columns of the data, based on column name patterns. For example, we can keep only the variables ending with the string, delay:

> library(dplyr)
> library(hflights)
> str(select(hflights, ends_with("delay")))
'data.frame':  227496 obs. of  2 variables:
 $ ArrDelay: int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
 $ DepDelay: int  0 1 -8 3 5 -1 -1 -5 43 43 ...

Of course, there is a similar helper function to check the first characters of the column names with starts_with, and both functions can ignore (by default) or take into account the upper or lower case of the characters with the ignore.case parameter. And we have the more general, contains function, looking for substrings in the column names:

> str(select(hflights, contains("T", ignore.case = FALSE)))
'data.frame':  227496 obs. of  7 variables:
 $ DepTime          : int  1400 1401 1352 1403 1405 ...
 $ ArrTime          : int  1500 1501 1502 1513 1507 ...
 $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
 $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
 $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
 $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
 $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...

The other option is that we might need a more complex approach with regular expressions, which is another extremely important skill for data scientists. Now, we will provide a regular expression to the matches function, which is to be fitted against all the columns names. Let's select all the columns with a name comprising of 5 or 6 characters:

> str(select(hflights, matches("^[[:alpha:]]{5,6}$")))
'data.frame':  227496 obs. of  3 variables:
 $ Month : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Origin: chr  "IAH" "IAH" "IAH" "IAH" ...
 $ TaxiIn: int  7 6 5 9 9 6 12 7 8 6 ...

We can keep all column names that do not match a regular expression by using a negative sign before the expression. For example, let's identify the most frequent number of characters in the columns' names:

> table(nchar(names(hflights)))
 4  5  6  7  8  9 10 13 16 17 
 2  1  2  5  4  3  1  1  1  1

And then, let's remove all the columns with 7 or 8 characters from the dataset. Now, we will show the column names from the filtered dataset:

> names(select(hflights, -matches("^[[:alpha:]]{7,8}$")))
 [1] "Year"              "Month"             "DayofMonth"       
 [4] "DayOfWeek"         "UniqueCarrier"     "FlightNum"        
 [7] "ActualElapsedTime" "Origin"            "Dest"             
[10] "TaxiIn"            "Cancelled"         "CancellationCode"
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.35