© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_4

4. Tidy Select

Thomas Mailund1  
(1)
Aarhus, Denmark
 

So-called “tidy select ” is not a package you would use on its own (although you can import it and it is called tidyselect), rather it is a small language for selecting columns in a data frame or tibble. Many packages in the following chapters support this language, so rather than describing it in each chapter, I decided to put a description in its own chapter, and this is as good a place as any. We cannot, however, use the language without functionality from packages that are described in later chapters, so I will use the function select from the dplyr package; see Chapter 8.

The select function helps you select columns from a tibble or a data frame . Its first argument is a table, so we can use it with a pipe operator, and then the remaining can be the columns we want to extract.
tbl<-tribble(
  ~sample,~min_size,~max_size,~min_weight,~max_weight,
  "foo",13,16,45.2,67.2,
  "bar",12,17,83.1,102.5
)
tbl|>select(sample, min_size, min_weight)
## # A tibble: 2 × 3
## sample min_size min_weight
##   <chr>    <dbl>     <dbl>
## 1 foo         13      45.2
## 2 bar         12      83.1

This simple way of selecting columns is useful in itself, of course, but doesn’t add much on top of just indexing into tables. The power of tidy select is the small language that we can use instead of explicitly listing columns.

Ranges

If you want a range of columns, say from min_size to max_weight, you can use : to select the columns:
tbl|>select(min_size:max_weight)
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
##    <dbl>    <dbl>      <dbl>      <dbl>
## 1     13       16       45.2       67.2
## 2     12       17       83.1      102.
This only works for contiguous columns, but you can select multiple ranges if you want:
tbl|>select(sample:min_size, min_weight:max_weight)
## # A tibble: 2 × 4
##   sample min_size min_weight max_weight
##    <chr>    <dbl>      <dbl>      <dbl>
## 1  foo         13       45.2       67.2
## 2  bar         12       83.1      102.

Complements

Sometimes, it is easier to specify which columns you do not want, and then you can use the complement operator ! . This works both for the complement of a single column, so “everything except this column”:
tbl|>select(!sample)
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
##    <dbl>    <dbl>      <dbl>     <dbl>
## 1     13       16       45.2      67.2
## 2     12       17       83.1     102.
but it also works if you select columns some other way, for example, through ranges. If sample:min_size selects the columns sample and min_size, then !sample:min_size is everything except those columns:
tbl|>select(!(sample:min_size))
## # A tibble: 2 × 3
##  max_size min_weight max_weight
##     <dbl>     <dbl>       <dbl>
## 1      16      45.2        67.2
## 2      17      83.1       102.

Unions and Intersections

It gets a bit strange if you include both the complements of columns and the columns themselves, but you will get all the columns you ask for, so with
tbl|>select(sample,!(sample:min_size))
## # A tibble: 2 × 4
## sample max_size min_weight max_weight
##   <chr>   <dbl>     <dbl>       <dbl>
## 1 foo        16      45.2        67.2
## 2 bar        17      83.1       102.

column sample, and then we get all the columns that are not in the range “sample:min_size”.

You can explicitly ask for unions or intersections of selections using & (intersection) and | (union).
tbl|>select(
  sample:min_weight   # sample, min_size, max_size, and min_weight
  &                   # intersect with
  max_size:max_weight # max_size, min_weight, max_weight
)
## # A tibble: 2 × 2
##   max_size min_weight
##      <dbl>     <dbl>
## 1       16      45.2
## 2       17      83.1
tbl|>select(
   sample:min_weight    # sample, min_size, max_size, and min_weight
   |                    # union with
   max_size:max_weight  # max_size, min_weight, max_weight
)
## # A tibble: 2 × 5
##   sample min_size max_size min_weight max_weight
##   <chr>     <dbl>    <dbl>      <dbl>      <dbl>
## 1 foo          13      16        45.2       67.2
## 2 bar          12      17        83.1      102.

Select Columns Based on Name

There are several functions that will let you select columns based on their names. The starts_with() function will pick columns that start with a specific string :
tbl|>select(starts_with("min"))
## # A tibble: 2 × 2
##   min_size min_weight
##      <dbl>       <dbl>
## 1       13        45.2
## 2       12        83.1
Similarly, the ends_with() function will let you pick columns with names that end with a specific string:
tbl|>select(ends_with("weight"))
## # A tibble: 2 × 2
##   min_weight max_weight
##        <dbl>      <dbl>
## 1       45.2       67.2
## 2       83.1      102.
If you want to pick all columns whose names contain a string, you use the function contains() :
tbl|>select(contains("_"))
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
##    <dbl>    <dbl>      <dbl>      <dbl>
## 1     13       16       45.2       67.2
## 2     12       17       83.1      102.
For more complex matching, the matches() function lets you pick columns based on regular expressions. Here, I’ll just pick the columns with names that contain an underscore gain, because there really isn’t anything complicated to select for in this example table:
tbl|>select(matches(".*_.*"))
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
##    <dbl>    <dbl>      <dbl>      <dbl>
## 1     13       16       45.2       67.2
## 2     12       17       83.1      102.

Everything

It might not seem like much of a feature, since selecting all columns just gives you the original table back, but there is a function, everything() , that does just that.
tbl|>select(everything())
## # A tibble: 2 × 5
##   sample min_size max_size min_weight max_weight
##   <chr>     <dbl>    <dbl>     <dbl>       <dbl>
## 1 foo          13       16      45.2        67.2
## 2 bar          12       17      83.1       102.

You wouldn’t use that with select(), of course, because it doesn’t do anything, but there are other functions that requires that you select columns for some transformation of the data, and there everything() can be used to pick all the columns in a data frame.

Indexing from the Last Column

If you just want to select the last column, you can use the function last_col():
tbl|>select(last_col())
## # A tibble: 2 × 1
##   max_weight
##        <dbl>
## 1       67.2
## 2      102.
but you can also use the function to select by indexing from the right. If you give it an integer argument, i, it will select the column that is i from the right:
tbl|>select(last_col(0))
## # A tibble: 2 × 1
##   max_weight
##       <dbl>
## 1      67.2
## 2     102.
tbl|>select(last_col(1))
## # A tibble: 2 × 1
##   min_weight
##        <dbl>
## 1       45.2
## 2       83.1
tbl|>select(last_col(2))
## # A tibble: 2 × 1
##    max_size
##       <dbl>
## 1        16
## 2        17
tbl|>select(last_col(3))
## # A tibble: 2 × 1
##   min_size
##      <dbl>
## 1       13
## 2       12
and you can, of course, use this in ranges:
tbl|>select(last_col(3):last_col())
## # A tibble: 2 × 4
##   min_size max_size min_weight max_weight
##      <dbl>    <dbl>      <dbl>     <dbl>
## 1       13       16       45.2      67.2
## 2       12       17       83.1     102.

Selecting from Strings

If you have the column names you want to select in a vector of strings
vars<-c("min_size","min_weight")
you can use the all_of() or any_of() functions to pick them:
tbl|>select(all_of(vars))
## # A tibble: 2 × 2
##   min_size min_weight
##      <dbl>   <dbl>
## 1       13    45.2
## 2       12    83.1
tbl|>select(any_of(vars))
## # A tibble: 2 × 2
##   min_size min_weight
##      <dbl>     <dbl>
## 1       13      45.2
## 2       12      83.1
The difference between the two functions is that all_of() considers it an error if vars contain a name that isn’t found in the table, while any_of() does not.
vars<-c(vars,"foo")
tbl|>select(all_of(vars))
## Error in 'select()':
## ! Can't subset columns past the end.
## x Column 'foo' doesn't exist.
tbl|>select(any_of(vars))
## # A tibble: 2 × 2
##   min_size min_weight
##      <dbl>   <dbl>
## 1       13    45.2
## 2       12    83.1

Selecting Columns Based on Their Content

Perhaps the most powerful selection function is where. You give it a function as an argument, that function is called with each column, and the columns for which the function returns TRUE are selected.

So, for example, if you only want columns that are numeric, you can use where combined with is.numeric:
tbl|>select(where(is.numeric))
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
##    <dbl>    <dbl>    <dbl>     <dbl>
## 1     13       16     45.2      67.2
## 2     12       17     83.1     102.
Or if you want columns that are numeric and the largest value is greater than 100, you can use
tbl|>select(where((x)is.numeric(x)&&max(x)>100.0))
## # A tibble: 2 × 1
##   max_weight
##       <dbl>
## 1      67.2
## 2     102.

(The syntax (args) body was introduced in R 4 as a short form of function(args) body but means the same thing).

It Is a Growing Language, so Check for Changes

The language used by functions that support tidy select is evolving and growing, so there is likely more you can do by the time you read this. Check the documentation of the select function
?select

to get the most recent description.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.46.18