So-called “tidy select
” is not a package you would use on its own (although you can import it and it is called tidyselect), rather it is a small language for selecting columns in a data frame or tibble. Many packages in the following chapters support this language, so rather than describing it in each chapter, I decided to put a description in its own chapter, and this is as good a place as any. We cannot, however, use the language without functionality from packages that are described in later chapters, so I will use the function select from the dplyr package; see Chapter 8.
The
select function
helps you select columns from a tibble or a data
frame
. Its first argument is a table, so we can use it with a pipe operator, and then the remaining can be the columns we want to extract.
tbl<-tribble(
~sample,~min_size,~max_size,~min_weight,~max_weight,
"foo",13,16,45.2,67.2,
"bar",12,17,83.1,102.5
)
tbl|>select(sample, min_size, min_weight)
## # A tibble: 2 × 3
## sample min_size min_weight
## <chr> <dbl> <dbl>
## 1 foo 13 45.2
## 2 bar 12 83.1
This simple way of selecting columns is useful in itself, of course, but doesn’t add much on top of just indexing into tables. The power of tidy select is the small language that we can use instead of explicitly listing columns.
Ranges
If you want a
range
of columns, say from
min_size to
max_weight, you can use : to select the columns:
tbl|>select(min_size:max_weight)
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 13 16 45.2 67.2
## 2 12 17 83.1 102.
This only works for contiguous columns, but you can select multiple ranges if you want:
tbl|>select(sample:min_size, min_weight:max_weight)
## # A tibble: 2 × 4
## sample min_size min_weight max_weight
## <chr> <dbl> <dbl> <dbl>
## 1 foo 13 45.2 67.2
## 2 bar 12 83.1 102.
Complements
Sometimes, it is easier to specify which columns you do
not want, and then you can use the complement operator
!
. This works both for the complement of a single column, so “everything except this column”:
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 13 16 45.2 67.2
## 2 12 17 83.1 102.
but it also works if you select columns some other way, for example, through ranges. If
sample:min_size selects the columns
sample and
min_size, then
!sample:min_size is everything except those columns:
tbl|>select(!(sample:min_size))
## max_size min_weight max_weight
## <dbl> <dbl> <dbl>
## 1 16 45.2 67.2
## 2 17 83.1 102.
Unions and Intersections
It gets a bit strange if you include both the complements of columns and the
columns
themselves, but you will get all the columns you ask for, so with
tbl|>select(sample,!(sample:min_size))
## # A tibble: 2 × 4
## sample max_size min_weight max_weight
## <chr> <dbl> <dbl> <dbl>
## 1 foo 16 45.2 67.2
## 2 bar 17 83.1 102.
column sample, and then we get all the columns that are not in the range “sample:min_size”.
You can explicitly ask for unions or intersections of selections using & (intersection) and | (union).
tbl|>select(
sample:min_weight # sample, min_size, max_size, and min_weight
& # intersect with
max_size:max_weight # max_size, min_weight, max_weight
)
## # A tibble: 2 × 2
## max_size min_weight
## <dbl> <dbl>
## 1 16 45.2
## 2 17 83.1
tbl|>select(
sample:min_weight # sample, min_size, max_size, and min_weight
| # union with
max_size:max_weight # max_size, min_weight, max_weight
)
## # A tibble: 2 × 5
## sample min_size max_size min_weight max_weight
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 foo 13 16 45.2 67.2
## 2 bar 12 17 83.1 102.
Select Columns Based on Name
There are several functions that will let you select
columns
based on their names. The
starts_with() function
will pick columns that start with a specific
string
:
tbl|>select(starts_with("min"))
## # A tibble: 2 × 2
## min_size min_weight
## <dbl> <dbl>
## 1 13 45.2
## 2 12 83.1
Similarly, the
ends_with() function
will let you pick columns with names that end with a specific string:
tbl|>select(ends_with("weight"))
## # A tibble: 2 × 2
## min_weight max_weight
## <dbl> <dbl>
## 1 45.2 67.2
## 2 83.1 102.
If you want to pick all columns whose names contain a string, you use the function
contains()
:
tbl|>select(contains("_"))
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 13 16 45.2 67.2
## 2 12 17 83.1 102.
For more complex matching, the
matches() function
lets you pick columns based on regular expressions. Here, I’ll just pick the columns with names that contain an underscore gain, because there really isn’t anything complicated to select for in this example table:
tbl|>select(matches(".*_.*"))
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 13 16 45.2 67.2
## 2 12 17 83.1 102.
Everything
It might not seem like much of a feature, since selecting all columns just gives you the original table back, but there is a function,
everything()
, that does just that.
tbl|>select(everything())
## # A tibble: 2 × 5
## sample min_size max_size min_weight max_weight
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 foo 13 16 45.2 67.2
## 2 bar 12 17 83.1 102.
You wouldn’t use that with select(), of course, because it doesn’t do anything, but there are other functions that requires that you select columns for some transformation of the data, and there everything() can be used to pick all the columns in a data frame.
Indexing from the Last Column
If you just want to select the last column, you can use the function
last_col():
## # A tibble: 2 × 1
## max_weight
## <dbl>
## 1 67.2
## 2 102.
but you can also use the function to select by
indexing
from the right. If you give it an integer argument, i, it will select the column that is i from the right:
## # A tibble: 2 × 1
## max_weight
## <dbl>
## 1 67.2
## 2 102.
## # A tibble: 2 × 1
## min_weight
## <dbl>
## 1 45.2
## 2 83.1
## # A tibble: 2 × 1
## max_size
## <dbl>
## 1 16
## 2 17
## # A tibble: 2 × 1
## min_size
## <dbl>
## 1 13
## 2 12
and you can, of course, use this in ranges:
tbl|>select(last_col(3):last_col())
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 13 16 45.2 67.2
## 2 12 17 83.1 102.
Selecting from Strings
If you have the column names you want to select in a vector of
strings
vars<-c("min_size","min_weight")
you can use the
all_of() or
any_of() functions
to pick them:
tbl|>select(all_of(vars))
## # A tibble: 2 × 2
## min_size min_weight
## <dbl> <dbl>
## 1 13 45.2
## 2 12 83.1
tbl|>select(any_of(vars))
## # A tibble: 2 × 2
## min_size min_weight
## <dbl> <dbl>
## 1 13 45.2
## 2 12 83.1
The difference between the two functions is that
all_of() considers it an error if
vars contain a name that isn’t found in the table, while
any_of() does not.
vars<-c(vars,"foo")
tbl|>select(all_of(vars))
## Error in 'select()':
## ! Can't subset columns past the end.
## x Column 'foo' doesn't exist.
tbl|>select(any_of(vars))
## # A tibble: 2 × 2
## min_size min_weight
## <dbl> <dbl>
## 1 13 45.2
## 2 12 83.1
Selecting Columns Based on Their Content
Perhaps the most powerful selection function is where. You give it a function as an argument, that function is called with each column, and the columns for which the function returns TRUE are selected.
So, for example, if you only want
columns
that are numeric, you can use
where combined with
is.
numeric:
tbl|>select(where(is.numeric))
## # A tibble: 2 × 4
## min_size max_size min_weight max_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 13 16 45.2 67.2
## 2 12 17 83.1 102.
Or if you want columns that are numeric and the largest value is greater than 100, you can use
tbl|>select(where((x)is.numeric(x)&&max(x)>100.0))
## # A tibble: 2 × 1
## max_weight
## <dbl>
## 1 67.2
## 2 102.
(The syntax (args) body was introduced in R 4 as a short form of function(args) body but means the same thing).
It Is a Growing Language, so Check for Changes
The
language
used by functions that support tidy select is evolving and growing, so there is likely more you can do by the time you read this. Check the documentation of the
select function
to get the most recent description.