Computing on language

In the previous section, we introduced the functional programming facilities in R. You learned that functions are just another type of object we can pass around. When we create a new function, say fun, the environment we create will be associated with the function. This environment is called the enclosing environment of the function, which can be accessed via environment(fun). Each time we call the function, a new executing environment that contains the unevaluated arguments (promises) will be created to host the execution of the function, which enables lazy evaluation. The parent of the executing environment is the enclosing environment of the function, which enables lexical scoping.

Functional programming allows us to write code in higher level of abstraction. Metaprogramming goes even further. It allows us to tweak the language itself and make certain language constructs easier to use in a certain scenario. Some popular R packages use metaprogramming in their functions to make things easier. In this section, I will show you the power of metaprogramming as well as its pros and cons, so that you can understand how related packages and functions work.

Before digging into the knowledge of how things work, we may look at a few built-in functions that use metaprogramming to make things easier.

Suppose we want to filter the built-in dataset iris for records with each numeric column being greater than 80 percent of all records.

The standard method is to subset the rows of the data frame by composing a logical vector:

iris[iris$Sepal.Length > quantile(iris$Sepal.Length, 0.8) &
    iris$Sepal.Width > quantile(iris$Sepal.Width, 0.8) &
    iris$Petal.Length > quantile(iris$Petal.Length, 0.8) &
    iris$Petal.Width > quantile(iris$Petal.Width, 0.8), ]
##     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
## 110     7.2           3.6          6.1           2.5
## 118     7.7           3.8          6.7           2.2
## 132     7.9           3.8          6.4           2.0
## Species
## 110 virginica
## 118 virginica
## 132 virginica 

In the preceding code, each call of quantile() yields an 80 percent threshold for a column. Although the code works, it is quite redundant, because each time we use a column, we have to begin with iris$. In total, iris$ appears nine times.

The built-in function subset is useful to make things easier:

subset(iris,
    Sepal.Length > quantile(Sepal.Length, 0.8) &
    Sepal.Width > quantile(Sepal.Width, 0.8) &
    Petal.Length > quantile(Petal.Length, 0.8) &
    Petal.Width > quantile(Petal.Width, 0.8))
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 110      7.2          3.6         6.1          2.5
## 118      7.7          3.8         6.7          2.2
## 132      7.9          3.8         6.4          2.0
## Species
## 110 virginica 
## 118 virginica
## 132 virginica 

The preceding code returns exactly the same results, but with cleaner code. But why does it work while omitting iris$ in the previous example does not work?

iris[Sepal.Length > quantile(Sepal.Length, 0.8) &
    Sepal.Width > quantile(Sepal.Width, 0.8) &
    Petal.Length > quantile(Petal.Length, 0.8) &
    Petal.Width > quantile(Petal.Width, 0.8), ]
## Error in `[.data.frame`(iris, Sepal.Length > quantile(Sepal.Length, 0.8) & : object 'Sepal.Length' not found 

The preceding code does not work, because Sepal.Length and other columns are not defined in the scope (or environment) where we evaluate the subsetting expression. The magic function, subset, uses metaprogramming techniques to tweak the evaluation environment of its arguments so that Sepal.Length>quantile(Sepal.Length, 0.8) is evaluated in the environment with the columns of iris.

Moreover, subset not only works with rows, but is also useful in selecting columns. For example, we can also specify the select argument by directly using the column names as variables instead of using a character vector to select columns:

subset(iris,
    Sepal.Length > quantile(Sepal.Length, 0.8) &
    Sepal.Width > quantile(Sepal.Width, 0.8) &
    Petal.Length > quantile(Petal.Length, 0.8) &
    Petal.Width > quantile(Petal.Width, 0.8),
select = c(Sepal.Length, Petal.Length, Species))
##     Sepal.Length Petal.Length  Species
## 110     7.2          6.1      virginica
## 118     7.7          6.7      virginica
## 132     7.9          6.4      virginica 

See, subset tweaks how its second argument (subset) and third argument (select) are evaluated. The result is we can write simpler code with less redundancy.

In the next few sections, you will learn what happens behind the scene and how it is designed to work.

Capturing and modifying expressions

When we type an expression and hit the Enter (or return) key, R will evaluate the expression and show the output. Here is an example:

rnorm(5)
## [1] 0.54744813 1.15202065 0.74930997 -0.02514251
## [5]  0.99714852 

It shows the five random numbers generated. The magic of subset is that it tweaks the environment where the argument is evaluated. This happens in two steps: first, capture the expression and then, interfere the evaluation of the expression.

Capturing expressions as language objects

Capturing an expression means preventing the expression from being evaluated, but storing the expression itself as a variable. The function that does this is quote(); we can call quote() to capture the expression between the parenthesis:

call1 <- quote(rnorm(5))
call1
## rnorm(5) 

The preceding code does not result in five random numbers, but the function call itself. We can use typeof() and class() to see the type and class of the resulted object, call1:

typeof(call1)
## [1] "language"
class(call1)
## [1] "call" 

We can see that call1 is essentially a language object and it is a call. We can also write a function name in quote():

name1 <- quote(rnorm)
name1
## rnorm
typeof(name1)
## [1] "symbol"
class(name1)
## [1] "name" 

In this case, we don't get a call but a symbol (or name) instead.

In fact, quote() will return a call if a function call is captured and return a symbol if a variable name is captured. The only requirement is the validity of the code to capture; that is, as long as the code is syntactically correct, quote() will return the language object that represents the captured expression itself.

Even if the function does not exist or the variable as yet is undefined, the expression can be captured on its own:

quote(pvar)
## pvar
quote(xfun(a = 1:n))
## xfun(a = 1:n) 

Of the preceding language objects, maybe pvar, xfun, and n are all as yet undefined, but we can quote() them anyway.

It is important to understand the difference between a variable and a symbol object, and between a function and a call object. A variable is a name of an object, and a symbol object is the name itself. A function is an object that is callable, and a call object is a language object that represents such a function call, which is as yet unevaluated. In this case, rnorm is a function and it is callable (for example, rnorm(5) returns five random numbers), but quote(rnorm) returns a symbol object and quote(rnorm(5)) returns a call object, both of which are only the representations of the language itself.

We can convert the call object to a list so that we can see its internal structure:

as.list(call1)
## [[1]]
## rnorm
##
## [[2]]
## [1] 5 

This shows that the call consists of two components: the symbol of the function and one argument. We can extract objects from a call object:

call1[[1]]
## rnorm
typeof(call1[[1]])
## [1] "symbol"
class(call1[[1]])
## [1] "name" 

The first element of call1 is a symbol:

call1[[2]]
## [1] 5
typeof(call1[[2]])
## [1] "double"
class(call1[[2]])
## [1] "numeric" 

The second element of call1 is a numeric value. From the previous examples, we know that quote() captures a variable name as a symbol object and a function call as a call object. Both of them are language objects. Like typical data structures, we can use is.symbol()/is.name() and is.call() to detect whether an object is a symbol or a call, respectively. More generally, we can also use is.language() to detect both the symbol and the call.

Another question is, "What if we call quote() on a literal value? What about a number or a string?" The following code creates a numeric value num1 and a quoted numeric value num2:

num1 <- 100
num2 <- quote(100) 

They have exactly the same representation:

num1
## [1] 100
num2
## [1] 100 

In fact, they have exactly the same value:

identical(num1, num2) 
## [1] TRUE 

Therefore, quote() does not transform a literal value (such as a number, logical value, string, and so on) to a language object, but it leaves it as it is. However, an expression that combines several literal values into a vector will still be transformed into a call object. Here is an example:

call2 <- quote(c("a", "b")) 
call2 
## c("a", "b") 

It is consistent because c() is indeed a function that combines values and vectors. Moreover, if you look at the list representation of the call using as.list(), we can see the structure of the call:

as.list(call2) 
## [[1]] 
## c 
##  
## [[2]] 
## [1] "a" 
##  
## [[3]] 
## [1] "b" 

The types of elements in the call can be revealed by str():

str(as.list(call2)) 
## List of 3 
##  $ : symbol c 
##  $ : chr "a" 
##  $ : chr "b" 

Another noteworthy fact here is that simple arithmetic calculations are captured as calls too because they are surely function calls to arithmetic operators such as + and *, which are essentially built-in functions. For example, we can use the quote() function to the simplest arithmetic calculation, perform 1 + 1:

call3 <- quote(1 + 1) 
call3 
## 1 + 1 

The arithmetic representation is preserved, but it is a call and has exactly the same structure as of a call:

is.call(call3) 
## [1] TRUE 
str(as.list(call3)) 
## List of 3 
##  $ : symbol + 
##  $ : num 1 
##  $ : num 1 

Given all the preceding knowledge about capturing an expression, we can now capture a nested call; that is, a call that contains more calls:

call4 <- quote(sqrt(1 + x ^ 2)) 
call4 
## sqrt(1 + x ^ 2) 

We can use a function in the pryr package to view the recursive structure of the call. To install the package, run install.package("pryr"). Once the package is ready, we can call pryr::call_tree to do that:

pryr::call_tree(call4) 
## - () 
## - `sqrt 
## - () 
## - `+ 
## -  1 
## - () 
## - `^ 
## - `x 
## -  2 

For call4, the recursive structure is printed in a tree structure. The - () operator means a call, then `var represents a symbol object var, and others are literal values. In the preceding output, we can see that symbols and calls are captured and literal values are preserved.

If you are curious about the call tree of an expression, you can always use this function because it precisely reflects the way R processes the expression.

Modifying expressions

When we capture an expression as a call object, the call can be modified as if it were a list. For example, we can change the function to call by replacing the first element of the call with another symbol:

call1 
## rnorm(5) 
call1[[1]] <- quote(runif) 
call1 
## runif(5) 

So, rnorm(5) is changed to runif(5).

We can also add new argument to the call:

call1[[3]] <- -1 
names(call1)[[3]] <- "min" 
call1 
## runif(5, min = -1) 

Then, the call now has another parameter: min = -1.

Capturing expressions of function arguments

In the previous examples, you learned how to use quote() to capture a known expression, but subset works with arbitrary user-input expressions. Suppose we want to capture the expression of argument x.

The first implementation uses quote():

fun1 <- function(x) { 
  quote(x) 
} 

Let's see if fun1 can capture the input expression when we call the function with rnorm(5):

fun1(rnorm(5)) 
## x 

Obviously, quote(x) only captures x and has nothing to do with the input expression rnorm(5). To correctly capture it, we need to use substitute(). The function captures an expression and substitutes existing symbols with their expressions. The simplest usage of this function is to capture the expression of a function argument:

fun2 <- function(x) { 
  substitute(x) 
} 
fun2(rnorm(5)) 
## rnorm(5) 

With this implementation, fun2 returns the input expression rather than x because x is replaced with the input expression, in this case, rnorm(5).

The following examples demonstrate the behavior of substitute when we supply a list of language objects or literal values. In the first example, we substitute each symbol x in the given expression with 1:

substitute(x + y + x ^ 2, list(x = 1)) 
## 1 + y + 1 ^ 2 

In the second example, we substitute each symbol f that is supposed to be a function name with another quoted function name sin:

substitute(f(x + f(y)), list(f = quote(sin))) 
## sin(x + sin(y)) 

Now, we are able to capture a certain expression with quote() and user-input expression with substitute().

Constructing function calls

In addition to capturing expressions, we can directly build language objects with built-in functions. For example, call1 is a captured call using quote():

call1 <- quote(rnorm(5, mean = 3)) 
call1 
## rnorm(5, mean = 3) 

We can use call() to create a call of the same function with the same arguments:

call2 <- call("rnorm", 5,  mean = 3) 
call2 
## rnorm(5, mean = 3) 

Alternatively, we can convert a list of call components to a call using as.call():

call3 <- as.call(list(quote(rnorm), 5, mean = 3)) 
call3 
## rnorm(5, mean = 3) 

The three methods create identical calls; that is, they call a function of the same name and with the same arguments, which can be confirmed by calling identical() with the three resulted call objects:

identical(call1, call2) 
## [1] TRUE 
identical(call2, call3) 
## [1] TRUE 

Evaluating expressions

After capturing an expression, the next step is evaluating it. This can be done with eval().

For example, if we type sin(1) and enter, the value will appear immediately:

sin(1) 
## [1] 0.841471 

To control the evaluation of sin(1), We can use quote() to capture the expression and then eval() to evaluate the function call:

call1 <- quote(sin(1)) 
call1 
## sin(1) 
eval(call1) 
## [1] 0.841471 

We can capture any expression that is syntactically correct, which allows us to quote() an expression that uses undefined variables:

call2 <- quote(sin(x)) 
call2 
## sin(x) 

In call2, sin(x) uses an undefined variable x. If we directly evaluate it, an error occurs:

eval(call2) 
## Error in eval(expr, envir, enclos): object 'x' not found 

This error is similar to what happens when we directly run sin(x) without x being defined:

sin(x) 
## Error in eval(expr, envir, enclos): object 'x' not found 

The difference between directly running in console and using eval() is that eval() allows us to provide a list to evaluate the given expression. In this case, we don't have to create a variable x but supply a temporary list that contains x so that the expression will look up symbols in the list:

eval(call2, list(x = 1)) 
## [1] 0.841471 

Alternatively, eval() also accepts an environment for symbol lookup. Here, we will create a new environment e1 in which we create a variable x with value 1, and then we use eval() in the call in e1:

e1 <- new.env() 
e1$x <- 1 
eval(call2, e1) 
## [1] 0.841471 

The same logic also applies when the captured expression has more undefined variables:

call3 <- quote(x ^ 2 + y ^ 2) 
call3 
## x ^ 2 + y ^ 2 

Directly evaluating the expression without a complete specification of the undefined symbols will result in an error:

eval(call3) 
## Error in eval(expr, envir, enclos): object 'x' not found 

So does a partial specification, as follows:

eval(call3, list(x = 2)) 
## Error in eval(expr, envir, enclos): object 'y' not found 

Only when we fully specify the values of the symbols in the expression can the evaluation result in a value:

eval(call3, list(x = 2, y = 3)) 
## [1] 13 

The evaluation model of eval(expr, envir, enclos) is the same as calling a function. The function body is expr, and the executing environment is envir. If envir is given as a list, then the enclosing environment is enclos, or otherwise the enclosing environment is the parent environment of envir.

This model implies the exact behavior of symbol lookup. Suppose we use an environment instead to evaluate call3. Since e1 only contains variable x, the evaluation does not proceed:

e1 <- new.env() 
e1$x <- 2 
eval(call3, e1) 
## Error in eval(expr, envir, enclos): object 'y' not found 

Then, we create a new environment whose parent is e1 and contains variable y. If we now evaluate call3 in e2, both x and y are found and the evaluation works:

e2 <- new.env(parent = e1) 
e2$y <- 3 
eval(call3, e2) 
## [1] 13 

In the preceding code, eval(call3, e2) tries to evaluate call3, with e2 being the executing environment. Now, we can go through the evaluating process to get a better understanding of how it works. The evaluation process is reflected by travelling recursively along the call tree produced by pryr::call_tree():

pryr::call_tree(call3) 
## - () 
## - `+ 
## - () 
## - `^ 
## - `x 
## -  2 
## - () 
## - `^ 
## - `y 
## -  2 

First, it tries to find a function called +. It goes through e2 and e1, and does not find + until it reaches the base environment (baseenv()), where all the basic arithmetic operators are defined. Then, + needs to evaluate its arguments, so it looks for another function called ^ and finds it by going through the same flow. Then, again ^ needs to evaluate its arguments, so it looks for symbol x in e2. Environment e2 does not contain variable x, so it continues searching in e2 class's parent environment, e1, and finds x there. Finally, it looks for symbol y in e2 and finds it immediately. When the arguments a call needs are ready, the call can be evaluated to a result.

An alternative approach is to supply a list to envir and an enclosing environment:

e3 <- new.env() 
e3$y <- 3 
eval(call3, list(x = 2), e3) 
## [1] 13 

The evaluating process begins with an executing environment generated from the list whose parent environment is e3, as specified. Then, the process is exactly the same as the previous example.

Since everything we do is essentially calling functions, quote() and substitute() can capture everything, including assignment and other operations that do not look like calling functions. In fact, for example, x <- 1 is essentially calling <- with (x, 1), and length(x) <- 10 is essentially calling length<- with (x, 10).

To demonstrate the point, we may construct another example in which we create a new variable.

In the following example, we supply a list to generate the executing environment and e3 as the enclosing environment:

eval(quote(z <- x + y + 1), list(x = 1), e3) 
e3$z 
## NULL 

As a result, z is not created in e3 but in a temporary executing environment created from the list. If we, instead, specify e3 as the executing environment, the variable will be created in it:

eval(quote(z <- y + 1), e3) 
e3$z 
## [1] 4 

In conclusion, eval() works in a way extremely close to the behavior of function calling, but eval() allows us to customize the evaluation of an expression by tweaking its executing and enclosing environment, which allows us to do good things such as subset as well as bad things such as follows:

eval(quote(1 + 1), list(`+` = `-`)) 
## [1] 0 

Understanding non-standard evaluation

In the previous sections, you learned how to use quote() and substitute() to capture an expression as a language object, and you learned how to use eval() to evaluate it within a given list or environment. These functions constitute the facility of metaprogramming in R and allow us to tweak standard evaluation. The main application of metaprogramming is to perform non-standard evaluation to make certain usage easier. In the following sections, we will discuss a few examples to gain a better understanding of how it works.

Implementing quick subsetting using non-standard evaluation

Often, we need to take out a certain subset from a vector. The range of the subset may be the first few elements, last few elements, or some elements in the middle.

The first two cases can be easily handled by head(x, n) and tail(x, n). The third case requires an input of the length of the vector.

For example, suppose we have an integer vector and want to take out elements from the third to the fifth last:

x <- 1:10 
x[3:(length(x) -5)] 
## [1] 3 4 5 

The preceding subsetting expression uses x twice and looks a bit redundant. We can define a quick subsetting function that uses metaprogramming facilities to provide a special symbol to refer to the length of the input vector The following function, qs, is a simple implementation of this idea that allows us to use dot (.) to represent the length of the input vector x:

qs <- function(x, range) { 
 range <- substitute(range) 
  selector <- eval(range, list(. =length(x))) 
  x[selector] 
} 

Using this function, we can use 3:(. - 5) to represent the same range as the motivating example:

qs(x, 3:(. -5)) 
## [1] 3 4 5 

We can also easily pick out a number by counting from the last element:

qs(x, . -1) 
## [1] 9 

Based on qs(), the following function is designed to trim both margins of n elements from the input vector x; that is, it returns a vector without the first n and last n elements of x:

trim_margin <- function(x, n) { 
  qs(x, (n + 1):(. -n -1)) 
} 

The function looks alright, but when we call it with an ordinary input, an error occurs:

trim_margin(x, 3) 
## Error in eval(expr, envir, enclos): object 'n' not found 

How come it couldn't find n? To understand why this happens, we need to analyze the path of symbol lookup when trim_margin is called. In the next section, we will go into this in detail and introduce the concept of dynamic scoping to resolve the problem.

Understanding dynamic scoping

Before trying to tackle the problem, let's use what you have learned to analyze what went wrong. When we call trim_margin(x, 3), we call qs(x, (n + 1):(. - n - 1)) in a fresh executing environment with x, and n. qs() is special because it uses non-standard evaluation. More specifically, it first captures range as a language object and then evaluates it with a list of additional symbols to provide, which, at the moment, only contains . = length(x).

The error just happens at eval(range, list(. = length(x))). The number of margin elements to trim, n, cannot be found here. There must be something wrong with the enclosing environment of evaluation. Now, we will take a closer look at the default value of the enclos argument of eval():

eval 
## function (expr, envir = parent.frame(), enclos = if (is.list(envir) ||  
## is.pairlist(envir)) parent.frame() else baseenv())  
## .Internal(eval(expr, envir, enclos)) 
## <bytecode: 0x00000000106722c0> 
## <environment: namespace:base> 

The definition of eval() says that if we supply a list to envir, which is exactly what we have done, enclos will take parent.frame() by default, which is the calling environment of eval(); that is, the executing environment when we call qs(). Certainly, there is no n in any executing environment of qs.

Here, we exposed a shortcoming of using substitute() in trim_margin() because the expression is only fully meaningful in the correct context, that is, the executing environment of trim_margin(), which is also the calling environment of qs(). Unfortunately, substitute() only captures the expression; it does not capture the environment in which the expression is meaningful. Therefore, we have to do it ourselves.

Now, we know where the problem comes from. The solution is simple: always use the correct enclosing environment in which the captured expression is defined. In this case, we specify enclos = parent.frame() so that eval() looks for all symbols other than . in the calling environment of qs(), that is, the executing environment of trim_margin() where n is supplied.

The following lines of code are the fixed version of qs():

qs <- function(x, range) { 
 range <- substitute(range) 
  selector <- eval(range, list(. =length(x)), parent.frame()) 
  x[selector] 
} 

We can test the function with the same code that went wrong previously:

trim_margin(x, 3) 
## [1] 4 5 6 

Now, the function works in the correct manner. In fact, this mechanism is the so-called dynamic scoping. Recall what you learned in the previous chapter. Each time a function is called, an executing environment is created. If a symbol cannot be found in the executing environment, it will search the enclosing environment.

With lexical scoping used in standard evaluation, the enclosing environment of a function is determined when the function is defined and so is the environment where it is defined.

However, with dynamic scoping used in non-standard evaluation, by contrast, the enclosing environment should be the calling environment in which the captured expression is defined so that symbols can be found either in the customized executing environment or in the enclosing environment, along with its parents.

In conclusion, when a function uses non-standard evaluation, it is important to ensure that dynamic scoping is correctly implemented.

Using formulas to capture expression and environment

To correctly implement dynamic scoping, we use parent.frame() to track the expression captured by substitute(). An easier way is to use a formula to capture the expression and environment at the same time.

In the chapter of working the data, we saw that a formula is often used to represent the relationship between variables. Most model functions (such as lm()) accept a formula to specify the relationship between a response variable and explanatory variables.

In fact, a formula object is much simpler than that. It automatically captures the expressions beside ~ and the environment where it is created. For example, we can directly create a formula and store it in a variable:

formula1 <- z ~ x ^ 2 + y ^ 2 

We can see that the formula is essentially a language object with the formula class:

typeof(formula1) 
## [1] "language" 
class(formula1) 
## [1] "formula" 

If we convert the formula to a list, we can have a closer look at its structure:

str(as.list(formula1)) 
## List of 3 
##  $ : symbol ~ 
##  $ : symbol z 
##  $ : language x^2 + y^2 
##  - attr(*, "class")= chr "formula" 
##  - attr(*, ".Environment") =< environment: R_GlobalEnv> 

We can see that formula1 captured not only the expressions as language objects on both sides of ~, but also the environment where it was created. In fact, a formula is merely a call of function ~ with the arguments and calling environment captured. If both sides of ~ are specified, the length of the call is 3:

is.call(formula1) 
## [1] TRUE 
length(formula1) 
## [1] 3 

To access the language objects it captured, we can extract the second and the third elements:

formula1[[2]] 
## z 
formula1[[3]] 
## x^2 + y^2 

To access the environment where it was created, we can call environment():

environment(formula1) 
## <environment: R_GlobalEnv> 

A formula can also be right-sided, that is, only the right side of ~ is specified. Here is an example:

formula2 <- ~x + y 
str(as.list(formula2)) 
## List of 2 
##  $ : symbol ~ 
##  $ : language x + y 
##  - attr(*, "class")= chr "formula" 
##  - attr(*, ".Environment")=<environment: R_GlobalEnv> 

In this case, only one argument of ~ is supplied and captured so that we have a call of two language objects and we can access the expression it captured by extracting its second element:

length(formula2) 
## [1] 2 
formula2[[2]] 
## x + y 

With the knowledge of how the formula works, we can implement another version of qs() and trim_margin() using the formula.

The following function, qs2, behaves consistently with qs when range is a formula, or otherwise, it directly uses range to subset x:

qs2 <- function(x, range) { 
 selector <- if (inherits(range, "formula")) { 
eval(range[[2]], list(. = length(x)), environment(range)) 
  } else range 
  x[selector] 
} 

Note that we use inherits(range, "formula") to check whether range is a formula and use environment(range) to implement dynamic scoping. Then, we can use a right-sided formula to activate non-standard evaluation:

qs2(1:10, ~3:(. -2)) 
## [1] 3 4 5 6 7 8 

Otherwise, we can use standard evaluation:

qs2(1:10, 3) 
## [1] 3 

Now, we can re-implement trim_margin with qs2 using a formula:

trim_margin2 <- function(x, n) { 
  qs2(x, ~ (n + 1):(. -n -1)) 
} 

As can be verified, dynamic scoping works correctly because the formula used in trim_margin2 automatically captures the executing environment, which is also the environment where the formula and n are defined:

trim_margin2(x, 3) 
## [1] 4 5 6 

Implementing subset with metaprogramming

With the knowledge of language objects, evaluation functions, and dynamic scoping, now we have the capability to implement a version of subset.

The underlying idea of the implementation is simple:

  • Capture the row subsetting expression and evaluate it within the data frame which is, in essence, a list
  • Capture the column-selecting expression and evaluate it in a named list of integer indices
  • Use the resulting row selector (logical vector) and column selector (integer vector) to subset the data frame

Here is an implementation of the preceding logic:

subset2 <- function(x, subset = TRUE, select = TRUE) { 
  enclos <- parent.frame() 
  subset <- substitute(subset) 
  select <- substitute(select) 
  row_selector <- eval(subset, x, enclos) 
  col_envir <- as.list(seq_along(x)) 
  names(col_envir) <- colnames(x) 
  col_selector <- eval(select, col_envir, enclos) 
  x[row_selector, col_selector] 
} 

The feature of row subsetting is easier to implement than the column selecting part. To perform row subsetting, we only need to capture subset and evaluate it within the data frame.

The column subsetting is trickier here. We will create a list of integer indices for the columns and give them the corresponding names. For example, a data frame with three columns (say, x, y, and z) needs a list of indices such as list(a = 1, b = 2, c = 3), which allows us to select rows in the form of select = c(x, y) because c(x, y) is evaluated within the list.

Now, the behavior of subset2 is very close to the built-in function subset:

subset2(mtcars, mpg >= quantile(mpg, 0.9), c(mpg, cyl, qsec)) 
##                 mpg  cyl  qsec 
## Fiat 128       32.4   4  19.47 
## Honda Civic    30.4   4  18.52 
## Toyota Corolla 33.9   4  19.90 
## Lotus Europa   30.4   4  16.90 

Both implementations allow us to use a:b to select all columns between a and b, including both sides:

subset2(mtcars, mpg >= quantile(mpg, 0.9), mpg:drat) 
##                 mpg   cyl disp  hp  drat 
## Fiat 128       32.4   4   78.7  66  4.08 
## Honda Civic    30.4   4   75.7  52  4.93 
## Toyota Corolla 33.9   4   71.1  65  4.22 
## Lotus Europa   30.4   4   95.1  113 3.77 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.123.147