© Thomas Mailund 2018

Thomas Mailund, Domain-Specific Languages in R, https://doi.org/10.1007/978-1-4842-3588-1_5

5. Parsing and Manipulating Expressions

Thomas Mailund

(1)Aarhus N, Staden København, Denmark

A powerful feature of the R programming language is that it readily allows us to treat expressions in the language itself as data that we can examine and modify as part of a program—so-called meta-programming. From within a program we can take a piece of R code and computationally manipulate it before we evaluate it. We need to get hold of the code before it is evaluated, and there are several ways to do that. The simplest is to “quote” expressions, which leaves them as unevaluated expressions.

In this chapter, we will use the following libraries:

library(purrr)
library(rlang)
library(magrittr)

Quoting and Evaluating

If you write an expression such as the following, R will immediately try to evaluate it:

2 * x + y

It will look for the variables x and y in the current scope, and if it finds them, it will evaluate the expression; if it does not, it will report an error. By the time R has evaluated the expression, we have either a value or an error. If it is the former, the expression is essentially equivalent to the result of evaluating the expression (computation time notwithstanding). A literate expression as this one is not something we can get a hold on to in a program—we get either an error or the value the expression evaluates to. If we want to get hold of the actual expression, we need to “quote” it. If we wrap the expression in a call to the function quote, then we prevent the evaluation of the expression and instead get a data structure that represents the unevaluated expression.

quote(2 * x + y)

## 2 * x + y

The class of an expression is a “call.”

expr <- quote(2 * x + y)
class(expr)


## [1] "call"

It is a call because infix operators are syntactic sugar for function calls, and all function call expressions will have this type. For “call” objects, we can get their components by indexing as we would a list. The first element will be the function name, and the remaining elements will be the arguments to the function call. For binary operators, of course, there will be two arguments.

For this expression, the function call is an addition:

expr[[1]]

## `+`

expr[[2]]

## 2 * x

expr[[3]]

## y

It is an addition because multiplication has higher precedence than addition, so the expression is equivalent to the following:

(2 * x) + y

This is because the multination is nested deeper in the expression than the addition. The multiplication can be accessed as the first argument to the addition call, so the second element in the object is as follows:

expr[[2]][[1]]

## `*`

expr[[2]][[2]]

## [1] 2

expr[[2]][[3]]

## x

To evaluate a quoted expression, we can use the function eval. The following expression:

eval(quote(2 * x + y))

is equivalent to writing the literate expression shown here:

2 * x + y

The eval function provides more flexibility in how an expression is evaluated since we can modify the scope of the evaluation, something we return to in more detail in Chapter 7.

Combining quoted expressions and functions introduces additional complications, at least if we want to handle the quoting within a function call. We can, however, pass quoted expressions as parameters to a function, as shown here:

f <- function(expr) expr[[1]]
f(quote(2 * x + y))


## `+`

However, it gets more complicated if we want to provide the literate expression to the function.

f(2 * x + y)

## Error in f(2 * x + y): object 'x' not found

In the function f, when we return expr[[1]], R will first attempt to evaluate the expression, but the expression depends on the variables x and y, which are undefined. Even if we define x and y, we still do not get a “call” object that we can manipulate. We get the result of evaluating the expression.

x <- 2
y <- 3
f(2 * x + y)


## [1] 7

Using quote inside the function doesn’t help us. If we write quote(expr), we get the expression expr—a single symbol—as a result, not the argument we give to f.

f <- function(expr) {
  expr <- quote(expr)
  expr[[1]]
}
f(2 * x + y)


## Error in expr[[1]]: object of type 'symbol' is not subsettable

To get the actual argument as a quoted expression, we need to use the function substitute.

f <- function(expr) {
  expr <- substitute(expr)
  expr[[1]]
}
f(2 * x + y)


## `+`

Two things come together to make this work. First, function arguments in R are lazily evaluated, so the expr argument is never evaluated if we do not use it in an expression. So, even though x and y are not defined, we do not get any errors as long as we do not evaluate the argument to f. Second, substitute does not evaluate its argument, but it returns a quoted object where variables are replaced with the value they have in the current scope.1 The argument to substitute does not have to be a single variable name. It can be any expression that will be considered quoted after which variable substitution is done, and the return value will be the modified quoted expression.

f <- function(expr) {
  expr <- substitute(expr + expr)
  expr
}
f(2 * x + y)


## 2 * x + y + (2 * x + y)

Another complication appears if we attempt to evaluate a quoted expression inside a function. You might expect these two functions to be equivalent since eval(quote(expr)) should be the same as expr, but they are not equivalent.

f <- function(expr) {
  expr + expr
}
g <- function(expr) {
  x <- substitute(expr + expr)
  eval(x)
}

If we make sure that both x and y are defined, then the function f returns twice the value of the expression.

x <- 2; y <- 3
f(2 * x + y)


## [1] 14

Function g, on the other hand, raises an error because the type of x is incorrect.

g(2 * x + y)

## Error in 2 * x: non-numeric argument to binary operator

By default, the eval function will evaluate an expression in the current scope, which inside a function will be that function’s evaluation environment. Inside g, we have defined x to be the expression we get from the call to substitute, so it is this x that is seen by eval. If you want eval to evaluate an expression in another scope, you need to give it an environment as a second argument. If you want it to evaluate the expression in the scope where the function is called, rather than inside the function scope itself, then you can get that using the parent.frame function.

g <- function(expr) {
  x <- substitute(expr + expr)
  eval(x, parent.frame())
}
g(2 * x + y)


## [1] 14

We will discuss environments, scopes, and how expressions are evaluated in more detail in Chapter 7. For the remainder of this chapter, we will focus on manipulating expressions and not on evaluating them.

Exploring Expressions

An expression is a recursive data structure, and you can explore it as such. We can define expressions in a grammar like this:

EXPRESSION ::= CONSTANT
            |  NAME
            |  PRIMITIVE
            |  PAIRLIST
            |  CALL EXPRESSION_LIST
EXPRESSION_LIST
           ::= EXPRESSION
            |  EXPRESSION EXPRESSION_LIST

We will not expend the grammar of expressions further, but just agree that they will be any legal R expressions. All expressions are one of the five. The first four are terminals in the grammar, while call expressions are recursive; a call is constructed from a function and its arguments, and all these are other expressions.

We can explore expressions using recursive functions where the first three meta-variables, CONSTANT, NAME, and PRIMITIVE, are basis cases that do not contain other expressions, while PAIRLIST might and CALL will contain other expressions and must be handled in recursive calls.

Of the meta-variables, CONSTANT refers to any literal data such as numbers or strings, NAME refers to any variable name, PRIMTIVE refers to a function written in C as part of the implementation of R, PAIRLIST refers to formal arguments in function definitions (more on this below), and CALL refers to function calls. Function calls capture everything more complicated than the first four options. Since everything in R that does anything is considered a function call, including such statements as function definitions and control structures, these are captured in the CALL case. As we saw earlier, calls are list-like and always have at least one element. The first element is the function that is called, and the remaining components are the arguments to that function.

To recursively explore an expression, we can write functions that test the four cases. Constants are recognized by the is.atomic function, names by the is.name function, primitives by the is.primitive function, pair lists by the is.pairlist, and calls by the is.call function. A function for printing out an expression’s structure can look like this:

print_expression <- function(expr, indent = "") {
  if (is.atomic(expr)) {
    if (inherits(expr, "srcref")) {
      expr <- paste0("srcref = ", expr)
    }
    cat(indent, " - ", expr, " ")


  } else if (is.name(expr)) {
    if (expr == "") {
      expr <- "MISSING"
    }
    cat(indent, " - ", expr, " ")


  } else if (is.primitive(expr)) {
    cat(indent, " - ", expr, " ")


  } else if (is.pairlist(expr)) {
    cat(indent, " - ", "[ ")
    new_indent <- paste0(indent, "       ")
    vars <- names(expr)
    for (i in seq_along(expr)) {
      cat(indent, "    ", vars[i], " -> ")
      print_expression((expr[[i]]), new_indent)
    }
    cat(indent, "    ] ")


  } else {
    print_expression((expr[[1]]), indent)
    new_indent <- paste0("  ", indent)
    for (i in 2:length(expr)) {
      print_expression(expr[[i]], new_indent)
    }
  }
}

Here, we do not explicitly test for the type of calls; if the expression is not one of the first four cases, it must be the fifth. There are two special cases we handle in this printing expression—source references for function definitions and missing expressions in pair lists. We discuss these next.

We can see the function in action by calling it on the expression we explored earlier.

print_expression(quote(2 * x + y))

##   -  +
##     -  *
##       -  2
##       -  x
##     -  y

The pretty-printed expression shows the structure we explored explicitly in the previous section.

Declaring a function is considered a function call—a call to the function function.

print_expression(quote(function(x) x))

##   -  function
##     -  [
##         x  ->
##            -  MISSING
##        ]
##     -  x
##     -  srcref = function(x) x

For a function definition, we have a call object where the first argument is function, the second argument is the pair list that defines the function parameters, and the third element is the function body—another expression. There is also a fourth element called srcdef, an atomic vector that captures the actual code used to define the function. In the printing function, we just print the text representation of the source definition, which we get by pasting the expression.

The argument list of a function we declare is where the pair list data structure is used. We can get the names of the formal parameters using the names function and the default arguments by indexing into the pair list. Parameters without default arguments are a special case here, and the expression they contain is an empty string. In the printing function, we make this explicit by changing the empty string to the string MISSING. If we have default arguments, then those are represented as expressions we can explore recursively.

print_expression(quote(function(x = 2 * 2 + 4) x))

##   -  function
##     -  [
##         x  ->
##            -  +
##              -  *
##                -  2
##                -  2
##              -  4
##        ]
##     -  x
##     -  srcref = function(x = 2 * 2 + 4) x


print_expression(quote(function(x, y = 2 * x) x + y))

##   -  function
##     -  [
##         x  ->
##            -  MISSING
##         y  ->
##            -  *
##              -  2
##              -  x
##        ]
##     -  +
##       -  x
##       -  y
##     -  srcref = function(x, y = 2 * x) x + y

The usual case for function calls is that the first element in the “call” list is a symbol that refers to a function, and any expression that returns a function can be used as a function in R. This means the first element of calls can be any expression. For example, if we define a function and call it right after, the first element of the call object will be the function definition.

expr <- quote((function(x) x)(2))
print_expression(expr)


##   -  (
##     -  function
##       -  [
##           x  ->
##              -  MISSING
##          ]
##       -  x
##       -  srcref = function(x) x
##     -  2


expr[[1]]

## (function(x) x)

expr[[2]]

## [1] 2

As an example of doing something non-trivial with expressions, we can write a function that collects all unbound variables in an expression. If we recurse through an expression, we can collect all the bound and unbound symbols. To get the unbound variables, we can keep track of those that are bound and not collect those. Ignoring, at first, those variables that might be bound outside of the expression itself—in the scope where we will call the function—the variables that are bound are those that are named in a function definition. We can identify those from the pair list that is the second argument to calls to function. When recursing over expressions, we capture those and pass them on down the recursion. Aside from that, we simply collect the symbols. In the following implementation, I use the linked lists we saw earlier to collect the symbols, and I translate the symbols into characters as I collect them. I do this because I can use the character representation of symbols to check whether a symbol exists in an environment later. I use the cons function to collect symbols in a linked list.

cons <- function(car, cdr) list(car = car, cdr = cdr)
collect_symbols_rec <- function(expr, lst, bound) {
  if (is.symbol(expr) && expr != "") {
    if (as.character(expr) %in% bound) lst
    else cons(as.character(expr), lst)


  } else if (is.pairlist(expr)) {
    for (i in seq_along(expr)) {
      lst <- collect_symbols_rec(expr[[i]], lst, bound)
    }
    lst


  } else if (is.call(expr)) {
    if (expr[[1]] == as.symbol("function"))
      bound <- c(names(expr[[2]]), bound)


    for (i in 1:length(expr)) {
      lst <- collect_symbols_rec(expr[[i]], lst, bound)
    }
    lst


  } else {
    lst
  }
}

For processing the lists, it is easier to work with list than with linked-lists objects, so we need the lst_to_list function from earlier as well.

lst_length <- function(lst) {
  len <- 0
  while (!is.null(lst)) {
    lst <- lst$cdr
    len <- len + 1
  }
  len
}
lst_to_list <- function(lst) {
  v <- vector(mode = "list", length = lst_length(lst))
  index <- 1
  while (!is.null(lst)) {
    v[[index]] <- lst$car
    lst <- lst$cdr
    index <- index + 1
  }
  v
}

We explicitly avoid the empty symbol when we collect symbols. The empty symbol is the symbol we get when we recurse on a pair list for a function parameter without a default value. We do not consider this a variable, bound or otherwise. The way we handle symbols is straightforward. For pair lists, we collect the parameters that will be bound and recurse through the default arguments to collect any unbound variables there. As for calls, we handle the function definitions by extending the list of bound variables and then recursing. For anything else—which in practice means for any atomic value—we just return the list we called the function with. There are no unbound variables in constant values after all.

The recursive function works on a quoted expression and collects all symbols that are not bound within the expression itself. We wrap it in a function that does the quoting of the expression, call the recursive function, and then remove the symbols that are defined in the calling scope (the parent.frame).

collect_symbols <- function(expr) {
  expr <- substitute(expr)
  bound <- c()
  lst <- collect_symbols_rec(expr, NULL, bound)
  lst %>% lst_to_list() %>% unique() %>%
          purrr::discard(exists, parent.frame()) %>%
          unlist()
}

Here, I use the discard function from the purrr package to remove all elements that satisfy a predicate. For the predicate, I use the function exists with a second argument that is the calling environment, parent.frame. This gets rid of symbols that are defined in the scope where we call collect_symbols, including globally defined functions such as *, +, and function.

I pipe the final result through unlist to translate the list into a character vector. This is only for pretty-printing reasons. It gives nicer output when printed in the console. For programming, you can work with lists as well as with vectors.

If we get rid of variables x and y that we defined earlier, the expression 2 * x + y + z should have three unbound variables, x, y, and z. This is indeed what we find:

rm(x) ; rm(y)
collect_symbols(2 * x + y + z)


## [1] "z" "y" "x"

If we define one of the variables, for example, z, then it is no longer unbound.

z <- 3
collect_symbols(2 * x + y + z)


## [1] "y" "x"

Function definitions also bind variables, so those are not collected.

collect_symbols(function(x) 2 * x + y + z)

## [1] "y"

collect_symbols(function(x) function(y) f(2 * x + y))

## NULL

Default values can contain unbound variables; we collect those values.

collect_symbols(function(x, y = 2 * w) 2 * x + y)

## [1] "w"

We are not entirely done learning about how to explore expressions yet. The actual recursive exploration of expressions is simple, as shown previously. But often, it must be combined with an evaluation of expressions. And often, this evaluation does not follow the usual rules for how expressions are evaluated because we have to evaluate some expressions while we keep others quoted. When we start manipulating how expressions are evaluated, we call it non-standard evaluation, which is the topic of Chapter 7. Here, however, I want to give you a taste of what it involves.

If we write a simple function such as this:

f <- function(expr) collect_symbols(expr)

we might expect it to give us the unbound variables in an expression, but it returns an empty list, as shown here:

f(2 + y * w)

## NULL

This is because of the combination of the two issues we will have when we try to program the functions of the so-called non-standard evaluation. First, when we use substitute in the collect_symbols function, we get the literal expression that substitute was called with. The argument we give to substitute in f is expr. The expression that f itself is called with does not get passed along. Second, the environment in which we test for a bound variable inside collect_symbols is the calling environment. When we call the function from f, the calling environment is the body of f. In this environment, the variable expr is defined—it is the formal argument of the function—so it will be considered bound.

We will explore environments and how to program with non-standard evaluation in some detail later, but the general solution to these problems is to avoid using non-standard evaluation in functions you plan to call from other functions. It is a powerful technique for writing a domain-specific language, but keep it to the interface of the language and not the internal functions. For collect_symbols, we can get around the problem by writing another function that takes as arguments a quoted expression and an environment we should look for variables in. We can then call this function from collect_symbols when we want a non-standard evaluation and call the other function directly if we want to use it from other functions.

collect_symbols_ <- function(expr, env) {
  bound <- c()
  lst <- collect_symbols_rec(expr, NULL, bound)
  lst %>% lst_to_list() %>% unique() %>%
    purrr::discard(exists, env) %>%
    unlist()
}
collect_symbols <- function(expr) {
  collect_symbols_(substitute(expr), parent.frame())
}

Manipulating Expressions

We can do more than simply inspect expressions. We can also modify them or create new ones from within programs. You cannot modify the two primitive expressions, constants and symbols. They are simply data. We can, however, modify calls and pair lists, although the second is not something we would usually do. We work with pair lists when we create new functions, but usually we either create new pair lists to set the formal arguments of a function or take the arguments from another function; we rarely modify existing pair lists. In any case, both pair lists and calls can be assigned to by indexing into their components.

To get it out of the way with, the following is an example where we modify a pair list. We can construct the expression for defining a function like this:

f <- quote(function(x) 2 * x)
f


## function(x) 2 * x

This is an expression of the type “call”—it is a call to the function function that defines functions (try saying that fast)—and its second argument is the pair list that defines its arguments.

f[[2]]

## $x

If we assign to the elements in this pair list, we provide default arguments to the function. The values we assign must be quoted expressions.

f[[2]][[1]] <- quote(2 * y)
f


## function(x = 2 * y) 2 * x

To change the names of function arguments, we must change the names of the pair list components. We can do this using the names<- function.

names(f[[2]]) <- c("a")
f[[3]] <- quote(2 * a)
f


## function(a = 2 * y) 2 * a

In this example, we also saw how we could modify the function body through its third component.

Through this example, we have already seen all we need to know about how to modify call expressions. What we were modifying was simply a particular case of a call—the call to function. Any other call can be changed the same way.

expr <- quote(2 * x + y)
expr


## 2 * x + y

expr[[1]] <- as.symbol("/")
expr


## 2 * x/y

expr[[2]][[1]] <- as.symbol("+")
expr


## (2 + x)/y

We can construct new call objects using the call function. As its first argument, this function takes the function to call. This can be a symbol or a string and will automatically be quoted. After that, you can give it a variable number of arguments that will be evaluated before they are put into the constructed expression.

call("+", quote(2 * x), quote(y))

## 2 * x + y

call("+", call("*", 2, quote(x)), quote(y))

## 2 * x + y

If you are creating a call to a function with named arguments, rather than an operator, you can provide those to the call function as well.

call("f", a = quote(2 * x), b = quote(y))

## f(a = 2 * x, b = y)

It is essential that you quote the arguments if you do not want them evaluated. The call function will not do it for you.

z <- 2
call("+", 2 * z, quote(y))


## 4 + y

In the rlang package you have two additional functions for creating calls. The function lang works as the call function except that you can specify a namespace in which the called function should be found. The new_language function lets you provide the call arguments as an explicit pair list.

library(rlang)
lang("+", quote(2 * x), quote(y))


## 2 * x + y

new_language(as.symbol("+"), pairlist(quote(2 * x), quote(y)))

## 2 * x + y

The rlang package is worth exploring if you plan to do much meta-programming in R. It provides several functions for manipulating and creating expressions and functions and for managing environments. We will explore the package more in Chapter 8.

There is one extra complication if the call you are making is to function. This function needs a pair list as its second argument, so you will have to make such an object. If you want to create a function without default parameters, you need to make a list with “missing” elements at named positions. The way to make a missing argument is by calling substitute without arguments, so a function that creates a list of function parameters without default arguments can look like this:

make_args_list <- function(args) {
  res <- replicate(length(args), substitute())
  names(res) <- args
  as.pairlist(res)
}

We can use it to construct a call to function like this:

f <- call("function",
          make_args_list(c("x", "y")),
          quote(2 * x + y))
f


## function(x, y) 2 * x + y

Remember, however, that this is an expression for creating a function; it is not the function itself, and it does not behave like a function.

f(2, 3)

## Error in f(2, 3): could not find function "f"

The error message here looks a bit odd. R is not complaining that f is not a function but that the function f cannot be found. This is because R will look for functions when you use a symbol for a function call and will not confuse the value f with the function f. Here, we only have a value-version of f. Anyway, to get the actual function, we need to evaluate the call.

f <- eval(f)
f


## function (x, y)
## 2 * x + y


f(2, 3)

## [1] 7

A more direct way of creating a function is by using the new_function function from the rlang package.

f <- new_function(make_args_list(c("x", "y")),
                  quote(2 * x + y))
f


## function (x, y)
## 2 * x + y


f(2, 3)

## [1] 7

As a final example, we can combine the expression creating methods we have seen with the expression exploration functions from the previous section to translate expressions with unbound variables into functions. We can collect all unbound variables in an expression using the collect_symbols_ function from earlier and then use new_function to create the function.

expr_to_function <- function(expr) {
  expr <- substitute(expr)
  unbound <- collect_symbols_(expr, caller_env())
  new_function(make_args_list(unbound), expr, caller_env())
}

Here, I have used another function from rlang, caller_env. This function does the same as the parent.frame we have used earlier but with a more informative name. I recommend using caller_env over parent.frame for that reason.

We provide more arguments in this call to new_function than in the previous example where we used it. There, we provided only two arguments, the parameters of the function and its body. Here, we also provide its environment. This will be the function’s enclosing environment. It is here that the function will find the value of variables that are not local to the function itself or parameters to the function. Since we consider variables found in the caller environment as bound, we have to make sure that the function we create can also find them, so we put the function in the same environment. If this explanation is unclear to you, then return to this example after you have read Chapter 7 where we go into environments in much more detail. It should, ideally, be clearer then.

expr_to_function does exactly what we intended it to do. It creates a function from an expression, whose arguments are the unbound variables.

f <- expr_to_function(2 * x + y)
f


## function (y, x)
## 2 * x + y


f(x = 2, y = 3)

## [1] 7

g <- expr_to_function(function(x) 2 * x + y)
g


## function (y)
## function(x) 2 * x + y


g(y = 3)(x = 2)

## [1] 7

The order of the variables in the function will depend on the order in which they appear in the expression and in whatever order the unique function will leave them in. Therefore, calling the resulting function is best done with named arguments.

Footnotes

1 The substitute function will replace variables by the value they contain in the current scope or the value they have in an environment you provide as a second argument, except for variables in the global environment. Those variables are left alone. If you experiment with substitute, be aware that it behaves differently inside the scope of a function from how it behaves in the global scope.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.40.177