Chapter 8. Inside R

In the previous chapters, you learned the basics of R programming language, and understood the usage of vectors, matrices, lists, and data frames to represent data in different shapes. You also saw how we can use the built-in functions to solve simple problems. However, simply knowing these features does not help you solve every problem. Real-world data analysis usually involves careful and detailed transformation and aggregation of data, which can be done with a good variety of functions, whether they are built-in or provided by extension packages.

To best use these functions rather than let them confuse you with unexpected results, you need a basic but concrete understanding of how R functions work. In this chapter, we will cover the following topics:

  • Lazy evaluation
  • Copy-on-modify mechanism
  • Lexical scoping
  • Environments

If you understand these concepts and their roles in the code, most R code should appear highly predictable to you, which means higher productivity in both finding bugs and writing properly functional code.

Understanding lazy evaluation

A big part of understanding how R works can be done by figuring out how R functions work. After going through the previous chapters, you should know the most commonly used basic functions. However, you may still be confused about their exact behavior. Suppose we create the following function:

test0 <- function(x, y) {
  if (x > 0) x else y
} 

The function is somewhat special because y seems to be needed only when x is greater than zero. What if we only supply a positive number to x and ignore y? Will the function fail because we don't supply every argument in its definition? Let's find out by calling the following function:

test0(1)
## [1] 1 

The function works without y being supplied. It looks like we are not required to supply the values to all arguments when we call a function but only to those that are needed. If we call test0 with a negative number, y is needed:

test0(-1)
## Error in test0(-1): argument "y" is missing, with no default 

Since we did not specify the value of y, the function stopped, reporting that y is missing.

From the preceding examples, you learn that a function does not require all arguments to be specified if they are not needed to return a value. What if we insist on specifying those arguments that are not used in the function? Will they be evaluated before we call the function or not evaluated at all? Let's find out by putting a stop() function in the position of argument y. If the expression is evaluated by any means somewhere, it should stop immediately before x is returned:

test0(1, stop("Stop now"))
## [1] 1 

The output indicates that stop() does not happen, which indicates that it is not evaluated at all. If we change the value of x to a negative number, the function should stop instead:

test0(-1, stop("Stop now"))
## Error in test0(-1, stop("Stop now")): Stop now 

Now, it is very clear that stop() is evaluated in this case. The mechanism becomes quite transparent. In a function call, the expression of an argument is evaluated only when the value of the argument is needed. This mechanism is called lazy evaluation, and therefore, we can also say that the arguments of a function call are lazily evaluated, that is, evaluated only when needed.

If you are not aware of the lazy evaluation mechanism, you may think that the following function call must be extremely time consuming and may exhaust all your computer memory. However, lazy evaluation prevents it from happening because rnorm(1000000) is never evaluated. This is because it is never needed when evaluating if (x > 0) x else y, which can be verified by timing the function calls in turn using system.time():

system.time(rnorm(10000000))
## user system elapsed
## 0.91  0.01   0.92 

Generating 10 million random numbers is not an easy job. It takes more than a second. By contrast, evaluating a number should be the easiest thing R can do, and it is so fast that the timer itself can't tell:

system.time(1)
## user system elapsed
##  0     0      0 

If we time the following expression, given the logic of test0 and the knowledge of lazy evaluation, at an educated guess it should be zero:

system.time(test0(1, rnorm(10000000)))
## user system elapsed
##  0     0      0 

Another lazy evaluation scenario that could happen is the default values of arguments. More precisely, the default values of function arguments should really be default expressions because the value is not available until the expression is actually evaluated. Consider the following function:

test1 <- function(x, y = stop("Stop now")) {
  if (x > 0) x else y
} 

We give y a default value that calls stop(). If lazy evaluation does not apply here, that is, if y is evaluated irrespective of whether it is needed, we should receive an error as long as we call test1() without supplying y. However, if lazy evaluation applies, calling test1() with a positive x argument should not cause an error since the stop() expression of y is never evaluated.

Let's do an experiment to find out which is true. First, we will call test1() with a positive x argument:

test1(1)
## [1] 1 

The output implies that lazy evaluation also works here. The function only uses x, and the default expression of y is not evaluated at all. If we supply a negative x argument instead, the function should stop as supposed:

test1(-1)
## Error in test1(-1): Stop now 

The preceding examples demonstrate an advantage of lazy evaluation: it makes it possible to save time and avoid unnecessary evaluation of expressions. Besides, it also allows more flexible specification of default values of function arguments. For example, you can use other arguments in the expression of a function argument:

test2 <- function(x, n = floor(length(x) / 2)) {
  x[1:n]
} 

This allows you to set up the default behavior of a function in a more reasonable or desirable way, while the function arguments are still as customizable as they were without those default values.

If we call test2 without specifying n, the default behavior takes out the first half elements of x:

test2(1:10)
## [1] 1 2 3 4 5 

The function remains flexible because you can always override its default behavior by specifying another value of n:

test2(1:10, 3)
## [1] 1 2 3 

Like all other features, lazy evaluation also has its pros and cons. Since the arguments of a function are only parsed but not evaluated when the function is called, we can only make sure that the expressions supplied to the arguments are syntactically correct. It is hard to ensure that the arguments are going to work.

For example, if an undefined variable appears in the default value of an argument, there will be no warning or error the moment we create the function. In the following example, we create a test3 function, which is exactly the same as test2, except that x in n is mistakenly written as an undefined variable m.

test3 <- function(x, n = floor(length(m) / 2)) {
  x[1:n]
} 

When we create test3, there's no warning or error because floor(length(m) / 2) is never evaluated before test3 is called, and the value of n is demanded by 1:n. The function will stop only when we actually call it:

test3(1:10)
## Error in test3(1:10): object 'm' not found 

If we have m defined before test3 is called, the function works, but in an unexpected way:

m <- c(1, 2, 3)
test3(1:10)
## [1] 1 

Another example that makes how lazy evaluation works more explicit is as follows:

test4 <- function(x, y = p) {
  p <- x + 1
  c(x, y)
} 

Note that the default value of y is p, which is not defined before the function is called, just like the previous example. A notable difference between these two examples is when the missing symbol in the default value of the second argument is supplied. In the previous example, p is defined before the function is called. However, in this example, p is defined inside the function before y is used.

Let's see what happens when we call the function:

test4(1)
## [1] 1 2 

It looks like the function works rather than ending up in an error. It will become easier to understand if we go through the detailed process of how test4(1) is executed:

  1. Find a function named test4.
  2. Match the given arguments, but both x and y are unevaluated.
  3. p <- x + 1 evaluates x + 1 and assigns the value to a new variable p.
  4. c(x, y) evaluates both x and y, where x takes 1 and y takes p, which just happens to get the value of x + 1, which is 2.
  5. The function returns a numeric vector c(1, 2).

Therefore, in the whole evaluation process of test4(1), no warning or error occurs because no rules are violated. The most important trick here is that p is just defined before y is used.

The preceding example helps exaplain how lazy evaluation works, but it is indeed a bad practice. I won't recommend writing a function in this way because such a trick only makes the behavior of the function less transparent. A good practice is to simplify the arguments and avoid using undefined symbols outside the function. Otherwise, it can be hard to predict its behavior or debug the function due to its dependency on the outer environment.

Despite this, there is some wise use of lazy evaluation too. For example, stop() can be used along with switch() in the last argument to make the function stop when no cases are matched. The following function check_input() uses switch() to regulate the input of x so that it only accepts y or n and stops when other strings are supplied:

check_input <- function(x) {
  switch(x,
    y = message("yes"),
    n = message("no"),
    stop("Invalid input"))
} 

When x takes y, a message saying yes shows:

check_input("y")
## yes 

When x takes n, a message saying no shows:

check_input("n")
## no 

Otherwise, the function stops:

check_input("what")
## Error in check_input("what"): Invalid input 

The example works because stop() is lazily evaluated as an argument of switch().

As a summary of the examples, the reminder here is that you cannot rely too much on the parser to check the code. It only checks the code in its syntax, and it does not tell you whether the code is written with good practice. To avoid the potential pitfalls caused by lazy evaluation, do necessary checking in the function to make sure that the input can be handled correctly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.31.163