Understanding how an environment works

In the previous sections, you learned about lazy evaluation, copy-on-modify, and lexical scoping. These mechanisms are highly related to a type of object called environment. In fact, lexical scoping is enabled exactly by the environment. Although environments look quite similar to lists, they are indeed fundamentally different in several aspects. In the following sections, we will get to know the behavior of environment objects by creating and manipulating them, and see the way its structure determines how R functions work.

Knowing the environment object

An environment is an object consisting of a set of names and has a parent environment. Each name (also known as a symbol or variable) points to an object. When we look up a symbol in an environment, it will search the set of symbols and return the object the symbol points to if it exists in the environment. Otherwise, it will continue to look up its parent environment. The following diagram illustrates the structure of an environment and the relationship between environments:

Knowing the environment object

In the preceding diagram, Environment 1 consists of two names (id and grades), and its parent environment is Environment 0, which consists of one name (scores). Each name in these environments points to an object stored somewhere in the memory. If we look up id in Environment 1, we'll get the numeric vector it points to directly. If we look up scores instead, Environment 1 does not consist of scores, so it will look it up in its parent environment, Environment 1, and get its value successfully. For other names, it will look along the chain of parent environments until it is found or it will end up with an error of symbol not found.

In the following sections, we will go through these concepts in detail.

Creating and chaining environments

We can create a new environment using the new.env() function:

e1 <- new.env() 

The environment is usually represented by hexadecimal digits, which is a memory address:

e1
## <environment: 0x0000000014a45748> 

Extraction operators ($ and[[) can be used to create variables in the environment, just like modifying a list:

e1$x <- 1
e1[["x"]]
## [1] 1 

However, there are three major differences between an environment and a list:

  • An environment has no index
  • An environment has a parent environment
  • Environments have reference semantics

In the following sections, we will explain them in detail.

Accessing an environment

An environment has no index. This means that we cannot subset an environment nor can we extract an element from it by index. If we try to subset the environment using a range of positions, we will get an error:

e1[1:3]
## Error in e1[1:3]: object of type 'environment' is not subsettable 

We will get a different error when we try to extract a variable from an environment using index:

e1[[1]]
## Error in e1[[1]]: wrong arguments for subsetting an environment 

The correct way to work with an environment is using names and environment-access functions. For example, we can detect whether a variable exists in an environment using exists():

exists("x", e1)
## [1] TRUE 

For an existing variable, we can call get() to retrieve its value:

get("x", e1)
## [1] 1 

We can call ls() to see all variable names in a given environment, as we mentioned in Chapter 3Managing Your Workspace:

ls(e1)
## [1] "x" 

If we use $ or [[ to access variables that don't exist in the environment, we will get NULL, just like what we get when we extract an element from a list using a non-existing name:

e1$y
## NULL
e1[["y"]]
## NULL 

However, if we use the get() function in a non-existing variable out of an environment, we will certainly receive an error, just like what happens when we refer to a non-existing variable without caution:

get("y", e1)
## Error in get("y", e1): object 'y' not found 

To better handle the situation before an error occurs, we may use exists() to perform a detection before we use the get() function to the variable:

exists("y", e1)
## [1] FALSE 

Chaining environments

An environment has a parent environment, which is the next place to look up a symbol if the symbol does not exist in the original environment. Suppose we are trying to use the get() function to a variable in an environment. If the variable is directly found in it, we get the value. Otherwise, get() will look for the variable in its parent environment.

In the following example, we will create a new environment e2, whose parent (or enclosing) environment is e1, just like we created in the previous section:

e2 <- new.env(parent = e1) 

Different environments have different memory addresses:

e2
## <environment: 0x000000001772ef70>
e1
## <environment: 0x0000000014a45748> 

However, the parent environment of e2 is, by definition, exactly the same environment e1 refers to, which can be verified by parent.env():

parent.env(e2)
## <environment: 0x0000000014a45748> 

Now, we create a variable y in e2:

e2$y <- 2 

We can use ls() to inspect all variable names in e2:

ls(e2)
## [1] "y" 

We can also access the value of the variable using $[[, exists() or get():

e2$y
## [1] 2
e2[["y"]]
## [1] 2
exists("y", e2)
## [1] TRUE
get("y", e2)
## [1] 2 

However, the extraction operators ($ and [[) and the environment-access functions have a notable difference. The operators only work in the scope of a single environment, but the functions work along a chain of environments.

Note that we don't define any variable called x in e2. With no surprise, both operators extracting x result in NULL:

e2$x
## NULL
e2[["x"]]
## NULL 

However, the parent environment plays a role when we use exists() and get(). Since x is not found in e2, the functions will continue the search in its parent environment e1:

exists("x", e2)
## [1] TRUE
get("x", e1)
## [1] 1 

That's why we get positive results from both the preceding function calls. If we don't want the functions to search the parent environment, we can set inherits = FALSE. In this case, if the variable is not immediately available in the given environment, the search will not continue. Instead, exists() will return FALSE:

exists("x", e2, inherits = FALSE)
## [1] FALSE 

Also, the get() function will result in an error:

get("x", e2, inherits = FALSE)
## Error in get("x", e2, inherits = FALSE): object 'x' not found 

The chaining of environments work at many levels. For example, you may create an environment, e3, whose parent is e2. When you use the get() function to a variable from e3, the search will go along the chain of environments.

Using environments for reference semantics

Environments have reference semantics. This means that unlike data types such as atomic vectors and lists, an environment will not be copied when it is modified, whether it has multiple names or is passed as an argument to a function.

For example, we assign the value of e1 to another variable e3:

ls(e1)
## [1] "x"
e3 <- e1 

If we have two variables pointing to the same list, modifying one would make a copy first and then modify the copied version, which does not influence the other list. Reference semantics behave otherwise. No copy is made when we modify the environment through either variable. So, we can see the changes through both e1 and e3 since they point to exactly the same environment.

The following code demonstrates how reference semantics work:

e3$y
## NULL
e1$y <- 2
e3$y
## [1] 2 

First, there is no y defined in e3. Then, we created a new variable y in e1. Since e1 and e3 point to exactly the same environment, we can also access y through e3.

The same thing happens when we pass an environment as an argument to a function. Suppose we define the following function that tries to set z of e to 10:

modify <- function(e) {
  e$z <- 10
} 

If we pass a list to this function, the modification will not work. Instead a local version is created and modified, but it is dropped after the function call ends:

list1 <- list(x = 1, y = 2)
list1$z
## NULL
modify(list1)
list1$z
## NULL 

However, if we pass an environment to the function, modifying the environment does not produce a local copy but directly creates a new variable z in the environment:

e1$z
## NULL
modify(e1)
e1$z
## [1] 10 

Knowing the built-in environments

Environment is quite a special type of object in R, but it is used everywhere from the implementation of a function call to the mechanism of lexical scoping. In fact, when you run a chunk of R code, you run it in a certain environment. To know which environment we are running the code in, we can call environment():

environment()
## <environment: R_GlobalEnv> 

The output says that the current environment is the global environment. In fact, when a fresh R session gets ready for user input, the working environment is always the global environment. It is in this environment that we usually create variables and functions in data analysis.

As the previous examples demonstrated, an environment is also an object we can create and work with. For example, we can assign the current environment to a variable and create new symbols in this environment:

global <- environment()
global$some_obj <- 1 

The preceding assignment is equivalent to directly calling some_obj <- 1, because this is already in the global environment. As long as you run the preceding code, the global environment is modified and some_obj gets a value:

some_obj
## [1] 1 

There are other ways to access the global environment. For example, both globalenv() and .GlobalEnv refer to the global environment:

globalenv()
## <environment: R_GlobalEnv>
.GlobalEnv
## <environment: R_GlobalEnv> 

The global environment (globalenv()) is the user workspace, while the base environment (baseenv()) provides basic functions and operators:

baseenv()
## <environment: base> 

If you type base:: in the RStudio editor, a long list of functions should appear. Most of the functions we introduced in the previous chapters are defined in the base environment, including, for example, functions to create basic data structures (for example, list() and data.frame()) and operators to work with them (for example, [: and even +).

The global environment and the base environment are the most important built-in environments. Now, you may ask "What is the parent environment of the global environment? And what about the base environment? What about their grandparents?"

The following function can be used to find out the chain of a given environment:

parents <- function(env) {
  while (TRUE) {
    name <- environmentName(env)
    txt <- if (nzchar(name)) name else format(env)
    cat(txt, "
")
    env <- parent.env(env)
  }
} 

The preceding function recursively prints the names of the environment, the parent environment of each being the next one. Now, we can find out all levels of parent environments of the global environment:

parents(globalenv())
## R_GlobalEnv
## package:stats
## package:graphics
## package:grDevices
## package:utils
## package:datasets
## package:methods
## Autoloads
## base
## R_EmptyEnv
## Error in parent.env(env): the empty environment has no parent 

Note that the chain terminates at an environment called the empty environment, which is the only environment that has nothing in it and has no parent environment. There is also a emptyenv()function that refers to the empty environment, but parent.env(emptyenv()) will cause an error. This explains why parents() will always end up with an error.

The chain of environments is a combination of built-in environments and package environments. We can call search() to get the search path of symbol lookup in the perspective of the global environment:

search()
## [1] ".GlobalEnv" "package:stats"
## [3] "package:graphics" "package:grDevices"
## [5] "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads"
## [9] "package:base" 

Given the knowledge of symbol lookup along a chain of environments, we can figure out the process in detail of how the following code is evaluated in the global environment:

median(c(1, 2, 1 + 3)) 

The expression looks simple, but its evaluation process is more complex than it appears. First, look for median along the chain. It is found in the stats package environment. Then, look for c. It is found in the base environment. Finally, you may be surprised when you look for + (this is also a function!), as it is found in the base environment.

In fact, each time you attach a package, the package environment will be inserted before the global environment in the search path. If two packages export functions with conflict names, the functions defined in the package attached later will mask formerly defined ones since it becomes a closer parent to the global environment.

Understanding environments associated with a function

Environments govern the symbol lookup at not only the global level but also the function level. There are three important environments associated with function and its execution process: the executing environment, the enclosing environment, and the calling environment.

Each time a function is called, a new environment called is created to host the execution process. This is the executing environment of the function call. The arguments of the function and the variables we create in the function are actually the variables in the executing environment.

Like all other environments, the executing environment of a function is created with a parent environment. That parent environment, also called the enclosing environment of the function, is the environment where the function is defined. This means that during the execution of the function, any variable that is not defined in the executing environment will be looked for in the enclosing environment. This is exactly what makes lexical scoping possible.

Sometimes it is also useful to know the calling environment, that is, the environment in which the function is called. We can use parent.frame() to get the calling environment of the currently executing function.

To demonstrate these concepts, suppose we define the following function:

simple_fun <- function() {
  cat("Executing environment: ")
  print(environment())
  cat("Enclosing environment: ")
  print(parent.env(environment()))
} 

The function does nothing but prints the executing and enclosing environments when it is called:

simple_fun()
## Executing environment: <environment: 0x0000000014955db0>
## Enclosing environment: <environment: R_GlobalEnv>
simple_fun()
## Executing environment: <environment: 0x000000001488f430>
## Enclosing environment: <environment: R_GlobalEnv>
simple_fun()
## Executing environment: <environment: 0x00000000146a23c8>
## Enclosing environment: <environment: R_GlobalEnv> 

Note that each time the function is called, the executing environment is different, but the enclosing environment remains the same. In fact, when the function is defined, its enclosing environment is determined. We can call environment() over a function to get its enclosing environment:

environment(simple_fun)
## <environment: R_GlobalEnv> 

The following example involves the three environments of three nested functions. In each function, the executing environment, enclosing environment, and calling environment are printed. If you firmly understand these concepts, I suggest that you make a prediction of which are the same and which are different:

f1 <- function() {
  cat("[f1] Executing in ")
  print(environment())
  cat("[f1] Enclosed by ")
  print(parent.env(environment()))
  cat("[f1] Calling from ")
  print(parent.frame())
  f2 <- function() {
    cat("[f2] Executing in ")
    print(environment())
    cat("[f2] Enclosed by ")
    print(parent.env(environment()))
    cat("[f2] Calling from ")
    print(parent.frame())
  }
  f3 <- function() {
    cat("[f3] Executing in ")
    print(environment())
    cat("[f3] Enclosed by ")
    print(parent.env(environment()))
    cat("[f3] Calling from ")
    print(parent.frame())
    f2()
  }
  f3()
} 

Let's call f1 and find out when each message is printed. The output requires some effort to read in its original form. We split the output into chunks for easier reading while preserving the order of output for consistency.

Note that temporarily created environments only have memory addresses (for example, 0x0000000016a39fe8) instead of a common name like the global environment (R_GlobalEnv). To make it easier to identify identical environments, we give the same memory addresses the same tags (for example, *A) at the end of each line of text output for the environments:

f1()
## [f1] Executing in <environment: 0x0000000016a39fe8> *A
## [f1] Enclosed by <environment: R_GlobalEnv>
## [f1] Calling from <environment: R_GlobalEnv> 

When we call f1, its associated environments are printed as supposed, and then f2 and f3 are defined, and finally f3 is called, which continues producing the following text output:

## [f3] Executing in <environment: 0x0000000016a3def8> *B
## [f3] Enclosed by <environment: 0x0000000016a39fe8> *A
## [f3] Calling from <environment: 0x0000000016a39fe8> *A 

Then, f2 is called in f3, which further produces the following text output:

## [f2] Executing in <environment: 0x0000000016a41f90> *C
## [f2] Enclosed by <environment: 0x0000000016a39fe8> *A
## [f2] Calling from <environment: 0x0000000016a3def8> *B 

The printed messages show the following facts:

  • Both the enclosing environment and calling environment of f1 are the global environment
  • The enclosing environment and the calling environment of f3, as well as the enclosing environment of f2, are the executing environments of f1
  • The calling environment of f2 is the executing environment of f3

The preceding facts are consistent with the following facts:

  • f1 is both defined and called in the global environment
  • f3 is both defined and called in f1
  • f2 is defined in f1 but called in f3

If you managed to make the right predictions, you have a good understanding of how an environment and a function basically work. To go even deeper, I strongly recommend Hadley Wickham's Advanced R (http://amzn.com/1466586966?tag=devtools-20).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.186.219