Automated calculations using the apply family of functions

In this section, you are going to learn about two very useful functions to apply an operation on the subsets of data. The two functions, tapply and apply, along with a few others, form a collection of functions called apply functions. The functions in the collection are used to apply (hence the name) a function we choose over subsets of an object, and then join the results to form a single object once again. The apply functions are a defining feature of R; they replace the necessity to write explicit loops in many common situations in data analysis, which makes the code shorter and more elegant.

Applying a function on separate parts of a vector

The tapply function is used to apply a function over different sections of a vector and then combine the results into a single object. To do this, we need to provide three arguments for the following three parameters:

  • Vector A, which the function will operate upon (X)
  • Vector B, which defines the subsets of vector A (INDEX)
  • A function that will be applied to the subsets of vector A (FUN)

As an example, we shall use a short table, which is a random subset of six rows (out of the original 150) in the iris dataset (available in R by typing iris). These are measurements of four floral traits (first four columns) on different plants (rows) that belong to three different iris species (fifth column, Species). You can create a data.frame object such as the following example with iris=iris[sample(1:nrow(iris),6),] (note that since it is a random sample, the exact values will be different each time). The exact table being used in the examples is provided on the book's website (iris2.csv). Here is the iris dataset subset we are going to use:

> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
100          5.7         2.8          4.1         1.3 versicolor
45           5.1         3.8          1.9         0.4     setosa
90           5.5         2.5          4.0         1.3 versicolor
34           5.5         4.2          1.4         0.2     setosa
38           4.9         3.6          1.4         0.1     setosa
101          6.3         3.3          6.0         2.5  virginica

Using tapply, we can quickly find out, for example, the average petal width per species, as follows:

> x = tapply(iris$Petal.Width, iris$Species, mean)
> x
    setosa versicolor  virginica
 0.2333333  1.3000000  2.5000000

The first argument, iris$Petal.Width, is the vector on which we apply our function. The second argument, iris$Species, is the vector that defines the subsets in iris$Petal.Width. Basically, all elements in iris$Petal.Width at the positions with a unique value in iris$Species are treated as groups. The last argument is the function that we apply on the subsets of iris$Petal.Width; in this case, the mean function. Thus, the iris$Petal.Width vector was split into three subsets, a mean was calculated for each subset, and the results were combined once again.

The returned object of tapply is an array, which is a vector with an additional attribute stating the number and size of its dimensions. A one-dimensional array, which is what we have here, is identical to a vector in its usage. The reason that the returned object of tapply is an array, however, is that in some cases (which we will not cover here), the returned object will have more than one dimension, and thus cannot be represented by a vector (for example, when the function we apply returns more than one value, such as the range function). We will further elaborate on two-dimensional (matrix) and three-dimensional (array) vector-like objects in the next chapter.

Note that the array is named using the values in the grouping vector, so we can access any value of interest using its name as follows:

> x["setosa"]
   setosa 
0.2333333

Also, if we wish, we can transform the result to a vector using as.numeric as follows:

> as.numeric(x)
[1] 0.2333333 1.3000000 2.5000000

Note

As previously mentioned, the apply functions are similar to loops in purpose and concept, although simpler and clearer in their syntax. For example, the preceding operation can be performed using a for loop, although the code would be longer (and, arguably, less clear):

> x = NULL
> for(i in unique(iris$Species)) {
+ x = c(x, mean(iris$Petal.Width[iris$Species == i]))
+ }
> names(x) = unique(iris$Species)
> x
versicolor     setosa  virginica
 1.3000000  0.2333333  2.5000000

Here, we create an empty object (with NULL, the special value that denotes an empty object in R) and then go through the unique values in iris$Species using a loop, each time adding the mean of iris$Petal.width to the respective species in x. Finally, we edit the names attribute of the resulting vector, using the names function, to add the unique species names.

Let's see another example with tapply involving our climatic data. Say we are interested in finding out how many stations are there (and which ones) with at least one missing value within its respective time series of precipitation amount. For an individual station (such as the one named "IZANA SP"), we could check whether its tpcp column contains at least one NA value as follows:

> any(is.na(dat[dat$station_name == "IZANA SP", "tpcp"]))
[1] TRUE

The returned value is TRUE, meaning the answer is yes. Note that the operation consisted of three steps. We first created a subset of dat (consisting of the rows for which the station name is "IZANA SP" and the column name is "tpcp"). Since the subset is created from a single column, it was automatically simplified to a vector. Secondly, we looked for each element whether it is NA with the is.na function. Finally, we checked whether at least one element in the resulting logical vector is TRUE, with the any function.

To instantly perform this operation on all stations, we can use tapply:

> result = tapply(
+ dat$tpcp,
+ dat$station_name,
+ function(x) any(is.na(x)))

This time the values vector we use the tapply function upon is dat$tpcp (since we want to look for missing values in the precipitation data) and the vector that defines the subsets is dat$station_name (since we want to apply the function on data from each station separately). Finally, the function that we apply is a user-defined one; its definition is encompassed within the tapply function call for compactness. The function takes one argument (x) and returns TRUE or FALSE depending on whether x does or does not contain at least one NA value, respectively, the same way that we did in the previous code section.

The resulting array indicates, for each station, whether at least one precipitation measurement is missing. Here are its first ten elements:

> result[1:10]
        A CORUNA ALVEDRO SP                 A CORUNA SP
                      FALSE                       FALSE
     ALBACETE LOS LLANOS SP            ALBACETE OBS. SP
                      FALSE                       FALSE
      ALMERIA AEROPUERTO SP          ASTURIAS AVILES SP
                      FALSE                       FALSE
                   AVILA SP BADAJOZ TALAVERA LA REAL SP
                      FALSE                       FALSE
    BARCELONA AEROPUERTO SP                BARCELONA SP
                      FALSE                       FALSE

To check how many stations have at least one missing value, we can simply use the sum function (see the previous chapter):

> sum(result)
[1] 11

The answer is that 11 stations have at least one NA value in their tpcp column. To see which stations these are, we can subset the result array with the array itself (NOT) since the TRUE values in that array exactly define the subset we are looking for:

> result[result]
 COLMENAR VIEJO FAMET SP    CORDOBA AEROPUERTO SP
                    TRUE                     TRUE
          GUADALAJARA SP                 IZANA SP
                    TRUE                     TRUE
                 JAEN SP PALENCIA OBSERVATORIO SP
                    TRUE                     TRUE
PAMPLONA OBSERVATORIO SP              PAMPLONA SP
                    TRUE                     TRUE
                 ROTA SP      SANTANDER CENTRO SP
                    TRUE                     TRUE
               TARIFA SP
                    TRUE

The values of the array are now unimportant (since they are all TRUE); we are actually interested only in the elements' names. The names attribute of an array (or of a vector for that matter) can be extracted with the names function, which we already met, as follows:

> names(result[result])
 [1] "COLMENAR VIEJO FAMET SP"  "CORDOBA AEROPUERTO SP"   
 [3] "GUADALAJARA SP"           "IZANA SP"                
 [5] "JAEN SP"                  "PALENCIA OBSERVATORIO SP"
 [7] "PAMPLONA OBSERVATORIO SP" "PAMPLONA SP"             
 [9] "ROTA SP"                  "SANTANDER CENTRO SP"     
[11] "TARIFA SP"

These are the names of the stations we were looking for, in the form of a character vector.

Applying a function on rows or columns of a table

The second function of the apply family that we will meet is apply. This function is also used to apply a certain function on subsets of data, but instead of operating on subsets defined by a grouping object, it does this on the margins of an array (or an object that is analogous to an array, such as a data.frame object with numeric values only). Applying a function on each row or each column of a table is, for example, such an operation. We will limit ourselves to this type of two-dimensional operation for now. In Chapter 6, Modifying Rasters and Analyzing Raster Time Series, we will see an example of apply involving three dimensions.

Similar to tapply, the first parameter of apply is the object we would like to base our calculation on (X), and the third parameter is the function we would like to apply (FUN). The second parameter (MARGIN), however, defines the dimension across which we would like to apply the function (rather than which subsets of the input, as in tapply). For example, the data.frame objects (and matrices, which will be introduced in the next chapter) have two dimensions: rows (dimension number 1) and columns (dimension number 2). When the input has more than two dimensions (such as in a three-dimensional array), we can apply a function on the third dimension as well and so on, although having an array of more than three dimensions is not common in practice.

Let's return to our iris example to see how apply works. Using apply, we can find out the mean measured value for each of the five individual plants by averaging the values on the first dimension (that is, the rows) as follows:

> apply(iris[, 1:4], 1, mean)
  100    45    90    34    38   101
3.475 2.800 3.325 2.825 2.500 4.525

We can also find the mean measured value for each of the four measured traits by averaging the values of the second dimension (that is, the columns) as follows:

> apply(iris[, 1:4], 2, mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
   5.5000000    3.3666667    3.1333333    0.9666667

Note that we are working only with the numeric part of the iris object (columns 1 to 4) since the function that we apply (mean) operates on numeric vectors.

We can also pass additional arguments to apply, which will, in turn, be passed to the specific function that we apply. For example, the mean function has an additional parameter, na.rm, which we can set to FALSE within the apply function call. In that case, we will be able to, for example, find out the column means excluding the missing values:

> iris[3,2] = NA
> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
100          5.7         2.8          4.1         1.3 versicolor
45           5.1         3.8          1.9         0.4     setosa
90           5.5          NA          4.0         1.3 versicolor
34           5.5         4.2          1.4         0.2     setosa
38           4.9         3.6          1.4         0.1     setosa
101          6.3         3.3          6.0         2.5  virginica
> apply(iris[, 1:4], 2, mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
   5.5000000           NA    3.1333333    0.9666667
> apply(iris[, 1:4], 2, mean, na.rm = TRUE)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
   5.5000000    3.5400000    3.1333333    0.9666667

Here, we first introduced an NA value to our iris table and then applied the mean function on the columns, first with the default arguments (na.rm=FALSE) and then with na.rm set to TRUE. Note that passing additional arguments can be done the same way in tapply as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.106.237