In this section, you are going to learn about two very useful functions to apply an operation on the subsets of data. The two functions, tapply
and apply
, along with a few others, form a collection of functions called apply
functions. The functions in the collection are used to apply (hence the name) a function we choose over subsets of an object, and then join the results to form a single object once again. The apply
functions are a defining feature of R; they replace the necessity to write explicit loops in many common situations in data analysis, which makes the code shorter and more elegant.
The tapply
function is used to apply a function over different sections of a vector and then combine the results into a single object. To do this, we need to provide three arguments for the following three parameters:
X
)INDEX
)FUN
)As an example, we shall use a short table, which is a random subset of six rows (out of the original 150) in the iris
dataset (available in R by typing iris
). These are measurements of four floral traits (first four columns) on different plants (rows) that belong to three different iris species (fifth column, Species
). You can create a data.frame
object such as the following example with iris=iris[sample(1:nrow(iris),6),]
(note that since it is a random sample, the exact values will be different each time). The exact table being used in the examples is provided on the book's website (iris2.csv
). Here is the iris
dataset subset we are going to use:
> iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 100 5.7 2.8 4.1 1.3 versicolor 45 5.1 3.8 1.9 0.4 setosa 90 5.5 2.5 4.0 1.3 versicolor 34 5.5 4.2 1.4 0.2 setosa 38 4.9 3.6 1.4 0.1 setosa 101 6.3 3.3 6.0 2.5 virginica
Using tapply
, we can quickly find out, for example, the average petal width per species, as follows:
> x = tapply(iris$Petal.Width, iris$Species, mean) > x setosa versicolor virginica 0.2333333 1.3000000 2.5000000
The first argument, iris$Petal.Width
, is the vector on which we apply our function. The second argument, iris$Species
, is the vector that defines the subsets in iris$Petal.Width
. Basically, all elements in iris$Petal.Width
at the positions with a unique value in iris$Species
are treated as groups. The last argument is the function that we apply on the subsets of iris$Petal.Width
; in this case, the mean
function. Thus, the iris$Petal.Width
vector was split into three subsets, a mean was calculated for each subset, and the results were combined once again.
The returned object of tapply
is an array
, which is a vector with an additional attribute stating the number and size of its dimensions. A one-dimensional array, which is what we have here, is identical to a vector in its usage. The reason that the returned object of tapply
is an array, however, is that in some cases (which we will not cover here), the returned object will have more than one dimension, and thus cannot be represented by a vector (for example, when the function we apply returns more than one value, such as the range
function). We will further elaborate on two-dimensional (matrix
) and three-dimensional (array
) vector-like objects in the next chapter.
Note that the array is named using the values in the grouping vector, so we can access any value of interest using its name as follows:
> x["setosa"] setosa 0.2333333
Also, if we wish, we can transform the result to a vector using as.numeric
as follows:
> as.numeric(x) [1] 0.2333333 1.3000000 2.5000000
As previously mentioned, the apply
functions are similar to loops in purpose and concept, although simpler and clearer in their syntax. For example, the preceding operation can be performed using a for
loop, although the code would be longer (and, arguably, less clear):
> x = NULL > for(i in unique(iris$Species)) { + x = c(x, mean(iris$Petal.Width[iris$Species == i])) + } > names(x) = unique(iris$Species) > x versicolor setosa virginica 1.3000000 0.2333333 2.5000000
Here, we create an empty object (with NULL
, the special value that denotes an empty object in R) and then go through the unique values in iris$Species
using a loop, each time adding the mean of iris$Petal.width
to the respective species in x
. Finally, we edit the names
attribute of the resulting vector, using the names
function, to add the unique species names.
Let's see another example with tapply
involving our climatic data. Say we are interested in finding out how many stations are there (and which ones) with at least one missing value within its respective time series of precipitation amount. For an individual station (such as the one named "IZANA SP"
), we could check whether its tpcp
column contains at least one NA
value as follows:
> any(is.na(dat[dat$station_name == "IZANA SP", "tpcp"])) [1] TRUE
The returned value is TRUE
, meaning the answer is yes. Note that the operation consisted of three steps. We first created a subset of dat
(consisting of the rows for which the station name is "IZANA SP"
and the column name is "tpcp"
). Since the subset is created from a single column, it was automatically simplified to a vector. Secondly, we looked for each element whether it is NA
with the is.na
function. Finally, we checked whether at least one element in the resulting logical vector is TRUE
, with the any
function.
To instantly perform this operation on all stations, we can use tapply
:
> result = tapply( + dat$tpcp, + dat$station_name, + function(x) any(is.na(x)))
This time the values vector we use the tapply
function upon is dat$tpcp
(since we want to look for missing values in the precipitation data) and the vector that defines the subsets is dat$station_name
(since we want to apply the function on data from each station separately). Finally, the function that we apply is a user-defined one; its definition is encompassed within the tapply
function call for compactness. The function takes one argument (x
) and returns TRUE
or FALSE
depending on whether x
does or does not contain at least one NA
value, respectively, the same way that we did in the previous code section.
The resulting array indicates, for each station, whether at least one precipitation measurement is missing. Here are its first ten elements:
> result[1:10] A CORUNA ALVEDRO SP A CORUNA SP FALSE FALSE ALBACETE LOS LLANOS SP ALBACETE OBS. SP FALSE FALSE ALMERIA AEROPUERTO SP ASTURIAS AVILES SP FALSE FALSE AVILA SP BADAJOZ TALAVERA LA REAL SP FALSE FALSE BARCELONA AEROPUERTO SP BARCELONA SP FALSE FALSE
To check how many stations have at least one missing value, we can simply use the sum
function (see the previous chapter):
> sum(result) [1] 11
The answer is that 11 stations have at least one NA
value in their tpcp
column. To see which stations these are, we can subset the result
array with the array itself (NOT
) since the TRUE
values in that array exactly define the subset we are looking for:
> result[result] COLMENAR VIEJO FAMET SP CORDOBA AEROPUERTO SP TRUE TRUE GUADALAJARA SP IZANA SP TRUE TRUE JAEN SP PALENCIA OBSERVATORIO SP TRUE TRUE PAMPLONA OBSERVATORIO SP PAMPLONA SP TRUE TRUE ROTA SP SANTANDER CENTRO SP TRUE TRUE TARIFA SP TRUE
The values of the array are now unimportant (since they are all TRUE
); we are actually interested only in the elements' names. The names
attribute of an array (or of a vector for that matter) can be extracted with the names
function, which we already met, as follows:
> names(result[result]) [1] "COLMENAR VIEJO FAMET SP" "CORDOBA AEROPUERTO SP" [3] "GUADALAJARA SP" "IZANA SP" [5] "JAEN SP" "PALENCIA OBSERVATORIO SP" [7] "PAMPLONA OBSERVATORIO SP" "PAMPLONA SP" [9] "ROTA SP" "SANTANDER CENTRO SP" [11] "TARIFA SP"
These are the names of the stations we were looking for, in the form of a character vector.
The second function of the apply family that we will meet is apply
. This function is also used to apply a certain function on subsets of data, but instead of operating on subsets defined by a grouping object, it does this on the margins of an array (or an object that is analogous to an array, such as a data.frame
object with numeric values only). Applying a function on each row or each column of a table is, for example, such an operation. We will limit ourselves to this type of two-dimensional operation for now. In Chapter 6, Modifying Rasters and Analyzing Raster Time Series, we will see an example of apply
involving three dimensions.
Similar to tapply
, the first parameter of apply
is the object we would like to base our calculation on (X
), and the third parameter is the function we would like to apply (FUN
). The second parameter (MARGIN
), however, defines the dimension across which we would like to apply the function (rather than which subsets of the input, as in tapply
). For example, the data.frame
objects (and matrices, which will be introduced in the next chapter) have two dimensions: rows (dimension number 1) and columns (dimension number 2). When the input has more than two dimensions (such as in a three-dimensional array), we can apply a function on the third dimension as well and so on, although having an array of more than three dimensions is not common in practice.
Let's return to our iris
example to see how apply
works. Using apply
, we can find out the mean measured value for each of the five individual plants by averaging the values on the first dimension (that is, the rows) as follows:
> apply(iris[, 1:4], 1, mean) 100 45 90 34 38 101 3.475 2.800 3.325 2.825 2.500 4.525
We can also find the mean measured value for each of the four measured traits by averaging the values of the second dimension (that is, the columns) as follows:
> apply(iris[, 1:4], 2, mean) Sepal.Length Sepal.Width Petal.Length Petal.Width 5.5000000 3.3666667 3.1333333 0.9666667
Note that we are working only with the numeric part of the iris
object (columns 1 to 4) since the function that we apply (mean
) operates on numeric vectors.
We can also pass additional arguments to apply
, which will, in turn, be passed to the specific function that we apply. For example, the mean
function has an additional parameter, na.rm
, which we can set to FALSE
within the apply
function call. In that case, we will be able to, for example, find out the column means excluding the missing values:
> iris[3,2] = NA > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 100 5.7 2.8 4.1 1.3 versicolor 45 5.1 3.8 1.9 0.4 setosa 90 5.5 NA 4.0 1.3 versicolor 34 5.5 4.2 1.4 0.2 setosa 38 4.9 3.6 1.4 0.1 setosa 101 6.3 3.3 6.0 2.5 virginica > apply(iris[, 1:4], 2, mean) Sepal.Length Sepal.Width Petal.Length Petal.Width 5.5000000 NA 3.1333333 0.9666667 > apply(iris[, 1:4], 2, mean, na.rm = TRUE) Sepal.Length Sepal.Width Petal.Length Petal.Width 5.5000000 3.5400000 3.1333333 0.9666667
Here, we first introduced an NA
value to our iris
table and then applied the mean
function on the columns, first with the default arguments (na.rm=FALSE
) and then with na.rm
set to TRUE
. Note that passing additional arguments can be done the same way in tapply
as well.
18.188.178.181