How it works...

Step 1 begins with the creation of a 10 x 10 matrix, with rows holding the same number and columns running from 1 to 10. Inspecting it makes it clear, as partly shown in the following output:

## > m 
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
## [1,] 1 1 1 1 1 1 1 1 1 1 
## [2,] 2 2 2 2 2 2 2 2 2 2 
## [3,] 3 3 3 3 3 3 3 3 3 3

We then use apply(): the first argument is the object to loop over, the second is the direction to loop in (or margin, 1 = rows, and 2 = columns), and the third is the code to apply. Here, it's the name of a built-in function, but it could be a custom one. Note it's the margin argument that affects the amount of data that is taken each time. Contrast the two apply() calls:

> apply(m, 1, sum) 
[1] 10 20 30 40 50 60 70 80 90 100
> apply(m, 2, sum) 
[1] 55 55 55 55 55 55 55 55 55 55

Clearly, margin = 1 is taking each row at a time, whereas margin = 2 is taking the columns. In any case, apply() returns a vector of results, meaning the results must be of the same type each time. It is not the same shape as the input data.

With Step 2, we move onto using lapply(), which can loop over many types of data structures, but always returns a list with one member for each iteration. Because it's a list, each member can be of a different type. We start by creating a simple vector containing the integers 1 to 3 and a custom function that just creates a vector of random numbers of a given length. Then, we use lapply() to apply that function over the vector; the first argument to lapply() is the thing to iterate over, and the second is the code to apply. Note that the current value of the vector we're looping over is passed automatically to the called function as the argument. Inspecting the resulting list, we see the following:

>my_list 
[[1]] [1] -0.3069078
[[2]] [1] 0.9207697 1.8198781 
[[3]] [1] 0.3801964 -1.3022340 -0.8660626

We get a list of one random number, then two, then three, reflecting the change in the original vector.

In Step 3, we see the difference between lapply() and sapply() when running over the same object. Recall lapply() always returns a list but sapply() can return a vector (s can be thought of as standing for simplify). We create a simple summary function to ensure we only get a single value back and sapply() can be used. Inspecting the results, we see the following:

>lapply(my_list, summary_function) 
[[1]] [1] -0.3069078 
[[2]] [1] 1.370324 
[[3]] [1] -0.5960334

>sapply(my_list, summary_function) 
[1] -0.3069078 1.3703239 -0.5960334

Finally, in Step 4, we use lapply() over a dataframe, namely, the built-in iris data. By default, it applies to columns on a dataframe, applying the mean() function to each one in turn. Note the last two arguments (trim and na.rm) are not arguments for lapply(), though, it does look like it. In all of these functions, the arguments after the vector to iterate over and the code (in other words, argument positions 1 and 2) are all passed to the code being run—here, our mean() function. The column names of the dataframe are used as the member names for the list. You may recall that one of the columns in iris is categorical, so mean() doesn't make much sense. Inspect the result to see what lapply() has done in this case:

> lapply(iris, mean, trim = 0.1, na.rm = TRUE ) 
$Sepal.Length [1] 5.808333 
$Sepal.Width [1] 3.043333 
$Petal.Length [1] 3.76 
$Petal.Width [1] 1.184167 
$Species [1] NA

It has returned NA. Also, it has generated a warning but not failed. This can be a source of bugs in later analyses.

With a simple list like this, we can also use unlist() to get a vector of the results:

> unlist(list_from_data_frame)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
5.808333 3.043333 3.760000 1.184167 NA

If names are present, the vector is named.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...