© Matt Wiley and Joshua F. Wiley 2016

Matt Wiley and Joshua F. Wiley, Advanced R, 10.1007/978-1-4842-2077-1_1

1. Programming Basics

Matt Wiley and Joshua F. Wiley1

(1)Elkhart Group Ltd. & Victoria College, Columbia City, Indiana, USA

Electronic supplementary material

The online version of this chapter (doi:10.​1007/​978-1-4842-2077-1_​1) contains supplementary material, which is available to authorized users.

As with most languages, more advanced usage requires delving into the underlying structure. This chapter covers such programming basics, and this first section of the book (through Chapter 6), develops some advanced programming techniques. We start with R’s basic building blocks, which create our foundation for programming, data management, and cloud analytics.

Before we dig too deeply into R, some general principles to follow may well be in order. First, experimentation is good. It is much more powerful to learn hands-on than it is simply to read. Download the source files that come with this text, and try new things!

Second, it can help quite a bit to become familiar with the ? function. Simply type ? immediately followed by text in your R console to call up help of some kind. We cover more on functions later, but this is too useful to ignore until that time.

Finally, just before we dive into the real reason you bought this book, a word of caution: this is an applied text. There may be topics and areas of R we skip or ignore. While we, the authors, like to imagine this is due to careful pruning of ideas, it may well be due to ignorance. There are likely other ways to perform these tasks or additional good topics to learn. Our goal is to get you up and running as quickly as possible toward some useful skills. Good luck!

Advanced R Software Choices

This book is written for advanced users of the R language. We should note that for most of our examples, we continue using RStudio ( www.rstudio.com/products/rstudio/download/ ) as in Beginning R: An Introduction to Statistical Programming (Apress, 2015). We also assume you are using a Microsoft Windows ( www.microsoft.com ) operating system, except for the later chapters, where we delve into using R in the cloud via Ubuntu ( www.ubuntu.com ). What is different is the underlying R distribution.

We are going to use Microsoft R Open (MRO) , which is fully aligned with the current version(s) of R. This provides performance enhancements that happen behind the scenes. We also use Intel Math Kernel Library (Intel MKL) , which is available for download at the same site as MRO ( https://mran.microsoft.com/download/ ). In fact, as this book goes to print, these two software programs combined in their latest release. It would be wonderful if that trend continues. These downloads are very straightforward, and we anticipate that our readers, familiar with using R and RStudio already, find this a seamless installation. On Windows (and Linux-based operating systems), the MKL replaces the default linear algebra system with an optimized system and allows implicit parallel processing for linear algebra operations, such as matrix multiplication and decomposition that are used in many statistical algorithms.

In case it is not already, you also need Java installed. We used Java Version 8 Update 91 for 64 bit in this book. Java may be downloaded at www.oracle.com/technetwork/java/javase/ ; specifically, get the Java Development Kit (JDK ).

While these choices may have minor consequences, our goal is to provide universal guidance that remains true enough regardless of environmental specifics. Nevertheless, some packages and prebuilt functions on occasion have quirks. We turn our attention to ensuring that you can readily reproduce our results.

Reproducing Results

One useful feature of R is the abundance of packages written by experts worldwide. This is also potentially the Achilles’ heel of using R: from the version of R itself to the version of particular packages, lots of code specifics are in flux. Your code has the potential to not work from day to day, let alone our code written months before this book was published. To solve this, we use the Revolution Analytics checkpoint package (Microsoft Corporation, 2016), which uses server-stored snapshots from the Comprehensive R Archive Network (CRAN) to “lock” our code to a specific version and date. To learn the technical specifics of how this is done, visit the link in the “References” section at the end of this chapter. We’ll get you started with the basics.

For this book, we used R version 3.3.1, Bug in Your Hair, along with Windows 10 Professional x64. As this version moves from the current version to historical, CRAN maintains an archive of past releases. Thus, the checkpoint package has ready access to previous versions of R, and indeed all packages. What you need to do is add the following code to the top of your Chapter 1 R file in your project directory:

## uncomment to install the checkpoint package
## install.packages("checkpoint")
library(checkpoint)
checkpoint("2016-09-04", R.version = "3.3.1")
library(data.table)

We place all library calls at the start of each chapter’s project file, after the call to the checkpoint library. By including the date of September 4, 2016, we ensure that the latest version of all packages up to that cutoff is installed and run by checkpoint. The first time it is run, after asking permission, checkpoint creates a folder to host the needed versions of the packages used. Thus, as long as you start each chapter’s code file with the correct library calls, you use the same versions of the packages we use.

Types of Objects

First of all, we need things to build our language, and in R, these are called objects. We start with five very common types of objects.

Logicalobjects take on just two values: TRUE or FALSE. Computers are binary machines, and data often may be recorded and modeled in an all-or-nothing world. These logical values can be helpful, where TRUE has a value of 1, and FALSE has a value of 0:

TRUE              
[1] TRUE
FALSE
[1] FALSE

As you may remember from the quickly muttered comments of your algebra professor, there are many types, or flavors, of numbers. Whole numbers, which include zero as well as negative values, are called integers. In set notation, {…,-2, -1, 0, 1, 2, …}, these numbers are helpful for headcounts or other indexes (as well as other things, naturally). In R, integers have the capital L suffix. If decimal numbers are needed, then double numeric objects are in order. These are the numbers suited for even-ratio data types. Complex numbershave useful properties as well and are understood precisely as you might expect, with an i suffix on the imaginary portion. R is quite friendly in using all of these numbers, and you simply type in the desired numbers (remember to add the L or i suffix as needed):

42L              
[1] 42
1.5
[1] 1.5
2+3i
[1] 2+3i

Nominal-level data may be stored via the character class and is designated with quotation marks:

"a" ## character              
[1] "a"

Of course, numerical data may have missing values. These missing values are of the type that the rest of the data in that set would be (we discuss data storage shortly). Nevertheless, it can be helpful to know how to hand-code logical, integer, double, complex, or character missing values:

NA              
[1] NA
NA_integer_
[1] NA
NA_real_
[1] NA
NA_character_
[1] NA
NA_complex_
[1] NA

Factorsare a special kind of object, not so useful for general programming, but used a fair amount in statistics. A factor variable indicates that a variable should be treated discretely. Factors are stored as integers, with labels to indicate the original value:

factor(1:3)              
[1] 1 2 3
Levels: 1 2 3
factor(c("a", "b", "c"))
[1] a b c
Levels: a b c
factor(letters[1:3])
[1] a b c
Levels: a b c

We turn now to data structures, which can store objects of the types we have discussed (and of course more). A vector is a relatively simple data storage object. A simple way to create a vector is with the concatenate function c():

c(1, 2, 3)              
[1] 1 2 3

Just as in mathematics, a scalaris a vector of just length 1. Toward the opposite end of the continuum, a matrixis a vector with dimensions for both rows and columns. Notice the way the matrix is populated with the numbers 1 through 6, counting down each column:

c(1)
[1] 1


matrix(c(1:6), nrow = 3, ncol = 2)
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

All vectors, be they scalar, vector, or matrix, can have only one data type (for example, integer, logical, or complex). If more than one type of data is needed, it may make sense to store the data in a list. A list is a vector of objects, in which each element of the list may be a different type. In the following example, we build a list that has character, vector, and matrix elements:

list(                
+   c("a"),
+   c(1, 2, 3),
+   matrix(c(1:6), nrow = 3, ncol = 2)
+   )
[[1]]
[1] "a"


[[2]]
[1] 1 2 3


[[3]]
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

A particular type of list is the data frame, in which each element of the list is identical in length (although not necessarily in object type). Take a look at the following instructive examples with output:

data.frame(1:3, 4:6)                
  X1.3 X4.6
1    1    4
2    2    5
3    3    6


## using non equal length objects causes problems
data.frame( 1:3, 4:5)
Error in data.frame(1:3, 4:5) :
  arguments imply differing number of rows: 3, 2


data.frame( 1:3, letters[1:3])
  X1.3 letters.1.3.
1    1            a
2    2            b
3    3            c

Because of their superior speed, we use data table objects in R from the data.table package. Data tables are similar to data frames, but are designed to be more memory efficient and faster. Even though we recommend data tables, we show some examples with data frames as well because when you work with R, many other people’s code includes data frames, and indeed data tables inherit many methods from data frames.

library(data.table)              
data.table( 1:3, 4:6)
   V1 V2
1:  1  4
2:  2  5
3:  3  6

Having explored several types of objects , we turn our attention to ways of manipulating those objects with operators and functions.

Base Operators and Functions

Objects are not enough for a language; some things require actions. Operators and functions are the verbs of the programming world. We start with assignment, which can be done in two ways. Much like written languages, more-elegant turns of phrase can be more helpful than simpler prose. So although = and <- are both assignment operators and do the same thing, because = is used within functions to set arguments, we recommend for clarity’s sake to use <- for general assignment. We nevertheless demonstrate both assignment techniques. Assignments allow objects to be given sensible names; this can significantly enhance code readability (for your future self as well as for other users).

In addition to assigning names to variables, you can check specifics by using functions. Functions in R take the general format of function name, followed by parentheses, with input inside the parentheses, and then R provides output. Here are examples:

x <- 5                
y = 3
x
[1] 5
y
[1] 3


is.integer(x)
[1] FALSE


is.double(y)
[1] TRUE


is.vector(x)
[1] TRUE

Once an object is assigned , you can access specific object elements by using brackets. Most computer languages start their indexing at either 0 or 1. R starts indexing at 1. Also, note that you can readily change old assignments with little trouble and no warning; it is wise to watch names cautiously and comment code carefully.

x <- c("a", "b", "c")                
x[1]
[1] "a"


is.vector(x)
[1] TRUE


is.vector(x[1])
[1] TRUE


is.character(x[1])
[1] TRUE

While a vector may take only a single index, more-complex structures require more indices. For the matrix you met earlier, the first index is the row, and the second is for column position. Notice that after building a matrix and assigning it, there are many ways to access various combinations of elements. This process of accessing just some of the elements is sometimes called subsetting:

x2 <- matrix(c(1:6), nrow = 3, ncol = 2)                
x2
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6


x2[1, 2] ## row 1, column 2
[1] 4


x2[1, ] ## all row 1
[1] 1 4


x2[, 1] ## all column 1
[1] 1 2 3


x2[c(1, 2), ] ## rows 1 and 2
     [,1] [,2]
[1,]    1    4
[2,]    2    5


x2[c(1, 3), ] ## rows 1 and 3
     [,1] [,2]
[1,]    1    4
[2,]    3    6


x[-2] ## drop element two
[1] "a" "c"


x2[, -2] ## drop column two
[1] 1 2 3


x2[-1, ] ## drop row 1
     [,1] [,2]
[1,]    2    5
[2,]    3    6


is.vector(x2)
[1] FALSE


is.matrix(x2)
[1] TRUE

Accessing and subsetting lists is perhaps a trifle more complex, yet all the more essential to learn and master for later techniques. A single index in a single bracket returns the entire element at that spot (recall that for a list, each element may be a vector or just a single object). Using double brackets returns the object within that element of the list—nothing more.

Thus, the following code is, in fact, a vector with the element a inside. Again, using the data-type-checking functions can be helpful in learning how to interpret various pieces of code.

y <- list( c("a"), c(1:3))                
y[1]
[[1]]
[1] "a"


is.vector(y[1])
[1] TRUE


is.list(y[1])
[1] TRUE


is.character(y[1])
[1] FALSE

Contrast that with this code, which is simply the element a:

y[[1]]                
[1] "a"


is.vector(y[[1]])
[1] TRUE


is.list(y[[1]])
[1] FALSE


is.character(y[[1]])
[1] TRUE

You can, in fact, chain brackets together, so the second element of the list (a vector with the numbers 1 through 3) can be accessed, and then, within that vector, the third element can be accessed :

 y[[2]][3]              
 [1] 3

Brackets almost always work, depending on the type of object, but there may be additional ways to access components. Named data frames and lists can use the $ operator. Notice in the following code how the bracket or dollar sign ends up being equivalent:

x3 <- data.frame( A = 1:3, B = 4:6)                
y2 <- list( C = c("a"), D = c(1, 2, 3))


x3$A
[1] 1 2 3
y2$C
[1] "a"
x3[["A"]]
[1] 1 2 3
y2[["C"]]
[1] "a"

Notice that although both data frames and lists are both lists, neither is a matrix:

is.list(x3)                
[1] TRUE


is.list(y2)
[1] TRUE


is.matrix(x3)
[1] FALSE


is.matrix(y2)
[1] FALSE

Moreover, despite not being matrices, because of their special nature (that is, all elements have equal length), data frames and data tables can be indexed similarly to matrices :

x3[1, 1]                
[1] 1


x3[1, ]
  A B
1 1 4


x3[, 1]
[1] 1 2 3

Any named object can be indexed by using the names rather than the positional numbers, provided those names have been set:

x3[1, "A"]                
[1] 1


x3[, "A"]
[1] 1 2 3

This applies to both column and row names, and these names can be established after building the matrix:

rownames(x3) <- c("first", "second", "third")                

x3["second", "B"]
[1] 5

Data tables use a slightly different approach. Selecting rows works almost identically but selecting columns does not require quotes. Additionally, you can select multiples by name without quotes by using the .() operator. Should you need to use quotes, the data table can be accessed by using the option with = FALSE such as follows:

x4 <- data.table( A = 1:3, B = 4:6)                

x4[1, ]
   A B
1: 1 4


x4[, A]
[1] 1 2 3
x4[1, A]
[1] 1


x4[1:2, .(A, B)]
   A B
1: 1 4
2: 2 5


x4[1, "A", with = FALSE]
   A
1: 1

Technically, the bracket operators are functions. Although they’re not used as functions, they can be. Most functions are named, but the brackets are a particular case and require using single quotes in the regular function format, as in the following example:

`[`(x, 1)                
[1] "a"


`[`(x3, "second", "A")
[1] 2

Although we have been using the is.datatype() function to better illustrate what an object is, you can do more. Specifically, you can check whether a value is missing an element by using the is.na() function:

is.na(NA) ## works              
[1] TRUE

Of course, the preceding code snippet usually has a vector or matrix element argument whose populated status is up for debate. Our last (for now) exploratory function is the inherits() function. It is helpful when no is.class() function exists, which can occur when specific classes outside the core ones you have seen presented so far are developed:

inherits(x3, "data.frame")              
[1] TRUE
inherits(x2, "matrix")
[1] TRUE

You can also force lower types into higher types. This coercion can be helpful but may have unintended consequences. It can be particularly risky if you have a more advanced data object being coerced to a lesser type (pay close attention to the attempt to coerce an integer).

as.integer(3.8)                
[1] 3


as.character(3)
[1] "3"


as.numeric(3)
[1] 3


as.complex(3)
[1] 3+0i


as.factor(3)
[1] 3
Levels: 3


as.matrix(3)
     [,1]
[1,]    3


as.data.frame(3)
  3
1 3


as.list(3)
[[1]]
[1] 3


> as.logical("a")
[1] NA


as.logical(3)
[1] TRUE


as.numeric("a")
[1] NA
Warning message:
NAs introduced by coercion

Coercion can be helpful. All the same , it must be used cautiously. Before you move on from this section, if any of this is new, be sure to experiment with different inputs than the ones we tried in the preceding example! Experimenting never hurts, and it can be a powerful way to learn.

Let’s turn our attention now to mathematical and logical operators and functions.

Mathematical Operators and Functions

Several operators can be used for comparison. These will be helpful later, once we get into loops and building our own functions. Equally useful are symbolic logic forms . We start with some basic comparisons and admit to a strange predilection for the number 4:

4 > 4              
[1] FALSE
4 >= 4
[1] TRUE
4 < 4
[1] FALSE
4 <= 4
[1] TRUE
4 == 4
[1] TRUE
4 != 4
[1] FALSE

It is sensible now to mention that although the preceding code may be helpful, often numbers differ from one another only slightly—particularly in the programming environment, which relies on the computer representation of floating-point (irrational) numbers. Therefore, we often check that things are close within a tolerance:

all.equal(1, 1.00000002, tolerance = .00001)              
[1] TRUE

In symbolic logic, AND as well as OR are useful comparisons between two objects. In R, we use & for AND, as well as | for OR. Complex logic tests can be constructed from these simple structures:

TRUE | FALSE              
[1] TRUE
FALSE | TRUE
[1] TRUE
TRUE & TRUE
[1] TRUE
TRUE & FALSE
[1] FALSE

All of the logic tests mentioned so far apply just as well to vectors as they apply to single objects:

1:3 >= 3:1                
[1] FALSE  TRUE  TRUE


c(TRUE, TRUE) | c(TRUE, FALSE)
[1] TRUE TRUE


c(TRUE, TRUE) & c(TRUE, FALSE)
[1]  TRUE FALSE

If you want only a single response, such as for if-else flow control , you can use && or ||, which stop evaluating as soon as they have determined the final result. Work through the following code and output carefully:

W                
Error: object 'W' not found


TRUE | W
Error: object 'W' not found


TRUE || W
[1] TRUE


W || TRUE
Error: object 'W' not found


FALSE & W

Error: object 'W' not found

FALSE && W
[1] FALSE

Note that the double operators are not, in fact, vectorized. They simply use the first element of any vectors:

c(TRUE, TRUE) || c(TRUE, FALSE)                
[1] TRUE


c(TRUE, TRUE) && c(TRUE, FALSE)
[1] TRUE

The any() and all() functions are helpful as well in these contexts for similar reasons:

any(c(TRUE, FALSE, FALSE))                
[1] TRUE


all(c(TRUE, FALSE, TRUE))
[1] FALSE


all(c(TRUE, TRUE, TRUE))
[1] TRUE

We turn our attention now to mathematical, rather than logical, operators. R is powerful mathematically and can perform most mathematical calculations. So although we introduce some functions, we are leaving many out of the mix. For more details, ?Arithmetic can be your friend. It is (as always) important to be aware of the way computers perform mathematical calculations . Being able to code bespoke solutions directly is powerful, yet with the freedom to customize comes a corresponding amount of responsibility. Take a careful look at the following mathematical operations (which can behave differently than expected because of implementation choices):

3 + 3              
[1] 6
3 - 3
[1] 0
3 * 3
[1] 9
3 / 3
[1] 1
(-27) ^ (1/3)
[1] NaN
4 %/% .7
[1] 5
4 %% .3
[1] 0.1

R also has some common functions that have straightforward names:

sqrt(3)              
[1] 1.732051
abs(-3)
[1] 3
exp(1)
[1] 2.718282
log(2.71)
[1] 0.9969486

Trigonometric functions also have their part, and ?Trig can bring up a nice list of these. We show cosine’s function call cos() for brevity. Note the slight inaccuracy again on the cosine function’s output:

cos(3.1415)              
[1] -1

We close this section and this chapter with a brief selection of matrix operations. Scalar operations use the basic arithmetic operators. To perform matrix multiplication, we use %*%:

x2                
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6


x2 * 3
     [,1] [,2]
[1,]    3   12
[2,]    6   15
[3,]    9   18


x2 + 3
     [,1] [,2]
[1,]    4    7
[2,]    5    8
[3,]    6    9


x2 %*% matrix(c(1, 1), 2)
     [,1]
[1,]    5
[2,]    7
[3,]    9

Matrices have a few other fairly common operations that are helpful in linear algebra. We suppose you have some idea about the mathematics behind some of the applications in modeling we cover, and we discuss an appropriate amount of mathematics as needed in the following chapters. Still, this seems a good place to show how the transpose, cross product, and transpose cross product might be coded. We show both the raw code to make the cross product and transpose cross product occur, as well as easier function calls that may be used. This is a relatively common occurrence in R, incidentally. Through packages, quite a few techniques are implemented in fairly clear function calls. Here are the examples:

t(x2)                
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6


t(x2) %*% x2
     [,1] [,2]
[1,]   14   32
[2,]   32   77


crossprod(x2)
     [,1] [,2]
[1,]   14   32
[2,]   32   77


x2 %*% t(x2)
     [,1] [,2] [,3]
[1,]   17   22   27
[2,]   22   29   36
[3,]   27   36   45


tcrossprod(x2)
     [,1] [,2] [,3]
[1,]   17   22   27
[2,]   22   29   36
[3,]   27   36   45

We end this chapter with some final thoughts . First, as you have just seen, it is common in R for someone else to have done the heavy lifting by making a function that simply creates the desired outcome. Of course, these friendly programmers’ work is subjected to only the underlying constraints of R itself as well as the ability to acquire a free GitHub account. Thus, it can be helpful to understand at least some of the base commands and operators that make R work. Second, R runs on computers, and for those who have not yet met computer logic, there are differences due to the hardware structure and (and consequent software implementation choices).

Next, let’s focus on understanding implementation nuances as well as quickly getting data in and out of R.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.254