Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1

Introduction

1.1 Computational Statistics and Statistical Computing

Computational statistics and statistical computing are two areas within statistics that may be broadly described as computational, graphical, and numerical approaches to solving statistical problems. Statistical computing traditionally has more emphasis on numerical methods and algorithms, such as optimization and random number generation, while computational statistics may encompass such topics as exploratory data analysis, Monte Carlo methods, and data partitioning, etc. However, most researchers who apply computationally intensive methods in statistics use both computational statistics and statistical computing methods; there is much overlap and the terms are used differently in different contexts and disciplines. Gentle [113] and Givens and Hoeting [121] use “computational statistics” to encompass all the relevant topics that should be covered in a modern introductory text, so that “statistical computing” is somewhat absorbed under this more broad definition of computational statistics. On the other hand, journals and professional organizations seem to use both terms to cover similar areas. Some examples are the International Association for Statistical Computing (IASC), part of the International Statistical Insititute, and the Statistical Computing section of the American Statistical Association.

This book encompasses parts of both of these subjects, because a first course in computational methods for statistics necessarily includes both. Some examples of topics covered are described below.

Monte Carlo methods refer to a diverse collection of methods in statistical inference and numerical analysis where simulation is used. Many statistical problems can be approached through some form of Monte Carlo integration. In parametric bootstrap, samples are generated from a given probability distribution to compute probabilities, gain information about sampling distributions of statistics such as bias and standard error, assess the performance of procedures in statistical inference, and to compare the performance of competing methods for the same problem. Resampling methods such as the ordinary bootstrap and jackknife are nonparametric methods that can be applied when the distribution of the random variable or a method to simulate it directly is unavailable. The need for Monte Carlo analysis also arises because in many problems, an asymptotic approximation is unsatisfactory or intractable. The convergence to the limit distribution may be too slow, or we require results for finite samples; or the asymptotic distribution has unknown parameters. Monte Carlo methods are covered in Chapters 5, 6, 7, 8, and 9. The first tool needed in a simulation is a method for generating psuedo random samples; these methods are covered in Chapter 3.

Markov Chain Monte Carlo (MCMC) methods are based on an algorithm to sample from a specified target probability distribution that is the stationary distribution of a Markov chain. These methods are widely applied for problems arising in Bayesian analysis, and in such diverse fields as computational physics and computational finance. Markov Chain Monte Carlo methods are covered in Chapter 9.

Several special topics also deserve an introduction in a survey of computationally intensive methods. Density estimation (Chapter 10) provides a nonparametric estimate of a density, which has many applications in addition to estimation ranging from exploratory data analysis to cluster analysis. Computational methods are essential for the visualization of multivariate data and reduction of dimensionality. The increasing interest in massive and streaming data sets, and high dimensional data arising in applications of biology and engineering, for example, demand improved and new computational approaches for multivariate analysis and visualization. Chapter 4 is an introduction to methods for visualization of multivariate data. A review of selected topics in numerical methods for optimization and numerical integration is presented in Chapter 11.

Many references can be recommended for further reading on these topics. Gentle [113] and the volume edited by Gentle, et al. [114] have thorough coverage of topics in computational statistics. Givens and Hoeting [121] is a recent graduate text on computational statistics and statistical computing. Martinez and Martinez [192] is an accessible introduction to computational statistics, with numerous examples in Matlab. Texts on statistical computing include the classics by Kennedy and Gentle [161] and Thisted [269], and a more recent survey of methods in statistical computing is covered in Kundu and Basu [165]. For statistical applications of numerical analysis see Lange [168] or Monahan [202]. Books that primarily cover Monte Carlo methods or resampling methods include Davison and Hinkley [63], Efron and Tibshirani [84], Hjorth [143], Liu [179], and Robert and Casella [228]. On density estimation see Scott [244] and Silverman [252].

1.2 The R Environment

The R environment is a suite of software and programming language based on S, for data analysis and visualization. “What is R” is one of the frequently asked questions included in the online documentation for R. Here is an excerpt from the R FAQ [217]:

R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

The home page of the R project is http://www.r-project.org/, and the current R distribution and documentation are available on the Comprehensive R Archive Network (CRAN). The CRAN master site is at TU Wien, Austria, http://cran.R-project.org/. The R distribution includes the base and recommended packages with documentation. A help system and several reference manuals are installed with the program.

R is based on the S langauge. Some details about differences between R and S are given in the R FAQ [147]. Venables and Ripley [278] is a good resource for applied statistics with S, Splus, and R. Other references on the S language include [24, 41, 42, 277].

An excellent starting point is the manual Introduction to R [279]. Some introductory books using R include Dalgaard [62] and Verzani [280]. On programming methods see Chambers [41], and Venables and Ripley [277, 278]. Other texts that feature Splus, S, and/or R may also be helpful (see e.g. Crawley [57] or Everitt and Hothorn [88]). Albert [5] is an introductory text on Bayesian computation. On statistical models see Faraway [90, 91], Fox, [97], Harrell [131], and Pinhiero and Bates [211]. Many more references can be found through links on the R project home page.

Programming is discussed as needed in the chapters that follow. In this text, new functions or programming methods are explained in remarks called “R notes” as they arise. Readers are always encouraged to consult the R help system and manuals [147, 279, 217]. For platform specific details about installation and interacting with the graphical user interface the best resource is the R manual [218] and current information at www.r-project.org.

In the remainder of this chapter, we cover some basic information aimed to help a new user get started with R. Topics include basic syntax, using the online help, datasets, files, scripts, and packages. There is a brief overview of basic graphics functions. Also see Appendix B on working with data frames.

1.3 Getting Started with R

R has a command line interface that can be used interactively or in batch mode. Commands can be typed at the prompt in the R Console window, or submitted by the source command (see Section 1.8). For example, we can evaluate the standard normal density $ϕ (x) = \frac{1}{\sqrt{2 π}} e^{− x^{2} / 2}$ $ϕ (x) = \frac{1}{\sqrt{2 π}} e^{− x^{2} / 2}$ at x = 2 by typing the formula or (more conveniently) the dnorm function:

 > 1/sqrt(2*pi) * exp(-2)

 [1] 0.05399097

 > dnorm(2)

 [1] 0.05399097

In the example above, the command prompt is >. The [1] indicates that the result displayed is the first element of a vector.

A command can be continued on the next line. The prompt symbol changes whenever the command on the previous line is not complete. In the example below, the plot command is continued on the second line, as indicated by the prompt symbol changing to +.

 > plot(cars, xlab="Speed", ylab="Distance to Stop",

 + main="Stopping Distance for Cars in 1920")

Whenever a statement or expression is not complete at the end of a line, the parser automatically continues it on the next line. No special symbol is needed to end a line. (A semicolon can be used to separate statements on a single line, although this tends to make code harder to read.) A group of statements can be gathered into a single (compound) expression by enclosing them in curly braces {} .

To cancel a command, a partial command, or a running script use Ctrl-C, or in the Windows version of the R GUI, press the escape key (Esc). To exit the R system type the command q() or close the R GUI.

The usual assignment operator is <-. For example, x <- sqrt(2 * pi) assigns the value of $\sqrt{2π}$ $\sqrt{2π}$ to the symbol x.

Commands entered at the command prompt in the R console are automatically echoed to the console, but assignment operations are silent. Some objects have print methods so that the output displayed is not necessarily the entire object, but a summarized report. Compare the effect of these commands. The first command displays a sequence (0.0 0.5 1.0 1.5 2.0 2.5 3.0), but does not store it. The second command stores the sequence in x, but does not display it.

 seq(0, 3, 0.5)

 x <- seq(0, 3, 0.5)

Syntax

Below are some help topics on R operators and syntax. The ? invokes the help system for the indicated keyword.

 ?Syntax

 ?Arithmetic

 ?Logic

 ?Comparison #relational operators

 ?Extract #operators on vectors and arrays

 ?Control #control flow

Symbols or labels for functions and variables are case-sensitive and can include letters, digits, and periods. Symbols cannot contain the underscore character and cannot start with a digit. Many symbols are already defined by the R base or recommended packages. To check if a symbol is already defined, type the symbol at the prompt. The symbols q, t, I, T, and F, for example, are used by R. Note that whenever a package is loaded, other symbols may now be defined by the package.

> T

 [1] TRUE

> t

 function (x) UseMethod("t") <environment: namespace:base>

> g

 Error: Object "g" not found

Here we see that both T and t are already defined, but g is not yet defined by R or by the user. Nothing prevents a user from assigning a new value to predefined symbols such as t or T, but it is a bad programming practice in general and can lead to unexpected results and programming errors.

Most new R users have some experience with other programming environments and languages such as C, MATLAB, or SAS. Some operations and features are common to all these languages. A brief list summarizing R syntax for some of these common elements is shown in Table 1.1. For more details see the help topic Syntax. Some of the functions common to most development environments are listed in Table 1.2.

Table 1.1

R Syntax and Commonly Used Operators

Description	R symbol	Example
Comment	#	#this is a comment
Assignment	< -	x <- log2(2)
Concatenation operator	c	c(3,2,2)
Elementwise multiplication	*	a*b
Exponentiation	^	2^1.5
x mod y	x%%y	25 %% 3
Integer division	%/%	25 %/% 3
Sequence from a to b by h	seq	seq(a,b,h)
Sequence operator	:	0:20

Table 1.2

Commonly Used Functions

Description	R symbol
Square root	sqrt
[x] , [x]	floor, ceiling
Natural logarithm	log
Exponential function ex	exp
Factorial	factorial
Random Uniform numbers	runif
Random Normal numbers	rnorm
Normal distribution	pnorm, dnorm, qnorm
Rank, sort	rank, sort
Variance, covariance	var, cov
Std. dev., correlation	sd, cor
Frequency tables	Table
Missing values	NA, is.na

Most arithmetic operations are vectorized. For example, x^2 will square each of the elements of the vector x, or each entry of the matrix x if x is a matrix. Similarly, x*y will multiply each of the elements of the vector x times the corresponding element of y (generating a warning if the vectors are not the same length). Operators for matrices are described in Table 1.3.

Table 1.3

R Syntax and Functions for Vectors and Matrices

Description	R symbol	Example
Zero vector	numeric(n)	x <-numeric(n)
	integer(n)	x <-integer(n)
	rep(0,n)	x <-rep(0,n)
Zero matrix	matrix(0,n,m)	x <-matrix(0,n,m)
ith element of vector a	a[i]	a[i] <-0
jth column of a matrix A	A[,j]	sum(A[,j])
ijth entryof matrixA	A[i,j]	x <-A[i,j]
Matrix multiplication	%*%	a %*% b
Elementwise multiplication	*	a*b
Matrix transpose	t	t(A)
Matrix inverse	solve	solve(A)

1.4 Using the R Online Help System

For documentation on a topic, type ?topic or help(topic) where “topic” is the name of the topic for which you need help. For example, ?seq will bring up documentation for the sequence function. In some cases, it may be necessary to surround the topic with quotation marks.

 > ?%%

 Error: syntax error, unexpected SPECIAL in " ?%%"

The second version (below) produces the help topic.

 > ?”%%”

On most systems Html help is also available by the command help.start(); in Windows also try the Help menu, Html help. This command displays Help in a web browser, with hyperlinks. The Html help system has a search engine.

Another way to search for help on a topic is help.search(). This and the search engine in Html help may help locate several relevant topics. For example, if we are searching for a method to compute a permutation,

 help.search("permutation")

produces two results: order and sample. We can then consult the help topics for order and sample. The help topic for sample shows that x is sampled without replacement (a permutation of the elements of vector x) by:

 sample(x)  #permutation of all elements of x

 sample(x, size=k) #permutation of k elements of x

(If the goal was to count permutations, and evaluate $\frac{n!}{(n - k)!}$ $\frac{n!}{(n - k)!}$ , wewant ?Special, a list of special functions including factorial and gamma.)

Many help files end with executable examples. The examples can be copied and pasted at the command line. To run all the examples associated with topic, use example(topic). See e.g. the interesting set of examples for density. To run all the examples for density, type example(density). To see one example, open the help page, copy the lines and paste them at the command prompt.

 help(density)

 # copy and paste the lines below from the help page

   # The Old Faithful geyser data

 d <- density(faithful$eruptions, bw = "sj")

 plot(d)

A list of available data sets in the base and loaded packages is displayed by data(), and documentation on a loaded data set is displayed by the associated help topic For example, help(faithful) displays the Old Faithful geyser data help topic. If a package is installed but not yet loaded, specify the name of the package. For example, help(“geyser”, package = MASS) displays help for the dataset geyser without loading the package MASS [278].

R note 1.1 Data sets in the base package can be accessed without explicitly loading them via data. Data sets in other packages can be loaded by the data function. For example,

 data("geyser", package = "MASS")

loads geyser data from the MASS package.

1.5 Functions

The syntax for a function definition is

  function(arglist) expr

  return(value)

Many examples of functions are documented in the chapter “Writing your own functions” of the manual [279].

Here is a simple example of a user-defined R function that “rolls” n fair dice and returns the sum.

 sumdice <- function(n) {

  k <- sample(1:6, size=n, replace=TRUE)

  return(sum(k))

The function definition can be entered by several methods.

Typing the lines at the prompt, if the definition is short.
Copy from an editor and paste at the command prompt.
Save the function in a script file and source the file.

Note that the R GUI provides an editor and toolbar for submitting code. Once the user-defined function is entered in the workspace, it can be used like other R functions.

 #to print the result at the console

 > sumdice(2)

 [1] 9

 #to store the result rather than print it

 a <- sumdice(100)

 #we expect the mean for 100 dice to be close to 3.5

 > a / 100

 [1] 3.59

The value returned by an R function is the argument of the return statement or the value of the last evaluated expression. The sumdice function could be written as

 sumdice <- function(n)

   sum(sample(1:6, size=n, replace=TRUE))

Functions can have default argument values. For example, sumdice can be generalized to roll s-sided dice, but keep the default as 6-sided. The usage is shown below.

 sumdice <- function(n, sides = 6) {

  if (sides < 1) return (0)

  k <- sample(1:sides, size=n, replace=TRUE)

  return(sum(k))

 > sumdice(5) #default 6 sides

 [1] 12

 > sumdice(n=5, sides=4) #4 sides

 [1] 14

1.6 Arrays, Data Frames, and Lists

Arrays, data frames, and lists are some of the objects used to store data in R. A matrix is a two dimensional array. A data frame is not a matrix, although it can be represented in a rectangular layout like a matrix. Unlike a matrix, the columns of a data frame may be different types of variables. Arrays contain a single type.

Data Frames

A data frame is a list of variables, each of the same length but not necessarily of the same type. In this section we will discuss how to extract values of variables from a data frame.

Example 1.1 (Iris data)

The Fisher iris data set gives four measurements on observations from three species of iris. The first few cases in the iris data are shown below.

 Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1    5.1    3.5   1.4    0.2  setosa

2    4.9    3.0   1.4    0.2  setosa

3    4.7    3.2   1.3    0.2  setosa

4    4.6    3.1   1.5    0.2  setosa

The iris data is an example of a data frame object. It has 150 cases in rows and 5 variables in columns. After loading the data, variables can be referenced by $name (the column name), by subscripts like a matrix, or by position using the [[]] operator. The list of variable names is returned by names. Some examples with output are shown below.

 > names(iris)

 [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

 [5] "Species"

 > table(iris$Species)

 setosa versicolor virginica

   50  50   50

 > w <- iris[[2]]  #Sepal.Width

 > mean(w)

 [1] 3.057333

Alternately, the data frame can be attached and variables referenced directly by name. If a data frame is attached, it is a good practice to detach it when it is no longer needed, to avoid clashes with names of other variables.

 > attach(iris)

 > summary(Petal.Length[51:100]) #versicolor petal length

  Min. 1st Qu. Median Mean 3rd Qu. Max.

  3.00 4.00  4.35 4.26 4.60 5.10

If we only need the iris data temporarily, we can use with. The syntax is in this example would be

 with(iris, summary(Petal.Length[51:100]))

Suppose we wish to compute the means of all variables, by species. The first four columns of the data frame can be extracted with iris[,1:4]. Here the missing row index indicates that all rows should be included. The by function easily computes the means by species.

 > by(iris[,1:4], Species, mean)

 Species: setosa

 Sepal.Length Sepal.Width Petal.Length Petal.Width

   5.006  3.428   1.462  0.246

 ------------------------------------------------

 Species: versicolor

 Sepal.Length Sepal.Width Petal.Length Petal.Width

   5.936  2.770   4.260  1.326

 ------------------------------------------------

 Species: virginica

 Sepal.Length Sepal.Width Petal.Length Petal.Width

   6.588  2.974   5.552  2.026

 > detach(iris)

R note 1.2 Although iris$Sepal.Width, iris[[2]], and iris[,2] all produce the same result, the $ and [[]] operators can only select one element, while the [] operator can select several. See the help topic Extract.

Arrays and Matrices

An array is a multiply subscripted collection of a single type of data. An array has a dimension attribute, which is a vector containing the dimensions of the array.

Example 1.2 (Arrays)

Different arrays are shown. The sequence of numbers from 1 to 24 is first a vector without a dimension attribute, then a one dimensional array, then used to fill a 4 by 6 matrix, and finally a 3 by 4 by 2 array.

 x <- 1:24     # vector

 dim(x) <- length(x)   # 1 dimensional array

 matrix(1:24, nrow=4, ncol=6)  # 4 by 6 matrix

 x <- array(1:24, c(3, 4, 2))  # 3 by 4 by 2 array

The 3 × 4 × 2 array defined by the last statement is displayed below.

 , , 1

   [,1] [,2] [,3] [,4]

 [1,]  1   4   7  10

 [2,]  2   5   8  11

 [3,]  3   6   9  12

 , , 2

   [,1] [,2] [,3] [,4]

 [1,]  13  16  19  22

 [2,]  14  17  20  23

 [3,]  15  18  21  24

The array x is displayed showing x[, , 1] (the first 3 × 4 elements) followed by x[, , 2] (the second 3 × 4 elements).

A matrix is a doubly subscripted array of a single type of data. If A is a matrix, then A[i, j] is the ij-th element of A, A[, j] is the j-th column of A, and A[i ,] is the i-th row of A. A range of rows or columns can be extracted using the : sequence operator. For example, A[2:3, 1:4] extracts the 2 × 4 matrix containing rows 2 and 3 and columns 1 through 4 of A.

Example 1.3 (Matrices)

The statements

 A <- matrix(0, nrow=2, ncol=2)

 A <- matrix(c(0, 0, 0, 0), nrow=2, ncol=2)

 A <- matrix(0, 2, 2)

all assign to A the 2 × 2 zero matrix. Matrices are filled in column major order by default; that is, the row index changes faster than the column index. Thus,

  A <- matrix(1:8, nrow=2, ncol=4)

stores in A the matrix

$[\begin{array}{l} 1 3 5 7 \\ 2 4 6 8 \end{array}] .$ $[\begin{array}{l} 1 3 5 7 \\ 2 4 6 8 \end{array}] .$

If necessary, use the option byrow=TRUE in matrix to change the default.

Example 1.4 (Iris data: Example 1.1, cont.)

We can convert the first four columns of the iris data to a matrix using as.matrix.

   > x <- as.matrix(iris[,1:4]) #all rows of columns 1 to 4

 > mean(x[,2])  #mean of sepal width, all species

 [1] 3.057333

 > mean(x[51:100,3])  #mean of petal length, versicolor

 [1] 4.26

It is possible to convert the matrix to a three dimensional array, but arrays (and matrices) are stored in “column major order” by default. For arrays, “column major” means that the indices to the left are changing faster than indices to the right. In this case it is easy to convert the matrix to a 50 × 3 × 4 array, with the species as the second dimension. This works because in the data matrix, by column major order, the iris species changes faster than the variable name (column).

 > y <- array(x, dim=c(50, 3, 4))

 > mean(y[,,2]) #mean of sepal width, all species

 [1] 3.057333

 > mean(y[,2,3]) #mean of petal length, versicolor

 [1] 4.26

It is somewhat more difficult to produce a 50 × 4 × 3 array of iris data, with species as the third dimension. Here is one approach. First the matrix is sliced into three blocks of 50 observations each, corresponding to the three species. Then the three blocks are concatenated into a vector length 600, so that species is changing the most slowly, and observation (row) is changing fastest. This vector then fills a 50 × 4 × 3 array.

 > y <- array(c(x[1:50,], x[51:100,], x[101:150,]),

 + dim=c(50, 4, 3))

 > mean(y[,2,]) #mean of sepal width, all species

 [1] 3.057333

  > mean(y[,3,2]) #mean of petal length, versicolor

 [1] 4.26

This array is provided in R as the data set iris3.

Lists

A list is an ordered collection of objects. The members of a list (the components) can be different types. Lists are more general than data frames; in fact, a data frame is a list with class “data.frame”. A list can be created by the list() function.

Lists are frequently used to return several results of a function in a single object. Several classical hypothesis tests that return class htest are a good example. See e.g. the help topic for t.test or chisq.test. Refer to the “Value” section of the documentation. The value returned is a list containing the test statistic, p-value, etc. The components of a list can be referenced by name using $ or by position using [[]].

Example 1.5 (Named list)

The Wilcoxon rank sum test is implemented in the function wilcox.test. Here the test is applied to two normal samples with different means.

 w <- wilcox.test(rnorm(10), rnorm(10, 2))

 > w #print the summary

   Wilcoxon rank sum test

 data: rnorm(10) and rnorm(10, 2)

 W = 2, p-value = 4.33e-05

 alternative hypothesis:

 true location shift is not equal to 0

 > w$statistic  #stored in object w

W 2

 > w$p.value

 [1] 4.330035e-05

Try unlist(w) and unclass(w) to see more details.

Some examples of functions in this book that return a named list can be found in Examples 7.14 on page 205, 10.12 on page 305, and 11.17 on page 349.

Example 1.6 (A list of names)

Below we create a list to assign row and column names in a matrix. The first component for row names will be NULL in this case because we do not want to assign row names.

 a <- matrix(runif(8), 4, 2) #a 4x2 matrix

 dimnames(a) <- list(NULL, c("x", "y"))

Here is the 4 × 2 matrix with column names (type a to display it).

    x  y

 [1,] 0.88009604 0.6583918

 [2,] 0.32964955 0.1385332

 [3,] 0.61625490 0.1378254

 [4,] 0.08102034 0.1746324

 # if we want row names

 > dimnames(a) <- list(letters[1:4], c("x", "y"))

> a

   x  y

 a 0.88009604 0.6583918

 b 0.32964955 0.1385332

 c 0.61625490 0.1378254

 d 0.08102034 0.1746324

 # another way to assign row names

 > row.names(a) <- list("NE", "NW", "SW", "SE")

> a

    x  y

 NE 0.88009604 0.6583918

 NW 0.32964955 0.1385332

 SW 0.61625490 0.1378254

 SE 0.08102034 0.1746324

1.7 Workspace and Files

The workspace in R contains data and other objects. User defined objects created in a session will persist until R is closed. If the workspace is saved before quitting R, the objects created during the session will be saved. It is not necessary to save the workspace for the examples and code here.

The ls command will display the names of objects in the current workspace. One or more objects can be removed from the workspace by the rm or remove command. For more information consult the R documentation.

Note that saving objects in the workspace can lead to unexpected results and serious hidden programming errors. For example, in the following, suppose that the programmer intended to randomly generate the value of b, but accidentally omitted the code.

 y <- runif(100, 0, b)

Now, if an object named b happens to be found in the workspace, and the value of b produces a valid expression in runif, no error will be reported. An error will occur, but the programmer will not realize that it has occurred.

It is recommended that the user occasionally check what is stored in the workspace, and remove unneeded objects. The entire list of objects returned by ls() can be removed (without warning!) by rm(list = ls()).

In general, it is probably a bad practice to save functions in the workspace, because the user may forget that certain objects exist and these objects are either not documented at all or only through comments. It is a better idea to save functions in scripts and data in files. Collections of functions and data sets can also be organized and documented in packages. (See Sections 1.8 and 1.9 below.)

The Working Directory

Many scripts and data sets are provided, and many will be created by users. It is convenient to create a folder or directory with a short path name to store these files. In the examples, we assume that the files are located in /Rfiles, which will be created by the user. Any other name or path can be used.

Although it is not necessary to specify the working directory, sometimes it may be convenient to do so. A user can get or set the current working directory by the commands getwd and setwd. To set the working directory to “/Rfiles", for example, the command is setwd(“/Rfiles”). Windows users can make this change the default by editing the Properties (Start in) in the Windows shortcut to R-GUI. More information about startup options for R can be found in the help topic Startup.

Reading Data from External Files

Often data to be analyzed is stored in external files. Typically, data is stored in plain text files, delimited by white space such as tabs or spaces, or by special characters such as commas.

Univariate data from an external file can be read into a vector by the scan command. If the file contains a data frame or a matrix, or is csv (comma separated values) format, use the read.table function. The read.table function has many options to support different file formats. Here are a few simple examples that refer to data files in Hand, et al. [126]. The data files currently are available at http://www.stat.ncsu.edu/sas/sicl/data/ or at http://www.stat.ucla.edu/data/. To download, do not save the web page. Instead copy the data into a local text editor and save as plain text. Windows users note the unix style forward slashes in the path name below. See the R for Windows FAQ [225].

 forearm <- scan("/Rfiles/forearm.dat") #a vector

 x <- read.table("/Rfiles/irises.dat") #a data frame

 > dim(x)

 [1] 50 12

 #get the fourth variable in the data frame

 x <- read.table("/Rfiles/irises.dat")[[4]] #a vector

 #read and coerce to matrix

 x <- as.matrix(read.table("/Rfiles/irises.dat"))

The version of the iris data in [126] is given in a 50 by 12 array, with the variables in columns 1:4, 5:8, and 9:12 corresponding to the four measurements on each of the three species. Note that many of the data files from [126] are divided in groups by horizontal white space only (see e.g. the Tibetan skulls data), so they may require reformatting before reading into a data frame.

The help topic for read.table also contains documentation for read.csv and read.delim, for reading comma-separated-values (.csv) files and text files with other delimiters. Also see Appendix B.3.4 for an example with .csv format.

R note 1.3 By default, read.table will convert character variables to factors. To prevent conversion of character data to factors, set as.is = TRUE (also see the colClasses argument of read.table).

One of the recommended R packages included with the distribution is the foreign package, which provides several utility functions for reading files in Minitab, S, SAS, SPSS, Stata, and other formats. For details type help(package = foreign).

1.8 Using Scripts

R scripts are plain text files containing R code. Once code is saved in a script, all of it can be submitted via the source command, or part of it can be executed by copy and paste (to the console).

To save R commands in a file, prepare the file with a plain text editor and save with extension .R. The Windows R GUI provides an integrated text editor. The File menu contains commands “New Script”, “Open Script”, “Source R code”, etc. If a script editor is open, more commands for submitting the code are provided under the Edit menu and on the toolbar.

There are many other GUI’s available for preparing and submitting scripts in R. Currently a list of several appears at the URL www.sciviews.org/_rgui. The RWinEdt package [177] is particularly nice for Windows users who like WinEdt.

The source command loads and executes the commands in the script. It is not necessary to close the file, and in fact, it may be convenient to keep it open for editing. Save changes before source-ing the file. For example, if “/Rfiles/example.R" is a file containing R code, the command

 source("/Rfiles/example.R")

will enter all lines of the file at the command prompt and execute the code. Windows users should use the unix style forward slashes above or double backslashes like the command below.

 source("\Rfiles\example.R")

Recent commands can be recalled using the up-arrow key. To edit your source file and run it again (after saving), simply use the up-arrow to recall your source command and press Enter.

Note that by default, evaluations of expressions are not printed at the console when a script is running. Use the print command within a script to display the value of an expression.

Thus, in interactive mode, an expression and its value are both printed

 > sqrt(pi)

 [1] 1.772454

but from a script it is necessary to use print(sqrt(pi)).

Alternately, set options in the source statement to control how much is printed. By setting echo=TRUE the statements and evaluation of expressions are echoed to the console. To see evaluation of expressions but not statements, leave echo=FALSE and set print.eval=TRUE. The examples are below.

 source("/Rfiles/example.R", echo=TRUE)

 source("/Rfiles/example.R", print.eval=TRUE)

1.9 Using Packages

The R installation consists of the base and several recommended packages. Type library() to see a list of installed packages. A packagemust be installed and loaded to be available. Base packages are automatically loaded. Other packages can be installed and loaded as needed.

Several of the recommended packages are used in this text. Some contributed packages are also used. The R system provides an interface to install contributed packages from CRAN as needed (see install.packages; in the Windows GUI see the Packages menu). A frequent error is the ‘Object not found’ error, which can occur when a symbol is used from a package that is not available. If this error occurs, check spelling, then check that the package containing the object is loaded.

To load an installed package use the library or require command. For example, to load the recommended package boot, type library(boot) at the command prompt. If the package is loaded, the help system for the package is also loaded. The package can also be loaded via the Packages menu in the GUI. Typing the command help(package=boot) will bring up a window showing the contents of the package, whether or not the package is loaded. Once the package is loaded, typing ?boot will bring up the help topic for the boot function in the boot package (if not loaded, use help(boot, package=boot)).

A complete list of all available packages is provided on the CRAN web site. A list of available packages is also included in the R FAQ [147]. Type installed.packages() to see a list of all the installed packages.

1.10 Graphics

The R graphics package contains most of the commonly used graphics functions. In this section, for reference, some of the graphics functions and options or parameters are listed. Examples of graphics and the R code used to produce them appear throughout the text. See Murrell [204] for many more examples. Maindonald and Braun [184]), and Venables and Ripley [278] also have many examples of graphics in R.

Table 1.4 lists some basic 2D graphics functions in R (graphics) and other packages. Several examples using the graphics functions in Table 1.4 are given throughout the text. See Table 4.1 and the examples of Chapter 4 for more 2D graphics functions and some 3D visualization methods. Also see the gallery of graphics at http://addictedtor.free.fr/graphiques/.

Table 1.4

Some Basic Graphics Functions in R (graphics) and Other Packages

Method	in (graphics)	in (package)
Scatter plot	plot
Add regression line to plot	abline
Add reference line to plot	abline
Reference curve	curve
Histogram	hist	truehist (MASS)
Bar plot	barplot
Plot empirical CDF	plot.ecdf
QQ Plot	qqplot	qqmath (lattice)
Normal QQ plot	qqnorm
QQ normal ref. line	qqline
Box plot	boxplot
Stem plot	stem

Colors, plotting symbols, and line types

In most plotting functions, colors, symbols, and line types can be specified using col, pch, and lty. The size of a symbol is specified by cex. Available plotting characters are shown in the manual [279, Ch. 12], which includes this example for displaying plotting characters in a legend.

 plot.new() #if a plot is not open

 legend(locator(1), as.character(0:25), pch=0:25)

 #then click to locate the legend

The example above can be used to display line types, by substituting lty for pch. The following produces a display of colors.

 legend(locator(1), as.character(0:8), lwd=20, col=0:8)

Other colors and color palettes are available. For example,

 plot.new()

 palette(rainbow(15))

 legend(locator(1), as.character(1:15), lwd=15, col=1:15)

puts a 15 color rainbow palette into effect and displays the colors. Use colors() to see the vector of named colors.

The figures in this text have been drawn in black and white. Where color palettes would normally be used, we have substituted a grayscale palette. In these cases, on screen it is better to substitute one of the pre-defined color palettes or a custom palette. To define a color palette, refer to ?palette, and to use a defined color palette, see the topic ?rainbow (the topics rainbow, heat.colors, topo.colors, and terrain.colors are documented on the same page.)

A table of plotting characters is produced by show.pch() (Hmisc). A utility to display available colors in R is show.colors() in the DAAG package [184]. Also see show.col() in the Hmisc package [132].

Setting the graphical parameter par(ask = TRUE) has the effect that the graphics device will wait for user input before displaying the next plot; e.g. the message “Waiting to confirm page change ... “ appears, and in the GUI the user should click on the graphics window to display the next screen. To turn off this behavior, type par(ask = FALSE).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1 Introduction

Create new playlist

Sign In

Sign Up