Chapter 1

Getting Started with R

This chapter serves as a primer for R. We invite the reader to start his or her R session and follow along with our discussion. We assume the reader is familiar with basic summary statistics and graphics taught in standard introductory statistics courses. We present a short tour of the langage; those interested in a more thorough introduction are referred to a monograph on R (e.g., Chambers 2008). Also, there are a number of manuals available at the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org/). An excellent overview, written by developers of R, is Venables and Ripley (2002).

R provides a built-in documentation system. Using the help function i.e. help(command) or ?command in your R session to bring up the help page (similar to a man page in traditional Unix) for the command. For example try: help(help) or help(median) or help(rfit). Of course, Google is another excellent resource.

1.1 R Basics

Without going into a lot of detail, R has the capability of handling character (strings), logical (TRUE or FALSE), and of course numeric data types. To illustrate the use of R we multiply the system defined constant pi by 2.

> 2*pi

[1] 6.283185

We usually want to save the result for later calculation, so assignment is important. Assignment in R is usually carried out using either the <- operator or the = operator. As an example, the following code computes the area of a circle with radius 4/3 and assigns it to the variable A:

> r<-4/3
> A<-pi*r^2
> A

[1] 5.585054

In data analysis, suppose we have a set of numbers we wish to work with, as illustrated in the following code segment, we use the c operator to combine values into a vector. There are also functions rep for repeat and seq for sequence to create patterned data.

> x<-c(11,218,123,36,1001)
> y<-rep(1,5)
> z<-seq(1,5,by=1)
> x+y

[1]  12 219 124 37 1002

> y+z

[1] 2 3 4 5 6

The vector z could also be created with z<-1:5 or z<-c(1:3,4:5). Notice that R does vector arithmetic; that is, when given two lists of the same length it adds each of the elements. Adding a scalar to a list results in the scalar being added to each element of the list.

> z+10

[1] 11 12 13 14 15

One of the great things about R is that it uses logical naming conventions as illustrated in the following code segment.

> sum(y)

[1] 5

> mean(z)

[1] 3

> sd(z)

[1] 1.581139

> length(z)

[1] 5

Character data are embedded in quotation marks, either single or double quotes; for example, first<-‘Fred’ or last<-“Flintstone”. The outcomes from the toss of a coin can be represented by

> coin<-c(‘H’,‘T’)

To simulate three tosses of a fair coin one can use the sample command

> sample(coin,3,replace=TRUE)
[1] “H” “T” “T”

The values TRUE and FALSE are reserved words and represent logical constants. The global variables T and F are defined as TRUE and FALSE respectively. When writing production code, one should use the reserved words.

1.1.1 Data Frames and Matrices

Data frames are a standard data object in R and are used to combine several variables of the same length, but not necessarily the same type, into a single unit. To combine x and y into a single data object we execute the following code.

> D<-data.frame(x,y)
> D

 x y
1  11 1
2 218 1
3 123 1
4  36 1
5 1001 1

To access one of the vectors the $ operator may be used. For example to calculate the mean of x the following code may be executed.

> mean(D$x)

[1] 277.8

One may also use the column number or column name D[,1] or D[,‘x’] respectively. Omitting the first subscript means to use all rows. The with command as follows is another convenient alternative.

> with(D,mean(x))

[1] 277.8

As yet another alternative, many of the modeling functions in R have a data= options for which the data frame (or matrix) may be supplied. We utilize this option when we discuss regression modeling beginning in Chapter 4.

In data analysis, records often consist of mixed types of data. The following code illustrates combining the different types into one data frame.

> subjects<-c(‘Jim’,‘Jack’,‘Joe’,‘Mary’,‘Jean’)
> sex<-c(‘M’,‘M’,‘M’,‘F’,‘F’)
> score<-c(85,90,75,100,70)
> D2<-data.frame(subjects,sex,score)
> D2

 subjects sex score
1 Jim  M 85
2 Jack  M 90
3 Joe  M 75
4 Mary  F  100
5 Jean  F 70

Another variable can be added by using the $ operator for example D2$letter<-c(‘B’,‘A’,‘C’,‘A’,‘C’).

A set of vectors of the same type and size can be grouped into a matrix.

> X<-cbind(x,y,z)
> is.matrix(X)

[1] TRUE

> dim (X)

[1] 5 3

Note that R is case sensitive so that X is a different variable (or more generally, data object) than x.

1.2 Reading External Data

There are a number of ways to read data from an external file into R, for example scan or read.table. Though read.table and its variants (see help(read.table)) can read files from a local file system, in the following we illustrate the use of loading a file from the Internet. Using the command

egData<-read.csv(‘http://www.biostat.wisc.edu/˜kloke/eg1.csv’)

the contents of the dataset are now available in the current R session. To display the first several lines we may use the head command:

>  head(egData)
 X   x1 x2   y
1  1 0.3407328 0 0.19320286
2  2 0.0620808 1 0.17166831
3  3 0.9105367 0 0.02707827
4  4 0.2687611 1 -0.78894410
5  5 0.2079045 0 9.39790066
6  6 0.9947691 1 -0.86209203

1.3 Generating Random Data

R has an abundance of methods for random number generation. The methods start with the letter r (for random) followed by an abbreviation for the name of the distribution. For example, to generate a pseudo-random list of data from normal (Gaussian) distribution, one would use the command rnorm. The following code segment generates a sample of size n = 8 of random variates from a standard normal distribution.

> z<-rnorm(8)

Often, in introductory statistics courses, to illustrate generation of data, the student is asked to toss a fair coin, say, 10 times and record the number of trials that resulted in heads. The following experiment simulates a class of 28 students each tossing a fair coin 10 times. Note that any text to right of the sharp (or pound) symbol # is completely ignored by R. i.e. represents a comment.

> n<-10
> CoinTosses<-rbinom(28,n,0.5)
> mean(CoinTosses) # should be close to 10*0.5 = 5

[1] 5.178571

> var(CoinTosses) # should be close to 10*0.5*0.5 =2.5

[1] 2.300265

In nonparametric statistics, often, a contaminated normal distribution is used to compare the robustness of two procedures to a violation of model assumptions. The contaminated normal is a mixture of two normal distributions, say X~N(0,1) and Y~N(0,σ2c). In this case X is a standard normal and both distributions have the same location parameter μ = 0. Let ϵ denote the probability an observation is drawn from Y and 1 – ϵ denote the probability an observation is drawn from X. The cumulative distribution function (cdf) of this model is given by

F(x)=(1ϵ)Φ(x)+ϵΦ(x/σc)(1.1)

where Φ(x) is the cdf of a standard normal distribution. In npsm we have included the function rcn which returns random deviates from this model. The rcn takes three arguments: n is the samples size (n), eps is the amount of contamination (ϵ), and sigmac is standard deviation of the contaminated part (σc). In the following code segment we obtain a sample of size n = 1000 from this model with ϵ = 0.1 and σc = 3.

> d<-rcn(1000,0.1,3)
> mean(d)   # should be close to 0

[1] -0.02892658

> var(d)  # should be close to 0.9*1 + 0.1*9 = 1.8

[1] 2.124262

1.4 Graphics

R has some of the best graphics capabilities of any statistical software package; one can make high quality graphics with a few lines of R code. In this book we are using base graphics, but there are other graphical R packages available, for example, the R package ggplot2 (Wickham 2009).

Continuing with the classroom coin toss example, we can examine the sampling distribution of the sample proportion. The following code segment generates the histogram of ˆps displayed in Figure 1.1.

Figure 1.1

Figure showing histogram of 28 sample proportions; each estimating the proportion of heads in 10 tosses of a fair coin.

Histogram of 28 sample proportions; each estimating the proportion of heads in 10 tosses of a fair coin.

> phat<-CoinTosses/n
> hist(phat)

To examine the relationship between two variables we can use the plot command which, when applied to numeric objects, draws a scatterplot. As an illustration, we first generate a set of n = 47 datapoints from the linear model y = 0.5x + e where e ~ N(0,0.12) and x ~ U(0,1).

> n<-47
> x<-runif(n)
> y<-0.5*x+rnorm(n,sd=0.1)

Next, using the the command plot(x,y) we create a simple scatterplot of x versus y. One could also use a formula as in plot(y~x). Generally one will want to label the axes and add a title as the following code illustrates; the resulting scatterplot is presented in Figure 1.2.

Figure 1.2

Figure showing example usage of the plot command.

Example usage of the plot command.

> plot(x,y,xlab=‘Explanatory Variable’,ylab=‘Response Variable’,
+ main=‘An Example of a Scatterplot ’)

There are many options that can be set; for example, the plotting symbol, the size, and the color. Text and a legend may be added using the commands text and legend.

1.5 Repeating Tasks

Often in scientific computing a task is to be repeated a number of times. R offers a number of ways of replicating the same code a number of times making iterating straightforward. In this section we discuss the apply, for, and tapply functions.

The apply function will repeatedly apply a function to the rows or columns of a matrix. For example to calculate the mean of the columns of the matrix D previously defined we execute the following code:

> apply(D,2,mean)
 x y
277.8  1.0

To apply a function to the rows of a matrix the second argument would be set to 1. The apply function is discussed further in the next section in the context of Monte Carlo simulations.

A simple example to demonstrate the use of a for loop returns a vector of length n with the cumulative sum.

> n<-10
> result<-rep(1,n)
> for(i in 1:n) result[i]<-sum(1:i)

Using for is discouraged in R; a loop generally results in much slower computational time than a vectorized function such as apply.

The function tapply is useful in obtaining summary statistics by cohort. For example, to calculate the mean score by sex from the D2 data we may use the tapply command.

> with(D2, tapply(score,sex,mean))
  F   M
85.00000 83.33333

A general purpose package for repeated tests for arrays, lists, matrices or data frames is plyr (Wickham 2011).

1.6 User Defined Functions

The syntax for creating an R function is relatively simple. A brief schematic for an R function is:

name_of_function <- function(0 or more arguments){
  ...  body of function ...
}

where name_of_function contains the newly created function; the parentheses after function enclose the arguments of the function; and the braces {} enclose any number of R statements including function calls. A call to a user defined function is done in the expected way. E.g. result<-name_of_function(data,arguments). Usually, the last line of the body contains a line of what is to be returned. We illustrate these concepts with the following example which computes the median and interquartile range of a sample contained in the data vector x. We named it mSummary.

mSummary <- function(x) {
 q1 <- quantile(x,.25)
 q3 <- quantile(x,.75)
 list(med=median(x),iqr=q3-q1)
}

These commands can be typed directly into an R session or copied and pasted from another file. Alternatively, the function may be sourced. If the function is in the file mSummary.r in the working directory, this can be accomplished by the R command source(“mSummary.r”). If the file is in another directory (or folder) then the path to it must be included; the path may be relative or absolute. For example, if mSummary.r is in the directory Myfunctions which is a subdirectory of the current working directory then the command is source(“Myfunctions/mSummary.r”). For a simple debugging run, we used the sample consisting of the first 13 positive integers.

> xsamp <- 1:13
> mSummary(xsamp)

$med
[1] 7

$iqr
75%
 6

Notice a list is returned with two elements: the median (med) and the IQR (iqr). A function need only be sourced once in an R session.

1.7 Monte Carlo Simulation

Simulation is a powerful tool in modern statistics. Inferences for rank-based procedures discussed in this book are based, generally, on the asymptotic distribution of the estimators. Simulation studies allow us to examine their performance for small samples. Specifically, simulation is used to examine the empirical level (or power) of rank-based tests of hypotheses or the empirical coverage of their confidence intervals. Comparisons of estimators are often based on their empirical relative efficiencies (the ratio of the mean squared error of the two estimators). Simulation is also used to examine the effect of violations of model assumptions on the validity of the rank-based inference. Another inference procedure used in this text is based on the bootstrap. This is a resampling technique, i.e., a Monte Carlo technique.

R is an excellent tool for simulation studies, because a simulation may be carried out with only a few lines of code. One way to run a simulation in R is to generate many samples from a distribution and use the apply function. For example,

> X<-matrix(rnorm(10*100),ncol=10)

generates a dataset with 100 rows and 10 columns. In the context of simulation, we think of the rows as distinct samples, each of size n = 10. To calculate the sample mean of each of the 100 samples, we use the apply function:

> xbar<-apply(X, 1 ,mean)

The mean of each of the rows is calculated and the results are stored in the vector xbar. If we calculate the variance of the sample means we observe that it is similar to the theoretical result (σ2/n = 0.1).

> var (xbar)

[1] 0.1143207

We can also do the same thing with the median

> xmed<-apply(X,1,median)

The relative efficiency is

> var(xbar)/var(xmed)

[1] 0.7146234

Exercise 1.9.4 asks the reader to compare the efficiency of these two estimators of location when the data are drawn from a t3 distribution.

The level (α) of a statistical test is defined as the probability that the data support rejection of the null hypothesis when in fact the null hypothesis is true. The power of a statistical test is defined as the probability that the data support rejection of the null hypothesis when it is in fact false.

For our simple example, suppose we are interested in testing the null hypothesis that the true mean is 0. Using the 100 samples X, the following code obtains the empirical α-level of the nominal 5% t-test.

> myttest<-function (data) t.test(data)$p.value
> pval<-apply(X,1,myttest)
> mean(pval<0. 05)

[1] 0.09

Exercise 1.9.11 asks the reader to approximate the power of the t-test under an alternative hypothesis.

1.8 R packages

The developers of R have made it fairly easy to extend R by creating a straightforward mechanism to create a package. A package may be developed for a small number of users or distributed worldwide. Two notable distribution sites are the Comprehensive R Archive Network (CRAN) and Bioconductor. The packages hosted at CRAN tend to be for general use while those hosted at Bioconductor are intended for analyzing high-throughput genomic data. At the time this book was to go to press the CRAN repository contained over 5500 packages developed by individual users.

We have written two such R packages related to nonparametrics: Rfit and npsm. Rfit (Kloke and McKean 2012) contains rank-based estimation and testing procedures for general linear models and is discussed extensively in Chapters 4 and 5. The package npsm includes many of the additional functions used in the book which are not already available in Rfit or base R. Most of the datasets used in this book are available in one of these packages. Both Rfit and npsm are available at CRAN. By loading npsm along with it’s dependencies the reader will have the software tools and data nesessary to work through the first six chapters of the text. For later chapters, additional packages may be required and are available through https://github.com/kloke/book. New methods and features are being added to these packages and information will be available at that website. We anticipate any new code to be backward compatible to what is presented in this book. Two built-in R functions that help the user keep all their packages up-to-date are new.packages and updated.packages.

The install.packages command is a straightforward way to install a package from CRAN in an R session. For example to install the version of npsm on CRAN one could use the command

install.packages(‘npsm’)

A pop-up window may appear asking the user to select a mirror. Once the mirror is selected (the user should use one that is close to him or her) R will then download npsm as well as any required packages, and then perform the installation. From then on the the package only needs to be loaded into R once per session using the function library. For example

library(npsm)

will load npsm and any packages on which it depends (e.g. Rfit).

1.9 Exercises

  1. 1.9.1. Use the commands seq and rep create the following lists.

    1. Even numbers less than 20
    2. Odd numbers between 101 and 203
    3. 1 3 1 3 1 3 1 3
    4. 1 1 1 1 3 3 3 3
  2. 1.9.2. Calculate the mean and variance of the following.

    1. First 100 integers.
    2. Random sample of 50 normal random variates with mean 30 and standard deviation 5.
  3. 1.9.3. Use the sample command to simulate a sequence of 10 tosses of a fair coin. Use ‘H‘ to denote heads and ‘T’ to denote tails.
  4. 1.9.4. Using a t3 distribution, approximate the relative efficiency of the sample median to the sample mean. Which estimator is more efficient for t3 data?
  5. 1.9.5. Create a data frame D where the first column is named x and contains a vector of observed numeric values. Verify that the following commands all produce the same result.

    1. summary(D[1:nrow(D)],‘x’])
    2. summary(D[,‘x’])
    3. summary(D[!is.na(D$x),1])
    4. summary(D[rep(TRUE,nrow(D)),1])
  6. 1.9.6. What is the output for the command:

    rep(c(37,39,40,41,42),times=c(2,2,4,1,2))?
  7. 1.9.7. A dotplot may be created with the command stripchart by using the option method=‘stack’. Create a dotplot of the data discussed in the previous exercise using the command stripchart.
  8. 1.9.8. A sunflower plot can be useful for visualizing the relationship between two numeric variables which are either discrete or have been rounded. Use the R function sunf lowerplot to obtain a sunflower plot of the relationship between height and weight for the baseball data in Rfit.
  9. 1.9.9. A diagnostic test of clairvoyance is to declare a person clairvoyant if they get 8 or more tosses of a fair coin correct out of 10. Determine, either via simulation or directly, the specificity of the test. That is, in this case, determine the probability that a person who is guessing is correctly classified as non-clairvoyant.
  10. 1.9.10. Simulate the sampling distribution of the mean of 10 tosses of a fair die.
  11. 1.9.11. Approximate the power of a t-test of H0 : μ = 0 versus HA : μ > 0 when the true mean is μ = 0.5. Assume a random sample of size n = 25 from a normal distribution with σ = 1. Assume α = 0.05.
  12. 1.9.12. Use the commands dnorm, seq, and lines to create a plot of the pdf of a normal distribution with μ = 50 and σ2 = 10.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.213.238