Chapter 13. Reproducibility and Best Practices

At the close of some programming texts, the user, now knowing the intricacies of the subject of the text, is nevertheless bewildered on how to actually get started with some serious programming. Very often, discussion of the tooling, environment, and the like—the things that inveterate programmers of language x take for granted—are left for the reader to figure out on their own.

Take R, for example—when you click on the R icon on your system, a rather Spartan window with a text-based interface appears imploring you to enter commands interactively. Are you to program R in this manner? By typing commands one-at-a-time into this window? This was, more or less, permissible up until this point in the book, but it just won't cut it when you're out there on your own. For any kind of serious work—requiring the rerunning of analyses with modifications, and so on—you need knowledge of the tools and typical workflows that professional R programmers use.

To not leave you in this unenviable position of not knowing how to get started, dear reader, we will be going through a whole chapter's worth of information on typical workflows and common/best practices.

You may have also noticed (via the enormous text at the top of this page) that the subject discussed in the previous paragraphs is sharing the spotlight with reproducibility. What's this, then?

Reproducibility is the ability for you, or an independent party, to repeat a study, experiment, or line of inquiry. This implies the possession of all the relevant and necessary materials and information. It is one of the principal tenets of scientific inquiry. If a study is not replicable, it is simply not science.

If you are a scientist, you are likely already aware of the virtues of reproducibility (if not, shame on you!). If you're a non-scientist data analyst, there is great merit in your taking reproducibility seriously, too. For one, starting an analysis with reproducibility in mind requires a level of organization that makes your job a whole lot easier, in the medium and long run. Secondly, the person who is likely going to be reproducing your analyses the most is you; do yourself a favor, and take reproducibility seriously so that when you need to make changes to an analysis, alter your priors, update your data source, adjust your plots and figures, or rollback to an established checkpoint, you make things easier on yourself. Lastly—and true to the intended spirit of reproducibility—it makes for more reliable and trustworthy dissemination of information.

By the way, all these benefits still hold even if you are working for a private (or otherwise confidential) enterprise, where the analyses are not to be repeated or known about outside of the institution. The ability of your coworkers to follow the narrative of your analysis is invaluable, and can give your firm a competitive edge. Additionally, the ability for supervisors to track and audit your progress is helpful—if you're honest. Finally, keeping your analyses reproducible will make your coworkers' lives much easier when you finally drop everything to go live on the high seas.

Anyway, we are talking about best practices and reproducibility in the same chapter because of the intimate relationship between the two goals. More explicitly, it is best practice for your code to be as reproducible as possible.

Both reproducibility and best practices are wide and diverse topics, but the information in this chapter should give you a great starting point.

R Scripting

The absolute first thing you should know about standard R workflows is that programs are not generally written directly at the interactive R interpreter. Instead, R programs are usually written in a text file (with a .r or .R file extension). These are usually referred to as R scripts. When these scripts are completed, the commands in this text file are usually executed all at once (we'll get to see how, soon). During development of the script, however, the programmer usually executes portions of the script interactively to get feedback and confirm proper behavior. This interactive component to R scripting allows for building each command or function iteratively.

I've known some serious R programmers who copy and paste from their favorite text editor into an interactive R session to achieve this effect. To most people, particularly beginners, the better solution is to use an editor that can send R code from the script that is actively being written to an interactive R console, line-by-line (or block-by-block). This provides a convenient mechanism to run code, get feedback, and tweak code (if need be) without having to constantly switch windows.

If you're a user of the venerable Vim editor, you may find that the Vim-R-plugin achieves this nicely. If you use the equally revered Emacs editor, you may find that Emacs Speaks Statistics (ESS) accomplishes this goal. If you don't have any compelling reason not to, though, I strongly suggest you use RStudio to fill this need. RStudio is a powerful, free Integrated Development Environment (IDE) for R. Not only does RStudio give you the ability to send blocks of code to be evaluated by the R interpreter as you write your scripts but it also provides all the affordances you'd expect from the most advanced of IDEs such as syntax highlighting, an interactive debugger, code completion, integrated help and documentation, and project management. It also provides some very helpful R-specific functionality like a mechanism for visualizing a data frame in memory as a spreadsheet and an integrated plot window. Lastly, it is very widely used within the R community, so there is an enormous amount of help and support available.

Given that RStudio is so helpful, some of the remainder of the chapter will assume you are using it.

RStudio

First things first—go to http://www.rstudio.com, and navigate to the downloads page. Download and install the Open Source Edition of the RStudio Desktop application.

When you first open RStudio, you may only see three panes (as opposed to the four paned windows in Figure 13.1). If this is the case, click the button labeled e in Figure 13.1, and click R Script from the dropdown. Now the RStudio window should look a lot like the one from Figure 13.1.

The first thing you should know about the interface is that all of the panels serve more than one function. The pane labeled a is the source code editor. This will be the pane wherein you edit your R scripts. This will also serve as the editor panel for LaTeX, C++, or RMarkdown, if you are writing these kinds of files. You can work on multiple files at the same time using tabs to switch from document to document. Panel a will also serve as a data viewer that will allow you to view datasets loaded in memory in a spreadsheet-like manner.

Panel b is the interactive R console, which is functionally equivalent to the interactive R console that shipped with R from CRAN. This pane will also display other helpful information or the output of various goings-on in secondary or tertiary tabs.

Panel c allows you to see the objects that you have defined in your global environment. For example, if you load a dataset from disk or the web, the name of the dataset will appear in this panel; if you click on it, RStudio will open the dataset in the data viewer in panel a. This panel also has a tab labeled History, that you can use to view R statements we've executed in the past.

Panel d is the most versatile one; depending on which of its tabs are open, it can be a file explorer, a plot-displayer, an R package manager, and a help browser.

RStudio

Figure 13.1: RStudio's four-panel interface in Mac OS X (version 0.99.486)

The typical R script development workflow is as follows: R statements, expressions, and functions are typed into the editor in panel a; statements from the editor are executed in the console in panel b by putting the cursor on a chosen line and clicking the Run button (component g from the figure), or by selecting multiple lines and then clicking the Run button. If the outputs of any of these statements are plots, panel d will automatically display these. The script is named and saved when the script is complete (or, preferably, many times while you are writing it).

To learn your way around the RStudio interface, write an R script called nothing.R with the following content:

library(ggplot2)
nothing <- data.frame(a=rbinom(1000, 20, .5),
                      b=c("red", "white"),
                      c=rnorm(1000, mean=100, sd=10))
qplot(c, data=nothing, geom="histogram")
write.csv(nothing, "nothing.csv")

Execute the statements one by one. Notice that the histogram is automatically displayed in panel d. After you are done, type and execute ?rbinom in the interactive console. Notice how panel d displays the help page for this function? Finally, view click on the object labeled nothing panel c and inspect the data set in the data viewer.

Running R scripts

There are a few ways to run saved R scripts, like nothing.R. First—and this is RStudio specific—is to click the button labeled Source (component h). This is roughly equivalent to highlighting the entire document and clicking Run.

Of course, we would like to run R scripts without being dependent on RStudio. One way to do this is to use the source function in the interactive R console—either RStudio's console, the console that ships with R from CRAN, or your operating system's command prompt running R. The source function takes a filename as it's first and only required argument. The filename specified will be executed, and when it's done, it will return you to the prompt with all the objects from the R script now in your workspace. Try this with nothing.R; executing the ls() command after the source function ends should indicate that the nothing data frame is now in your workspace. Calling the source() function is what happens under the hood when you press the Source button in RStudio. If you have trouble making this work, make sure that either (a) you specify the full path to the file nothing.R in the source() function call, or (b) you use setwd() to make the directory containing nothing.R your current working directory, before you execute source("nothing.R").

A third, less popular method is to use the R CMD BATCH command on your operating system's command/terminal prompt. This should work on all systems, out of the box, except Windows, which may require you to add the R binary folder (usually, something like: C:Program FilesRR-3.2.1in) to your PATH variable. There are instructions on how to accomplish this on the web.

Note

Your system's command prompt (or terminal emulator) will depend on which operating system you use. Window users' command prompt is called cmd.exe (which you can run by pressing Windows-key+R, typing cmd, and striking enter). Macintosh users' terminal emulator is known as Terminal.app, and is under /Applications/Utilities. If you use GNU/Linux or BSD, you know where the terminal is.

Using the following incantation:

R CMD BATCH nothing.R

This will execute the code in the file, and automatically direct it's output into a file named nothing.Rout, which can be read with any text editor.

R may have asked you, anytime you tried to quit R, whether you wanted to save your workplace image. Saving your workplace image means that R will create a special file in your current working directory (usually named .RData) containing all the objects in your current workspace that will be automatically loaded again if you start R in that directory. This is super useful if you are working with R interactively and you want to exit R, but be able to pick up and write where you left off some other time. However, this can cause issues with reproducibility, since another useR won't have the same .RData file on their computer (and you won't have it when you rerun the same script on another computer). For this reason, we use R CMD BATCH with the --vanilla option:

R --vanilla CMD BATCH nothing.R

which means don't restore previously saved objects from .RData, don't save the workplace image when the R script is done running, and don't read the any of the files that can store custom R code that will automatically load in each R session, by default. Basically, this amounts to don't do anything that would be able to be replicated using another computer and R installation.

The final method—which is my preference—is to use the Rscript program that comes with recent versions of R. On GNU/Linux, Macintosh, or any other Unix-like system that supports R, this will automatically be available to use from the command/terminal prompt. On Windows, the aforementioned R binary folder must be added to your PATH variable.

Using Rscript is as easy as typing the following:

Rscript nothing.R

Or, if you care about reproducibility (and you do!):

Rscript --vanilla nothing.R

This is the way I suggest you run R scripts when you're not using RStudio.

Note

If you are using a Unix or Unix-like operating system (like Mac OS X or GNU/Linux), you may want to put a line like #!/usr/bin/Rscript --vanilla as the first line in your R scripts. This is called a shebang line, and will allow you to run your R scripts as a program without specifying Rscript at the prompt. For more information, read the article Shebang (Unix) on Wikipedia.

An example script

Here's an example R script that we will be referring to for the rest of the chapter:

#!/usr/bin/Rscript --vanilla
###########################################################
##                                                       ##
##   nyc-sat-scores.R                                    ##
##                                                       ##
##                Author: Tony Fischetti                 ##
##                        [email protected]       ##
##                                                       ##
###########################################################

##
## Aim: to use Bayesian analysis to compare NYC's 2010 
##      combined SAT scores against the average of the
##      rest of the country, which, according to
##      FairTest.com, is 1509
##

# workspace cleanup
rm(list=ls())

# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)

# libraries
library(assertr)   # for data checking
library(runjags)   # for MCMC

# make sure everything is all set with JAGS
testjags()
# yep!


## read data file
# data was retrieved from NYC Open Data portal
# direct link: https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv?accessType=DOWNLOAD
nyc.sats <- read.csv("./data/SAT_Scores_NYC_2010.csv")

# let's give the columns easier names
better.names <- c("id", "school.name", "n", "read.mean",
                  "math.mean", "write.mean")
names(nyc.sats) <- better.names


# there are 460 rows but almost 700 NYC schools
# we will *assume*, then, that this is a random
# sample of NYC schools

# let's first check the veracity of this data...
#nyc.sats <- assert(nyc.sats, is.numeric,
#                   n, read.mean, math.mean, write.mean)

# It looks like check failed because there are "s"s for some
# rows. (??) A look at the data set descriptions indicates
# that the "s" is for schools # with 5 or fewer students.
# For our purposes, let's just exclude them.


# This is a function that takes a vector, replaces all "s"s
# with NAs and make coverts all non-"s"s into numerics
remove.s <- function(vec){
  ifelse(vec=="s", NA, vec)
}

nyc.sats$n          <- as.numeric(remove.s(nyc.sats$n))
nyc.sats$read.mean  <- as.numeric(remove.s(nyc.sats$read.mean))
nyc.sats$math.mean  <- as.numeric(remove.s(nyc.sats$math.mean))
nyc.sats$write.mean <- as.numeric(remove.s(nyc.sats$write.mean))

# Remove schools with fewer than 5 test takers
nyc.sats <- nyc.sats[complete.cases(nyc.sats), ]

# Calculate a total combined SAT score
nyc.sats$combined.mean <- (nyc.sats$read.mean +
                           nyc.sats$math.mean +
                           nyc.sats$write.mean)

# Let's build a posterior distribution of the true mean
# of NYC high schools' combined SAT scores.

# We're not going to look at the summary statistics, because
# we don't want to bias our priors

# Specify a standard gaussian model
the.model <- "
model {
  # priors
  mu ~ dunif(0, 2400)
  stddev ~ dunif(0, 500)
  tau <- pow(stddev, -2)

  # likelihood
  for(i in 1:theLength){
     samp[i] ~ dnorm(mu, tau)
  }
}"

the.data <- list(
  samp = nyc.sats$combined.mean,
  theLength = length(nyc.sats$combined.mean)
)

results <- autorun.jags(the.model, data=the.data,
                        n.chains = 3,
                        monitor = c('mu', 'stddev'))

# View the results of the MCMC
print(results)

# Plot the MCMC diagnostics
plot(results, plot.type=c("histogram", "trace"), layout=c(2,1))
# Looks good!

# Let's extract the MCMC samples of the mean and get the
# bounds of the middle 95%
results.matrix <- as.matrix(results$mcmc)
mu.samples <- results.matrix[,'mu']
bounds <- quantile(mu.samples, c(.025, .975))

# We are 95% sure that the true mean is between 1197 and 1232

# Now let's plot the marginal posterior distribution for the mean
# of the NYC high schools' combined SAT grades and draw the 95%
# percent credible interval.
plot(density(mu.samples),
     main=paste("Posterior distribution of mean combined SAT",
                "score in NYC high schools (2010)", sep="
"))
lines(c(bounds[1], bounds[2]), c(0, 0), lwd=3, col="red")


# Given the results, the SAT scores for NYC high schools in 2010
# are *incontrovertibly* not at par with the average SAT scores of
# the nation.

There're a few things I'd like you to note about this R script, and it's adherence to best practices.

First, the filename is nyc-sat-scores.R—not foo.R, do it.R, or any of that nonsense; when you are looking through your files in six months, there will be no question about what the file was supposed to do.

The second is that comments are sprinkled liberally throughout the entire script. These commands serve to state the intentions and purpose of the analysis, separate sections of code, and remind ourselves (or anyone who is reading) where the data file came from. Additionally, comments are used to block out sections of code that we'd like to keep in the script, but which we don't want to execute. In this example, we commented out the statement that calls assert, since the assertion fails. With these comments, anybody—even an R beginner—can follow along with the code.

There are a few other manifestations of good practice on display in this script: indention that aids in following the code flow, spaces and new-lines that enhance readability, lines that are restricted to under 80 characters, and variables with informative names (no foo, bar, or baz).

Lastly, take note of the remove.s function we employ instead of copy-and-pasting ifelse(vec=="s", NA, …) four times. An angel loses its wings every time you copy-and-paste code, since it is a notorious vector for mistakes.

Scripting and reproducibility

Put any code that is not one-off, and is meant to be run again, in a script. Even for one-off code, you are better off putting it in a script, because (a) you may be wrong (and often are) about not needing to run it again, (b) it provides a record of what you've done (including, perhaps, unnoticed bugs), and (c) you may want to use similar code at another time.

Scripting enhances reproducibility, because now, the only things we need to reproduce this line of inquiry on another computer are the script and the data file. If we didn't place all this code in a script, we would have had to copy and paste our interactive R console history, which is ugly and messy to say the absolute least.

It's time to come clean about a fib I told in the preceding paragraph. In most cases, all you need to reproduce the results are the data file(s) and the R script(s). In some cases, however, some code you've written that works in your version of R may not work on another person's version of R. Somewhat more common is that the code you write, which uses a functionality provided by a package, may not work on another version of that package.

For this reason, it's good practice to record the version of R and the packages you're using. You can do this by executing sessionInfo(), and copying the output and pasting it into your R script at the bottom. Make sure to comment all of these lines out, or R will attempt to execute them the next time the script is run. For a prettier/better alternative to sessionInfo(), use the session_info() function from the devtools package. The output of devtools::session_info() for our example script looks like this:

> devtools::session_info()
Session info ---------------------------------
 setting  value                       
 version  R version 3.2.1 (2015-06-18)
 system   x86_64, darwin13.4.0        
 ui       RStudio (0.99.486)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     1969-07-20

Packages -------------------------------------
 package  * version date       source        
 assertr  * 1.0.0   2015-06-26 CRAN (R 3.2.1)
 coda       0.17-1  2015-03-03 CRAN (R 3.2.0)
 devtools   1.9.1   2015-09-11 CRAN (R 3.2.0)
 digest     0.6.8   2014-12-31 CRAN (R 3.2.0)
 lattice    0.20-33 2015-07-14 CRAN (R 3.2.0)
 memoise    0.2.1   2014-04-22 CRAN (R 3.2.0)
 modeest    2.1     2012-10-15 CRAN (R 3.2.0)
 rjags      3-15    2015-04-15 CRAN (R 3.2.0)
 runjags  * 2.0.2-8 2015-09-14 CRAN (R 3.2.0)

The packages that we explicitly loaded are marked with an asterisk; all the other packages listed are packages that are used by the packages we loaded. It is important to note the version of these packages, too, as they can potentially cause cross-version irreproducibility.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.20.132