R projects

There are some (rare) cases where a single R script contains the totality of your research/analyses. This may happen if you are doing simulation studies, for example. For most cases, an analysis will consist of a script (or scripts) and at least one data set. I refer to any R analysis that uses at least two files as an R project.

In R projects, special attention must be paid to how the files are stored relative to each other. For example, if we stored the file SAT_Scores_NYC_2010.csv on our desktop, the data import line would have read:

read.csv("/Users/bensisko/Desktop/SAT_Scores_NYC_2010.csv")

If you want to send this analysis to a contributor to be replicated, we would send them the script and the data file. Even if we instructed them to place the file on their desktop, the script would still not be reproducible. Our collaborators on Windows and Unix would have to manually change the argument of read.csv to C:/Users/jameskirk/Desktop/SAT_Scores_NYC_2010.csv or /home/katjaneway/Desktop/SAT_Scores_NYC_2010.csv, respectively.

A far better way to handle this situation is to organize all your files in a neat hierarchy that will allow you to specify relative paths for your data imports. In this case, it means making a folder called sat-scores (or something like that), which contains the script nyc-sat-scores.R and a folder called data that contains the file SAT_Scores_NYC_2010.csv:

R projects

Figure 13.2: A sample file/folder hierarchy for an R analysis project

The function call read.csv("./data/SAT_Scores_NYC_2010.csv") instructs R to load the dataset inside the data folder in the current working directory. Now, if we wanted to send our analysis to a collaborator, we would just send them the folder (which we can compress, if we want), and it will work no matter what our collaborator's username and operating system is. Additionally, everything is nice and neat, and in one place. Note that we put a file called README.txt into the root directory of our project. This file would contain information about the analysis, instructions for running it, and so on. This is a common convention.

Anyway, never use absolute paths!

In projects that use more than one R script, some choose a slightly different project layout. For example, let's say we divided our preceding script into load-and-clean-sat-data.R and analyze-sat-data.R; we might choose a folder hierarchy that looks like this:

R projects

Figure 13.3: A sample file/folder hierarchy for a multiscript R analysis project

Under this organizational paradigm, the two scripts are now placed in a folder called code, and a new script master.R is placed in the project's root directory. master.R is called driver script, and it will call our two non-driver scripts in the right order. For example, master.R may look like this:

#!/usr/bin/Rscript --vanilla
source("./code/load-and-clean-sat-data.R")
source("./code/analyze-sat-data.R")

Now, our collaborator just has to execute master.R, which will, in turn, execute our analysis scripts.

Note

There are a few alternatives to using an R script as a driver. One common alternative is to use a shell script as a driver. These scripts contain code that is run by the operating system's command-line interpreter. A downside of this approach is that shell scripts are, in general, not portable across the Windows versus all-other-operating-systems divide.

A common, but somewhat more advanced alternative, is to replace master.R with a dependency-tracking build utility like make, shake, sake, or drake. This offers a host of benefits including extensibility and identification of redundant computations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.47.130