A very compelling benefit to our neat hierarchical organization scheme is that it lends itself to easy integration with version control systems. Version control systems, at a basic level, allow one to track changes/revisions to a set of files, and easily roll back to previous states of the set of files.
A simple (and inadequate) approach is to compress your analysis project at regular intervals, and post-fix the filename of each compressed copy with a timestamp. This way, if you make a mistake, and would like to revert to a previous version, all you have to do is delete your current project and un-compress the project from the time you want to roll back to.
A far more sane solution is to use a remote file synchronization service that features revision tracking. The most popular of these services at the time of writing is Dropbox, though there are others such as TeamDrive and Box. These services allow you to upload your project into the cloud. When you make changes to your local copy, these services will track your changes, resynchronize the remotely stored copy, and version your project for you. Now you can revert to a previous version of just one file, instead of having to revert the entire project hierarchy.
A great benefit of using one of these services is that any number of collaborators can be invited to work on the project simultaneously. You can even set permissions for the files each collaborator can read/write to. The service you choose should be able to track the changes made by the collaborators, too.
Perhaps, the sanest solution is to use an actual version control system like Git, Mercurial, Subversion, or CVS. These are traditionally used for software projects that contain hundreds of files and many many contributors, but it's proving to be a crackerjack solution to data analysts with just a few files and little to no other contributors. These alternatives offer the most flexibility in terms of rollback, revision tracking, conflict (incompatible changes) resolution, compression, and merging. The combination of Git and GitHub (a remote Git repository hosting service) is proving to be a particularly effective and common solution to statistical programmers.
Version control enhances reproducibility—since all the changes to the entire project (scripts/data/folder-structure layouts) are documented, all the changes are repeatable.
If your data files are small to medium, keeping them in your project will play nicely with your version control solution; it will even offer great benefits like the assurance that no one tampered with your data. If your data is too large, though, you might look into other data storage solutions like remote database storage.
Package version management
Some R analysts, who rely heavily on the use of add-on CRAN packages, may choose to use a tool to manage these packages and their versions. The two most popular tools to do this are the package packrat
and checkpoint
.
packrat
, which is the more popular of the two, maintains a library of the packages an analysis uses inside the project's root directory. This allows the analysis and the packages it depends on to be version controlled.
checkpoint
allows you to use the versions of CRAN packages as they were on a particular date. An analyst would store the date of the CRAN snapshot used at the top of a script, and the proper versions of these packages would automatically download on a collaborator's machine.
3.145.50.124