In order to write code to work with data, you will need to use a number of different (free) software programs for writing, executing, and managing your code. This chapter details which software you will need and explains how to install those programs. While there are a variety of options for each task, we discuss software programs that are largely supported within the data science community, and whose popularity continues to grow.
It is an unfortunate reality that one of the most frustrating and confusing barriers to working with code is getting your machine properly set up. This chapter aims to provide sufficient information for setting up your machine and troubleshooting the installation process.
In short, you will need to install the following programs, each of which is described in detail in the following sections.
For Writing Code
There are two different programs that we suggest you use for writing code:
RStudio: An integrated development environment (IDE) for writing and executing
R code. This will be your primary work environment for doing data science. You will also need to install the
R software so that RStudio will be able to execute your code (discussed later in this section).
Atom: A lightweight text editor that supports programming in lots of different languages. (Other text editors will also work effectively; some further suggestions are included in this chapter.)
For Managing Code
To manage your code, you will need to install and set up the following programs:
git: An application used to track changes to your files (namely, your code). This is crucial for maintaining an organized project, and can help facilitate collaboration with other developers. This program is already installed on Macs.
GitHub: A web service for hosting code online. You don’t actually need to install anything (GitHub uses
git), but you will need to create a free account on the GitHub website. The corresponding exercises for this book are hosted on GitHub.
For Executing Code
To provide instructions to your machine (i.e., run code), you will need to have an environment in which to provide those instructions, while also ensuring that your machine is able to understand the language in which you’re writing your code.
Bash shell: A command line interface for controlling your computer. This will provide you with a text-based interface you can use to work with your machine. Macs already have a Bash shell program called Terminal, which you can use “out of the box.” On Windows, installing
git will also install an application called Git Bash, which you can use as your Bash shell.
R: A programming language commonly used for data science. This is the primary programming language used throughout this book. “Installing
R” actually means downloading and installing tools that will let your computer understand and run
The remainder of this chapter has additional information about the purpose of each software system, how to install it, and alternative configurations or options. The programs are described in the order they are introduced in the book (though in many cases, the software programs are used in tandem).
The command line provides a text-based interface for giving instructions to your computer (much more on this in Chapter 2). As you are getting started with data science, you will largely use the command line for navigating your computer’s file structure and executing commands that allow you to keep track of changes to the code you write (i.e., version control with
To use the command line, you will need to use a command shell (also called a command prompt or terminal). This computer program provides the interface in which you type commands. In particular, this book discusses the Bash shell, which provides a particular set of commands common to Mac and Linux machines.
On a Mac, you will want to use the built-in app called Terminal as your Bash shell. This application is part of the Mac operating system, so you don’t need to install anything. You can open Terminal by searching via Spotlight (press
cmd+spacebar together, type in “terminal”, then select the app to open it), or by finding it in the
Applications > Utilities folder. This will open your Terminal window, as described in Chapter 2.
On Windows, we recommend using Git Bash as your Bash shell, which is installed along with
git. Open this program to open the command shell. This works great, since you will primarily be using the command line for performing version control.
Alternatively, the 64-bit Windows 10 Anniversary Update (August 2016) includes a version of an integrated Bash shell. You can access this by enabling the subsystem for Linux1 and then running
bash in the command prompt.
1Install the Windows subsystem for Linux: https://msdn.microsoft.com/en-us/commandline/wsl/install_guide
Windows includes its own command shell, called Command Prompt (previously DOS Prompt), but it has a different set of commands and features. If you try to use the commands described in Chapter 2 with DOS Prompt, they will not work. For a more advanced Windows Management Framework, you can look into using Powershell.a Because Bash is more common in open source programming like in this book, we will focus on that set of commands.
Most Linux flavors come with a command shell pre-installed; for example, in Ubuntu you can use the Terminal application (use
ctrl+alt+t to open it, or search for it from the Ubuntu dashboard).
One of the most important aspects of doing data science is keeping track of the changes that you (and others) make to your code.
git is a version control system that provides a set of commands that you can use to manage changes to written code, particularly when collaborating with other programmers (version control is described in more detail in Chapter 3).
git comes pre-installed on Macs, though it is possible that the first time you try to use the tool you will be prompted to install the Xcode command line developer tools via a dialog box. You can choose to install these tools, or download the latest version of
On Windows, you will need to download2 the
git software. Once you have downloaded the installer, double-click on your downloaded file, and follow the instructions to complete installation. This will also install a program called Git Bash, which provides a command line (text-based) interface for executing commands on your computer. See Section 1.1.2 for alternative and additional Windows command line tools.
On Linux, you can install
apt-get or a similar command. For more information, see the download page for Linux.3
GitHub4 is a website that is used to store copies of computer code that are being managed with
git. To use GitHub, you will need to create a free GitHub account.5 When you register, remember that your profile is public, and future collaborators or employers may review your GitHub account to assess your background and ongoing projects. Because GitHub leverages the
git software package, you don’t need to install anything else on your machine to use GitHub.
While you will be using RStudio to write
R code, you will sometimes want to use another text editor that is more lightweight (e.g., runs faster), more robust, or supports a different programming language than
R. A coding-focused text editor provides features such as automatic formatting and coloring for easier interpretation of the code, auto-completion, and integration with version control (features that are also available in RStudio).
Many different text editors are available, all of which have slightly different appearances and features. You only need to download and use one of the following programs (we recommend Atom as a default), but feel free to try out different ones until you find something you like (and then evangelize about it to your friends!).
Programming involves working with many different file types, each of which is indicated by its extension (the letters after the
. in the file name, such as
Atom6 is a text editor built by the folks at GitHub. As it is an open source project, people are continually building (and making available) interesting and useful extensions to Atom. Atom’s built-in spell-check is a great feature, especially for documents that require lots of written text. It also has excellent support for Markdown, a markup language used regularly in this book (see Chapter 4). In fact, much of this text was written using Atom!
To download Atom, visit the application’s webpage and click the “Download” button to download the program. On Windows, you will download the installer
AtomSetup.exe file; double-click on that icon to install the application. On a Mac, you will download a zip file; open that file and drag the
Atom.app file to your “Applications” folder.
Once you’ve installed Atom, you can open the program and create a new text file (just like you would create a new file with a word processor such as Microsoft Word). When you save a document that is a particular file type (e.g.,
FILE_NAME.md), Atom (or any other modern text editor) will apply a language specific color scheme to your text, making it easier to read.
The trick to using Atom more efficiently is to get comfortable with the Command Palette.7 If you press
cmd+shift+p (Mac) or
ctrl+shift+p (Windows), Atom will open a small window where you can search for whatever you want the editor to do. For example, if you type in
markdown, you can get a list of commands related to Markdown files (including the ability to open up a preview right in Atom).
7Atom Command Palette: http://flight-manual.atom.io/getting-started/sections/atom-basics/#command-palette
For more information about using Atom, see the manual.8
R, and provides a number of extensions for adding even more features. It has a similar command palette to Atom, but isn’t quite as nice for editing Markdown specifically. Although fairly new, it is updated regularly and has become one of the authors’ main editors for programming.
Sublime Text10 is a very popular text editor with excellent defaults and a variety of available extensions (though you will need to manage and install extensions to achieve the functionality offered by other editors out of the box). While the software can be used for free, every 20 or so saves it will prompt you to purchase the full version (an offer that you can decline without loss of functionality).
The primary programming language used throughout this book is called
R.11 It is a very powerful statistical programming language that is built to work well with large and diverse data sets. Chapter 5 provides a more in-depth introduction to the language.
To program with
R, you will need to install the
R Interpreter on your machine. This software is able to “read” code written in
R and use that code to control your computer, thereby “programming” it.
The easiest way to install
R is to download it from the Comprehensive
R Archive Network (CRAN).12 Click on the appropriate link for your operating system to find a link to the installer. On a Mac, click the link to download the
.pkg file for the latest version supported by your computer. Double-click on the
.pkg file and follow the prompts to install the software. On Windows, follow the Windows link to “install
R for the first time,” then click the link to download the latest version of
R for Windows. You will need to double-click on the
.exe file and follow the prompts to install the software.
While you are able to execute
R scripts without a dedicated application, the RStudio program provides a wonderful way to engage with the
R language by providing a single interface to write and execute code, search documentation, and view results such as charts and maps. RStudio is described in more detail in Chapter 5. This book assumes you are using RStudio to write
To install the RStudio program, visit the download page,13 select to “Download” the free version of RStudio Desktop, and then select the installer for your operating system to download it.
13Download RStudio: https://www.rstudio.com/products/rstudio/download/
After the download is complete, double-click on the
.dmg file to run the installer. Follow the steps of the installer, and you should be prepared to use RStudio.
This chapter has walked you through setting up the necessary software for basic data science, including the following programs:
Bash for controlling your computer
R for programmatically analyzing and working with data
RStudio as an IDE for writing and executing
git for version control
Atom as a general text editor for creating and editing documents
With this software installed, you are ready to get started programming for data science!