3
Version Control with git and GitHub

One of the most important parts of writing code to work with data is keeping track of changes to your code. Maintaining a clear and well-documented history of your work is crucial for transparency and collaboration. Even if you are working independently, tracking your changes will enable you to revert to earlier versions of your project and more easily identify errors.

Alternatives to proper version control systems—such as emailing code to others, or having dozens of versions of the same file—lack any structured way of backing up work, and are time-consuming and error-prone. This is why you should be using a version control system like git.

This chapter introduces the git command line program and the GitHub cloud storage service, two wonderful tools that track changes to your code (git) and facilitate collaboration (GitHub). git and GitHub are the industry standards for the family of tasks known as version control. Being able to manage changes to your code and share that code with others is one of the most important technical skills a data scientist can learn, and is the focus of this chapter as well as Chapter 20.

Tip

Because this chapter revolves around using new interfaces and commands to track file changes—which can be difficult to understand abstractly—we suggest that you follow along with the instructions as they are introduced throughout the chapter. The best way to learn is by doing!

3.1 What Is git?

git1 is an example of a version control system. Open source software guru Eric Raymond defines version control as follows:

1Git homepage: http://git-scm.com/

A version control system (VCS) is a tool for managing a collection of program code that provides you with three important capabilities: reversibility, concurrency, and annotation.2

2Raymond, E. S. (2009). Understanding version-control systems. http://www.catb.org/esr/writings/version-control/version-control.html

Version control systems work a lot like Dropbox or Google Docs: they allow multiple people to work on the same files at the same time, and to view or “roll back” to previous versions. However, systems like git differ from Dropbox in a couple of key ways:

  • Each new version or “checkpoint” of your files must be explicitly created (committed). git doesn’t save a new version of your entire project each time you save a file to disk. Instead, after making progress on your project (which may involve editing multiple files), you take a snapshot of your work, along with a description of what you’ve changed.

  • For text files (which almost all programming files are), git tracks changes line by line. This means it can easily and automatically combine changes from multiple people, and give you very precise information about which lines of code have changed.

Like Dropbox and Google Docs, git can show you all previous versions of a file and can quickly roll back to one of those previous versions. This is often helpful in programming, especially if you embark on making a massive set of changes, only to discover partway through that those changes were a bad idea (we speak from experience here).

But where git really comes in handy is in team development. Almost all professional development work is done in teams, which involves multiple people working on the same set of files at the same time. git helps teams coordinate all these changes, and provides a record so that anyone can see how a given file ended up the way it did.

There are a number of different version control systems that offer these features, but git is the de facto standard—particularly when used in combination with the cloud-based service GitHub.

3.1.1 git Core Concepts

To understand how git works, you need to understand its core concepts and terms:

  • repository (repo): A database of your file history, containing all the checkpoints of all your files, along with some additional meta-data. This database is stored in a hidden subdirectory named .git within your project directory. If you want to sound cool and in-the-know, call the project folder itself a “repo” (even though the repository is technically the database inside the project folder).

  • commit: A snapshot or checkpoint of your work at a given time that has been added to the repository (saved in the database). Each commit will also maintain additional information, including the name of the person who did the commit, a message describing the commit, and a timestamp. This extra tracking information allows you to see when, why, and by whom changes were made to a given file. Committing a set of changes creates a snapshot of what that work looks like at the time, which you can return to in the future.

  • remote: A link to a copy of your repository on a different machine. This link points to a location on the web where the copy is stored. Typically this will be a central (“master”) version of the project that all local copies point to. This chapter generally deals with copies stored on GitHub as remote repositories. You can push (upload) commits to, and pull (download) commits from, a remote repository to keep everything in sync.

  • merging: git supports having multiple different versions of your work that all live side by side (in what are called branches), which may be created by one person or by many collaborators. git allows the commits (checkpoints) saved in different versions of the code to be easily merged (combined) back together without any need to manually copy and paste different pieces of the code. This makes it easy to separate and then recombine work from different developers.

3.1.2 What Is GitHub?

git was created to support completely decentralized development, in which developers pull commits (sets of changes) from one another’s machines directly. But in practice, most professional teams take the approach of creating one central repository on a server that all developers push to and pull from. This repository contains the authoritative version of the source code, and all deployments to the “rest of the world” are done by downloading from this centralized repository.

Teams can set up their own servers to host these centralized repositories, but many choose to use a server maintained by someone else. The most popular of these in the open source world is GitHub,3 which as of 2017 had more than 24 million developers using the site.4 In addition to hosting centralized repositories, GitHub offers other team development features such as issue tracking, wiki pages, and notifications. Public repositories on GitHub are free, but you have to pay for private ones.

In short, GitHub is a site that will host a copy of your project in the cloud, enabling multiple people to collaborate (using git). git is what you use to do version control; GitHub is one possible place where repositories of code can be stored.

3GitHub: https://github.com

4The State of the Octoverse 2017: https://octoverse.github.com

Going Further

Although GitHub is the most popular service that hosts “git” repositories, it is not the only such site. BitBucketa offers a similar set of features to GitHub, though it has a different pricing model (you get unlimited free private repos, but are limited in the number of collaborators). GitLabb offers a hosting system that incorporates more operations and deployment services for software projects.

ahttps://bitbucket.org

bhttps://gitlab.com

Caution

The interface and functionality of websites such as GitHub are constantly evolving and may change. Additional features may become available, and the current structure may be reorganized to better support common usage.

3.2 Configuration and Project Setup

This section walks you through all the commands needed to set up version control for a project using git. It focuses on using git from the command line, which is the most effective way to learn (if not use) the program, and is how most professional developers interact with the software. That said, it is also possible to use git directly through code editors and IDEs such as Atom or RStudio—as well as through dedicated graphical software such as GitHub Desktop5 or Sourcetree.6

5GitHub Desktop: https://desktop.github.com

6Sourcetree: https://www.sourcetreeapp.com

The first time you use git on your machine after having installed it, you will need to configure7 the installation, telling git who you are so you can commit changes to a repository. You can do this by using the git command line command with the config option (i.e., running the git config command):

7GitHub: Set Up Git: https://help.github.com/articles/set-up-git/

# Configure `git` on your machine (only needs to be done once)

# Set your name to appear alongside your commits
# This *does not* need to be your GitHub username
git config --global user.name "YOUR FULLNAME"


# Set your email address
# This *does* need to be the email associated with your GitHub account
git config --global user.email "YOUR_EMAIL_ADDRESS"

Even after git knows who you are, it will still prompt you for your password before pushing your code up to GitHub. One way to save some time is by setting up an SSH key for GitHub. This will allow GitHub to recognize and permit interactions coming from your machine. If you don’t set up the key, you will need to enter your GitHub password each time you want to push changes up to GitHub (which may be multiple times a day). Instructions for setting up an SSH key are available from GitHub Help.8 Make sure you set up your key on a machine that you control and trust!

8GitHub: Authenticating to GitHub: https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/

3.2.1 Creating a Repo

To work with git, you will need to create a repository. A repository acts as a “database” of changes that you make to files in a directory. A repository is always created in an existing directory (folder) on your computer. For example, you could create a new folder called learning_git on your computer’s Desktop. You can turn this directory into a repository by telling the git program to run the init action (running the git init command) inside that directory:

# Create a new folder in your current location called `learning_git`
mkdir learning_git

# Change your current directory to the new folder you just created
cd learning_git

# Initialize a new repository inside your `learning_git` folder
git init

The git init command creates a new hidden folder called .git inside the current directory. Because it’s hidden, you won’t see this folder in Finder, but if you use ls -a (the “list” command with the all option) you can see it listed. This folder is the “database” of changes that you will make—git will store all changes you commit in this folder. The inclusion of the .git folder causes a directory to become a repository; you refer to the whole directory as the “repo.” However, you won’t ever have to directly interact with this hidden folder; instead, you will use a short set of terminal commands to interact with the database.

Caution

Do not put one repo inside of another! Because a git repository tracks all of the content inside of a single folder (including the content in subfolders), this will turn one repo into a “sub-repo” of another. Managing changes to both the repo and sub-repo will be difficult and should be avoided.

Instead, you should create a lot of different repos on your computer (one for each project), making sure that they are in separate folders.

Note that it is also not a good idea to have a git repository inside of a shared folder, such as one managed with Dropbox or Google Drive. Those systems’ built-in file tracking will interfere with how git manages changes to files.

3.2.2 Checking Status

Once you have a repo, the next thing you should do is check its status:

# Check the status of your repository
# (this and other commands will only work inside git project folders)
git status

The git status command will give you information about the current “state” of the repo. Running this command on a new repo tells you a few things (as shown in Figure 3.1):

  • That you’re actually in a repo (otherwise you will get an error)

  • That you’re on the master branch (think: line of development)

  • That you’re at the initial commit (you haven’t committed anything yet)

  • That currently there are no changes to files that you need to commit (save) to the database

  • What to do next! (namely, create/copy files and use “git add” to track)

    A screenshot shows the execution of git status command to check the status of a new repository.
    Figure 3.1 Checking the status of a new (empty) repository with the git status command.

That last point is important. git status messages are verbose and somewhat awkward to read (this is the command line after all). Nevertheless, if you look at them carefully, they will almost always tell you which command to use next.

Tip

If you ever get stuck, use git status to figure out what to do next!

This makes git status the most useful command in the entire process. As you are learning the basics of git, you will likely find it useful to run the command before and after each other command to see how the status of your project changes. Learn it, use it, love it.

3.3 Tracking Project Changes

Running git status in a new repository will tell you to create a file—which we suggest you do now to practice the steps of using version control. For example, open up your favorite text editor (e.g., Atom) and create a plain text file with a list of your favorite books. Save this file in your learning_git folder as favorite_books.txt. git will be able to detect and manage changes to your file as long as it was saved inside the repo (project directory).

Remember

After editing a file, always save it to your computer’s hard drive (e.g., with File > Save). git can track only changes that have been saved!

3.3.1 Adding Files

After making a change to your repository (such as creating and saving the favorite_books.txt file), run git status again. As shown in Figure 3.2, git now gives a list of changed and “untracked” files, as well as instructions about what to do next to save those changes to the repo’s database.

A screenshot shows the status of the repository with changes before adding files to the repository's database.
Figure 3.2 The status of a repository with changes that have not (yet) been added and are therefore shown in red.

The first step is to add those changes to the staging area. The staging area is like a shopping cart in an online store: you put changes in temporary storage before you commit to recording them in the database (e.g., before clicking “purchase”).

You add files to the staging area using the git add command (replacing FILENAME in the following example with the name/path of the file or folder you want to add):

# Add changes to a file with the name FILENAME to the staging area
# Replace FILENAME with the name of your file (e.g., favorite_books.txt)
git add FILENAME

This will add a single file in its current saved state to the staging area. For example, git add favorite_books.txt would add that file to the staging area. If you change the file later, you will need to add the updated version by running the git add command again.

You can also add all of the contents of the current directory (tracked or untracked) to the staging area with the following command:

# Add all saved contents of the directory to the staging area
git add .

This command is the most common way to add files to the staging area, unless you’ve made changes to specific files that you aren’t ready to commit yet. Once you’ve added files to the staging area, you’ve “changed” the repo and so can run git status again to see what it says to do next. As you can see in Figure 3.3, git will tell you which files are in the staging area, as well as the command to unstage those files (i.e., remove them from the “cart”).

A screenshot shows the status of the repository after adding files to the repository's database.
Figure 3.3 The status of a repository after adding changes (added les are displayed in green).

3.3.2 Committing

When you’re happy with the contents of your staging area (i.e., you’re ready to purchase), it’s time to commit those changes, saving that snapshot of the files in the repository database. You do this with the git commit command:

# Create a commit (checkpoint) of the changes in the staging area
# Replace "Your message here" with a more informative message
git commit -m "Your message here"

You should replace "Your message here" with a short message saying what changes that commit makes to the repo. For example, you could type git commit -m "Create favorite_books.txt file".

Caution

If you forget the -m option, git will put you into a command line text editor so that you can compose a message (then save and exit to finish the commit). If you haven’t done any other configuration, you might be dropped into the vim editor. Type :q (colon then q) and press enter to flee from this place and try again, remembering the -m option! Don’t panic: getting stuck in vim happens to everyone.a

ahttps://stackoverflow.blog/2017/05/23/stack-overflow-helping-one-million-developers-exit-vim/

3.3.2.1 Commit Message Etiquette

Your commit messages should be informative9 about which changes the commit is making to the repo. "stuff" is not a good commit message. In contrast, "Fix critical authorization error" is a good commit message.

9Do not do this: https://xkcd.com/1296/

Commit messages should use the imperative mood ("Add feature", not "Added feature"). They should complete the following sentence:

If applied, this commit will {your message}

Other advice suggests that you limit your message to 50 characters (like an email subject line), at least for the first line—this helps when you are going back and looking at previous commits. If you want to include more detail, do so after a blank line. (For more detailed commit messages, we recommend you learn to use vim or another command line text editor.)

A specific commit message format may also be required by your company or project team. Further consideration of good commit messages can be found in this blog post.10

10Chris Beams: How to Write a Git Commit Message blog post: http://chris.beams.io/posts/git-commit/

As you make commits, remember that they are a public part of your project history, and will be read by your professors, bosses, coworkers, and other developers on the internet.11

11Don’t join this group: https://twitter.com/gitlost

After you’ve committed your changes, be sure and check git status, which should now say that there is nothing to commit!

3.3.3 Reviewing the local git Process

This cycle of edit files–add files–commit changes is the standard “development loop” when working with git, and is illustrated in Figure 3.4.

An illustration shows the flow of the local git process.
Figure 3.4 The local git process: add changes to the staging area, then create a checkpoint of your project by making a commit. The commit saves a version of the project at this point in time to the database of file history.

In general, you will make lots of changes to your code (editing lots of files, running and testing your code, and so on). Once you’re at a good “break point”—you’ve got a feature working, you’re stuck and need some coffee, you’re about to embark on some radical changes—be sure to add and commit your changes to make sure you don’t lose any work and you can always get back to that point.

Remember

Each commit represents a set of changes, which can and usually does include multiple files. Do not think about each commit being a change to a file; instead, think about each commit as being a snapshot of your entire project!

Tip

If you accidentally add files that you want to “unadd,” you can use the git reset command (with no additional arguments) to remove all added files from the staging area.

If you accidentally commit files when you didn’t want to, you can “undo” the commit using the command git reset --soft HEAD~1. This command makes it so the commit you just made never occurred, leaving the changed files in your working directory. You can then edit which files you wish to commit before running the git commit command again. Note that this works only on the most recent commit, and you cannot (easily) undo commits that have been pushed to a remote repository.

3.4 Storing Projects on GitHub

Once you are able to track your changes locally with git, you will often want to access your project from a different computer, or share your project with other people. You can do this using GitHub, an online service that stores copies of repositories in the cloud. These repositories can be linked to your local repositories (the ones on your machine, like those you’ve been working with so far) so that you can synchronize changes between them. The relationship between git and GitHub is the same as that between a Photos application on your computer and a photo hosting service such as Flickr: git is the program you use to (locally) create and manage repositories (like Photos); GitHub is simply a website that stores these repositories (like Flickr). Thus you use git, but upload to and download from GitHub.

Repositories stored on GitHub are examples of remotes: other repos that are linked to your local one. Each repo can have multiple remotes, and you can synchronize commits between them. Each remote has a URL associated with it (indicating where on the internet the remote copy of the repo can be found), but they are given “alias” names—similar to browser bookmarks. By convention, the remote repo stored on GitHub’s servers is named origin, since it tends to be the “origin” of any code you’ve started working on.

To use GitHub, you will need to create a free GitHub account, which is discussed in Chapter 1.

Next, you will need to “link” your local repository to the remote one on GitHub. There are two common processes for doing this:

  1. If you already have a project tracked with git on your computer, you can create a new repository on GitHub by clicking the green “New Repository” button on the GitHub homepage (you will need to be logged in). This will create a new empty repo on GitHub’s servers under your account. Follow the provided instructions on how to link a repository on your machine to the new one on GitHub.

  2. If there is a project on GitHub that you want to edit on your computer, you can clone (download) a copy of a repo that already exists on GitHub, allowing you to work with and modify that code. This process is more common, so it is described in more detail here.

Each repository on GitHub has a web portal at a unique location. For example, https://github.com/programming-for-data-science/book-exercises is the webpage for the programming exercises that accompany this book. You can click on the files and folders on this page to view their source and contents online, but you won’t change them through the browser.

Remember

You should always create a local copy of the repository when working with code. Although GitHub’s web interface supports it, you should never make changes or commit directly to GitHub. All development work is done locally, and changes you make are then uploaded and merged into the remote. This allows you to test your work and to be more flexible with your development.

3.4.1 Forking and Cloning

Just like with Flickr or other image-hosting sites, all GitHub users have their own account under which repos are stored. The repo mentioned earlier is under this book’s account programming-for-data-science. Because it’s under the book’s user account, you won’t be able to modify it—just as you can’t change someone else’s picture on Flickr. So the first thing you will need to do is copy the repo over to your own account on GitHub’s servers. This process is called forking the repo (you’re creating a “fork” in the development, splitting off to your own version).

To fork a repo, click the “Fork” button in the upper right of the screen (shown in Figure 3.5). This will copy the repo over to your own account; you will be able to download and upload changes to that copy but not to the original. Once you have a copy of the repo under your own account, you need to download the entire project (files and their history) to your local machine to make changes. You do this by using the git clone command:

# Change to the folder that will contain the downloaded repository folder
cd ~/Desktop

# Download the repository folder into the current directory
git clone REPO_URL
A screenshot shows the top right corner of GitHub's web portal that includes three buttons: watch set to 2, star set to 0, and fork set to 154. The fork button is highlighted and shown.
Figure 3.5 The Fork button for a repository on GitHub’s web portal. Click this button to create your own copy of a repository on GitHub.

This command creates a new repo (directory) in the current folder, and downloads a copy of the code and all the commits from the URL you specify into that new folder.

Caution

Make sure that you are in the desired location in the command line before running any git commands. For example, you would want to cd out of the learning_git directory described earlier; you don’t want to clone into a folder that is already a repo!

You can get the URL for the git clone command from the address bar of your browser, or by clicking the green “Clone or Download” button. If you click that button, you will see a pop-up that contains a small clipboard icon that will copy the URL to your clipboard, as shown in Figure 3.6. This allows you to use your terminal to clone the repository. If you click “Open in Desktop,” it will prompt you to use a program called GitHub Desktop12 to manage your version control (a technology not discussed in this book). But do not click the “Download Zip” option, as it contains code without the previous version history (the code, but not the repository itself).

12https://desktop.github.com

A screenshot shows the clone or download button on the GitHub's web portal. Clone with HTTPS pop up is open which includes a text field that holds the URL and a clipboard icon labeled, Click to copy the URL.
Figure 3.6 The Clone button for a repository on GitHub’s web portal. Click this button to open the dialog box, then click the clipboard icon to copy the GitHub URL needed to clone the repository to your machine. Red notes are added.

Remember

Make sure you clone from the forked version (the one under your account!) so that the repo downloads with a proper link back to the origin remote.

Note that you will only need to clone once per machine. clone is like init for repos that are on GitHub; in fact, the clone command includes the init command (so you do not need to init a cloned repo). After cloning, you will have a full copy of the repository—which includes the full project history—on your machine.

3.4.2 Pushing and Pulling

Once you have a copy of the repo code, you can make changes to that code on your machine and then push those changes up to GitHub. You can edit the files (e.g., the README.md) in an editor as if you had created them locally. After making changes, you will, of course, need to add the changed files to the staging area and commit the changes to the repo (don’t forget the -m message!).

Committing will save your changes locally, but it does not push those changes to GitHub. If you refresh the web portal page (make sure you’re looking at the one under your account), you shouldn’t see your changes yet.

To get the changes to GitHub (and share your code with others), you will need to push (upload) them to GitHub’s computers. You can do this with the following command:

# Push commits from your computer up to a remove server (e.g., GitHub)
git push

By default, this command will push the current code to the origin remote (specifically, to its master branch of development). When you cloned the repo, it came with an origin “bookmark” link to the original repo’s location on GitHub. To check where the remote is, you can use the following command:

# Print out (verbosely) the remote location(s)
git remote -v

Once you’ve pushed your code, you should be able to refresh the GitHub webpage and see your changes on the web portal.

If you want to download the changes (commits) that someone else has made, you can do that using the pull command. This command will download the changes from GitHub and merge them into the code on your local machine:

# Pull changes down from a remove server (e.g., GitHub)
git pull

Caution

Because pulling down code involves merging versions of code together, you will need to keep an eye out for merge conflicts! Merge conflicts are discussed in more detail in Chapter 20.

Going Further

The commands git pull and git push have the default behavior of interacting with the master branch at the origin remote location. git push is thus equivalent to the more explicit command git push origin master. As discussed in Chapter 20, you will adjust these arguments when engaging in more complex and collaborative development processes.

The overall process of using git and GitHub together is illustrated in Figure 3.7.

An illustration shows the flow of the remote git process.
Figure 3.7 The remote git process: fork a repository to create a copy on GitHub, then clone it to your machine. Then add and commit changes, and push them up to GitHub to share.

Tip

If you are working with others (or just on different computers), always pull in the latest changes before you start working. This will get you the most up-to-date changes, and reduce the chances that you will encounter an issue when you try to push your code.

3.5 Accessing Project History

The benefit of making each commit (checkpoint) is that you can easily view your project or revert to that checkpoint at any point in the future. This section details the introductory approaches for viewing files at an earlier point in time, and reverting to those checkpoints.

3.5.1 Commit History

You can view the history of commits you’ve made by using the git log command while inside of your repo on the command line:

# Print out a repository's commit history
git log

This will give you a list of the sequence of commits you’ve made: you can see who made which changes and when. (The term HEAD refers to the most recent commit made.) The optional --oneline argument gives you a nice compact printout, though it displays less information (as shown in Figure 3.8). Note that each commit is listed with its SHA-1 hash (the sequence of random-looking numbers and letters), which you can use to identify that commit.

A screenshot shows the command terminal that displays the project's commit history using the git log command.
Figure 3.8 A project’s commit history displayed using the git log --oneline command in the terminal. Each commit is identified by a six-digit hash (e.g., e4894a0), the most recent of which is referred to as the HEAD.

3.5.2 Reverting to Earlier Versions

One of the key benefits of version control systems is reversibility, meaning the ability to “undo” a mistake (and we all make lots of mistakes when programming!). git provides two basic ways that you can go back and fix a mistake you’ve made previously:

  1. You can replace a file (or the entire project directory!) with a version saved as a previous commit.

  2. You can have git “reverse” the changes that you made with a previous commit, effectively applying the opposite changes and thereby undoing it.

Note that both of these approaches require you to have committed a working version of the code that you want to go back to. git only knows about changes that have been committed: if you don’t commit, git can’t help you!

Tip

Commit early; commit often.

For both forms of undoing, you first need to decide which version of the file to revert to. Use the git log --oneline command described earlier, and note the SHA-1 hash for the commit that saved the version you want to revert to. The first six characters of each hash is a unique ID and acts as the “name” for the commit.

To go back to an older version of the file (to “revert” it to the version of a previous commit), you can use the git checkout command:

# Print a list of commit hashes
git log --oneline

# Checkout (load) the version of the file from the given commit
git checkout COMMIT_HASH FILENAME

Replace COMMIT_HASH and FILENAME with the commit ID hash and the file you want to revert, respectively. This will replace the current version of that single file with the version saved in COMMIT_HASH. You can also use -- as the commit hash to refer to the most recent commit (called the HEAD), such as if you want to discard current changes:

# Checkout the file from the HEAD (the most recent commit)
git checkout -- FILENAME

This will change the file in your working directory, so that it appears just as it did when you made the earlier commit.

Caution

You can use the git checkout command to view project files at the time of a particular commit by leaving off the filename (i.e., git checkout COMMIT_HASH). However, you can’t actually commit any changes to these files when you do this. Thus you should use this command only to explore the files at a previous point in time.

If you do this (or if you forget the filename when checking out), you can return to your most recent version of the code with the following command:

# Checkout the most recent version of the master branch
git checkout master

If you just had one bad commit but don’t want to throw out other valuable changes you made to your project later, you can use the git revert command:

# Apply the opposite changes made by the given commit
git revert COMMIT_HASH --no-edit

This command will determine which changes the specified commit made to the files, and then apply the opposite changes to effectively “back out” the commit. Note that this does not go back to the given commit number (that’s what git checkout is for!), but rather reverses only the commit you specify.

The git revert command does create a new commit (the --no-edit option tells git that you don’t want to include a custom commit message). This is great from an archival point of view: you never “destroy history” and lose the record of which changes were made and then reverted. History is important; don’t mess with it!

Caution

The git reset command can destroy your commit history. Be very careful when using it. We recommend you never reset beyond the most recent commit—that is, use it only to unstage files (git reset) or undo the most recent commit (git reset --soft HEAD~1).

3.6 Ignoring Files from a Project

Sometimes you want git to always ignore particular directories or files in your project. For example, if you use a Mac and you tend to organize your files in Finder, the operating system will create a hidden file in that folder named .DS_Store (the leading dot makes it “hidden”) to track the positions of icons, which folders have been “expanded,” and so on. This file will likely be different from machine to machine, and has no meaningful information for your project. If it is added to your repository and you work from multiple machines (or as part of a team), it could lead to a lot of merge conflicts (not to mention cluttering up the folders for Windows users).

You can tell git to ignore files like these by creating a special hidden file in your project directory called .gitignore (note the leading dot). This text file contains a list of files or folders that git should “ignore” and therefore not “see” as one of the files in the folder. The file uses a very simple format: each line contains the path to a directory or file to ignore; multiple files are placed on multiple lines. For example:

# This is an example .gitignore file
# The leading "#" marks a comment describing the code

# Ignore Mac system files;
.DS_Store

# Don't check in passwords stored in this file
secret/my_password.txt

# Don't include large files or libraries
movies/my_four_hour_epic.mov

# Ignore everything in a particular folder; note the slash
raw-data/

The easiest way to create the .gitignore file is to use your preferred text editor (e.g., Atom). Select File > New from the menu and choose to make the .gitignore file directly inside your repo (in the root folder of that repo, not in a subfolder).

If you are on a Mac, we strongly suggest globally ignoring your .DS_Store file. There’s no need to ever share or track this file. To always ignore this file on your machine, you can create a “global” .gitignore file (e.g., in your ~ home directory), and then tell git to always exclude files listed there through the core.excludesfile configuration option:

# Append `.DS_Store` to your `.gitignore` file in your home directory
echo ".DS_Store" >> ~/.gitignore

# Always ignore files listed in that central file
git config --global core.excludesfile ~/.gitignore

Note that you may still want to list .DS_Store in a repo’s local .gitignore file in case you are collaborating with others.

Additionally, GitHub provides a number of suggested .gitignore files for different languages,13 including R.14 These are good places to start when creating a local .gitignore file for a project.

Whew! You made it through! This chapter has a lot to take in, but really you just need to understand and use the following half-dozen commands:

13.gitignore templates: https://github.com/github/gitignore

14.gitignore template for R: https://github.com/github/gitignore/blob/master/R.gitignore

  • git status: Check the status of a repo.

  • git add: Add files to the staging area.

  • git commit -m "Message": Commit changes.

  • git clone: Copy a repo to the local machine.

  • git push: Upload commits to GitHub.

  • git pull: Download commits from GitHub.

While it’s tempting to ignore version control systems, they will save you time in the long run. git is a particularly complex and difficult-to-understand system given its usefulness and popularity. As such, a wide variety of tutorials and explanations are available online if you need further clarification. Here are a few recommendations to get started:

For practice working with git and GitHub, see the set of accompanying book exercises.22

22Version control exercises: https://github.com/programming-for-data-science/chapter-03-exercises

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.222.212.138