Version control for datasets

Let's start with understanding the importance of version control in data science. Now, there is somewhat of a crisis of reproducibility in the data science and scientific computing community. This is when one data team can extract a specific insight from a dataset but others cannot, even when using the same methods. Many instances of this are because the data used across these different teams is not compatible with each other. Some might be using the same, but an outdated dataset, while other datasets might have been collected from a different source.

For this reason, version control for datasets is increasingly important. However, as we discussed in Chapter 5, Version Control with Git in PyCharm, common version control tools such as Git are not applicable for datasets, which are typically large files that are not suitable for being stored with code. In particular, we are not allowed to push any file larger than 100 MB onto our GitHub repositories.

Luckily, there is another version of Git that is specifically designed for this purpose, Git Large File Storage (Git LFS), which is also integrated nicely with traditional Git. The way it works is that, when we register a file using Git LFS, the system will replace that file with a pointer that simply references it. So, when the file is placed under version control, Git will only have a reference to the actual file, which is now stored in an external server.

In short, Git LFS allows us to apply version control to large files (in this case, datasets) with Git, without actually storing the files in Git. Now, let's go through the process of using Git LFS through the following steps:

Git LFS is typically installed with Git if you download the Git Client from their official website, https://git-scm.com/. Otherwise, you can run the following command to install the software:

git lfs install

To have Git LFS track files of a given extension, run the following command:

git lfs track ".[extension]"

Git LFS will now keep track of any file with the same extension. Go ahead and run the command with the .txt extension within our current project, which will register our text data files with Git LFS.

We also need to add the .gitattributes file to Git. This is because this file contains the information on the file extensions we are tracking:

git add .gitattributes

That is essentially the process of using Git LFS. Now, when a file with an extension tracked by Git LFS is added by the regular Git, Git LFS will automatically handle all of the backend referencing logic that we mentioned earlier. With this topic, we also conclude our discussion on the topic of working with our datasets in a data science project.

In the next step, we will start the exploratory process with the dataset we have.

Table of Contents for Version control for datasets

Create new playlist

Sign In

Sign Up

Table of Contents for
Version control for datasets