Starting with data

It is very easy to incorporate dvc into your workflow. First, we need to install it with pip install dvc. After that, we gradually set it up. You should always start by adding the raw data to dvc. Let's assume data is collected outside of the workflow; we'll just store the files. To do so, perform the following steps:

  1. First, open the Terminal (in VS Code, for example), make sure you're in the right folder—the same place where git was initialized (and therefore the .git folder is located)—and type this:
dvc init
  1. If you succeed, DVC will print out a few links to documentation and offer to commit changes to git. If you type git status, you'll notice a new folder generated, .dvc, with two files in it: .gitignore and config. So, let's commit this change to git:
git add .
git commit -m "dvc initialized"
git push # optionally
  1. Now, let's register our data file in DVC:
dvc add Chapter14/data/EF_battles_corrected.csv

If you check the git status, you'll notice a new file was generated—EF_battles_corrected.csv.dvc. Feel free to open it in the text editor. The most important element here is the string of gibberish—the unique MD5 hash. This string is generated using a special deterministic algorithm and represents your data. If the data changes, the new hash won't match the one stored in the .dvc file, so DVC will understand that it is changed and store a new version.

This is the same reason it is used as the path to the file in the .dvc folder and on the remote server. By committing this file to git, you essentially entwine this specific version of the code with a specific version of data: anyone who pulls the code and the .dvc file will be able (given access to a server, of course) to pull a specific version of the dataset. Given that git tracks all versions of this file, we'll always be able to align the version of code with the corresponding version of data.

Let's commit it to git

git add .; git commit -m "adding the first dataset to dvc"

It should be noted that the data itself will not be uploaded—DVC explicitly adds it to gitignore; given there is no remote storage for DVC, it will be kept locally. For now, we won't use remote storage for DVC—but if you have an S3 bucket, FTP server, or Azure or Google Cloud account, feel free to use them with dvc remote. Once the remote is set up, just run dvc push every time data is updated (this can also be set to run automatically on every git commit). Others can then clone the git repository and pull the data, using the .dvc file with the hash.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.215