© Johan Abildskov 2020
J. AbildskovPractical Githttps://doi.org/10.1007/978-1-4842-6270-2_8

8. Additional Git Features

Johan Abildskov1 
(1)
Tilst, Denmark
 

In this chapter, I have a lovely amalgam of Git features for you – features that I could not find any place to put. The reason they ended up here might be that where they would have originally fit, we had not established the right mental models, or that they are slightly tangential to the rest of the content in this book. These are features that might help you in your work but should not come into play on an everyday basis. Being aware of their existence might key you in for those dire situations where they are just the right thing for you. We cover figuring out what specific commit introduced a discrepancy using Git bisect. We use Git Submodules to manage dependencies between repositories. And we are going to use Git Large File Storage or Git LFS for short. If you made it this far, congratulations. You have completed the Practical Git curriculum and mastered the foundations. The rest is the icing on the cake.

Git Bisect

In a perfect world, we have quick tests that we can run on every commit, letting us know if we introduce an error, breaking existing functionality. Unfortunately, this seems to be a utopic vision. In reality, we seldom have perfect tests, and even when we do have extensive test coverage, there are no guarantees that no bugs slip through our net. It can also be a nonbreaking change, such as an element changing the color, which would have been hard to test for in a meaningful way. In these cases, we can revert the change manually, to remedy the unwanted change. But this is both tedious and error-prone. Plus, there might be a good reason for this change to be valid. As such, it is valuable to be able to find the commit that introduced the change.

The most straightforward strategy is to start from the most recent commit that was in a healthy state and check out commits one at a time. For each commit, poke around and figure out if it was the commit that introduced the test. At some point, we figure out the commit that introduced the breaking change, and we can revert that commit. If we are lucky, we have tests that we can run in each commit to verify the quality of the given commit. Worst-case scenario, we need to check all of the commits between the good and the bad commit, and it is a tiresome and arduous task. This linear strategy is in Figure 8-1. There can be small improvements such as starting from the newest commit if you believe it was a recent change you are looking for, but nevertheless, it might take a long time.
../images/495602_1_En_8_Chapter/495602_1_En_8_Fig1_HTML.jpg
Figure 8-1

Searching through history in a linear fashion

We are fortunate that Git provides us with a better way of finding the culprit. You might have heard of binary search. Binary is a superior approach to finding an element in a sorted list. As we are searching through time, we can consider our commit history to be a sorted list. Binary search is recursively looking at the middle element to determine if the desired element is in the left or right half of the list. When we keep doing this, it quickly yields the desired element. The performance is particularly attractive for long histories. Looking through 1000 elements linearly takes a long time and has a horrible worst case of going through all the elements. Using binary search, we can guarantee that we have found the element after at most 11 iterations. This is a huge difference! Figure 8-2 shows jumping through history to find the breaking change.
../images/495602_1_En_8_Chapter/495602_1_En_8_Fig2_HTML.jpg
Figure 8-2

Jumping through the history like a binary search

It is tedious to keep track of where we are and what commit to investigate. Git instruments this with the command bisect.

Git bisect works by marking a commit as being bad and one as being good. Then, bisect will iteratively check out commits that we can mark as either good or bad, and bisect will continue until it is unambiguous which commit was the first bad one.

GIT BISECT EXERCISE
In this exercise, we will go through the bisect kata. It can be found in the katas repository in the bisect folder. In this exercise, we are left with 100 commits and changes in 50 files. It’s no easy task to figure out when this broke! Fortunately, we have a script that can verify if a commit is broken, so we will use bisect to move through history.
$ . setup.sh
<Truncated>
$ git bisect start
After having started the bisection, we need to mark a commit as good and one as bad. This sets the endpoints for our search. We find the tag that we want to mark as good and mark HEAD as bad.
$ git tag
initial-commit
$ git bisect good initial-commit
$ git bisect bad
Bisecting: 49 revisions left to test after this (roughly 6 steps)
[9d7c0188ea01453068cab551cd07bc2f52cb4a44] 50
Now that we have marked the endpoints for the search, Git checks out the first commit that we need to verify. We use the script test.sh in the exercise folder to verify the commit. Depending on the test outcome, we mark the commit as either good or bad and continues verifying the commits that Git presents us with.
$ ./test.sh
test failed
$ git bisect bad
Bisecting: 24 revisions left to test after this (roughly 5 steps)
[7ff73ce2a82182eaa46e7239e093b976b851c2fc] 25
$ ./test.sh
test failed
$ git bisect bad
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[1bb261b8f8d9549430af7c93e27c54a25abee63d] 12
$ ./test.sh
test passed
$ git bisect good
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[c3b042dd17d10492c94d2544ec36982637efef36] 18
$ ./test.sh
test passed
$ git bisect good
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[a604e7c7d423c6271925d1f7431cdbaa0c069a5a] 21
$ ./test.sh
test passed
$ git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[878630d3e906eb6e262f58d16b5611c79313ba91] 23
$ ./test.sh
test failed
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[819fa50314086a1e031427704e7bbc9419375cfd] 22
$ ./test.sh
test failed
$ git bisect bad
819fa50314086a1e031427704e7bbc9419375cfd is the first bad commit
commit 819fa50314086a1e031427704e7bbc9419375cfd
Author: Johan Abildskov <[email protected]>
Date:   Sun Aug 2 21:15:32 2020 +0200
    22
 22.txt | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 22.txt

While this was tedious, we unambiguously found the bad commit and with a good guarantee on the maximum amount of commits that we need to test.

Fortunately, because we were provided with a test, we can do this even more efficiently using git bisect run.
$ git bisect reset
Previous HEAD position was 819fa50 22
Switched to branch 'master'
$ git bisect start
$ git bisect good initial-commit
$ git bisect bad
Bisecting: 49 revisions left to test after this (roughly 6 steps)
[9d7c0188ea01453068cab551cd07bc2f52cb4a44] 50
After the Git provides us with the initial commit to verify, rather than doing so manually, we pass the test script to bisect.
$ git bisect run './test.sh'
running ./test.sh
test failed
Bisecting: 24 revisions left to test after this (roughly 5 steps)
[7ff73ce2a82182eaa46e7239e093b976b851c2fc] 25
running ./test.sh
test failed
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[1bb261b8f8d9549430af7c93e27c54a25abee63d] 12
running ./test.sh
test passed
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[c3b042dd17d10492c94d2544ec36982637efef36] 18
running ./test.sh
test passed
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[a604e7c7d423c6271925d1f7431cdbaa0c069a5a] 21
running ./test.sh
test passed
Bisecting: 1 revision left to test after this (roughly 1 step)
[878630d3e906eb6e262f58d16b5611c79313ba91] 23
running ./test.sh
test failed
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[819fa50314086a1e031427704e7bbc9419375cfd] 22
running ./test.sh
test failed
819fa50314086a1e031427704e7bbc9419375cfd is the first bad commit
commit 819fa50314086a1e031427704e7bbc9419375cfd
Author: Johan Abildskov <[email protected]>
Date:   Sun Aug 2 21:15:32 2020 +0200
    22
 22.txt | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 22.txt
bisect run success

Using run, we found the offending commit easily. In many cases, it will pay off to have. This exercise showed how we could smoothly go about figuring out the particular commit that introduced a given change or bug.

Git bisect is an excellent feature, but it requires you to care about the history that you create. If you create bundle too many changes into a single commit, then it will still be difficult to figure out what particular part of that commit introduced the bad change. Ideally, you would be able to revert the bad commit as the change was atomic. Git bisect also is easier to work with when you do not have many merges and levels of branches. As such it can be easier to work with bisect, if you are using rebase, rather than merges.

Git Submodules

One of the general problems in developing software is handling dependencies and using code from other people. Whether this code is open sourced and publicly available or proprietary, how you get that code into your workspace in a traceable manner is a challenge. Depending on the ecosystem of your programming language of choice, there are preferred solutions. Python has pip packages, JavaScript npm, and .NET NuGet packages, and many languages have their own. The native package management should be the preferred solution for sharing code across code bases. In some scenarios, such a solution might not present itself. C and C++ do not come with a native dependency management solution, for instance. In these scenarios, we can turn to Git Submodules to share code across code bases. As Git is language-agnostic, it should be our fallback solution rather than the default. Defaulting to Git Submodules for dependency management causes you to miss out on the benefits of the integrated ecosystem.

With Git Submodules, we add folders whose content should come from a different repository. Git Submodules uses a file called .gitmodules to keep track of paths that are submodules. This allows Git to restore the content of that folder to what we remote we have added. If, for instance, we want to add the Git katas repository as a dependency in our repository, we can run the command git submodule add [email protected]:eficode-academy/git-katas katas. After running this command, the folder katas contains the content of the master branch on the kata repositorycc. If we look at the content of .gitmodules, it looks as follows.
$ cat .gitmodules
[submodule "katas"]
        path = katas
        url = [email protected]:eficode-academy/git-katas
Listing 8-1

Content of .gitmodules after adding submodule

Note that this is very different from putting the katas repository manually inside another Git repo, which is something we should never do. The .gitmodules file allows us to reestablish this dependency in other clones of our remote. Git Submodule configuration lives in .git/config, but as that is not shared across remotes, we need to initialize submodules to re-create the configuration from .gitmodules on new clones. This initialization is either done by git submodule init, followed by git submodule update, or git submodule update --init. The latter is preferred unless you need to customize submodule locations. Init restores configuration to .git/config, while update checks out the content to the path.

Note

One of the challenges of working with submodules is keeping track on which project you are currently trying to make a change in. Is this a change on the outer or inner project? There is no way to help with this, other than being deliberate about what changes belong where and focusing while delivering.

SUBMODULE EXERCISE

In this exercise, we go through the Git Submodule kata. We show how to add submodules and the workflow around delivering changes to both outer and inner repositories. The submodule kata is in the katas in the folder submodules/.

First, we set up the exercise.
cd submodules/
$ ls
README.md  setup.ps1  setup.sh
$ . setup.sh
<Truncated>
$ ls
component  product  remote

We note three folders, each a Git repository. We have the product that we are building. The folder remote represents the presence of the component on a repository manager like GitHub. The component folder represents the local working folder of those developing the submodule.

The first thing we do is add the component to our product.
$ cd product/
/product$ ls
product.h
/product$ git submodule add ../remote include
Cloning into '/home/randomsort/repos/git-katas/submodules/exercise/product/include'...
done.
/product$ ls
include  product.h
/product$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
        new file:   .gitmodules
        new file:   include

We observe that the two paths have changed: the .gitmodules file that keeps track of submodules and the path where we have added the submodule.

Inside include, the content of the module is present.
/product$ ls include
component.h
/product$ cd include
/product/include$ git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
The status of the submodule is clean, even though our root repository is dirty. This is one of the things that can be tricky with submodules.
/product/include$ cd ..
/product$ git diff --cached
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..79d5c92
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "include"]
+       path = include
+       url = ../remote
diff --git a/include b/include
new file mode 160000
index 0000000..3aecaf4
--- /dev/null
+++ b/include
@@ -0,0 +1 @@
+Subproject commit 3aecaf441cca7d98cbec906bf7bf61902fcd41ee
The diff in the product repository matches what we expect based on the previous step, except for the +Subproject commit <hash> line.
/product$ cat .gitmodules
[submodule "include"]
        path = include
        url = ../remote

However, when we look in the .gitmodules file, there is no information letting us know which commit we have added to our product repository. This is because Git is storing that object reference directly in its internal database as a commit listing in its tree object. We cover how commits are constructed and how trees look like in the next chapter.

Now, we commit our change to the product repository, namely, adding the submodule.
/product$ git commit -m "Add component"
[master f7a101d] Add component
 2 files changed, 4 insertions(+)
 create mode 100644 .gitmodules
 create mode 160000 include
/product$ cd ..
Let’s move on and create a change inside of the submodules remote. As the submodule itself is a completely ordinary Git repository, nothing new is going on here.
$ cd component
/component$ ls
component.h
/component$ git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
/component$ echo "important change" > file
/component$ git add file
/component$ git commit -m "important change"
[master 19451c0] important change
 1 file changed, 1 insertion(+)
 create mode 100644 file
/component$ git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
nothing to commit, working tree clean
/component$ git push
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 298 bytes | 149.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To /home/randomsort/repos/git-katas/submodules/exercise/remote
   3aecaf4..19451c0  master -> master
/component$ cd ..
We published the change to the remote. Let’s check how that looks from the perspective of the product.
$ cd product
/product$ git status
On branch master
nothing to commit, working tree clean
Our master branch is clean, so we do not detect a change of the submodule.
/product$ git submodule foreach 'git status'
Entering 'include'
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
Even going through the submodules and running status in there does not help us. We need to pull inside of the submodule.
/product$ cd include
/product/include$ git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From /home/randomsort/repos/git-katas/submodules/exercise/remote
   3aecaf4..19451c0  master     -> origin/master
Updating 3aecaf4..19451c0
Fast-forward
 file | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 file
/product/include$ ls
component.h  file
While this works and we could have used git submodule foreach to iterate over each of our repositories, it becomes less transparent what changes we are pulling into our product.
/product/include$ cd ..
/product$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
        modified:   include (new commits)
no changes added to commit (use "git add" and/or "git commit -a")
After updating the submodule, we can see that there is a change from the vantage point of the product. With git diff, we can see the change from tracking one commit to another. We commit that change to our product.
/product$ git diff
diff --git a/include b/include
index 3aecaf4..19451c0 160000
--- a/include
+++ b/include
@@ -1 +1 @@
-Subproject commit 3aecaf441cca7d98cbec906bf7bf61902fcd41ee
+Subproject commit 19451c07652a282a71eeb7d953d9d807c66284a8
/product$ git add .
/product$ git commit -m "Update include"
[master ebb028e] Update include
 1 file changed, 1 insertion(+), 1 deletion(-)
With the product thus updated, we can take advantage of having the submodule embedded as a proper Git repository inside of our product. This is a powerful feature as we can develop our submodule in the context of the product that uses it. It has the disadvantage that it becomes more difficult to discern when you are working in which repository, and if a submodule is used in multiple products, it is unlikely to be a good idea to develop in the context of a single specific product.
/product$ cd include/
/product/include$ ls
component.h  file
/product/include$ git mv file file.txt
/product/include$ git status
On branch master
Your branch is up to date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
        renamed:    file -> file.txt
/product/include$ git commit -m "Add file extension to file"
[master d9ba324] Add file extension to file
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename file => file.txt (100%)
/product/include$ git push
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 285 bytes | 285.00 KiB/s, done.
Total 2 (delta 0), reused 0 (delta 0)
To /home/randomsort/repos/git-katas/submodules/exercise/remote
   19451c0..d9ba324  master -> master
We have delivered a change to the submodule from the Git repository embedded in our product. Next, we clone a second product from the product folder to show how it looks if you are not adding submodules, but rather cloning a repository that is already using submodules.
/product/include$ cd ..
/product$ cd ..
$ git clone product product_alpha
Cloning into 'product_alpha'...
done.
$ cd product_alpha/
/product_alpha$ ls
include  product.h
/product_alpha$ ls include/
In our freshly cloned repository, the include folder exists, but it is empty. The following log statement shows that we do indeed have the newest commit on the project repository. So the issue must be with the submodule itself.
/product_alpha$ git log
commit ebb028e42833ba80df82f1694257e646d26436d1 (HEAD -> master, origin/master, origin/HEAD)
Author: Johan Abildskov <[email protected]>
Date:   Tue Aug 4 20:57:06 2020 +0200
    Update include
commit f7a101df8286b36cd2abee11cd878306c5b89a7b
Author: Johan Abildskov <[email protected]>
Date:   Tue Aug 4 20:50:24 2020 +0200
    Add component
commit 53e5bf7ed2455e9aa578ff1f9a7bdd7a09eb4c21
Author: Johan Abildskov <[email protected]>
Date:   Tue Aug 4 20:47:44 2020 +0200
    Touch product header
After cloning a repository using submodules, we first need to init the submodules. Initialization is required to populate the local repository configuration correctly.
/product_alpha$ git submodule init
Submodule 'include' (/home/randomsort/repos/git-katas/submodules/exercise/remote) registered for path 'include'
/product_alpha$ ls include
The still frustratingly empty include directory tells us that it is not enough to initialize the submodule. We use update to check out the submodule to the relevant path.
/product_alpha$ git submodule update
Cloning into '/home/randomsort/repos/git-katas/submodules/exercise/product_alpha/include'...
done.
Submodule path 'include': checked out '19451c07652a282a71eeb7d953d9d807c66284a8'
/product_alpha$ ls include
component.h  file
So we did not get the newest change on the submodule, as we have file rather than file.txt .
/product_alpha$ cd ..
$ cd product
/product$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
        modified:   include (new commits)
no changes added to commit (use "git add" and/or "git commit -a")

As we can see, just because we made the change to the submodule in the context of the product is no guarantee that the product reflects this change. This trap is another caveat using submodules. People that have experience with other version control systems such as ClearCase might have an intuition that we can deliver a single change atomically across multiple repositories, but that is not possible in Git. While it might not feel like it, changes in the submodule and in the products using the submodule are completely independent and cannot be done as a transaction.

So let us commit the change to the submodule version in the product repository.
/product$ git add .
/product$ git commit -m "update submodule"
[master 6102bac] update submodule
 1 file changed, 1 insertion(+), 1 deletion(-)
/product$ cd ..
$ cd product_alpha/
/product_alpha$ git submodule update
/product_alpha$ git pull
remote: Counting objects: 2, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (2/2), done.
From /home/randomsort/repos/git-katas/submodules/exercise/product
   ebb028e..6102bac  master     -> origin/master
Updating ebb028e..6102bac
Fast-forward
 include | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
/product_alpha$ ls include
component.h  file
/product_alpha$ git submodule update
Submodule path 'include': checked out 'd9ba3247bb58bfc4f36ed3d6fa60781b0b32a5e1'
/product_alpha$ ls include/
component.h  file.txt

Here, we again notice the two-step approach to getting a change from a submodule. First, we update the reference to the submodule, and then we make the local path reflect the content of the submodule at that reference.

This exercise walked you through working with submodules. As you can see, the tooling is quite easy to use. The difficulties with submodules come from nontrivial usage where it can become hard to keep track of what is going on.

This section has covered Git Submodules, so you now should have an idea about how they work and what you can do with them. I still recommend going with the native dependency management tooling if there is one available for the framework that you are using.

Git Large File Storage

Git is excellent at managing text files, which is the polite way of saying that Git is not very suited at storing binary files. Large binary assets, in particular, are taxing in Git. This is caused by Git’s offline capabilities, where the distributed nature of Git puts all versions in each of our clones. This can cause a lot of bandwidth and storage usage, which might make Git slow to work with.

My first take when people want to store binary assets in Git is to tell them not to. In the general case, storing binary assets in Git is a workaround rather than a real solution. A proper artifact management strategy together with a binary repository manager, such as JFrog Artifactory or Sonatype Nexus, usually is the best solution. There can be scenarios where it is useful to save binary assets in Git, and if this is necessary, in my opinion, the only right way to do this is using Git LFS. The primary cost in terms of workflow of using Git LFS is that you no longer can work truly offline. Depending on connectivity and size of binary assets, this is a smaller problem today than it was five or ten years ago.

These days Git LFS is bundled with most installers. You can test if you have it installed with the command git lfs. If it doesn’t error, you have Git LFS on your computer. If you lack Git LFS, you can download and install it from https://git-lfs.github.com/.

Implementation

Although invisible to daily users, I believe understanding the shape of the Git LFS implementation helps with the intuition around what parts of your workflow will have changed fundamentally from a non-Git LFS workflow. Git LFS uses some features that we covered previously, namely, filters and Git attributes.

Git LFS uses Git attributes to track which paths should be processed through LFS, rather than Git’s normal persistence model. Filters are used to substitute the read and write operations of plain Git, with those from Git LFS.

In order to be able to work with Git LFS, the repository manager that you are using needs to support it. The large Git repository managers support Git LFS out of the box. Some need a secondary storage to put the large files in, while others are able to maintain them on a stand-alone basis. Consult the documentation for your specific repository manager.

What happens when you track a path with Git LFS is that it will not write the full binary object to the repository, but rather an empty dummy file. When the commit is pushed, rather than pushing directly to the repository, it will be offloaded to the secondary storage defined by the repository configuration on the central host. When you check out a tracked path, Git LFS will, if necessary, download that file from secondary storage and then check out that file to the given path. Except that you are unable to work in offline mode when switching to previously non-checked-out versions, this will function completely transparent. Figure 8-3 shows this workflow.
../images/495602_1_En_8_Chapter/495602_1_En_8_Fig3_HTML.jpg
Figure 8-3

Git LFS flow showing uploading to secondary storage during push and downloading during a checkout

So rather than retrieving all commits with all objects when fetching, some objects are not fetched by Git until they are needed by a given checkout.

Tracking Files with Git LFS

In this section, we cover working with new files added to Git LFS. Later, we cover how to remove large assets from your repository and moving them to Git LFS. Initially, we need to run the command git lfs install to initialize Git LFS. This should only be done once per local repository. After having done that, we can add paths to track using git lfs track path. This will create an entry in the .gitattributes file, with the relevant properties. Commonly, we want to track patterns of paths rather than concrete paths. This removes the need for us to explicitly add all binary assets that we want to track with LFS individually. So we’d rather use git lfs track *.iso than git lfs track image.iso.

After running the command git lfs track *.iso, the .gitattributes file should contain the following:
*.iso filter=lfs diff=lfs merge=lfs -text

This means that whenever someone adds an ISO to our repository, it will be handled by Git LFS. Assuming that your remote supports Git LFS, this is all you need to do.

As we covered earlier, commits are immutable, so this does not clean up binary assets that were previously added to the repository. We cover how to find them and clean them up in the next sections.

Git Sizer

It is not uncommon to have a feeling that working with one of your repositories is clunky. Often, we even have a good idea about what is making the repository bothersome to work with. But, if we are going to do a huge undertaking, like cleaning up our repository, we should not do it on a gut feeling, we should do it based on a database. Fortunately, there are free tools that can help us investigate our repository. One such tool is git-sizer https://github.com/github/git-sizer. git-sizer allows us to analyze repositories and report common problems with big Git repositories. Listing 8-2 shows a snapshot from analyzing the DevOpsDays Assets repository. Even though it primarily contains binary assets, a common cause of a repository that is too big, Git sizer only reports one problematic asset. This shows that Git can be used sensibly for assets, if done right. The DevOpsDays web team has also separated binary assets from the code base to make it easier to work with.
/pg-lfs$ ~/git-sizer
Processing blobs: 5
Processing trees: 4
Processing commits: 4
Matching commits to trees: 4
Processing annotated tags: 0
Processing references: 3
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest objects              |           |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [1] |  81.6 MiB | ********                       |
[1]  6660801deb787c5d0fa941801c73dd573381c4c6 (refs/heads/master:alpine-rpi-3.12.0-armv7.tar.gz)
Listing 8-2

Report from Git sizer

This report can be useful to determine if there are particular aspects of the repository that we can address. The README of the git-sizer repository contains some remedies for different Git repository size ailments. In our case, we are looking for problematic binary assets, and now that we know how to use git-sizer to locate them, we move on to using the BFG repo cleaner to move those files to Git LFS.

Converting a Repository to Git LFS

Now that we can detect problematic files that are already present in our repository, we are ready to clean up the repository and make it a bit more efficient in our workflow.

We can use the BFG repo cleaner to remove unwanted files from our history. This unwanted data can be sensitive information that we would prefer not have in the history or more commonly binary assets that we either never should have added, or that have grown problematic over time.

Caution

We are now moving into potentially dangerous territory. As long as we are careful, these operations should be safe, but there are potentially destructive, nonrecoverable scenarios that can occur. If, however, we are deliberate and move with caution, we can prefer any unexpected incidents.

We can use Git LFS to rewrite our history and add problematic paths to Git LFS.

CONVERT TO LFS

This exercise involves forking a repository from GitHub and working in that, so complete it in your terminal, wherever you put your repositories. Start by heading to https://github.com/randomsort/practical-git-lfs and create a fork of that repository to your account. In this exercise, I work from the fork pg-lfs. Note that this exercise requires a remote that supports Git LFS. GitHub does this, but you might need to enable it on your settings page.

First off, I clone the repository that I work in, in this exercise. Replace the URL with your personal fork.
$ git clone [email protected]:randomsort/pg-lfs
Cloning into 'pg-lfs'...
remote: Enumerating objects: 13, done.
remote: Total 13 (delta 0), reused 0 (delta 0), pack-reused 13
Receiving objects: 100% (13/13), 81.64 MiB | 11.28 MiB/s, done.
Resolving deltas: 100% (2/2), done.
It is not apparent from the printed terminal output, but this took a long, tedious time, which we know kills developer productivity and motivation. So we look to see if we can find a problem.
$ cd pg-lfs
/pg-lfs$ ls
LICENSE  README.md  alpine-rpi-3.12.0-armv7.tar.gz
We note that there is a tar.gz file and that the Git folder is large compared to such a small repository. We run git-sizer to find out if there are any problems.
/pg-lfs$ du -s -h .git
82M     .git
/pg-lfs$ ~/git-sizer
Processing blobs: 5
Processing trees: 4
Processing commits: 4
Matching commits to trees: 4
Processing annotated tags: 0
Processing references: 3
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest objects              |           |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [1] |  81.6 MiB | ********                       |
[1]  6660801deb787c5d0fa941801c73dd573381c4c6 (refs/heads/master:alpine-rpi-3.12.0-armv7.tar.gz)
From the output of git-sizer, we see that at least a tar.gz file is problematic. We decide it would be good to store tar.gz files in Git LFS, rather than directly in the Git repository. We can use the git lfs migrate tool for that. We pass the patterns and references we want Git LFS to treat.
/pg-lfs$ git lfs migrate import --include="*.tar.gz" --include-ref=master
migrate: Sorting commits: ..., done
migrate: Rewriting commits: 100% (4/4), done
  master        9a3d24f44a28e5f514633b834afbe6022062febe -> 873439a4869e29b388027465e2a488d68c977df2
migrate: Updating refs: ..., done
migrate: checkout: ..., done
/pg-lfs$ git status
On branch master
Your branch and 'origin/master' have diverged,
and have 4 and 4 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
nothing to commit, working tree clean
Git status tells us that we have all different commits and that our working directory is clean. In this scenario, this shows that we have no commits in common with our remote.
/pg-lfs$ ls -al
total 4
drwxrwxrwx 1 randomsort randomsort  512 Aug  4 22:06 .
drwxrwxrwx 1 randomsort randomsort  512 Aug  4 22:03 ..
drwxrwxrwx 1 randomsort randomsort  512 Aug  4 22:06 .git
-rw-rw-rw- 1 randomsort randomsort   45 Aug  4 22:06 .gitattributes
-rw-rw-rw- 1 randomsort randomsort 1080 Aug  4 22:03 LICENSE
-rw-rw-rw- 1 randomsort randomsort  287 Aug  4 22:03 README.md
-rw-rw-rw- 1 randomsort randomsort  133 Aug  4 22:06 alpine-rpi-3.12.0-armv7.tar.gz
randomsort@DESKTOP-3196DO6:~/repos/lfs2$ cat .gitattributes
*.tar.gz filter=lfs diff=lfs merge=lfs -text
The Git LFS migration added the correct entry to .gitattributes , retroactively. We are happy with the state of our repository and push to the remote.
/pg-lfs$ git push --force
Counting objects: 14, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (14/14), 2.34 KiB | 2.34 MiB/s, done.
Total 14 (delta 3), reused 14 (delta 3)
remote: Resolving deltas: 100% (3/3), done.
remote: This repository moved. Please use the new location:
remote:   [email protected]:RandomSort/pg-lfs.git
To github.com:randomsort/pg-lfs
 + 9a3d24f...873439a master -> master (forced update)
A force push should not be done leisurely, and as mentioned earlier, we should use --force-with-lease , but that does not work in this case as we have no common history. After pushing to the remote, we clone to a separate location to see if we saved any space.
/pg-lfs$ cd ..
$ git clone [email protected]:randomsort/pg-lfs lfs2
Cloning into 'lfs2'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 14 (delta 3), reused 10 (delta 3), pack-reused 4
Receiving objects: 100% (14/14), done.
Resolving deltas: 100% (3/3), done.
$ cd lfs2
/lfs2$ du -s -h .git
48K     .git
/lfs2$ ls
LICENSE  README.md  alpine-rpi-3.12.0-armv7.tar.gz

Even though our workspace looks the same, our Git repository is only a fraction of the size. 48K compared to 82M is a difference that we cannot fathom without experiencing it. This will have an impact on developer quality of life and have an impact on automation.

Remember to delete your fork so you don’t take up unnecessary resources at GitHub :).

This exercise showed how easy it is to slice a part of your repository out if it hurts you in terms of size.

Git Katas

To support the learning goals of this chapter, complete the following katas:
  • Bisect

  • Submodules

Summary

In this chapter, we learned how to manage dependencies using submodules, to find bad changesets efficiently using Git bisect, and finally to remove problematic assets from our repositories with Git LFS.

I sincerely hope that none of these will be useful for you on a day-to-day basis, as they represent corner cases. But now you are aware should the need arise for one of these specialized Git features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.13.173