Chapter 22

Outsourcing Long Histories

Git repositories tend to become large over time, despite its efficient memory management. This effect is usually negligible if the source files are the only items versioned in the repository. The size of such a repository is not tiny compared to current hard disks and network bandwidth.

However, if large binary files (libraries, release artifacts, test databases, images) are also versioned, the repository size can indeed attract negative attention.

Compared to a centralized version control system, a distributed version control system tends to consume more resources. When you clone a repository, all historical files will also be copied.

This workflow describes how you can outsource the history of a Git repository so that

  • the new project repository uses less resources and
  • it is still possible to do a search with the old commits using the log, blame and annotate commands.

Overview

This workflow has three pillars:

grafts file: With the grafts file, commit parents in the local repository can be deleted.

filter-branch command: The filter-branch command can copy all commits from a repository and modify them. The modified parent relationship can be removed permanently.

alternates file: With the alternates file, commits from other repositories can be merged.

The top section of Figure 22.1 outlines our project repository. There are three commits: A, B and C. The history of C needs to be removed.

First, with the help of the grafts file, we remove the parents of commit C. A new project repository is created and then treated with the filter-branch command, which only includes the modified commit C. This outsourcing is now done. The development now takes place only on the new project history. The previous project repository now only serves as an archive.

To perform searches in the whole history, the archive repository is linked from the new project repository using the alternates file. With the grafts file, commit C is assigned the historically correct parent (see the bottom section of Figure 22.1).

Requirements

Coordinated break: All team members must agree on a simultaneous break from the repository, and then continue working on the new clone.

History is rarely needed: If the historical information is needed very often and by many developers, then it makes more sense to accept the larger resource consumption.

Commit hashes do not matter: Git can use commit hashes to detect unauthorized changes to old versions. However, this workflow breaks the history and creates new commits.

Workflow “Outsourcing Long Histories”

This workflow aims to reduce the size of a repository that contains a very long commit history with many large files. The older commits are outsourced to a separate repository. Searching the history is still possible.

Figure 22.1: Workflow overview

Process and Implementation

Outsourcing the History

This procedure describes in detail how the history of a repository can be outsourced. More specifically, a new repository is created which only contains the truncated history.

Attention! Clones of the old repository will not work with the new repository. Therefore, any development changes must be merged to the central repository before the next steps. All developers must be informed that no further changes can be made in the clones.

The starting point is shown in the top section of Figure 22.1. The example is based on a bare repository with three commits from the master branch. A new project repository with commit C must be created.

For the following steps we need the complete hashes of commits C and B. This can be determined using the following log command:

> cd project.git 

> git log --pretty=oneline 

166a7e047a85b318720dc6e857a5321f9a3df7b4 C
dcbddd5cd590de3d30e1ecca1882c9187e7eab95 B 
577b8e2cf613c43ed969453477fadc189482c1fb A 

The --pretty=oneline parameter specifies the log output should be printed in a single line. Unlike the --oneline parameter, the full hash value is returned.

Step 1: Create a grafts tag

This step is a preparation for integrating the future archive repository. To integrate the archive, you will need to know the hash of the last historical predecessor commit. In our case, this is commit B. An elegant solution to permanently store this information is to create a tag (grafts/master) at the new “first” commit (commit C) and store the hash in the tag comment. This tag will be present in the future project repository. It makes no sense to create a tag at commit B, as this will not be included in the new project repository.

The tag command gets the hash of commit C passed, and the hash of commit B stored in the tag description.

> git tag -a grafts/master 
166a7e047a85b318720dc6e857a5321f9a3df7b4 
-m "Predecessor: dcbddd5cd590de3d30e1ecca1882c9187e7eab95" 

Here, grafts/master is the name of the new tag. In Git, it is possible to create a tag and branch name hierarchy by using the / symbol.

If you have multiple branches, then you have to perform the above step for all the branches. That is, for each branch you have to decide where the history should cease and create a tag named grafts/<branch-name> with the previous information.

Step 2: Create a clone

The following steps change the repository contents permanently. Since we want to continue using the existing repository as an archive, we need to create a clone. Also, it will once again be a bare repository, as it will be used for the push command.

> cd .. 

> git clone --bare project.git temp-project.git 

Step 3: Alter the history with the grafts file

Now you can prune the history in the cloned repository. For this, you need to create an info/grafts file and edit it. The info/grafts file has a simple format. Each line manipulates the predecessor relationship of a commit. For this, only the hash of the commit to manipulate will be written, followed by a space and then the hash of the new predecessor. If the second value is omitted, then the commit will have no predecessor.

In our example, commit C will have no predecessors. The following command creates a new grafts file and writes the hash of commit C into it:

> cd temp-project.git 

> echo 166a7e047a85b318720dc6e857a5321f9a3df7b4 >info/grafts 

If you are editing multiple branches, you need to add one line for every branch.

To check if the manipulation worked, you can use the log command. In our example, only commit C should appear.

> git log --pretty=oneline

166a7e047a85b318720dc6e857a5321f9a3df7b4 C 

Step 4: Change the repository permanently

After the repository has been adjusted by using the grafts file, you can now create a permanent new commit history with the filter-branch command. This command takes all commits in the specified branch and creates new commits according to the specified filter. In this particular case, you do not need to change the filter as the only objective is to change the commit history according to the grafts file.

Only the --tag-name-filter parameter is used, to bind the existing tags to the new commits.

> git filter-branch --tag-name-filter cat ---all 

Rewrite 166a7e047a85b318720dc6e857a5321f9a3df7b4 (2/2) 
Ref 'refs/heads/master' was rewritten 
Ref 'refs/tags/grafts/master' was rewritten 
WARNING: Ref 'refs/tags/release-1' is unchanged 
Ref 'refs/tags/release-2' was rewritten 
grafts/master -> grafts/master 
        (166a7e047a85b318720dc6e857a5321f9a3df7b4 
   -> 259ee224ac1f2d73898ec2ed25ad4dccd3c40f70) 
release-1 -> release-1 
        (577b8e2cf613c43ed969453477fadc189482c1fb 
   -> 577b8e2cf613c43ed969453477fadc189482c1fb) 
release-2 -> release-2 (166a7e047a85b318720dc6e857a5321f9a3df7b4 
   -> 259ee224ac1f2d73898ec2ed25ad4dccd3c40f70) 

The parameters used are as follows.

--tag-name-filter cat: All tags are newly created and point to the new commits.

-all: All branches in the repository are filtered.

You can see in the output of the filter-branch command that commit C was copied with hash 166a7 and has been assigned a new hash 259ee.

In the output you can also see a warning. There is a release-1 tag with no matching commit anymore in the new history. In our example, the release-1 tag shows on commit A. However, after the changes, this is no longer part of the history.

These tags must be manually deleted, because otherwise they will ultimately prevent Git from deleting the corresponding old commits.

> git tag -d release-1 

Step 5: Reduce the repository

At this stage, the repository is completely converted to the new history. However, the filter-branch command does not delete the old commits, as they are still referenced by other names. That is why the new repository is not smaller than the original.

With repeated cloning, however, you can create a new repository that only contains the new history. After that, you can delete the temporary repository.

> git clone --bare temp-project.git new-project.git 

> rm -rf temp-project.git 

The new repository can be compressed by using the gc command. This command does various cleanups in the repository. Among others, it compresses new files and deletes referenced objects that can no longer be retrieved.

> cd new-project.git 

> git gc --prune 

The --prune option indicates that all file versions that are no longer needed must be removed.

You can now notify all developers that a new repository is available and can be cloned.

Linking the Archive Repository

If you want to access historical information, the current repository must be linked with the archive repository. This is only a local link in the developer repository and can be activated individually by each developer.

For the following procedure, we assume that a developer already has his or her own clone (new project directory) of the new repository. In this repository, there is already a new commit D (See the bottom section of Figure 22.1).

Step 1: Clone the archive repository

To access the historical information, you will need a clone of the archive repository. Since there is no development in the archive repository, a bare clone is good enough.

> git clone --bare project.git archive-project.git 

Step 2: Link the archive repository

The commits in the archive repository must be made available in the developer repository.

In order for a repository to access the commits of some other repository, “alternate” paths can be specified in the .git/objects/info/alternates file. Each line in this file specifies the absolute path to the objects directory of another repository.

Note that you must specify the actual path to the objects directory. The path to the project’s root directory is not enough.

Use the echo command to add a new line to the alternates file.

> cd new-project

> echo /gitrepos/archive-project.git/objects

>> .git/objects/info/alternates 

Step 3: Connect to the histories

Finally, using the already familiar .git/info/grafts file, commit C’ must be linked to commit B in the archive repository.

In this case, the prepared grafts tag will be very helpful as it contains all the necessary information (See Step 1 in the previous sequence).

> git show grafts/master --pretty=oneline 

tag grafts/master 
Predecessor: dcbddd5cd590de3d30e1ecca1882c9187e7eab95 
259ee224ac1f2d73898ec2ed25ad4dccd3c40f70 C 
diff --git a/foo.txt b/foo.txt 
.. 

You can see two commit hashes in the output. The first, dcbdd, corresponds to the historically correct predecessor, commit B. The second hash, 259ee, corresponds to the new commit C in the current repository.

In the grafts file the hashes must be specified in reverse order. First comes commit C’, followed by a space and the new predecessor, commit B.

> echo 259ee224ac1f2d73898ec2ed25ad4dccd3c40f70 
       dcbddd5cd590de3d30e1ecca1882c9187e7eab95 
       >.git/info/grafts

To verify that the command terminated successfully, use the log command. The output must now contain commits A and B.

> git log --pretty=oneline 

da8ba94d6bd9ec293f22a558756a91927f8b3525 D
259ee224ac1f2d73898ec2ed25ad4dccd3c40f70 C
dcbddd5cd590de3d30e1ecca1882c9187e7eab95 B
577b8e2cf613c43ed969453477fadc189482c1fb A

All historical information is now available In the current development repository.

Why Not the Alternatives?

Why Not Fetch the Archive Repository?

The workflow described uses the objects/info/alternates file to link the commits in a repository. An alternative is to use the normal fetch command to import these commits. The use of the grafts file to create the parent relationship would work anyway.

However, the workflow assumes that access to the history is only required rarely and temporarily. In this case, the solution with the alternates file is more useful because its own repository is not increased by more commits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.133.180