The command git filter-branch is a generic branch processing command that allows you to arbitrarily rewrite the commits of a branch using custom commands that operate on different objects within the repository. Some filters work on commits, some filters on tree objects and directory structures, and others provide environmental manipulation opportunity.
Does that sound useful and yet dangerous?
Good.
As you might suspect, with great power comes great responsibility.[40] The power and purpose of filter-branch is also the source of my warning: it has the potential to rewrite the entire repository’s commit history. Executing this command on a repository that has already been published for others to clone and use will likely cause them endless grief later. As with all rebasing operations, commit history will change. After this command, you should consider any repositories cloned from it earlier as obsolete.
With that warning about rewriting repository history behind us, let’s find out what the command can do, when and why it might be useful, and how to use it responsibly.
The filter-branch command runs a series of filters on one or more branches within your repository. Each filter can have its own custom filtering command. You don’t have to run them all, or even more than one. But they are designed and sequenced so that earlier filters can affect the behavior of later filters. The subdirectory-filter runs as a precommit-processing selection filter, and the tag-name-filter runs as a postcommit-processing step,
To help you get a clearer picture of what is happening
during the filtering process, it might help to know that as of version
1.7.9, git filter-branch is a shell
script.[41] Except for the commit-filter
, each
command
is evaluated in a shell (sh) context using eval.
Here is a brief description of each filter and the order in which they run:
command
The env-filter can
be used to create or alter the shell environment settings prior to
running the subsequent filters and committing the newly rewritten
objects. Of note, changing variables such as GIT_AUTHOR_NAME
, GIT_AUTHOR_EMAIL
, GIT_COMMITTER_NAME
, and GIT_COMMITTER_EMAIL
may be useful. The
command
should likely both set and export
environment variables.
command
The tree-filter allows you to modify the contents of a directory that will be captured by a tree object. You can use this filter to remove files from or add files to the repository retroactively. This filter checks out the branch at each commit during the filtering. Be aware that the .gitignore file is not effective during this filter.
command
The index-filter is used to alter the contents of the index prior to making a commit. Throughout the filtering process, the index of each commit is made available without checking out the corresponding files into a working directory. Thus, this filter is similar to the tree-filter but faster if you don’t actually need the file contents during the filter. You should study the low-level git update-index command.
command
The parent-filter allows you to restructure the parent relationship of every commit. For a given commit, you specify its new parent or parents. To use this properly, you should study the low-level git commit-tree command.
command
Just prior to actually making a newly filtered commit,
the msg-filter allows you to edit
the commit message. The command
should
accept the old message on stdin
and write the new message on stdout
.
command
Normally during the filtering pipeline, git commit-tree will be used to perform
the commit. However, this filter gives you control over this step
yourself. Your command
will be called with the new (possibly rewritten)
tree-obj
and a list of (possibly rewritten) -p
parameters. The
(possibly rewritten) commit message will be on parent-obj
stdin
. You should likely still use
git commit-tree, but there are
also a few convenience functions provided environmentally as well:
map
, skip_commit
, git_commit_non_empty_tree
, and die
. The git
filter-branch manual page has details for each of these
functions.
command
If your repository has any tags, you should probably use tag-name-filter to rewrite existing tags to reference the newly created corresponding commits. By default, the old tags will remain, but you can use cat as the filter to obtain direct new-for-old mappings of your tags. Although simply mapping tags to reference the new, corresponding commits is certainly possible, maintaining a signed tag is not. Remember that the whole point of signing a tag was to maintain a cryptographically secure indicator of the repository at a certain point in its history. That just went out the window here, right? So all those signatures on signed tags will be removed from the corresponding new tags.
command
The subdirectory-filter can be used to limit the rewriting of history to only those commits that affect a specific directory. That is, after filtering, the new repository will contain only the named directory at its root.
After a git filter-branch
completes, the original references comprising the entire old commit
history are available as new refs in refs/original. Naturally, this implies that the
refs/original directory must be empty
at the start of the filtering operation. After verifying that you obtained
the filtered history you desired, and the original commit history is no
longer needed, carefully remove the .git/refs/original refs. (Or, if you want to be
fully Git compliant and Git friendly, you can even use the command
git update-ref -d
refs/original/branch
for each
branch
you filtered.) If you do not remove this
directory, you will continue to have the entirety of both the old and new
content within your repository. The old refs will linger and prevent
garbage collection (see Garbage Collection) from trimming out the
otherwise obsolete commits.[42] If you don’t want to explicitly remove this directory, you
can also clone away from it. That is, make a clone of the repository,
leaving these original refs behind and not cloning them into a new
repository. Think of it as a natural checkpoint backup.
There are several reasons that best practices with git filter-branch suggest you should always operate on a newly cloned repository. For starters, git filter-branch flat-out requires that the operation to begin with a clean working directory. Because the git filter-branch modifies your original repository in place, it is often described as being a “destructive” operation. Because the command has many steps, options, and subtleties, running the command can be quite tricky and often difficult to get right on the first attempt. Saving the original repository is just prudent computing.
Now that we know what git filter-branch can do, let’s look at a few cases where it can be used productively. One of the most useful situations occurs when you have just created a repository full of commit history and want to clean it up or do a large-scale alteration on it prior to making it available for cloning and general use by others.
A common use for git filter-branch is to completely remove a file from the entire history of a repository. Remember, Git maintains the complete history of every file within the repository. Thus, simply deleting a file with git rm will not remove it from older history. One can always go back to earlier commits and retrieve the file.
However, by using git filter-branch, a file can be removed from any and every commit in the repository, making it appear as if it was never there in the first place.
Let’s work on an example repository that contains personal notes after reading various books. The notes are stored in files named after the works.
$cd BookNotes
$ls
1984 Animal_Farm Nightfall Readme Snow_Crash $git log --pretty=oneline --abbrev-commit
ffd358c Read Asimov's 'Nightfall'. 4df8f74 Read a few classics. 8d3f5a9 Read 'Snow Crash' 3ed7354 Collect some notes about books.
And the classics from the third commit 4df8f74
are:
$ git show 4df8f74
commit 4df8f74b786b31b6043c44df59d7d13ee2b4b298
Author: Jon Loeliger <[email protected]>
Date: Sat Jan 14 12:57:35 2012 -0600
Read a few classics.
- Animal Farm by George Orwell
- 1984 by George Orwell
diff --git a/1984 b/1984
new file mode 100644
index 0000000..84a2da2
--- /dev/null
+++ b/1984
@@ -0,0 +1 @@
+George Orwell is disturbed.
diff --git a/Animal_Farm b/Animal_Farm
new file mode 100644
index 0000000..e1fcda1
--- /dev/null
+++ b/Animal_Farm
@@ -0,0 +1 @@
+Animal Farm was interesting.
Suppose for some history-revising reason we have decided to remove any record of George Orwell’s 1984 from the repository. If you don’t care about the old commit history, simply issuing a git rm 1984 would suffice. But to be thoroughly Orwellian, it must be removed from the complete history of the repository. It must never have existed.
Of all the filters listed previously, the likeliest candidates for this operation are the tree-filter and index-filter. Because this is a small repository and the operation we want to do, namely, remove one file, is pretty simple and direct, we’ll use the tree-filter.
As advised earlier, start with a clean clone, just in case.
$cd ..
$git clone BookNotes BookNotes.revised
Cloning into 'BookNotes.revised'... done. $cd BookNotes.revised
$git filter-branch --tree-filter 'rm 1984' master
Rewrite 3ed7354c2c8ae2678122512b26d591a9ed61663e (1/4) rm: cannot remove `1984': No such file or directory tree filter failed: rm 1984 $ls
1984 Animal_Farm Nightfall Readme Snow_Crash
Clearly that didn’t go well and something failed. The file is still in the repository.
Let’s think a little about what Git is doing here. Git will
iterate over each commit in the master
branch, starting with the very first
commit, establish the context (index, files, directories, etc.) of
that commit, and then try to remove the file 1984.
Git tells you which commit it was modifying when the command
failed. Commit 3ed7354
is the first
of 4 commits.
Rewrite 3ed7354c2c8ae2678122512b26d591a9ed61663e (1/4)
But recall that the file 1984 was introduced in the third commit,
4df8f74
, and not the first. That
means that for the first two commits, 3ed7354
and 8d3f5a9
, the 1984 file was not yet in the repository or
any of its managed files. That in turn means that when establishing
the filtering context of those first two commits, a simple rm 1984 shell command within the top-level
directory will fail for lack of a file to remove. It’s exactly as if
you had typed rm snizzle-frotz in a
directory with no snizzle-frotz
file in it.
$cd /tmp
$rm snizzle-frotz
rm: cannot remove `snizzle-frotz': No such file or directory
The trick is to realize that when removing a file, you don’t
care whether the file is actually present or not. So just force the
removal and ignore nonexistent files using the -f
or
--force
option:
$cd /tmp
$rm -f snizzle-frotz
$
OK, back to the BookNotes.revised repository:
$cd BookNotes.revised
$git filter-branch --tree-filter 'rm -f 1984' master
Rewrite ffd358c675a1c6d36114e10a92d93fdc1ee84629 (4/4) Ref 'refs/heads/master' was rewritten
As a side note, Git really scrolls through all the commits, stating which one it is presently rewriting, but only the last one shows up on your screen, as just shown. If you are a bit more clever, perhaps by piping that output through less, you can see that it actually prints each commit processed:
Rewrite 3ed7354c2c8ae2678122512b26d591a9ed61663e (1/4) Rewrite 8d3f5a96b18f9795a1bb41295e5a9d2d4eb414b4 (2/4) Rewrite 4df8f74b786b31b6043c44df59d7d13ee2b4b298 (3/4) Rewrite ffd358c675a1c6d36114e10a92d93fdc1ee84629 (4/4)
But it worked this time:
$ ls
Animal_Farm Nightfall Readme Snow_Crash
The 1984 file is now gone!
For the terminally curious, the corresponding command using index-filter would be something like this:
$ git filter-branch --index-filter
'git rm --cached --ignore-unmatch 1984' master
Let’s look at the new commit log:
$ git log --pretty=oneline --abbrev-commit
ad1000b Read Asimov's 'Nightfall'.
7298fc5 Read a few classics.
8d3f5a9 Read 'Snow Crash'
3ed7354 Collect some notes about books.
Notice how each commit starting with the original third commit
(4df8f74
and ffd358c
) now has different SHA1 values
(7298fc5
and ad1000b
), whereas the earlier commits
(3ed7354
and 8d3f5a9
) remain unchanged.
During the filtering and rewriting process, Git creates and
maintains this mapping between old and new commit values and makes it
available to you as the map
convenience function. If for some reason you need to convert from an
old commit SHA1 to the corresponding new SHA1, you can do so using
this mapping from within your filter
command
command.
Let’s investigate a bit more, though.
$ git show 7298fc5
commit 7298fc55d1496c7e70909f3ebce238d447d07951
Author: Jon Loeliger <[email protected]>
Date: Sat Jan 14 12:57:35 2012 -0600
Read a few classics.
- Animal Farm by George Orwell
- 1984 by George Orwell
diff --git a/Animal_Farm b/Animal_Farm
new file mode 100644
index 0000000..e1fcda1
--- /dev/null
+++ b/Animal_Farm
@@ -0,0 +1 @@
+Animal Farm was interesting.
Indeed the commit that first introduced 1984 no longer does so! That means the file
was never introduced in the first place. It is not just gone from the
top commit; it is not just gone from any commit reachable from the
master
branch; it never existed on
this branch.
But doesn’t it bother you that the commit message itself still
mentions the 1984
book? Let’s fix
that in the next section!
Here’s the problem we’re solving: some commit message needs to be revised. In the previous section, we saw how to remove a file from the complete history of a repository. However, the commit message that used to introduce it still alludes to it:
$ git log -1 7298fc55
commit 7298fc55d1496c7e70909f3ebce238d447d07951
Author: Jon Loeliger <[email protected]>
Date: Sat Jan 14 12:57:35 2012 -0600
Read a few classics.
- Animal Farm by George Orwell
- 1984 by George Orwell
That last line has to go!
This is the perfect use case for the
--msg-filter
filter. Your filter command should
accept the old text of a commit message on stdin
and write its revised text on stdout
. That is, your filter should be a
classic stdin
-to-stdout
edit filter. Typically, it will be
something like sed, although it can
be as complex as needed.
In our case, we’ll want to delete that last 1984
line. We’ll also want to touch up the
previous sentence to just talk about one book rather than a “a
few.” A sed command to do
these edits looks like this:
sed -e "/1984/d" -e "s/few classics/classic/"
Put that together with the --msg-filter
option.
Be careful with your line breaks on input here. It should be all one
line, or use the single quote as a command input continuation
technique.
$ git filter-branch --msg-filter '
sed -e "/1984/d" -e "s/few classics/classic/"' master
Rewrite ad1000b936acf7dbe4a29da6706cb759efded1ae (4/4)
Ref 'refs/heads/master' was rewritten
Let’s check:
$ git log --pretty=oneline --abbrev-commit
bf7351c Read Asimov's 'Nightfall.'
f28e55d Read a classic.
8d3f5a9 Read 'Snow Crash'
3ed7354 Collect some notes about books.
We can already see that the log message from commit f28e55d
has been singularized by our
sed script. Good. Looking again at
the whole message:
$ git log -1 f28e55d
commit f28e55dc8bbdee555a3f7778ba8355db9ab4c4a1
Author: Jon Loeliger <[email protected]>
Date: Sat Jan 14 12:57:35 2012 -0600
Read a classic.
- Animal Farm by George Orwell
Now it is truly as if it never existed in this repository! And we’ve always been at war with Eastasia.
One cautionary note about the filtering process: make sure that you are both operating on the items you want to change, and that you are operating on only those items!
For example, the sed command
from the previous --msg-filter
example appears to
change precisely the one commit message we wanted to adjust. However,
be aware that same sed script is
applied to every commit message in the history. If there were other,
perhaps incidental occurrences of the string 1984
in other commit messages, they would
also have been deleted because our script was not very discriminating.
Subsequently, you may have to write a more detailed sed command or a more clever
script.
It is important to understand a brutal consequence of the name of this Git command: it is filter-branch. At its core, the git filter-branch command is designed to operate on just one branch or ref. However, it can operate on many branches or refs.
In many cases, you want to have it operate on
all branches so as to obtain a repository-wide
coverage. In these cases, you will need the -- --all
tacked onto the end of the command.
$ git filter-branch --index-filter
"git rm --cached -f --ignore-unmatch '*.jpeg'"
-- --all
Similarly, you almost certainly want to translate any tag refs
from a prefiltered state into the new postfiltered repository. That
means adding --tag-name-filter cat
is also quite
common:
$ git filter-branch --index-filter
"git rm --cached -f --ignore-unmatch '*.jpeg'"
--tag-name-filter cat
-- --all
How about this one? You used --tree-filter
or
--index-filter
to remove a file from a repository, but did
that file get moved or have its name changed at some point in its
history? You can use a command like this to find out:
$ git log --name-only --follow --all -- file
If other names for that file exist, you might want to delete those versions as well.
One day, I received this piece of email:
Jon,
I’m trying to figure out how to do a date-based check out from a Git repository into an empty working directory. Unfortunately, winding my way through the Git manual pages makes me feel like I’m playing “Adventure.”
Eric
Indeed. Let’s see if we can navigate some of those twisty passages.
It might seem that a command like git checkout master@{Jan 1, 2011} should work.
However, that command is really using the reflog
(See The Stash) to resolve the date-based reference
for the master
ref. There are lots of
ways this innocent looking construct might fail: your repository may not
have the reflog enabled, you may not have manipulated the master
ref during that time period, or the
reflog may have already expired refs from that time period. Even more
subtly, that construct may not give you your expected answer. It
requests the reflog to resolve where your master
was at the given time as you
manipulated the branch, and not according to the branch’s commit time
line. They may be related, especially if you developed and committed
that history in this repository, but they don’t have to be.
Ultimately, this approach can be a misleading dead-end. Using the reflog might get what you want. But it can also fail, and it isn’t a reliable method.
Instead, you should use the git rev-list command. It is the general purpose workhorse whose job is to combine a multitude of options, sort through a complex commit history of many branches, intuit potentially vague user specifications, limit search spaces, and ultimately locate selected commits from within the repository history. It then emits one or more SHA1 IDs for use by other tools. Think of git rev-list and its myriad options as a commit database front-end query tool for your repository.
In this case, the goal is fairly simple: find the one commit in a repository that existed immediately before a given date on a given branch and then check it out.
Let’s use the actual Git source repository because it has a fairly
extensive and explorable history. First, we’ll use rev-list to find that SHA1. The -n
1
option limits the output from the command to just one commit
ID.
Here, we try to locate just the last master
commit of 2011 from the Git source
repository:
$git clone git://github.com/gitster/git.git
Cloning into 'git'... remote: Counting objects: 126850, done. remote: Compressing objects: 100% (41033/41033), done. remote: Total 126850 (delta 93115), reused 117003 (delta 84141) Receiving objects: 100% (126850/126850), 27.56 MiB | 1.03 MiB/s, done. Resolving deltas: 100% (93115/93115), done. $cd git
$git rev-list -n 1 --before="Jan 1, 2012 00:00:00" master
0eddcbf1612ed044de586777b233caf8016c6e70
Having identified the commit, you may use it, tag it, reference
it, or even check it out. But as the checkout note reminds you, you are
on a detached HEAD
.
$ git checkout 0eddcb
Note: checking out '0eddcb'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
git checkout -b new_branch_name
HEAD is now at 0eddcbf... Add MYMETA.json to perl/.gitignore
But is that really the right commit?
$ git log -1 --pretty=fuller
commit 0eddcbf1612ed044de586777b233caf8016c6e70
Author: Jack Nagel <[email protected]>
AuthorDate: Wed Dec 28 22:42:05 2011 -0600
Commit: Junio C Hamano <[email protected]>
CommitDate: Thu Dec 29 13:08:47 2011 -0800
Add MYMETA.json to perl/.gitignore
...
The rev-list date selection
uses the CommitDate
field, not the
AuthorDate
field. So it looks like
the last commit of 2011 in the Git repository happened on December 29,
2011.
A few words of caution are in order, though. Git’s date
handling is implemented using a function called approxidate()
. Not that dates are inherently
approximate, but rather that Git’s interpretation of what you meant
are approximated, usually due to insufficient details or
precision.
$git rev-list -n 1 --before="Jan 1, 2012 00:00:00" master
0eddcbf1612ed044de586777b233caf8016c6e70 $git rev-list -n 1 --before="Jan 1, 2012" master
5c951ef47bf2e34dbde58bda88d430937657d2aa
I typed those two commands at 11:05 A.M. local time. For lack of a specified time in the second command, Git assumed I meant “at this time on Jan 1, 2012.” Subsequently, 11 more hours of leeway were available in which to match commits.
$ git log -1 --pretty=fuller 5c951ef
commit 5c951ef47bf2e34dbde58bda88d430937657d2aa
Author: Clemens Buchacher <[email protected]>
AuthorDate: Sat Dec 31 12:50:56 2011 +0100
Commit: Junio C Hamano <[email protected]>
CommitDate: Sun Jan 1 01:18:53 2012 -0800
Documentation: read-tree --prefix works with existing subtrees
...
This commit happened an hour and 18 minutes into the new year; well within the 11 hours past midnight that I accidentally specified in my second command.
One more caution about date-based checkout. Although you may get
a valid answer to your query for a specific commit, that same question
at some later date may yield a different answer. For example, consider
a repository with several lines of development happening on different
branches. As previously, when you request the commit --before
on a given branch, you get an
answer for the branch as it exists just then. At some later point in
time, however, new commits from other branches might be merged into
your branch, altering the notion of which commit might satisfy your
search conditions. In the previous January 1, 2012 example, someone
might merge in a commit from another branch that is closer to midnight
December 31, 2011 than December 29, 2011 at 13:08:47.date
Sometimes in the course of software archeology, you simply want to retrieve an old version of a file from the repository history. It seems overkill to use the techniques of a date-based checkout as described in Date-Based Checkout because that causes a complete change in your working directory state for every directory and file just to get one file. In fact, it is even likely that you want to keep your current working directory state but replace the current version of just one file by reverting it to an earlier version.
The first step is to identify a commit that contains the desired version of the file. The direct approach is to use an explicit branch, tag, or ref already known to have the correct version. In the absence of that information, some searching has to be done. And when searching the commit history, you should think about using some rev-list techniques to identify commits that have the desired file. As previously seen, dates can be used to select interesting commits. Git also allows the search to be restricted to a particular file or set of files. Git calls this approach “path limiting.” It provides the ultimate guide to possible previous commits that might contain different versions of a file, or as Git calls them, paths.
Again, let’s explore Git’s source repository itself to see what previous versions of, say, date.c are available.
$git clone git://github.com/gitster/git.git
Cloning into 'git'... remote: Counting objects: 126850, done. remote: Compressing objects: 100% (41033/41033), done. remote: Total 126850 (delta 93115), reused 117003 (delta 84141) Receiving objects: 100% (126850/126850), 27.56 MiB | 1.03 MiB/s, done. Resolving deltas: 100% (93115/93115), done. $git rev-list master -- date.c
ee646eb48f9a7fc6c225facf2b7449a8a65ef8f2 f1e9c548ce45005521892af0299696204ece286b ... 89967023da94c0d874713284869e1924797d30bb ecee9d9e793c7573cf3730fb9746527a0a7e94e7
Uh, yeah, something like 60-odd lines of SHA1 commit IDs. Fun! But what does it all mean? And how do you use it?
Because I didn’t specify the -n 1
option, all
matching commit IDs have been generated and printed. The default is to
emit them in reverse chronological order. So this means commit ee646e
contains the most recent version of the
file date.c, and ecee9d9
contains the oldest version. In fact,
looking at commit ecee9d9
shows the
file being introduced into the repository for the first time.
$ git show --stat ecee9d9 --pretty=short
commit ecee9d9e793c7573cf3730fb9746527a0a7e94e7
Author: Edgar Toernig <[email protected]>
[PATCH] Do date parsing by hand...
Makefile | 4 +-
cache.h | 3 +
commit-tree.c | 27 +--------
date.c | 184 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 191 insertions(+), 27 deletions(-)
Where you go from here to find your desired commit is kind
of sketchy. You could do git log
operations on randomly selected SHA1 values from that rev-list
list output. Or you could binary search the time stamps on
commits from that list. Earlier we used the -n 1
to
select the most recent. It’s really hard to say what trick might work in
your selection process to identify the precise commit that contains the
version of a file that is interesting to you.
But once you have identified one of those commits, how do you use it? What does that version of date.c look like? What if we wanted to retrieve it in place?
There are three slightly different approaches you can use to get that version of a file. The first form directly checks out the named version and overwrites the existing version in your working directory.
$ git checkout ecee9d9 date.c
If you want to get the version of a file from a commit and you don’t know its SHA1, but you do happen to know some text from its commit log message, you can use this searching technique to obtain it:
$ git checkout :/"Fix PR-1705" main.c
The youngest commit found is used.
In two other very similar commands, Git accepts the form
commit
:path
to
name the desired file (i.e., path) as it existed at the time the commit
happened, and writes the specified version of the file to be written to
stdout
. What you do with that output
is up to you, though. You could pipe the output to other commands or
create files:
$ git show ecee9d9:date.c > date.c-oldest
Or:
$ git cat-file -p 89967:date.c > date.c-first-change
The difference between these two forms is a bit esoteric. The
former filters the output file through any applicable text conversion
filters, whereas the latter is a more basic, plumbing command and does
not. Differences might show up between these two commands when manipulating binaries, when
textconv
filters are set up, or
possibly during some newline handling transformations. If you want the
raw data, use the cat -p form. If you
want the transformed version as it would be when checked out or added to
the repository, use the show
form.
These are exactly the same mechanisms you would use to obtain versions of a file as it appears in another branch:
$git checkout dev date.c
$git show dev:date.c > date.c-dev
Or even earlier on the same branch:
$ git checkout HEAD~2:date.c
Although a bit of an ominous moniker, interactive hunk staging is nevertheless an incredibly powerful tool that can be used to simplify and organize your development into concise and easily understood commits. If anyone has ever asked you to split your patch up or make single-concept patches, chances are good that this section is for you!
Unless you are a super coder, and both think and develop in concise patches, your day-to-day development probably resembles mine: a little scattered, perhaps over-extended, and likely containing several intertwined ideas all mixed up as they occurred to you. One coding thought leads to another and pretty soon you fixed the original bug, stumbled onto another (but fixed it!), and then added a new easy feature while you were there. Oh, and, you fixed those two typos as well.
And, if you, like I do, appreciate having someone review your changes to important code before you ask for it to be accepted upstream, chances are good that having all of those different, unrelated changes will not make for a logical presentation of a single patch. Indeed, some open source projects insist that submitted patches contain separate self-contained fixes. That is, a patch shouldn’t serve multiple purposes in one shot. Instead, each idea should stand alone and should be presentable as a well-defined, simple patch that is just large enough to do the job and nothing more. If more than one idea needs to be upstreamed, more than one patch, perhaps in a sequence, will be needed. Common wisdom suggests that these sorts of patches and patch sequences lead to very solid reviews, quick turnaround, and easy acceptance into the mainline upstream development.
So how do these perfect patch sequences come about? Although I strive for a development style that facilitates simple patches, I’m not always successful. Nevertheless, Git provides some tools to help formulate good patches. One of those tools is the ability to interactively select and commit pieces, or “hunks,” of a patch, leaving the rest to be committed in a later patch. Ultimately, you will want to create a new sequence of smaller commits that still sum up to your original work.
What Git won’t do for you is decide which conceptual pieces of a patch belong together and which do not. You have to be able to discern the meaning and grouping of hunks that make logical sense together. Sometimes those hunks are all in one file, but sometimes they are in multiple files. Collectively, all the conceptually related hunks must be selected and staged together as part of one commit.
Furthermore, you must ensure that your selection of hunks still meets any external requirements. For example, if you are writing source code that must be compiled, you will likely want to ensure that the code base continues to be compilable after each commit. Thus, you must ensure that your patch breakup, when reassembled in smaller parts, still compiles at each commit within the new sequence. Git can’t do that for you; that’s the part where you have to think. Sorry.
Staging hunks interactively is as easy as adding the
-p
option to the git
add command!
$ git add -p file.c
Interactive hunk staging looks pretty easy, and it is. But we should probably still have a mental model in mind of what Git is doing with our patches. Remember way back in Chapter 5, I explained how Git maintains the index as a staging area that accumulates your changes prior to committing them. That’s still happening. But instead of gathering the changes an entire file at a time, Git is picking apart the changes you have made in your working copy of a file, and allowing you to select which individual part or parts to stage in the index, waiting to be committed.
Let’s suppose we’re developing a program to print out a histogram of white-space–separated words found in a file. The very first version of this program is the “Hello, World!” program that proves things are starting out on the right compilation track. Here’s main.c:
#include <stdio.h> int main(int argc, char **argv) { /* * Print a histogram of words found in a file. * "Words" are any whitespace separated characters. * Words are listed in no particular order. * FIXME: Implementation needed still! */ printf("Histogram of words "); }
Add a Makefile and .gitignore, and put it all in a new repository:
$mkdir /tmp/histogram
#cd /tmp/histogram
$git init
Initialized empty Git repository in /tmp/histogram/.git/ $git add main.c Makefile .gitignore
$git commit -m "Initial histogram program."
[master (root-commit) 42300e7] Initial histogram program. 3 files changed, 18 insertions(+), 0 deletions(-) create mode 100644 .gitignore create mode 100644 Makefile create mode 100644 main.c
Let’s do some miscellaneous development until main.c looks like this:
#include <stdio.h> #include <stdlib.h> struct htentry { char *item; int count; struct htentry *next; }; struct htentry ht_table[256]; void ht_init(void) { /* FIXME: details */ } int main(int argc, char **argv) { FILE *f; f = fopen(argv[1], "r"); if (f == 0) exit(-1); /* * Print a histogram of words found in a file. * "Words" are any whitespace separated characters. * Words are listed in no particular order. * FIXME: Implementation needed still! */ printf("Histogram of words "); ht_init(); }
Notice that this development effort has introduced two conceptually different changes: the hash table structure and storage, and the beginnings of the file reading operation. In a perfect world, these two concepts would be introduced into the program with two separate patches. It will take us a couple of steps to get there, but Git will help us split these changes properly.
Git, along with most of the Free World, considers a hunk to be any series of lines from a diff command that are delineated by a line that looks something like this:
@@ -1,7 +1,27 @@
or this:
@@ -9,4 +29,6 @@ int main(int argc, char **argv)
In this case, git diff shows two hunks:
$ git diff
diff --git a/main.c b/main.c
index 9243ccf..b07f5dd 100644
--- a/main.c
+++ b/main.c
@@ -1,7 +1,27 @@
#include <stdio.h>
+#include <stdlib.h>
+
+struct htentry {
+ char *item;
+ int count;
+ struct htentry *next;
+};
+
+struct htentry ht_table[256];
+
+void ht_init(void)
+{
+ /* FIXME: details */
+}
int main(int argc, char **argv)
{
+ FILE *f;
+
+ f = fopen(argv[1], "r");
+ if (f == 0)
+ exit(-1);
+
/*
* Print a histogram of words found in a file.
* "Words" are any whitespace separated characters.
@@ -9,4 +29,6 @@ int main(int argc, char **argv)
* FIXME: Implementation needed still!
*/
printf("Histogram of words
");
+
+ ht_init();
}
The first hunk starts with the line @@ -1,7
+1,27 @@
and finishes at the start of the second hunk: @@ -9,4 +29,6 @@ int main(int argc, char
**argv)
.
When interactively staging hunks with git add -p, Git offers a choice for each hunk in turn: do you want to stage it?
But let’s look at our patch a bit more closely and consider the need to break up the pieces so that conceptually related parts are all gathered up and staged at the same time. That means we’d like to stage all the hash table parts together in one patch, and then stage all the file operations in a second patch. Unfortunately, it looks like the first hunk has both hash table and file operation pieces in one hunk! That means, for the purposes of the first commit (i.e., the hash table pieces), we want to both stage it and not stage it. Or more precisely, we want to stage part of the hunk. If Git only asks us about the first and second hunks, we are in trouble.
But, not to worry! The hunk staging will allow us to split a hunk. Any place where a contiguous sequence of added and deleted lines identified by a plus or minus in the first column is broken up by original context text, a split operation may be performed.
Let’s see how this works by starting with a git add -p main.c command:
$ git add -p
diff --git a/main.c b/main.c
index 4809266..c60b800 100644
--- a/main.c
+++ b/main.c
@@ -1,7 +1,27 @@
#include <stdio.h>
+#include <stdlib.h>
+
+struct htentry {
+ char *item;
+ int count;
+ struct htentry *next;
+};
+
+struct htentry ht_table[256];
+
+void ht_init(void)
+{
+ /* FIXME: details */
+}
int main(int argc, char **argv)
{
+ FILE *f;
+
+ f = fopen(argv[1], "r");
+ if (f == 0)
+ exit(-1);
+
/*
* Print a histogram of words found in a file.
* "Words" are any whitespace separated characters.
Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]?
After reviewing this hunk and seeing both the hash table and file
operation related changes there, you realize you need to both stage and
not stage this hunk. That is your clue to answer s
, for split, to the question.
Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? s
Split into 2 hunks.
@@ -1,4 +1,18 @@
#include <stdio.h>
+#include <stdlib.h>
+
+struct htentry {
+ char *item;
+ int count;
+ struct htentry *next;
+};
+
+struct htentry ht_table[256];
+
+void ht_init(void)
+{
+ /* FIXME: details */
+}
int main(int argc, char **argv)
{
Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]?
Excellent. We want this hunk staged.
Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? y
And immediately next up:
@@ -2,6 +16,12 @@ int main(int argc, char **argv) { + FILE *f; + + f = fopen(argv[1], "r"); + if (f == 0) + exit(-1); + /* * Print a histogram of words found in a file. * "Words" are any whitespace separated characters. Stage this hunk [y,n,q,a,d,/,K,j,J,g,e,?]?
But not that one.
Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? n
And finally, Git offers to stage the last hunk. We want it, too.
@@ -9,4 +29,6 @@ int main(int argc, char **argv)
* FIXME: Implementation needed still!
*/
printf("Histogram of words
");
+
+ ht_init();
}
Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? y
Let’s review. Originally, there were two hunks. But we wanted only part of the first hunk and all of the second. So when Git offered us the first hunk we had to split it into two subhunks. We then staged the first subhunk, and not the second subhunk. We then staged the entire original second hunk.
Verifying that the staged pieces look correct is easy:
$ git diff --staged
diff --git a/main.c b/main.c
index 4809266..8a95bb0 100644
--- a/main.c
+++ b/main.c
@@ -1,4 +1,18 @@
#include <stdio.h>
+#include <stdlib.h>
+
+struct htentry {
+ char *item;
+ int count;
+ struct htentry *next;
+};
+
+struct htentry ht_table[256];
+
+void ht_init(void)
+{
+ /* FIXME: details */
+}
int main(int argc, char **argv)
{
@@ -9,4 +23,6 @@ int main(int argc, char **argv)
* FIXME: Implementation needed still!
*/
printf("Histogram of words
");
+
+ ht_init();
}
That looks good, so you can go ahead and commit it. Don’t worry that there are lingering differences remaining in the file main.c. That’s by design because it is the next patch! Oh, and don’t use the filename with this next git commit command because that would use the entire file and not the just the staged parts.
$git commit -m "Introduce a Hash Table."
[master 66a212c] Introduce a Hash Table. 1 files changed, 16 insertions(+), 0 deletions(-) $git diff
diff --git a/main.c b/main.c index 8a95bb0..c60b800 100644 --- a/main.c +++ b/main.c @@ -16,6 +16,12 @@ void ht_init(void) int main(int argc, char **argv) { + FILE *f; + + f = fopen(argv[1], "r"); + if (f == 0) + exit(-1); + /* * Print a histogram of words found in a file. * "Words" are any whitespace separated characters.
And with that, just add and commit the remaining change because it is the total material for the file operations patch.
$git add main.c
$git commit -m "Open the word source file."
[master e649d27] Open the word source file. 1 files changed, 6 insertions(+), 0 deletions(-)
A glance at the commit history shows two new commits:
$ git log --graph --oneline
* e649d27 Open the word source file.
* 66a212c Introduce a Hash Table.
* 3ba81f7 Initial histogram program.
And that is a happy patch sequence!
As usual, there are a few caveats and extenuating circumstances. For instance, what about that sneaky line:
#include <stdlib.h>
Doesn’t it really belong with the file operation patch and not the hash table patch? Yep. You got me. It does.
That’s a bit trickier to handle. But let’s do it anyway. We’ll have
to use the e
option. First, reset to
the first commit and leave all those changes in your working tree so we
can do it all over again.
$ git reset 3ba81f7
Unstaged changes after reset:
M main.c
Do the git add -p again, and
split the first patch just like before. But this time, instead of
answering y
to the first subhunk
staging request, answer e
and request
to edit the patch:
$git add -p
diff --git a/main.c b/main.c index 4809266..c60b800 100644 --- a/main.c +++ b/main.c @@ -1,7 +1,27 @@ #include <stdio.h> +#include <stdlib.h> + +struct htentry { + char *item; + int count; + struct htentry *next; +}; + +struct htentry ht_table[256]; + +void ht_init(void) +{ + /* FIXME: details */ +} int main(int argc, char **argv) { + FILE *f; + + f = fopen(argv[1], "r"); + if (f == 0) + exit(-1); + /* * Print a histogram of words found in a file. * "Words" are any whitespace separated characters. Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]?s
Split into 2 hunks. @@ -1,4 +1,18 @@ #include <stdio.h> +#include <stdlib.h> + +struct htentry { + char *item; + int count; + struct htentry *next; +}; + +struct htentry ht_table[256]; + +void ht_init(void) +{ + /* FIXME: details */ +} int main(int argc, char **argv) { Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]?e
You will be placed in your favorite editor[43] and allowed the chance to manually edit the patch. Read the
comment at the bottom of the editor buffer. Carefully delete that one
#include <stdlib.h>
line. Don’t
disturb the context lines, and don’t mess with the line counts. Git, and
most any patch program, will lose its mind if you mess with the context
lines. However, my editor updates the line counts automatically.
In this case, because the #include
line was removed, it will be swept up
in the remainder of the patches that get formed. This effectively
introduces it at the correct time in the patch with the other file
operation changes.
It is kind of tricky here, but Git now assumes that when you exit your editor, the patch that is left in your editor should be applied and its effects staged. So it offers you the following hunk and lets you choose its disposition. Be careful.
Because Git has moved on to the file operation changes, don’t stage those changes yet, but do pick up the last hash table change:
@@ -2,6 +16,12 @@ int main(int argc, char **argv) { + FILE *f; + + f = fopen(argv[1], "r"); + if (f == 0) + exit(-1); + /* * Print a histogram of words found in a file. * "Words" are any whitespace separated characters. Stage this hunk [y,n,q,a,d,/,K,j,J,g,e,?]? n @@ -9,4 +29,6 @@ int main(int argc, char **argv) * FIXME: Implementation needed still! */ printf("Histogram of words "); + + ht_init(); } Stage this hunk [y,n,q,a,d,/,K,g,e,?]? y
The separation can be verified, noting that the #include <stdlib.h>
line has been
correctly associated with the file operations now:
$ git diff
diff --git a/main.c b/main.c
index 3e77315..c60b800 100644
--- a/main.c
+++ b/main.c
@@ -1,4 +1,5 @@
#include <stdio.h>
+#include <stdlib.h>
struct htentry {
char *item;
@@ -15,6 +16,12 @@ void ht_init(void)
int main(int argc, char **argv)
{
+ FILE *f;
+
+ f = fopen(argv[1], "r");
+ if (f == 0)
+ exit(-1);
+
/*
* Print a histogram of words found in a file.
* "Words" are any whitespace separated characters.
As before, wrap up with a git commit for the hash table patch, then stage and commit the remaining file operation pieces.
I’ve only touched on the essential responses to the “Stage
this hunk?” question. In fact, even more options than those listed
in its prompt (i.e., [y,n,q,a,d,/,K,g,e,?]
) are available. There are
options to delay the fate of a hunk and then revisit it when prompted
again later.
Furthermore, although this example only had two hunks in one file, the staging operation generalizes too many hunks, possibly split, in many files. Pulling together changes across multiple files can be a simple process of applying git add -p to each file that has a hunk needing to be staged.
However, there is another, outer level to the whole interactive hunk
staging process that can be invoked using the git
add -i command. It can be a bit cryptic, but its purpose is to
allow you to select which paths (i.e., files) to stage in the index. As a
sub-option, you may then select the patch
option for your chosen paths. This enters
the previously described per file staging mechanism.
Occasionally, an ill-timed git
reset command or an accidental branch deletion leaves you
wishing you hadn’t lost the development it represented, and wishing you
could recover it somehow. The usual approach to recovering such work is to
inspect your reflog as shown in Chapter 11.
Sometimes the reflog isn’t available, perhaps because it has been turned
off (e.g., core.logAllRefUpdates =
false
), because you are manipulating a bare repository directly,
or perhaps because the reflog has simply expired. For whatever reason,
sometimes the reflog cannot help recover a lost commit.
Although not foolproof, Git provides the command git fsck to help locate lost data. The word “fsck” is an old abbreviation for “file system check.” Although this command does not check your filesystem, it does have many characteristics and algorithms that are quite similar to a traditional filesystem check, and results in some of the same output data as well.
Understanding how git fsck works is predicated on a good understanding of Git’s data structures as described in Chapter 4. Normally, every object in the Git repository, whether it is a blob, tree, commit, or tag, is connected to another object and anchored to a branch name, tag name, or some other symbolic ref such as a reflog name.
However, various commands and manipulations can leave objects in the object store that are not linked into the complete data structure somehow. These objects are called “unreachable” or “dangling.” They are unreachable because a traversal of the full data structure that starts from every named ref and follows every tag, commit, commit parent, and tree object reference will never encounter the lost object. In a sense, it is out there dangling on its own.
But traversing the ref-based commit graph isn’t the only way to walk every object in the database! Consider simply listing the objects in your object store using ls directly:
$cd path/to/some/repo
$ls -R .git/objects/
.git/objects/: 25 3b 73 82 info pack .git/objects/25: 7cc5642cb1a054f08cc83f2d943e56fd3ebe99 .git/objects/3b: d1f0e29744a1f32b08d5650e62e2e62afb177c .git/objects/73: 8d05ac5663972e2dcf4b473e04b3d1f19ba674 .git/objects/82: b5fee28277349b6d46beff5fdf6a7152347ba0 .git/objects/info: .git/objects/pack:
In this simple example, the set of objects in the repository has been listed without doing a traversal of the refs and commits.
By carefully comparing the total set of objects with those reachable via a traversal of the ref-based commit graph, you can determine all of the unreferenced objects. From the previous example, the second object listed turns out to be an unreferenced blob (i.e., file):
$ git fsck
Checking object directories: 100% (256/256), done.
dangling blob 3bd1f0e29744a1f32b08d5650e62e2e62afb177c
Let’s follow an example that shows how a lost commit can occur, and see how git fsck can recover it. First, construct a simple, new repository with a single simple file in it.
$mkdir /tmp/lost
$cd /tmp/lost
$git init
Initialized empty Git repository in /tmp/lost/.git/ $echo "foo" >> file
$git add file
$git commit -m "Add some foo"
[master (root-commit) 1adf46e] Add some foo 1 files changed, 1 insertions(+), 0 deletions(-) create mode 100644 file $git fsck
Checking object directories: 100% (256/256), done. $ls -R .git/objects/
.git/objects/: 25 4a f8 info pack .git/objects/25: 7cc5642cb1a054f08cc83f2d943e56fd3ebe99 .git/objects/4a: 1c03029e7407c0afe9fc0320b3258e188b115e .git/objects/f8: 5b097ee0f77c5f4dc1868037acbffe59b0e93e .git/objects/info: .git/objects/pack:
Notice that there are only three objects and none of them are
dangling. In fact, starting from the master
ref, which is the f85b097ee
commit object, the traversal points
to the tree object 4a1c0302
and then
the blob 257cc564
.
The command git cat-file -t
object-id
can be used to
determine an object’s type.
Now let’s make a second commit, and then hard reset back to the first commit:
$echo bar >> file
$git commit -m "Add some bar" file
[master 11e0dc9] Add some bar 1 files changed, 1 insertions(+), 0 deletions(-)
And now the “accident” that causes us to lose a commit:
$git reset --hard HEAD^
HEAD is now at f85b097 Add some foo $git fsck
Checking object directories: 100% (256/256), done.
But wait! git fsck doesn’t report any dangling object. It doesn’t seem to be lost after all. This is exactly what the reflog is designed to do: prevent you from accidentally losing commits. (See The Reflog.)
So let’s try again after brutally eliminating the reflog:
# Not recommended; this is for purposes of exposition only! $rm -rf .git/logs
$git fsck
Checking object directories: 100% (256/256), done. dangling commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe $ls -R .git/objects/
.git/objects/: 11 25 3b 41 4a f8 info pack .git/objects/11: e0dc9c11d8f650711b48c4a5707edf5c8a02fe .git/objects/25: 7cc5642cb1a054f08cc83f2d943e56fd3ebe99 .git/objects/3b: d1f0e29744a1f32b08d5650e62e2e62afb177c .git/objects/41: 31fe4d33cd85da805ac9a6697c2251c913881c .git/objects/4a: 1c03029e7407c0afe9fc0320b3258e188b115e .git/objects/f8: 5b097ee0f77c5f4dc1868037acbffe59b0e93e .git/objects/info: .git/objects/pack:
You can use the git fsck --no-reflog command to find dangling objects as if the reflog were not available to reference commits. That is, objects that are only reachable from the reflog will be considered unreachable.
Now we can see that only the reflog was referencing the second
commit 11e0dc9c
in which the
“bar” content was added.
But how would we even know what that dangling commit is?
$git show 11e0dc9c
commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe Author: Jon Loeliger <[email protected]> Date: Sun Feb 10 11:59:59 2012 -0600 Add some bar diff --git a/file b/file index 257cc56..3bd1f0e 100644 --- a/file +++ b/file @@ -1 +1,2 @@ foo +bar # The "index" line above named blob 3bd1f0e $git show 3bd1f0e
foo bar
Note that the blob 3bd1f0e
is
not considered dangling because it is actually referenced by the commit
11e0dc9c
, even though the commit
itself is unreferenced.
Sometimes, though, git fsck will find blobs that are unreferenced. Remember, every time you git add a file to the index, its blob is added to the object store. If you subsequently change that content and re-add it, no commit will have captured the intermediate blob that was added to the object store. Thus, it will be unreferenced.
$echo baz >> file
$git add file
$git fsck
Checking object directories: 100% (256/256), done. dangling commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe $echo quux >> file
$git add file
$git fsck
Checking object directories: 100% (256/256), done. dangling blob 0c071e1d07528f124e31f1b6c71348ec13f21a7a dangling commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe
The reason the first git fsck didn’t show a dangling blob was because that blob was still referenced directly by the index. Only after the content associated with the pathname file was changed again and re-added did that blob become dangling.
$ git show 0c071e1d
foo
baz
If you find you have a very cluttered git fsck report consisting entirely of unnecessary blobs and commits and want to clean it up, consider running garbage collection as described in Garbage Collection.
Although using git fsck is a handy way to discover the SHA1 of lost commits and blobs, I mentioned the reflog earlier as another mechanism. In fact, you could cut and paste it from some lingering line of output found by scrolling back over your terminal output log. Ultimately, it doesn’t matter how you discover the SHA1 of a lost blob or commit. The question remains, once you know it, how do you reconnect it or otherwise incorporate it into your project?
By definition, blobs are nameless file content. All you really have to do to reestablish a blob is place that content into a file and git add it again. As I showed in the previous section, git show can be used on the blob SHA1 to obtain the full object content. Just redirect that to your desired file:
$ git show 0c071e1d > file2
On the other hand, reconnecting a commit might depend on what you want to do with it. The simple example from the previous section is only one commit. But it could equally well have been the first commit in an entire sequence of commits that was lost. Maybe even an entire branch was accidentally lost! Consequently, a usual practice would reintroduce a lost commit as a branch.
Here, the previously lost commit that introduced the bar content,
11e0dc9c
, is re-introduced on the new branch called
recovered
:
$git branch recovered 11e0dc9c
$git show-branch
* [master] Add some foo ! [recovered] Add some bar -- + [recovered] Add some bar *+ [master] Add some foo
From there it can manipulated (kept as is, merged, etc.) as you wish.
3.145.59.187