Chapter 5
The Repository
You can use Git quite well without knowing how the repository works. However, you will have a better understanding of the workflow if you know how Git stores and organizes its data. If you really hate theory, you can skim through this chapter and just read the Step by Step sections.
Git is constructed in two planes. On the upper level you have commands such as log, reset or commit, which are easy to use and offer numerous options and call options. Git developers call these commands porcelain commands.
The layer below it is called plumbing. Here there are a number of simple commands with few options on which to build the porcelain commands. Plumbing commands are rarely used directly. This chapter provides a little insight into the plumbing level of the system.
A Simple and Efficient Storage System
The core of Git is an object database. Here you can store text or binary data, such as the content of a file. You can use the hash-object command with the -w option (w stands for write) to insert a record into the object database.
> git hash-object -w hello.txt 28cf67640e502fe8e879a863bd1bbcd4366689e8
When you store an object, Git returns a 40-character code. This is the key for the stored object. Keep in mind that you can access the object again later with the key using the cat-file command with the -p option (p stands for print).
> git cat-file -p 28cf67640e Hello World!
The implementation of the object database is very efficient. Even in a large project with a very long commit history (such as the Linux kernel, which has more than 200,000 commits and almost two million objects), accessing objects from the repository is almost instantaneous. Git is extremely well suited for projects with a large number of small source files. The boundaries become apparent only when the total data volume of the data gets very large. Those wanting to manage large binary files with a Git repository will not be well served.
Storing Directories: Blob & Tree
To store files and directories Git uses a simple tree with two types of nodes. File contents are unchanged, stored byte by byte as blob objects in the object database. Directories are represented by tree objects that look like this:
> git cat-file -p 2790ef78 100644 blob 507d3a30ae9ed53bcf953744c5f5c9391a263356 README 040000 tree 91c7822ab43800b0e3c13049519587df4fd74591 src
The tree object contains files and subdirectories, as shown in Figure 5.2. Each entry comes with information about access rights (eg 100644), type (blob or tree), and a hash of the file content and the name of the file or directory.
sample-workspace/ README /src Hello.java World.java
Figure 5-1: A small project
Figure 5.2: Representation of directories in the repository
Identical Data Is Stored Only Once
To save memory, identical data is stored only once. In the following example, the file contents from foo.txt and copy-of-foo.txt return the same hash, because they are identical:
> git hash-object -w foo.txt a42a0aba404c211e8fdf33d4edde67bb474368a7
> git hash-object -w copy-of-foo.txt a42a0aba404c211e8fdf33d4edde67bb474368a7
By using this approach, not only does Git save memory, it also gains performance. Many Git operations are fast because they only compare hashes in their algorithms without looking at the actual data.
Compressing Similar Content
Git can do more than just merge identical file content. When programmers are constantly creating new files that differ from their predecessors in only a few lines, Git can save these files using a delta method, which stores only the changes after the original version in pack files.
To do this, use the gc command when you want to save space. Git removes any unwanted commits that are no longer accessible from any branch head, and stores the remaining commits in pack files. For projects that contain mostly source code, an amazingly high compression is achieved. Often the size of the unzipped workspace contents with the current version of the project is greater than the size of the Git repository with years of project history.
Is It Bad When Various Files Happen to Get the Same Hash?
That would be bad, because Git identifies content by its hash. Git could therefore provide incorrect data when the contents of various files happen to have the same hash. This is known as a hash collision.
The good news is that a hash collision is an extremely unlikely event. The reason is that there are at least 2160 possible hashes. For instance, after five years the Linux kernel project “only” has about 221 objects in its repository.
In theory, the SHA1 hash algorithm has weaknesses, in which you could find an SHA1 collision with 251 operations. However, a research project at Graz University of Technology tried from 2007 to 2009 to find one (!) such hash collision, and was terminated without success. In summary it is safe to say that security in the context of version control is okay today.
Commits
Commits are stored in the object database. The format is simple:
> git cat-file -p 64b98df0 tree 319c67d41a0b3f7464550b41db4bb1584939ad2a parent 6c7f1ba0828a5b595026e08d2476808105a6b815 author Bjørn Stachmann <[email protected]> 1295906997 +0100 committer Bjørn Stachmann <[email protected]> 1295906997 +0100 Section on trees & blobs.
In addition to metadata, such as the author, committer, date and comment, a commit object contains hashes for other objects in the object database. The tree object describes the contents of the commit. It refers to the project’s root directory and, as described above, is represented by trees and blobs. The parent object represents the previous commit.
Object Reuse in the Commit History
Except for the very first commit, every commit has at least one predecessor commit (parent object). Often a commit only changes a few files in a project, and most of the files and directories remain unchanged. Whenever possible Git reuses objects from the previous commit.
Figure 5.3: Reusing of an object tree
Figure 5.3 shows an example. A commit (represented by the second rectangular with the header “commit” on the second row from the top, with solid arrows) includes a README file and a src directory containing other files. When a new commit is created (the rectangular with the header “commit” on the first row, with dashed arrows), in which only the README file has changed. A new blob object is created for README. However, for the src directory, the existing tree object and associated blob can still be used.
Renaming, Moving and Copying
In many version control systems you can trace the history of file renames and time changes. Most often this is achieved by using a special command to move or rename files. In Subversion, for example, you move a file with svn move. However, Subversion is clueless when the user moves a file by dragging and dropping the file to a new location. In this case, Subversion knows nothing about the move and instead records a file deletion and creates a new file.
Git takes a different approach: it stores no information about which files have been moved. Instead, it employs a rename detection algorithm: If a file is missing from a commit but was still present in the predecessor, Git checks if a file with the same name or very similar content emerges in another location. If this is so, Git assumes that the file has been moved. Figure 5.4 shows this: The foo.txt file that has been moved is missing in the second commit. Git then examines all recently added files for one that has similar content and locates it in src/foo-moved.txt. This is interpreted as renaming.
sample-workspace/ sample-workspace foo.txt (foo.txt missing) /src /src bar.txt bar.txt foo-moved.txt (Commit 1) (Commit 2)
Figure 5.4: A file is moved
Step by Step
Following renames and moving
Git will show which files have been renamed or moved.
1. Get a summary
You activate rename detection using the log command with the -M option (for “Move”). To format the output, use the --summary option to display information about file changes. The problem is, the output is very long. If you like a shorter one, you can filter the output using the grep command. The percentage in each line indicates how similar the source and target files are.
> git log --summary -M90% | grep -e "^ rename"
rename foo.txt => foo-renamed.txt (90%) rename src/{before => after}/bar.txt (100%)
2. Track the history of a file that has been moved
Use the log command with the --follow option to continue listing the history of a file beyond renames (works only for a single file). Without this option, the log would end at file renames.
> git log --follow foo-renamed.txt
Step by Step
Tracking down copies
You can also track data that has been copied using the –C option.
> git log --summary -C90% | grep -e "^ copy"
If necessary, you can use the --find-copies-harder option to make Git do the calculation longer. If this option is present, Git will examine all files in a commit, not only those that have been changed.
You can also configure Git so that rename detection is enabled by default. Then, you do not need to specify the -M and --follow options for each log command you issue.
> git config diff.renames true
Step by Step
Determining the origin of a code section
You want to find out who last changed some lines of code and when.
1. Print the origin information line by line
Git can even determine the origin of lines of code when larger sections of the code were copied or moved from other files. The blame command also displays when and by whom a line of code was last modified.
> git blame -M -C -C -C copied-together.txt
f5fdbad0 foo.txt (Rene 2010-11-14 18:30:42 +0100 1) One, a5b80903 bar.txt (Bjørn 2011-01-31 21:32:49 +0100 2) Two or f5fdbad0 foo.txt (Rene 2010-11-14 18:30:42 +0100 3) Three
The -M option (M for Move) reveals the copies and moves of a file. The -C option also detect copies of files in the same commit. You can also use multiple -C options to search for files in other commits. For large repositories this can sometimes take a little longer.
Summary
3.144.47.218