Chapter 5: The Repository

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5

The Repository

You can use Git quite well without knowing how the repository works. However, you will have a better understanding of the workflow if you know how Git stores and organizes its data. If you really hate theory, you can skim through this chapter and just read the Step by Step sections.

Git is constructed in two planes. On the upper level you have commands such as log, reset or commit, which are easy to use and offer numerous options and call options. Git developers call these commands porcelain commands.

The layer below it is called plumbing. Here there are a number of simple commands with few options on which to build the porcelain commands. Plumbing commands are rarely used directly. This chapter provides a little insight into the plumbing level of the system.

A Simple and Efficient Storage System

The core of Git is an object database. Here you can store text or binary data, such as the content of a file. You can use the hash-object command with the -w option (w stands for write) to insert a record into the object database.

> git hash-object -w hello.txt 
28cf67640e502fe8e879a863bd1bbcd4366689e8

When you store an object, Git returns a 40-character code. This is the key for the stored object. Keep in mind that you can access the object again later with the key using the cat-file command with the -p option (p stands for print).

> git cat-file -p 28cf67640e
Hello World!

The implementation of the object database is very efficient. Even in a large project with a very long commit history (such as the Linux kernel, which has more than 200,000 commits and almost two million objects), accessing objects from the repository is almost instantaneous. Git is extremely well suited for projects with a large number of small source files. The boundaries become apparent only when the total data volume of the data gets very large. Those wanting to manage large binary files with a Git repository will not be well served.

Storing Directories: Blob & Tree

To store files and directories Git uses a simple tree with two types of nodes. File contents are unchanged, stored byte by byte as blob objects in the object database. Directories are represented by tree objects that look like this:

> git cat-file -p 2790ef78
100644 blob 507d3a30ae9ed53bcf953744c5f5c9391a263356 README 
040000 tree 91c7822ab43800b0e3c13049519587df4fd74591 src

The tree object contains files and subdirectories, as shown in Figure 5.2. Each entry comes with information about access rights (eg 100644), type (blob or tree), and a hash of the file content and the name of the file or directory.

                    sample-workspace/ 
                        README 
                        /src 
                            Hello.java 
                            World.java

Figure 5-1: A small project

Figure 5.2: Representation of directories in the repository

Identical Data Is Stored Only Once

To save memory, identical data is stored only once. In the following example, the file contents from foo.txt and copy-of-foo.txt return the same hash, because they are identical:

> git hash-object -w foo.txt
a42a0aba404c211e8fdf33d4edde67bb474368a7

> git hash-object -w copy-of-foo.txt
a42a0aba404c211e8fdf33d4edde67bb474368a7

By using this approach, not only does Git save memory, it also gains performance. Many Git operations are fast because they only compare hashes in their algorithms without looking at the actual data.

Compressing Similar Content

Git can do more than just merge identical file content. When programmers are constantly creating new files that differ from their predecessors in only a few lines, Git can save these files using a delta method, which stores only the changes after the original version in pack files.

To do this, use the gc command when you want to save space. Git removes any unwanted commits that are no longer accessible from any branch head, and stores the remaining commits in pack files. For projects that contain mostly source code, an amazingly high compression is achieved. Often the size of the unzipped workspace contents with the current version of the project is greater than the size of the Git repository with years of project history.

Is It Bad When Various Files Happen to Get the Same Hash?

That would be bad, because Git identifies content by its hash. Git could therefore provide incorrect data when the contents of various files happen to have the same hash. This is known as a hash collision.

The good news is that a hash collision is an extremely unlikely event. The reason is that there are at least 2160 possible hashes. For instance, after five years the Linux kernel project “only” has about 221 objects in its repository.

In theory, the SHA1 hash algorithm has weaknesses, in which you could find an SHA1 collision with 251 operations. However, a research project at Graz University of Technology tried from 2007 to 2009 to find one (!) such hash collision, and was terminated without success. In summary it is safe to say that security in the context of version control is okay today.

Commits

Commits are stored in the object database. The format is simple:

> git cat-file -p 64b98df0 
tree 319c67d41a0b3f7464550b41db4bb1584939ad2a 
parent 6c7f1ba0828a5b595026e08d2476808105a6b815 
author Bjørn Stachmann <[email protected]> 1295906997 +0100 
committer Bjørn Stachmann <[email protected]> 1295906997 +0100 

Section on trees & blobs.

In addition to metadata, such as the author, committer, date and comment, a commit object contains hashes for other objects in the object database. The tree object describes the contents of the commit. It refers to the project’s root directory and, as described above, is represented by trees and blobs. The parent object represents the previous commit.

Object Reuse in the Commit History

Except for the very first commit, every commit has at least one predecessor commit (parent object). Often a commit only changes a few files in a project, and most of the files and directories remain unchanged. Whenever possible Git reuses objects from the previous commit.

Figure 5.3: Reusing of an object tree

Figure 5.3 shows an example. A commit (represented by the second rectangular with the header “commit” on the second row from the top, with solid arrows) includes a README file and a src directory containing other files. When a new commit is created (the rectangular with the header “commit” on the first row, with dashed arrows), in which only the README file has changed. A new blob object is created for README. However, for the src directory, the existing tree object and associated blob can still be used.

Renaming, Moving and Copying

In many version control systems you can trace the history of file renames and time changes. Most often this is achieved by using a special command to move or rename files. In Subversion, for example, you move a file with svn move. However, Subversion is clueless when the user moves a file by dragging and dropping the file to a new location. In this case, Subversion knows nothing about the move and instead records a file deletion and creates a new file.

Git takes a different approach: it stores no information about which files have been moved. Instead, it employs a rename detection algorithm: If a file is missing from a commit but was still present in the predecessor, Git checks if a file with the same name or very similar content emerges in another location. If this is so, Git assumes that the file has been moved. Figure 5.4 shows this: The foo.txt file that has been moved is missing in the second commit. Git then examines all recently added files for one that has similar content and locates it in src/foo-moved.txt. This is interpreted as renaming.

            sample-workspace/        sample-workspace
                foo.txt                  (foo.txt missing)
                /src                      /src
                    bar.txt                   bar.txt
                                              foo-moved.txt 

            (Commit 1)               (Commit 2)

Figure 5.4: A file is moved

Step by Step

Following renames and moving

Git will show which files have been renamed or moved.

1. Get a summary

You activate rename detection using the log command with the -M option (for “Move”). To format the output, use the --summary option to display information about file changes. The problem is, the output is very long. If you like a shorter one, you can filter the output using the grep command. The percentage in each line indicates how similar the source and target files are.

> git log --summary -M90% | grep -e "^ rename"

rename foo.txt => foo-renamed.txt (90%) 
rename src/{before => after}/bar.txt (100%)

2. Track the history of a file that has been moved

Use the log command with the --follow option to continue listing the history of a file beyond renames (works only for a single file). Without this option, the log would end at file renames.

> git log --follow foo-renamed.txt

Step by Step

Tracking down copies

You can also track data that has been copied using the –C option.

> git log --summary -C90% | grep -e "^ copy"

If necessary, you can use the --find-copies-harder option to make Git do the calculation longer. If this option is present, Git will examine all files in a commit, not only those that have been changed.

You can also configure Git so that rename detection is enabled by default. Then, you do not need to specify the -M and --follow options for each log command you issue.

> git config diff.renames true

Step by Step

Determining the origin of a code section

You want to find out who last changed some lines of code and when.

1. Print the origin information line by line

Git can even determine the origin of lines of code when larger sections of the code were copied or moved from other files. The blame command also displays when and by whom a line of code was last modified.

> git blame -M -C -C -C copied-together.txt

f5fdbad0 foo.txt  (Rene  2010-11-14 18:30:42 +0100   1) One,
a5b80903 bar.txt  (Bjørn 2011-01-31 21:32:49 +0100   2) Two or
f5fdbad0 foo.txt  (Rene  2010-11-14 18:30:42 +0100   3) Three

The -M option (M for Move) reveals the copies and moves of a file. The -C option also detect copies of files in the same commit. You can also use multiple -C options to search for files in other commits. For large repositories this can sometimes take a little longer.

Summary

Object database: The files, directories, and metadata for all commits are stored in this database.
SHA1 hash: You can retrieve objects from the object database using an SHA1 hash. A SHA1 hash is a cryptographic checksum of file contents.
Identical data is stored only once: Objects with the same content have the same SHA1 hash and are stored only once.
Similar data is compressed: For similar data there is a delta method that stores only the changes.
Blob: The content of a file is stored in a blob.
Tree: A directory is stored in an object tree. An object tree contains a list of file names with the SHA1 hash of the content belonging to a blob or a tree.
Commit graph: Commit objects form a commit graph, along with the tree and blob objects.
Rename detection: File rename and move do not have to be reported before a commit. Git recognizes this later by examining the similarities of the file contents. Example: git log –follow
Who was it: You can use the blame command to determine the origin of lines of code, even if they have been moved or copied.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: The Repository

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5: The Repository