Chapter 15. Combining Projects

There are many reasons to combine outside projects with your own. A submodule is simply a project that forms a part of your own Git repository but also exists independently in its own source control repository. This chapter discusses why developers create submodules and how Git attempts to deal with them.

Earlier in this book, we worked with a repository named public_html that we imagine contains your web site. If your web site relies on an AJAX library such as Prototype or jQuery, you’ll need to have a copy of that library somewhere inside public_html. Not only that, you’d like to be able to update that library automatically, see what has changed when you do, and maybe even contribute changes back to the authors. Or maybe, as Git allows and encourages, you want to make changes and not contribute them back but still be able to update your repository to their latest version.

Git does make all these things possible.

But here’s the bad news. Git’s initial support for submodules was unapologetically awful—for the simple reason that none of the Git developers had a need for them. At the time this book is being written, the situation has only recently started to improve.

In the beginning, there were only two major projects that used Git: Git itself, and the Linux kernel. These projects have two important things in common: they were both originally written by Linus Torvalds, and they both have virtually no dependencies on any outside project. Where they’ve borrowed code from other projects, they’ve imported it directly and made it their own. There’s no intention of ever trying to merge that code back into someone else’s project. Such an occurrence would be rare, and it would be easy enough to generate some diffs by hand and submit them back to the other project.

If your project’s submodules are like that—things you import once, leaving the old project behind forever—you don’t need this chapter. You already know enough about Git to simply add a directory full of files.

On the other hand, sometimes things get more complicated. One common situation at many companies is to have a lot of applications that rely on a common utility library or set of libraries. You want each of your applications to be developed, shared, branched, and merged in its own Git repository, either because that’s the logical unit of separation or, perhaps, because of code ownership issues.

But dividing up your applications this way creates a problem: what about the shared library? Each application relies on a particular version of the shared library, and you need to keep track of exactly which version. If someone upgrades the library by accident to a version that hasn’t been tested, he might end up breaking your application. Yet, the utility library isn’t developed all by itself; usually people are tweaking it to add new features that they need in their own applications. Eventually, they want to share those new features with everybody else writing other applications; that’s what a utility library is for.

What can you do? That’s what this chapter is about. I discuss several strategies—although some people might not dignify them with that term, preferring to call them hacks—in common use and end with the most sophisticated solution, submodules.

The Old Solution: Partial Checkouts

A popular feature in many version control systems, including CVS and Subversion, is called a partial checkout. With a partial checkout, you choose to retrieve only a particular subdirectory or subdirectories of the repository and work just in there.[28]

If you have a central repository that holds all your projects, partial checkouts can be a workable way of handling submodules. Simply put your utility library in one subdirectory and put each application using that library in another directory. When you want to get one application, just check out two subdirectories (the library and the application) instead of checking out all directories: that’s a partial checkout.

One benefit of using partial checkouts is that you don’t have to download the gigantic, full history of every file. You download just the files you need for a particular revision of a particular project. You may not even need the full history of just those files; the current version alone may suffice.

This technique was especially popular in an older version control system, CVS. CVS has no conceptual understanding of the whole repository; it only understands the history of individual files. In fact, the history of the files is stored in the files themselves. CVS’s repository format was so simple that the repository administrator could make copies and use symbolic links between different application repositories. Checking out a copy of each application would then automatically check out a copy of the referenced files. You wouldn’t even have to know that the files were shared with other projects.

This technique had its idiosyncrasies, but it has worked well on many projects for years. The KDE (K Desktop Environment) project, for example, encourages partial checkouts with their multi-gigabyte Subversion repository.

Unfortunately, this idea isn’t compatible with distributed version control systems like Git. With Git, you don’t just download the current version of a selected set of files; you download all the versions of all the files. After all, every Git repository is a complete copy of the repository. Git’s current architecture doesn’t support partial checkouts well.[29]

As of this writing, the KDE project is considering a switch from Subversion to Git, and submodules are their main point of contention. An import of the entire KDE repository into Git is still several gigabytes in size. Every KDE contributor would have to have a copy of all that data, even if they wanted to work on only one application. But you can’t just make one repository per application: each application depends on one or more of the KDE core libraries.

For KDE to successfully switch to Git, it needs an alternative to huge, monolithic repositories using simple partial checkouts. For example, one experimental import of KDE into Git separated the code base into roughly 500 separate repositories.[30]



[28] In fact, Subversion cleverly uses partial checkouts to implement all its branching and tagging features. You just make a copy of your files in a subdirectory and then check out only that subdirectory.

[29] Actually, there are some experimental patches that implement partial checkouts in Git. They aren’t yet in any released Git version and may never be. Also, they are only partial checkouts, not partial clones. You still have to download the entire history even if it doesn’t end up in your working tree, and this limits the benefit. Some people are interested in solving that problem, too, but it’s extremely complicated—maybe even impossible—to do right.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.114.245