Modern Version Control With Git
In this article, we will cover the basic background information for understanding how — and more importantly, why — to use Git. Later, we will take a closer look at Git’s features, including branching and merging, and discuss how to use it in your own design and development projects.
Git: Born Of Necessity
Linus Torvalds was unsatisfied: none of the version control systems (VCS) available in 2005 met his requirements. Once the proprietary version control system BitKeeper changed its license agreement, it couldn’t be used to manage the Linux kernel project anymore. An alternative had to be found, one that was distributed, scalable and — above all — fast.
The Linux community took action by starting two new projects: Git and Mercurial. Both have their origins in this emergency, and they are among today’s leading distributed version control systems. Git is now used in countless well-known open-source projects: the Linux kernel, jQuery, Ruby on Rails, Symfony, CakePHP, Debian, Fedora, Perl and many more. The large number of tutorials and tools, including desktop clients, shows how important Git has become.
Centralized Vs. Distributed
Git (like Mercurial) is a “distributed” version control system (DVCS). The classic systems like Subversion and CVS, in contrast, function as centralized systems (CVCS).
In centralized systems, there is only one “master” repository, which every developer feeds their changes into. Every action must be synchronized with this central repository. And because it usually resides on a central server, each action has to pass through the network — leaving a developer unable to work if they happen to have no network connection.
In distributed systems, each developer has their own full-fledged repository on their computer. In most set-ups there’s an additional central repository on a server that’s used for sharing. However, this is not a requirement; every developer can perform all important actions in their local repository: committing changes, viewing differences between revisions, switching branches, etc.
Git’s Advantages
One of Git’s main advantages is its distributed nature. It doesn’t matter whether you’re using a complex set-up with multiple remote repositories or you have just one central server to share code (working “Subversion style”). A DVCS can be used independently of any one person’s workflow. Being able to work offline is an important advantage of DVCS for many developers. You can work without constraints, even if you’re not connected to the network.
Speed is another important factor, and the differences between Git and other DVCS here are evident. In almost any situation, Git is faster than other modern systems, such as Mercurial and Bazaar. One of the reasons for Git’s remarkable speed is that it was written in C. Another reason is that it was designed to work with the Linux kernel and therefore has to perform well even under huge amounts of data.
Another convenience: every local Git repository can serve as a full-fledged back-up, because it contains the project’s complete history. And considering that almost every action in Git only adds data, losing data is pretty hard to do.
The biggest advantages, however, lie in Git’s feature set: in how it deals with code and in its tools and workflows. We’ll take a closer look at things like the staging area, the stash and the concept of branching later on.
The Local Git Repository
In SVN, every directory that’s under version control is assigned a hidden .svn folder, which saves all relevant meta data for that directory. Have you ever (inadvertently, of course) deleted or moved this magical folder? If so, then you’ll appreciate that Git has only one of these folders. This means you can move, delete or rename in your favorite editor or file browser as you wish, without any headaches the next morning.
This folder (.git) resides in the root directory of your project and makes up your local repository. The actual files you work with comprise your so-called “working directory” (or “working tree”).
In addition to the .git repository and the working directory, the third crucial part is the so-called “staging area.” This enables you to precisely define which changes you want to have in your next commit — even down to individual lines in a file.
Let’s consider this more carefully, because it’s one of Git’s best features. Say you have modified 10 files, and you realize that splitting these changes into two separate commits would be best (because every commit should contain only related changes and not be a hodgepodge). Using the staging area, you can define exactly which changes to commit and which to leave for a later commit.
Technically, the staging area is nothing more than a file named index that lies in your local .git repository. That’s why it’s sometimes referred to as the index.
States Of Files
In Git, a file in your working directory can be in one of several states. The most basic distinction is between “tracked” and “untracked.” A file is tracked if it’s already under version control; in this case, Git observes all changes in that file. If it’s not (yet) saved in your Git repository, then it’s treated as untracked.
Tracked files can be in one of the following three states:
- Unmodified or committed. The file is in its last committed state and therefore has no local modifications.
- Modified. The file has been changed since it was last committed.
- Staged. The file was not only changed but also added to the staging area, meaning that the changes will be included in the next commit. Because a file can be staged partially, it’s entirely possible for it to be both staged and modified.
A Basic Workflow
A typical workflow in Git usually consists of the following steps:
- You modify, create or delete files in your working directory. Your favorite editor or file browser is perfectly suited to this job.
- You execute the
git status
command in the command line to see an overview of what has been changed. People use this command rather frequently in their workflow to stay on top of things. - You add all changes for the next commit to the staging area using the
git add
command. - Finally, you execute
git commit
to save the staged changes in your local repository.
Hashes Instead Of Revision Numbers
I’ll try to break this to you gently: Git doesn’t understand revision numbers.
In a CVCS such as Subversion, every commit is assigned a consecutive revision number. This doesn’t work in DVCS anymore. Because commits are created locally, the system can’t assign consecutive numbers. Imagine the developers on a team working on their own, producing commits locally, and then publishing their work on a shared remote repository as they go along. When an individual commit is created locally, there’s no way for Git to foresee the eventual order.
So, Git uses SHA-1 hashes to identify commits (and all other objects) internally. SHA-1 hashes are 40-character checksums that are unique, like conventional revision numbers, but with the benefit of being compatible with DVCS.
Installation And Tools
Thanks to graphical installers and package managers, installing Git has become very easy. The steps on all major platforms are explained in detail in Scott Chacon’s free eBook Pro Git.
Typically, Git is used from the command line. In many scenarios, however, a desktop client (like Tower on Mac OS [author’s product] or Tortoise Git on Windows) can make life a lot easier.
Improve Development Quality
“Complex” is often used to describe the Git version control system (VCS). At least compared to classic VCS’ like Subversion, Git does indeed have a steeper learning curve.
When inviting people to learn a “complex” new technology, you’ll hardly get volunteers. But what if the technology could improve software quality and maybe even your own way of developing software? Git is such a technology for which investing time is worth it. Moreover, desktop clients such as Tower for Mac OS (disclaimer: this is the author’s product) and Tortoise Git for Windows make a lot of the tasks easier.
Installation And Help
If a recent version of Git is not installed on your machine, you can quickly and easily catch up by, for example, reading the instructions in Scott Chacon’s free eBook Pro Git. I’ll mention commands only briefly in this article. To get more detailed descriptions and parameter listings, you can use git help <command>
in the command line or browse the GitRef online.
Creating And Cloning Repositories
Having a repository is the most basic requirement for working in Git. If there is not one for your current project or if you’re starting afresh, then create a new repository in the current location by executing git init
in the command line. If a remote repository already exists on a server, then you can get it via git clone
on your machine. You can learn more about git clone
on the GitRef website.
Committing Changes
After working for some time, you’ll have a couple of new, deleted or modified files that you’ll want to save to your local repository as commits. As a first step, we’ll use git status
to show us which changes we currently have in our working directory.
To add certain changes to the next commit, you have to explicitly add them to Git’s “staging area.” The command git add
is used for this purpose. Let’s look at a concrete scenario:
git add
The paragraph that begins “Changes to be committed” lists all of the files that will be included in the next commit. The changes had to be added to the staging area via git add
.
The paragraph that begins “Changes not staged for commit” lists all of the files that have been changed but that have not been added to the staging area. Therefore, they won’t be included in our next commit but will remain simply as changes in our working directory.
The last paragraph lists “Untracked files,” files that aren’t under version control yet. In other words they are unknown to Git.
You might have spotted a little peculiarity: error.html is listed twice! That’s because we staged some of the changes in this file, while leaving other changes in the same file unstaged. This feature — staging individual files or even parts of a file — enables us to create extremely granular commits that really only contain related changes.
git commit
After having composed the commit exactly as we want it to be, we can save it to our local repository via git commit
.
Commit History
Once a repository contains a couple of commits, we should look into its history (or “log”). The git log
command gives us an overview of the last few commits. Git log’s default output shows each commit with its SHA-1 hash (which is the equivalent of a revision number in a classic centralized VCS, or CVCS), its author, the date and the message used.
Customizing git log
You can bend the presentation of git log
almost completely to your will. Using the parameter –oneline
, for example, reduces the output to a single line per commit. Using -p
adds the “diff” (or patch) to every commit, thereby showing you what detailed changes were introduced. Customization can be taken to the point where you have your very own output format:
Branching And Merging
Branching is often referred to as Git’s killer feature, and rightly so. If you already know the term from another VCS like Subversion, then take an unbiased, fresh look at the topic. Using branches in Git is extremely easy and fast because they have been a core feature since the very beginning. They’re certainly one of the most important tools in a developer’s daily work. But what are they actually used for?
Branches allow you to cleanly separate variants or features in your code base. “Separation of concerns” is a good term for this.
When To Use Branching
Take the following scenario. At a certain point, you decide to develop a new feature. To separate it from your other development work, you create a new branch, based on the current status. The git branch
command creates the new branch, while git checkout
makes it your current working branch.
After having implemented a couple of new features, you commit the current status to this branch (C2 in the sketch below). But you haven’t yet finished developing the new feature when an urgent bug report forces you to continue working on your code’s main line (your production or “live” code).
After having switched back (via git checkout
) to your master branch, you commit the fix for the problem (C3). Again, you switch to your feature branch and commit some more changes (C4) to complete the development for this feature. Finally, you use git merge
to unite the two branches again (C5).
Imagine this scenario without branches. Because you already had changes that belonged to an unfinished, not yet releasable feature, you would have been mixing two different concerns (the experimental feature and the bug fix in your production code).
With only one feature, the problem might not have been major. But in larger projects, with numerous parallel features and development stages, such processes quickly become confusing and error-prone without branches.
Of course, Git is not the only VCS that uses the concept of branches, but it does make branching easier, faster and more effortless than any other system. As soon as you discover branching in Git, switching your active branch a dozen times a day or creating a new branch for every bug fix, however small, becomes totally run of the mill. Greater clarity and a clear separation of code is worthwhile in the long run.
Terms Related To Branching
The currently active branch is called “HEAD” in Git. Switching the HEAD is done with the git checkout
command (not related to the checkout
command in SVN). Your working directory will then contain the files that belong to this branch (or, more precisely, that belong to the last commit in this branch).
Sharing Code: Working With Remote Repositories
Remote repositories are used to make your code available to teammates and to integrate code from other developers into your local repository. With git remote add
, you can connect any number of remotes to your local repository.
To publish local commits from your current branch to a remote repository, use the git push
command.
Integrating remote changes into your local repository, on the other hand, is done via git fetch
. This downloads all of the changes from the remote server to your local computer, but it does not modify the files in your working directory in any way!
You can decide for yourself when to integrate the downloaded changes into your current HEAD via git merge
. Alternatively, you can use git pull
to combine downloading and merging into one step.
As with most things, and as anyone who has worked with Git for a while knows, there’s more than one way to skin a cat. A lot of tasks can be performed with a couple of basic commands. However, a few advanced concepts and tricks will sometimes help you achieve your goals more elegantly.
Advanced Concepts
Stash: The Clipboard
The “stash” is one such feature. In some situations, a “clean” working directory is recommended (or even essential). That means there shouldn’t be any local changes (for example, when switching the current branch).
Imagine the following scenario. You’ve been working on a new feature for several hours, when suddenly a critical bug report comes in. Of course, you’ve already changed a couple of files. But now you must switch branches to be able to work on your current production code. You could simply commit your changes — but they’re only half done (and committing stuff that is only half done is bad karma!).
The stash helps you solve precisely this dilemma. All current changes are saved on this clipboard, and your working directory is left in a clean condition. As soon as you’re done fixing that bug, you can return to working on your feature — and simply restore all stashed changes.
Staging Parts Of Files
A large commit that mixes a lot of different topics is hard for other developers to understand, and rolling back problems will be hard if problems should occur. That’s why creating granular commits that contain only related changes is so important in version control.
Git helps you do this by enabling you to add parts of a changed file to the staging area. If you execute git add
with the -p
parameter, Git lets you choose for every part of the file whether you want to stage it or not. This way, you can control very precisely which changes should go into your next commit — and which should remain for a later commit.
Tracking Branches
If you’ve already glanced at the configuration file of one of your local Git repositories (.git/config), you might have spotted one of these sections:
Git saves some meta data about the relationship between two branches; in this case, our local “master” branch tracks the same-named branch on the remote “origin.” This meta data is used by a couple of commands in Git, such as push
, pull
and status
.
In general, though, you don’t have to worry about managing all of this meta data. If you create a new local branch based on a remote branch, Git will set up the tracking relationship for you.
Undoing Things
Most mistakes that you make in Git can be corrected pretty easily.
Let’s take a simple case. You have mistyped your last commit message and now want to correct this typo. Git offers an –amend
parameter for its commit
command. This will overwrite the last commit and make it look as if your little mistake never happened. Amending also allows you to change the set of committed files by adding and removing items to and from the commit. But remember one golden rule: don’t amend commits that you’ve already pushed to a remote!
The revert
command lets you “take back” a commit (and this time it doesn’t have to be your most recent commit). Reverting, however, will not delete any commits. Quite the opposite: a new commit will be created that reverses the effects of the corresponding commit.
The reset
command is useful if you truly regret your most recent commit(s). It takes advantage of the fact that branches are really nothing more than pointers to a certain commit. This command rolls the pointer back to an older commit. In fact, reset
will not even delete any commits; but your project’s history will look like it has done exactly that.
Integrating Selected Commits
Usually, you would integrate changes into a branch by merging with another one. In those rare cases where a merge is undesired, Git offers an alternative with the cherry-pick
command. Instead of integrating a complete branch (when merging), cherry-pick
allows you to integrate any desired commit. You can even integrate multiple selected commits in one go — but remember to start with the oldest one to avoid problems.
GUI applications like Tower make tasks like these a lot easier by allowing you to simply drag and drop the desired commits.
Rebase Instead Of Merge
The most common method to integrate one branch into another is to perform a “merge.” For an ordinary three-way merge, Git takes the endpoints of the branches to be merged and the common parent commit as the basis for the integration. This results in a so-called “merge commit” that connects both branches like a knot.
A “rebase” is an alternative to an ordinary merge. A rebase does not result in a separate merge commit and therefore produces no “bumps” in your project’s history. It will look as if the history has run linearly and all commits have happened on the same branch.
Let’s look at a concrete scenario to better understand what a rebase does:
Here, branch_B is our current HEAD branch. If we execute git rebase branch_A
, the following things will happen. First, all new commits (C2 and C4) that originated after the last common commit (C1) will be temporarily removed. Now, branch_A’s new commits will be applied on branch_B. This means that both branches are now on the same position: on branch_A’s position.
Right at the beginning, branch_B’s new commits were removed temporarily. Now is the time to reapply them: one after the other, and in their original order.
In the end, by rebasing the branches, no merge commit was necessary, and the history has remained linear.
Rewriting History
A handful of commands in Git will change a project’s history. In addition to amending a commit, rebase
also falls into this category.
Note that the reapplied commits in our sample scenario aren’t completely identical to the original commits (which is why they’re named C2’ and C4’). The commit SHA-1 has changed because we rewrote history with the rebase
.
Always follow this golden rule when using commands that change your history: local commits that haven’t been published can safely be changed by using rebase
or amend
. However, if they’ve already been pushed to a remote repository, you should not use these tools anymore. Your teammates will thank you for it.
Housekeeping In Git
A Git repository can accumulate quite a number of objects over its lifetime, be they commits, files or file trees. Organizing these objects optimally is crucial to keeping Git fast. The git gc
command (where gc
stands for “garbage collect”) was made for just this purpose. Although it’s executed automatically in the background when running certain commands, running gc
from time to time is still a good idea (preferably with the –aggressive
parameter, to ensure the best result).
Desktop Tools And External Code Hosting
Tools
If you spend a lot of time in the command line, you can spice things up by using plug-ins. Nice little helpers include Tab auto-completion for branch names, and directly displaying the current branch in your shell’s prompt. Plug-ins are available for Bash and zsh.
As alternatives to the command line, various desktop clients might be worth a look. A lot of tasks can be performed more easily and comfortably using a graphical user interface — if not only to not have to memorize all of the commands and parameters. Windows users might want to look at Tortoise Git, while Mac OS users can try Tower (disclaimer: this is the author’s product).
Repository Hosting
More and more companies and individual developers are opting not to host their own code anymore. Providing and maintaining expensive server infrastructure is not everyone’s cup of tea: special know-how is necessary, ressources are tied up, and a high degree of security and availability must be guaranteed.
Meanwhile, some companies are already offering code hosting as “software as a service.” Some of the most popular ones currently are GitHub, Beanstalk and Codebase. If hosting your code externally is an option, then check out one of these services.
Conclusion
Git is an extremely versatile tool. In its early days, it consisted of more than 140 binaries that could be combined flexibly with one another. While using Git has become much easier with recent versions, it has managed to maintain this flexibility. As a result, Git can be viewed as a toolset for creating your very own version control workflow.
But the advanced tools are not all that make Git interesting. Its extraordinary speed and unique branching concept, for example, are great to have, too.
Further Reading
- Moving A Git Repository To A New Server
- S(GH)PA: The Single-Page App Hack For GitHub Pages
- How To Deploy WordPress Plugins With GitHub Using Transients
- Ultimate Round-Up For Version Control with Subversion