2006-11-03

Bazaar vs. Mercurial : An unscientific comparison

Two things are a constant with me: I have some personal programming project going at all times and that all of my coding takes place on my laptop. That means I have a need for a VCS (or a RCS, SCM system, or some other buzzword term) that I can use when I don't have a Net connection.

That has led me to look at distributed VCSs for personal use and do a comparison between two Python-implemented ones: Bazaar and Mercurial (a.k.a., bzr and hg). Since not everyone might know what the "distributed" part means but most likely knows how Subversion or CVS (and if you are in the latter camp but not the former, for the love of something you care about please upgrade!), I am going to start with a quick distributed VCS primer.

The biggest difference from a client/server VCS (ala svn) to a distributed one is the idea that branches are used a lot more. In svn, branches are used when you want to do some long-term development on something. But in a distributed VCS you create one for every individual bug that you are working on. This provides a nice level of isolation between disparate thing you might be working on. Because of all this branching, distributed VCSs try to make branching a cheap operation.

And one way to make branching cheap is to do it locally. The "distributed" part of these tools comes from the fact that they are designed to work offline. This lets you commit changes and such to your local disk without requiring Internet access to hit a central repository online (but you can commit your changes to an online repository later when you are ready to). This is really handy if you develop on a laptop; it really sucks to have to do a bunch of coding remotely and then commit one huge patch instead of committing a bunch of individual patches. You want the atomicity of commits that represent a single piece of semantic change to allow better tracking of how/why your code changed, not some blob of code that changes a bunch of things at once. Plus it allows for easier rollback if you accidentally introduced a bug.

Finally, the other big difference for these tools is that since they expect everyone to be working from a branch they use branches as a way to share code changes between each other. From the view of Python development, I might end up working on a branch that adds the new Whizbang feature to Python while some other developer does the Doohickey feature which interacts with my feature. We could share our work by telling each other where our repositories were online and then constantly pulling and pushing patches from each other (although having a repository online is not required).

This does lead to a shift in development compared to svn in terms of patches. You end up pushing what are called changesets between branches. You can view a changeset as basically a bundle of patches. This is rather different (at least for me in terms of Python development) when compared to using svn as I am used to getting a big diff file, looking at the diff, applying the diff, possibly looking at the patched source and tweaking it, running the test suite, and then committing. The distributed VCSs want you to view it more at a commit level and not as a huge diff file. This does change how one pushes around changesets, though, as not everyone has a place online to push a online version of a branch. Luckily both tools at the bare minimum support generating a diff file as needed.

Hopefully that explanation is clear. If not Bazaar has an explanation as does Mercurial from the perspective of a cvs/svn user. Once you have a handle of what these tools are trying to do, hopefully this comparison will make some sense. =)

To begin, let's introduce Bazaar (commonly known as bzr). Created by Martin Pool, it is now a project developed by Canonical (of Ubuntu fame) with Martin as the lead. It is written entirely in Python. This comparison was done against version 0.12.

Mercurial was started by Matt Mackall of Selenic (although personally I view Bryan O'Sullivan as co-front man as I have met him personally and he has always personally answered my emails). Mercurial is mostly written in Python with some key parts of the code written as an extension module. Version 0.9.1 was used in this comparison.

When comparing the two there is not much difference in terms of commands. The only blaring difference is what you do after you pull in a new changeset into a branch. In bzr it gets applied and everybody is happy. But in hg you have to do either an explicit update or merge step. I found out this is so that you can have more fine grained control over what version you update to. Say you have a branch at revision 100 and you pull in a changeset that has a commit against revision 101. By making the application of the changeset a separate step you can have only changesets applied to revision 100, make sure everything is happy, and then pull in the new stuff to help minimize possible change conflicts. There is an extension, though, that will automate this stuff for hg and act more like bzr (called 'fetch').

The other huge difference is how they handle pushing and pulling changesets to a remote repository. Hg has you install it on the server and then uses SSH to tunnel the data and commands across. There is also support for pulling over HTTP. Bzr, on the other hand, has no built-in support for pushing (yet), but they do support pulling over HTTP. If you look at the bzr web site you will notice they suggest rsync for a naive way of pushing changesets. They do have support using SFTP if you have Paramiko installed (which itself requires pycrypto). Lastly, there is support for using SSH, but it is undocumented (use the ``bzr+ssh`` protocol and then just follow the normal SFTP instructions) and it still requires Paramiko (although that is unneeded and should be removed as a dependency soon).

One last difference that is more a tech thing than how you use the tools is bzr's shared repository. Basically a shared repository stores common metadata between branches in a single location in order to cut down on disk usage. In order to use the feature you just need to run a single command in the directory above where you store your branches. Because of its simple usage I ended up adding a comparison with bzr using shared resources along with vanilla bzr.

To start this comparison I am going to talk about the benchmarking results first and then follow with a discussion of the warm, fuzzy stuff (docs, community, etc.).

First, a word about my benchmarking numbers: they are not exactly scientifically rigorous. I didn't run the benchmarks multiple times and do an average of them or try to set up my laptop in a clean room setting to that other apps were not running in the background. What I did do was run them back-to-back while I left my laptop alone (short of tapping the touchpad to prevent the screensaver from kicking in). And the timings are consistent from what I noticed when I constantly ran my tests.

To handle all of this I wrote a Python script that acted as a replay driver. Each VCS had a command module that listed the various commands required to get it to do what I needed it to do (e.g., commit, pull, etc.). I logged the info I cared about and dumped it all to pickle files. I then wrote another script to output a reST document for easy comparison of the results. All of this was run against a pydebug version of Python from svn. In other words all reported numbers should only be viewed in terms of relative performance comparison and not what absolute performance would be like.

The code I used to run this (and a copy of the reST document for my results) can be found here. You will need to change the server location since the address uses an SSH alias I have for my personal server on top of pointing to a specific location on the server. And you can say how I fouled something up and how something else can be done better, but I don't guarantee I am going to care after having put so much time and effort into this already. =)

Here are the basic steps I had each VCS do after cleaning up the environment from any previous run:
  1. Any prep step needed (only used by bzr's shared repository).
  2. Create and initialize a repository (named 'main').
  3. Add 100 files, each 1000 lines long and commit them.
  4. Append a line to all the files and commit.
  5. Create a repository on a server.
  6. Push the 'main' repository to the server.
  7. Create a 'pristine' repository from the remote repository.
  8. Create a 'branch' repository by cloning the 'main' repository locally.
  9. Append a line to every file in the 'branch' repository and commit.
  10. Push changeset from 'branch' repository to remote repository.
  11. Update 'pristine' repository by pulling latest changeset from the remote repository.
Yes, it is contrived as you usually don't update 100 files at once, but then again you do have a handful of files that do have a total over 100 changed lines easily. What it really comes down to is the above was just very easy to write and configure (getting all of the reporting to work the way I wanted, though, is something else =).

So, how did the two VCSs do? In terms of execution time for local disk commands, hg is the faster of the two by a good amount (I am only bothering to report the faster of the two bzr versions for brevity; download the tar.bz2 file to see the full output):
  • repository initialization: hg 2.8x faster
    • hg: 0.67 seconds
    • bzr: 1.93 seconds
  • adding files for committal: hg 2.4x faster
    • hg: 1.12 seconds
    • bzr shared repository: 2.73 seconds
  • committing new files: hg 3x faster
    • hg: 4.08 seconds
    • bzr: 12.36 seconds
  • commit an append line to every file: hg 2.3x faster
    • hg: 8.63 seconds
    • bzr: 20.1 seconds
  • clone a repository: hg 4x faster
    • hg: 3.23 seconds
    • bzr shared repository: 12.92 seconds
  • committing in the cloned branch: hg 2x faster
    • hg: 10.6 seconds
    • bzr shared repository: 21.79 seconds
But what really made the performance numbers a huge difference was anything dealing with networking. Now granted these numbers rely on a lot of variance in terms of network connectivity, if the server was being hit to serve a web page, etc. But the speed differences are not miniscule and seemed consistent:
  • Initialize remote repository: hg 2.8x faster
    • hg: 5.71 seconds
    • bzr: 16.43 seconds
  • Push local repository to remote repository: hg 7.1x faster
    • hg: 11.38 seconds
    • bzr shared repository: 80.99 seconds
  • Branch from remote repository: hg 1.9x faster
    • hg: 8.16 seconds
    • bzr shared repository: 15.62 seconds (vanilla was 52.85 seconds)
  • Push appended line change to server: hg 10.1x faster
    • hg: 5.36 seconds
    • bzr: 54.62 seconds
  • Pull appended line change from server: hg 1.8x faster
    • hg: 9.65 seconds
    • bzr shared repository: 17.69 seconds (46.89 seconds for vanilla bzr)
Something to realize about these network numbers are these are much better compared to what I originally was benchmarking. I originally was creating 1000 files of 100 lines each, but bzr was taking over 10 minutes to push the initial version of the server, and when you have to do that constantly because of bugs in your code from introducing new reporting stats and such, it gets old really fast.

One interesting thing is how much faster the shared repository run of bzr was compared to the vanilla one. The remote branching, for instance, was 3.3x faster with the shared repository than without it.

But speed is not everything; there is disk usage to consider since every change needs to be stored somewhere. Here I will report all three VCSs using ``du -k -s``:
  • Total size of all local disk repositories: bzr shared repository 1.07x smaller
    • hg: 2468 KB
    • bzr: 3932 KB
    • bzr shared repository: 2292 KB
As you can see, using the shared repository makes a big difference for bzr (1.7x smaller between the two versions). But compared to hg the difference is miniscule; I had to add decimal places I left out in the performance multipliers just so the difference showed up.

For expanding, both support plug-ins. I didn't really delve into either beyond noticing that they both take different approaches to the API required for a plug-in to work.

In terms of other techy stuff, both projects have their own trajectory. For instance Mercurial seems to have more support for packaging up changesets and shipping them around. This compares to Bazaar where they are working on a "smart server" to be more competitive with svn in terms of server capabilities.

OK, with the techy out of the way, let's talk about documentation. Neither is spectacular. Both use wikis and thus are kind of all over the place. Mercurial seems to have slightly better documentation that is more up-to-date. For instance, the bzr+ssh protocol is not documented in the Bazaar docs and it has been in the tool since version 0.11. What's worse is that the tutorial for Bazaar has not been updated since version 0.8.

But luckily both tools have helpful people. For Mercurial I emailed the general list and got help within a day to answer my question about what an update/merge step was required after pulling a changeset. For Bazaar community support, I went to the IRC channel since it is supposedly a "vibrant community" even though I personally dislike IRC. I wanted to see if rsync was really the only option I had for pushing/pulling a server short of setting up my web server to point to the repository directory or installing third-party software (having the batteries included is a big thing with me). There I had 'marienz' and 'j-a-meinel' help me out and tell me about the bzr+ssh protocol. They even generated a patch while I was talking with them to remove the Paramiko dependency for the protocol.

After all of this, I think I like Mercurial the most. But with both tools still feeling fairly beta (I think a real push to clean up and update their documentation would help) I am not about to say I am not willing to switch in the future; I would not be upset if I was forced to use Bazaar. But if I did, I would only use a shared repository since it makes such a difference. And I am sure someone is going to point out some feature that Bazaar supports that Mercurial doesn't that some people are going to consider crucial. But my gut says that I like Mercurial the most.

With that being said, there are other possibilities for distributed VCSs. You can always look at SVK, darcs, Git, or Monotone (and use Tailor to switch between them).

[update]
I originally said that I might be willing to evaluate another VCS. I got a good amount of responses recommending darcs. Problem is that installing darcs on my server for remote tests would be a huge inconvenience. I did look at it and it seems like a fine VCS, but not enough to motivate me to get ghc installed (never managed to get that program to install properly).