2009-03-31

Why Python is switching to Mercurial

Starting at PyCon 2008 thanks to Barry Warsaw and the Bazaar team I started thinking about moving Python over to a distributed version control system (DVCS). While I wanted to get offline commits for the benefit of non-core developers along with easier merging from 2.6 to 3.0 (ah, the days when there are only three branches under development), I knew that would not necessarily be enough of a reason for others to switch.

Come October I started a PEP for switching off of svn to a DVCS. Originally it was going to be hg vs. bzr, but enough of an outcry on python-dev led to me to relent to adding git to the PEP. With the list of DVCSs decided I began writing up common use cases that I and other developers have come up against in developing Python. I then had a representative for each DVCS fill in the PEP with the best solution to the use case (who I am grateful to for helping). This all became PEP 374.

And for me that's when the stress began. I was being bombarded on all sides on this pretty regularly. I quickly realized that choosing a DVCS was like choosing a code editor; it's a very personal thing for a lot of people. Plus I forget how big Python is now; when what I was doing hit the Net I ended up talking with developers from all three DVCSs which I didn't expect.

Time past and I tried to absorb all three DVCSs as much as I could, although with my internship and trying to finish importlib for 3.1 I only had so much time. I ran a survey of the core developers where I asked them to rate the three DVCSs as either better, equal, or worse than svn if they felt they had enough experience to have an opinion.

Based on the results of that survey where git was clearly the most disliked tool of the core developers, having the weakest Windows support, and not being implemented in Python, I decided to eliminate git from the running and announce its elimination at the first lightning talk at PyCon.

When I arrived at PyCon pretty much everyone asked me about the DVCS PEP. People wanted to know how it was going, who was going to win, and giving me support/pity for what I was going through. Guido noticed this and decided to end my misery by saying he wanted to make a decision by the end of PyCon. I said I was fine with that as one was already about to be eliminated and I knew my personal preference at that exact moment aligned with Guido's.

So I did my lightning talk eliminating git. Luckily that went well with only about two people telling me directly they disliked the decision.

But the more telling thing was what everyone else told me after that lightning talk. I ended up with a surprisingly large number of people telling me -- including core developers -- they wanted, preferred, or guessed that hg would now win. Now the guesses could be explained away by Guido having publicly stated he likes hg, but to me the amount of people telling me they wanted hg to be chosen was surprisingly large. And honestly no one told me they preferred bzr (although no one said I better choose hg over bzr either).

So Monday morning came around and I walked into the sprint. I asked Guido if he was ready to make a decision. He said yes, we both said hg, and so Guido tweeted the decision before telling python-dev that we chose Mercurial.

There has been a lot of speculation as to why Guido pronounced the way he did. On Twitter Guido said to read PEP 374 for the reasons. Since I helped write my PEP my reasons are reflected in the PEP.

Obviously community preference as shown at PyCon played a role. No one wants to choose a DVCS that causes the community to not want to contribute to Python. And I would never choose a VCS that would cause Guido to not want to work on Python. Some people seem surprised that something non-technical played a role, but ignoring social issues is to ignore how much open source is a social phenomenon. And we are not the first project to take social preference into consideration: I know both GNOME and Pinax chose git because their developers preferred git.

And there are technical reasons. Having hg being faster than bzr by 2x to 3x does matter to some extent. No one wants to cause someone to not contribute because they didn't want to wait for a checkout. And having personally experienced long checkout times because of a subpar connection to a specific server I know this can occur. The performance margin between hg and bzr is within reason typically and is not a flat-out deal-breaker, but it doesn't help either.

Bazaar also has its short timespan of format stability working against it. The tool has changed its format at least three times based on what the man page says (1.0, 1.6, and 1.9). Mercurial, on the other hand, has been stable since I think it went public or near that time. They take great pride in the fact they have not changed it. And that stability more aligns with python-dev's sensibilities regarding stability.

Stephen Turnbull's explanation of why on the bzr mailing list is also a good explanation of why we chose hg. Basically no one is saying bzr is bad, just that hg is a better fit for our needs on python-dev.

But the thing I really love about having made this decision -- other than I don't have to stress about this anymore -- is that everyone seems to be flat-out happy we made a decision to switch as well. Once again the Python community stands out as being friendly and understanding about stuff like this with no one really seeming to be upset that we made the decision we did.

As for when the switch will happen, I don't know. We are hoping by summer, but that is just a hope at the moment. We have to figure out the best way to convert our history as well as what workflow we want to have.

2009-03-19

Consider volunteering at PyCon

Doug has sent out a request for help at PyCon. If you will be attending PyCon and are willing to help, please do! PyCon is volunteer-driven so we need all the volunteers we can get.

2009-03-18

Last day for online registration to PyCon

Helping to spread the word that today is the last day to register for PyCon online. You can still register at the door, although obviously it is a little bit more expensive.

2009-03-16

Use XMPP for microblogging?

I had something interesting happen to me today: people noticed my status message on Google Talk. Normally I figure no one really pays attention to status messages for IM beyond whether someone is online or not (we all know people who are always listed as away or have not changed their status message in ages). But for some reason today I decided to use my status message to bitch about the car issues I was having.

And people responded. I had various friends IM me throughout the day inquire about what was going on (short answer: transmission is shot and the repair bill is more than the value of the car, so I am trying to get the car recycled; not easy when you drove down to visit a parent in another country). It reminded me of what can happen when I post to Twitter or FriendFeed when I get to have a topical conversation about what is going on in my life that exact second; instantaneous blogging in snippet form. And it was nice to get to have a conversation with some friends about what was happening. That's something I don't necessarily get to have on Twitter since the threading of the conversation gets all out of whack. And unlike FriendFeed the conversations were private and happening in real-time.

But the other nice thing is that these conversations were with people I normally don't get to interact with on a regular basis online. Typically I interact with people online through FriendFeed, Twitter, and Google Reader. IM has a much broader reach than any of these newfangled services have.

For me, the difference between the former services and IM is one takes more proactive participation while the latter is usually only for when I have a specific need or desire to talk with someone. If the people I follow on FF or Twitter post something I will eventually see it when I actively check the application (although obviously you can use notifiers so that the engagement is passively triggered). But with IM, it is typically relegated to the background, sitting there until either I initiate a conversation with someone or vice-versa.

And yet today that didn't happen. It makes sense that ones status in the world at any moment be tied to their IM persona. I mean my status of being online or not is listed in IM, why shouldn't what I am thinking about or reading also be displayed? We talk about microblogging and lifestreaming services, and yet the one service we all use that portrays our status online is not doing more than showing a green, red, or gray dot next to some picture. It feels like an unneeded disconnect to me that my IM persona does not tie into what else I might share with the world at that moment.

There is also a case that IM is a better model for microblogging/lifestreaming. It's very much a push model; I decide to say something to the world, it gets pushed to the people who care to listen. And yet all of these other services use a pull model to get the information. Why should my Twitter client have to refresh to find out that there is something new to read? Can't it just be pushed to me? Can't the pull from a web site be a side-effect of providing access while the actual system is push?

If you listen to FLOSS Weekly episode 49 you will hear Peter Saint-Andre mention how he thinks XMPP is a good system to use for microblogging. I personally buy his arguments. Apparently I am not the only one as people have blogged about this idea and gone as far as to sketch out a proposal of microblogging over XMPP. But what exactly would microblogging tied to IM act like?

First thing first, though, is to realize how microblogging versus IM is different. For one, anyone can follow me on Twitter or FF but not on IM. If some stranger wants to follow me on FF that's fine, but I don't necessarily care when they are online or want to have a personal conversation with them. So any microblogging service would need to make the idea of subscribing to one's microblog feed separate from getting to interact with them directly over IM. Another difference is the temporal nature of IM status messages versus microblogging. When I change my IM status, that last one vanished into the digital ether. But for microblogging, I don't necessarily want that to happen. So a ticker would be needed to keep track of the messages that have happened throughout the day that I may have missed.

So what are the benefit of tying microblogging to IM? For one, I could easily respond directly to someone about something and have a truly real-time conversation with them about it. Yes, Twitter has direct messages, but the fluidity of an IM conversation, I feel, is much better. It also simplies things by not forcing me to run yet another service.

Jaiku actually tried this approach somewhat. While being a lifestreaming app, they also had their S60 client which updated your online status and your location in the world. But I don't think it included IM services which would have been nice.

Or maybe all I really want is better support for longer IM status messages. What I honestly get out of Twitter is what people are currently up to and some conversations. If longer status messages could easily be displayed and a public group chat around my status message could take place that would be close to what I get out of Twitter with the only thing missing is people overhearing my public conversation.

2009-03-15

Google Moderator pages for PyCon VM panel

I organized a VM panel at PyCon this year where a representative from each of the major VMs (CPython, Jython, IronPython, and PyPy) will be in one room to answer your various questions. To help get questions from people who cannot attend along with making our esteem MC's job easier, Jacob (who is the moderator) and I have set up a Google Moderator page for the panel to propose qauestions.

Please propose questions or vote questions up and down so that Jacob can come with a battery of questions for the panel to fill in lulls during the Q&A. And I think all talks are being recorded this year so the panel should eventually end up online.

Oh, and if you are not attending PyCon, you still have a chance to register and come!

2009-03-11

Time deltas between Python releases


I have been thinking about what needs to be covered during my session at the language summit at PyCon which will cover 2.7/3.2 issues. One thing I have been wondering about is whether Python should consider shifting to a time-based release schedule. The rule of thumb has been that we do a minor release (e.g 2.7 is the next minor release for on the 2.x series) every 18 months.

To see if that 18 month idea was true, I created a chart (thanks to the Google Chart API for existing and Doug Napoleone for always hyping up the API). The x-axis is number of days and each section of the graph is the length of time to develop a micro release, although developing x.y.0 starts from the release of x.y-1.0 so that's why the initial values are so large.

It would seem that we are trending away from 18 months between minor releases rather quickly. While I would never propose a hard deadline (that's just stupid for open source when you are not trying to get a packaged piece of software done by a specific date for shipment), I wouldn't mind planning out release dates after the last minor release went out so we know when exactly stuff needs to get out (minus any bumps; see Python 3.1 and its planned 6 month release cycle).

The chart also points out some interesting things. One is that 2.3 was a trouble-maker. =) It has the most micro releases (2.30 through 2.3.5) on top of having the shortest amount of time between micro releases (10 days between 2.3.1 and 2.3.2 thanks to a bug). The 2.5 branch has been actively maintained the longest (4 years, 24 days from 2.4.0 to 2.5.4). And a random fact for myself: 2.2.3 was the first release which occurred while I had commit privileges.

2009-03-09

Interacting between <insert JVM lang here> and Java

[2009-03-15: This blog post has been thoroughly rewritten after I found out some details were wrong and people wanted more details]

I don't like Java. If care to find out why you can search my blog, but it's no secret Java is not exactly at the top of my list of favorite programming languages. But I am not stupid enough to think that Java is about to go away. And I have nothing against the JVM, just the primary language that runs on it.

Thanks to the JVM, just because I have to work with Java code does not mean I have to write Java code. Thanks to various languages being ported to the JVM there are now multiple options for working with Java code in languages other than Java:



By targeting the JVM all the above languages can call Java code in order to be relevant in a Java-heavy world. But just because they can easily consume Java code does not mean that the reverse is true. What I am looking for is not just a way to call into Java, but to call other languages from Java.

To better explain this, take the following three files. First, I have Spam.java:

public class Spam {
public String serves() {
return "spam";
}
}
I have another class that implements a subclass, BaconSpam.java:
public class BaconSpam extends Spam {
@Override
public String serves() {
return "bacon " + super.serves();
}
}
And finally, the class that runs the show, Waitress.java:
public class Waitress {
public static void main(String[] args) {
BaconSpam menu = new BaconSpam();
String food = menu.serves();
System.out.println("We serve " + food + "!");
}
}
What I am after is a language to rewrite BaconSpam.java in another JVM language such that I don't touch Spam.java. I also want the changes to Waitress.java to be minimal or non-existent while still having to store an instance of the class and the value returned by serves() to show how objects would be stored and used in a long-running Java application (i.e. no cheating by inlining some call in the println() call to make the example a little tougher and more "real world").


Jython



Thanks to the Jython guys I was pointed to a Jython Monthly article from October 2006 that explains how best to go about accessing Jython code from Java.

To start I wrote BaconSpam in Python:
import Spam

class BaconSpam(Spam):
def serves(self):
return " ".join(["bacon", Spam.serves(self)])


While that was rather simple, there is still the issue of getting an instance of the class. Because Jython dynamically interprets Python code you can't simply drop in BaconSpam.py and have it work with Waitress.java. You need to create an instance of org.python.util.PythonInterpreter and interface with it to get at the Python code:

import org.python.core.__builtin__;
import org.python.core.PyObject;

public class Waitress {
public static void main(String[] args) {
PyObject BaconSpam = __builtin__.__import__("BaconSpam").__getattr__("BaconSpam");
Spam menu = (Spam)BaconSpam.__call__().__tojava__(Spam.class);
String food = menu.serves();
System.out.println("We serve " + food + "!");
}
}


Because Jython dynamically loads Python code, Waitress.java has to be modified. Luckily a large chunk of the code is boilerplate that can be extracted out into a factory class to help simplify things.

And just a warning for anyone wanting to run the above code: for some odd reason the above code only worked for me with Jython 2.5b3, not 2.2.1.


JRuby



Writing the BaconSpam subclass was simple:


class BaconSpam < Spam
def serves
"bacon " + super
end
end


But as of right now you need to use JSR 223 to interface with the class which is worse than the Jython approach, so I am not going to go through the steps here.

But Charles Nutter has blogged about adding signature support to JRuby's in-dev compiler2. It looks like as soon as inheritance is handled properly in compiler2 that JRuby will be in the same position at Groovy and Scala for ease of integration (see below for details on those two languages).

Rhino



I don't like JavaScript, so I am skipping Rhino. =) But it's another JSR 223 approach.

Clojure



I actually didn't get Clojure to work. I tried to follow the gen-class example from the Clojure wikibooks, but ran into several issues that included having to manually execute compilation for the Eclipse plug-in to even get an error message and putting the code in a package to get Clojure to not assume I was working off of java.lang when inheriting from Spam.java.


(ns pkg.BaconSpam
(:gen-class
:extends pkg.Spam
:exposes {serves servesSuper}))

(defn -serves [this]
(str "bacon " (.servesSuper this))
)


With an error of "java.lang.IllegalArgumentException: Don't know how to create ISeq from: Symbol", I just stopped trying to make this work. I assume Waitress.java will need to be changed beyond just being put in a package anyway in order to deal with Clojure's dynamic typing.


Groovy



Writing the Groovy version of BaconSpam was very easy thanks to the language having been designed for the JVM from the outset. The only trick was that serves() needed to have the return type specified for the method instead of being dynamic:


public class BaconSpam extends Spam {
// Using 'def' makes return value dynamic.
String serves() {
return "bacon " + super.serves()
}
}


With the typed method there is no need for modifying Waitress.java. Groovy basically ends up looking like Java with some syntax removed.

One issue that did come up with writing the Groovy example was that the Eclipse plug-in is in some bad shape; I couldn't get it to run the project. This drove me to download and use NetBeans since it has Groovy support included. That was a much better experience since it actually worked.

Oh, and the docs suck. Took way too long to figure out how the darn language is even structured. Just had to read various examples to figure things out.


Scala



Much like Groovy, Scala was easy to use to rewrite BaconSpam.


class BaconSpam extends Spam {
override def serves(): String =
return "bacon " + super.serves()
}


And just like Groovy there was no need to change Waitress.java in order to interact with the class. But unlike Groovy the Eclipse plug-in for Scala actually allowed me to execute the application. Plus the docs are better so I didn't have to go digging around to figure out what I needed to do.

2009-03-08

Importlib is now useful to other people

I have always had two goals for importlib. The docs for importlib say they were to provide a reference implementation of import and to make it easier for people to implement their own importers. The former goal is sort of true; the real goal is to not just be a reference implementation of import but to actually become THE implementation of import. The latter goal, though, is spot-on and I finally checked in a big chunk of code tonight that gets me closer to helping other people really harness the flexibility of import.

The cron job to rebuild the docs has not kicked in yet (I think it runs twice a day), but once it does you will discover there is now a new module: importlib.abc. Within the module there are ABCs for everything specified in PEP 302. Back in January I asked for help to name those classes. The fruition of that discussion is now finally live.

But simple ABCs to require load_module exist as a method isn't THAT helpful. What I really wanted to provide was something to make it as easy as possible to write their own custom loaders such that they didn't have to worry about the little details that are consistent between all loaders (and there a lot of details; just look at the "see also" section of the importlib docs). Way back in August 2007 I came up with an idea called handlers which would deal with stuff like setting __file__, making sure bytecode is recreated when it's older than the source code, etc.

Unfortunately handlers turned out to be somewhat burdensome. They required a bunch of information upfront to be passed into them that loaders had to provide. On their own they couldn't do much and just became internal delegates for loaders. So I set about trying to come up with a way to merge the handler concept into loaders.

And it turns out PEP 302 got me part way there. If you look at what it takes to import Python source code (I am ignoring bytecode), you essentially create an empty module, read in the source code, compile the source into a code object, and then execute the code object in the __dict__ of the empty module. After that it's just details like __file__ and such. When you take these basic steps you will notice they roughly align with some APIs that PEP 302 defined as optional protocols loaders could implement. So I asked myself if I could somehow harness the optional protocols to get all the information I need so that I can simply provide a loader that uses those protocols.

Looking at source only, it would seem like the answer is "yes" since there is a get_source method which obviously returns source code for a module. One would think then an implementation for load_module would simply call get_source, compile it, and then use it to create a module. But of course life is not simple.

First of all, as I have discussed before, reading source from disk was not working for me as I didn't have a simple way to get source from disk in the proper decoding thanks to PEP 263. Everything I came up with was on the complicated side. That meant get_source was not exactly a nice thing to rely upon.

But the other issue of relying on get_source is it doesn't tell me the path the source came from. That's needed for __file__ to be set. That completely kills solely relying on get_source.

You could potentially rely on another part of PEP 302 which defines a get_code method which is supposed to return the code object for a module. But that puts more burden on a developer than I wanted to.

At this point I realized I was probably going to have to add to the APIs that PEP 302 provided. I didn't want to do this as that just makes it that much more difficult to implement a loader, but I realized that the PEP 302 protocols simply did not provide all the information needed to create a module from scratch. So I started to think about what the minimum amount I needed to add to the API.

And I thought. And I thought. And I thought. Whenever I have blogged about APIs while working on importlib, it has been in regards to this conundrum of building off of PEP 302's protocols with something that is simple and useful no matter what the storage back-end for modules happened to be.

Eventually I had an epiphany. Using get_source was not an option because it was missing the path to set __file__ to. Somehow the loader needed to have some concept of paths to set __file__ in some meaningful fashion, even if it wasn't really a file path. If the loader provided some concept of a path, then I could use the loader as if it was using a file-based back-end. If I went with that assumption I could use get_data from PEP 302 in order to get at the source code; get_data(source_path('module')).

But I hesistated for a long time at using get_data to fetch source code. Having get_source sitting right there was just so tempting! But then I started to consider how to handle reading bytecode. Should I have a get_bytecode method? But I then run into the problem of needing a bytecode_path method to be able to set __file__ probably for modules loaded from bytecode (and when no source is available; new semantics of 3.0). Going that route means I would have added source_path, bytecode_path, and get_bytecode just to read source and bytecode. This still doesn't deal with getting the modification time for source to see if the bytecode is stale or writing bytecode to the storage back-end through the loader.

Realizing that going the get_bytecode route duplicated functionality needlessly, I went with using get_data to read source and bytecode based on what source_path and bytecode_path return. This keeps the functionality per method simple and mostly unique. It's definitely a "misuse" of get_data as it was not meant to be used this way, but it makes sense and keeps the API simple.

With all of this put together I can provide an ABC that implements load_module in terms of the PEP 302 protocols and just a couple of other methods that handles all the stuff that is not specific to the back-end being used to store the source or bytecode. For instance, to implement a source loader, one needs to implement:
  • get_data
  • source_path
  • is_package
With those three methods, you get a bunch of other methods for free:
  • get_code
  • get_source (eventually; actually figured this out just before starting this blog post)
  • load_module
As you can see, the methods one needs to implement are rather simple to do for a storage back-end. It does follow a path-like API which isn't really needed for non-file back-ends, e.g. databases, which is unfortunate. But since most people blindly assume __file__ and items in __path__ are paths anyway, I don't think you can get around this without breaking people's code.

But the big one is when handling source and bytecode together. There you add the above methods plus:
  • bytecode_path
  • source_mtime
  • write_bytecode
Since write_bytecode can be a no-op, you really only need to implement the first two methods to be able to use bytecode. Heck you could implement all three as dud methods and you would end up with a source-only loader that just happened to always try for bytecode. The point is that I have implemented the other methods so that no one else should have to care about what the format for bytecode files are or when to use bytecode or source.

With all of this done, that leaves just two parts left to implement for the public API (get_source in terms of get_data and source_path along with a decorator I found useful). Once that is done, importlib is semantically done for CPython. I do need to talk to Jython, IronPython, and PyPy to see what they might be missing that I rely on from CPython such that if someone implements a source/bytecode loader it will still work on those VMs even if bytetcode happens to be present (this worry is thanks to PEP 370).

My first academic research paper is now available online

I just got back from AOSD.09 in Charlottesville, VA. Now that the conference is over my paper that I presented at the conference is now online at the ACM Portal. I have also put up a PDF here for those that are not ACM members.

You can read the abstract at the ACM page to see what it is about. But just a heads-up that the paper has nothing to do with Python; it's all Java/AspectJ in terms of coding stuff and assumes you know both.