2009-03-08

Importlib is now useful to other people

I have always had two goals for importlib. The docs for importlib say they were to provide a reference implementation of import and to make it easier for people to implement their own importers. The former goal is sort of true; the real goal is to not just be a reference implementation of import but to actually become THE implementation of import. The latter goal, though, is spot-on and I finally checked in a big chunk of code tonight that gets me closer to helping other people really harness the flexibility of import.

The cron job to rebuild the docs has not kicked in yet (I think it runs twice a day), but once it does you will discover there is now a new module: importlib.abc. Within the module there are ABCs for everything specified in PEP 302. Back in January I asked for help to name those classes. The fruition of that discussion is now finally live.

But simple ABCs to require load_module exist as a method isn't THAT helpful. What I really wanted to provide was something to make it as easy as possible to write their own custom loaders such that they didn't have to worry about the little details that are consistent between all loaders (and there a lot of details; just look at the "see also" section of the importlib docs). Way back in August 2007 I came up with an idea called handlers which would deal with stuff like setting __file__, making sure bytecode is recreated when it's older than the source code, etc.

Unfortunately handlers turned out to be somewhat burdensome. They required a bunch of information upfront to be passed into them that loaders had to provide. On their own they couldn't do much and just became internal delegates for loaders. So I set about trying to come up with a way to merge the handler concept into loaders.

And it turns out PEP 302 got me part way there. If you look at what it takes to import Python source code (I am ignoring bytecode), you essentially create an empty module, read in the source code, compile the source into a code object, and then execute the code object in the __dict__ of the empty module. After that it's just details like __file__ and such. When you take these basic steps you will notice they roughly align with some APIs that PEP 302 defined as optional protocols loaders could implement. So I asked myself if I could somehow harness the optional protocols to get all the information I need so that I can simply provide a loader that uses those protocols.

Looking at source only, it would seem like the answer is "yes" since there is a get_source method which obviously returns source code for a module. One would think then an implementation for load_module would simply call get_source, compile it, and then use it to create a module. But of course life is not simple.

First of all, as I have discussed before, reading source from disk was not working for me as I didn't have a simple way to get source from disk in the proper decoding thanks to PEP 263. Everything I came up with was on the complicated side. That meant get_source was not exactly a nice thing to rely upon.

But the other issue of relying on get_source is it doesn't tell me the path the source came from. That's needed for __file__ to be set. That completely kills solely relying on get_source.

You could potentially rely on another part of PEP 302 which defines a get_code method which is supposed to return the code object for a module. But that puts more burden on a developer than I wanted to.

At this point I realized I was probably going to have to add to the APIs that PEP 302 provided. I didn't want to do this as that just makes it that much more difficult to implement a loader, but I realized that the PEP 302 protocols simply did not provide all the information needed to create a module from scratch. So I started to think about what the minimum amount I needed to add to the API.

And I thought. And I thought. And I thought. Whenever I have blogged about APIs while working on importlib, it has been in regards to this conundrum of building off of PEP 302's protocols with something that is simple and useful no matter what the storage back-end for modules happened to be.

Eventually I had an epiphany. Using get_source was not an option because it was missing the path to set __file__ to. Somehow the loader needed to have some concept of paths to set __file__ in some meaningful fashion, even if it wasn't really a file path. If the loader provided some concept of a path, then I could use the loader as if it was using a file-based back-end. If I went with that assumption I could use get_data from PEP 302 in order to get at the source code; get_data(source_path('module')).

But I hesistated for a long time at using get_data to fetch source code. Having get_source sitting right there was just so tempting! But then I started to consider how to handle reading bytecode. Should I have a get_bytecode method? But I then run into the problem of needing a bytecode_path method to be able to set __file__ probably for modules loaded from bytecode (and when no source is available; new semantics of 3.0). Going that route means I would have added source_path, bytecode_path, and get_bytecode just to read source and bytecode. This still doesn't deal with getting the modification time for source to see if the bytecode is stale or writing bytecode to the storage back-end through the loader.

Realizing that going the get_bytecode route duplicated functionality needlessly, I went with using get_data to read source and bytecode based on what source_path and bytecode_path return. This keeps the functionality per method simple and mostly unique. It's definitely a "misuse" of get_data as it was not meant to be used this way, but it makes sense and keeps the API simple.

With all of this put together I can provide an ABC that implements load_module in terms of the PEP 302 protocols and just a couple of other methods that handles all the stuff that is not specific to the back-end being used to store the source or bytecode. For instance, to implement a source loader, one needs to implement:
  • get_data
  • source_path
  • is_package
With those three methods, you get a bunch of other methods for free:
  • get_code
  • get_source (eventually; actually figured this out just before starting this blog post)
  • load_module
As you can see, the methods one needs to implement are rather simple to do for a storage back-end. It does follow a path-like API which isn't really needed for non-file back-ends, e.g. databases, which is unfortunate. But since most people blindly assume __file__ and items in __path__ are paths anyway, I don't think you can get around this without breaking people's code.

But the big one is when handling source and bytecode together. There you add the above methods plus:
  • bytecode_path
  • source_mtime
  • write_bytecode
Since write_bytecode can be a no-op, you really only need to implement the first two methods to be able to use bytecode. Heck you could implement all three as dud methods and you would end up with a source-only loader that just happened to always try for bytecode. The point is that I have implemented the other methods so that no one else should have to care about what the format for bytecode files are or when to use bytecode or source.

With all of this done, that leaves just two parts left to implement for the public API (get_source in terms of get_data and source_path along with a decorator I found useful). Once that is done, importlib is semantically done for CPython. I do need to talk to Jython, IronPython, and PyPy to see what they might be missing that I rely on from CPython such that if someone implements a source/bytecode loader it will still work on those VMs even if bytetcode happens to be present (this worry is thanks to PEP 370).