2007-08-09

Making it easier to write custom code formats while doing the py/pyc dance

Wouldn't it be nice if you could write Python code with slightly tweaked syntax or something but avoid having to write all the import code to pull it off? Wouldn't you like to have bytecode compilation supported as well?

As an example, consider Quixote. It has a syntax format called PTL where string literals in Python code end up being used as output. They had to write a custom __import__ implementation to get proper support for imports of PTL files. And there is no bytecode support.

When I was designing importlib I wanted to make sure that situations like PTL could be supported without jumping through ridiculous hoops like completely reimplementing entire parts of the import machinery. I figured there should be a way to allow for a class to be written that acted as a delegate between the file system and the code that managed the source/bytecode dance.

That is when I came up with the idea of handlers. From PEP 302, importers are for querying a backend as to whether it has the desired module and loaders are for actually get the source or bytecode from the backend store. Handlers have been designed to handle the data received from the loader and to make requests of the loader as needed. Thus the py/pyc handler deals with validating the bytecode is new enough and has the proper magic cookie, requesting source when needed, and creating new bytecode to be stored by the loader.

And to support this I came up with an API for loaders to implement to be used by the py/pyc handler. But last night, when I decided to stop working in my zipimport rewrite, I looked at the API and realized it was not simple enough. I realized that the py/pyc dance requires some information up front while some other info is optional and completely depends on whether source and/or bytecode is available for a module. I also realized that loaders can cache any info they needed and thus I did not need to complicate things by passing around opaque token objects through the handler to give back to the loader to minimize wasted lookup costs and such.

Let's look at what it takes to import a source or bytecode module. To begin with, you need to know the values for __name__, __file__, __loader__ (optional, but since this is a PEP 302 implementation its easy to provide), and __path__ if dealing with a package. This all needs to be set on a module before any code is executed.

You will also need to know very quickly whether source and/or bytecode is available for the named module. Based on that you branch on what information you need. But the available formats or the module and the attributes that must be set before executed code are all upfront costs. So I figured those should be passed in to the handler at call time and thus not require the loader to implement as part of any API.

After that what is needed varies based on the formats the module is available in. If there is bytecode you need to get said bytecode. If there is both bytecode and source you need the modification time of the source to see if the bytecode is stale. If there is only source or the bytecode is stale you need to be able to get the source itself. Finally, if the bytecode turned out to be bad (stale because its out of date or bad bytecode cookie) you need to be able to write out the new bytecode. So that's four things that the loader needs an API for: read bytecode, read source, source modification time, and writing out new bytecode.

Now the question becomes how should this API look. Originally, as I mentioned above, there was an idea of opaque token objects that represented paths that were passed around within the handler code and given back to the loader. This was to make the loader stateless and to minimize any repeated lookups on the filesystem (or whatever store the loader used).

But then I realized the loader did not need to be stateless for any reason. Plus if someone who implemented a loader wanted to optimize they could easily cache information in the loader themselves. That meant the API really only needed the module's name. That does add the possible inconvenience of verifying that a request was reasonable for that specific module (e.g., requesting the bytecode for a module who has no bytecode yet), but that should be minimal. And the biggest perk of just requiring the module's name is that PEP 302's optional extensions for importers and loaders can be used (the get_source method).

So that is the direction I am going with for my API change to importlib. I am also breaking up the filesystem importer and loader I wrote and having extension modules and py/pyc modules have their own importers and loaders. That way the generalization does not need to be so severe on everything. Plus extension modules are just plain different from py/pyc files and trying to make them work with the same API seems extreme.