2008-09-17

Designing an archive format for Python source/bytecode files

Google App Engine just released version 1.1.3 of the SDK. As part of this release, there is now a pure Python implementation of zipimport. This was created to allow people to deal with the 1000 file limit place on GAE apps. One could now create a zip of the latest version of Django, for instance, and thus only have a single file used of the quota.

This, along with the fact that using a zipimport can be faster thanks to the significant drop in stat calls and potentially having the entire zipfile already in memory and thus skipping the disk entirely, it got me thinking about what an archive format might look like that was designed explicitly for Python.

To begin, it would need to support both source and bytecode. This would allow for tracebacks to include source. That's about it in terms of clear-cut answers. =)

First question is how to handle __path__. For zipimport, a munged path of the zip file and then the internal path in the zip to a package directory is set. The drawback to this is that __path__ now contains a string that is not a literal file path. It also leads to the need of going through sys.path_hooks for all packages and potentially creating new importers which has always seemed suboptimal to me. But unfortunately this usage of __path__ allows for tweaking the __path__ variable as necessary. I don't know how often __path__ is actually played with so I don't know if this is important to support. If it is then I would want to somehow find a way to make it so that multiple importers are not necessary.

Next comes to how names should be specified within the archive. Dotted names make sense, but then the __path__ issue comes up again. If people make assumptions that everything in __path__ is a file path, they are probably using ``os.path.join()``, meaning that at least for that part of the importer (if __path__ manipulation was supported), __path__ would need to be paths with slashes in it. But since I suspect importing modules will occur more than packages (merely by the fact that there will typically be at least a couple modules per package), that storing the name in dotted format is best.

Next comes handling new bytecode. Now zipimport does not support this, which I have always thought to be a shame. I am sure this is because it costs so much to have to rewrite the zipfile thanks to the main header info being at the end of the file and thus requiring writing over it when appending any new file. Tarballs are more of a linked list, but that makes looking for a single file expensive. You could traverse the file as needed to look for a file and cache any info found, but that still has a startup cost of having to travserse the archive in the first place to build the table of contents.

But what if you overwrote old bytecode? If you stored file size along with absolute start position within the archive, you could buffer the amount of space bytecode was given per file, write out whatever bytecode you wanted, and then if the Python you were running had different bytecode, simply overwrite the old version. If you look at Python 2.4, 2.5, and 2.6rc1, the bytecode size for heapq is 12725, 12254, and 12242, respectively according to ``len`` for the same heapq.py file. So bytecode size is decreasing over time. Given that, you could probably pad the amount of space allocated for bytecode files and most of the time be able to write into it.

Or, if you decide to have it be read-only and as part of the installation of some package you generated this archive, you could then just use the version of Python meant to use the archive and just write to the archive bytecode and source near each other for locality reasons. And being read-only also prevents any race conditions one might have if you are sharing the archive between multiple interpreters (people who have run Python on supercomputers used to have this problem where .pyc files were constantly out of whack because they were written to by multiple processes; the -B flag solves this by suppressing writing out bytecode).

But I have no clue if any of this is even worth it. The potential savings of space and speed-up from having a simpler archive format over zipfiles could quite easily be negligble to non-existent. The potentially only benefit is having support for rewriting bytecode as needed, but I don't know how reasonable that would be to pull off or how often it really comes up since I don't know if most people would generate an archive at install time or ship the archive.