Rethinking import

Recently there has been some discussion on python-dev about import and what people don't like about it, how to possibly change it, etc. Since I am going to have to muck around with the import machinery for my security work I have been contemplating how I would change imports if I had the time and inclination.

But before I start dribbling ideas from my proverbial mouth, I should say two things. One is that this is about the machinery for imports, not new ways of doing the import statement. Second, this is all for Py3K and not for Python 2.x since this is all pie-in-the-sky stuff.

To begin any discussion about imports you need to identify the different facets to an import. There is .py vs. .pyc/.pyo vs. extension modules vs. custom imports. There are absolute vs. relative imports. Finally there are packages vs. individual modules.

Now, we should also look at how imports are controlled currently. We have sys.path, __import__, .pth files, and the PEP 302 importers. In terms of sys.path you have the stdlib (with all of the plat and lib directories included in that), site-packages, .pth files, and PYTHONPATH. __import__ works off of sys.path. The PEP 302 importers can basically do what they want, but are usually associated with an entry in sys.path.

What issues are there with this current setup, though? For one, there are several places to tweak imports: sys.path, __import__, and the three sys module attributes for PEP 302. It would be nice if this was a little bit more unified. People also tend not to like people to tweak sys.path, but usually complicated packages do tweak stuff. I also personally hate .pth files, but that's partially because I had to re-implement them when I partially rewrote site.py .

OK, so let's look at the various facets of imports and see what they require and what is bad about them. So let's start with packages vs. modules. An absoloute import builds off of an entry in sys.path. If it is a package it is a directory with an __init__.py in a sys.path entry. A module is just a file in a sys.path entry. Having both just be off of a sys.path entry allows dynamic adding of packages and modules into a sys.path entry. Problem is that it ties into the file system in a loose way which leads to a ton of stat calls looking for the code you are trying to import since you have to check for a .py, .pyc, or an extension modules. It would be nice to not have to tie into the filesystem as much for more isoteric import setups like through zip or tarfiles, URLs, etc.

Absolute imports are simple; just look off of sys.path or have a PEP 302 importer pick up the import. As for relative imports, they do not work off of sys.path but on the dotted name of the module that contains the import statement. So relative imports get turned into absolute imports by doing the proper change to the dotted name.

And there is the issue for supporting .py, .pyc, and extension modules for all sys.path entries. If you look at the number of stats required to import a module you will notice that it stats for all three different file types plus package directory for *every* entry in sys.path until the code is found. That is expensive if you are running over something like NFS or any other filesystem where stat calls are expensive.

The difference between packages and modules is kind of interesting. A package gains its root namespace name based on the name of a directory, while modules are based on a filename. This means that if you look at a directory it could be the top of a package, or just a random collection of modules.

Now another thing that must be considered is directly executing modules. Now that sets __name__ to '__main__' which helps to kill relative imports (which is good since it is only meant for packages and should be minimized).

And there is working off of code in the current directory. I am sure I am not the only person who writes a quick one-off script in one terminal/buffer/whatever and have an interpreter that was executed from the same directory in another window while doing constant reloading of the script to test it until it does what I want. So there has to be some consideration of what is in the current directory.

OK, so after all of the blathering on my part in terms of what imports should support and how they are a pain to work with, how am I going to suggest this whole situation be fixed? Well, it's begins with the idea of import objects. Not necessarily just like PEP 302, but a simple interface that is to be implemented for a module that is to handle importing. Hell, might be as simple as a function that might be called (more discussion below).

Now, let's have a registry of top-level namespaces. This registry would work like a dict (might even be a dict) where the key is the top-level namespace name and the value somehow associates that namespace with an importer object that can go forth and import that namespace (might be as simple as a string to associate with an importer and a tuple of the arguments to pass in to the importer). That way when an import occurs for something, it is a simple key lookup in this registry, and if it matches you then use the value to do the import. And if the namespace is not in the registry then it is an ImportError. Relative imports act as they do now and are just a change in the dot name of a package and so are turned into absolute imports.

This can allow third-party packages and modules to register themselves with the Python installation as needed for the importer they want to use to be able to import their namespaces. This could obviously be added to Distutils' setup.py scripts. The Python interpreter could even have an extremely simple registration flag built into for quicky registrations.

But how about temporary registrations? This is when we should borrow from Java and have our own version of the classpath. Stuff listed on that path is added at run-time to the registry during this one execution of the intepreter. Everything can be inferred for what is needed to import based on the name of a listed module or directory and assuming it is a standard .py/.pyc/extension import. The current directory could even be implicitly included in this classpath.

There could even be support for pointing to other registeries that get pulled in. So there can be the master registry for the Python installation, a system-wide one that acts like site-python, and then the one generated on the fly from the Python version of classpath.

Going back to the registry, it could be something like ``{'email' : ('py_import', 'pkg', '/usr/local/lib/python3.0/email')}``. The key specifies the namespace. The tuple specifies which importer to use and the arguments to pass into the importer plus the dotted name of what is to be imported (``py_import('email.charsets', 'pkg', '/usr/local/lib.python3.0/email')``). This allows the importer to be as lazy as possible and minimizes object creation at interpreter startup. This registry could just be a pickled dict that gets loaded at run-time.

That does add the need for another registry of importers. That might have to act like site-packages where the code is always put in a specific place to prevent bootstrapping issues of importers themselves and such. So you could have a per-file importer directory that gets pulled from for all importers. Since custom importers are a rare thing this shouldn't be a big issue having a single directory where importers are kept.

Anyway, that is my crazy idea for re-implementing importers. I think it provides the proper flexibility people tend to want, but might be overly complicated for some. I think the trick is making sure the way things work now makes all of this as transparent as possible. But this should allow for a lot less stat calls, make various styles of importers easier to work with, and remove the need of site-packages so that people can install their software where they want to.

My big worry is whether this is too complex or will have performance issues somehow. But then again that assumes this is even a sane idea overall and people like it. And then it requires me to put the time and effort into implementing it. =)