Coder Who Says Py: Import pseudocode now covers how modules are searched for

Previously I announced some pseudocode I tossed together to explain the core part of the import algorithm Python used. Turns out some people liked the post. So I decided to flesh it out the pseudocode some more.

In terms of the pre-existing pseudocode, I added support for how __path__ entries are handled. I also tried to clarify some comments.

This update includes code to explain how the file system is searched for an extension module or Python source or bytecode module. Probably the biggest thing to get out of this is how many stat calls Python has to make for each entry in sys.path. Two things come into play for this: file suffixes and packages.

Both extension modules and Python code have multiple file suffixes that can be valid. For extension modules, the typically valid suffixes are ".so" and "module.so" (on OS X; if I don't specify an OS assume I am talking about OS X or some other UNIX-based OS). That means that when import looks for the 'spam' module it has to look for 'spammodule.so' and 'spam.so' in the path on sys.path that is currently being considered. Same goes for Python code for the '.py' and either '.pyc' or '.pyo' depending. And you Windows users are even worse off as you have '.pyw' as well.

But what is being imported may be a package. That means you need to see if the module name is actually a directory name with an __init__ file (the current implementation only checks Python code files for this; importlib allows extension modules as well but that was for ease of coding reasons). That means you don't only check for 'spam.py' and 'spam.pyc', but also 'spam/__init__.py' and 'spam/__init__.pyc'.

All told that is a lot of stat calls per entry on sys.path. To cover extension modules you need two (one per suffix), and for Python code you need four (module or package check per source and bytecode suffix). That equals out to six stat calls per entry in sys.path in the worst case (it's actually worse in the C code as a stat for the directory for the package check adds one more, but importlib gets around it, I think, by just using os.path.isfile).

Luckily not all imports are top-level and thus hitting all of sys.path. If an import is happening within a package then you only need to check the __path__ attribute for the containing package (which is typically only a single directory).

Now some people blame Python's startup time on all of the imports the interpreter does to get itself going. And usually people blame the stat calls specifically. On NFS I can believe this. But as no one has bothered to implement a caching importer to cut out the caching and benchmark on a local hard disk to see if the stat calls are the problem.

My immediate worry with a caching importer is staleness. I want to be able to add a file to a directory on sys.path and have that file be noticed. I also want to be able to delete files and not have an importer think it is still there. I assume a stat call on the directory to check the modification time should pick up any changes to it and thus lead to a refreshing of the directory contents cache.

There is also the cost of getting the directory contents for the first time. I don't know how expensive that is compared to six stat calls. If it turns out to be more expensive in the common case it might not be worth it when there are not a ton of imports; there is a reason we have sys.modules.

Anyway, part of the reason I am really happy I wrote importlib is that it allows for easy experimentation like this. I am hoping to get to this at some point to see if it really does make a difference.

Coder Who Says Py

2007-06-12

Import pseudocode now covers how modules are searched for

License