Coder Who Says Py: January 2009

2009-01-27

Snakebite announced

Well, I am definitely not the first person to blog about Snakebite, but it is such a big thing I figured I would do my part to get the word out. Having a huge server farm available to the Python developers to fix platform-specific bugs is going to be great.

A perk of being the chairman of the infrastructure committee for the PSF is I got to know about this when Trent first started this. But even knowing about it didn't prepare me for the scope of what he has done with Titus' (and others) help.

And so I big "thank you" from me to Trent for doing this! The man shelled out a lot of time and cash to do this for the benefit for the Python community and I think he deserves all the kudos he has been getting for this.

2009-01-17

importlib is now in Python 3.1

[update: fixed a typo and a broken link]

Back in the summer of 2006 I interned at Google under Neal Norwitz. Part of what I did that summer was try to figure out how to potentially secure the Python interpreter for embedding into Firefox. I did finally figure out how to secure the interpreter for embedding to protect resources, culminating in a paper for a security course I took at UBC.

Part of the solution I developed required controlling import such that you couldn't import arbitrary built-in, frozen, and extension modules. As import is currently implemented, that is simply not possible to do in a secure fashion. That meant reworking or rewriting import. Since the import code was known to be a little difficult to work with I decided to rewrite it in pure Python.

Work began on October 4, 2006. At the time I was planning on making my Python security work my thesis topic with the long-term goal of making my rewrite of import the official implementation of import. Little did I know how massive of a project this would turn out to be.

Two years, three months, and 13 days later, importlib came into being for Python 3.1 in revision 68698. Between the beginning and now my security work stopped being my thesis topic (as did anything directly relating to Python), I dropped support for Python 2.x, and I learned part of the reason the C implementation is so difficult to work with is that import's semantics are rather nuanced and require juggling a lot in your head at once. Oh, and allowing different source encodings is evil.

This is easily the longest amount of time I have ever spent on a single piece of code. One of the surprising things is that the thing is not even 2,000 LOC, including tests! It just took forever to get the semantics fully backwards-compatible (short of some assumptions in the tests, you can currently run the entire test suite for Python with importlib as __import__ and have things work).

The other thing that held up checking importlib in was being too much of a perfectionist. I think I implemented importlib twice, and I still have plans on how to clean things up. Importlib has become the perfect example that your initial implementation might work, but it most definitely will not be the best implementation you can do. Heck, I still have some things to change to make the code easier to work with and more useful to users.

The perfectionist part also came out through worrying about the public API. I know people want to have access to all of the code I have written for their own importers. So I have constantly worried about how to expose it in a sane way. But API design is hard, especially when it is in Python's standard library. Get something wrong and you have to live with it for at least one extra release when you add a deprecation. This is why I am going to expose the API slowly over time and probably blog about it so that I can get feedback from people.

Now that the code is in, what are the long term plans? Well, I have notes with the code that cover what I plan to do. They start with documenting importlib.import_module. That is to be the function that everyone has asked for: a usable interface over __import__. As it stands now the interface is ``import_module(name, package)`` where 'name' is what to import, including relative imports, and 'package' is the package for the calling module. Calling the function returns the specified module, not the top module like __import__; no more fake values in fromlist! I might change the argument names, but I can't think of any other way to make it the API simpler and straight-forward.

Past that is cleaning up some things through better refactorings, exposing more code, and then exposing more of the code. But the end goal is still to get this all bootstrapped into Python 3.1 so that importlib becomes the actual implementation of __import__.

2009-01-09

Getting importlib into the standard library by PyCon

My email to python-dev about importlib has not raised anyone's ire within 24 hours, which is a good sign. Assuming no one is going to try to stone-wall me (I would honestly be shocked if anyone did), I am committing to getting importlib into py3k by PyCon.

I have two things left before I am willing to check importlib into py3k and continue development there. One is getting the tests running under regrtest since I have not structured it for that. Second is that I have some file reorganizing in the test suite. Since svn is not exactly a lover of file renames I figure it would be best to just have that all straightened out as much as possible before I check into svn.

But PyCon is a hard deadline. No matter how much is left, the code will be checked in no later than when I get WiFi access at the hotel for PyCon.

2009-01-06

The confusing terminology of imports

While I somewhat await someone to tell my importlib is broken, I am thinking about the public API I plan to expose for the package. That has put squarely in my face the issues I have with import terminology and how PEP 302 has muddled things somewhat.

PEP 302 essentially introduces the concepts of "hooks", "importers", and "loaders". When I read the PEP I come away with "hooks" being an all-encompassing term for objects that help with importing. "Importers" are are objects that define find_module and potentially load_module. "Loaders" define load_module.

The issue I have are with the definitions of "hooks" and "importers". I personally view "hooks" as things that go on sys.path_hooks, not just any object that helps with importing. For that I prefer the term "importer" as the word is tied into "import".

That means the definition of "importer", as I read the PEP, is not right to me. I prefer the term "finder" for an object that defines find_module as that is what the object does; it finds the module if possible. That would mean an importer is either a finder and/or a loader.

But how does this play out in a potential package layout? Assuming I stick with importlib as the package name (I don't want to make imp a package as that just gets messy with existing names along with what to name the existing imp and _importlib), that would mean I would want to stick all of the importers into the importlib.importers module. While that is fine, that is a lot of "import", especially if you end up with "importlib.importers.BuiltinImporter". Using a name more like importlib.hooks is easier to read and less error-prone to typing; "importlib.hooks.BuiltinImporter" has a lot less repetitiveness.

So my hope of redefining "hooks" might not work out for pragmatic reasons. Hell, PEP 302 is entitled "New Import Hooks" which seems to make it a catch-all term. And I don't want to put everything directly under importlib as that makes the namespace huge; BuiltinImporter, FrozenImporter, Finder, Loader, etc., all under the same module? I would rather have importlib.hooks have BuiltinImporter and FrozenImporter and importlib.abc has Finder, Loader, etc.

So I think I just convinced myself of having under the importlib package the hooks, abc, util, and test modules. But I am going to use the term "finder" for an object that defines find_module and thus the soon-to-exist importlib.abc.Finder ABC.

That leaves the challenge of naming the ABC for the PEP 302 protocol that covers get_source/get_code/is_package. If something defines these methods what would it be called? Is the other API an introspective one? I could go with IntrospectiveLoader or InspectLoader. I think I prefer the the former, but that's rather long and makes me feel dirty like Java makes me feel dirty with its naming. InspectLoader it is!

Obviously if any of this is nuts, please speak up. This post may seem like me just thinking out loud (and it is), but I also blog about stuff like this to get feedback from the community. While I obviously need to be happy with an API if I am going to end up having to maintain it, I want the Python community to be happy with my decisions as their will be more consumers (you) of the API than producers (me). So if you have an opinion, positive or negative, let me know (although be warned I switched off anonymous posting since OpenID is supported by Blogger and I want to cut down on WoW gold farming spam).

2009-01-03

importlib hits beta (with PEP 263 support!)

As I type this I am doing a ``bzr push`` to importlib that puts the code at beta quality. At this point I pass all of my own tests and the failures I have from the standard library are either from explicit checking of exception messages or from code not expecting __loader__ to be defined.

This means I finally dealt with how to read source code that has a defined encoding. Supporting PEP 263 has literally taken me nearly a year to support. "Why did it take so long?", you might ask. Well, this is where the difference between 'compile' and what import does becomes glaringly apparent and Python 3.0 shows how badly even the interpreter intermixed strings and bytes and strings as text.

When you want to create a code object, you use 'compile'. Typically you simply take a string, pass it in with the proper arguments, and out comes a code object. But those pesky encoding declarations make things a wee bit tricky. You see, in Python 2.x, when you passed in a string, you could pass in a string read from a file raw with universal newline support. You didn't have to decode anything or worry about newlines since the former was handled by the parser and the latter was handled by 'open'. This meant the parser simply took the string's buffer and used that as input to tokenize everything based on either the default encoding for source code or the encoding declaration.

Enter Python 3.0 and the distinction between strings and bytes. Passing in a string means that the underlying buffer is already decoded for you. But you see, the parser doesn't know that. Instead it just sees that it was given a 'const char *' and figures it needs to decode it. Now if you have no encoding declaration this works fine as it is assumed to be encoded using the default encoding for strings. But what if you have an encoding declaration for the source? Well, turns out you are hosed since the tokenizer has no idea that the 'const char *' it is working with is coming from a decoded string and not some encoded bytes. Pass 'compile' a string which is encoded underneath as UTF-8 but which contains a Latin-1 encoding declaration for the source and the tokenizer throws a decoding hissy fit.

What is a core developer to do? He created an issue for it of course! Issue 4626 covers the problem of 'compile' not playing well with a string that has a declared encoding other than UTF-8. I spent most of my day trying to fix this bug. It turns out that 'compile' underneath it all simply gets the buffer for its first argument if it is a bytes or string object and passes it along to be parsed. It seems like a waste to throw out the knowledge that the 'const char *' that is being parsed is already decoded as UTF-8, so I tried to pass that information around.

But I hit my first snag. Since this is C I can't simply toss in a new argument to a function and have everything just work. Not only does that break any other code that calls the function I added an argument to, it breaks ABI and API compatibility for extension authors, where the latter is an absolute no-no in Python and the latter is just really bad.

OK, so I try a hacked solution by adding a flag to the compiler flags that is meant more for __future__ statements. It really isn't that bad since it is already used for other backwards-compatibility stuff, so I don't feel dirty. But then I discover that the flags used by the compiler are not the same used by the parser, even when they do overlap. Oops. So I find the proper macro and tweak it to translate from compiler flags to parser flags. That way I can send down the call chain into the parser that the 'const char *' it is working off of is already decoded in the default encoding.

That is when I hit the next snag. Turns out that the parser is not compiled with the various Python objects linked in. That means calls into the PyUnicode API are not supported. And keeping the parser simple is on purpose. Suddenly linking in the various Python objects would make it tempting to walk away from this. This also means I can't call PyUnicode_GetDefaultEncoding to know what the bytes are encoded in.

So I take a different tack and try to explicitly translate to UTF-8. That way the flag I have added simply means that it is all already translated explicitly into UTF-8. But even that doesn't work easily. After battling with pgen/not-pgen code in the parser and everything else, I decide to take another approach.

Have I mentioned I don't like working on the parser? I didn't like mucking with it for doing the parse tree to AST conversion code, I didn't like mucking with it to fix the bug that I blogged about a couple months ago, and I didn't like having to deal with it today.

Time for a different approach. Remember how earlier in this post I mentioned universal newline support along with encoding being a thorn in my side? The reason I mentioned newlines is because passing in a bytes object read directly from the file into 'compile' gets around the encoding mess as the parser can just go ahead and do its own decoding since bytes have not been decoded ahead of time. But reading bytes from a file doesn't give you universal newlines support. Turns out that the universal newline support that the parser uses is entirely based on using FILE pointers and the requisite C functions. Thus issue 4628 was born.

Initially I thought I was going to have to write some C code to add to the bytes object to have bytes.splitlines do proper universal newlines (as it stands now that method just splits on \n, \r, and \r\n all in the same string). But then I realized that the basic algorithm is dead-simple in Python code and I had already abstracted out the reading of source code thanks to PEP 302 protocols and the get_source method. Running with my realization, I wrote the code to read in the bytes, figure out the newlines used, and do a bytes.replace to make the newlines universal. Simple.

But this is where the whole issue of compatibility and performance comes into play. Consider compatbility with get_source. That is supposed to return a decoded string according to PEP 302. That's fine, but that sucks in my case as I want bytes. Do I define a new method like get_bytes or add an argument to get_source like 'raw' that flags I want just the bytes? Or do I just not care and tweak the definition of get_source to return anything that 'compile' will handle properly? And if I have it return bytes, does it have to have universal newlines, or should import do that?

And this all then spills into performance. Compare having get_source return a string instead of bytes. To return a decoded string takes two file system calls; you have to read the first two files to find the possible source encoding and then open the file again with the proper encoding specified so that you properly decode the file. Luckily universal newline support is done in a streaming manner so there is not real penalty for that.

Using bytes cuts those stat calls down to one as you only need to read the file. But the memory pressure potentially rises as you need to handle the universal newline support after the file has been completely read.

Considering how I/O bound importing is, I went with the latter case of more memory but less stat calls. It's bad enough that importing from the file system potentially takes four stat calls per entry on sys.path/__path__ just to figure out what file to use for the import, let alone actually reading the files and dealing with ImportWarning for directories that match by name but lack a __init__.py file. But then what about the potential need to tweak the API for loaders to return the bytes for source?

There the universal newlines overhead comes into play. If the method that returns the bytes for source code handles the universal newlines support, that's annoying as that will be a common thing to have to do. It would be easier to simply do the universal newline translation in import itself so that loader developers do not need to deal with this themselves.

But what about those loaders that take the time to actually do the newline translation upfront? If I decide to create a sqlite loader that guarantees that the stored source code already has its newlines translated, why should the loader pay the penalty of me trying to figure out what newlines are used?

Luckily the penalty for detecting that universal newlines is already being used is negligble as it is simply iterating through the bytes until the first possible newline is found. I will probably just shift it out of get_source and just have import do it automatically.

Yes, this is the stuff I think about when contemplating what the API will be for importlib in the end. I have put in enough time into this code that I don't want to screw up the API and have to support something that just frustrates me for the rest of my days.

Where does this leave importlib? At this moment I consider it semantically complete short of someone finding an incompatibility. It passes on OS X for me, although I don't know about Windows as I may have made some silly path separator assumption (if anyone runs importlib's test suite by doing ``python3.0 -m tests.__init__``, let me know if things pass for you). Assuming no bugs are found that leaves removing Python 2.x cruft in the code and cleaning up the docstrings. After all that is done I will create The Great Import Function where you can do something like ``importlib.import_module('some.module')`` and have it give you back 'some.module'. Then I will try to get the code into Python 3.1 for at least inclusion in the standard library, if not bootstrapped in as the official implementation of import.