2009-01-03

importlib hits beta (with PEP 263 support!)

As I type this I am doing a ``bzr push`` to importlib that puts the code at beta quality. At this point I pass all of my own tests and the failures I have from the standard library are either from explicit checking of exception messages or from code not expecting __loader__ to be defined.

This means I finally dealt with how to read source code that has a defined encoding. Supporting PEP 263 has literally taken me nearly a year to support. "Why did it take so long?", you might ask. Well, this is where the difference between 'compile' and what import does becomes glaringly apparent and Python 3.0 shows how badly even the interpreter intermixed strings and bytes and strings as text.

When you want to create a code object, you use 'compile'. Typically you simply take a string, pass it in with the proper arguments, and out comes a code object. But those pesky encoding declarations make things a wee bit tricky. You see, in Python 2.x, when you passed in a string, you could pass in a string read from a file raw with universal newline support. You didn't have to decode anything or worry about newlines since the former was handled by the parser and the latter was handled by 'open'. This meant the parser simply took the string's buffer and used that as input to tokenize everything based on either the default encoding for source code or the encoding declaration.

Enter Python 3.0 and the distinction between strings and bytes. Passing in a string means that the underlying buffer is already decoded for you. But you see, the parser doesn't know that. Instead it just sees that it was given a 'const char *' and figures it needs to decode it. Now if you have no encoding declaration this works fine as it is assumed to be encoded using the default encoding for strings. But what if you have an encoding declaration for the source? Well, turns out you are hosed since the tokenizer has no idea that the 'const char *' it is working with is coming from a decoded string and not some encoded bytes. Pass 'compile' a string which is encoded underneath as UTF-8 but which contains a Latin-1 encoding declaration for the source and the tokenizer throws a decoding hissy fit.

What is a core developer to do? He created an issue for it of course! Issue 4626 covers the problem of 'compile' not playing well with a string that has a declared encoding other than UTF-8. I spent most of my day trying to fix this bug. It turns out that 'compile' underneath it all simply gets the buffer for its first argument if it is a bytes or string object and passes it along to be parsed. It seems like a waste to throw out the knowledge that the 'const char *' that is being parsed is already decoded as UTF-8, so I tried to pass that information around.

But I hit my first snag. Since this is C I can't simply toss in a new argument to a function and have everything just work. Not only does that break any other code that calls the function I added an argument to, it breaks ABI and API compatibility for extension authors, where the latter is an absolute no-no in Python and the latter is just really bad.

OK, so I try a hacked solution by adding a flag to the compiler flags that is meant more for __future__ statements. It really isn't that bad since it is already used for other backwards-compatibility stuff, so I don't feel dirty. But then I discover that the flags used by the compiler are not the same used by the parser, even when they do overlap. Oops. So I find the proper macro and tweak it to translate from compiler flags to parser flags. That way I can send down the call chain into the parser that the 'const char *' it is working off of is already decoded in the default encoding.

That is when I hit the next snag. Turns out that the parser is not compiled with the various Python objects linked in. That means calls into the PyUnicode API are not supported. And keeping the parser simple is on purpose. Suddenly linking in the various Python objects would make it tempting to walk away from this. This also means I can't call PyUnicode_GetDefaultEncoding to know what the bytes are encoded in.

So I take a different tack and try to explicitly translate to UTF-8. That way the flag I have added simply means that it is all already translated explicitly into UTF-8. But even that doesn't work easily. After battling with pgen/not-pgen code in the parser and everything else, I decide to take another approach.

Have I mentioned I don't like working on the parser? I didn't like mucking with it for doing the parse tree to AST conversion code, I didn't like mucking with it to fix the bug that I blogged about a couple months ago, and I didn't like having to deal with it today.

Time for a different approach. Remember how earlier in this post I mentioned universal newline support along with encoding being a thorn in my side? The reason I mentioned newlines is because passing in a bytes object read directly from the file into 'compile' gets around the encoding mess as the parser can just go ahead and do its own decoding since bytes have not been decoded ahead of time. But reading bytes from a file doesn't give you universal newlines support. Turns out that the universal newline support that the parser uses is entirely based on using FILE pointers and the requisite C functions. Thus issue 4628 was born.

Initially I thought I was going to have to write some C code to add to the bytes object to have bytes.splitlines do proper universal newlines (as it stands now that method just splits on \n, \r, and \r\n all in the same string). But then I realized that the basic algorithm is dead-simple in Python code and I had already abstracted out the reading of source code thanks to PEP 302 protocols and the get_source method. Running with my realization, I wrote the code to read in the bytes, figure out the newlines used, and do a bytes.replace to make the newlines universal. Simple.

But this is where the whole issue of compatibility and performance comes into play. Consider compatbility with get_source. That is supposed to return a decoded string according to PEP 302. That's fine, but that sucks in my case as I want bytes. Do I define a new method like get_bytes or add an argument to get_source like 'raw' that flags I want just the bytes? Or do I just not care and tweak the definition of get_source to return anything that 'compile' will handle properly? And if I have it return bytes, does it have to have universal newlines, or should import do that?

And this all then spills into performance. Compare having get_source return a string instead of bytes. To return a decoded string takes two file system calls; you have to read the first two files to find the possible source encoding and then open the file again with the proper encoding specified so that you properly decode the file. Luckily universal newline support is done in a streaming manner so there is not real penalty for that.

Using bytes cuts those stat calls down to one as you only need to read the file. But the memory pressure potentially rises as you need to handle the universal newline support after the file has been completely read.

Considering how I/O bound importing is, I went with the latter case of more memory but less stat calls. It's bad enough that importing from the file system potentially takes four stat calls per entry on sys.path/__path__ just to figure out what file to use for the import, let alone actually reading the files and dealing with ImportWarning for directories that match by name but lack a __init__.py file. But then what about the potential need to tweak the API for loaders to return the bytes for source?

There the universal newlines overhead comes into play. If the method that returns the bytes for source code handles the universal newlines support, that's annoying as that will be a common thing to have to do. It would be easier to simply do the universal newline translation in import itself so that loader developers do not need to deal with this themselves.

But what about those loaders that take the time to actually do the newline translation upfront? If I decide to create a sqlite loader that guarantees that the stored source code already has its newlines translated, why should the loader pay the penalty of me trying to figure out what newlines are used?

Luckily the penalty for detecting that universal newlines is already being used is negligble as it is simply iterating through the bytes until the first possible newline is found. I will probably just shift it out of get_source and just have import do it automatically.

Yes, this is the stuff I think about when contemplating what the API will be for importlib in the end. I have put in enough time into this code that I don't want to screw up the API and have to support something that just frustrates me for the rest of my days.

Where does this leave importlib? At this moment I consider it semantically complete short of someone finding an incompatibility. It passes on OS X for me, although I don't know about Windows as I may have made some silly path separator assumption (if anyone runs importlib's test suite by doing ``python3.0 -m tests.__init__``, let me know if things pass for you). Assuming no bugs are found that leaves removing Python 2.x cruft in the code and cleaning up the docstrings. After all that is done I will create The Great Import Function where you can do something like ``importlib.import_module('some.module')`` and have it give you back 'some.module'. Then I will try to get the code into Python 3.1 for at least inclusion in the standard library, if not bootstrapped in as the official implementation of import.