2013-05-26

Practicality beats purity when it comes to backwards-compatibility

This past week I was on a 5 hour train ride where I was doing some API work on importlib to make it easier to customize imports. The route I have been taking to accomplish that is to define a couple key methods in the appropriate ABC so that subclasses can do as little custom work as possible and use sensible defaults as defined by the ABCs. That way gnarly details are taken care of in a consistent way by importlib and users can simply focus on what is different/special about their importers.

Part of that includes dealing with __path__. For those of you who might not be aware, the existence of __path__ on a module is the official definition of a package in Python. And if __path__ is defined it must be an iterable that only returns strings (if it returns anything at all; this all means [] is a legit value for __path__ to define that something is a package). Relatively straight-forward in the common case of some directory on your file system representing a package.

But what about situations like zipfiles? They are kind of like a file system, but not exactly; at least they have the innate concept of a path. Or how about a database that stores your code? In that situation a file path isn't really there unless you choose to make it happen; you could easily have a primary key of just module names and completely forgo the idea of paths. Since I'm trying to minimize the work anyone has to do for a custom importer I wanted to make sure that the only thing a user of importlib might have to do for __path__ is provide its custom value, I tried to come up with a way to allow that.

This all came up while I was designing a method to handle the setting of the various attributes on modules. As of right now that's all handled in importlib.abc.Loader.load_module()  which is not a good point to do it at since you need to set it on a module and all load_module() takes is a name as a string (i.e. there is no hook in the API to tweak attributes on a module before it is executed). By providing a method that does the right thing using the pre-existing APIs allows for a convenient location to tweak values, such as __path__. The method is going to be called init_module_attrs() and will set things like __loader__, __package__, __file__, __cached__, and hopefully __path__ (the trigger for this post).

As it stands now, the only API for packages in importlib is importlib.abc.InspectLoader.is_package() on loaders which I inherited from PEP 302. The problem with this method is that all is_package() does is return a boolean value to say whether something is a package or not. There was no provided API to return what __path__ should be. Basically there was no hook point in the API for overloading a function to provide __path__ for a module until after import which is not where you want it to happen.

My initial reaction was, "I'll define a new method!" From an API perspective that the purest solution. I could define a new method called package_paths() that either returns what __path__ should be set to or even subsume is_package() by returning None when a module isn't a package and all other values signify a package and what __path__ should be (this is to allow [] as a value for __path__). Nice and straight-forward.

But upon reflecting on the idea, I realized it wasn't practical. While databases and such might not need to rely on paths, in the massive majority of cases __path__ will end up with at least path-like entries (if for any other reason than most code that mucks with __path__ just assumes it can use os.path), so that suggests using what importlib.abc.ExecutionLoader.get_filename() returns is reasonable. There is also the point that users would need to define this new method just to get any help with setting __path__ which is always a problem since code would only be usable in 3.4 without users writing their own shim for 3.3 and earlier. All of this led to me to the realization that it would be better to just set __path__ in init_module_attrs() as best as I could (e.g. os.path.dirname(self.get_filename())or []) and that if the default didn't work then people could override the method, call super().init_module_attrs(), and then reset __path__ to what they want. That still provides a way to make overriding simple while still giving a reasonable default for the majority of cases. Practicality beats purity, especially when you are trying to provide a solution that works in a backwards-compatible fashion.

And just to really hit this point home, I had another "brilliant" idea that I decided was too much of me thinking about API purity and not what was practical. With the advent of __package__, an alternative way to tell if a module is a package is to see if __name__ == __package__. That's nice and simple and is a clearer definition since both of those attributes always exist (at least starting in Python 3.3) and it doesn't require having a dummy empty list entry on packages where __path__ must exist but lacks any other reasonable value (which is what I was trying to avoid since you don't see an empty list for __path__ very often and I didn't want to confuse users by doing that or that __path__ had to have a true value). Good idea I thought!

Except in a practical sense, there is too much code out there relying on __path__ being the way to tell if a module is a package. Now making older code work with this __name__ == __package__ idea wouldn't be hard since it's a check for that and if __package__ isn't set or is None then fall back on looking for __path__. I could even provide an importlib.util.is_package() whose code could be copied easily for code that needed to stay compatible with 3.3 and earlier. But what benefit would it get me? Not confusing users when they see an empty list for __path__? That's pretty minor since the vast majority of code should be doing a hasattr(module, '__path__') check to tell if something is a package instead of checking for the attribute and making sure the value is true. Plus how much code ever changes __path__, let alone even cares if it exists? My bet is hardly anyone cares, and so it would be better to not try to clean up the semantics to be more straight-forward and easier to explain for all future versions of Python how to tell if a module is a package and just stick with what is seemingly working now for all pre-existing code when the status quo isn't that bad.

So practicality beats purity when dealing with backwards-compatibility. When the itch strikes you to expand an API to try to have a more pure solution to something, take a moment before you scratch that itch and see if there is a more backwards-compatible way that might not be as pure or simple, but still allows for the same outcome. This has happened to me more than once with importlib which is why I always mull over any API changes for at least a week before actually committing any code.