2010-11-04

My Python minifier which was not to be

After two days of off-and-on work, I have given up on my attempt at writing a Python minifier. Since I'm not going to continue on with the work I figured I would explain why I decided to give this a try, why I have decided not to continue pursuing it, and some tips on how to minify Python code (and thus what I wanted my minifier to do).

When I think "minifer", I think JavaScript first, and then Google Compiler second (especially with ADVANCED_OPTIMIZATIONS turned on). These days people minify their JavaScript so that it takes up less bandwidth when shipping the code over to the browser. Less bandwidth means faster load times and less bandwidth costs; win-win.

In my case, it was to make the SL4A implementation of Oplop small enough to fit into a QR Code that was easy to scan as that's one way to get a script to your phone (at least with my phone, the larger the QR Code the harder it is for ZXing to discover the barcode). I have been maintaining a minified version of the code manually which is a slight pain when updates are needed (I had to change some method calls so I stopped using API calls from SL4A's ASE days). Being a programmer it seemed like writing a script to minify code for me would make sense.

When you think of minification the other thing one thinks of is obfuscation. Just try to read the JavaScript for Gmail and you quickly realize how good of an obfuscater minification can be (and I have tried to reverse-engineer that code and it is hard to). Being an open source developer means I have the benefit of not having to try to obfuscate code that I ship to clients. But not everyone chooses to go down that route. When I tried to kill bytecode-only importing when Python 3.2 gained its __pycache__ support (PEP 3147), some people (<cough> Doug Hellmann </cough>) said they shipped bytecode-only code to "[keep] support costs under control from 'ambitious' users". Well, bytecode is actually a horrible obfuscation technique as a ton of info is kept in .pyc files (e.g., variable names are preserved). You also prevent all other Python VMs other than CPython from being able to run your code. And lastly, bytecode is not necessarily smaller than the comparative source file (and bytecode is probably never smaller than minified source).

So, with the desire to have a smaller QR Code for Oplop on SL4A and a script that would let me say "you don't need bytecode-only imports, just minify your source code", I tried to write a minifier for Python.

But then it began to require thought. =)

First thing is that great minification requires access to an AST. While you could probably get away with using a parser to simply strip out extraneous whitespace (e.g., from 3 + 5 to 3+5), that won't help you pull some advanced minification tricks which require semantic information of the source. This meant I wanted to work off the AST that is provided through the ast module.

But by not working with a parser it meant I needed to be able to map the AST back to source code. 2to3 is no help as it is a source-to-source translator using a pure Python version of the parser CPython uses. The best pre-existing code I could find was from Armin Ronacher (code), but he admitted it was probably out of date. The real issue, though, is the code takes a somewhat simplistic approach to creating code where I was going to need more smarts in order to minify code. For instance, for an if statement, the AST has an If node which consists of the guard, the true clause, and an optional else clause: If(expr test, stmt* body, stmt* orelse). But notice the lack of direct support for elif clauses? That's because an elif semantically gets put into the else clause. E.g.,:

if A:
  pass
elif B:
  pass
else:
  pass

becomes:

if A:
  pass
else:
  if B:
    pass
  else:
    pass

That translation is entirely accurate semantically, but a minifier would need to use elif clauses to cut down on the wasted characters. This same issue comes up with try statements if you use both except/else and finally.

Things continue to require more thought when dealing with accessing objects. Something like a.b.c becomes (in shorthand) Attribute(Attribute(Name('a'), 'b'), 'c'). This is semantically accurate (and how bytecode actually loads objects), but it is a slight pain when you start to do variable rewriting anywhere past the initial object (e.g., turning a.b.c into d = a.b; d.c) as you won't know without looking ahead where you are in an access.

Basically it felt like I was starting to write a reverse parser and that was sucking the enjoyment out of this. The AST-to-source part was supposed to be minor compared to the AST analysis and rewriting which was going to bring the true fun/perks of thorough minification. But it was just not meant to be.

Without a script to do minification for you, what is one to do? Well, you do it by hand (and honestly probably get more minification because of it). Having tried to wring every last byte out of oplop_sl4a.py I have a couple of tips.

Obviously the first one is to strip out all extraneous whitespace and comments (before someone points it out, oplop_sl4a.py has that comment at the top as SL4A takes the first line of a file scanned as the file name, so I leave the filename in as a comment). This includes docstrings! You also want to strip all spaces around any bit of syntax like , and such.

You definitely want to consolidate your imports. You should put as many on a single line as possible. You can use multiple as clauses on the same line, e.g., import hashlib as h,base64 as b is valid. You also want to take into consideration the length of the name of what you are importing. Consider import hashlib which has 7 characters to the module name. Because the module name is so long, it pays off to spend the extra 4 characters to use an as clause: import hashlib as h. Now every time you use h you are saving (len('hashlib')-len('h')) * x - (4 + len('h')) (len('hashlib') - len('h') as the savings you get per use , and -(4 + len('h')) because of the cost of as h tacked on to the import statement). Also notice that the usefulness of using the as clause is a function of the number of times you use the name and the amount of characters saved. So using a shortened 2 character name instead of 1 character negates the usefulness if you use the name once, but if you use it twice you get a perk. For a module like re, you would need to use the module's name 6 times before a single character name became beneficial. You can see how this can be taken further when dealing with from clauses in import statements; you have to care if the amount of characters eliminated makes up for the 5 character overhead of going from import a.b; b to from a import b; b.

Think about indents. Even with single space indents, you are paying a cost of scoping code compared to using a semi-colon. With things at the top-level of a module, there is no difference between a newline or a semi-colon. But the instant you have to indent you accrue that indent cost compared to using semi-colons to separate simple statements. This is rule of thumb includes putting as much as possible on the same line as a statement that introduces scope. In other words, you can go from:

if A:
 B=3+5
 C()

to if A:B=3+5;C() (a savings of 3 characters). And part of the trick when automating something like this is to make sure not to add an extra semi-colon (on top of properly identifying when you can even use a semi-colon).

You can even take simple function definitions into account to save characters. Consider the function:

def A(x):return x+6

Compare that to:

A=lambda x:x+6

The def statement took 19 characters while the lambda took 14 characters; if you ditched the argument you would even save an additional character). Obviously this can't always work as it requires a function whose body is only an expression and returns a value, but it does save characters when you have this situation come up.

The last trick for minifying Python code is string concatenation. If you have any proper length prefixes or suffixes shared amongst strings you should simply break those prefixes and suffixes out and concatenate any strings you need with the unique parts. So instead of:

A('Hello there, Guido')
B('Hello there, Brett')

use:

h='Hello there, ')
A(h+'Guido')
B(h+'Brett')

Much like the import situation, the usefulness of this optimization depends on the length of the substring being reused and how often it is reused (this example originally just had 'Hello ,', but that was too short by two characters to make a difference).

Then you start using unrecognizable variables names. From there you change all numbers greater/less than 10**12/-10**12 to use hexadecimal notation as it's smaller. And then from there you start to do hand-coded stuff like using 'utf8' instead of 'utf-8' when encoding/decoding as it saves a character or from A>=8 to A>9 when you know you are only compared integers. It never really ends (heck, I managed to perform three optimizations on oplop_sl4a.py that saved 4 characters while writing this blog post and undo one which bought me nothing). You can even start doing actual optimizations like constant folding.

But as with anything to do with Python and optimizations, you have to be careful you are not changing semantics (e.g, imports might happen in a specific order because of side-effects or not renaming anything that somehow is used by external code). But it can still be a fun exercise in optimizations to see how far you can take it.

If only the AST-to-source translation was simpler. Who knows, maybe this will gnaw at me long enough that I will bother to finish the AST-to-source translator (or people reliably offering to help with this, but that would probably require making the translator much fancier in order to support minification, PEP 8 output, or even damn-close output from original source).