2006-02-12

Possibilities of the AST

Since the AST has been tossed about in comments on this blog I figured I would do a post on my thoughts on it while giving a quick overview for those who don't know about the AST.

First, a quick intro. As of Python 2.5 there is a new bytecode compiler. It was started by Jeremy Hylton about three years ago and up to this point has been a main sprint topic at every PyCon. The joke amongst python-dev members who have gone to the sprints was that the AST could never finish because we would lose our PyCon sprinting tradition. =)

Well, in October Guido drew a line in the sand and gave us three weeks to get it in. So some people worked hard and got it in. =) The biggest perk of the AST compiler is it makes it much easier to add features to the language. By working with an AST it is much easier to add language features. For more implementation details, see Python/compile.txt .

Since the checkin, two things have happened. One is that an arena memory management setup has been implemented. This made the AST code much easier to use since it uses C structs for objects and not PyObjects. Another is that a competing memory management scheme was started that does use PyObjects by Martin v. Löwis (that can be seen in the ast-objects branch). This is not yet complete although I am heavily considering working on this at PyCon since I want to see the AST compiler's design locked down and settled so some cool features can be added.

What cool features? Well, how would you like to have access to the AST in Python code? Semantic checks could be added, optimizations could be applied. Basically anything that mucks with bytecode directly stands to benefit from the easier mental concept of working with an AST over bytecode. But the problem is deciding how to expose the AST to Python code.

The first problem is how to go from C to Python for the AST. As it stands now the AST uses custom C structs internally. That won't fly for exporting directly to Python. The current thinking it to come up with a serialization form that can be used to generate Python objects. This has the benefit of not allowing Python code that might modify the AST from generating a bad tree that would prevent sane bytecode from being used. So if the AST was found to be malformed the original version could be used instead, for instance.

The other option is the ast-objects branch that uses PyObjects directly for the AST. If this version is used the AST itself can be exported to Python simply through a pointer to the root node. Then code can modify the AST directly and then the resulting tree can be used directly to write out to disk. Really simple, but it does allow for possible malformed trees that could prevent bytecode from being generated.

The next issue is choosing the level of exposure. I think there are three different levels: command line, sys module, and code objects. For the command line, something like the -m option could be introduced that listed various functions in modules to have the AST passed through. So ``python --ast pychecker.ast_check code.py`` would generate the AST, pass that to pychecker.ast_check(), and then use the returned AST for generating the final bytecode.

Another possibility is adding something like sys.ast_transformations . This would be a list of functions that has the AST passed into the first function, and then the returned AST is subsequently passed in to the next function until every one has had a chance to play with the AST. And then the final AST returned is what is used to write the final bytecode. This allows for a more programmatic way to specific global AST transformations, but does not lead to easy external changes. But writing up a simple script that adds the AST transformation functions to the list and then start executing your code.

Lastly, making it so that one can get the AST for a specific code object could be possible. The obvious way is to just keep the AST object with the code object so acces is always there. But the memory overhead would be too high for something that wouldn't be used that often, and so on a per-request basis is the only reasonable thing to do. That can be done if we open up a function that takes in a code object, reads in the file and the line the code object has its source code start, parse the source, and then generate the AST again.

The problem with any of these is performance cost. If serialization is used to go from C to Python, the overhead will kill performance. Chances are only applying the changes to newly generated bytecode would be worth it. Serialization would also kill direct code object access. If the PyObject solution is used the cost drops significantly for the global access, but having to parse source again is still rather costly.

Then there is keeping track of what changes have been applied to the bytecode. As of right now a .pyc file and a .pyo file only flag their difference on the file extension. There is no built-in notion of marking what transformations that could be considered semantic changes have been applied to bytecode. Guido has said this should be done and it's on my list of PEPs to write. This would allow for .pyo files to disappear since a .pyc with a mark saying assertions are gone would negate the biggest need of .pyo files.

Anyway, these are just some ideas floating in my head. At the bare minimum global access will come into existence. Pretty much whatever is needed by PyChecker to get access to the AST will come into existence. After that it will be up to what the community wants and what people are willing to implement in terms of exposure. Personally global access is good enough to me. I just want to be able to play with random optimization ideas and add better semantic checks.