The case for AST-based computer language representation and specification

This is the idea:

To have a language which is defined in its AST form, rather than its human-readable form.
To save it as a structured AST dump rather than as readable source code.
To have editors and viewers that format the AST into a readable form for human use.

(Note: AST == Abstract Syntax Tree, the internal representation of the source code which is the output of the parser in a compiler.)

Some advantages:

The user can choose their preferred syntax, the spacing and brace arrangements that seem most comfortable to them, and everyone else's code appears to them formatted this way.
The cruft of language development accumulates in the AST spec not in the syntax read by the user. The user-visible syntax can be upgraded independently from upgrades to the AST spec. For example, the AST can be upgraded from version 4 to 5, and a new syntax formatter written for version 5 which makes the code look just the same to the user. In addition, all the existing code (in AST-dump form) could be automatically upgraded to version 5 without the user seeing any difference. Or alternatively the readable syntax could be improved, and from the user's point of view all old code would immediately appear to have been rewritten using the new syntax. Compare this to current languages which jump through hoops to avoid breaking the compilation of old source code, meaning lots of legacy syntax, and new features bolted onto legacy syntax rather than using a more natural or modern representation.

Some disadvantages:

The user has less control over precise formatting, e.g. of complex expressions -- they give up control of this to a particular code formatter. However, with a good enough code formatter, I believe the trade-off is worth it -- and the user can potentially write their own (better) formatter, or tweak an existing one to improve its formatting. I am experimenting with these ideas in Java (see below) to see how well it could work.
Diffs and patches appear in the AST-dump form. Whilst it would be possible to structure the AST dump in a way that minimises diff size, it still makes reviewing patches in raw form much harder. A viewer would be required to convert patches into a readable form. An alternative may be to save in a standardised readable formatted form, and re-parse back into AST, but that gives up some of the advantages of the AST-based approach. However, this is what I'm experimenting with in Java right now -- using standard Java syntax as the saved form.

One of the advantages of making the AST the primary specification of the language is that it is half-way between the user and the tools. So refactoring is easier because a refactoring tool doesn't have to worry about how the refactored code should be formatted. A compiler should be faster because parsing an AST dump is less complex.

Looking at source code from an AST point of view, a project is just an immense tree. Taking Java as an example, the package names form a tree with classes/enums/etc as leaves, and below those appear methods and fields, and below those come type specifications and nested lists of statements and expressions.

So in an abstract form we have this huge tree. We could dump it out in folders and files, similar to how Java source is structured right now -- which is probably the best way to interact with a version control system (git, mercurial, etc).

Alternatively we could load it up into some kind of a database optimised for storing trees. As a database, we could maintain indexes for cross-referencing. This would make it very easy to do rename refactoring -- if every reference to class A is a database pointer to A, then we can rename A with one operation -- and similarly for local variables and all other named entities. Reverse indexes could optimise searches for references to a definition. This is an ideal environment for refactoring and analysis. It could even be used for global optimisation -- make a temporary copy of the database and eliminate dead code or inline code globally, before it passing on to the compiler.

Putting this into practice

As a practical test of some of these ideas, I'm working on a Java editor which edits the AST not the source. The source is loaded up and parsed, then the resulting AST reformatted to the user's preferred syntax for display and editing. The edited AST is then reformatted back into standard syntax before white space is adjusted to minimise changes and it is saved.

Like this I will get a number of advantages:

Uniform display of code mangled by tools or other people's weird formatting preferences, whilst still maintaining the original formatting in unchanged parts of the source (thus minimising patches).
The opportunity to experiment with other display syntaxes, or representations optimised to what is possible with a display rather than what is possible with a line-printer, e.g. using colour or visual structure to indicate something rather than symbols or specific keywords, and so on.
The opportunity to experiment with analysis and refactoring enabled by working with the AST.

I'm using Eclipse JDT as a base for now (org.eclipse.jdt.core.dom) to see whether I can make this fly.

The future

I wonder whether we will ever see a language as I envisage? Having worked through all this in my head, it is now frustrating to see languages still being designed in terms of their ASCII representation. It just seems so limiting. We will have to see how far I can develop these ideas myself, and what the future may hold.

-- Peru, 19-Dec-2011

These pages and files, including applets and artwork, are Copyright (c) 1997-2019 Jim Peters unless otherwise stated. Please contact me if you'd like to use anything not explicitly released, or if you have something interesting to discuss.