One of the oldest parts of the Loci compiler was replaced earlier this month: the parser. The switch was from a Bison GLR parser to a hand-written recursive descent parser. Rewriting a major compiler component is a significant time/effort investment (it took around 2-3 weeks of relatively intense development), so I’ll explain why I took the leap.
Original GLR Grammar
LL vs LR
If you don’t know anything about LR parsers, I’ll try to give a summary here. Basically LR parsers construct the parse tree bottom-up, trying to match a sequence of symbols to their associated non-terminal. This contrasts with LL parsers, which start from the non-terminal and try to construct the parse tree top-down. There’s an explanation here: http://stackoverflow.com/a/6824545/142088 . Out of the two, LL parsers are the more intuitive, though LR parsers are more ‘powerful’ (i.e. for every language that there’s an LL parser there’s an LR parser, but not vice versa).
Template Syntax
Of course for a generalised systems programming language like Loci it’s not this simple. Most significantly, Loci inherits C++’s template syntax, which uses ‘<‘ and ‘>’ both to delimit template arguments and for less-than/greater-than. This is combined with Loci’s multi-pass compilation, meaning that we don’t know whether a NAME is a type or a function or a value etc., making code like the following potentially ambiguous:
f(name < a, b > (c));
This code could be parsed as two comparisons, i.e.:
f( (name < a), (b > (c)) );
Or it could be a call to a templated function call:
f( (name<a, b>)(c) );
(It’s clearly a templated function call and this is the interpretation chosen by the compiler.) All of this could’ve been made easier for the compiler; Loci could’ve been designed to use a different template syntax, such as (from the D programming language):
f(name!(a, b)(c));
However in the design of Loci I deliberately focused on doing the easiest thing for the user, which is to preserve the familiarity of the C++ template (and Java generic) syntax. Loci always tries to follow language conventions where they exist, following the principle of least surprise. I personally also think that the C++ template is clearer, since the triangle brackets are noticeable different from the round brackets.
Switching to GLR
The Loci compiler started off with an LALR parser (LALR is essentially a variant of LR which is less powerful but which produces significantly smaller parsing tables). However in the context of syntax like that shown above, it was upgraded to use Bison’s GLR parsing functionality.
A GLR (Generalised LR) parser is a way of running multiple parsers simultaneously. Most of the time the action for any given input token is unambiguous (i.e. one choice), but in some cases there are multiple choices (e.g. treat ‘<‘ as less-than versus treat it as template). A normal LALR parser wouldn’t allow this (it would complain about shift/reduce conflicts), but GLR works around this problem by splitting the parser into two parsers, and hence exploring both opportunities simultaneously. If one of the parsers hits an error it simply vanishes and the other parser continues. If both parsers produce different parse trees for a given non-terminal this is an ambiguity, which can be handled by the parser implementer (e.g. one can emit an error).
Problems
(You can see the old parser, from the v1.3 release, here.)
While GLR was a good solution to the parsing problem, the parser had many problems including (from most significant to least significant):
- Poor quality diagnostics – This is related to LL vs LR; the top-down LL parser means the context is unambiguous, so it’s easier to produce good error messages. It’s also related to the parser generator, since it’s difficult to get Bison to produce good errors (and adding extra rules for errors often seems to introduce shift/reduce conflicts).
- AST must be held in a union – Bison requires that the AST nodes corresponding to non-terminals are kept in a union. This is unfortunate, because C++ cannot call class destructors in a union (since it doesn’t know which destructor to call). Hence all the AST nodes were referred to by heap-allocated pointer, with lots of memory leaks (I wrote the early compiler with too little regard for issues like this).
- Parser is in one large file – It seems that Bison is designed to have the entire parser in a single file. The Loci GLR parser was around 2500 lines, which is impressively concise given the syntactical complexity of the language. However keeping all this logic in a single file makes it harder to move between different parts of the parser (e.g. type parsing versus value parsing), meaning that you tend to end up in a sea of code, not really knowing where you are. I tend to keep source files below 500 lines in length for this reason.
- C++ <-> Bison communication is messy – This isn’t a deal breaker, but it does make the code harder to read, particularly when following the control flow of the parser. Code readability is extremely important for long-term maintenance of the compiler and for allowing others (beyond myself) to make sense of the code.
- Invalid conflicts – LR grammars are always unambiguous, but all unambiguous grammars aren’t LR. While Bison did produce some useful shift/reduce conflicts, some conflicts simply arose from the structure of the grammar. Restructuring the grammar fixes the conflicts, but means the grammar is incomprehensible.
- Token header file generated at build time – One of the tedious aspects of Bison is that you have to wait until build time to get a header file containing the tokens. This means that the tokens can’t be exposed in a pre-written library API, unlike the rest of the compiler code.
For a while most of these issues were manageable, since the goal was to build a compiler that works. However, now that the compiler has reasonably solid support for the language, the challenge is to make the compiler work well, part of which is to get a good user experience.
With this in mind, I decided that the poor diagnostics were no longer acceptable and proceeded to design a new parser.
New parser
Existing Compilers
Unsurprisingly there’s a lot of prior art when it comes to programming language parsers. Given the syntactic similarities between C++ and Loci, it seemed to make sense to look at the mainstream C++ compilers: Clang and GCC.
Clang
Clang is a relatively new compiler that supports C, C++ and Objective-C and targets LLVM. The design of Clang and LLVM (the two are closely linked) is well known to be excellent, since they thoroughly embrace modularity and well-defined API boundaries. The best example of this is LLVM IR, a representation documented in equisite detail that separates compiler front-ends and back-ends. As you may have guessed, the Loci compiler targets LLVM IR.
Given that it’s newer than most, Clang was designed with understanding gained from the development of older compilers. With this experience in hand, the Clang developers chose to write their C++ parser using a hand-written recursive descent parser. They say:
We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.
Clang is one of the fastest C++ compilers, produces excellent diagnostics and is used in library form as libclang in an increasing number of tools to do tasks such as code highlighting, analysis etc. So there’s every reason to believe these claims.
GCC
Unlike Clang, GCC is a relatively old compiler (by software standards) and has therefore travelled down a slightly different path. GCC started off a Bison LALR grammar, however in the early 2000s they switched to a hand-written recursive descent parser. Claimed advantages included:
- Better error-messages
- Better error-recovery
- Simplification of many parts of the compiler that presently do funny things to work around oddities in the parser.
- Easier to debug than a yacc parser. (Don’t have to understand the detailed works of yacc.)
- Don’t have hard-to-understand data and control flow dependencies between lexer, parser, and rest of compiler.
- Likely to be smaller, since you don’t have as many duplications due to contortions to deal with complicated parts fo the grammar.
- Might be faster. – If the C grammar is re-written as the same time, you can perhaps share much of the code.
- Can use same grammar is other applications (such as gdb), some of which might need a reentrant grammar, which is difficult with yacc. (A re-entrant grammar was a goal for cppexp.c.)
Building the parser
Looking at the existing compilers and analysing all the solutions available, it became clear that a hand-written recursive descent parser was the way to go. My goals were to:
- Produce excellent diagnostics.
- Keep the parser code as clear as possible.
- Split the parser into multiple well-organised files.
- Avoid using a union to hold AST nodes (and hence avoid potential memory leaks).
- Ideally, use C++ code to avoid introducing dependencies.
Building a hand-written parser for a language like Loci is a significant project, a much larger scale parser than I’d created before. I therefore decided to first rewrite the lexer, at that time using Flex with similar issues to the Bison parser, in C++ code. This gave me a good test case to develop a structure for the parser and how to represent/issue diagnostics. When I finished the lexer I had a very clear idea about how the parser would be designed. In fact, if you look at the code you can see the lexer has an inferior design to the parser; this will be addressed in due course.
Testing the parser
Most of the compiler testing started off with using the compiler to build a program and then verify the output of running the program. Since then the testing has become more and more fine grained to reduce testing times and to achieve more comprehensive testing of the compiler.
The rewrite took this further by adding suites of unit tests for the lexer and parser; up to then only the support functionality (e.g. arrays) had unit tests. Unit tests essentially take a module, isolate it and then hammer it with a combination of pleasant and unpleasant situations to check it responds appropriately. For example, a unit test of an arithmetic module could hand it zero inputs, or max-sized inputs, to check it doesn’t explode with the edge cases.
The idea of unit testing isn’t new; (non-software) engineering regularly involves testing individual components to ensure they can withstand the expected stresses. This provides a high degree of confidence that when put together the system will operate properly, since only one component has to fail for the system to start to fall apart. Unit tests are also extremely extremely quick: a couple of seconds could cover many thousands of unit tests. And, since they depend only on the code being tested, they’re extremely reliable.
The new diagnostics
Along with the lexer and parser rewrites I added some initial infrastructure for ‘rendering’ diagnostics. Here are some examples of the many diagnostics that the compiler now emits:
test.loci:6:2: error: unexpected method declaration; was expecting method definition void method(); ^~~~~~~~~~~~~~ 1 error generated.
test.loci:11:4: error: 'else' without a previous 'if' } else { ^~~~ 1 error generated.
These diagnostics are modelled after Clang, because they’re succinct, extremely readable and familiar to C++ developers.
Follow-on work
The new lexer and parser are now done, and the old Flex lexer and Bison parser have been removed. Further improvements will be made to both in due course. Current work is now upgrading Semantic Analysis to use the new diagnostics infrastructure, with the aim of releasing v1.4 in February with considerable improvements to the user experience. Semantic Analysis is also the next target of refactoring, as the slowest and messiest part of the compiler, to bring it up to a good level in preparation for the next batch of language features.
Summary
Re-writing the parser was a significant amount of work. I started just before Christmas (2015) and worked almost continuously until finishing in early January. However the result has been a much more usable and maintainable compiler.
The new lexer and parser were written in a different way to the existing code. A better way. Ultimately the biggest gain may have been reinvigorating my focus on improving programmer productivity and quality, which after all is what Loci is all about.