Loci v1.4 Released

The latest version of the Loci compiler is now available. See the release notes for detailed information about what’s included.

The main feature in this release is a brand new compiler diagnostics system with a rewritten lexer and parser. Here’s an example of errors from the old compiler:

And here’s how it looks with the new diagnostics system:

This release is therefore an enormous improvement in the usability of the compiler and the cause of errors/warnings should be very clear to the user.

GLR vs Recursive Descent Parsing

One of the oldest parts of the Loci compiler was replaced earlier this month: the parser. The switch was from a Bison GLR parser to a hand-written recursive descent parser. Rewriting a major compiler component is a significant time/effort investment (it took around 2-3 weeks of relatively intense development), so I’ll explain why I took the leap.

Original GLR Grammar

LL vs LR

If you don’t know anything about LR parsers, I’ll try to give a summary here. Basically LR parsers construct the parse tree bottom-up, trying to match a sequence of symbols to their associated non-terminal. This contrasts with LL parsers, which start from the non-terminal and try to construct the parse tree top-down. There’s an explanation here: http://stackoverflow.com/a/6824545/142088 . Out of the two, LL parsers are the more intuitive, though LR parsers are more ‘powerful’ (i.e. for every language that there’s an LL parser there’s an LR parser, but not vice versa).

Template Syntax

Of course for a generalised systems programming language like Loci it’s not this simple. Most significantly, Loci inherits C++’s template syntax, which uses ‘<‘ and ‘>’ both to delimit template arguments and for less-than/greater-than. This is combined with Loci’s multi-pass compilation, meaning that we don’t know whether a NAME is a type or a function or a value etc., making code like the following potentially ambiguous:

This code could be parsed as two comparisons, i.e.:

Or it could be a call to a templated function call:

(It’s clearly a templated function call and this is the interpretation chosen by the compiler.) All of this could’ve been made easier for the compiler; Loci could’ve been designed to use a different template syntax, such as (from the D programming language):

However in the design of Loci I deliberately focused on doing the easiest thing for the user, which is to preserve the familiarity of the C++ template (and Java generic) syntax. Loci always tries to follow language conventions where they exist, following the principle of least surprise. I personally also think that the C++ template is clearer, since the triangle brackets are noticeable different from the round brackets.

Switching to GLR

The Loci compiler started off with an LALR parser (LALR is essentially a variant of LR which is less powerful but which produces significantly smaller parsing tables). However in the context of syntax like that shown above, it was upgraded to use Bison’s GLR parsing functionality.

A GLR (Generalised LR) parser is a way of running multiple parsers simultaneously. Most of the time the action for any given input token is unambiguous (i.e. one choice), but in some cases there are multiple choices (e.g. treat ‘<‘ as less-than versus treat it as template). A normal LALR parser wouldn’t allow this (it would complain about shift/reduce conflicts), but GLR works around this problem by splitting the parser into two parsers, and hence exploring both opportunities simultaneously. If one of the parsers hits an error it simply vanishes and the other parser continues. If both parsers produce different parse trees for a given non-terminal this is an ambiguity, which can be handled by the parser implementer (e.g. one can emit an error).

Problems

(You can see the old parser, from the v1.3 release, here.)

While GLR was a good solution to the parsing problem, the parser had many problems including (from most significant to least significant):

  • Poor quality diagnostics – This is related to LL vs LR; the top-down LL parser means the context is unambiguous, so it’s easier to produce good error messages. It’s also related to the parser generator, since it’s difficult to get Bison to produce good errors (and adding extra rules for errors often seems to introduce shift/reduce conflicts).
  • AST must be held in a union – Bison requires that the AST nodes corresponding to non-terminals are kept in a union. This is unfortunate, because C++ cannot call class destructors in a union (since it doesn’t know which destructor to call). Hence all the AST nodes were referred to by heap-allocated pointer, with lots of memory leaks (I wrote the early compiler with too little regard for issues like this).
  • Parser is in one large file – It seems that Bison is designed to have the entire parser in a single file. The Loci GLR parser was around 2500 lines, which is impressively concise given the syntactical complexity of the language. However keeping all this logic in a single file makes it harder to move between different parts of the parser (e.g. type parsing versus value parsing), meaning that you tend to end up in a sea of code, not really knowing where you are. I tend to keep source files below 500 lines in length for this reason.
  • C++ <-> Bison communication is messy – This isn’t a deal breaker, but it does make the code harder to read, particularly when following the control flow of the parser. Code readability is extremely important for long-term maintenance of the compiler and for allowing others (beyond myself) to make sense of the code.
  • Invalid conflicts – LR grammars are always unambiguous, but all unambiguous grammars aren’t LR. While Bison did produce some useful shift/reduce conflicts, some conflicts simply arose from the structure of the grammar. Restructuring the grammar fixes the conflicts, but means the grammar is incomprehensible.
  • Token header file generated at build time – One of the tedious aspects of Bison is that you have to wait until build time to get a header file containing the tokens. This means that the tokens can’t be exposed in a pre-written library API, unlike the rest of the compiler code.

For a while most of these issues were manageable, since the goal was to build a compiler that works. However, now that the compiler has reasonably solid support for the language, the challenge is to make the compiler work well, part of which is to get a good user experience.

With this in mind, I decided that the poor diagnostics were no longer acceptable and proceeded to design a new parser.

New parser

Existing Compilers

Unsurprisingly there’s a lot of prior art when it comes to programming language parsers. Given the syntactic similarities between C++ and Loci, it seemed to make sense to look at the mainstream C++ compilers: Clang and GCC.

Clang

Clang is a relatively new compiler that supports C, C++ and Objective-C and targets LLVM. The design of Clang and LLVM (the two are closely linked) is well known to be excellent, since they thoroughly embrace modularity and well-defined API boundaries. The best example of this is LLVM IR, a representation documented in equisite detail that separates compiler front-ends and back-ends. As you may have guessed, the Loci compiler targets LLVM IR.

Given that it’s newer than most, Clang was designed with understanding gained from the development of older compilers. With this experience in hand, the Clang developers chose to write their C++ parser using a hand-written recursive descent parser. They say:

We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.

Clang is one of the fastest C++ compilers, produces excellent diagnostics and is used in library form as libclang in an increasing number of tools to do tasks such as code highlighting, analysis etc. So there’s every reason to believe these claims.

GCC

Unlike Clang, GCC is a relatively old compiler (by software standards) and has therefore travelled down a slightly different path. GCC started off a Bison LALR grammar, however in the early 2000s they switched to a hand-written recursive descent parser. Claimed advantages included:

  • Better error-messages
  • Better error-recovery
  • Simplification of many parts of the compiler that presently do funny things to work around oddities in the parser.
  • Easier to debug than a yacc parser. (Don’t have to understand the detailed works of yacc.)
  • Don’t have hard-to-understand data and control flow dependencies between lexer, parser, and rest of compiler.
  • Likely to be smaller, since you don’t have as many duplications due to contortions to deal with complicated parts fo the grammar.
  • Might be faster. – If the C grammar is re-written as the same time, you can perhaps share much of the code.
  • Can use same grammar is other applications (such as gdb), some of which might need a reentrant grammar, which is difficult with yacc. (A re-entrant grammar was a goal for cppexp.c.)

Building the parser

Looking at the existing compilers and analysing all the solutions available, it became clear that a hand-written recursive descent parser was the way to go. My goals were to:

  • Produce excellent diagnostics.
  • Keep the parser code as clear as possible.
  • Split the parser into multiple well-organised files.
  • Avoid using a union to hold AST nodes (and hence avoid potential memory leaks).
  • Ideally, use C++ code to avoid introducing dependencies.

Building a hand-written parser for a language like Loci is a significant project, a much larger scale parser than I’d created before. I therefore decided to first rewrite the lexer, at that time using Flex with similar issues to the Bison parser, in C++ code. This gave me a good test case to develop a structure for the parser and how to represent/issue diagnostics. When I finished the lexer I had a very clear idea about how the parser would be designed. In fact, if you look at the code you can see the lexer has an inferior design to the parser; this will be addressed in due course.

Testing the parser

Most of the compiler testing started off with using the compiler to build a program and then verify the output of running the program. Since then the testing has become more and more fine grained to reduce testing times and to achieve more comprehensive testing of the compiler.

The rewrite took this further by adding suites of unit tests for the lexer and parser; up to then only the support functionality (e.g. arrays) had unit tests. Unit tests essentially take a module, isolate it and then hammer it with a combination of pleasant and unpleasant situations to check it responds appropriately. For example, a unit test of an arithmetic module could hand it zero inputs, or max-sized inputs, to check it doesn’t explode with the edge cases.

The idea of unit testing isn’t new; (non-software) engineering regularly involves testing individual components to ensure they can withstand the expected stresses. This provides a high degree of confidence that when put together the system will operate properly, since only one component has to fail for the system to start to fall apart. Unit tests are also extremely extremely quick: a couple of seconds could cover many thousands of unit tests. And, since they depend only on the code being tested, they’re extremely reliable.

The new diagnostics

Along with the lexer and parser rewrites I added some initial infrastructure for ‘rendering’ diagnostics. Here are some examples of the many diagnostics that the compiler now emits:

These diagnostics are modelled after Clang, because they’re succinct, extremely readable and familiar to C++ developers.

Follow-on work

The new lexer and parser are now done, and the old Flex lexer and Bison parser have been removed. Further improvements will be made to both in due course. Current work is now upgrading Semantic Analysis to use the new diagnostics infrastructure, with the aim of releasing v1.4 in February with considerable improvements to the user experience. Semantic Analysis is also the next target of refactoring, as the slowest and messiest part of the compiler, to bring it up to a good level in preparation for the next batch of language features.

Summary

Re-writing the parser was a significant amount of work. I started just before Christmas (2015) and worked almost continuously until finishing in early January. However the result has been a much more usable and maintainable compiler.

The new lexer and parser were written in a different way to the existing code. A better way. Ultimately the biggest gain may have been reinvigorating my focus on improving programmer productivity and quality, which after all is what Loci is all about.

Loci Compiler v1.3 released

The latest version of the Loci compiler is now available. See the release notes for detailed information about what’s included.

A lot of this release is clearing up the low level mechanisms of the language, including:

  • Specify object memory states, which is important for move semantics.
  • Adding sizeof() and alignmask() methods so that unsized types (e.g. opaque structs) can be represented.
  • Adding predicate aliases (or ‘named predicates’) to improve readability.
  • Fixing noexcept to be inferred correctly.
  • Adding static array type.
  • Fixing value templates (primarily for static array element count template argument).

There’s also been a lot of work on the language grammar and parser to work out the more difficult cases. It’s becoming increasingly clear that the Bison parser has many problems and so rewriting it as a recursive descent parser will soon be necessary. However in the meantime the priority is to augment the standard library and make the compiler easier to use.

I’m also planning to shorten the release cycle down to about 2-3 months, so the feature list for v1.4 is naturally trimmed down as well.

LLVM ARM and AArch64 nest attribute patches accepted

In the process of developing a mechanism for efficient virtual method calls in Loci I came across difficulties related to generating the appropriate assembly via LLVM IR. For more information, read the LLVM mailing list discussion.

Fortunately it turns out LLVM has been adding the required features recently, one of which is the ‘nest’ attribute that’s designed to be used for GCC’s nested functions extension to C (Clang needs to be compatible with GCC). The only issue was that while full support is already present for x86 32-bit and 64-bit, the ‘nest’ attribute wasn’t implemented on the LLVM ARM 32-bit backend or AArch64 backend.

Hence I created patches for ARM and AArch64 and these have now been accepted and both committed to LLVM. Separately support has also been implemented for PowerPC. These will all be going into LLVM 3.7.

Locic 1.2 Very soon

The next release of the Loci Compiler Tools will soon be available!

Full release notes will be part of the provided locic-1.2 archives (and up on http://loci-lang.org) but (to briefly mention it) the main feature of this release is predicates, which is essentially a much clearer and easier to use form of C++’s proposed concepts feature.

Predicates are just boolean expressions (i.e. that evaluate to either true or false) depending on template variables. For example:

Predicates come in three forms:

  • Const predicates – Specify that a variable or value is const depending on the outcome of a predicate. The common case is methods that return a non-const result if their parent is non-const, but return a const result if their parent is const.
  • Noexcept predicates – Specify that a function or method throws depending on the outcome of a predicate. For example, a method may only be noexcept if another method it calls is noexcept.
  • Require predicates – Specify that a type/function/method is only valid depending on the outcome of a predicate. For example, an array is only copyable if its elements are copyable.

This kind of functionality and capability is far ahead of C++; combined with the design of Loci templates (being usable across compilation boundaries!) it provides an extremely clear and concise way to write structured code. It’s an important core aspect of the Loci type system and substantial portions of the standard library rely on this (and that will be an increasing trend).

Named Predicates

For the next release I’m planning to augment this functionality to allow specifying named predicates. Right now you can create type aliases like:

My idea is to extend this to predicates:

The vast majority of the compiler machinery is already in place for this sort of feature and there is a clear gain in readability, so I expect this to be one of the first features I work on in the next release cycle (1.3).

Remaining Work

So the compiler is now mostly ready for release (release date is expected to be this weekend). The remaining work is:

  • Improvements/fixes to the standard library.
  • Updating documentation.
  • Adding more tests.
  • Clearing up some of the messier parts of the code.

Dropped Features

Unfortunately exception specifications are going to be dropped from this release and pushed forward. Fortunately on a relative basis they aren’t a particularly important feature and they also have a slight conflict with noexcept predicates that should be resolved. I think people (including myself) care a lot more about other features (e.g. lambdas, import-from-export generator tool, better standard library) so I’m going to push the implementation of exception specifications into the more distant future.

Next release (1.3)

I’m expecting the next release to be focused on even further development of templates in the form of value templates (already mostly ready in the compiler), variadic templates and template argument deduction.

By adding these, functions, methods, interface methods etc. will become primitive types (i.e. like ints, floats, lvals etc.) and hence this will remove a nasty (and probably slightly broken) special case from the compiler. Hence developers will able to treat functions in the same way as other types (e.g. putting them in containers). The other (slightly simpler) new primitive type will be a statically sized array type (like C++’s std::array).

I’ll also start building a ported version of OpenP2P and in the process start building (alpha quality) tools that make it easier to create C or C++ to Loci bindings (i.e. calling from Loci to C and C++). The idea for this is to use Clang to analyse C and C++ source code and produce both some C++ code (for C this isn’t necessary) and a Loci ‘header’ file with the relevant imports. In the C++ case we need to generate some C++ code that generates some ‘export C’ (i.e. unmangled, standard C calling convention) functions that are callable from Loci and call the relevant C++ methods.

The C case is fairly simple but for C++ there are some issues, of course:

  • C++ templates require compile-time expansion. They also don’t express any constraints that could be translated to require() predicates.
  • C++ classes must have a compile-time known size (Loci classes have fixed size, but in the general case this size doesn’t need to be known until run-time) – This is actually not a problem for calling from Loci into C++, but would raise issues for the reverse situation (solution is probably just using PIMPL idiom, meaning a heap allocation, which is what C++ programmers would do anyway).
  • C++ supports function/method overloading.

These problems can all be solved, probably via some sort of human direction (i.e. specify that particular template instantiations be created, indicate names for overloaded methods). Overall with the help of Clang it should ultimately be possible to do the vast majority of the work automatically and leave the developer to clear up the odd cases.

In terms of release dates, I’m aiming for mid-Summer (i.e. around July).

Moving projects to Loci

It’s always been my intention to develop the Loci programming language as a tool that I would regularly deploy myself and this year that’s going to start happening!

The language and compiler have reached an early stage of maturity, with a substantial mix of implemented language features and a small (but quickly growing) set of useful standard library modules. Hence this seems the right time to move some of my projects, currently written in C++, over to Loci, both to take advantage of the strengths of the language, to guide future design decisions of the language and as a real-world test of the compiler. Chain Reversi has already been ported to Loci and has been particularly useful when optimising the compiler; I consider quick build times extremely important and I’m pleased to say both that the design of Loci means it can be expected to compile much faster than similar C++ code and that this is borne out in practice.

The first project to move over is my ‘second’ (but slightly neglected) project, OpenP2P, which is basically a set of libraries that make it easy to develop applications that use peer-to-peer networking. One of the aspects that held OpenP2P back was the lack of good event notification/handling in C++ (which was basically implemented as a layer over boost::asio) but fortunately Loci has std.event, which in my opinion follows a very clean approach to event signalling. So I’ll be moving OpenP2P across and integrating it neatly with the standard library modules.

There’s another reason for making this change. Loci has presented me with a couple of problems:

  • It’s too much fun to work on!
  • It’s too successful!

Hence it’s starting to dominate my projects time, which has led to it becoming relatively well structured but on the other hand my other projects have been neglected. By moving these projects over to Loci I get to work on both at the same time! I also get the thrilling feeling of the last 5 years of designing Loci and building the compiler coming to fruition in a very practical way.

The Internet discovers Loci

So it seems that in the last couple of days postings have appeared about Loci on Y Combinator and Reddit (first and second).

It’s really awesome to see that developers are interested in Loci and various questions have been asked about the language, which will help me to improve the documentation to better explain how the language works and the reasons behind the design decisions. The 1.2 release of Loci will be appearing in the next couple of weeks; it looks likely that there’ll be a desperate rush next week to get some great features into the compiler, and this release also includes some optimisation of Semantic Analysis (whereas 1.1 involved an enormous performance improvement to Code Generation).

If you’re reading this and you have a question, a suggestion and/or constructive criticism of the language, I recommend raising an issue on the GitHub repository and I’ll do my best to respond promptly; at some point I’ll probably set up a better system for discussion/questions (I’m thinking about a forum on the Loci website).

Locic Website is now documentation

So after much consideration I’ve just gone ahead and replaced the Loci website with the compiler’s own generated HTML documentation. This is after the realisation that since Sphinx and reStructured Text are such amazingly convenient tools I’ve ended up writing a huge amount about the language and the compiler. The original website used WordPress and it was tedious to update both the website and the documentation. Also with a bit of theming the Locic documentation actually looks really good.

Hopefully this means anyone discovering the website will be able to get a clear idea of the language by just visiting the website; it’s obviously tedious to actually get the compiler source and to build the documentation from source.

Right now the documentation is mostly up-to-date. There are a few sections that are ahead of the implementation (!), so that should probably be fixed. For example, there’s a section about exception specifications but they don’t actually exist in the compiler. Fortunately the features affected tend to be fairly small and so unlikely to affect users. Realistically the main issues from a user’s point of view are platform support, standard library development (I’ve already done loads on this and actively adding more functionality) and easier integration with C and development tools.