WebAssembly – native code IR for the web

Right now various developers are getting excited about WebAssembly, which is a new IR (intermediate representation) designed for supporting languages like C and C++ (and Loci) with native-like performance. Most notably the intention is to remove the parsing overhead of asm.js (a subset of Javascript designed to be a target for native compilation), open up new low level functionality and stop using Javascript as the ‘assembly of the web’.

There’s been a long-held desire for something like this and it finally seems to be happening, given the broad support from browser vendors.

From my point of view the most interesting part of their documentation is the comparison to LLVM IR, in which they say:

The LLVM compiler infrastructure has a lot to recommend it: it has an existing intermediate representation (LLVM IR) and binary encoding format (bitcode). It has code generation backends targeting many architectures is actively developed and maintained by a large community. […] However the goals and requirements that LLVM was designed to meet are subtly mismatched with those of WebAssembly.

WebAssembly has several requirements and goals for its IR and binary encoding:

  • Portability: The IR must be the same for every machine architecture.
  • Stability: The IR and binary encoding must not change over time (or change only in ways that can be kept backward-compatible).

[…]

LLVM IR is meant to make compiler optimizations easy to implement, and to represent the constructs and semantics required by C, C++, and other languages on a large variety of operating systems and architectures. This means that by default the IR is not portable (the same program has different representations for different architectures) or stable (it changes over time as optimization and language requirements change). […] LLVM’s binary format (bitcode) was designed for temporary on-disk serialization of the IR for link-time optimization, and not for stability or compressibility […]

This is a very insightful comparison of WebAssembly versus LLVM IR. A lot of people have wanted LLVM IR to be a portable and stable representation, but that directly contradicts its goals to be excellent for performing optimisations and for splitting frontends/backends. There was a mailing list discussion in 2011 about what LLVM IR is (“LLVM IR is a compiler IR”) and the use cases it does and doesn’t support.

The new model seems to be using WebAssembly to communicate code across machines/platforms/architectures but on the machines themselves a translation to/from LLVM IR seems likely (of course platforms can use whatever internal representation they prefer). Logically then, work is beginning on a new WebAssembly back-end for LLVM.

Fortunately, now that we have WebAssembly, it looks like LLVM IR can be left to focus on its core objectives. So while WebAssembly is standardised as a stable, portable and backwards compatible representation for communicating programs between machines, LLVM IR can be continuous modified/improved to enable state-of-the-art optimisation and further develop the capabilities of front-ends and back-ends.

The Internet discovers Loci

So it seems that in the last couple of days postings have appeared about Loci on Y Combinator and Reddit (first and second).

It’s really awesome to see that developers are interested in Loci and various questions have been asked about the language, which will help me to improve the documentation to better explain how the language works and the reasons behind the design decisions. The 1.2 release of Loci will be appearing in the next couple of weeks; it looks likely that there’ll be a desperate rush next week to get some great features into the compiler, and this release also includes some optimisation of Semantic Analysis (whereas 1.1 involved an enormous performance improvement to Code Generation).

If you’re reading this and you have a question, a suggestion and/or constructive criticism of the language, I recommend raising an issue on the GitHub repository and I’ll do my best to respond promptly; at some point I’ll probably set up a better system for discussion/questions (I’m thinking about a forum on the Loci website).

Locic 1.1 released!

So the second version of the Loci Compiler Tools is now available (see Loci Compiler), with the main new features being:

  • Switching from C++-like template expansion to use Template Generators (to allow templated APIs across module boundaries)
  • Module imports and exports
  • scope(success), scope(failure) and scope(exit)
  • noexcept
  • Type-templated functions/methods
  • Type aliases
  • assert and unreachable statements
  • Implicit and explicit casts between types using templated methods
  • Standard library memory allocators and smart pointers
  • Standard library containers
  • Standard library strings
  • Vastly improved performance, particularly for Code Generation.
  • A larger set of examples and updates to examples to demonstrate newly implemented features.
  • Significantly improved documentation in reStructuredText using Sphinx, which can generate multiple output formats including HTML and PDF.
  • A much larger set of integrated tests to check both accept and reject cases, as well as testing the standard library.

The release was delayed slightly from the mid-August estimate in order to add support for LLVM 3.5 (so that LLVM 3.3, 3.4 and 3.5 are all supported as backends for Locic), which was initially scheduled for release on the 25th August 2014.

LLVM 3.5’s release has since been re-scheduled for the start of September, so the Locic 1.1 release was modified and tested for compatibility with LLVM 3.5 RC3 (pulled from SVN), which is expected to be near-identical to the actual LLVM 3.5 release.

Locic 1.1 Soon

I’m currently working hard on the second release of ‘Locic’, the compiler tools for the Loci programming language. The initial target date was around mid-August and it looks likely that I will be able to release on schedule.

While 1.0 provides the key features of the language (classes, algebraic datatypes, templated types), this new release really fleshes out the language implementation (and the language itself!) to include lesser but nevertheless important features. The internal structure is also being improved thoroughly, to generate better error messages, better quality code, etc. Performance wise, the compiler is now around 50 times faster than 1.0 (tested on the Chain Reversi example), due mainly to optimisations focused on CodeGen. Testing is also much more thorough, with a much larger set of integration tests for accept/reject cases, and I’m planning to add some unit tests on top.

One of the most valuable improvements to the compiler in this release is the ‘template fix’. Specifically, you can now put a templated implementation in one module, a declaration in another module, and then just link* them together and everything will work nicely!

In other words, you can do this:

// ModuleA.loci
template <typename T>
interface Addable {
    T add(const T& other) const;
}

template <typename T: Addable<T>>
T addValues(T value0, T value1) {
    return value0 + value1;
}
// ModuleB.loci
template <typename T>
interface Addable {
    T add(const T& other) const;
}

template <typename T: Addable<T>>
T addValues(T value0, T value1);

void yourFunction() {
    int a = 1;
    int b = 2;
    int c = addValues<int>(a, b);
    // etc.
}

So the point here is that unlike C++ you can use a templated class or function without needing its implementation**. This is one of the many ground breaking features of Loci that enables:

  • No more avoiding Templated APIs; they can now be part of a stable API!
  • Switching between different implementations of a templated class/function by simply linking them together.
  • Much faster compile times.
  • Optimisations to more effectively balance trade-off of code size versus performance (i.e. it’s easier to generate multiple instantiations of a template than to merge them).
  • No need to release templated code to clients (in a commercial setting).

So now when users write templated constructs they don’t need to treat them any differently from normal constructs!

* You could generate object code and then use the platform linker. However, I’d strongly encourage the use of llvm-link followed by opt so that you can benefit from inter-procedural optimisation.

** Naturally some users will be interested in the implementation, and this will be explained fully in the language design document. To summarise, the compiler will generate vtables for each type and then effectively perform virtual calls as needed. If you then link two modules together and run opt on the result you’ll find that all the virtual calls are eliminated by inlining (unless the compiler determines it’s not beneficial, though currently it always inlines), generating similar (or better) code to what you’d write if you ‘manually’ instantiated the templates.

Concurrent Programming Talk Slides

I recently gave a talk on ‘Concurrent Programming’, for which the slides can be found here.

The abstract for the talk was:

In this talk I’ll discuss the fundamental problems facing the development of concurrent applications and common approaches that aim to solve them. I’ll propose a solution that is based on the best of these approaches, and consider its wider implications. Finally, I’ll examine how it should be used in real applications.

Threads vs Events

As I’ve been programming OpenP2P, I’ve iterated over a few styles of programming – a process which consumed a lot of time, but taught me a lot.

Originally, I built OpenP2P on top of boost::asio to make use of its asynchronous features, which I had supposed would make my programming life easier, as well providing greater efficiency. The style was effectively based upon the Actor model, as deployed by languages such as Erlang, for which I would have a thread managing the state for each object – although in this case the ‘thread’ was an abstraction and the system as a whole was based on a small number of threads on which methods would be cooperatively queued and executed. In other words, it was an event-based system.

The problems became obvious as the number of boost::shared_ptr’s required increased dramatically, as I tried to manage all the state of the abstract threads I had created, and I felt that I needed garbage collection to be able to safely free all the data. RAII on the stack, a very useful technique in C++, quickly became useless as state must now all exist on the heap.

Then I discovered this.

The paper argues that historically threading packages have provided poor performance, but this isn’t inherent to the threading model and is only an artifact of these implementations. It then goes on to display that more modern threading implementations can achieve excellent performance. Such conclusions are not isolated, as demonstrated by this presentation which makes a similar argument, focusing particularly on Java.IO vs Java.NIO.

The point is that threading packages have many advantages, making it easier to understand the code and to manage state on the stack (utilising RAII), and therefore that it isn’t wise to switch systems to the event-based architecture in pursuit of a temporary performance increase. In fact, Linux’s NPTL shows that that time is over.

So, after comparing the two approaches, I decided to change OpenP2P to a synchronous thread-based architecture, where operations such as socket.receive() would no longer take a callback, which would be called possibly at some time in the future from some thread, but would simply take the necessary parameters and block, returning to the current thread when it completes, returning a value indicating error (or throwing an exception). If you want to do other things simultaneously -> spawn a new thread. The only thing I ended up adding was a facility for easily canceling any operations, for example when a signal is raised from another thread or when a timer completes.

An event-based architecture is obviously useful for some purposes, such as a GUI, although I can see potential advantages in an approach of using a separate thread for each event, to prevent the GUI from blocking if one of the event handling routines decides to do something like open a file, which should generally be fast, but in this case happens to be on an external hard drive that takes 10 seconds or so to wake up. However, creating threads clearly has to be efficient (or otherwise we might have to keep threads around for new events), and this could potentially generate a large number, as well as creating all sorts of synchronisation concerns since events are likely to raise other events. Naturally, it all comes down to whether you can rely on your event handlers to cooperate.