Investigating the performance of wasmjit-omr, Part 1

Introduction

As a final year university project, a few classmates and I created a JIT compiler for WebAssembly called wasmjit-omr. Our goal was to create a simple JIT compiler capable of accelerating most WebAssembly workloads.

Naturally, we wrote the JIT compiler using Eclipse OMR JitBuilder. A JIT compiler by itself isn’t that useful however, as it usually runs inside a virtual machine (VM). To avoid having to create a full WebAssembly VM, we decided to integrate our JIT in the interpreter that is provided by the WebAssembly Binary Toolkit project, a.k.a. WABT.

After the project ended, we continued to work on it on our own time, with most of the work going towards improving performance.

With a few improvements intended to provide a significant performance boost to wasmjit-omr completed (these will be further discussed in a later post), it became time to validate our assumptions and measure the performance of our JIT.

In this post, the first in a series, I will discuss the process I went through to get an initial insight into the performance improvement our JIT compiler provides over the interpreter. First I will explain how I implemented the benchmark used to measure performance. Then, I will describe some of the tooling I had to implement to run the benchmark; a wasmjit-omr based VM that provides the system calls needed to execute WebAssembly code generated by Emscripten. Finally, I will show and discuss the results of the performance measurements I did.

Creating a Benchmark

The purpose of a JIT compiler is to make programs written in interpreted languages run faster. So, for the initial performance measurements, I focused on measuring the speed up the JIT compiler provides compared to the interpreter alone.

The benchmark I used is a Mandelbrot program because it’s simple, relatively computationally intensive, and frequently used to benchmark language implementations.

To avoid having to write the benchmark in WebAssembly directly (which is doable but unnecessarily tedious), I wrote the Mandelbrot program in C and transpiled it to WebAssembly using Emscripten.

For technical reasons, one property of the wasmjit-omr JIT compiler is that it never compiles the top-level function (main or start function). So, to ensure the JIT compiles the Mandelbrot code, the code must be placed in a separate function. Also, annotating the function with __attribute__((noinline)) prevents Emscripten from inlining it into the main function.

typedef double FP_t;

__attribute__((noinline))
void mandelbrot(int x_pixels, int y_pixels, int* out_table, int max_iterations) {
    FP_t x_scale_factor = 3.5 / (FP_t) x_pixels;
    FP_t y_scale_factor = 2.0 / (FP_t) y_pixels;

    for (int py = 0; py < y_pixels; ++py) {
        for (int px = 0; px < x_pixels; ++px) {
            FP_t x0 = px * x_scale_factor - 2.5;
            FP_t y0 = py * y_scale_factor - 1.0;

            FP_t x = 0.0;
            FP_t y = 0.0;
            int iteration = 0;
            while (x*x + y*y <= 2*2 && iteration < max_iterations) {
                FP_t x_temp = x*x - y*y + x0;
                y = 2*x*y + y0;
                x = x_temp;
                iteration += 1;
            }

            out_table[py*x_pixels + px] = iteration;
        }
    }
}

Since we’re only interested in the time it takes for the Mandelbrot code to execute (and not the VM startup or JIT compilation time), I surrounded a call to mandelbrot() with calls to C’s time() function, which returns the current time in seconds.

int main(void) {
    // ...

    int table[34][80] = {0};
    time_t time_diff = time(NULL);
    mandelbrot(80, 34, (int*)table, 300000);
    time_diff = time(NULL) - time_diff;

    // ..
}

One drawback of using time() is that measurements will only have one-second precision. However, for reasons that I will explain shortly, in this case using time() is a convenient way of measuring time with relatively little effort. Even still, as a first ballpark comparison between interpreted code and JIT compiled code, one-second precision won’t be a problem.

Also, to make sure that the compilation time of mandelbrot() is not measured, I added a dummy call to it before the first call to time(). Because, by default, wasmjit-omr compiles a function just before its first call, using a dummy call ensures compilation has completed by the time the benchmarked code runs.

int main(void) {
    int small_table [3][4] = {0};
    mandelbrot(4, 3, (int*)small_table, 10);

    int table[34][80] = {0};
    time_t time_diff = time(NULL);
    mandelbrot(80, 34, (int*)table, 300000);
    time_diff = time(NULL) - time_diff;

    // ..
}

To report the execution time, we would ideally print it to the screen. However, printing to the screen requires significant support from the VM and would take time to implement.

Instead, I decided to make the execution time the return value of the main() function. Given that the time is in seconds, the range of the return value should be just enough to get meaningful results from the benchmark.

int main(void) {
    // force JIT compilation
    int small_table [3][4] = {0};
    mandelbrot(4, 3, (int*)small_table, 10);

    // run and time the benchmark
    int table[34][80] = {0};
    time_t time_diff = time(NULL);
    mandelbrot(80, 34, (int*)table, 300000);
    time_diff = time(NULL) - time_diff;

    return time_diff;
}

The complete code for this benchmark and build instructions can be seen here: https://github.com/wasmjit-omr/micro-benchmarks/tree/post1.

Dealing with System Calls

In most systems, tasks such as getting the current time and printing text to the screen are generally done via system calls; special calls that are handled by the operating system. Most programming languages hide this behind more convenient interfaces like standard libraries. However, the implementation of these eventually relies on making a system call.

WebAssembly does not support making system calls directly. Instead, it allows VMs to provide “host functions” that are callable from WebAssembly code. Compilers like Emscripten take advantage of this by emitting “host calls” where a regular C compiler would have emitted a system call.

By design, the only system call that is required to run the Mandelbrot benchmark is _time, which is used by the time() function. Other unused/uncalled host imports are removed by Emscripten when using the -O3 flag ( -Os also does this).

Implementing the System Calls

By default, the WABT interpreter does not provide any host functions. Since WABT is designed to be embedded, it is the embedder’s responsibility to define the host functions and other imports they want to expose to WebAssembly code. WABT’s wasm-interp tool shows an example of how this is done.

To execute the Mandelbrot benchmark with wasmjit-omr, a custom tool (or “embedder”) that provides the host functions that handle system calls was needed. I decided to call this tool em-interp (the em- is for EMscripten).

To implement this tool, I started by just copying the wabt-interp source code, which is just one file.

Next, I created a function that takes a pointer to a wabt::Environment object and that registers a new module in it.

void AppendEmscriptenModule(wabt::interp::Environment* env);

The new function is called from the environment initialization routine. Creating a separate function decouples the definition and implementation of the host module (its functions, memories, tables, constants, etc.) from the core tool implementation.

static void InitEnvironment(Environment* env) {
  AppendEmscriptenModule(env);
  // ...
}

WebAssembly code generated by Emscripten imports system calls from a host module called env. env is also used to import other components like linear memory, tables, and a few constants initialized by the host VM. In addition to the _time system call, the Mandelbrot code only needs a linear memory. As such, these are the only components that I implemented in AppendEnvrionmentModule.

void AppendEmscriptenModule(Environment* env) {
  HostModule* module = env->AppendHostModule("env");

  Memory* memory = nullptr;
  Index memIndex = 0;
  std::tie(memory, memIndex) = module->AppendMemoryExport("memory", Limits{256, 256});

The function first creates a host module called env and adds to it a linear memory with the appropriate size limits.

module->on_unknown_func_export=
      [](Environment* env, HostModule* host_module, string_view name, Index sig_index) -> Index {
          printf("Importing unimplemented host function '%s' from '%s' module\n", name.to_string().c_str(), host_module->name.c_str());
          auto name_s = name.to_string(); // cached copy of name to avoid reading bad values from string_view
          auto callback = host_module->AppendFuncExport(
                  name,
                  sig_index,
                  [=](const HostFunc* func, const FuncSignature* sig, const TypedValues& args, TypedValues& results) -> interp::Result {
                      printf("Call to unimplemented host function '%s' from '%s' module!!!\n", name_s.c_str(), host_module->name.c_str());
                      return interp::Result::TrapUnreachable;
                  });
          return callback.second;
      };

The next part of the function may be a bit confusing (a lambda within a lambda, whaaaat?) but what it does is fairly straightforward. If a WebAssembly module attempts to import a function from env that isn’t provided, this code will instead provide a “dummy” function that will immediately trap. That way, as long as the unimplemented function is never called, the WebAssembly code will work correctly without having to provide all it’s imported functions.

  module->AppendFuncExport("_time"
                          ,{{Type::I32}, {Type::I32}}
                          ,[=](const HostFunc* func, const FuncSignature* sig, const TypedValues& args, TypedValues& results) -> interp::Result {
                             auto t = static_cast<uint32_t>(time(nullptr));
                             auto addr = args.at(0).get_i32();
                             if (addr != 0) {
                                memcpy(memory->data.data() + addr, &t, sizeof(uint32_t));
                             }
                             results.at(0).set_i32(t);
                             return interp::Result::Ok;
                          });
}

Finally, the _time system call is created and made available as a host import. The host function to be invoked is defined inline using a lambda. The VM (interpreter and JIT) takes care of performing the transition from WebAssembly code to the host function when the latter is invoked.

The complete code for em-interp can be seen here: https://github.com/wasmjit-omr/em-interp/tree/initial.

Benchmark and Results

With the benchmark written and the system calls setup, all that was left was to run the benchmark.

My primary goal was to compare the performance of interpreted code vs. JIT compiled code. So first, I ran the Mandelbrot program with the --disable-jit option, which (as the name suggests) disables the JIT and forces the program to be interpreted. I then did the same thing again but without the --disable-jit option so that the benchmark code would be JIT compiled.

$ ./em-interp ../../micro-benchmarks/mandelbrot.wasm --disable-jit
_main() => i32:110
$ ./em-interp ../../micro-benchmarks/mandelbrot.wasm
_main() => i32:1

Right away, it was pretty evident that the JIT provides a significant performance boost. JIT compiled code was roughly two orders of magnitude faster than interpreted code!

To make things a little more scientific I did the comparison 10 more times. These were the results:

Interpreter (s) JIT (s)
110 1
111 1
110 1
110 0
111 0
111 1
112 1
110 1
111 0
110 1

I did not do a particularly detailed analysis of the results because only having one-second precision makes it difficult to draw any reasonable conclusions from the data. For example, the zeros in the “JIT” column don’t mean that the JIT compiled code took zero seconds to execute, just that it took less than one second to execute (probably). The only thing we can say is that JIT compiled code was roughly two orders of magnitude faster than interpreted code, which is pretty cool!

Conclusion

In this blog post, I described some work I did for measuring and comparing the performance of JIT compiled code vs interpreted code in wasmjit-omr.

I first had to create a Mandelbrot program in C that I then compiled to WebAssembly using Emscripten. I then created a tool based on wasmjit-omr that can run code generated by Emscripten. Finally, I ran the benchmark using this tool and found the JIT compiled code was roughly two orders of magnitude faster than interpreted code.

In the next blog post, I will discuss further performance measurements and comparison with higher precision and a closer look at what contributes to the improved performance.