A Guide to Double Map Arraylets

What are arraylets?

When an array or any object is created in a Java program, one of the places the Garbage Collector (GC) store objects is the heap. Individual objects are stored in a contiguous block of memory, which makes it easier to reference them. Arrays as such are also stored in a contiguous region of memory, where in theory we can access each element of the array given the starting address of the array in memory. However, that’s not the case for GC policies such as metronome and balanced [1][2]. Both these policies are region based, meaning the heap associated with the running program is divided into regions. Objects are allocated inside these regions. Array objects that are smaller than a region will be allocated contiguously, which is the default allocation layout. On the other hand, if the size of an array object surpasses the size of a region, the array is allocated in multiple regions (most likely not adjacent ones), represented as an arraylet (internally contiguous arrays are still stored as arraylets). Arraylets have a spine, which contains the class pointer and size, and leaves which contain the data associated with the array. The spine also contains arrayoids which are pointers to the respective arraylet leaves. Figure 1 explains this layout.

Figure 1: Arraylet Representation (not to scale)

Furthermore, arraylets themselves have different layouts depending on data size. They are contiguous, discontinuous and hybrid. The first layout, represented by the next figure, happens when an array size is sufficiently small to fit in a region. Note that in metronome and balanced policies all arrays are stored as arraylets, even those that fit in its entirety in a region. In this case there are no arrayoids or arraylet leaves, only the header and array data. Figure 2 illustrates two arraylets where both have a contiguous layout.

Figure 2: Arraylet Contiguous Layout

Discontiguous layout depicted in the next figure, happens when all data from the array are stored at the arraylet leaves and no data is stored alongside the spine. Figure 3 illustrates two arraylets where both have a discontinuous layout.

Figure 3: Arraylet Discontiguous Layout

Lastly, hybrid arraylets are a mixture of both contiguous and discontiguous layout, where data is stored both at the leaves as well as in the spine after the arrayoids. Figure 4 illustrates two arraylets where both have an hybrid layout.

Figure 4: Arraylet Hybrid Layout

Note that in all these representations, all arraylet leaves are 100% full of array data. That will change once we introduce double mapping.

Double Mapping

The above representation of arraylets gives the GC flexibility to allocate regions not necessarily adjacent to each other. In this case, to access data in the array, we use arraylet leaf iterators, which hide the complexity of arraylet data access implementation.

Arraylets were introduced so that arrays were more cleverly stored in the heap for balanced and metronome GC policies. However, there are a few drawbacks when using them. One of those is when Java Native Interface (JNI) [3] is used. In short, JNI allows Java programs to call native code (e.g. C, C++) giving more flexibility to the program and functions that it can use. When both large arrays and JNI critical [4] are used, Java programs are experiencing a slowdown.

JNI critical is used when the programmer wants direct addressability of the object. To do that, JNI needs the object to be in a contiguous region of memory, or at least make it think the object is in a contiguous region. For that to happen, JNI critical creates a temporary array on the size of total elements in the original array. It then copies element by element to this temporary array. After the JNI critical is done it copies everything back, again element by element from the temporary array to the arraylet. Yes, that does sound expensive and it is. One use case, SPECjbb2015 benchmark [5] is not being able to finish RT curve building phase when using balanced and the GC policy. This has also become a problem for the JIT (Just in time Compile) when it performs array copy optimizations.

Double mapping arraylets was introduced to solve these problems. The initial idea was to make a large array, which is divided across the heap in different chunks, look like a contiguous array. And that was what we did, but how does it actually work?

For double mapping to happen we leverage something called Operating System Virtualization or virtual memory [7][8]. In a high level, the operating system uses virtual memory to abstract storage in a program. It hides memory fragmentation, where program variables are stored and among other things. Virtual memory also makes the program think it has more physical memory than there actually is available.

Virtual memory address space on 64-bit systems is very large, 64 bits worth in fact, compared to 32 bits in the 32-bit systems counterpart. However, physical memory is limited to a size much smaller in most cases. Using more virtual memory address space does not really cost the application anything, but using more physical memory would. Therefore, instead of using more physical memory and perform expensive copies, we map two different virtual memory addresses to the same physical memory address. Think of it as a mirror of the original arraylet, where any modification to the newly mapped address will reflect the original array data. The following figure depicts this architecture. Figure 5 captures the main advantage of double mapping explained earlier; we show original data as red rectangles and data being modified as green rectangles. Whenever an element is modified in the contiguous region of memory (green), that is reflected on the physical memory (not shown) which is then reflected in the arraylet leaves that are stored in the heap; hence, no need to copy data over.

Figure 5: Double Mapping Example

With this additional feature to represent arraylets, we no longer make use of the hybrid layout. Whenever the number of elements of an array is not a multiple of a region size, the remainder elements occupy an entire region (arraylet leaf) leaving the remainder space unutilized. Whereas before, every leaf of an arraylet was 100% occupied with array data. Figure 5 shows this behavior where the last araylet leaf is roughly half full. Note that only the part containing arraylet data is double mapped.

Now, whenever JNI or JIT uses arraylets, they won’t need to create and copy element by element to a temporary array. The double mapping is created as soon as the array in the Java program is instantiated. Therefore, JNI and JIT only need to reference the contiguous memory double mapped mirror of the original array. By doing so, it saves a lot of time because the contiguous view of the array which is chucked in its arraylet leaves was already created and only needs to be referenced by JNI and/or JIT.

Prior to implementation, we conducted some experiments to prove our initial concept. These results are demonstrated in the next section. After analysing the results and their effect on the existing code base, we decided to implement approach 1 for the Linux version. We explain why in “Preliminary Experiments and Results” section. Even though Windows does not support calls to mmap, we were able to implement a version for it which follows the same principles as the Linux version.

Linux Experiments

After implementing the Linux version of double map we wrote two simple Java programs to test its performance. One simple program which creates several arrays with different sizes and another program that does the same thing but using JNI critical to do so. Its source code can be found in [6].

We forcibly set the heap size to be 1 GB and the GC policy to balanced in order to analyze if Garbage Collection would suffer from collecting objects constantly. This configuration creates a heap with 2,048 regions, each 512KB in size. Also, for this test we used a double array on the size of 8,484,144, which comprises 129 arraylet leaves in total. After JNI critical, we overwhelm the GC by allocating 2,000 arraylet leaves through several arrays (occupying 2,000 regions). We then iterate over a loop 100 times, where in each iteration it allocates and deallocates 30 arraylet leaves. By the end of the loop we would have created and freed roughly 3,000 regions. We keep the 2,000 arraylet leaves alive throughout the for loop where the GC needs to constantly free regions for upcoming new arraylets.

We ran the program with and without double map and in both cases, the GC was able to collect all necessary objects. This time reported only comprises the time related to the 129 arraylet leaf array. These simple test programs showed that when double map was used, JNI critical performed its operation 26 times faster on average. To be more precise, JNI critical, without double map, took an average of 90 milliseconds to complete; on the other hand, it took 3.5 milliseconds to complete with double map enabled. The above experiments were conducted in an Ubuntu machine with 8 virtual cores, 16 GB of RAM and 2,200.0 MHz of clock speed.

Similar to Linux, we also conducted preliminary Windows experiments which yielded better results when Double Mapping was used. However, we are still on the process of enabling double mapping on Windows platform, but we are confident that they will yield similar results as the preliminary experiments.

Preliminary Experiments and Results

Preliminary experiments shown in this section were performed prior the implementation of this new feature. Two approaches were studied in order to implement such feature. Following is a brief and more technical description of the two.

Approach 1

  1. In order to simulate the GC heap, we used a shm_get call to reserve a shared memory space. After that we call ftruncate to set the desired size to be allocated, say 256MB of memory. Next, we call mmap on the file descriptor returned by shm_getto reserve the memory space , in a random location (first parameter: NULL). Also for the mmap heap call, we pass PROT_READ and PROT_WRITE protection flags so we can write to the heap. Finally, we pass MAP_SHARED as the fourth parameter; note that we cannot use MAP_PRIVATE here, if we do, double map fails. A better way to implement this would be to make the entire heap MAP_PRIVATE, and only share those regions where the arraylets are located. For simplicity we share the entire heap.
  2. Next we simulated arraylets by randomly picking locations in the heap and storing numbers of size of 2 pages. So, if the size of the system page is 4KB we would store 8KB for each arraylet (we can change this size and it does not have to be a multiple of pagesize). At this point we have our representation of the heap along with the arraylets scattered across memory (heap).
  3. On this step we create/reserve a contiguous memory space in order to double map the arraylets. We pre calculate the total size of all arraylets to reserve such space. We used and anonymous mmap with PROT_READ, PROT_WRITE and MAP_PRIVATE flags (since our heap is already shared we can pass MAP_PRIVATE here).
  4. Lastly, we make one call to mmap for each arraylet because one call to mmap would not work for all of them since mmap always allocate a contiguous block of memory. In this approach only one file descriptor is used (different from approach 2) which relates to the heap. For each mmap call we pass in this file descriptor which was created on step 1 as well as each arraylet offset into the heap

Approach 2

  1. Unlike approach 1, in order to simulate the heap we only made a call to mmap without using shm_get. In this case we did not use a file descriptor and instead of passing PROT_READ and PROT_WRITE we passed PROT_NONE to indicate that nothing can be read, written or executed in the heap. We just “reserve” a region of memory so that later we can allocate a proper region for storing data (in this case the arraylets).
  2. To allocate the arraylets we used one file descriptor for each of them. We do so by calling shm_get to then call ftruncate with the size of the arraylet e.g. 8KB; finally, we call mmap with flags PROT_READ, PROT_WRITE, MAP_SHARED and MAP_FIXED because we want to specifically put arraylets in locations (random) in the heap. With this approach only the arraylet space can be read written to, while the rest of the heap stays protected. Furthermore, if there are 500 arraylets we would require 500 file descriptors in this approach. This is a bottleneck because systems have a hard limit on how many file descriptors can be used at a time, e.g. the system we used had a 253 as a limit.
  3. This step is exactly the same as approach 1’s step 3 where we create/reserve a contiguous block of memory to double map the arraylets from the heap.
  4. Lastly, unlike the previous approach where we used the arraylets offset into the heap, in this approach we use each arraylet file descriptor to mmap into the contiguous block of memory created in step 3. Therefore, instead of passing in the heap address to mmap we pass the contiguous block of memory address along with arraylet file descriptors. The end result is the same as before where the double mapping is successful.
Cons approach 1:
  • Associate a file descriptor to the entire heap, as a result we would have to make the entire heap as a large chunk of shared memory.
Cons approach 2:
  • Hard limit of number of file descriptors that can be used at a time

We decided to implement approach 1 for OMR and OpenJ9. If we decided to choose approach 2 there would be a lot of hacks to be done regarding the file descriptor count hard limit.

Source code can be found at: https://github.com/bragaigor/Arraylets

Experiments / Benchmark

One iteration corresponds to creating/reserving contiguous block of memory, double mapping each arraylet to this chunk of memory and modifying it. The results also include the time taken to free that addresses reserved by shm_get in each iteration.

In case of GC Simulation, an iteration corresponds to copying arraylets into a separate array (index by index or through malloc followed by a memcpy call), modifying this array and then copying its contents back into the original arraylet location in the heap. For tables 1 through 4 both heap simulation does not require malloc because stack is sufficiently large to accommodate all arraylets. However, for consistency we used malloc for tests in the experiment.

Approaches 1 and 2 use munmap to free addresses allocated by shm_get/mmap while heap simulations use free to free memory allocated.

System configuration
  • MacBook Pro (13-inch, Early 2015)
  • Processor: 2.7 GHz Intel Core i5
  • RAM: 8GB
  • Graphics: Intel Iris Graphics 6100 1536 MB

 

  • Experiments 1 through 4 use heap size of 256 MB
  • Experiment 5 use heap size of 1 GB
  • Experiment 6 use heap size of 4 GB
  • Experiment 7 use heap size of 16 GB
  • Experiment 8 use heap size of 64 GB

Experiment 1

  • Low number of arraylets: – 16
  • Small arraylet size: – – – – 8 KB
  • Heap size: – – – – – – – – 256 MB
  • Arraylets total size: – – – 128 KB

Experiment 1

Legend 1

Experiment 2

  • High number of arraylets: – 64
  • Medium arraylet size: – – – 64 KB
  • Heap size: – – – – – – – – – 256 MB
  • Arraylets total size: – – – – – 4 MB

Experiment 2

Experiment 3

  • Low number of arraylets: – 12
  • Large arraylet size: – – – – 256 KB
  • Heap size: – – – – – – – – – 256 MB
  • Arraylets total size: – – – – – 3 MB

Experiment 3

Experiment 4

  • High number of arraylets: – 128
  • Small arraylet size: – – – – – 16 KB
  • Heap size: – – – – – – – – – 256 MB
  • Arraylets total size: – – – – – 2 MB

Experiment 4

Legend 1

Experiment 5

  • Medium number of arraylets: – 32
  • Large arraylet size: – – – – – – – 1 MB
  • Heap size: – – – – – – – – – – – – 1 GB
  • Arraylets total size: – – – – – – 32 MB

Experiment 5

Legend 2

Experiment 6

  • Low number of arraylets: – 16
  • Very Large arraylet size: – – 64 MB
  • Heap size: – – – – – – – – – – – 4 GB
  • Arraylets total size: – – – – – – 1 GB

Experiment 6

Experiment 7

  • Medium number of arraylets: – 32
  • Very Large arraylet size: – – – 128 MB
  • Heap size: – – – – – – – – – – – – 16 GB
  • Arraylets total size: – – – – – – – 4 GB

Experiment 7

Experiment 8

  • Medium number of arraylets: – 48
  • Very Large arraylet size: – – – 256 MB
  • Heap size: – – – – – – – – – – – – 64 GB
  • Arraylets total size: – – – – – – – 12 GB

Experiment 8

Legend 2

 

Conclusion and Next Steps

On this post we introduced arraylets, explained what they are and how balanced and metronome GC policies benefit from them. Additionally, we introduced a new feature called double map that improves JNI Critical array access. Experiments showed a considerable improvement when using double map; around 26 times faster than without it to be more precise. Going forward, we will finish Windows implementation and add this feature to Mac OS and AIX as well. Lastly, we will leverage this new feature on JIT, because we believe the JIT can take advantage of double map to optimize array operations.

 

References

[1] OpenJ9 GC -Xgcpolicy, URL:          https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/openj9/xgcpolicy/index.html

[2] Balanced Garbage Collection Policy, URL:            https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/mm_gc_balanced.html

[3] S. Stricker, Java programming with JNI, URL:       https://www.ibm.com/developerworks/java/tutorials/j-jni/j-jni.html

[4] Copying and Pinning, URL:            https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.0.0/com.ibm.java.lnx.70.doc/diag/understanding/jni_copypin.html

[5] Standard Performance Evaluation Corporation, URL: https://www.spec.org/jbb2015/

[6] Arraylet JNI Java source code, URL: https://github.com/bragaigor/ArrayletTests

[7] What are the differences between virtual memory and physical memory?, URL: https://stackoverflow.com/questions/14347206/what-are-the-differences-between-virtual-memory-and-physical-memory

[8] Virtual Memory, URL: https://www.student.cs.uwaterloo.ca/~cs350/F16/notes/vm-2up.pdf

Investigating the performance of wasmjit-omr, Part 1

Introduction

As a final year university project, a few classmates and I created a JIT compiler for WebAssembly called wasmjit-omr. Our goal was to create a simple JIT compiler capable of accelerating most WebAssembly workloads.

Naturally, we wrote the JIT compiler using Eclipse OMR JitBuilder. A JIT compiler by itself isn’t that useful however, as it usually runs inside a virtual machine (VM). To avoid having to create a full WebAssembly VM, we decided to integrate our JIT in the interpreter that is provided by the WebAssembly Binary Toolkit project, a.k.a. WABT.

After the project ended, we continued to work on it on our own time, with most of the work going towards improving performance.

With a few improvements intended to provide a significant performance boost to wasmjit-omr completed (these will be further discussed in a later post), it became time to validate our assumptions and measure the performance of our JIT.

In this post, the first in a series, I will discuss the process I went through to get an initial insight into the performance improvement our JIT compiler provides over the interpreter. First I will explain how I implemented the benchmark used to measure performance. Then, I will describe some of the tooling I had to implement to run the benchmark; a wasmjit-omr based VM that provides the system calls needed to execute WebAssembly code generated by Emscripten. Finally, I will show and discuss the results of the performance measurements I did.

Creating a Benchmark

The purpose of a JIT compiler is to make programs written in interpreted languages run faster. So, for the initial performance measurements, I focused on measuring the speed up the JIT compiler provides compared to the interpreter alone.

The benchmark I used is a Mandelbrot program because it’s simple, relatively computationally intensive, and frequently used to benchmark language implementations.

To avoid having to write the benchmark in WebAssembly directly (which is doable but unnecessarily tedious), I wrote the Mandelbrot program in C and transpiled it to WebAssembly using Emscripten.

For technical reasons, one property of the wasmjit-omr JIT compiler is that it never compiles the top-level function (main or start function). So, to ensure the JIT compiles the Mandelbrot code, the code must be placed in a separate function. Also, annotating the function with __attribute__((noinline)) prevents Emscripten from inlining it into the main function.

typedef double FP_t;

__attribute__((noinline))
void mandelbrot(int x_pixels, int y_pixels, int* out_table, int max_iterations) {
    FP_t x_scale_factor = 3.5 / (FP_t) x_pixels;
    FP_t y_scale_factor = 2.0 / (FP_t) y_pixels;

    for (int py = 0; py < y_pixels; ++py) {
        for (int px = 0; px < x_pixels; ++px) {
            FP_t x0 = px * x_scale_factor - 2.5;
            FP_t y0 = py * y_scale_factor - 1.0;

            FP_t x = 0.0;
            FP_t y = 0.0;
            int iteration = 0;
            while (x*x + y*y <= 2*2 && iteration < max_iterations) {
                FP_t x_temp = x*x - y*y + x0;
                y = 2*x*y + y0;
                x = x_temp;
                iteration += 1;
            }

            out_table[py*x_pixels + px] = iteration;
        }
    }
}

Since we’re only interested in the time it takes for the Mandelbrot code to execute (and not the VM startup or JIT compilation time), I surrounded a call to mandelbrot() with calls to C’s time() function, which returns the current time in seconds.

int main(void) {
    // ...

    int table[34][80] = {0};
    time_t time_diff = time(NULL);
    mandelbrot(80, 34, (int*)table, 300000);
    time_diff = time(NULL) - time_diff;

    // ..
}

One drawback of using time() is that measurements will only have one-second precision. However, for reasons that I will explain shortly, in this case using time() is a convenient way of measuring time with relatively little effort. Even still, as a first ballpark comparison between interpreted code and JIT compiled code, one-second precision won’t be a problem.

Also, to make sure that the compilation time of mandelbrot() is not measured, I added a dummy call to it before the first call to time(). Because, by default, wasmjit-omr compiles a function just before its first call, using a dummy call ensures compilation has completed by the time the benchmarked code runs.

int main(void) {
    int small_table [3][4] = {0};
    mandelbrot(4, 3, (int*)small_table, 10);

    int table[34][80] = {0};
    time_t time_diff = time(NULL);
    mandelbrot(80, 34, (int*)table, 300000);
    time_diff = time(NULL) - time_diff;

    // ..
}

To report the execution time, we would ideally print it to the screen. However, printing to the screen requires significant support from the VM and would take time to implement.

Instead, I decided to make the execution time the return value of the main() function. Given that the time is in seconds, the range of the return value should be just enough to get meaningful results from the benchmark.

int main(void) {
    // force JIT compilation
    int small_table [3][4] = {0};
    mandelbrot(4, 3, (int*)small_table, 10);

    // run and time the benchmark
    int table[34][80] = {0};
    time_t time_diff = time(NULL);
    mandelbrot(80, 34, (int*)table, 300000);
    time_diff = time(NULL) - time_diff;

    return time_diff;
}

The complete code for this benchmark and build instructions can be seen here: https://github.com/wasmjit-omr/micro-benchmarks/tree/post1.

Dealing with System Calls

In most systems, tasks such as getting the current time and printing text to the screen are generally done via system calls; special calls that are handled by the operating system. Most programming languages hide this behind more convenient interfaces like standard libraries. However, the implementation of these eventually relies on making a system call.

WebAssembly does not support making system calls directly. Instead, it allows VMs to provide “host functions” that are callable from WebAssembly code. Compilers like Emscripten take advantage of this by emitting “host calls” where a regular C compiler would have emitted a system call.

By design, the only system call that is required to run the Mandelbrot benchmark is _time, which is used by the time() function. Other unused/uncalled host imports are removed by Emscripten when using the -O3 flag ( -Os also does this).

Implementing the System Calls

By default, the WABT interpreter does not provide any host functions. Since WABT is designed to be embedded, it is the embedder’s responsibility to define the host functions and other imports they want to expose to WebAssembly code. WABT’s wasm-interp tool shows an example of how this is done.

To execute the Mandelbrot benchmark with wasmjit-omr, a custom tool (or “embedder”) that provides the host functions that handle system calls was needed. I decided to call this tool em-interp (the em- is for EMscripten).

To implement this tool, I started by just copying the wabt-interp source code, which is just one file.

Next, I created a function that takes a pointer to a wabt::Environment object and that registers a new module in it.

void AppendEmscriptenModule(wabt::interp::Environment* env);

The new function is called from the environment initialization routine. Creating a separate function decouples the definition and implementation of the host module (its functions, memories, tables, constants, etc.) from the core tool implementation.

static void InitEnvironment(Environment* env) {
  AppendEmscriptenModule(env);
  // ...
}

WebAssembly code generated by Emscripten imports system calls from a host module called env. env is also used to import other components like linear memory, tables, and a few constants initialized by the host VM. In addition to the _time system call, the Mandelbrot code only needs a linear memory. As such, these are the only components that I implemented in AppendEnvrionmentModule.

void AppendEmscriptenModule(Environment* env) {
  HostModule* module = env->AppendHostModule("env");

  Memory* memory = nullptr;
  Index memIndex = 0;
  std::tie(memory, memIndex) = module->AppendMemoryExport("memory", Limits{256, 256});

The function first creates a host module called env and adds to it a linear memory with the appropriate size limits.

module->on_unknown_func_export=
      [](Environment* env, HostModule* host_module, string_view name, Index sig_index) -> Index {
          printf("Importing unimplemented host function '%s' from '%s' module\n", name.to_string().c_str(), host_module->name.c_str());
          auto name_s = name.to_string(); // cached copy of name to avoid reading bad values from string_view
          auto callback = host_module->AppendFuncExport(
                  name,
                  sig_index,
                  [=](const HostFunc* func, const FuncSignature* sig, const TypedValues& args, TypedValues& results) -> interp::Result {
                      printf("Call to unimplemented host function '%s' from '%s' module!!!\n", name_s.c_str(), host_module->name.c_str());
                      return interp::Result::TrapUnreachable;
                  });
          return callback.second;
      };

The next part of the function may be a bit confusing (a lambda within a lambda, whaaaat?) but what it does is fairly straightforward. If a WebAssembly module attempts to import a function from env that isn’t provided, this code will instead provide a “dummy” function that will immediately trap. That way, as long as the unimplemented function is never called, the WebAssembly code will work correctly without having to provide all it’s imported functions.

  module->AppendFuncExport("_time"
                          ,{{Type::I32}, {Type::I32}}
                          ,[=](const HostFunc* func, const FuncSignature* sig, const TypedValues& args, TypedValues& results) -> interp::Result {
                             auto t = static_cast<uint32_t>(time(nullptr));
                             auto addr = args.at(0).get_i32();
                             if (addr != 0) {
                                memcpy(memory->data.data() + addr, &t, sizeof(uint32_t));
                             }
                             results.at(0).set_i32(t);
                             return interp::Result::Ok;
                          });
}

Finally, the _time system call is created and made available as a host import. The host function to be invoked is defined inline using a lambda. The VM (interpreter and JIT) takes care of performing the transition from WebAssembly code to the host function when the latter is invoked.

The complete code for em-interp can be seen here: https://github.com/wasmjit-omr/em-interp/tree/initial.

Benchmark and Results

With the benchmark written and the system calls setup, all that was left was to run the benchmark.

My primary goal was to compare the performance of interpreted code vs. JIT compiled code. So first, I ran the Mandelbrot program with the --disable-jit option, which (as the name suggests) disables the JIT and forces the program to be interpreted. I then did the same thing again but without the --disable-jit option so that the benchmark code would be JIT compiled.

$ ./em-interp ../../micro-benchmarks/mandelbrot.wasm --disable-jit
_main() => i32:110
$ ./em-interp ../../micro-benchmarks/mandelbrot.wasm
_main() => i32:1

Right away, it was pretty evident that the JIT provides a significant performance boost. JIT compiled code was roughly two orders of magnitude faster than interpreted code!

To make things a little more scientific I did the comparison 10 more times. These were the results:

Interpreter (s) JIT (s)
110 1
111 1
110 1
110 0
111 0
111 1
112 1
110 1
111 0
110 1

I did not do a particularly detailed analysis of the results because only having one-second precision makes it difficult to draw any reasonable conclusions from the data. For example, the zeros in the “JIT” column don’t mean that the JIT compiled code took zero seconds to execute, just that it took less than one second to execute (probably). The only thing we can say is that JIT compiled code was roughly two orders of magnitude faster than interpreted code, which is pretty cool!

Conclusion

In this blog post, I described some work I did for measuring and comparing the performance of JIT compiled code vs interpreted code in wasmjit-omr.

I first had to create a Mandelbrot program in C that I then compiled to WebAssembly using Emscripten. I then created a tool based on wasmjit-omr that can run code generated by Emscripten. Finally, I ran the benchmark using this tool and found the JIT compiled code was roughly two orders of magnitude faster than interpreted code.

In the next blog post, I will discuss further performance measurements and comparison with higher precision and a closer look at what contributes to the improved performance.