Simple VM for dynamic languages

A few weeks ago, I've started working on a lisp interpreter. I already did a few implementations of lisp in different languages, but those were mostly just recursive evaluators. This time, it's a bit more serious.

Instead of writing the implementation top-to-bottom, I started with a virtual machine. Virtual machines are used to execute most of the scripting languages, since bytecode is more compact and faster to evaluate. Compared to a tree-walker, bytecode VMs are better for branch prediction and friendlier to the CPU cache.

Because the VM is very barebones at the moment, and no language is written on top of it, I created a very simple assembly language. Here's an example of computing a factorial function in it:

li r1, 10
li r2, 1
li r0, 1

factorial:
    mul r0, r0, r2
    addi r2, r2, 1
    jle r2, r1, factorial

This code computes 10! and returns it in the r0 register.

The architecture of the VM is "load/store", meaning that computation (addition, multiplication, conditions, etc...) can only be performed on registers. Data can be loaded from the memory to registers with separate instructions. This contrasts a bit with how some VMs are implemented: for some reason many of them don't use registers at all, and instead rely only on the stack. For example, this is what a factorial function would look like in Python bytecode:

  2     >>    0 LOAD_FAST                0 (N)
              3 LOAD_CONST               1 (1)
              6 COMPARE_OP               2 (==)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_FAST               1 (result)
             15 RETURN_VALUE

  4     >>   16 LOAD_FAST                0 (N)
             19 LOAD_CONST               1 (1)
             22 BINARY_SUBTRACT
             23 LOAD_FAST                0 (N)
             26 LOAD_FAST                1 (result)
             29 BINARY_MULTIPLY
             30 STORE_FAST               1 (result)
             33 STORE_FAST               0 (N)
             36 JUMP_ABSOLUTE            0

If you look carefully, you'd notice that there are no registers here. This is because all operations that write something, usually do so to the top of the stack.

There are a few problems I find with the stack machines:

It is unnatural to read the disassembly (you have to keep track of changing stack offsets instead of register names)
Either instructions waste space, or we deal with variable-widths instructions (as is the case for Python)
Some potential for optimizations is wasted

In my personal opinion, a good virtual machine for a dynamic language should be also a suitable target for compiling regular expression state machines.

So instead, I opted for a more traditional approach, that is similar to a RISC CPU:

32-bit constant-width instructions
flexible stack
32 registers, most of which are general-purpose, except for frame pointer/instruction pointer/etc...
64-bit width for both register and stack entries

The only twist that I've added compared to the "normal" CPUs is register and stack tagging. With physical processors, it is often the case that software is written in strongly-typed languages, where data types are known during compile time, and thus the compiler can generate specific instructions for handling, say, int32 vs int64.

Consider the following code:

mul r3, r2, r1

It is essentially equivalent to r3 = r2 * r1. But what types do r1 and r2 have? Well, in case of my virtual machine, registers and stack entries "know" their type. So if you attempt to multiply an int32 with uint64, you'd get standard type promotion, and the result would be tagged as uint64. Because of this, you don't have to perform checks on the bytecode level.

So, how fast is this approach? In my preliminary tests, a simple loop from 1 to 100 million, with a multiplication inside, takes 0.7 seconds to complete. Which is plenty fast, considering that the VM implementation is naive and has never been seriously optimized.

The code for the experimental version can be found here.