VM progress update: data segment and constants

Today I'd like to introduce another addition to the virtual machine I'm working on: the data segment.

Let's first take a look at why this is necessary. Here's a piece of assembly code that multiplies 2 by 2:

; Load value '2' into register r0
li r0, 2
; Multiply register r0 by 2 and
; put result back to r0
mul r0, r0, 2

There instruction here that allows to load a constant into a register is li which stands for "load immediate". The value that it loads into the register is encoded as part of the instruction itself. Since the size of the instruction is 32 bit, the size of the instruction code is 7 bit, and the size of the register number is 4 bits, it leaves 21 bits for the value. 21 bits is only enough to encode a range of values between -1048576 and 1048575.

Representing such a short range is OK in practice in many pieces of code that deal with offsets. You can have larger ranges, but with a high likelihood, most of the values in your code will be quite short (think loop counters).

But if you have a larger value, you need to be able to represent it somehow. One way to do that is through bitwise operations: load 21 bits at a time, and then bit-shift and bit-or until you get the desired value. It would work for integers, and won't require any additional operations. Though, it will take a lot of instructions to do the same work that one instruction should have accomplished.

Another problem with just using bit-shifts is that you can't easily encode strings and other more complex data structures this way. Any more complex constant will require executing lots of sequential instructions to reconstruct a particular data type.

For this reason, regular executables contain both a code and data section. Whenever an executable is loaded into memory by a kernel, the sections get mapped into memory at predictable addresses, and the code can just load a constant either from the absolute address or from an offset relative to the instruction pointer.

For my virtual machine, I don't want to mix the code and data as part of one logical memory block. This is because I would like to minimize the chance of memory corruptions, and thus arbitrary memory access in the VM is not possible. What I have instead is a representation of code as a pair of values: an array of instructions, and an array of constants. This pair can be predictably serialized to a file, and then loaded into memory later.

To make it possible to load constants from the constant array, I've added a new instruction called loadc where you pass a register and an index to the array. When executed, the instruction will load the value of the constant to the specified register.

Here's what its usage looks like (note the u8 suffix is just a way to tell the compiler that this is a 8-bit unsigned integer):

loadc r0, 255u8
loadc r1, 2u8
add r0, r0, r1 ; r0 would contain "1"

Also note, that when writing the assembly code, you don't need to fill in the constant array yourself. The compiler does that for you. You just specify the constant as a second argument, and during compilation the compiler would move the constant to the array and insert the opcode with the correct index in place of the loadc instruction.

Right now there is only support for integers, but adding other data types should be relatively easy.