VM progress update: flexible arena, ASAN and gcov #

This is another post in the series where I talk about what's new in the virtual machine I'm working on.

Today let's discuss memory allocation. In a typical garbage-collected language, the memory that is allocated for the language objects is taken from a "pool". Usually, the pool is a contiguous region that is pre-allocated using malloc() and then while there is still space in that pool, the specialized allocator logic would place objects there. As soon as the pool is completely filled, or when a certain allocation threshold is reached, a garbage collector will be called. The collector will then traverse known "roots" and everything not reachable through those roots will be discarded.

In high-performance languages with a garbage collector, there would be of course multiple layers of optimizations. In my case, performance is not an explicit goal (at least not yet). So I initially went with two pools, where one is used for allocations, and another is always vacant. As soon as one pool is fully occupied, a garbage collector would copy all reachable objects to the second one, and switch that pool to be the primary one. This approach worked well initially, until I hit the C interop problems.

And this is where things become interesting. When I have the virtual machine context around (essentially in the VM implementation code), I can pass it to the memory allocation functions like this:

vm_t* vm = ...;
// allocate an array of 10 elements
tagged_value_t obj =
  vm_mk_array(vm, 10);
// do something with obj

vm_mk_array() would call vm_alloc(vm, size) internally to allocate a raw chunk of memory. If during that call the pool has insufficient memory, the vm_alloc() function can trigger garbage collection itself and then resume allocation so the caller would not be aware of all the complications.

In this case, you need to be very careful with handling the results of functions that perform memory allocations. If you won't save the pointer to the memory region in the VM register or stack, or otherwise mark it as the GC root, the subsequent garbage collection won't treat this object as alive and would just "erase" it.

When I started rewriting the assembly compiler to use "native objects" and allocate them in the VM memory pool, I immediately hit this problem. Writing code with the expectation that every dynamically allocated object can be pulled from under your feet if you're not careful makes the code complicated and hard to read.

And then I remembered how Zig solves these problems. In Zig, there's a thing called "arena allocator", that allows you to not care about freeing the individual objects, and free all allocated memory at once when you're done with computations. This is implemented through a linked list of buffers that the allocator maintains internally. When buffers run out of space, a new buffer is allocated and added to the list. It allows the arena to grow dynamically, and all allocated objects "stay put".

So to solve the problem with the C code, I ended up using the idea of the arena allocators from Zig. Instead of memory pools having fixed size, I turned them into a linked list of pages internally. This means that individual memory allocations would never call the garbage collector as the memory is always available (provided there is enough memory on the physical machine). I moved all garbage collection to the upper level, so that it is only called when the virtual machine bytecode is evaluated. This means that any C code that is called from the VM (or that calls back to it) can just perform memory allocations safely from the VM pool, and then expect this memory to be freed by the VM later. But still expect that the garbage collection would not be triggered while the C function is executing.

This has made implementation of the memory allocation and garbage collection a little bit more complex, and I started having segfaults and leaks (which can be expected in low-level code like this). To make debugging easier, I enabled -fsanitize=address compiler flag, which is essentially using ASAN to wrap all memory allocations and instruments code to detect incorrect accesses to memory. It allowed to very quickly iron out most of the trivial allocation bugs.

In addition to enabling the address sanitizer, I started gathering test coverage with gcov, which is now part of the GCC toolchain. It allows me to see which parts of the critical functionality are not covered with tests, and so needs more work. I even added a plugin to my editor that annotates the opened .c or .h files with colored markers for lines that don't have test coverage.

I find that if you have clangd, ASAN, gcov, gdb and some test coverage, working on the low-level C code can actually be pretty enjoyable!