Why use red-black trees instead of hash tables?

In the programming language that I'm working on, I'm putting a lot of thought into making data serialization seamless. And after a bit of thinking and experimenting, I've found that dictionary implementation based on hash tables won't work, and that I need a different data structure. This post will explain why, and also why I settled on red-black trees instead.

Let's begin with data serialization. My language would have threads implemented using the Erlang model (full isolation, with no ability to mutate each other's data). This essentially means that I need a way to copy the objects somehow. I could've done this specifically for the virtual machine in a simple manner: just walk the tree of pointers in the object and then copy every component one-by-one. The problem with this approach is that it only works within one process. As soon as you need to pass objects through the network or to a different programming language (through bindings) -- you'd need something else. You'd need a flat, serialized representation that can be sent as a contiguous sequence of bytes.

At the time of initial design, I decided to go with one unified approach: there should be a way to take any data structure, and "freeze" it. Freezing a data structure creates its immutable copy, that occupies a single, contiguous block of memory, and all its "children" would be copied there as well. All the references inside this block of memory would be relative, so it can be easily moved. But otherwise, for all intents and purposes it still stays the same data structure, and you can operate with the frozen version the same was as with its non-frozen counterpart.

This is how the freezing of data structures would look like if it was in Python-like syntax:

a = [42, {"foo", "bar"}]
b = freeze(a)
print(b[1])
# -> {"foo", "bar"}
b[1]["foo"] = "baz"
# -> error: object is immutable

In this example, b is a "deep" copy of a, within a linear memory block, and marked as immutable.

This approach has benefits not only from the point of view of serialization and message passing, but also because it makes distinction between Tuple and Array types unnecessary. A Tuple would be just a frozen array, and not a distinct type as it is in Python.

Now let's talk about why Python makes this distinction. Here are a few examples:

mydict = {}
mydict[1] = 2
# -> ok
mydict["foo"] = "bar"
# -> ok
mydict[ [42] ] = 5
# -> error: unhashable type

The reason why using arrays as keys in a dictionary won't work is because they are mutable. In theory, you can use the array as a key in the dictionary and then update it later. The dictionary won't know about the change, and won't have the ability to recalculate its hashes. This would lead to surprises when you later look up the object: it won't be found by either of the keys - neither the old one, nor the new one.

I guess, because using arrays as keys in dictionaries is a common thing, the authors of Python came up with Tuples. They allow you to overcome this limitation, by forcing the array to be immutable. So what you would do instead is:

mydict = {}
mydict[ (42,) ] = 5
# -> ok

But what about something more complex? What if you have a custom data structure? Well, for this purpose, Python gives you a way to define __hash__ and __eq__ properties in your class. This would allow you to use custom data structures as keys in dictionaries. But it's completely on you to make sure that they don't change.

With that said, let's get back to my language. Of course I wanted to have support for dictionaries. But on the other hand, I didn't want to deal with the complexities of implementing support for classes and OOP in general. It was meant to be a simple language after all. And this is when I realized that since I have frozen data structures, then something like this should be possible:

mydict = {}
a = [42]
b = freeze(a)
mydict[b] = "foobar"

If I have a frozen object, that is a contiguous memory slice, I should be able to calculate a hash for it without problems, right? And so I did an initial prototype for dictionary implementation that is based on hashes. You could pass primitive values and frozen arrays (even nested), and it would work just fine. I thought the design was solid, until I tried something like this:

mydict = {}
a = {"foo": "bar"}
b = freeze(a)
mydict[b] = 42

This case is where everything started falling apart: I remembered that hash tables cannot be easily compared with each other, and that their in-memory layout depends on the order of insertion of their keys, and often on the size of the array that contains hash buckets. All of these are not predictable. This means that to calculate a hash function from a dictionary to then use it as a key in another dictionary, you'd need to sort its keys. This requires allocation of a temporary array with n elements, and approximately n * log(n) time complexity for sorting.

In short, this would only work if the order of the keys would be the same for all dictionaries, and already be sorted. This means that the only reasonable underlying data structure would be a tree (in case if I want to keep things simple).

From my previous experience with C++, I remember that the standard std::map type (which is a dictionary) is usually implemented there with red-black trees, which gives it a nice property of always having ordered keys. So I thought -- maybe I can just do the same. The only "catch" is that in C++ the keys in the dictionary should have exactly the same type, whereas in my dynamic language, you should be able to mix them. For example:

mydict = {}
a = freeze({"foo": "bar"})
b = freeze({"foo": "baz"})
c = freeze(42)
mydict[a] = 1
mydict[b] = 2
mydict[c] = 3

How would that work? Well, the red-black tree requires that all elements that you put into it should have a "strict total order". Which means that you should be able to tell if one element is less than, equal to, or greater than any other element.

Resolving this problem took a bit of thinking, but eventually I've formulated the following approach:

If elements have a different data type - compare type identifiers instead, and use that as a result of comparison (they would never be equal)
If elements have the same data type - use the comparison defined by that data type
for simple data types, use native comparison (arithmetic comparison for numbers, lexicographic for strings, etc...)
for complex data types, write a "walker" that would defer to simpler comparisons

It seems to fit nicely into the whole idea of making the language simple, and doesn't require any custom implementation if you need to use your data structures as keys in dictionaries. Of course, using red-black trees is not without disadvantages: they are slower than hash tables, and would take more memory. But as of now, I feel that this is a good tradeoff to make.

At the time of writing this post, I've finished a simple implementation of a red-black tree in C, and plugged it into the dictionary implementation of my language. As usual, you can find the code here.