Error reporting and syntax objects

If you know a little bit about lisp, you may think that it is "homoiconic". The code that it compiles is written the same way as regular data. For example:


valeri> (+ 1 2 3 (* 4 5))
26

This is of course a program, but you can also quote it to get back a list:


valeri> '(+ 1 2 3 (* 4 5))
(+ 1 2 3 (* 4 5))

And many people will either know or realize that it opens up a possibility for source code transformation, and in particular macros. I've even heard from some that in lisp, you write code directly in AST. But this is actually wrong!

Consider for a moment what will happen if during an arithmetic operation you'll get a runtime error? How would the runtime show you the source code location of the error? To do that, the compiler must emit debug information with source code mapping. And if the data structures that the compiler is receiving as input are just regular lists - the source mapping is lost.

So, practical lisp implementations (at least of Scheme) actually do have AST, which is called "syntax objects". See Racket docs for an in-depth explanation.

In Scheme, syntax objects can wrap any other object and give it additional context such as lexical scope, source code location, or any other custom metadata. You can "pack" and "unpack" syntax objects if you want to really fiddle with a low-level representation. Scheme also uses syntax objects for hygienic macro system, but that's out of scope for me right now.

Since I want Valeri to be friendly, I've taken a stab at implementing syntax objects. To play with them in the REPL, you can do as follows:


valeri> (syntax 42)
#<syntax 42>

valeri> (syntax (1 2 3))
#<syntax (#<syntax 1> #<syntax 2> #<syntax 3>)>

valeri> (syntax {1 2 3 4})
#<syntax (dict #<syntax 1> #<syntax 2> #<syntax 3> #<syntax 4>)>

Here, syntax is a special form that allows you to keep the syntax information of its parameter. Compare (syntax (1 2 3)) in the example with the following:


valeri> (quote (1 2 3))
(1 2 3)

Quote actually does the reverse: it will strip the syntax information from its parameter, so the user will see what they expect. Any time any "atoms" (numbers, strings, symbols, etc...) get compiled into the bytecode, their syntax information is stripped.

In the current implementation, the reader that parses source code into the object hierarchy is already embedding source code information. The compiler or runtime don't utilize this information yet to enrich error messages, but that's coming up soon.

And finally, because I've added the collection of syntax context to the reader, it now will show errors that happen on the reader phase, like this:


valeri> (1 2 "foo)
#<error:syntax-error "<unknown>:1:6 Syntax error: unterminated string">