Writing a compiler…
Right now, I am writing the scripting language compiler for NGEDIT. It is loosely based in JavaScript (typeless variables, object oriented), and so it has a C-like syntax. It generates code for a simple stack machine. I’m using a handwritten recursive-descent parser to parse the code and generate the bytecodes as parsing is done.
I’ve noticed two interesting things in the process.
The first one is how much of C/C++ syntax is unnecessary. Python did away with a lot of it, but I never thought that even in C/C++ most stuff could be parsed without any problems with many commas and semicolons left out. Ever wondered about class/struct/enum declarations ending in a semicolon, while functions don’t have one? In the parser, it’s as simple as saying skip next token if semicolon after the decl/def. After most statements, there is no ambiguity if the semicolon is removed (expression parsing stops if the next token isn’t a possible continuation of the expression). It is possible to leave out the commas between function arguments and between identifiers in an enum
definition, even if they carry an explicit = expr initialization.
The second one came onto me when I was writing the intermediate structures that hold the code generated for a full function and the code to evaluate an expression. I had to separate the code for the expression, as it can’t simply be added to the tail of the function (as can be done with the statements). It turned out that the structure to hold a compiled function and the structure to hold a compiled expression are identical! (at least, given that I was storing the function signature separately). In NGS (the NGEDIT scripting language) all function return values, so that is common as well. This reminded me of one extension GCC used to have years ago… a body of code surrounded by braces can be embedded in an expression, resulting in the value of the last statement. I don’t remember being able to use return
inside the code, but it would make sense. In this way, we could write:
int a = { if (b) return 1; else { while(c--) do_something(d, e); return get_something(); } };
And if we used the parser with optional separators we could even write:
int a = { if (b) return 1 else { while(c--) do_something(d e) return get_something() } }
The optional separators are already working in the NGEDIT parser, but I don’t plan to implement the code-as-subexpression trick.
September 17th, 2006 at 8:39 am
A bit late…was just stumbling on this. Leaving out operators is a bit dangerous; using a parser generator will point out misconceptions very quickly. Try removing the comma in ‘f (a, (b + c) * 2)’… But then there are languages that do not have commas or parentheses in function invocations, but those are designed for that in the expression department.
For the semicolon: Up to now I’ve only seen languages that take an implicit semicolon at the end of a line.
And you should need returns in the statements-for-expression. Actually you shouldn’t need returns at all at the end of a function. See ruby.
September 17th, 2006 at 11:01 pm
Andreas, yeah, it’s a long time I haven’t touched NGS. Thanks for pointing out a case where the comma actually changes the meaning. Actually, the parentheses are incredibly overused in C-like languages (expression priority, function call, part of specific constructs like if, while, etc…).
The semicolon is less tricky, as it’s only used at statement level (although with some inconsistencies – the do-while form for example), so I think it’s actually the case that it can be ellided more often.
But in any case, this was more of a curiosity than anything else. Actually, I wrote a basic vi/vim emulator in the scripting language, which I then ported to C++, and adding the semicolons back was a PITA 🙂