Compiler overview (III)

Hello back. I have been very busy working in NGEDIT during the past days, and it has taken a bit to get back to posting to the blog. Development is advancing nicely: VimEmu (the vi/vim emulation layer) is working quite well and I have even started using it for continuous editing. There are a couple of major parts yet to be done (namely, named registers and character-wise, linewise and blockwise selection), but most of the core parts are working very well. Well, actually not now, as I have done a major restructuring of the VimEmu.ngs script, and it’s not working right now, but it was just before the restructuring. VimEmu is already a 1,500 lines script which is also doubling as the best testing base for the scripting core.

In order to work in VimEmu, I have integrated in NGEDIT some hotkeys to swap the InputProcessor (the name for the scripts that can process user input), reload it, compile the current file and signal the errors. I have also added the possibility of dumping the contents of the full Undo stack onto a file and to load and play it back. If bugs appear, this is a great way to save and inspect the way to reproduce a given bug.

Given the host of comments regarding a new, incompatible language, I have been researching a bit more about Javascript. Although I won’t do this for version 1.0, it is quite possible that I will evolve NGS into Javascript after that, which will be nicer for folks who want to extend the editor. NGEDIT will have an extension system similar to Firefox and the like, such that extensions can be easily developed and deployed. Regarding interesting Javascript sites, check out DHTML Lemmings.

I have to correct one of my previous points: in fact, in Javascript you have to qualify member access within member functions by using “this.” before member access. This is good, as a mistyped variable access won’t be interpreted as a possible member access and go unflagged as a compile error. It is the same that I’m doing in NGS. Apologies for the mistake.

Last time, we saw how to generate code for an expression which involves no assignments. Today, we are going to see what to do with expressions involving an assignment, and delve into the wonderful world of l-value and r-value parsing.

Anyone with experience in programming has an intuitive graps for what l-values and r-values are. Consider for example the assignment a = 5; . This will represent a fairly easy to understand piece of code for mostly any C-descendant-language programming (it would be a := 5; in other languages, but they are not that ubiquitous nowadays). Something such as a = b; will also be common to understand. But a piece of code such as 5 = a; will raise an alarm on mostly anyone: you can’t assign a value to 5. The same will happen for a+b=c;. Well, what this means is that we understand that there are expressions which are valid both at the right and the left side of an assignment (such as variable_name) and others that are valid only on the right hand side, because you can’t assign to them (such as 5 or a+b. These concepts are explicitly handled in compiler writing and they are formally called l-value and r-value expressions. Actually, many compilers will flag an error for the 5 = a by issuing "Error: '5': not an l-value" diagnostic. Thus there are two classes of expressions: those that can only be on the right side of an assignment (r-values) and those that can be on both sides (l-values). Take into account that the operand to ++ or += also has to be an l-value, as they also mean to assign something to it.

So, how do we go about this in parsing a language such as NGS? For starters, let’s assume that we only allow assignments as separate statements, not in parts of other expressions (function calls, etc…). The generalization will be easy afterwards, but this limitations makes things easier to explain.

When we are parsing a function or code block, we have to parse one statement after the other and generate the code for each. The best thing to do is to peek at the first token in the statement, and then do different things depending on the token:

  • If it’s an if token, go to a separate subroutine to parse the if-expression, the code-block, and possibly the else-block.
  • If it’s a while token, go to a separate subroutine to parse its parts.
  • If it’s a while token, go to a separate subroutine to parse its parts.
  • Idem for other keyword-marked language contructs like return, for, etc…
  • If it’s an identifier, go to the assignment or function call parsing routine

This easiness comes from the fact that language keywords are “reserved” and can’t be used for variable names (in C, C++, etc..). This makes writing the compiler easier. This is not true for all languages, for example, the following is a famously valid PL/I statement:

IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;

You can imagine this language is more difficult to parse than C (C++ has its own, different difficulties).

By the way, I took this sample from the famous Dragon book (Compilers: Principles, Techniques and Tools, by Aho, Sethi and Ullman). It’s the book on compiler foundations, a 1986 book called after the Dragon on the cover that stands for compiler complexity - and if you are interested in compilers you could do much worse than check it out. Be adviced that it is heavily theoretical, and you would probably be grateful to complement it with a more practical one such as Holub’s Compiler design in C.

Back to where we were, we saw how to get into our assigment parsing routine quite easily: a statement starting with a user-provided identifier is deemed to be either a function call, an assignment, or a mistake on the part of the programmer.

In a simple procedural (that is, non-OO) language, with no support for structs, the statement we are looking at would only be one of the following two constructs:

variable = expr;

or

function(expr, expr, …);

We already saw last week how to parse the right-hand-side expressions and generate code to push them on the stack, so I hope you won’t be afraid by that part.

Regarding the rest, the strategy to parse this would be simple: save the first token, and peek the next one. If it is the assignment operator, we have to parse the right hand side expression and then emit the code to store the top of the stack (TOS) in the right variable. If it is an opening parenthesis, we have to parse a sequence of expressions separated by commas (emitting the code for one after the other will leave them nicely ordered on the stack) and then emit the code to call the appropriate function.

For example, code such as a = b + 2; could generate the following stack-machine code:


PUSH_VAR 'b'
PUSH_CONST '2'
ADD
POP_VAR 'a'

In both JavaVM and CLR (.NET VM) slang, the last instruction to pop the TOS into a given variable would be called a STORE, I tend to use the POP word, but the function is the same. In the same way, both JavaVM and CLR call LOAD what I call PUSH_VAR.

The first three instructions come from the parser we saw last week, which evaluate the b + 2 part. The last instruction is the tricky one we have to generate by recognizing the left-hand-side of the assignment.

For another example, code such as function(b + 2, c); could generate the following stack-machine code:


PUSH_VAR 'b'
PUSH_CONST '2'
ADD
PUSH_VAR 'c'
PUSH_FN 'function'
CALL

Of course, function calls should be allowed on the right hand side of assignments, so we better include the parsing of this in the ParseFactor() function of the parser described in the previous post - but we can leave that for later and only allow it for separate statements for now.

How would the function to parse an statement look? Something similar to this:

void TParser::ParseStatement()
{
  TToken tok = m_pLexer->PeekNextToken();

  switch (tok)
  {
    case TOK_IF:
      ParseIfStatement(); break;
    case TOK_WHILE:
      ParseWhileStatement(); break;
    //...
    case TOK_ID:
      ParseAssignmentOrFunctionCall(); break;
    case TOK_SEMICOLON:
      /* Empty statement */ break;
    default:
      Error("Unrecognized statement"); break;
  }
}

void TParser::ParseAssignmentOrFunctionCall()
{
  // Get the identifier name
  const char *psz = m_pLexer->GetTokenID();
  m_pLexer->EatToken();

  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_ASSIGNMENT)
  {
    m_pLexer->EatToken();

    // Parse the right-hand-side and
    //  emit code to leave it on TOS
    ParseExpression();
    if (ErrorFound())
      return;

    // There should be a semicolon...
    Require(TOK_SEMICOLON);
    if (ErrorFound())
      return;

    ushort uVar;
    if (m_pEmitter->FindVariable(psz, &uVar))
    {
      EmitCode(OP_POP_VAR);
      EmitWord(uVar);
    } else {
      Error('Unrecognized variable name');
    }
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();

    // Parse all arguments in order and push them
    while (
      m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN
    )
    {
      ParseExpression(); // Parse the argument
      if (ErrorFound())
        return;

      // There should be a semicolon...
      Require(TOK_COMMA);
      if (ErrorFound())
        return;
    }

    m_pLexer->EatToken(); // Skip the close paren

    // There should be a semicolon...
    Require(TOK_SEMICOLON);
    if (ErrorFound())
      return;

    ushort uFn;
    if (m_pEmitter->FindFunction(psz, &uFn))
    {
      EmitCode(OP_PUSH_FN);
      EmitWord(uFn);

      EmitCode(OP_CALL);
    } else {
      Error('Unrecognized function name');
    }
  } else {
    Error('Unrecognized statement');
  }
}

Ok, so now that we are done compiling simple assignments, we are ready to see what goes on in a more complex language design involving structs, or even objects with member functions, etc….

If we have variables which can be objects or records, there are other types of l-values apart from simple variable names: we can access an object’s members by name by using the ‘.’ (dot) operator. It can be ‘chained’, such that expressions as Parser.PartialInput.str are valid l-values, meaning the field ’str’ of the ‘PartialInput’ member of the ‘Parser’ variable. As all l-values, it can also appear as an r-value (if you can assign something to an expression, surely you can read it and use it later for other expressions).

The key in the design of our stack machine (the same as in the Java VM, but only to some extent in the CLR VM) is that we can only push values on to the stack, and if we need to store something in a place, there is some kind of special instruction to pop values from the stack there. POP_VAR seems quite easy, popping the TOS into a variable. In NGS, actually, I have POP_MODVAR to pop into a module variable and POP_LOCVAR to pop into a function’s local variable. This or a similar distinction also exists both in the Java VM and Microsoft’s CLR VM.

What to do with objects? The first thing is to decide how to operate on them - the solution in many cases (NGS among them) is to allow the stack or general variables to contain references to objects. These are internally just some kind of pointers into objects. Thus, when a variable holding an object appears in a right-hand-side expression, it is pushed as a normal variable, which pushes a reference to the object onto the stack. If this is assigned to another variable, then we have both variables referencing the same object. This is how it works in Javascript. And if we want to make an actual copy of a variable, we have to make it explicit in the code in some way. This is very different from C or C++, where if you assign a = b; a and b being struct variables, the whole contents are copied over.

In order to handle objects in our stack-machine, we just need to new instructions: WRITE_MEMBER and READ_MEMBER. READ_MEMBER expects to find the object reference on TOS, and the member-identifying string could be either on the stack or as an immediate value in the bytecode stream (both approaches work, and this is a design decision regarding the VM - in NGS it is stored in the bytestream). On the other hand, WRITE_MEMBER expects to find the object reference on the top of the stack and the value just below it.

How do we modify ParseFactor() to accept variable.member as a value? When we found an identifier, we just emitted a PUSH_VAR opcode. What we have to check now, is, if we find a trailing dot token, the value pushed on the stack is already the object reference we need, and we just have to emit a READ_MEMBER instruction with the proper member-reference immediate value.

Now, how we do allow to use variable.member to be taken as a valid assignment left-hand-side? Suppose an assignment such as this.PartialInput = 0; . The code we’d like to emit looks like this:


PUSH_CONST '0'
PUSH_VAR 'this'
WRITE_MEMBER 'PartialInput'

Now suppose we had an expression such as this.PartialInput.len = 0; . The beginning of the expression looks the same (this.PartialInput), but the code we’d like to generate looks like this:


PUSH_CONST '0'
PUSH_VAR 'this'
READ_MEMBER 'PartialInput'
WRITE_MEMBER 'len'

That is, the first member access (this.PartialInput) has to be emitted as if it were on the right hand side of an assignment expression, as we want to bring its value on the top of the stack. But in both cases, the last member access (.PartialInput in the first, and .len in the second) has to be emitted as an l-value assignment, that is, as a WRITE_MEMBER opcode.

We can see this in the following light: a chain of member access can generate different code if it is used as the target of an assignment or if it is used in order to access its value, but only the “tail” of it is different. And we don’t know whether it is used in one way or the other until the next token in the expression is evaluated: if it is yet another dot token, we need to emit the r-value code, and if it is finally an assignment, we will need to emit a WRITE_MEMBER opcode.

We can see that it is necessary to emit the code to evaluate the right-hand-side of the assignment before the code to access and assign the left hand side. Likewise, in function calls, we need to emit the code to evaluate the arguments before the function call itself. This may easily done by shuffling around a few buffers when the expression to be emitted afterwards is parsed before, so usually intermediate results of expression parsing go in a separate buffer instead of directly to the output bytecode stream.

So then, the trick is to prepare some kind of TChainReference that parses a chain of object member references. The best thing is to make this TChainReference accept variables as well, such that it will parse expressions such as module_var_name, local_var_name, localvar.member1.member2, etc… The parser will call it first with the first identifier in the chain, and then once again for each .memberref found. The class must then be able to respond to two requests: either get the r-value code for the expression passed in until now, or get the l-value code as we have found an assignment token and we need that. When we notify it of an additional .member element in the chain, it can safely compile in its current chain into only-r-value code and forget about it, as it won’t be used as an l-value, and buffer the last member access for either l-value or r-value generation.

This class can also be called from the ParseFactor() function in the expression parser we saw last week, and we will only be calling the GetRValueCode() request in the end if there is no assignment.

Given that we are parsing an OO language, there are not only member variable accesses, but also member function (method) calls. In order to call a member function, we could have a CALL_MEMBER opcode, but the approach taken in NGS is that member functions are just member variables of the internal FUNCTION type. Thus, member functions are called in the same way as regular function - although the “this” argument is passed as the first argument. I have even gone all the way to having all functions receiving a “this” argument by default, which is just nil for non-member function calls.

Thus, the code generated for something like Parser.ParseElement(); looks like this:


PUSH_VAR 'Parser'
DUP
READ_MEMBER 'ParseElement'
CALL

What does the DUP opcode do? It duplicates the TOS element - in this way, after the DUP we have two copies of the object reference on the stack. The one on TOS is used by the READ_MEMBER instruction, which leaves the value of the the ‘ParseElement’ member variable on the stack - which hopefully contained a function reference. Finally, the CALL opcode actually calls the function, passing in the original object reference as the first and only argument - the ‘this’ reference for the member function.

Given that member functions can return any type of value, they could easily return an object reference or even a function value. Thus, the reference chains we actually need to parse could be of any of the folllowing forms:


Parser.GetCurrentValue().bUsed = expr
Parser.PartialInput.Reset()
GetCurrentParser().PartialInput.bUsed = true
Parser.GetLoggingFunction()(”log text”); // even this!

We can see how the chain can be arbitrarily complex, involving as many function calls and member accesses along the way as the programmer wants. Take into account that a reference chain (as I like to call them) ending in a function call cannot act as an l-value, as values pushed on the stack can never actually be references to storage places. But, as you see in the last case, we could call a function that returns a function object and call it, making pairs of argument lists one after another valied syntax with a valid meaning if proper types are returned.

I will leave as an exercise for you to do the bookkeeping to generate the code for the arguments in function calls interwoven with the chain call (hint one: compile the arguments into a separate byte stream and just stick it where it is necessary; hint two: the DUP trick for the this variable doesn’t always work unless you push arguments right-to-left). You can also easily see how the extension for array member access would work, as signaled by the presence of square brackets in the reference chain - the general bookkeeping gets more complex, but should be straightforward if a bit messy to write.

The extension to allow assignments in expressions is quite simple now: we can accept an assignment sign after a factor that is actually a reference, and the code to emit would be:


code to eval the right-hand-side of the assignment
DUP
code to eval the left-hand-side as an L-VALUE assignment

This leaves the value of the assigned on the TOS, respecting the C rule that an assignment evaluates to the assigned value and performs the assignment as a side-effect.

I hope you are finding this material useful or interesting - please drop a note in that case (or if you don’t as well). We’ll finish the compiler overview with a peek at the runtime environment in the next installment, and hopefully I will be able to get back to shorter blog entries :)

4 Responses to “Compiler overview (III)”

  1. Tomás Fernández Says:

    Hi, I would say that i have found very interisting your explanation (actually I saved the page to ensure I have it for further study) Although I am having a little problems to understand the TChainReference class (But finally sure i will success :-). Anyway, reading it i found a doubt:

    If I have undertood well, an assigment like this This.Obj1.Obj2.Obj3 = 0 would generate code like this:

    PUSH_CONST ‘0′
    PUSH_VAR ‘This’
    READ_MEMBER ‘Obj1′
    READ_MEMBER ‘Obj2′
    WRITE_MEMBER ‘Obj3′

    Each “READ_MEMBER” insturction will read the memeber belonging to the previous object in the TOS and put it in the TOS.

    So to achieve the “address” of the inner member it is neccessary to “calculate” all the reference chain each time it is needed to be accesed.

    Could it be posissible to pre-calculate the iner member reference?, so that the code generated could be not neccessary the READ and WRITE instructions. Could be it possible use only in this way the generic PUSH and POP instructions set. In other words if a pre-calculation of the reference of the inner member woul be possible then could be possible to use the PUSH_REFERENCE ‘Calculated-var-reference” and the same with the POP.

    Than you very much for your useful explanations.

    Best regards.
    Tomás.

  2. J Says:

    You understood it well. The question is actually a good question. Your idea is actually doable (and done). The pointer-based language tradition (such as C and C++) would need such an approach, if only to shuffle around pointer type variable values. But, many other languages don’t offer pointers, and in such case it is simpler to just implement the LOAD/STORE model to bring stuff on to the stack and send it back to assignment destinations, and you can do without allowing pointers on the stack.

    Take into account that it’s not only about object members, variables are read and written via specific instructions.

    CLR, Microsoft’s .NET runtime, actually supports pointer values on the stack. It makes the VM quite a lot more complex. I’ll be guessing that they needed to provide it only in order to implement managed C++, which is plain old C++ compiled to a stack-based VM (with some extensions thrown in for the new model, but by and large compatible with standard C++).

  3. shankar Says:

    Hi,

    I found this very useful. In fact I read this page many years ago, now that i am writing my own compiler for my software, i used google to find you again and refresh the knowledge. Have you done any further work on this? Would appreciate if you put up see also: links.

    Thank you.

  4. J Says:

    shankar, I’m glad you found this interesting. I’ve often thought about recasting these articles into proper compiler-writing articles, but meanwhile here there are to help out whoever needs them!

    Writing a compiler is not as difficult as it’s often thought to be, and I believe any serious programming should write at least one to become comfortable with all the concepts.

    I haven’t written more than I did in the few articles in the blog. I will be working in the language in the near future again, as I plan to base ViEmu-scripting on it, so I might post a bit more, but unfurtunately I hardly have any time left from my main development work.

    Best wishes with your project!

    - Jon

Leave a Reply