Compiler overview (II)

In the last installment, I wrote an overview of the first stages of the NGS compiler work. We’ll now get a bit more in detail in the parsing, and we’ll be seing how code generation works.

Now, knowing how to write a Recusive Descent Parser, we confront the task of compiling a whole script module. It is quite easy to do that for NGS or similar languages: a module is a sequence of distinct types of blocks that can be identified by the type of their first token. So, the parser just has to peek the token in the input queue, and call the parsing function for the appropriate type of section. In NGS, this currently means either a ‘const’ declaration, a ‘var’ declaration, or a ‘function’ block. Functions may be either simple forward declarations (a semicolon after the argument list) or a full definition (an opening brace and hopefully a full block o’code after the argument list).

‘const’ and ‘var’ sections are simple enough to parse and process. A ‘const’ section serves the purpose that ‘#define’ and ‘enum’ do in C and C++: it declares symbols and associates constant values with them, so that constants can be ‘abstracted out’ in the code. A ‘const’ section allows you to either assign explicit values to the symbols, or let the compiler implicitly generate them as consecutive integers (same as C/C++ enums). In the current incarnation of the NGS compiler, they cannot be initialized with expressions, allowing only simple ‘literal’ value initializators. A constant expression parser, or being able to run bytecodes generated by the regular expression parser in compile-time, can easily solve this limitation.

‘var’ sections declare module-level variables, that keep their value between function invocations. Given that they have no complex type declaration syntax such as C/C++, they are straightforward to parse, and they only have to be added to the variable pool of the script object being generated. Simple literal initializers are allowed for variables right now, the same notes regarding ‘const’ expression initializers hold.

Both ‘const’ and ‘var’ sections share the property that they can define either a single const/var or a group of them, with or without initializers:

const PI = 3.1415926535;
const
{
  CH_RETURN = 13,
  CH_BACKSPACE = 8
};
const
{
  MODE_NORMAL, // Implicit '0'
  MODE_INPUT,  // 1
  MODE_VISUAL, // ...and so on
  MODE_COMMANDLINE
};

var a;
var { v1, v2, v3 };

Peeking for the opening brace token allows telling blocks from standalones, peeking for the assignment sign allows detecting assignments, and commas (as well as semicolons) are optional, as their presence is not required in order to know what’s going on.

The only other type of module-level block is a function. This is easily detected when seeing the ‘function’ token, and that is where the real fun begins!

Let’s diverge a bit from parsing into the realm of code generation. There are several ways to compile a high-level language into something which is quick to execute. The goal language could be a bytecode machine (such is the case for Java or .NET language compilers), it could generate assembly code (or directly native machine code, bypassing the intermediate assembly step), or it can generate code in another language. As an example, Bjarne Stroustrup’s Cfront compiler generated C code from original C++ code. In the case of NGS, I had already decided to generate bytecode for a stack based machine: it provided the best trade-off between implementation complexity and execution speed, while providing the needed flexibility.

How does code for a stack machine look like? Basically, all operations are done by ‘pushing’ the operands onto a stack, having instructions to operate on the stack, and having other instructions to store the result back to variables. For example, the bytecodes to evaluate a = b + c would look like this:

  PUSH_VAR 'b'
  PUSH_VAR 'c'
  ADD
  STORE_VAR 'a'

Things get a bit more complex for conditional jumps (involving instructions such as EQ or GREATER_THAN that operate on the stack, a JCOND instruction checking the TOS or Top-Of-Stack element), function calls (arguments usually passed on the stack, where the return value is also left, and with return addresses which may be stored on the same stack or on a separate special-purpose stack (the case in NGEDIT).

Although it’s not the only way, additional benefit is gained by storing the instructions as sequences of bytes, called bytecode or bytecodes. These sequences can easily be generated, inspected, saved to disk or passed around between platforms. Insructions can be one-byte long (for operand-less instructions such as ADD which implicitly operate on the stack) or can carry additional information (such as the target address for a JMP).

Why are stack based machines so popular? On one hand, we have their technical merits (portability, speed of execution, etc…). On the other there are two other significative reasons: the first one is that implementing the virtual machine that executes the code is straightforward, as basic operations are simple push_from/operate/pop_into actions. The second one is that implementing a compiler to generate code for such a machine is quite straightforward as well.

Generating machine code is quite more difficult (I plan to write a post on a simple way to generate code for the x86 architecture, that is, the architecture underlying PCs and Macs :) - even if there are no immediate plans for native code generation for NGEDIT). Although I have not attempted it, I think that it is quite a lot simpler than it may seem.

There are other intermediate representations, such as operation ‘tuples’, and others can be thought up, such as keeping the expression syntax tree explicitly for evaluation, but they don’t provide important enough advantages to weigh over the bytecode-based stack based alternative.

So now - how can we generate bytecodes from a C-like source file? The code generated for a function will be a sequence of bytecodes, plus the list of the necessary local variables required by the function. We shall start with a simple case: how to compile a = b + c; into the instructions shown above.

This being an assignment, with an expression on the right-hand side, there are two parts to code generation. The first one involves parsing the right-hand side into bytecodes that will leave the calculated value on the stack. The second parts involves generation of code to store value on the top of the stack (TOS) in the place indicated by the left hand side.

Generating code to evaluate the expression can be done by parsing the expression with a parser similar to what we saw in the previous article. If the expression were a simple variable name, the code is just a simple PUSH_VAR opcode (there will probably be different opcodes for pushing local or module level variables, but that is easily decided by looking up the identifier and finding out where it lies). The key to generating code to evaluate a generic expression is in seing that, in order to generate the code for a ‘+’ expression, we can do it in three steps:

(1) Generate the code to push the value of the operand to the left of the addition
(2) Generate and append to the previous the code to calc and push the value of the operand to the right of the addition sign
(3) Append an ‘ADD’ bytecode

In the absence of any other elements, the code in part (1) will start operating on an empty stack, and will leave the value on TOS. Then, the code in part (2) will operate on a stack which has the previous element pushed, but won’t touch it. Its operations will leave the first value just below the TOS, and will leave its result on TOS. And finally, the ADD bytecode will add them together and leave a single value on the stack, the result of the addition.

This is a key part, as any expression can be evaluated this way. All calculation operators taking to operands work in this way: ‘a - b’ will emit the code for the (a) and (b) parts and then a ‘SUB’ opcode, and ‘a > b’ will emit a ‘GREATER_THAN’ opcode which leaves TRUE on TOS if the inequality holds or FALSE if it doesn’t. Unary operators such as ‘-a’ or ‘~b’ are even simpler: they simply emit the code to their operand and then an opcode that performs the operation on the single operand, such as ‘NEG’, ‘BINARY_NOT’, etc…

We can emit this code as we go along in a recursive descent parser - instead of directly calculating the values and the result of the operators, we emit code along the way to calculate the subexpressions, and then code to perform the requested operation. Let’s modify the parser in the previous article to emit code to calculate the expression, step by step starting with the ParseFactor function:

void TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();

    // Emits a byte with the opcode
    m_pEmitter->EmitCode(OP_PUSH_CONSTANT);

    // Adds to constant pool
    ushort usConstant = m_pEmitter->AddConstant(iVal);

    // Emits a two-byte ref to the constant
    EmitWord( usConstant );

    m_pLexer->EatToken();
  }
  else if (tok == TOK_ID)
  {
    const char *psz = m_pLexer->GetTokenID();
    ushort uVar;
    if (m_pEmitter->FindVariable(psz, &uVar))
    {
      EmitCode(OP_PUSH_VAR);
      EmitWord(uVar);

      m_pLexer->EatToken();
    }
    else
    {
      Error('Unrecognized variable name');
    }
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();
    ParseExpression();
    if (!ErrorFound())
    {
      if (m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN)
      {
        Error('Closing parentheses expected');
      }
      m_pLexer->EatToken();
    }
  }
  else
  {
    Error('Unexpected token');
  }
}

You can now see how ParseFactor doesn’t return the value of the factor, as its job is to generate the code to evaluate it. In the case of a literal number, we just add it to the ‘generated’ or ‘emitted’ script, emit a bytecode with the PUSH_CONSTANT meaning, and emit a word reference inline with the index of what constant to push.

We have added code to parse an identifier refering to a variable, which involves looking it up in the previously declared variables table and inserting a reference to it. Parsing of nested parenthesized expressions is very similar, as its the function ParseExpression the one who does the job.

Now we can see how simply the ParseProduct function emits the code to multiply the factors:

void TParser::ParseProduct()
{
  bool bFactorEmitted = false;

  do
  {
    ParseFactor();
    if (ErrorFound())
      break; // Stop parsing

    if (bFactorEmitted)
    {
      m_pEmitter->EmitCode(OP_MUL);
    }

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_MUL)
    {
      // Other factors have to be multiplied in
      bFactorEmitted = true;
      m_pLexer->EatToken();
    }
  } while (tok == TOK_MUL);
}

It’s easy to see how this function loops over one or more factors joined by TOK_MUL (that is, ‘*’ signs), emitting the code to evaluate each of them, and then emitting a ‘MUL’ instruction on each of them starting after the second. This will leave the result of the multiplications on the stack.

And finally, the function to parse an addition, which is our ‘top-level expression’ is very similar and straightforward:

void TParser::ParseExpression()
{
  bool bProductEmitted = false;

  do
  {
    ParseProduct();
    if (ErrorFound())
      break; // Stop parsing

    if (bProductEmitted)
    {
      m_pEmitter->EmitCode(OP_ADD);
    }

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_ADD)
    {
      // Other factors have to be multiplied in
      bProductEmitted = true;
      m_pLexer->EatToken();
    }
  } while (tok == TOK_ADD);
}

You can easily see how an expression such as b + (2 * c) results in emitting the following code:

  PUSH_VAR 'b'
  PUSH_CONST '2'
  PUSH_VAR 'c'
  MUL
  ADD

Which indeed evaluates the expression we are looking for!

Given that this overview is becoming quite lengthy, I will continue in a separate post. Feel free to comment on anything.

PS: I have replaced regular quotes with single quotes, as else they appear escaped :(

Leave a Reply