Compiler overview (I)

Unlike other microISV entrepreneurs out there, I’ve started writing the blog a bit late in the development process. Some people start writing the blog as the first action towards building their product. Myself, I started with pure development as I saw that getting the product to exist was the key part in my case. After some hanging around JoS forums, and a suggestion to get the blog running from Ian of StealthPlayer, I actually set up a website and started writing the blog. It’s been very interesting, and the amount of attention brought over by Eric Sink has been a pleasant surprise (thanks Eric!).

Given the interest on the scripting engine side of things, I will do a brief recap of the scripting system. I have, myself, always had a clearly lustful interest in compilers 🙂

I will skip the overview in two or three parts, covering the lexer and parser in the first, and leaving code generation and execution for afterwards.

1. “Lexing”
Formally called lexical analyisis. The first stage is loading the actual text file and breaking it up into tokens. The interface is the typical GetNextToken() or PeekNextToken() (the latter not removing the actual next token). Skipping comments also happens here. Tokens are identified with a token enumerator, and the ones requiring more info (such as TOK_ID which is returned for any non-keyword identifier) point into the raw input buffer. This avoids memory management during the lexing phase.

One common option is to use a tool such as LEX or FLEX (a GNU clone of the former). I chose to implement it by hand, as it is simple enough and has not licensing issues. LEX/FLEX parsers are usually based in regular expressions to detect the tokens – I just wrote specialized code to detect each type of token.

2. “Parsing”
Formally called syntax analysis. I could have also gone with a tool such as YACC or BISON (the GNU clone) to generate a state machine parser, but I implemented a simple recursive descent one myself.

For those of you who haven’t seen a living specimen in the wild, how does a simple recursive descent parser look? Suppose we want to parse a syntax consisting on products and additions with no parentheses, with positive integer numbers as elements. The parser would accept and evaluate expressions such as 5 + 3 * 4, or 10 * 20 * 3 + 2 * 3 + 41 * 5. Suppose the lexer returns TOK_NUM for an integer value, TOK_ADD for the ‘+’ symbol, TOK_MUL for the ‘*’ symbol, and TOK_EOF when the end of file is reached. The value of the TOK_NUM returned can be asked for by calling GetTokenVal(). The lexer is also supposed to skip the whitespace in between the elements.

To write a recursive descent parser, we take into account that the multiply operator has precedence over the addition operator, that is, the previous expressions have to be evaluated as 5 + (3 * 4), or (10 * 20 * 3) + (2 * 3) + (41 * 5). This means that any such expression can be looked at as the addition of a series of products, which are themselves the product of a series of factors.

This is the key to structuring the parser: we will parse the whole expression by getting one product after the other, and adding them together. The products are parsed by getting one factor after the other and multiplying them together. When should the product parser stop? This is another key: when it finds a symbol that it doesn’t recognize as continuing the product (which could be another ‘+’ sign or the end of the sequence).

int TParser::ParseExpression()
{
  int iSum = 0;

  do
  {
    int iVal = ParseProduct();
    if (ErrorFound())
      return -1;

    iSum += iVal;

    TToken tok = m_pLexer->PeekNextToken();
    if (tok = TOK_ADD)
    {
      m_pLexer->EatToken();
    }
  } while (tok == TOK_ADD);

  return -1; // Should never arrive here
}

int TParser::ParseProduct()
{
  int iProduct = 1;

  do
  {
    int iVal = ParseFactor();
    if (ErrorFound())
      return -1;

    iProduct *= iVal;

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_MUL)
    {
      m_pLexer->EatToken();
    }
  } while (tok == TOK_MUL);

  return -1; // Should never arrive here
}

int TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();
    m_pLexer->EatToken();
    return iVal;
  }
  else
  {
    Error("Unexpected token");
    return -1;
  }
}

Ok, so now, why is it called a recursive descent parser? Clearly, the one above is not recursive at all. Now, suppose we want to expand the parser above to evaluate expressions with parentheses in them. In any place an integer used to appear, we must now allow a whole ‘nother subexpression, using both ‘*’ and ‘+’, enclosed between parentheses. This is simple enough to do: we only have to modify the ParseFactor function to accept what we just described:

int TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();
    m_pLexer->EatToken();
    return iVal;
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();
    int iVal = ParseExpression();
    if (m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN)
    {
      Error("Closing parentheses expected");
      return -1;
    }
    m_pLexer->EatToken();
    return iVal;
  }
  else
  {
    Error("Unexpected token");
    return -1;
  }
}

And now, you can see how the parser operates recursively to parse complex expressions such as 3 + 4 * (6 + 1) or (((1 + 3) * 4 + 5) + 2) * 6 up to any nesting depth (or as long as the runtime environment will allow the stack to grow). The fact that the parsing functions stop at any unexpected token without “consuming it” allows them to stop when a closing parenthesis is reached, returning to the calling ParseFactor who “eats it up”.

Usually, parsers are written using other general tools employing algorithms such as LALR, which discard explicit recursion and implement stack based iterative methods. But recursive descent parsers are famed to sport fast execution, and they’re simple enough to write.

Next installment, I will cover how the code generation mechanism fits.

7 Responses to “Compiler overview (I)”

  1. Tomás Fernández Says:

    Beautiful algorithm, so clear…
    I hope you post more explanations with code as this one.

  2. J Says:

    Thanks Tomas! I didn’t mention, but anyone feel to comment on what type of posts do you find most interesting/useful/boring and it will most probably affect my decisions on what to post about.

  3. Matt Says:

    Very ice article with clear, elegant code examples. Good work fella!

  4. J Says:

    Thanks Matt. Really glad you liked it. I wondered if anyone read this old posts 🙂

  5. Tarun Says:

    Hi J,
    I am afraid I am new to this site and couldn’t find your name.

    Thanks for nicely written article. Could you separate out your compiler writing articles into a separate section if its not too much effort?

  6. J Says:

    Hi Tarun, my name is Jon. Thanks for the positive feedback. The closest thing to what you are asking for are the links in:

    http://www.ngedit.com/articles.html

    Actually, the blog format is not so nice when there is content that can be valid for quite some time. I may switch to another format together with a full site redesign I’m thinking about, but it will take quite some time.

    I plan to cover more advanced compiling subjects in the next few months, with an emphasis in functional languages, so I suggest you keep an eye at blog.ngedit.com where I’ll announce it.

    Thanks again,

    Jon

  7. RaiulBaztepo Says:

    Hello!
    Very Interesting post! Thank you for such interesting resource!
    PS: Sorry for my bad english, I’v just started to learn this language 😉
    See you!
    Your, Raiul Baztepo

Leave a Reply