Archive for June, 2005

Main site updated

Monday, June 27th, 2005

I have updated the main site to a much nicer design. Now only the blog needs a lift-up. And, please let me apologize for not posting in a whole week. I can tell you it’s been a heavy development week, delving really deeply in hardcore COM programming!

I hope I’ll be able to post some new interesting technical article during the week.

Compiler IV: the runtime environment

Sunday, June 19th, 2005

This is the last installment in the compiler series. I will be reviewing how NGEDIT implements the stack-based machine that executes the bytecodes generated by the compiler, and actually breaths life into the code written in the script language.

The first design decision is how to store the values that the runtime environment manipulates. For one, we have the variables that may hold values - and object members as well, which are just another kind of variable. As the other important group, we have the values that are stored in the stack.

We know that variables may hold a value of any type, which in the case of NGS means:

  • char: single byte signed quantities as in regular C/C++
  • int: native machine integer, that is, 32-bit, signed quantities
  • unsigned: same as the previous one, but unsigned
  • float: regular 32-bit floating point number
  • string: byte-per-character strings
  • object: reference to an NGS object
  • function: reference to an NGS function
  • built-in function: reference to a builtin NGEDIT provided function
  • nil: both a type and a value of its own, the “non-value”

There were some design decisions in coming up with this list. Even if the main editor interface is provided via objects, I decided not to provide a special type for built-in objects. Instead, I chose to use regular script objects with member functions initialized to built-in functions, and some “opaque” data to access the internal application data. This makes things simpler and fits nicely in the overall architecture.

I also decided not to provide a boolean type. Bytecodes that generate a logical boolean, such as comparisons, push either integer 1 or integer 0. Instructions that test a value (conditional jumps), take nil and any numeric zero as false, and anything else as true. As an important case, an empty string evaluates as true (you have to check its length if you want to check whether a string is non-empty). When adding values involving strings, which is implemented as string concatenation, nil values are promoted to the empty string, so string manipulation is quite comfortable.

Another decision that I took is to initially support only byte-based encoded strings. That means Unicode UTF-16 (almost the same as UCS-2, or what Windows simply calls “Unicode”) is not supported as a native NGS value. Even if the editor is fully-enabled to handling all sorts of encodings, I didn’t want to overcomplicate the internal script system. Please take into account that this does not mean the script cannot manipulate text in other encodings - this is done via NGEDIT’s built-in functions, it only means NGS’s native variables cannot hold this type of text. I’ll probably add a second string type to support unicode strings in the future.

How does one implement these values? Via a good’ole union inside a C++ class, which holds separately the “type” of the value:

enum EAtomType
{
  SA_NIL,
  SA_CHAR,
  SA_INT,
  SA_UNSIGNED,
  SA_FLOAT,
  SA_BUILTIN_FN,  // Built-in global functions
                // (::MesageBox(), ...)
  SA_FUNCTION,    // Identifies a script function
  SA_STRING,
  SA_OBJECT,
};

class TAtom
{
  public:
    TAtom () { m_Type = SA_NIL; }
    ...
    void SetFloat( float fVal )
    {
      m_Type = SA_FLOAT;
      m_Data.f = fVal;
    }
    ...
    EAtomType   GetType     () const
      { return m_Type; }
    char        GetVal_char () const
      { ASSERT(m_Type == SA_CHAR); return m_Data.c; }
    ...

  private:
    EAtomType m_Type;
    union
    {
      char     c;
      int      i;
      unsigned u;
      float    f;
      char     *psz;
      SObject  *pObj;
    } m_Data;
};

You get the idea. Each value then occupies exactly 8 bytes. 3 of them are wasted, but that’s the price of alignment.

The operand stack, then, is simply an array of TAtom values together with a pointer to the TOS element. Variables are just TAtom’s. The members of an object or the elements in an array are just TAtom’s.

By the way, pardon my French in borrowing the ‘atom’ and ‘nil’ term from lisp terminology. ‘Atom’ was quite clear to me, and I preferred ‘nil’ over ‘null’, given that it represents a non-value rather than a null pointer.

As I have already commented, I will probably evolve NGS towards Javascript so some of this terminology will probably change.

So, now that we know how to structure the stack and the array pool, we just need to start interpreting the bytecodes. From the rest of NGEDIT, a script function is invoked with a call similar to this:

  TScrExec exec;
  TScript *ps = FindScript(...);
  TAtom arg, retval;

  arg.SetInt(key.Code);
  exec.AddArg( &arg );

  arg.SetInt(key.ModifKeyState.Val);
  exec.AddArg( &arg );

  exec.InvokeFunction(
    ps,
    ps->FindFunction("IProcessChar"), NULL, &retval
  );

  return CastInt(retval) ? true : false;

TScrExec is the class that executes a script function by interpreting its bytecode. Right now, scripts are non-concurrent, and as such the function runs until it returns or it generates a “trap”. I will be adding a concurrent script engine which will help in using the scripts more efficiently and, importantly, in being able to debug them by executing step-by-step.

Leaving apart the interfacing code to pass in the arguments and get the result back, the core function is InvokeFunction. It is the function that actually performs the work. How does interpreting the bytecodes work? Quite simply, we need to keep a pointer into the bytecode stream pointing to the next instruction to execute - we can perform a large switch statement on the bytecode, which branches into the code to execute each instruction. Each instruction is so simple that they are quite straightforward to execute. Push, pop, dup, etc… Quite a lot of error-checking goes in here, as we have to take care that the stack doesn’t underflow, invalid codes are not present in the bytecode stream, operations are not performed on values of the wrong type, etc:

TRet TScrExec::InvokeFunction(
  TScript *ps, unsigned uFn,
  const TAtom *pThisVal, TAtom *pRetVal
)
{
  // Push 'this' and args on the stack
  ...

  // Prepare the run context
  ...
  m_pRC->uPC = 0; // Next instruction pointer
  ...

  // Execute instructions
  while (!IsFinished())
  {
    ExecInstr();
  }

  ...
}

void TScrExec::ExecInstr()
{
  // Trap on invalid address
  if (!IsValidPC())
  {
    Trap( TRAP_INVALID_ADDRESS );
    return;
  }
  ...

  // Read the opcode
  byte opcode = pCode->GetByte( m_pRC->uPC );

  // Immediate handling & decoding
  unsigned immediate = 0;
  unsigned imm_bytes = GetByteCodeImmBytes ( opcode );
  unsigned instr_bytes = imm_bytes + 1;

  if (imm_bytes)
  {
    ...

    // Read immediate
    if (imm_bytes == 1)
      immediate = pCode->GetByte( m_pRC->uPC + 1 );
    else if (imm_bytes == 2)
      immediate = pCode->GetWord( m_pRC->uPC + 1 );
  }

  // Advance PC to next instruction
  m_pRC->uPC += instr_bytes;

  // Instruction execution
  switch  (opcode)
  {
    case BC_NULL:
      Trap( TRAP_USER );
      break;

    case BC_PUSH_CONST:
    {
      TAtom atom;
      ps->GetConstant(&atom, immediate);
      m_Stack.Add(atom);
      break;
    }

    ... Really long switch statement ...

  }
}

There is one thing we haven’t considered here: functions in the script will be calling other functions. Values to those functions are passed on the stack, so it’s just a matter of pushing the arguments before calling the function. The return value is returned on the stack as well, so that part’s simple. But there is the need of storing the return address in order to know where to jump when we hit on the RET bytecode in the called function.

Apart from the need for the return address, we need to reserve space for local variables. Sure, we could use the regular stack for them, but that would make code generation a bit more complex. What we do is we have a separate “run-context” stack. A run-context holds the return address, the local variables, and some other useful info. When we find a CALL instruction, a new run-context is generated and pushed on what I call the “call-stack”. The variable m_pRC shown above always points to the top of the run-context stack.

There is one last thing we haven’t talked about: the calling convention. Even if we have already talked about the calling function pushing the args on the stack, and the called function popping them and leaving the result on the stack, there are some details that have to be taken care of. For once, we need to decide in what order the arguments are pushed: is the leftmost one pushed first or last? Where does the implicit “this” pointer get pushed?

Given that NGS handles values, and functions are called via pushing references to them on the stack (maybe even taken from members of objects), there is in general no easy mechanism to check at compile time whether the number of arguments passed matches the number expected by the called function. We could check it at runtime and fail if the number is not right. I preferred to implement a calling mechanism that pushes the actual number of arguments passed on the stack as well, after pushing all the regular arguments. The called function, then, can use that number to know how many arguments have been actually passed and pop them in local variables. If more arguments have been passed, the extra ones are ignored. If less than the expected number have been passed, their corresponding local variables are set to nil. The ENTER instruction, which is the first instruction in every function, does exactly this.

This is the last part of our review of NGEDIT’s scripting architecture. I hope it has been worthy as an overview of how a full scripting systems is developed. I will cover some other compiling issues in the future, but I will also be covering other general editor areas such as text memory management, etc.

Feel free to comment or ask whatever you are interested on.

Programming productivity

Thursday, June 16th, 2005

I always have an eye on my own productivity. Don’t misunderstand me, I’m no productivity freak. I don’t buy in all the “Get things Done” or Steve Pavlina’s productivity obsession based on pushing yourself. Years ago, I would feel guilty of having spent too much time watching TV, reading or generally procrastinating . I spent way too many years in the vicious cycle of not getting things done, not forgiving myself, feeling guilty about it, and falling in the same hole again and again.

(BTW, do read the link above if you haven’t yet, it’s one of the funniest things I’ve ever read.)

Later, I’ve learned that, when you don’t do something, there is always a reason. And your best possible effort is to discover it - not to put pressure on yourself to crush the reason. That’s just doing stuff over and above your own instinct, a nasty thing to do.

The most common reasons I’ve found are: (1) I didn’t really want to do it, or (2) I was afraid of something.

When you discover that you really don’t want to do something you’re trying to push yourself into doing, the best thing you can do is to take the steps to actually not to do it. If there are other people involved, you probably have to tell them you are not going to do it. If the task was the center of your life, you probably have to actually discover what you actually want to do with your life (a tough task). Sometimes the implications are so overwhelming that your best choice is to simply do the thing, because of other reasons. But then, you know why you are doing the stuff, and you won’t be punishing yourself for not actually doing with pleasure what you really don’t want to do.

The other very common reason is that some fear is pushing you back and making the unconscious decision that it’s better not to tackle whatever it is, lest whatever the fearful thing may happen. Sometimes it is fear of failing, and discovery of failure would be much worse than just having the task lying around. Sometimes it is fear of others seeing you fail. Anything.

I am digressing, I wanted to focus on the feelings that come with programming.

The deal is that, even when you are actually pursuing a goal you really want to pursue, there may be a myriad of little tasks in the way that are not satisfying by themselves, but you still want to do them for the goal you will be getting towards. In programming, I often find out that a lot of the things I have to do are not really a pleasure, it’s the final goal that matters to me. I’m the kind of impatient guy, and I think much faster than I code (who doesn’t?), so I have a lot of conflicting feelings when programming.

Knowing that I have to deal with the Windows API is the typical part that I really hate, so tasks that involve that tend to get done later than others. But not only those, mostly every programming task creates some feeling of discomfort. I actually have to drag myself to actual coding, which is much less fun than thinking up the design of things. I think programming tools will evolve in the sense in which programmers will have more of the fun part and less of the nitty-gritty-nasty-details part (think, assembler programming was much worse than C/C++ programming and most programmers can actually get a lot more work done, just by not having to worry about in which registers do things get put).

I can attest that I do this dragging quite successfully, as I’m averaging at about 300 lines of code per day for the past few months, which is quite a lot. And the code is of quite high quality, not that I’m writing a mess - I can produce that much because I am quite eager to have all the core of the editor working in order to start with the innovative stuff. But even then, I find I have to push myself into doing things every day. Reading JoS forums is too often a much more interesting temptation.

Anyway, I tried an experiment yesterday and it has worked nicely. There were three tasks that I had to accomplish in order to get to the next step. Vi/Vim emulation is working so that it’s mostly complete for my everyday usage (complete vi emulation is really a large lot and I don’t need every single little command), and after I get to my desired level of emulation completeness I’ll be moving to other areas of NGEDIT development. The three tasks were actually:

  • Smart input buffer: in vi, you enter a special “input” mode whenever you are going to add new text. The ‘.’ command later on lets you repeat whatever you input (inserted characters, replaced characters, erased characters, etc). I use that continuously.
  • Named clipboards (”registers” in vi slang). You use them in vi much more often than in regular editors, as they are very
    straightforward to use and for some other reasons. This feature is not only useful for the vi interface, but for normal editor interface as well. More importantly, the clipboard in the editor has been “internal” until now, with no interaction with the windows clipboard, and this task involved starting to deal with the Win32 API.
  • Selection types: just the regular character/line/column types of selections for copy/pasting. Character selections are implemented a long time ago, but I really need the other types for everyday usage.

I was looking at this as a somewhat steep hill to climb, with three separate “pushing myself’s”, one for each task.

Instead, I tried an alternative: I would start the three tasks at once, cutting through the need of three separate starts. I would get to the VimEmu.ngs script, implement the script side of the three of them, and then go through the other areas of the application source and get the three of done in each part. This would mean that the application wouldn’t compile for some time, but I know that compiling and testing a minor thing is often an escape from actually confronting the programming task at hand.

How did it go? I think it’s gone quite well. The features are not finished yet, but the smart input buffer was completely implemented yesterday (if not working perfectly). The named registers (clipboards) are mostly working and code to write to / read from the windows clipboard is there (even if not all cases of text encoding conversions are working). And the selection types are lacking the core low-level implementation (actually displaying the different selections and actually reading/writing the right text when copy/pasting), but all the framework including communication with NGS is there.

I’m hoping I can get a decent chunk of that finished today. And I think I gained some time, and avoided some dismay, by “parallelizing” the starting-up conflict of the three.

PS: Does anyone know why on earth non-platform-native text on the Windows clipboard is not characterized by the codepage but by the locale? It seems that if I have text in the codepage used in Russia, I can’t paste it into the windows clipboard, unless I look up what their currency or date format is. And no, I don’t yet know how to get from the codepage to the specific locale - I’m using Unicode for copy/pasting in those cases.

What’s broken in the WM_KEYDOWN/WM_CHAR input model?

Monday, June 13th, 2005

I was planning on writing a rant on the input model of the Win32 API. See, while developing the input model for NGEDIT, I’ve been pondering how to implement the key-mappings, input handling, etc. Although I had used it in the past, it wasn’t in an edit-intensive, fully customizable, internationally targeted program (although I’ve implemented IME-based chat for China, Japan and Korea :) ). The more I looked at the model and I thought about how to handle it, the more it seemed a complete mess in my head. But I could not pin down exactly what it was, and what a better model may be.

I set out to grasp a complete understanding of the model, which I will be describing here for whoever needs it while programming (I couldn’t find a fine explanation that made sense - of course MSDN is of no help in this kind of situations).

The Win32 model works like this: when a key is pressed, you get a WM_KEYDOWN message (more if the key is kept depressed past the auto-repeat time). When it is released, you get a WM_KEYUP. You can access the state of the modifier keys in that moment (SHIFT, ALT and CONTROL). Both messages indicate the pressed key with their virtual key code: an integer code which uniquely identifies a key on the keyboard.

Apart from this, as messages get translated by the TranslateMessage() Win32 function, they are checked with the current keyboard config, and an additional WM_CHAR message is sent for each WM_KEYDOWN, with the current-codepage character code of the key pressed.

This is all fine and dandy, although things start getting tricky after this.

  • In non-US keyboard mappings, or in the US international keyboard, you can use some keys to modify the character typed afterwards with an accent. If you press one of these “accent” keys, a WM_DEADCHAR is generated that probably you’d better ignore, and the next WM_KEYDOWN of a vowel character will generate a WM_CHAR message with a different char code representing the accented character.
  • When you want to handle a key in your program, you have a choice as to handle either the WM_KEYDOWN or the WM_CHAR message. Keys such as the cursor arrows or function keys don’t have a character equivalent, so you need to handle WM_KEYDOWN. For regular character keys that are used for typing, you probably prefer to handle the WM_CHAR message, as it will get you either the lower case or upper case version or even the accented version in international version. A doubt always arises about what to do regarding other keys which generate both a WM_KEYDOWN and a WM_CHAR: what to do with TAB, RETURN, ESC, etc…? Even worse, the DEL key doesn’t generate the ASCII DEL (127) WM_CHAR message that you would expect - it is a “mute” key WM_CHAR-wise. I think I remember reading Charles Petzold mourning the same decision in his programming windows book - I don’t remember what conclusion he reached, but I remember it didn’t satisfy me.
  • The virtual key code, numerically matches the uppercase-ASCII-code of the character on the keyboard, and has special codes for cursor arrows, function keys, and some weird keys called stuff like VK_OEM_1 (for the ’tilde’ key in a US mapping). A key such as VK_OEM_1 is mapped to an actual alphabetic character in non-english european languages, making it difficult to distinguish alphabetic from non-alphabetic WM_KEYDOWNs.
  • Windows nicely throws a different version of the messages on you if the ALT key is pressed at the same time: you get WM_SYSKEYDOWN and WM_SYSCHAR instead of the regular WM_KEYDOWN and WM_CHAR. You also get WM_SYSKEYUP for the sake of consistency. This messages usually get handled by DefWindowProc() in order to bring up the corresponding menu item and some special key processing. Curiously, pressing the ALT key by itself and releasing it afterwards, buys you a WM_SYSKEYDOWN/WM_SYSKEYUP pair for the ALT key. But if you press ALT, then a character key, and release both, you will all WM_SYS* messages, except for the KEYUP’s for keys released after the ALT key (and including it!), which generate regular WM_KEYUP’s. The F10 key as well is a “special key”, generating WM_SYS* versions of its KEYDOWN - I guess because that brings up the menu as well, and I’m unsure about whether its KEYUP is SYS or not. No wonder the virtual key code of the ALT key is VK_MENU.
  • I couldn’t obviously derive from MSDN whether having CTRL or SHIFT pressed at the same time as ALT would buy me SYS messages or plain vanilla messages.
  • European and US-int’l mapping treat the right ALT key as a different AltGr key, that buys some extra characters that are not placed in regular positions. This key is sent to the application as having both CTRL and ALT pressed at the same time, and the system takes care of mapping the WM_KEYDOWN to the appropriate char code for the WM_CHAR message.
  • If you are interested in your Asian customers, you have a new world of IME messages to serve you. It opens a whole new avenue for spec nightmare - suffice it to say that you will be receiving two WM_CHAR messages in a row for some complex kanji or hanzi or hangeul double-byte character, corresponding to no keypress, as the user selects complex characters from his/her IME window. Your US-only-thought application may even work if you just don’t shuffle your WM_CHAR’s too much.

After writing the list, I’m looking back at the title of the post and wondering if it’s really a good question. Anyway.

Coming back to development, I keep a “todo list” of vi/vim commands that need implementation. It happens that the innocent “Redo” command in vi/vim is achieved by the Ctrl-R key combination. There I went and mapped the ‘r’ WM_CHAR with the CTRL key modifier set to the undo command, pressed Ctrl-R and… nothing. To no avail. I checked what the program was receiving, and found out that Windows was translating the VK_R keypress with CTRL into the proper C-R ASCII code, which happens to be 0×12 or 18 in decimal.

In the moment, I just hated the fact, and the memory of how input behaves from previous projects (which was not as clear as I have described above) kicked in, and had me for days happily implementing all sorts of other vi/vim command that didn’t require handling the !~@# control key. I was working, but in some way, I was procrastinating :) Seriously, this is the kind of stuff that keeps me from tackling a task.

After a few days, I finally decided to drag myself to do it, and started researching all the intrincacies of key input, dumping their results, and trying to build a clear mental image - I found out some stuff that I didn’t know, haven’t seen anywhere on the web (of course, MSDN is less than complete).

Here is a nice list of the messages that the R key can generate in its different combinations:


            nothing   CTRL    ALT    CTRL+ALT
  nothing    KD+'r'  KD+^R  SKD+S'r'    KD
  SHIFT      KD+'R'  KD+^R  SKD+S'r'    KD

The row selects between having Shift pressed or not pressed. The column selects between the combination of CTRL and ALT that is pressed. ‘KD’ represents the WM_KEYDOWN message, SKD represents WM_SYSKEYDOWN. A ‘x’ character represents the WM_CHAR message. S’x’ represents WM_SYSCHAR. ^X means the ASCII C-X control code in WM_CHAR. WM_KEYUPS are not shown. And of course, the char code in the left is toggled depending on the CAPSLOCK status (I’m unsure about what happens when CAPSLOCK is on and ALT is used together with the key, although your program’d better not behave too differently).

Other non-alphabetic keys get similar mappings, although they don’t get an WM_CHAR when pressed together with CTRL (in most cases, as we’ll see in a moment).

I set out to try to complete my understanding on the mapping, and wondered what would happen with other keys. It turns out, Windows goes a long length towards trying to get the ASCII code that may correspond to the keypress (in some cases quite dubious to my understanding of affairs).

All (English) alphabetic keys (from A to Z all included) get mapped to their control codes (ASCII 1 to 26). That means that if you press Ctrl-M, your program will receive a nice WM_CHAR message with the 13 code, usually corresponding to the RETURN key. Be sure to filter that if you provide Ctrl+Key mappings! It even works as a replacement to RETURN in unsophisticated programs such as notepad.

Then I set out to find out if Windows would generate the rest of the control codes - poking around, looking at ASCII charts in more detail that I’d like to, I found out that you can generate WM_CHAR with ASCII 28 with C-\, ASCII 29 with C-], ASCII 30 with C-S-6 (which is more like C-^), and ASCII 31 with C-S-hyphen (which should be read as C-_). I usually use a non-US keyboard, for which I can tell you that the ASCII mapping is a bit lousier than for the US mapping, as the ‘\[]^’ symbols are moved around to other keys but Windows still performs the mapping as if the keyboard had US keytops, except in the case of DEADCHAR generating keys which simply behave differently…

The ASCII code 27, that is, ESC, can be both generated by the ESC key and by the C-[ combination.

And I also discovered some curious behavior: RETURN by itself generates WM_CHAR with ASCII 13, while C-RETURN generates WM_CHAR with ASCII 10. I was about not to tell you that C-S-RETURN generates no WM_CHAR, in order to get you lost if you aren’t already :)

And a misterious ASCII code, DEL (127), was conspicuously absent on the keyboard - the DEL key would not generate it with any combination of modifier keys. But finally I found it: while BACKSPACE generates ASCII 8 (BS), combined with the CTRL key it generates the famous ASCII 127 DEL character.

Wow… I sincerely hope someone this explanation turns out useful or interesting for anyone. I would’ve been happy to find it myself a couple of weeks ago.

With this understanding, I could finally go and map the Ctrl-R key to the Undo command, go around binding other keys, and go on. I could also comment it with geek friends and all have a good laugh. But something was still dangling in my mind. Apart from awkward, the system seemed flawed, but I could not pinpoint why.

Even if this is all a mess, and disregarding the far-east input stuff, I was thinking that I had to come up with a better design if I was to rant about it. Of course, the SYS nightmare should be easy to fix in a new design (messages should be the same, and the fact that ALT or F10 or ALT-Shortcut brings the menu should be stuffed somewhere else), but something seemed to be fundamentally wrong and I could not grasp it.

Finally, I realized today.

The input model is losing one piece of critical information: the fact that WM_KEYDOWN and WM_CHAR are separate and completely unrelated loses the concept of a keypress. And there is no way of bringing that back.

See, having the system “cook” some type of information, such as the character mapping of the key if the key has a sensible one, is ok. Receiving extra information is great and the system will do a much better job of supporting all national specifics than the app. Even if that information is “cooked” in a weird way, such as the SYS stuff (or even the DEADCHAR stuff), it is ok and you could just cope with it. But you dearly want to be able to handle a keypress in some way, either using or ignoring the “cooked” information or just the raw virtual key code.

Realizing this, I could finally think up what a better scheme would be. Wouldn’t it be great if Windows provided a WM_KEYPRESS message including all the information? Say, the virtual key code, the fully-cooked ASCII mapping that WM_CHAR gets, some sort of indication of a DEADCHAR keypress, the state of the modifier keys, maybe even the uncooked keytop ASCII char if that exists?

Probably, Asian IME could generate keypresses with a null virtual key code to send weird characters to un-aware applications just handling input, but making it clear that no key was pressed for that character to be sent to the application, and the application would not try to match it up with keyboard mappings. Key maps would always refer to virtual keys, and text input would always just take the fully-cooked ASCII component. And we would all be happy.

The good thing is that this could have been implemented in any moment, as adding a new message does not create incompatibility. But we are now stuck with the really weird WM_KEYDOWN/WM_CHAR model.

I won’t hold my breath until Microsoft includes something along the line :)

Compiler overview (III)

Sunday, June 12th, 2005

Hello back. I have been very busy working in NGEDIT during the past days, and it has taken a bit to get back to posting to the blog. Development is advancing nicely: VimEmu (the vi/vim emulation layer) is working quite well and I have even started using it for continuous editing. There are a couple of major parts yet to be done (namely, named registers and character-wise, linewise and blockwise selection), but most of the core parts are working very well. Well, actually not now, as I have done a major restructuring of the VimEmu.ngs script, and it’s not working right now, but it was just before the restructuring. VimEmu is already a 1,500 lines script which is also doubling as the best testing base for the scripting core.

In order to work in VimEmu, I have integrated in NGEDIT some hotkeys to swap the InputProcessor (the name for the scripts that can process user input), reload it, compile the current file and signal the errors. I have also added the possibility of dumping the contents of the full Undo stack onto a file and to load and play it back. If bugs appear, this is a great way to save and inspect the way to reproduce a given bug.

Given the host of comments regarding a new, incompatible language, I have been researching a bit more about Javascript. Although I won’t do this for version 1.0, it is quite possible that I will evolve NGS into Javascript after that, which will be nicer for folks who want to extend the editor. NGEDIT will have an extension system similar to Firefox and the like, such that extensions can be easily developed and deployed. Regarding interesting Javascript sites, check out DHTML Lemmings.

I have to correct one of my previous points: in fact, in Javascript you have to qualify member access within member functions by using “this.” before member access. This is good, as a mistyped variable access won’t be interpreted as a possible member access and go unflagged as a compile error. It is the same that I’m doing in NGS. Apologies for the mistake.

Last time, we saw how to generate code for an expression which involves no assignments. Today, we are going to see what to do with expressions involving an assignment, and delve into the wonderful world of l-value and r-value parsing.

Anyone with experience in programming has an intuitive graps for what l-values and r-values are. Consider for example the assignment a = 5; . This will represent a fairly easy to understand piece of code for mostly any C-descendant-language programming (it would be a := 5; in other languages, but they are not that ubiquitous nowadays). Something such as a = b; will also be common to understand. But a piece of code such as 5 = a; will raise an alarm on mostly anyone: you can’t assign a value to 5. The same will happen for a+b=c;. Well, what this means is that we understand that there are expressions which are valid both at the right and the left side of an assignment (such as variable_name) and others that are valid only on the right hand side, because you can’t assign to them (such as 5 or a+b. These concepts are explicitly handled in compiler writing and they are formally called l-value and r-value expressions. Actually, many compilers will flag an error for the 5 = a by issuing "Error: '5': not an l-value" diagnostic. Thus there are two classes of expressions: those that can only be on the right side of an assignment (r-values) and those that can be on both sides (l-values). Take into account that the operand to ++ or += also has to be an l-value, as they also mean to assign something to it.

So, how do we go about this in parsing a language such as NGS? For starters, let’s assume that we only allow assignments as separate statements, not in parts of other expressions (function calls, etc…). The generalization will be easy afterwards, but this limitations makes things easier to explain.

When we are parsing a function or code block, we have to parse one statement after the other and generate the code for each. The best thing to do is to peek at the first token in the statement, and then do different things depending on the token:

  • If it’s an if token, go to a separate subroutine to parse the if-expression, the code-block, and possibly the else-block.
  • If it’s a while token, go to a separate subroutine to parse its parts.
  • If it’s a while token, go to a separate subroutine to parse its parts.
  • Idem for other keyword-marked language contructs like return, for, etc…
  • If it’s an identifier, go to the assignment or function call parsing routine

This easiness comes from the fact that language keywords are “reserved” and can’t be used for variable names (in C, C++, etc..). This makes writing the compiler easier. This is not true for all languages, for example, the following is a famously valid PL/I statement:

IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;

You can imagine this language is more difficult to parse than C (C++ has its own, different difficulties).

By the way, I took this sample from the famous Dragon book (Compilers: Principles, Techniques and Tools, by Aho, Sethi and Ullman). It’s the book on compiler foundations, a 1986 book called after the Dragon on the cover that stands for compiler complexity - and if you are interested in compilers you could do much worse than check it out. Be adviced that it is heavily theoretical, and you would probably be grateful to complement it with a more practical one such as Holub’s Compiler design in C.

Back to where we were, we saw how to get into our assigment parsing routine quite easily: a statement starting with a user-provided identifier is deemed to be either a function call, an assignment, or a mistake on the part of the programmer.

In a simple procedural (that is, non-OO) language, with no support for structs, the statement we are looking at would only be one of the following two constructs:

variable = expr;

or

function(expr, expr, …);

We already saw last week how to parse the right-hand-side expressions and generate code to push them on the stack, so I hope you won’t be afraid by that part.

Regarding the rest, the strategy to parse this would be simple: save the first token, and peek the next one. If it is the assignment operator, we have to parse the right hand side expression and then emit the code to store the top of the stack (TOS) in the right variable. If it is an opening parenthesis, we have to parse a sequence of expressions separated by commas (emitting the code for one after the other will leave them nicely ordered on the stack) and then emit the code to call the appropriate function.

For example, code such as a = b + 2; could generate the following stack-machine code:


PUSH_VAR 'b'
PUSH_CONST '2'
ADD
POP_VAR 'a'

In both JavaVM and CLR (.NET VM) slang, the last instruction to pop the TOS into a given variable would be called a STORE, I tend to use the POP word, but the function is the same. In the same way, both JavaVM and CLR call LOAD what I call PUSH_VAR.

The first three instructions come from the parser we saw last week, which evaluate the b + 2 part. The last instruction is the tricky one we have to generate by recognizing the left-hand-side of the assignment.

For another example, code such as function(b + 2, c); could generate the following stack-machine code:


PUSH_VAR 'b'
PUSH_CONST '2'
ADD
PUSH_VAR 'c'
PUSH_FN 'function'
CALL

Of course, function calls should be allowed on the right hand side of assignments, so we better include the parsing of this in the ParseFactor() function of the parser described in the previous post - but we can leave that for later and only allow it for separate statements for now.

How would the function to parse an statement look? Something similar to this:

void TParser::ParseStatement()
{
  TToken tok = m_pLexer->PeekNextToken();

  switch (tok)
  {
    case TOK_IF:
      ParseIfStatement(); break;
    case TOK_WHILE:
      ParseWhileStatement(); break;
    //...
    case TOK_ID:
      ParseAssignmentOrFunctionCall(); break;
    case TOK_SEMICOLON:
      /* Empty statement */ break;
    default:
      Error("Unrecognized statement"); break;
  }
}

void TParser::ParseAssignmentOrFunctionCall()
{
  // Get the identifier name
  const char *psz = m_pLexer->GetTokenID();
  m_pLexer->EatToken();

  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_ASSIGNMENT)
  {
    m_pLexer->EatToken();

    // Parse the right-hand-side and
    //  emit code to leave it on TOS
    ParseExpression();
    if (ErrorFound())
      return;

    // There should be a semicolon...
    Require(TOK_SEMICOLON);
    if (ErrorFound())
      return;

    ushort uVar;
    if (m_pEmitter->FindVariable(psz, &uVar))
    {
      EmitCode(OP_POP_VAR);
      EmitWord(uVar);
    } else {
      Error('Unrecognized variable name');
    }
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();

    // Parse all arguments in order and push them
    while (
      m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN
    )
    {
      ParseExpression(); // Parse the argument
      if (ErrorFound())
        return;

      // There should be a semicolon...
      Require(TOK_COMMA);
      if (ErrorFound())
        return;
    }

    m_pLexer->EatToken(); // Skip the close paren

    // There should be a semicolon...
    Require(TOK_SEMICOLON);
    if (ErrorFound())
      return;

    ushort uFn;
    if (m_pEmitter->FindFunction(psz, &uFn))
    {
      EmitCode(OP_PUSH_FN);
      EmitWord(uFn);

      EmitCode(OP_CALL);
    } else {
      Error('Unrecognized function name');
    }
  } else {
    Error('Unrecognized statement');
  }
}

Ok, so now that we are done compiling simple assignments, we are ready to see what goes on in a more complex language design involving structs, or even objects with member functions, etc….

If we have variables which can be objects or records, there are other types of l-values apart from simple variable names: we can access an object’s members by name by using the ‘.’ (dot) operator. It can be ‘chained’, such that expressions as Parser.PartialInput.str are valid l-values, meaning the field ’str’ of the ‘PartialInput’ member of the ‘Parser’ variable. As all l-values, it can also appear as an r-value (if you can assign something to an expression, surely you can read it and use it later for other expressions).

The key in the design of our stack machine (the same as in the Java VM, but only to some extent in the CLR VM) is that we can only push values on to the stack, and if we need to store something in a place, there is some kind of special instruction to pop values from the stack there. POP_VAR seems quite easy, popping the TOS into a variable. In NGS, actually, I have POP_MODVAR to pop into a module variable and POP_LOCVAR to pop into a function’s local variable. This or a similar distinction also exists both in the Java VM and Microsoft’s CLR VM.

What to do with objects? The first thing is to decide how to operate on them - the solution in many cases (NGS among them) is to allow the stack or general variables to contain references to objects. These are internally just some kind of pointers into objects. Thus, when a variable holding an object appears in a right-hand-side expression, it is pushed as a normal variable, which pushes a reference to the object onto the stack. If this is assigned to another variable, then we have both variables referencing the same object. This is how it works in Javascript. And if we want to make an actual copy of a variable, we have to make it explicit in the code in some way. This is very different from C or C++, where if you assign a = b; a and b being struct variables, the whole contents are copied over.

In order to handle objects in our stack-machine, we just need to new instructions: WRITE_MEMBER and READ_MEMBER. READ_MEMBER expects to find the object reference on TOS, and the member-identifying string could be either on the stack or as an immediate value in the bytecode stream (both approaches work, and this is a design decision regarding the VM - in NGS it is stored in the bytestream). On the other hand, WRITE_MEMBER expects to find the object reference on the top of the stack and the value just below it.

How do we modify ParseFactor() to accept variable.member as a value? When we found an identifier, we just emitted a PUSH_VAR opcode. What we have to check now, is, if we find a trailing dot token, the value pushed on the stack is already the object reference we need, and we just have to emit a READ_MEMBER instruction with the proper member-reference immediate value.

Now, how we do allow to use variable.member to be taken as a valid assignment left-hand-side? Suppose an assignment such as this.PartialInput = 0; . The code we’d like to emit looks like this:


PUSH_CONST '0'
PUSH_VAR 'this'
WRITE_MEMBER 'PartialInput'

Now suppose we had an expression such as this.PartialInput.len = 0; . The beginning of the expression looks the same (this.PartialInput), but the code we’d like to generate looks like this:


PUSH_CONST '0'
PUSH_VAR 'this'
READ_MEMBER 'PartialInput'
WRITE_MEMBER 'len'

That is, the first member access (this.PartialInput) has to be emitted as if it were on the right hand side of an assignment expression, as we want to bring its value on the top of the stack. But in both cases, the last member access (.PartialInput in the first, and .len in the second) has to be emitted as an l-value assignment, that is, as a WRITE_MEMBER opcode.

We can see this in the following light: a chain of member access can generate different code if it is used as the target of an assignment or if it is used in order to access its value, but only the “tail” of it is different. And we don’t know whether it is used in one way or the other until the next token in the expression is evaluated: if it is yet another dot token, we need to emit the r-value code, and if it is finally an assignment, we will need to emit a WRITE_MEMBER opcode.

We can see that it is necessary to emit the code to evaluate the right-hand-side of the assignment before the code to access and assign the left hand side. Likewise, in function calls, we need to emit the code to evaluate the arguments before the function call itself. This may easily done by shuffling around a few buffers when the expression to be emitted afterwards is parsed before, so usually intermediate results of expression parsing go in a separate buffer instead of directly to the output bytecode stream.

So then, the trick is to prepare some kind of TChainReference that parses a chain of object member references. The best thing is to make this TChainReference accept variables as well, such that it will parse expressions such as module_var_name, local_var_name, localvar.member1.member2, etc… The parser will call it first with the first identifier in the chain, and then once again for each .memberref found. The class must then be able to respond to two requests: either get the r-value code for the expression passed in until now, or get the l-value code as we have found an assignment token and we need that. When we notify it of an additional .member element in the chain, it can safely compile in its current chain into only-r-value code and forget about it, as it won’t be used as an l-value, and buffer the last member access for either l-value or r-value generation.

This class can also be called from the ParseFactor() function in the expression parser we saw last week, and we will only be calling the GetRValueCode() request in the end if there is no assignment.

Given that we are parsing an OO language, there are not only member variable accesses, but also member function (method) calls. In order to call a member function, we could have a CALL_MEMBER opcode, but the approach taken in NGS is that member functions are just member variables of the internal FUNCTION type. Thus, member functions are called in the same way as regular function - although the “this” argument is passed as the first argument. I have even gone all the way to having all functions receiving a “this” argument by default, which is just nil for non-member function calls.

Thus, the code generated for something like Parser.ParseElement(); looks like this:


PUSH_VAR 'Parser'
DUP
READ_MEMBER 'ParseElement'
CALL

What does the DUP opcode do? It duplicates the TOS element - in this way, after the DUP we have two copies of the object reference on the stack. The one on TOS is used by the READ_MEMBER instruction, which leaves the value of the the ‘ParseElement’ member variable on the stack - which hopefully contained a function reference. Finally, the CALL opcode actually calls the function, passing in the original object reference as the first and only argument - the ‘this’ reference for the member function.

Given that member functions can return any type of value, they could easily return an object reference or even a function value. Thus, the reference chains we actually need to parse could be of any of the folllowing forms:


Parser.GetCurrentValue().bUsed = expr
Parser.PartialInput.Reset()
GetCurrentParser().PartialInput.bUsed = true
Parser.GetLoggingFunction()(”log text”); // even this!

We can see how the chain can be arbitrarily complex, involving as many function calls and member accesses along the way as the programmer wants. Take into account that a reference chain (as I like to call them) ending in a function call cannot act as an l-value, as values pushed on the stack can never actually be references to storage places. But, as you see in the last case, we could call a function that returns a function object and call it, making pairs of argument lists one after another valied syntax with a valid meaning if proper types are returned.

I will leave as an exercise for you to do the bookkeeping to generate the code for the arguments in function calls interwoven with the chain call (hint one: compile the arguments into a separate byte stream and just stick it where it is necessary; hint two: the DUP trick for the this variable doesn’t always work unless you push arguments right-to-left). You can also easily see how the extension for array member access would work, as signaled by the presence of square brackets in the reference chain - the general bookkeeping gets more complex, but should be straightforward if a bit messy to write.

The extension to allow assignments in expressions is quite simple now: we can accept an assignment sign after a factor that is actually a reference, and the code to emit would be:


code to eval the right-hand-side of the assignment
DUP
code to eval the left-hand-side as an L-VALUE assignment

This leaves the value of the assigned on the TOS, respecting the C rule that an assignment evaluates to the assigned value and performs the assignment as a side-effect.

I hope you are finding this material useful or interesting - please drop a note in that case (or if you don’t as well). We’ll finish the compiler overview with a peek at the runtime environment in the next installment, and hopefully I will be able to get back to shorter blog entries :)

Compiler overview (II)

Wednesday, June 8th, 2005

In the last installment, I wrote an overview of the first stages of the NGS compiler work. We’ll now get a bit more in detail in the parsing, and we’ll be seing how code generation works.

Now, knowing how to write a Recusive Descent Parser, we confront the task of compiling a whole script module. It is quite easy to do that for NGS or similar languages: a module is a sequence of distinct types of blocks that can be identified by the type of their first token. So, the parser just has to peek the token in the input queue, and call the parsing function for the appropriate type of section. In NGS, this currently means either a ‘const’ declaration, a ‘var’ declaration, or a ‘function’ block. Functions may be either simple forward declarations (a semicolon after the argument list) or a full definition (an opening brace and hopefully a full block o’code after the argument list).

‘const’ and ‘var’ sections are simple enough to parse and process. A ‘const’ section serves the purpose that ‘#define’ and ‘enum’ do in C and C++: it declares symbols and associates constant values with them, so that constants can be ‘abstracted out’ in the code. A ‘const’ section allows you to either assign explicit values to the symbols, or let the compiler implicitly generate them as consecutive integers (same as C/C++ enums). In the current incarnation of the NGS compiler, they cannot be initialized with expressions, allowing only simple ‘literal’ value initializators. A constant expression parser, or being able to run bytecodes generated by the regular expression parser in compile-time, can easily solve this limitation.

‘var’ sections declare module-level variables, that keep their value between function invocations. Given that they have no complex type declaration syntax such as C/C++, they are straightforward to parse, and they only have to be added to the variable pool of the script object being generated. Simple literal initializers are allowed for variables right now, the same notes regarding ‘const’ expression initializers hold.

Both ‘const’ and ‘var’ sections share the property that they can define either a single const/var or a group of them, with or without initializers:

const PI = 3.1415926535;
const
{
  CH_RETURN = 13,
  CH_BACKSPACE = 8
};
const
{
  MODE_NORMAL, // Implicit '0'
  MODE_INPUT,  // 1
  MODE_VISUAL, // ...and so on
  MODE_COMMANDLINE
};

var a;
var { v1, v2, v3 };

Peeking for the opening brace token allows telling blocks from standalones, peeking for the assignment sign allows detecting assignments, and commas (as well as semicolons) are optional, as their presence is not required in order to know what’s going on.

The only other type of module-level block is a function. This is easily detected when seeing the ‘function’ token, and that is where the real fun begins!

Let’s diverge a bit from parsing into the realm of code generation. There are several ways to compile a high-level language into something which is quick to execute. The goal language could be a bytecode machine (such is the case for Java or .NET language compilers), it could generate assembly code (or directly native machine code, bypassing the intermediate assembly step), or it can generate code in another language. As an example, Bjarne Stroustrup’s Cfront compiler generated C code from original C++ code. In the case of NGS, I had already decided to generate bytecode for a stack based machine: it provided the best trade-off between implementation complexity and execution speed, while providing the needed flexibility.

How does code for a stack machine look like? Basically, all operations are done by ‘pushing’ the operands onto a stack, having instructions to operate on the stack, and having other instructions to store the result back to variables. For example, the bytecodes to evaluate a = b + c would look like this:

  PUSH_VAR 'b'
  PUSH_VAR 'c'
  ADD
  STORE_VAR 'a'

Things get a bit more complex for conditional jumps (involving instructions such as EQ or GREATER_THAN that operate on the stack, a JCOND instruction checking the TOS or Top-Of-Stack element), function calls (arguments usually passed on the stack, where the return value is also left, and with return addresses which may be stored on the same stack or on a separate special-purpose stack (the case in NGEDIT).

Although it’s not the only way, additional benefit is gained by storing the instructions as sequences of bytes, called bytecode or bytecodes. These sequences can easily be generated, inspected, saved to disk or passed around between platforms. Insructions can be one-byte long (for operand-less instructions such as ADD which implicitly operate on the stack) or can carry additional information (such as the target address for a JMP).

Why are stack based machines so popular? On one hand, we have their technical merits (portability, speed of execution, etc…). On the other there are two other significative reasons: the first one is that implementing the virtual machine that executes the code is straightforward, as basic operations are simple push_from/operate/pop_into actions. The second one is that implementing a compiler to generate code for such a machine is quite straightforward as well.

Generating machine code is quite more difficult (I plan to write a post on a simple way to generate code for the x86 architecture, that is, the architecture underlying PCs and Macs :) - even if there are no immediate plans for native code generation for NGEDIT). Although I have not attempted it, I think that it is quite a lot simpler than it may seem.

There are other intermediate representations, such as operation ‘tuples’, and others can be thought up, such as keeping the expression syntax tree explicitly for evaluation, but they don’t provide important enough advantages to weigh over the bytecode-based stack based alternative.

So now - how can we generate bytecodes from a C-like source file? The code generated for a function will be a sequence of bytecodes, plus the list of the necessary local variables required by the function. We shall start with a simple case: how to compile a = b + c; into the instructions shown above.

This being an assignment, with an expression on the right-hand side, there are two parts to code generation. The first one involves parsing the right-hand side into bytecodes that will leave the calculated value on the stack. The second parts involves generation of code to store value on the top of the stack (TOS) in the place indicated by the left hand side.

Generating code to evaluate the expression can be done by parsing the expression with a parser similar to what we saw in the previous article. If the expression were a simple variable name, the code is just a simple PUSH_VAR opcode (there will probably be different opcodes for pushing local or module level variables, but that is easily decided by looking up the identifier and finding out where it lies). The key to generating code to evaluate a generic expression is in seing that, in order to generate the code for a ‘+’ expression, we can do it in three steps:

(1) Generate the code to push the value of the operand to the left of the addition
(2) Generate and append to the previous the code to calc and push the value of the operand to the right of the addition sign
(3) Append an ‘ADD’ bytecode

In the absence of any other elements, the code in part (1) will start operating on an empty stack, and will leave the value on TOS. Then, the code in part (2) will operate on a stack which has the previous element pushed, but won’t touch it. Its operations will leave the first value just below the TOS, and will leave its result on TOS. And finally, the ADD bytecode will add them together and leave a single value on the stack, the result of the addition.

This is a key part, as any expression can be evaluated this way. All calculation operators taking to operands work in this way: ‘a - b’ will emit the code for the (a) and (b) parts and then a ‘SUB’ opcode, and ‘a > b’ will emit a ‘GREATER_THAN’ opcode which leaves TRUE on TOS if the inequality holds or FALSE if it doesn’t. Unary operators such as ‘-a’ or ‘~b’ are even simpler: they simply emit the code to their operand and then an opcode that performs the operation on the single operand, such as ‘NEG’, ‘BINARY_NOT’, etc…

We can emit this code as we go along in a recursive descent parser - instead of directly calculating the values and the result of the operators, we emit code along the way to calculate the subexpressions, and then code to perform the requested operation. Let’s modify the parser in the previous article to emit code to calculate the expression, step by step starting with the ParseFactor function:

void TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();

    // Emits a byte with the opcode
    m_pEmitter->EmitCode(OP_PUSH_CONSTANT);

    // Adds to constant pool
    ushort usConstant = m_pEmitter->AddConstant(iVal);

    // Emits a two-byte ref to the constant
    EmitWord( usConstant );

    m_pLexer->EatToken();
  }
  else if (tok == TOK_ID)
  {
    const char *psz = m_pLexer->GetTokenID();
    ushort uVar;
    if (m_pEmitter->FindVariable(psz, &uVar))
    {
      EmitCode(OP_PUSH_VAR);
      EmitWord(uVar);

      m_pLexer->EatToken();
    }
    else
    {
      Error('Unrecognized variable name');
    }
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();
    ParseExpression();
    if (!ErrorFound())
    {
      if (m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN)
      {
        Error('Closing parentheses expected');
      }
      m_pLexer->EatToken();
    }
  }
  else
  {
    Error('Unexpected token');
  }
}

You can now see how ParseFactor doesn’t return the value of the factor, as its job is to generate the code to evaluate it. In the case of a literal number, we just add it to the ‘generated’ or ‘emitted’ script, emit a bytecode with the PUSH_CONSTANT meaning, and emit a word reference inline with the index of what constant to push.

We have added code to parse an identifier refering to a variable, which involves looking it up in the previously declared variables table and inserting a reference to it. Parsing of nested parenthesized expressions is very similar, as its the function ParseExpression the one who does the job.

Now we can see how simply the ParseProduct function emits the code to multiply the factors:

void TParser::ParseProduct()
{
  bool bFactorEmitted = false;

  do
  {
    ParseFactor();
    if (ErrorFound())
      break; // Stop parsing

    if (bFactorEmitted)
    {
      m_pEmitter->EmitCode(OP_MUL);
    }

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_MUL)
    {
      // Other factors have to be multiplied in
      bFactorEmitted = true;
      m_pLexer->EatToken();
    }
  } while (tok == TOK_MUL);
}

It’s easy to see how this function loops over one or more factors joined by TOK_MUL (that is, ‘*’ signs), emitting the code to evaluate each of them, and then emitting a ‘MUL’ instruction on each of them starting after the second. This will leave the result of the multiplications on the stack.

And finally, the function to parse an addition, which is our ‘top-level expression’ is very similar and straightforward:

void TParser::ParseExpression()
{
  bool bProductEmitted = false;

  do
  {
    ParseProduct();
    if (ErrorFound())
      break; // Stop parsing

    if (bProductEmitted)
    {
      m_pEmitter->EmitCode(OP_ADD);
    }

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_ADD)
    {
      // Other factors have to be multiplied in
      bProductEmitted = true;
      m_pLexer->EatToken();
    }
  } while (tok == TOK_ADD);
}

You can easily see how an expression such as b + (2 * c) results in emitting the following code:

  PUSH_VAR 'b'
  PUSH_CONST '2'
  PUSH_VAR 'c'
  MUL
  ADD

Which indeed evaluates the expression we are looking for!

Given that this overview is becoming quite lengthy, I will continue in a separate post. Feel free to comment on anything.

PS: I have replaced regular quotes with single quotes, as else they appear escaped :(

Shame on me

Wednesday, June 8th, 2005

While preparing the new post in the Compiler Overview series, I realized the control flow logic in the previous post was flawed. I have fixed it and updated the code in the post. My apologies. Drawbacks of writing the code without either compiling or debugging it.

Compiler overview (I)

Sunday, June 5th, 2005

Unlike other microISV entrepreneurs out there, I’ve started writing the blog a bit late in the development process. Some people start writing the blog as the first action towards building their product. Myself, I started with pure development as I saw that getting the product to exist was the key part in my case. After some hanging around JoS forums, and a suggestion to get the blog running from Ian of StealthPlayer, I actually set up a website and started writing the blog. It’s been very interesting, and the amount of attention brought over by Eric Sink has been a pleasant surprise (thanks Eric!).

Given the interest on the scripting engine side of things, I will do a brief recap of the scripting system. I have, myself, always had a clearly lustful interest in compilers :)

I will skip the overview in two or three parts, covering the lexer and parser in the first, and leaving code generation and execution for afterwards.

1. “Lexing”
Formally called lexical analyisis. The first stage is loading the actual text file and breaking it up into tokens. The interface is the typical GetNextToken() or PeekNextToken() (the latter not removing the actual next token). Skipping comments also happens here. Tokens are identified with a token enumerator, and the ones requiring more info (such as TOK_ID which is returned for any non-keyword identifier) point into the raw input buffer. This avoids memory management during the lexing phase.

One common option is to use a tool such as LEX or FLEX (a GNU clone of the former). I chose to implement it by hand, as it is simple enough and has not licensing issues. LEX/FLEX parsers are usually based in regular expressions to detect the tokens - I just wrote specialized code to detect each type of token.

2. “Parsing”
Formally called syntax analysis. I could have also gone with a tool such as YACC or BISON (the GNU clone) to generate a state machine parser, but I implemented a simple recursive descent one myself.

For those of you who haven’t seen a living specimen in the wild, how does a simple recursive descent parser look? Suppose we want to parse a syntax consisting on products and additions with no parentheses, with positive integer numbers as elements. The parser would accept and evaluate expressions such as 5 + 3 * 4, or 10 * 20 * 3 + 2 * 3 + 41 * 5. Suppose the lexer returns TOK_NUM for an integer value, TOK_ADD for the ‘+’ symbol, TOK_MUL for the ‘*’ symbol, and TOK_EOF when the end of file is reached. The value of the TOK_NUM returned can be asked for by calling GetTokenVal(). The lexer is also supposed to skip the whitespace in between the elements.

To write a recursive descent parser, we take into account that the multiply operator has precedence over the addition operator, that is, the previous expressions have to be evaluated as 5 + (3 * 4), or (10 * 20 * 3) + (2 * 3) + (41 * 5). This means that any such expression can be looked at as the addition of a series of products, which are themselves the product of a series of factors.

This is the key to structuring the parser: we will parse the whole expression by getting one product after the other, and adding them together. The products are parsed by getting one factor after the other and multiplying them together. When should the product parser stop? This is another key: when it finds a symbol that it doesn’t recognize as continuing the product (which could be another ‘+’ sign or the end of the sequence).

int TParser::ParseExpression()
{
  int iSum = 0;

  do
  {
    int iVal = ParseProduct();
    if (ErrorFound())
      return -1;

    iSum += iVal;

    TToken tok = m_pLexer->PeekNextToken();
    if (tok = TOK_ADD)
    {
      m_pLexer->EatToken();
    }
  } while (tok == TOK_ADD);

  return -1; // Should never arrive here
}

int TParser::ParseProduct()
{
  int iProduct = 1;

  do
  {
    int iVal = ParseFactor();
    if (ErrorFound())
      return -1;

    iProduct *= iVal;

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_MUL)
    {
      m_pLexer->EatToken();
    }
  } while (tok == TOK_MUL);

  return -1; // Should never arrive here
}

int TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();
    m_pLexer->EatToken();
    return iVal;
  }
  else
  {
    Error("Unexpected token");
    return -1;
  }
}

Ok, so now, why is it called a recursive descent parser? Clearly, the one above is not recursive at all. Now, suppose we want to expand the parser above to evaluate expressions with parentheses in them. In any place an integer used to appear, we must now allow a whole ‘nother subexpression, using both ‘*’ and ‘+’, enclosed between parentheses. This is simple enough to do: we only have to modify the ParseFactor function to accept what we just described:

int TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();
    m_pLexer->EatToken();
    return iVal;
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();
    int iVal = ParseExpression();
    if (m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN)
    {
      Error("Closing parentheses expected");
      return -1;
    }
    m_pLexer->EatToken();
    return iVal;
  }
  else
  {
    Error("Unexpected token");
    return -1;
  }
}

And now, you can see how the parser operates recursively to parse complex expressions such as 3 + 4 * (6 + 1) or (((1 + 3) * 4 + 5) + 2) * 6 up to any nesting depth (or as long as the runtime environment will allow the stack to grow). The fact that the parsing functions stop at any unexpected token without “consuming it” allows them to stop when a closing parenthesis is reached, returning to the calling ParseFactor who “eats it up”.

Usually, parsers are written using other general tools employing algorithms such as LALR, which discard explicit recursion and implement stack based iterative methods. But recursive descent parsers are famed to sport fast execution, and they’re simple enough to write.

Next installment, I will cover how the code generation mechanism fits.

Tons o’garbage

Saturday, June 4th, 2005

The vi emulation script needs to get positions from the built-in editor object. At the very least, this is necessary to store “marks” in the file and get back to them later when the user requests so. But it turns out that most motion commands in vi are better implemented as separating the act of getting the target position and then setting it later on.

For example, the ‘w’ command adavances to the beginning of the next word. Although the built-in interface may provide a “GotoNextWord” function, the fact is ‘dw’ deletes everything up to the beginning of the next word. ‘yw’ copies everything up to the beginning of the next word, without moving the cursor. And thus with many commands.

Now, a “position” within a text buffer, as handled by NGEDIT, may be quite complex. If you are editing a simple ASCII file, or a file in your local code page as long as it is a single-byte-per-character codepage, things get simpler. If, on top of that, your file does not contain a single tab character (ASCII 9, which I call “hard” tabs), and you are not using a variable width font in order to edit the file, a position can be as simple as a line number and an offset within that line.

But if you are editing, for example, UTF-8 text (in which a single character may occupy anything from 1 to 6 bytes, although the Unicode consortium promises they will not be using 5- and 6- byte encodings), which does include tabs (so the count of characters does not match the column on the screen), and even using a proportional font to display the file, then a position may host the above mentioned line and offset, but also a column count and an horizontal display coordinate in pixels.

So, when a script wants to grab, for instance, the current cursor position, it actually has no idea what it is trying to grab. This is great for the script, because it needn’t worry about the underlying complexity: these positions are only manipulated by the built-in editor object.

The complexity with this is that a position has to be returned as an object from the built-in engine.

The NGS language, the same as JavaScript and others, doesn’t have pointers. You can’t pass a pointer to an object to any function, be it other script functions or built-in functions. I had a look at the Microsoft .NET CLR specification, together with their IL bytecode-based language, and it turns that the model is quite similar to what I am doing for NGS. They have pointers, but most of the code doesn’t need them. It probably came to them as a necessity to be able to generate managed C++ code, which will of course require shuffling and passing and squeezing all types of pointers around.

Besides,

  var pos;
  s_Editor.GetCursorPos(&pos);

is of course much, much uglier than

  var pos = s_Editor.GetCursorPos();

By the way, I think the Java VM doesn’t manipulate pointers either. Of course, “object” values are actually pointers to objects in the heap and can be thought of as a kind of pointer - but pointers to general variables are just not there (please correct me if memory fails.)

The fact is, I had to return objects. That meant the built-in engine creating them, and then the script manipulating it: either keeping them in a variable, operating and discarding them, whatever.

I had avoided complex memory management of objects until now. ‘new’ and ‘delete’ were there for manual memory management, which I was doing fine with - vi emulation does not involve any heavy object shuffling, so I thought I could leave it for later.

Not any more. I had to do Garbage Collection.

I had never implemented a GC scheme - I had an idea of how a mark-n-sweep algorithm works, and it seemed simple enough, so off I went to implement it.

And it turns, it took less than a couple of hours to have a decent memory management system working. Reclaimed objects are reused on demand and only freed at the unloading of a script context, so it does reduce the amount of memory shuffling with the CRT.

All objects allocated are kept in a doubly-linked list, hand-written mind you, whose root is at the TScript object. There are actually two lists: one is the list of the “busy” objects and the other one holds the “free” objects. When someone asks for a new object (be it script code, the built-in editor object, or whoever), I first check if there are any objects in the “free” list. If not, I perform a GC cycle. This involves first un-marking all known supposedly-busy objects, then iterating over all script variables and the running stack and marking the objects they refer to (if indeed the do refer to objects), apply recursively, and voila! we can now pass all unmarked objects from the busy list to the free list. If nothing turns up, we can always revert to a good-ole-fashioned new Object; within the C++ code.

Now I can even write such insulting code as:

  function fn()
  {
    var v = new Whatever();
  }

and be called a lazy programmer :)

The vi input model

Friday, June 3rd, 2005

I don’t want to start a flame war. I don’t want to advocate anything over to anyone. I’m just going to tell a little tale pertaining to my use of text editors.

About 6-8 months ago, I started doing most of my work on a laptop. I hadn’t used a laptop extensively, and I felt quite happy editing with UltraEdit (my editor of choice for years) or the built-in editor of Microsoft Visual Studio. Using VS to edit gives you Intellisense (a tooltip-box showing you the prototype of the function whose name you’ve just typed), a quick type-compile-correct cycle, and no time to start debugging.

You see, these laptops we have are nifty and all. Plenty of horsepower, decent screens, and you can use them on top of your lap at the sofa. There are only two quirks to me: one is that, if you add a mouse, you need a surface, which usually means a table, and a lot of the mobility is lost. The bag also starts to become bulky. The second quirk is that arrow keys, and even worse navigation keys (Home/End/PgUp/PgDn), seem to be an afterthought.

For the past ten years, I’ve never even thought that I was moving the hand across the keyboard to reach all these magic little keys hundreds if not thousands of times a day, and performing all sorts of selection/copying/pasting/navigation quickly. A common idiom: selecting the current line completely, and pasting a copy below it. I used to do that by pressing Home twice (as the first one gets you to the first non-blank character in the line and I want the whole line), then pressing Shift and holding it depressed while pressing the down arrow, then Ctrl-C, and then Ctrl-V twice (either that or Down-Arrow then Ctrl-V).

It’s more or less that. I must have made this action more times than I’ve said “hi”.

Now comes the wonderful world of laptop editing. There are many small variations of the laptop keyboard torture devised by different incarnations of Evil. Mine is an Acer with Del on the top-right corner (a good choice), Ins just left to it, the arrows in a nice little inverted T on the bottom right (not incredibly easy to find by touch, you see, I am a quite proficient touch typist as most people who spend a lot of their day typing).

The worst comes with the Home-End and PgUp/PgDn keys. PgUp and PgDn are dutifully struck there on the two little gaps of the inverted T - PgUp is above the left arrow, and PgDn is above the right arrow. This alone makes you unable to distinguish the inverted T shape by touch, which is such an aid for quickly locating it. And the mental cue of having PgUp above PgDn is lost. But the worst is yet to come: Home and End are the same two keys, as long as you press a difficult to press “Fn” key almost on the bottom left corner on the keyboard (second from the left, between the Ctrl and the Windows key).

This layout is incredibly uncomfortable. Such a simple idiom as I spoke about above turns into something like this:

- Move the hand around the bottom right corner trying to locate the keys
- Give up trying blindly and look down to locate them
- Try to locate Fn directly with the left thumb or little finger
- Give up and locate it either as the second key on the row or just look down
- Fn-PgUp (Home) twice
- Shift-Down, usually by touch from the previous position but w/o feeling reliable
- Ctrl-C (with which you lose track of where your left keys are)
- Ctrl-V twice
- Swear about these damn keyboards and try to remember what you were trying to write.

You see, a few days of this and I was bound to find a better solution.

I started thinking that there must be a way to edit text which didn’t require the Home/End/Arrow keys. I started dreaming up about text editing interfaces that would save all the hand movement. During one of those daydreams, it struck me: “remember vi, that old-school, bizarre, complex, text-editor?”. I had used vi (most probably vim) maybe a couple of times, once at the University HP-9000 Unix system and another time when installing Linux circa 1994. I promptly switched to “pico” or “joe” or one of the other more standard editors and forgot about it - not without learning that “i” helped you enter text (?!) and quitting required typing ‘:’ and then q or quit or e or exit or x or whatnot. The only way to add text to a line would be to use ‘a’ instead of ‘i’ while at the end of line, really weird But I had sometime heard that you DID NOT USE ARROW KEYS WITH VI. It resounded as a syren song in my mind. I commented it with a coworker, quite a Linux supporter, and he praised the vi input model really strong - no, he wasn’t using it most of the time having succumbed to development under Windows using plain Microsoft Visual Studio, but it was the best way of editing text known to man.

You see, I was perfectly happy with my editor(s), but suddenly the input model was a pain to use with my environment.

I looked up some resources, downloaded vim, and I started using it. First web page I printed out and kept at arm’s length for emergency:

Efficient Editing With vim - Jonathan McPherson

This is the best introduction to the vi input model I have found. First thing it does is tell you to stay away from the arrow keys and use h,j,k,l for moving around. Sure its weird, and I probably couldn’t have done it on a regular keyboard, but the laptop nightmare got me avoiding the arrow keys by sheer instinct.

It felt weird to have your hands stuck there in their home position. It felt even a bit “restrictive”, and things didn’t come out quick. It took about 2 weeks to become basically fluent with the vi model, and I was playing games of what commands to use for every little thing (”How do I erase this word? Oh, it may be, what key was it? ah, yeah, ‘b’, to go to the beginning of the word, then ‘d’ to delete and maybe ‘w’ to delete to the next word HURRAH! IT WORKED!” - then “diw - Delete Inner Word does it as well!”).

But after the steeeeeeeep learning curve, I started to become very proficient with it. You could split, resize and move around editing windows with just keyboard commands (”you mean, ‘k’ is up, but Ctrl-W and then ‘k’ gets you to the window above the current one? NEAT!”). Everything is there at your finger tips.

So - there I was, early 2005, discovering the input model of a text editor written, what, 25 years ago?, and finding out that it is the best input model for my current platform.

Mind you, once I’ve learnt it, it’s also much quicker to use on a regular keyboard - the incentive just wasn’t there with a fancy regular keyboard.

Now editing is no less comfortable at the sofa, on the plane or bus, than on a non-portable full size desk.

And do you know how I make an additional copy of the current line nowadays?

yyp

[Update May 31, 2006: if you want to learn vi/vim editing yourself, I think the best way to do it is the graphical tutorial I prepared recently. Enjoy!]