Archive for the ‘ngdev’ Category

What’s broken in the WM_KEYDOWN/WM_CHAR input model?

Monday, June 13th, 2005

I was planning on writing a rant on the input model of the Win32 API. See, while developing the input model for NGEDIT, I’ve been pondering how to implement the key-mappings, input handling, etc. Although I had used it in the past, it wasn’t in an edit-intensive, fully customizable, internationally targeted program (although I’ve implemented IME-based chat for China, Japan and Korea 🙂 ). The more I looked at the model and I thought about how to handle it, the more it seemed a complete mess in my head. But I could not pin down exactly what it was, and what a better model may be.

I set out to grasp a complete understanding of the model, which I will be describing here for whoever needs it while programming (I couldn’t find a fine explanation that made sense – of course MSDN is of no help in this kind of situations).

The Win32 model works like this: when a key is pressed, you get a WM_KEYDOWN message (more if the key is kept depressed past the auto-repeat time). When it is released, you get a WM_KEYUP. You can access the state of the modifier keys in that moment (SHIFT, ALT and CONTROL). Both messages indicate the pressed key with their virtual key code: an integer code which uniquely identifies a key on the keyboard.

Apart from this, as messages get translated by the TranslateMessage() Win32 function, they are checked with the current keyboard config, and an additional WM_CHAR message is sent for each WM_KEYDOWN, with the current-codepage character code of the key pressed.

This is all fine and dandy, although things start getting tricky after this.

  • In non-US keyboard mappings, or in the US international keyboard, you can use some keys to modify the character typed afterwards with an accent. If you press one of these “accent” keys, a WM_DEADCHAR is generated that probably you’d better ignore, and the next WM_KEYDOWN of a vowel character will generate a WM_CHAR message with a different char code representing the accented character.
  • When you want to handle a key in your program, you have a choice as to handle either the WM_KEYDOWN or the WM_CHAR message. Keys such as the cursor arrows or function keys don’t have a character equivalent, so you need to handle WM_KEYDOWN. For regular character keys that are used for typing, you probably prefer to handle the WM_CHAR message, as it will get you either the lower case or upper case version or even the accented version in international version. A doubt always arises about what to do regarding other keys which generate both a WM_KEYDOWN and a WM_CHAR: what to do with TAB, RETURN, ESC, etc…? Even worse, the DEL key doesn’t generate the ASCII DEL (127) WM_CHAR message that you would expect – it is a “mute” key WM_CHAR-wise. I think I remember reading Charles Petzold mourning the same decision in his programming windows book – I don’t remember what conclusion he reached, but I remember it didn’t satisfy me.
  • The virtual key code, numerically matches the uppercase-ASCII-code of the character on the keyboard, and has special codes for cursor arrows, function keys, and some weird keys called stuff like VK_OEM_1 (for the ’tilde’ key in a US mapping). A key such as VK_OEM_1 is mapped to an actual alphabetic character in non-english european languages, making it difficult to distinguish alphabetic from non-alphabetic WM_KEYDOWNs.
  • Windows nicely throws a different version of the messages on you if the ALT key is pressed at the same time: you get WM_SYSKEYDOWN and WM_SYSCHAR instead of the regular WM_KEYDOWN and WM_CHAR. You also get WM_SYSKEYUP for the sake of consistency. This messages usually get handled by DefWindowProc() in order to bring up the corresponding menu item and some special key processing. Curiously, pressing the ALT key by itself and releasing it afterwards, buys you a WM_SYSKEYDOWN/WM_SYSKEYUP pair for the ALT key. But if you press ALT, then a character key, and release both, you will all WM_SYS* messages, except for the KEYUP’s for keys released after the ALT key (and including it!), which generate regular WM_KEYUP’s. The F10 key as well is a “special key”, generating WM_SYS* versions of its KEYDOWN – I guess because that brings up the menu as well, and I’m unsure about whether its KEYUP is SYS or not. No wonder the virtual key code of the ALT key is VK_MENU.
  • I couldn’t obviously derive from MSDN whether having CTRL or SHIFT pressed at the same time as ALT would buy me SYS messages or plain vanilla messages.
  • European and US-int’l mapping treat the right ALT key as a different AltGr key, that buys some extra characters that are not placed in regular positions. This key is sent to the application as having both CTRL and ALT pressed at the same time, and the system takes care of mapping the WM_KEYDOWN to the appropriate char code for the WM_CHAR message.
  • If you are interested in your Asian customers, you have a new world of IME messages to serve you. It opens a whole new avenue for spec nightmare – suffice it to say that you will be receiving two WM_CHAR messages in a row for some complex kanji or hanzi or hangeul double-byte character, corresponding to no keypress, as the user selects complex characters from his/her IME window. Your US-only-thought application may even work if you just don’t shuffle your WM_CHAR’s too much.

After writing the list, I’m looking back at the title of the post and wondering if it’s really a good question. Anyway.

Coming back to development, I keep a “todo list” of vi/vim commands that need implementation. It happens that the innocent “Redo” command in vi/vim is achieved by the Ctrl-R key combination. There I went and mapped the ‘r’ WM_CHAR with the CTRL key modifier set to the undo command, pressed Ctrl-R and… nothing. To no avail. I checked what the program was receiving, and found out that Windows was translating the VK_R keypress with CTRL into the proper C-R ASCII code, which happens to be 0x12 or 18 in decimal.

In the moment, I just hated the fact, and the memory of how input behaves from previous projects (which was not as clear as I have described above) kicked in, and had me for days happily implementing all sorts of other vi/vim command that didn’t require handling the !~@# control key. I was working, but in some way, I was procrastinating 🙂 Seriously, this is the kind of stuff that keeps me from tackling a task.

After a few days, I finally decided to drag myself to do it, and started researching all the intrincacies of key input, dumping their results, and trying to build a clear mental image – I found out some stuff that I didn’t know, haven’t seen anywhere on the web (of course, MSDN is less than complete).

Here is a nice list of the messages that the R key can generate in its different combinations:


            nothing   CTRL    ALT    CTRL+ALT
  nothing    KD+'r'  KD+^R  SKD+S'r'    KD
  SHIFT      KD+'R'  KD+^R  SKD+S'r'    KD

The row selects between having Shift pressed or not pressed. The column selects between the combination of CTRL and ALT that is pressed. ‘KD’ represents the WM_KEYDOWN message, SKD represents WM_SYSKEYDOWN. A ‘x’ character represents the WM_CHAR message. S’x’ represents WM_SYSCHAR. ^X means the ASCII C-X control code in WM_CHAR. WM_KEYUPS are not shown. And of course, the char code in the left is toggled depending on the CAPSLOCK status (I’m unsure about what happens when CAPSLOCK is on and ALT is used together with the key, although your program’d better not behave too differently).

Other non-alphabetic keys get similar mappings, although they don’t get an WM_CHAR when pressed together with CTRL (in most cases, as we’ll see in a moment).

I set out to try to complete my understanding on the mapping, and wondered what would happen with other keys. It turns out, Windows goes a long length towards trying to get the ASCII code that may correspond to the keypress (in some cases quite dubious to my understanding of affairs).

All (English) alphabetic keys (from A to Z all included) get mapped to their control codes (ASCII 1 to 26). That means that if you press Ctrl-M, your program will receive a nice WM_CHAR message with the 13 code, usually corresponding to the RETURN key. Be sure to filter that if you provide Ctrl+Key mappings! It even works as a replacement to RETURN in unsophisticated programs such as notepad.

Then I set out to find out if Windows would generate the rest of the control codes – poking around, looking at ASCII charts in more detail that I’d like to, I found out that you can generate WM_CHAR with ASCII 28 with C-\, ASCII 29 with C-], ASCII 30 with C-S-6 (which is more like C-^), and ASCII 31 with C-S-hyphen (which should be read as C-_). I usually use a non-US keyboard, for which I can tell you that the ASCII mapping is a bit lousier than for the US mapping, as the ‘\[]^’ symbols are moved around to other keys but Windows still performs the mapping as if the keyboard had US keytops, except in the case of DEADCHAR generating keys which simply behave differently…

The ASCII code 27, that is, ESC, can be both generated by the ESC key and by the C-[ combination.

And I also discovered some curious behavior: RETURN by itself generates WM_CHAR with ASCII 13, while C-RETURN generates WM_CHAR with ASCII 10. I was about not to tell you that C-S-RETURN generates no WM_CHAR, in order to get you lost if you aren’t already 🙂

And a misterious ASCII code, DEL (127), was conspicuously absent on the keyboard – the DEL key would not generate it with any combination of modifier keys. But finally I found it: while BACKSPACE generates ASCII 8 (BS), combined with the CTRL key it generates the famous ASCII 127 DEL character.

Wow… I sincerely hope someone this explanation turns out useful or interesting for anyone. I would’ve been happy to find it myself a couple of weeks ago.

With this understanding, I could finally go and map the Ctrl-R key to the Undo command, go around binding other keys, and go on. I could also comment it with geek friends and all have a good laugh. But something was still dangling in my mind. Apart from awkward, the system seemed flawed, but I could not pinpoint why.

Even if this is all a mess, and disregarding the far-east input stuff, I was thinking that I had to come up with a better design if I was to rant about it. Of course, the SYS nightmare should be easy to fix in a new design (messages should be the same, and the fact that ALT or F10 or ALT-Shortcut brings the menu should be stuffed somewhere else), but something seemed to be fundamentally wrong and I could not grasp it.

Finally, I realized today.

The input model is losing one piece of critical information: the fact that WM_KEYDOWN and WM_CHAR are separate and completely unrelated loses the concept of a keypress. And there is no way of bringing that back.

See, having the system “cook” some type of information, such as the character mapping of the key if the key has a sensible one, is ok. Receiving extra information is great and the system will do a much better job of supporting all national specifics than the app. Even if that information is “cooked” in a weird way, such as the SYS stuff (or even the DEADCHAR stuff), it is ok and you could just cope with it. But you dearly want to be able to handle a keypress in some way, either using or ignoring the “cooked” information or just the raw virtual key code.

Realizing this, I could finally think up what a better scheme would be. Wouldn’t it be great if Windows provided a WM_KEYPRESS message including all the information? Say, the virtual key code, the fully-cooked ASCII mapping that WM_CHAR gets, some sort of indication of a DEADCHAR keypress, the state of the modifier keys, maybe even the uncooked keytop ASCII char if that exists?

Probably, Asian IME could generate keypresses with a null virtual key code to send weird characters to un-aware applications just handling input, but making it clear that no key was pressed for that character to be sent to the application, and the application would not try to match it up with keyboard mappings. Key maps would always refer to virtual keys, and text input would always just take the fully-cooked ASCII component. And we would all be happy.

The good thing is that this could have been implemented in any moment, as adding a new message does not create incompatibility. But we are now stuck with the really weird WM_KEYDOWN/WM_CHAR model.

I won’t hold my breath until Microsoft includes something along the line 🙂

Compiler overview (III)

Sunday, June 12th, 2005

Hello back. I have been very busy working in NGEDIT during the past days, and it has taken a bit to get back to posting to the blog. Development is advancing nicely: VimEmu (the vi/vim emulation layer) is working quite well and I have even started using it for continuous editing. There are a couple of major parts yet to be done (namely, named registers and character-wise, linewise and blockwise selection), but most of the core parts are working very well. Well, actually not now, as I have done a major restructuring of the VimEmu.ngs script, and it’s not working right now, but it was just before the restructuring. VimEmu is already a 1,500 lines script which is also doubling as the best testing base for the scripting core.

In order to work in VimEmu, I have integrated in NGEDIT some hotkeys to swap the InputProcessor (the name for the scripts that can process user input), reload it, compile the current file and signal the errors. I have also added the possibility of dumping the contents of the full Undo stack onto a file and to load and play it back. If bugs appear, this is a great way to save and inspect the way to reproduce a given bug.

Given the host of comments regarding a new, incompatible language, I have been researching a bit more about Javascript. Although I won’t do this for version 1.0, it is quite possible that I will evolve NGS into Javascript after that, which will be nicer for folks who want to extend the editor. NGEDIT will have an extension system similar to Firefox and the like, such that extensions can be easily developed and deployed. Regarding interesting Javascript sites, check out DHTML Lemmings.

I have to correct one of my previous points: in fact, in Javascript you have to qualify member access within member functions by using “this.” before member access. This is good, as a mistyped variable access won’t be interpreted as a possible member access and go unflagged as a compile error. It is the same that I’m doing in NGS. Apologies for the mistake.

Last time, we saw how to generate code for an expression which involves no assignments. Today, we are going to see what to do with expressions involving an assignment, and delve into the wonderful world of l-value and r-value parsing.

Anyone with experience in programming has an intuitive graps for what l-values and r-values are. Consider for example the assignment a = 5; . This will represent a fairly easy to understand piece of code for mostly any C-descendant-language programming (it would be a := 5; in other languages, but they are not that ubiquitous nowadays). Something such as a = b; will also be common to understand. But a piece of code such as 5 = a; will raise an alarm on mostly anyone: you can’t assign a value to 5. The same will happen for a+b=c;. Well, what this means is that we understand that there are expressions which are valid both at the right and the left side of an assignment (such as variable_name) and others that are valid only on the right hand side, because you can’t assign to them (such as 5 or a+b. These concepts are explicitly handled in compiler writing and they are formally called l-value and r-value expressions. Actually, many compilers will flag an error for the 5 = a by issuing "Error: '5': not an l-value" diagnostic. Thus there are two classes of expressions: those that can only be on the right side of an assignment (r-values) and those that can be on both sides (l-values). Take into account that the operand to ++ or += also has to be an l-value, as they also mean to assign something to it.

So, how do we go about this in parsing a language such as NGS? For starters, let’s assume that we only allow assignments as separate statements, not in parts of other expressions (function calls, etc…). The generalization will be easy afterwards, but this limitations makes things easier to explain.

When we are parsing a function or code block, we have to parse one statement after the other and generate the code for each. The best thing to do is to peek at the first token in the statement, and then do different things depending on the token:

  • If it’s an if token, go to a separate subroutine to parse the if-expression, the code-block, and possibly the else-block.
  • If it’s a while token, go to a separate subroutine to parse its parts.
  • If it’s a while token, go to a separate subroutine to parse its parts.
  • Idem for other keyword-marked language contructs like return, for, etc…
  • If it’s an identifier, go to the assignment or function call parsing routine

This easiness comes from the fact that language keywords are “reserved” and can’t be used for variable names (in C, C++, etc..). This makes writing the compiler easier. This is not true for all languages, for example, the following is a famously valid PL/I statement:

IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;

You can imagine this language is more difficult to parse than C (C++ has its own, different difficulties).

By the way, I took this sample from the famous Dragon book (Compilers: Principles, Techniques and Tools, by Aho, Sethi and Ullman). It’s the book on compiler foundations, a 1986 book called after the Dragon on the cover that stands for compiler complexity – and if you are interested in compilers you could do much worse than check it out. Be adviced that it is heavily theoretical, and you would probably be grateful to complement it with a more practical one such as Holub’s Compiler design in C.

Back to where we were, we saw how to get into our assigment parsing routine quite easily: a statement starting with a user-provided identifier is deemed to be either a function call, an assignment, or a mistake on the part of the programmer.

In a simple procedural (that is, non-OO) language, with no support for structs, the statement we are looking at would only be one of the following two constructs:

variable = expr;

or

function(expr, expr, ...);

We already saw last week how to parse the right-hand-side expressions and generate code to push them on the stack, so I hope you won’t be afraid by that part.

Regarding the rest, the strategy to parse this would be simple: save the first token, and peek the next one. If it is the assignment operator, we have to parse the right hand side expression and then emit the code to store the top of the stack (TOS) in the right variable. If it is an opening parenthesis, we have to parse a sequence of expressions separated by commas (emitting the code for one after the other will leave them nicely ordered on the stack) and then emit the code to call the appropriate function.

For example, code such as a = b + 2; could generate the following stack-machine code:


PUSH_VAR 'b'
PUSH_CONST '2'
ADD
POP_VAR 'a'

In both JavaVM and CLR (.NET VM) slang, the last instruction to pop the TOS into a given variable would be called a STORE, I tend to use the POP word, but the function is the same. In the same way, both JavaVM and CLR call LOAD what I call PUSH_VAR.

The first three instructions come from the parser we saw last week, which evaluate the b + 2 part. The last instruction is the tricky one we have to generate by recognizing the left-hand-side of the assignment.

For another example, code such as function(b + 2, c); could generate the following stack-machine code:


PUSH_VAR 'b'
PUSH_CONST '2'
ADD
PUSH_VAR 'c'
PUSH_FN 'function'
CALL

Of course, function calls should be allowed on the right hand side of assignments, so we better include the parsing of this in the ParseFactor() function of the parser described in the previous post – but we can leave that for later and only allow it for separate statements for now.

How would the function to parse an statement look? Something similar to this:

void TParser::ParseStatement()
{
  TToken tok = m_pLexer->PeekNextToken();

  switch (tok)
  {
    case TOK_IF:
      ParseIfStatement(); break;
    case TOK_WHILE:
      ParseWhileStatement(); break;
    //...
    case TOK_ID:
      ParseAssignmentOrFunctionCall(); break;
    case TOK_SEMICOLON:
      /* Empty statement */ break;
    default:
      Error("Unrecognized statement"); break;
  }
}

void TParser::ParseAssignmentOrFunctionCall()
{
  // Get the identifier name
  const char *psz = m_pLexer->GetTokenID();
  m_pLexer->EatToken();

  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_ASSIGNMENT)
  {
    m_pLexer->EatToken();

    // Parse the right-hand-side and
    //  emit code to leave it on TOS
    ParseExpression();
    if (ErrorFound())
      return;

    // There should be a semicolon...
    Require(TOK_SEMICOLON);
    if (ErrorFound())
      return;

    ushort uVar;
    if (m_pEmitter->FindVariable(psz, &uVar))
    {
      EmitCode(OP_POP_VAR);
      EmitWord(uVar);
    } else {
      Error('Unrecognized variable name');
    }
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();

    // Parse all arguments in order and push them
    while (
      m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN
    )
    {
      ParseExpression(); // Parse the argument
      if (ErrorFound())
        return;

      // There should be a semicolon...
      Require(TOK_COMMA);
      if (ErrorFound())
        return;
    }

    m_pLexer->EatToken(); // Skip the close paren

    // There should be a semicolon...
    Require(TOK_SEMICOLON);
    if (ErrorFound())
      return;

    ushort uFn;
    if (m_pEmitter->FindFunction(psz, &uFn))
    {
      EmitCode(OP_PUSH_FN);
      EmitWord(uFn);

      EmitCode(OP_CALL);
    } else {
      Error('Unrecognized function name');
    }
  } else {
    Error('Unrecognized statement');
  }
}

Ok, so now that we are done compiling simple assignments, we are ready to see what goes on in a more complex language design involving structs, or even objects with member functions, etc….

If we have variables which can be objects or records, there are other types of l-values apart from simple variable names: we can access an object’s members by name by using the ‘.’ (dot) operator. It can be ‘chained’, such that expressions as Parser.PartialInput.str are valid l-values, meaning the field ‘str’ of the ‘PartialInput’ member of the ‘Parser’ variable. As all l-values, it can also appear as an r-value (if you can assign something to an expression, surely you can read it and use it later for other expressions).

The key in the design of our stack machine (the same as in the Java VM, but only to some extent in the CLR VM) is that we can only push values on to the stack, and if we need to store something in a place, there is some kind of special instruction to pop values from the stack there. POP_VAR seems quite easy, popping the TOS into a variable. In NGS, actually, I have POP_MODVAR to pop into a module variable and POP_LOCVAR to pop into a function’s local variable. This or a similar distinction also exists both in the Java VM and Microsoft’s CLR VM.

What to do with objects? The first thing is to decide how to operate on them – the solution in many cases (NGS among them) is to allow the stack or general variables to contain references to objects. These are internally just some kind of pointers into objects. Thus, when a variable holding an object appears in a right-hand-side expression, it is pushed as a normal variable, which pushes a reference to the object onto the stack. If this is assigned to another variable, then we have both variables referencing the same object. This is how it works in Javascript. And if we want to make an actual copy of a variable, we have to make it explicit in the code in some way. This is very different from C or C++, where if you assign a = b; a and b being struct variables, the whole contents are copied over.

In order to handle objects in our stack-machine, we just need to new instructions: WRITE_MEMBER and READ_MEMBER. READ_MEMBER expects to find the object reference on TOS, and the member-identifying string could be either on the stack or as an immediate value in the bytecode stream (both approaches work, and this is a design decision regarding the VM – in NGS it is stored in the bytestream). On the other hand, WRITE_MEMBER expects to find the object reference on the top of the stack and the value just below it.

How do we modify ParseFactor() to accept variable.member as a value? When we found an identifier, we just emitted a PUSH_VAR opcode. What we have to check now, is, if we find a trailing dot token, the value pushed on the stack is already the object reference we need, and we just have to emit a READ_MEMBER instruction with the proper member-reference immediate value.

Now, how we do allow to use variable.member to be taken as a valid assignment left-hand-side? Suppose an assignment such as this.PartialInput = 0; . The code we’d like to emit looks like this:


PUSH_CONST '0'
PUSH_VAR 'this'
WRITE_MEMBER 'PartialInput'

Now suppose we had an expression such as this.PartialInput.len = 0; . The beginning of the expression looks the same (this.PartialInput), but the code we’d like to generate looks like this:


PUSH_CONST '0'
PUSH_VAR 'this'
READ_MEMBER 'PartialInput'
WRITE_MEMBER 'len'

That is, the first member access (this.PartialInput) has to be emitted as if it were on the right hand side of an assignment expression, as we want to bring its value on the top of the stack. But in both cases, the last member access (.PartialInput in the first, and .len in the second) has to be emitted as an l-value assignment, that is, as a WRITE_MEMBER opcode.

We can see this in the following light: a chain of member access can generate different code if it is used as the target of an assignment or if it is used in order to access its value, but only the “tail” of it is different. And we don’t know whether it is used in one way or the other until the next token in the expression is evaluated: if it is yet another dot token, we need to emit the r-value code, and if it is finally an assignment, we will need to emit a WRITE_MEMBER opcode.

We can see that it is necessary to emit the code to evaluate the right-hand-side of the assignment before the code to access and assign the left hand side. Likewise, in function calls, we need to emit the code to evaluate the arguments before the function call itself. This may easily done by shuffling around a few buffers when the expression to be emitted afterwards is parsed before, so usually intermediate results of expression parsing go in a separate buffer instead of directly to the output bytecode stream.

So then, the trick is to prepare some kind of TChainReference that parses a chain of object member references. The best thing is to make this TChainReference accept variables as well, such that it will parse expressions such as module_var_name, local_var_name, localvar.member1.member2, etc… The parser will call it first with the first identifier in the chain, and then once again for each .memberref found. The class must then be able to respond to two requests: either get the r-value code for the expression passed in until now, or get the l-value code as we have found an assignment token and we need that. When we notify it of an additional .member element in the chain, it can safely compile in its current chain into only-r-value code and forget about it, as it won’t be used as an l-value, and buffer the last member access for either l-value or r-value generation.

This class can also be called from the ParseFactor() function in the expression parser we saw last week, and we will only be calling the GetRValueCode() request in the end if there is no assignment.

Given that we are parsing an OO language, there are not only member variable accesses, but also member function (method) calls. In order to call a member function, we could have a CALL_MEMBER opcode, but the approach taken in NGS is that member functions are just member variables of the internal FUNCTION type. Thus, member functions are called in the same way as regular function – although the “this” argument is passed as the first argument. I have even gone all the way to having all functions receiving a “this” argument by default, which is just nil for non-member function calls.

Thus, the code generated for something like Parser.ParseElement(); looks like this:


PUSH_VAR 'Parser'
DUP
READ_MEMBER 'ParseElement'
CALL

What does the DUP opcode do? It duplicates the TOS element – in this way, after the DUP we have two copies of the object reference on the stack. The one on TOS is used by the READ_MEMBER instruction, which leaves the value of the the ‘ParseElement’ member variable on the stack – which hopefully contained a function reference. Finally, the CALL opcode actually calls the function, passing in the original object reference as the first and only argument – the ‘this’ reference for the member function.

Given that member functions can return any type of value, they could easily return an object reference or even a function value. Thus, the reference chains we actually need to parse could be of any of the folllowing forms:


Parser.GetCurrentValue().bUsed = expr
Parser.PartialInput.Reset()
GetCurrentParser().PartialInput.bUsed = true
Parser.GetLoggingFunction()("log text"); // even this!

We can see how the chain can be arbitrarily complex, involving as many function calls and member accesses along the way as the programmer wants. Take into account that a reference chain (as I like to call them) ending in a function call cannot act as an l-value, as values pushed on the stack can never actually be references to storage places. But, as you see in the last case, we could call a function that returns a function object and call it, making pairs of argument lists one after another valied syntax with a valid meaning if proper types are returned.

I will leave as an exercise for you to do the bookkeeping to generate the code for the arguments in function calls interwoven with the chain call (hint one: compile the arguments into a separate byte stream and just stick it where it is necessary; hint two: the DUP trick for the this variable doesn’t always work unless you push arguments right-to-left). You can also easily see how the extension for array member access would work, as signaled by the presence of square brackets in the reference chain – the general bookkeeping gets more complex, but should be straightforward if a bit messy to write.

The extension to allow assignments in expressions is quite simple now: we can accept an assignment sign after a factor that is actually a reference, and the code to emit would be:


code to eval the right-hand-side of the assignment
DUP
code to eval the left-hand-side as an L-VALUE assignment

This leaves the value of the assigned on the TOS, respecting the C rule that an assignment evaluates to the assigned value and performs the assignment as a side-effect.

I hope you are finding this material useful or interesting – please drop a note in that case (or if you don’t as well). We’ll finish the compiler overview with a peek at the runtime environment in the next installment, and hopefully I will be able to get back to shorter blog entries 🙂

Compiler overview (II)

Wednesday, June 8th, 2005

In the last installment, I wrote an overview of the first stages of the NGS compiler work. We’ll now get a bit more in detail in the parsing, and we’ll be seing how code generation works.

Now, knowing how to write a Recusive Descent Parser, we confront the task of compiling a whole script module. It is quite easy to do that for NGS or similar languages: a module is a sequence of distinct types of blocks that can be identified by the type of their first token. So, the parser just has to peek the token in the input queue, and call the parsing function for the appropriate type of section. In NGS, this currently means either a ‘const’ declaration, a ‘var’ declaration, or a ‘function’ block. Functions may be either simple forward declarations (a semicolon after the argument list) or a full definition (an opening brace and hopefully a full block o’code after the argument list).

‘const’ and ‘var’ sections are simple enough to parse and process. A ‘const’ section serves the purpose that ‘#define’ and ‘enum’ do in C and C++: it declares symbols and associates constant values with them, so that constants can be ‘abstracted out’ in the code. A ‘const’ section allows you to either assign explicit values to the symbols, or let the compiler implicitly generate them as consecutive integers (same as C/C++ enums). In the current incarnation of the NGS compiler, they cannot be initialized with expressions, allowing only simple ‘literal’ value initializators. A constant expression parser, or being able to run bytecodes generated by the regular expression parser in compile-time, can easily solve this limitation.

‘var’ sections declare module-level variables, that keep their value between function invocations. Given that they have no complex type declaration syntax such as C/C++, they are straightforward to parse, and they only have to be added to the variable pool of the script object being generated. Simple literal initializers are allowed for variables right now, the same notes regarding ‘const’ expression initializers hold.

Both ‘const’ and ‘var’ sections share the property that they can define either a single const/var or a group of them, with or without initializers:

const PI = 3.1415926535;
const
{
  CH_RETURN = 13,
  CH_BACKSPACE = 8
};
const
{
  MODE_NORMAL, // Implicit '0'
  MODE_INPUT,  // 1
  MODE_VISUAL, // ...and so on
  MODE_COMMANDLINE
};

var a;
var { v1, v2, v3 };

Peeking for the opening brace token allows telling blocks from standalones, peeking for the assignment sign allows detecting assignments, and commas (as well as semicolons) are optional, as their presence is not required in order to know what’s going on.

The only other type of module-level block is a function. This is easily detected when seeing the ‘function’ token, and that is where the real fun begins!

Let’s diverge a bit from parsing into the realm of code generation. There are several ways to compile a high-level language into something which is quick to execute. The goal language could be a bytecode machine (such is the case for Java or .NET language compilers), it could generate assembly code (or directly native machine code, bypassing the intermediate assembly step), or it can generate code in another language. As an example, Bjarne Stroustrup’s Cfront compiler generated C code from original C++ code. In the case of NGS, I had already decided to generate bytecode for a stack based machine: it provided the best trade-off between implementation complexity and execution speed, while providing the needed flexibility.

How does code for a stack machine look like? Basically, all operations are done by ‘pushing’ the operands onto a stack, having instructions to operate on the stack, and having other instructions to store the result back to variables. For example, the bytecodes to evaluate a = b + c would look like this:

  PUSH_VAR 'b'
  PUSH_VAR 'c'
  ADD
  STORE_VAR 'a'

Things get a bit more complex for conditional jumps (involving instructions such as EQ or GREATER_THAN that operate on the stack, a JCOND instruction checking the TOS or Top-Of-Stack element), function calls (arguments usually passed on the stack, where the return value is also left, and with return addresses which may be stored on the same stack or on a separate special-purpose stack (the case in NGEDIT).

Although it’s not the only way, additional benefit is gained by storing the instructions as sequences of bytes, called bytecode or bytecodes. These sequences can easily be generated, inspected, saved to disk or passed around between platforms. Insructions can be one-byte long (for operand-less instructions such as ADD which implicitly operate on the stack) or can carry additional information (such as the target address for a JMP).

Why are stack based machines so popular? On one hand, we have their technical merits (portability, speed of execution, etc…). On the other there are two other significative reasons: the first one is that implementing the virtual machine that executes the code is straightforward, as basic operations are simple push_from/operate/pop_into actions. The second one is that implementing a compiler to generate code for such a machine is quite straightforward as well.

Generating machine code is quite more difficult (I plan to write a post on a simple way to generate code for the x86 architecture, that is, the architecture underlying PCs and Macs 🙂 – even if there are no immediate plans for native code generation for NGEDIT). Although I have not attempted it, I think that it is quite a lot simpler than it may seem.

There are other intermediate representations, such as operation ‘tuples’, and others can be thought up, such as keeping the expression syntax tree explicitly for evaluation, but they don’t provide important enough advantages to weigh over the bytecode-based stack based alternative.

So now – how can we generate bytecodes from a C-like source file? The code generated for a function will be a sequence of bytecodes, plus the list of the necessary local variables required by the function. We shall start with a simple case: how to compile a = b + c; into the instructions shown above.

This being an assignment, with an expression on the right-hand side, there are two parts to code generation. The first one involves parsing the right-hand side into bytecodes that will leave the calculated value on the stack. The second parts involves generation of code to store value on the top of the stack (TOS) in the place indicated by the left hand side.

Generating code to evaluate the expression can be done by parsing the expression with a parser similar to what we saw in the previous article. If the expression were a simple variable name, the code is just a simple PUSH_VAR opcode (there will probably be different opcodes for pushing local or module level variables, but that is easily decided by looking up the identifier and finding out where it lies). The key to generating code to evaluate a generic expression is in seing that, in order to generate the code for a ‘+’ expression, we can do it in three steps:

(1) Generate the code to push the value of the operand to the left of the addition
(2) Generate and append to the previous the code to calc and push the value of the operand to the right of the addition sign
(3) Append an ‘ADD’ bytecode

In the absence of any other elements, the code in part (1) will start operating on an empty stack, and will leave the value on TOS. Then, the code in part (2) will operate on a stack which has the previous element pushed, but won’t touch it. Its operations will leave the first value just below the TOS, and will leave its result on TOS. And finally, the ADD bytecode will add them together and leave a single value on the stack, the result of the addition.

This is a key part, as any expression can be evaluated this way. All calculation operators taking to operands work in this way: ‘a – b’ will emit the code for the (a) and (b) parts and then a ‘SUB’ opcode, and ‘a > b’ will emit a ‘GREATER_THAN’ opcode which leaves TRUE on TOS if the inequality holds or FALSE if it doesn’t. Unary operators such as ‘-a’ or ‘~b’ are even simpler: they simply emit the code to their operand and then an opcode that performs the operation on the single operand, such as ‘NEG’, ‘BINARY_NOT’, etc…

We can emit this code as we go along in a recursive descent parser – instead of directly calculating the values and the result of the operators, we emit code along the way to calculate the subexpressions, and then code to perform the requested operation. Let’s modify the parser in the previous article to emit code to calculate the expression, step by step starting with the ParseFactor function:

void TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();

    // Emits a byte with the opcode
    m_pEmitter->EmitCode(OP_PUSH_CONSTANT);

    // Adds to constant pool
    ushort usConstant = m_pEmitter->AddConstant(iVal);

    // Emits a two-byte ref to the constant
    EmitWord( usConstant );

    m_pLexer->EatToken();
  }
  else if (tok == TOK_ID)
  {
    const char *psz = m_pLexer->GetTokenID();
    ushort uVar;
    if (m_pEmitter->FindVariable(psz, &uVar))
    {
      EmitCode(OP_PUSH_VAR);
      EmitWord(uVar);

      m_pLexer->EatToken();
    }
    else
    {
      Error('Unrecognized variable name');
    }
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();
    ParseExpression();
    if (!ErrorFound())
    {
      if (m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN)
      {
        Error('Closing parentheses expected');
      }
      m_pLexer->EatToken();
    }
  }
  else
  {
    Error('Unexpected token');
  }
}

You can now see how ParseFactor doesn’t return the value of the factor, as its job is to generate the code to evaluate it. In the case of a literal number, we just add it to the ‘generated’ or ’emitted’ script, emit a bytecode with the PUSH_CONSTANT meaning, and emit a word reference inline with the index of what constant to push.

We have added code to parse an identifier refering to a variable, which involves looking it up in the previously declared variables table and inserting a reference to it. Parsing of nested parenthesized expressions is very similar, as its the function ParseExpression the one who does the job.

Now we can see how simply the ParseProduct function emits the code to multiply the factors:

void TParser::ParseProduct()
{
  bool bFactorEmitted = false;

  do
  {
    ParseFactor();
    if (ErrorFound())
      break; // Stop parsing

    if (bFactorEmitted)
    {
      m_pEmitter->EmitCode(OP_MUL);
    }

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_MUL)
    {
      // Other factors have to be multiplied in
      bFactorEmitted = true;
      m_pLexer->EatToken();
    }
  } while (tok == TOK_MUL);
}

It’s easy to see how this function loops over one or more factors joined by TOK_MUL (that is, ‘*’ signs), emitting the code to evaluate each of them, and then emitting a ‘MUL’ instruction on each of them starting after the second. This will leave the result of the multiplications on the stack.

And finally, the function to parse an addition, which is our ‘top-level expression’ is very similar and straightforward:

void TParser::ParseExpression()
{
  bool bProductEmitted = false;

  do
  {
    ParseProduct();
    if (ErrorFound())
      break; // Stop parsing

    if (bProductEmitted)
    {
      m_pEmitter->EmitCode(OP_ADD);
    }

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_ADD)
    {
      // Other factors have to be multiplied in
      bProductEmitted = true;
      m_pLexer->EatToken();
    }
  } while (tok == TOK_ADD);
}

You can easily see how an expression such as b + (2 * c) results in emitting the following code:

  PUSH_VAR 'b'
  PUSH_CONST '2'
  PUSH_VAR 'c'
  MUL
  ADD

Which indeed evaluates the expression we are looking for!

Given that this overview is becoming quite lengthy, I will continue in a separate post. Feel free to comment on anything.

PS: I have replaced regular quotes with single quotes, as else they appear escaped 🙁

Compiler overview (I)

Sunday, June 5th, 2005

Unlike other microISV entrepreneurs out there, I’ve started writing the blog a bit late in the development process. Some people start writing the blog as the first action towards building their product. Myself, I started with pure development as I saw that getting the product to exist was the key part in my case. After some hanging around JoS forums, and a suggestion to get the blog running from Ian of StealthPlayer, I actually set up a website and started writing the blog. It’s been very interesting, and the amount of attention brought over by Eric Sink has been a pleasant surprise (thanks Eric!).

Given the interest on the scripting engine side of things, I will do a brief recap of the scripting system. I have, myself, always had a clearly lustful interest in compilers 🙂

I will skip the overview in two or three parts, covering the lexer and parser in the first, and leaving code generation and execution for afterwards.

1. “Lexing”
Formally called lexical analyisis. The first stage is loading the actual text file and breaking it up into tokens. The interface is the typical GetNextToken() or PeekNextToken() (the latter not removing the actual next token). Skipping comments also happens here. Tokens are identified with a token enumerator, and the ones requiring more info (such as TOK_ID which is returned for any non-keyword identifier) point into the raw input buffer. This avoids memory management during the lexing phase.

One common option is to use a tool such as LEX or FLEX (a GNU clone of the former). I chose to implement it by hand, as it is simple enough and has not licensing issues. LEX/FLEX parsers are usually based in regular expressions to detect the tokens – I just wrote specialized code to detect each type of token.

2. “Parsing”
Formally called syntax analysis. I could have also gone with a tool such as YACC or BISON (the GNU clone) to generate a state machine parser, but I implemented a simple recursive descent one myself.

For those of you who haven’t seen a living specimen in the wild, how does a simple recursive descent parser look? Suppose we want to parse a syntax consisting on products and additions with no parentheses, with positive integer numbers as elements. The parser would accept and evaluate expressions such as 5 + 3 * 4, or 10 * 20 * 3 + 2 * 3 + 41 * 5. Suppose the lexer returns TOK_NUM for an integer value, TOK_ADD for the ‘+’ symbol, TOK_MUL for the ‘*’ symbol, and TOK_EOF when the end of file is reached. The value of the TOK_NUM returned can be asked for by calling GetTokenVal(). The lexer is also supposed to skip the whitespace in between the elements.

To write a recursive descent parser, we take into account that the multiply operator has precedence over the addition operator, that is, the previous expressions have to be evaluated as 5 + (3 * 4), or (10 * 20 * 3) + (2 * 3) + (41 * 5). This means that any such expression can be looked at as the addition of a series of products, which are themselves the product of a series of factors.

This is the key to structuring the parser: we will parse the whole expression by getting one product after the other, and adding them together. The products are parsed by getting one factor after the other and multiplying them together. When should the product parser stop? This is another key: when it finds a symbol that it doesn’t recognize as continuing the product (which could be another ‘+’ sign or the end of the sequence).

int TParser::ParseExpression()
{
  int iSum = 0;

  do
  {
    int iVal = ParseProduct();
    if (ErrorFound())
      return -1;

    iSum += iVal;

    TToken tok = m_pLexer->PeekNextToken();
    if (tok = TOK_ADD)
    {
      m_pLexer->EatToken();
    }
  } while (tok == TOK_ADD);

  return -1; // Should never arrive here
}

int TParser::ParseProduct()
{
  int iProduct = 1;

  do
  {
    int iVal = ParseFactor();
    if (ErrorFound())
      return -1;

    iProduct *= iVal;

    TToken tok = m_pLexer->PeekNextToken();
    if (tok == TOK_MUL)
    {
      m_pLexer->EatToken();
    }
  } while (tok == TOK_MUL);

  return -1; // Should never arrive here
}

int TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();
    m_pLexer->EatToken();
    return iVal;
  }
  else
  {
    Error("Unexpected token");
    return -1;
  }
}

Ok, so now, why is it called a recursive descent parser? Clearly, the one above is not recursive at all. Now, suppose we want to expand the parser above to evaluate expressions with parentheses in them. In any place an integer used to appear, we must now allow a whole ‘nother subexpression, using both ‘*’ and ‘+’, enclosed between parentheses. This is simple enough to do: we only have to modify the ParseFactor function to accept what we just described:

int TParser::ParseFactor()
{
  TToken tok = m_pLexer->PeekNextToken();

  if (tok == TOK_NUM)
  {
    int iVal = m_pLexer->GetTokenVal();
    m_pLexer->EatToken();
    return iVal;
  }
  else if (tok == TOK_OPEN_PAREN)
  {
    m_pLexer->EatToken();
    int iVal = ParseExpression();
    if (m_pLexer->PeekNextToken() != TOK_CLOSE_PAREN)
    {
      Error("Closing parentheses expected");
      return -1;
    }
    m_pLexer->EatToken();
    return iVal;
  }
  else
  {
    Error("Unexpected token");
    return -1;
  }
}

And now, you can see how the parser operates recursively to parse complex expressions such as 3 + 4 * (6 + 1) or (((1 + 3) * 4 + 5) + 2) * 6 up to any nesting depth (or as long as the runtime environment will allow the stack to grow). The fact that the parsing functions stop at any unexpected token without “consuming it” allows them to stop when a closing parenthesis is reached, returning to the calling ParseFactor who “eats it up”.

Usually, parsers are written using other general tools employing algorithms such as LALR, which discard explicit recursion and implement stack based iterative methods. But recursive descent parsers are famed to sport fast execution, and they’re simple enough to write.

Next installment, I will cover how the code generation mechanism fits.

Tons o’garbage

Saturday, June 4th, 2005

The vi emulation script needs to get positions from the built-in editor object. At the very least, this is necessary to store “marks” in the file and get back to them later when the user requests so. But it turns out that most motion commands in vi are better implemented as separating the act of getting the target position and then setting it later on.

For example, the ‘w’ command adavances to the beginning of the next word. Although the built-in interface may provide a “GotoNextWord” function, the fact is ‘dw’ deletes everything up to the beginning of the next word. ‘yw’ copies everything up to the beginning of the next word, without moving the cursor. And thus with many commands.

Now, a “position” within a text buffer, as handled by NGEDIT, may be quite complex. If you are editing a simple ASCII file, or a file in your local code page as long as it is a single-byte-per-character codepage, things get simpler. If, on top of that, your file does not contain a single tab character (ASCII 9, which I call “hard” tabs), and you are not using a variable width font in order to edit the file, a position can be as simple as a line number and an offset within that line.

But if you are editing, for example, UTF-8 text (in which a single character may occupy anything from 1 to 6 bytes, although the Unicode consortium promises they will not be using 5- and 6- byte encodings), which does include tabs (so the count of characters does not match the column on the screen), and even using a proportional font to display the file, then a position may host the above mentioned line and offset, but also a column count and an horizontal display coordinate in pixels.

So, when a script wants to grab, for instance, the current cursor position, it actually has no idea what it is trying to grab. This is great for the script, because it needn’t worry about the underlying complexity: these positions are only manipulated by the built-in editor object.

The complexity with this is that a position has to be returned as an object from the built-in engine.

The NGS language, the same as JavaScript and others, doesn’t have pointers. You can’t pass a pointer to an object to any function, be it other script functions or built-in functions. I had a look at the Microsoft .NET CLR specification, together with their IL bytecode-based language, and it turns that the model is quite similar to what I am doing for NGS. They have pointers, but most of the code doesn’t need them. It probably came to them as a necessity to be able to generate managed C++ code, which will of course require shuffling and passing and squeezing all types of pointers around.

Besides,

  var pos;
  s_Editor.GetCursorPos(&pos);

is of course much, much uglier than

  var pos = s_Editor.GetCursorPos();

By the way, I think the Java VM doesn’t manipulate pointers either. Of course, “object” values are actually pointers to objects in the heap and can be thought of as a kind of pointer – but pointers to general variables are just not there (please correct me if memory fails.)

The fact is, I had to return objects. That meant the built-in engine creating them, and then the script manipulating it: either keeping them in a variable, operating and discarding them, whatever.

I had avoided complex memory management of objects until now. ‘new’ and ‘delete’ were there for manual memory management, which I was doing fine with – vi emulation does not involve any heavy object shuffling, so I thought I could leave it for later.

Not any more. I had to do Garbage Collection.

I had never implemented a GC scheme – I had an idea of how a mark-n-sweep algorithm works, and it seemed simple enough, so off I went to implement it.

And it turns, it took less than a couple of hours to have a decent memory management system working. Reclaimed objects are reused on demand and only freed at the unloading of a script context, so it does reduce the amount of memory shuffling with the CRT.

All objects allocated are kept in a doubly-linked list, hand-written mind you, whose root is at the TScript object. There are actually two lists: one is the list of the “busy” objects and the other one holds the “free” objects. When someone asks for a new object (be it script code, the built-in editor object, or whoever), I first check if there are any objects in the “free” list. If not, I perform a GC cycle. This involves first un-marking all known supposedly-busy objects, then iterating over all script variables and the running stack and marking the objects they refer to (if indeed the do refer to objects), apply recursively, and voila! we can now pass all unmarked objects from the busy list to the free list. If nothing turns up, we can always revert to a good-ole-fashioned new Object; within the C++ code.

Now I can even write such insulting code as:

  function fn()
  {
    var v = new Whatever();
  }

and be called a lazy programmer 🙂

New customs

Tuesday, May 31st, 2005

I’m already using the scripting language extensively. I first had to complete array access code, and I also had to include “true”, “false” and “nil” (not “null”). I’m now fighting against the intrincacies of vi/vim emulation instead of against the language, so I’d guess it’s faring well.

I’m using the newly discovered style of avoiding all necessary syntactic sugar such as semicolons, commas, etc. The funny thing is that it feels comfortable everywhere except in a single place: I can not get used to leaving out the comma between function arguments. Given that arguments are typeless (the same as all other variables), it seems that something such as:

function IProcessKey ( key modifkey ) 

looks too much as if “key” was the type and “modifkey” the name, and my brain is already too hardwired to that.

On other issues, I’ve started working in a design for the blog – I already have a nice design, just need to work on some graphics elements and then work out how to integrate everything with WordPress.

And I’m also thinking about changing the name of the blog. I don’t like seeing “pains” up there prominently, it paints a negative picture of something which is a positive, enjoyable venture (if not completely painless).

Compiler & VM finished

Monday, May 30th, 2005

Today, I’ve written the bytecode-based virtual machine that runs the scripts within NGEDIT. It didn’t take more than 6-8 hours. Complete with the access mechanisms to built-in functions, casting of values for inter-type operations, function calls, etc. It doesn’t amount to much more than 1,000 lines of code.

It always strikes me as weird that the kind of things that seem “complex” are often done in quite little time, while other parts which seem easier take so much more. The Undo system, selecting/copying/pasting etc… have taken much less time to develop than things such as the custom menus, tab bar, preferences dialogs, etc, which are supposedly much simpler. No surprise we’re so bad at forecasting development times. I almost dare predict that printing support will take double the time than syntax highlighting will.

I find that my productivity is much lower when I have to be talking to the Windows API, than when I’m writing self-contained code. Does the same happen to you?

My next goal is vi/vim emulation using the internal NGS scripting language, I’ll report how it fares.

More compiler writing

Saturday, May 28th, 2005

The parser/code generator is already working very nicely. I’ve written a little disassembler to examine the output of the code generation phase – it has already helped me find a couple of bugs, and it’s crying for a peephole optimizer (although I’ll leave this for later). Here’s some sample input to the compiler and the output generated:

NGS source:

const
{
  NORMAL
  INPUT
};

function ProcessChar( Char )
{
  if (this.m_Mode == NORMAL)
  {
    var pfn = this.m_MapNormal.Get(Char);
    if (!pfn)
    {
      pfn = this.m_MotionMap.Get(Char);
      if (!pfn)
        ::Beep();
    }
  }
  else if (this.m_Mode == INPUT)
  {
    //TODO: implement insert
    //::Insert(Char);
    ::MessageBox("Insert", Char);
  }
}

Compiler output:

(002) ProcessChar (this, Char)
locals: this, Char, pfn
13           ENTER
02 00 00     PUSH_LOCVAR    0000 ; this
06 07 00     READ_MEMBERVAR 0007 ; m_Mode
01 00 00     PUSH_CONST     0000 ; 0 (NORMAL)
20           EQUAL
16 40 00     JNCOND         0040 ; L0
02 02 00     PUSH_LOCVAR    0002 ; Char
01 07 00     PUSH_CONST     0007 ; 1 (*unnamed*)
02 00 00     PUSH_LOCVAR    0000 ; this
06 09 00     READ_MEMBERVAR 0009 ; m_MapNormal
06 0A 00     READ_MEMBERVAR 000A ; Get
11           CALL
03 03 00     POP_LOCVAR     0003 ; pfn
02 03 00     PUSH_LOCVAR    0003 ; pfn
1E           LOGIC_NOT
16 23 00     JNCOND         0023 ; L1
02 02 00     PUSH_LOCVAR    0002 ; Char
01 08 00     PUSH_CONST     0008 ; 1 (*unnamed*)
02 00 00     PUSH_LOCVAR    0000 ; this
06 05 00     READ_MEMBERVAR 0005 ; m_MotionMap
06 0A 00     READ_MEMBERVAR 000A ; Get
11           CALL
03 03 00     POP_LOCVAR     0003 ; pfn
02 03 00     PUSH_LOCVAR    0003 ; pfn
1E           LOGIC_NOT      
16 09 00     JNCOND         0009 ; L1
01 09 00     PUSH_CONST     0009 ; 0 (*unnamed*)
0A 00 00     PUSH_BUILTIN   0000 ; Built-in
11           CALL
0C 01        STACK_DEL      01
14 1C 00 L1: JMP            001C ; L2
02 00 00 L0: PUSH_LOCVAR    0000 ; this
06 07 00     READ_MEMBERVAR 0007 ; m_Mode
01 01 00     PUSH_CONST     0001 ; 1 (INPUT)
20           EQUAL
16 0F 00     JNCOND         000F ; L2
01 0A 00     PUSH_CONST     000A ; Insert (*unnamed*)
02 02 00     PUSH_LOCVAR    0002 ; Char
01 0B 00     PUSH_CONST     000B ; 2 (*unnamed*)
0A 01 00     PUSH_BUILTIN   0001 ; Built-in
11           CALL
0C 01        STACK_DEL      01
09       L2: PUSH_NIL        
12           RET             

Ain’t it beautiful? Compiler writing has some “magic” feeling to it.

I am facing a little design decision: as you see in the source above, references to “this” are explicit within functions. The reason for this is that NGS is, the same as JavaScript, a class-less language. Each object can have its own set of members, be it values or member functions. For this reason, there is no way of telling during the compiling whether an identifier refers to a member of the “this” object. Allowing direct reference to member variables would turn off the detection of any undeclared identifier reference, turning them into accesses to the “this” object. This is bad, because a mis-spelling of a constant or module variable would silently be turned into a member-variable access (even worse given that all functions in NGS have a “this” object).

The way it is now, member accesses are unchecked at compile-time, but non-member accesses are checked and flagged as errors if not found either as constants, module variables, built-ins or local variables (this includes arguments to the current function).

JavaScript does exactly what I described above as “bad”. Any reference is checked not against the “this” object, but against the whole identifer resolution chain (which may include other stuff). In my experience developing in JavaScript (which consists basically in a FireFox extension), all syntax checking is very uncomfortable (I wonder how they do it, as most of the FireFox UI is actually implemented in JavaScript).

One possible solution is to use a single dot, elliding the “this” identifier, as an implicit reference to the this object. It reminds me of the “with” statement in Pascal, which I really liked (and miss in C++ etc). I could even include “with” as a feature in NGS, with “un-with’ed” code referring to “this” and explicit “with” referring to whatever the programmer wants.

I’ll leave this as it is now, and if it proves too cumbersome as I start using NGS to implement some functionality, I’ll get back to it.

By the way, the page is already in Google (although one needs to look explicitly for NGEDIT, of course), and no spam as arrived in my inbox as a result of publishing the e-mail address. I haven’t received any non-spam e-mail, either. A friend told me he’d dropped me a line, but it didn’t arrive. I checked sending one myself through the link and it worked. I hope it will work.

Writing a compiler…

Thursday, May 26th, 2005

Right now, I am writing the scripting language compiler for NGEDIT. It is loosely based in JavaScript (typeless variables, object oriented), and so it has a C-like syntax. It generates code for a simple stack machine. I’m using a handwritten recursive-descent parser to parse the code and generate the bytecodes as parsing is done.

I’ve noticed two interesting things in the process.

The first one is how much of C/C++ syntax is unnecessary. Python did away with a lot of it, but I never thought that even in C/C++ most stuff could be parsed without any problems with many commas and semicolons left out. Ever wondered about class/struct/enum declarations ending in a semicolon, while functions don’t have one? In the parser, it’s as simple as saying skip next token if semicolon after the decl/def. After most statements, there is no ambiguity if the semicolon is removed (expression parsing stops if the next token isn’t a possible continuation of the expression). It is possible to leave out the commas between function arguments and between identifiers in an enum definition, even if they carry an explicit = expr initialization.

The second one came onto me when I was writing the intermediate structures that hold the code generated for a full function and the code to evaluate an expression. I had to separate the code for the expression, as it can’t simply be added to the tail of the function (as can be done with the statements). It turned out that the structure to hold a compiled function and the structure to hold a compiled expression are identical! (at least, given that I was storing the function signature separately). In NGS (the NGEDIT scripting language) all function return values, so that is common as well. This reminded me of one extension GCC used to have years ago… a body of code surrounded by braces can be embedded in an expression, resulting in the value of the last statement. I don’t remember being able to use return inside the code, but it would make sense. In this way, we could write:

int a = { if (b) return 1; else { while(c--) do_something(d, e); return get_something(); } };

And if we used the parser with optional separators we could even write:

int a = { if (b) return 1 else { while(c--) do_something(d e) return get_something() } }

The optional separators are already working in the NGEDIT parser, but I don’t plan to implement the code-as-subexpression trick.

Welcome to the NGEDIT blog

Wednesday, May 25th, 2005

What’s NGEDIT? NGEDIT is my project for a new text editor, which I’ve been developing for the past months, and which will hopefully be ready very soon.

You may be asking yourself, why another text editor? There are dozens of them in the market, of all types and sizes, prices and weights, colors and fits. Why would anyone in his right mind think on developing yet another one?

The answer is actually simple. It turns out that I have have a set of features in mind that will make current-day text editing look prehistoric. Of course, only time (and a lot of work on my part) will be able to prove or disprove this.

Before I can start writing all the innovative features, I need to implement a full-blown modern text editor: multiple text file formats and character sets (codepages, DBCS, Unicode anyone?), multi-file and multi-window support, fixed- and variable-width fonts, syntax highlighting, keystroke and programmable macros, multi-file find&replace, regular expressions, tons of preferences,… you name it.

I’m currently in the process of developing this first text editor, the first step towards the full NGEDIT. I plan to have a first release with:

  • all the solid features of a modern text editor
  • some advanced ones that are not very common
  • some very carefully crafted usability-oriented features
  • and some very specific and useful features that aren’t found in other editors

All in all, I think NGEDIT 1.0 will already be a winner over other editors out there, if not in all cases, at least in many of them. I hope to have a downloadable trial version soon enough that you will be able to see it yourself.

I will be posting about how development progresses, the process of setting up the web store infrastructure, there will probably be some rants, and miscellaneous other stuff. Feel free to comment on any issue.