Entrepreneur: Get your HTML right!

July 29th, 2005

I’ve been having some problems with the www.ngedit.com. It was working nicely, but almost zero searches had ever found them. It’s weird, because quite many searches have found blog.ngedit.com.

After the release of ViEmu, it’s a bit different, but up until last week, the top search item finding this blog was “WM_CHAR”. WM_KEYDOWN and friends were also up there on the list. Search phrases revealed the frustration and desperation of programmers fighting to tame the Win32 keyboard input model: “how to get utf-16 wm_char without unicode”, “getting wm_syskeydown instead of wm_keydown ime”, “wm_char arrows repeat”, even “brief explanation of egytian currency”! My blog post on the issue is even on the first page you get from google when you just search for WM_CHAR!

As an aside, web stats are a source of awe and wonder for me, I cannot help but imagine the story behind every search, and this links to my theory of “exceptions”, but that’s a story for another day.

But only two terms have been finding the main page: ngedit and www.ngedit.com. A grand total of 23 search hits since the existence of the site (and I’m pretty sure several of them were either me or some friend I told about the endeavor).

It didn’t worry me too much until now. People were finding the blog and that was nice, the main site was almost placeholder stuff. But now, after the release of my first product, www.ngedit.com is important for the good advance of my venture.

And this week, after the release, I’ve gotten quite a lot of traffic from JoS and some announcements here and there, as well as the nice blog posts from fellow entrepreneurs setting up their own shop at the same time. But nothing from google or other search engines.

The weird thing has been that my site has basically not appeared when looking for “visual studio vi emulation” (I know I looked for that when I had the need myself.) And, even if I didn’t do it on purpose for SEO, the main site must be filled up to death with the phrase!

As a weirder thing, my blog post on the release of ViEmu is on one of the first pages with that search phrase, and even Ian Landsman‘s post is on the second page! How come www.ngedit.com does not appear on Google searches?

I set up a google adwords campaign yesterday (please, if you see the add, and you already know about the project, do not click on the add πŸ™‚ I’m doing it with a very limited budget). Ads are not appearing either (although google reports 6 impressions with 0 click-throughs).

Weirdly, today I found out that if I wrote it in another order, such as “vi studio visual emulation”, the site appeared… on the second page! Come on, this is not such a crowded market niche!

Today, I’ve found out something which I believe is the key to all of this: as I was preparing a new page for the site, I found out with terror that my html pages were full of html syntax errors! Well, actually not full, there were about two or three on the main ViEmu product page, all pages were missing the DOCTYPE line, and, by some accident, I had removed the opening <html> and <head> tags from over half the pages. Duh.

Please don’t take me wrong – I do all my html and css by hand, with a lot of love put in every html tag and every css style. I don’t have much experience with html (although that’s rapidly changing), and I prefer to do it this way by now. I check everything with Firefox and IE, and I pay a lot of attention to proper html – but had simply forgotten to actually verify it with something such as the w3c validator.

So, now everything’s corrected (the new content section is not uploaded yet), and I hope to get good google search results in a very short while (couple days?), unless they really punish you for having lowered the average quality of html on the web.

And a happy fact is that I actually used NGEDIT to convert all the files from CRLF terminators to LF terminators, so that they don’t have to be converted by ftp each time I upload them πŸ™‚

On other news, yes, I will be adding an “Articles” section to the main site, with some new content which is ready now, and I’ll publish there the main articles instead of on the blog – leaving the blog for shorter and more day-to-day entries, and announcements for the articles posted there. I really don’t think the blog is the right place for the long articles I tend to post.

Unicode, text management, and C++ techniques (II)

July 29th, 2005

We left the series a few weeks ago, after having talked a bit about UCS-2/UTF-16, which in its little-endian version is simply called “Unicode” by Microsoft.

We’re now going to review a bit of what UTF-8, probably the most extended Unicode encoding, actually means.

Remember the context (I’m probably just reminding myself, as I’ve been so busy with ViEmu for the past few weeks.) We’re going to see how NGEDIT handles the different text encodings internally – based on the fact that NGEDIT does not convert on write and read, but it keeps files in memory in their original encoding.

UTF-8 was a very neat trick (elevated to category of a standard) devised by Ken Thompson. The basic unit in UTF-8 is a byte, but only a few characters occupy a single byte. Characters may actually be anything from 1 to 4 bytes, depending on their value. Actually, the encoding method allows characters of 5 or even 6 bytes, but those only happen for code points above 0x10FFFF, which the Unicode standard now forbids – so no 5 or 6 byte sequences should be found in a “legal” UTF-8 file.

Basically, ASCII characters are stored as single-byte 0..127 values (because, I guess you know, ASCII is a 7-bit code, and the Unicode set of characters is coincidential in the 127 first characters). That means a file consisting of only ASCII values will be exactly the same in good old 8-bit-stored ASCII, or in UTF-8.

The 128 characters from 128 to 255 in Unicode, together with the first 128 which are plain ASCII, complete the ISO-8859-1 encoding, usually called Latin1. This was the default encoding for HTML, and even if a lot of HTML these days uses UTF-8, I think the default if no encoding is specified is ISO-8859-1. How are these characters encoded in UTF-8? Actually, with two byte sequences:

Latin1 character 128: UTF-8 bytes 0xC2 0x80

Latin1 character 255: UTF-8 bytes 0xC3 0xBF

Unicode code points up to 0x7FF (2048 characters) are all encoded in two bytes in UTF-8, and the last one is 0xDF,0xBF.

As you can deduce, it’s not that the first byte is just a marker. Two-byte UTF-8 secuences are marked the high 3 bits of the first byte being binary 110 (so, in hexadecimal, the number will be between 0xC0 and 0xDF). The other 5 bits of the first char are the highest 5 bits of the 11 bit character encoded. And the trailing bytes actually has 6 bits of info, as the highest two bits must be binary 10.

Higher code points use 3 and 4 bytes per character encodigs: 3 byte characters are marked by high four bits being binary 1110 and 4 byte characters are marked by high five bits being binary binary 11110.

As an important point, all trailing bytes in characters of any byte-length always have the high two bits as binary 10, so finding where characters start is easy.

Anyway, the point is, how do we translate the code from the last post, which looks for whitespace, so that it will work with UTF-8? Let’s see again the beautifully simple original one-byte-per-character code:

unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It’s not the most beautiful code, but it’s beautifully simple.

Now, let’s see the UTF-8 enabled version, which could actually recognize a hieroglyphic whitespace if it were necessary:

unsigned FindWhiteSpaceRight(
  const byte *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    len = UTF8_CalcLen(psz[u]);
    if (u + len < uLen)
      unsigned ch = UTF8_Decode(psz);
    else
    {
      // Invalid!
      //TODO: Handle it in some way!
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

The code to calculate the length of and decode a UTF-8 character would look more or less like this:

inline unsigned UTF8_CalcLen(byte b)
{
       if (b < 0x80u) return 1;
  else if (b < 0xE0u) return 2;
  else if (b < 0xF0u) return 3;
  else if (b < 0xF8u) return 4;
  else if (b < 0xFCu) return 5;
  else return 6;
}


#define EX(x, shl) ((x & 0x3F) << shl)
inline unsigned UTF8_Decode(const byte *p)
{
  byte lead = *p;

  if (lead < 0x80u)
  {
    return lead;
  }
  else if (lead < 0xE0u)
  {
    return ((lead & 0x1Fu) << 6u) | EX(p[1], 0);
  }
  else if (lead < 0xF0u)
  {
    return ((lead & 0x0Fu) << 12) | EX(p[1], 6)
        | EX(p[2], 0);
  }
  else if (lead < 0xF8u)
  {
    return ((lead & 0x07u) << 18u) | EX(p[1], 12)
        | EX(p[2], 6) | EX(p[3], 0);
  }
  else if (lead < 0xFCu)
  {
    return ((lead & 0x03u) << 24u) | EX(p[1], 18)
        | EX(p[2], 12) | EX(p[3], 6)
        | EX(p[4], 0);
  }
  else
  {
    return ((lead & 0x03u) << 30u) | EX(p[1], 24)
        | EX(p[2], 18) | EX(p[3], 12)
        | EX(p[4], 6) | EX(p[5], 0);
  }
}
#undef EX

Take into account that this code is not actually Unicode conformant, given that it shouldn’t accept 5 and 6 byte characters, and it should filter out overlong sequences (characters which occupy N bytes but could have been encoded with less bytes).

So, now you see how the actual code gets more complex for UTF-8, and the innocent loop actually involves a lot of operations now.

We’ve now seen the complexities of dealing with different encodings: one byte per character, Windows “Unicode” with possible “surrogates”, UTF-8 with all its varying length management needs. We haven’t even checked DBCS, which are the systems by which Japanese, Korean, and different Chinese text are commonly stored, and in which seeking backwards in text is all but impossible, because lead bytes and trail bytes are not distinguishable by value. And then there are all the other Unicode encoding variants, including little-endian and big-endian versions, etc…

How can one choose to implement support for all of these in C++?

One possibility is to write a version of each text-management function such as FindWhiteSpaceRight for each supported encoding.

Just kidding πŸ™‚

What we really want is to write code almost as simple as the one-byte-per-character version above, which will work for all encodings.

As a common C++ idiom, we could a design base class with virtual methods which represent the required functions. Methods could be “unsigned GetChar()”, “AdvancePointer()”, etc… and each derived would implement their version of each.

This would work. Indeed. But we would be paying a high price.

For one, the price of a virtual function call for each simple operation. The one-byte-per-char version is not only simple to see, but the code it generates is really good because the CPU is very good at handling simple bytes.

But the second very important one is that the virtual function would need to receive and return the most general class of characters in mind, actually, 32-bit-per-char UCS-4. And that would mean converting for really simple operations.

This is especially important for one reason: I wanted NGEDIT to handle all encoding types, to handle them natively, but most day-to-day editing happens with one-byte-per-char encodings. Burdening the code which is run 90% of the time in a large part of the world (at least, all of Europe and the US) with a high performance impediment seems a bit absurd, and I didn’t want to do it.

The goal is code that is simple to write and read, code which can be made to work with all encoding types, but also code that will become the simple byte-handling code that we had for the first case when we are actually dealing with one-byte-per-char encodings. And, sure, we don’t want to write gobs of code.

The solution? Of course, courtesy of templates, and will be the topic of the last article in this mini-series, together with some other actually important reasons to use such a solution (hint: tab handling code is often a waste!)

vi/vim emulation for Visual Studio

July 27th, 2005

After a lot of work and testing, ViEmu 1.0 is finally out. You can check it at www.ngedit.com. I hope you like it. All feedback is welcome.

I think I will be able to come back to posting some interesting articles later this week.

Best!

Beta process going

July 18th, 2005

ViEmu beta is going well. Beta 1 had some trouble, as I had left a DLL out from the package, but after several hours of fresh installs of Visual Studio I found out. Today I’ve issued Beta 2 with some improvements in Intellisense integration and better TAB handling. I also have the new web design almost ready.

I still keep the target of releasing before the end of July. We’ll see if nothing turns it into an August release πŸ™‚

Unfortunately, this isn’t leaving too much time for technical posts, although they’ll be back after the release.

Imminent release announcement

July 12th, 2005

No, it’s not NGEDIT. It’s ViEmu, and it’s almost ready. Vi editor emulation for Microsoft Visual Studio .NET 2003.

Let me explain myself.

A few weeks ago, an idea struck me – I had a quite large chunk of vi emulation working. And having vi emulation would sure be a great addition for Microsoft Visual Studio (at least for those, like me, who have their fingers hardwired to vi’s input model). I had no experience writing extensions or add-in’s for VS, but it couldn’t be that hard. I gave myself a full afternoon to research it (“whaddayamean, switching to another project before the first is ready?”) and verify it was as good an idea as it seemed.

After two hours, reading a blog comment saying “if I had vi emulation in Visual Studio I’d be in heaven” and reading a note from Microsoft that yes, they had it as a feature for future versions, but no, VS .NET 2005 would not have it, I registered the domain name (actually two, viemu.com and vimemu.com, I wasn’t sure). I downloaded the VSIP SDK (add-in’s only have limited extension capabilities, you need the VSIP SDK if you really want to shake Visual Studio). And I started hacking at all that COM code.

In a couple of days, I had been able to have a custom package loaded within Visual Studio, and subclassed the editor window. I did have to learn a whole lot of COM programming (I had done some basic COM, but not the kind of COM you have to do in order to talk with a beast like VS – no, I don’t love it more now). I briefly considered C# but went with C++ instead. I did the first experiment with the following thought: “if I can make pressing 0 send the cursor to the beginning of the line, I can do the rest” (0 is the default way to send the cursor to the beginning of the line). I did it, felt great, and went to sleep (it was late).

I then started porting the vi emulation core. You know, the emulation core was written in NGS, the NGEDIT scripting language. I had to port it to C++ (no, I wasn’t going to bring over NGS scripting to Visual Studio). It took a solid four days to port all of it. No, adding the semicolons wasn’t the worst part. The emulation core was nicely separated from the editing core actions implementation, so I only had to implement a few primitives to get it working. The day I finished porting the ViEmu core, a lot of vi started working simultaneously – wonders of porting. I started using ViEmu fulltime to develop itself.

A few weeks later, a lot of vi implementation afterwards (there were many missing things yet in the NGEDIT ViEmu module), ViEmu is now almost ready. Missing things:

  • The preferences section (you wouldn’t believe the COM programming required for a simple dialog with 5 or 6 checkboxes and an edit box – multiple inheritance from twelve base classes is used in MS’s sample code, only one of which is not a template.
  • The installer – you must use MSI installer for it – I have to decode the MSI SDK, which I still haven’t been able to even reliably find for download – I’m waiting to see if I received my recent MSDN Universal subscription and I can better find it there.
  • Solve some small interactions with Intellisense and the undo system. It’s working but there is some tricky case yet.
  • General review of cursor positions after vi commands are performed. Some of them are not the same as with vim, and that should be right for 1.0.
  • And beta testing. I’ve been using it myself for all development and it’s very stable, but I’m only using C++ and a simple one-byte-per-character codepage (even if VS uses UCS-2 Unicode internally). I’ll do some basic testing on Visual Basic or C#, but not the kind of testing someone who is actually developing does.

Ok, so, my expected timeframe is quite short. I definitely want to release it before the end of July. I also have to set up the web page and e-commerce system, but I don’t expect many problems with that. I expect to finish the development tasks listed above this week, and I want to run beta-testing as soon as the installer is ready (I need to deliver the beta itself as an installable package).

So, if you would like to beta test ViEmu, please drop me a line. I’m planning a quite small beta testing group. I’m mostly looking for people who use languages other than C++, codepages other than the usual US/Western Europe one, and Windows versions different from Windows 2K and Windows XP.

Someone who writes VB applications in Korean on a Windows Millenium machine would be a dream come true πŸ™‚

I hope it will work with any left-to-right writing system (I don’t have any idea how it will perform with Arab or Hebrew bidirectional writing, and I’m not delaying 1.0 until that is ok).

Leaving out the vi/vim command line, which is not emulated, almost all of vi/vim input is emulated (including visual selection modes, etc). I’ll make available a full list of emulated features.

It is not compatible with Whole Tomato Software’s Visual Assist. I’m looking into it, but I’m not sure it will be fixed by 1.0, maybe a bit later.

Porting to VS 2005 is on the radar, of course, but 1.0 will be for VS .NET 2003. VS 2005 is only beta yet, actually.

And, rest assured, I’ll come back to NGEDIT after ViEmu is released.

Unicode, text management, and C++ techniques

July 2nd, 2005

Let me apologize for taking so long to post. I’ve been in a kind of a “development frenzy” for the past couple of weeks. I will be posting some news regarding all the new development shortly πŸ™‚

Today, I’m going to start reviewing how NGEDIT manages the text buffers of the files being edited. I was explaining it and showing the source to a developer friend of mine a few days ago, and he found the C++ techniques interesting. I hope it will be useful and/or interesting to you as well.

The model is rooted in the way NGEDIT handles different text encodings, such as Windows codepages, DBCS, or different flavors of Unicode. It will take a few blog posts to cover the subject.

Some months ago, when I developed the earliest prototype, I started out with simple one-byte-per-char buffers. It was not final code and I just wanted to have the editor up and running. At the end of the day, most editing I do is in the good ole’ 1252 codepage, using a single byte per character. So is quite probably yours, if you’re in the US or Western Europe.

As soon as basic editing and UI were working, I started researching how to handle the different encoding types.

I know that one can use Windows’ understanding of Unicode, using two bytes per character. Well, actually, it’s not two bytes per character – even though the Unicode standard creators initially thought that 65,536 unique characters would be enough to encode all writing systems, in the end they found out they needed more. I’m not completely sure, but I think Microsoft’s decision to use a two-byte-per-character encoding predates the (bad) news that some characters would not fit in two bytes, thus requiring some sort of extension (actually called “surrogates”). That is, if you decide to use two bytes per character, you can still not assume uniform character length. That is only true for the first 65,536 characters (technically “code-points”) in the Unicode standard. This set is nicely dubbed “Basic Multilingual Plane”, and I think it covers all widespread systems (including Japanese, Chinese, Korean, Greek, Cyrillic, Hebrew, Arabic, Thai, etc. I think the writing systems you are forgetting about would include Klingon, Egytian hieroglyphs and some other alphabets which you’d better not use in the comments in your code or in your config files or in your customer database.

If two bytes per characters brought you universality together with simplicity, I’d be much more inclined to using it. But the thought that the code should gracefully handle the kind-of-escape-sequence surrogate pairs makes me feel that, apart from wasting the memory in most cases, I have to tolerate variable-length characters. And in most cases (Greek and Eastern writings excluded), UTF-8 is a much better encoding for this: ASCII characters, that is, the first 128 characters in the character system you are actuallly using now (unless you are reading this from an IBM mainframe, which I seriously doubt), are one-byte-coded in UTF-8. If you use english, unless you are the type of person that writes “naΓ―ve” or “rΓ©sumΓ©”, the whole of your file can be encoded in one byte per character, while still allowing the occasional hieroglyph in the middle.

Anway, I had to support the different Unicode encodings in the editor. Even if you only use it sometimes, an editor with just one-byte-per-character encodings support is simply not serious nowadays. I also decided that I would be supporting DBCS encodings, that is, Asian code pages in which characters can be encoded in one or two byte sequences. When I had to do some localization support for Japan, Korea, China and Taiwan a few years ago, I was not sure whether Unicode would be widespread in those countries. I simply asked them to send me some localized materials without specifying the format, and they just sent DBCS encoded text files. I found out Unicode was not too widespread there either.

Let’s look at how the early NGEDIT code to do some handling looked. This sample shows the code to find the next “whitespace” character in the line:

unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

This is quite fine and dandy. And quick. The call to IsWhiteSpace() can easily be inlined, and the whole loop can be easily optimized by the compiler.

Now, let’s see how this may look for the default Windows Unicode encoding (which is formally callled UCS-2LE or UTF-16LE, where LE means little-endian, and although there is some technical difference between UCS-2 and UTF-16, it is nothing of any importance in this context). We will do a simple translation.

unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It seems like a really simple transformation, one that is easy to perform, and which results in much more general code dealing with Asian or Arabic or Greek or Cyrillic encodings. wchar_t is a built-in C/C++ standard type used for wide characters. We switched from talking about offsets into talking about columns, as they’re not equivalent any more, but the rest seems pretty good.

But things are trickier.

As always happens with standard C/C++ types, wchar_t is not technically a very well defined type. According to Microsoft’s compilers, it is a two byte word able to store one of the first 65,536 code-points. According to GNU’s gcc compiler, it is a FOUR BYTE integer able to store any Unicode character. I don’t even know what it means in other environments.

So, the above code would be correct when compiled under gcc, although using 4 bytes per character – probably something you don’t want to do to handle really large files.

Compiling under Microsoft’s Visual C, or just using “unsigned short” in gcc in order to save some space, the above code is not really correct.

What happens if there is some Klingon character thrown in in the middle of the source code?

First thing, you should probably fire the programmer who wrote that. But that’s not very satisfying.

How do these characters get encoded in UCS-2/UTF-16? Well, the first 65,536 characters in the Unicode standard get simply encoded as-is. But, they were so cunning as to leave certain ranges unused for characters – most importantly the so called surrogate range from 0xD800 to 0xDFFF. These codepoints are not assigned to any character in the standard.

The standard defines characters from the starting 0x0000, and they have promised no to use any single value above 0x10FFFF. That is, there are 16 times 65,536 possible codepoints that can get encoded apart from the first 65,536 ones. That is, there are gobs of characters above the 0xFFFF ceiling. They decided to use what are called surrogate pairs. A sequence of two values in the 0xD800-0xDFFF range defines a single character. Actually, the surrogate range is divided in a “High Surrogate” value (0xD800 to 0xDBFF) and a “Low Surrogate” value (0xDC00 to 0xDFFF). The high surrogate must always come first, they must always come together (an idependent surrogate with no companion has no meaning), and together they can encode 1024 times 1024 different characters. That covers the extra 0x0FFFFF values beyond the BMP (‘Basic Multilingual Plane’).

This leaves us in the unconfortable situation that the above code handling wchar_t’s is actually unaware of what it is doing with those symbols.

What will happen if we just do that? Well, it’s not that bad, as you probably won’t encounter Klingon characters “in the wild”. But if there are, then you will be incorrectly manipulating them, and even if your OS does a good job of rendering them (the user had better installed some good fonts to display that), you will be mangling the text.

UTF-8 encoding has similar properties, although the “naive” code will find characters wrongly handled much more easily (more about this in the next installment).

So, what should we really do to handle UCS-2/UTF-16 correctly? Something like this:

unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    if (u+2 < uLen)
      unsigned ch = UTF16_Decode(psz + 1, &len);
    else
    {
      // Last wchar_t in seq, if it's surrogate, invalid!
      if (UTF16_IsSurrogate(psz[u+1]))
      {
        // What to do now? Just fail?
        return uOffStart;
      }
      else
      {
        ch = (unsigned)psz[1];
      }
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

You see, now things are much uglier. We can find “invalid” sequences, and have to think about a sensible way to handle that. Encodings in which all sequences are valid make life much easier. On the other hand, we switched into talking about “columns” when getting into UCS-2/UTF-16, but that’s not so valid anymore, given that the code just above isn’t using characters (which are variable length) or bytes, but a kind of “word offset”. The nasty things of variable length encoding.

Next time, I’ll review UTF-8, which really requires this kind of special handling, and start ellaborating on how we can use some C++ mechanisms in order to handle all this gracefully.

Main site updated

June 27th, 2005

I have updated the main site to a much nicer design. Now only the blog needs a lift-up. And, please let me apologize for not posting in a whole week. I can tell you it’s been a heavy development week, delving really deeply in hardcore COM programming!

I hope I’ll be able to post some new interesting technical article during the week.

Compiler IV: the runtime environment

June 19th, 2005

This is the last installment in the compiler series. I will be reviewing how NGEDIT implements the stack-based machine that executes the bytecodes generated by the compiler, and actually breaths life into the code written in the script language.

The first design decision is how to store the values that the runtime environment manipulates. For one, we have the variables that may hold values – and object members as well, which are just another kind of variable. As the other important group, we have the values that are stored in the stack.

We know that variables may hold a value of any type, which in the case of NGS means:

  • char: single byte signed quantities as in regular C/C++
  • int: native machine integer, that is, 32-bit, signed quantities
  • unsigned: same as the previous one, but unsigned
  • float: regular 32-bit floating point number
  • string: byte-per-character strings
  • object: reference to an NGS object
  • function: reference to an NGS function
  • built-in function: reference to a builtin NGEDIT provided function
  • nil: both a type and a value of its own, the “non-value”

There were some design decisions in coming up with this list. Even if the main editor interface is provided via objects, I decided not to provide a special type for built-in objects. Instead, I chose to use regular script objects with member functions initialized to built-in functions, and some “opaque” data to access the internal application data. This makes things simpler and fits nicely in the overall architecture.

I also decided not to provide a boolean type. Bytecodes that generate a logical boolean, such as comparisons, push either integer 1 or integer 0. Instructions that test a value (conditional jumps), take nil and any numeric zero as false, and anything else as true. As an important case, an empty string evaluates as true (you have to check its length if you want to check whether a string is non-empty). When adding values involving strings, which is implemented as string concatenation, nil values are promoted to the empty string, so string manipulation is quite comfortable.

Another decision that I took is to initially support only byte-based encoded strings. That means Unicode UTF-16 (almost the same as UCS-2, or what Windows simply calls “Unicode”) is not supported as a native NGS value. Even if the editor is fully-enabled to handling all sorts of encodings, I didn’t want to overcomplicate the internal script system. Please take into account that this does not mean the script cannot manipulate text in other encodings – this is done via NGEDIT’s built-in functions, it only means NGS’s native variables cannot hold this type of text. I’ll probably add a second string type to support unicode strings in the future.

How does one implement these values? Via a good’ole union inside a C++ class, which holds separately the “type” of the value:

enum EAtomType
{
  SA_NIL,
  SA_CHAR,
  SA_INT,
  SA_UNSIGNED,
  SA_FLOAT,
  SA_BUILTIN_FN,  // Built-in global functions
                // (::MesageBox(), ...)
  SA_FUNCTION,    // Identifies a script function
  SA_STRING,
  SA_OBJECT,
};

class TAtom
{
  public:
    TAtom () { m_Type = SA_NIL; }
    ...
    void SetFloat( float fVal )
    {
      m_Type = SA_FLOAT;
      m_Data.f = fVal;
    }
    ...
    EAtomType   GetType     () const
      { return m_Type; }
    char        GetVal_char () const
      { ASSERT(m_Type == SA_CHAR); return m_Data.c; }
    ...

  private:
    EAtomType m_Type;
    union
    {
      char     c;
      int      i;
      unsigned u;
      float    f;
      char     *psz;
      SObject  *pObj;
    } m_Data;
};

You get the idea. Each value then occupies exactly 8 bytes. 3 of them are wasted, but that’s the price of alignment.

The operand stack, then, is simply an array of TAtom values together with a pointer to the TOS element. Variables are just TAtom’s. The members of an object or the elements in an array are just TAtom’s.

By the way, pardon my French in borrowing the ‘atom’ and ‘nil’ term from lisp terminology. ‘Atom’ was quite clear to me, and I preferred ‘nil’ over ‘null’, given that it represents a non-value rather than a null pointer.

As I have already commented, I will probably evolve NGS towards Javascript so some of this terminology will probably change.

So, now that we know how to structure the stack and the array pool, we just need to start interpreting the bytecodes. From the rest of NGEDIT, a script function is invoked with a call similar to this:

  TScrExec exec;
  TScript *ps = FindScript(...);
  TAtom arg, retval;

  arg.SetInt(key.Code);
  exec.AddArg( &arg );

  arg.SetInt(key.ModifKeyState.Val);
  exec.AddArg( &arg );

  exec.InvokeFunction(
    ps,
    ps->FindFunction("IProcessChar"), NULL, &retval
  );

  return CastInt(retval) ? true : false;

TScrExec is the class that executes a script function by interpreting its bytecode. Right now, scripts are non-concurrent, and as such the function runs until it returns or it generates a “trap”. I will be adding a concurrent script engine which will help in using the scripts more efficiently and, importantly, in being able to debug them by executing step-by-step.

Leaving apart the interfacing code to pass in the arguments and get the result back, the core function is InvokeFunction. It is the function that actually performs the work. How does interpreting the bytecodes work? Quite simply, we need to keep a pointer into the bytecode stream pointing to the next instruction to execute – we can perform a large switch statement on the bytecode, which branches into the code to execute each instruction. Each instruction is so simple that they are quite straightforward to execute. Push, pop, dup, etc… Quite a lot of error-checking goes in here, as we have to take care that the stack doesn’t underflow, invalid codes are not present in the bytecode stream, operations are not performed on values of the wrong type, etc:

TRet TScrExec::InvokeFunction(
  TScript *ps, unsigned uFn,
  const TAtom *pThisVal, TAtom *pRetVal
)
{
  // Push 'this' and args on the stack
  ...

  // Prepare the run context
  ...
  m_pRC->uPC = 0; // Next instruction pointer
  ...

  // Execute instructions
  while (!IsFinished())
  {
    ExecInstr();
  }

  ...
}

void TScrExec::ExecInstr()
{
  // Trap on invalid address
  if (!IsValidPC())
  {
    Trap( TRAP_INVALID_ADDRESS );
    return;
  }
  ...

  // Read the opcode
  byte opcode = pCode->GetByte( m_pRC->uPC );

  // Immediate handling & decoding
  unsigned immediate = 0;
  unsigned imm_bytes = GetByteCodeImmBytes ( opcode );
  unsigned instr_bytes = imm_bytes + 1;

  if (imm_bytes)
  {
    ...

    // Read immediate
    if (imm_bytes == 1)
      immediate = pCode->GetByte( m_pRC->uPC + 1 );
    else if (imm_bytes == 2)
      immediate = pCode->GetWord( m_pRC->uPC + 1 );
  }

  // Advance PC to next instruction
  m_pRC->uPC += instr_bytes;

  // Instruction execution
  switch  (opcode)
  {
    case BC_NULL:
      Trap( TRAP_USER );
      break;

    case BC_PUSH_CONST:
    {
      TAtom atom;
      ps->GetConstant(&atom, immediate);
      m_Stack.Add(atom);
      break;
    }

    ... Really long switch statement ...

  }
}

There is one thing we haven’t considered here: functions in the script will be calling other functions. Values to those functions are passed on the stack, so it’s just a matter of pushing the arguments before calling the function. The return value is returned on the stack as well, so that part’s simple. But there is the need of storing the return address in order to know where to jump when we hit on the RET bytecode in the called function.

Apart from the need for the return address, we need to reserve space for local variables. Sure, we could use the regular stack for them, but that would make code generation a bit more complex. What we do is we have a separate “run-context” stack. A run-context holds the return address, the local variables, and some other useful info. When we find a CALL instruction, a new run-context is generated and pushed on what I call the “call-stack”. The variable m_pRC shown above always points to the top of the run-context stack.

There is one last thing we haven’t talked about: the calling convention. Even if we have already talked about the calling function pushing the args on the stack, and the called function popping them and leaving the result on the stack, there are some details that have to be taken care of. For once, we need to decide in what order the arguments are pushed: is the leftmost one pushed first or last? Where does the implicit “this” pointer get pushed?

Given that NGS handles values, and functions are called via pushing references to them on the stack (maybe even taken from members of objects), there is in general no easy mechanism to check at compile time whether the number of arguments passed matches the number expected by the called function. We could check it at runtime and fail if the number is not right. I preferred to implement a calling mechanism that pushes the actual number of arguments passed on the stack as well, after pushing all the regular arguments. The called function, then, can use that number to know how many arguments have been actually passed and pop them in local variables. If more arguments have been passed, the extra ones are ignored. If less than the expected number have been passed, their corresponding local variables are set to nil. The ENTER instruction, which is the first instruction in every function, does exactly this.

This is the last part of our review of NGEDIT’s scripting architecture. I hope it has been worthy as an overview of how a full scripting systems is developed. I will cover some other compiling issues in the future, but I will also be covering other general editor areas such as text memory management, etc.

Feel free to comment or ask whatever you are interested on.

Programming productivity

June 16th, 2005

I always have an eye on my own productivity. Don’t misunderstand me, I’m no productivity freak. I don’t buy in all the “Get things Done” or Steve Pavlina’s productivity obsession based on pushing yourself. Years ago, I would feel guilty of having spent too much time watching TV, reading or generally procrastinating . I spent way too many years in the vicious cycle of not getting things done, not forgiving myself, feeling guilty about it, and falling in the same hole again and again.

(BTW, do read the link above if you haven’t yet, it’s one of the funniest things I’ve ever read.)

Later, I’ve learned that, when you don’t do something, there is always a reason. And your best possible effort is to discover it – not to put pressure on yourself to crush the reason. That’s just doing stuff over and above your own instinct, a nasty thing to do.

The most common reasons I’ve found are: (1) I didn’t really want to do it, or (2) I was afraid of something.

When you discover that you really don’t want to do something you’re trying to push yourself into doing, the best thing you can do is to take the steps to actually not to do it. If there are other people involved, you probably have to tell them you are not going to do it. If the task was the center of your life, you probably have to actually discover what you actually want to do with your life (a tough task). Sometimes the implications are so overwhelming that your best choice is to simply do the thing, because of other reasons. But then, you know why you are doing the stuff, and you won’t be punishing yourself for not actually doing with pleasure what you really don’t want to do.

The other very common reason is that some fear is pushing you back and making the unconscious decision that it’s better not to tackle whatever it is, lest whatever the fearful thing may happen. Sometimes it is fear of failing, and discovery of failure would be much worse than just having the task lying around. Sometimes it is fear of others seeing you fail. Anything.

I am digressing, I wanted to focus on the feelings that come with programming.

The deal is that, even when you are actually pursuing a goal you really want to pursue, there may be a myriad of little tasks in the way that are not satisfying by themselves, but you still want to do them for the goal you will be getting towards. In programming, I often find out that a lot of the things I have to do are not really a pleasure, it’s the final goal that matters to me. I’m the kind of impatient guy, and I think much faster than I code (who doesn’t?), so I have a lot of conflicting feelings when programming.

Knowing that I have to deal with the Windows API is the typical part that I really hate, so tasks that involve that tend to get done later than others. But not only those, mostly every programming task creates some feeling of discomfort. I actually have to drag myself to actual coding, which is much less fun than thinking up the design of things. I think programming tools will evolve in the sense in which programmers will have more of the fun part and less of the nitty-gritty-nasty-details part (think, assembler programming was much worse than C/C++ programming and most programmers can actually get a lot more work done, just by not having to worry about in which registers do things get put).

I can attest that I do this dragging quite successfully, as I’m averaging at about 300 lines of code per day for the past few months, which is quite a lot. And the code is of quite high quality, not that I’m writing a mess – I can produce that much because I am quite eager to have all the core of the editor working in order to start with the innovative stuff. But even then, I find I have to push myself into doing things every day. Reading JoS forums is too often a much more interesting temptation.

Anyway, I tried an experiment yesterday and it has worked nicely. There were three tasks that I had to accomplish in order to get to the next step. Vi/Vim emulation is working so that it’s mostly complete for my everyday usage (complete vi emulation is really a large lot and I don’t need every single little command), and after I get to my desired level of emulation completeness I’ll be moving to other areas of NGEDIT development. The three tasks were actually:

  • Smart input buffer: in vi, you enter a special “input” mode whenever you are going to add new text. The ‘.’ command later on lets you repeat whatever you input (inserted characters, replaced characters, erased characters, etc). I use that continuously.
  • Named clipboards (“registers” in vi slang). You use them in vi much more often than in regular editors, as they are very
    straightforward to use and for some other reasons. This feature is not only useful for the vi interface, but for normal editor interface as well. More importantly, the clipboard in the editor has been “internal” until now, with no interaction with the windows clipboard, and this task involved starting to deal with the Win32 API.
  • Selection types: just the regular character/line/column types of selections for copy/pasting. Character selections are implemented a long time ago, but I really need the other types for everyday usage.

I was looking at this as a somewhat steep hill to climb, with three separate “pushing myself’s”, one for each task.

Instead, I tried an alternative: I would start the three tasks at once, cutting through the need of three separate starts. I would get to the VimEmu.ngs script, implement the script side of the three of them, and then go through the other areas of the application source and get the three of done in each part. This would mean that the application wouldn’t compile for some time, but I know that compiling and testing a minor thing is often an escape from actually confronting the programming task at hand.

How did it go? I think it’s gone quite well. The features are not finished yet, but the smart input buffer was completely implemented yesterday (if not working perfectly). The named registers (clipboards) are mostly working and code to write to / read from the windows clipboard is there (even if not all cases of text encoding conversions are working). And the selection types are lacking the core low-level implementation (actually displaying the different selections and actually reading/writing the right text when copy/pasting), but all the framework including communication with NGS is there.

I’m hoping I can get a decent chunk of that finished today. And I think I gained some time, and avoided some dismay, by “parallelizing” the starting-up conflict of the three.

PS: Does anyone know why on earth non-platform-native text on the Windows clipboard is not characterized by the codepage but by the locale? It seems that if I have text in the codepage used in Russia, I can’t paste it into the windows clipboard, unless I look up what their currency or date format is. And no, I don’t yet know how to get from the codepage to the specific locale – I’m using Unicode for copy/pasting in those cases.

What’s broken in the WM_KEYDOWN/WM_CHAR input model?

June 13th, 2005

I was planning on writing a rant on the input model of the Win32 API. See, while developing the input model for NGEDIT, I’ve been pondering how to implement the key-mappings, input handling, etc. Although I had used it in the past, it wasn’t in an edit-intensive, fully customizable, internationally targeted program (although I’ve implemented IME-based chat for China, Japan and Korea πŸ™‚ ). The more I looked at the model and I thought about how to handle it, the more it seemed a complete mess in my head. But I could not pin down exactly what it was, and what a better model may be.

I set out to grasp a complete understanding of the model, which I will be describing here for whoever needs it while programming (I couldn’t find a fine explanation that made sense – of course MSDN is of no help in this kind of situations).

The Win32 model works like this: when a key is pressed, you get a WM_KEYDOWN message (more if the key is kept depressed past the auto-repeat time). When it is released, you get a WM_KEYUP. You can access the state of the modifier keys in that moment (SHIFT, ALT and CONTROL). Both messages indicate the pressed key with their virtual key code: an integer code which uniquely identifies a key on the keyboard.

Apart from this, as messages get translated by the TranslateMessage() Win32 function, they are checked with the current keyboard config, and an additional WM_CHAR message is sent for each WM_KEYDOWN, with the current-codepage character code of the key pressed.

This is all fine and dandy, although things start getting tricky after this.

  • In non-US keyboard mappings, or in the US international keyboard, you can use some keys to modify the character typed afterwards with an accent. If you press one of these “accent” keys, a WM_DEADCHAR is generated that probably you’d better ignore, and the next WM_KEYDOWN of a vowel character will generate a WM_CHAR message with a different char code representing the accented character.
  • When you want to handle a key in your program, you have a choice as to handle either the WM_KEYDOWN or the WM_CHAR message. Keys such as the cursor arrows or function keys don’t have a character equivalent, so you need to handle WM_KEYDOWN. For regular character keys that are used for typing, you probably prefer to handle the WM_CHAR message, as it will get you either the lower case or upper case version or even the accented version in international version. A doubt always arises about what to do regarding other keys which generate both a WM_KEYDOWN and a WM_CHAR: what to do with TAB, RETURN, ESC, etc…? Even worse, the DEL key doesn’t generate the ASCII DEL (127) WM_CHAR message that you would expect – it is a “mute” key WM_CHAR-wise. I think I remember reading Charles Petzold mourning the same decision in his programming windows book – I don’t remember what conclusion he reached, but I remember it didn’t satisfy me.
  • The virtual key code, numerically matches the uppercase-ASCII-code of the character on the keyboard, and has special codes for cursor arrows, function keys, and some weird keys called stuff like VK_OEM_1 (for the ’tilde’ key in a US mapping). A key such as VK_OEM_1 is mapped to an actual alphabetic character in non-english european languages, making it difficult to distinguish alphabetic from non-alphabetic WM_KEYDOWNs.
  • Windows nicely throws a different version of the messages on you if the ALT key is pressed at the same time: you get WM_SYSKEYDOWN and WM_SYSCHAR instead of the regular WM_KEYDOWN and WM_CHAR. You also get WM_SYSKEYUP for the sake of consistency. This messages usually get handled by DefWindowProc() in order to bring up the corresponding menu item and some special key processing. Curiously, pressing the ALT key by itself and releasing it afterwards, buys you a WM_SYSKEYDOWN/WM_SYSKEYUP pair for the ALT key. But if you press ALT, then a character key, and release both, you will all WM_SYS* messages, except for the KEYUP’s for keys released after the ALT key (and including it!), which generate regular WM_KEYUP’s. The F10 key as well is a “special key”, generating WM_SYS* versions of its KEYDOWN – I guess because that brings up the menu as well, and I’m unsure about whether its KEYUP is SYS or not. No wonder the virtual key code of the ALT key is VK_MENU.
  • I couldn’t obviously derive from MSDN whether having CTRL or SHIFT pressed at the same time as ALT would buy me SYS messages or plain vanilla messages.
  • European and US-int’l mapping treat the right ALT key as a different AltGr key, that buys some extra characters that are not placed in regular positions. This key is sent to the application as having both CTRL and ALT pressed at the same time, and the system takes care of mapping the WM_KEYDOWN to the appropriate char code for the WM_CHAR message.
  • If you are interested in your Asian customers, you have a new world of IME messages to serve you. It opens a whole new avenue for spec nightmare – suffice it to say that you will be receiving two WM_CHAR messages in a row for some complex kanji or hanzi or hangeul double-byte character, corresponding to no keypress, as the user selects complex characters from his/her IME window. Your US-only-thought application may even work if you just don’t shuffle your WM_CHAR’s too much.

After writing the list, I’m looking back at the title of the post and wondering if it’s really a good question. Anyway.

Coming back to development, I keep a “todo list” of vi/vim commands that need implementation. It happens that the innocent “Redo” command in vi/vim is achieved by the Ctrl-R key combination. There I went and mapped the ‘r’ WM_CHAR with the CTRL key modifier set to the undo command, pressed Ctrl-R and… nothing. To no avail. I checked what the program was receiving, and found out that Windows was translating the VK_R keypress with CTRL into the proper C-R ASCII code, which happens to be 0x12 or 18 in decimal.

In the moment, I just hated the fact, and the memory of how input behaves from previous projects (which was not as clear as I have described above) kicked in, and had me for days happily implementing all sorts of other vi/vim command that didn’t require handling the !~@# control key. I was working, but in some way, I was procrastinating πŸ™‚ Seriously, this is the kind of stuff that keeps me from tackling a task.

After a few days, I finally decided to drag myself to do it, and started researching all the intrincacies of key input, dumping their results, and trying to build a clear mental image – I found out some stuff that I didn’t know, haven’t seen anywhere on the web (of course, MSDN is less than complete).

Here is a nice list of the messages that the R key can generate in its different combinations:


            nothing   CTRL    ALT    CTRL+ALT
  nothing    KD+'r'  KD+^R  SKD+S'r'    KD
  SHIFT      KD+'R'  KD+^R  SKD+S'r'    KD

The row selects between having Shift pressed or not pressed. The column selects between the combination of CTRL and ALT that is pressed. ‘KD’ represents the WM_KEYDOWN message, SKD represents WM_SYSKEYDOWN. A ‘x’ character represents the WM_CHAR message. S’x’ represents WM_SYSCHAR. ^X means the ASCII C-X control code in WM_CHAR. WM_KEYUPS are not shown. And of course, the char code in the left is toggled depending on the CAPSLOCK status (I’m unsure about what happens when CAPSLOCK is on and ALT is used together with the key, although your program’d better not behave too differently).

Other non-alphabetic keys get similar mappings, although they don’t get an WM_CHAR when pressed together with CTRL (in most cases, as we’ll see in a moment).

I set out to try to complete my understanding on the mapping, and wondered what would happen with other keys. It turns out, Windows goes a long length towards trying to get the ASCII code that may correspond to the keypress (in some cases quite dubious to my understanding of affairs).

All (English) alphabetic keys (from A to Z all included) get mapped to their control codes (ASCII 1 to 26). That means that if you press Ctrl-M, your program will receive a nice WM_CHAR message with the 13 code, usually corresponding to the RETURN key. Be sure to filter that if you provide Ctrl+Key mappings! It even works as a replacement to RETURN in unsophisticated programs such as notepad.

Then I set out to find out if Windows would generate the rest of the control codes – poking around, looking at ASCII charts in more detail that I’d like to, I found out that you can generate WM_CHAR with ASCII 28 with C-\, ASCII 29 with C-], ASCII 30 with C-S-6 (which is more like C-^), and ASCII 31 with C-S-hyphen (which should be read as C-_). I usually use a non-US keyboard, for which I can tell you that the ASCII mapping is a bit lousier than for the US mapping, as the ‘\[]^’ symbols are moved around to other keys but Windows still performs the mapping as if the keyboard had US keytops, except in the case of DEADCHAR generating keys which simply behave differently…

The ASCII code 27, that is, ESC, can be both generated by the ESC key and by the C-[ combination.

And I also discovered some curious behavior: RETURN by itself generates WM_CHAR with ASCII 13, while C-RETURN generates WM_CHAR with ASCII 10. I was about not to tell you that C-S-RETURN generates no WM_CHAR, in order to get you lost if you aren’t already πŸ™‚

And a misterious ASCII code, DEL (127), was conspicuously absent on the keyboard – the DEL key would not generate it with any combination of modifier keys. But finally I found it: while BACKSPACE generates ASCII 8 (BS), combined with the CTRL key it generates the famous ASCII 127 DEL character.

Wow… I sincerely hope someone this explanation turns out useful or interesting for anyone. I would’ve been happy to find it myself a couple of weeks ago.

With this understanding, I could finally go and map the Ctrl-R key to the Undo command, go around binding other keys, and go on. I could also comment it with geek friends and all have a good laugh. But something was still dangling in my mind. Apart from awkward, the system seemed flawed, but I could not pinpoint why.

Even if this is all a mess, and disregarding the far-east input stuff, I was thinking that I had to come up with a better design if I was to rant about it. Of course, the SYS nightmare should be easy to fix in a new design (messages should be the same, and the fact that ALT or F10 or ALT-Shortcut brings the menu should be stuffed somewhere else), but something seemed to be fundamentally wrong and I could not grasp it.

Finally, I realized today.

The input model is losing one piece of critical information: the fact that WM_KEYDOWN and WM_CHAR are separate and completely unrelated loses the concept of a keypress. And there is no way of bringing that back.

See, having the system “cook” some type of information, such as the character mapping of the key if the key has a sensible one, is ok. Receiving extra information is great and the system will do a much better job of supporting all national specifics than the app. Even if that information is “cooked” in a weird way, such as the SYS stuff (or even the DEADCHAR stuff), it is ok and you could just cope with it. But you dearly want to be able to handle a keypress in some way, either using or ignoring the “cooked” information or just the raw virtual key code.

Realizing this, I could finally think up what a better scheme would be. Wouldn’t it be great if Windows provided a WM_KEYPRESS message including all the information? Say, the virtual key code, the fully-cooked ASCII mapping that WM_CHAR gets, some sort of indication of a DEADCHAR keypress, the state of the modifier keys, maybe even the uncooked keytop ASCII char if that exists?

Probably, Asian IME could generate keypresses with a null virtual key code to send weird characters to un-aware applications just handling input, but making it clear that no key was pressed for that character to be sent to the application, and the application would not try to match it up with keyboard mappings. Key maps would always refer to virtual keys, and text input would always just take the fully-cooked ASCII component. And we would all be happy.

The good thing is that this could have been implemented in any moment, as adding a new message does not create incompatibility. But we are now stuck with the really weird WM_KEYDOWN/WM_CHAR model.

I won’t hold my breath until Microsoft includes something along the line πŸ™‚