Archive for the ‘ngdev’ Category

ViEmu 1.2 Release Candidate out, & html macro language

Monday, September 19th, 2005

I have finished implementing and packaging ViEmu 1.2, and sent out an initial release to current customers and interested users. It includes folding support and window-command support as in vim (I think none of these was in the original vi). By the way, it is already using the C++ string class I talked about in the last post - not heavy use yet, but already using it. After a bit of testing of this release candidate (as the version has already been released), I will be announcing it and putting it up for download on the main page.

The main web site is built with simple, static html files. There is quite a lot of repetition, both for common elements like the navigation bar and for common parts such as general layout. I guess that must be the case with many sites. I’ve been wanting to add two new sections to the web site during the last weeks, but having to update those elements on all pages was not something I wanted to do. I am going to set up a sensible framework such that those elements don’t have to be updated in many places.

I think many sites use a dynamic mechanism, such as ASP or PHP, to avoid replicating such elements. This blog, for example, which is based in WordPress, does such things. But I do not want to switch the whole site to a dynamic system - it seems absurd to evaluate code in each page-request when it can be a one-time-off that generates the proper html files.

I do the html and css by hand, using vim, and I like to have that kind of control. I don’t know of any system that provides what I want - some kind of “macro preprocessor” for html pages. My idea is that I will be writing “.hs” (”html source”) files, and a preprocessor will be preprocessing them to generate the actual html files. There will be a “.hi” (”html include”) file with the common element definitions.

It’s not that I like to do stuff from scratch, but I’ve never heard of tools to do such a thing. I’ve checked the “m4″ macro preprocessor, but the main problem I see is that it is leant towards single-line macros - and most definitions that I’d be using will be multi-line. It need be comfortable to use in such a case.

Unless I find out about a tool that does this, I will be writing it myself. It should only take a couple of hours to get it working. If you know of such a tool, I’d be very grateful if you leave a comment here.

It’s good to see how, as months pass, I’m getting to automate common tasks and the general “work environment” is better every week. Starting from scratch, you have to live with many cumbersome methodologies for some time, but if you are patient it’s very satisfying to improve each part little by little: I can already develop text-encoding-independent text-processing code, I will be able to restructure the web site easily, … I’m dying to develop a dual installer for Visual Studio 2003 / Visual Studio 2005 (for ViEmu) and take out another thorn!

A C++ string class that doesn’t suck

Wednesday, September 14th, 2005

No, the title is actually not mine - although I loved it when I read it. Read on.

One stumbling block I’m finding is that desperately want to write code which is common to both ViEmu and NGEDIT. I mean, code that deals with text. And this means all areas in which I’m developing, vi emulation, the new features of NGEDIT, etc…

On one hand, ViEmu works with the internal text buffer of Visual Studio, which work with 16-bit wchar_t, called simply ‘Unicode’ by Microsoft, but which are actually the UCS-2 little-endian encoded version of Unicode (I talk about 16-bit wchar_t’s because this is actually not defined in the C++ standard, as happens with all built-in types, and GNU gcc for example implements them as 4-byte values).

On the other hand, NGEDIT deals with all types of underlying text encodings, including Unicode, platform-native one-byte-per-char, variable length UTF-8, etc… The underlying text store management code stores the text in fixed-size pages and is accessible through simple byte pointers and unsigned count of bytes - not null-terminated strings.

I’ve been working in some template classes to manipulate strings for months now. They are unfinished and only partly usable. The deal is, when you start actually trying to build complex template stuff in C++, it gets hairy very soon - and you find out about the many limitations of C++’s templates. Not only the syntax, but the sheer amount of red-tape you have to write.

I think I’ve made some significant advance in this area. My goal is to be able to write, for example, a tokenizer, in such a way that I can instance it as TTokenizer and use it right away, the same as for TTokenizer, etc… I’ve been after this for quite some time and I’ve ended up with my head spinning several times. One other reason for the problems is that I don’t want to duplicate too much code, and TCountedNativeCharString and the other ones are also templates on more basic types.

Anyway, I was Goo^H^H^Hresearching earlier this week, and stumbled into a similar initiative by Ryan Myers from Microsoft. He’s been documenting it in his blog, and although it is unfinished yet, the 8 posts he did early this year are a very interesting read for C++ programmers. His goal is not exactly the same as mine, as he is developing a single template-based string class that manages both interpretation and memory needs, and I have separate assorted template classes for each nead - I think it wil be better for my code to separate those two. But his blog posts were an awesome read. His own words: “I’m out to create a string class for C++ that doesn’t suck”. And I couldn’t agree more.

I’m hoping that I will have the template stuff working by the end of the week, and this will in turn unlock my working in the common regular expression for both applications. I will also be rewriting the new features I’ve started implementing for NGEDIT using the new common framework (among other things, like the fact I hate writing code bound to work with ‘char’s or ‘wchar_t’s, the features may some time make it into another VS .NET add-in).

Next steps

Monday, August 29th, 2005

Sorry for the delay - I finally have spent several days with the flu, which isn’t the best premise to do anything useful.

When I released ViEmu, the past July 26th, I talked to a friend and told him “step one.” It is quite true: that was only the first step. One month later, I have released a new version incorporating suggestions from customers, I have had a few reviews (mostly positive, although some point to some limitations, and others simply wonder about the whole idea), and have had a few sales. The same way as myself, even if vi command line support is not present yet, some people have found using the vi/vim input model within VS a big win. My site is nowhere to be seen on Google results for obvious searches like “visual studio vi emulation” (although it once was something like #12, it has disappeared into oblivion), but I have an adwords campaign set up to handle that, and it indeed is #1 in similar searches on MSN or Yahoo.

I’m very happy to have a product actually released and customers using it. I have already learnt a lot about setting up the sales & marketing, deployment issues, receiving and acting upon customer feedback, and, overall, about the whole development-release-sales cycle. I’ve even learnt how to prepare a Microsoft Installer (.msi) file to perform automatic upgrades (for which the most similar analogy I can think of is drilling a whole on your skull.)

Now I have to think what the next steps are. I now have two “babies” to take care of: ViEmu and NGEDIT.

The next step for ViEmu is clear: adding “ex” commands support (the vi command line which allows gems such as :v/./.,/./-1join to compress empty lines, and similar useful tasks.)

As an aside, it would be good on one hand to “back-port” the vi/vim emulation to NGEDIT. The emulation in ViEmu is implemented in C++ (originally ported from the NGEDIT Scripting Language), and is now much more complete that the original.

And as a second aside, some refactoring of NGEDIT is in order, now that I can see the code with more critical eyes (thanks to the perspective gained in two months of not touching the codebase.) I’ve actually already started with this, which is also a good way to get familiar again with the code of NGEDIT (it’s already an almost 50k+ lines of code small beast.)

But these last two are minor tasks which shouldn’t take too long. The vi/ex command line support is more work, but I will surely work it out in the following weeks/months: I need to do a regular expression engine (which I plan to share between both products), and a parser for the ex commands, which shouldn’t be too bad. (Note: following with my tradition of technical postings, I’ll probably explain how I implement regular expression support - no, I won’t use any off-the-shelf libraries, in part because of the functionality I want to provide, but also probably due to my severe case of the Not-Invented-Here syndrome.)

The main area that needed a full reevaluation was how to tackle the development of NGEDIT itself. I’ve actually thought a new development and release strategy/roadmap, but let me first expose the preliminaries.

I have some ideas for what I think will be “killer” features for a text editor. I may be wrong, but please bear with me and assume that’s the case. The reason I decided to develop a text editor was the thought of these features, as the text editor market is already quite saturated.

Blindly as I do think about the value of the new features, I initially thought that other editor developers would be ripping them off as soon as NGEDIT is out. Maybe it is unwarranted precaution, but I decided to first develop a quite complete text editor, which would be mostly up to par with current editors (probably not Visual SlickEdit, which is probably the most feature-loaded editor there is, but definitely with other popular text editors such as UltraEdit.) This way, I wouldn’t lose my edge with the new stuff.

It turned out that the whole set of features that a modern text editor has is a heckuva lot of work to develop. I initially thought this part would be less work, but it turns out that all of the development work in NGEDIT has only started to get it into what other editors offer. Actually, I haven’t even started designing or implemented the actual innovative features of NGEDIT (and they do require quite a lot of research!)

Now comes the experience of ViEmu. The kind of “echo” that ViEmu has received has been a bit less than I expected. Probably releasing in August is not the best moment, equally probably I should wait a bit more than a month before evaluating the result, and probably it will take a bit of time and probably some more versions released until it becomes more widely known. But I have found out that, even if the internet is a great resonance chamber which creates a great echo of remarkable products, it behaves as a dense and difficult-to-travel information mesh to those products that are not remarkable, or which target a too small group.

I can’t help but think that ViEmu is targetting those who are “vi lovers”, probably much less than 5% of the developers, which themselves are probably less than 5% of the general software-buying public. 5% times 5% is about 0,25% of the potential software buying audience (and, yes, this is a bit of a faux argument, as you can never target 100% of the audience, but yet my point about how ViEmu is a very niche product is still valid.)

NGEDIT is a general purpose text editor, which already targets a much wider audience than ViEmu. This makes it a better starting point in order to generate a profitable business. But then, one thing is easy to see: even if I develop an editor with the 14,273 or so features other text editors have, that won’t make it remarkable. I could spend one year implementing everything down to spell checking, FTP file access, and emulation of some obscure and forgotten editor’s command set, and even then I would still have a slow start.

The point about usability is important, and will help it become a successful product, but yet that’s not something that creates its own phenomenon.

Fortunately, I have the “killer” features to try to create a remarkable product. But then, does it make sense to spend a bunch of time to have just a “standard” product before I even start with them?

On the other hand, I’ve realized one thing: not only vi emulation is something that (at most) 5% of the users of a text editor miss, even regular expressions are something that is not used daily (or at all!) by many programmers.

And as a final element, my motivation starts to decline if my work consist in coding known-feature-after-known-feature. And given that I have created a quite powerful underlying technology framework, the codebase looks a bit like a Gruyere cheese full of holes that are designed-in but not filled in yet, such as double-byte character set support, filling in the template code for a lot of template-based-generalized elements, completing the scripting framework, plug-in support, etc… All these promise a lot of not-too-creative code-churning hours.

Meanwhile, I’m really eager to start researching, designing and developing the new features.

So, I’ve figured out there is a much more sensible strategy to NGEDIT. Put it the short way, I will be focusing in the innovative features, and leaving the “completeness” in comparison to other editors as an aside. I have two or three basic tasks to perform before actually starting with them, partly due to a few early wrong decisions which require a bit of code refactoring, but basically I think I will be able to start with them this week.

I’ve changed my immediate focus from the-complete-text-editor to a-nice-little-editor-with-nifty-stuff. I am currently using it myself, as I make changes, and focusing in making a better tool for me than vim or Visual Studio with ViEmu. If I can achieve that, I have the confidence it will be likewise for other people (even if other people don’t focus on vi-like editing, of course, the new features have nothing to do with that.)

I also know, from other projects and even other disciplines, that when you focus on immediate use, many other issues become apparent and pieces start to fall in place, and even completeness comes along.

Regarding the point of staying ahead of other text editors, or at least not too far behind, I think I have probably overestimated the risk. Even if it is successful, NGEDIT will take some time to catch on, which is usually measured in months, and I will be able to employ that time in getting the checklist features that are missing from NGEDIT (although probably not yet C++ Intellisense, but that’s not so common either.)

I’m using NGEDIT as a I go along. Developing a text editor is good in that you can use and test it for its own development. On the other hand, it is a bit of a mess. Even if you get a bit organized, and prepare deployment releases and don’t use the bleeding-edge last build (as I had to do for ViEmu in an even more complex interaction involving the IDE), it is still a bit of a mess. So, what I am doing is to develop the regular expression library with NGEDIT, which I start from Visual Studio itself, and the whole process is a bit less messy. And ViEmu then also benefits from this, even if it does require a bit of mental task-switching.

So: I’m focusing on bringing NGEDIT 1.0 as a very innovative editor, if not as feature-complete as existing editors, and I’m really pleased that I’ve taken this decission, which I think makes sense both development- and business-wise.

And only time will tell if the strategy was right, but then this is a bet!

ViEmu 1.1 released

Tuesday, August 23rd, 2005

Finally! I have just released and updated ViEmu 1.1 . It improves version 1.0 in many ways, including a new way to hook into the Visual Studio environment: much better integrated, and compatible with third party tools like Visual Assist or Resharper (which was a recurring request with the previous version.) There are a few minor vi/vim emulation improvement.

I plan on summarizing the general project status in another blog post, which I think will be interesting in a more business-oriented rather than technical way.

Next steps & screenshot

Wednesday, August 10th, 2005

ViEmu 1.0 has been out for two weeks now and I’ve had some interesting feedback. I’m getting ready ViEmu 1.1, which solves some limitations in the integration with VS - ViEmu will take over all editing windows within VS, instead of being a separate editor type. This helps with the VB form and HTML editors within VS, and also with not getting the standard editor instead of ViEmu through some UI elements which are not completely correct within VS (such as the “View Code” button which bypasses the whole internal VS editor mechanism).

I think I will be able to release ViEmu 1.1 next week. This will address all the major outstanding issues in 1.0 and, with the exception of unforeseen bugs or problems, it will be the latest released version for some time.

Right after that, I will be getting back to NGEDIT. I’ve had the chance to think quite a lot about NGEDIT during the development of ViEmu, and I will probably start by doing some refactoring of the code. Apart from this, there are two major pieces that are missing before I can feel comfortable with how complete the tech base of NGEDIT is: syntax highlighting and regular expression search. I also want to improve the core memory management code, so that it will better lend itself to hex editing of files and storing other types of information.

There are a myriad other things to do then, in order to cover what the expectation of a modern text editor is. They will take quite some time, as each little detail needs its own love and care. But, after the refactoring and the major two features, I will already be able to start focusing on the most important part: the UI. I have a ton of ideas that I’m looking forward to implement and try out.

I thought it may be time to post an early screenshot of NGEDIT. It is from April, quite incomplete, but gives a good idea of the general look. Toolbar icons in large to appreciated the drawings :) (click to see a full size version)

Screenshot

Given that I’ve checked out the beta of VS2005 to do the porting of ViEmu to VS2005 (which is already working), I’ve seen that Microsoft have gone with a similar look for the gradient-based UI background. With a bit less taste, if I may say so myself :)

Unicode, text management, and C++ techniques (and III)

Tuesday, August 2nd, 2005

If you are a C++ programmer, I recommend you read this article - the techniques discussed are quite interesting and could be useful for your own non-text-related projects.

In the last article in the series, we saw what UTF-8 is about. And I promised to cover some interesting techniques in order to handle this maze of encodings. Let’s get to it.

In order to abstract out the differences between text encodings, I decided to implement the concept of a codec: a text coding and decoding module that can convert between a particular encoding and some general interface that the main editor uses.

As we saw in the last article, using a base class with virtual functions has two important drawbacks: the first one is that all access is through costly virtual function calls (at least, quite costly compared to raw byte-size character access), and the second one is that it most probably forces us to use full unicode for the arguments and return values.

So, I decided to implement the codec as a class which will be used as a template argument. There is one such class for each supported encoding (TCodecNativeSimple, TCodecForeignSimple, TCodecUTF8, TCodecUTF16BE, TCodecUTF16LE, TCodecDBCS, and TCodecGeneral). Each such class is not meant to be instantiated, and it doubles as an access mechanism to codec-defined specifics - that is, it only has public members, and it doesn’t have any member variables with the expection of const static members (C++ idiom for constant class-wide values).

For example, each of these classes contains a TLineDecoder class. So, we can instantiate a TCodec::TLineDecoder object in order to walk a line of encoded text char by char and do whatever specifics we may need.

But the greatest strength of this technique comes from defining types within each codec. Each codec aditionally defines a TChar type, which represents the preferred type in order to manipulate such text.

For example, the native-simple codec is used for the platform-native text encoding, but only when such encoding is a single-byte-per-char encoding (eg, US and European native codepages qualify, whereas Japanese and Korean native text is not handled by this codec). This codec doesn’t require converting the characters input by the user, and can be output via native Windows API calls. And the TChar type for this codec is a simple byte.

As another example, the foreign-simple codec is used for one-byte-per-char text encodings which are foreign to the current platform (for example, US-Windows-codepage in a machine using another codepage as native, Mac text on a PC, or any of the ISO encodings such as Latin1, etc…). Given that this text cannot be reliably represented in a single byte, the TChar type in this codec maps to TUCS4Char (a full 4-byte Unicode codepoint).

We use this mechanism in order to map concepts in as many levels as we want. This allows us to map both high- and low-level concepts, so that we can have the required level of access in every part of the main application without performance getting hit. I really hate it when a concept that makes development much more comfortable makes a significant compromise in runtime performance.

Apart from operation classes (such as TLineDecoder) and basic types (such as TChar), the codec class also features some static const members, that represent features of the encoding. For example, all codecs have a c_bFixedCharWidth boolean member which indicates exactly that, whether encoded chars are all of the same byte length.

As an example of how this works, the function to find whitespace which we have used as an example may be written like this:

template<class TCODEC>
unsigned FindWhiteSpaceRight(
  const byte *psz, unsigned uLen, unsigned uOffStart
)
{
  typename TCODEC::TLineDecoder ld(psz, uLen, uOffStart);

  while (!ld.IsAtEnd())
  {
    typename TCODEC::TChar ch;

    if (ld.IsInvalid())
      ; // Handle in some way
    else
      ch = ld.GetChar();

    if (TCODEC::IsWhiteSpace(ch))
      return ld.GetCurPos();

    ld.Advance();
  }

  return uOffStart;
}

Let’s see some aspects of this code. For one, you can see that we are indeed checking for invalid characters. For encodings that may present invalid encoded characters, this function will check validity. But for encodings that can never encounter invalid encoded characters, the call IsInvalid() will be hardwired to return ‘false’, and so the compiler will optmize that part of the loop away! The same optimization happens for a function such as Advance(), which will amount to just a pointer increment for the most common one-byte-per-char encodings, while the code we have written is properly compiled to all the complex mechanic involved in decoding UTF8.

Code that checks for TCODEC::c_bFixedCharWidth with a seemingly runtime ‘if’ will also be evaluated and optimized out in the compile stage of Release builds, as the compiler is smart enough to see it is actually a compile-time constant.

And, as a final remark, we talked about TAB character decoding at the end of the last article. It turns out that having TAB characters in a file involves quite a lot of complexity, as the offset within a line loses any correlation with the graphic column. But this is not the cases for files which sport no TAB characters, and we are losing some performance because of this. One way to handle this seamlessly: have TAB handling abstracted behind a scheme such as the one above (I call TABBER the concept equivalent to a CODEC for TAB decoding), and choose between two TABBERs depending on whether the file contains TABs when loading. You can always switch to a non-null TABBER if the user inserts a TAB character. For people like me, who prefer not to use TAB characters at all, this is a win in most if not all editing sessions.

Unicode, text management, and C++ techniques (II)

Friday, July 29th, 2005

We left the series a few weeks ago, after having talked a bit about UCS-2/UTF-16, which in its little-endian version is simply called “Unicode” by Microsoft.

We’re now going to review a bit of what UTF-8, probably the most extended Unicode encoding, actually means.

Remember the context (I’m probably just reminding myself, as I’ve been so busy with ViEmu for the past few weeks.) We’re going to see how NGEDIT handles the different text encodings internally - based on the fact that NGEDIT does not convert on write and read, but it keeps files in memory in their original encoding.

UTF-8 was a very neat trick (elevated to category of a standard) devised by Ken Thompson. The basic unit in UTF-8 is a byte, but only a few characters occupy a single byte. Characters may actually be anything from 1 to 4 bytes, depending on their value. Actually, the encoding method allows characters of 5 or even 6 bytes, but those only happen for code points above 0×10FFFF, which the Unicode standard now forbids - so no 5 or 6 byte sequences should be found in a “legal” UTF-8 file.

Basically, ASCII characters are stored as single-byte 0..127 values (because, I guess you know, ASCII is a 7-bit code, and the Unicode set of characters is coincidential in the 127 first characters). That means a file consisting of only ASCII values will be exactly the same in good old 8-bit-stored ASCII, or in UTF-8.

The 128 characters from 128 to 255 in Unicode, together with the first 128 which are plain ASCII, complete the ISO-8859-1 encoding, usually called Latin1. This was the default encoding for HTML, and even if a lot of HTML these days uses UTF-8, I think the default if no encoding is specified is ISO-8859-1. How are these characters encoded in UTF-8? Actually, with two byte sequences:

Latin1 character 128: UTF-8 bytes 0xC2 0×80

Latin1 character 255: UTF-8 bytes 0xC3 0xBF

Unicode code points up to 0×7FF (2048 characters) are all encoded in two bytes in UTF-8, and the last one is 0xDF,0xBF.

As you can deduce, it’s not that the first byte is just a marker. Two-byte UTF-8 secuences are marked the high 3 bits of the first byte being binary 110 (so, in hexadecimal, the number will be between 0xC0 and 0xDF). The other 5 bits of the first char are the highest 5 bits of the 11 bit character encoded. And the trailing bytes actually has 6 bits of info, as the highest two bits must be binary 10.

Higher code points use 3 and 4 bytes per character encodigs: 3 byte characters are marked by high four bits being binary 1110 and 4 byte characters are marked by high five bits being binary binary 11110.

As an important point, all trailing bytes in characters of any byte-length always have the high two bits as binary 10, so finding where characters start is easy.

Anyway, the point is, how do we translate the code from the last post, which looks for whitespace, so that it will work with UTF-8? Let’s see again the beautifully simple original one-byte-per-character code:

unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It’s not the most beautiful code, but it’s beautifully simple.

Now, let’s see the UTF-8 enabled version, which could actually recognize a hieroglyphic whitespace if it were necessary:

unsigned FindWhiteSpaceRight(
  const byte *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    len = UTF8_CalcLen(psz[u]);
    if (u + len < uLen)
      unsigned ch = UTF8_Decode(psz);
    else
    {
      // Invalid!
      //TODO: Handle it in some way!
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

The code to calculate the length of and decode a UTF-8 character would look more or less like this:

inline unsigned UTF8_CalcLen(byte b)
{
       if (b < 0x80u) return 1;
  else if (b < 0xE0u) return 2;
  else if (b < 0xF0u) return 3;
  else if (b < 0xF8u) return 4;
  else if (b < 0xFCu) return 5;
  else return 6;
}

#define EX(x, shl) ((x & 0x3F) << shl)
inline unsigned UTF8_Decode(const byte *p)
{
  byte lead = *p;

  if (lead < 0x80u)
  {
    return lead;
  }
  else if (lead < 0xE0u)
  {
    return ((lead & 0x1Fu) << 6u) | EX(p[1], 0);
  }
  else if (lead < 0xF0u)
  {
    return ((lead & 0x0Fu) << 12) | EX(p[1], 6)
        | EX(p[2], 0);
  }
  else if (lead < 0xF8u)
  {
    return ((lead & 0x07u) << 18u) | EX(p[1], 12)
        | EX(p[2], 6) | EX(p[3], 0);
  }
  else if (lead < 0xFCu)
  {
    return ((lead & 0x03u) << 24u) | EX(p[1], 18)
        | EX(p[2], 12) | EX(p[3], 6)
        | EX(p[4], 0);
  }
  else
  {
    return ((lead & 0x03u) << 30u) | EX(p[1], 24)
        | EX(p[2], 18) | EX(p[3], 12)
        | EX(p[4], 6) | EX(p[5], 0);
  }
}
#undef EX

Take into account that this code is not actually Unicode conformant, given that it shouldn’t accept 5 and 6 byte characters, and it should filter out overlong sequences (characters which occupy N bytes but could have been encoded with less bytes).

So, now you see how the actual code gets more complex for UTF-8, and the innocent loop actually involves a lot of operations now.

We’ve now seen the complexities of dealing with different encodings: one byte per character, Windows “Unicode” with possible “surrogates”, UTF-8 with all its varying length management needs. We haven’t even checked DBCS, which are the systems by which Japanese, Korean, and different Chinese text are commonly stored, and in which seeking backwards in text is all but impossible, because lead bytes and trail bytes are not distinguishable by value. And then there are all the other Unicode encoding variants, including little-endian and big-endian versions, etc…

How can one choose to implement support for all of these in C++?

One possibility is to write a version of each text-management function such as FindWhiteSpaceRight for each supported encoding.

Just kidding :)

What we really want is to write code almost as simple as the one-byte-per-character version above, which will work for all encodings.

As a common C++ idiom, we could a design base class with virtual methods which represent the required functions. Methods could be “unsigned GetChar()”, “AdvancePointer()”, etc… and each derived would implement their version of each.

This would work. Indeed. But we would be paying a high price.

For one, the price of a virtual function call for each simple operation. The one-byte-per-char version is not only simple to see, but the code it generates is really good because the CPU is very good at handling simple bytes.

But the second very important one is that the virtual function would need to receive and return the most general class of characters in mind, actually, 32-bit-per-char UCS-4. And that would mean converting for really simple operations.

This is especially important for one reason: I wanted NGEDIT to handle all encoding types, to handle them natively, but most day-to-day editing happens with one-byte-per-char encodings. Burdening the code which is run 90% of the time in a large part of the world (at least, all of Europe and the US) with a high performance impediment seems a bit absurd, and I didn’t want to do it.

The goal is code that is simple to write and read, code which can be made to work with all encoding types, but also code that will become the simple byte-handling code that we had for the first case when we are actually dealing with one-byte-per-char encodings. And, sure, we don’t want to write gobs of code.

The solution? Of course, courtesy of templates, and will be the topic of the last article in this mini-series, together with some other actually important reasons to use such a solution (hint: tab handling code is often a waste!)

Unicode, text management, and C++ techniques

Saturday, July 2nd, 2005

Let me apologize for taking so long to post. I’ve been in a kind of a “development frenzy” for the past couple of weeks. I will be posting some news regarding all the new development shortly :)

Today, I’m going to start reviewing how NGEDIT manages the text buffers of the files being edited. I was explaining it and showing the source to a developer friend of mine a few days ago, and he found the C++ techniques interesting. I hope it will be useful and/or interesting to you as well.

The model is rooted in the way NGEDIT handles different text encodings, such as Windows codepages, DBCS, or different flavors of Unicode. It will take a few blog posts to cover the subject.

Some months ago, when I developed the earliest prototype, I started out with simple one-byte-per-char buffers. It was not final code and I just wanted to have the editor up and running. At the end of the day, most editing I do is in the good ole’ 1252 codepage, using a single byte per character. So is quite probably yours, if you’re in the US or Western Europe.

As soon as basic editing and UI were working, I started researching how to handle the different encoding types.

I know that one can use Windows’ understanding of Unicode, using two bytes per character. Well, actually, it’s not two bytes per character - even though the Unicode standard creators initially thought that 65,536 unique characters would be enough to encode all writing systems, in the end they found out they needed more. I’m not completely sure, but I think Microsoft’s decision to use a two-byte-per-character encoding predates the (bad) news that some characters would not fit in two bytes, thus requiring some sort of extension (actually called “surrogates”). That is, if you decide to use two bytes per character, you can still not assume uniform character length. That is only true for the first 65,536 characters (technically “code-points”) in the Unicode standard. This set is nicely dubbed “Basic Multilingual Plane”, and I think it covers all widespread systems (including Japanese, Chinese, Korean, Greek, Cyrillic, Hebrew, Arabic, Thai, etc. I think the writing systems you are forgetting about would include Klingon, Egytian hieroglyphs and some other alphabets which you’d better not use in the comments in your code or in your config files or in your customer database.

If two bytes per characters brought you universality together with simplicity, I’d be much more inclined to using it. But the thought that the code should gracefully handle the kind-of-escape-sequence surrogate pairs makes me feel that, apart from wasting the memory in most cases, I have to tolerate variable-length characters. And in most cases (Greek and Eastern writings excluded), UTF-8 is a much better encoding for this: ASCII characters, that is, the first 128 characters in the character system you are actuallly using now (unless you are reading this from an IBM mainframe, which I seriously doubt), are one-byte-coded in UTF-8. If you use english, unless you are the type of person that writes “naïve” or “résumé”, the whole of your file can be encoded in one byte per character, while still allowing the occasional hieroglyph in the middle.

Anway, I had to support the different Unicode encodings in the editor. Even if you only use it sometimes, an editor with just one-byte-per-character encodings support is simply not serious nowadays. I also decided that I would be supporting DBCS encodings, that is, Asian code pages in which characters can be encoded in one or two byte sequences. When I had to do some localization support for Japan, Korea, China and Taiwan a few years ago, I was not sure whether Unicode would be widespread in those countries. I simply asked them to send me some localized materials without specifying the format, and they just sent DBCS encoded text files. I found out Unicode was not too widespread there either.

Let’s look at how the early NGEDIT code to do some handling looked. This sample shows the code to find the next “whitespace” character in the line:

unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

This is quite fine and dandy. And quick. The call to IsWhiteSpace() can easily be inlined, and the whole loop can be easily optimized by the compiler.

Now, let’s see how this may look for the default Windows Unicode encoding (which is formally callled UCS-2LE or UTF-16LE, where LE means little-endian, and although there is some technical difference between UCS-2 and UTF-16, it is nothing of any importance in this context). We will do a simple translation.

unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It seems like a really simple transformation, one that is easy to perform, and which results in much more general code dealing with Asian or Arabic or Greek or Cyrillic encodings. wchar_t is a built-in C/C++ standard type used for wide characters. We switched from talking about offsets into talking about columns, as they’re not equivalent any more, but the rest seems pretty good.

But things are trickier.

As always happens with standard C/C++ types, wchar_t is not technically a very well defined type. According to Microsoft’s compilers, it is a two byte word able to store one of the first 65,536 code-points. According to GNU’s gcc compiler, it is a FOUR BYTE integer able to store any Unicode character. I don’t even know what it means in other environments.

So, the above code would be correct when compiled under gcc, although using 4 bytes per character - probably something you don’t want to do to handle really large files.

Compiling under Microsoft’s Visual C, or just using “unsigned short” in gcc in order to save some space, the above code is not really correct.

What happens if there is some Klingon character thrown in in the middle of the source code?

First thing, you should probably fire the programmer who wrote that. But that’s not very satisfying.

How do these characters get encoded in UCS-2/UTF-16? Well, the first 65,536 characters in the Unicode standard get simply encoded as-is. But, they were so cunning as to leave certain ranges unused for characters - most importantly the so called surrogate range from 0xD800 to 0xDFFF. These codepoints are not assigned to any character in the standard.

The standard defines characters from the starting 0×0000, and they have promised no to use any single value above 0×10FFFF. That is, there are 16 times 65,536 possible codepoints that can get encoded apart from the first 65,536 ones. That is, there are gobs of characters above the 0xFFFF ceiling. They decided to use what are called surrogate pairs. A sequence of two values in the 0xD800-0xDFFF range defines a single character. Actually, the surrogate range is divided in a “High Surrogate” value (0xD800 to 0xDBFF) and a “Low Surrogate” value (0xDC00 to 0xDFFF). The high surrogate must always come first, they must always come together (an idependent surrogate with no companion has no meaning), and together they can encode 1024 times 1024 different characters. That covers the extra 0×0FFFFF values beyond the BMP (’Basic Multilingual Plane’).

This leaves us in the unconfortable situation that the above code handling wchar_t’s is actually unaware of what it is doing with those symbols.

What will happen if we just do that? Well, it’s not that bad, as you probably won’t encounter Klingon characters “in the wild”. But if there are, then you will be incorrectly manipulating them, and even if your OS does a good job of rendering them (the user had better installed some good fonts to display that), you will be mangling the text.

UTF-8 encoding has similar properties, although the “naive” code will find characters wrongly handled much more easily (more about this in the next installment).

So, what should we really do to handle UCS-2/UTF-16 correctly? Something like this:

unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    if (u+2 < uLen)
      unsigned ch = UTF16_Decode(psz + 1, &len);
    else
    {
      // Last wchar_t in seq, if it's surrogate, invalid!
      if (UTF16_IsSurrogate(psz[u+1]))
      {
        // What to do now? Just fail?
        return uOffStart;
      }
      else
      {
        ch = (unsigned)psz[1];
      }
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

You see, now things are much uglier. We can find “invalid” sequences, and have to think about a sensible way to handle that. Encodings in which all sequences are valid make life much easier. On the other hand, we switched into talking about “columns” when getting into UCS-2/UTF-16, but that’s not so valid anymore, given that the code just above isn’t using characters (which are variable length) or bytes, but a kind of “word offset”. The nasty things of variable length encoding.

Next time, I’ll review UTF-8, which really requires this kind of special handling, and start ellaborating on how we can use some C++ mechanisms in order to handle all this gracefully.

Compiler IV: the runtime environment

Sunday, June 19th, 2005

This is the last installment in the compiler series. I will be reviewing how NGEDIT implements the stack-based machine that executes the bytecodes generated by the compiler, and actually breaths life into the code written in the script language.

The first design decision is how to store the values that the runtime environment manipulates. For one, we have the variables that may hold values - and object members as well, which are just another kind of variable. As the other important group, we have the values that are stored in the stack.

We know that variables may hold a value of any type, which in the case of NGS means:

  • char: single byte signed quantities as in regular C/C++
  • int: native machine integer, that is, 32-bit, signed quantities
  • unsigned: same as the previous one, but unsigned
  • float: regular 32-bit floating point number
  • string: byte-per-character strings
  • object: reference to an NGS object
  • function: reference to an NGS function
  • built-in function: reference to a builtin NGEDIT provided function
  • nil: both a type and a value of its own, the “non-value”

There were some design decisions in coming up with this list. Even if the main editor interface is provided via objects, I decided not to provide a special type for built-in objects. Instead, I chose to use regular script objects with member functions initialized to built-in functions, and some “opaque” data to access the internal application data. This makes things simpler and fits nicely in the overall architecture.

I also decided not to provide a boolean type. Bytecodes that generate a logical boolean, such as comparisons, push either integer 1 or integer 0. Instructions that test a value (conditional jumps), take nil and any numeric zero as false, and anything else as true. As an important case, an empty string evaluates as true (you have to check its length if you want to check whether a string is non-empty). When adding values involving strings, which is implemented as string concatenation, nil values are promoted to the empty string, so string manipulation is quite comfortable.

Another decision that I took is to initially support only byte-based encoded strings. That means Unicode UTF-16 (almost the same as UCS-2, or what Windows simply calls “Unicode”) is not supported as a native NGS value. Even if the editor is fully-enabled to handling all sorts of encodings, I didn’t want to overcomplicate the internal script system. Please take into account that this does not mean the script cannot manipulate text in other encodings - this is done via NGEDIT’s built-in functions, it only means NGS’s native variables cannot hold this type of text. I’ll probably add a second string type to support unicode strings in the future.

How does one implement these values? Via a good’ole union inside a C++ class, which holds separately the “type” of the value:

enum EAtomType
{
  SA_NIL,
  SA_CHAR,
  SA_INT,
  SA_UNSIGNED,
  SA_FLOAT,
  SA_BUILTIN_FN,  // Built-in global functions
                // (::MesageBox(), ...)
  SA_FUNCTION,    // Identifies a script function
  SA_STRING,
  SA_OBJECT,
};

class TAtom
{
  public:
    TAtom () { m_Type = SA_NIL; }
    ...
    void SetFloat( float fVal )
    {
      m_Type = SA_FLOAT;
      m_Data.f = fVal;
    }
    ...
    EAtomType   GetType     () const
      { return m_Type; }
    char        GetVal_char () const
      { ASSERT(m_Type == SA_CHAR); return m_Data.c; }
    ...

  private:
    EAtomType m_Type;
    union
    {
      char     c;
      int      i;
      unsigned u;
      float    f;
      char     *psz;
      SObject  *pObj;
    } m_Data;
};

You get the idea. Each value then occupies exactly 8 bytes. 3 of them are wasted, but that’s the price of alignment.

The operand stack, then, is simply an array of TAtom values together with a pointer to the TOS element. Variables are just TAtom’s. The members of an object or the elements in an array are just TAtom’s.

By the way, pardon my French in borrowing the ‘atom’ and ‘nil’ term from lisp terminology. ‘Atom’ was quite clear to me, and I preferred ‘nil’ over ‘null’, given that it represents a non-value rather than a null pointer.

As I have already commented, I will probably evolve NGS towards Javascript so some of this terminology will probably change.

So, now that we know how to structure the stack and the array pool, we just need to start interpreting the bytecodes. From the rest of NGEDIT, a script function is invoked with a call similar to this:

  TScrExec exec;
  TScript *ps = FindScript(...);
  TAtom arg, retval;

  arg.SetInt(key.Code);
  exec.AddArg( &arg );

  arg.SetInt(key.ModifKeyState.Val);
  exec.AddArg( &arg );

  exec.InvokeFunction(
    ps,
    ps->FindFunction("IProcessChar"), NULL, &retval
  );

  return CastInt(retval) ? true : false;

TScrExec is the class that executes a script function by interpreting its bytecode. Right now, scripts are non-concurrent, and as such the function runs until it returns or it generates a “trap”. I will be adding a concurrent script engine which will help in using the scripts more efficiently and, importantly, in being able to debug them by executing step-by-step.

Leaving apart the interfacing code to pass in the arguments and get the result back, the core function is InvokeFunction. It is the function that actually performs the work. How does interpreting the bytecodes work? Quite simply, we need to keep a pointer into the bytecode stream pointing to the next instruction to execute - we can perform a large switch statement on the bytecode, which branches into the code to execute each instruction. Each instruction is so simple that they are quite straightforward to execute. Push, pop, dup, etc… Quite a lot of error-checking goes in here, as we have to take care that the stack doesn’t underflow, invalid codes are not present in the bytecode stream, operations are not performed on values of the wrong type, etc:

TRet TScrExec::InvokeFunction(
  TScript *ps, unsigned uFn,
  const TAtom *pThisVal, TAtom *pRetVal
)
{
  // Push 'this' and args on the stack
  ...

  // Prepare the run context
  ...
  m_pRC->uPC = 0; // Next instruction pointer
  ...

  // Execute instructions
  while (!IsFinished())
  {
    ExecInstr();
  }

  ...
}

void TScrExec::ExecInstr()
{
  // Trap on invalid address
  if (!IsValidPC())
  {
    Trap( TRAP_INVALID_ADDRESS );
    return;
  }
  ...

  // Read the opcode
  byte opcode = pCode->GetByte( m_pRC->uPC );

  // Immediate handling & decoding
  unsigned immediate = 0;
  unsigned imm_bytes = GetByteCodeImmBytes ( opcode );
  unsigned instr_bytes = imm_bytes + 1;

  if (imm_bytes)
  {
    ...

    // Read immediate
    if (imm_bytes == 1)
      immediate = pCode->GetByte( m_pRC->uPC + 1 );
    else if (imm_bytes == 2)
      immediate = pCode->GetWord( m_pRC->uPC + 1 );
  }

  // Advance PC to next instruction
  m_pRC->uPC += instr_bytes;

  // Instruction execution
  switch  (opcode)
  {
    case BC_NULL:
      Trap( TRAP_USER );
      break;

    case BC_PUSH_CONST:
    {
      TAtom atom;
      ps->GetConstant(&atom, immediate);
      m_Stack.Add(atom);
      break;
    }

    ... Really long switch statement ...

  }
}

There is one thing we haven’t considered here: functions in the script will be calling other functions. Values to those functions are passed on the stack, so it’s just a matter of pushing the arguments before calling the function. The return value is returned on the stack as well, so that part’s simple. But there is the need of storing the return address in order to know where to jump when we hit on the RET bytecode in the called function.

Apart from the need for the return address, we need to reserve space for local variables. Sure, we could use the regular stack for them, but that would make code generation a bit more complex. What we do is we have a separate “run-context” stack. A run-context holds the return address, the local variables, and some other useful info. When we find a CALL instruction, a new run-context is generated and pushed on what I call the “call-stack”. The variable m_pRC shown above always points to the top of the run-context stack.

There is one last thing we haven’t talked about: the calling convention. Even if we have already talked about the calling function pushing the args on the stack, and the called function popping them and leaving the result on the stack, there are some details that have to be taken care of. For once, we need to decide in what order the arguments are pushed: is the leftmost one pushed first or last? Where does the implicit “this” pointer get pushed?

Given that NGS handles values, and functions are called via pushing references to them on the stack (maybe even taken from members of objects), there is in general no easy mechanism to check at compile time whether the number of arguments passed matches the number expected by the called function. We could check it at runtime and fail if the number is not right. I preferred to implement a calling mechanism that pushes the actual number of arguments passed on the stack as well, after pushing all the regular arguments. The called function, then, can use that number to know how many arguments have been actually passed and pop them in local variables. If more arguments have been passed, the extra ones are ignored. If less than the expected number have been passed, their corresponding local variables are set to nil. The ENTER instruction, which is the first instruction in every function, does exactly this.

This is the last part of our review of NGEDIT’s scripting architecture. I hope it has been worthy as an overview of how a full scripting systems is developed. I will cover some other compiling issues in the future, but I will also be covering other general editor areas such as text memory management, etc.

Feel free to comment or ask whatever you are interested on.

Programming productivity

Thursday, June 16th, 2005

I always have an eye on my own productivity. Don’t misunderstand me, I’m no productivity freak. I don’t buy in all the “Get things Done” or Steve Pavlina’s productivity obsession based on pushing yourself. Years ago, I would feel guilty of having spent too much time watching TV, reading or generally procrastinating . I spent way too many years in the vicious cycle of not getting things done, not forgiving myself, feeling guilty about it, and falling in the same hole again and again.

(BTW, do read the link above if you haven’t yet, it’s one of the funniest things I’ve ever read.)

Later, I’ve learned that, when you don’t do something, there is always a reason. And your best possible effort is to discover it - not to put pressure on yourself to crush the reason. That’s just doing stuff over and above your own instinct, a nasty thing to do.

The most common reasons I’ve found are: (1) I didn’t really want to do it, or (2) I was afraid of something.

When you discover that you really don’t want to do something you’re trying to push yourself into doing, the best thing you can do is to take the steps to actually not to do it. If there are other people involved, you probably have to tell them you are not going to do it. If the task was the center of your life, you probably have to actually discover what you actually want to do with your life (a tough task). Sometimes the implications are so overwhelming that your best choice is to simply do the thing, because of other reasons. But then, you know why you are doing the stuff, and you won’t be punishing yourself for not actually doing with pleasure what you really don’t want to do.

The other very common reason is that some fear is pushing you back and making the unconscious decision that it’s better not to tackle whatever it is, lest whatever the fearful thing may happen. Sometimes it is fear of failing, and discovery of failure would be much worse than just having the task lying around. Sometimes it is fear of others seeing you fail. Anything.

I am digressing, I wanted to focus on the feelings that come with programming.

The deal is that, even when you are actually pursuing a goal you really want to pursue, there may be a myriad of little tasks in the way that are not satisfying by themselves, but you still want to do them for the goal you will be getting towards. In programming, I often find out that a lot of the things I have to do are not really a pleasure, it’s the final goal that matters to me. I’m the kind of impatient guy, and I think much faster than I code (who doesn’t?), so I have a lot of conflicting feelings when programming.

Knowing that I have to deal with the Windows API is the typical part that I really hate, so tasks that involve that tend to get done later than others. But not only those, mostly every programming task creates some feeling of discomfort. I actually have to drag myself to actual coding, which is much less fun than thinking up the design of things. I think programming tools will evolve in the sense in which programmers will have more of the fun part and less of the nitty-gritty-nasty-details part (think, assembler programming was much worse than C/C++ programming and most programmers can actually get a lot more work done, just by not having to worry about in which registers do things get put).

I can attest that I do this dragging quite successfully, as I’m averaging at about 300 lines of code per day for the past few months, which is quite a lot. And the code is of quite high quality, not that I’m writing a mess - I can produce that much because I am quite eager to have all the core of the editor working in order to start with the innovative stuff. But even then, I find I have to push myself into doing things every day. Reading JoS forums is too often a much more interesting temptation.

Anyway, I tried an experiment yesterday and it has worked nicely. There were three tasks that I had to accomplish in order to get to the next step. Vi/Vim emulation is working so that it’s mostly complete for my everyday usage (complete vi emulation is really a large lot and I don’t need every single little command), and after I get to my desired level of emulation completeness I’ll be moving to other areas of NGEDIT development. The three tasks were actually:

  • Smart input buffer: in vi, you enter a special “input” mode whenever you are going to add new text. The ‘.’ command later on lets you repeat whatever you input (inserted characters, replaced characters, erased characters, etc). I use that continuously.
  • Named clipboards (”registers” in vi slang). You use them in vi much more often than in regular editors, as they are very
    straightforward to use and for some other reasons. This feature is not only useful for the vi interface, but for normal editor interface as well. More importantly, the clipboard in the editor has been “internal” until now, with no interaction with the windows clipboard, and this task involved starting to deal with the Win32 API.
  • Selection types: just the regular character/line/column types of selections for copy/pasting. Character selections are implemented a long time ago, but I really need the other types for everyday usage.

I was looking at this as a somewhat steep hill to climb, with three separate “pushing myself’s”, one for each task.

Instead, I tried an alternative: I would start the three tasks at once, cutting through the need of three separate starts. I would get to the VimEmu.ngs script, implement the script side of the three of them, and then go through the other areas of the application source and get the three of done in each part. This would mean that the application wouldn’t compile for some time, but I know that compiling and testing a minor thing is often an escape from actually confronting the programming task at hand.

How did it go? I think it’s gone quite well. The features are not finished yet, but the smart input buffer was completely implemented yesterday (if not working perfectly). The named registers (clipboards) are mostly working and code to write to / read from the windows clipboard is there (even if not all cases of text encoding conversions are working). And the selection types are lacking the core low-level implementation (actually displaying the different selections and actually reading/writing the right text when copy/pasting), but all the framework including communication with NGS is there.

I’m hoping I can get a decent chunk of that finished today. And I think I gained some time, and avoided some dismay, by “parallelizing” the starting-up conflict of the three.

PS: Does anyone know why on earth non-platform-native text on the Windows clipboard is not characterized by the codepage but by the locale? It seems that if I have text in the codepage used in Russia, I can’t paste it into the windows clipboard, unless I look up what their currency or date format is. And no, I don’t yet know how to get from the codepage to the specific locale - I’m using Unicode for copy/pasting in those cases.