Unicode, text management, and C++ techniques

Let me apologize for taking so long to post. I’ve been in a kind of a “development frenzy” for the past couple of weeks. I will be posting some news regarding all the new development shortly 🙂

Today, I’m going to start reviewing how NGEDIT manages the text buffers of the files being edited. I was explaining it and showing the source to a developer friend of mine a few days ago, and he found the C++ techniques interesting. I hope it will be useful and/or interesting to you as well.

The model is rooted in the way NGEDIT handles different text encodings, such as Windows codepages, DBCS, or different flavors of Unicode. It will take a few blog posts to cover the subject.

Some months ago, when I developed the earliest prototype, I started out with simple one-byte-per-char buffers. It was not final code and I just wanted to have the editor up and running. At the end of the day, most editing I do is in the good ole’ 1252 codepage, using a single byte per character. So is quite probably yours, if you’re in the US or Western Europe.

As soon as basic editing and UI were working, I started researching how to handle the different encoding types.

I know that one can use Windows’ understanding of Unicode, using two bytes per character. Well, actually, it’s not two bytes per character – even though the Unicode standard creators initially thought that 65,536 unique characters would be enough to encode all writing systems, in the end they found out they needed more. I’m not completely sure, but I think Microsoft’s decision to use a two-byte-per-character encoding predates the (bad) news that some characters would not fit in two bytes, thus requiring some sort of extension (actually called “surrogates”). That is, if you decide to use two bytes per character, you can still not assume uniform character length. That is only true for the first 65,536 characters (technically “code-points”) in the Unicode standard. This set is nicely dubbed “Basic Multilingual Plane”, and I think it covers all widespread systems (including Japanese, Chinese, Korean, Greek, Cyrillic, Hebrew, Arabic, Thai, etc. I think the writing systems you are forgetting about would include Klingon, Egytian hieroglyphs and some other alphabets which you’d better not use in the comments in your code or in your config files or in your customer database.

If two bytes per characters brought you universality together with simplicity, I’d be much more inclined to using it. But the thought that the code should gracefully handle the kind-of-escape-sequence surrogate pairs makes me feel that, apart from wasting the memory in most cases, I have to tolerate variable-length characters. And in most cases (Greek and Eastern writings excluded), UTF-8 is a much better encoding for this: ASCII characters, that is, the first 128 characters in the character system you are actuallly using now (unless you are reading this from an IBM mainframe, which I seriously doubt), are one-byte-coded in UTF-8. If you use english, unless you are the type of person that writes “naïve” or “résumé”, the whole of your file can be encoded in one byte per character, while still allowing the occasional hieroglyph in the middle.

Anway, I had to support the different Unicode encodings in the editor. Even if you only use it sometimes, an editor with just one-byte-per-character encodings support is simply not serious nowadays. I also decided that I would be supporting DBCS encodings, that is, Asian code pages in which characters can be encoded in one or two byte sequences. When I had to do some localization support for Japan, Korea, China and Taiwan a few years ago, I was not sure whether Unicode would be widespread in those countries. I simply asked them to send me some localized materials without specifying the format, and they just sent DBCS encoded text files. I found out Unicode was not too widespread there either.

Let’s look at how the early NGEDIT code to do some handling looked. This sample shows the code to find the next “whitespace” character in the line:

unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

This is quite fine and dandy. And quick. The call to IsWhiteSpace() can easily be inlined, and the whole loop can be easily optimized by the compiler.

Now, let’s see how this may look for the default Windows Unicode encoding (which is formally callled UCS-2LE or UTF-16LE, where LE means little-endian, and although there is some technical difference between UCS-2 and UTF-16, it is nothing of any importance in this context). We will do a simple translation.

unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It seems like a really simple transformation, one that is easy to perform, and which results in much more general code dealing with Asian or Arabic or Greek or Cyrillic encodings. wchar_t is a built-in C/C++ standard type used for wide characters. We switched from talking about offsets into talking about columns, as they’re not equivalent any more, but the rest seems pretty good.

But things are trickier.

As always happens with standard C/C++ types, wchar_t is not technically a very well defined type. According to Microsoft’s compilers, it is a two byte word able to store one of the first 65,536 code-points. According to GNU’s gcc compiler, it is a FOUR BYTE integer able to store any Unicode character. I don’t even know what it means in other environments.

So, the above code would be correct when compiled under gcc, although using 4 bytes per character – probably something you don’t want to do to handle really large files.

Compiling under Microsoft’s Visual C, or just using “unsigned short” in gcc in order to save some space, the above code is not really correct.

What happens if there is some Klingon character thrown in in the middle of the source code?

First thing, you should probably fire the programmer who wrote that. But that’s not very satisfying.

How do these characters get encoded in UCS-2/UTF-16? Well, the first 65,536 characters in the Unicode standard get simply encoded as-is. But, they were so cunning as to leave certain ranges unused for characters – most importantly the so called surrogate range from 0xD800 to 0xDFFF. These codepoints are not assigned to any character in the standard.

The standard defines characters from the starting 0x0000, and they have promised no to use any single value above 0x10FFFF. That is, there are 16 times 65,536 possible codepoints that can get encoded apart from the first 65,536 ones. That is, there are gobs of characters above the 0xFFFF ceiling. They decided to use what are called surrogate pairs. A sequence of two values in the 0xD800-0xDFFF range defines a single character. Actually, the surrogate range is divided in a “High Surrogate” value (0xD800 to 0xDBFF) and a “Low Surrogate” value (0xDC00 to 0xDFFF). The high surrogate must always come first, they must always come together (an idependent surrogate with no companion has no meaning), and together they can encode 1024 times 1024 different characters. That covers the extra 0x0FFFFF values beyond the BMP (‘Basic Multilingual Plane’).

This leaves us in the unconfortable situation that the above code handling wchar_t’s is actually unaware of what it is doing with those symbols.

What will happen if we just do that? Well, it’s not that bad, as you probably won’t encounter Klingon characters “in the wild”. But if there are, then you will be incorrectly manipulating them, and even if your OS does a good job of rendering them (the user had better installed some good fonts to display that), you will be mangling the text.

UTF-8 encoding has similar properties, although the “naive” code will find characters wrongly handled much more easily (more about this in the next installment).

So, what should we really do to handle UCS-2/UTF-16 correctly? Something like this:

unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    if (u+2 < uLen)
      unsigned ch = UTF16_Decode(psz + 1, &len);
    else
    {
      // Last wchar_t in seq, if it's surrogate, invalid!
      if (UTF16_IsSurrogate(psz[u+1]))
      {
        // What to do now? Just fail?
        return uOffStart;
      }
      else
      {
        ch = (unsigned)psz[1];
      }
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

You see, now things are much uglier. We can find “invalid” sequences, and have to think about a sensible way to handle that. Encodings in which all sequences are valid make life much easier. On the other hand, we switched into talking about “columns” when getting into UCS-2/UTF-16, but that’s not so valid anymore, given that the code just above isn’t using characters (which are variable length) or bytes, but a kind of “word offset”. The nasty things of variable length encoding.

Next time, I’ll review UTF-8, which really requires this kind of special handling, and start ellaborating on how we can use some C++ mechanisms in order to handle all this gracefully.

Leave a Reply