The growing pains of NGEDIT

Archive for July, 2005

Entrepreneur: The post yesterday was stupid!

Sunday, July 31st, 2005

I like to get carried away with stuff. Mainly with ideas, it gives me the energy required to beat them to death until I can squeeze out the last ounce of juice. When I don’t let myself get carried away with stuff, creativity is gone.

Of course, if I’m doing something risky and dangerous, like driving or debugging, I prefer not to get carried away. It’s much better for all.

And of course, letting yourself get carried away has its drawbacks. Sometimes you get carried away with stuff that is plain dumb.

One such thing was the post yesterday. I found out that the HTML in my site was wrong and made the association with not getting search results. But it was just a thought, and it may or may not be true. I just jumped to the conclusion it was true… and posted about it.

When Ian Landsman mentioned there are probably other reasons for the fact, I revolved a bit until I actually admitted to myself that I had jumped to unjustified conclusions. Duh.

I have expertise in many areas, but web site creation and search engine optimization are not two of them.

Anyway, both the good and and the bad sides of blogs is that you get “the human factor”, mistakes and all.

I’ll post about how searches evolve… and will try not to jump to conclusions before checking them!

Posted in misc | 3 Comments »

Entrepreneur: Get your HTML right!

Friday, July 29th, 2005

I’ve been having some problems with the www.ngedit.com. It was working nicely, but almost zero searches had ever found them. It’s weird, because quite many searches have found blog.ngedit.com.

After the release of ViEmu, it’s a bit different, but up until last week, the top search item finding this blog was “WM_CHAR”. WM_KEYDOWN and friends were also up there on the list. Search phrases revealed the frustration and desperation of programmers fighting to tame the Win32 keyboard input model: “how to get utf-16 wm_char without unicode”, “getting wm_syskeydown instead of wm_keydown ime”, “wm_char arrows repeat”, even “brief explanation of egytian currency”! My blog post on the issue is even on the first page you get from google when you just search for WM_CHAR!

As an aside, web stats are a source of awe and wonder for me, I cannot help but imagine the story behind every search, and this links to my theory of “exceptions”, but that’s a story for another day.

But only two terms have been finding the main page: ngedit and www.ngedit.com. A grand total of 23 search hits since the existence of the site (and I’m pretty sure several of them were either me or some friend I told about the endeavor).

It didn’t worry me too much until now. People were finding the blog and that was nice, the main site was almost placeholder stuff. But now, after the release of my first product, www.ngedit.com is important for the good advance of my venture.

And this week, after the release, I’ve gotten quite a lot of traffic from JoS and some announcements here and there, as well as the nice blog posts from fellow entrepreneurs setting up their own shop at the same time. But nothing from google or other search engines.

The weird thing has been that my site has basically not appeared when looking for “visual studio vi emulation” (I know I looked for that when I had the need myself.) And, even if I didn’t do it on purpose for SEO, the main site must be filled up to death with the phrase!

As a weirder thing, my blog post on the release of ViEmu is on one of the first pages with that search phrase, and even Ian Landsman‘s post is on the second page! How come www.ngedit.com does not appear on Google searches?

I set up a google adwords campaign yesterday (please, if you see the add, and you already know about the project, do not click on the add 🙂 I’m doing it with a very limited budget). Ads are not appearing either (although google reports 6 impressions with 0 click-throughs).

Weirdly, today I found out that if I wrote it in another order, such as “vi studio visual emulation”, the site appeared… on the second page! Come on, this is not such a crowded market niche!

Today, I’ve found out something which I believe is the key to all of this: as I was preparing a new page for the site, I found out with terror that my html pages were full of html syntax errors! Well, actually not full, there were about two or three on the main ViEmu product page, all pages were missing the DOCTYPE line, and, by some accident, I had removed the opening <html> and <head> tags from over half the pages. Duh.

Please don’t take me wrong – I do all my html and css by hand, with a lot of love put in every html tag and every css style. I don’t have much experience with html (although that’s rapidly changing), and I prefer to do it this way by now. I check everything with Firefox and IE, and I pay a lot of attention to proper html – but had simply forgotten to actually verify it with something such as the w3c validator.

So, now everything’s corrected (the new content section is not uploaded yet), and I hope to get good google search results in a very short while (couple days?), unless they really punish you for having lowered the average quality of html on the web.

And a happy fact is that I actually used NGEDIT to convert all the files from CRLF terminators to LF terminators, so that they don’t have to be converted by ftp each time I upload them 🙂

On other news, yes, I will be adding an “Articles” section to the main site, with some new content which is ready now, and I’ll publish there the main articles instead of on the blog – leaving the blog for shorter and more day-to-day entries, and announcements for the articles posted there. I really don’t think the blog is the right place for the long articles I tend to post.

Posted in webdev | 9 Comments »

Unicode, text management, and C++ techniques (II)

Friday, July 29th, 2005

We left the series a few weeks ago, after having talked a bit about UCS-2/UTF-16, which in its little-endian version is simply called “Unicode” by Microsoft.

We’re now going to review a bit of what UTF-8, probably the most extended Unicode encoding, actually means.

Remember the context (I’m probably just reminding myself, as I’ve been so busy with ViEmu for the past few weeks.) We’re going to see how NGEDIT handles the different text encodings internally – based on the fact that NGEDIT does not convert on write and read, but it keeps files in memory in their original encoding.

UTF-8 was a very neat trick (elevated to category of a standard) devised by Ken Thompson. The basic unit in UTF-8 is a byte, but only a few characters occupy a single byte. Characters may actually be anything from 1 to 4 bytes, depending on their value. Actually, the encoding method allows characters of 5 or even 6 bytes, but those only happen for code points above 0x10FFFF, which the Unicode standard now forbids – so no 5 or 6 byte sequences should be found in a “legal” UTF-8 file.

Basically, ASCII characters are stored as single-byte 0..127 values (because, I guess you know, ASCII is a 7-bit code, and the Unicode set of characters is coincidential in the 127 first characters). That means a file consisting of only ASCII values will be exactly the same in good old 8-bit-stored ASCII, or in UTF-8.

The 128 characters from 128 to 255 in Unicode, together with the first 128 which are plain ASCII, complete the ISO-8859-1 encoding, usually called Latin1. This was the default encoding for HTML, and even if a lot of HTML these days uses UTF-8, I think the default if no encoding is specified is ISO-8859-1. How are these characters encoded in UTF-8? Actually, with two byte sequences:

Latin1 character 128: UTF-8 bytes 0xC2 0x80
…
Latin1 character 255: UTF-8 bytes 0xC3 0xBF

Unicode code points up to 0x7FF (2048 characters) are all encoded in two bytes in UTF-8, and the last one is 0xDF,0xBF.

As you can deduce, it’s not that the first byte is just a marker. Two-byte UTF-8 secuences are marked the high 3 bits of the first byte being binary 110 (so, in hexadecimal, the number will be between 0xC0 and 0xDF). The other 5 bits of the first char are the highest 5 bits of the 11 bit character encoded. And the trailing bytes actually has 6 bits of info, as the highest two bits must be binary 10.

Higher code points use 3 and 4 bytes per character encodigs: 3 byte characters are marked by high four bits being binary 1110 and 4 byte characters are marked by high five bits being binary binary 11110.

As an important point, all trailing bytes in characters of any byte-length always have the high two bits as binary 10, so finding where characters start is easy.

Anyway, the point is, how do we translate the code from the last post, which looks for whitespace, so that it will work with UTF-8? Let’s see again the beautifully simple original one-byte-per-character code:


unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It’s not the most beautiful code, but it’s beautifully simple.

Now, let’s see the UTF-8 enabled version, which could actually recognize a hieroglyphic whitespace if it were necessary:


unsigned FindWhiteSpaceRight(
  const byte *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    len = UTF8_CalcLen(psz[u]);
    if (u + len < uLen)
      unsigned ch = UTF8_Decode(psz);
    else
    {
      // Invalid!
      //TODO: Handle it in some way!
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

The code to calculate the length of and decode a UTF-8 character would look more or less like this:


inline unsigned UTF8_CalcLen(byte b)
{
       if (b < 0x80u) return 1;
  else if (b < 0xE0u) return 2;
  else if (b < 0xF0u) return 3;
  else if (b < 0xF8u) return 4;
  else if (b < 0xFCu) return 5;
  else return 6;
}


#define EX(x, shl) ((x & 0x3F) << shl)
inline unsigned UTF8_Decode(const byte *p)
{
  byte lead = *p;

  if (lead < 0x80u)
  {
    return lead;
  }
  else if (lead < 0xE0u)
  {
    return ((lead & 0x1Fu) << 6u) | EX(p[1], 0);
  }
  else if (lead < 0xF0u)
  {
    return ((lead & 0x0Fu) << 12) | EX(p[1], 6)
        | EX(p[2], 0);
  }
  else if (lead < 0xF8u)
  {
    return ((lead & 0x07u) << 18u) | EX(p[1], 12)
        | EX(p[2], 6) | EX(p[3], 0);
  }
  else if (lead < 0xFCu)
  {
    return ((lead & 0x03u) << 24u) | EX(p[1], 18)
        | EX(p[2], 12) | EX(p[3], 6)
        | EX(p[4], 0);
  }
  else
  {
    return ((lead & 0x03u) << 30u) | EX(p[1], 24)
        | EX(p[2], 18) | EX(p[3], 12)
        | EX(p[4], 6) | EX(p[5], 0);
  }
}
#undef EX

Take into account that this code is not actually Unicode conformant, given that it shouldn’t accept 5 and 6 byte characters, and it should filter out overlong sequences (characters which occupy N bytes but could have been encoded with less bytes).

So, now you see how the actual code gets more complex for UTF-8, and the innocent loop actually involves a lot of operations now.

We’ve now seen the complexities of dealing with different encodings: one byte per character, Windows “Unicode” with possible “surrogates”, UTF-8 with all its varying length management needs. We haven’t even checked DBCS, which are the systems by which Japanese, Korean, and different Chinese text are commonly stored, and in which seeking backwards in text is all but impossible, because lead bytes and trail bytes are not distinguishable by value. And then there are all the other Unicode encoding variants, including little-endian and big-endian versions, etc…

How can one choose to implement support for all of these in C++?

One possibility is to write a version of each text-management function such as FindWhiteSpaceRight for each supported encoding.

Just kidding 🙂

What we really want is to write code almost as simple as the one-byte-per-character version above, which will work for all encodings.

As a common C++ idiom, we could a design base class with virtual methods which represent the required functions. Methods could be “unsigned GetChar()”, “AdvancePointer()”, etc… and each derived would implement their version of each.

This would work. Indeed. But we would be paying a high price.

For one, the price of a virtual function call for each simple operation. The one-byte-per-char version is not only simple to see, but the code it generates is really good because the CPU is very good at handling simple bytes.

But the second very important one is that the virtual function would need to receive and return the most general class of characters in mind, actually, 32-bit-per-char UCS-4. And that would mean converting for really simple operations.

This is especially important for one reason: I wanted NGEDIT to handle all encoding types, to handle them natively, but most day-to-day editing happens with one-byte-per-char encodings. Burdening the code which is run 90% of the time in a large part of the world (at least, all of Europe and the US) with a high performance impediment seems a bit absurd, and I didn’t want to do it.

The goal is code that is simple to write and read, code which can be made to work with all encoding types, but also code that will become the simple byte-handling code that we had for the first case when we are actually dealing with one-byte-per-char encodings. And, sure, we don’t want to write gobs of code.

The solution? Of course, courtesy of templates, and will be the topic of the last article in this mini-series, together with some other actually important reasons to use such a solution (hint: tab handling code is often a waste!)

Posted in ngdev | No Comments »

vi/vim emulation for Visual Studio

Wednesday, July 27th, 2005

After a lot of work and testing, ViEmu 1.0 is finally out. You can check it at www.ngedit.com. I hope you like it. All feedback is welcome.

I think I will be able to come back to posting some interesting articles later this week.

Best!

Posted in misc | 7 Comments »

Beta process going

Monday, July 18th, 2005

ViEmu beta is going well. Beta 1 had some trouble, as I had left a DLL out from the package, but after several hours of fresh installs of Visual Studio I found out. Today I’ve issued Beta 2 with some improvements in Intellisense integration and better TAB handling. I also have the new web design almost ready.

I still keep the target of releasing before the end of July. We’ll see if nothing turns it into an August release 🙂

Unfortunately, this isn’t leaving too much time for technical posts, although they’ll be back after the release.

Posted in misc | No Comments »

Imminent release announcement

Tuesday, July 12th, 2005

No, it’s not NGEDIT. It’s ViEmu, and it’s almost ready. Vi editor emulation for Microsoft Visual Studio .NET 2003.

Let me explain myself.

A few weeks ago, an idea struck me – I had a quite large chunk of vi emulation working. And having vi emulation would sure be a great addition for Microsoft Visual Studio (at least for those, like me, who have their fingers hardwired to vi’s input model). I had no experience writing extensions or add-in’s for VS, but it couldn’t be that hard. I gave myself a full afternoon to research it (“whaddayamean, switching to another project before the first is ready?”) and verify it was as good an idea as it seemed.

After two hours, reading a blog comment saying “if I had vi emulation in Visual Studio I’d be in heaven” and reading a note from Microsoft that yes, they had it as a feature for future versions, but no, VS .NET 2005 would not have it, I registered the domain name (actually two, viemu.com and vimemu.com, I wasn’t sure). I downloaded the VSIP SDK (add-in’s only have limited extension capabilities, you need the VSIP SDK if you really want to shake Visual Studio). And I started hacking at all that COM code.

In a couple of days, I had been able to have a custom package loaded within Visual Studio, and subclassed the editor window. I did have to learn a whole lot of COM programming (I had done some basic COM, but not the kind of COM you have to do in order to talk with a beast like VS – no, I don’t love it more now). I briefly considered C# but went with C++ instead. I did the first experiment with the following thought: “if I can make pressing 0 send the cursor to the beginning of the line, I can do the rest” (0 is the default way to send the cursor to the beginning of the line). I did it, felt great, and went to sleep (it was late).

I then started porting the vi emulation core. You know, the emulation core was written in NGS, the NGEDIT scripting language. I had to port it to C++ (no, I wasn’t going to bring over NGS scripting to Visual Studio). It took a solid four days to port all of it. No, adding the semicolons wasn’t the worst part. The emulation core was nicely separated from the editing core actions implementation, so I only had to implement a few primitives to get it working. The day I finished porting the ViEmu core, a lot of vi started working simultaneously – wonders of porting. I started using ViEmu fulltime to develop itself.

A few weeks later, a lot of vi implementation afterwards (there were many missing things yet in the NGEDIT ViEmu module), ViEmu is now almost ready. Missing things:

The preferences section (you wouldn’t believe the COM programming required for a simple dialog with 5 or 6 checkboxes and an edit box – multiple inheritance from twelve base classes is used in MS’s sample code, only one of which is not a template.
The installer – you must use MSI installer for it – I have to decode the MSI SDK, which I still haven’t been able to even reliably find for download – I’m waiting to see if I received my recent MSDN Universal subscription and I can better find it there.
Solve some small interactions with Intellisense and the undo system. It’s working but there is some tricky case yet.
General review of cursor positions after vi commands are performed. Some of them are not the same as with vim, and that should be right for 1.0.
And beta testing. I’ve been using it myself for all development and it’s very stable, but I’m only using C++ and a simple one-byte-per-character codepage (even if VS uses UCS-2 Unicode internally). I’ll do some basic testing on Visual Basic or C#, but not the kind of testing someone who is actually developing does.

Ok, so, my expected timeframe is quite short. I definitely want to release it before the end of July. I also have to set up the web page and e-commerce system, but I don’t expect many problems with that. I expect to finish the development tasks listed above this week, and I want to run beta-testing as soon as the installer is ready (I need to deliver the beta itself as an installable package).

So, if you would like to beta test ViEmu, please drop me a line. I’m planning a quite small beta testing group. I’m mostly looking for people who use languages other than C++, codepages other than the usual US/Western Europe one, and Windows versions different from Windows 2K and Windows XP.

Someone who writes VB applications in Korean on a Windows Millenium machine would be a dream come true 🙂

I hope it will work with any left-to-right writing system (I don’t have any idea how it will perform with Arab or Hebrew bidirectional writing, and I’m not delaying 1.0 until that is ok).

Leaving out the vi/vim command line, which is not emulated, almost all of vi/vim input is emulated (including visual selection modes, etc). I’ll make available a full list of emulated features.

It is not compatible with Whole Tomato Software’s Visual Assist. I’m looking into it, but I’m not sure it will be fixed by 1.0, maybe a bit later.

Porting to VS 2005 is on the radar, of course, but 1.0 will be for VS .NET 2003. VS 2005 is only beta yet, actually.

And, rest assured, I’ll come back to NGEDIT after ViEmu is released.

Posted in misc | 6 Comments »

Unicode, text management, and C++ techniques

Saturday, July 2nd, 2005

Let me apologize for taking so long to post. I’ve been in a kind of a “development frenzy” for the past couple of weeks. I will be posting some news regarding all the new development shortly 🙂

Today, I’m going to start reviewing how NGEDIT manages the text buffers of the files being edited. I was explaining it and showing the source to a developer friend of mine a few days ago, and he found the C++ techniques interesting. I hope it will be useful and/or interesting to you as well.

The model is rooted in the way NGEDIT handles different text encodings, such as Windows codepages, DBCS, or different flavors of Unicode. It will take a few blog posts to cover the subject.

Some months ago, when I developed the earliest prototype, I started out with simple one-byte-per-char buffers. It was not final code and I just wanted to have the editor up and running. At the end of the day, most editing I do is in the good ole’ 1252 codepage, using a single byte per character. So is quite probably yours, if you’re in the US or Western Europe.

As soon as basic editing and UI were working, I started researching how to handle the different encoding types.

I know that one can use Windows’ understanding of Unicode, using two bytes per character. Well, actually, it’s not two bytes per character – even though the Unicode standard creators initially thought that 65,536 unique characters would be enough to encode all writing systems, in the end they found out they needed more. I’m not completely sure, but I think Microsoft’s decision to use a two-byte-per-character encoding predates the (bad) news that some characters would not fit in two bytes, thus requiring some sort of extension (actually called “surrogates”). That is, if you decide to use two bytes per character, you can still not assume uniform character length. That is only true for the first 65,536 characters (technically “code-points”) in the Unicode standard. This set is nicely dubbed “Basic Multilingual Plane”, and I think it covers all widespread systems (including Japanese, Chinese, Korean, Greek, Cyrillic, Hebrew, Arabic, Thai, etc. I think the writing systems you are forgetting about would include Klingon, Egytian hieroglyphs and some other alphabets which you’d better not use in the comments in your code or in your config files or in your customer database.

If two bytes per characters brought you universality together with simplicity, I’d be much more inclined to using it. But the thought that the code should gracefully handle the kind-of-escape-sequence surrogate pairs makes me feel that, apart from wasting the memory in most cases, I have to tolerate variable-length characters. And in most cases (Greek and Eastern writings excluded), UTF-8 is a much better encoding for this: ASCII characters, that is, the first 128 characters in the character system you are actuallly using now (unless you are reading this from an IBM mainframe, which I seriously doubt), are one-byte-coded in UTF-8. If you use english, unless you are the type of person that writes “naïve” or “résumé”, the whole of your file can be encoded in one byte per character, while still allowing the occasional hieroglyph in the middle.

Anway, I had to support the different Unicode encodings in the editor. Even if you only use it sometimes, an editor with just one-byte-per-character encodings support is simply not serious nowadays. I also decided that I would be supporting DBCS encodings, that is, Asian code pages in which characters can be encoded in one or two byte sequences. When I had to do some localization support for Japan, Korea, China and Taiwan a few years ago, I was not sure whether Unicode would be widespread in those countries. I simply asked them to send me some localized materials without specifying the format, and they just sent DBCS encoded text files. I found out Unicode was not too widespread there either.

Let’s look at how the early NGEDIT code to do some handling looked. This sample shows the code to find the next “whitespace” character in the line:


unsigned FindWhiteSpaceRight(
  const char *psz, unsigned uLen, unsigned uOffStart
)
{
  unsigned u = uOffStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

This is quite fine and dandy. And quick. The call to IsWhiteSpace() can easily be inlined, and the whole loop can be easily optimized by the compiler.

Now, let’s see how this may look for the default Windows Unicode encoding (which is formally callled UCS-2LE or UTF-16LE, where LE means little-endian, and although there is some technical difference between UCS-2 and UTF-16, it is nothing of any importance in this context). We will do a simple translation.


unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    if (IsWhiteSpace(psz[u+1]))
      return u+1;
    u++;
  }

  return uOffStart;
}

It seems like a really simple transformation, one that is easy to perform, and which results in much more general code dealing with Asian or Arabic or Greek or Cyrillic encodings. wchar_t is a built-in C/C++ standard type used for wide characters. We switched from talking about offsets into talking about columns, as they’re not equivalent any more, but the rest seems pretty good.

But things are trickier.

As always happens with standard C/C++ types, wchar_t is not technically a very well defined type. According to Microsoft’s compilers, it is a two byte word able to store one of the first 65,536 code-points. According to GNU’s gcc compiler, it is a FOUR BYTE integer able to store any Unicode character. I don’t even know what it means in other environments.

So, the above code would be correct when compiled under gcc, although using 4 bytes per character – probably something you don’t want to do to handle really large files.

Compiling under Microsoft’s Visual C, or just using “unsigned short” in gcc in order to save some space, the above code is not really correct.

What happens if there is some Klingon character thrown in in the middle of the source code?

First thing, you should probably fire the programmer who wrote that. But that’s not very satisfying.

How do these characters get encoded in UCS-2/UTF-16? Well, the first 65,536 characters in the Unicode standard get simply encoded as-is. But, they were so cunning as to leave certain ranges unused for characters – most importantly the so called surrogate range from 0xD800 to 0xDFFF. These codepoints are not assigned to any character in the standard.

The standard defines characters from the starting 0x0000, and they have promised no to use any single value above 0x10FFFF. That is, there are 16 times 65,536 possible codepoints that can get encoded apart from the first 65,536 ones. That is, there are gobs of characters above the 0xFFFF ceiling. They decided to use what are called surrogate pairs. A sequence of two values in the 0xD800-0xDFFF range defines a single character. Actually, the surrogate range is divided in a “High Surrogate” value (0xD800 to 0xDBFF) and a “Low Surrogate” value (0xDC00 to 0xDFFF). The high surrogate must always come first, they must always come together (an idependent surrogate with no companion has no meaning), and together they can encode 1024 times 1024 different characters. That covers the extra 0x0FFFFF values beyond the BMP (‘Basic Multilingual Plane’).

This leaves us in the unconfortable situation that the above code handling wchar_t’s is actually unaware of what it is doing with those symbols.

What will happen if we just do that? Well, it’s not that bad, as you probably won’t encounter Klingon characters “in the wild”. But if there are, then you will be incorrectly manipulating them, and even if your OS does a good job of rendering them (the user had better installed some good fonts to display that), you will be mangling the text.

UTF-8 encoding has similar properties, although the “naive” code will find characters wrongly handled much more easily (more about this in the next installment).

So, what should we really do to handle UCS-2/UTF-16 correctly? Something like this:


unsigned FindWhiteSpaceRight(
  const wchar_t *psz, unsigned uLen, unsigned uColStart
)
{
  unsigned u = uColStart;

  while (u+1 < uLen)
  {
    unsigned len; // Characters may be long...
    unsigned ch; // Characters may be >0xFFFF

    if (u+2 < uLen)
      unsigned ch = UTF16_Decode(psz + 1, &len);
    else
    {
      // Last wchar_t in seq, if it's surrogate, invalid!
      if (UTF16_IsSurrogate(psz[u+1]))
      {
        // What to do now? Just fail?
        return uOffStart;
      }
      else
      {
        ch = (unsigned)psz[1];
      }
    }

    if (IsWhiteSpace(ch))
      return u+len;
    u += len;
  }

  return uOffStart;
}

You see, now things are much uglier. We can find “invalid” sequences, and have to think about a sensible way to handle that. Encodings in which all sequences are valid make life much easier. On the other hand, we switched into talking about “columns” when getting into UCS-2/UTF-16, but that’s not so valid anymore, given that the code just above isn’t using characters (which are variable length) or bytes, but a kind of “word offset”. The nasty things of variable length encoding.

Next time, I’ll review UTF-8, which really requires this kind of special handling, and start ellaborating on how we can use some C++ mechanisms in order to handle all this gracefully.