Unicode, text management, and C++ techniques (II)
We left the series a few weeks ago, after having talked a bit about UCS-2/UTF-16, which in its little-endian version is simply called “Unicode” by Microsoft.
We’re now going to review a bit of what UTF-8, probably the most extended Unicode encoding, actually means.
Remember the context (I’m probably just reminding myself, as I’ve been so busy with ViEmu for the past few weeks.) We’re going to see how NGEDIT handles the different text encodings internally – based on the fact that NGEDIT does not convert on write and read, but it keeps files in memory in their original encoding.
UTF-8 was a very neat trick (elevated to category of a standard) devised by Ken Thompson. The basic unit in UTF-8 is a byte, but only a few characters occupy a single byte. Characters may actually be anything from 1 to 4 bytes, depending on their value. Actually, the encoding method allows characters of 5 or even 6 bytes, but those only happen for code points above 0x10FFFF, which the Unicode standard now forbids – so no 5 or 6 byte sequences should be found in a “legal” UTF-8 file.
Basically, ASCII characters are stored as single-byte 0..127 values (because, I guess you know, ASCII is a 7-bit code, and the Unicode set of characters is coincidential in the 127 first characters). That means a file consisting of only ASCII values will be exactly the same in good old 8-bit-stored ASCII, or in UTF-8.
The 128 characters from 128 to 255 in Unicode, together with the first 128 which are plain ASCII, complete the ISO-8859-1 encoding, usually called Latin1. This was the default encoding for HTML, and even if a lot of HTML these days uses UTF-8, I think the default if no encoding is specified is ISO-8859-1. How are these characters encoded in UTF-8? Actually, with two byte sequences:
Latin1 character 128: UTF-8 bytes 0xC2 0x80
…
Latin1 character 255: UTF-8 bytes 0xC3 0xBF
Unicode code points up to 0x7FF (2048 characters) are all encoded in two bytes in UTF-8, and the last one is 0xDF,0xBF.
As you can deduce, it’s not that the first byte is just a marker. Two-byte UTF-8 secuences are marked the high 3 bits of the first byte being binary 110 (so, in hexadecimal, the number will be between 0xC0 and 0xDF). The other 5 bits of the first char are the highest 5 bits of the 11 bit character encoded. And the trailing bytes actually has 6 bits of info, as the highest two bits must be binary 10.
Higher code points use 3 and 4 bytes per character encodigs: 3 byte characters are marked by high four bits being binary 1110 and 4 byte characters are marked by high five bits being binary binary 11110.
As an important point, all trailing bytes in characters of any byte-length always have the high two bits as binary 10, so finding where characters start is easy.
Anyway, the point is, how do we translate the code from the last post, which looks for whitespace, so that it will work with UTF-8? Let’s see again the beautifully simple original one-byte-per-character code:
unsigned FindWhiteSpaceRight( const char *psz, unsigned uLen, unsigned uOffStart ) { unsigned u = uOffStart; while (u+1 < uLen) { if (IsWhiteSpace(psz[u+1])) return u+1; u++; } return uOffStart; }
It’s not the most beautiful code, but it’s beautifully simple.
Now, let’s see the UTF-8 enabled version, which could actually recognize a hieroglyphic whitespace if it were necessary:
unsigned FindWhiteSpaceRight( const byte *psz, unsigned uLen, unsigned uColStart ) { unsigned u = uColStart; while (u+1 < uLen) { unsigned len; // Characters may be long... unsigned ch; // Characters may be >0xFFFF len = UTF8_CalcLen(psz[u]); if (u + len < uLen) unsigned ch = UTF8_Decode(psz); else { // Invalid! //TODO: Handle it in some way! } if (IsWhiteSpace(ch)) return u+len; u += len; } return uOffStart; }
The code to calculate the length of and decode a UTF-8 character would look more or less like this:
inline unsigned UTF8_CalcLen(byte b) { if (b < 0x80u) return 1; else if (b < 0xE0u) return 2; else if (b < 0xF0u) return 3; else if (b < 0xF8u) return 4; else if (b < 0xFCu) return 5; else return 6; } #define EX(x, shl) ((x & 0x3F) << shl) inline unsigned UTF8_Decode(const byte *p) { byte lead = *p; if (lead < 0x80u) { return lead; } else if (lead < 0xE0u) { return ((lead & 0x1Fu) << 6u) | EX(p[1], 0); } else if (lead < 0xF0u) { return ((lead & 0x0Fu) << 12) | EX(p[1], 6) | EX(p[2], 0); } else if (lead < 0xF8u) { return ((lead & 0x07u) << 18u) | EX(p[1], 12) | EX(p[2], 6) | EX(p[3], 0); } else if (lead < 0xFCu) { return ((lead & 0x03u) << 24u) | EX(p[1], 18) | EX(p[2], 12) | EX(p[3], 6) | EX(p[4], 0); } else { return ((lead & 0x03u) << 30u) | EX(p[1], 24) | EX(p[2], 18) | EX(p[3], 12) | EX(p[4], 6) | EX(p[5], 0); } } #undef EX
Take into account that this code is not actually Unicode conformant, given that it shouldn’t accept 5 and 6 byte characters, and it should filter out overlong sequences (characters which occupy N bytes but could have been encoded with less bytes).
So, now you see how the actual code gets more complex for UTF-8, and the innocent loop actually involves a lot of operations now.
We’ve now seen the complexities of dealing with different encodings: one byte per character, Windows “Unicode” with possible “surrogates”, UTF-8 with all its varying length management needs. We haven’t even checked DBCS, which are the systems by which Japanese, Korean, and different Chinese text are commonly stored, and in which seeking backwards in text is all but impossible, because lead bytes and trail bytes are not distinguishable by value. And then there are all the other Unicode encoding variants, including little-endian and big-endian versions, etc…
How can one choose to implement support for all of these in C++?
One possibility is to write a version of each text-management function such as FindWhiteSpaceRight for each supported encoding.
Just kidding 🙂
What we really want is to write code almost as simple as the one-byte-per-character version above, which will work for all encodings.
As a common C++ idiom, we could a design base class with virtual methods which represent the required functions. Methods could be “unsigned GetChar()”, “AdvancePointer()”, etc… and each derived would implement their version of each.
This would work. Indeed. But we would be paying a high price.
For one, the price of a virtual function call for each simple operation. The one-byte-per-char version is not only simple to see, but the code it generates is really good because the CPU is very good at handling simple bytes.
But the second very important one is that the virtual function would need to receive and return the most general class of characters in mind, actually, 32-bit-per-char UCS-4. And that would mean converting for really simple operations.
This is especially important for one reason: I wanted NGEDIT to handle all encoding types, to handle them natively, but most day-to-day editing happens with one-byte-per-char encodings. Burdening the code which is run 90% of the time in a large part of the world (at least, all of Europe and the US) with a high performance impediment seems a bit absurd, and I didn’t want to do it.
The goal is code that is simple to write and read, code which can be made to work with all encoding types, but also code that will become the simple byte-handling code that we had for the first case when we are actually dealing with one-byte-per-char encodings. And, sure, we don’t want to write gobs of code.
The solution? Of course, courtesy of templates, and will be the topic of the last article in this mini-series, together with some other actually important reasons to use such a solution (hint: tab handling code is often a waste!)