Unicode, text management, and C++ techniques (and III)
If you are a C++ programmer, I recommend you read this article – the techniques discussed are quite interesting and could be useful for your own non-text-related projects.
In the last article in the series, we saw what UTF-8 is about. And I promised to cover some interesting techniques in order to handle this maze of encodings. Let’s get to it.
In order to abstract out the differences between text encodings, I decided to implement the concept of a codec: a text coding and decoding module that can convert between a particular encoding and some general interface that the main editor uses.
As we saw in the last article, using a base class with virtual functions has two important drawbacks: the first one is that all access is through costly virtual function calls (at least, quite costly compared to raw byte-size character access), and the second one is that it most probably forces us to use full unicode for the arguments and return values.
So, I decided to implement the codec as a class which will be used as a template argument. There is one such class for each supported encoding (TCodecNativeSimple, TCodecForeignSimple, TCodecUTF8, TCodecUTF16BE, TCodecUTF16LE, TCodecDBCS, and TCodecGeneral). Each such class is not meant to be instantiated, and it doubles as an access mechanism to codec-defined specifics – that is, it only has public members, and it doesn’t have any member variables with the expection of const static members (C++ idiom for constant class-wide values).
For example, each of these classes contains a TLineDecoder class. So, we can instantiate a TCodec::TLineDecoder object in order to walk a line of encoded text char by char and do whatever specifics we may need.
But the greatest strength of this technique comes from defining types within each codec. Each codec aditionally defines a TChar type, which represents the preferred type in order to manipulate such text.
For example, the native-simple codec is used for the platform-native text encoding, but only when such encoding is a single-byte-per-char encoding (eg, US and European native codepages qualify, whereas Japanese and Korean native text is not handled by this codec). This codec doesn’t require converting the characters input by the user, and can be output via native Windows API calls. And the TChar type for this codec is a simple byte.
As another example, the foreign-simple codec is used for one-byte-per-char text encodings which are foreign to the current platform (for example, US-Windows-codepage in a machine using another codepage as native, Mac text on a PC, or any of the ISO encodings such as Latin1, etc…). Given that this text cannot be reliably represented in a single byte, the TChar type in this codec maps to TUCS4Char (a full 4-byte Unicode codepoint).
We use this mechanism in order to map concepts in as many levels as we want. This allows us to map both high- and low-level concepts, so that we can have the required level of access in every part of the main application without performance getting hit. I really hate it when a concept that makes development much more comfortable makes a significant compromise in runtime performance.
Apart from operation classes (such as TLineDecoder) and basic types (such as TChar), the codec class also features some static const members, that represent features of the encoding. For example, all codecs have a c_bFixedCharWidth boolean member which indicates exactly that, whether encoded chars are all of the same byte length.
As an example of how this works, the function to find whitespace which we have used as an example may be written like this:
template<class TCODEC> unsigned FindWhiteSpaceRight( const byte *psz, unsigned uLen, unsigned uOffStart ) { typename TCODEC::TLineDecoder ld(psz, uLen, uOffStart); while (!ld.IsAtEnd()) { typename TCODEC::TChar ch; if (ld.IsInvalid()) ; // Handle in some way else ch = ld.GetChar(); if (TCODEC::IsWhiteSpace(ch)) return ld.GetCurPos(); ld.Advance(); } return uOffStart; }
Let’s see some aspects of this code. For one, you can see that we are indeed checking for invalid characters. For encodings that may present invalid encoded characters, this function will check validity. But for encodings that can never encounter invalid encoded characters, the call IsInvalid() will be hardwired to return ‘false’, and so the compiler will optmize that part of the loop away! The same optimization happens for a function such as Advance(), which will amount to just a pointer increment for the most common one-byte-per-char encodings, while the code we have written is properly compiled to all the complex mechanic involved in decoding UTF8.
Code that checks for TCODEC::c_bFixedCharWidth with a seemingly runtime ‘if’ will also be evaluated and optimized out in the compile stage of Release builds, as the compiler is smart enough to see it is actually a compile-time constant.
And, as a final remark, we talked about TAB character decoding at the end of the last article. It turns out that having TAB characters in a file involves quite a lot of complexity, as the offset within a line loses any correlation with the graphic column. But this is not the cases for files which sport no TAB characters, and we are losing some performance because of this. One way to handle this seamlessly: have TAB handling abstracted behind a scheme such as the one above (I call TABBER the concept equivalent to a CODEC for TAB decoding), and choose between two TABBERs depending on whether the file contains TABs when loading. You can always switch to a non-null TABBER if the user inserts a TAB character. For people like me, who prefer not to use TAB characters at all, this is a win in most if not all editing sessions.
August 2nd, 2005 at 3:21 pm
JP,
Sometimes I think you and I are the last C++ developers left. 🙂
August 2nd, 2005 at 6:04 pm
Yes, C# is taking away some people who used MFC… but many apps still benefit from lower-level control.