Archive for October, 2005

Long time no see

Friday, October 28th, 2005

I’ve been swamped with work in the past few days, so I didn’t have any time to blog. But just yesterday I made available the first alpha version of ViEmu 1.3, which provides a single star feature: regular expressions and ex command line emulation. This means my regular expression engine is working nicely (on top of my encoding-independent C++ string template class!). And that ViEmu is starting to bring the full power of vi/vim to Visual Studio. I hope to iron out the remaining bugs and release it to the public around next week.

I think I will have be having more time to blog starting next week, and I have a line up of stuff I want to blog about. Part of it thanks to Baruch Even who’s started the very nice Planet uISV blog aggregator (be sure to check it out if you’re interested in other small software shops and start-ups!).

Visual Studio 2005 has just been released (finally!) and I’m downloading it through msdn subscriptions, but I already have news from some customer that the build I provide for VS2005-beta-2 seems to work nicely with it. I will finally prepare a version of ViEmu that installs dually to both VS.NET 2003 and VS2005. Some customers were already using ViEmu with VS2005 beta versions quite happily, but I had some “false positives” on ViEmu problems where VS2005 was actually the culprit – I hope everything of importance will be fixed now and I’ll be able to release ViEmu for VS2005 “officially”.

To finish this post, I’ll extract the information on what ViEmu 1.3 brings – be warned, it is for heavy vi users and regex experts, so skip it as soon as you have a doubt whether you’re interested in it.

Summary of what's contained in ViEmu 1.3-a-1:

  - Regular expression support for '/' and '?' searches
  - Command line editing, with command history
    (use the cursor arrows, BACKSPACE and DEL)
  - '< , '> marks for the last active visual selection
  - gv normal mode command to restore the last visual selection
  - The following ex commands:
    - :set - basic implementation allowing [no]ig[norecase]/[no]ic,
      [no]sm[artcase]/[no]sc, and [no]ma[gic]
    - :d   - :[range]d[elete] [x] [count] to delete (x is the register)
    - :y   - :[range]y[ank] [x] [count] to yank (x is the register)
    - :j   - :[range]j[oin][!] to join the lines in the range,
               or default to the given line (or cursor line) and the next
    - :pu  - :[range]pu[t][!] [dest] to paste after (!=before) the given
    - :co  - :[range]co[py] [dest] to copy the lines in range to the
               destination address (:t is a synonim for this)
    - :m   - :[range]m[ove] [dest] to move the lines in range to the
               destination address
    - :p   - :[range]p[rint] [count] to print the lines (send them to the
               output window) (:P is a synonim for this)
    - :nu  - :[range]nu[mber] [count] to print the lines (send them to the
               output window), w/line number (:# is a synonim for this)
    - :s   - :[range]s[ubstitute]/re/sub/[g] to substitute matches for the
               given regex with sub (do not give 'g' for only 1st match on
               each line)
    - :g   - :[range]g[lobal][!]/re/cmd to run ':cmd' on all lines matching
               the given regex (! = *not* matching)
    - :v   - :[range]v[global]/re/cmd to run ':cmd' on all lines *not*
               matching the given regex

You can now use :g/^/m0 to invert the file, :g/^$/d to remove
all empty lines, :%s/\s\+$// to remove all trailing whitespace,
and use many of your favorite vi/vim tricks.

In implementing the regular expression engine, I've gone through vim
documentation and implemented everything there. There are a few things
not implemented yet - I plan to add them later on. This is a summary of the
implemented features (for now, you can look at vim's documentation for

 - Regular matching characters
 - '.' for any character
 - Sets (full vim syntax): [abc], [^1-9a-z], [ab[:digit:]], ...
     (including '\_[' to include newline)
 - Standard repetitions: * for 0-or-more, \+ for 1-or-more, \= or \? for
     0 or 1
 - Counted repetitions: {1,2} for 1-to-2 repetitions, {1,} for 1-to-any,
     {,5} for 5 or less, {-1,} for non-greedy versions
 - Branches: foo\|bar matches either "foo" or "bar"
 - Concats: foobar\|.. matches the first two characters where 'foobar'
 - Subexpressions: \( and \) to delimit them ('\%(' to make them
 - ^ and $ for start- and end-of-line. (See the note on the limitation
 - \_^ and \_$ for s-o-l and eol anywhere in the pattern
 - \_. for any character including newline
 - \zs and \ze to mark the match boundaries
 - \< and \> for beg and end of word
 - Character classes: \s for space, \d for digit, \S for non-space, etc...
     and '\_x' for the '\x' class plus newline (all of them work)
 - Special chars: \n for newline, \e, \t, \r, \b
 - \1..\9 repeat matches
 - Regex control: \c to forece ignore chase, \C to force check case, \m for
     magic, \M for nomagic, \v for verymagic, \V for verynomagic

Full [very][no]magic is supported.

These are the vim regular expression features not yet implemented by ViEmu:

 - ~ to match last substitute string
 - \@>, \@=, \@!, \@< = and |@<! zero-width and dependent matches/non-matches
 - \%^ (beg-of-file), \%$ (end-of-file), \%# (cursor-pos), \%23l
     (specific line), \%23c (col) and \%23v (vcol)
 - optional tail "wh\%[atever]"
 - *NO PROTECTION* for repetitions of possibly zero-width matches, be
     careful! \zs* or \(a*\)* MAY HANG VIEMU!!!
 - ^ and $ are only detected as special at the very beginning and very end
     of the regular expression string, use \_^ or \_$
 - \Z (ignore differences in unicode combining chars)

Other limitations:

 - The :s replacement string does not yet understand the full vi/vim options,
and cannot insert multi-line text. Only & and \1..\9 are recognized as special,
and if some of them matched a multi-line range, only the regular characters
will be inserted. You can't insert new line breaks by using \r either.

 - After-regular-expression displacement strings are not implemented
     ('/abc/+1' to go to the line after the match).

 - Ex-ranges accept everything (%, *, ., $, marks, searches) but not
     references to previous searches (\/, \?, \&) or +n/-n arithmetics.

 - The command line editing at the status bar looks a bit crude, with
     that improvised cursor, but it should make the ex emulation very

 - :p and :# output is sent to a "ViEmu" pane on the output window.

Beautiful regular expressions code

Sunday, October 9th, 2005

My regular expression engine is starting to work. I can already compile and match basic regular expressions, and the framework for the most complex features is already there, even if not completely implemented yet. The first use of the engine will be for ViEmu (I might go straight from 1.2 to 1.5, as I feel regex and ex command line support take it to the next level, and Firefox is already going to do a 1.0->1.5 jump so I shouldn’t be less).

It’s probably the piece of code with which I’ve been most happy in a long time. The reason? Not that it is complex and it was hard to write (which it was), as several other pieces have been more complex, and many others have taken a lot more work. The actual reason is that it is free of tight bindings to anything else: it uses the generic string template framework I talked about, and so it can handle any variation of string encoding, format, storage, access-mechanism… whatever – and not losing an ounce of efficiency from writing code that uses straight ‘char’ or ‘wchar_t’.

For one, I feel I will never need to write another regular expression engine, and that is a good feeling.

But, most important, when I look at the code, I get a feeling of beauty. And it is a feeling that I miss most of the time I write code. I don’t know how you feel about it, but it actually hurts me when I write code that is too tightly bound to some specific circumstance. Reusability is great, but the feeling of being right is a separate issue and that’s what makes me happiest.

I’m going to try to post the declaration here so that you can have a look at it. Let’s see if the beauty survives the adjustment to the blog’s width:

// Regular expressions support for NGEDIT and ViEmu, templatized to
//  support text and input in any encoding (ViEmu only uses wchar_t)

#ifndef _NGREGEXP_H_
#define _NGREGEXP_H_

#include "ngbase.h"
#include "vector.h"

namespace nglib

template<class TREADSTR>
class TRegExp
      typedef typename TREADSTR::TREF           TSREF;
      typedef typename TREADSTR::TREF::iterator TSITER;
      typedef typename TREADSTR::TPOS           TSPOS;
      typedef typename TREADSTR::TCACHE         TSCACHE;
      typedef typename TREADSTR::TCHAR          TSCHAR;

    enum ECompileError

    struct TCompileError
      ECompileError type;
      TSPOS         pos;

    struct TMatchResult
      TSPOS posStart, posAfterEnd;
      struct TSubMatch
        TSPOS posStart, posAfterEnd;
      TVector<TSubMatch> vSubmatches;

    TRegExp() { m_ok = false; }
    //~TRegExp() { }

    TRet Compile (
      TSREF rsRegExp,
      TCompileError *pCompileError = NULL,
      TSCHAR chTerm = TSCHAR::zero

    // Methods for simple string matching
    bool TryToMatch (
      TSREF rsInput,
      TMatchResult *pMatchResult,
      bool bBOL = true,
      bool bEOL = true

    bool Contains (
      TSREF rsInput,
      TMatchResult *pMatchResult,
      bool bBOL = true,
      bool bEOL = true

    // Methods for possibly multi-line regexps
    template <class TTEXTBUF>
    bool TryToMatch (
      TTEXTBUF txtBuf,
      unsigned uStartLine,
      TSPOS pos,
      TMatchResult *pMatchResult
    ); // At a certain line pos

    template <class TTEXTBUF>
    bool Contains (
      TTEXTBUF txtBuf,
      unsigned uStartLine,
      TSPOS pos,
      TMatchResult *pMatchResult
    ); // Starting anywhere in that line


    enum ENodeType
      NT_MANDATORY_JUMP,      // + 2 bytes signed offset
      NT_OPTIONAL_JUMP,       // + 2 bytes signed offset
      NT_OPTIONAL_JUMP_PREF,  // + 2 bytes signed offset
            // jump with preference (to control greediness)

      NT_MATCH,               // + 1 byte match_type + details
      NT_OPEN_SUBEXPR,        // + 1 byte subexpr #
      NT_CLOSE_SUBEXPR,       // + 1 byte subexpr #

      NT_SAVE_IPOS,    // + 1 byte pos-reg where to save
      NT_JUMPTO_IPOS,  // + 1 byte temp to read
      NT_SET_TEMP,     // + 1 byte temp where to save
      NT_INC_TEMP,     // + 1 byte temp to read

    enum EMatchType
      MT_CHAR,          // any literal char (+ ENC_CHAR)
      MT_DOT,           // .                        
      MT_BOL,           // ^
      MT_EOL,           // $
      MT_NEXTLINE,      // \n
      MT_SET,           // [abc] or [a-zA-Z]
                        //+ (byte)num_chars
                        //+ (byte)num_ranges
                        //+ nc * ENC_CHAR
                        //+ 2 * nr * ENC_CHAR
      MT_NEGSET,        // [^abc] or [^a-zA-Z]  same
      MT_WHITESPACE,    // \s
      MT_WORD,          // \w
      MT_NONWORD        // \W

    // Compiler and matcher classes, not accessible externally
    class TCompiler;
    class TMatcher;

    bool          m_ok;
    TVector<byte> m_vbCompiledExpr;


} // end namespace

#include "ngregexp.inl"


Main things: the class is in a namespace that I use for all my common code, you can see how the interface uses the string’s specific types, the TCompiler and TMatcher classes are just declared and are unnecessary for the user of the class, and the only declarations – apart from the main interface – are for the node types, which need to be accessible both by the TCompiler and the TMatcher.

Even if C++ templates force you to include the definition of the members at the point of use, I usually separate template-based code in a header file with the declarations and an “.inl” (inline) file with the definitions, which helps keep code sanity.

The only actual types are bool (which is fair enough to use without loss of generality), byte, which allows generality through the TSCHAR::EncodeToBytes() and TSCHAR::DecodeFromBytes(), and the lonely unsigned to refer to a line number within a multi-line text buffer. I will probably get rid of that one by using an abstract TIXLINE (line-index type) in TTEXTBUF.

The only shortcut I took from my original idea is that both the regex definition string and the target text have to be of the same type, but templatizing on two string types seemed a bit overkill and I can always easily convert at the use point via special-purpose inter-string-type conversion inline functions (possibly even template-based to avoid too much rewriting).

Now that I think of it, if the matched target is a sparse or arbitrarily ordered disk-based text buffer, abstracted away through a very smart TTEXTBUF class, I will probably have to allow specifying the regex itself with another type.

It’s taken a long time to develop this C++ style, but I’m starting to feel really happy with how my code is looking – for the first time after over 10 years of C++ programming!

I haven’t posted much on the blog lately, as I’ve been focusing in development. Bringing ViEmu to maturity is taking quite some work, although the result is satisfying – and I’d like to blog more in the future, but we’ll have to see if development leaves time and energy for this…