Beautiful regular expressions code
My regular expression engine is starting to work. I can already compile and match basic regular expressions, and the framework for the most complex features is already there, even if not completely implemented yet. The first use of the engine will be for ViEmu (I might go straight from 1.2 to 1.5, as I feel regex and ex command line support take it to the next level, and Firefox is already going to do a 1.0->1.5 jump so I shouldn’t be less).
It’s probably the piece of code with which I’ve been most happy in a long time. The reason? Not that it is complex and it was hard to write (which it was), as several other pieces have been more complex, and many others have taken a lot more work. The actual reason is that it is free of tight bindings to anything else: it uses the generic string template framework I talked about, and so it can handle any variation of string encoding, format, storage, access-mechanism… whatever – and not losing an ounce of efficiency from writing code that uses straight ‘char’ or ‘wchar_t’.
For one, I feel I will never need to write another regular expression engine, and that is a good feeling.
But, most important, when I look at the code, I get a feeling of beauty. And it is a feeling that I miss most of the time I write code. I don’t know how you feel about it, but it actually hurts me when I write code that is too tightly bound to some specific circumstance. Reusability is great, but the feeling of being right is a separate issue and that’s what makes me happiest.
I’m going to try to post the declaration here so that you can have a look at it. Let’s see if the beauty survives the adjustment to the blog’s width:
//------------------------------------------------------------------------- // // Regular expressions support for NGEDIT and ViEmu, templatized to // support text and input in any encoding (ViEmu only uses wchar_t) // //------------------------------------------------------------------------- #ifndef _NGREGEXP_H_ #define _NGREGEXP_H_ #include "ngbase.h" #include "vector.h" namespace nglib { template<class TREADSTR> class TRegExp { public: typedef typename TREADSTR::TREF TSREF; typedef typename TREADSTR::TREF::iterator TSITER; typedef typename TREADSTR::TPOS TSPOS; typedef typename TREADSTR::TCACHE TSCACHE; typedef typename TREADSTR::TCHAR TSCHAR; //-------------------------------------------- enum ECompileError { E_OK, E_SYNTAX_ERROR, E_NOT_IMPL, E_NO_MEM, E_UNCLOSED_SET, E_UNCLOSED_SUBEXPR, }; struct TCompileError { ECompileError type; TSPOS pos; }; struct TMatchResult { TSPOS posStart, posAfterEnd; struct TSubMatch { TSPOS posStart, posAfterEnd; }; TVector<TSubMatch> vSubmatches; }; //-------------------------------------------- TRegExp() { m_ok = false; } //~TRegExp() { } TRet Compile ( TSREF rsRegExp, TCompileError *pCompileError = NULL, TSCHAR chTerm = TSCHAR::zero ); // Methods for simple string matching bool TryToMatch ( TSREF rsInput, TMatchResult *pMatchResult, bool bBOL = true, bool bEOL = true ); bool Contains ( TSREF rsInput, TMatchResult *pMatchResult, bool bBOL = true, bool bEOL = true ); // Methods for possibly multi-line regexps template <class TTEXTBUF> bool TryToMatch ( TTEXTBUF txtBuf, unsigned uStartLine, TSPOS pos, TMatchResult *pMatchResult ); // At a certain line pos template <class TTEXTBUF> bool Contains ( TTEXTBUF txtBuf, unsigned uStartLine, TSPOS pos, TMatchResult *pMatchResult ); // Starting anywhere in that line //-------------------------------------------- private: //-------------------------------------------- enum ENodeType { NT_MANDATORY_JUMP, // + 2 bytes signed offset NT_OPTIONAL_JUMP, // + 2 bytes signed offset NT_OPTIONAL_JUMP_PREF, // + 2 bytes signed offset // jump with preference (to control greediness) NT_MATCH, // + 1 byte match_type + details NT_ACCEPT, NT_OPEN_SUBEXPR, // + 1 byte subexpr # NT_CLOSE_SUBEXPR, // + 1 byte subexpr # NT_SAVE_IPOS, // + 1 byte pos-reg where to save NT_JUMPTO_IPOS, // + 1 byte temp to read NT_SET_TEMP, // + 1 byte temp where to save NT_INC_TEMP, // + 1 byte temp to read }; //-------------------------------------------- enum EMatchType { MT_CHAR, // any literal char (+ ENC_CHAR) MT_DOT, // . MT_BOL, // ^ MT_EOL, // $ MT_NEXTLINE, // \n MT_SET, // [abc] or [a-zA-Z] //+ (byte)num_chars //+ (byte)num_ranges //+ nc * ENC_CHAR //+ 2 * nr * ENC_CHAR MT_NEGSET, // [^abc] or [^a-zA-Z] same MT_WHITESPACE, // \s MT_NONWHITESPACE, // \S MT_WORD, // \w MT_NONWORD // \W }; //-------------------------------------------- // Compiler and matcher classes, not accessible externally class TCompiler; class TMatcher; bool m_ok; TVector<byte> m_vbCompiledExpr; }; } // end namespace #include "ngregexp.inl" #endif
Main things: the class is in a namespace that I use for all my common code, you can see how the interface uses the string’s specific types, the TCompiler and TMatcher classes are just declared and are unnecessary for the user of the class, and the only declarations – apart from the main interface – are for the node types, which need to be accessible both by the TCompiler and the TMatcher.
Even if C++ templates force you to include the definition of the members at the point of use, I usually separate template-based code in a header file with the declarations and an “.inl” (inline) file with the definitions, which helps keep code sanity.
The only actual types are bool (which is fair enough to use without loss of generality), byte, which allows generality through the TSCHAR::EncodeToBytes() and TSCHAR::DecodeFromBytes(), and the lonely unsigned to refer to a line number within a multi-line text buffer. I will probably get rid of that one by using an abstract TIXLINE (line-index type) in TTEXTBUF.
The only shortcut I took from my original idea is that both the regex definition string and the target text have to be of the same type, but templatizing on two string types seemed a bit overkill and I can always easily convert at the use point via special-purpose inter-string-type conversion inline functions (possibly even template-based to avoid too much rewriting).
Now that I think of it, if the matched target is a sparse or arbitrarily ordered disk-based text buffer, abstracted away through a very smart TTEXTBUF class, I will probably have to allow specifying the regex itself with another type.
It’s taken a long time to develop this C++ style, but I’m starting to feel really happy with how my code is looking – for the first time after over 10 years of C++ programming!
I haven’t posted much on the blog lately, as I’ve been focusing in development. Bringing ViEmu to maturity is taking quite some work, although the result is satisfying – and I’d like to blog more in the future, but we’ll have to see if development leaves time and energy for this…
October 9th, 2005 at 11:28 pm
wow. your throwaway comment stuck a chord. my obsession, sometimes excessive, about writing perfect code – even when most of it will never be seen by another – is because i want to be Right!
CT.
October 9th, 2005 at 11:34 pm
BTW. I know full reg-exp support ala Perl/Java may be a way off if ever, but please make sure you support the non-greedy ‘?’ quantifier, as in “B.*?C+” matching (B)(AB)(CC).
This makes quick and dirty text file hacking with reg-exps significantly easier as you can filter out the noise without worrying about it overlapping your targets.
October 9th, 2005 at 11:46 pm
Glad to hear someone else feels the pain! Well, actually I shouldn’t be happy for anyone’s pain, but I think it opens the way to innovation.
Regarding the non-greedy multi operators, you can see on the listing above NT_OPTIONAL_JUMP_PREF which is the opcode emitted when non-greedy operators are compiled – they are actually already working!
It’s not complete, but only partial implementation stuff is missing – the whole compiler and matcher framework is there and supports everything necessary. It took a lot of effort to get the template-based design right.
What do you mean by “full reg-ex support”? I’m implementing full vi/vim reg-ex support, but I’m not aware of what perl/Java may do that vim doesn’t (although I know some of the syntax details differ).
It would be great to hear what things are needed at this stage.
Thanks for the comments!
October 10th, 2005 at 1:28 am
Java reg-exp capabilities are summed up here :
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html
The main advantage java has has over perl is the posessive quantifier ‘+’.
So whereas J+JF would match JJJJF , J++JF would not. I havent found this overly useful.
When I meant full support, I guess I was thinking about lookaheads and lookbehinds. As in “Easter (?!eggs)\w+” matching “Easter Monday” or “Easter afdasdsadfa” but not “Easter eggs”. This can be damn handy but in the context of an editor you can always do a second filtering operation.
CT
October 10th, 2005 at 2:47 pm
I didn’t know about the possesive quantifier – it seems quite doable. If I understand your example well, I would call it the super-greedy multi – it matches everything it can, but not happy with that, it won’t accept nothing less that everything “the full cake”. I’ll still have to think about a syntax with which to implement it in vim. I think two ‘+’ signs is ok (or two ‘*’ or two {2,3}’s).
Lookahead support is there in my regex engine – the node types NT_SAVE_IPOS and NT_JUMPTO_IPOS are there for that (as well as for \& concats). vim also supports it, although the syntax is different from perl (I think it is \@= to make a zero-width match and some other operator to make zero-width non-match).
Not all of the stuff is working as of yet, but I hope to have it fully working by the end of the week. Only implementation details are missing. I designed the engine by following vim’s dcumentation for regular expressions, and with the determination to make an engine powerful enough to handle all of it.
October 28th, 2005 at 9:12 pm
[…] The growing pains of NGEDIT A blog on the development of the NGEDIT text editor « Beautiful regular expressions code […]
November 26th, 2005 at 5:16 pm
[…] Well, the thing is that my tendency to drift off, my ambition, and my yearning for beautiful code kicked in. Instead of a simple solution, I found myself implementing the “ultimate” command line (of course). It’s already pretty much fully architected, and about half-working (although opening files from the command line ended up being just a small part of the available functionality). As I did this, I also started refactoring the part of the code that handles file loading into using my C++ string class that doesn’t suck, which is great, but it’s quite an effort by itself. Meanwhile, I found myself whining that I didn’t want to have all that code written using the non-portable Windows API (as a shortcut I took before summer, NGEDIT code is uglily using the Windows API directly in way too many places), so I started implementing an OS-independence layer (I know, I know, these things are better done from day 1, but you sometimes have to take shortcuts and that was one of many cases). Of course, with the OS-independence layer using said generic string class for the interface. And establishing a super-flexible application framework for NGEDIT, which was a bit cluttered to my taste. And sure, I started trying to establish the ultimate error-handling policy, which took me to posting about and researching C++ exceptions and some other fundamental problems of computing… […]