Readable Regular Expressions Revisited

Many, many years ago (internet time), I proposed a fluent interface for composing regular expressions. People either loved the idea or hated it (or thought it was just ok). The intention was to try and tackle the opaqueness of regular expressions that might be embedded in your otherwise familiar C# source code.

I’ll confess I never used that approach in production code. It started as a thought experiment, and that’s as far as it went (for me, at least). However, I did pick up a great technique from the comments. William “OmegaMan” Wegerson suggested using RegexOptions.IgnorePatternWhitespace along with liberal usage of in-line comments. Here is a recent example from the fubumvc source:

const string propertyFindingPattern = @"
{              # start variable
(?<varname>w+) # capture 1 or more word characters as the variable name
(:              # optional section beginning with a colon
(?<default>w+) # capture 1 or more word characters as the default value
)?              # end optional section
}              # end variable"; 

Notice that the comments violate one of the main rules of good commenting: do not restate what the code says. Usually, someone reading your code is literate enough in the programming language that they can figure out “what” the code does, it just isn’t always clear “why”. But when it comes to regular expressions, I would guess that a majority of C# programmers need to look at a regex reference every time they try and decipher a pattern. Do them a favor and document what each part of the pattern does while you are writing it, since you’re probably looking at the reference already anyway. This should make it much easier for someone to follow (and modify) the code going forward. No fancy fluent interface required.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Benjamin Smith

    I have a take: break up the string into variables that describe their intention. For example, parsing a log file might look like this:

    $year = “([0-9]{4})”;
    $match=”/$year\.$month\.$day $message/”;

    This produces something that’s easy to understand, and breaks up the expression into chunks that can be easily added/removed from $match so that diagnosing a pattern failure is easy.

  • You might want to check my project that isomorphically transforms regex expressions into XML. It was my way of transforming regular expressions into something that is more humanly readable and, eventually, can be XSLTed into another regex.

    I also think Benjamin’s solution is valid, although I could never make it work in a usable way. I would create subpatterns and use them in a regular expression like this:
    var $int=@”-?\d+”;
    var match=@”{$int/hour}-{$int/minute}-{$int/second}”. But it still isn’t modular enough for me.

  • As much as I love regex, as you do no doubt, sometimes the patterns get intense. Hence commenting as you pointed out can be a time saver in trying to decipher the intent of the pattern not only for someone else but ourselves as we go back to older patterns. Good article.