Unicode in regular expressions


Wow, what an interesting blog post title!  Two technologies, each scintillating by itself, when brought together have more energy than a 1988 GnR concert.

A feature request came up in AutoMapper to support international characters.  On the surface, that might seem simple, were it not for AutoMapper’s flattening feature.  AutoMapper requires zero configuration for flattened mappings, as long as the names all match up (minus all those dots).  The trick came into figuring out how to take a destination member and search the source type members, taking into account that .NET PrefersPascalCasingForPublicMembers.

So what could this match to?  Any permutation of putting that dot anywhere in the chain.  And if it goes down the chain in one spot (PrefersPascal.Casing.ForPublic.Memberz) and misses, it needs to go back and search another vector.

Long story short, I’m no algorithms expert, so the best option I could come up with is one where I just split a string based on uppercase characters, instead of one. character. at. a. time.  Yes, I’d love to hear a better way.

In any case, my regular expression was a little too assuming:

string[] matches = Regex.Matches(nameToSearch, "[A-Z][a-z0-9]*")

That’s all fine and dandy in the States, but not in other countries, as the list of valid characters for members is much larger than this.  Suppose I’m trying to map this flattened DTO:

private class Order
{
    public Customer Customer { get; set; }
}

private class Customer
{
    public string Æøå { get; set; }
}

private class OrderDto
{
    public string CustomerÆøå { get; set; }
}

It should match the Customer property, then the…other…property.  I can’t just use the normal “A-Z” assumption.  Instead, I can use the Unicode general categories substitutions.  Unicode defines general categories for characters, such as uppercase, lowercase, titlecase (Turkey), numbers and so on.  To match these as a group in regular expressions, I can use a substitution, which in .NET is “p{name}”, where name is the name of the substitution.  For more character classes and substitutions, check out the MSDN documentation.

With that in hand, I can use the substitutions for uppercase and lowercase:

string[] matches = Regex.Matches(nameToSearch, @"p{Lu}[p{Ll}0-9]*")

Now if I was being thorough, I might look up the rest of the valid Unicode characters in members.  But no, enough double-insanity for one day.

A sign of the times