Feel like a geek and get yourself Ema Personal Wiki for Android and Windows

09 April 2010

i18n-proof c# regex recognizing uppercase and lowercase letters

Finding WikiWords with patterns like [A-Z][a-z][...] won't do: the recognition of uppercase and lowercase letters is not i18n-proof.
The following Regex finds WikiWords in an i18n-proof way:
private static Regex _wikiWords = new Regex(@"
    \b       #start on a word bounday
    \p{Lu}   #start with uppercase letter
    \p{Ll}*  #zero or more lowercase letters 
    \p{Lu}   #one uppercase letter 
    \w*      #and zero or more arbitrary characters 
    |                 #or
    \p{L}+\d\w*       #a mix of letters and digits
    |                 #or
    \d+\p{L}\w*       #a mix of digits and letters
", RegexOptions.IgnorePatternWhitespace);
Speaking of i18n. The term i18n is flawed. I am from the Netherlands. If i want to support my own language, and no other language, no international boundary is crossed. But I still need a i18n-proof WikiWord engine.

No comments: