Regular Expressions
• With .Net Regex you need to think about efficiency first.
• Regex class supports two flavors of Regular Expressions
- Full featured .Net version
- ECMA (JavaScript) version
• Outcome of Regex.Match can tell which part of subject matched which parts of pattern.
• Concatenation, Alternation (|), Repetition (*)
• Regex is immutable class.
• Regex finds leftmost match.
• Alternation: Order of terms in alternation is important (catnip|cat) will match catnip first and if this didn’t match then it will match cat.
• Repetition
- Group a set of characters together using parenthesis for e.g. (cat)* matches all multiple occurrences of cat.
- (cat)* : Match 0 or more
- (cat)? : Match 0 or 1
- (cat)+ : Match 1 or more
- (cat){1, 4}: Minimum 1 and maximum 4 occurrences are to be matched.
Matching whole Expressions
• Match.NextMatch gives next match.
• (?m) : End modifier. This says that subsequent Regex is multiline regular expression.
• \A : Match start of the string. This is not affected by End modifier.
• ^: Same as \A if regular expression is not multiline. Matches subsequent pattern only at the start of new line.
• $: (Opposite of ^) When regular expression is not multiline, match should end with string end. When regular expression is NOT multiline, match should end with newline.
• \Z: (Opposite of \A) is not affected by End modifier. Match end of string.
• Custom character class e.g. [abc]
• Custom character class is more efficient than alternatives i.e. [abc] is more efficient than a|b|c. That’s why character classes are preferred over alternations if possible.
• Character class matches single character only.
• [a-z0-9] : All characters from a to z and 0 to 9
• [^a-z]: All characters except a to z. ^ is negation in this context.
• /d : [0-9]
• /D: [^0-9] (Note negation, so all characters except 0 to 9.
• /s: Any white space character
• /S: Negation of /s
• /w: a to z, A to Z, 0 to 9 and underscore and few more characters which are part considered to be part of English word
• /W: Negation of /w
• \w{3}: Match exactly 3 word characters
• \ is escape character. Regex.Escape method will add escape characters wherever necessary.
• \b: Word boundary. Matches with start or end of the word.
• \B: Negation of \b
• . (Dot): Wild character. Matches any character. By default won’t match New line.
• (?s) : Dot Modifier. (Dot) will match new line character as well. (:) colon is used to scope modifier to only one alternation pattern. For e.g. (?s: c.t) | d.g : Here modifier is applied only to first clause of the pattern.
• Avoid using operator * on . (dot) for performance reasons.