1.5 .NET and C#

Microsoft's .NET framework provides a consistent and powerful set of regular expression classes for all .NET implementations. The following sections list the .NET regular expression syntax, the core .NET classes, and C# examples. Microsoft's .NET uses a Traditional NFA match engine. For an explanation of the rules behind a Traditional NFA engine, see Section 1.2.

1.5.1 Supported Metacharacters

.NET supports the metacharacters and metasequences listed in Table 1-15 through Table 1-8. For expanded definitions of each metacharacter, see Section 1.2.1.

Table 1-15. Character representations

Sequence

Meaning

\a

Alert (bell), x07.

\b

Backspace, x08, supported only in character class.

\e

ESC character, x1B.

\n

Newline, x0A.

\r

Carriage return, x0D.

\f

Form feed, x0C.

\t

Horizontal tab, x09.

\v

Vertical tab, x0B.

\0octal

Character specified by a two-digit octal code.

\xhex

Character specified by a two-digit hexadecimal code.

\uhex

Character specified by a four-digit hexadecimal code.

\cchar

Named control character.

Table 1-16. Character classes and class-like constructs

Class

Meaning

[...]

A single character listed or contained within a listed range.

[^...]

A single character not listed and not contained within a listed range.

.

Any character, except a line terminator (unless single-line mode, s).

\w

Word character, [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] or [a-zA-Z_0-9] in ECMAScript mode.

\W

Non-word character, [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] or [^a-zA-Z_0-9] in ECMAScript mode.

\d

Digit, \p{Nd} or [0-9] in ECMAScript mode.

\D

Non-digit, \P{Nd} or [^0-9] in ECMAScript mode.

\s

Whitespace character, [ \f\n\r\t\v\x85\p{Z}] or [ \f\n\r\t\v] in ECMAScript mode.

\S

Non-whitespace character, [^ \f\n\r\t\v\x85\p{Z}] or [^ \f\n\r\t\v] in ECMAScript mode.

\p{prop}

Character contained by given Unicode block or property.

\P{prop}

Character not contained by given Unicode block or property.

Table 1-17. Anchors and other zero-width tests

Sequence

Meaning

^

Start of string, or after any newline if in MULTILINE mode.

\A

Beginning of string, in all match modes.

$

End of string, or before any newline if in MULTILINE mode.

\Z

End of string but before any final line terminator, in all match modes.

\z

End of string, in all match modes.

\b

Boundary between a \w character and a \W character.

\B

Not-word-boundary.

\G

End of the previous match.

(?=...)

Positive lookahead.

(?!...)

Negative lookahead.

(?<=...)

Positive lookbehind.

(?<!...)

Negative lookbehind.

Table 1-18. Comments and mode modifiers

Modifier/sequence

Mode character

Meaning

Singleline

s

Dot (.) matches any character, including a line terminator.

Multiline

m

^ and $ match next to embedded line terminators.

IgnorePatternWhitespace

x

Ignore whitespace and allow embedded comments starting with #.

IgnoreCase

i

Case-insensitive match based on characters in the current culture.

CultureInvariant

i

Culture-insensitive match.

ExplicitCapture

n

Allow named capture groups, but treat parentheses as non-capturing groups.

Compiled

Compile regular expression.

RightToLeft

Search from right to left, starting to the left of the start position.

ECMAScript

Enables ECMAScript compliance when used with IgnoreCase or Multiline.

(?imnsx-imnsx)

Turn match flags on or off for rest of pattern.

(?imnsx-imnsx:...)

Turn match flags on or off for the rest of the subexpression.

(?#...)

Treat substring as a comment.

#...

Treat rest of line as a comment in /x mode.

Table 1-19. Grouping, capturing, conditional, and control

Sequence

Meaning

(...)

Grouping. Submatches fill \1,\2,... and $1, $2,....

\n

In a regular expression, match what was matched by the nth earlier submatch.

$n

In a replacement string, contains the nth earlier submatch.

(?<name>...)

Captures matched substring into group, name.

(?:...)

Grouping-only parentheses, no capturing.

(?>...)

Disallow backtracking for subpattern.

...|...

Alternation; match one or the other.

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 1 or 0 times.

{n}

Match exactly n times.

{n,}

Match at least n times.

{x,y}

Match at least x times, but no more than y times.

*?

Match 0 or more times, but as few times as possible.

+?

Match 1 or more times, but as few times as possible.

??

Match 0 or 1 times, but as few times as possible.

{n,}?

Match at least n times, but as few times as possible.

{x,y}?

Match at least x times, no more than y times, but as few times as possible.

Table 1-20. Replacement sequences

Sequence

Meaning

$1, $2, ...

Captured submatches.

${name}

Matched text of a named capture group.

$'

Text before match.

$&

Text of match.

$'

Text after match.

$+

Last parenthesized match.

$_

Copy of original input string.

1.5.2 Regular Expression Classes and Interfaces

.NET defines its regular expression support in the System.Text.RegularExpressions module. The RegExp( ) constructor handles regular expression creation, and the rest of the RegExp methods handle pattern matching. The Groups and Match classes contain information about each match.

C#'s raw string syntax, @"", allows you to define regular expression patterns without having to escape embedded backslashes.

Regex

This class handles the creation of regular expressions and pattern matching. Several static methods allow for pattern matching without creating a RegExp object.

Methods

public Regex(string pattern)
public Regex(string pattern, RegexOptions options): Return a regular expression object based on pattern and with the optional mode modifiers, options.
public static void CompileToAssembly(RegexCompilationInfo[ ] regexinfos, System.Reflection.AssemblyName assemblyname)
public static void CompileToAssembly(RegexCompilationInfo[ ] regexinfos, System.Reflection.AssemblyName assemblyname)
public static void CompileToAssembly(RegexCompilationInfo[ ] regexinfos, System.Reflection.AssemblyName assemblyname, System.Reflection.Emit.CustomAttributeBuilder[ ] attributes)
public static void CompileToAssembly(RegexCompilationInfo[ ] regexinfos, System.Reflection.AssemblyName assemblyname, System.Reflection.Emit.CustomAttributeBuilder[ ] attributes, string resourceFile): Compile one or more Regex objects to an assembly. The regexinfos array describes the regular expressions to include. The assembly filename is assemblyname. The array attributes defines attributes for the assembly. resourceFile is the name of a Win32 resource file to include in the assembly.
public static string Escape(string str): Return a string with all regular expression metacharacters, pound characters (#), and whitespace escaped.
public static bool IsMatch(string input, string pattern)
public static bool IsMatch(string input, string pattern, RegexOptions options)
public bool IsMatch(string input)
public bool IsMatch(string input, int startat): Return the success of a single match against the input string input. Static versions of this method require the regular expression pattern. The options parameter allows for optional mode modifiers (OR'd together). The startat parameter defines a starting position in input to start matching.
public static Match Match(string input, string pattern)
public static Match Match(string input, string pattern, RegExpOptions options)
public Match Match(string input)
public Match Match(string input, int startat)
public Match Match(string input, int startat, int length): Perform a single match against the input string input and return information about the match in a Match object. Static versions of this method require the regular expression pattern. The options parameter allows for optional mode modifiers (OR'd together). The startat and length parameters define a starting position and the number of characters after the starting position to perform the match.
public static MatchCollection Matches(string input, string pattern)
public static MatchCollection Matches(string input, string pattern, RegExpOptions options)
public MatchCollection Matches(string input)
public MatchCollection Matches(string input, int startat): Find all matches in the input string input, and return information about the matches in a MatchCollection object. Static versions of this method require the regular expression pattern. The options parameter allows for optional mode modifiers (OR'd together). The startat parameter defines a starting position in input to perform the match.
public static string Replace(string input, pattern, MatchEvaluator evaluator)
public static string Replace(string input, pattern, MatchEvaluator evaluator, RegexOptions options)
public static string Replace(string input, pattern, string replacement)
public static string Replace(string input, pattern, string replacement, RegexOptions options)
public string Replace(string input, MatchEvaluator evaluator)
public string Replace(string input, MatchEvaluator evaluator, int count)
public string Replace(string input, MatchEvaluator evaluator, int count, int startat)
public string Replace(string input, string replacement)
public string Replace(string input, string replacement, int count)
public string Replace(string input, string replacement, int count, int startat): Return a string in which each match in input is replaced with either the evaluation of the replacement string or a call to a MatchEvaluator object. The string replacement can contain backreferences to captured text with the $n or ${name} syntax.

The options parameter allows for optional mode modifiers (OR'd together). The count paramenter limits the number of replacements. The startat parameter defines a starting position in input to start the replacement.

public static string[ ] Split(string input, string pattern)
public static string[ ] Split(string input, string pattern, RegexOptions options)
public static string[ ] Split(string input)
public static string[ ] Split(string input, int count)
public static string[ ] Split(string input, int count, int startat): Return an array of strings broken around matches of the regex pattern. If specified, no more than count strings are returned. You can specify a starting position in input with startat.

Match

Properties

public bool Success: Indicates whether the match was successful.
public string Value: Text of the match.
public int Length: Number of characters in the matched text.
public int Index: Zero-based character index of the start of the match.
public GroupCollection Groups: A GroupCollection object where Groups[0].value contains the text of the entire match, and each additional Groups element contains the text matched by a capture group.

Methods

public Match NextMatch( ): Return a Match object for the next match of the regex in the input string.
public virtual string Result(string result): Return result with special replacement sequences replaced by values from the previous match.
public static Match Synchronized(Match inner): Return a Match object identical to inner, except also safe for multithreaded use.

Group

Properties

public bool Success: True if the group participated in the match.
public string Value: Text captured by this group.
public int Length: Number of characters captured by this group.
public int Index: Zero-based character index of the start of the text captured by this group.

1.5.3 Unicode Support

.NET provides built-in support for Unicode 3.1, including full support in the \w, \d, and \s sequences. The range of characters matched can be limited to ASCII characters by turning on ECMAScript mode. Case-insensitive matching is limited to the characters of the current language defined in Thread.CurrentCulture, unless the CultureInvariant option is set.

.NET supports the standard Unicode properties (see Table 1-2) and blocks. Only the short form of property names are supported. Block names require the Is prefix and must use the simple name form, without spaces or underscores.

1.5.4 Examples

Example 1-9. Simple match

//Match Spider-Man, Spiderman, SPIDER-MAN, etc.
namespace Regex_PocketRef
{
  using System.Text.RegularExpressions;

  class SimpleMatchTest
  {
    static void Main(  )
    {
      string dailybugle = "Spider-Man Menaces City!";

      string regex = "spider[- ]?man";
  
    if (Regex.IsMatch(dailybugle, regex, RegexOptions.IgnoreCase)) {
      //do something
    }  
  }
}

Example 1-10. Match and capture group

//Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
using System.Text.RegularExpressions;

class MatchTest 
{
  static void Main(  )  
  {
    string date = "12/30/1969";
    Regex r = 
      new Regex( @"(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)" );

    Match m = r.Match(date);

    if (m.Success) {
      string month = m.Groups[1].Value;
      string day   = m.Groups[2].Value;
      string year  = m.Groups[3].Value;
    }
  } 
}

Example 1-11. Simple substitution

//Convert <br> to <br /> for XHTML compliance
using System.Text.RegularExpressions;

class SimpleSubstitutionTest 
{
  static void Main(  ) 
  {
    string text = "Hello world. <br>";
    string regex = "<br>";
    string replacement = "<br />";

    string result = 
      Regex.Replace(text, regex, replacement, RegexOptions.IgnoreCase);
  }
}

Example 1-12. Harder substitution

//urlify - turn URL's into HTML links
using System.Text.RegularExpressions;

public class Urlify 
{
  static Main (  ) 
  {
   string text = "Check the website, http://www.oreilly.com/catalog/repr.";
   string regex =                                                
      @"\b                            # start at word boundary
        (                             # capture to $1
        (https?|telnet|gopher|file|wais|ftp) :
                                      # resource and colon
        [\w/#~:.?+=&%@!\-] +?         # one or more valid
                                      # characters
                                      # but take as little as
                                      # possible
        )                                                               
        (?=                           # lookahead
        [.:?\-] *                     # for possible
                                      # punctuation
        (?: [^\w/#~:.?+=&%@!\-]       # invalid character
        | $ )                         # or end of string  
        )";

    Regex r = new Regex(regex,  RegexOptions.IgnoreCase
                     | RegexOptions.IgnorePatternWhitespace);
    string result = r.Replace(text, "<a href=\"$1\">$1</a>");
  } 
}

1.5.5 Other Resources

Programming C#, by Jesse Liberty (O'Reilly), gives a thorough introduction to C#, .NET, and regular expressions.
Mastering Regular Expressions, Second Edition, by Jeffrey E. F. Friedl (O'Reilly), covers the details and failings of .NET regular expressions on pages 399-432.
Microsoft's online documentation at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconregularexpressionslanguageelements.asp.

[ Team LiB ]