The .NET Framework provides an object-oriented approach to regular expression matching and replacement.
The Framework Class Library namespace System.Text.RegularExpressions is the home to all the .NET Framework objects associated with regular expressions. The central class for regular expression support is Regex, which represents an immutable, compiled regular expression. Example 10-9 rewrites Example 10-8 to use regular expressions and thus solve the problem of searching for more than one type of delimiter.
Option Strict On Imports System Imports System.Text Imports System.Text.RegularExpressions Namespace RegularExpressions Class Tester Public Sub Run( ) Dim s1 As String = "One,Two,Three Liberty Associates, Inc." Dim theRegex As New Regex(" |, |,") Dim sBuilder As New StringBuilder( ) Dim id As Integer = 1 Dim subString As String For Each subString In theRegex.Split(s1) id = id + 1 sBuilder.AppendFormat("{0}: {1}" _ & Environment.NewLine, id, subString) Next subString Console.WriteLine("{0}", sBuilder.ToString( )) End Sub 'Run Public Shared Sub Main( ) Dim t As New Tester( ) t.Run( ) End Sub 'Main End Class 'Tester End Namespace 'RegularExpressions Output: 1: One 2: Two 3: Three 4: Liberty 5: Associates 6: Inc.
Example 10-9 begins by creating a string, s1, identical to the string used in Example 10-8:
Dim s1 As String = "One,Two,Three Liberty Associates, Inc."
and a regular expression that will be used to search that string:
Dim theRegex As New Regex(" |, |,")
One of the overloaded constructors for Regex takes a regular expression string as its parameter.
|
The rest of the program proceeds like Example 10-8 except that rather than calling Split() on string s1, the Split( ) method of Regex is called. Regex.Split( ) acts in much the same way as String.Split( ), returning an array of strings as a result of matching the regular expression pattern within theRegex.
Regex.Split( ) is overloaded. The simplest version is called on an instance of Regex as shown in Example 10-9. There is also a shared version of this method, which takes a string to search and the pattern to search with, as illustrated in Example 10-10.
Option Strict On
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Namespace RegularExpressions
Class Tester
Public Sub Run( )
Dim s1 As String = "One,Two,Three Liberty Associates, Inc."
Dim sBuilder As New StringBuilder( )
Dim id As Integer = 1
Dim subString As String
For Each subString In Regex.Split(s1, " |, |,")
id = id + 1
sBuilder.AppendFormat("{0}: {1}" _
& Environment.NewLine, id, subString)
Next subString
Console.WriteLine("{0}", sBuilder.ToString( ))
End Sub 'Run
Public Shared Sub Main( )
Dim t As New Tester( )
t.Run( )
End Sub 'Main
End Class 'Tester
End Namespace 'RegularExpressions
Example 10-10 is identical to Example 10-9 except that the latter example does not instantiate an object of type Regex. Instead, Example 10-10 uses the shared version of Split( ), which takes two arguments: a string to be searched and a regular expression string that represents the pattern to match.
The instance method of Split( ) is also overloaded with versions that limit the number of times the split will occur and also that determine the position within the target string where the search will begin.
Two additional classes in the .NET RegularExpressions namespace allow you to search a string repeatedly and to return the results in a collection. The collection returned is of type MatchCollection, which consists of zero or more Match objects. Two important properties of a Match object are its length and its value, each of which can be read, as illustrated in Example 10-11.
Option Strict On Imports System Imports System.Text Imports System.Text.RegularExpressions Namespace RegularExpressions Class Tester Public Sub Run( ) Dim string1 As String = "This is a test string" Dim theReg As New Regex("(\S+)\s") Dim theMatches As MatchCollection = theReg.Matches(string1) Dim theMatch As Match For Each theMatch In theMatches Console.WriteLine("theMatch.Length: {0}", _ theMatch.Length) If theMatch.Length <> 0 Then Console.WriteLine("theMatch: {0}", _ theMatch.ToString( )) End If Next theMatch End Sub 'Run Public Shared Sub Main( ) Dim t As New Tester( ) t.Run( ) End Sub 'Main End Class 'Tester End Namespace 'RegularExpressions Output: theMatch.Length: 5 theMatch: This theMatch.Length: 3 theMatch: is theMatch.Length: 2 theMatch: a theMatch.Length: 5 theMatch: test
Example 10-11 creates a simple string to search:
Dim string1 As String = "This is a test string"
and a trivial regular expression to search it:
Dim theReg As New Regex("(\S+)\s")
The string \S finds nonwhitespace, and the plus sign indicates one or more. The string \s (note lowercase) indicates whitespace. Thus, together, this string looks for any nonwhitespace characters followed by whitespace.
The output shows that the first four words were found. The final word was not found because it is not followed by a space. If you insert a space after the word string and before the closing quote marks, this program will find that word as well.
The Length property is the length of the captured substring and will be discussed in Section 10.4.3, later in this chapter.
It is often convenient to group subexpression matches together so that you can parse out pieces of the matching string. For example, you might want to match on IP addresses and group all IP addresses found anywhere within the string.
|
The Group class allows you to create groups of matches based on regular expression syntax, and represents the results from a single grouping expression.
A grouping expression names a group and provides a regular expression; any substring matching the regular expression will be added to the group. For example, to create an ip group you might write:
"(?<ip>(\d|\.)+)\s"
The Match class derives from Group and has a collection called "Groups," which contains all the groups your Match finds.
Example 10-12 illustrates the creation and use of the Groups collection and Group classes.
Option Strict On Imports System Imports System.Text Imports System.Text.RegularExpressions Namespace RegularExpressions Class Tester Public Sub Run( ) Dim string1 As String = _ "04:03:27 127.0.0.0 LibertyAssociates.com" ' time = one or more digits or colons ' followed by a space ' ip address = one or more digits or dots ' followed by space ' site = one or more characters Dim regString As String = "(?<time>(\d|\:)+)\s" & _ "(?<ip>(\d|\.)+)\s" & _ "(?<site>\S+)" Dim theReg As New Regex(regString) Dim theMatches As MatchCollection = theReg.Matches(string1) Dim theMatch As Match For Each theMatch In theMatches If theMatch.Length <> 0 Then Console.WriteLine( _ "theMatch: {0}", _ theMatch.ToString( )) Console.WriteLine( _ "time: {0}", _ theMatch.Groups("time")) Console.WriteLine( _ "ip: {0}", _ theMatch.Groups("ip")) Console.WriteLine( _ "site: {0}", _ theMatch.Groups("site")) End If Next theMatch End Sub 'Run Public Shared Sub Main( ) Dim t As New Tester( ) t.Run( ) End Sub 'Main End Class 'Tester End Namespace 'RegularExpressions Output: theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com time: 04:03:27 ip: 127.0.0.0 site: LibertyAssociates.com
Again, Example 10-12 begins by creating a string to search:
Dim string1 As String = _ "04:03:27 127.0.0.0 LibertyAssociates.com"
This string might be one of many recorded in a web server log file or produced as the result of a search of the database. In this simple example there are three columns: one for the time of the log entry, one for an IP address, and one for the site, each separated by spaces; of course, in a real example solving a real-life problem, you might need to do more complex searches and choose to use other delimiters and more complex searches.
In Example 10-12, you create a single Regex object to search strings of this type and break them into three groups: time, ip address, and site. The regular expression string is fairly simple (as regular expressions go), so the example is easy to understand (however, keep in mind that in a real search, you would probably only use a part of the source string rather than the entire source string, as shown here):
Dim regString As String = "(?<time>(\d|\:)+)\s" & _ "(?<ip>(\d|\.)+)\s" & _ "(?<site>\S+)"
Let's focus on the characters that create the group:
(?<time>
The parentheses create a group. Everything between the opening parenthesis (just before the question mark) and the closing parenthesis (in this case, after the plus sign) is a single unnamed group.
("(?<time>(\d|\:)+)
The string ?<time> names that group time, and the group is associated with the matching text, the regular expression (\d|\:)+)\s". This regular expression can be interpreted as "one or more digits or colons followed by a space."
Similarly, the string ?<ip> names the ip group, and ?<site> names the site group. As Example 10-11 does, Example 10-12 asks for a collection of all the matches:
Dim theMatches As MatchCollection = theReg.Matches(string1)
Example 10-12 iterates through the Matches collection, finding each Match object.
If the Length of theMatch is greater than 0, a Match was found; then it prints the entire match:
If theMatch.Length <> 0 Then Console.WriteLine( _ "theMatch: {0}", _ theMatch.ToString( ))
Here's the output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com
It then gets the "time" group from theMatch.Groups collection and prints that value:
Console.WriteLine( _ "time: {0}", _ theMatch.Groups("time"))
This produces the output:
time: 04:03:27
The code then obtains ip and site groups:
Console.WriteLine( _ "ip: {0}", _ theMatch.Groups("ip")) Console.WriteLine( _ "site: {0}", _ theMatch.Groups("site"))
This produces the output:
ip: 127.0.0.0 site: LibertyAssociates.com
In Example 10-12, the Matches collection has only one Match. It is possible, however, to match more than one expression within a string. To see this, modify string1 in Example 10-12 to provide several logFile entries instead of one, as follows:
Dim string1 As String = "04:03:27 127.0.0.0 LibertyAssociates.com " + "04:03:28 127.0.0.0 foo.com " + "04:03:29 127.0.0.0 bar.com " ;
This creates three matches in the MatchCollection, theMatches. Here's the resulting output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com time: 04:03:27 ip: 127.0.0.0 site: LibertyAssociates.com theMatch: 04:03:28 127.0.0.0 foo.com time: 04:03:28 ip: 127.0.0.0 site: foo.com theMatch: 04:03:29 127.0.0.0 bar.com time: 04:03:29 ip: 127.0.0.0 site: bar.com
In this example, theMatches contains three Match objects. Each time through the outer For Each loop we find the next Match in the collection and display its contents:
For Each theMatch In theMatches
For each of the Match items found, you can print out the entire match, various groups, or both.
Each time a Regex object matches a subexpression, a Capture instance is created and added to a CaptureCollection collection. Each capture object represents a single capture. Each group has its own capture collection of the matches for the subexpression associated with the group.
A key property of the Capture object is its length, which is the length of the captured sub-string. When you ask Match for its length, it is Capture.Length that you retrieve because Match derives from Group, which in turn derives from Capture.
|
Typically, you will find only a single Capture in a CaptureCollection; but that need not be so. Consider what would happen if you were parsing a string in which the company name might occur in either of two positions. To group these together in a single match you create the ?<company> group in two places in your regular expression pattern:
Dim regString As String = "(?<time>(\d|\:)+)\s" & _ "(?<company>\S+)\s" & _ "(?<ip>(\d|\.)+)\s" & _ "(?<company>\S+)\s"
This regular expression group captures any matching string of characters that follows time, and also any matching string of characters that follows ip. Given this regular expression, you are ready to parse the following string:
Dim string1 As String = "04:03:27 Jesse 0.0.0.127 Liberty "
The string includes names in both the positions specified. Here is the result:
theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty
What happened? Why is the Company group showing Liberty? Where is the first term, which also matched? The answer is that the second term overwrote the first. The group, however, has captured both; its Captures collection can show that to you, as illustrated in Example 10-13.
Imports System Imports System.Text Imports System.Text.RegularExpressions Namespace RegularExpressions Class Tester Public Sub Run( ) Dim string1 As String = _ "04:03:27 Jesse 0.0.0.127 Liberty " ' time = one or more digits or colons ' followed by a space ' ip address = on ore more digits or dots ' followed by space ' site = one or more characters Dim regString As String = "(?<time>(\d|\:)+)\s" & _ "(?<company>\S+)\s" & _ "(?<ip>(\d|\.)+)\s" & _ "(?<company>\S+)\s" Dim theReg As New Regex(regString) Dim theMatches As MatchCollection = theReg.Matches(string1) Dim theMatch As Match For Each theMatch In theMatches If theMatch.Length <> 0 Then Console.WriteLine( _ "theMatch: {0}", _ theMatch.ToString( )) Console.WriteLine( _ "time: {0}", _ theMatch.Groups("time")) Console.WriteLine( _ "ip: {0}", _ theMatch.Groups("ip")) Console.WriteLine( _ "Company: {0}", _ theMatch.Groups("company")) Dim cap As Capture For Each cap In _ theMatch.Groups("company").Captures Console.WriteLine( _ "cap: {0}", cap.ToString( )) Next End If Next theMatch End Sub 'Run Public Shared Sub Main( ) Dim t As New Tester( ) t.Run( ) End Sub 'Main End Class 'Tester End Namespace 'RegularExpressions Output: theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty cap: Jesse cap: Liberty
The code in bold iterates through the Captures collection for the Company group.
Dim cap As Capture For Each cap In _ theMatch.Groups("company").Captures
Let's review how this line is parsed. The compiler begins by finding the collection that it will iterate. theMatch is an object that has a collection named Groups. The Groups collection has a default property (as explained in the previous chapter) that takes a string and returns a single Group object. Thus, the following line returns a single Group object:
theMatch.Groups("company")
The Group object has a collection named Captures. Thus, the following line returns a Captures collection for the Group stored at Groups["company"] within the theMatch object:
theMatch.Groups("company").Captures
The For Each loop iterates over the Captures collection, extracting each element in turn and assigning it to the local variable cap, which is of type Capture. You can see from the output that there are two capture elements: Jesse and Liberty. The second one overwrites the first in the group, and so the displayed value is just Liberty, but by examining the Captures collection you can find both values that were captured.
Top |