17.3 Regular Expressions
Regular expressions are a powerful
language for describing and manipulating text. Underlying regular
expressions is a technique called pattern
matching, which involves comparing one string to another,
or comparing a series of wildcards that represent a type of string to
a literal string. A regular expression is
applied to a string — that is, to a set of
characters. Often that string is an entire text document.
The result of applying a regular expression to a string is either to
return a substring or to return a new string representing a
modification of some part of the original string. (Remember that
string objects are immutable and so cannot be changed by the regular
expression.)
By applying a properly constructed regular expression to the
following string:
One,Two,Three Liberty Associates, Inc.
you can return any or all of its substrings (e.g., Liberty or One) or
modified versions of its substrings (e.g., LIBeRtY or OnE). What the
regular expression does is determined by the syntax of the regular
expression itself.
A regular expression consists of two types of characters:
literals and
metacharacters. A
literal is a character you want to
match in the target string. A metacharacter is a special symbol that acts
as a command to the regular expression parser. The parser is the
engine responsible for understanding the regular expression. For
example, if you create a regular expression:
^(From|To|Subject|Date):
this will match any substring with the letters
"From",
"To",
"Subject", or
"Date" so long as those letters
start a new line (^) and end with a colon (:).
The
caret (^)
indicates to the regular expression parser that the string
you're searching for must begin a new line. The
letters "From" and
"To" are literals, and the
metacharacters left and right
parentheses (
(, ) ) and vertical bar
(|) are all used to group sets of literals and
indicate that any of the choices should match. Thus you would read
the following line as "match any string that begins
a new line, followed by any of the four literal strings From, To,
Subject, or Date, and followed by a colon":
^(From|To|Subject|Date):
|
A full explanation of regular expressions is beyond the scope of this
book, but all the regular expressions used in the examples are
explained. For a complete understanding of regular expressions, I
highly recommend Mastering Regular
Expressions, Second Edition, by Jeffrey E. F. Friedl
(O'Reilly).
|
|
|