1.6 Python
Python
provides a rich, Perl-like regular expression syntax in the
re module. The re module uses a
Traditional NFA match engine. For an explanation of the rules behind
an NFA engine, see Section 1.2.
This chapter covers the version of re included
with Python 2.2, although the module has been available in similar
form since Python 1.5.
1.6.1 Supported Metacharacters
The re module supports the metacharacters and
metasequences listed in Table 1-21 through
Table 1-25. For expanded definitions of each
metacharacter, see Section 1.2.1.
Table 1-21. Character representations|
\a
|
Alert (bell), x07.
|
\b
|
Backspace, x08, supported only in character class.
|
\n
|
Newline, x0A.
|
\r
|
Carriage return, x0D.
|
\f
|
Form feed, x0C.
|
\t
|
Horizontal tab, x09.
|
\v
|
Vertical tab, x0B.
|
\octal
|
Character specified by up to three octal digits.
|
\xhh
|
Character specified by a two-digit hexadecimal code.
|
\uhhhh
|
Character specified by a four-digit hexadecimal code.
|
\Uhhhhhhhh
|
Character specified by an eight-digit hexadecimal code.
|
Table 1-22. Character classes and class-like constructs|
[...]
|
Any character listed or contained within a listed range.
|
[^...]
|
Any character that is not listed and is not contained within a listed
range.
|
.
|
Any character, except a newline (unless DOTALL
mode).
|
\w
|
Word character, [a-zA-z0-9_] (unless
LOCALE or UNICODE mode).
|
\W
|
Non-word character, [^a-zA-z0-9_] (unless
LOCALE or UNICODE mode).
|
\d
|
Digit character, [0-9].
|
\D
|
Non-digit character, [^0-9].
|
\s
|
Whitespace character, [ \t\n\r\f\v].
|
\S
|
Nonwhitespace character, [ \t\n\r\f\v].
|
Table 1-23. Anchors and zero-width tests|
^
|
Start of string, or after any newline if in
MULTILINE match mode.
|
\A
|
Start of search string, in all match modes.
|
$
|
End of search string or before a string-ending newline, or before any
newline in MULTILINE match mode.
|
\Z
|
End of string or before a string-ending newline, in any match mode.
|
\b
|
Word boundary.
|
\B
|
Not-word-boundary.
|
(?=...)
|
Positive lookahead.
|
(?!...)
|
Negative lookahead.
|
(?<=...)
|
Positive lookbehind.
|
(?<!...)
|
Negative lookbehind.
|
Table 1-24. Comments and mode modifiers|
I or IGNORECASE
|
i
|
Case-insensitive matching.
|
L or LOCALE
|
L
|
Cause \w, \W,
\b, and \B to use current
locale's definition of alphanumeric.
|
M or MULTILINE or
(?m)
|
m
|
^ and $ match next to embedded
\n.
|
S or DOTALL or
(?s)
|
s
|
Dot (.) matches newline.
|
U or UNICODE or
(?u)
|
u
|
Cause \w, \W,
\b, and \B to use Unicode
definition of alphanumeric.
|
X or VERBOSE or
(?x)
|
x
|
Ignore whitespace and allow comments (#) in
pattern.
|
(?mode)
| |
Turn listed modes (iLmsux) on for the entire
regular expression.
|
(?#...)
| |
Treat substring as a comment.
|
#...
| |
Treat rest of line as a comment in VERBOSE mode.
|
Table 1-25. Grouping, capturing, conditional, and control|
(...)
|
Group subpattern and capture submatch into
\1,\2,...
|
(?P<name>
...)
|
Group subpattern and capture submatch into named capture group,
name.
|
(?P=name)
|
Match text matched by earlier named capture group,
name.
|
\n
|
Contains the results of the nth earlier submatch.
|
(?:...)
|
Groups subpattern, but does not capture submatch.
|
...|...
|
Try subpatterns in alternation.
|
*
|
Match 0 or more times.
|
+
|
Match 1 or more times.
|
?
|
Match 1 or 0 times.
|
{n}
|
Match exactly n times.
|
{x,y}
|
Match at least x times but no more than
y times.
|
*?
|
Match 0 or more times, but as few times as possible.
|
+?
|
Match 1 or more times, but as few times as possible.
|
??
|
Match 0 or 1 time, but as few times as possible.
|
{x,y}?
|
Match at least x times, no more than
y times, and as few times as possible.
|
1.6.2 re Module Objects and Functions
The re module defines all regular expression
functionality. Pattern matching is done directly through module
functions, or patterns are compiled into regular expression objects
that can be used for repeated pattern matching. Information about the
match, including captured groups, is retrieved through match objects.
Python's raw string syntax, r''
or r"", allows you to specify regular expression
patterns without having to escape embedded backslashes. The
raw-string pattern, r'\n', is equivalent to the
regular string pattern, '\\n'. Python also
provides triple-quoted raw strings for multiline regular expressions:
r'''text''' and r"""text""".
The re module
defines the following functions and one exception.
- compile( pattern [, flags])
-
Return a regular expression object with the optional mode modifiers,
flags.
- match( pattern, string [, flags])
-
Search for pattern at starting position of
string, and return a match object or
None if no match.
- search( pattern, string [, flags])
-
Search for pattern in
string, and return a match object or
None if no match.
- split( pattern, string [, maxsplit=0])
-
Split string on
pattern. Limit the number of splits to
maxsplit. Submatches from capturing
parentheses are also returned.
- sub( pattern, repl, string [, count=0])
-
Return a string with all or up to count
occurrences of pattern in
string replaced with
repl. repl may
be either a string or a function that takes a match object argument.
- subn( pattern, repl, string [, count=0])
-
Perform sub( ) but return a tuple of the new
string and the number of replacements.
- findall( pattern, string)
-
Return matches of pattern in
string. If
pattern has capturing groups, returns a
list of submatches or a list of tuples of submatches.
- finditer( pattern, string)
-
Return an iterator over matches of pattern
in string. For each match, the iterator
returns a match object.
- escape( string)
-
Return string with alphanumerics backslashed so that
string can be matched literally.
- exception error
-
Exception raised if an error occurs during compilation or matching.
This is common if a string passed to a function is not a valid
regular expression.
Regular expression objects are created with the
re.compile function.
- flags
-
Return the flags argument used when the object was compiled or 0.
- groupindex
-
Return a dictionary that maps symbolic group names to group numbers.
- pattern
-
Return the pattern string used when the object was compiled.
- match( string [, pos [, endpos]])
- search( string [, pos [, endpos]])
- split( string [, maxsplit=0])
- sub( repl, string [, count=0])
- subn( repl, string [, count=0])
- findall( string)
-
Same as the re module functions, except
pattern is implied. pos
and endpos give start and end string
indexes for the match.
Match objects are created by the
match and
find functions.
- pos
- endpos
-
Value of pos or endpos passed
to search or match.
- re
-
The regular expression object whose match or
search returned this object.
- string
-
String passed to match or
search.
- group([ g1, g2, ...])
-
Return one or more submatches from capturing groups. Groups may be
either numbers corresponding to capturing groups or strings
corresponding to named capturing groups. Group zero corresponds to
the entire match. If no arguments are provided, this function returns
the entire match. Capturing groups that did not match have a result
of None.
- groups([ default])
-
Return a tuple of the results of all capturing groups. Groups that
did not match have the value None or
default.
- groupdict([ default])
-
Return a dictionary of named capture groups, keyed by group name.
Groups that did not match have the value None or
default.
- start([ group])
-
Index of start of substring matched by
group (or start of entire matched string
if no group).
- end([ group])
-
Index of end of substring matched by group
(or start of entire matched string if no
group).
- span([ group])
-
Return a tuple of starting and ending indexes of
group (or matched string if no
group).
- expand([ template])
-
Return a string obtained by doing backslash substitution on
template. Character escapes, numeric
backreferences, and named backreferences are expanded.
- lastgroup
-
Name of the last matching capture group, or None
if no match or if the group had no name.
- lastindex
-
Index of the last matching capture group, or None
if no match.
1.6.3 Unicode Support
re provides limited
Unicode
support. Strings may contain Unicode characters, and individual
Unicode characters can be specified with \u.
Additionally, the UNICODE flag causes
\w, \W, \b,
and \B to recognize all Unicode alphanumerics.
However, re does not provide support for matching
Unicode properties, blocks, or categories.
1.6.4 Examples
Example 1-13. Simple match
#Match Spider-Man, Spiderman, SPIDER-MAN, etc.
import re
dailybugle = 'Spider-Man Menaces City!'
pattern = r'spider[- ]?man.'
if re.match(pattern, dailybugle, re.IGNORECASE):
print dailybugle
Example 1-14. Match and capture group
#Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
import re
date = '12/30/1969'
regex = re.compile(r'(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)')
match = regex.match(date)
if match:
month = match.group(1) #12
day = match.group(2) #30
year = match.group(3) #1969
Example 1-15. Simple substitution
#Convert <br> to <br /> for XHTML compliance
import re
text = 'Hello world. <br>'
regex = re.compile(r'<br>', re.IGNORECASE);
repl = r'<br />'
result = regex.sub(repl,text)
Example 1-16. Harder substitution
#urlify - turn URL's into HTML links
import re
text = 'Check the website, http://www.oreilly.com/catalog/repr.'
pattern = r'''
\b # start at word boundary
( # capture to \1
(https?|telnet|gopher|file|wais|ftp) :
# resource and colon
[\w/#~:.?+=&%@!\-] +? # one or more valid chars
# take little as possible
)
(?= # lookahead
[.:?\-] * # for possible punc
(?: [^\w/#~:.?+=&%@!\-] # invalid character
| $ ) # or end of string
)'''
regex = re.compile(pattern, re.IGNORECASE
+ re.VERBOSE);
result = regex.sub(r'<a href="\1">\1</a>', text)
1.6.5 Other Resources
|