I l@ve RuBoard |
1.8 The re Module
The re module provides a set of powerful regular expression facilities, which allows you to quickly check whether a given string matches a given pattern (using the match function), or contains such a pattern (using the search function). A regular expression is a string pattern written in a compact (and quite cryptic) syntax. The match function attempts to match a pattern against the beginning of the given string, as shown in Example 1-54. If the pattern matches anything at all (including an empty string, if the pattern allows that!), match returns a match object. The group method can be used to find out what matched. Example 1-54. Using the re Module to Match StringsFile: re-example-1.py import re text = "The Attila the Hun Show" # a single character m = re.match(".", text) if m: print repr("."), "=>", repr(m.group(0)) # any string of characters m = re.match(".*", text) if m: print repr(".*"), "=>", repr(m.group(0)) # a string of letters (at least one) m = re.match("\w+", text) if m: print repr("\w+"), "=>", repr(m.group(0)) # a string of digits m = re.match("\d+", text) if m: print repr("\d+"), "=>", repr(m.group(0)) '.' => 'T' '.*' => 'The Attila the Hun Show' '\\w+' => 'The' You can use parentheses to mark regions in the pattern. If the pattern matched, the group method can be used to extract the contents of these regions, as shown in Example 1-55. group(1) returns the contents of the first group, group(2) returns the contents of the second, and so on. If you pass several group numbers to the group function, it returns a tuple. Example 1-55. Using the re Module to Extract Matching SubstringsFile: re-example-2.py import re text ="10/15/99" m = re.match("(\d{2})/(\d{2})/(\d{2,4})", text) if m: print m.group(1, 2, 3) ('10', '15', '99') The search function searches for the pattern inside the string, as shown in Example 1-56. It basically tries the pattern at every possible character position, starting from the left, and returns a match object as soon it has found a match. If the pattern doesn't match anywhere, it returns None. Example 1-56. Using the re Module to Search for SubstringsFile: re-example-3.py import re text = "Example 3: There is 1 date 10/25/95 in here!" m = re.search("(\d{1,2})/(\d{1,2})/(\d{2,4})", text) print m.group(1), m.group(2), m.group(3) month, day, year = m.group(1, 2, 3) print month, day, year date = m.group(0) print date 10 25 95 10 25 95 10/25/95 The sub function used in Example 1-57 can be used to replace patterns with another string. Example 1-57. Using the re Module to Replace SubstringsFile: re-example-4.py import re text = "you're no fun anymore..." # literal replace (string.replace is faster) print re.sub("fun", "entertaining", text) # collapse all non-letter sequences to a single dash print re.sub("[^\w]+", "-", text) # convert all words to beeps print re.sub("\S+", "-BEEP-", text) you're no entertaining anymore... you-re-no-fun-anymore- -BEEP- -BEEP- -BEEP- -BEEP- You can also use sub to replace patterns via a callback function. Example 1-58 shows how to precompile patterns. Example 1-58. Using the re Module to Replace Substrings via the callback FunctionFile: re-example-5.py import re import string text = "a line of text\\012another line of text\\012etc..." def octal(match): # replace octal code with corresponding ASCII character return chr(string.atoi(match.group(1), 8)) octal_pattern = re.compile(r"\\(\d\d\d)") print text print octal_pattern.sub(octal, text) a line of text\012another line of text\012etc... a line of text another line of text etc... If you don't compile, the re module caches compiled versions for you, so you usually don't have to compile regular expressions in small scripts. In Python 1.5.2, the cache holds 20 patterns. In 2.0, the cache size has been increased to 100 patterns. Finally, Example 1-59 matches a string against a list of patterns. The list of patterns are combined into a single pattern, and precompiled to save time. Example 1-59. Using the re Module to Match Against One of Many PatternsFile: re-example-6.py import re, string def combined_pattern(patterns): p = re.compile( string.join(map(lambda x: "("+x+")", patterns), "|") ) def fixup(v, m=p.match, r=range(0,len(patterns))): try: regs = m(v).regs except AttributeError: return None # no match, so m.regs will fail else: for i in r: if regs[i+1] != (-1, -1): return i return fixup # # try it out! patterns = [ r"\d+", r"abc\d{2,4}", r"p\w+" ] p = combined_pattern(patterns) print p("129391") print p("abc800") print p("abc1600") print p("python") print p("perl") print p("tcl") 0 1 1 2 2 None |
I l@ve RuBoard |