“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”
The re
module provides a set of powerful regular expression facilities, which allows you to quickly check
whether a given string matches a given pattern
(using the match
function), or
contains such a pattern (using the
search
function). A regular expression is a string pattern written in a compact (and
quite cryptic) syntax.
The match
function attempts to match a pattern
against the beginning of the given string, as shown in Example 1-54. If the pattern matches
anything at all (including an empty string, if the pattern allows
that!), match
returns a match object. The group
method can be used
to find out what matched.
Example 1-54. Using the re Module to Match Strings
File: re-example-1.py import re text = "The Attila the Hun Show" # a single character m = re.match(".", text) if m: print repr("."), "=>", repr(m.group(0)) # any string of characters m = re.match(".*", text) if m: print repr(".*"), "=>", repr(m.group(0)) # a string of letters (at least one) m = re.match("w+", text) if m: print repr("w+"), "=>", repr(m.group(0)) # a string of digits m = re.match("d+", text) if m: print repr("d+"), "=>", repr(m.group(0))'.' => 'T'
'.*' => 'The Attila the Hun Show'
'\w+' => 'The'
You can use parentheses to mark regions in the pattern. If the
pattern matched, the group
method can be used to
extract the contents of these regions, as shown in Example 1-55. group(1)
returns the contents of the first group, group(2)
returns
the contents of the second, and so on. If you pass several group numbers to
the group
function, it returns a tuple.
The search
function searches for the pattern
inside the string, as shown in Example 1-56. It basically tries the pattern at every possible
character position, starting from the left, and returns a match
object as soon it has found a match. If the pattern doesn’t match
anywhere, it returns None
.
The sub
function used in Example 1-57 can be used to replace patterns
with another string.
Example 1-57. Using the re Module to Replace Substrings
File: re-example-4.py import re text = "you're no fun anymore..." # literal replace (string.replace is faster) print re.sub("fun", "entertaining", text) # collapse all non-letter sequences to a single dash print re.sub("[^w]+", "-", text) # convert all words to beeps print re.sub("S+", "-BEEP-", text)you're no entertaining anymore...
you-re-no-fun-anymore-
-BEEP- -BEEP- -BEEP- -BEEP-
You can also use sub
to replace patterns via a
callback
function. Example 1-58 shows how to
precompile patterns.
Example 1-58. Using the re Module to Replace Substrings via the callback Function
File: re-example-5.py import re import string text = "a line of text\012another line of text\012etc..." def octal(match): # replace octal code with corresponding ASCII character return chr(string.atoi(match.group(1), 8)) octal_pattern = re.compile(r"\(ddd)") print text print octal_pattern.sub(octal, text)a line of text 12another line of text 12etc...
a line of text
another line of text
etc...
If you don’t compile, the re
module caches compiled
versions for you, so you usually don’t have to compile regular
expressions in small scripts. In Python 1.5.2, the cache holds 20
patterns. In 2.0, the cache size has been increased to 100
patterns.
Finally, Example 1-59 matches a string against a list of patterns. The list of patterns are combined into a single pattern, and precompiled to save time.
Example 1-59. Using the re Module to Match Against One of Many Patterns
File: re-example-6.py import re, string def combined_pattern(patterns): p = re.compile( string.join(map(lambda x: "("+x+")", patterns), "|") ) def fixup(v, m=p.match, r=range(0,len(patterns))): try: regs = m(v).regs except AttributeError: return None # no match, so m.regs will fail else: for i in r: if regs[i+1] != (-1, -1): return i return fixup # # try it out! patterns = [ r"d+", r"abcd{2,4}", r"pw+" ] p = combined_pattern(patterns) print p("129391") print p("abc800") print p("abc1600") print p("python") print p("perl") print p("tcl")0
1
1
2
2
None
3.140.197.10