To demonstrate, let’s do some matching on the following strings (see file
re-interactive.txt for all the interactive code run in this section):
>>> text1 = 'Hello spam...World'
>>> text2 = 'Hello spam...other'
The match performed in the following code does not precompile: it executes an im-
mediate match to look for all the characters between the words Hello and World in our
text strings:
>>> import re
>>> matchobj = re.match('Hello(.*)World', text2)
>>> print(matchobj)
None
When a match fails as it does here (the text2 string doesn’t end in World), we get back
the None object, which is Boolean false if tested in an if statement.
In the pattern string we’re using here in the first argument to re.match, the words
Hello and World match themselves, and (.) means any character (.) repeated zero or
more times (). The fact that it is enclosed in parentheses tells Python to save away the
part of the string matched by that part of the pattern as a group—a matched substring
available after the match. To see how, we need to make a match work:
>>> matchobj = re.match('Hello(.*)World', text1)
>>> print(matchobj)
<_sre.SRE_Match object at 0x009D6520>
>>> matchobj.group(1)
' spam...'
When a match succeeds, we get back a match object, which has interfaces for extracting
matched substrings—the group(1) call returns the portion of the string matched by the
first, leftmost, parenthesized portion of the pattern (our (.*)). As mentioned, matching
is not just a yes/no answer; by enclosing parts of the pattern in parentheses, it is also a
way to extract matched substrings. In this case, we’ve parsed out the text between
Hello and World. Group number 0 is the entire string matched by the pattern—useful
if you want to be sure your pattern is consuming all the text you think it is.
The interface for precompiling is similar, but the pattern is implied in the pattern ob-
ject we get back from the compile call:
>>> pattobj = re.compile('Hello(.*)World')
>>> matchobj = pattobj.match(text1)
>>> matchobj.group(1)
' spam...'
Again, you should precompile for speed if you will run the pattern multiple times, and
you normally will when scanning files line by line. Here’s something a bit more complex
that hints at the generality of patterns. This one allows for zero or more blanks or tabs
at the front ([ \t]*), skips one or more after the word Hello ([ \t]+), captures characters
Regular Expression Pattern Matching | 1417