[Python编程(第4版)].(Programming.Python.4th.Edition).Mark.Lutz.文字版

(yzsuai) #1

locate and extract bracketed text anywhere in a string, even pairs with optional text
between:


>>> '<spam>/<ham>/<eggs>'.find('ham') # find substring offset
8
>>> re.findall('<(.*?)>', '<spam>/<ham>/<eggs>') # find all matches/groups
['spam', 'ham', 'eggs']
>>> re.findall('<(.*?)>', '<spam> / <ham><eggs>')
['spam', 'ham', 'eggs']

>>> re.findall('<(.*?)>/?<(.*?)>', '<spam>/<ham> ... <eggs><cheese>')
[('spam', 'ham'), ('eggs', 'cheese')]
>>> re.search('<(.*?)>/?<(.*?)>', 'todays menu: <spam>/<ham>...<eggs><s>').groups()
('spam', 'ham')

Especially when using findall, the (?s) operator comes in handy to force. to match
end-of-line characters in multiline text; without it. matches everything except lines
ends. The following searches look for two adjacent bracketed strings with arbitrary text
between, with and without skipping line breaks:


>>> re.findall('<(.*?)>.*<(.*?)>', '<spam> \n <ham>\n<eggs>') # stop at \n
[]
>>> re.findall('(?s)<(.*?)>.*<(.*?)>', '<spam> \n <ham>\n<eggs>') # greedy
[('spam', 'eggs')]
>>> re.findall('(?s)<(.*?)>.*?<(.*?)>', '<spam> \n <ham>\n<eggs>') # nongreedy
[('spam', 'ham')]

To make larger patterns more mnemonic, we can even associate names with matched
substring groups in using the <?P) pattern syntax and fetch them by name after
matches, though this is of limited utility for findall. The next tests look for strings of
“word” characters (\w) separated by a /—this isn’t much more than a string split, but
parts are named, and search and findall both scan ahead:


>>> re.search('(?P<part1>\w*)/(?P<part2>\w*)', '...aaa/bbb/ccc]').groups()
('aaa', 'bbb')
>>> re.search('(?P<part1>\w*)/(?P<part2>\w*)', '...aaa/bbb/ccc]').groupdict()
{'part1': 'aaa', 'part2': 'bbb'}

>>> re.search('(?P<part1>\w*)/(?P<part2>\w*)', '...aaa/bbb/ccc]').group(2)
'bbb'
>>> re.search('(?P<part1>\w*)/(?P<part2>\w*)', '...aaa/bbb/ccc]').group('part2')
'bbb'

>>> re.findall('(?P<part1>\w*)/(?P<part2>\w*)', '...aaa/bbb ccc/ddd]')
[('aaa', 'bbb'), ('ccc', 'ddd')]

Finally, although basic string operations such as slicing and splits are sometimes
enough, patterns are much more flexible. The following uses [^ ] to match any char-
acter not following the ^, and escapes a dash within a [] alternative set using - so it’s
not taken to be a character set range separator. It runs equivalent slices, splits, and
matches, along with a more general match that the other two cannot approach:


>>> line = 'aaa bbb ccc'
>>> line[:3], line[4:7], line[8:11] # slice data at fixed offsets

1420 | Chapter 19: Text and Language

Free download pdf