[Python编程(第4版)].(Programming.Python.4th.Edition).Mark.Lutz.文字版

(yzsuai) #1

Let’s quickly explore ways to extract this file’s book ISBN numbers and corresponding
titles by example, using each of the four primary Python tools at our disposal—patterns,
SAX, DOM, and ElementTree.


Regular expression parsing


In some contexts, the regular expressions we met earlier can be used to parse informa-
tion from XML files. They are not complete parsers, and are not very robust or accurate
in the presence of arbitrary text (text in tag attributes can especially throw them off).
Where applicable, though, they offer a simple option. Example 19-9 shows how we
might go about parsing the XML file in Example 19-8 with the prior section’s re module.
Like all four examples in this section, it scans the XML file looking at ISBN numbers
and associated titles, and stores the two as keys and values in a Python dictionary.


Example 19-9. PP4E\Lang\Xml\rebook.py


"""
XML parsing: regular expressions (no robust or general)
"""


import re, pprint
text = open('books.xml').read() # str if str pattern
pattern = '(?s)isbn="(.?)".?(.<em>?)' # ?=nongreedy
found = re.findall(pattern, text) # (?s)=dot matches /n
mapping = {isbn: title for (isbn, title) in found} # dict from tuple list
pprint.pprint(mapping)


When run, the re.findall method locates all the nested tags we’re interested in, ex-
tracts their content, and returns a list of tuples representing the two parenthesized
groups in the pattern. Python’s pprint module displays the dictionary created by the
comprehension nicely. The extract works, but only as long as the text doesn’t deviate
from the expected pattern in ways that would invalidate our script. Moreover, the XML
entity for “&” in the first book’s title is not un-escaped automatically:


C:\...\PP4E\Lang\Xml> python rebook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}

SAX parsing


To do better, Python’s full-blown XML parsing tools let us perform this data extraction
in a more accurate and robust way. Example 19-10, for instance, defines a SAX-based
parsing procedure: its class implements callback methods that will be called during the
parse, and its top-level code creates and runs a parser.


XML and HTML Parsing | 1431
Free download pdf