Example 19-10. PP4E\Lang\Xml\saxbook.py
"""
XML parsing: SAX is a callback-based API for intercepting parser events
"""
import xml.sax, xml.sax.handler, pprint
class BookHandler(xml.sax.handler.ContentHandler):
def init(self):
self.inTitle = False # handle XML parser events
self.mapping = {} # a state machine model
def startElement(self, name, attributes):
if name == "book": # on start book tag
self.buffer = "" # save ISBN for dict key
self.isbn = attributes["isbn"]
elif name == "title": # on start title tag
self.inTitle = True # save title text to follow
def characters(self, data):
if self.inTitle: # on text within tag
self.buffer += data # save text if in title
def endElement(self, name):
if name == "title":
self.inTitle = False # on end title tag
self.mapping[self.isbn] = self.buffer # store title text in dict
parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse('books.xml')
pprint.pprint(handler.mapping)
The SAX model is efficient, but it is potentially confusing at first glance, because the
class must keep track of where the parse currently is using state information. For ex-
ample, when the title tag is first detected, we set a state flag and initialize a buffer; as
each character within the title tag is parsed, we append it to the buffer until the ending
portion of the title tag is encountered. The net effect saves the title tag’s content as a
string. This model is simple, but can be complex to manage; in cases of potentially
arbitrary nesting, for instance, state information may need to be stacked as the class
receives callbacks for nested tags.
To kick off the parse, we make a parser object, set its handler to an instance of our
class, and start the parse; as Python scans the XML file, our class’s methods are called
automatically as components are encountered. When the parse is complete, we use the
Python pprint module to display the result again—the mapping dictionary object at-
tached to our handler. The result is the mostly the same this time, but notice that the
“&” escape sequence is properly un-escaped now—SAX performs XML parsing, not
text matching:
1432 | Chapter 19: Text and Language