C:\...\PP4E\Lang> python cheader.py test.h
2 defined TEST_H =
4 include stdio.h
5 include lib/spam.h
6 include Python.h
8 defined DEBUG =
9 defined HELLO = 'hello regex world'
10 defined SPAM = 1234
12 defined EGGS = sunny + side + up
13 defined ADDER = (arg) 123 + arg
For an additional example of regular expressions at work, see the file pygrep1.py in the
book examples package; it implements a simple pattern-based “grep” file search utility,
but was cut here for space. As we’ll see, we can also sometimes use regular expressions
to parse information from XML and HTML text—the topics of the next section.
XML and HTML Parsing
Beyond string objects and regular expressions, Python ships with support for parsing
some specific and commonly used types of formatted text. In particular, it provides
precoded parsers for XML and HTML which we can deploy and customize for our text
processing goals.
In the XML department, Python includes parsing support in its standard library and
plays host to a prolific XML special-interest group. XML (for eXtensible Markup Lan-
guage) is a tag-based markup language for describing many kinds of structured data.
Among other things, it has been adopted in roles such as a standard database and
Internet content representation in many contexts. As an object-oriented scripting lan-
guage, Python mixes remarkably well with XML’s core notion of structured document
interchange.
XML is based upon a tag syntax familiar to web page writers, used to describe and
package data. The xml module package in Python’s standard library includes tools for
parsing this data from XML text, with both the SAX and the DOM standard parsing
models, as well as the Python-specific ElementTree package. Although regular expres-
sions can sometimes extract information from XML documents, too, they can be easily
misled by unexpected text, and don’t directly support the notion of arbitrarily nested
XML constructs (more on this limitation later when we explore languages in general).
In short, SAX parsers provide a subclass with methods called during the parsing oper-
ation, and DOM parsers are given access to an object tree representing the (usually)
already parsed document. SAX parsers are essentially state machines and must record
(and possibly stack) page details as the parse progresses; DOM parsers walk object trees
using loops, attributes, and methods defined by the DOM standard. ElementTree is
roughly a Python-specific analog of DOM, and as such can often yield simpler code; it
can also be used to generate XML text from their object-based representations.
XML and HTML Parsing | 1429