[Python编程(第4版)].(Programming.Python.4th.Edition).Mark.Lutz.文字版

(yzsuai) #1
>>> from html.parser import HTMLParser
>>> class ParsePage(HTMLParser):
... def handle_starttag(self, tag, attrs):
... print('Tag start:', tag, attrs)
... def handle_endtag(self, tag):
... print('tag end: ', tag)
... def handle_data(self, data):
... print('data......', data.rstrip())
...

Now, create a web page’s HTML text string; we hardcode one here, but it might also
be loaded from a file, or fetched from a website with urllib.request:


>>> page = """
... <html>
... <h1>Spam!</h1>
... <p>Click this <a href="http://www.python.org">python</a> link</p>
... </html>"""

Finally, kick off the parse by feeding text to a parser instance—tags in the HTML text
trigger class method callbacks, with tag names and attribute sequences passed in as
arguments:


>>> parser = ParsePage()
>>> parser.feed(page)
data......
Tag start: html []
data......
Tag start: h1 []
data...... Spam!
tag end: h1
data......
Tag start: p []
data...... Click this
Tag start: a [('href', 'http://www.python.org')]
data...... python
tag end: a
data...... link
tag end: p
data......
tag end: html

As you can see, the parser’s methods receive callbacks for events during the parse. Much
like SAX XML parsing, your parser class will need to keep track of its state in attributes
as it goes if it wishes to do something more specific than print tag names, attributes,
and content. Watching for specific tags’ content, though, might be as simple as check-
ing names and setting state flags. Moreover, building object trees to reflect the page’s
structure during the parse would be straightforward.


1436 | Chapter 19: Text and Language

Free download pdf