[Python编程(第4版)].(Programming.Python.4th.Edition).Mark.Lutz.文字版

(yzsuai) #1

Handling HTML entity references (revisited)


Here’s another HTML parsing example: in Chapter 15, we used a simple method ex-
ported by this module to unquote HTML escape sequences (a.k.a. entities) in strings
embedded in an HTML reply page:


>>> import cgi, html.parser
>>> s = cgi.escape("1<2 <b>hello</b>")
>>> s
'1<2 <b>hello</b>'
>>>
>>> html.parser.HTMLParser().unescape(s)
'1<2 <b>hello</b>'

This works for undoing HTML escapes, but that’s all. When we saw this solution, I
implied that there was a more general approach; now that you know about the method
callback model of the HTML parser class, the more idiomatic way to handle entities
during a parse should make sense—simply catch entity callbacks in a parser subclass,
and translate as needed:


>>> class Parse(html.parser.HTMLParser):
... def handle_data(self, data):
... print(data, end='')
... def handle_entityref(self, name):
... map = dict(lt='<', gt='>')
... print(map[name], end='')
...
>>> p = Parse()
>>> p.feed(s); print()
1<2 <b>hello</b>

Better still, we can use Python’s related html.entities module to avoid hardcoding
entity-to-character mappings for HTML entities. This module defines many more en-
tity names than the simple dictionary in the prior example and includes all those you’ll
likely encounter when parsing HTML text in the wild:


>>> s
'1<2 <b>hello</b>'
>>>
>>> from html.entities import entitydefs
>>> class Parse(html.parser.HTMLParser):
... def handle_data(self, data):
... print(data, end='')
... def handle_entityref(self, name):
... print(entitydefs[name], end='')
...
>>> P = Parse()
>>> P.feed(s); print()
1<2 <b>hello</b>

Strictly speaking, the html.entities module is able to map entity name to Unicode code
point and vice versa; its table used here simply converts code point integers to characters
with chr. See this module’s documentation, as well as its source code in the Python
standard library for more details.


XML and HTML Parsing | 1437
Free download pdf