'a < b > c & d "spam"'
>>> s = cgi.escape("1<2 <b>hello</b>")
>>> s
'1<2 <b>hello</b>'
Python’s cgi module automatically converts characters that are special in HTML syntax
according to the HTML convention. It translates <, >, and & with an extra true argument,
", into escape sequences of the form &X;, where the X is a mnemonic that denotes the
original character. For instance, < stands for the “less than” operator (<) and
& denotes a literal ampersand (&).
There is no unescaping tool in the CGI module, because HTML escape code sequences
are recognized within the context of an HTML parser, like the one used by your web
browser when a page is downloaded. Python comes with a full HTML parser, too, in
the form of the standard module html.parser. We won’t go into details on the HTML
parsing tools here (they’re covered in Chapter 19 in conjunction with text processing),
but to illustrate how escape codes are eventually undone, here is the HTML parser
module at work reading back the preceding output:
>>> import cgi, html.parser
>>> s = cgi.escape("1<2 <b>hello</b>")
>>> s
'1<2 <b>hello</b>'
>>>
>>> html.parser.HTMLParser().unescape(s)
'1<2 <b>hello</b>'
This uses a utility method on the HTML parser class to unquote. In Chapter 19, we’ll
see that using this class for more substantial work involves subclassing to override
methods run as callbacks during the parse upon detection of tags, data, entities, and
more. For more on full-blown HTML parsing, watch for the rest of this story in
Chapter 19.
Escaping URLs
By contrast, URLs reserve other characters as special and must adhere to different es-
cape conventions. As a result, we use different Python library tools to escape URLs for
transmission. Python’s urllib.parse module provides two tools that do the translation
work for us: quote, which implements the standard %XX hexadecimal URL escape code
sequences for most nonalphanumeric characters, and quote_plus, which additionally
translates spaces to + signs. The urllib.parse module also provides functions for un-
escaping quoted characters in a URL string: unquote undoes %XX escapes, and
unquote_plus also changes plus signs back to spaces. Here is the module at work, at the
interactive prompt:
>>> import urllib.parse
>>> urllib.parse.quote("a & b #! c")
'a%20%26%20b%20%23%21%20c'
1204 | Chapter 15: Server-Side Scripting