The urllib module package gives us a file-like interface to the server’s reply for a URL.
Notice that the output we read from the server is raw HTML code (normally rendered
by a browser). We can process this text with any of Python’s text-processing tools,
including:
- String methods to search and split
- The re regular expression pattern-matching module
- Full-blown HTML and XML parsing support in the standard library, including
html.parser, as well as SAX-, DOM-, and ElementTree–style XML parsing tools.
When combined with such tools, the urllib package is a natural for a variety of
techniques—ad-hoc interactive testing of websites, custom client-side GUIs, “screen
scraping” of web page content, and automated regression testing systems for remote
server-side CGI scripts.
Formatting Reply Text
One last fine point: because CGI scripts use text to communicate with clients, they
need to format their replies according to a set of rules. For instance, notice how Ex-
ample 1-31 adds a blank line between the reply’s header and its HTML by printing an
explicit newline (\n) in addition to the one print adds automatically; this is a required
separator.
Also note how the text inserted into the HTML reply is run through the cgi.escape
(a.k.a. html.escape in Python 3.2; see the note under “Python HTML and URL Escape
Tools” on page 1203) call, just in case the input includes a character that is special in
HTML. For example, Figure 1-13 shows the reply we receive for form input Bob
Smith—the in the middle becomes </i> in the reply, and so doesn’t interfere
with real HTML code (use your browser’s view source option to see this for yourself);
if not escaped, the rest of the name would not be italicized.
Figure 1-13. Escaping HTML characters
Step 6: Adding a Web Interface | 59