Chapter 11 ■ the World Wide Web
217
The other trick is to use some kind of “tidy” program, like that distributed by the W3C and available as the tidy
package on Debian and Ubuntu. It turns out that both of the parsing libraries that were used in Listing 11-12 have
such routines built in. Once the soup object exists, you can display its elements to the screen with helpful indentation
with the following:
print(soup.prettify())
An lxml document tree requires a little more work to display.
from lxml import etree
print(etree.tostring(root, pretty_print=True).decode('ascii'))
Either way, the result is likely to be far easier to read than the raw HTML if the site that is delivering it is not
putting elements on separate lines and indenting them to make their document structure clear—steps that, of course,
can be inconvenient and would increase the bandwidth needs of any site serving HTML.
Examining HTML involves the following three steps:
- Ask your library of choice to parse the HTML. This can be difficult for the library because
much HTML on the Web contains errors and broken markup. But designers often never
notice this because browsers always try to recover and understand the markup anyway.
After all, what browser vendor would want their browser to be the only one that returns an
error for some popular web site when all of the other browsers display it just fine? Both of
the libraries used in Listing 11-12 have a reputation for being robust HTML parsers. - Dive into the document using selectors, which are text patterns that will automatically
find the elements you want. While you can instead make the dive yourself, slowly iterating
over each element’s children and looking for the tags and attributes that interest you, it is
generally much faster to use selectors. They also usually result in cleaner Python code that
is easier to read. - Ask each element object for the text and attribute values you need. You are then back
in the world of normal Python strings and can use all of the normal string methods to
postprocess the data.
This three-stage process is enacted twice in Listing 11-12 using two separate libraries.
The scrape_with_soup() function uses the venerable BeautifulSoup library that is a go-to resource for
programmers the world over. Its API is quirky and unique because it was the first library to make document parsing so
convenient in Python, but it does get the job done.
All “soup” objects, whether the one representing the whole document or a subordinate one that represents a
single element, offer a find_all() method that will search for subordinate elements that match a given tag name and,
optionally, HTML class name. The get_text() method can be used when you finally reach the bottom element you
want and are ready to read its content. With these two methods alone, the code is able to get data from this simple web
site, and even complicated web sites can often be scraped with only a half-dozen or a dozen separate steps.
The full BeautifulSoup documentation is available online at http://www.crummy.com/software/BeautifulSoup/.
The scrape_with_lxml() function instead uses the modern and fast lxml library that is built atop libxml2
and libxslt. It can be difficult to install if you are on a legacy operating system that does not come with compilers
installed—or if you have not installed the python-dev or python-devel package with which your operating system
might support compiled Python packages. Debian-derived operating systems will already have the library compiled
against the system Python as a package, often simply named python-lxml.
A modern Python distribution such as Anaconda will have lxml already compiled and ready to install, even on
Mac OS X and Windows: http://continuum.io/downloads.