Chapter 11 ■ the World Wide Web
218
If you are able to get it installed, Listing 11-12 can use the library to parse the HTML instead.
$ pip install lxml
$ python mscrape.py -l http://127.0.0.1:5000/
125 Registration for PyCon
200 Payment for writing that code
325 Total payments made
Again, the same basic steps are in operation as with BeautifulSoup. You start at the top of the document, use a
find or search method—in this case cssselect()—to zero in on the elements that interest you, and then use further
searches either to grab subordinate elements or, in the end, to ask elements for the text that they contain so that you
can parse and display it.
lxml is not only faster than BeautifulSoup, but it also presents many options for how you can select elements.
• It supports CSS patterns with cssselect(). This is especially important when looking for
elements by class because an element is considered to be in the class x whether its class
attribute is written as class="x" or class="x y" or class="w x".
• It supports XPath expressions with its xpath() method, beloved by XML aficionados. They
look like './/p' to find all paragraphs, for example. One fun aspect of an XPath expression
is that you can end it with '.../text()' and simply get back the text inside each element,
instead of getting back Python objects, of which you then have to request the text inside of
them.
• It natively supports a fast subset of XPath operations through its find() and findall()
methods.
Note that, in both of these cases, the scraper had to do a bit of work because the payment description field is its
own element but the dollar amount at the beginning of each line was not placed inside its own element by the
site designer. This is a quite typical problem; some things that you want from a page will be sitting conveniently in an
element by themselves, while others will be in the middle of other text and will need you to use traditional Python
string methods such as split() and strip() to rescue them from their context.
Recursive Scraping
The source code repository for this book includes a small static web site that makes it deliberately difficult for a web
scraper to reach all of its pages. You can view it online here:
https://github.com/brandon-rhodes/fopnp/tree/m/py3/chapter11/tinysite/
If you have checked out the source code repository, you can serve it on your own machine by using Python’s
built-in web server.
$ cd py3/chapter11/tinysite
$ python -m http.server
Serving HTTP on 0.0.0.0 port 8000 ...
If you view the page source and then look around using the web debugging tools of your browser, you will see that
not all of the links on the front page at http://127.0.0.1:8000/ are delivered at the same moment. Only two, in fact
(“page1” and “page2”) are present in the raw HTML of the page as real anchor tags with href="" attributes.
The next two pages are behind a form with a Search submit button, and they will not be accessible unless the
button is clicked.