Chapter 3
The lxml library's ElementTree implementation has been designed to be 100 percent
compatible with the standard library's, so we can start exploring the document in the
same way as we did with XML:
[e.tag for e in root]
['head', 'body']
root.find('head').find('title').text
'Debian –- Debian \u201cjessie\u201d Release Information'
In the preceding code, we have printed out the text content of the document's
screenshot. We can already see it contains the codename that we want.
Zeroing in
Screen scraping is the art of finding a way to unambiguously address the elements
in the HTML that contain the information that we want, and extract the information
from only those elements.
However, we also want the selection criteria to be as simple as possible. The less we
rely on the contents of the document, the lesser the chance of it being broken if the
page's HTML changes.
Let's inspect the HTML source of the page, and see what we're dealing with. For this,
either use View Source in a web browser, or save the HTML to a file and open it in
a text editor. The page's source code is also included in the source code download
for this book. Search for the text Debian 8.0, so that we are taken straight to the
information we want. For me, it looks like the following block of code:
<body>
...
<div id="content">
<h1>Debian “jessie” Release Information</h1>
<p>Debian 8.0 was
released October 18th, 2014.
The release included many major
changes, described in
...
I've skipped the HTML between the
is a direct child of the element. From the above, we can see that we want the
contents of the
tag child of the