Chapter 3

The lxml library's ElementTree implementation has been designed to be 100 percent
compatible with the standard library's, so we can start exploring the document in the
same way as we did with XML:

[e.tag for e in root]

['head', 'body']


'Debian –- Debian \u201cjessie\u201d Release Information'

In the preceding code, we have printed out the text content of the document's

element, which is the text that appears in the tab in the preceding<br /> screenshot. We can already see it contains the codename that we want.<br /> <h3>Zeroing in</h3><br /> <p>Screen scraping is the art of finding a way to unambiguously address the elements<br /> in the HTML that contain the information that we want, and extract the information<br /> from only those elements.</p><br /> <p>However, we also want the selection criteria to be as simple as possible. The less we<br /> rely on the contents of the document, the lesser the chance of it being broken if the<br /> page's HTML changes.</p><br /> <p>Let's inspect the HTML source of the page, and see what we're dealing with. For this,<br /> either use View Source in a web browser, or save the HTML to a file and open it in<br /> a text editor. The page's source code is also included in the source code download<br /> for this book. Search for the text Debian 8.0, so that we are taken straight to the<br /> information we want. For me, it looks like the following block of code:</p><br /> <pre><code><body><br /> ...<br /> <div id="content"><br /> <h1>Debian “jessie” Release Information</h1><br /> <p>Debian 8.0 was<br /> released October 18th, 2014.<br /> The release included many major<br /> changes, described in<br /> ...</code></pre><br /> <p>I've skipped the HTML between the <body> and the <div> to show that the <div><br /> is a direct child of the <body> element. From the above, we can see that we want the
contents of the <p> tag child of the <div> element. 