Python Programming for Raspberry Pi, Sams Teach Yourself in 24 Hours

(singke) #1

Using the etree Methods to Parse HTML


The etree methods break an HTML document down into the individual HTML elements. If you’re
familiar with HTML code, you’ve seen the HTML elements that are used to define the layout and
structure of the webpage. Here’s a quick example of the HTML code in a simple webpage:


Click here to view code image


<!DOCTYPE html>
<html>
<head>
<title>This is a test webpage</title>
</head>
<body>
<h1>This is a test webpage!</h1>
<p>This webpage contains a simple title and two paragraphs of text</p>
<p>This is the second paragraph of text on the webpage</p>
<h2>This is the end of the test webpage</h2>
</body>
</html>

The etree methods can return each HTML element in the document as a separate object that you can
manipulate. Here’s the code required to extract the HTML elements from the html variable returned
from the urllib process shown earlier:


Click here to view code image


import lxml.etree
encoding = lxml.etree.HTMLParser(encoding='utf-8')
doctree = lxml.etree.fromstring(html, encoding)

First, you need to define the encoding that you want to convert the raw binary HTML data into. The
encoding variable contains the encoding object to use. This example defines the utf-8 encoding
scheme, which can handle most languages in the world.


The second statement uses the fromstring() method to produce a list that contains the string
values of all the HTML elements and their values. The doctree variable contains a list of the
individual HTML elements and their values. You can search the list values, looking for the data, or if
you know exactly which position in the element list your data appears, you can jump directly there.
That method is a little better than using the regular expression method to search for data, but you can
still make things easier!


Most webpages use Cascading Style Sheets (CSS) to differentiate important content on the webpage.
The next step is to leverage that information to look for the specific data you want.


Using CSS to Find Data


Now that you have the webpage data broken down into the separate elements, you can use the
CSSSelector() method in the lxml module to try to parse the data even further, based on CSS
information in the webpage.


You may need to do some hunting around through the raw HTML code to figure out just what unique
features make the data you’re looking for stand out. Most modern webpages use CSS classes to define
CSS styles for specific content on the webpage. It looks something like this:


Click here to view code image


<div class="day-temp-current temp-f">79</div>
Free download pdf