Python Programming for Raspberry Pi, Sams Teach Yourself in 24 Hours

Using the etree Methods to Parse HTML

The etree methods break an HTML document down into the individual HTML elements. If you’re
familiar with HTML code, you’ve seen the HTML elements that are used to define the layout and
structure of the webpage. Here’s a quick example of the HTML code in a simple webpage:

Click here to view code image

<!DOCTYPE html> <html> <head> <title>This is a test webpage</title> </head> <body> <h1>This is a test webpage!</h1> <p>This webpage contains a simple title and two paragraphs of text</p> <p>This is the second paragraph of text on the webpage</p> <h2>This is the end of the test webpage</h2> </body> </html>

The etree methods can return each HTML element in the document as a separate object that you can
manipulate. Here’s the code required to extract the HTML elements from the html variable returned
from the urllib process shown earlier:

import lxml.etree encoding = lxml.etree.HTMLParser(encoding='utf-8') doctree = lxml.etree.fromstring(html, encoding)

First, you need to define the encoding that you want to convert the raw binary HTML data into. The
encoding variable contains the encoding object to use. This example defines the utf-8 encoding
scheme, which can handle most languages in the world.

The second statement uses the fromstring() method to produce a list that contains the string
values of all the HTML elements and their values. The doctree variable contains a list of the
individual HTML elements and their values. You can search the list values, looking for the data, or if
you know exactly which position in the element list your data appears, you can jump directly there.
That method is a little better than using the regular expression method to search for data, but you can
still make things easier!

Most webpages use Cascading Style Sheets (CSS) to differentiate important content on the webpage.
The next step is to leverage that information to look for the specific data you want.

Using CSS to Find Data

Now that you have the webpage data broken down into the separate elements, you can use the
CSSSelector() method in the lxml module to try to parse the data even further, based on CSS
information in the webpage.

You may need to do some hunting around through the raw HTML code to figure out just what unique
features make the data you’re looking for stand out. Most modern webpages use CSS classes to define
CSS styles for specific content on the webpage. It looks something like this:

<div class="day-temp-current temp-f">79</div>

Python Programming for Raspberry Pi, Sams Teach Yourself in 24 Hours

Get our desktop app

Company

Features

Documentation

Resources