quite a chore.
If you find the data you want and try to use a positional method of extracting the content (such as
looking for the 1,200th character in an HTML document and splicing the next 10 characters), you
might be disappointed when, the next time the webpage is updated, the data is at the 1,201st position.
One solution to this problem is to use an HTML parser library. An HTML parser library allows you
to parse through the individual HTML elements contained in the document, looking for specific tags
and keywords. This makes the job of searching for data much easier, and it can help your program
survive simple changes to the webpage.
There are plenty of HTML parser libraries available in Python. The HTMLParser module is
included in the standard Python library, but it can be somewhat difficult to work with. In the
following Try It Yourself, you will use the LXML module, which is fairly easy to use yet robust
enough to help you parse through the webpages you need.
Try It Yourself: Install the LXML Module
To complete the web parsing project, you need to install the Python v3 version of the
LXML module from the Raspbian Linux distribution software repository. Just follow
these steps:
- Open a command prompt, either from the main Raspberry Pi login interface or from
the LXTerminal utility in the graphical desktop. - Run the apt-get command as the root user account to update your library, like
this:
Click here to view code image
pi@raspberrypi ~ $ sudo apt-get update - Run the apt-get command as the root user account to install the Python v3 version
of the LXML module, like this:
Click here to view code image
pi@raspberrypi ~ $ sudo apt-get install python3-lxml
Watch Out!: The LXML Module
Be careful. The Raspbian Linux distribution software repository includes both
the Python v2 and Python v3 versions of the LXML module. Make sure you
install the Python v3 version to use with your Python v3 code! The Python v3
version is python3-lxml, while the Python v2 version is python-lxml.
Now that you have the LXML module installed, you can import it into your program and use its
features. There are two specific features that you’re interested in:
The etree methods, which break an HTML document down into the individual HTML code
elements in the document.
The cssselect methods, which can parse CSS data embedded in HTML documents.
Let’s take a closer look at using each of these features.