Learning Python Network Programming

Chapter 3

You should always check a site's terms and conditions before scraping. Some
websites explicitly disallow automated parsing and retrieval. Breaching the terms
may result in your IP address being barred. However, in most cases, as long as you
don't republish the data and don't make excessively frequent requests, you should
be okay.

HTML parsers

We'll be parsing HTML just as we parsed XML. We again have a choice between
pull-style APIs and object-oriented APIs. We are going to use ElementTree for
the same reasons as mentioned before.

There are several HTML parsing libraries that are available. They're differentiated
by their speed, the interfaces that they offer for navigating within HTML documents,
and their ability at handling badly constructed HTML. The Python standard library
doesn't include an object-oriented HTML parser. The universally recommended
third-party package for this is lxml, which is primarily an XML parser. However,
it does include a very good HTML parser. It's quick, it offers several ways of
navigating documents, and it is tolerant of broken HTML.

The lxml library can be installed on Debian and Ubuntu through the python-lxml
package. If you need an up-to-date version or if you're not able to install the system
packages, then lxml can be installed through pip. Note that you'll need a build
environment for this. Debian usually comes with an environment that has already
been set up but if it's missing, then the following will install one for both Debian
and Ubuntu:

$ sudo apt-get install build-essential

Then you should be able to install lxml, like this:

$ sudo STATIC_DEPS=true pip install lxml

If you hit compilation problems on a 64-bit system, then you can also try:

$ CFLAGS="$CFLAGS -fPIC" STATIC_DEPS=true pip install lxml

On Windows, installer packages are available from the lxml website at
http://lxml.de/installation.html. Check the page for links to third-party
installers in case an installer for your version of Python isn't available.

The next best library, in case lxml doesn't work for you, is BeautifulSoup.
BeautifulSoup is pure Python, so it can be installed with pip, and it should run
anywhere. Although it has its own API, it's a well-respected and capable library,
and it can, in fact, use lxml as a backend library.

Learning Python Network Programming

HTML parsers

Get our desktop app

Company

Features

Documentation

Resources