Learning Python Network Programming

APIs in Action

Show me the data

Before we start parsing HTML, we need something to parse! Let's grab the
version and codename of the latest stable Debian release from the Debian website.
Information about the current stable release can be found at https://www.debian.
org/releases/stable/.

The information that we want is displayed in the page title and in the first sentence:

So, we should extract the "jessie" codename and the 8.0 version number.

Parsing HTML with lxml

Let's open a Python shell and get to parsing. First, we'll download the page with
Requests.

import requests

response = requests.get('https://www.debian.org/releases/stable')

Next, we parse the source into an ElementTree tree. This is the same as it is for
parsing XML with the standard library's ElementTree, except here we will use the
lxml specialist HTMLParser.

from lxml.etree import HTML

root = HTML(response.content)

The HTML() function is a shortcut that reads the HTML that is passed to it, and
then it produces an XML tree. Notice that we're passing response.content and
not response.text. The lxml library produces better results when it uses the raw
response rather than the decoded Unicode text.

Learning Python Network Programming

Show me the data

Parsing HTML with lxml

Get our desktop app

Company

Features

Documentation

Resources