Learning Python Network Programming

(Sean Pound) #1

APIs in Action


Show me the data


Before we start parsing HTML, we need something to parse! Let's grab the
version and codename of the latest stable Debian release from the Debian website.
Information about the current stable release can be found at https://www.debian.
org/releases/stable/.


The information that we want is displayed in the page title and in the first sentence:


So, we should extract the "jessie" codename and the 8.0 version number.


Parsing HTML with lxml


Let's open a Python shell and get to parsing. First, we'll download the page with
Requests.





import requests








response = requests.get('https://www.debian.org/releases/stable')





Next, we parse the source into an ElementTree tree. This is the same as it is for
parsing XML with the standard library's ElementTree, except here we will use the
lxml specialist HTMLParser.





from lxml.etree import HTML








root = HTML(response.content)





The HTML() function is a shortcut that reads the HTML that is passed to it, and
then it produces an XML tree. Notice that we're passing response.content and
not response.text. The lxml library produces better results when it uses the raw
response rather than the decoded Unicode text.

Free download pdf