Learning Python Network Programming

(Sean Pound) #1

APIs in Action


The square brackets after div, [@id="content"], form a condition that we place on
the

elements that we're matching. The @ sign before id means that id refers
to an attribute, so the condition means: only elements with an id attribute equal to
"content". This is how we can find our content
.


Before we employ this to extract our information, let's just touch on a couple of
useful things that we can do with conditions. We can specify just a tag name,
as shown here:





root.xpath('//div[h1]')





[<Element div at 0x39e05c8>]


This returns all

elements which have an

child element. Also try:





root.xpath('body/div[2]'):





[<Element div at 0x39e05c8>]


Putting a number as a condition will return the element at that position in the
matched list. In this case this is the second

child element of .
Note that these indexes start at 1 , unlike Python indexing which starts at 0.


There's a lot more that XPath can do, the full specification is a World Wide
Web Consortium (W3C) standard. The latest version can be found at
http://www.w3.org/TR/xpath-3/.


Pulling it together


Now that we've added XPath to our superpowers, let's finish up by writing a script
to get our Debian version information. Create a new file, get_debian_version.py,
and save the following to it:


import re
import requests
from lxml.etree import HTML

response = requests.get('http://www.debian.org/releases/stable/')
root = HTML(response.content)
title_text = root.find('head').find('title').text
release = re.search('\u201c(.*)\u201d', title_text).group(1)
p_text = root.xpath('//div[@id="content"]/p[1]')[0].text
version = p_text.split()[1]

print('Codename: {}\nVersion: {}'.format(release, version))