APIs in Action
The square brackets after div, [@id="content"], form a condition that we place on
the
to an attribute, so the condition means: only elements with an id attribute equal to
"content". This is how we can find our content
Before we employ this to extract our information, let's just touch on a couple of
useful things that we can do with conditions. We can specify just a tag name,
as shown here:
root.xpath('//div[h1]')
[<Element div at 0x39e05c8>]
This returns all
child element. Also try:
root.xpath('body/div[2]'):
root.xpath('body/div[2]'):
[<Element div at 0x39e05c8>]
Putting a number as a condition will return the element at that position in the
matched list. In this case this is the second
Note that these indexes start at 1 , unlike Python indexing which starts at 0.
There's a lot more that XPath can do, the full specification is a World Wide
Web Consortium (W3C) standard. The latest version can be found at
http://www.w3.org/TR/xpath-3/.
Pulling it together
Now that we've added XPath to our superpowers, let's finish up by writing a script
to get our Debian version information. Create a new file, get_debian_version.py,
and save the following to it:
import re
import requests
from lxml.etree import HTML
response = requests.get('http://www.debian.org/releases/stable/')
root = HTML(response.content)
title_text = root.find('head').find('title').text
release = re.search('\u201c(.*)\u201d', title_text).group(1)
p_text = root.xpath('//div[@id="content"]/p[1]')[0].text
version = p_text.split()[1]
print('Codename: {}\nVersion: {}'.format(release, version))