Foundations of Python Network Programming

(WallPaper) #1
Chapter 11 ■ the World Wide Web

185

The “network location” netloc can have several subordinate pieces, but they are uncommon enough that urlsplit()
does not break them out as separate items in its tuple. Instead, they are available only as attributes of its result.





u = urlsplit('https://brandon:atigdng@localhost:8000/')
u.netloc
'brandon:atigdng@localhost:8000'
u.username
'brandon'
u.password
'atigdng'
u.hostname
'localhost'
u.port
8000





Reducing a URL to pieces is only half of the process of parsing. The path and query components can both include
characters that had to be escaped before becoming part of the URL. For example, & and # cannot appear literally
because they delimit the URL itself. And the character / needs to be escaped if it occurs inside a particular a path
component because the slash serves to separate path components.
The query portion of a URL has encoding rules all its own. Query values often contain spaces—think of all of
the searches you type into Google that include a space—and so the plus sign + is designated as an alternative way of
encoding spaces in queries. The query string would otherwise only have the option of encoding spaces the way the
rest of the URL does, as a %20 hexadecimal escape code.
The only correct way to parse a URL that is accessing the “Q&A” section of your site in order to access the “TCP/IP”
section and do a search there for information about “packet loss” is as follows:





from urllib.parse import parse_qs, parse_qsl, unquote
u = urlsplit('http://example.com/Q%26A/TCP%2FIP?q=packet+loss')
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)
path
['', 'Q&A', 'TCP/IP']
query
[('q', 'packet loss')]





Note that my splitting of the path using split() returns an initial empty string because this particular path is an
absolute path that begins with a slash.
The query is given as a list of tuples, and not a simple dictionary, because a URL query string allows a query
parameter to be specified multiple times. If you are writing code that does not care about this possibility, you can pass
the list of tuples to dict() and you will only see the last value given for each parameter. If you want a dictionary back
but also want to let a parameter be specified multiple times, you can switch from parse_qsl() to parse_qs() and get
back a dictionary whose values are lists.





parse_qs(u.query)
{'q': ['packet loss']}





The Standard Library provides all of the necessary routines to go back in the other direction. Given the path and
query shown previously, Python can reconstruct the URL from its parts by quoting each path component, joining
them back together with slashes, encoding the query, and presenting the result to the “unsplit” routine that is the
opposite of the urlsplit() function called earlier.

Free download pdf