Foundations of Python Network Programming

(WallPaper) #1

Chapter 11 ■ the World Wide Web


186





from urllib.parse import quote, urlencode, urlunsplit
urlunsplit(('http', 'example.com',
... '/'.join(quote(p, safe='') for p in path),
... urlencode(query), ''))
'http://example.com/Q%26A/TCP%2FIP?q=packet+loss'





If you carefully defer all URL parsing to these Standard Library routines, you should find that all of the tiny details
of the full specification are taken care of for you.
The code in the previous examples is so utterly correct that some programmers might even describe it as fussy,
or even overwrought. How often, really, do path components themselves have slashes in them? Most web sites are
careful to design path elements, called slugs by developers, so that they never require ugly escaping to appear in a
URL. If a site only allows URL slugs to include letters, numbers, dashes, and the underscore, then the fear that a slug
could itself include a slash is obviously misplaced.
If you are sure that you are dealing with paths that never have escaped slashes inside individual path components,
then you can simply expose the whole path to quote() and unquote() without the bother of splitting it first.





quote('Q&A/TCP IP')
'Q%26A/TCP%20IP'
unquote('Q%26A/TCP%20IP')
'Q&A/TCP IP'





In fact, the quote() routine expects this to be the common case, and so its parameter default is safe='/', which
will normally leave slashes untouched. That is what was overridden by safe='' in the fussy version of the code.
The Standard Library urllib.parse module has several more specialized routines than the general ones
outlined previously, including urldefrag() for splitting the URL apart from its fragment at the # character. Read the
documentation to learn about this and the other functions that can make a few special cases more convenient.


Relative URLs

Your filesystem command line supports a “change working directory” command that establishes the location where
the system will start searching relative paths, which lack a leading slash. Paths that do start with a slash are explicitly
declaring that they begin their search at the root of the filesystem. They are absolute paths, which always name the
same location regardless of your working directory.


$ wc -l /var/log/dmesg
977 dmesg
$ wc -l dmesg
wc: dmesg: No such file or directory
$ cd /var/log
$ wc -l dmesg
977 dmesg


Hypertext has the same concept. If all the links in a document are absolute URLs, like the ones in the previous
section, then there is no question about the resource to which each of them links. However, if the document includes
relative URLs, then the document’s own location will have to be taken into account.
Python provides a urljoin() routine that understands the entire standard in all of its nuance. Given a URL
that you have recovered from inside a hypertext document that might be either relative or absolute, you can pass it
to urljoin() to have any missing information filled in. If the URL was absolute to begin with, no problem; it will be
returned unchanged.

Free download pdf