Learning Python Network Programming

(Sean Pound) #1
Chapter 2

The full rules for where the reserved characters need to be escaped are given in RFC
3986, however urllib provides us with a couple of methods for helping us construct
URLs. This means that we don't need to memorize all of these!


We just need to:



  • URL-encode the path

  • URL-encode the query string

  • Combine them by using the urllib.parse.urlunparse() function


Let's see how to use the aforementioned steps in code. First, we encode the path:





path = 'pypi'
path_enc = quote(path)





Then, we encode the query string:





from urllib.parse import urlencode
query_dict = {':action': 'search', 'term': 'Are you quite sure
this is a cheese shop?'}
query_enc = urlencode(query_dict)
query_enc
'%3Aaction=search&term=Are+you+quite+sure+this+is+a+cheese+shop%3F'





Lastly, we compose everything into a URL:





from urllib.parse import urlunparse
netloc = 'pypi.python.org'
urlunparse(('http', netloc, path_enc, '', query_enc, ''))
'http://pypi.python.org/pypi?%3Aaction=search&term=Are+you+quite+sure
+this+is+a+cheese+shop%3F'





The quote() function has been setup for specifically encoding paths. By default,
it ignores slash characters and it doesn't encode them. This isn't obvious in the
preceding example, try the following to see how this works:





from urllib.parse import quote
path = '/images/users/+Zoot+/'
quote(path)
'/images/users/%2BZoot%2B/'





Notice that it ignores the slashes, but it escapes the +. That is perfect for paths.


The urlencode() function is similarly intended for encoding query strings directly
from dicts. Notice how it correctly percent encodes our values and then joins them
with &, so as to construct the query string.


Lastly, the urlunparse() method expects a 6-tuple containing the elements
matching those of the result of urlparse(), hence the two empty strings.

Free download pdf