Learning Python Network Programming

(Sean Pound) #1

HTTP and Working with the Web


Since urllib follows redirects for us, they generally don't affect us, but it's worth
knowing that a response urllib returns may be for a URL different from what we
had requested. Also, if we hit too many redirects for a single request (more than
10 for urllib), then urllib will give up and raise an urllib.error.HTTPError
exception.


URLs


Uniform Resource Locators, or URLs are fundamental to the way in which the web
operates, and they have been formally described in RFC 3986. A URL represents a
resource on a given host. How URLs map to the resources on the remote system is
entirely at the discretion of the system admin. URLs can point to files on the server,
or the resources may be dynamically generated when a request is received.
What the URL maps to though doesn't matter as long as the URLs work when
we request them.


URLs are comprised of several sections. Python uses the urllib.parse module for
working with URLs. Let's use Python to break a URL into its component parts:





from urllib.parse import urlparse








result = urlparse('http://www.python.org/dev/peps')








result





ParseResult(scheme='http', netloc='www.python.org', path='/dev/peps',
params='', query='', fragment='')


The urllib.parse.urlparse() function interprets our URL and recognizes http as
the scheme, http://www.python.org as the network location, and /dev/peps as the path.
We can access these components as attributes of the ParseResult:





result.netloc





'www.python.org'





result.path





'/dev/peps'


For almost all resources on the web, we'll be using the http or https schemes. In
these schemes, to locate a specific resource, we need to know the host that it resides
on and the TCP port that we should connect to (together these are the netloc
component), and we also need to know the path to the resource on the host
(the path component).

Free download pdf