Learning Python Network Programming

(Sean Pound) #1
Chapter 2

Port numbers can be specified explicitly in a URL by appending them to the host.
They are separated from the host by a colon. Let's see what happens when we try
this with urlparse.





urlparse('http://www.python.org:8080/')





ParseResult(scheme='http', netloc='www.python.org:8080', path='/',
params='', query='', fragment='')


The urlparse method just interprets it as a part of the netloc. This is fine because
this is how handlers such as urllib.request.urlopen() expect it to be formatted.


If we don't supply a port (as is usually the case), then the default port 80 is used for
http, and the default port 443 is used for https. This is usually what we want, as
these are the standard ports for the HTTP and HTTPS protocols respectively.


Paths and relative URLs


The path in a URL is anything that comes after the host and the port. Paths always
start with a forward-slash (/), and when just a slash appears on its own, it's called
the root. We can see this by performing the following:





urlparse('http://www.python.org/')





ParseResult(scheme='http', netloc='www.python.org', path='/',
params='', query='', fragment='')


If no path is supplied in a request, then by default urllib will send a request for
the root.


When a scheme and a host are included in a URL (as in the previous example), the
URL is called an absolute URL. Conversely, it's possible to have relative URLs,
which contain just a path component, as shown here:





urlparse('../images/tux.png')





ParseResult(scheme='', netloc='', path='../images/tux.png',
params='', query='', fragment='')


We can see that ParseResult only contains a path. If we want to use a relative URL
to request a resource, then we need to supply the missing scheme, the host, and the
base path.


Usually, we encounter relative URLs in a resource that we've already retrieved from
a URL. So, we can just use this resource's URL to fill in the missing components. Let's
look at an example.

Free download pdf