Foundations of Python Network Programming

(WallPaper) #1
Chapter 11 ■ the World Wide Web

187

The argument order of urljoin() is the same as that of os.path.join(). First provide the base URL of the
document that you are examining and then provide the URL that you have found inside of it. There are several
different ways that a relative URL can rewrite parts of its base.





from urllib.parse import urljoin
base = 'http://tools.ietf.org/html/rfc3986'
urljoin(base, 'rfc7320')
'http://tools.ietf.org/html/rfc7320'
urljoin(base, '.')
'http://tools.ietf.org/html/'
urljoin(base, '..')
'http://tools.ietf.org/'
urljoin(base, '/dailydose/')
'http://tools.ietf.org/dailydose/'
urljoin(base, '?version=1.0')
'http://tools.ietf.org/html/rfc3986?version=1.0'
urljoin(base, '#section-5.4')
'http://tools.ietf.org/html/rfc3986#section-5.4'





Again, it is perfectly safe to provide an absolute URL to urljoin() because it will detect the fact that it is entirely
self-contained and return it without any modifications from the base URL.





urljoin(base, 'https://www.google.com/search?q=apod&btnI=yes')
'https://www.google.com/search?q=apod&btnI=yes'





Relative URLs make it easy, even on static parts of a page, to write web pages that are agnostic about whether they
are served by HTTP or HTTPS because a relative URL can omit the scheme but specify everything else. In that case,
only the scheme is copied from the base URL.





urljoin(base, '//www.google.com/search?q=apod')
'http://www.google.com/search?q=apod'





If your site is going to use relative URLs, then it is critical that you be strict about whether pages carry a trailing
slash or not because a relative URL means two different things depending on whether the trailing slash is present.





urljoin('http://tools.ietf.org/html/rfc3986', 'rfc7320')
'http://tools.ietf.org/html/rfc7320'
urljoin('http://tools.ietf.org/html/rfc3986/', 'rfc7320')
'http://tools.ietf.org/html/rfc3986/rfc7320'





What might look to the naked eye as a slight difference between these two base URLs is crucial for the meaning
of any relative links! The first URL can be thought of as visiting the html directory in order to display the rfc3986 file
that it finds there, which leaves the “current working directory” as the html directory. The second URL instead treats
rfc3986 itself as the directory that it is visiting, because only directories can take a trailing slash in a real filesystem.
So, the relative link built atop the second URL starts building at the rfc3986 component instead of at its parent html
component.
Always design your site so that a user arriving at a URL that is written the wrong way gets immediately redirected
to the correct path. For example, if you were to try visiting the second URL from the previous example, then the IETF
web server will detect the erroneous trailing slash and declare a Location: header with the correct URL in its response.
This is a lesson if you ever write a web client: relative URLs are not necessarily relative to the path that you
provided in your HTTP request! If the site chooses to respond with a Location header, then relative URLs should be
constructed relative to that alternative location.

Free download pdf