HTTP and Working with the Web
Suppose that we've retrieved the http://www.debian.org URL, and within the
webpage source code we found the relative URL for the 'About' page. We found that
it's a relative URL for intro/about.
We can create an absolute URL by using the URL for the original page and the
urllib.parse.urljoin() function. Let's see how we can do this:
from urllib.parse import urljoin
urljoin('http://www.debian.org', 'intro/about')
'http://www.debian.org/intro/about'
By supplying urljoin with a base URL, and a relative URL, we've created a new
absolute URL.
Here, notice how urljoin has filled in the slash between the host and the path. The
only time that urljoin will fill in a slash for us is when the base URL does not have
a path, as shown in the preceding example. Let's see what happens if the base URL
does have a path.
urljoin('http://www.debian.org/intro/', 'about')
'http://www.debian.org/intro/about'
urljoin('http://www.debian.org/intro', 'about')
This will give us varying results. Notice how urljoin appends to the path if the base
URL ends in a slash, but it replaces the last path element in the base URL if the base
URL doesn't end in a slash.
We can force a path to replace all the elements of a base URL by prefixing it with a
slash. Do the following:
urljoin('http://www.debian.org/intro/about', '/News')
How about navigating to parent directories? Let's try the standard dot syntax,
as shown here:
urljoin('http://www.debian.org/intro/about/', '../News')
'http://www.debian.org/intro/News'
urljoin('http://www.debian.org/intro/about/', '../../News')
urljoin('http://www.debian.org/intro/about', '../News')