Learning Python Network Programming

HTTP and Working with the Web

Suppose that we've retrieved the http://www.debian.org URL, and within the
webpage source code we found the relative URL for the 'About' page. We found that
it's a relative URL for intro/about.

We can create an absolute URL by using the URL for the original page and the
urllib.parse.urljoin() function. Let's see how we can do this:

from urllib.parse import urljoin

urljoin('http://www.debian.org', 'intro/about')

'http://www.debian.org/intro/about'

By supplying urljoin with a base URL, and a relative URL, we've created a new
absolute URL.

Here, notice how urljoin has filled in the slash between the host and the path. The
only time that urljoin will fill in a slash for us is when the base URL does not have
a path, as shown in the preceding example. Let's see what happens if the base URL
does have a path.

urljoin('http://www.debian.org/intro/', 'about')

'http://www.debian.org/intro/about'

urljoin('http://www.debian.org/intro', 'about')

'http://www.debian.org/about'

This will give us varying results. Notice how urljoin appends to the path if the base
URL ends in a slash, but it replaces the last path element in the base URL if the base
URL doesn't end in a slash.

We can force a path to replace all the elements of a base URL by prefixing it with a
slash. Do the following:

urljoin('http://www.debian.org/intro/about', '/News')

'http://www.debian.org/News'

How about navigating to parent directories? Let's try the standard dot syntax,
as shown here:

urljoin('http://www.debian.org/intro/about/', '../News')

'http://www.debian.org/intro/News'

urljoin('http://www.debian.org/intro/about/', '../../News')

urljoin('http://www.debian.org/intro/about', '../News')

Learning Python Network Programming

Get our desktop app

Company

Features

Documentation

Resources