Learning Python Network Programming

(Sean Pound) #1

HTTP and Working with the Web


Suppose that we've retrieved the http://www.debian.org URL, and within the
webpage source code we found the relative URL for the 'About' page. We found that
it's a relative URL for intro/about.


We can create an absolute URL by using the URL for the original page and the
urllib.parse.urljoin() function. Let's see how we can do this:





from urllib.parse import urljoin








urljoin('http://www.debian.org', 'intro/about')





'http://www.debian.org/intro/about'


By supplying urljoin with a base URL, and a relative URL, we've created a new
absolute URL.


Here, notice how urljoin has filled in the slash between the host and the path. The
only time that urljoin will fill in a slash for us is when the base URL does not have
a path, as shown in the preceding example. Let's see what happens if the base URL
does have a path.





urljoin('http://www.debian.org/intro/', 'about')





'http://www.debian.org/intro/about'





urljoin('http://www.debian.org/intro', 'about')





'http://www.debian.org/about'


This will give us varying results. Notice how urljoin appends to the path if the base
URL ends in a slash, but it replaces the last path element in the base URL if the base
URL doesn't end in a slash.


We can force a path to replace all the elements of a base URL by prefixing it with a
slash. Do the following:





urljoin('http://www.debian.org/intro/about', '/News')





'http://www.debian.org/News'


How about navigating to parent directories? Let's try the standard dot syntax,
as shown here:





urljoin('http://www.debian.org/intro/about/', '../News')





'http://www.debian.org/intro/News'





urljoin('http://www.debian.org/intro/about/', '../../News')





'http://www.debian.org/News'





urljoin('http://www.debian.org/intro/about', '../News')





'http://www.debian.org/News'

Free download pdf