Chapter 11 ■ the World Wide Web
184
This syntax can be used for more general purposes than describing material to be fetched from a network.
The more general concept of a uniform resource identifier (URI) can be used either to identify physical network-
accessible documents or as a generic unique identifier used to give computer-readable names to conceptual entities,
that is, labels that are called uniform resource names (URNs). Everything in this book will specifically be a URL.
The pronunciation of URL, by the way, is you-are-ell. An “earl” is a member of the British aristocracy whose rank
is not quite that of a marquis but who does rank above a viscount—so an earl is the equivalent of a count over on the
Continent (not, in other words, a network document address).
When a document is automatically generated based on parameters specified by the user, the URL is extended
with a query string that starts with a question mark (?) and then uses the ampersand character (&) to delimit each
further parameter. Each parameter consists of a name, an equals sign, and a value.
https://www.google.com/search?q=apod&btnI=yes
Finally, a URL can be suffixed with a fragment that names the particular location on a page to which the link is referring.
http://tools.ietf.org/html/rfc2324#section-2.3.2
The fragment is different from the other components of a URL. Because a web browser presumes that it needs
to fetch the entire page named by the path in order to find the element named by the fragment, it does not actually
transmit the fragment in its HTTP request! All that the server can learn from the browser when it fetches an HTTP
URL is the hostname, the path, and the query. The hostname, you will recall from Chapter 9, is delivered as the Host
header, and the path and query are concatenated together to produce the full path that follows the HTTP method on
the first line of the request.
If you study RFC 3986, you will discover a few additional features that are only rarely in use. It is the authoritative
resource to consult when you run across rare features that you want to learn more about, like the possibility of
including a user@password authentication string right in the URL itself.
Parsing and Building URLs
The urllib.parse module that comes built in to the Python Standard Library provides the tools that you’ll need both
to interpret and to build URLs. Splitting a URL into its component pieces is a single function call. It returns what in
earlier versions of Python was simply a tuple, and you can still view the result that way and use integer indexing—or
tuple unpacking in an assignment statement—to access its items.
from urllib.parse import urlsplit
u = urlsplit('https://www.google.com/search?q=apod&btnI=yes')
tuple(u)
('https', 'www.google.com', '/search', 'q=apod&btnI=yes', '')
But the tuple also supports named attribute access to its items to help make your code more readable when you
are inspecting a URL.
u.scheme
'https'
u.netloc
'www.google.com'
u.path
'/search'
u.query
'q=apod&btnI=yes'
u.fragment
''