Working with Web Servers
These days, just about everyone gets information from the Internet. The World Wide Web (WWW)
has become a primary source of information for news, weather, sports, and even personal
information.
You can leverage this wealth of information on the Internet from your Python scripts. You might be
wondering how you can use your Python scripts to extract data from the graphical world of webpages.
Fortunately, Python makes it easy.
The Python urllib module, which is part of the standard Python library, allows you to interact with
a remote website to retrieve information. It retrieves the full HTML code sent from the website and
stores it in a variable. The downside is that you then have to parse through the HTML code, looking
for the content you need. But fortunately again, Python provides help for doing that!
To summarize, extracting data from websites is basically a two-step process:
- Connect to the website and retrieve the webpage.
- Parse the HTML code to find the data you’re looking for.
The following sections walk through these two steps to help you retrieve useful information from any
website by using a Python script.
Retrieving Webpages
Retrieving the HTML code for a webpage involves three steps:
- Connect to the remote web server.
- Send an HTTP request for the webpage.
- Read the HTML code that the web server returns.
All these steps are handled with just two simple commands from the urllib module (after you
import the module):
Click here to view code image
import urllib.request
response = urllib.request.urlopen(url)
html = response.read()
The urlopen() method attempts to establish the HTTP connection with the remote website
specified in the parameter. You need to specify the full http:// or https:// format of the
address in the URL. The read() method then retrieves the HTML code sent from the remote
website.
The read() method returns the text as binary data instead of as a text string. You can use some of
the standard Python tools to convert the HTML code into text (see Hour 10, “Working with Strings”)
and then use the standard Python searching tools (see Hour 16, “Regular Expressions”) to parse
through the HTML code, looking for the data you need, in a process called screen scraping.
However, there’s an easier way of extracting, and you’ll learn about it next.
Parsing Webpage Data
While screen scraping is certainly one way to extract data from a webpage, it can be extremely
painful. Trying to hunt down individual data elements buried in the HTML code of a webpage can be