Chapter 11 ■ the World Wide Web
213
There are two general flavors of automation that programmers tackle.
The first is where you are casting a wide net because there is a large amount of data you want to download. Aside
from the possibility of an initial login step to get the cookies that you need, this kind of task tends to involve repeated
GET operations that might fuel even further GETs as you read links from the pages that you are downloading. This
pattern is the same pattern undertaken by the “spider” programs that web search engines use to learn the pages that
exist on each web site.
The term spider for these programs comes from the early days when the term web still made people think of
spider webs.
The other flavor is when you perform a specific and targeted action at only one or two pages, instead of wanting
a whole section of a web site. This might be because you need the data only from a specific page—maybe you want
your shell prompt to print the temperature from a specific weather page—or because you are trying to automate
an action that would normally require a browser such as actions like paying a customer or listing yesterday’s credit
card transactions so that you can look for fraud. This often involves far more caution regarding clicks and forms and
authentication, and it often requires a full-fledged browser running the show instead of Python by itself because the
bank uses in-page JavaScript to discourage automated attempts to gain unauthorized access to accounts.
Remember to check terms-of-service conditions and a site’s robots.txt files before even considering unleashing
an automated program against it. And expect to be blocked if your program’s behavior—even when it gets stuck in
edge cases that you did not anticipate—becomes noticeably more demanding than a normal human user clicking
through the page that they are stopping to scan or read.
I am not even going to talk about OAuth and other maneuvers that make it even more difficult for programmers
to run programs that accomplish what the programmer would otherwise need a browser to do. When unfamiliar
maneuvers or protocols seem to be involved, look for as much help from third-party libraries as possible and watch
your outgoing headers carefully to try to make them match exactly what you see emitted when you post a form or visit
a page successfully with your browser. Even the user-agent field can matter, depending on how opinionated the site is!
Fetching Pages
There are three board approaches to fetching pages from the Web so that you can examine their content in a
Python program.
• Making direct GET or POST requests using a Python library. Use the Requests library as your
go-to solution, and ask it for a Session object so that it can keep up with cookies and do
connection pooling for you. An alternative for low-complexity situations is urllib.request if
you want to stay within the Standard Library.
• There was once a middle ground of tools that could act enough like a primitive web browser
that they could find <form> elements and help you build an HTTP request using the same rules
that a browser would use to deliver the form inputs back to the server. Mechanize was the most
famous, but I cannot find that it has been maintained—probably because so many sites are now
complicated enough that JavaScript is nearly a requirement for browsing the modern Web.
• You can use a real web browser. You will control Firefox with the Selenium Webdriver library
in the examples that follow, but experiments are also ongoing with “headless” tools that would
act like browsers without having to bring up a full window. They typically work by creating a
WebKit instance that is not connected to a real window. PhantomJS has made this approach
popular in the JavaScript community, and Ghost.py is one current experiment in bringing the
capability to Python.
If you already know which URLs you want to visit, your algorithm can be quite simple. Take the list of URLs, run
an HTTP request against each one, and save or examine its content. Things get complicated only if you do not know
the list of URLs up front and need to learn them as you go. You will then need to keep up with where you have been so
that you do not visit a URL twice and go in loops forever.