Chapter 11 ■ the World Wide Web
212
Web Scraping
The number of programmers who start their web programming careers by trying to scrape a web site is probably
much larger than the number who start by writing their own example site. After all, how many beginning
programmers have access to great stacks of data waiting to be displayed on the Web compared to the number who can
easily think of data already on the Web that they would like to copy?
A first piece of advice about web scraping: avoid it, always, if at all possible!
There are often many ways to get data besides raw scraping. Using such data sources is less expensive not only
for you, the programmer, but also for the site itself. The Internet Movie Database will let you download movie data
from http://www.imdb.com/interfaces so that you can run statistics across Hollywood films without forcing the main site
to render hundreds of thousands of extra pages, which then forces you to parse them! Many sites such as Google and
Yahoo provide APIs for their core services that can help you avoid getting back raw HTML.
If Google searches for the data you want but is not turning up any download or API alternatives, there are a few
rules of the road to keep in mind. Search for whether the site you are targeting has a “Terms of Service” page. Also
check for a /robots.txt file that will tell you which URLs are designed for downloading by search engines and which
should be avoided. This can help you avoid getting several copies of the same article but with different ads, while also
helping the site control the load it faces.
Obeying the Terms of Service and robots.txt can also make it less likely that your IP will be blocked for offering
an excessive traffic load.
Scraping a web site will, in the most general case, require everything you have learned in Chapter 9, Chapter 10,
and this chapter about HTTP and the way that it is used by web browsers.
• The GET and POST methods and how a method, path, and headers combine to form an HTTP
request
• The status codes and structure of an HTTP response, including the difference between a
success, a redirect, a temporary failure, and a permanent failure
• Basic HTTP authentication—both how it is demanded by a server response and then provided
in a client request
• Form-based authentication and how it sets cookies that then need to be present in your
subsequent requests for them to be judged authentic
• JavaScript-based authentication, where the login form performs a direct POST back to the web
server without letting the browser itself get involved in submitting the form
• The way that hidden form fields, and even new cookies, can be supplied in HTTP responses as
you are browsing to protect the site from CSRF attacks
• The difference between a query or action that appends data to the URL and performs a GET
for that location versus an action that does a direct POST of data to the server that is carried as
the request body instead
• The contrast between POST URLs designed for form-encoded data arriving from the browser
and URLs designed for direct interaction with front-end JavaScript code and therefore likely to
expect and return data in JSON or another programmer-friendly format
Scraping a complicated site will often require hours of experimentation, tweaking, and long sessions of clicking
around in your browser’s web developer tools to learn what is going on. Three tabs are essential, and all three should
be available in either Firefox or Google Chrome once you have right-clicked a page and selected Inspect Element. The
Elements tab (refer to Figure 11-1) shows you the live document, even if JavaScript has been adding and removing
things so that you can learn which elements live inside of which other ones. The Network tab (refer to Figure 11-2)
lets you hit Reload and see the HTTP request and responses—even those kicked off by JavaScript—that together have
delivered a complete page. And the Console lets you see any errors that the page is encountering, including ones that
might not be signaled to you as a user.