Chapter 11 ■ the World Wide Web
220
def main(GET):
parser = argparse.ArgumentParser(description='Scrape a simple site.')
parser.add_argument('url', help='the URL at which to begin')
start_url = parser.parse_args().url
starting_netloc = urlsplit(start_url).netloc
url_filter = (lambda url: urlsplit(url).netloc == starting_netloc)
scrape((GET, start_url), url_filter)
if name == 'main':
main(GET)
Beyond the task of starting up and reading its command-line arguments, Listing 11-13 has only two moving parts.
The simplest is its GET() function, which attempts to download a URL and attempts to parse it if its type is HTML; only
if those steps succeed does it pull the href="" attributes of all the anchor tags () to learn the additional pages to
which the current page has links. Because any of these links might be relative URLs, it calls urljoin() on every one of
them to supply any base components that they might lack.
For each URL that the GET() function discovers in the text of the page, it returns a tuple stating that it would like
the scraping engine to call itself on the URL it has discovered, unless the engine knows that it has done so already.
The engine itself merely needs to keep up with which combinations of functions and URLs it has already invoked
so that a URL that appears again and again on the web site gets visited only once. It keeps a set of URLs it has seen
before and another of URLs that have not yet been visited, and it continues looping until the latter set is finally empty.
You can run this scraper against a big public web site, like httpbin.
$ python rscrape1.py http://httpbin.org/
Or you can run it against the small static site whose web server you started up a few paragraphs ago—and, alas,
this scraper will find only the two links that appear literally in the HTML as first delivered by the HTTP response.
$ python rscrape1.py http://127.0.0.1:8000/
GET http://127.0.0.1:8000/
GET http://127.0.0.1:8000/page1.html
GET http://127.0.0.1:8000/page2.html
Two ingredients are needed if the scraper is to see more.
First, you will need to load the HTML in a real browser so that the JavaScript can run and load the rest of the page.
Second, you will need to have a second operation besides GET() that takes a deep breath and clicks the Search
button to see what lies behind it.
This is the sort of operation that should never, under any circumstances, be part of an automated scraper
designed to pull general content from a public web site because, as you have learned at length by this point, form
submission is expressly designed for user actions, especially if backed by a POST operation. (In this case, the form
does a GET and is thus at least a little safer.) However, in this case, you have studied this small site and have concluded
that clicking the button should be safe.
Note that Listing 11-14 can simply reuse the engine from the previous scraper because the engine was not tightly
coupled to any particular opinion of what functions it should call. It will call any functions that are submitted to it as work.