Foundations of Python Network Programming

(WallPaper) #1
Chapter 11 ■ the World Wide Web

219

The two final links (“page5” and “page6”) appear at the bottom of the screen as the result of a short snippet of
dynamic JavaScript code. This simulates the behavior of web sites that show you the skeleton of a page quickly but
then do another round-trip to the server before the data in which you are interested appears.
At this point—where you want to do a full-fledged recursive search of all of the URLs on a web site or even just
within part of it—you might want to go looking for a web-scraping engine that could help you. In the same way that web
frameworks factor common patterns out of web applications, like needing to return 404 for nonexistent pages, scraping
frameworks know all about keeping up with pages that have been visited already and which ones still need to be visited.
The most popular web scraper at the moment is Scrapy (http://scrapy.org/) whose documentation you can
study if you want to try describing a scraping task in a way that will fit into its model.
In Listing 11-13 you can look behind the scenes to see what a real—if simple—scraper looks like underneath. This
one requires lxml, so install that third-party library, as described in the previous section, if you can.


Listing 11-13. Simple Recursive Web Scraper That Does GET


#!/usr/bin/env python3


Foundations of Python Network Programming, Third Edition


https://github.com/brandon-rhodes/fopnp/blob/m/py3/chapter11/rscrape1.py


Recursive scraper built using the Requests library.


import argparse, requests
from urllib.parse import urljoin, urlsplit
from lxml import etree


def GET(url):
response = requests.get(url)
if response.headers.get('Content-Type', '').split(';')[0] != 'text/html':
return
text = response.text
try:
html = etree.HTML(text)
except Exception as e:
print(' {}: {}'.format(e.class.name, e))
return
links = html.findall('.//a[@href]')
for link in links:
yield GET, urljoin(url, link.attrib['href'])


def scrape(start, url_filter):
further_work = {start}
already_seen = {start}
while further_work:
call_tuple = further_work.pop()
function, url, etc = call_tuple
print(function.name, url,
etc)
for call_tuple in function(url, etc):
if call_tuple in already_seen:
continue
already_seen.add(call_tuple)
function, url,
etc = call_tuple
if not url_filter(url):
continue
further_work.add(call_tuple)

Free download pdf