Chapter 11 ■ the World Wide Web
221
Listing 11-14. Recursively Scraping a Web Site with Selenium
#!/usr/bin/env python3
Foundations of Python Network Programming, Third Edition
https://github.com/brandon-rhodes/fopnp/blob/m/py3/chapter11/rscrape2.py
Recursive scraper built using the Selenium Webdriver.
from urllib.parse import urljoin
from rscrape1 import main
from selenium import webdriver
class WebdriverVisitor:
def init(self):
self.browser = webdriver.Firefox()
def GET(self, url):
self.browser.get(url)
yield from self.parse()
if self.browser.find_elements_by_xpath('.//form'):
yield self.submit_form, url
def parse(self):
(Could also parse page.source with lxml yourself, as in scraper1.py)
url = self.browser.current_url
links = self.browser.find_elements_by_xpath('.//a[@href]')
for link in links:
yield self.GET, urljoin(url, link.get_attribute('href'))
def submit_form(self, url):
self.browser.get(url)
self.browser.find_element_by_xpath('.//form').submit()
yield from self.parse()
if name == 'main':
main(WebdriverVisitor().GET)
Because Selenium instances are expensive to create—they have to start up a copy of Firefox, after all—you dare
not call the Firefox() method every time you need to fetch a URL. Instead, the GET() routine is written as a method
here, instead of a bare function, so that the browser attribute can survive from one GET() call to the next and also be
available when you are ready to call submit_form().
The submit_form() method is where this listing really diverges from the previous one. When the GET() method
sees the search form sitting on the page, it sends an additional tuple back to the engine. In addition to yielding one
tuple for every link that it sees on a page, it will yield a tuple that will load the page up and click the big Search button.
That is what lets this scraper reach deeper into this site than the previous one.
$ python rscrape2.py http://127.0.0.1:8000/
GET http://127.0.0.1:8000/
GET http://127.0.0.1:8000/page1.html
GET http://127.0.0.1:8000/page2.html
submit_form http://127.0.0.1:8000/
GET http://127.0.0.1:8000/page5.html
GET http://127.0.0.1:8000/page6.html
GET http://127.0.0.1:8000/page4.html
GET http://127.0.0.1:8000/page3.html