Foundations of Python Network Programming

(WallPaper) #1
Chapter 11 ■ the World Wide Web

215

def scrape_with_lxml(text):
root = lxml.html.document_fromstring(text)
total = 0
for li in root.cssselect('li.to'):
dollars = int(li.text_content().split()[0].lstrip('$'))
memo = li.cssselect('i')[0].text_content()
total += dollars
print(ROW.format(dollars, memo))
print(ROW.format('-' 8, '-' 30))
print(ROW.format(total, 'Total payments made'))


def main():
parser = argparse.ArgumentParser(description='Scrape our payments site.')
parser.add_argument('url', help='the URL at which to begin')
parser.add_argument('-l', action='store_true', help='scrape using lxml')
parser.add_argument('-s', action='store_true', help='get with selenium')
args = parser.parse_args()
if args.s:
text = download_page_with_selenium(args.url)
else:
text = download_page_with_requests(args.url)
if args.l:
scrape_with_lxml(text)
else:
scrape_with_soup(text)


if name == 'main':
main()


Once this Flask application is running on port 5000, you are ready to kick off mscrape.py in another terminal
window. Install the Beautiful Soup third-party library first, if it is not available on your system, and you will also need
Requests.


$ pip install beautifulsoup4
$ pip install requests
$ python mscrape.py http://127.0.0.1:5000/
125 Registration for PyCon
200 Payment for writing that code




325 Total payments made


Running in its default mode like this, mscrape.py first uses the Requests library to log in to the site using the
login form. This is what will provide the Session object with the cookie that it needs then to fetch the front page
successfully. The script then parses the page, fetches the list-item elements marked with the class to, and adds up
those outgoing payments as it displays them with a few print() calls.
By providing the -s option, you can switch mscrape.py so that it does something rather more exciting: running
a full version of Firefox, if it finds it installed on your system, to visit the web site instead! You will need the Selenium
package installed for this mode to work.

Free download pdf