Foundations of Python Network Programming

(WallPaper) #1

Chapter 11 ■ the World Wide Web


216


$ pip install selenium
$ python mscrape.py -s http://127.0.0.1:5000/
125 Registration for PyCon
200 Payment for writing that code




325 Total payments made


You can press Ctrl+W to dismiss Firefox once the script has printed its output. While you can write Selenium
scripts so that they close Firefox automatically, I prefer to leave it open when writing and debugging so that I can see
what went wrong in the browser if the program hits an error.
The difference between these two approaches deserves to be stressed. To write the code that uses Requests, you
need to open the site yourself, study the login form, and copy the information you find there into the data that the
post() method uses to log in. Once you have done so, your code has no way to know whether the login form changes
in the future or not. It will simply keep using the hard-coded input names 'username' and 'password' whether they
are still relevant or not.
So, the Requests approach is, at least when written this way, really nothing like a browser. It is at no point opening
the login page and seeing a form there. It is, rather, assuming the existence of the login page and doing an end-run
around it to POST the form that is its result. Obviously, this approach will break if the login form is ever given, say,
a secret token to prevent mass attempts to guess user passwords. In that case, you would need to add a first GET of
the /login page itself to grab the secret token that would need to be combined with your username and password to
make a valid POST.
The Selenium-based code in mscape.py takes the opposite approach. Like a user sitting down at the browser,
it acts as though it simply sees a form and selects its elements and starts typing. Then it reaches over and clicks the
button to submit the form. As long as its CSS selectors continue to identify the form fields correctly, the code will
succeed in logging in regardless of any secret tokens or special JavaScript code to sign or automate the form post
because Selenium is simply doing in Firefox exactly what you would do to log on.
Selenium is, of course, much slower than Requests, especially when you first kick it off and have to wait for
Firefox to start. But it can quickly perform actions that might otherwise take you hours of experimentation to get
working in Python. An interesting approach to a difficult scraping job can be a hybrid approach: could you use
Selenium to log in and gain the necessary cookies and then tell Requests about them so that your mass fetch of
further pages does not need to wait on the browser?


Scraping Pages

When a site returns data in CSV, JSON, or some other recognized data format, you will of course use the
corresponding module in the Standard Library or a third-party library to get it parsed so that you can process it. But
what if the information you need is hidden in user-facing HTML?
Reading raw HTML after pressing Ctrl+U in Google Chrome or Firefox can be quite wearisome, depending on
how the site has chosen to format it. It is often more pleasant to right-click, select Inspect Element, and then happily
browse the collapsible document tree of elements that the browser sees—assuming that the HTML is properly
formatted and that a mistake in the markup has not hidden the data you need from the browser! The problem with the
live element inspector, as you have already seen, is that by the time you see the document, any JavaScript programs
that run in the web page might already have edited it out of all recognition.
There are at least two easy tricks to looking at such pages. The first is to turn JavaScript off in your browser and
click Reload for the page you are reading. It should now re-appear in the element inspector but without any changes
being made: you should see exactly what your Python code will see when it downloads the same document.

Free download pdf