Foundations of Python Network Programming

(WallPaper) #1

Chapter 11 ■ the World Wide Web


222


The scraper is thus able to find every single page on the site despite that some links are loaded dynamically through
JavaScript and others are reached only through a form post. Through powerful techniques like this, you should find
that your interactions with any web site could be automated through Python.


Summary


HTTP was designed to deliver the World Wide Web: a collection of documents interconnected with hyperlinks
that each name the URL of a further page, or section of a page, that can be visited simply by clicking the text of the
hyperlink. The Python Standard Library has helpful routines for parsing and building URLs and for turning partial
“relative URLs” into absolute URLs by filling in any incomplete components with information from the base URL of
the page where they appeared.
Web applications typically connect some persistent data store, like a database, with code that responds to
incoming HTTP requests and builds HTML pages in response. It is crucial to let the database do its own quoting when
you try to insert untrusted information from out on the Web, and both the DB-API 2.0 and any ORM you might use in
Python will be careful to do this quoting correctly.
Web frameworks range from simple to full stack. With a simple framework, you will make your own choice
of both a template language and an ORM or other persistence layer. A full-stack framework will instead offer its
own versions of these tools. In either case, some means of connecting URLs to your own code will be available that
supports both static URLs and also URLs such as /person/123/ that have path components that can vary. Quick ways
to render and return templates, as well as to return redirects or HTTP errors, will also be provided.
The vast danger that faces every site author is that the many ways that components interact in a complicated
system like the Web can allow users either to subvert your own intentions or each other’s. The possibility of cross-site
scripting attacks, cross-site request forgery, and attacks on your user’s privacy must all be kept in mind at the interface
between the outside world and your own code. These dangers should be thoroughly understood before you ever write
code that accepts data from a URL path, a URL query string, or a POST or file upload.
The trade-off between frameworks is often the choice between a full-stack solution like Django, which
encourages you to stay within its tool set but tends to choose good defaults for you (such as having CSRF protection
turned on automatically in your forms), or a solution such as Flash or Bottle, which feels sleeker and lighter and lets
you assemble your own solution, but that requires you to know up front all of the pieces you need. If you write an app
in Flask simply not knowing that you need CSRF protection, you will go without it.
The Tornado framework stands out for its async approach that allows many clients to be served from a single
operating-system-level thread of control. With the emergence of asyncio in Python 3, approaches like Tornado
can be expected to move toward a common set of idioms like those that WSGI already provides for threaded web
frameworks today.
Turning around and scraping a web page involves a thorough knowledge of how web sites normally work so
that what would normally be user interactions can instead be scripted—including such complexities as logging on or
filling out and submitting a form. Several approaches are available in Python both for fetching pages and for parsing
them. Requests or Selenium for fetching and BeautifulSoup or lxml for parsing are among the favorites at this point.
And thus with a study of web application writing and scraping, this book completes its coverage of HTTP and
the World Wide Web. The next chapter begins a tour of several less well-known protocols supported in the Python
Standard Library by turning to the subject of e-mail messages and how they are formatted.

Free download pdf