Learning Python Network Programming

(Sean Pound) #1
Chapter 3

Here, we have downloaded and parsed the web page by pulling out the text that we
want with the help of XPath. We have used a regular expression to pull out jessie,
and a split to extract the version 8.0. Finally we print it out.


So, run it like it is shown here:


$ python3.4 get_debian_version.py


Codename: jessie


Version: 8.0


Magnificent. Well, darned nifty, at least. There are some third-party packages
available which can speed up scraping and form submission, two popular ones are
Mechanize and Scrapy. Check them out at http://wwwsearch.sourceforge.net/
mechanize/, and http://scrapy.org.


With great power...


As an HTTP client developer, you may have different priorities to the webmasters
that run websites. A webmaster will typically provide a site for human users;
possibly offering a service designed for generating revenue, and it is most likely
that all this will need to be done with the help of very limited resources. They will
be interested in analyzing how humans use their site, and may have areas of the site
they would prefer that automated clients didn't explore.


HTTP clients that automatically parse and download pages on websites are called
various things, such as bots, web crawlers, and spiders. Bots have many legitimate
uses. All the search engine providers make extensive use of bots for crawling the
web and building their huge page indexes. Bots can be used to check for dead links,
and to archive sites for repositories, such as the Wayback Machine. But, there are
also many uses that might be considered as illegitimate. Automatically traversing
an information service to extract the data on its pages and then repackaging that
data for presentation elsewhere without permission of the site owners, downloading
large batches of media files in one go when the spirit of the service is online viewing
and so on could be considered as illegitimate. Some sites have terms of service
which explicitly bar automated downloads. Although some actions such as copying
and republishing copyrighted material are clearly illegitimate, some other actions
are subject to interpretation. This gray area is a subject of ongoing debate, and it is
unlikely that it will ever be resolved to everyone's satisfaction.


However, even when they do serve a legitimate purpose, in general, bots do make
webmasters lives somewhat more difficult. They pollute the webserver logs, which
webmasters use for calculating statistics on how their site is being used by their
human audience. Bots also consume bandwidth and other server resources.

Free download pdf