Learning Python Network Programming

(Sean Pound) #1

APIs in Action


Using the methods that we are looking at in this chapter, it is quite straightforward
to write a bot that performs many of the aforementioned functions. Webmasters
provide us with services that we will be using, so in return, we should respect the
aforementioned areas and design our bots in such a way that they impact them as
little as possible.


Choosing a User Agent


There are a few things that we can do to help our webmasters out. We should always
pick an appropriate user agent for our client. The principle way in which webmasters
filter out bot traffic from their logfiles is by performing user agent analysis.


There are lists of the user agents of known bots, for example, one such list can be
found at http://www.useragentstring.com/pages/Crawlerlist/.


Webmasters can use these in their filters. Many webmasters also simply filter out
any user agents that contain the words bot, spider, or crawler. So, if we are writing an
automated bot rather than a browser, then it will make the webmasters' lives a little
easier if we use a user agent that contains one of these words. Many bots used by the
search engine providers follow this convention, some examples are listed here:



There are also some guidelines in section 5.5.3 of the HTTP RFC 7231.


The Robots.txt file


There is an unofficial but standard mechanism to tell bots if there are any parts of
a website that they should not crawl. This mechanism is called robots.txt, and
it takes the form of a text file called, unsurprisingly, robots.txt. This file always
lives in the root of a website so that bots can always find it. It has rules that describe
the accessible parts of the website. The file format is described at http://www.
robotstxt.org.


The Python standard library provides the urllib.robotparser module for parsing
and working with robots.txt files. You can create a parser object, feed it a robots.
txt file and then you can simply query it to see whether a given URL is permitted
for a given user agent. A good example can be found in the documentation in the
standard library. If you check every URL that your client might want to access before
you access it, and honor the webmasters wishes, then you'll be helping them out.

Free download pdf