Digital Marketing Handbook

(ff) #1

Robots Exclusion Standard 163


Disadvantages


Despite the use of the terms "allow" and "disallow", the protocol is purely advisory. It relies on the cooperation of
the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee exclusion of all web
robots. In particular, malicious web robots are unlikely to honor robots.txt
There is no official standards body or RFC for the robots.txt protocol. It was created by consensus [4] in June 1994
by members of the robots mailing list ([email protected]). The information specifying the parts that
should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt
patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching
directories have the final '/' character appended, otherwise all files with names starting with that substring will match,
rather than just those in the directory intended.

Examples


This example tells all robots to visit all files because the wildcard * specifies all robots:


User-agent: *
Disallow:

This example tells all robots to stay out of a website:


User-agent: *
Disallow: /

The next is an example that tells all robots not to enter four directories of a website:


User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Example that tells a specific robot not to enter one specific directory:


User-agent: BadBot # replace the 'BadBot' with the actual user-agent of the bot
Disallow: /private/

Example that tells all robots not to enter one specific file:


User-agent: *
Disallow: /directory/file.html

Note that all other files in the specified directory will be processed.
Example demonstrating how comments can be used:

# Comments appear after the "#" symbol at the start of a line, or after a directive
User-agent: * # match all bots
Disallow: / # keep them out

Example demonstrating how to add the parameter to tell bots where the Sitemap is located


User-agent: *
Sitemap: http://www.example.com/sitemap.xml # tell the bots where your sitemap is located
Free download pdf