Robots.txt 166
whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still
appear in search results if they are linked to from a page that is crawled.[3]
For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a
robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to
a.example.com.
Disadvantages
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory. It relies on the cooperation of
the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee exclusion of all web
robots. In particular, malicious web robots are unlikely to honor robots.txt
There is no official standards body or RFC for the robots.txt protocol. It was created by consensus [4] in June 1994
by members of the robots mailing list ([email protected]). The information specifying the parts that
should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt
patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching
directories have the final '/' character appended, otherwise all files with names starting with that substring will match,
rather than just those in the directory intended.
Examples
This example tells all robots to visit all files because the wildcard * specifies all robots:
User-agent: *
Disallow:
This example tells all robots to stay out of a website:
User-agent: *
Disallow: /
The next is an example that tells all robots not to enter four directories of a website:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
Example that tells a specific robot not to enter one specific directory:
User-agent: BadBot # replace the 'BadBot' with the actual user-agent of the bot
Disallow: /private/
Example that tells all robots not to enter one specific file:
User-agent: *
Disallow: /directory/file.html
Note that all other files in the specified directory will be processed.
Example demonstrating how comments can be used:
# Comments appear after the "#" symbol at the start of a line, or after a directive
User-agent: * # match all bots