Digital Marketing Handbook

(ff) #1

Robots Exclusion Standard 165


[ 10 ]"Google's Hidden Interpretation of Robots.txt" (http:/ / blog. semetrical. com/ googles-secret-approach-to-robots-txt/ ).. Retrieved
2010-11-15.
[ 11 ]"Robots Exclusion Protocol - joining together to provide better documentation" (http:/ / http://www. bing. com/ community/ site_blogs/ b/
webmaster/ archive/ 2008/ 06/ 03/ robots-exclusion-protocol-joining-together-to-provide-better-documentation. aspx).. Retrieved 2009-12-03.
[ 12 ]"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" (http:/ / ysearchblog. com/ 2007/ 04/ 11/
webmasters-can-now-auto-discover-with-sitemaps/ ).. Retrieved 2009-03-23.
[ 13 ]"Search engines and dynamic content issues" (http:/ / ghita. org/ search-engines-dynamic-content-issues. html). MSNbot issues with
robots.txt.. Retrieved 2007-04-01.

External links



  • http://www.robotstxt.org - The Web Robots Pages (http:/ / http://www. robotstxt. org/ )

  • History of robots.txt (http:/ / http://www. antipope. org/ charlie/ blog-static/ 2009/ 06/
    how_i_got_here_in_the_end_part_3. html) - (how Charles Stross prompted its invention; original comment (http:/
    / yro. slashdot. org/ comments. pl?sid=377285& cid=21554125) on Slashdot)

  • Block or remove pages using a robots.txt file - Google Webmaster Tools Help = Using the robots.txt analysis tool
    (http:/ / http://www. google. com/ support/ webmasters/ bin/ answer. py?hl=en& answer=156449)

  • About Robots.txt at the Mediawiki website (http:/ / http://www. mediawiki. org/ wiki/ Robots. txt)

  • List of Bad Bots (http:/ / http://www. kloth. net/ internet/ badbots. php) - rogue robots and spiders which ignore these
    guidelines

  • Wikipedia's Robots.txt - an example (http:/ / en. wikipedia. org/ robots. txt)

  • Robots.txt Generator + Tutorial (http:/ / http://www. mcanerin. com/ EN/ search-engine/ robots-txt. asp)

  • Robots.txt Generator Tool (http:/ / http://www. howrank. com/ Robots. txt-Tool. php)

  • Robots.txt is not a security measure (http:/ / http://www. diovo. com/ 2008/ 09/ robotstxt-is-not-a-security-measure/ )


Robots.txt


The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a
convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which
is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by
webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps,
a robot inclusion standard for websites.

History


The invention of "robots.txt" is attributed to Martijn Koster, when working for WebCrawler in 1994[1][2].
"robots.txt" was then popularized with the advent of AltaVista, and other popular search engines, in the following
years.

About the standard


If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root
of the web site hierarchy (e.g. http://www.example.com/robots.txt). This text file should contain the instructions
in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read
the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the
web owner wishes to provide no specific instructions.
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when
crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief
that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a
Free download pdf