Digital Marketing Handbook

(ff) #1

Sitemap 162


More information defining the field operations and other Sitemap options are defined at http:/ / http://www. sitemaps. org
(Sitemaps.org: Google, Inc., Yahoo, Inc., and Microsoft Corporation)
See also Robots.txt, which can be used to identify sitemaps on the server.

References
[ 1 ]Site Map Usability (http:/ / http://www. useit. com/ alertbox/ sitemaps. html) Jakob Nielsen's Alertbox, August 12, 2008
[ 2 ]"WordPress Plugin: Google XML Sitemaps" (http:/ / linksku. com/ 10/ wordpress-plugins). Linksku..
[ 3 ]Joint announcement (http:/ / http://www. google. com/ press/ pressrel/ sitemapsorg. html) from Google, Yahoo, Bing supporting Sitemaps

External links



  • Common Official Website (http:/ / http://www. sitemaps. org/ ) - Jointly maintained website by Google, Yahoo, MSN
    for an XML sitemap format.

  • / Sitemap generators (http:/ / http://www. dmoz. org/ Computers/ Internet/ Searching/ Search_Engines/ Sitemaps) at the
    Open Directory Project

  • Tools and tutorial (http:/ / http://www. scriptol. com/ seo/ simple-map. html) Helping to build a cross-systems sitemap
    generator.


Robots Exclusion Standard


The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a
convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which
is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by
webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps,
a robot inclusion standard for websites.

History


The invention of "robots.txt" is attributed to Martijn Koster, when working for WebCrawler in 1994[1][2].
"robots.txt" was then popularized with the advent of AltaVista, and other popular search engines, in the following
years.

About the standard


If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root
of the web site hierarchy (e.g. http://www.example.com/robots.txt). This text file should contain the instructions
in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read
the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the
web owner wishes to provide no specific instructions.
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when
crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief
that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a
whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still
appear in search results if they are linked to from a page that is crawled.[3]
For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a
robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to
a.example.com.
Free download pdf