User-agent: CrawlerName
Disallow: /tmp/
Disallow: /links/listing.html
This bit of text tells crawlers first that allcrawlers should ignore the temporary directories. So
every crawler reading that file will automatically ignore the temporary files. But you’ve also told a
specific crawler (indicated by CrawlerName) to disallow both temporary directories and the links
on the Listing page. The problem is, the specified crawler will never get that message because it
has already read that all crawlers should ignore the temporary directories.
If you want to command multiple crawlers, you need to first begin by naming the crawlers you want
to control. Only after they’ve been named should you leave your instructions for all crawlers. Written
properly, the text from the preceding code should look like this:
User-agent: CrawlerName
Disallow: /tmp/
Disallow: /links/listing.html
User-agent: *
Disallow: /tmp/
If you have certain pages or links that you want the crawler to ignore, you can accomplish
this without causing the crawler to ignore a whole site or a whole directory or having to
put a specific meta tag on each page.
Each search engine crawler goes by a different name, and if you look at your web server log, you’ll
probably see that name. Here’s a quick list of some of the crawler names that you’re likely to see in
that web server log:
Google: Googlebot
MSN: MSNbot
Yahoo! Web Search: Yahoo SLURP or just SLURP
Ask: Teoma
AltaVista: Scooter
LookSmart: MantraAgent
WebCrawler: WebCrawler
SearchHippo: Fluffy the Spider
These are just a few of the search engine crawlers that might crawl across your site. You can find a
complete list along with the text of the Robots Exclusion Standard document on the Web Robots
Pages (www.robotstxt.org). Take the time to read the Robots Exclusion Standard document.
It’s not terribly long, and reading it will help you understand how search crawlers interact with
your web site. That understanding can also help you learn how to control crawlers better when
they come to visit.
NOTENOTE
231
Robots, Spiders, and Crawlers 16
75002c16.qxd:Layout 1 11/7/07 9:55 AM Page 231