P1: IML
Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0
732 WEBSEARCHFUNDAMENTALS<html>
<head>
<meta name="description" content="Human lists and automated search engines.">
<meta name="keywords" content="search engine, indexing">
<title>How Search Engines Work</title>
</head>
<body><h1>Automated Search Engines</h1>Automated Web search engines have two main tasks; one of indexing the Web
information, the second of answering search queries from the index. First, an indexing
program visits a website much as you would with a browser, normally starting at the
default homepage, visiting connected pages and indexing the site information (<a
href="Figure6.html">see Figure 6</a>).
</body>
</html>Figure 7: An HTML page contains visible parts displayed by the browser and hid-
den parts that can help spiders index the page more accurately, provide descriptive
information to the searcher, and link to other HTML pages.relatively high. Influencing other sites to link to a site’s
pages is not easy, depending upon good content to at-
tract recognition. Cross-listing agreements with other
sites to link to a site is one technique that can provide
a quicker start to that recognition.What Search Engines Ignore
Can designers make their Web sites invisible? Are all spi-
ders the same? Not every word on a Web page will lead
the searcher to the page. Most spiders purposely ignore
or cannot see large parts of a page that human readers
might see and use; carelessness in page design can force
both spiders and human searchers away. The most com-
mon spider problems and solutions are the following:Frames:The purpose of frames is to visibly divide a
browser screen into several parts, but unfortunately
frames can stop an indexing spider and create confu-
sion for visitors arriving from a search engine. At least
three separate pages are needed for frames: a hidden
frameset page that defines the frames and links visible
pages into the frames, a second page for visible con-
tent, and a third that is often for navigation. A spider
normally arrives at the hidden frameset page but must
understand how to handle frames in order to follow
links to the other, visible pages. Spiders that do not un-
derstand frames simply stop and never index further.
For those spiders and browsers that do not understandHow Search Engines Work
Human lists and automated search engines.
http://www.insearchof.org/how.htmFigure 8: An example of how a search engine might
respond to a query. The word “indexing” is part of the
keyword meta tag and embedded in the text content. The
title is “How Search Engines Work” and the description
meta tag is “Human lists and automated search engines.”
The document URL “www.insearchof.org/how.htm”
and the title provide links to the complete document.frames, the remaining site pages may be unreachable.
Because frames cause some spiders problems, the ob-
vious solution is to avoid the use of frames entirely.
However, when the Web site designer is forced to use
frames, including the “noframes” tag exposes alterna-
tive text for navigation and content to frame-ignorant
spiders and browsers. The “noframes” tag designates
text to be displayed in place of the framed pages, effec-
tively duplicating the page content and design effort.
Spiders that understand frames create a different prob-
lem. Visitors can now arrive at a content page directly
from the search engine rather than through the main
frameset page as the Web site designer intended. With-
out the main frameset page there is no navigation page
either; visitors can become wedged on a dead-end page
and, without any navigation, forced to leave the site.
One solution is to place a link to the main frameset or
site home page in every navigation and content page
to help keep the visitor on the site. Of course, another
solution is to avoid frames altogether.
Scripts:Most spiders ignore script programs written in
JavaScript or other scripting languages; others simply
index the script program text. Spiders that index the
script may also index only the first few hundred words
of the document and possibly never reach the content.
Place important content and keywords before scripts
and, for pages that are mostly scripts, include title, key-
word, and description tags.
Java Applets and Plug-Ins:To a spider, a Java applet,
plug-in, or other browser-executed program is invisi-
ble. For indexing purposes, include descriptive content
and tags within the page that contains the program.
Server-Generated Pages:Spiders may ignore any un-
usual link references, such as ones that do not end
in “HTM” or “HTML.” For example, a spider will fol-
low the connecting link “<a href = ‘mainpage.HTML’>”
but may not follow the link to the Web server pro-
gram of “<a href = ‘mainpage.ASP’>” due to the end-
ing “ASP.” Generating the Web site main page with
a server program could mean that some spiders ig-
nore the complete site. One solution is to provide some