P1: IML
Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0
HOW TOBESEARCHED—VIEWSFROM THEWEBSITE 733HTML pages for the spider to index and eventually
guide the searcher to program-generated pages.
Forms:Collecting visitor information is one of the most
important functions of many Web sites, but spiders
do not know how to fill out forms; a site login form
automatically stops a spider. Spiders that do index
the content and links of the page containing the form
create potential problems by leading visitors directly
to the form page from a search engine rather than
through pages intended to precede the form. An exam-
ple would be an airline reservation system with forms
for itinerary and payment. Visitors arriving directly
at the payment form would obviously have problems;
consider adding links back to a starting page.
Robot Exclusion:Forms represent one good reason to
exclude spiders from indexing certain pages. Two stan-
dards exist that instruct well-behaved spiders on ex-
cluding specified pages. The recognized standard is the
“robots.txt” file that lists acceptable and unacceptable
directories, pages, and robots (i.e., spiders). A single
robot.txt file exists for the entire Web site, which only
the site administrator can access, creating a mainte-
nance bottleneck as multiple designers make changes
to the site. A better but less accepted solution defines a
special “robots” meta tag to specify how to index each
individual page. Options are that every spider or no spi-
der should index the page, the page should be indexed
or not, and the page links should be followed or not.
Images:Spiders may index the image location, image
title, and alternate text but that is probably all. For
searchers to find the image, include some additional
information about the image in the page content and
the HTML tags.
Deeply Linked Pages:Most spiders completely index
only small sites, generally indexing only limited pages
on each site. Spiders limit indexing to several con-
necting links deep, ignoring pages linked beyond that
depth. As a rule, keep pages important in attracting
visitors linked directly from the site home page.
Reorganization and Broken Links:Resist the urge to
reorganize the site. Until all the spiders come again,
the new greeting for visitors arriving from search en-
gines to pages that have been renamed or otherwise
permanently hidden may be “404 Not Found.” Adding
new pages to the site is fine; just leave existing page
locations and names alone.Metasearch
Individual search engines produce results biased in ways
that are unpredictable and invisible to a searcher. Send-
ing the same query to several search engines will obtain
widely different results that clearly illustrate the bias. One
study on search engine bias (Mowshowitz & Kawaguchi,
2002) demonstrated that querying nine popular search
engines for information on “home refrigerators” pro-
duced 14 different brand names in the cumulative top 50
results of each engine. Reporting of the brands was un-
predictable and uneven across search engines consulted;
several brands were found by only a single search engine
and no search engine found a majority of the 14 brands.And obviously there was no clue as to the brands not found
at all.
Given that individual search engines index only a small
fraction of the Web and the degree of index overlap among
a group of search engines may be small, it makes sense
to consult multiple search engines. Metasearch engines
automate multiple searches by sending the query to sev-
eral standard search engines and organizing the fusion
of the results uniformly. Unfortunately, metasearch en-
gines cannot broadcast a query to all other search en-
gines but attempt to minimize the use of limited resources
such as network bandwidth while maximizing informa-
tion quality. Although increasing the number of infor-
mation sources will generally improve recall, it is also
likely that precision will suffer correspondingly. Balanc-
ing these conflicting goals is the key challenge in designing
a metasearch engine.
The essential architecture of a metasearch engine
(Dreilinger & Howe, 1997) consists of a dispatch mecha-
nism to determine which search engines receive a specific
query; interface agents to contact and adapt the query
and result formats of each search engine; and a display
mechanism that creates and displays a uniform ranking
of results after removing duplicates. By depending upon
the direct results returned from regular search engines,
metasearch engines cannot expand or improve upon the
information sources. However, to a searcher, metase-
arch represents an obvious improvement over a single
search engine by the simple increase in the number of
search engines consulted and the corresponding increase
in the fraction of the Web examined.HOW TO BE SEARCHED—VIEWS FROM
THE WEB SITE
The purpose of building a Web site is to attract visitors;
information is the lure. Visitors most often find a new
site via a search engine, so building an easily found and
searched Web site is critical. A study of search success
(Users Don’t Learn to Search Better, 2001) illustrates the
challenges of designing a Web site for search. After watch-
ing 30 searchers search different sites for content that
was on the sites, the study concluded: “The more times
the users searched, the less likely they were to find what
they wanted.” Single searches found the content 55% of
the time, those searching twice found the content only
38% of the time, and those searching more than twice
never found the content. Nearly 23% of the searchers
received a “no results” response on their first searches,
causing most to give up immediately. For those who con-
tinued to search, results only grew worse. Further com-
pounding search problems is the prevalence of invalid
links to pages that are no longer accessible; one study
(Lawrence et al., 2000) gives the percentage of invalid
links ranging from 23% of 1999 pages to 53% of 1993
pages. The collective message seems clear: design the site
and pages for search and continually test that search
works.
Designing a Web site for search is the subject of this
section; the details of page search were covered in the
previous section. This section divides search of a Web site
into two main parts: search that includes the Web site as