P1: 57
Yu WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 17:52 Char Count= 0
752 WEBSEARCHTECHNOLOGYdatabaseDi,rminbe the smallest database ranking score,r
be the local rank of a page from Di, andgbe the converted
similarity of the page. The conversion function isg= 1 −
(r−1) *Fi, whereFi=rmin/(m*ri) andmis the number
of documents desired across all searched databases. This
conversion has the following properties. First, all locally
top-ranked pages have the same converted similarity, i.e.,- Second,Fiis the difference between the converted sim-
ilarities of the jth and the (j+1)th ranked pages from
databaseDi, for anyj=1,2,....Note that the distance is
larger for databases with smaller ranking scores. Conse-
quently, if the rank of a pagepin a higher rank database is
the same as the rank of a pagep′in a lower rank database
and neitherpnorp′is top-ranked, then the converted sim-
ilarity ofpwill be higher than that ofp′. This property can
lead to the selection of more pages from databases with
higher scores into the merged result. As an example, con-
sider two databasesD 1 andD 2. Supposer 1 =0.2,r 2 =0.5,
andm=4. Thenrmin=0.2,F 1 =0.25, andF 2 =0.1. Thus,
the three top-ranked pages fromD 1 will have converted
similarities 1, 0.75, and 0.5, respectively, and the three top-
ranked pages fromD 2 will have converted similarities 1,
0.9, and 0.8, respectively. As a result, the merged list will
contain three pages fromD 2 and one page fromD 1.
Selected Databases for a Given Query Share Pages.In
this case, the same page may be returned by multiple local
search engines. Result merging in this situation is usually
carried out in two steps. In the first step, techniques dis-
cussed in the first two cases can be applied to all pages,
regardless of whether they are returned by one or more
search engines, to compute their similarities for merging.
In the second step, for each pagepreturned by multi-
ple search engines, the similarities ofpdue to multiple
search engines are combined in a certain way to gener-
ate a final similarity forp. Many combination functions
have been proposed and studied (Croft, 2000), and some of
these functions have been used in metasearch engines. For
example, themaxfunction is used in ProFusion (Gauch
et al., 1996), and thesumfunction is used in MetaCrawler
(Selberg & Etzioni, 1997).CONCLUSION
In the past decade, we have all witnessed the explosion
of the Web. Up to now, the Web has become the largest
digital library used by millions of people. Search engines
and metasearch engines have become indispensable tools
for Web users to find desired information.
While most Web users probably have used search en-
gines and metasearch engines, few know the technologies
behind these wonderful tools. This chapter has provided
an overview of these technologies, from basic ideas to
more advanced algorithms. As can be seen from this chap-
ter, Web-based search technology has its roots from text
retrieval techniques, but it also has many unique features.
Some efforts to compare the quality of different search
engines have been reported (for example, see (Hawking,
Craswell, Bailey, & Griffiths, 2001)). An interesting issue is
how to evaluate and compare the effectiveness of different
techniques. Since most search engines employ multiple
techniques, it is difficult to isolate the effect of a particulartechnique on effectiveness even when the effectiveness of
search engines can be obtained.
Web-based search is still a pretty young discipline, and
it still has a lot of room to grow. The upcoming transition
of the Web from mostly HTML pages to XML pages will
probably have a significant impact on Web-based search
technology.ACKNOWLEDGMENT
This work is supported in part by NSF Grants
IIS-9902872, IIS-9902792, EIA-9911099, IIS-0208574,
IIS-0208434 and ARO-2-5-30267.GLOSSARY
Authority page A Web page that is linked from hub
pages in a group of pages related to the same topic.
Collection fusion A technique that determines how
to retrieve documents from multiple collections and
merge them into a single ranked list.
Database selection The process of selecting potentially
useful data sources (databases, search engines, etc.) for
each user query.
Hub page A Web page with links to important (author-
ity) Web pages all related to the same topic.
Metasearch engine A Web-based search tool that uti-
lizes other search engines to retrieve information for
its user.
PageRank A measure of Web page importance based on
how Web pages are linked to each other on the Web.
Search engine A Web-based tool that retrieves poten-
tially useful results (Web pages, products, etc.) for each
user query.
Result merging The process of merging documents re-
trieved from multiple sources into a single ranked list.
Text retrieval A discipline that studies techniques to
retrieve relevant text documents from a document
collection for each query.
Web (World Wide Web) Hyperlinked documents resid-
ing on networked computers, allowing users to navi-
gate from one document to any linked document.CROSS REFERENCES
SeeIntelligent Agents; Web Search Fundamentals; Web Site
Design.REFERENCES
Bergman, M. (2000). The deep Web: Surfacing the hid-
den value. Retrieved April 25, 2002, from http://www.
completeplanet.com/Tutorials/DeepWeb/index.asp
Callan, J. (2000). Distributed information retrieval. In W.
Bruce Croft (Ed.),Advances in information retrieval: Re-
cent research from the Center for Intelligent Information
Retrieval(pp. 127–150). Dordrecht, The Netherlands:
Kluwer Academic.
Callan, J., Connell, M., & Du, A. (1999). Automatic dis-
covery of language models for text databases. InACM
SIGMOD Conference(pp. 479–490). New York: ACM
Press.