P1: IML
Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0
HOWSEARCHWORKS—VIEWSFROM THESEARCHENGINE 729from other pages, under the assumption that a frequently
referenced page is more relevant than a page with fewer
references. However, a popular page is not automatically
important. Page importance further assumes that pages
or sites that mutually refer to one another form groups
of importance or authority on a subject. The more ref-
erences to a page from others in the group, the higher
the importance and consequent rank of the page. The re-
sulting groups are somewhat analogous to the subjects
of a list service with the exception that humans organize
subject lists and the interconnecting references organize
a group. Automatic grouping can produce more focused
and higher quality references but eschew the hierarchical
organization that supports browsing of subjects by the
searcher.
Which ranking approach produces the best result
depends upon the user’s search needs. Comprehensive
search is the natural outcome of search based on word
match ranking alone but yields no organization of the
results. Existence and exploratory search can benefit from
the reference-based ranking methods of popularity and
importance. Popularity ranking anticipates that the infor-
mation that many others reference represents common
knowledge of a subject. Importance attempts to refine
popularity ranking by organizing references into support-
ing groups. Grouping together documents that have com-
mon references will generally provide more homogenous
results and is best suited for exploratory search where the
subject is recognized.HOW SEARCH WORKS—VIEWS FROM
THE SEARCH ENGINE
Although human-organized lists and automated searches
are often lumped together as search engines, how each
gathers and represents information is quite different.Human-Organized Lists
Human-organized lists depend upon editors to first review
Web site pages and organize the collected information
into subject hierarchies. A searcher can then manually
browse a subject hierarchy list to find information. Lists
more accurately reflect true page content than automated
searches because human editors do not review the parts
of a page not normally seen by a reader and can ignore
gratuitous or repetitive words used to attract a search en-
gine. One weakness of lists is the effort required of the
editors to perform the review. Once a site review is com-
pleted, significant changes to the site content are unlikely
to prompt another review, although most lists do accept
site resubmissions.Search Engines
Automated Web search engines have two main tasks: plac-
ing keywords from each page into an index and answering
search queries from the index. Web-based search engines
are only recent versions of traditional full-text systems,
such as the SMART and SIRE systems (Salton & McGill,
1983). Rather than employing human editors to extract
and index subject information from pages, such systemsperform automatic indexing of complete pages into a
database, creating lists of words and the page locations
that contain each word. Answering queries from the in-
dex compares query words to page words, retrieving those
pages that have a high calculated similarity to the query.Indexing
The automatic indexing of a page first removes most com-
mon or stop words such as “a,” “the,” and “it”, because
almost all pages contain those words and they are nearly
useless in discriminating between pages. Each remain-
ing word then has a weight factor calculated based in
part on word frequency. Weight measures also may give
greater or lesser importance to words in different parts of
the page; for example, title words generally receive more
weight than regular text words. A word that occurs in few
pages also has a higher weight, based on the proportion of
times that the word occurs in a single page relative to all
other pages. A word that occurs on only one page would
have relatively high weight whereas a word occurring on
all pages would have low weight, the rationale being that
a rare word is more useful in finding a page than a word
common to many pages. The same rationale applies to the
removal of stop words.
Conceptually, the resulting index contains all words
from all pages, excluding stop words. Included with each
word is a list of all the pages that contain the word. For
each page in the list, there is stored the frequency with
which the word occurs in the page for weight calculation
and the location of the page for retrieval.
Web search engines differ significantly from traditional
search engines in that pages are scattered across many
thousands of Web sites and each page can have connec-
tions to many other pages. To find pages and build the
index, the indexing program must then visit Web sites as
one would with a browser, starting at a page, visiting con-
nected pages, and indexing each page as it goes. In the
jargon of the Web, a spider, robot, or Web-crawler is the
program that visits or crawls connected pages, indexing
selected parts from the visited pages. The resulting index
contains the word lists of traditional text searching and
the unique location of the page for retrieval. For each page
indexed, the connecting references to other Web pages are
included when the popularity and importance of pages are
to be calculated.Retrieval
The traditional retrieval process produces a ranking of all
pages that contain one or more query words entered by the
searcher. The retrieval operation consists of converting
the query words to the same representation as the pages,
calculating a similarity measure between the query and
each page in the collection, and retrieving pages from the
collection ranked in order of high to low similarity.
A common similarity measure is the cosine of the angle
between the words of the query and the words from indi-
vidual pages. Determining the cosine similarity measure
requires representing the words of the query and pages as
vectors in a multidimensional word space where each axis
corresponds to a different word drawn from the indexed
pages. Roughly, if a page matches the query, the angle
between the query and the page vector is zero; having