The Internet Encyclopedia (Volume 3)

P1: IML

Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0

HOWSEARCHWORKS—VIEWSFROM THESEARCHENGINE 729

from other pages, under the assumption that a frequently referenced page is more relevant than a page with fewer references. However, a popular page is not automatically important. Page importance further assumes that pages or sites that mutually refer to one another form groups of importance or authority on a subject. The more references to a page from others in the group, the higher the importance and consequent rank of the page. The resulting groups are somewhat analogous to the subjects of a list service with the exception that humans organize subject lists and the interconnecting references organize a group. Automatic grouping can produce more focused and higher quality references but eschew the hierarchical organization that supports browsing of subjects by the searcher. Which ranking approach produces the best result depends upon the user’s search needs. Comprehensive search is the natural outcome of search based on word match ranking alone but yields no organization of the results. Existence and exploratory search can benefit from the reference-based ranking methods of popularity and importance. Popularity ranking anticipates that the information that many others reference represents common knowledge of a subject. Importance attempts to refine popularity ranking by organizing references into support- ing groups. Grouping together documents that have common references will generally provide more homogenous results and is best suited for exploratory search where the subject is recognized.

HOW SEARCH WORKS—VIEWS FROM THE SEARCH ENGINE Although human-organized lists and automated searches are often lumped together as search engines, how each gathers and represents information is quite different.

Human-Organized Lists Human-organized lists depend upon editors to first review Web site pages and organize the collected information into subject hierarchies. A searcher can then manually browse a subject hierarchy list to find information. Lists more accurately reflect true page content than automated searches because human editors do not review the parts of a page not normally seen by a reader and can ignore gratuitous or repetitive words used to attract a search engine. One weakness of lists is the effort required of the editors to perform the review. Once a site review is com- pleted, significant changes to the site content are unlikely to prompt another review, although most lists do accept site resubmissions.

Search Engines Automated Web search engines have two main tasks: plac- ing keywords from each page into an index and answering search queries from the index. Web-based search engines are only recent versions of traditional full-text systems, such as the SMART and SIRE systems (Salton & McGill, 1983). Rather than employing human editors to extract and index subject information from pages, such systems

perform automatic indexing of complete pages into a database, creating lists of words and the page locations that contain each word. Answering queries from the index compares query words to page words, retrieving those pages that have a high calculated similarity to the query.

Indexing The automatic indexing of a page first removes most common or stop words such as “a,” “the,” and “it”, because almost all pages contain those words and they are nearly useless in discriminating between pages. Each remain- ing word then has a weight factor calculated based in part on word frequency. Weight measures also may give greater or lesser importance to words in different parts of the page; for example, title words generally receive more weight than regular text words. A word that occurs in few pages also has a higher weight, based on the proportion of times that the word occurs in a single page relative to all other pages. A word that occurs on only one page would have relatively high weight whereas a word occurring on all pages would have low weight, the rationale being that a rare word is more useful in finding a page than a word common to many pages. The same rationale applies to the removal of stop words. Conceptually, the resulting index contains all words from all pages, excluding stop words. Included with each word is a list of all the pages that contain the word. For each page in the list, there is stored the frequency with which the word occurs in the page for weight calculation and the location of the page for retrieval. Web search engines differ significantly from traditional search engines in that pages are scattered across many thousands of Web sites and each page can have connec- tions to many other pages. To find pages and build the index, the indexing program must then visit Web sites as one would with a browser, starting at a page, visiting connected pages, and indexing each page as it goes. In the jargon of the Web, a spider, robot, or Web-crawler is the program that visits or crawls connected pages, indexing selected parts from the visited pages. The resulting index contains the word lists of traditional text searching and the unique location of the page for retrieval. For each page indexed, the connecting references to other Web pages are included when the popularity and importance of pages are to be calculated.

Retrieval The traditional retrieval process produces a ranking of all pages that contain one or more query words entered by the searcher. The retrieval operation consists of converting the query words to the same representation as the pages, calculating a similarity measure between the query and each page in the collection, and retrieving pages from the collection ranked in order of high to low similarity. A common similarity measure is the cosine of the angle between the words of the query and the words from indi- vidual pages. Determining the cosine similarity measure requires representing the words of the query and pages as vectors in a multidimensional word space where each axis corresponds to a different word drawn from the indexed pages. Roughly, if a page matches the query, the angle between the query and the page vector is zero; having

The Internet Encyclopedia (Volume 3)

Get our desktop app

Company

Features

Documentation

Resources