The Internet Encyclopedia (Volume 3)

(coco) #1

P1: IML


Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0


728 WEBSEARCHFUNDAMENTALS

phrases “vegetarian diet” “weight loss” produces only
about 3,000 page references, of which the highest
ranked pages contain both of the two phrases.
Inclusion and Exclusion:Query words prefixed with a
“+” are words that must be included or found on a
Web page, whereas words prefixed with a “−” must
be excluded or not occur on the page. The query
“+diet−vegetarian” then means that “diet” must be
found on the page and “vegetarian” must not be found.
The Boolean operators AND, NOT, AND NOT are of-
ten used alternatives to the “+” and “−” operators. The
query “diet AND NOT vegetarian” is normally equiv-
alent to our earlier query. The OR operator matches
either term; “weight OR mass” matches “weight” in-
terchangeably with “mass,” much to the annoyance of
physics teachers.
Proximity:Limiting matches to only words or phrases
occurring within a close proximity to another assumes
that nearness implies some relation between the words.
The query “vegetarian NEAR diet” would find pages
having “vegetarian” within a few words of “diet” but
exclude pages having “diet” distanced from “vegetar-
ian” by more than the proximity word limit.
Wildcard:An “∗” at the end of a word or partial word
expands the range and number of matching words. For
example, the query “veget∗” finds pages that include
“vegetarian,” “vegetable,” and any other words starting
with “veget.”
Field:Limiting the search to a designated field of the
Web page, such as considering only pages avail-
able on a specified Web site, can greatly narrow the
search focus. As an example, title: “vegetarian diet”
requires “vegetarian diet” to be part of the page title
field, whereas “site:food.com” limits the search to the
“food.com” Web site pages. The link control lists all
pages with a reference link to a specified site; the query
“link:food.com” will list all pages that link to pages on
the “food.com” Web site.
Combined Controls:Combining multiple search controls
yields a more discriminating search. The query “vege-
tarian diet domain:au” limits search for vegetarian di-
ets to only Australian sites; Australian-based sites end
with “au” in the domain name.

Search Engine Performance
A single search engine never performs the best in all cases;
a search that fails on one can succeed on another. Search
engines compete based on the scale and strategy of search;
finding the best pages for the searcher is not only a point
of technical distinction but also a competitive advantage.
Using many different search engines for a single search is
one strategy for improving the probability of finding rel-
evant information, which is basically what a metasearch
engine does by automatically submitting a query to multi-
ple search engines and presenting the fusion of the highest
ranked results from each.

Performance Measures
The three important performance measures are recall,
precision, and ranking. As discussed earlier, recall is
defined as the percentage of relevant pages found and pre-

cision as the percentage of pages found that are relevant.
A search engine calculates rank to measure relative page
relevancy, using individual page rank to order the pages
from less to highly relevant. As a way to gauge individual
search engine strengths and weaknesses, the following de-
scribes searches that provide observations of these three
measures.

Recall:Searching for something one knows is on a Web
site gives a rough estimate of recall. If a user’s name ap-
pears in Web pages on his or her site, searching on the
name should return results from the site and perhaps
other sites as well. Restricting the search to the user’s
specific site, if recall is 100%, should return all pages on
the site containing the user’s name. For testing total re-
call, the query “vegetarian site:www.food.com” should
return all pages with the word “vegetarian” from the
site “www.food.com.” Most search engines follow only
a limited number of page links on a single site, stopping
at some maximum number of links deep. Searching for
pages several links deep from the site main page then
measures the search engine’s recall ability. If one-half
of the relevant pages on a site are found, recall is 50%
and indicates that the search engine spider stopped af-
ter following some arbitrary number of links from one
page to the next.
Precision:Precision is difficult to mechanically quantify
as it measures the number of relevant pages among
those found and relevancy is by nature subjective. Be-
cause search engines generally list only pages that
contain matching query words, precision is always
arguably high. However, when users find thousands
of pages and only a few are relevant to their needs,
the challenge is to focus the search on more relevant
and generally fewer pages. The searcher can influence
the search precision by including or excluding query
words, limiting the search to known sites, and using
other controls discussed earlier. Searching for a known
page can test the degree of search control and pre-
cision afforded by a search engine. Most search en-
gines provide sufficient control to limit the results to a
single page. For example, on some search engines the
query “vegetarian diet” would find all pages with “veg-
etarian diet” anywhere in the page whereas the query
title: “vegetarian diet” would find only pages with that
phrase in the page title. The more accurate the search
controls, the better the search precision.
Rank:Rank reflects the calculated relevancy of a docu-
ment. One key factor determining rank is the num-
ber of query words matched in a page; generally the
more words matched the higher the rank. Matching
rare words also increases the rank of the page.

Search engines can also employ the structure of the
Web to improve relevancy based on links to and from
other pages to augment the ranking of pages. Pages with
many links from other pages generally rank higher than
pages of equal similarity to the query but with fewer links.
Page popularity and importance are two common
ranking measures based on link structure. Popularity
assigns a higher rank to pages having more references
Free download pdf