The Internet Encyclopedia (Volume 3)

(coco) #1

P1: IML


Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0


730 WEBSEARCHFUNDAMENTALS

Figure 3: Two pages indexed on the words “weight,”
“loss,” and “diet,” where page p1 contains the word
“weight” and p2 contains the words “loss” and “diet.”
Similarity is measured as the cosine of the angle between
a page and query vectors; the smaller angle corresponds
to greater similarity. A query of “diet” would have some
similarity to page p2 but no similarity to the “weight”
page p1.

fewer words in common produce a greater angle or
smaller similarity. Figure 3 illustrates the indexing of
pages p1 and p2 containing the words “weight” and “diet
loss” respectively, where each axis corresponds to one of
the three words. Representing the pages and query as vec-
tors in three-dimensional space, the query vector for “diet”
has a smaller angle with respect to the “diet loss” page p2
than with respect to the “weight” page p1. Intuitively, the
“diet” query is somewhat similar to the “diet loss” page
and not similar to the “weight” page. Based strictly on
similarity to the query “diet,” the search engine would re-
turn a high ranking for the p2 page and a low ranking for
the p1 page.
Web search engines can exploit the special features of
Web pages to further improve retrieval quality beyond that
possible using query and page words alone. Specific ele-
ments of a page, such as title words, can augment strict
similarity measure ranking. When a page is retrieved, the
engine generally returns the calculated ranking, desig-
nated parts of the page such as the title and a descrip-
tion of the page content, some text surrounding the word
found, and most importantly, the location of the page to
retrieve.
Page popularity and importance measures utilize the
natural links between Web pages to refine rank calculated
on word matches alone. Popularity can be defined to as-
sign a higher rank to pages having more references from
other pages, under the assumption that a frequently ref-
erenced page is more relevant than a page with fewer ref-
erences. Figure 4 illustrates the popularity of page A to
be greater than that of either B or C, as more links or
references are to A than to either B or C.
However, a popular page is not automatically impor-
tant. Importance of a page can be defined in terms of the
number of links to the page and the importance of the

Figure 4: The number of links to a page measures its
popularity. Two links confer on page A the greatest popu-
larity, B the next, and C the least popularity. The ranking
based on popularity would be ABC.

linking sites. For example, a link from an Internal Revenue
Service page to a page on taxes is intuitively more impor-
tant and adds greater rank than a link from a random in-
dividual to the same page. This global ranking scheme is
the basis of PageRank (Page, Brin, Motwani, & Winograd,
1998), used by the search engine Google.
Although a discussion of the full PageRank algorithm
is beyond the scope of this chapter, the basic concepts
(Arasu et al., 2001) are relatively clear when limited to
a Web where every page can be reached by every other
page. For that assumption, let 1, 2, 3,..., m bepages on
the Web; let N(i) be the number of links from page i; let
B(i) be the set of pages linking to page i. Then r(i), the
simple PageRank of page i, is

r(i)=


j∈B(i)

r(j)
N(j)

.

Note that dividing byN(j), the number of pages linked
from a page, reduces the raking contribution of a page as
the number of pages linked to increases. This fits one’s in-
tuition that a page with links to many, possibly loosely re-
lated pages should not contribute as much rank as a page
that links to only a few, likely more closely related pages.
Authority ranking assumes that a page referring to an-
other confers some measure of authority to the page.
Authority and hub ranking utilize page links to iden-
tify authoritative pages on a subject and hub pages that
link to many of these authoritative pages. Mutually refer-
ring pages then form authoritative groups on a subject
(Kleinberg, 1999), where the more links to a page from
others within the authority group, the higher consequent
rank of the page. Figure 5 illustrates a simplistic method
of forming groups through mutual B and C links and com-
mon BC links to A. Page D marginally contributes to the
authority of A but is not included in the ABC group as no
other group member references D.
Hub pages are defined to have extensive links to au-
thority pages and serve to bring together authorities on a
common subject to form a mutually self-referential com-
munity. In Figure 5, page D acts as a hub that connects
two separate groups into a single community. The result-
ing community is somewhat analogous to a single subject
within a list service except that humans organize sub-
ject lists and the interconnecting links organize a subject
community. Automatic grouping can produce more focu-
sed and higher quality references than strict similarity
ranking but eschew the hierarchical organization that re-
quires subject knowledge on the part of the searcher.

Figure 5: Pages ABC and EF are two separate authority
groups connected by hub page D to form a common
subject authority.
Free download pdf