untitled

(ff) #1

6.4 Organizing by Citation 145


is the case when a relatively small community uses the same terminology as
a much larger community. For this reason, commercial search engines like
Google that are based on the Kleinberg algorithm do not implement it in its
original form.
Google, for example, uses a formula which differs from the Kleinberg al-
gorithm in several ways:



  1. The rank of a document is normalized by dividing it by the total num-
    ber of references made by that document. Thus a document with a large
    number of references will have its influence reduced a great deal, while
    documents with a small number of references will have more influence.
    Presumably this was done to prevent the algorithm from being easily sub-
    verted.

  2. Instead of the normal eigenvector equation, an additional term was added
    that serves to “dampen” the process of computing the rank, but which
    adds some arbitrariness to the computed rank.

  3. The original adjacency matrix is used rather than either the authority or
    hub matrix. Thus the algorithm is measuring a form of popularity rather
    than whether the document is authoritative or a central source.


Current search engines have another weakness. The original set of candi-
date documents is obtained using simple word-matching strategies that do
not incorporate any of the meaning of the words. As a simple example, try
running these two queries with Google: “spinal tap” and “spinal taps.” From
almost any point of view these two have essentially the same meaning. Yet,
the documents displayed by Google have completely different rankings in
these two cases. Among the first ten documents of each query there is only
one document in common. Although the spinal tap query is problematic
because there is a popular movie by that name, one can easily create many
more such examples by just varying the inflection of the words in the query
or by substituting synonymous words or phrases.
One obvious way to deal with this shortcoming of Google would be to in-
dex using concepts rather than character strings. This leads to the possibility
of search based on the meaning of the documents. Many search engines, in-
cluding Google, are starting to incorporate semantics in their algorithms. We
discuss this in section 6.6.

Free download pdf