untitled

(ff) #1

138 6 Information Retrieval


length and has the same direction as v. This vector is obtained by dividing v
by its length:|vv|. Thus the angle between vectors v and w is given by|vv||·ww|.
The length|v|of a vector is also called itsnorm, hence|vv|is called thenor-
malizationof v. Some systems normalize the vectors of documents so that all
documents have the same “size” with respect to information retrieval, and so
that the dot product is the cosine of the angle between vectors. Normaliza-
tion does not have a probabilistic interpretation, so it is not appropriate for
information retrieval using a query. However, it is useful when documents
are compared with one another. In this case, the cosine of the angle between
the document vectors is a measure of similarity that varies between 0 and


  1. A value of 0 means that the documents are unrelated, while a value of
    1 means that the documents use the same terms with the same relative fre-
    quencies. One can use similarity functions such as the cosine as a means of
    classifying documents by looking for clusters of documents that are near one
    another. All of the clustering algorithms mentioned in section 1.5 can be used
    to cluster documents either hierarchically or by using some other organizing
    principle. Clustering techniques based on similarity functions are still in use,
    but they have been superseded to some extent by citation-based techniques,
    to be discussed in section 6.4.
    In spite of the logical elegance of the vector space model, it has several
    deficiencies.

  2. In many languages, words are composed of letters which can be in more
    than one “case.” In English, letters can be upper- or lower-case. Com-
    puters actually deal withcharacters, not letters, so the upper-case variant
    differs from the lower-case variant. To deal with this ambiguity, most
    search techniques ignore case distinctions when comparing words.
    Unfortunately, case distinctions are sometimes important. For example,
    acronyms are usually written using upper-case letters to prevent confu-
    sion with the ordinary word. Thus “COLD” (which is the acronym for
    chronic obstructive lung disease) can be distinguished from “cold” (which
    has several meanings) by the use of upper-case letters.

  3. Many languages, including English, also vary the form of a word for
    grammatical purposes. This is known asinflection. For example, English
    words can be singular or plural. For example, while “normetanephrines”
    only occurs in five PubMed citations, the singular form “normetanephrine”
    occurs in 1207 citations. Although the singular and plural forms have dif-
    ferent meanings, such distinctions are rarely important during a search.

Free download pdf