138 6 Information Retrieval
length and has the same direction as v. This vector is obtained by dividing v
by its length:|vv|. Thus the angle between vectors v and w is given by|vv||·ww|.
The length|v|of a vector is also called itsnorm, hence|vv|is called thenor-
malizationof v. Some systems normalize the vectors of documents so that all
documents have the same “size” with respect to information retrieval, and so
that the dot product is the cosine of the angle between vectors. Normaliza-
tion does not have a probabilistic interpretation, so it is not appropriate for
information retrieval using a query. However, it is useful when documents
are compared with one another. In this case, the cosine of the angle between
the document vectors is a measure of similarity that varies between 0 and
- A value of 0 means that the documents are unrelated, while a value of
1 means that the documents use the same terms with the same relative fre-
quencies. One can use similarity functions such as the cosine as a means of
classifying documents by looking for clusters of documents that are near one
another. All of the clustering algorithms mentioned in section 1.5 can be used
to cluster documents either hierarchically or by using some other organizing
principle. Clustering techniques based on similarity functions are still in use,
but they have been superseded to some extent by citation-based techniques,
to be discussed in section 6.4.
In spite of the logical elegance of the vector space model, it has several
deficiencies. - In many languages, words are composed of letters which can be in more
than one “case.” In English, letters can be upper- or lower-case. Com-
puters actually deal withcharacters, not letters, so the upper-case variant
differs from the lower-case variant. To deal with this ambiguity, most
search techniques ignore case distinctions when comparing words.
Unfortunately, case distinctions are sometimes important. For example,
acronyms are usually written using upper-case letters to prevent confu-
sion with the ordinary word. Thus “COLD” (which is the acronym for
chronic obstructive lung disease) can be distinguished from “cold” (which
has several meanings) by the use of upper-case letters. - Many languages, including English, also vary the form of a word for
grammatical purposes. This is known asinflection. For example, English
words can be singular or plural. For example, while “normetanephrines”
only occurs in five PubMed citations, the singular form “normetanephrine”
occurs in 1207 citations. Although the singular and plural forms have dif-
ferent meanings, such distinctions are rarely important during a search.