132 6 Information Retrieval
that contain the word. Note that “of” occurs in more documents than “the,”
although the latter occurs more often.
One can deal with the varying selectivity of words in several ways. One
could ignore the most commonly occurring words. The list of ignored words
is called the “stop word list.” One can also weight the matches so that more
commonly occurring words have a smaller effect on the choice of documents
to be returned. When this technique is used, the documents are arranged in
order by how well the documents match the query. Many algorithms have
been proposed for how one should rank the selected documents, but the one
that has been the most effective isvector space retrieval, also called the vector
space model. This method was pioneered by (Salton et al. 1983; Salton 1989).
In this model, each document and query is represented by a vector in a very
high-dimensional vector space. The components of the vector (i.e., the axes
or dimensions of the vector space) are all the words that can occur in a doc-
ument or query and that can be used for searching. Such words are called
terms. Terms normally do not include stop words, and one commonly maps
synonymous words (such as words that differ only by upper- or lower-case
distinctions) to the same term.
The vector of a document or query will be very sparse: nearly all entries
will be zero for a particular document or query. The entry for a particular
term in the vector is a number called theterm weight. Term weights can be
based on many criteria, but the two most important are the following (Salton
and McGill 1986):
- Term Frequency. The number of times that the term occurs in a doc-
ument. The assumption is that if a term occurs more frequently in the
document, then it must be more important for that document. - Document Frequency.The number of documents that make use of the
term. When a term occurs in more documents, then it is less impor-
tant for the purposes of information retrieval. One makes this assump-
tion because terms that occur in more documents are less selective and
therefore less useful for distinguishing the relevant documents. For ex-
ample, “human” occurs in over 8.65 million PubMed documents, while
“normetanephrines” only occurs in five documents. Thus “human” is
much less selective than “normetanephrines.”
The term weight to be assigned to a document should combine the term
frequency with the document frequency. The most common way to do this