untitled

132 6 Information Retrieval

that contain the word. Note that “of” occurs in more documents than “the,” although the latter occurs more often. One can deal with the varying selectivity of words in several ways. One could ignore the most commonly occurring words. The list of ignored words is called the “stop word list.” One can also weight the matches so that more commonly occurring words have a smaller effect on the choice of documents to be returned. When this technique is used, the documents are arranged in order by how well the documents match the query. Many algorithms have been proposed for how one should rank the selected documents, but the one that has been the most effective isvector space retrieval, also called the vector space model. This method was pioneered by (Salton et al. 1983; Salton 1989). In this model, each document and query is represented by a vector in a very high-dimensional vector space. The components of the vector (i.e., the axes or dimensions of the vector space) are all the words that can occur in a document or query and that can be used for searching. Such words are called terms. Terms normally do not include stop words, and one commonly maps synonymous words (such as words that differ only by upper- or lower-case distinctions) to the same term. The vector of a document or query will be very sparse: nearly all entries will be zero for a particular document or query. The entry for a particular term in the vector is a number called theterm weight. Term weights can be based on many criteria, but the two most important are the following (Salton and McGill 1986):

Term Frequency. The number of times that the term occurs in a doc-
ument. The assumption is that if a term occurs more frequently in the
document, then it must be more important for that document.

Document Frequency.The number of documents that make use of the
term. When a term occurs in more documents, then it is less impor-
tant for the purposes of information retrieval. One makes this assump-
tion because terms that occur in more documents are less selective and
therefore less useful for distinguishing the relevant documents. For ex-
ample, “human” occurs in over 8.65 million PubMed documents, while
“normetanephrines” only occurs in five documents. Thus “human” is
much less selective than “normetanephrines.”

The term weight to be assigned to a document should combine the term frequency with the document frequency. The most common way to do this

untitled

Get our desktop app

Company

Features

Documentation

Resources