untitled

(ff) #1

136 6 Information Retrieval


If there are N documents in the entire collection, and if M of them contain a
particular term T, then Pr(T) = M/N. The number M is the called thedocument
frequencyof the term. Since N is a constant for all documents, queries, and
terms, it is not needed for determining the ranking of documents. The ratio
1/M is theinverse document frequency(IDF). Most term-weighting techniques
make use of the IDF or some variation of it.
To see why IDF is important, consider a term such as “human,” which
occurs in 66.2% of the documents in Medline. Knowing that a citation con-
tains this word gives one less than 1 bit of information. By comparison, the
term “normetanephrines” only occurs in five Medline citations, so know-
ing that a citation contains this term gives one much more information. To
be precise, since there are over 12 million Medline citations, this term gives
approximately 21 bits of information.
Returning to the computation of Pr(D|Relevant)/Pr(D), we find that it is
the product of 1/Pr(T) for all terms that occur both in D and in the query Q.
Since the logarithm preserves order and converts products to sums, it is con-
venient to take the logarithm of this product. This implies that one should
order documents by the sum of log(1/Pr(T)) = -log(Pr(T)), for all terms that
occur both in each document and the query. This representation has the fur-
ther advantage that the terms in the sum can be interpreted as measurements
of information. It also explains why it makes sense to interpret documents
and queries as high-dimensional vectors. The vectors have one entry for each
possible term. Define the vector for a document D to have log(1/Pr(T)) as the
entry for the term T, whenever T is in D, and to have 0 if T is not in D. De-
fine the vector for a query Q to have 1 as the entry for the term T, whenever
T is in Q, and to have 0 if T is not in Q. In general, if v=(v 1 ,v 2 ,...,vn)and
w=(w 1 ,w 2 ,...,wn) are two vectors, then thedot product(also called theinner
product) of v and w is given by this sum:

v·w=

∑n

1

viwi

Consequently, the logarithm of the ratio Pr(D|Relevant)/Pr(D) is the dot prod-
uct of the vector for D and the vector for Q.
When the same term occurs more than once in a document, there is a ques-
tion of how to incorporate this in the term-weighting scheme. One way is to
treat each occurrence as being independent, so that one should add the IDF
each time. The number of occurrences of a single term in one document is
called itsterm frequency. The resulting term-weighting scheme is called the
Free download pdf