6.2 Vector Space Retrieval 135
can then examine the documents in this order until it is found that the doc-
uments are no longer relevant. In other words, the ratio c 1 /c 2 is implicitly
determined by the researcher during examination of the document list.
The conditional probabilities Pr(Relevant|D) and Pr(Irrelevant|D) can be
“reversed” by applying Bayes’ law. Thus
Pr(Relevant|D)=Pr(D|Relevant)Pr(Relevant)/P r(D)
and similarly for the probability of irrelevance. In the ratio of these two, the
term Pr(D) cancels, and we obtain the following expression:
Pr(Relevant|D)
Pr(Irrelevant|D)
=
Pr(D|Relevant)
Pr(D|Irrelevant)
Pr(Relevant)
Pr(Irrelevant)
The last factor in the equation above is a ratio that depends only on the
query Q, not on the document D. Consequently, arranging the documents in
descending order by the ratio Pr(D|Relevant)/Pr(D|Irrelevant) will produce
exactly the same order as using the ratio Pr(Relevant|D)/Pr(Irrelevant|D).
This is fortunate because the probabilities in the former ratio are much easier
to compute.
To estimate the ratio Pr(D|Relevant)/Pr(D|Irrelevant), first consider the
denominator. In a large corpus such as the web, with billions of pages, or
Medline with over 12 million citations, one will rarely be interested in more
than a very small fraction of all documents. Thus nearly all documents will
be irrelevant. As a result, it is reasonable to assume that Pr(D|Irrelevant) is
the same as Pr(D).
To estimate Pr(D|Relevant)/Pr(D) it is common to assume that the docu-
ments and queries can be decomposed into statistically independent terms.
We will discuss how to deal with statistical dependencies later. Statistical in-
dependence implies that Pr(D|Relevant) is the product of Pr(T|Relevant) for
all terms T in the document D, and Pr(D) is the product of the unconditional
probabilities Pr(T). Because queries can also be decomposed into indepen-
dent terms, there are two possibilities for a term T in a document D. It is
either part of the query Q or it is not. If T is in the query Q, then by defini-
tion the term T is relevant, so Pr(T|Relevant) = 1. If T is not in the query Q,
then the occurrence of T is independent of any relevance determination, so
Pr(T|Relevant) = Pr(T). The ratio Pr(D|Relevant)/Pr(D) is then the product of
two kinds of factor: 1/Pr(T) when T is in the query Q and Pr(T)/Pr(T) when
T is not in the query Q. So all that matters are the terms in D that are also in
Q.