untitled

(ff) #1

134 6 Information Retrieval


is based on a probabilistic cost/benefit approach. The two cost factors asso-
ciated with information retrieval are:


  1. The loss associated with the retrieval of an irrelevant document. Such an
    error is called a Type I error, or afalse positive.Letc 1 denote the cost of
    this kind of error.

  2. The loss associated with failing to retrieve a relevant document. Such an
    error is called a Type II error, or afalse negative.Letc 2 denote the cost of
    this kind of error.


The cost factors are shown diagrammatically in figure 6.1. This same dia-
gram applies to any situation in which a statistical decision must be made.

Figure 6.1 Types of errors that can occur during document retrieval.

Retrieval begins by specifying a query Q. Documents are either relevant
to the query Q or they are irrelevant to the query. The probability of rel-
evance is Pr(Relevant) and the probability of irrelevance is Pr(Irrelevant) =
1 - Pr(Relevant). If one is considering a particular document D, then the
probability of relevance is the conditional probability Pr(Relevant|D), and
the probability of irrelevance is Pr(Irrelevant|D) = 1 - Pr(Relevant|D). The
cost of retrieving this document is c 1 Pr(Irrelevant|D) and the cost of not re-
trieving it is c 2 Pr(Relevant|D). The ideal strategy is to retrieve the document
when the cost of retrieval is less than the cost of nonretrieval or

Pr(Relevant|D)/P r(Irrelevant|D)>c 1 /c 2.

In practice, one does not explicitly specify either c 1 or c 2 or even their ra-
tio. Rather, one attempts to arrange the documents in descending order by
the ratio Pr(Relevant|D)/Pr(Irrelevant|D). The person requesting the query
Free download pdf