Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
17.5 Bipartite Ranking and Multivariate Performance Measures 245

Fβ= (1+β

(^2) )a
(1+β^2 )a+b+β^2 c. Again, we setθ= 0, and the loss function becomes
∆(y′,y) = 1−Fβ.



  • Recall atk: We measure the recall while the prediction must contain at most
    kpositive labels. That is, we should setθso thata+b≤k. This is conve-
    nient, for example, in the application of a fraud detection system, where a
    bank employee can only handle a small number of suspicious transactions.

  • Precision atk: We measure the precision while the prediction must contain
    at leastkpositive labels. That is, we should setθso thata+b≥k.


The measures defined previously are often referred to asmultivariate perfor-
mance measures. Note that these measures are highly different from the average
zero-one loss, which in the preceding notation equalsa+bb++dc+d. In the aforemen-
tioned example of fraud detection, when 99.9% of the examples are negatively
labeled, the zero-one loss of predicting that all the examples are negatives is
0 .1%. In contrast, the recall of such prediction is 0 and hence theF 1 score is also
0, which means that the corresponding loss will be 1.

17.5.1 Linear Predictors for Bipartite Ranking


We next describe how to train linear predictors for bipartite ranking. As in the
previous section, a linear predictor for ranking is defined to be

hw( ̄x) = (〈w,x 1 〉,...,〈w,xr〉).

The corresponding loss function is one of the multivariate performance measures
described before. The loss function depends ony′=hw( ̄x) via the binary vector
it induces, which we denote by

b(y′) = (sign(y′ 1 −θ),..., sign(y′r−θ)) ∈ {± 1 }r. (17.12)

As in the previous section, to facilitate an efficient algorithm we derive a convex
surrogate loss function on ∆. The derivation is similar to the derivation of the
generalized hinge loss for the NDCG ranking loss, as described in the previous
section.
Our first observation is that for all the values ofθdefined before, there is some
V⊆{± 1 }rsuch thatb(y′) can be rewritten as

b(y′) = argmax
v∈V

∑r

i=1

viyi′. (17.13)

This is clearly true for the caseθ= 0 if we chooseV={± 1 }r. The two measures
for whichθis not taken to be 0 are precision atkand recall atk. For precision
atkwe can takeV to be the setV≥k, containing all vectors in{± 1 }rwhose
number of ones is at leastk. For recall atk, we can takeVto beV≤k, which is
defined analogously. See Exercise 5.
Free download pdf