Understanding Machine Learning: From Theory to Algorithms

17.5 Bipartite Ranking and Multivariate Performance Measures 245

Fβ= (1+β

(^2) )a
(1+β^2 )a+b+β^2 c. Again, we setθ= 0, and the loss function becomes
∆(y′,y) = 1−Fβ.

Recall atk: We measure the recall while the prediction must contain at most
kpositive labels. That is, we should setθso thata+b≤k. This is conve-
nient, for example, in the application of a fraud detection system, where a
bank employee can only handle a small number of suspicious transactions.

Precision atk: We measure the precision while the prediction must contain
at leastkpositive labels. That is, we should setθso thata+b≥k.

The measures defined previously are often referred to asmultivariate performance measures. Note that these measures are highly different from the average zero-one loss, which in the preceding notation equalsa+bb++dc+d. In the aforemen- tioned example of fraud detection, when 99.9% of the examples are negatively labeled, the zero-one loss of predicting that all the examples are negatives is 0 .1%. In contrast, the recall of such prediction is 0 and hence theF 1 score is also 0, which means that the corresponding loss will be 1.

17.5.1 Linear Predictors for Bipartite Ranking

We next describe how to train linear predictors for bipartite ranking. As in the previous section, a linear predictor for ranking is defined to be

hw( ̄x) = (〈w,x 1 〉,...,〈w,xr〉).

The corresponding loss function is one of the multivariate performance measures described before. The loss function depends ony′=hw( ̄x) via the binary vector it induces, which we denote by

b(y′) = (sign(y′ 1 −θ),..., sign(y′r−θ)) ∈ {± 1 }r. (17.12)

As in the previous section, to facilitate an efficient algorithm we derive a convex surrogate loss function on ∆. The derivation is similar to the derivation of the generalized hinge loss for the NDCG ranking loss, as described in the previous section. Our first observation is that for all the values ofθdefined before, there is some V⊆{± 1 }rsuch thatb(y′) can be rewritten as

b(y′) = argmax v∈V

∑r

i=1

viyi′. (17.13)

This is clearly true for the caseθ= 0 if we chooseV={± 1 }r. The two measures for whichθis not taken to be 0 are precision atkand recall atk. For precision atkwe can takeV to be the setV≥k, containing all vectors in{± 1 }rwhose number of ones is at leastk. For recall atk, we can takeVto beV≤k, which is defined analogously. See Exercise 5.

Understanding Machine Learning: From Theory to Algorithms

17.5.1 Linear Predictors for Bipartite Ranking

Get our desktop app

Company

Features

Documentation

Resources