Understanding Machine Learning: From Theory to Algorithms

17.5 Bipartite Ranking and Multivariate Performance Measures 243

The following claim states that every doubly stochastic matrix is a convex combination of permutation matrices. claim17.3 ((Birkhoff 1946, Von Neumann 1953)) The set of doubly stochastic matrices inRr,ris the convex hull of the set of permutation matrices inRr,r. On the basis of the claim, we easily obtain the following:

lemma17.4 There exists an optimal solution of Equation (17.10) that is also an optimal solution of Equation (17.9). Proof LetBbe a solution of Equation (17.10). Then, by Claim 17.3, we can writeB=

∑

∑ iγiCi, where eachCiis a permutation matrix, eachγi>0, and iγi = 1. Since all theCiare also doubly stochastic, we clearly have that 〈A, B〉≤〈A, Ci〉for everyi. We claim that there is someifor which〈A, B〉= 〈A, Ci〉. This must be true since otherwise, if for everyi〈A, B〉<〈A, Ci〉, we would have that

〈A,B〉=

〈

A,

∑

i

γiCi

〉

=

∑

i

γi〈A,Ci〉>

∑

i

γi〈A,B〉=〈A,B〉,

which cannot hold. We have thus shown that some permutation matrix,Ci, satisfies〈A,B〉=〈A,Ci〉. But, since for every other permutation matrixCwe have〈A,B〉 ≤ 〈A,C〉we conclude thatCiis an optimal solution of both Equa- tion (17.9) and Equation (17.10).

17.5 Bipartite Ranking and Multivariate Performance Measures

In the previous section we described the problem of ranking. We used a vector y∈Rrfor representing an order over the elementsx 1 ,...,xr. If all elements iny are different from each other, thenyspecifies a full order over [r]. However, if two elements ofyattain the same value,yi=yjfori 6 =j, thenycan only specify a partial order over [r]. In such a case, we say thatxiandxjare of equal relevance according toy. In the extreme case,y∈ {± 1 }r, which means that eachxiis either relevant or nonrelevant. This setting is often called “bipartite ranking.” For example, in the fraud detection application mentioned in the previous section, each transaction is labeled as either fraudulent (yi= 1) or benign (yi=−1). Seemingly, we can solve the bipartite ranking problem by learning a binary classifier, applying it on each instance, and putting the positive ones at the top of the ranked list. However, this may lead to poor results as the goal of a binary learner is usually to minimize the zero-one loss (or some surrogate of it), while the goal of a ranker might be significantly different. To illustrate this, consider again the problem of fraud detection. Usually, most of the transactions are benign (say 99 .9%). Therefore, a binary classifier that predicts “benign” on all transactions will have a zero-one error of 0.1%. While this is a very small number, the resulting predictor is meaningless for the fraud detection application. The crux of the

Understanding Machine Learning: From Theory to Algorithms

∑

〈

A,

∑

〉

=

∑

∑

17.5 Bipartite Ranking and Multivariate Performance Measures

Get our desktop app

Company

Features

Documentation

Resources